New AI research proposes a simple but effective structure-based encoder for learning the representation of proteins based on their 3D structures

New AI research proposes a simple but effective structure-based encoder for learning the representation of proteins based on their 3D structures

Proteins, the cell’s energy, are involved in various applications, including materials and treatments. They consist of a chain of amino acids that folds into a certain shape. A significant number of new protein sequences have recently been discovered thanks to the development of low-cost sequencing technology. Accurate and effective in silico protein function annotation methods are needed to bridge the current sequence-function gap since functional annotation of a new protein sequence is still costly and time-consuming.

Many data-driven approaches rely on learning the representations of protein structures because many protein functions are controlled by how they are folded. These representations can then be applied to tasks such as protein design, structure classification, model quality assessment, and function prediction.

The number of protein structures published is orders of magnitude lower than the number of datasets in other machine learning application fields due to the difficulty of experimental identification of the protein structure. For example, the Protein Data Bank has 182,000 experimentally confirmed structures, compared to 47 million protein sequences in Pfam and 10 million annotated images in ImageNet. Several studies have used the abundance of unlabelled protein sequence data to develop an adequate representation of existing proteins to fill this representative gap. Many researchers have used self-trained learning to pre-train protein coders on millions of sequences.

Build high-quality training datasets with Kili Technology and solve NLP machine learning challenges to develop powerful ML applications

Recent developments in accurate deep learning-based protein structure prediction techniques have made it possible to effectively and safely predict the structures of many protein sequences. However, these techniques do not specifically capture or use information about known protein structure to determine protein function. It has been proposed that many structure-based protein encoders make better use of structural information. Unfortunately, edge interactions, which are crucial in protein structure simulation, have yet to be addressed explicitly in these models. Furthermore, due to the paucity of experimentally established protein structures, relatively little work has until recently been done to create pre-training techniques that take advantage of unlabelled 3D structures.

Inspired by this advance, they create a protein coder that can be applied to a range of property prediction applications and is pre-trained on the most feasible protein structures. They suggest a simple but efficient structure-based encoder called GeomEtry-Aware Relational Graph Neural Network, which conducts relational messages passing on protein residue graphs after encoding spatial information including various structural or sequential edges. They suggest a sparse edge message passing technique to improve the protein structure encoder, which is the first effort to implement edge level message passing on GNN for protein structure encoding. Their idea was inspired by the attention triangle design in the Evoformer.

They also provide a geometric pretraining approach based on the well-known contrastive learning framework to learn the protein structure encoder. They suggest innovative augmentation functions that enhance the similarity between acquired representations of substructures from the same protein while decreasing that between those of different proteins to find physiologically related protein substructures occurring simultaneously in proteins. At the same time they suggest a series of simple baselines based on self-prediction.

They established a strong foundation for the pre-training of protein structure representations by comparing their pre-training methods with different downstream property prediction tasks. These pretraining problems include the masked prediction of various geometric or physicochemical properties, such as residual types, Euclidean distances, and dihedral angles. Extensive tests using a variety of benchmarks, such as Enzyme Commission Number Prediction, Gene Ontology Term Prediction, Fold Classification, and Reaction Classification, show that GearNet enhanced with edge message passing can consistently outperform existing protein encoders in most tasks in a supervised environment.

Also, using the suggested pretraining strategy, their model trained on fewer than a million samples performs equivalent to or even better than the most advanced sequence-based coders pretrained on a million or billion datasets. The code base is publicly available on Github. It is written in PyTorch and Torch Drug.

Check out thePaper ANDGithub link. All credit for this research goes to the researchers of this project. Also, don’t forget to subscribeour 26k+ ML SubReddit,Discord channel,ANDEmail newsletterwhere we share the latest news on AI research, cool AI projects and more.

Aneesh Tickoo is a Consulting Intern at MarktechPost. She is currently pursuing her BA in Data Science and Artificial Intelligence from Indian Institute of Technology (IIT), Bhilai. She spends most of her time working on projects that harness the power of machine learning. Her research interest is image processing and she is passionate about building solutions around it. She loves connecting with people and collaborating on interesting projects.

Gain a competitive edge with data – actionable market insights for global brands, retailers, analysts and investors. (Sponsored)

#research #proposes #simple #effective #structurebased #encoder #learning #representation #proteins #based #structures
Image Source :

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *