Protein Distance Matrix Prediction

 I had wanted to work on a biological related machine learning project so I had started to explore what public datasets were available. I came across PBD, the Protein Folding Databank. I thought this could be a compelling biological machine learning project as I know predicting protein folding is a complex task that is very important and applicable to the biomedical field. I also know that there has been a lot of work done on this particular task so I knew it was a viable solution.

 The structure I initially thought of was a transformer type input to read the amino acid sequence, followed by a dense encoder to represent the latent structual featues, and then a tranformer type output to write the CIF file. As I started to look into how that architecture might perform, I found that while on the surface the proposed architecture had merit, but due to the complex local and non-local interactions and gemometric positioning generation it would likely fall flat. In my research, I found that in order to handle the complex spatial interactions and coordinates, Geometric models were used. I had never worked with gemetric models and so I was excited to learn more about them.

 I came across a few different gemotric modeling concepts: Graph Neural Networks (GNNs), Geometric Deep Learning (GDLs), Geometric Transformers, Distance Matracies, and Point Clouds.

 After researching these geometric modeling concepts, I realized that they would be a better fit for the protein folding task than the initial transformer-based architecture I had proposed. As I had not used these concepts before, I decided to work on some smaller projects to get familiar with them before diving into the protein folding project.

 To start, I decided to work on a project that used GNNs to predict a distance matrix for small segments of a protien sequence. The goal was to work with segments that were 10-100 residues long and build a model that would input the amino acid sequence and output a distance matrix. I could use the PDB data to train the model and evaluate its performance. This project would help me get familiar with GNNs and distance matracies, which would be important for the protein folding project.

Project Goals


Tech Stack


Example

Input Image:


Challenges


Lessons Learned


Future Improvements