Protein ML Colab Notebooks

31 Oct 2021 on Education, Protein, and ML

Seven Google Colab notebooks made for the CSBERG Synthetic Biology course. Content delivered in Summer 2021.

The table of contents Colab notebook is here.

1. Introduction

Basic numpy and pytorch vectorized operations
.backward(), .grad, manual gradient optimization
Model saving and loading
Curse of dimensionality exercise
Loading .csv and .fasta files of sequences, one-hot encoding
PyTorch Dataset and DataLoader

2. Discriminative Models

Two layer fully-connected neural network for catalytic activity prediction
Rough Mount Fuji model

3. Generative Models

Representing multiple sequence alignments as matrices
Variational Auto-Encoders trained on Pfam aligned sequences
Sampling sequences from VAEs and visualizing results with sequence logos

4. Model-based Optimization

Latent space optimization
Conditioning by Adaptive Sampling (CbAS)

5. Inductive Bias

Potts model implementation in PyTorch
Attention (WIP) and nn.Embedding

6. Language Models

bio_embeddings
Exploratory code to benchmark random embeddings for protein property prediction

7. Structure-based Models

py3Dmol for visualizing structures in Colab
Distance matrix, orientograms from trRosetta
Molecular dynamics with OpenMM

Slides

The accompanying slides to notebooks 1, 2 and 3. Slides 1-28 can be delivered in about 2 hours.