Protein ML Colab Notebooks

Seven Google Colab notebooks made for the CSBERG Synthetic Biology course. Content delivered in Summer 2021.

The table of contents Colab notebook is here.

1. Introduction

  • Basic numpy and pytorch vectorized operations
  • .backward(), .grad, manual gradient optimization
  • Model saving and loading
  • Curse of dimensionality exercise
  • Loading .csv and .fasta files of sequences, one-hot encoding
  • PyTorch Dataset and DataLoader

2. Discriminative Models

  • Two layer fully-connected neural network for catalytic activity prediction
  • Rough Mount Fuji model

3. Generative Models

  • Representing multiple sequence alignments as matrices
  • Variational Auto-Encoders trained on Pfam aligned sequences
  • Sampling sequences from VAEs and visualizing results with sequence logos

4. Model-based Optimization

  • Latent space optimization
  • Conditioning by Adaptive Sampling (CbAS)

5. Inductive Bias

  • Potts model implementation in PyTorch
  • Attention (WIP) and nn.Embedding

6. Language Models

  • bio_embeddings
  • Exploratory code to benchmark random embeddings for protein property prediction

7. Structure-based Models

  • py3Dmol for visualizing structures in Colab
  • Distance matrix, orientograms from trRosetta
  • Molecular dynamics with OpenMM


The accompanying slides to notebooks 1, 2 and 3. Slides 1-28 can be delivered in about 2 hours.