ProteinSolver

binder docs conda pipeline status coverage report

Description

ProteinSolver is a deep neural network which learns to solve (ill-defined) constraint satisfaction problems (CSPs) from training data. It has shown promising results both on a toy problem of learning how to solve Sudoku puzzles and on a real-world problem of designing protein sequences that fold into a predetermined geometric shape.

Demo notebooks

The following notebooks can be used to explore the basic functionality of proteinsolver.

Notebook name MyBinder Description
20_sudoku_demo.ipynb binder Use a pre-trained network to solve a single Sudoku puzzle.
06_sudoku_analysis.ipynb binder Evaluate a network trained to solve Sudoku puzzles using the validation
and test datasets.
(This notebook is resource-intensive and is best ran on a machine with a GPU).
20_protein_demo.ipynb binder Use a pre-trained network to design sequences for a single protein geometry.
06_protein_analysis.ipynb binder Evaluate a network trained to reconstruct protein sequences using the
validation and test datasets.
(This notebook is resource-intensive and is best ran on a machine with a GPU).

Other notebooks in the notebooks/ directory show how to perform more extensive validations of the networks and how to train new networks.

Installation

We recommend installing proteinsolver into a clean conda environment using the following command:

conda create -n proteinsolver -c pytorch -c conda-forge -c kimlab -c ostrokach-forge proteinsolver
conda activate proteinsolver

Development

First, use conda to install proteinsolver into a new conda environment. This will also install all dependencies.

conda create -n proteinsolver -c pytorch -c conda-forge -c kimlab -c ostrokach-forge proteinsolver
conda activate proteinsolver

Second, run pip install --editable . inside the root directory of this package. This will force Python to use the development version of our code.

cd path/to/proteinsolver
pip install --editable .

Pre-trained models

Pre-trained models can be downloaded using gsutil, by running the following command in the root folder of the proteinsolver repository:

gsutil rsync -r gs://proteinsolver/v0.1/ ./

Training and validation datasets

Data used to train and validate the “proteinsolver” network to solve Sudoku puzzles and reconstruct protein sequences can be downloaded using gsutil. The DATAPKG_DATA_DIR environment variable should be set to the folder containing the downloaded files.

gsutil rsync -r gs://deep-protein-gen/ ./

Environment variables

  • DATAPKG_DATA_DIR - Location of training and validation data.

Acknowledgements

References

poster

  • Alexey Strokach, David Becerra, Carles Corbi, Albert Perez-Riba, Philip M. Kim. Designing real novel proteins using deep graph neural networks. https://doi.org/10.1101/868935