Commit b799dade authored by Vincent Mallet's avatar Vincent Mallet
Browse files

expanded readme and packaged into a pipable repo

parent 8487d3a9
Copyright (c) 2021 Institut Pasteur. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that
the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following
disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following
disclaimer in the documentation and/or other materials provided with the distribution.
3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products
derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES,
INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
......@@ -27,42 +27,82 @@ The graphs are annotated with graph, node, and edge-level attributes. These incl
## Package Structure
* `/prepare_data/`: processes raw PDB structures and builds a database of 2.5D graphs with full structural annotation
* `/RNAGlib/prepare_data/`: processes raw PDB structures and builds a database of 2.5D graphs with full structural annotation
* `/RNAGlib/data_loading`: custom PyTorch dataloader implementations
* `/RNAGlib/models`: pre-built GCN models.
* `/RNAGlib/learning`: learning routines for the easiest use of the package.
* `/RNAGlib/drawing`: utilities for visualizing 2.5D graphs
* `/RNAGlib/ged`: custom graph similarity functions
* `/RNAGlib/kernels`: custom local neighbourhood similarity functions
* `/RNAGlib/learning`: ML model implementations
## Data scheme
A more detailed description of the data is presented in `/RNAGlib/prepare_data/README`.
It comes along with instructions on how to produce the data from public databases.
However since this construction is computationally expensive, we offer a
pre-built database.
We provide a visualization of what the graphs in this database contain :
![Example graph](images/Fig1.png)
## Installation
### Code
The package can be cloned and the source code used directly.
We also deploy it as a pip package and recommend using conda environments.
One just needs to run :
```
pip install rnaglib
```
## Example usage
Then one can start using the packages functionalities by importing them in
one's python script.
### Basic supervised model
### Data
To perform machine learning one needs RNA data.
The instructions to produce this data are presented in prepare_data.
We however offer two possibilities to directly access pre-built databases :
* Direct download at the address : http://rnaglib.cs.mcgill.ca/static/datasets/iguana
* In code download : if one instantiates a dataloader and the data cannot be found, the corresponding data will be automatically downloaded and cached
Because of this second option, after installing our tool with pip,
one can start learning on RNA data without extra steps.
### Unsupervised model
## Example usage
To provide the user with a hands on tutorial, we offer two example learning pipelines
in `RNAGlib/examples`.
If one has run the pip installation, just run :
```
rnaglib_first
rnaglib_second
```
Otherwise, after cloning the repository, run :
```
cd examples
python first_example.py
python second_example.py
```
### Visualization
You should see data getting downloaded and networks being trained.
The first example is a basic supervised model training to predict
protein binding nucleotides.
The second one starts by an unsupervised phase that pretrains
the network and then performs this supervised training in a principled way,
suitable for benchmarking its performance.
This simple code was used to produce the benchmark values
presented in the paper.
## Visualization
To visualize the graphs in the format described above, we
## Associated Repositories:
[VeRNAl](https://github.com/cgoliver/vernal)
[RNAMigos](https://github.com/cgoliver/RNAmigos)
# References
## References
1. Leontis, N. B., & Zirbel, C. L. (2012). Nonredundant 3D Structure Datasets for RNA Knowledge Extraction and Benchmarking. In RNA 3D Structure Analysis and Prediction N. Leontis & E. Westhof (Eds.), (Vol. 27, pp. 281–298). Springer Berlin Heidelberg. doi:10.1007/978-3-642-25740-7\_13
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
......@@ -5,10 +5,10 @@ import random
script_dir = os.path.dirname(os.path.realpath(__file__))
if __name__ == "__main__":
sys.path.append(os.path.join(script_dir, '..'))
sys.path.append(os.path.join(script_dir, '..', '..'))
from data_loading import loader, get_all_labels
from learning import learn
from rnaglib.data_loading import loader, get_all_labels
from rnaglib.learning import learn
"""
This script is to be used for reproducible benchmarking : we propose an official train/test split
......
#!/usr/bin/env python3
import os
import sys
import torch
from rnaglib.learning import models, learn
from rnaglib.data_loading import loader
"""
This script just shows a first very basic example : learn binding protein preferences
from the nucleotide types and the graph structure
To do so, we choose our data, create a data loader around it, build a RGCN model and train it.
"""
# Choose the data, features and targets to use and GET THE DATA GOING
node_features = ['nt_code']
node_target = ['binding_protein']
supervised_dataset = loader.SupervisedDataset(node_features=node_features,
node_target=node_target)
train_loader, validation_loader, test_loader = loader.Loader(dataset=supervised_dataset).get_data()
# Define a model, we first embed our data in 10 dimensions, and then add one classification
input_dim, target_dim = supervised_dataset.input_dim, supervised_dataset.output_dim
embedder_model = models.Embedder(dims=[10, 10], infeatures_dim=input_dim)
classifier_model = models.Classifier(embedder=embedder_model, classif_dims=[target_dim])
# Finally get the training going
optimizer = torch.optim.Adam(classifier_model.parameters(), lr=0.001)
learn.train_supervised(model=classifier_model,
optimizer=optimizer,
train_loader=train_loader)
#!/usr/bin/env python3
import os
import sys
import torch
from rnaglib.learning import models, learn
from rnaglib.data_loading import loader
from rnaglib.benchmark import evaluate
from rnaglib.kernels import node_sim
"""
This script shows a second more complicated example : learn binding protein preferences as well as
small molecules binding from the nucleotide types and the graph structure
We also add a pretraining phase based on the R_graphlets kernel
"""
# Choose the data, features and targets to use
node_features = ['nt_code']
node_target = ['binding_protein']
###### Unsupervised phase : ######
# Choose the data and kernel to use for pretraining
print('Starting to pretrain the network')
node_sim_func = node_sim.SimFunctionNode(method='R_graphlets', depth=2)
unsupervised_dataset = loader.UnsupervisedDataset(node_simfunc=node_sim_func,
node_features=node_features)
train_loader = loader.Loader(dataset=unsupervised_dataset, split=False,
num_workers=0, max_size_kernel=100).get_data()
# Then choose the embedder model and pre_train it, we dump a version of this pretrained model
embedder_model = models.Embedder(infeatures_dim=unsupervised_dataset.input_dim,
dims=[64, 64])
optimizer = torch.optim.Adam(embedder_model.parameters())
learn.pretrain_unsupervised(model=embedder_model,
optimizer=optimizer,
train_loader=train_loader,
learning_routine=learn.LearningRoutine(num_epochs=10),
rec_params={"similarity": True, "normalize": False, "use_graph": True, "hops": 2})
# torch.save(embedder_model.state_dict(), 'pretrained_model.pth')
print()
###### Now the supervised phase : ######
print('We have finished pretraining the network, let us fine tune it')
# GET THE DATA GOING, we want to use precise data splits to be able to use the benchmark.
train_split, test_split = evaluate.get_task_split(node_target=node_target)
supervised_train_dataset = loader.SupervisedDataset(node_features=node_features,
redundancy='all_graphs',
node_target=node_target,
all_graphs=train_split)
train_loader = loader.Loader(dataset=supervised_train_dataset, split=False).get_data()
# Define a model and train it :
# We first embed our data in 64 dimensions, using the pretrained embedder and then add one classification
# Then get the training going
classifier_model = models.Classifier(embedder=embedder_model, classif_dims=[supervised_train_dataset.output_dim])
optimizer = torch.optim.Adam(classifier_model.parameters(), lr=0.001)
learn.train_supervised(model=classifier_model,
optimizer=optimizer,
train_loader=train_loader,
learning_routine=learn.LearningRoutine(num_epochs=10))
# torch.save(classifier_model.state_dict(), 'final_model.pth')
# embedder_model = models.Embedder(infeatures_dim=4, dims=[64, 64])
# classifier_model = models.Classifier(embedder=embedder_model, classif_dims=[1])
# classifier_model.load_state_dict(torch.load('final_model.pth'))
# Get a benchmark performance on the official uncontaminated test set :
metric = evaluate.get_performance(node_target=node_target, node_features=node_features, model=classifier_model)
print('We get a performance of :', metric)
print()
......@@ -4,9 +4,9 @@ import numpy as np
script_dir = os.path.dirname(os.path.realpath(__file__))
if __name__ == "__main__":
sys.path.append(os.path.join(script_dir, '..'))
sys.path.append(os.path.join(script_dir, '..', '..'))
from config.graph_keys import EDGE_MAP_RGLIB
from rnaglib.config.graph_keys import EDGE_MAP_RGLIB
s = """
,CHH,TWH,CWW,THS,CWS,CSS,CWH,CHS,TWS,TSS,TWW,THH,B53
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment