Commit 88589318 authored by Roman SARRAZIN GENDRON's avatar Roman SARRAZIN GENDRON
Browse files

fixed conflict

parents 4758d8d8 024fa970
......@@ -7,4 +7,4 @@
*.png
.DS*
*checkpoint*
bayespairing/models/
*_models*
......@@ -11,7 +11,7 @@ This package includes tools for:
NOTE: The user friendly API is currently being built and tested. Sample usage described below is still being improved.
### BayesPairing2 was submitted to RECOMB 2020
### BayesPairing2 was accepted to RECOMB 2020
The article is available on biorxiv:
https://www.biorxiv.org/content/10.1101/834762v1.abstract
......@@ -19,17 +19,18 @@ https://www.biorxiv.org/content/10.1101/834762v1.abstract
## Requirements
* Python 3.6+
* Networkx 2.1+
* RNAlib
* BioPython
* matplotlib
* weblogo
* wrapt
* Infrared (only required for training new models)
* htd (only required for training new models)
* anytree
* pgmpy (included)
* Python 3.6+
* Python modules (installed by our installer)
* Networkx 2.1+
* BioPython
* matplotlib
* weblogo
* wrapt
* anytree
* pgmpy (included)
* RNAlib (install with conda : https://anaconda.org/bioconda/viennarna)
* LaTeX (optional)
## Installing
......@@ -40,7 +41,24 @@ cd rnabayespairing2
pip install .
```
``pip install .`` in this directory will install the required python libraries, but you will need to install RNAlib, Infrared and htd and have them in your PATH to use BayesPairing2.
``pip install .`` in this directory will install the required python libraries, but you will need to install RNAlib separately, and have it in your PATH to use BayesPairing2. You can install RNAlib via conda: https://anaconda.org/bioconda/viennarna)
## Provided datasets
BayesPairing2 comes with three pre-assembled datasets you can immediately start searching sequences with. To search with a specific dataset, use the ``-d`` option with the name of the dataset.
* ``3DmotifAtlas_RELIABLE``: A subset of 60 modules from the RNA 3D Motif Atlas with the highest number of occurrences and highest sequence variation. We are confident in the prediction of those modules given the high quality data we have to train them. This is the default dataset we use.
* ``3DmotifAtlas_ALL``: A dataset containing all the modules we were able to convert from the 3D Motif Atlas to BayesPairing2 models (426). Some of those only had one occurrence and/or may have been trained on limited/incomplete data.
* ``rna3dmotif`` : A dataset containing the 75 most recurrent modules as identified via an exhaustive search of loops in the full PDB database with rna3dmotif.
#### Interpreting dataset-specific output
BayesPairing2 returns results by index, where the indexes correspond to modules in the relevant database. The rna3dmotif modules are described by graphs and sequence logos found in ``bayespairing/DBData/rna3dmotif``.
The 3D Motif Atlas modules match to entries in that database. The correspondences between indexes of the two BayesPairing2 Atlas databases and the online 3D motif atlas database (with link to each relevant model) are found in the file ``bayespairing/DBData/3DmotifAtlas/3DmotifAtlas_info.csv``.
In this csv file, you can observe that sometimes, more than one module maps to the same entry of the Atlas; this is because the 3D Motif Atlas modules are clustered in 3D, and sometimes it is not possible to represent all occurrences accurately with the same graph, so they must be searched separately.
## Identifying 3D modules in a sequence
......@@ -66,17 +84,17 @@ The core function of the BayesPairing package is ``parse_sequences.py``. It take
* ``--init`` To reset all trained models on the modules included in -mod
* ``-o O`` Name of the output file. Default: output
#### Example: searching for motifs on a TPP riboswitch sequence
#### Example: searching for motifs on a TPP riboswitch sequence
The scripts described in this section should be run from the ``bayespairing/src`` directory.
The first time you use BayesPairing with a full dataset, it will train all its models before searching a sequence. Those models will not need to be trained again. If you want to reset those models, you can use the ``init`` option. The default dataset available on this repository comes with 75 pre-loaded models.
The first time you use BayesPairing with a full dataset, it will train all its models before searching a sequence. Those models will not need to be trained again. If you want to reset those models, you can use the ``init`` option. With the ``-d`` option, we are using the rna3dmotif dataset, which includes 75 pre-trained modules.
``python3 parse_sequences.py -seq "UUUUUUAAGGAAGAUCUGGCCUUCCCACAAGGGAAGGCCAAAGAAUUUCCUU" -samplesize 1000``
``python3 parse_sequences.py -seq "UUUUUUAAGGAAGAUCUGGCCUUCCCACAAGGGAAGGCCAAAGAAUUUCCUU" -samplesize 1000 -d rna3dmotif``
The output is very large, so we can raise the threshold to have a better idea of the dominating modules.
``python3 parse_sequences.py -seq "UUUUUUAAGGAAGAUCUGGCCUUCCCACAAGGGAAGGCCAAAGAAUUUCCUU" -samplesize 1000 -t 4 ``
``python3 parse_sequences.py -seq "UUUUUUAAGGAAGAUCUGGCCUUCCCACAAGGGAAGGCCAAAGAAUUUCCUU" -samplesize 1000 -t 4 -d rna3dmotif``
```
=========================================================================================
......@@ -93,7 +111,7 @@ TOTAL TIME: 5.775
If we have a secondary structure, the search is much faster
``python3 parse_sequences.py -seq "UUUUUUAAGGAAGAUCUGGCCUUCCCACAAGGGAAGGCCAAAGAAUUUCCUU" -t 4 -ss "......(((((((.((((((((((((....)))))))))..))).)))))))"``
``python3 parse_sequences.py -seq "UUUUUUAAGGAAGAUCUGGCCUUCCCACAAGGGAAGGCCAAAGAAUUUCCUU" -t 4 -d rna3dmotif -ss "......(((((((.((((((((((((....)))))))))..))).)))))))"``
```
=========================================================================================
......@@ -110,17 +128,14 @@ TOTAL TIME: 2.581
To assess what module the module ID matches, we can generate graphs and sequence logos for all modules and store them in the Graphs directory.
``python3 display_modules.py -n "default"``
``python3 display_modules.py -n "rna3dmotif"``
![](bayespairing/Graphs/default_logo28.png)
![](bayespairing/Graphs/default_graph28.png)
![](bayespairing/DBData/rna3dmotif/default_logo28.png)
![](bayespairing/DBData/rna3dmotif/default_graph28.png)
Module 28 is a hairpin with a trans sugar-hoogstein non-canonical base pair, as well as a stacking.
### Searching for motifs from the RNA 3D Motif Atlas
This feature is under development
As for the RNA 3D Motif Atlas datasets, you can either use the website links in the file ``bayespairing/DBData/3DmotifAtlas/3DmotifAtlas_info.csv`` or generate the graphs and logos with ``display_modules.py``.
### Building your own dataset
......@@ -130,4 +145,4 @@ For building new datasets, full dataset cross-validation, results presented in t
### Contact
Roman Sarrazin-Gendron
roman.sarrazingendron@mail.mcgill.ca
\ No newline at end of file
roman.sarrazingendron@mail.mcgill.ca
This diff is collapsed.
This diff is collapsed.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment