Commit c8c6c66d authored by Roman Sarrazin-Gendron's avatar Roman Sarrazin-Gendron
Browse files

Added TeX and methods

parent e447af6b
% Formatting
% Fonts
% Graphics
% Bibliography
\usepackage[comma,sort&compress, numbers]{natbib}
% Spacing
% Misc
% custom commands
Recurrent 3D interactions networks can be learned from any dataset of RNA structure data. When representing structures as graphs, those interaction networks manifest themselves as common subgraphs between structures. Many different methods for extracting meaningful networks have been published. In this project, we present data from two recent published datasets, CARNAVAL(date) and rna3dmotif(2008).
\subsubsection{Using CARNAVAL data}
CARNAVAL takes as input the full PDB database of RNA structures and outputs a set of recurrent interaction networks (RINs), small graphs of secondary and tertiary structure interactions. Different examples of a same RIN can have different sequences, and all RINs include at least one long range interaction. The dataset we constituted with CARNAVAL for this project includes 220 RINs obtained from all RNA structures in the PDB database as of date, and every example of each RIN, as a small annotated graph.
Rna3dmotif uses a graph similarity approach to extract RNA tertiary structures, and published a large database of experimentally resolved motifs. There is one example of each motif in the dataset. Because our approach is structural and probabilistic, we are looking to predict a base pair pattern based on the sequences thus, for the needs of this project, motifs with identical base pair patterns (same type and partner) but different sequences are considered different examples of the same "3D module". To account for this, the rna3dmotif dataset was reorganized from number unique sequence and structure motifs to number structural motifs, with several examples of each.
\subsection{Learning a Bayesian Network from structural module data}
\subsubsection{Module graphs to Bayes Nets}
To model the dependence relationship between the members of a base pair, each 3D module graph can be represented as a Bayesian network. In such network, a node represents a nucleotide, and an edge, a base pair (canonical or non canonical). Bayes Nets require directed edges and the absence of cycles so, for mathematical reasons, base pairs are modeled as directed edges from the lower node number to the higher node number. The probability distribution of each node is learned from data, and has a cardinality of $4^n$ , where n is the number of parents of the node.The Bayesian networks were implement with pgmpy, a python library for probabilistic models.
\subsubsection{Learning node and edge parameters}
Because we defined 3D module as a set of examples that share the exact same base pairs, all examples of a same module have the same edges. The node values (nucleotides) are learned from the examples. For a node with no parent, the examples are used to generate the statistic $P(nucleotide)$, which is identical to a position-weight matrix. For a node with a parent, that statistic is $P(nucleotide|parent)$, i.e. the data is parsed for examples of a node in the cases where the parent is a specific nucleotide. This process is repeated until each node conditional probability distribution (CPD) is established. 0.1 example of each nucleotide were artificially added at each sampling step as a correction for potentially missing data.
\subsubsection{Leveraging dependence relationships}
When evaluating a candidate set of node values (or module sequence), the properties of the Bayes net can be used to compute a probabilistic score based on $P(module|nucleotides)$, which is defined as the product of the probability of each node taking the value of the nucleotide assigned to it by the input given its parents, obtained by inference. This probability score is then divided by the probability of a null model ($0.25^4$) to normalize by size, and then logged to output a readable score between 10 and 40. With the objective of generating an output for a large number of sequences in reasonable time, performing inference on each input sequence is not feasible, hence the probability of all possible sets of nucleotides for each module is approximated by sampling each Bayes net 10,000 times, and reporting the fraction of occurrences as the relative probability. The sampling process is very slow but only has to be performed once.
\subsection{Scanning input sequences for candidate module sites}
The software tool we present can take as input any valid RNA sequence of length 0 to 300. Longer sequences can also be analyzed but require prior knowledge of their secondary structure because of time and accuracy considerations (see lower)
\subsubsection{Defining the most likely motif candidate as a regular expression}
After sampling all possible nucleotide composition for a module Bayes net, the best 10 structures are extracted. All candidates amongst those that have probability higher than 5\% of the maximum likelihood structure are included in the following computation. For each position, the sum of the occurrences of each nucleotide weighted by probability of the structure is computed, and all nucleotides occurring at a rate higher 25\% at each position are added to the "optimal structure".
\subsubsection{Scanning the sequence with a flexible regular expression}
This "optimal structure" is then converted to a regular expression. Because the nucleotides are not consecutive, distance between different components must be established. For distant components (distance > 5 in all examples) maximum distance is obtained from the maximum distance observed in the examples, and the minimum distance is set to 4. For close components which are consistent in distance among examples, it can be assume there is a biological meaning to this distance, so the distance is set to the distance observed in the data.
\subsubsection{Obtaining candidates}
Because we are presenting a framework to assign a probability score to candidates, it is important to adapt the regular expression to gather as many reasonable candidates as possible. For this reason, the input sequence is scanned multiple times by a regex while increasing the number of allowed substitutions, from zero to 40\% of the length of the module. Moreover, random subsequences of lengths between 0.3 and 0.6 are also scanned.
\subsection{Evaluating the probability of the presence of the module at a candidate site}
\subsubsection{Probability score}
The 10 best candidates obtained with the regular expression parsing are evaluated with the $P(module|nucleotides)$ score previously presented, which is sometimes sufficient, but does not account for secondary structure compatibility. Indeed, there are situations in which a set of position have the perfect subsequence to fold into a specific module, but doing so would require a very suboptimal secondary structure. To account for that, the probability score is corrected by a secondary structure compatibility evaluation.
\subsubsection{Modeling motif insertion as secondary structure constraints}
The 3D modules described in the context of this project can have secondary and tertiary structure base pairs. In terms of secondary structure, this implies that all nodes of every module provide information about being part of a canonical base pair, or the location of an absence of canonical base pair. This situation can easily be modeled by RNAfold hard constraints. The candidate sequence is folded first with RNAfold without constraints, and then folded again with constraints ("(",")" for base paired positions, "x" for unpaired positions) at the positions of the candidate module emplacement. We use the method presented by RMDetect(ref) to assign an energy score to the quotient of the unconstrained and constrained energy.
\subsubsection{Final score and output}
This score is then logged and added to the $P(module|nucleotides)$ score, the idea being that if the secondary structure allows for the insertion of the 3D module at the candidate positions, the constrained and unconstrained folds will be similar and the score will not be significantly changed, whereas unfavorable folds will penalize the probabilistic scores proportionally to how unfavored they are. After this correction, the candidate positions with the top 4 scores are outputted to the user.
\subsubsection{Test sets from PDB}
Both CARNAVAL and rna3dmotif(see Datasets) learn their modules from PDB graphs, and each example is taken from a PDB structure. The test set we used to evaluate the software was constituted from the sequences of those PDB structures as reported in the PDB file. We performed leave-one-out cross-validation on those PDB sequences by excluding the corresponding chain from the data, learning and Sampling the Bayes net for each input.
\subsection{Longer sequences}
Because of the small alphabet, regular expressions do not perform well on longer sequences. Moreover, our methods require multiple folding steps for each sequence which strongly decreases efficiency and/or accuracy depending on the technique used. Longer sequences can still be parsed by the software we present, but with the secondary structure. If the user does not provide a secondary structure, it can be computed by the program (using RNAfold). Each helix of length under 400 is then processed as a separate sequence with a pre-established secondary structure. Module candidates are outputted in terms of their position in the initial (long) sequence.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment