Commit e4e9a187 authored by yannponty's avatar yannponty
Browse files

Merge branch 'master' of jwgitlab.cs.mcgill.ca:vreinharz/arnhack

Conflicts:
	TeX/NAR/main_NAR.tex

+ Twocolumn FIX !!!!
parents 96f4a373 8a2ba6d7
......@@ -70,7 +70,7 @@
\ExecuteOptions{custompaper,twoside,crop,reqno,proof}
\ProcessOptions
\RequirePackage{crop,inputenc,amsmath,amssymb,latexsym,
float,epsfig,wrapfig,graphicx,stfloats,
float,epsfig,wrapfig,graphicx,%stfloats,
multicol,rotating,multirow,dcolumn,
boxedminipage,makeidx,soul,url,xspace,times}
......
\documentclass[letterpaper,12pt]{article}
\usepackage[margin=1in]{geometry}
\usepackage{epsf}
\usepackage{amssymb} % AMS symbols
\usepackage{latexsym} % Old Latex symbols
\usepackage{epsfig}
\usepackage{graphics}
\usepackage{xspace}
\usepackage{url}
\usepackage{todonotes}
\renewcommand{\em}{\it}
\parindent=0in
\renewcommand{\thepage}{ } \pagestyle{empty}
\begin{document}
\includegraphics[width=0.3\linewidth]{mcgill_logo}\\
\hrule
\vspace{1em}
\hspace{11cm}
\parbox{2.5in}
{\small J\'er\^ome Waldisp\"{u}hl, Ph.D.\\
School of Computer Science\\
McGill Centre for Bioinformatics\\
McGill University\\
{\tt jeromew@cs.mcgill.ca} \\
\today}
\vskip 0.3in
Dear Dr. Alan Kimmel,
\vskip 0.3in
We thank you for giving us an opportunity to improve our manuscript. We do agree with the reviewer that additional validations will improve the impact and significance of our work.\\
We thank the reviewer for pointing at us novel mutate-and-map experiments that have been released after the initial submission of our manuscript. In this revision, we include all new mutate-and-map experiments for which we could (i) identify evolutionary conserved intermolecular interactions, (ii) find an experimentally determined structure in the PDB, and (iii) obtain an alignment from the Rfam family. We provide below a list of all experiments included in our updated benchmark, and a justification for the ones that we could not included inside.
\section*{Molecules included in the benchmark}
\subsubsection*{5S rRNA}
This molecule was already included in the benchmark of our original manuscript.
\subsubsection*{c-di-GMP Riboswitch}
This molecule was already included in the benchmark of our original manuscript.
\subsubsection*{Adenine Riboswitch (New)}
Initially, we reported the molecule in the supplementary material because only one nucleotide was found to interact with another molecule. However, we discovered that the automated annotation, originally done with a Pymol script, was incomplete.\\
We manually revisited the PDB file associated with this molecule (1Y26), and identified 9 additional interacting positions. In light of this finding, we decided to move this molecule into the main body of our manuscript.\\
The mutate-and-map repository has 3 different experiments on this molecule (named adenine\_2, adenine\_3, and adenine\_4, following the mutate-and-map annotation). However, only the first two experiments were conducted with the ligand, the last one does not have it. Strikingly, our methods perform well on the two first mutate-and-map experiments (with the ligand), but fail to identify a clear signal on the third one (without the ligand). This observation brings additional support to our approach. Indeed, in the absence of the ligand, the structure differs from the ligand-bound conformation. The mutate-and-map mutations altering the structure of the non-bound conformation are not necessarily associated with changes of the ligand-bound structure, and the evolutionary information calculated with our algorithms cannot be related to binding sites of the ligand-bound structure.
\subsubsection*{tRNA phenylalanine (New)}
The PDB contains two experimental structures of this RNA in the yeast 80S ribosome- tRNA complexes (PDB identifier 3J78), and Rfam provides a single alignment for \emph{all} tRNAs (RF00005). We clearly identified 2 conserved regions of interactions in the literature. The anticodon and the T-$\Psi$-C-G motif binding the 5S RNA in the 50S ribosomal subunit. We thus included these two sites in the revised benchmark. We present and discuss the results separately.\\
A couple of other nucleotides are also eventually implicated in interactions with other molecules (positions 1, 19 and 73-76). However, these positions seem to be weakly evolutionary conserved. In fact, interacting positions vary between the two copies of the tRNA in the PDB structure of the 80S ribosome. Our methods are not designed to work in such cases. For this reason, we did not included these interaction sites in our benchmark. Although, for the sake of transparency, we included the results on these sites in the supplementary material.
\subsubsection*{Cobalamin riboswitch (New)}
Also referred as RNAPuzzle 6 in the mutate-and-map repository, this molecule is associated to the RNA family RF00174. The PDB has an experimental structure (4GXY),which has been crystallized in the presence of magnesium and iridium. We ignored these molecules in our experiments and focused on the ligand binding pocket.\\
Interestingly,
\section*{Molecules not included in the benchmark}
\subsubsection*{Glycine Riboswitch}
We tried to integrate the Glycosine riboswitch in our benchmark. The latter is associated with the PDB structure 3P49 and Rfam family RF00504.\\
Unfortunately, we realized after the alignment of the PDB sequence with the Rfam alignment that the target sequence contains an artificial sequence of a stem loop used to bind protein stabilizing the whole RNA structure. The large discrepancy between the PDB sequence and mutate-and-map sequences prevented us to use this molecules in our benchmark.\\
Nonetheless, in the revised version of this paper, we discuss this case and provide all results in the supplementary material.\\
\subsection*{Other molecules}
The 16S rRNA Four-Way Junction has not been included in our new benchmark because we have not been able to identify evolutionary conserved intermolecular interactions.\\
The SAM riboswitch (RNAPuzzle 8) has not been included in the new benchmark because the sequence similarity of the mutate-and-map sequence with Rfam alignment (RF00162) is below 50\%. Other related experimental structures present in the PDB have similar issues.\\
The following molecules were not included in our new benchmark because we have not been able to identify an alignment in the Rfam database:
\begin{itemize}
\setlength{\itemsep}{-1pt}
\item P4-P6 domain, Tetrahymena ribozyme.
\item M-stableRNA.
\item Hox A9 mRNA 5' UTR.
\item MedLoop.
\item Class I Ligase.
\end{itemize}
The following molecules were not integrated in our benchmark because these are artificial sequences:
\begin{itemize}
\setlength{\itemsep}{-1pt}
\item Hobartner bistable switch
\item X20/H20
\item Tebowned
\end{itemize}
\end{document}
......@@ -17,7 +17,7 @@
%\usepackage{caption}
\usepackage{amssymb}
\usepackage{amsmath}
\usepackage{dblfloatfix}
%\usepackage{amsthm} %pour les newtheorem
%\usepackage{epstopdf}
......@@ -83,7 +83,7 @@
\usepackage[noend,ruled,vlined]{algorithm2e}
%\usepackage{tikz}
\usepackage{todonotes}
%\usepackage{todonotes}
%\newtheorem{theorem}{Theorem}[section]
%\newtheorem{lemma}[theorem]{Lemma}
......@@ -153,7 +153,11 @@ A distinct, yet complementary, approach to analyze structural and functional pro
To date, both approaches have not been combined and even less reconciled. Nonetheless, an important observation is that the systematic mutations such as those conducted in the mutate-and-map protocol enable us to probe the evolutionary landscape of a molecule, which in turn can be used to reveal nucleotide patterns in the fitness landscape. To capture this signal, it is essential to design a formal framework that calculates correlations between the genetic robustness of the structural profile -- obtained from mutate-and-map experiments -- and the evolutionary information available for this molecule -- usually contained in multiple sequence alignments.
This paper attempts to look beyond RNA structure determination, and introduces a novel concept to leverage the information embedded in experimental structure probing data sets of mutant RNAs. We apply neutral theory principles \cite{Schuster:1994aa} to detect functional dependencies between distant nucleotides in a single stranded RNA. More precisely, we first use mutate-and-map data to identify mutations that significantly destabilize the native structure of the molecule, i.e. the mutations associated with the most divergent SHAPE profiles. Then, we retrieve from RNA multiple sequence alignments (\rfam database~\cite{burge2012rfam}) homologous sequences containing those destabilizing mutations, and compare their nucleotide distribution to the background distribution observed in the \rfam multiple sequence alignment. Finally, the ensemble of positions with highest mutual information is used to reveal nucleotide networks of functional dependencies. This protocol aims to capture non-trivial covariations or geometric conservations that are key to guarantee the stability and specificity of the binding site structure.
This paper attempts to look beyond RNA structure determination, and introduces a novel concept to leverage the information embedded in experimental structure probing data sets of mutant RNAs. We apply neutral theory principles \cite{Schuster:1994aa} to detect functional dependencies between distant nucleotides in a single stranded RNA. More precisely, we first use mutate-and-map data to identify mutations that significantly destabilize the native structure of the molecule, i.e. the mutations associated with the most divergent SHAPE profiles. Then, we retrieve from RNA multiple sequence alignments (\rfam database~\cite{burge2012rfam}) homologous sequences containing those destabilizing mutations, and compare their nucleotide distribution to the background distribution observed in the \rfam multiple sequence alignment. Finally, the ensemble of
\newpage
\noindent
positions with highest mutual information is used to reveal nucleotide networks of functional dependencies. This protocol aims to capture non-trivial covariations or geometric conservations that are key to guarantee the stability and specificity of the binding site structure.
We implement our model in a software named \soft. We illustrate potential applications of the signal captured with our theoretical framework, and apply \soft to analyze mutate-and-map data sets available on the RNA Mapping Database~\cite{Cordero:2012aa}. Our experiments reveal non-trivial long-range dependencies within ncRNA primary structures of 5S ribosomal RNA, {\color{red}the yeast phenylalanine tRNA and the cobalamin, adenine, and glycine riboswitches}. We investigate the biological significance of these patterns by looking at the distribution of these nucleotides on the RNA 3D structures. Interestingly, we find significant correlations between the sets of nucleotides produced by our method and those identified as participating in RNA-RNA, RNA-protein{\color{red}, RNA-DNA and RNA-ligand} interfaces.
......@@ -243,7 +247,7 @@ prediction~\cite{eddy1994rna}. The \rfam database~\cite{burge2012rfam} is a repo
To identify those pairs of nucleotides at specific positions which vary together, we use the \Def{normalized point-wise mutual information}~(\NPMI) measure~\cite{eddy1994rna}.
Given $x$ and $y$ two mutations, each indicated by a column of an alignment and a nucleotide present at the position, the \NPMI is defined as:
$$\displaystyle \text{\NPMI}(x,y)=\frac{\log\frac{\mathbb P\left(x,y\right)}{\mathbb P\left(x\right)\mathbb P\left(y\right)}}{-\log \mathbb P\left(x, y\right)} \in [-1,1]$$
$$\displaystyle \text{\NPMI}(x,y)=\frac{\log\frac{\mathbb P\left(x,y\right)}{\mathbb P\left(x\right)\,\mathbb P\left(y\right)}}{-\log \mathbb P\left(x, y\right)} \in [-1,1]$$
where probabilities $\mathbb{P}(\cdot)$ are estimated from their frequencies in the multiple sequence alignment.
An \NPMI of $ -1$ indicates that $x$ and $y$ never appear together. On the opposite side of the spectrum, a value of $1$ signifies a perfect correlation. If $x$ and $y$ can be considered as two independent random variables, then the \NPMI will be $0$.
......@@ -253,14 +257,14 @@ all \NPMI{}s greater than $-1$ is called $\zeta$.
The procedure to compute the positions over a cutoff percentile $\zeta_c$ given a mutation $m$, a list of positions $p$ and a multiple sequence alignment $MSA$ is described in Algo.~\ref{algo:npmi}.
\subsubsection{Structure as Graph}
\subsubsection{Structures as Graphs}
Most disruptive mutations, when observed in multiple alignments, are found in combination with compensatory mutations which reestablish the structure. Since a common secondary structure is posited in this work, such local covariations are scarcely informative and should be ignored. However, RNAs are three dimensional, and thus one cannot use the sequence distance between the mutation and positions of interests.
In order to assess a realistic notion of distance, we transform the secondary structure into a graph $G$ where the positions are the vertices. The edges are composed of the phosphodiester bonds and the canonical base pairs. For $G$ to adequately reflect the pairwise proximity of nucleotides involved in a loop, an edge is added between every pair of position belonging to the same loop. Effectively every loop becomes a clique. Fig~\ref{fig:sses_graph} shows the full graph of a secondary structure. The distance from a loop to a position $x$ is defined as the maximal shortest path in $G$ from $x$ to any position in that loop.
\begin{figure}[t!]
\centering
\includegraphics[width=.8\linewidth]{Figure3.png}
\includegraphics[width=.9\linewidth]{Figure3.png}
\caption{{\bf RNA secondary structure graph used for proximity filtering.}}
\label{fig:sses_graph}
\end{figure}
......@@ -274,7 +278,7 @@ In RNA, a large proportion of observed covariations are adequately explained by
%\item Stems, due to their rigidity, constitute the scaffold on an RNA structure. It is however the loops that have the flexibility to create complex three dimensional structures~\cite{stombaugh2009frequency} \todo{probably better ref} allowing the RNAs to uphold their functions.
%Given our goal of detecting interfaces between the RNA and other chains, we restrict our attention to positions located inside loops;
%\item
Since this structure is, to a large extent, already revealed by comparative analysis (and already present in the \rfam{} profile taken as input to the method), it does not constitute the primary object of interest of our study. In order to minimize the probability of detecting a local structural compensation, we require a minimal distance $\gamma$ between the mutation and the loops identified for their good NPMI values. This definition is formalized in Algo.~\ref{algo:pos}.
Since the secondary structure is, to a large extent, already revealed by comparative analysis (and already present in the \rfam{} profile taken as input to the method), it does not constitute the primary object of interest of our study. In order to minimize the probability of detecting a local structural compensation, we require a minimal distance $\gamma$ between the mutation and the loops identified for their good NPMI values. This definition is formalized in Algo.~\ref{algo:pos}.
\subsubsection{{\color{red}Binding interfaces positions}}
......@@ -381,22 +385,16 @@ $M_{\text{disruption}}\leftarrow \filterMutations(MaM, \delta)$\;
\subsection{Implementation}
Our software, \soft, is implemented in \texttt{python 2.7}.
To identify mutations of interest, the threshold of \shape profile disruption was tested between the $95$ and $99$ percentiles.
The parameter $\delta$ was set to $10$ creating windows of size $21$ to measure the local \shape profile disruptions.
The parameter $\gamma$ was evaluated for values between $0$ and $30$.
The whole implementation is freely available at:
Our software, \soft, is implemented in \texttt{python 2.7}. To identify mutations of interest, the threshold of \shape profile disruption was tested between the $95$ and $99$ percentiles. The parameter $\delta$ was set to $10$ creating windows of size $21$ to measure the local \shape profile disruptions. The parameter $\gamma$ was evaluated for values between $0$ and $30$. The whole implementation is freely available at:
{\centering \softweb \\ }
\noindent It requires \texttt{BioPython}~\cite{cock2009biopython} for reading the multiple sequence alignments, \texttt{networkx}~\cite{hagberg-2008-exploring} for modeling the graphs, and \texttt{matplotlib}~\cite{Hunter:2007} for visualizing the results. The 3D complex analysis used in our validation is performed using the Python API provided by the {\tt PyMol} software~\cite{PyMOL}.
\subsection{Dataset}
{\color{red} We evaluated our methods on molecule for which we could obtain simultaneously (i) a mutate-and-map experiment data set, (ii) a determined three-dimensional structures interacting with other chain(s), and (iii) a \rfam alignment. This search resulted in six RNAs: the 5S ribosomal RNA, the c-di-GMP riboswitch, the cobalamin riboswitch (Puzzle 6), the phenylalanine tRNA, the adenine riboswitch and the glycine riboswitch. }
To evaluate the efficiency of our method, data with the desired properties was available for {\color{red}six} RNAs~\cite{Cordero:2012aa}. These required properties are a mutate-and-map experiment data set, a determined three-dimensional structures interacting with other chain(s) and and \rfam alignment. Those {\color{red}six} RNAs are the 5S ribosomal RNA, {\color{red} the c-di-GMP riboswitch, the cobalamin riboswitch (Puzzle 6), the phenylalanine tRNA, the adenine riboswitch and the glycine riboswitch . }
{\color{red}The latest gave poor results, potentially due to an artificial hairpin, introduced in the sequence, binding to a small protein to help the crystalisation. The protein was missing in the MaM experiments. It was thus omitted in the present analysis. The results are shown in supplementary material (Fig.~S7).}
{\color{red} However, the experimental structure of the glycine riboswitch found in the PDB contains an artificial stem loop binding a protein used to stabilize the RNA structure and faciliate the crystallization. Since this protein is missing in the MaM experiments, we decided to exclude this riboswitch from our test set. Nonetheless, we show our results in the supplementary material (Fig.~S7).}
\begin{table}[t!]
\centering
......@@ -411,75 +409,57 @@ To evaluate the efficiency of our method, data with the desired properties was a
glycine ribo. & \phantom{0}52.85 \\
\end{tabular}
}
\caption{{\color{red}Each sequence total relative entropy when aligned to its RFAM sequence, obtained from \texttt{infernal}}}
\caption{{\color{red} Bit scores of mutate-and-map sequences on covariance model built from related \rfam alignments using \texttt{infernal}}}
\label{table:ali_entropy}
\end{table}
The 5S ribosomal RNA is the family \texttt{RF00001} on \rfam. Its seed alignment consist of $713$ sequences. The family also provides the consensus structure. The mutate-and-map protocol was applied to the consensus sequence of $4$ structures which have as PDB identifiers \texttt{2WWQ}~\cite{2WWQ}, \texttt{3OAS} and \texttt{3OFC}~\cite{3OAS_3OFC}, and \texttt{3ORB}~\cite{3ORB}. Those four determined structures have almost the same sequence with slight differences in the length on their $5'$ and $3'$ extremities.
{\color{red}
The yeast phenylalanine tRNA is included in family \texttt{RF00005} which has $960$ seed sequences . Since its known crystal structure (PDB identifier \texttt{1EHZ}) is only
in presence of magnesium and manganese, we also considered structures of the two tRNAs in the structure of the yeast 80S ribosome-tRNA complexes (PDB identifier \texttt{3J78}).
}
{\color{red} The yeast phenylalanine tRNA is included in the \rfam family \texttt{RF00005} which has $960$ seed sequences from various tRNAs. Its structure has been crystallized in presence of magnesium and manganese (PDB identifier \texttt{1EHZ}). Although, for a complete characterization of its structural context and interactions with other molecules, we also considered structures of the two tRNAs in the structure of the yeast 80S ribosome-tRNA complexes (PDB identifier \texttt{3J78}).}
The c-di-GMP riboswitch is present in family \texttt{RF01051} in \rfam, which contains $156$ sequences in its seed alignment, and a consensus structure. The consensus sequence was also built from $4$ structures, with PDB identifiers \texttt{3IWN}~\cite{3IWN} and \texttt{3MXH}, \texttt{3MUV}, \texttt{3MUT}~\cite{3MXH_3MUV_3MUT}.
Importantly, c-di-GMP is known to bind a pocket inside the 3-way junction at positions $11\mhyphen 13$, $40\mhyphen 41$ and 85 of the sequence on which the mutate-and-map experiments were run~\cite{smith2009structural, kulshina2009recognition}, and the MaM experiment was done in presence of its ligand. It is also worth noting that, in order to facilitate the crystallization, the hairpin loop L2 of this molecule has been artificially designed to bind the U1A protein. {\color{red}Here, we included only the positions binding to its ligand. Nonetheless, for completeness, we also show in the supplementary material the results obtained with these positions, hence with the c-di-gmp binding interface only.}
The c-di-GMP riboswitch is present in family \texttt{RF01051} in \rfam, which contains $156$ sequences in its seed alignment, and a consensus structure. The consensus sequence was also built from $4$ structures, with PDB identifiers \texttt{3IWN}~\cite{3IWN} and \texttt{3MXH}, \texttt{3MUV}, \texttt{3MUT}~\cite{3MXH_3MUV_3MUT}. Importantly, c-di-GMP is known to bind a pocket inside the 3-way junction at positions $11\mhyphen 13$, $40\mhyphen 41$ and 85 of the sequence on which the mutate-and-map experiments were run~\cite{smith2009structural, kulshina2009recognition}, and the MaM experiment was done in presence of its ligand. It is also worth noting that, in order to facilitate the crystallization, the hairpin loop L2 of this molecule has been artificially designed to bind the U1A protein. {\color{red} Here, we included only the positions binding the ligand. Nonetheless, for completeness, we also show in the supplementary material the results including the stem loop L2 binding interface.}
{\color{red}
The cobalamin riboswitch is in family \texttt{RF00174} which has $430$ seed sequences. The structure bounded to its ligand is known (PDB identifier \texttt{4GXY}). The MaM experiments
were done in the presence of cobalamin ligands.
The MaM cobalamin riboswitch sequence can be found in the \rfam family \texttt{RF00174} which has $430$ seed sequences. The PDB contains the structure bounded to its ligand (PDB identifier \texttt{4GXY}). Noticeably, the MaM experiments were done in the presence of cobalamin ligands.
The adenine riboswitch belongs to family \texttt{RF00167} which has $133$ seed sequences. The structure with the adenine ligand�has PDB identifier \texttt {1Y26}. Three different MaM experiments were conducted on this molecule. Experiments \texttt{Adenine\char`_2} and \texttt{Adenine\char`_3} where done in presence of the ligand, and are used in this paper. The third experiment \texttt{Adenine\char`_4} has been performed in absence of the ligand, and thus was omitted from this benchmark since disruptive mutations cannot be used to detect key structural elements of the ligand-bound structure. Nonetheless, the results are indicated in the supplementary material.
The adenine riboswitch belongs to family \texttt{RF00167} which has $133$ seed sequences. The structure with the adenine ligand has PDB identifier \texttt {1Y26}. Three different MaM experiments were done on that particular molecule. Experiments \texttt{Adenine\char`_2} and \texttt{Adenine\char`_3} where done in presence of the ligand, and are shown in this paper.
Experiments \texttt{Adenine\char`_4} which was done in absence of the ligand gave poor results, since the MaM experiment was done on the unbounded structure. Those additional results are shown in the Supp. Mat.
For each molecule, we present in Fig.~\ref{fig:shape}, for every position $i$, the disruption of the shape profile when a mutation occurs at that position (i.e. $\Delta(S, S_i)$), with the aligned \rfam consensus secondary structure below. We notice the rugged landscape and the sharp differences between the adenine riboswitch profile 4, where the MaM was done without the presence of adenine, and the other two. Additionally, the fact that the glycosine riboswitch sequence is mostly aligned to non structured part might be an additional reason for the poor results of our method.
%Fig.~\ref{fig:shape} shows for every molecule and position $i$, the disruption of the shape profile when a mutation occurs at that position (i.e. $\Delta(S, S_i)$). We also display the \rfam consensus secondary structure at the bottom. We notice the rugged landscape and the sharp differences between the adenine riboswitch profile 4, where the MaM was done without the presence of adenine, and the other two. Additionally, the fact that the glycosine riboswitch sequence is mostly aligned to non structured part might be an additional reason for the poor results of our method.
}
\begin{figure*}[ht!]
\centering
\colorbox{red}{
\includegraphics[width=0.96\textwidth]{Figure4.png}
}
\caption{{\bf Disruptive impact of mutations, as measured from MaM data.} On the $x$-axis the position of the mutation, and on the $y$-axis the value of the SHAPE profile distance $\Delta(S, S_i)$. {\color{red} The adenine\_4 MaM experiment was done in the absence of adenine.}}
\label{fig:shape}
\end{figure*}
%\begin{figure*}[ht!]
% \centering
% \colorbox{red}{
% \includegraphics[width=0.96\textwidth]{Figure4.png}
% }
% \caption{{\bf Disruptive impact of mutations, as measured from MaM data.} On the $x$-axis the position of the mutation, and on the $y$-axis the value of the SHAPE profile distance $\Delta(S, S_i)$. {\color{red} The adenine\_4 MaM experiment was done in the absence of adenine.}}
% \label{fig:shape}
%\end{figure*}
We also built a secondary test set of \rfam families with experimentally determined 3D structures. We selected all \rfam families with sequences having a size ranging from 35 to 150 nucleotides, and with PDB files containing at least one other molecule in the vicinity of the RNA. In total, we found 14 families matching 729 different structures.
We omitted the shortest sequences (i.e. \rfam families RF00032, RF00037 and RF00175) because our distance metric $\delta$ would be too coarse to extract a signal on such small structures. Large molecules (i.e. more than 150 nucleotides) were not considered because the accuracy of the nearest-neighbor model decreases significantly after this size. Hence, this is also the case for \remu with which will use this data set.
{\color{red} To complete our benchmark, we also built a secondary test set of \rfam families with experimentally determined 3D structures, but for which MaM experiments were not available.} We selected all \rfam families with sequences having a size ranging from 35 to 150 nucleotides, and with PDB files containing at least one other molecule in the vicinity of the RNA. In total, we found 14 families matching 729 different structures.
We omitted the shortest sequences (i.e. \rfam families RF00032, RF00037 and RF00175) because our distance metric $\delta$ would be too coarse to extract a signal on such small structures. {\color{red} Similarly, we also removed large molecules (i.e. more than 150 nucleotides) because the accuracy of the nearest-neighbor model decreases significantly beyond this size. So it is the case for computational tools with which we simulated MaM data (i.e. \remu).}
\subsection{Experimental design}
The \texttt{Infernal 1.1}~\cite{nawrocki2009infernal} software was used with default parameter values to: 1) create a covariance model for each alignment, and; 2) align the sequence from the mutate-and-map experiment with the generated covariance model. The consensus secondary structure was then restricted to gapless positions within the aligned sequence $\Seq$.
For each mutation over the \shape profile percentile cutoff $\delta$, the data set was composed of the regions of interest given $\gamma$, i.e. the set of positions returned by the Algo~\ref{algo:pos}.
For each mutation over the \shape profile percentile cutoff~$\delta$, the data set was composed of the regions of interest given $\gamma$, i.e. the set of positions returned by the Algo~\ref{algo:pos}.
{\color{red}
To determine the positive datasets, we proceeded in three different ways.
For the 5S RNA, for each PDB model, the positive data set is composed of the positions in those regions which have the center of any of their atom at most at $5$\AA\xspace from the center of any atom of another chain the the complex. An implementation using the PyMOL Python API is included in the provided code.
We used different strategies to determine the interaction sites (i.e. positive data set), depending of the nature and context of these interactions. All interactions were manually verified.
For the tRNA, we located positions at most at $5$\AA to another chain in the two tRNAs inside
the structure of the yeast 80S ribosome-tRNA complexes (PDB identifier \texttt{3J78}).
Because those were note the phenyla;anine tRNA, we used \texttt{LocaRNA}\cite{will2007inferring} to align them to our sequence, resulting in the positions 1, 19, $34\mhyphen 36$, $56\mhyphen 57$, $73 \mhyphen76$ (containing the anticodon) as the positive set. Given the poor results (shown in Supp. Mat.) we located two sets of positions known as binding interfaces.
The anticodon, identified using \texttt{tRNAscan-SE}~\cite{schattner2005trnascan}, and the T-$\psi$-C-G motif
which is known to bind the 5S RNA in the 50S ribosomal subunit~\cite{schwarz1976codon}.
For the riboswitches, from their respective crystal structures we used \texttt{Ligand Explorer}~\cite{moreland2005molecular} to identify nucleotide at most $5$\AA\xspace from the ligand.
The set of all positions is found in the Supp. Mat. Table 1.}
For the 5S RNA, we implemented a PyMOL script to extract nucleotides of each PDB model, whose position any of their atom is at most at $5$\AA{} from any atom of another chain the the complex. An implementation of this script is included in the distribution of \soft
For the tRNA, we extracted positions that are at most $5$\AA{} away from another chain in the two tRNAs found inside the structure of the yeast 80S ribosome-tRNA complexes (PDB identifier \texttt{3J78}). However, because those were not phenylalanine tRNAs, we aligned them to the MaM sequence with \texttt{LocaRNA}\cite{will2007inferring}, and used this alignment to map the interaction sites on the latter. We identified the positions 1, 19, $34\mhyphen 36$, $56\mhyphen 57$, $73 \mhyphen76$ (containing the anticodon) in this positive set. Among them, only the anticodon and T-$\psi$-C-G, known to bind the 5S RNA in the 50S ribosomal subunit~\cite{schwarz1976codon}, motif appeared to us to be strongly conserved. Thus, we considered only these two interactions sites in our experiments and presented the results separately. For completeness, the results obtained on other positions have been included in the supplementary material. Finally, we also confirmed the location of the anticodon using \texttt{tRNAscan-SE}~\cite{schattner2005trnascan}.
For the riboswitches, we used \texttt{Ligand Explorer}~\cite{moreland2005molecular} to identify nucleotide at most $5$\AA{} from the ligand in their respective crystal structures.
The set of all positions is found in the Supplementary Material Table 1.}
The remaining positions compose the negative dataset. The positions not present in the model were ignored. This highlights one of the challenges of benchmarking. For the 5S rRNA, out of 121 positions, two models had 3 nucleotides missing, one had 4 missing and the other 6. {\color{red} For c-di-GMP, out of 103 positions, one model had 8 nucleotides missing, two others 21 and the last 22. Which explains some of the discrepancies between the models.}
All other remaining positions compose the negative dataset. The positions not present in the model were ignored. This highlights one of the challenges of this benchmark. For the 5S rRNA, out of 121 positions, two models had 3 nucleotides missing, one had 4 missing and the other 6. {\color{red} For c-di-GMP, out of 103 positions, one model had 8 nucleotides missing, two others 21 and the last 22. Which explains some discrepancies between the models.}
The set $\zeta$ is composed of the NPMI between every pairs of positions and every possible nucleotide (i.e. \Ab, \Cb, \Gb, \Ub\xspace and \gapb{}) in the resulting alignment. The thresholds on the NPMIs, $\zeta^+$ (resp. $\zeta^-$) was sliced from the $0^{\text{th}}$ to the $100^{\text{th}}$ percentile of the positive values of $\zeta$ (resp. negative values of $\zeta$).
......@@ -488,8 +468,7 @@ Thus, for each \shape profile distance measure, each \shape distance threshold $
\section{Results}
\label{sec:results}
We evaluated \soft on a comprehensive set of values for $\delta$ the \shape profile distance measure and $\gamma$ the proximity threshold. For each $(\delta,\gamma)$ pair, a Receiver operating characteristic (ROC) curve was computed and the Area Under the Curve (AUC){\color{red} are shown in Fig.~\ref{fig:auc}, averaged for each the 5S rRNA and for c-di-GMP riboswitch who have four models.
It is important to recall that the set of positives and negatives is influenced by the value of $\gamma$, as they are determined by the Algo.~\ref{algo:pos}. Notice also that we only show the pair of parameters $\delta, \gamma$ such that the all structures for each RNA had a set of positives and negatives. As discussed before, many nucleotides are missing in the PDB models. of 5S and c-di-GMP.}
We evaluated \soft on a comprehensive set of values for $\delta$ the \shape profile distance measure and $\gamma$ the proximity threshold. For each $(\delta,\gamma)$ pair, {\color{red} we computed a Receiver Operating Characteristic (ROC) curve and its Area Under the Curve (AUC). The data are shown in Fig.~\ref{fig:auc}. Noticeably, we averaged the values for the 5S rRNA and c-di-GMP riboswitch who have four PDB models. Importantly, we remind that the set of positives and negatives is influenced by the value of $\gamma$, as calculated by Algo.~\ref{algo:pos}. We also note that we only show the pair of parameters $\delta, \gamma$ such that for all structures for each RNA the positive and negative sets are non-empty. As discussed before, many nucleotides are missing in the PDB models of 5S and c-di-GMP.}
\begin{figure*}[ht!]
\centering
......@@ -517,24 +496,17 @@ It is important to recall that the set of positives and negatives is influenced
\end{figure*}
\subsection{Evolutionary stabilization of {\itshape in vitro} disruptive mutations reveal binding sites}
{\color{red}
The results shown in Fig.~\ref{fig:auc} show that in all cases the maximal AUCs are found wen the \shape disruption percentile cutoff $\delta$ is around $97\%$, with the maximal distance possible distance $\gamma$ from those mutations.
We show our results in Fig.~\ref{fig:auc}. In all cases, it exists a pair of parameters $(\delta,\gamma)$ for which \soft achieve good predictive performance. Importantly, the maximal AUCs are found when the \shape disruption percentile cutoff $\delta$ is around $97\%$, with the largest possible distance $\gamma$ from those mutations.
The 5S RNA and tRNA show a much more diffuse pattern. We hypothesize that the weaker signal is due to the complex
and numerous interactions helping to shape and stabilize their 3D structures which was not captured during the MaM
experiment. Thus creating noise as the disruptive effect of each mutations.
Results on 5S RNA and tRNA appear to have a slightly more diffuse pattern than other experiments. We hypothesize that it is due to the complexity of the nucleotide interaction network used to stabilize their 3D structures. Such a network would be easily disrupted by any mutation, hence the amplifed noise in MaM experiments.
The strong signal of the c-di-GMP riboswitch might be a consequence of a small artificial helix that was introduced to stabilize the crystal structure, and conserved in the MaM experiments.
By constrast, the c-di-GMP riboswitch exhibits one of the strongest signal, most likely because of a strong evolutionary conservation and the central location of the binding site.
The cobalamin riboswitch is the only that exhibits a strong negative correlation at a slightly smaller disruption cutoff than the optimal. This region correspond to the selection of the positions in the leftmost hairpin as disruptive, as seen in Fig.~\ref{fig:shape}, and might indicate another binding interface not identified by our approach.
Interestingly, the cobalamin riboswitch exhibits a negative correlation with smaller disruption cutoff $\delta$ than the optimal value for which a positive correlation is found. In fact, these values of $\delta$ are strongly associated with positions in the leftmost hairpin of this structure. This suggests a conserved, yet currently unannotated, structural motif or binding interface that would warrant further investigations.
The two adenine riboswitch MaM experiment show that although similar results are achieved, the variation between
experiments remains a concern and some the quality of the \shape experiments must be taken into account. The necessity of the correct structure when applying the MaM protocol is necessary as negative results are obtained when using the non-bonded form (shown in Supp. Mat. FigS9).
Finally, the two MaM experiments on the adenine riboswitch show that, although similar results are observed, the variation between experiments remains a concern and that the quality of the \shape experiments must be taken into account. The necessity of the correct structure when applying the MaM protocol is necessary as negative results are obtained when using the unbound form (See Fig. S9 in supplementary material).
}
%The ROC curves for the 4 PDB structures are shown on the same graph, for a specific \shape distance threshold $\delta$ and $\gamma$.
......@@ -550,11 +522,7 @@ experiments remains a concern and some the quality of the \shape experiments mus
%The set of parameters associated with the best average AUC correspond to 3 curves having AUC of $1$, plus another around $0.5$. We suggest that at high value of $\delta$ the region of interests is restrained to nucleotides that are not in the PDB models, resulting in a minimal validation case.
We conjecture that the differences in the influence of the $\gamma$ parameter, minimal distance from the mutation, are due to
structural differences. We present in Fig.~\ref{fig:dist} the distribution of path lengths (distance) between every pair of secondary structure elements, weighted by the number of combinations of positions that are not in the intersection of the secondary structure element. We observe that the 5S rRNA has a much more Normal-like distance distribution, {\color{red} while the c-di-GMP and cobalamin riboswitch are more Poisson-like. In contrast, the tRNA and adenine riboswitch havemuch more bimodal distributions.
Those distributions determine how smoothly the number of positions considered decreases as the parameter $\gamma$, minimal distance from the mutation, is increased.
We conjecture that the differences in the influence of the $\gamma$ parameter, minimal distance from the mutation, are due to structural differences. We present in Fig.~\ref{fig:dist} the distribution of path lengths (distance) between every pair of secondary structure elements, weighted by the number of combinations of positions that are not in the intersection of the secondary structure element. {\color{red} We observe that the distribution of distances on 5S rRNA tends toward a Normal distribution, while on the c-di-GMP and cobalamin riboswitches it seems instead to follows a Poisson pattern. By contrast, the tRNA and adenine riboswitches have distributions tending toward bimodal modes. Those distributions determine how smoothly the number of positions considered could decrease as the parameter $\gamma$, minimal distance from the mutation, increases.
}
......@@ -593,11 +561,11 @@ The Boltzmann conformational ensemble $\mathds B_w$ of a sequence $w$ is the pro
distributions $\mathds B_{wt}$ and $\mathds B_{m}$. The latter provides an estimate of the destabilization of the conformational landscape induced by the mutation. Given the set of all secondary structure $\mathcal S$,
the {\em relative entropy} is defined as
$$
\sum_{S\in \mathcal S}\mathbb P(S\mid\mathds B_{wt})\log\left(\frac{\mathbb P(S\mid\mathds B_{wt})}{\mathbb P(S\mid\mathds B_m}\right).
\sum_{S\in \mathcal S}\mathbb P(S\mid\mathds B_{wt})\log\left(\frac{\mathbb P(S\mid\mathds B_{wt})}{\mathbb P(S\mid\mathds B_m)}\right).
$$
We report our results in Fig.~\ref{fig:auc_remu}. Here again, our data unveil a signal that shows a correlation between the mutation identified with \soft and the RNA-binding interfaces. Nonetheless, the strength of the signal extracted with \remu is of lower magnitude than the one achieved with the \shape experiments and the mutate-and-map protocol.
We report our results in Fig.~\ref{fig:auc_remu}. Here again, our data unveil a signal that shows a correlation between the mutation identified with \soft and the RNA-binding interfaces. Nonetheless, the strength of the signal extracted with \remu is of lower magnitude than the one achieved with the \shape experiments and the mutate-and-map protocol. {\color{red} An exception is the tRNA for which \soft achieves better performance with \remu than MaM data. We conjecture that this could be due to a difficulty of the MaM protocol to capture a clear signal on these structures.}
To further validate our model, we applied this protocol based on \remu prediction on a data set made of \rfam families with experimentally determined 3D structures (See Methods). It took us 1 CPU year to complete this experiment on each of the 727 structures. For each family, we extracted the sequences annotated by \rfam as having the best matching score (i.e. the Bit score measuring the fitness of the PDB sequence to the \rfam covariance model). This restrained the set to 52 sequences since some families had many sequences with the same score. We present in Fig.~\ref{fig:rfam_best_bit} our analysis on those top scoring sequences, showing the same trend. Complete results including omitted (short) families RF00032, RF00037 and RF00175 are available in the supplementary material (See Fig.~S6).
......@@ -612,8 +580,7 @@ The poorest results are achieved in family RF01118. Interestingly, one of the co
\centering
\includegraphics[width=0.96\textwidth]{Figure8.png}
\caption{ {\bf Performance of \soft for \remu-predicted disruptions.} For each \rfam family, we consider all PDBs having less than 150 nucleotides, and having maximal matching score to family. For a set of extreme percentile cutoff of the \shape profile disruption in the first column (computational \remu disruption in the second
column) $\delta$ and a minimal distance $\gamma$ from the mutation. Note that the PDB models considered for the 5S family (RF0001) do not match those investigated by MaM, which explains the discrepancies observed between the results above and those of Fig.~\ref{fig:aucremumam}.}
\label{fig:rfam_best_bit}
column) $\delta$ and a minimal distance $\gamma$ from the mutation. Note that the PDB models considered for the 5S family (RF0001) do not match those investigated by MaM, which explains the discrepancies observed between the results above and those of Fig.~\ref{fig:aucremumam}.\label{fig:rfam_best_bit}}
\end{figure*}
These observations validate our methodology and at the same time justify the use of \shape data.
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment