Commit 12cb0a6a authored by Carlos GO's avatar Carlos GO
Browse files

small comments

parent 03c95e75
import sys
from math import log;
def loop_counter(structure):
......@@ -189,3 +190,5 @@ def compute_entropy(listseqs):
outdic['avg'] /= lenseq;
return outdic['avg'];
if __name__ == "__main__":
print(loop_counter(sys.argv[1]))
import numpy as np
import pandas as pd
def contiguity(stem):
"""
return: contiguity score for stem
"""
stem_length = len(stem)
bps = stem.count('(')
def count_stacks(ss):
stacks = 0
#count number of contiguous stacks
i = 0
j = len(stem) - 1
stack_size = 0
in_stack = False
i = 0
j = len(ss)-1
while i < j:
l = stem[i]
r = stem[j]
l = ss[i]
r = ss[j]
if l == "(" and r == ")":
stack_size += 1
......@@ -35,6 +24,7 @@ def contiguity(stem):
if in_stack:
stacks += 1
in_stack = False
stack_size = 0
if l == "(" and r == ".":
j -= 1
if l == "." and r == ")":
......@@ -42,17 +32,26 @@ def contiguity(stem):
if l == "." and r == ".":
i += 1
j -= 1
return stacks
def contiguity(stem):
"""
return: contiguity score for stem
"""
stem_length = len(stem)
bps = stem.count('(')
#count number of contiguous stacks
stacks = count_stacks(stem)
return np.log((stem_length - bps) / stacks)
def stem_find(ss):
"""
return: list containint start and end indices of all stems in RNA
return: list containing start and end indices of all stems in RNA
"""
#remove dangles
ss = ss.strip('.')
print(ss)
stack = []
stem_start = None
......@@ -74,6 +73,8 @@ def stem_find(ss):
if len(stack) == 0:
stem_end = i
stem_indices.append((stem_start, stem_end))
elif b == "." and len(stack) == 0:
continue
else:
continue
......@@ -92,9 +93,32 @@ def ml_contiguity():
df = pd.read_csv("../Data/rnamuts_multiloops.csv")
sss = df['structure']
for ss in sss:
print(stem_find(ss))
stems = stem_find(ss)
print(ss)
print(stems)
# print(mean_contig(ss, stems))
break
def contigs(ss):
stack = []
in_stack = False
stack_count = 0
stacks = 0
for i, s in enumerate(ss):
if s == "(":
stack.append(i)
stack_count += 1
elif s == ")":
pass
else:
if stack_count > 1:
stacks += 1
stack_count = 0
return stacks
if __name__ == "__main__":
ml_contiguity()
pass
# ml_contiguity()
ss = "((((..)).))"
print(contigs(ss))
......@@ -57,13 +57,13 @@ Interestingly, \textit{in vitro} experiments revealed the extreme versatility of
Various theoretical models have been proposed to highlight mechanisms that may have favoured the birth and growth of structural complexity from replications of small monomers. Computational studies have been of tremendous help to validate these theories and quantify their impact. In particular, numerical simulations enabled us to explore the effects of polymerization on mineral surfaces \citep{Szabo:2002aa,Briones:2009aa} or the importance of spatial distribution \citep{Shay:2015aa}. Still, the debate about the necessity for such hypothesis remains open.
%\subsection{Our contribution}
In this work, we show that structural complexity can naturally emerge without the help of any sophisticated molecular mechanisms. We reveal subtle topological features of RNA mutational networks that helped to promote the discovery of functional RNAs at the early stages of the RNA world hypothesis. We demonstrate that in the absence of selective pressure, self-replicating RNA populations naturally drift toward a singular region of the sequence landscape enriched in complex structures, allowing for the simultaneous discovery of all molecular components needed to form a complete functional system.
In this work, we show that structural complexity can naturally emerge without the help of any sophisticated molecular mechanisms. We reveal subtle topological features of RNA mutational networks that helped to promote the discovery of functional RNAs at the early stages of the RNA world hypothesis. We demonstrate that in the absence of selective pressure, self-replicating RNA populations naturally drift toward \st{a singular region} \hlt{regions} of the sequence landscape enriched in complex structures, allowing for the simultaneous discovery of all molecular components needed to form a complete functional system.
For the first time, we apply customized algorithms to map secondary structures on all mutant sequences with $50$ nucleotides \citep{Waldispuhl:2008aa,waldispuhl2011unbiased}. This approach considerably expands the scope and significance of comprehensive RNA evolutionary studies that were previously limited to sequences with less than $20$ nucleotides \citep{Gruner:1996aa,Cowperthwaite:2008aa}, or restricted to explore a small fraction of the sequence landscape of sequences \citep{stich2008structural,Dingle:2015aa}. This technical breakthrough is essential to observe the formation of complex multi-branched structures often used to carry essential molecular functions that cannot be assembled on shorter sequences.
Our simulations reveal the unexpected presence of a large pool of remarkably stable multi-branched structures in
a region of the RNA mutational landscape characterized by an average distance of $30$ to $40$ mutations from a random sequence, and a balanced GC content (i.e. $0.5$). Strikingly, these multi-branched RNAs have similar energies ($-15 \pm 5\:\kcalmol$) to those observed in the Rfam database \citep{Nawrocki:2015aa} (See {\bf Fig.~\ref{subfig:rfam_stats}}).
a region of the RNA mutational landscape characterized by an average distance of $30$ to $40$ mutations from a random sequence, and a balanced GC content (i.e. $0.5$). Strikingly, these multi-branched RNAs have similar energies ($-15 \pm 5\:\kcalmol$) to those observed in the Rfam database \citep{Nawrocki:2015aa} \hlt{on the same length scale} (See {\bf Fig.~\ref{subfig:rfam_stats}}).
We compare these data to populations that evolved under a selective pressure eliciting stable structures. Although this evolutionary mechanism shows a remarkable capacity to quickly improve the stability of structures, it fails to reproduce the structural complexity observed in RNA families of similar lengths.
Finally, we show that a population of RNA molecules replicating itself with random errors but preserving a balanced GC content \citep{Tamura:1992aa}, naturally evolves toward regions of the landscape enriched with multi-branched structures potentially capable of supporting essential biochemical functions. Our results argue for a simple scenario of the origin of life in which an initial pool of nucleic acids would irresistibly evolve to promote a spontaneous and simultaneous discovery of the basic bricks of life.
Finally, we show that a population of RNA molecules replicating itself \hlt{randomly} with random errors but preserving a balanced GC content \citep{Tamura:1992aa}, naturally evolves toward regions of the landscape enriched with multi-branched structures potentially capable of supporting essential biochemical functions. Our results argue for a simple scenario of the origin of life in which an initial pool of nucleic acids would irresistibly evolve to promote a spontaneous and simultaneous discovery of the basic bricks of life.
......@@ -12,7 +12,9 @@
\usepackage{amsmath}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{todonotes}
\usepackage{tikz}
\usepackage{soul}
\usetikzlibrary{shapes,arrows}
\usepackage[]{algorithm2e}
\usepackage{siunitx}
......@@ -40,6 +42,8 @@
\newcommand{\kcalmol}{\si{\kilo\calorie\per\mol}}
\newcommand{\kcalmolk}{\si{\kilo\calorie\per\mol\per\kelvin}}
\newcommand{\hlt}[1]{\colorbox{pink}{#1}}
\begin{document}
......@@ -54,6 +58,8 @@ $\\\small$^1$ School of Computer Science, McGill University, Montreal, Canada\\\
The RNA world hypothesis relies on the ability of ribonucleic acids to replicate and spontaneously acquire complex structures capable of supporting essential biological functions. Multiple sophisticated evolutionary models have been proposed, but they often assume specific conditions.
%
In this work we explore a simple and parsimonious scenario describing the emergence of complex molecular structures at the early stages of life. We show that at specific GC-content regimes, an undirected replication model is sufficient to explain the apparition of multi-branched RNA secondary structures -- a structural signature of many essential ribozymes. We ran a large scale computational study to map energetically stable structures on complete mutational networks of 50-nucleotide-long RNA sequences. Our results reveal a distinct region of the sequence landscape enriched with multi-branched structures bearing strong similarities to those observed in databases. A random replication mechanism preserving a $50\%$ GC-content suffices to explain a natural drift of RNA populations toward this particular region.
\end{abstract}
\newpage
......@@ -70,7 +76,7 @@ In this work we explore a simple and parsimonious scenario describing the emerge
\input{contributions.tex}
\input{declaration.tex}
%\input{declaration.tex}
\bibliographystyle{unsrtnat}
\bibliography{biblio}
......
......@@ -3,7 +3,7 @@
% Overview
\subsection{Our approach}
We apply two complementary mutation space search techniques to characterize the influence of a selection process to the repertoire of shapes accessible from an initial pool of random sequences (See \textbf{Fig.~\ref{fig:summary}}). Importantly, our analysis explicitly models the impact of the GC content bias. Our first algorithm \RNAmutants enumerates all mutated sequences and samples the ones with the \emph{globally} lowest folding energy \citep{Waldispuhl:2008aa}. It enables us to calculate the structures accessible from a random replication process. Our other algorithm named \maternal, has been developed for this study to simulate the evolution of a population of RNA sequences that preferentially selects the most stable structures under nucleotide bias.
We apply two complementary mutation space search techniques to characterize the influence of \st{a selection} \hlt{sampling} process to the repertoire of shapes accessible from an initial pool of random sequences (See \textbf{Fig.~\ref{fig:summary}}). Importantly, our analysis explicitly models the impact of the GC content bias. Our first algorithm \RNAmutants enumerates all mutated sequences and samples the ones with the \emph{globally} lowest folding energy \citep{Waldispuhl:2008aa}. It enables us to calculate the structures accessible from a random replication process. \hlt{In contrast, } our other algorithm named \maternal, has been developed for this study to simulate the evolution of a population of RNA sequences that preferentially selects the most stable structures under nucleotide bias.
\begin{figure}
......@@ -15,12 +15,12 @@ We apply two complementary mutation space search techniques to characterize the
Using \RNAmutants, we compute for the very first time, the energy landscape of the complete mutational network of RNA sequences of length $50$ preserving the GC content (See {\bf Fig.~\ref{fig:energy}}). This result contrasts with previous studies that used brute-force approaches to calculate the complete sequence landscape, and were therefore limited to sizes of $20$ nucleotides. This technical breakthrough is essential because multi-branched structures occur only on RNAs with sizes above $40$ (See {\bf Fig.~\ref{fig:rfam}})
More precisely, given an initial sequence (i.e. the \emph{seed}), \RNAmutants calculates mutated sequences with the lowest folding energies among all possible secondary structures (i.e. no target secondary structure). Importantly, \RNAmutants allows us to control the GC content of mutated sequences \citep{waldispuhl2011unbiased} (fraction of bases in a sequence that are either G or C). We ran \RNAmutants on $20$ independently generated random sequences for each GC content regime in $\big\{ 0.1, 0.3, 0.5, 0.7, 0.9 \big\}$ ($\pm 0.1$). At each mutational distance (i.e. number of mutations from the seed) from 0 to 50, we sample from the Boltzmann distribution approximately \num{10000} sequence-structure pairs with the target GC content. It is worth mentioning that the secondary structures sampled by \RNAmutants are not necessarily MFE structures of the sequences associated with. However, our data shows that the vast majority of these sampled structures are very close to their MFE counterpart (See {\bf Supp. Fig.~\ref{fig:freq}}). This observation enables us to simplify analysis by retaining MFE structures and produce data that is directly comparable with \maternal.
More precisely, given an initial sequence (i.e. the \emph{seed}), \RNAmutants calculates mutated sequences with the lowest folding energies among all possible secondary structures (i.e. no target secondary structure). Importantly, \RNAmutants allows us to control the GC content of mutated sequences \citep{waldispuhl2011unbiased} (fraction of bases in a sequence that are either G or C). We ran \RNAmutants on $20$ independently generated random sequences for each GC content regime in $\big\{ 0.1, 0.3, 0.5, 0.7, 0.9 \big\}$ ($\pm 0.1$). At each mutational distance (i.e. number of mutations from the seed) from 0 to 50, we sample from the Boltzmann distribution approximately \num{10000} sequence-structure pairs with the target GC content. \todo{word on sample size} It is worth mentioning that the secondary structures sampled by \RNAmutants are not necessarily MFE structures of the sequences associated with. However, our data shows that the vast majority of these sampled structures are very close to their MFE counterpart (See {\bf Supp. Fig.~\ref{fig:freq}}). This observation enables us to simplify analysis by retaining MFE structures and produce data that is directly comparable with \maternal.
\subsection{Energy landscape of RNA mutational networks}
We start by characterizing the distribution of folding energies of stable structures accessible from random seeds in the mutational landscape using \RNAmutants. Our simulations show that, initially, increasing mutational distances result in more stable structures at all GC content regimes (See {\bf Fig.~\ref{fig:energy}}). In particular, we observe that the folding energies of the samples represent at least 80\% of the global minimum energy attainable over all $k$ mutations for all GC contents within less than 10 mutations from the seed (see {\bf Supp. Fig.~\ref{fig:minfrac}}). This finding suggests that over short evolutionary periods, mutations are likely to play an important stabilizing role.
We start by characterizing the distribution of folding energies of stable structures accessible from random seeds in the mutational landscape using \RNAmutants. Our simulations show that, initially, increasing mutational distances result in more stable structures at all GC content regimes (See {\bf Fig.~\ref{fig:energy} solid line}). In particular, we observe that the folding energies of the samples represent at least 80\% of the global minimum energy attainable over all $k$ mutations for all GC contents within less than 10 mutations from the seed (see {\bf Supp. Fig.~\ref{fig:minfrac}}). This finding suggests that over short evolutionary periods, mutations are likely to play an important stabilizing role.
By contrast, at larger mutational distances (i.e. 15 mutations and above) we notice an increase in the ensemble energy for all GC content regimes. This behaviour is likely due to the exponential growth in sequence space that accompanies higher mutational distances, which results in a more uniform sampling of the ensemble. In this case, lower energy sequences with high Boltzmann weights become less represented among more abundant and less stable sequences (See {\bf Fig.~\ref{fig:freq}}).
......@@ -44,9 +44,9 @@ As expected, we observe that the nucleotide content is an important factor in de
%\subsubsection{Structural diversity}
Having established that mutational neighbourhoods of random sequences yield stable and low energy states, it is important to also understand the structural features of these states. This is of particular interest trying to provide an account of the emergence of functional and complex RNAs. While previous studies on the structure of short randomly sampled RNA sequences have shown that simple hairpin structures dominate the landscape, we find that for longer molecules the landscape of stable mutant neighbourhood reveals ensembles that are well populated with diverse structures (See {\bf Fig.~\ref{fig:struct_diversity}}). Interestingly, for all GC content sampling and particularly at low to intermediate GC contents, we find that after an initial stabilization period at short mutational distances ($\sim 10$ mutations) more complex structures begin to emerge. This sudden and unexpected change of regime seems to correlate with the decrease of the average folding energies observed in {\bf Fig.~\ref{fig:energy}}.
Having established that mutational neighbourhoods of random sequences yield stable and low energy states, it is important to also understand the structural features of these states. This is of particular interest trying to provide an account of the emergence of functional and complex RNAs. While previous studies on the structure of short randomly sampled RNA sequences have shown that simple hairpin structures dominate the landscape, we find that for longer molecules the landscape of stable mutant neighbourhood reveals ensembles that are well populated with diverse structures (See {\bf Fig.~\ref{fig:struct_diversity}}). Interestingly, for all GC content sampling and particularly at low to intermediate GC contents, we find that after an initial stabilization period at short mutational distances ($\sim 10$ mutations) more complex structures begin to emerge. \todo{figure?} This sudden and unexpected change of regime seems to correlate with the decrease of the average folding energies observed in {\bf Fig.~\ref{fig:energy}}.
We can observe this change as an increase in the number of structures with internal loops ({\bf Fig.~\ref{fig:rnamuts_internal}}) and remarkably, multi-loops ({\bf Fig.~\ref{fig:rnamuts_multi}}). This finding is in good qualitative agreement with databases of evolved structures which contain structures with more diverse motifs than has been previously estimated from pools of random RNAs.
We can observe this change as an increase in the number of structures with internal loops ({\bf Fig.~\ref{fig:rnamuts_internal}}) and remarkably, multi-loops ({\bf Fig.~\ref{fig:rnamuts_multi}}). This finding is in good qualitative agreement with databases of evolved structures which contain structures with more diverse motifs than has been previously estimated from pools of random RNAs. \todo{size difference}
These secondary structure features (i.e. internal and multi-loops) are key components of RNAs that allow for higher order interactions that can support catalytic function. Although multi-loops occur at relatively low frequencies in the \RNAmutants mutational landscapes (at most 0.01), we focus on them for their high degree of structural complexity and importance to many functional RNAs observed in vivo such as the hammerhead ribozyme.
......@@ -88,7 +88,7 @@ Eventually, we also note a smaller peak of multi-loop occurences closer from the
%\subsubsection{Energy landscape}
Our \RNAmutants simulations suggest that mutational networks traversed in an energy based manner can yield stable and diverse structures. We revealed that multi-branched structures reside in specific regions of the mutational landscape at fixed mutational distances (primarily at 30--40 mutations from random seeds) and GC contents ($0.5$ for the majority of structures). At this point, our main question is to determine if a natural selection process, independent of a particular target, is capable of reaching these regions.
Our \RNAmutants simulations suggest that mutational networks traversed in an energy based manner can yield stable and diverse structures. We revealed that multi-branched structures reside in \st{specific} regions of the mutational landscape at fixed mutational distances (primarily at 30--40 mutations from random seeds) and GC contents ($0.5$ for the majority of structures). At this point, our main question is to determine if a natural selection process, independent of a particular target, is capable of reaching these regions.
To address this question we build an evolutionary algorithm named \maternal, where the fitness is proportional to the folding energies of the molecules. Intuitively, \maternal enables us to simulate the behaviour of a population of RNAs selecting the most functional sequences (i.e. most stable) regardless of the structures used to carry the functions. The selected structures are therefore by-products of intrinsic adaptive forces.
......@@ -104,7 +104,7 @@ While we observe very efficient energy optimization from the natural selection p
\subsection{An undirected evolutionary scenario}
\label{sec:undirected_evolution}
We are now showing that random replication \emph{without natural selection} is sufficient to explain the diversity of structures observed in RNA databases, and further the emergence of RNA structural complexity. We consider a simple model in which RNA molecules are duplicated with a small error rate, but preserving the GC content \citep{Tamura:1992aa}. In our simulations, we use an error rate of $0.02$ to allow a immediate comparison of the number of elapsed generations. We also apply identical transitions and transversions rates. Under these assumptions, we can directly compute the expected number of mutations in sequences at the $i^{th}$ generation (see Methods). \textbf{Fig.~\ref{fig:tamura}} shows the results of this calculation for GC content biases varying from $0.1$ to $0.9$. Strikingly, our data reveals that after a short initialization phase (i.e. after $\sim 50$ generations), sequences with a GC content of $0.5$ have on average slightly more than $35$ mutations. This observation is in perfect adequacy with the peak of multi-branched structures identified in \textbf{Fig.~\ref{fig:rnamuts_multi}} and \textbf{Fig.~\ref{fig:multigc}}. It follows that a simple undirected replication mechanism, not subject to natural selection mechanism, is sufficient to explain a drift of populations of RNA molecules toward regions of the sequence landscape enriched with multi-branched structures.
We are now showing that random replication \emph{without natural selection} is sufficient to explain the diversity of structures observed in RNA databases, and further the emergence of RNA structural complexity. We consider a simple model in which RNA molecules are duplicated with a small error rate, but preserving the GC content \citep{Tamura:1992aa}. In our simulations, we use an error rate of $0.02$ to allow a immediate comparison of the number of elapsed generations. We also apply identical transitions and transversions rates. Under these assumptions, we can directly compute the expected number of mutations in sequences at the $i^{th}$ generation (see Methods). \textbf{Fig.~\ref{fig:tamura}} shows the results of this calculation for GC content biases varying from $0.1$ to $0.9$. Strikingly, our data reveals that after a short initialization phase (i.e. after $\sim 50$ generations), sequences with a GC content of $0.5$ have on average slightly more than $35$ mutations. This observation is in perfect adequacy with the peak of multi-branched structures identified in \textbf{Fig.~\ref{fig:rnamuts_multi}} and \textbf{Fig.~\ref{fig:multigc}}. It follows that a simple undirected replication mechanism, not subject to natural selection, is sufficient to explain a drift of populations of RNA molecules toward regions of the sequence landscape enriched with multi-branched structures.
\begin{figure}[htb!]
\centerline{\includegraphics[width=0.5\textwidth]{Fig_5.pdf}}
......@@ -114,7 +114,7 @@ We are now showing that random replication \emph{without natural selection} is s
The abundance of complex structures at large mutational distances (i.e. 30 to 40 mutations from the seed) could eventually be linked to larger sizes of the hamming neighbourhoods (See \textbf{Fig.~\ref{fig:mutant_count}}). Nonetheless, here we also show that this phenomenon indeed coincides with an enrichment of stable multi-branched structures at specific GC contents (i.e. between 0.3 and 0.5).
In \textbf{Fig.~\ref{fig:ML_distribs}}, we compare the average minimum free energy (MFE), as well as the frequency of multi-loops in MFE structures, of sequences sampled from an uniform distribution of mutations (``Random'') to sequences sampled from the low-energy ensemble (``RNAmutants'').
In \textbf{Fig.~\ref{fig:ML_distribs}}, we compare the average minimum free energy (MFE), as well as the frequency of multi-loops in MFE structures, of sequences sampled from a uniform distribution of mutations (``Random'') to sequences sampled from the low-energy ensemble (``RNAmutants'').
\begin{figure}[htb!]
\includegraphics[width=\textwidth]{Fig_6.pdf}
......@@ -124,7 +124,7 @@ In \textbf{Fig.~\ref{fig:ML_distribs}}, we compare the average minimum free ener
We uniformly sampled random sequences in each mutational neighbourhood and calculated their minimum free energy (MFE) secondary structures. Our data confirms previous observations made on shorter sequences showing that multi-branched structures are rare and relatively unstable in random populations, but that their frequency increases with higher GC content biases \citep{stich2008structural}. Strikingly, we note that the frequency of stable multi-branched structures in the low energy ensemble (i.e. ``RNAmutants'') is almost matching those obtained with random sequences at GC contents between 0.3 and 0.5 and mutational distances from 30 and 40 (second row in the third and fourth columns of \textbf{Fig.~\ref{fig:ML_distribs}}), but with higher absolute values for higher GC content regimes (i.e. 0.5).
The analysis of folding energies shed a different light on this phenomenon. While the average energies of multi-branched structures remains steady at all mutational distances, this is not the case of other structures with simpler architectures (first row of \textbf{Fig.~\ref{fig:ML_distribs}}). Lower GC content regimes from 0.3 to 0.5 are characterized by a clear increase of average energies (i.e. MFE structures are less stable), which is not observed at higher GC contents. We conclude from these observations that the relative weight of multi-branched structures in the low energy ensemble increases due to a better (collective) resilience of this architecture to point-wise mutations and/or a more uniform distribution in the sequence landscape. In turn, it increases their density in the large/distant mutational neighbourhoods.
The analysis of folding energies shed a different light on this phenomenon. While the average energies of multi-branched structures remains steady at all mutational distances, this is not the case of other structures with simpler architectures (first row of \textbf{Fig.~\ref{fig:ML_distribs}}). Lower GC content regimes from 0.3 to 0.5 are characterized by a clear increase of average energies \hlt{at mutational distances over 20} (i.e. MFE structures are less stable), which is not observed at higher GC contents. We conclude from these observations that the relative weight of multi-branched structures in the low energy ensemble increases due to a better (collective) resilience of this architecture to point-wise mutations and/or a more uniform distribution in the sequence landscape. In turn, it increases their density in the large/distant mutational neighbourhoods.
Eventually, we also distinguish a secondary peak of occurrences of multi-branched structures in the vicinity of the seeds (i.e. 5-10 mutations) at higher GC regimes (0.7). By contrast, this higher density appears to result from mutants folding with marginally lower energies. It suggests the presence of mutants with improved fitness to the structures of the seeds rather than a global enrichment of multi-branched structures in these neighbourhoods.
......
% Title: A LaTeX Template For Responses To a Referees' Reports
`% Title: A LaTeX Template For Responses To a Referees' Reports
% Author: Petr Zemek <s3rvac@gmail.com>
% Homepage: https://blog.petrzemek.net/2016/07/17/latex-template-for-responses-to-referees-reports/
% License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment