Commit 179a7318 authored by Carlos GO's avatar Carlos GO
Browse files

region edits

parent 9c14c998
......@@ -9,11 +9,11 @@ Our results offer solid foundations to parsimonious evolutionary scenarios based
The preservation of intermediate GC content values appeared to us as a reasonable assumption, which could reflect the availability of various nucleotides in the prebiotic milieu. This nucleotide composition bias can be interpreted as an intrinsic force that favoured the emergence of life. It also offers novel insights into fundamental properties of the genetic alphabet~\citep{Gardner:2003aa}.
In addition to this central reservoir of complex structures, our data also revealed the occasional presence of stable multi-branched structures in the vicinity of random sequences at GC content regimes of 0.7 (See \textbf{Fig.~\ref{fig:rnamuts_multi}} and \textbf{Fig.~\ref{fig:ML_distribs}}). This finding is in agreement with previous theoretical studies that showed that neutral networks percolate the whole sequence landscape~\citep{Schuster:1994aa,Fontana:1998aa}. However, our simulations also suggest that a random replication model would only transiently occupy these regions, hence with significantly lower probabilities to find these structures. Eventually these close multi-branched structures appear to have a GC content ($\geq 0.7$) similar to those of RNA families with longer structures of more than 70 nucleotides (See \textbf{Fig.~\ref{subfig:rfam_stats}}). In these particular cases, we conjecture that the structures were selected later in evolution by natural selection processes.
In addition to this central reservoir \todo{reservoir} of complex structures, our data also revealed the occasional presence of stable multi-branched structures in the vicinity of random sequences at GC content regimes of 0.7 (See \textbf{Fig.~\ref{fig:rnamuts_multi}} and \textbf{Fig.~\ref{fig:ML_distribs}}). This finding is in agreement with previous theoretical studies that showed that neutral networks percolate the whole sequence landscape~\citep{Schuster:1994aa,Fontana:1998aa}. However, our simulations also suggest that a random replication model would only transiently occupy these regions, hence with significantly lower probabilities to find these structures. Eventually these close multi-branched structures appear to have a GC content ($\geq 0.7$) similar to those of RNA families with longer structures of more than 70 nucleotides (See \textbf{Fig.~\ref{subfig:rfam_stats}}). In these particular cases, we conjecture that the structures were selected later in evolution by natural selection processes.
Eventually, our results could be used to put in perspective earlier findings suggesting that natural selection is not required to explain pattern composition in rRNAs~\citep{Smit:2006aa}. Incidentally, these observations suggest further investigations into the role of more complex nucleotide distributions~\citep{Levin:2012aa}.
Our analysis completes recent studies that aimed to characterize fundamental properties of genotype-phenotype maps~\citep{Greenbury:2015aa,Manrubia:2017aa}, and showed that their structure may contribute to the emergence of functional molecules~\citep{Dingle:2015aa}. \red{Whereas previous studies focused on characterizing the genotype-phenotype map of random sequences, we show that the map of stable mutants arising from random seeds favours the discovery of complex structures.} It also emphasizes the relevance of theoretical models based on a thermodynamical view of prebiotic evolution~\cite{Pascal:2013aa}.
Our analysis completes recent studies that aimed to characterize fundamental properties of genotype-phenotype maps~\citep{Greenbury:2015aa,Manrubia:2017aa}, and showed that their structure may contribute to the emergence of functional molecules~\citep{Dingle:2015aa}. \red{Whereas previous studies focused on characterizing the static genotype-phenotype map of random sequences, we show that the map of stable mutants arising from random seeds favours the discovery of complex structures.} It also emphasizes the relevance of theoretical models based on a thermodynamical view of prebiotic evolution~\cite{Pascal:2013aa}.
The size of the RNA sequences considered in this study has been fixed at 50 nucleotides. This length appears to be the current upper limit for non-enzymatic synthesis~\citep{Hill:1993aa}, and therefore maximizes the expressivity of our evolutionary scenario. Variations of the sizes of populations or lengths of RNA sequences resulting from indels could be eventually considered with the implementation of dedicated algorithms~\citep{Waldispuhl:2002aa}. Although, if these variations remain modest, we do not expect any major impact on our conclusions.
......
......@@ -74,4 +74,4 @@ a region of the RNA mutational landscape characterized by an average distance of
We compare these data to populations that evolved under a selective pressure eliciting stable structures. Although this evolutionary mechanism shows a remarkable capacity to quickly improve the stability of structures, it fails to reproduce the structural complexity observed in RNA families of similar lengths.
Finally, we show that a population of RNA molecules replicating itself randomly with accidental errors but preserving a balanced GC content~\citep{Tamura:1992aa}, naturally evolves toward regions of the landscape enriched with multi-branched structures potentially capable of supporting essential biochemical functions. Our results argue for a simple scenario of the origin of life in which an initial pool of nucleic acids would irresistibly evolve to promote a spontaneous and simultaneous discovery of the basic bricks of life.
Finally, we show that a population of RNA molecules replicating itself randomly with accidental errors but preserving a balanced GC content~\citep{Tamura:1992aa}, naturally evolves toward regions \todo{regions} of the landscape enriched with multi-branched structures potentially capable of supporting essential biochemical functions. Our results argue for a simple scenario of the origin of life in which an initial pool of nucleic acids would irresistibly evolve to promote a spontaneous and simultaneous discovery of the basic bricks of life.
......@@ -14,7 +14,7 @@
\usepackage{xcolor}
\usepackage{subcaption}
%\usepackage[disable]{todonotes}
\usepackage{todonotes}
\usepackage[textsize=tiny]{todonotes}
\usepackage{tikz}
\usepackage{soul}
\usetikzlibrary{shapes,arrows}
......@@ -66,7 +66,7 @@ $\\\small$^1$ School of Computer Science, McGill University, Montreal, Canada\\\
The RNA world hypothesis relies on the ability of ribonucleic acids to replicate and spontaneously acquire complex structures capable of supporting essential biological functions. Multiple sophisticated evolutionary models have been proposed, but they often assume specific conditions.
%
In this work we explore a simple and parsimonious scenario describing the emergence of complex molecular structures at the early stages of life. We show that at specific GC-content regimes, an undirected replication model is sufficient to explain the apparition of multi-branched RNA secondary structures -- a structural signature of many essential ribozymes. We ran a large scale computational study to map energetically stable structures on complete mutational networks of 50-nucleotide-long RNA sequences. Our results reveal \red{that regions of the sequence landscape with stable structures are enriched with multi-branched structures at a length scale coinciding with the appearance of complex structures in RNA databases.} A random replication mechanism preserving a $50\%$ GC-content suffices to explain a natural drift of RNA populations toward \hlt{complex stable structures}.
In this work we explore a simple and parsimonious scenario describing the emergence of complex molecular structures at the early stages of life. We show that at specific GC-content regimes, an undirected replication model is sufficient to explain the apparition of multi-branched RNA secondary structures -- a structural signature of many essential ribozymes. We ran a large scale computational study to map energetically stable structures on complete mutational networks of 50-nucleotide-long RNA sequences. Our results reveal \red{that regions of the sequence landscape with stable structures are enriched with multi-branched structures at a length scale coinciding with the appearance of complex structures in RNA databases.} A random replication mechanism preserving a $50\%$ GC-content suffices to explain a natural drift of RNA populations toward \hlt{complex stable structures}. \todo{regions}
\end{abstract}
......
......@@ -4,7 +4,7 @@
% Overview
\subsection{Our approach}
We apply two complementary mutation space search techniques to characterize the influence of \hlt{sampling} process to the repertoire of shapes accessible from an initial pool of random sequences (See \textbf{Fig.~\ref{fig:summary}}). Importantly, our analysis explicitly models the impact of the GC content bias to \red{understand} the effect of potential nucleotide scarcity in prebiotic conditions \red{~\citep{penny1999nature, Gardner:2003aa}}. Our first algorithm \RNAmutants \red{ (see Sec.~\ref{sec:RNAmutants})} enumerates all mutated sequences and samples the ones with the \emph{globally} lowest folding energy~\citep{Waldispuhl:2008aa}. It enables us to calculate the structures accessible from a random replication process. \hlt{In contrast, } our other algorithm named \maternal, has been developed for this study to simulate the evolution of a population of RNA sequences that preferentially selects the most stable structures under nucleotide bias.
We apply two complementary mutation space search techniques to characterize the influence of \hlt{sampling} process to the repertoire of shapes accessible from an initial pool of random sequences (See \textbf{Fig.~\ref{fig:summary}}). Importantly, our analysis explicitly models the impact of the GC content bias to \red{understand} the effect of potential nucleotide scarcity in prebiotic conditions \red{~\citep{penny1999nature, Gardner:2003aa}}. Our first algorithm \RNAmutants \red{ (see Sec.~\ref{sec:RNAmutants})} enumerates all mutated sequences and samples the ones with the \emph{globally} lowest folding energy~\citep{Waldispuhl:2008aa}. It enables us to calculate the structures accessible from a random replication process. \todo[inline]{distinguish RNAmutants from a replication process!}\hlt{In contrast, } our other algorithm named \maternal, has been developed for this study to simulate the evolution of a population of RNA sequences that preferentially selects the most stable structures under nucleotide bias.
\begin{figure}
\centering
......@@ -25,7 +25,7 @@ More precisely, given an initial sequence (i.e. the \emph{seed}), \RNAmutants~ \
\begin{itemize}
\item \red{\RNAmutants reveals that around 50\% GC content, regions of the mutant landscape composed of sequences with stable secondary structures are enriched with multi-branched architectures.}
\item \red{Simple random replication models without selection natural selection suffices to explain a drift toward the same mutational neighbourhood.}
\item \red{Simple random replication models without selection natural selection suffices to explain a drift toward the same mutational neighbourhood.} \todo{reword}
\item \red{A natural selection exploration eliciting stable folds fails to reach this region.}
\end{itemize}
......@@ -57,7 +57,7 @@ As expected, we observe that the nucleotide content is an important factor in de
%\subsubsection{Structural diversity}
Having established that mutational neighbourhoods of random sequences yield stable and low energy states, it is important to also understand the structural features of these states. This is of particular interest trying to provide an account of the emergence of functional and complex RNAs. While previous studies on the structure of short randomly sampled RNA sequences have shown that simple hairpin structures dominate the landscape, we find that for longer molecules the landscape of stable mutant neighbourhood reveals ensembles that are well populated with diverse structures (See {\bf Fig.~\ref{fig:struct_diversity}}). Interestingly, for all GC content sampling and particularly at low to intermediate GC contents, we find that after an initial stabilization period at short mutational distances ($\sim 10$ mutations) more complex structures begin to emerge. This sudden and unexpected change of regime seems to correlate with the decrease of the average folding energies observed in {\bf Fig.~\ref{fig:energy}}.
Having established that mutational neighbourhoods of random sequences yield stable and low energy states, it is important to also understand the structural features of these states. This is of particular interest in trying to provide an account of the emergence of functional and complex RNAs. While previous studies on the structure of short randomly sampled RNA sequences have shown that simple hairpin structures dominate the landscape, we find that for longer molecules the landscape of stable mutant neighbourhood reveals ensembles that are well populated with diverse structures (See {\bf Fig.~\ref{fig:struct_diversity}}). Interestingly, for all GC content sampling and particularly at low to intermediate GC contents, we find that after an initial stabilization period at short mutational distances ($\sim 10$ mutations) complex structures begin to emerge. This sudden and unexpected change of regime seems to accompany the increase of the average folding energies observed in {\bf Fig.~\ref{fig:energy}}.
We can observe this change as an increase in the number of structures with internal loops ({\bf Fig.~\ref{fig:rnamuts_internal}}) and remarkably, multi-loops. This finding is in good qualitative agreement with databases of evolved structures ({\bf Fig.~\ref{subfig:rfam_count}}) \red{where multiloop structures begin to emerge in families slightly under 50 nucleotides long.}
......@@ -65,6 +65,8 @@ These secondary structure features (i.e. internal and multi-loops) are key compo
Across all runs, we sampled a total of \num{9419} sequences containing a multi-loop (no more than 1 multi-loop per structure was ever observed as is to be expected for such length scales). Interestingly, we find that unlike internal loops, multi-loops occur under very specific conditions in our sampling. We identify a clear surge of multi-loops frequency at distances at a mutational distance of $\sim 35$ (See {\bf Fig.~\ref{fig:rnamuts_multi}}), with a mean GC content of $\sim 0.45$ (See {\bf Fig.~\ref{fig:multigc}}). Furthermore, their energy distribution is tightly centered around $\sim -15\:\kcalmol$ (See {\bf Fig.~\ref{fig:multieng}}). These values are remarkably close to the those of multi-branched structures of similar lengths observed in the Rfam database (See {\bf Fig.~\ref{subfig:rfam_stats}}). In particular, the latter shows a clear bias toward medium GC contents as we identified 148 Rfam families with multi-loops with a GC content of 0.5 (among all Rfam families with sequences having at most 200 nucleotides), but only 80 with a GC content 0.3 and 40 with a GC content of 0.7. This serves as further evidence that GC content is an important determinant of the evolution of structural complexity. It also appears that these features are a general property of the distribution of multi-loops in the mutational landscape given the sequence entropy of the set of sequences containing multi-loops is \red{quite high ($0.945$ out of $1$)}. \red{This indicates that the observed properties are likely a feature of the GC content bias and not due to over-representation by isolated groups of similar sequences.} \red{Further} analysis carried out in Section \ref{sec:undirected_evolution} demonstrates that this enrichment of complex structures is not simply an artifact of larger Hamming neighbourhoods that accompanies deeper mutational explorations.
Eventually, we also note a smaller peak of multi-loop occurences closer from the seed sequences ($\sim 6$ mutations) for higher GC contents around $0.7$. Interestingly, with folding energies ranging from $-25$ to $-40\:\kcalmol$, these multi-branched structures are significantly more stable than those present in the main peak (See {\bf Fig.~\ref{fig:energy_map}}). This is also in agreement with the energies observed in the Rfam database for structures within this range of GC content values (See {\bf Fig.~\ref{subfig:rfam_stats}}).
\begin{figure}[htb!]
......@@ -97,10 +99,9 @@ Eventually, we also note a smaller peak of multi-loop occurences closer from the
\end{figure}
\subsection{Random replication {\it without} selection}\label{sec:undirected_evolution}
\label{sec:undirected_evolution}
We are now showing that random replication \emph{without natural selection} is sufficient to explain the diversity of structures observed in RNA databases, and further the emergence of RNA structural complexity. We consider a simple model in which RNA molecules are duplicated with a small error rate, but preserving the GC content~\citep{Tamura:1992aa}. In our simulations, we use an error rate of $0.02$ to allow a immediate comparison of the number of elapsed generations. We also apply identical transitions and transversions rates. Under these assumptions, we can directly compute the {\it expected} number of mutations in sequences at the $i^{th}$ generation (see Methods). \textbf{Fig.~\ref{fig:tamura}} shows the results of this calculation for GC content biases varying from $0.1$ to $0.9$. Strikingly, our data reveals that after a short initialization phase (i.e. after $\sim 50$ generations), sequences with a GC content of $0.5$ have on average slightly more than $35$ mutations. This observation is in perfect adequacy with the peak of multi-branched structures identified in \textbf{Fig.~\ref{fig:rnamuts_multi}} and \textbf{Fig.~\ref{fig:multigc}}. It follows that a simple undirected replication mechanism, not subject to natural selection, is sufficient to explain a drift of populations of RNA molecules toward regions of the sequence landscape enriched with multi-branched structures.
We are now showing that random replication \emph{without natural selection} is sufficient to explain the diversity of structures observed in RNA databases, and further the emergence of RNA structural complexity. We consider a simple model in which RNA molecules are duplicated with a small error rate, but preserving the GC content~\citep{Tamura:1992aa}. In our simulations, we use an error rate of $0.02$ to allow immediate comparison of the number of elapsed generations. We also apply identical transitions and transversions rates. Under these assumptions, we can directly compute the {\it expected} number of mutations in sequences at the $i^{th}$ generation (see Methods). \textbf{Fig.~\ref{fig:tamura}} shows the results of this calculation for GC content biases varying from $0.1$ to $0.9$. Strikingly, our data reveals that after a short initialization phase (i.e. after $\sim 50$ generations), sequences with a GC content of $0.5$ have on average slightly more than $35$ mutations. This observation is in perfect adequacy with the peak of multi-branched structures identified in \textbf{Fig.~\ref{fig:rnamuts_multi}} and \textbf{Fig.~\ref{fig:multigc}}. It follows that a simple undirected replication mechanism, not subject to natural selection, is sufficient to explain a drift of populations of RNA molecules toward regions of the sequence landscape enriched with multi-branched structures. \todo{regions.}
\begin{figure}[htb!]
\centerline{\includegraphics[width=0.5\textwidth]{Fig_5.pdf}}
......@@ -126,7 +127,7 @@ The analysis of folding energies sheds a different light on this phenomenon. Whi
Eventually, we also distinguish a secondary peak of occurrences of multi-branched structures in the vicinity of the seeds (i.e. 5-10 mutations) at higher GC regimes (0.7). By contrast, this higher density appears to result from mutants folding with marginally lower energies. It suggests the presence of mutants with improved fitness to the structures of the seeds rather than a global enrichment of multi-branched structures in these neighbourhoods.
In conclusion, assuming that functional structures are preferentially fixed on stable structures available in the sequence landscape, our simulations suggests that GC content regimes of 0.5 favour the discovery of multi-branched structures. The probability of spontaneous emergence of complex structures in a random replication model \red{without selection} may thus be higher than currently estimated.
In conclusion, assuming that functional structures are preferentially fixed \todo[inline]{elaborate on this point} on stable structures available in the sequence landscape, our simulations suggests that GC content regimes of 0.5 favour the discovery of multi-branched structures. The probability of spontaneous emergence of complex structures in a random replication model \red{without selection} may thus be higher than currently estimated.
\subsection{An energy-based evolutionary model \red{with natural selection}}
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment