methods.tex 11.7 KB
 Vladimir Reinharz committed Jul 25, 2014 1 2 3 4 5 6 7 8 9 10 11 %!TEX root = main_maternal.tex %Define block styles \tikzstyle{decision} = [diamond, draw, fill=blue!20, text width=4.5em, text badly centered, node distance=3cm, inner sep=0pt] \tikzstyle{block} = [rectangle, draw, fill=blue!20, text width=5em, text centered, rounded corners, minimum height=4em] \tikzstyle{line} = [draw, -latex'] \tikzstyle{cloud} = [draw, ellipse,fill=red!20, node distance=3cm, minimum height=2em]  Vladimir Reinharz committed Sep 29, 2014 12 \section{Materials and Methods}  Vladimir Reinharz committed Jul 25, 2014 13 14 15 16 \label{sec:methods} \label{subsec:methodcarlos}  Carlos GO committed Oct 08, 2016 17   Jerome Waldispuhl committed Jun 07, 2017 18 \subsection{Evolutionary Algorithm (\maternal)}  Carlos GO committed Oct 08, 2016 19   Vladimir Reinharz committed Jul 25, 2014 20 \subsubsection{Initial Population}  Jerome Waldispuhl committed Jun 07, 2017 21 22 \label{sec:initialization} Here we describe an evolutionary algorithm (EA) for energy-based selection with GC content bias. The algorithm is implemented in Python and is freely available at \maternalurl.  Carlos GO committed Jun 04, 2017 23   Vladimir Reinharz committed Aug 10, 2017 24 We first define a population at a generation $t$ as a set $P_{t}$ of sequence-structure pairs. We denote a sequence-structure pair as $(\omega, s)$ such that $s$ is the minimum free energy structure on sequence $\omega$ as computed by the software \texttt{RNAfold} version 2.1.9 \citep{Lorenz:2011aa}. Each sequence is formed as a string from the alphabet $\mathbb{B}:=\{A,U,C, G\}$. For all experiments we work with constant population size of $|P_t| =1000 \quad \forall \quad t$, and constant sequence length $\texttt{len}(s_i) = 50 \quad \forall s_i \in P_t$ . We then apply principles of natural selection under Wright-Fisher sampling to iteratively obtain $P_{t+1}$ from $P_{t}$ for the desired number of generations in the simulation.  Jerome Waldispuhl committed Jun 07, 2017 25 26  Sequences in the initial population, i.e. generation $t=0$, are generated by sampling sequences of the appropriate length uniformly at random from the alphabet $\mathbb{B}$.  Vladimir Reinharz committed Jul 25, 2014 27 28   Jerome Waldispuhl committed Jun 07, 2017 29 30 %\subsubsection{Replication and Selection} %\textit{Fitness Function and Evolutionary Algorithm} \\  Vladimir Reinharz committed Jul 25, 2014 31   Jerome Waldispuhl committed Jun 07, 2017 32 33 \subsubsection{Fitness Function} \label{sec:replication_selection}  Vladimir Reinharz committed Jul 25, 2014 34   Carlos GO committed Aug 13, 2017 35 In order to obtain subsequent generations, we iterate through $P_{t}$ and sample 1000 sequences with replacement according to their relative fitness in the population. Selected sequences generate one offspring that is added to the next generation's population $P_{t+1}$. Because we are sampling with replacement, higher fitness sequences on average contribute more offspring than lower fitness sequences. The relative fitness, or reproduction probability of a sequence $\omega$ is defined as the probability $F(\omega, s)$ that $\omega$ will undergo replication and contribute one offspring to generation $t + 1$. In previous studies, $F(\omega_{i}, s_i)$ has been typically defined as a function of the base pair distance between the MFE structure of $\omega$ and a given target structure. However, in our model, this function is proportional to the free energy of the sequence-structure pair, $E(\omega, s)$ as computed by \texttt{RNAfold}.  Vladimir Reinharz committed Jul 25, 2014 36 37   Carlos GO committed Apr 07, 2017 38  F(\omega, s) = N^{-1}e^{\frac{-\beta E(\omega, s)}{RT}}  Vladimir Reinharz committed Jul 25, 2014 39 40   Vladimir Reinharz committed Aug 13, 2017 41 The exponential term is also known as the Boltzmann weight of a sequence-structure pair. $N$ is a normalization factor obtained by summing over all other sequence-structure pairs in the population as $N = \sum_{\omega', s' \in S_t}\exp[\frac{-\beta E(\omega', s')}{RT}]$. This normalization enforces that reproduction probability of a sequence-structure pair is weighted against the Boltzmann weight of the entire population. $\beta$ is the selection coefficient that takes the value $\beta = -1$ in our simulations. $R=\num{0.00019871}~\kcalmolk$ and $T=310.15K$ are the Boltzmann constant and standard temperature respectively.  Jerome Waldispuhl committed Jun 07, 2017 42 43 44 45 46 47 48  When a sequence is selected for replication, the child sequence is formed by copying each nucleotide of the parent RNA with an error rate of $\mu$ known as the mutation rate. $\mu$ defines the probability of incorrectly copying a nucleotide and instead randomly sampling one of the other 3 bases in $\mathbb{B}$. \subsubsection{Controlling population GC content} \label{sec:gc_control} There are two obstacles to maintaining evolving populations within the desired GC content range of $\pm 0.1$. First, an initial population of random sequences sampled uniformly from the full alphabet naturally tends converge to a GC content of $0.5$. To avoid this, we sample from the alphabet with probability of sampling GC and AU equal to the desired GC content. This way our initial population has the desired nucleotide distribution. Second, when running the simulation, random mutations are able to move replicating sequences outside of the desired range. Given that we are selecting for stable structures, it is likely to drive the population to higher GC contents. To avoid this, at the selection stage, we do not select mutations that would take the sequence outside of this range. Instead, if a mutation takes a replicating sequence outside the GC range, we simply repeat the mutation process on the sequence until the child sequence has the appropriate GC content (See {\bf Alg. ~\ref{alg:gc}}).\\  Vladimir Reinharz committed Jul 25, 2014 49   Carlos GO committed Feb 21, 2017 50 51 52 53 54 \IncMargin{1em} \begin{algorithm}[H] \SetKwInOut{Input}{input}\SetKwInOut{Output}{output} \SetKwFunction{computeGC}{computeGC} \SetKwFunction{mutate}{mutate}  Vladimir Reinharz committed Jul 25, 2014 55   Carlos GO committed Feb 21, 2017 56 57  \Input{$parentSeq$, $targetGC$} \Output{$childSeq$}  Vladimir Reinharz committed Jul 25, 2014 58   Carlos GO committed Feb 21, 2017 59  $childSeq \leftarrow$ \mutate{$parentSequence$}  Vladimir Reinharz committed Jul 25, 2014 60   Carlos GO committed Jun 01, 2017 61  \While{\computeGC{$childSequence$} not in $targetGC \pm 0.1$}{  Carlos GO committed Feb 21, 2017 62 63 64 65 66 67 68  $childSeq \leftarrow$ \mutate{$parentSeq$} } \Return $childSeq$ \caption{GC content maintaining replication} \label{alg:gc} \end{algorithm}  Vladimir Reinharz committed Mar 29, 2017 69 \subsection{\RNAmutants}  Jerome Waldispuhl committed Jun 07, 2017 70 \label{sec:RNAmutants}  Vladimir Reinharz committed Mar 29, 2017 71 72 73 74 The evolutionary algorithm is similar to a local search. At every time step new sequences are close to the previous population and in particular to the elements with higher \emph{fitness}.  Carlos GO committed Jun 01, 2017 75 In contrast \RNAmutants~\citep{Waldispuhl:2008aa} can sample pairs of sequence-structure $(\omega, s)$  Carlos GO committed Mar 31, 2017 76  such that (1) the sequence is a $k$-mutant from a given seed $\omega_0$---for any $k$---%  Vladimir Reinharz committed Mar 29, 2017 77 and (2) the probability of seeing the pair is proportional to its \emph{fitness} compared to all pairs  Vladimir Reinharz committed Aug 10, 2017 78 $(\omega', s')$ where $w'$ is also an $k$-mutant of $\omega_0$.  Vladimir Reinharz committed Mar 29, 2017 79   Jerome Waldispuhl committed Jun 07, 2017 80 In addition \RNAmutants provides an unbiased control of the samples GC content allowing direct comparisons with  Vladimir Reinharz committed Mar 29, 2017 81 82 83 84 85 86 \maternal. We note that although the structure sampled is not in general the MFE replacing them by it does not significantly change the results, as shown in Fig.~\ref{fig:freq}. Therefore we replace the sampled structure with the MFE to simplify the study.  Vladimir Reinharz committed Jun 09, 2017 87 88 89 90 Fo each GC content in $\{0.1,0.3,0.5,0.7,0.9\} (\pm 0.1)$ we generated 20 random seed of length 50. For each seed, at each mutational distance (i.e. number of mutations from the seed) from 0 to 50, at least 10 000 sequence-structure pairs within the target GC content of the seed were sampled from the Boltzmann distribution. The software was run on Dual Intel Westmere EP Xeon X5650 (6-core, 2.66 GHz, 12MB Cache, 95W) on the Guillimin High Performance Computing Cluster of Calcul Qu\'ebec. It took over \num{12000} CPU hours to complete the sampling.  Vladimir Reinharz committed Mar 29, 2017 91 92 \subsubsection{Sequence-structure pairs weighted sampling}  Jerome Waldispuhl committed Jun 07, 2017 93 Given a seed sequence $\omega_0$, a fixed number of mutations $k$, and the ensemble $\mathbb{S}_{\omega_0}^k$ of all pairs sequence-structure such that the sequences are at hamming $k$ from $\omega_0$. Similarly to Sec.~\ref{sec:replication_selection} the \emph{fitness} of a sequence-structure pair $(\omega, s)\in \mathbb{S}_{\omega_0}^k$ will be its Boltzmann weight, a function of its energy. Formally, if the energy of the sequence $\omega$ in conformation $s$ is $E(\omega , s)$ then the weight of the pair is:  Carlos GO committed Mar 31, 2017 94   Vladimir Reinharz committed Mar 29, 2017 95 $ Vladimir Reinharz committed Apr 07, 2017 96  e^{\frac{-E(\omega, s)}{RT}}  Vladimir Reinharz committed Mar 29, 2017 97 $  Vladimir Reinharz committed Aug 13, 2017 98 where as before the Boltzmann constant $R$ equals \num{0.00019871}~\kcalmolk and the temperature $T$ is set at $310.15K$.  Vladimir Reinharz committed Apr 07, 2017 99 The normalization factor, or partition function, $\mathcal Z$ can now be defined as:  Vladimir Reinharz committed Mar 29, 2017 100 $ Vladimir Reinharz committed Apr 07, 2017 101  \mathcal Z = \sum_{(\omega', s')\in \mathbb{S}_{\omega_0}^k}e^{\frac{-E(\omega', s')}{RT}}  Vladimir Reinharz committed Mar 29, 2017 102 $  Vladimir Reinharz committed Aug 10, 2017 103 and thus the probability of sampling a pair $(\omega, s)$ is:  Vladimir Reinharz committed Mar 29, 2017 104 $ Carlos GO committed Mar 31, 2017 105  \mathbb{P}(\omega, s) = \frac{e^{\frac{-E(\omega, s)}{RT}}}{\mathcal Z}.  Vladimir Reinharz committed Mar 29, 2017 106 107 $  Carlos GO committed Mar 31, 2017 108 By increasing $k$ from 1 to $|\omega_0|$ an exploration of whole mutational landscape of $\omega_0$  Vladimir Reinharz committed Apr 07, 2017 109 110 111 is performed. To compute $\mathcal Z$ for each value of $k$, \RNAmutants has a complexity of $\mathcal O(n^3k^2)$. This has to be done only once per seed. The weighted sampling of the sequences themselves has complexity of $\mathcal O(n^2)$.  Vladimir Reinharz committed Mar 29, 2017 112   Carlos GO committed Nov 02, 2016 113   Vladimir Reinharz committed Mar 29, 2017 114 \subsubsection{Controlling samples GC content}  Carlos GO committed May 20, 2017 115 Due to the deep correlation between the GC content of the sequence and its energy, the GC base pair being the most energetic in the Turner model~\citep{turner2009nndb} which is used by \RNAmutants,  Vladimir Reinharz committed Apr 07, 2017 116 sampling from any ensemble $\mathbb S$ will be highly biased towards sequences with high GC content.  Vladimir Reinharz committed Apr 14, 2017 117 To get a sample $(\omega, s)$ at a specific target GC content, a natural approach is to continuously sample and reject any sequence not  Vladimir Reinharz committed Apr 14, 2017 118 119 fitting the requirements. Such an approach can yield an exponential time so a technique developed in~\citep{waldispuhl2011unbiased} is applied.  Vladimir Reinharz committed Apr 07, 2017 120   Vladimir Reinharz committed Apr 14, 2017 121 An unbiased sampling of pairs $(\omega, s)$ for any given GC target can be obtained by modifying the Boltzmann weights of any element $(\omega, s)$ with a term $\mathbf w^{\omega}\in[0, 1]$ which depends on the GC content of $\omega$. At its simplest, it can be the proportion of GC in $\omega$.  Vladimir Reinharz committed Apr 07, 2017 122 123 124 125 126 127 128 129 130 The weight of $(\omega, s)$ becomes $\mathbf w^{\omega}e^{\frac{-E(\omega, s)}{RT}}$ which implies that a new partition function $\mathcal Z^{\mathbf w}$ needs to be defined as follows: $\mathcal Z^{\mathbf w} = \sum_{(\omega', s')\in \mathbb{S}_{\omega_0}^k}\mathbf w^{\omega}e^{\frac{-E(\omega', s')}{RT}}.$  Jerome Waldispuhl committed May 13, 2017 131 132 133 To find the weights $\mathbf w$ for any target GC an exact solution could be found but in practice an efficient solution consists in applying a bisection algorithm to $\mathbf w$. The general idea is to sample a limited number of sequences. Keep those with the desired GC content and then update $\mathbf w$ upwards if the average GC of the samples is lower than the target, and downwards else. In practice, only a couple of iterations are required. \subsection{Sequence divergence in an random replication model}  Jerome Waldispuhl committed Jun 07, 2017 134 \label{sec:tamura}  Jerome Waldispuhl committed May 13, 2017 135 136 137 138 139 140 141 142 143 144  %$$%\begin{pmatrix} %1 - (\theta * \alpha + \beta) & (1 - \theta ) \beta & \theta \beta & \theta \alpha \\ %(1 - \theta ) \beta & 1 - (\theta * \alpha + \beta) & \theta \alpha & \theta \beta \\ %(1 - \theta ) \beta & (1 - \theta ) \alpha & 1 - ( 1 - (\theta * \alpha + \beta) ) & \theta \beta \\ %(1 - \theta ) \alpha & (1 - \theta ) \beta & \theta \beta & 1 - ( 1 - (\theta * \alpha + \beta) ) \\ %\end{pmatrix} %$$  Carlos GO committed Jun 01, 2017 145 We estimate the expected number of mutations in randomly replicated sequences (section \ref{sec:undirected_evolution}) using the transition matrix defined by K. Tamura \citep{Tamura:1992aa}. We use a mutation rate $\alpha = \num{0.02}$ mirroring the mutation rate used in \maternal, and assume that transition and transversion rates are identical. The target GC content is represented with the variable $\theta = \{ 0.1, 0.3, 0.5, 0.7, 0.9 \}$. The transition matrix is shown below.  Jerome Waldispuhl committed May 13, 2017 146   Jerome Waldispuhl committed May 20, 2017 147 \begin{center}  Jerome Waldispuhl committed May 13, 2017 148 {\scriptsize  Jerome Waldispuhl committed May 20, 2017 149 \begin{tabular}{c|cccc}  Jerome Waldispuhl committed May 13, 2017 150 & A & U & C & G \\  Jerome Waldispuhl committed May 20, 2017 151 152 153 154 155 156 \hline A & $1 - \alpha (1 + \theta)$ & $(1 - \theta ) \alpha$ & $\theta \alpha$ & $\theta \alpha$ \\ U & $(1 - \theta ) \alpha$ & $1 - \alpha (1 + \theta)$ & $\theta \alpha$ &$\theta \alpha$ \\ C & $(1 - \theta ) \alpha$ & $(1 - \theta ) \alpha$ & $1 - ( 1 - \alpha (1 + \theta) )$ & $\theta \alpha$ \\ G & $(1 - \theta ) \alpha$ & $(1 - \theta ) \alpha$ & $\theta \alpha$ & $1 - ( 1 - \alpha (1 + \theta ) )$ \\ \end{tabular}  Jerome Waldispuhl committed May 13, 2017 157 }  Jerome Waldispuhl committed May 20, 2017 158 \end{center}  Jerome Waldispuhl committed May 13, 2017 159 160 161  This matrix gives us the transition rate from one generation to the next one. To obtain the mutation probabilities at the $k^{th}$ generation, we calculate the $k^{th}$ exponent of this matrix. Then, we sum the values along the main diagonal to estimate the probability of a nucleotide to be the same at the initial and $k^{th}$ generation.