methods.tex 11.7 KB
Newer Older
Vladimir Reinharz's avatar
Vladimir Reinharz committed
1
2
3
4
5
6
7
8
9
10
11
%!TEX root = main_maternal.tex

%Define block styles
\tikzstyle{decision} = [diamond, draw, fill=blue!20, 
    text width=4.5em, text badly centered, node distance=3cm, inner sep=0pt]
\tikzstyle{block} = [rectangle, draw, fill=blue!20, 
    text width=5em, text centered, rounded corners, minimum height=4em]
\tikzstyle{line} = [draw, -latex']
\tikzstyle{cloud} = [draw, ellipse,fill=red!20, node distance=3cm,
    minimum height=2em]
    
Vladimir Reinharz's avatar
Vladimir Reinharz committed
12
\section{Materials and Methods}
Vladimir Reinharz's avatar
Vladimir Reinharz committed
13
14
15
16
\label{sec:methods}

\label{subsec:methodcarlos}

Carlos GO's avatar
Carlos GO committed
17

Jerome Waldispuhl's avatar
Jerome Waldispuhl committed
18
\subsection{Evolutionary Algorithm (\maternal)}
Carlos GO's avatar
Carlos GO committed
19

Vladimir Reinharz's avatar
Vladimir Reinharz committed
20
\subsubsection{Initial Population}
Jerome Waldispuhl's avatar
Jerome Waldispuhl committed
21
22
\label{sec:initialization}
Here we describe an evolutionary algorithm (EA) for energy-based selection with GC content bias. The algorithm is implemented in Python and is freely available at \maternalurl. 
Carlos GO's avatar
Carlos GO committed
23
	
Vladimir Reinharz's avatar
Vladimir Reinharz committed
24
We first define a population at a generation $t$ as a set $P_{t}$ of sequence-structure pairs. We denote a sequence-structure pair as $(\omega, s)$ such that $s$ is the minimum free energy structure on sequence $\omega$ as computed by the software \texttt{RNAfold} version 2.1.9 \citep{Lorenz:2011aa}. Each sequence is formed as a string from the alphabet $\mathbb{B}:=\{A,U,C, G\}$. For all experiments we work with constant population size of $|P_t| =1000 \quad \forall \quad t$, and constant sequence length $\texttt{len}(s_i) = 50 \quad \forall s_i \in P_t$ . We then apply principles of natural selection under Wright-Fisher sampling to iteratively obtain $P_{t+1}$ from $P_{t}$ for the desired number of generations in the simulation.
Jerome Waldispuhl's avatar
Jerome Waldispuhl committed
25
26

Sequences in the initial population, i.e. generation $t=0$, are generated by sampling sequences of the appropriate length uniformly at random from the alphabet $\mathbb{B}$.
Vladimir Reinharz's avatar
Vladimir Reinharz committed
27
28


Jerome Waldispuhl's avatar
Jerome Waldispuhl committed
29
30
%\subsubsection{Replication and Selection}
%\textit{Fitness Function and Evolutionary Algorithm} \\
Vladimir Reinharz's avatar
Vladimir Reinharz committed
31

Jerome Waldispuhl's avatar
Jerome Waldispuhl committed
32
33
\subsubsection{Fitness Function}
\label{sec:replication_selection}
Vladimir Reinharz's avatar
Vladimir Reinharz committed
34

Carlos GO's avatar
Carlos GO committed
35
In order to obtain subsequent generations, we iterate through $P_{t}$ and sample 1000 sequences with replacement according to their relative fitness in the population. Selected sequences generate one offspring that is added to the next generation's population $P_{t+1}$. Because we are sampling with replacement, higher fitness sequences on average contribute more offspring than lower fitness sequences. The relative fitness, or reproduction probability of a sequence $\omega$ is defined as the probability $F(\omega, s)$ that $\omega$ will undergo replication and contribute one offspring to generation $t + 1$. In previous studies, $F(\omega_{i}, s_i)$ has been typically defined as a function of the base pair distance between the MFE structure of $\omega$ and a given target structure. However, in our model, this function is proportional to the free energy of the sequence-structure pair, $E(\omega, s)$ as computed by \texttt{RNAfold}.
Vladimir Reinharz's avatar
Vladimir Reinharz committed
36
37

\begin{equation}
38
	F(\omega, s) = N^{-1}e^{\frac{-\beta E(\omega, s)}{RT}}
Vladimir Reinharz's avatar
Vladimir Reinharz committed
39
40
\end{equation}

Vladimir Reinharz's avatar
Vladimir Reinharz committed
41
The exponential term is also known as the Boltzmann weight of a sequence-structure pair. $N$ is a normalization factor obtained by summing over all other sequence-structure pairs in the population as $N = \sum_{\omega', s' \in S_t}\exp[\frac{-\beta E(\omega', s')}{RT}]$. This normalization enforces that reproduction probability of a sequence-structure pair is weighted against the Boltzmann weight of the entire population. $\beta$ is the selection coefficient that takes the value $\beta = -1$ in our simulations. $R=\num{0.00019871}~\kcalmolk$ and $T=310.15K$ are the Boltzmann constant and standard temperature respectively. 
Jerome Waldispuhl's avatar
Jerome Waldispuhl committed
42
43
44
45
46
47
48

When a sequence is selected for replication, the child sequence is formed by copying each nucleotide of the parent RNA with an error rate of $\mu$ known as the mutation rate. $\mu$ defines the probability of incorrectly copying a nucleotide and instead randomly sampling one of the other 3 bases in $\mathbb{B}$.
	

\subsubsection{Controlling population GC content} 
\label{sec:gc_control}
There are two obstacles to maintaining evolving populations within the desired GC content range of $\pm 0.1$. First, an initial population of random sequences sampled uniformly from the full alphabet naturally tends converge to a GC content of $0.5$. To avoid this, we sample from the alphabet with probability of sampling GC and AU equal to the desired GC content. This way our initial population has the desired nucleotide distribution. Second, when running the simulation, random mutations are able to move replicating sequences outside of the desired range. Given that we are selecting for stable structures, it is likely to drive the population to higher GC contents. To avoid this, at the selection stage, we do not select mutations that would take the sequence outside of this range. Instead, if a mutation takes a replicating sequence outside the GC range, we simply repeat the mutation process on the sequence until the child sequence has the appropriate GC content (See {\bf Alg. ~\ref{alg:gc}}).\\
Vladimir Reinharz's avatar
Vladimir Reinharz committed
49

Carlos GO's avatar
Carlos GO committed
50
51
52
53
54
\IncMargin{1em}
\begin{algorithm}[H]
	\SetKwInOut{Input}{input}\SetKwInOut{Output}{output}
	\SetKwFunction{computeGC}{computeGC}
	\SetKwFunction{mutate}{mutate}
Vladimir Reinharz's avatar
Vladimir Reinharz committed
55

Carlos GO's avatar
Carlos GO committed
56
57
	\Input{$parentSeq$, $targetGC$}
	\Output{$childSeq$}
Vladimir Reinharz's avatar
Vladimir Reinharz committed
58
	
Carlos GO's avatar
Carlos GO committed
59
	$childSeq \leftarrow$ \mutate{$parentSequence$}
Vladimir Reinharz's avatar
Vladimir Reinharz committed
60
	
Carlos GO's avatar
Carlos GO committed
61
	\While{\computeGC{$childSequence$} not in $targetGC \pm 0.1$}{
Carlos GO's avatar
Carlos GO committed
62
63
64
65
66
67
68
		$childSeq \leftarrow$ \mutate{$parentSeq$}
	}
	\Return $childSeq$
\caption{GC content maintaining replication}
\label{alg:gc}
\end{algorithm}

Vladimir Reinharz's avatar
Vladimir Reinharz committed
69
\subsection{\RNAmutants}
Jerome Waldispuhl's avatar
Jerome Waldispuhl committed
70
\label{sec:RNAmutants}
Vladimir Reinharz's avatar
Vladimir Reinharz committed
71
72
73
74
The evolutionary algorithm is similar to a local search. 
At every time step new sequences are close to the previous population and in particular to 
the elements with higher \emph{fitness}.

Carlos GO's avatar
done.    
Carlos GO committed
75
In contrast \RNAmutants~\citep{Waldispuhl:2008aa} can sample pairs of sequence-structure $(\omega, s)$
76
 such that (1) the sequence is a $k$-mutant from a given seed $\omega_0$---for any $k$---%
Vladimir Reinharz's avatar
Vladimir Reinharz committed
77
and (2) the probability of seeing the pair is  proportional to its \emph{fitness} compared to all  pairs
Vladimir Reinharz's avatar
Vladimir Reinharz committed
78
$(\omega', s')$ where $w'$ is also an $k$-mutant of $\omega_0$.
Vladimir Reinharz's avatar
Vladimir Reinharz committed
79

Jerome Waldispuhl's avatar
Jerome Waldispuhl committed
80
In addition \RNAmutants provides an unbiased control of the samples GC content allowing direct comparisons with
Vladimir Reinharz's avatar
Vladimir Reinharz committed
81
82
83
84
85
86
\maternal. 

We note that although the structure sampled is not in general the MFE replacing them by it 
does not significantly change the results, as shown in Fig.~\ref{fig:freq}. 
Therefore we replace the sampled structure with the MFE to simplify the study.

Vladimir Reinharz's avatar
Vladimir Reinharz committed
87
88
89
90
Fo each  GC content in $\{0.1,0.3,0.5,0.7,0.9\}   (\pm 0.1)$ we generated 20 random seed of length 50. For each seed, at each mutational distance (i.e. number of mutations from the seed) from 0 to 50, at least 10 000 sequence-structure pairs within the target GC content of the seed were sampled from the Boltzmann distribution.
The software was run on Dual Intel Westmere EP Xeon X5650 (6-core, 2.66 GHz, 12MB Cache, 95W)	 on the Guillimin  High Performance Computing Cluster of Calcul Qu\'ebec. 
It took over \num{12000} CPU hours to complete the sampling.

Vladimir Reinharz's avatar
Vladimir Reinharz committed
91
92
\subsubsection{Sequence-structure pairs weighted sampling}

Jerome Waldispuhl's avatar
Jerome Waldispuhl committed
93
Given a seed sequence $\omega_0$, a fixed number of mutations $k$, and the ensemble $\mathbb{S}_{\omega_0}^k$ of all pairs sequence-structure such that the sequences are at hamming $k$ from $\omega_0$. Similarly to Sec.~\ref{sec:replication_selection} the \emph{fitness} of a sequence-structure pair $(\omega, s)\in \mathbb{S}_{\omega_0}^k$ will be its Boltzmann weight, a function of its energy.  Formally, if the energy of the sequence $\omega$ in conformation $s$ is $E(\omega , s)$ then the weight of the pair is:
94

Vladimir Reinharz's avatar
Vladimir Reinharz committed
95
\[
Vladimir Reinharz's avatar
Vladimir Reinharz committed
96
	e^{\frac{-E(\omega, s)}{RT}}
Vladimir Reinharz's avatar
Vladimir Reinharz committed
97
\]
Vladimir Reinharz's avatar
Vladimir Reinharz committed
98
where as before the Boltzmann constant $R$ equals \num{0.00019871}~\kcalmolk and  the temperature $T$ is set at $310.15K$.
Vladimir Reinharz's avatar
Vladimir Reinharz committed
99
The normalization factor, or partition function, $\mathcal Z$ can now be defined as:
Vladimir Reinharz's avatar
Vladimir Reinharz committed
100
\[
Vladimir Reinharz's avatar
Vladimir Reinharz committed
101
	\mathcal Z =  \sum_{(\omega', s')\in \mathbb{S}_{\omega_0}^k}e^{\frac{-E(\omega', s')}{RT}}
Vladimir Reinharz's avatar
Vladimir Reinharz committed
102
\]
Vladimir Reinharz's avatar
Vladimir Reinharz committed
103
and thus the probability of sampling a pair $(\omega, s)$ is:
Vladimir Reinharz's avatar
Vladimir Reinharz committed
104
\[
105
	\mathbb{P}(\omega, s) =  \frac{e^{\frac{-E(\omega, s)}{RT}}}{\mathcal Z}.
Vladimir Reinharz's avatar
Vladimir Reinharz committed
106
107
\]

108
By increasing $k$ from 1 to $|\omega_0|$ an exploration of whole mutational landscape of $\omega_0$ 
Vladimir Reinharz's avatar
Vladimir Reinharz committed
109
110
111
is performed. To compute $\mathcal Z$ for each value of $k$,   \RNAmutants has a complexity of $\mathcal O(n^3k^2)$.
This has to be done only once per seed. The weighted sampling of the sequences themselves has complexity
of $\mathcal O(n^2)$.
Vladimir Reinharz's avatar
Vladimir Reinharz committed
112

Carlos GO's avatar
Carlos GO committed
113

Vladimir Reinharz's avatar
Vladimir Reinharz committed
114
\subsubsection{Controlling samples GC content}
Carlos GO's avatar
Carlos GO committed
115
Due to the deep correlation between the GC content of the sequence and its energy, the GC base pair being the most energetic in the Turner model~\citep{turner2009nndb} which is used by \RNAmutants,
Vladimir Reinharz's avatar
Vladimir Reinharz committed
116
sampling from any ensemble $\mathbb S$ will be highly biased towards sequences with high GC content. 
117
To get a sample $(\omega, s)$ at a specific target GC content, a natural approach is to continuously sample and  reject any sequence not
Vladimir Reinharz's avatar
Vladimir Reinharz committed
118
119
fitting the requirements. 
Such an approach can yield an exponential time so a technique developed in~\citep{waldispuhl2011unbiased} is applied.
Vladimir Reinharz's avatar
Vladimir Reinharz committed
120

121
An unbiased sampling of pairs $(\omega, s)$  for any given GC target can be obtained by modifying the Boltzmann weights of any element $(\omega, s)$ with a term $\mathbf w^{\omega}\in[0, 1]$ which depends on the GC content of $\omega$. At its simplest, it can be the proportion of GC in $\omega$.
Vladimir Reinharz's avatar
Vladimir Reinharz committed
122
123
124
125
126
127
128
129
130
The weight of $(\omega, s)$ becomes
\[
	\mathbf w^{\omega}e^{\frac{-E(\omega, s)}{RT}}
\]
which implies that a new partition function $\mathcal Z^{\mathbf w}$ needs to be defined as follows:
\[
		\mathcal Z^{\mathbf w} =  \sum_{(\omega', s')\in \mathbb{S}_{\omega_0}^k}\mathbf w^{\omega}e^{\frac{-E(\omega', s')}{RT}}.
\]

Jerome Waldispuhl's avatar
Jerome Waldispuhl committed
131
132
133
To find the weights $\mathbf w$ for any target GC an exact solution could be found but in practice an efficient solution consists in applying a bisection algorithm to $\mathbf w$. The general idea is to sample a limited number of sequences. Keep those with the desired GC content and then update $\mathbf w$ upwards if the average GC of the samples is lower than the target, and downwards else. In practice, only a couple of iterations are required.

\subsection{Sequence divergence in an random replication model}
Jerome Waldispuhl's avatar
Jerome Waldispuhl committed
134
\label{sec:tamura}
Jerome Waldispuhl's avatar
Jerome Waldispuhl committed
135
136
137
138
139
140
141
142
143
144

%$$
%\begin{pmatrix}
%1 - (\theta * \alpha + \beta) & (1 - \theta ) \beta & \theta \beta & \theta \alpha \\
%(1 - \theta ) \beta & 1 - (\theta * \alpha + \beta) & \theta \alpha & \theta \beta \\
%(1 - \theta ) \beta & (1 - \theta ) \alpha & 1 - ( 1 - (\theta * \alpha + \beta) ) & \theta \beta \\
%(1 - \theta ) \alpha  & (1 - \theta ) \beta & \theta \beta & 1 - ( 1 - (\theta * \alpha + \beta) ) \\
%\end{pmatrix}
%$$

Carlos GO's avatar
done.    
Carlos GO committed
145
We estimate the expected number of mutations in randomly replicated sequences (section \ref{sec:undirected_evolution}) using the transition matrix defined by K. Tamura \citep{Tamura:1992aa}. We use a mutation rate $\alpha = \num{0.02}$ mirroring the mutation rate used in \maternal, and assume that transition and transversion rates are identical. The target GC content is represented with the variable $\theta = \{ 0.1, 0.3, 0.5, 0.7, 0.9 \}$. The transition matrix is shown below.
Jerome Waldispuhl's avatar
Jerome Waldispuhl committed
146

Jerome Waldispuhl's avatar
intro    
Jerome Waldispuhl committed
147
\begin{center}
Jerome Waldispuhl's avatar
Jerome Waldispuhl committed
148
{\scriptsize
Jerome Waldispuhl's avatar
intro    
Jerome Waldispuhl committed
149
\begin{tabular}{c|cccc}
Jerome Waldispuhl's avatar
Jerome Waldispuhl committed
150
& A & U & C & G \\
Jerome Waldispuhl's avatar
intro    
Jerome Waldispuhl committed
151
152
153
154
155
156
\hline
A & $1 - \alpha (1 + \theta)$ & $(1 - \theta ) \alpha$ & $\theta \alpha$ & $\theta \alpha$ \\
U & $(1 - \theta ) \alpha$ & $1 - \alpha (1 + \theta)$ & $\theta \alpha$ &$ \theta \alpha$ \\
C & $(1 - \theta ) \alpha$ & $(1 - \theta ) \alpha$ & $1 - ( 1 - \alpha (1 + \theta) )$ & $\theta \alpha$ \\
G & $(1 - \theta ) \alpha$  & $(1 - \theta ) \alpha$ & $\theta \alpha$ & $1 - ( 1 - \alpha (1 + \theta ) )$ \\
\end{tabular}
Jerome Waldispuhl's avatar
Jerome Waldispuhl committed
157
}
Jerome Waldispuhl's avatar
intro    
Jerome Waldispuhl committed
158
\end{center}
Jerome Waldispuhl's avatar
Jerome Waldispuhl committed
159
160
161

This matrix gives us the transition rate from one generation to the next one. To obtain the mutation probabilities at the $k^{th}$ generation, we calculate the $k^{th}$ exponent of this matrix. Then, we sum the values along the main diagonal to estimate the probability of a nucleotide to be the same at the initial and $k^{th}$ generation.