Commit a2ca84be authored by Carlos GO's avatar Carlos GO
Browse files

figs

parent 81039242
File added
This diff is collapsed.
@phdthesis{reinharz2016algorithmic,
title={Algorithmic Properties of Evolved Structured {RNA}s},
author={Reinharz, Vladimir},
year={2016},
school={McGill University Libraries}
}
......@@ -3,13 +3,24 @@
\geometry{letterpaper} % ... or a4paper or a5paper or ...
%\geometry{landscape} % Activate for for rotated page geometry
%\usepackage[parfill]{parskip} % Activate to begin paragraphs with an empty line rather than an indent
\usepackage{fdsymbol}
\usepackage{graphicx}
\usepackage{amssymb}
\usepackage{epstopdf}
\usepackage[ruled,vlined,linesnumbered,noresetcount]{algorithm2e}
\usepackage{tikz}
\usetikzlibrary{timeline}
\DeclareGraphicsRule{.tif}{png}{.png}{`convert #1 `dirname #1`/`basename #1 .tif`.png}
\graphicspath{{Figs/}}
\newcommand{\maternal}{\texttt{mateRNAl}}
\newcommand{\rnamigos}{\texttt{RNAmigos}}
\newcommand{\vernal}{\texttt{veRNAl}}
\newcommand{\garl}{\texttt{garl}}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
\title{PhD Proposal Exam \\ Computational Tools for Complex RNA Structure Analysis}
\author{Carlos G. Oliver}
\date{\today} % Activate to display a given date or no date
......@@ -23,22 +34,32 @@
\section{Abstract}
RNA (Ribonucleic Acid) is a family of functional molecules which control vital processes in all kingdoms of life.
Complex 3D structures in these molecules encode a signature required to specify complex functions such as: regulating gene expression, catalyzing reactions, and SOMETHING.
The unique flexibility of RNA molecules allows for a large range of complex structures to be adopted which can determine very specific functions.
In this thesis we address two major questions: 1) the emergence of such structures through evolutionary processes, and 2) the analysis of complex RNA structures as functional signatures.
A major question is how complex structures could have arisen in prebioitic conditions.
Project I addresses this question with a unique evolutioary reactor for studying the thermodynamic and environmental scenarios which could lead to the emergence of complex RNA structures.
While statistical and thermodynamic methods have been successful at modelling RNA at a 2D (planar) level, these are ultimately a proxy for the true determinants of structure at the 3D level.
In Project II we
RNA (Ribonucleic Acid) molecules form a family of functional molecules which control vital cellular processes in all kingdoms of life.
These functions depend on complex architectures adopted by the molecule through a process of hierarchical folding.
The unique flexibility and structural organization of RNA molecules supports a diverse range of structures which can determine many specific functions.
Efficient computational tools are essential studying these structures and can provide valuable insights into the mechanisms underlying the most important processes of life.
To date, a large number of algorithmic advances have tackled many challenges in RNA structure at the 2D level including structure prediction, comparison and design.
However, most tools have been developed to handle RNA at the 2D (planar) level of organization, whereas information for highly specific function is typically encoded at the level of 3D structure.
In this thesis we address two major challenges (i) the emergence of such structures through evolutionary processes, and (ii) the analysis of complex RNA structures as functional signatures.
\section{Thesis Objectives and Contributions}
The central goal of my PhD research is to develop tools for understanding complex interactions arising in RNA molecules.
Successful development of such tools would represent a step towards a systematic understanding of the patterns governing key biological processes.
{\bf Project I} addresses the question of how RNA molecules could have evolved the basis of higher-order structural patterns in pre-biotic setting.
To this end, we develop an evolutioary reactor for studying the thermodynamic and environmental scenarios which could lead to the emergence of complex RNA structures.
In {\bf Projects II-IV} we build a set of tools for extracting useful patterns from RNA at the 3D level.
The main challenge addressed by these tools is the computational complexity imposed by modeling higher-order interactions which break some of the central assumptions of earlier models such as planarity of graphs and the convenience of experimental energy models.
The result is \rnamigos, \garl, and \vernal.
These tools respectively attempt to solve the problem of (i) building predictive models on complex RNA networks, (ii) making efficient comparisons and alignments (iii) mining recurrent structural signatures.
\begin{itemize}
\item maternal: complex structures evolve autonomously at 2D level
\item rnamigos: at 3D level we can learn useful functional signatures using graph representations
\item garl: tool for automated structure matching with custom cost functions
\item vernal: we can isolate new complex signatures using graph neural networks
\item \maternal: complex structures evolve autonomously at 2D level
\item \rnamigos: at 3D level we can learn useful functional signatures using graph representations
\item \garl: tool for automated structure matching with custom cost functions
\item \vernal: we can isolate new complex signatures using graph neural networks
\end{itemize}
......@@ -72,27 +93,166 @@ In Project II we
\end{tikzpicture}%
}
\end{figure}
\section{Background}
RNAs possess multiple levels of structural organization ranging from the secondary structure made, of Watson-Crick (\texttt{A-U, C-G}) and Wobble (\texttt{G-U}) base pairs to the full tertiary structure modelling the position of all atoms.
In a seminal work, Leontis and Westhof expanded the base-pairing nomenclature by identifying 12 different types of base-pairing interactions according to the relative 3D geometry of the participating nucleotides ~\cite{leontis2001geometric,leontis1998conserved}.
Among them, the canonical pairs (i.e. \texttt{A-U, C-G}) are the most studied class.
Notably, they create series of stable stacks that form a scaffold for the full structure ~\cite{tinoco1999rna}.
This feature naturally defines the RNA secondary structure level.
Non-canonical pairs on the other hand are enriched in loops (i.e. regions without canonical pairs) and create more complex patterns~\cite{leontis2003analysis,petrov2013automated}.
These interactions fine-tune the specificity of RNA interactions by determining structure at the 3D level~\cite{leontis2006building}.
\begin{figure}
\centering
\includegraphics[width=0.7\textwidth]{struc.jpeg}
\caption{This is an RNA. \cite{reinharz2016algorithmic}}
\end{figure}
\section{{\bf Project I:} \maternal -- Explaining the emergence of complex RNA}
\subsection{Problem}
\subsection{Proposed Solution}
\begin{figure}
\centering
\includegraphics[width=\textwidth]{maternal.pdf}
\caption{This is an RNA. \cite{reinharz2016algorithmic}}
\end{figure}
\section{{\bf Project II:} \rnamigos -- Controlling RNA using drug-like molecules}
\subsection{Problem}
As suggested by {\bf Project I}, non-covalent small molecules are important modulators of RNA structure in looped regions.
Indeed, recent studies identified small molecules as important non-covalent regulators of RNA function in many cellular pathways ~\cite{donlic2018targeting}.
These discoveries contribute to a better understanding of molecular mechanisms regulating biological systems, but also pose RNA molecules as a large class of promising novel drug targets.
We are already witnessing the development of the first drugs targeting RNAs.
Among them, Ribocil, which has recently been uncovered through a phenotypic assay to target the FMN riboswitch is currently undergoing clinical trials as a novel antibiotic ~\cite{howe2015selective}.
Various other small molecule mediated RNA control systems are also being proposed~\cite{wagner2018small,porter2017recurrent}, including some for CRISPR activation regulation ~\cite{kundert2019controlling} and are likely to play an important role in genetic disorder treatment and synthetic biology.
As observed by KD Warner and co-workers ~\cite{warner2018principles}, only a small fraction of the genome is translated into protein (1.5\%) while the vast majority is transcribed into non-coding RNA (70\%).
Interestingly, non-canonical pairs were found to be involved in ligand binding sites~\cite{david2010structural,kligun2013conformational}, which corroborates with further findings showing that some secondary structure motifs can specify ligand binding ~\cite{childs2018massively,wang2019hairpin}.
These observations, together with the observation that the complexity of an RNA structure appears to be associated with its ligand binding specificity ~\cite{warner2018principles}, lead us to hypothesize that studying RNA structures at the extended base-pairing level (i.e. including non-canonical pairs) holds useful spatial and chemical information about target sites to characterize ligand binding.
In practice, this means that a graph using vertices to represent nucleotides and edges to encode base-pairing interactions could offer a signature for RNA ligand binding sites (See {\bf Fig. ~\ref{fig:pipeline}}).
This paradigm distinguishes RNA from the well-studied protein-ligand interactions where surface cavity topologies drive binding preferences ~\cite{luo2019challenges}.
Indeed, graphical representations of RNA base pairing networks have been developed in various tools ~\cite{reinharz2018mining,sarrazin2019automated,petrov2013automated,cruz2011sequence} for their ability to capture RNA-specific interactions in a scalable and interpretable manner.
\subsection{Proposed Solution}
We show that base pairing networks can be used to automatically predict the binding of small molecules to RNAs.
To this end, we propose a new prediction task which aims to bridge the gap between ligand and structure-based approaches.
More specifically, we train a machine learning algorithm to use structural patterns in crystal structures of known RNA-ligand complexes to make predictions which allow us to identify potentially active ligands.
In machine learning terms, the RNA-ligand complex is treated as an input-output pair where the target structure is the input to the model and the ligand is the output.
In order to allow for ligand-based applications, we use molecular fingerprints of ligands as the outputs of our model.
These are vector-based representations of chemicals designed for ligand space similarity searches and which can be conveniently handled by machine learning models.
The prediction thus serves as a ligand-based tool since it can be used to search for active compounds in the ligand-space.
At the same time, we include target information by training the model to produce fingerprints based on known RNA-ligand complexes.
Similar methods have been proposed in recent preliminary works ~\cite{mallet2019leveraging,aumentado2018latent} for protein binding.
We implement this strategy in \RNAmigos, a data-driven tool for assisting the RNA-binding drug discovery process.
Leveraging a network representation of RNA structures, \RNAmigos learns structural patterns in known RNA-ligand complexes from crystal structure databases to predict chemical descriptors (i.e. fingerprints) for potential ligands.
We demonstrate that the resulting molecular fingerprints serve as effective ligand search tools across different ligand classes, and provide evidence of its effectiveness at identifying binding sites in full RNA riboswitch structures.
\begin{figure}
\centering
\includegraphics[width=\textwidth]{rnamigos.pdf}
\caption{Outline of the \RNAmigos pipeline. The user begins by providing either a 3D structure of an RNA site, or a base-pairing graph of the site. In this example, the input graph is drawn using the Leontis-Westhof convention for base-pairing annotation. This example graph contains 4 cis-Watson-Crick $\medblackcircle$ edges which define the secondary structure, and one cis-Hoogsteen $\medblacksquare$ which is a non-canonical base pair. The graph is represented as a real-valued vector of fixed size in the Graph-Representation stage by applying the Graph Edit Distance ($\GED$) graph comparison algorithm.
The resulting vector is then passed to a machine learning model in the Fingerprint Prediction module which produces a molecular fingerprint.
Finally, the fingerprint is used in a similarity search to identify molecules matching the prediction as candidate ligands for the input site.}
\end{figure}
\section{{\bf Project III:} \garl -- Learning to compare complex RNA}
\begin{figure}
\centering
\includegraphics[width=0.7\textwidth]{GARL.pdf}
\caption{This is an RNA. \cite{reinharz2016algorithmic}}
\end{figure}
\begin{algorithm}
\SetAlgoLined
\KwData{$\mathcal{D}$ graph dataset, $c$ graph cost function}
\KwResult{Trained DQN agent for graph alignment}
$\Theta \leftarrow \text{GCN network parameters}$\\
\While{episode $e=1 < E$}{
$G, G' \sim \mathcal{D} \qquad \text{i.i.d random graph pair}$ \\
$ t \leftarrow 0$\\
$\mathcal{M}_t \leftarrow \emptyset$\\
\While{$\vert \mathcal{M}_t \vert \leq \vert N_{G} \vert $}{
\[
(v, v')_t =
\begin{cases}
\text{random pair w.p} \qquad \epsilon \\
\argmax_{(v,v') \in \mathcal{\bar{M}}_t} Q(h(S_{t+1}, v, v'); \Theta)
\end{cases}
\]
$\mathcal{M}_{t+1} \leftarrow \mathcal{M}_t \cup (v, v')_t$\\
$R_t \leftarrow c(G, G', \mathcal{M}_{t+1}) -c(G, G', \mathcal{M}_t$) \\
$\Theta \leftarrow \text{TD update w. SGD over GCN}$\\
$t \leftarrow t + 1$
}
}
\Return $\Theta$
\caption{Deep Q-Learning for Graph Alignment}
\label{algo:dqn}
\end{algorithm}
\section{{\bf Project IV:} \vernal -- Searching for conserved RNA structures}
\subsection{Problem}
Functional RNA molecules adopt detailed 3D structural patterns (motifs) to carry out complex functions. Recent work has shown that such motifs are conserved across different RNA molecules [1], suggesting that the set of structural motifs may constitute an alphabet out of which evolution can build complex RNA functions.
We can think of a motif as a sub-structure which is identical or similar some sub-structure in other RNA molecules. Hence, the task of identifying structural motifs from a large set of full RNA crystal structures requires 1) structure comparison (recognizing similar sub-structures) , and 2) sub-structure searching (a way to navigate large structures to test for similarity). The main challenge is that comparing structures is computationally expensive, and the number of possible structures to explore explodes for even small structures.
State-of-the-art techniques for mining structural motifs rely on two major constraints to address the computational challenges. The first is that they apply strong limitations on which sub-structures to evaluate. And the second is that they assume instances of motifs will be exactly identical to each other. For a molecule as flexible as RNA, it is verly likely that motifs can adopt a range of possible conformations. Therefore, our current view of the repertoire of RNA structural motifs repertoire is narrow.
\section{{\bf Project I:} Maternal}
\subsection{Proposed solution}
\section{{\bf Project II:} RNAmigos}
We propose \vernal, the first tool which addresses both limitations of current methods. Our tool is built on a graph neural networks which encodes local structural information across very large sets of structures into a vector space. Similar sub-structures are thus represented as nearby vectors. The problem of comparing structures is reduced to comparing vectors, and the problem of searching the structures becomes a vector-based search. We show that this method produces richer motifs (more flexible) in a fully scalable manner.
\section{{\bf Project III:} GARL}
\section{{\bf Project IV:} VERNAL}
\begin{itemize}
\item A motif is a `subgraph level' thing.
\item i.e. a group of nodes whose combined embeddings is recurrent.
\item $Z \in \mathbb{R}^{n \times d}$ embedding matrix.
\item $\Sigma \in [0,1]^{n \times m}$ soft assignment of each node ($n$) to a motif ($m$)
\item $E \in [0,1]^{m \times d}$ dictionary matrix.
Motivation: Functional RNA molecules adopt detailed 3D structural patterns (motifs) to carry out complex functions. Recent work has shown that such motifs are conserved across different RNA molecules [1], suggesting that the set of structural motifs may constitute an alphabet out of which evolution can build complex RNA functions.
\begin{equation}
\mathcal{L} = \lambda_1 \norm{ DM(Z) - K}_2^2 + \lambda_2 \norm{\Sigma^T Z - E}_2^2 + \lambda_3 \norm{E^TE - I}_2^2
\end{equation}
Problem: We can think of a motif as a sub-structure which is identical or similar some sub-structure in other RNA molecules. Hence, the task of identifying structural motifs from a large set of full RNA crystal structures requires 1) structure comparison (recognizing similar sub-structures) , and 2) sub-structure searching (a way to navigate large structures to test for similarity). The main challenge is that comparing structures is computationally expensive, and the number of possible structures to explore explodes for even small structures.
\item $\Sigma^T Z \in \mathbb{R}^{m \times d}$ at cluster $i$ and dimension $d$ is a weighted average over node embeddings.
\begin{equation}
(\Sigma^T Z)_{ij} = \frac{\sum_{k=1}^{n} \sigma_{ik} z_{jk}}{\norm{\sigma_i}}
\end{equation}
\item In dictionary learning $\Sigma^T Z$ compared directly to $K$ which results in clusters of similar nodes.
Current approaches: State-of-the-art techniques for mining structural motifs rely on two major constraints to address the computational challenges. The first is that they apply strong limitations on which sub-structures to evaluate. And the second is that they assume instances of motifs will be exactly identical to each other. For a molecule as flexible as RNA, it is verly likely that motifs can adopt a range of possible conformations. Therefore, our current view of the repertoire of RNA structural motifs repertoire is narrow.
\end{itemize}
Our approach: We propose VERNAL, the first tool which addresses both limitations of current methods. Our tool is built on a graph neural networks which encodes local structural information across very large sets of structures into a vector space. Similar sub-structures are thus represented as nearby vectors. The problem of comparing structures is reduced to comparing vectors, and the problem of searching the structures becomes a vector-based search. We show that this method produces richer motifs (more flexible) in a fully scalable manner.
\bibliographystyle{plain}
\bibliography{biblio}
\end{document}
\ No newline at end of file
\end{document}
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment