German/Russian Summer School 2008
Evolution, Systems Biology and High Performance Computing
Novosibirsk, June 29-July 2, 2008
Analysis of biological networks and related data
Prof. Falk Schreiber (IPK Gatersleben, Germany)
This talk will give an overview of the structural analysis of biological networks, located at the interface of biology and computer science. Biological networks represent processes in cells, organisms, or entire ecosystems. Large amounts of data which represents (or is related to) biological networks have been gathered in the past, not least with the help of the latest technological advances. Thus, the analysis of these networks is an important research topic in modern bioinformatics,and the analysis of biological networks is gaining more and more attention in the life sciences and particular in the growing field of
Network analysis methods for biological networks are presented and discussed. This includes global network properties and network models, centrality analysis which helps in ranking network elements, and network motifs which can represent potentially important network parts and clustering methods. Furthermore we discuss the analysis of -omics data
(e.g. transcriptomics, proteomics, and metabolomics). To support an integrative, systems biology directed approach, the interactions of the biological entities (e.g. DNA, RNA, proteins, metabolites) are important and the data has to be linked to relevant networks. We will discusses methods for the visualisation and analysis of networks with related
experimental data and presents VANTED, a system implementing these methods. Different data such as transcript, enzyme, and metabolite data
can be integrated and presented in the context of their underlying networks, e. g. metabolic pathways or classification hierarchies such as gene ontology. Statistical methods allow analysis and comparison of multiple data sets. Correlation networks can be automatically generated from the data and substances can be clustered according to similar
behavior over time. Sophisticated visualisation approaches support an
easy visual analysis of the data enriched networks. VANTED is available
free of charge at http://vanted.ipk-gatersleben.de/.
Metabolic network analysis
Prof. Ralf Hofest?dt (Bielefeld University, Germany)
Currently, there are about 1000 database and information systems and various analysis tools available via the internet. The challenge we have, is to integrate these list-parts from genomics and proteomics at novel levels of understanding. Integrative bioinformatics would be this new area of research using the tools of computer science applied to biotechnology. Finally, these tools will represent the backbone of the concept of the virtual cell, which is both, a scientific vision and challenge of bioinformatics. This talk will present the architecture of a federated database concept for the integration of metabolic database systems. Moreover, behind the prediction of networks we will discuss the modelling and simulation of metabolic networks using automata and Petri nets.
Integration of reaction kinetics data: and modeling of metabolic networks: SABIO RK and SYCAMORE
Dr. Olga Krebs (EML Research gGmbH, Heidelberg, Germany)
Systems biology involves analyzing and predicting the behavior of complex biological systems like cells or organisms. This requires qualitative information about the interplay of genes, proteins, chemical compounds, and biochemical reactions. It also calls for quantitative data describing the dynamics of these networks.
To provide quantitative experimental data for systems biology, we have developed SABIO-RK, a database system offering information about biochemical reactions and their corresponding kinetics. It not only describes participants (enzymes, substrates, products, inhibitors, activators) and kinetic parameters of the reactions, but also provides both the environmental conditions for parameter determination and detailed information about the reaction mechanisms, including the mechanism type of a reaction and its related kinetic law equation defining the reaction rate with its corresponding parameters.
The SABIO-RK database is populated by merging information about biochemical reactions, mainly obtained from existing databases like KEGG (Kyoto Encyclopedia of Genes), with their corresponding kinetic data, manually extracted from literature. The kinetic data from articles are entered into the database using a web-based input interface, and subsequently curated, unified and systematically structured. The use of controlled vocabularies, synonymic notations and annotations to external resources offers the possibility of comparing and augmenting information about biochemical reactions and their kinetics.
SABIO-RK can be accessed in two different ways: via a web-based user interface to browse and search the data manually, and, more recently, via web-services that can be automatically called up by external tools, e.g. by other databases or simulation programs for biochemical network models. In both interfaces, reactions with kinetic data can be exported in SBML (Systems Biology Mark-Up Language), a data-exchange format widely used in systems biology.
SYCAMORE is a browser-based application that facilitates construction, simulation and analysis of kinetic models in systems biology. Thus, it allows e.g. database supported
modelling, basic model checking and the estimation of unknown kinetic parameters based on protein structures. In addition, it offers some guidance in order to allow non-expert users to perform basic computational modelling tasks. SYCAMORE provides an interface to the SABIORK database to permit the user to locate and select the relevant kinetic data for these two reaction steps.
SabioRK : http:/sabio.villa-bosch.de/SABIORK
Platform "From Gene to Lead Compound": integration in silico and in vitro technologies
Prof. A.S. Ivanov (V.N. Orechovich Institute of Biomedical Chemistry RAMS, Moscow, Russia)
Motivation and Aim. The pathway of drug discovery from idea to market consists of 7 basic steps: 1) disease selection, 2) target selection, 3) lead compound identification, 4) lead optimization, 5) preclinical trial evaluation, 6) clinical trials, 7) drug manufacturing. Two final stages are time- and money-consuming and their reduction is practically impossible owing to strict state standards and laws. Therefore, researchers paid special attention to increase the efficiency of drug development at earlier stages using computer modeling and bioinformatics integrated with new experimental methods. This methodology is directed at accelerating and optimizing the discovery of new biologically active compounds suitable as drug candidates (lead compounds). Recently these approaches have merged into a "from gene to lead compound" platform that covers the principle part of the pipeline. Several steps of this platform include computer modeling, virtual screening, and properties predictions`. Bioinformatics methods can reduce the amount of the compounds that are synthesized and tested by up to 2 orders of magnitude. Nonetheless, these approaches cannot completely replace the real experiments. The purpose of computer methods is to generate highly probable hypotheses about new targets and/or ligands that must be tested later in real experiments.
Methods and Algorithms. The following methods and approaches are hilighted in lecture: 1) bioinformatics approaches in genome-based antiinfective targets selection ; 2) experimental technologies for target validation ; 3) solving of 3D structure of target - experimental and computer modeling technologies ; 4) strategy of computer-aided drug design ; 5) experimental testing of probable lead compounds.
Results. Some examples of passing execution of some bioinformatics steps of platform "from gene to lead compound" are presented: 1) targets selection in genome of M. tuberculosis and beyond ; 2) 3D modeling of cytochrome P450 1A2 and database mining for new leads using docking procedure ; 3) dimerization inhibitor of HIV protease: screening in silico and in vitro .
Conclusion. This lecture describes the integration of computer and experimental approaches in a complementary manner and some specific examples of the steps in implementing this platform.
Acknowledgments. This work was supported in part by Russian Foundation for Basic Research (grant 07-04-00575 and Russian Federal Space Agency in frame of ground preparation of space research).
1. A.V. DUBANOV, ET AL. (2001) VOPR. MED. KHIM. 47, 353-367. (IN RUSSIAN).
2. A.S. IVANOV, ET AL. (2005) BIOMED. CHEM. 51 (1), 2-18. (IN RUSSIAN).
3. A.S. IVANOV, ET AL. (2003) BIOMED. CHEM. 49 (3), 221-237. (IN RUSSIAN).
4. A.V. VESELOVSKY, A.S. IVANOV (2003) CURRENT DRUG TARGETS - INFECTIOUS DISORDERS, 3, 33-40.
5. A.S. IVANOV, ET AL. (2005) METHODS MOL. BIOL., 316: 389-432.
6. N.V. BELKINA, ET AL. (1998) VOPR. MED. KHIM. 44(5), 464-473. (IN RUSSIAN).
7. A.S. IVANOV ET AL. (2007) J. BIOINFORM. COMPUT. BIOL., 5(2B): 579-592.
Detecting positive selection on the protein coding genes
Prof. Maria Anisimova (ETH Zurich, Switzerland)
Continued genome sequencing has progressed simultaneously with new statistical methodology for understanding the actions of natural selection.
I review various statistical methodologies (and their applicability) for detecting adaptation events and functional divergence of proteins.
As large-scale automatic studies become more frequent, they provide a great resource for generating biological null hypotheses for further experimental and statistical testing and shed more light on typical patterns of both lineage-specific organismal evolution, the functional and structural evolution of protein families, and the interplay between the two.
Increasingly, models are being developed that derive from underlying biological and chemical processes to complement simpler statistical models.
Linking processes to their statistical signatures can be a complicated process and the proper application of statistical models is discussed.
I first present the general modeling framework of a codon substitution process and the estimation of model parameters by maximum likelihood.
Next, the likelihood ratio tests for detecting positive selection will be introduced. The codon models are classified onto those allowing among-site, branch or site-branch variation of selective pressure.
The use of the Bayesian approach is demonstrated to predict sites under positive selection. For tests of positive selection along individual lineages, a priori hypothesis is required. Alternatively, the usage of various multiple test corrections is discussed.
Discrete models for molecular evolution simulation at the population level.
Prof. Dmitry Scherbakov (Limnological Institute SB RAS, Irkutsk)
(To be announced)
Introduction to high-performance computing.
Prof. Thomas Ludwig (Ruprecht-Karls-Universitat, Heidelberg, Germany)
The supercomputer parallel calculations find an ever increasing number of applications to solving numerous typical problems in modern science and technology. Their use is governed by emergence of a new class of superlarge problems. The course of lectures will include the history of development of supercomputing, review of the most powerful supercomputers, description of the classes of problems requiring parallel computations, and the main trends in development of technologies. Architectures of the supercomputers with shared and distributed memories will be described as well as distinctions between supercomputers and parallel clusters. Specific features of data storage during parallel computations will be considered as well as the technologies of parallel programming.
Lecture 1: Architecture of high performance computers
To start with, we will discuss the architectural principles of high performance computers and, in particular, of compute clusters. We will have a closer look to processors, interconnect technology, storage, and in particular, to the memory architecture. The latter defines the classes of shared and distributed memory computers. The lecture will also present some data from the current TOP500 list of the strongest computers in the world. Finally, an overview over operating system aspects will be presented.
Lecture 2: Parallel programming principles
We will now learn how parallel programs are characterized and how in principle we design and implement such programs. A good knowledge of compiler and hardware details is often necessary in order to get optimal performance of the program. The parallelization paradigm of data partitioning and message passing will be introduced. Two measures will be presented to evaluate the performance of the parallel program.
Lecture 3: Message passing with MPI
The first step into parallel programming will be done based on the Message Passing Interface (MPI). We will write a small program that distributes data to different compute nodes, calculates some data, and finally collects the results. A few basic library calls for message passing will be introduced, which are already sufficient to write a first parallel program. Problematic issues like debugging and performance analysis will be covered.
Lecture 4: Advanced issues with message passing
MPI offers a huge number of library calls, most of which do just combine several basic calls and thus realize complicated activities in a single call. We will have a look at collective calls and sophisticated communication patterns. As bioinformatics is particularly data intensive, a first introduction to parallel input/output via MPI will be given. We will present an outlook onto advances features in the MPI-2 standard and what they are used for.
Models, Algorithms, and Parallel Computing for Large-Scale Phylogenetic Inference
Dr. Alexandros Stamatakis (Ludwig-Maximilians-University, Munich, Germany)
The computation of ever larger as well as more accurate phylogenetic trees with the ultimate goal to compute the "tree of life" represents one of the grand challenges in high performance computing (HPC) Bioinformatics. Statistical methods of phylogeny reconstruction such as Maximum Likelihood (ML) and Bayesian inference have proved to be the most accurate models for evolutionary tree reconstruction and are becoming increasingly popular.
Unfortunately, the size of trees which can be computed in reasonable time is limited by the severe computational cost induced by these methods coupled with the explosive accumulation of sequence data, and the increasing popularity of large "gappy" multi-gene alignments.
There exist two orthogonal research directions to overcome this challenging computational burden which will be covered in this lecture:
Firstly, the development of faster and more accurate heuristic search algorithms as well as the implementation of efficient data-structures for multi-gene alignments.
Secondly, the application of high performance computing techniques to provide the required computational power, mainly in terms of CPU hours.
Initially, I will provide an introduction to phylogenetic inference under ML and outline the major computational challenges. Thereafter, I will discuss some of the basic search techniques as well as recent algorithmic advances in the field, especially with respect to rapid inference of support values. In the second part of my talk I will describe how the ML function can be adapted to a large variety of hardware architectures, ranging from multi-core processors to the IBM BlueGene supercomputer.
I will conclude with an overview of future challenges in the field.
Introduction to high performance computing.
Yurii Vyatkin, ICG SB RAS
Practical course on PAML
Dr. Maria Anisimova, ETZ, Zurich.
Practical course on SABIO RK and SYCAMORE
Dr. Olga Krebs (EML Research gGmbH, Heidelberg, Germany)