| |
|
 |
The 2005 BGRS International Summer School for young scientists
"Evolution, Systems Biology and
High Performance Computing Bioinformatics"
Novosibirsk, Russia
September 11-16, 2005
LECTURE COURSE OUTLINE
Prof. Thomas Casavant, Professor and Director
The UI Center for Bioinformatics and Computational Biology,
Parallel Processing Laboratory,
Departments of Electrical and Computer and
Biomedical Engineering, Genetics, Ophthalmology, and
The Holden Comprehensive Cancer Center
University of Iowa
USA
"Grid Computing Approaches to Finding Distant Orthologs and
Horizontal Gene Transfer Events".
Abstract:
This talk describes and evaluates a coarse-grained parallel computational approach to identifying rare evolutionary events often referred to as "horizontal gene transfers". Unlike classical genetic evolution, in which variations in genes accumulate gradually within and among species, horizontal transfer events result in a set of potentially important genes which "jump" directly from the genetic material of one species to another. Such genes, known as xenologs, appear as anomalies when phylogenetic trees are compared for normal and xenologous genes from the same sets of species. However, this has not been previously possible due to a lack of data and computational capacity. With the availability of large numbers of computer clusters, as well as genomic sequence from more than 2,000 species containing as many as 35,000 genes each, and trillions of sequence nucleotides in all, the possibility exists to examine "clusters" of genes using phylogenetic tree "similarity" as a distance metric. The full version of this problem requires years of CPU time, yet only makes modest IPC and memory demands; thus, it is an ideal candidate for a grid computing approach. This paper describes such a solution and preliminary benchmarking results that show a reduction in total execution time from approximately two years to less than two weeks. I will also report on several trade-off issues in various partitionings of the problem across WAN nodes, and LAN/WAN networks of tightly coupled computing clusters.
up
Prof. Vassily Lyubetsky, Institute for Information Transmission Problems,
Russian Academy of Sciences, Moscow, Russia
"Reconstruction Of Evolutionary Events At Molecular Level And Inference Of Species Phylogeny"
Abstract: Mathematic methods and models for
comparative analysis of large sets of protein phylogenies
are described. The processes modeled are gene duplication,
loss, gain and horizontal transfer. Initially, a species tree
is constructed as a consensus of corresponding gene trees
using probabilistic distribution on source data. Algorithms
are further implemented to identify vertices accounting for
topological disparities between gene and species trees, with
possibility to infer underlying evolutionary events. The analysis
is illustrated on case studies of a prokaryotic protein family
and a set of protein phylogenies deduced from families from the
COGs database (NCBI). The potential of the described methods to
infer phylogeny and gene evolution events is discussed.
Methods and algorithms described here are aimed at implementing
two tasks: reconstruction of prokaryotic species trees and analyzing
hypotheses about gene evolution. The main emphasis is placed on
original algorithms and their performance, although, due to space
limits, only general descriptions are provided along with necessary references.
Events in gene evolution are usually viewed as gene divergence
during species differentiation, gene duplication, gene gain, loss
and horizontal gene transfer (HGT). Molecular data is protein
sequences grouped according to their amino acid and functional
similarity into clusters of orthologous groups of proteins (COGs)
(Tatusov et al., 2001).
The general approach to reconstruct gene evolution events has long
been defined (Goodman et al. 1979, Eulenstein et al. 1998). A protein
gene family is selected, usually from among COGs, with subsequent
assembling of multiple sequence alignment and reconstruction of gene
tree G (also referred to as a protein tree or COG tree). Further
analyzed is topological similarity and disparity between gene trees
from set {Gi} in order to reconstruct the species tree and infer gene
evolution events, respectively. Topological differences are reconciled
to produce species tree S. Alternatively, when inferring gene evolution
events, considerable topological differences between particular gene
tree G (often pertaining to family {Gi}) and species tree S are the
basis of the analysis.
Mathematic models of gene evolution are formulated to accommodate the
observed differences, and optimization of model parameters is used as
a tool to reconstruct evolutionary history of a microbial gene family.
The evolutionary model is defined as a procedure of comparing gene and
species trees, while its parameters are defined as sets of tree vertices
with assigned evolutionary events. An optimized model has parameters
corresponding to the extremes of relevant evolutionary characteristics.
up
Prof. Vassily Lyubetsky, Institute for Information Transmission Problems,
Russian Academy of Sciences, Moscow, Russia
"A mathematical model for regulation of gene expression by formation of
alternative RNA structures".
The computer model of the regulation of genes expression in bacteria
mediated by the dynamical formation of the RNA secondary structures is
described in [Lyubetsky, Molecular Biology, 2005] and [Lyubetsky,
Information processes, 2005].
The present version of the program implements the Monte-Carlo modeling of
the regulation process beginning from the time of the RNA-polymerase binding
to DNA chain and then the ribosome binding to the Shine-Dalgarno box of the
leader peptide and until termination of transcription (or anti-termination).
In the present version all microstates and macrostates of the secondary
structure in the window between positions of the ribosome and the polymerase
are constructed anew at each step of the process. This decreaswes the
algorithm efficiency and makes modeling for long nucleotide sequences
difficul. We plan to develop a new version of algorithm for which the set of
states will be recounted from the previous set of states up to the change of
positions of the ribosome and polymerase.
Testing. The developed version of the program was successfully tested in
particular for S. venezuelae ISP5230, S. avermitilis MA-4680 è S. coelicolor
A3(2). The results of modeling show good correspondence with biological
data, that exists for the first case [C. Lin, A. Paradkar, L. Vining,
Microbiology, 1998, 144, p. 1971-1980].
Algorithm. The RNA folding process is represented as a Markov process with
states corresponding to RNA secondary structures and transition
probabilities corresponding to transformations of a secondary structure
caused by formation or deaintegration of a helix. Macrostate is defined as a
set of secondary structures having the same topological (bracket) structure
of bound helices. The transition probabilities (kinetic constants) for
transitions between secondary structures (microstates) belonging to the same
macrostate can be chosen arbitrary with the only condition that the
equilibrium Gibbs distribution on this macrostate (as on the set of
microstates) is invariant. The transition probabilities for transitions
corresponding to formation or disintegration of a helix are chosen in a
special way for which the principle of detailed balance is satisfied and the
probability of disintegration of the helix depends only on binding energy of
helices. Conversely the probability of helix formation depends only on the
free energy of loops corresponding to the given set of helices. It should be
noted that for several (not all) organisms the binding energy of a helix
contains also the term depending on the total length of its loop. But in
typical situations the binding energy is calculated just from experimental
data on stacking. The free energy of the loop depends on its length, taking
into account the Flory entropy term and the elastic energy term. The
topological correction term which differentiates the end loops and the side
loops is also included following experimental data. Then the averaged
transition rates for transition between different macrostates (topological
different secondary structures) are calculated. These rates define the part
of the model which does not contain the processes of transcription or
translation. The process of transcription is described by the transition
probability (kinetic constant) for the change of the RNA-polymerase position
on the DNA chain. The nominal value of this constant equals 40 sec-1, but in
reality it strongly depends on the secondary structure of the RNA chain.
There is experimental evidence for the influence of helices on the
transcription rate. We propose the resonant type formula for the interaction
between RNA hairpin and RNA-polymerase molecule. This formula roughly
corresponds to the experimental data. The process of transcription can be
terminated if the RNA-polymerase is located at a T-rich segment of DNA
chain. We consider several physical mechanisms of the termination. All of
them give the same dependence of the termination rate constant on the
transcription rate. Again we use experimental data to evaluate the parameter
value in the formula of that dependence. The last process incorporated in
the model is translation. The role of this process in the termination is the
following: the ribosome influences the secondary structure of RNA chain by
destroying some helices. In the result the presence (or absence) of some
helices determine the transcription rate and so the termination rate. The
kinetic constant of the translation is 15 (sec-1) for translation of any
non-regulatory codon, but for regulatory codons the translation rate depends
on the concentration of the corresponding amino acid, or, more exactly, on
the concentration of charged tRNA. But assuming the Michaelis-Menten type
formula for all these dependences, we obtain the same type formula for the
overall dependence of translation rate on the amino acid concentration in
the medium. The corresponding Michaelis-Menten parameter has no direct
physical sense, because it reflects the series of different processes of
amino acid diffusion, amino acid binding with tRNA, tRNA binding with the
ribosome and so on.
The result of the modeling procedure shows the dependence of the probability
of termination of transcription on the concentration of the amino acid in
the medium. These results for several cases are in good (at least
qualitative) correspondence with experiments.
up
Prof. Maria Samsonova, St.Petersburg State Polytechnical University, St.Petersburg, Russia
Abstract:
Methods for the integration of distributed heterogeneous bioinformatics tools and data resources.
It is well known that bioinformatics has to cope with large amount of information in all knowledge domains. There are hundreds of resources and applications available to today biologist via either "command line" applications, databases, flat files, web forms or graphical user interfaces. These may be either local to the user, or provided by remote sites. Besides these resources are updated frequently and have different semantics.
Recently the technologies have began to appear that make it possible to move from an interactive to an automated approach in biological information management by provision of a distributed environment that supports in silico experimental process in bioinformatics. At the core of these technologies is the construction of workflows. Currently, there is considerable development in workflow tools, however still it is a broad area with many competing proposals and no accepted standards.
In my lecture I am going to present the technology which we have developed to understand the dynamical regulatory mechanisms controlling the expression of segmentation genes in fruit fly Drosophila ( Jaeger et al, (2004), Nature, 430 ). This technology was used to construct a Laboratory Information Management System (LIMS) known as PIPE. PIPE is easily extendable to deal with new data processing and analysis methods, flexible in specification and modification of these methods, scalable and supports distributed processing and analysis of data and images. up
Prof. Dmitry Scherbakov, Limnological Institute SB RAS, Irkutsk, Russia
"Modeling of molecular evolution processes in different speciation scenarios. "
Individual-oriented modeling may be an efficient tool in study of the possible mechanisms of evolutionary process. A serious problem of this approach remains the complexity of experimental checking of the results. To facilitate the interaction of experimental and theoretical studies of microevolutionary process, we propose to include into the models objects that model accumulating neutral mutations or objects that are similar to proteins. As a result, it becomes possible to examine the question of wether sets of homologous segments of nucleic acids obtained in the process of experimental studies of populations can help to build a phylogenetic criterion for evolutionary hypothesis. We illustrate this approach by two examples - simulation of co-evolution of hosts and virtually transmitting and causing feminization of intracellular parasitic males, and also simulation of coordinated changes in protein sequences.
up
Prof. Alexis Ivanov, Institute of Biomedical Chemistry RAMS , Moscow, Russia
"3D modeling and molecular dynamics simulation of peripheral membrane proteins: case study with cytochrome b5 in explicit lipid bilayer"
Alexis S. Ivanov, Yulia Yu. Smolinskaya, Alexander V. Veselovsky, Alexander I. Archakov
Institute of Biomedical Chemistry RAMS, Moscow, Russia
Abstract:
Molecular dynamics (MD) simulations of full-length cytochrome b5
(b5) in membrane environment were carried out to investigate the
structure and probable membrane topology of b5. All computations
were executed on Linux cluster (32 CPU) running Gromacs 3.2 suite
of programs. MD simulations were performed in complex system consisted
of explicit dipalmitoylphosphatidylcholine (DPPC) bilayer (338 lipid molecules)
and two water phases (about 15000 molecules each). Preliminary
equilibration of this membrane system was done through 3.5 ns of MD
at constant temperature and pressure. Some structural parameters of
membrane model (thickness of bilayer, surface area and volume per lipid,
ordering of the DPPC chains, etc.) reproduced quite well the available
experimental data.
The obtained model system was validated as membrane environment for
modeling and MD simulation of membrane proteins. The reference protein
with known structure and peripheral membrane topology (monoamine oxidase
A, MAO) was used. The comparison of crystal and MD equilibrium structures
of MAO confirms the fitness of created membrane system for successful
simulation of structure and membrane topology of peripheral membrane proteins.
In order to study the probable membrane topology of full-length b5 we
have performed a number of series of MD with time simulation up to 3.0 ns.
Two hypothetical structures of b5 with transmembrane or loop membrane anchor
were analyzed. Special attention has been focused on the interaction of
membrane-bound part of protein with lipid bilayer. The results of the
simulation demonstrate that b5 with both types of anchor can be stable
in complex membrane environment which provide an explanation for known
contradictory experimental data.
up
Dr. Dmitry Afonnikov, Institute of Cytology and Genetics, Novosibirsk, Russia
"Analysis of co-ordinated substitutions in protein sequences"
Recent results suggest that during evolution certain
substitutions at protein sites may occur in a coordinated
manner due to interactions between amino acid residues.
Information about these coordinated substitutions may be
helpful in analysis of protein structural/functional relationships.
Here, I will consider coordinated amino acid substitutions
as a model of protein evolution. Experimental evidence for
the existence of dependent substitutions will be reevaluated.
Methods for the detection and evaluation of coordinated
substitutions available to the present time will be reviewed.
Possible practical applications of information about coordinated
substitutions to analysis of protein evolution and function
will be described.
up
Dr. Luciano Milanesi, National Research Council - Institute of Biomedical
Technology, Italy
"Distributed Applications,Web Services, Tools and GRID Infrastructures for Bioinformatics"
Due to the increasing number of nucleotide and protein sequences produced by high throughput techniques, that have to be analyzed by bioinformatics tools, will be necessary to increase the actual calculation resources. Therefore, in order to face these new challenges successfully, it will be necessary to develop dedicated supercomputers, parallel computer based on clustering technologies and high performance distributed platforms like GRID.
Next generation of Grid Infrastructures, are trying to implement a distributed computing model where easy access to large geographical computing and data management resources will be provided to large multi/inter-disciplinary Virtual Organizations (VO) made of both research and user entities.
Indeed, computational and data Grids are "de facto" considered as the way to realise the concept of virtual places where scientists and researchers work together to solve complex problems in Bioinformatics, despite their geographic and organizational boundaries.
In these respects, then, Grid Computing is announcing another technological and societal revolution in high performance distributed computing as the World Wide Web has been since the last ten years for what concerns the meaning and the availability of global information. The aim is to operate this widely distributed computing environment as a uniform service, which looks after resource management, exploitation, and security independently of individual technology choices.
A general overview of the GRID technologies and computer cluster application to perform distributed bioinformatics applications for data mining, gene discovery, sequence similarity for searching of DNA and protein will be illustrated.
up
|