(Updated version, 02.07.2006)
July 12, 2006
(I) Introduction to high-performance computing. Part I.
The supercomputer parallel calculations find an ever increasing number
of applications to solving numerous topical problems in the modern
science and technology. Their use is governed by emergence of a new
class of superlarge problems. The course of lections will include the
history of development of supercomputing, review of the most powerful
supercomputers, description of the classes of problems requiring
parallel computations, and the main trends in development of
technologies. Architectures of the supercomputers with shared and
distributed memories will be described as well as distinctions between
supercomputers and parallel clusters. Specific features of data storage
during parallel computations will be considered as well as the
technologies of parallel programming.
Lecture 1. Architecture of high performance computers. Prof. Thomas Ludwig, Ruprecht-Karls-Universitat Heidelberg, Institut fur Informatik, Heidelberg, Germany
To start with, we will discuss the architectural principles of high
performance computers and, in particular, of compute clusters. We will
have a closer look to processors, interconnect technology, storage, and
in particular, to the memory architecture. The latter defines the
classes of shared and distributed memory computers. The lecture will
also present some data from the current TOP500 list of the strongest
computers in the world. Finally, an overview over operating system
aspects will be presented.
Lecture 2. Parallel programming principles. Prof. Thomas Ludwig, Ruprecht-Karls-Universitat Heidelberg, Institut fur Informatik, Heidelberg, Germany
We will now learn how parallel programs are characterized and how in
principle we design and implement such programs. A good knowledge of
compiler and hardware details is often necessary in order to get
optimal performance of the program. The parallelization paradigm of
data partitioning and message passing will be introduced. Two measures
will be presented to evaluate the performance of the parallel program.
Seminar 1. Phylogenetic analysis of a protein family.
Dr. Daniil G. Naumoff, State Institute for Genetics and
Selection of Industrial Microorganisms, Moscow, Russia.
I am going to present a complete procedure of a protein family analysis,
viz. from database searching to visualization of the phylogenetic tree. A
special attention will be paid for solving problems of multi-domain protein
structure and for clarifying the phylogenetic status of ‘atypical’ members
of a protein family. Results of the phylogenetic analysis of several
glycosidase families will be shown as examples. The methods and programs
suggested can be applied to any protein family but they would work more
effectively with globular solving proteins. Protein family analysis can be
started using any protein sequence as a query. It does not matter if the
protein has been studied enzymatically or corresponds to a biochemically
uncharacterized ORF. A preliminary version of the lecture (in Russian) has
been published on-line in Zbio journal (http://zbio.net/bio/001/003.html).
July 13, 2006
(II) Introduction to high-performance computing. Part II.
Lecture 3. Message passing with MPI. Prof. Thomas Ludwig, Ruprecht-Karls-Universitat Heidelberg, Institut fur Informatik, Heidelberg, Germany
The first step into parallel programming will be done based on the
Message Passing Interface (MPI). We will write a small program that
distributes data to different compute nodes, calculates some data, and
finally collects the results. A few basic library calls for message
passing will be introduced, which are already sufficient to write a
first parallel program. Problematic issues like debugging and
performance analysis will be covered.
Lecture 4. Advanced issues with message passing. Prof. Thomas Ludwig, Ruprecht-Karls-Universitat Heidelberg, Institut fur Informatik, Heidelberg, Germany
MPI offers a huge number of library calls, most of which do just
combine several basic calls and thus realize complicated activities in
a single call. We will have a look at collective calls and
sophisticated communication patterns. As bioinformatics is particularly
data intensive, a first introduction to parallel input/output via MPI
will be given. We will present an outlook onto advances features in the
MPI-2 standard and what they are used for.
(III) Application of high-performance computations in the problems related to construction and analysis of phylogeny.
Construction of phylogenetic relationships is among the most important
biological problems. Of great importance here are molecular data-DNA,
RNA, and protein sequences. An active genomic sequencing and obtaining
of a large number of sequences made construction of phylogenies
involving large number of sequences (from 1000 and more) a topical
problem. This part of the lecture course will detail algorithms for
construction of phylogenetic trees and their comparison. A special
attention will be paid to specific features of algorithm realizations
using parallel architecture of computers. Problems of large phylogenies
(from 100 to 4000 sequences) will be considered as examples.
Lecture 5. Computation of large phylogenetic trees: algorithmic and technical solutions. Dr. Alexandros Stamatakis, Swiss Federal Institute of Technology, Lausanne, Switzerland
The computation of ever larger as well as more accurate phylogenetic
trees with the ultimate goal to compute the "tree of life" represents
one of the grand challenges in high performance computing (HPC)
Bioinformatics. Statistical methods of phylogenetic analysis such as
maximum likelihood and Bayesian inference have proved to be the most
accurate models for evolutionary tree reconstruction.
Unfortunately, the size of trees which can be computed in reasonable
time is limited by the severe computational cost induced by these
methods. There exist two orthogonal research directions to overcome
this challenging computational burden: Firstly, the development of
novel, faster, and more accurate heuristic algorithms. Secondly, the
application of high performance computing techniques, the deployment of
supercomputers, and Grid-computing to provide the required
computational power, mainly in terms of CPU hours.
The field has witnessed significant algorithmic advances over the last
2-3 years which allow for inference of large phylogenetic trees
containing 500-1000 sequences on a single PC processor within a couple
of hours using maximum likelihood. On the other hand, the main problem
which high performance computing implementations of maximum likelihood
analyses faces is that technical development lags behind algorithmic
development, i.e., programs are parallelized that do not represent the
state-of-the-art algorithms any more.
Within this context, the talk initially aims to provide a brief
overview of the computational challenges large-scale phylogenetic
inference face concerning both algorithmic as well as supercomputing
The benefits of simultaneous algorithmic and technical development are
outlined by example of the program RAxML (Randomized Axelerated Maximum
Likelihood). The sequential version of RAxML has been used to compute
the largest maximum likelihood tree to date (comprising 25.000
organisms) on a single CPU.
In addition, recent algorithmic developments including novel genetic
search algorithms and search techniques will be discussed. Finally, an
overview over possible future HPC implementations of those novel
algorithms is provided including Grid-based solutions, implementations
for hybrid supercomputer architectures, and exploitation of vector-like
peripheral processors like for example Graphics Processing Units (GPUs).
Seminar 2. The Parallelization of Bioinformatics Problems: A
Tutorial. Yury Vyatkin, Institute of Cytology and Genetics, Novosibirsk,
In this tutorial we are going to follow the entire
path from the serial program to its completely parallel version to learn how
to use the features of modern high performance computing systems in full
measure. This tutorial could be useful to everyone who knows C language a
little bit and wants to learn how to solve bioinformatics problems with
modern tools. We are going to cover the next topics:
(1) What is High Performance Computing?:
-Modern computers and supercomputers. Their types and features.
parallelization and how to use it?
- Models of
programming on supercomputers.
(2) Problems that could be solved on HPC systems.
my problem worth parallelization and how to determine that?
usage of profiler tool.
(3) Parallelization with Message Passing Interface.
- The most
frequently used places in programs to make parallelization.
- How to
find a place in program to make parallelization?
- The way
- The most
frequently used MPI operators.
insert some code to Plato program.
(4) Further practice with Plato.
July 14, 2006
(IV) Computational modeling of biological macromolecules.
The problems of modeling of the structure and functions of biological
macromolecules are among the most resource-intensive in bioinformatics.
Therefore, high-throughput computations are intensively used for their
solving. This part of the lecture course will brief the application of
computer algorithms and programs to analysis of the structure and
function of genetic macromolecules.
Lecture 6. Inhibitors of protein-protein interactions as lead compounds for new drugs generation. Prof. Alexis Ivanov, V.N. Orekhovich Institute of Biomedical Chemistry, Russian Academy of Medical Sciences, Moscow, Russia
Protein-protein interactions represent a new and extremely attractive
class of molecular targets for creation of essentially new drugs
generation. The reason is that contact areas of protein molecules in
complexes are very conservative regarding mutational changes and,
hence, the probability of mutational drug resistance is low for drugs
targeted to these areas.
Laboratory of authors works in the area of computer-aided design and
experimental testing of inhibitors of protein-protein interactions. The
computer technologies include methods of 3D molecular modeling, methods
of molecular mechanics, molecular dynamics simulation, molecular
docking, analysis of intermolecular interactions, virtual alanine
screening, molecular database mining, de novo design, etc. The basic
experimental approach is technology of intermolecular interactions
analysis in vitro using optical biosensor Biacore-3000 utilizing the
effect of surface plasmon resonance. Particular examples of approaches
and results will be presented based on the study of tetramer of
bacterial L-asparaginase and inhibitors of HIV-1 protease dimerization.
Lecture 7. Transcription and translation regulations of amino
acid metabolism genes in Actinobacteria and intron-containing genes in
chloroplasts of algae and plants. Prof. Vassily Lyubetsky, Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, Russia
Formation of alternative structures in mRNA in response to external
stimuli, either direct or mediated by proteins or other RNAs, is a
major mechanism of regulation of gene expression in bacteria. This
mechanism has been studied in detail using experimental and
computational approaches in proteobacteria and Firmicutes, but not in
other groups of bacteria. Comparative analysis of amino acid
biosynthesis operons in Actinobacteria resulted in identification of
conserved regions upstream of several operons. Classical attenuators
were predicted upstream of trp operons in Corynebacterium spp. and Streptomyces spp., and trpS and leuS genes in some Streptomyces spp. Candidate leader peptides with terminators were observed upstream of ilvB genes in Corynebacterium spp., Mycobacterium spp. and Streptomyces spp. Candidate leader peptides without obvious terminators were found upstream of cys operons in Mycobacterium spp. and several other species. A conserved pseudoknot (named LEU element) was identified upstream of leuA operons in most Actinobacteria.
Finally, T-boxes likely involved in the regulation of translation
initiation were observed upstream of ileS genes from several
Actinobacteria. The metabolism of tryptophan, cysteine and leucine in
Actinobacteria seems to be regulated on the RNA level. In some cases
the mechanism is classical attenuation, but in many cases some
components of attenuators are missing. The most interesting case seems
to be the leuA operon preceded by the LEU element that may fold into a
conserved pseudoknot or an alternative structure. A LEU element has
been observed in a transposase gene from Bifidobacterium longum,
but it is not conserved in genes encoding closely related transposases
despite a very high level of protein similarity. One possibility is
that the regulatory region of the leuA has been co-opted from some
element involved in transposition. Analysis of phylogenetic patterns
allowed for identification of ML1624 of M. leprae and its orthologs as
the candidate regulatory proteins that may bind to the LEU element.
T-boxes upstream of the ileS genes are unusual, as their regulatory
mechanism seems to be inhibition of translation initiation via a
hairpin sequestering the Shine-Dalgarno box.
A short description of the originally developed algorithms of searching
for conservative protein-RNA binding sites will be provided. One of
these algorithms is applied to analyze chloroplast genes. Candidate
protein-RNA binding sites were detected upstream of atpF, petB, clpP,
psaA, psbA and psbB genes in many chloroplasts of algae and plants. We
surmise that some of these sites are involved in suppressing
translation until splicing is completed.
The lecture includes results of the two original publications and describes several novel algorithms in bioinformatics.
(V) High Performance computing in systems biology: analysis of complex biological processes and data.
Analysis of gene and metabolic networks is a newest field of
bioinformatics, which was formed during last 10 years. Researchers in
this field need to operate a tremendous volume of molecular genetic
data (genes, proteins, metabolites) and at the same time take into
account various interactions between these objects. This results in a
growth in the number of parameters describing the behavior of gene
networks and requires large computational resources for data processing
and modeling. Storage of these data and quick access to them is also an
Lecture 8. Gene expression patterns: methods for visualization, processing, and quantification. Dr. Konstantin Kolzov, St. Petersburg State Polytechnic University, St. Petersburg, Russia
High-quality and high-resolution images of gene expression patterns
become available for developmental biology due to confocal scanning
microscopy technique. Extraction of quantitative information is
important to get insights into underlying regulation, construct
mathematical models, and plan new experiments.
We introduce a new image processing software package ProStak integrated
into distributed computing environment. ProStak includes all operations
needed to extract quantitative information from 2D and 3D biological
images. The chain of processing steps can be visually constructed using
graphical user interface that provides convenient environment for
digital image processing for all groups of scientists: beginners,
non-programmers, and experts, for which the speed of the result
acquisition is critical. All processing methods can be accessed by a
user through the command line interface, as well as through shared and
static libraries. The combination of features mentioned above
distinguishes ProStak from other image processing packages such as
commercial systems Matlab and VisiQuest, and freely available SIVIL,
SCIRun, and TiViPe.
Seminar 3. Transcription and translation regulations of amino acid
metabolism genes in Actinobacteria and intron-containing genes in
chloroplasts of algae and plants (accompanying the Lecture 7) .
Dr. Alexander Seliverstov, Institute for Information Transmission Problems,
Russian Academy of Sciences, Moscow, Russia
July 15, 2006
Session of the presentations of young scientists.
Lecture 9. Distributed applications, web services, tools and GRID infrastructures for bioinformatics. Dr. Luciano Milanesi, National Research Council - Institute of Biomedical Technology, Italy
Due to the increasing number of nucleotide and protein sequences
produced by high throughput techniques, that have to be analyzed by
bioinformatics tools, will be necessary to increase the actual
calculation resources. Therefore, in order to face these new challenges
successfully, it will be necessary to develop dedicated supercomputers,
parallel computer based on clustering technologies and high performance
distributed platforms like GRID.
Next generation of GRID infrastructures, are trying to implement a
distributed computing model where easy access to large geographical
computing and data management resources will be provided to large
multi/inter-disciplinary Virtual Organizations (VO) made of both
research and user entities.
Indeed, computational and data Grids are "de facto" considered as the
way to realize the concept of virtual places where scientists and
researchers work together to solve complex problems in Bioinformatics,
despite their geographic and organizational boundaries.
In these respects, then, Grid Computing is announcing another
technological and societal revolution in high performance distributed
computing as the World Wide Web has been since the last ten years for
what concerns the meaning and the availability of global information.
The aim is to operate this widely distributed computing environment as
a uniform service, which looks after resource management, exploitation,
and security independently of individual technology choices.
A general overview of the GRID technologies and computer cluster
application to perform distributed bioinformatics applications for data
mining, gene discovery, sequence similarity for searching of DNA and
protein will be illustrated.
Seminar 4. The models of adaptive
dynamics as tools for studying of neutral molecular evolution. Dr. Yury
Bukin, Limnological Institute SB RAS, Irkutsk, Russia.