| The concept of
phylogenomic profiling : |
|
One of the central goals of bioinformatics is to assign proteins a
function from genomic sequences. To this purpose, alignment methods,
based on sequence similarity, are still nowadays the most developed
and used. Yet, they give indications on the function of only fifty
percent of the proteins of an organism. This limit encourages the
development of new methods that exploit the information contained
within the full sequence of a genome. Those phylogenomic approaches
are of course possible because of the recent and massive sequencing of
genomes.
Phylogenomic profiling is one of the major non-sequence-homology-based
method. It is designed to infer a likely functional relationship between
proteins. It is based on the assumption that proteins involved in a
common metabolic pathway or constituting a multi-molecular complex are
likely to evolve in a correlated manner. This paradigm was put to use
by Pellegrini et al. (Assigning protein functions by
comparative genome analysis: protein phylogenetic
profiles. PNAS. 1999 Apr 13;96(8):4285-8).
Taking Escherichia coli as reference, this study proved that
the evolution of the genes contain some informations on their function.
|
| The
generation of our data : |
|
To begin with, we focused on the bacteria Escherichia coli.
The first step is to collect all the protein coding genes of E
coli K-12. As we are going to look for sequences significantly
similar to those genes, we considered that the protein shorter than
150 nucleotides could not be taken for the study. So, 4263 of the 4279
proteins coding genes of GenBank are used. We work with the protein
sequences because they preserve the information over evolutionnary
time much better than DNA sequences. This is due to the existence of
multiple alternative codons for many amino acids and the fact that
amino acids have distinctive structural and chemical properties.
|
 |
Schematic representation
of the E. coli genome. Six of the 4263 known protein coding
genes are colorized.
|
|
The 71 finished genomes, whose sequence were public, were downloaded
from the NCBI. We have 16 archaea bacteria and 55 bacteria in which
there are 20 gram+ and 29 gram-. We only took one strain per specie to
give them the same weigth in the determination of genes that have
co-evolved. To find similar sequences of E. coli proteins in the
other bacterial genomes, we predicted a large set of putative coding
sequences, the ORFs, in those 71 genomes. We found it better to work
with those sets of predicted proteins because the sets of known
proteins of each bacteria seemed not exhaustive, especially for
recently sequenced bacteria.
|
 |
Schematic representation
of the E. coli genome surrounded with five other bacterial
genomes. Six genes of E. coli are colorized and the darker
parts of the other genomes are predicted ORfs (putative coding region).
|
|
Each E. coli protein was finally compared with all those
bacterial ORFs using Blastp. Each point of a profile is the best
Blastp bit score between the target protein and all ORFs of a bacteria, divided by the self-score
of the target protein (the score when aligned with itself). As the
alignment with itself is always the best one, the profile values
(called normalized score) range between zero and one. The use of the
normalized Blastp scores allows each point to be weighed
proportionally to the length and quality of the corresponding
alignment. A second normalization procedure is then used. In order to
compensate for the decreasing protein similarity (i.e. score) expected
when comparing homologous genes from bacteria at increasing
evolutionary distance, we normalize each column (i.e. each bacteria)
by the average of the non-zero normalized scores (above the bit score
threshold) obtained with this bacteria.
|
 |
The colorized parts of the
bacterial genomes are regions significantly similar to the
E. coli gene of the corresponding colour.
|
|
We have constructed the profiles of the 4263 protein coding genes and
several visualization of those profiles are possible. As those
profiles are made of continous and positive values we could have just
plotted them. We decided to compute a DNA-chip like way of
representing the profiles. For a given E. coli sequence, if no
homologous sequence is found in a bacteria, the corresponding square/box
will be light green. The more the homologous sequence found is similar
to the given protein, the more dark red is the square/box.
of E. coli.
|