HELP : Introduction
Home Introduction Guided tour Links&sources News IGS resources Help
The concept of phylogenomic profiling :
    One of the central goals of bioinformatics is to assign proteins a function from genomic sequences. To this purpose, alignment methods, based on sequence similarity, are still nowadays the most developed and used. Yet, they give indications on the function of only fifty percent of the proteins of an organism. This limit encourages the development of new methods that exploit the information contained within the full sequence of a genome. Those phylogenomic approaches are of course possible because of the recent and massive sequencing of genomes.     Phylogenomic profiling is one of the major non-sequence-homology-based method. It is designed to infer a likely functional relationship between proteins. It is based on the assumption that proteins involved in a common metabolic pathway or constituting a multi-molecular complex are likely to evolve in a correlated manner. This paradigm was put to use by Pellegrini et al. (Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. PNAS. 1999 Apr 13;96(8):4285-8). Taking Escherichia coli as reference, this study proved that the evolution of the genes contain some informations on their function.
The generation of our data :
    To begin with, we focused on the bacteria Escherichia coli. The first step is to collect all the protein coding genes of E coli K-12. As we are going to look for sequences significantly similar to those genes, we considered that the protein shorter than 150 nucleotides could not be taken for the study. So, 4263 of the 4279 proteins coding genes of GenBank are used. We work with the protein sequences because they preserve the information over evolutionnary time much better than DNA sequences. This is due to the existence of multiple alternative codons for many amino acids and the fact that amino acids have distinctive structural and chemical properties.
Schematic representation of the E. coli genome. Six of the 4263 known protein coding genes are colorized.
    The 71 finished genomes, whose sequence were public, were downloaded from the NCBI. We have 16 archaea bacteria and 55 bacteria in which there are 20 gram+ and 29 gram-. We only took one strain per specie to give them the same weigth in the determination of genes that have co-evolved. To find similar sequences of E. coli proteins in the other bacterial genomes, we predicted a large set of putative coding sequences, the ORFs, in those 71 genomes. We found it better to work with those sets of predicted proteins because the sets of known proteins of each bacteria seemed not exhaustive, especially for recently sequenced bacteria.
Schematic representation of the E. coli genome surrounded with five other bacterial genomes. Six genes of E. coli are colorized and the darker parts of the other genomes are predicted ORfs (putative coding region).
    Each E. coli protein was finally compared with all those bacterial ORFs using Blastp. Each point of a profile is the best Blastp bit score between the target protein and all ORFs of a bacteria, divided by the self-score of the target protein (the score when aligned with itself). As the alignment with itself is always the best one, the profile values (called normalized score) range between zero and one. The use of the normalized Blastp scores allows each point to be weighed proportionally to the length and quality of the corresponding alignment. A second normalization procedure is then used. In order to compensate for the decreasing protein similarity (i.e. score) expected when comparing homologous genes from bacteria at increasing evolutionary distance, we normalize each column (i.e. each bacteria) by the average of the non-zero normalized scores (above the bit score threshold) obtained with this bacteria.
The colorized parts of the bacterial genomes are regions significantly similar to the E. coli gene of the corresponding colour.
    We have constructed the profiles of the 4263 protein coding genes and several visualization of those profiles are possible. As those profiles are made of continous and positive values we could have just plotted them. We decided to compute a DNA-chip like way of representing the profiles. For a given E. coli sequence, if no homologous sequence is found in a bacteria, the corresponding square/box will be light green. The more the homologous sequence found is similar to the given protein, the more dark red is the square/box. of E. coli.
Two possible representations of the profiles of the six E. coli genes, just the values plotted and the DNA-chip like representation.
    Once those profiles construct, a relation between them can be computed, similar profiles reflecting a similar evolution of the corresponding genes. Different distances were computed and compared through the use of data on the known metabolic pathways. The best one, i.e. the most suitable for detecting co-evolution between proteins, was kept. Two proteins that have a similar evolution have a similar profiles and are at a very small phylogenomic distance. For a target proteins, identifying its phylogenomic neighbors, i.e. the proteins that best co-evolved with it, and analysing their annotation may lead to new hypothetical function for this target.
Distances between the six E. coli genes. Those distances put into number the relation between string of numbers, the profiles.
Back to help