[ Japanese ]

2.5 Annotation policies: Evolutionary annotation

In evolutionary annotation of H-InvDB, orthologs among human and other vertebrates were detected for all human genes defined by human transcriptome analyses. Orthologs are a pair of genes in different species that evolved from a common ancestral gene by speciation. If they evolved by duplication in a species, they are paralogs. For human representative transcripts (one transcript per gene, 2.3.2), orthologs were detected by comparative genomics and amino acid sequence analysis ("Computational analysis"). Human duplicate gene families were detected to construct phylogenetic trees including orthologs. Based on the phylogenetic trees, orthologs for which the same phylogenetic relationships were estimated in genes and species are detected ("Manual curation", curated by in-house tools currently). These orthologs are available in Evola (4.9 Evola).

2.5.1 Ortholog detection

Genome alignments between human and other vertebrates were constructed by BLASTZ [Schwartz et al. 2003] following our procedures [Fujii et al. 2005, Kawahara et al. 2009]. Most similar genome alignments were used to ortholog detection. Original genome assemblies were downloaded from UCSC.
For human genes, H-InvDB representative transcripts (2.3.2) were used. For other species, all transcripts (RNAs) were downloaded from DDBJ, RefSeq and Ensembl. They were mapped onto the same species genomes by H-InvDB mapping pipelines using BLAT [Kent 2002], BLAST [Altschul et al. 1990] and est2genome [Rice et al. 2000] to locate their genomic positions (best loci). Multiple transcripts overlapping each other over exons in the same strand were defined to construct a gene locus. For each gene locus, a representative transcript and representative alternative splicing variants (RASVs) [Takeda et al. 2007] (2.2) were determined. Representative transcripts and RASVs were analyzed for ortholog detection.
Pairs of human transcript and other species transcript were first selected, if their exons were overlapped between species in genome alignments for the longest for both human side and other species side. Then, in the case where they were alignable with the length of 50% or more of the human amino acid sequence, the pairs were detected as "orthologs detected by computational analysis". Amino acid sequences of human transcripts were of H-InvDB. Those of other species were derived from the original databases (DDBJ, Ensembl and RefSeq). If no amino acid sequence was available for an other species transcript, its amino acid sequence was predicted by executing FASTY [Pearson 2000] with other species transcript as query and human amino acid sequence as reference.
In this analysis, not only one-to-one orthologs but also many-to-many orthologs may be detected.

2.5.2 Gene family detection and phylogenetic tree construction

In order to phylogenetically analyze the orthologs of duplicate genes, gene family dataset was constructed. First, similarity-based single-linkage analysis [Gu et al. 2002] of the amino acid sequences of the human representative transcripts was conducted to divide the human genes into groups. Then, if two or more human genes are orthologous to one other species gene and they were separated to different groups, the groups were merged. This procedure was repeated for all orthologous relations to detect gene families. The gene families were one-to-one among species (orthologous gene family).
Multiple alignments of human and other species amino acid sequences belonging to an orthologous family were constructed by ClustalW [Thompson et al. 1994] to construct a phylogenetic tree. By removing non-alignable sequences from alignments [Endo et al. 2002], more rigid alignments were constructed. Phylogenetic trees were constructed by the neighbor-joining (NJ) method [Saitou et al. 1987].
Utilizing the phylogenetic trees, orthologs detected by computational analysis were re-analyzed. If the same phylogenetic relationships were estimated in genes and species for orthologs, they were detected as "ortholog determined by manual curation". Manual curation pipelines (bootstrap value 900/1000, assumption of trichotomy of primates-rodents-Laurasiatherians, etc.) were initially determined in the All Human Genes Evolutionary Annotation Meeting (2006) [Matsuya et al. 2008]. Although the trees were examined by researchers in the fields of biology, the manual curation was currently automated by in-house tools incorporating the pipelines for efficiency. If a family lacked the phylogenetic information (not enough transcripts, no outgroup species, etc.), they remained as "ortholog detected by computational analysis".
Consequently, orthologs with two annotation status of comprehensive orthologs (Computational analysis + Manual curation) and more reliable orthologs (Manual curation, supported by phylogenetic trees) were constructed.

2.5.3 References

  1. Altschul SF, Gish W, Miller W, et al. Basic local alignment search tool. J. Mol. Biol. 215, 403-410 (1990).
  2. Endo T, Ogishima S and Tanaka H ETools: Tools to Handle Biological Sequences and Alignments for Evolutionary Studies. Genome Inform. 13, 543-544 (2002).
  3. Fujii Y, Itoh T, Sakate R, et al. A web tool for comparative genomics: G-compass. Gene 364, 45-52 (2005).
  4. Gu Z, Cavalcanti A, Chen FC, et al. Extent of gene duplication in the genomes of Drosophila, nematode, and yeast. Mol. Biol. Evol. 19, 256-262 (2002).
  5. Kawahara Y, Sakate R, Matsuya A, et al. G-compass: A web-based comparative genome browser between human and other vertebrate genomes. submitted (2009).
  6. Kent WJ BLAT--the BLAST-like alignment tool. Genome Res. 12, 656-664 (2002).
  7. Matsuya A, Sakate R, Kawahara Y, et al. Evola: Ortholog database of all human genes in H-InvDB with manual curation of phylogenetic trees. Nucleic Acids Res. D787-792 (2008).
  8. Pearson WR Flexible sequence similarity searching with the FASTA3 program package. Methods Mol. Biol. 132, 185-219 (2000).
  9. Rice P, Longden I, and Bleasby A EMBOSS: The European Molecular Biology Open Software Suite Trends Genet. 16, 276-277 (2000).
  10. Saitou N and Nei M The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406-425 (1987).
  11. Schwartz S, Kent WJ, Smit A, et al. Human-mouse alignments with BLASTZ. Genome Res. 13, 103-107 (2003).
  12. Takeda J, Suzuki Y, Nakao M, et al. H-DBAS: Alternative splicing database of completely sequenced and manually annotated full-length cDNAs based on H-Invitational. Nucleic Acids Res. D104-109 (2007).
  13. Thompson JD, Higgins DG and Gibson TJ CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673-4680 (1994).
Back to top
Revised: August 11, 2009