In evolutionary annotation of H-InvDB, orthologs among human and other vertebrates were detected for all human genes defined by human transcriptome analyses. Orthologs are a pair of genes in different species that evolved from a common ancestral gene by speciation. If they evolved by duplication in a species, they are paralogs. For human representative transcripts (one transcript per gene, 2.3.2), orthologs were detected by comparative genomics and amino acid sequence analysis ("Computational analysis"). Human duplicate gene families were detected to construct phylogenetic trees including orthologs. Based on the phylogenetic trees, orthologs for which the same phylogenetic relationships were estimated in genes and species are detected ("Manual curation", curated by in-house tools currently). These orthologs are available in Evola (4.9 Evola).
Genome alignments between human and other vertebrates were constructed by BLASTZ [Schwartz et al. 2003] following our procedures [Fujii et al. 2005, Kawahara et al. 2009]. Most similar genome alignments were used to ortholog detection. Original genome assemblies were downloaded from UCSC.
For human genes, H-InvDB representative transcripts (2.3.2) were used. For other species, all transcripts (RNAs) were downloaded from DDBJ, RefSeq and Ensembl. They were mapped onto the same species genomes by H-InvDB mapping pipelines using BLAT [Kent 2002], BLAST [Altschul et al. 1990] and est2genome [Rice et al. 2000] to locate their genomic positions (best loci). Multiple transcripts overlapping each other over exons in the same strand were defined to construct a gene locus. For each gene locus, a representative transcript and representative alternative splicing variants (RASVs) [Takeda et al. 2007] (2.2) were determined. Representative transcripts and RASVs were analyzed for ortholog detection.
Pairs of human transcript and other species transcript were first selected, if their exons were overlapped between species in genome alignments for the longest for both human side and other species side. Then, in the case where they were alignable with the length of 50% or more of the human amino acid sequence, the pairs were detected as "orthologs detected by computational analysis". Amino acid sequences of human transcripts were of H-InvDB. Those of other species were derived from the original databases (DDBJ, Ensembl and RefSeq). If no amino acid sequence was available for an other species transcript, its amino acid sequence was predicted by executing FASTY [Pearson 2000] with other species transcript as query and human amino acid sequence as reference.
In this analysis, not only one-to-one orthologs but also many-to-many orthologs may be detected.
In order to phylogenetically analyze the orthologs of duplicate genes, gene family dataset was constructed. First, similarity-based single-linkage analysis [Gu et al. 2002] of the amino acid sequences of the human representative transcripts was conducted to divide the human genes into groups. Then, if two or more human genes are orthologous to one other species gene and they were separated to different groups, the groups were merged. This procedure was repeated for all orthologous relations to detect gene families. The gene families were one-to-one among species (orthologous gene family).
Multiple alignments of human and other species amino acid sequences belonging to an orthologous family were constructed by ClustalW [Thompson et al. 1994] to construct a phylogenetic tree. By removing non-alignable sequences from alignments [Endo et al. 2002], more rigid alignments were constructed. Phylogenetic trees were constructed by the neighbor-joining (NJ) method [Saitou et al. 1987].
Utilizing the phylogenetic trees, orthologs detected by computational analysis were re-analyzed. If the same phylogenetic relationships were estimated in genes and species for orthologs, they were detected as "ortholog determined by manual curation". Manual curation pipelines (bootstrap value 900/1000, assumption of trichotomy of primates-rodents-Laurasiatherians, etc.) were initially determined in the All Human Genes Evolutionary Annotation Meeting (2006) [Matsuya et al. 2008]. Although the trees were examined by researchers in the fields of biology, the manual curation was currently automated by in-house tools incorporating the pipelines for efficiency. If a family lacked the phylogenetic information (not enough transcripts, no outgroup species, etc.), they remained as "ortholog detected by computational analysis".
Consequently, orthologs with two annotation status of comprehensive orthologs (Computational analysis + Manual curation) and more reliable orthologs (Manual curation, supported by phylogenetic trees) were constructed.