[ Japanese ]

2.1 Annotation policies: Mapping and clustering

2.1.1 Annotation of mapped clusters

The mapping and clustering pipeline is summarized below

[A]The human transcripts used for gene clustering consist of full-length cDNAs, other mRNAs, RefSeq (NM, NR), EnsEMBL (ENST), and eHIT/pHIT gene models constructed in the H-InvDB (link). Freeze date was May 9. 2007.

[B]The repeat library used was downloaded from Repbase (RepBase Update 20061006 and RM database version 20061006). For repeat masking of the transcripts, RepeatMasker version open-3.1.6 Beta was used. The options adapted were "-nolow -xsmall"br>
[C]The repetitive region on the human genome was masked by UCSC. "Chr6_hla_hap2" of build 35 was also used in the mapping procedure. To take polymorphism into account, we also used the SNP-replaced human genome for the transcript mapping. The SNP-replaced genome was constructed by replacing the nucleotide at the validated SNP site with the other allele. The data of dbSNP build 127 was used for the construction.

[D]Trimming of adaptor and cloning vector origin sequences in the transcripts.

[E]3'-end poly(A) and 5'-end poly(T) sequence is also masked. 3'-end poly(A) and 5'-end poly(T) are detected as follows:

For detecting 3'-end poly(A):

    1.Search for ten consecutive adenines in a transcript.
    2.If the content of adenine in the region from the first position of the consecutive ten adenines to the last position of the sequence was over 90%, the region was defined as a poly(A) tail.

For detecting 5'-end poly(T)(in order to find a poly(A) tail of the reverse complementary sequence):

    1.Turn over a sequence of transcript , i.e. consider the sequence in the reverse direction (don't make a complementary sequence).
    2.Search for ten consecutive thymines in a transcript.
    3.If the content of adenine in the region from the first position of the continuative ten adenines to the last of the sequence was over 90%, the region was defined as a poly(A) tail.

For EnsEMBL transcripts, this procedure is skipped because EnsEMBL transcripts are constitutive exon sequences from the genome.

[F]The sequences will be used in further analysis including annotation.

[G]The gene loci of the transcripts were roughly estimated by BLASTN (ver. 2.2.16) and BLAT (blatSrc34).
The BLASTN options used were as follows:

    -F 'm D' -U T -e 0.01
    The BLAT option used was as follows.
    The BLAT option allows insertion of a long intron (1.1Mbp at maximum)
[H]These sequences were not used in annotation.

[I]To detect exon sequences, est2genome (EMBOSS-4.1.0) was utilized.
The options used were default and modified parameters as follows:
    Gap open penalty = 4 (default: 2)
    Gap extension penalty = 4 (2)
    Mismatch penalty = 3 (1)
    Splice site penalty = 12 (20)
    Intron penalty = 24 (40)
    Minscore = 10
    Space = 512

    By our several trials, these parameters were found to be effective to prevent over-fitting of canonical splice sites and to locate short (13bp) terminal exons. Moreover, we set the non-canonical splice site penalty as 20 (35 for default) to get correct alignments against non-canonical GC-AG and AT-AC splice sites. We treated short introns (≤ 30bp) as "gaps".
[J]Filtering with a threshold of %ID ≥ 95 and %Cov. ≥ 90 to remove rough alignments. Adaptor and vector sequence and 3'-end poly(A) or 5'-end poly(T) region in the alignment were removed in the calculation of %Cov.

[K]Mapped loci. In this stage, some transcripts are mapped onto multiple locations. This can happen when the genome contains multiple copies of a sequence, paralogs, pseudogenes, statistical coincidences, or artifactual assembly duplications, or when the query itself contains repeats.

[L]Best locus selection (1). Selection of the best locus is processed as follows.
    1.Candidate locus of a mapped transcript with highest %ID of pairwise alignment between the transcript and genome sequences was selected from all the candidate loci.
    2.The candidate loci with widest %Cov. of aligned region against the full-length of the transcript.
[M]Correction of mapped strand is necessary because complementary strands of transcripts have been registered in DNA databanks and these sequences might be mapped on the opposite strand of the genome. We determined the correct mapped strand of each transcript using the following criteria:
    1.CDS on complementary sequence
    2.CT-AC splice site motif
    3.Existence of 5'-end poly(T)
[N]Best locus selection (2) was processed as follows.
    1.The candidate loci with short introns (<70bp) were discarded. If all the candidate loci of the mapped transcripts contain such a short intron, all candidate loci were retained.
    2.The candidate locus with the smallest number of mismatches in the alignments around 10bp of the splice boundaries was selected as the best locus.
    3.If the ratio of canonical splice sites (GT-AG) to the number of introns in the candidate locus was under 50%, the candidate locus was discarded.

    If the multiple loci were retained for a mapped transcript in spite of the best locus selection effort, we assigned a HIT-ID with a double-digit additional number for the mapped locus to set distinct identifiers (Example: HIT00000001_01).
[O]All the mapped loci retained after the best selection (2) were used for clustering. "eHIT" and "pHIT" gene models were also included in the clustering. For the explanation of "eHIT" and "pHIT", please refer to each section (eHIT -- section 2.0.3, pHIT - section 2.0.4). Same-position clustering was performed to create genetic loci. If one or more nucleotide(s) of exon sequences was overlapped, the mapped transcripts were clustered into a single locus.

[P]Cluster ID (HIX) assignment.

[Q]Single-linkage clustering based on sequence similarity was performed to obtain gene clusters of unmapped transcripts. Prior to the clustering, qualities of the sequences were checked and transcripts listed below were not used for the similarity search.
    1.Transcripts that were mapped on two separated genomic loci. These are suspected to be chimeric or trans-spliced transcripts, or transcribed from rearranged genome sequence. Of course, faulty assembled genome sequence is also considered to be a cause of the separated mapped regions. The aim of this operation is to avoid clustering error.
    2.Transcripts that were partially mapped onto the sequenced locus and which overlapped with other mapped transcripts on the same strand. The aim is to avoid multiple counting of the same gene located on the genome.

    The sequence similarity between unmapped transcripts was estimated by E-value. When the E-value was calculated to be 0, the sequences were clustered. BLASTN was used for the similarity search.
[R]Unmapped clusters. By considering the sequence qualities and character of the coding genes, clusters of unmapped transcripts (listed below) ,which show only weak evidence as genes or which should be merged with other mapped locus, were excluded from H-InvDB.
    1.Unmapped clusters including transcripts of Ig, MHC and TCR gene.
    2.Unmapped clusters including transcripts that were unexpectedly mapped on the mouse genome.
    3.Unmapped clusters including representative transcripts that were mapped neither on Celera's human, chimpanzee, macaque nor mouse genome assemblies.
    4.Unmapped clusters including representative transcripts whose CDSs were not identical to known human proteins.
    5.Single member transcripts without other transcripts' support.

2.1.2 Annotation of UM clusters Overview of the UM annotation

Unmapped (UM) transcripts, which cannot be mapped on the reference genome (UCSC hg18 assembly) with the threshold of %ID>=95 and %COV>=90, are composed of artificial errors, partially mapped sequences, or transcripts from the unsequenced genomic region. After removing unreliable or partially mapped sequences, we defined UM genes which are transcribed from the unsequenced genome region. We also annotated the chimera transcripts, which may contain trans-splicing candidates or fusion transcripts transcribed from a rearranged genome. Direct access to the UM annotation page

Users may access to the top search page of the UM annotation from the following URL.
http://www.h-invitational.jp/hinv/topic_annotation/um.cgi Annotation pipeline

The annotation pipeline is summarized in Figure 2.2.

Fig.2.2 Annotation pipeline for the UM transcripts. The number of UM transcripts remained after the respective annotation is described at the left bottom of the step

    Points of visual check
    1. Revision of immune related gene annotation by visual inspection of the definition of the transcript and mapped positions.
    2. Revision of contaminations by visual inspection of phylogenetic trees, which are constructed by submitting the transcript sequence on the all databases (nr, WGS, EST etc.) in NCBI BLAST. We checked the consistency between the gene tree and species tree.
    3. Manual annotation of experimental supports. Annotation of chimera transcripts

If fragments of a transcript were separately mapped on two distant regions of genome and the transcript was judged as unmapped transcript because of its insufficient sequence coverage on each locus, the transcripts were used on further analysis based on the method established by Hahn et al (PNAS 2004, 101; 13257) in order to detect chimeric transcripts. Firstly, the unmapped transcripts were aligned to genome again by use of blat. Secondly, pair of aligned regions of the genome sequence and additional 1kb margins from both termini were concatenated to make an artifical genome sequence if %ID and %Cov. of the combined alignments of the aligned regions exceed 80% and 80% respectively. The unmapped transcripts were aligned again to corresponding concatenated genome sequences by est2genome and then rough alignments are filtered out by use of criteria of %ID.<97 and %Cov.<95. The remained transcripts were defined as chimeric transcript. mapped transcripts

If the part of transcript sequence were aligned to the genome sequence with high nucleotide identity (%ID) and the aligned position overlaps with another correctly mapped transcriptfs exon, these are defined as gpartially mapped transcriptsh and discarded from the candidate of unmapped genes. The mapping criteria used were †97% for %ID and †35% for %Cov. These partially mapped transcripts are considered to generate due to polymorphism, experimental artifact, or somatic mutation. We provide the information of overlaps between partially mapped transcripts and mapped genes.

Revised: December 18, 2008