We predicted the coding-sequence (CDS) of H-Invitational transcript sequences by using a computational approach. Discarding the redundancy of cDNAs as described previously, we identified protein-coding and non-protein-coding loci while a part of the protein-coding loci were annotated as pseudogene candidates (refer to 2.3.4).
Since structures and functions of protein products from alternative splicing isoforms are expected to be quite similar, we selected a 'representative transcript' of each locus. The total protein-coding loci define a set of human proteins; here we defined as eH-Inv proteinsf. All of these eH-Inv proteinsf were determined by careful human curation, followed by computationally prediction.
After determination of the H-Inv proteins, we assigned a standardized functional annotation. The most suitable 'data source ID' to each 'H-Inv protein' based on the results of similarity search and InterProScan was assigned. According to the levels of the sequence similarity, we classified 'H-Inv proteins' into seven categories as illustrated in Fig 2.3.1.
The diagram illustrates the human curation pipeline to classify H-Inv proteins into seven similarity categories; Category I , II, III, IV, V, VI and VII proteins.
H-InvDB transcribed pseudogene candidates were predicted by the following two steps;
[Step1] Filtering of functional protein-coding genes and determination of frame shift and nonsense mutation
As a result of functional annotation, we filtered out the functional protein coding genes by only targeting representative category II transcripts. Then we determined the transcripts with frame shift error or nonsense mutation based on the alignment with target protein by FASTY.
[Step2] Prediction of transcribed pseudogene candidates based on support vector machine (SVM)
We applied support vector machine (SVM) method to predict transcribed pseudogene candidates using the selected parameters.