*--------------------------------------------------------------------------------------------------* Release information of H-InvDB_0.0 http://www.h-invitational.jp Released on 14th February, 2004 *--------------------------------------------------------------------------------------------------* --------------------------- Datasets --------------------------- 1. human full-length cDNA data set The dataset contains sequences produced by six institutes. All the sequences are already in DDBJ/EMBL/GenBank. cDNA dataset fixed on 15th July, 2002 2. human genome dataset Human genome assembly was provided by NCBI (build 30). Human genome dataset fixed on 15th July, 2002 --------------------------- Databases --------------------------- 1. RefSeq mRNA RefSeq curated mRNAs were obtained from NCBI on 15th July, 2002. 2. Ensembl gene transcripts Ensembl gene transcripts were downloaded from Ensembl (v.7.29a.2). This data set was predicted by Ensembl project based on the human genome of NCBI build 29 assembly. The data was used for the cDNA-based clustering. 3. non-redundant SWISS-PROT/TrEMBL/TrEMBL_new Non-redundant SWISS-PROT/TrEMBL/TrEMBLnew set in the external version kindly provided by EBI for H-Invitational annotation jamboree. The following redundancy steps were applied: -All tremblnew entries 100% identical to SWISS-PROT and TrEMBL removed -All tremblnew entries which are updates of SWISS-PROT and TrEMBL removed -All tremblnew entries which are 100% fragment matches of SWISS-PROT and TrEMBL removed. -About human and mouse entries, the non-redundant set was created from SWISS-PROT, TrEMBL and TrEMBLnew, all with cDNA/mRNA supporting evidence. Additional splice variants are also included. These data sets are used for similarity search (fasty & blastx). Protein dataset fixed on 15th July, 2002 4. HUGO approved gene symbol (GENEW) http://www.gene.ucl.ac.uk/nomenclature/ Human gene name data fixed on 15th July, 2002. --------------------------- Repeats in cDNAs --------------------------- Repetitive sequences in cDNAs were masked using "RepeatMasker". "xsmall" and "nolow" options were used to mask repetitive sequences with lowercase and not to mask low complexity sequences. The versions of the programs and databases used are as follows: RepeatMasker version 07/07/2001 run with cross_match version 0.990329 RepBase Update 7.5, vs 06052002. *--------------------------------------------------------------------------------------------------*