The mission of this study is as follows;
Our purpose is to extract new disease-susceptible genes using
H-InvDB.
And our concept for extracting new candidate genes is that new
candidates have similar character to reported disease related genes.
Similar character means not only sequence homology or domain structure
but also same metabolic pathway or expression pattern.
H-InvDB has advantage of existence
of these information on genes.
So, based on this concept, we constructed Priority Analysis for Disease
Association (PANDA) system On our methods, we used the 7 kinds of
annotated information.
Paralogous genes that result from a duplication anterior to the divergence
of vertebrates have high degrees of similarity in their sequence and may
share some redundancy in function.
Genes associated with a disease may be identified through their functional
similarities to known disease genes or their localization along the same
physiological pathway as known disease genes.
So, we compared among the genes based on
InterPro,
EC#,
KEGG pathway,
GO terms.
Next, we transferred these information into score with some formula.
We can get 7 kind of score in each gene.
Finally, we tried to select the new disease-susceptible candidates with
discriminant analysis.
At first, we selected disease related genes (called known genes) using
OMIM &
LocusLink.
We queried some disease to OMIM & LL.
OMIM & LL extracted the list of genes which have relationship to
queried disease. After that, we read
PubMed
abstract about relationship between gene and disease.
For example, when the gene is reported about the disease relationship
in only mouse experiment. We remove the gene from the list.
So, We selected the "reported disease-related gene set" from
H-InvDB.
We call these genes known genes.
And rest of genes are called "Others".
This is structure of
PANDA
system.
We selected known genes. After that we transferred gene information into
scores with some formula.
And we calculated Frequency scores in target disease. We used these
scores as training set for machine learning.
After that we scored unknown genes.
Finally, we tried to prioritize the others with discriminant analysis
and select new candidates.
We used Mahalanobis distance
for discriminant analysis.
As you know, we can get 7 kind of scores.
We calculated Mahalanobis distance in 7 dimension in equal weighting.
At first, we calculated the centers of gravity of known genes and
unknown genes.
Second we calculated the distance of position of Gene X from C.G. of
known and unknown genes (we called MD1 and MD2).
Finally, we compared between MD1 and MD2. We calculated MD2-MD1.
When the score of MD2-MD1 is more than 0, this means near to known genes.
And we considered this gene as a candidate.