International Mammalian Genome Society

logo18th International Mouse Genome Conference

17-22 October 2004, Seattle, USA


Morris QD1, Zhang W2, Robinson MD1, Frey BJ1, Hughes TR2

1 Electrical and Computer Engineering, University of Toronto, Toronto, Canada, 2 Banting and Best Department of Medical Research, University of Toronto, Toronto, Canada

Microarray expression data are useful for predicting gene functions, exploring gene regulation mechanisms, and uncovering novel aspects of physiology.  We have generated a microarray expression data set encompassing 55 tissues analyzed on custom ink-jet microarrays that represent over 42,000 known and predicted genes in the mouse draft genome sequence.  Here, we compare our data to the Novartis Gene Atlas (Su et al, 2004) and the Riken Transcriptome Analysis (Bono et al, 2003). 

To allow uniform comparison, we first mapped the array sequences from all three data sets to 34,343 MGI-curated genes.  Genes in any one data set were retained only if a single MGI gene was detected unambiguously.  We then compared the datasets on the basis of their ability to support the prediction of the known functional roles of genes shared by all three datasets.  We used Support Vector Machines (SVMs) to “learn” the relationship between each gene’s expression profile in each one of the datasets and that gene’s MGI Gene Ontology Biological Process (GO-BP) annotation(s).  Typically SVMs are used to predict new functional roles, here we use them to calculate the predictive value of each dataset for known functions on the basis of precision (% of predicted annotations that were confirmed by the annotation database) and recall (% of confirmed annotations that were predicted).  When restricting profiles to the 14 tissues shared by all three datasets, we found profiles drawn from our dataset to be more predictive of GO-BP annotation those of the other two.  Using the entire expression profiles of the shared genes (20 tissues and/or cell lines in Riken, 122 in Novartis, 55 in our data), we found the Novartis profiles generated more precise predictions at 10% recall but at higher recall, predictions based on our data were more precise.

We are currently producing a combined “validated” data set, by retaining only the gene pairs in each data set that contained a significantly positive “correlation of correlations” (Parmigiani, 2003) to the same pair in one of the other two data sets.  We expect this combined data set be superior to all three of the original data sets and we will distribute this validated dataset to the community once it is completed.

[an error occurred while processing this directive]