The 14th International Mouse Genome Conference (2000)

A7. Estimating the Number of Mouse Genes and the Duplicated Regions within the Mouse Genome

Yasuhiko Wada 1, Tadashi Imanishi 2, and Takashi Gojobori 2
1Faculty of Agriculture, Saga University, Saga 840-8502, JAPAN;
2Center for Information Biology, Natl. Inst. Genet., Mishima, Shizuoka 411-8540, JAPAN

To elucidate the evolution of mammalian genomes, it is crucial to estimate the number of genes in the genome and to measure the degree of redundancy in the genome in various species. The number of human protein-coding genes was recently estimated as 35,000-40,000, though it is still controversial. Also, traces of ancient duplications of extensive chromosomal regions were being discovered within the human genome. In this study, we estimated the number of mouse genes using expressed sequence tags of full-length cDNA library and a set of genes obtained by clustering mRNA sequences from GenBank. We also estimated the duplicated chromosomal regions within the mouse genome using the map information derived from the Mouse Genome Database and the numerous homologous gene pairs from GenBank.

To estimate the number of mouse genes, we adopted a method reported by Waterson et al. (1992) and Ewing and Green (2000). The method involves determining the overlap between two independently derived sets of gene sequences. The first set should contain full-length sequences for an unbiased sample of genes from the genome. The second set may have sequences that are incomplete or redundant provided they are accurate enough to reliably determine matches to genes in the first set. Under these assumptions, if the first set has n1 genes and the second set has n2 genes and the number of sequences in the second set that are matched by the first is m2, the total number G of genes in the genome may be estimated as G=(n1 n2)/m2.

For the first set of gene sequences, we used a set of 3,752 genes obtained by clustering mRNA sequences from GenBank (r.118). For the second set, we used redundant expressed sequence tags (Riken-MEI 4.02) of full-length cDNA library generated by Genome Exploration Research Group of RIKEN Tsukuba Life Science Center ( For comparison of the two sequence sets, we only accepted matches in which the aligned regions were a minimum of 100 bases and the sequences show 95% or higher identity. According to our preliminary result, the total number of mouse genes was estimated as 75,327. However, the estimated number is heavily dependent on the threshold of sequence matches; if we accept matches that show less sequence identity, the estimated number becomes much smaller.

To estimate the duplicated regions within the mouse genome, we searched for homologous gene pairs among genomic sequences extracted from GenBank (r.118). We defined the candidate of the duplicated regions that have more than two homologous gene pairs located within 5cM at each chromosome. We conducted a statistical test considering the distribution of homologous gene pairs and tandem repeats in the mouse genome. Twenty seven pairs of duplicated regions were found in this study (

