International Mammalian Genome Society

logo18th International Mouse Genome Conference

17-22 October 2004, Seattle, USA


Church DM 1, Baily JA 3, Eichler EE 2, Agarwala R 1

1 NIH/NCBI, Bethesda, United States, 2 University of Washington, Seattle, United States, 3 Case Western Reserve University Medical School, Cleveland, United States

The mouse genome is in the unique position of being sequenced using two distinct strategies with all of the data for both strategies publicly available. In late 2002, a Whole Genome Assembly (WGA) was published and is referred to as the MGSCv3. Work is ongoing to produce a high quality finished genome using clone based sequence. Currently, over 50% of the mouse genome is covered by high-quality, finished sequence and the rest is covered by draft sequence and WGS contigs. Finished sequence is hand curated and assembled into non-redundant contiguous sequences (contigs).

In an effort to provide access to the most current data we have been performing genome assemblies using all available data (Whole Genome Shotgun Contigs and BAC based sequence). Performing the composite assemblies has provided us with some insight concerning differences between clone based assemblies and WGA. Recently, we have performed several composite assemblies using different parameters on the same set of sequences. In addition to producing a better mouse genome assembly, we have been able to assess errors that are likely to occur when a given data source is allowed to drive the assembly. Assemblies produced where clone based data drives the assembly can have local order and orientation errors when a large amount of draft sequence is included. By contrast, assemblies where the WGS contigs drive the assembly have an increased rate of artificial duplication. This has led us to more systematically characterize the differences between WGA and draft clone based sequence.

In addition, we have been able to assess the presence of segmental duplication within the mouse genome. Regions of segmental duplication cause assembly problems regardless of the assembly strategy adopted. Preliminary analysis of the MGSCv3 and a small amount of finished data suggested that the amount of large (>10 Kb) regions of segmental duplication are under-represented in the WGA. We are extending this analysis now that over half of the genome is in finished form. Preliminary attempts at characterizing the location and contents of these duplications were limited due to our inability to place them in the assembly. Improved mapping and assembly has improved our ability to assess these regions. The results of these assessments will be discussed.

[an error occurred while processing this directive]