International Mammalian Genome Society

The 14th International Mouse Genome Conference (2000)

C12. Large-Scale Submission of Mouse EST Data and their Handling at the DNA Data Bank of Japan (DDBJ)

Satoru Miyazaki1, Jun Mashima1, Tadashi Mizunuma2, Hideaki Sugawara1, Yoshio Tateno1 and Takashi Gojobori1
1Center for Information Biology, National Institute of Genetics
2Hitachi Software Engineering Co., Ltd

Since April 1999, DDBJ/EMBL/Genbank International Nucleotide Data Bank have accepted more than 900,000 entries of mouse EST data from the Institute of Physical and Chemical Research (RIKEN). Most of these entries are included in the latest release of DDBJ and are freely accessible both online or by ftp along with various services on the DDBJ web server. The first large-scale submission sent to DDBJ was composed of 175,734 entries. These were separated into 33 files that correspond to categories defined by developmental stage and organ of expression. A second submission (176,943 entries) was sent to DDBJ comprising 61 categories of developmental stages and the latest (561,684 entries) comprised 84 categories. This information is summarized in the DEFINTION line of each flat file. The total size of these submissions is almost 5 times bigger than the total number of entries accepted at DDBJ in the last decade. In order to handle large-scale submissions efficiently, we have been developing a new file format and tools, Mass Submission System (MSM), an initiative begun in 1998. Main targets of this system are as follows:

1. Single DNA sequences of more than 50000 bp.

2. Submissions including more than 50 entries.

3. Single DNA sequences with more than 30 features.

We used this system for submissions from human genome projects and it has worked successfully for the submission of draft sequence entries. However, the number of mouse EST entries is 100 times bigger than that for the human genome. The expected descriptions in the DEFINITION and COMMENT lines were quite different from human draft sequences. Thus, some improvements were required in our loading system for DBMS in order to enhance the loading speed. We designed MSM using an object-oriented structure. Therefore, we can customize it and add new functions to programs very easily. The main parts of the application were written with JAVA to provide an interface for the submitter to select various options. Thus, we used MSM for the submission of mouse ESTs. Currently, efforts are underway to sequence the set of full-length cDNA sequences for organisms including mouse and Arabidopis. To handle these types of entries efficiently, DDBJ proposed the construction of a new division and this proposal was accepted among the three DNA databank collaborators. The definition of the new division (HTC, High Throughput cDNA) is as follows:

This division includes unfinished high throughput cDNA sequences, each of which has 5'UTR and 3'UTR at both ends and part of a coding region. The sequence may also include introns. When the sequence is later finished, it moves to the corresponding taxonomic division. The sequence is accompanied with a keyword, HTC, which is dropped when it moves to the taxonomic division.

These cDNA sequences will play a very important role in the era of post genome sequencing.

Abstracts * Officers * Bylaws * Application Form * Meeting Calendar * Contact Information * Home * Resources * News and Views * Membership

Base url
Last modified: Saturday, November 3, 2012