C13. Analysis of Functional Motifs in Mouse Full-Length cDNA Sequences
Kawaji1,2, Hideo Matsuda1, Shinji Kondo3,
Jun Kawai3,4, Yoshihide Hayashizaki3,4, Akihiro Hashimoto1
1 Department of Informatics and Mathematical Science, Graduate School of Engineering Science, Osaka University 1-3 Machikaneyama, Toyonaka, Osaka 560-8531, Japan.
2 NTT Software Corporation, 223-1 Yamashita-cho, Naka-ku, Yokohama, Kanagawa 231-8554, Japan.
3 Laboratory for Genome Exploration Research Group, RIKEN Genomic Sciences Center (GSC) 3-1-1 Koyadai, Tsukuba 305-0074, Japan.
4 Genome Science Laboratory, RIKEN Tsukuba Institute, 3-1-1 Koyadai, Tsukuba 305-0074, Japan. E-mail: email@example.com
Functional annotation of DNA sequences is recognized as one of the most important processes in genome analysis. In this poster, we present an analysis of detecting functional motifs from the mouse full-length cDNA data, which are being sequenced by RIKEN. Our method detects functional motifs as commonly conserved regions between given cDNA sequences. Detection of such regions may give useful information for the functional assignment of cDNA sequences if the regions can be related with specific functions.
Our method mainly consists of the following two steps, both of which are based on the graph theory.
(1) Clustering of sequences: First, we compute all pairwise similarities of given protein sequences (which are translated from full-length cDNA sequences), and construct a linkage graph. In the graph, a vertex corresponds to a protein, and an edge is weighted by the score expressing the similarity between the two vertices (proteins) only when the similarity is higher than a threshold. Second, we find all maximal P-quasi complete graphs from the above linkage graph. Here, a P-quasi complete subgraph is a subgraph such that every member is connected with at least a P fraction of the other members, where P ranges from 0 to 1. Third, we select a subset from all P-quasi complete graphs found above, satisfying the condition that any vertex of the linkage graph is covered by some of the selected graphs. Finally, the selected graphs represent clusters of protein sequences.
(2) Detection of conserved regions: For each cluster of sequences classified in the previous step; first, we generate all possible fixed-length ungapped subsequences (or blocks) from the sequences. We compute all pairwise similarities of the blocks, and construct a block graph. In a block graph, a vertex corresponds to a block, and an edge is weighted by the score expressing the similarity between the two blocks only when the similarity is higher than a threshold. Second, we extract the maximum-density subgraph from every connected component of the block graph. A maximum-density subgraph represents a set of fixed-length regions conserved commonly between the sequences. Third, we combine the maximum-density subgraphs obtained when their blocks overlap each other. The resulting set of selected graphs represents a set of detected conserved regions.
If one can relate some detected regions with the sites whose functions are already characterized, a set of sequences that have the same regions can be annotated based on the function of the sites. By applying our method to the mouse full-length cDNA data, we have detected a number of functional motifs (such as, zinc finger C2H2 type, eukaryotic protein kinase, and G-protein beta WD-40 repeats). Also, our method has detected some conserved regions whose functions have not been characterized yet. The functional analysis of those regions is still in progress.