**C13. Analysis of Functional Motifs in Mouse Full-Length cDNA Sequences**

**Hideya
Kawaji**^{1,2}, Hideo Matsuda^{1}, Shinji Kondo^{3},
Jun Kawai^{3,4}, Yoshihide Hayashizaki^{3,4}, Akihiro Hashimoto^{1
}^{1} Department of Informatics
and Mathematical Science, Graduate School of Engineering Science, Osaka University
1-3 Machikaneyama, Toyonaka, Osaka 560-8531,
Japan.

^{2} NTT Software Corporation, 223-1
Yamashita-cho, Naka-ku, Yokohama, Kanagawa 231-8554, Japan.

^{3} Laboratory for Genome Exploration
Research Group, RIKEN Genomic Sciences Center (GSC) 3-1-1
Koyadai, Tsukuba 305-0074, Japan.

^{4} Genome Science Laboratory,
RIKEN Tsukuba Institute, 3-1-1 Koyadai, Tsukuba 305-0074, Japan. E-mail:
kawaji@ics.es.osaka-u.ac.jp

Functional annotation of DNA sequences is recognized as one of the most important processes in genome analysis. In this poster, we present an analysis of detecting functional motifs from the mouse full-length cDNA data, which are being sequenced by RIKEN. Our method detects functional motifs as commonly conserved regions between given cDNA sequences. Detection of such regions may give useful information for the functional assignment of cDNA sequences if the regions can be related with specific functions.

Our method mainly consists of the following two steps, both of which are based on the graph theory.

(1) *Clustering of sequences*: First, we
compute all pairwise similarities of given protein sequences (which are translated
from full-length cDNA sequences), and construct a linkage graph. In the graph,
a vertex corresponds to a protein, and an edge is weighted by the score expressing
the similarity between the two vertices (proteins) only when the similarity
is higher than a threshold. Second, we find all maximal *P*-quasi complete
graphs from the above linkage graph. Here, a *P*-quasi complete subgraph
is a subgraph such that every member is connected with at least a *P*
fraction of the other members, where *P* ranges from 0 to 1. Third, we
select a subset from all *P*-quasi complete graphs found above, satisfying
the condition that any vertex of the linkage graph is covered by some of the
selected graphs. Finally, the selected graphs represent clusters of protein
sequences.

(2) *Detection of conserved regions*: For
each cluster of sequences classified in the previous step; first, we generate
all possible fixed-length ungapped subsequences (or blocks) from the sequences.
We compute all pairwise similarities of the blocks, and construct a block graph.
In a block graph, a vertex corresponds to a block, and an edge is weighted by
the score expressing the similarity between the two blocks only when the similarity
is higher than a threshold. Second, we extract the maximum-density subgraph
from every connected component of the block graph. A maximum-density subgraph
represents a set of fixed-length regions conserved commonly between the sequences.
Third, we combine the maximum-density subgraphs obtained when their blocks overlap
each other. The resulting set of selected graphs represents a set of detected
conserved regions.

If one can relate some detected regions with the sites whose functions are already characterized, a set of sequences that have the same regions can be annotated based on the function of the sites. By applying our method to the mouse full-length cDNA data, we have detected a number of functional motifs (such as, zinc finger C2H2 type, eukaryotic protein kinase, and G-protein beta WD-40 repeats). Also, our method has detected some conserved regions whose functions have not been characterized yet. The functional analysis of those regions is still in progress.