CIEM   05476
CENTRO DE INVESTIGACION Y ESTUDIOS DE MATEMATICA
Unidad Ejecutora - UE
congresos y reuniones científicas
Título:
Pattern Recognition on Random Trees
Autor/es:
ANA GEORGINA FLESIA, RICARDO FRAIMAN Y FLORENCIA LEONARDI
Lugar:
La Falda (Hotel del Lago)- Sierras de Córdoba
Reunión:
Congreso; II Escuela de Matemática y Biología.; 2007
Institución organizadora:
FaMAF- UNC
Resumen:
In this talk, we address the problem of identifying protein functionality using the information contained in its amino acid sequence. We propose a method to define sequence similarity relationships that can be used as input for classification and clustering via well known metric based statistical methods.To obtain our measure of sequence similarity, we first fit a Variable Length Markov model to each sequence of our database, generating estimated context trees,  and then we compute the BFFS distance in tree space between each pair of trees. The BFFS distance takes into account the structure of each tree, that is directly related to the most relevant motifs of the sequence, and indirectly, to the probability of occurrence of each motif. This approach is motivated by the idea that proteins that have the same functionality could be modeled with the same VLMC, so their estimated context trees are observations of the same random variable, and should be close together in tree space.In our examples, we specifically address three problems of supervised and unsupervised learning in structural genomics via simple metric based techniques on the space of trees:  Unsupervised detection of functionality families via K means clustering in the space of trees,  Classification of new proteins into known families via k nearest neighbor trees item Detection of the evolutionary relationships within a known family using Ward linkage of trees.We found evidence that the similarity measure induced by our approach concentrates information for discrimination. Classification has the same high performance that others VLMC approaches and the evolutionary tree of the FGF family is recovered via Ward linkage.   Clustering is a harder task, though, but our approach for clustering is alignment free and automatic, and may lead to many interesting variations by choosing other clustering or classification procedures that are based on pre-computed similarity information, as the ones that performs clustering using flow simulation, like TribeMCL, for instance.