CIEM   05476
CENTRO DE INVESTIGACION Y ESTUDIOS DE MATEMATICA
Unidad Ejecutora - UE
artículos
Título:
Pattern Recognition on Random Trees associated to protein functionality families
Autor/es:
FLESIA, A.G,; FRAIMAN, R,; LEONARDI, F.G.
Revista:
Actas de la Academia Nacional de Ciencias
Editorial:
Academia Nacional de Ciencias de Cordoba
Referencias:
Lugar: Cordoba; Año: 2008 p. 85 - 97
ISSN:
0325-7533
Resumen:
In this paper, we address the problem of identifying proteinfunctionality using the information contained in its aminoacidsequence. We propose a method to define sequence similarityrelationships that can be used as input for classification andclustering via well known metric based statistical methods. Toobtain our measure of sequence similarity, we first fit a VariableLength Markov model to each sequence of our database, generatingestimated context trees, and then we compute the BFFS distance intree space between each pair of trees. The BFFS distance takesinto account the structure of each tree, that is directly relatedto the most relevant motifs of the sequence, and indirectly, tothe probability of occurrence of each motif. This approach ismotivated by the idea that proteins having the same functionalitycould be modeled with the same VLMC, so their estimated contexttrees are observations of the same random variable, and should beclose together in tree space. In our examples, we specificallyaddress two problems of supervised and unsupervised learning instructural genomics via simple metric based techniques on thespace of trees egin{enumerate} item Unsupervised detection of functionalityfamilies via K means clustering in the space of trees, itemClassification of new proteins into known families via k nearestneighbor trees. end{enumerate}We found evidence that the similarity measure induced by ourapproach concentrates information for discrimination.Classification has the same high performance than others VLMCapproaches.  Clustering is a harder task, though, but our approachfor clustering is alignment free and automatic, and may lead tomany interesting variations by choosing other clustering orclassification procedures that are based on pre-computedsimilarity information, as the ones that performs clustering usingflow simulation, see (Yona et al 2000 , Enright et al, 2003) .