INVESTIGADORES
FLESIA Ana Georgina
artículos
Título:
Testing statistical hypothesis on random trees and applications to the protein classification problem
Autor/es:
BUSCH,J.,; FERRARI, P.; FLESIA, A.G,; FRAIMAN, R.,; GRYNBERG, S.,; LEONARDI, F.G.
Revista:
ANNALS OF APPLIED STATISTICS
Editorial:
INST MATHEMATICAL STATISTICS
Referencias:
Lugar: Cleveland, OH :; Año: 2009 vol. 3 p. 542 - 563
ISSN:
1932-6157
Resumen:
Efficient automatic protein classification is of central importance in genomicannotation. As an independent way to check the reliability of the classificationwe propose a statistical approach to test if two sets of  protein  domain sequences coming from two families of the Pfam database aresignificantly different. We model protein sequences as realizations of VariableLength Markov Chains ({ lmc}) and we use  the emph{context trees}as a signature of each protein family.  Our approach is based on aKolmogorov-Smirnov-type goodness-of-fit test proposed by Balding et al. (2008)({ffs}--test).  The test statistic is a supremum over the space of trees of afunction of the two samples; its computation grows in principle exponentiallyfast with the maximal number of nodes of the potential trees.  We show how totransform this problem into a max-flow over a related graph which can be solvedusing a Ford-Fulkerson algorithm in polynomial time on that number. We apply thetest to 10 randomly chosen protein domain families from the seed of Pfam-Adatabase (high quality, manually curated families). The test shows  that the distributions of context trees coming from different families are  significantly different. We emphasize that this is a novel mathematicalapproach to validate the automatic clustering of sequences in any context. Wealso study the performance of the test via simulations on Galton-Watson relatedprocesses.