CIEM   05476
CENTRO DE INVESTIGACION Y ESTUDIOS DE MATEMATICA
Unidad Ejecutora - UE
artículos
Título:
Testing statistical hypothesis on random trees
Autor/es:
BUSCH,J.,FERRARI, P. FLESIA, A.G, FRAIMAN, R., GRYNBERG, S., LEONARDI, F.G.
Revista:
Arxive
Referencias:
Año: 2007 p. 1 - 20
Resumen:
In this paper we address the problem of identifying   differences between populations of trees. Besides the theoretical relevance   of this problem, we are interested in testing if trees characterizing   protein sequences from different families constitute samples of significantly   different distributions. In this context, trees are obtained by modelling    protein sequences as Variable Length Markov Chains (VLMC), estimating the relevant   motifs that are sufficient to predict the next amino acid in the sequence.   We assign to  each protein family an underlying VLMC model, which   induces a distribution on the space of all trees. Our goal is to test if   two (or more) populations   of trees comes from  different distributions.  Our approach   is based on a hypothesis test proposed recently by Balding et al (2004)   (BFFS--test), which involves a Kolmogorov type statistics that roughly   speaking, maximizes the difference between the expected distance structure   that characterize the samples of the populations.  A naive approach to   calculate effectively the test statistic is quite difficult, since it is   based on a supremo defined over the space of all trees, which grows   exponentially fast.  We show how to transform this problem into a max-flow   over a network which can be solved using a Ford Fulkerson algorithm in   polynomial time on the maximal number of nodes of the random tree.  We also   describe conditions that imply the characterization of the measure by the   marginal distributions of each node (occupancy node probabilities) of the   random tree, which validate the use of the BFFS--test for measure   discrimination.  We study the performance of the test via simulations on   Galton-Watson processes. We apply the test to 10 randomly chosen samples of   protein families from the Pfam database, and show that the underlying distributions   over the set of trees are significantly different, confirming the coherence of the   selected families. This is a novel mathematical approach to validate if automatic   clustering results are indeed coherent families.