INVESTIGADORES
GOLOBOFF Pablo Augusto
congresos y reuniones científicas
Título:
On divide-and-conquer strategies for parsimony analysis of large data sets
Autor/es:
GOLOBOFF, PABLO
Lugar:
Columbus, Ohio, USA
Reunión:
Workshop; The Problems of Phylogenetic Analysis of Large Datasets; 2005
Institución organizadora:
Mathematical Biosciences Institute, Ohio State University
Resumen:
A recent paper (Roshan et al., 2004) described a "divide-and-conquer" technique for analysis of large data sets, rec-i-dcm3, and stated that it compares very favorably to results using TNT (the fastest parsimony program; Goloboff et al., 2003).  The technique is based on creating reduced data sets, finding most parsimonious trees for the reduced data set, and then merging all parts together.  The software Roshan et al. developed submits the reduced data sets to TNT; all their software does is creating the reduced data sets (selecting disjoint sets of terminals) and then combining the results produced by TNT.  Although not stated in the paper, the programs for rec-i-dcm3 actually do a complete round of TBR swapping to the complete data set (via TNT) every time the results from the reduced data sets are combined (optimality for a reduced data set rarely produces score improvements for the entire data set).  The divide-and-conquer method in TNT (sectorial searches; Goloboff, 1999), avoids this by selecting non-disjoint sets of HTU's (which produces exact length calculations).  In this talk, I will show that Roshan et al.'s claims that rec-i-dcm3 outperforms the techniques in TNT is poorly substantiated.  First, the settings they used for the TNT runs were extremely poor; very simple settings for TNT would have done a much better job. Second, having TNT analyze larger subproblems, with more exhaustive algorithms (earlier versions only used multiple random addition sequences with TBR, or a few rounds of tree-drifting), produces much better results (which is merely a difference in implementation; the basic algorithms are the same).  When this is done, the combined techniques used in TNT clearly outperform rec-i-dcm3.  On the other side, rec-i-dcm3 depends on a global round of TBR after each cycle of subdivision, so that using as search engine any program other than TNT becomes unfeasible in the case of very large data sets (e.g. for Roshan et al.'s largest data set, the TBR swapper in PAUP* runs about 800 times slower than the one in TNT).  The global TBR becomes more and more critical as the data set is divided into smaller subproblems (because then the combination of the results produces a more suboptimal tree), so that there is a clear limit to the number of taxa that can be reasonably analyzed with rec-i-dcm3.  Additionally, since creating reduced data sets to improve the results produces invariably a worse tree (which is subsequently improved by global TBR), rec-i-dcm3, is not truly a divide-and-conquer strategy (as publicized), but instead a technique for cyclic perturbations and improvements.