INSTITUTO DE INVESTIGACIONES CIENTIFICAS Y TECNOLOGICAS EN ELECTRONICA
Unidad Ejecutora - UE
congresos y reuniones científicas
Quality Assessment of HMMERCTTER Protein Superfamily Classifier using Phylogeny and Benchmarking.
AMALFITANO, A; PAGNUCO IA; STOCCHI, N; BRUN M; REVUELTA MV; ARJEN TEN HAVE
Conferencia; VIII Argentinian Bioinformatics and Computational Biology; 2017
Asociación Argentina de Bioinformática y Biología Computacional
Quality Assessment of HMMERCTTER Protein Superfamily Classifier using Phylogeny and BenchmarkingAgustín Amalfitano, Nicolas Stocchi, María Victoria Revuelta Inti Anabela Pagnuco, Marcel Brun and Arjen ten HaveBACKGROUND: We recently developed HMMERCTTER, a highly efficient protein superfamily classifier, based on phylogenetic clustering and subfamily HMMER profiling. In order to demonstrate the efficiency of the method we need to develop a quality assessment and compare HMMERCTTER with SVM feature selection methods as well as with phylogenomics platforms such as Panther. Since, phylogenomics has been shown to outperform distance based methods, which include feature selection methods, and HMMERCTTER uses ML phylogeny we decided to develop an adapted F1-measure that assesses the results by comparing the classification to a high fidelity phylogenetic tree. However, this approach is not feasible for benchmark datasets that are too large for the reconstruction of high fidelity phylogenetic trees. Hence, for large case studies we used a standard comparison of classifications with the reference classification. RESULTS: The phylogeny-based method was applied to the three original case studies, comparing HMMERCTTER with Panther and the SVM feature selection method of PSE analysis. For each method the resulting sequence classification was compared to a phylogenetic tree which included all the sequences. In all cases HMMERCTTER obtained higher F1 scores than Panther, and both of them obtained higher F1 scores than PSE analysis. A first analysis of the Enolase superfamily benchmark dataset (n=31345) provided by the Structure-Function Linkage Database (SFLD) showed HMMERCTTER has a precision of 0.9996 a recall of 0.7872m and an F1 of 0.8807. Since many subfamilies showed 100% recall, we now investigate why other subfamilies show low recall. CONCLUSIONS: Our studies show that HMMERCTTER outperforms Panther, and that both phylogenomics methods outperform PSE analysis. Additional benchmark datasets form the SFLD are currently being analyzed, comparing HMMERCTTER with both Panther and PSE analysis. Methods to deal with nested clusters and varying cluster numbers, as obtained by the different methods, will be discussed. Furthermore we will discuss the quality of the available benchmark datasets and compare the novel phylogeny based F1 score to the standard F1 score.