INVESTIGADORES
TEN HAVE Arjen
congresos y reuniones científicas
Título:
Quality Assessment of HMMERCTTER Protein Superfamily Classifier using Phylogeny and Benchmarking
Autor/es:
AMALFITANO, A; STOCCHI, N; REVUELTA MV; PAGNUCO IA; BRUN M; ARJEN TEN HAVE
Lugar:
Posadas
Reunión:
Congreso; 8vo Congreso Argentino de Bioinformática y Biología Computacional; 2017
Institución organizadora:
A2B2C
Resumen:
BACKGROUND:We recently developed HMMERCTTER, a highly efficient proteinsuperfamily classifier, based on phylogenetic clustering andsubfamily HMMER profiling. In order to demonstrate the efficiency ofthe method we need to develop a quality assessment and compareHMMERCTTER with SVM feature selection methods as well as withphylogenomics platforms such as Panther. Since,phylogenomics has been shown to outperform distance based methods,which include feature selection methods, and HMMERCTTER uses MLphylogeny we decided to develop an adapted F1-measure that assessesthe results by comparing the classification to a high fidelityphylogenetic tree. However, this approach is not feasible forbenchmark datasets that are too large for the reconstruction of highfidelity phylogenetic trees. Hence, for large case studies we used astandard comparison of classifications with the referenceclassification. RESULTS:The phylogeny-based method was applied to the three original casestudies, comparing HMMERCTTER with Panther and the SVM featureselection method of PSE analysis. For each method the resultingsequence classification was compared to a phylogenetic tree whichincluded all the sequences. In all cases HMMERCTTER obtained higherF1 scores than Panther, and both of them obtained higher F1 scoresthan PSE analysis. A first analysis of the Enolase superfamilybenchmark dataset (n=31345) provided by the Structure-FunctionLinkage Database (SFLD) showed HMMERCTTER has a precision of 0.9996 arecall of 0.7872m and an F1 of 0.8807. Since many subfamilies showed100% recall, we now investigate why other subfamilies show lowrecall. CONCLUSIONS:Our studies show that HMMERCTTER outperforms Panther, and that bothphylogenomics methods outperform PSE analysis. Additional benchmarkdatasets form the SFLD are currently being analyzed, comparingHMMERCTTER with both Panther and PSE analysis. Methods to deal withnested clusters and varying cluster numbers, as obtained by thedifferent methods, will be discussed. Furthermorewe will discuss the quality of the available benchmark datasets andcompare the novel phylogeny based F1 score to the standard F1 score.p { margin-bottom: 0.1in; direction: ltr; color: rgb(0, 0, 0); line-height: 120%; text-align: left; }p.western { font-family: "Calibri", serif; font-size: 11pt; }p.cjk { font-family: "Calibri"; font-size: 11pt; }p.ctl { font-size: 11pt; }