INVESTIGADORES
TEN HAVE Arjen
congresos y reuniones científicas
Título:
An Improved Version of HMMERCTTER Protein Superfamily Classifier
Autor/es:
AGUSTIN AMALFITANO; NICOLÁS STOCCHI; ARJEN TEN HAVE; MARCEL BRUN
Lugar:
Mar del Plata
Reunión:
Congreso; Noveno Congreso Argentino de Bioinformática y Biología Computacional; 2018
Institución organizadora:
A2B2C
Resumen:
Background: We recently developed HMMERCTTER, a highly efficient protein superfamily classifier, basedon phylogenetic clustering and subfamily HMMER profiling. HMMERCTTER clusters superfamily trainingsequences into monophyletic clusters with 100% Precision & Recall Self-Detection (100% P&R-SD). Theseare used to classify new sequences, maintaining 100% P&R-SD on the updated groups. Classification isiterated whereto in each step the profile is updated by including accepted sequences. Both algorithms areguided by the user, by which the analyses are subjective. In addition, the first version lacks quality scores.Objective: We want to obtain new algorithms that 1) Show improved performance; 2) Are more efficient; 3)Are completely automated; and 4) Output quality scores.Results: We developed a number of new, user-independent algorithms for both clustering and classification.In order to estimate their performance, quality scores are required. We first centered on the classificationalgorithm using the preexisting multiclass F1 score. Once the best classification algorithm is selected, we setout to determine a clustering quality measure that predicts the quality of an eventual classification. Typically, anumber of different partitions with 100% P&R-SD clusters exist. These will be identified using a recursivepartitioning of a high fidelity phylogeny of a superfamily training set. Then, for all these partitions a number ofquality measures, such as the Silhouette index, is determined. Subsequently, each partition is used to classifythe target set and the resulting F1 classification score is compared to the different clustering quality measures.A first set of analyses including four published datasets will be presented, alongside details of the algorithmsas well as a number of security loops that might be required to maintain reasonable computation times.Conclusions and perspectives: The final analysis will be performed using 100 medium fidelity superfamilydatasets that will be generated in a separate project. Each superfamily will be split 100 times in two randomsequence sets containing 20% (train) and 80% of the sequences. The clustering measure that, on theaverage, shows the best correlation with the classification F1 score, will then be selected and included in thefinal second version of HMMERCTTER.