ICYTE   26279
INSTITUTO DE INVESTIGACIONES CIENTIFICAS Y TECNOLOGICAS EN ELECTRONICA
Unidad Ejecutora - UE
congresos y reuniones científicas
Título:
SwissProtCluster: The New Protein Superfamily Database for Reliable Function Assignation by HMMERCTTER
Autor/es:
AGUSTÍN AMALFITANO; MARCEL BRUN; NICOLAS STOCCHI; ARJEN TEN HAVE
Lugar:
Praga
Reunión:
Conferencia; 25th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB) and 16th European Conference on Computational Biology (ECCB); 2017
Institución organizadora:
International Society for Computational Biology (ISCB)
Resumen:
SwissProtCluster: The New Protein SuperfamilyDatabase for Reliable Function Assignation byHMMERCTTERN Stocchi, A Amalfitano, M Brun, A ten HaveBackgroundHMMER databases, like Pfam, are used for sequence function assignation. They usetrusted cut-offs to obtain specificity at the cost of reduced sensitivity. HMMER Cut-offThreshold Tool (HMMERCTTER) consists of HMMERCTTER_Clust that identifiesmonophyletic clusters with 100% precision and recall (P&R), i.e. clusters that identify allcluster-sequences with higher scores than non-cluster-sequences. HMMERCTTER_Classthen classifies target-sequences using the identified clusters. Also, HMMERCTTER_Classcan use any sequence clustering with only 100% P&R clusters. Therefore, we developed a100% P&R HMMER-cluster database based on UniProTKB-SwissProt, providing a reliabletool for function assignation of complete proteomes. Here we report the construction of thesingle-domain database.ResultsSingle-domain sequences were grouped based on family annotation codes and tested for100% P&R. SwissProtCluster_1D.v1 contains 4143 groups of at least four sequences,totaling 69518 sequences, as well as 5871 ungrouped sequences. 3853 groups show100% P&R, the remaining 290 groups were scrutinized by a script that removes outliersuntil the group is 100% P&R. Ungrouped sequences were clustered into new groups usinga combination of CD-Hit and HMMERCTTER. SwissProtCluster_1D.v2 covers 86% of theUniProTKB-SwissProt single-domain sequence space. Sequences from small groups(n