INVESTIGADORES
TEN HAVE Arjen
congresos y reuniones científicas
Título:
SwissProt Select: The New Protein Superfamily Database for Reliable Function Assignation
Autor/es:
NICOLÁS STOCCHI; AGUSTIN AMALFITANO; ARJEN TEN HAVE; MARCEL BRUN
Lugar:
BsAs
Reunión:
Conferencia; 4th International Society of Computational Biology-Latin America Conference; 2016
Institución organizadora:
ISCB/A2B2C
Resumen:
p { margin-bottom: 0.1in; direction: ltr; color: rgb(0, 0, 0); line-height: 120%; }p.western { font-family: "Liberation Serif","Times New Roman",serif; font-size: 12pt; }p.cjk { font-family: "Droid Sans Fallback","Times New Roman"; font-size: 12pt; }p.ctl { font-family: "FreeSans","Times New Roman"; font-size: 12pt; }BackgroundPfam is a HMMER databaseroutinely used for function assignation of particularly completeproteomes. Since it is not equipped with a reliable cut-off itsuffers from poor Precision and Recall (P&R). We previouslydeveloped HMMERCTTER that uses HMMER to classify sequences usingphylogenetically clustered training sequences, where each clustershows 100% P&R. HMMERCTTER can also use phylogeny-independentclustering to generate a HMMER profile database of proteinsubfamilies. We set out to develop that database starting with there-clustering of SwissProt.ResultsSwissProtannotation shorthand codes were used for initial clustering.Following filters for annotation-quality and cluster-size, weobtained SwissProtSelect v1 with 69561 sequences clustered in 10139families of at least four sequences. Of these, 6745 families were100% P&R.We envisaged the 3394 remaining clusters lack 100% P&R eithersince single clusters contain unrelated sequences or since differentclusters contain related sequences. Combineridentifies cluster relationships and joins subfamilies. BadBoysidentifies and removes outliers. Their automated, iterativeapplication resulted in SwissProtSelect v2 that covers over 95% ofthe 69561 sequences with 100% P&R as well as many sequencesinitially excluded based on the originally small size of theircorresponding cluster.Conclusionand perspectiveSwissProtcan be re-clustered into clusters that show 100&% P&R,without a large increase in cluster number. Apparently, SwissProtcontains more errors than generally assumed. In order to showthe feasibility of a more complete HMMER profile with reliablecut-offs, we will combine SwissProt Select with other databases andlarge complete proteome datasets.Supported by CONICET andAGENCIA.