INVESTIGADORES
TEN HAVE Arjen
congresos y reuniones científicas
Título:
SwissProtCluster: Improvements on database development and performance by HMMERCTTER and an initial approach to multi-domain sequences.
Autor/es:
NICOLÁS STOCCHI; AGUSTIN AMALFITANO; MARCEL BRUN; ARJEN TEN HAVE
Lugar:
Mar del Plata
Reunión:
Congreso; Noveno Congreso Argentino de Bioinformática y Biología Computacional; 2018
Institución organizadora:
A2B2C
Resumen:
Background. HMMER Cut-off Threshold Tool (HMMERCTTER) is a training-based tool for reliable, automatedclassification of superfamily protein sequences. Using a phylogeny and corresponding sequences as trainingset, HMMERCTTER_Clust identifies subfamilies with 100% Precision and Recall Self Detection (P&R-SD),i.e. clusters that identify all cluster-sequences with higher HMMER scores than non-cluster-sequences. Thisdetermines a subfamily-specific cut-off threshold. HMMERCTTER_Class then classifies target-sequences,iterating with adapted HMMER profiles and cut-off thresholds, providing both high sensitivity and specificity.Interestingly, HMMERCTTER_Class can use any sequence partition provided its clusters show 100% P&R-SD. Hence, our longterm goal is to develop a 100% P&R-SD protein subfamily database with sequence spacecoverage comparable to Pfam.Results. We set out to develop the database using UniProTKB-SwissProt. The first problem we encounteredwas that multi-domain architectures lead to transitive hits by which sequences with different domainarchitectures become clustered. Thus we first used only single-domain sequences for the construction of aninitial single domain sequence space database. Single-domain sequences as detected by Pfam, weregrouped based on family annotation codes. SwissProtClusterSD_v1 (SPC-SD_v1) contains 3971 groups of atleast four sequences totaling 90465 sequences, as well as 1965 ungrouped sequences. 3766 groups show100% P&R-SD, covering 70,3% of the corresponding sequence space. The remaining groups (22,5%) werescrutinized by several algorithms. A first algorithm divides a big group into smaller subgroups, whereas asecond joins small groups. A third algorithm identifies contaminations. SPC-SD_v2 covers ~88% ofUniProTKB-SwissProt single-domain sequence space. The database was then expanded using PDB andUniprot Reference Proteomes 15 (UPRP15). The resulting SPC-SD_v3 could classify half of Pfam sequencesconfirming that UniProTKB-SwissProt shows poor coverage of sequence space. Finally, all unrepresentedPfam domains will be included, using the sequences of UPRP15 initially clustered with Pfam and scrutinizedusing the same algorithms as with SPC-SD_2 construction, resulting in SPC-SD_v4.Conclusions. Sequence space complexity and bad annotations in UniProTKB-SwissProt formed majorproblems in the construction of the single domain database. The single domain database will next be used todevelop the multi-domain database. Future versions of SPC will have mutli-domain sequences using proteindomains architectures hierarchically grouped by Superfamily.