CONICET | Buscador de Institutos y Recursos Humanos

Background The sheer amount of protein sequences derived from public genome sequences provide many opportunities but also challenges to biologists. Many protein superfamilies appear to consists of various, sometimes unknown, subfamilies that are often difficult to be distinguished. Computational analyses play an important role in what is referred to as function assignation but typically require specific biological knowledge, insight in the available biocomputational tools and heavy computation of large phylogenies. We set out to develop a tool for the bioinformatics layman that, based on a training set of high quality expert annotation, automatically clusters superfamily protein sequences into subfamilies. We developed an automatic but user-supervised procedure that results in a high quality clustering, cluster-specific HMMer profiles and corresponding cut-off threshold values for reliable sequence identification and clustering. Hence, we refer to this new tool as HMMer Cut-off Threshold Tool or HMMerCTTer. Results HMMerCTTer depends on an expert-provided training set that consists of a phylogeny and the underlying Multiple Sequence Alignment (MSA). First, HMMerCTTer assigns monophyletic clusters using a ranking algorithm based on the Silhouette Index with weight correction. The Silhouette Index measures the compactness and separation of clusters based on the distances provided by the tree. Then, a HMMer profile is build for each Silhouette-qualified cluster using the user provided MSA. Each cluster-specific HMMer profile will, theoretically, identify sequences belonging to the same cluster with a high alignment-score, whereas sequences from other clusters will have significantly lower scores. Sudden drops in alignment-scores are thus indicative for cut-off thresholds. This does, however, depend on the quality of the tree and corresponding MSA but also on the variation and conservation observed within and among the different subfamilies of the superfamily. Hence, the procedure is supervised by the biological expert in order to optimize both sensitivity and specificity. In a second step, the sensitivity and the specificity of the HMMer profiles is tested using either the ungapped sequences from the training set or the corresponding complete proteomes. Based on graphically represented data, the user either accepts clusters or asks for an iterative refinement. For instance, large clusters with a high Silhouette Index but nevertheless an in-discriminative HMMer profile, can be re-analysed by means of an iteration of only the clusters' subtree through the ranking algorithm. This results in smaller and more specific profiles. Another refinement included deals with clusters that are considered too small, using an iteration through the HMMer profiling loop. A third refinement is a manual override of the clustering provided by the ranking algorithm, in order to enable paraphyletic clustering. The idea of HMMerCTTer was first applied to our recently published plant-ACD superfamily study. In this study 29 custom-defined HMMer profiles were constructed and manually selected based on a phylogeny of 406 sequences derived from seven complete plant proteomes. The generated profiles were used to screen 17 complete plant proteomes. and yielded a single false positive (829 sequences, 17 complete proteomes) whereas all real positives were detected (training set, 7 compete proteomes). The automated HMMerCTTer identified a slightly higher amount of clusters than the manual procedure but the HMMer profiles generated reliable cut-offs. Hence, the same 829 sequence collection and clustering would have been achieved if HMMerCTTer rather than a time costly expert-analysis would have been applied. Conclusions HMMerCTTer provides biologists with an easy and powerful tool for the reliable classification of subfamilies of superfamilies. Since Nature provides us with infinite scenarios of superfamilies, more benchmarking will be required in order to further improve HMMerCTTer and to test its general applicability and limits. Currently we are analysing HMMerCTTer using the highly complex superfamilies of aspartic proteasas, polygalacturonases and phospolipases C. For the future we foresee the development of a HMMerCTTer based tool for the supervised annotation of complete proteomes.

enviar mensaje