INVESTIGADORES
TEN HAVE Arjen
congresos y reuniones científicas
Título:
HMMerCTTer: Tailor-made Decision Making for the Semi-automatic Clustering of large Protein Superfamilies
Autor/es:
BONDINO HG; PAGNUCO IA; REVUELTA MV; BRUN M; ARJEN TEN HAVE
Lugar:
Oro Verde
Reunión:
Congreso; 3rd Congress of the Asocación Argentina de Bioinformática y Biología Computaciona; 2012
Institución organizadora:
Asociación Argentina de la Bioinformática y Biología Computacionaĺ
Resumen:
<!--
@page { margin: 0.79in }
P { margin-bottom: 0.08in }
-->
Background
The
sheer amount of protein sequences derived from public genome
sequences provide many opportunities but also challenges to
biologists. Many protein superfamilies appear to consists of various,
sometimes unknown, subfamilies that are often difficult to be
distinguished. Computational analyses play an important role in what
is referred to as function assignation but typically require specific
biological knowledge, insight in the available biocomputational tools
and heavy computation of large phylogenies. We set out to develop a
tool for the bioinformatics layman that, based on a training set of
high quality expert annotation, automatically clusters superfamily
protein sequences into subfamilies. We developed an automatic but
user-supervised procedure that results in a high quality clustering,
cluster-specific HMMer profiles and corresponding cut-off threshold
values for reliable sequence identification and clustering. Hence, we
refer to this new tool as HMMer Cut-off Threshold Tool or HMMerCTTer.
Results
HMMerCTTer
depends on an expert-provided training set that consists of a
phylogeny and the underlying Multiple Sequence Alignment (MSA).
First, HMMerCTTer assigns monophyletic clusters using a ranking
algorithm based on the Silhouette Index with weight correction. The
Silhouette Index measures the compactness and separation of clusters
based on the distances provided by the tree. Then, a HMMer profile is
build for each Silhouette-qualified cluster using the user provided
MSA. Each cluster-specific HMMer profile will, theoretically,
identify sequences belonging to the same cluster with a high
alignment-score, whereas sequences from other clusters will have
significantly lower scores. Sudden drops in alignment-scores are thus
indicative for cut-off thresholds. This does, however, depend on the
quality of the tree and corresponding MSA but also on the variation
and conservation observed within and among the different subfamilies
of the superfamily. Hence, the procedure is supervised by the
biological expert in order to optimize both sensitivity and
specificity.
In
a second step, the sensitivity and the specificity of the HMMer
profiles is tested using either the ungapped sequences from the
training set or the corresponding complete proteomes. Based on
graphically represented data, the user either accepts clusters or
asks for an iterative refinement. For instance, large clusters with a
high Silhouette Index but nevertheless an in-discriminative HMMer
profile, can be re-analysed by means of an iteration of only the
clusters' subtree through the ranking algorithm. This results in
smaller and more specific profiles. Another refinement included deals
with clusters that are considered too small, using an iteration
through the HMMer profiling loop. A third refinement is a manual
override of the clustering provided by the ranking algorithm, in
order to enable paraphyletic clustering.
The
idea of HMMerCTTer was first applied to our recently published
plant-ACD superfamily study. In this study 29 custom-defined HMMer
profiles were constructed and manually selected based on a phylogeny
of 406 sequences derived from seven complete plant proteomes. The
generated profiles were used to screen 17 complete plant proteomes.
and yielded a single false positive (829 sequences, 17 complete
proteomes) whereas all real positives were detected (training set, 7
compete proteomes). The automated HMMerCTTer identified a slightly
higher amount of clusters than the manual procedure but the HMMer
profiles generated reliable cut-offs. Hence, the same 829 sequence
collection and clustering would have been achieved if HMMerCTTer
rather than a time costly expert-analysis would have been applied.
Conclusions
HMMerCTTer
provides biologists with an easy and powerful tool for the reliable
classification of subfamilies of superfamilies. Since Nature provides
us with infinite scenarios of superfamilies, more benchmarking will
be required in order to further improve HMMerCTTer and to test its
general applicability and limits. Currently we are analysing
HMMerCTTer using the highly complex superfamilies of aspartic
proteasas, polygalacturonases and phospolipases C. For the future we
foresee the development of a HMMerCTTer based tool for the supervised
annotation of complete proteomes.