IIBBA   05544
INSTITUTO DE INVESTIGACIONES BIOQUIMICAS DE BUENOS AIRES
Unidad Ejecutora - UE
congresos y reuniones científicas
Título:
Using coevolution to improve protein subfamily classification
Autor/es:
FRANCO SIMONETTI; ARIEL BERENSTEIN; ARIEL CHERNOMORETZ; CRISTINA MARINO BUSLJE
Lugar:
San Carlos de Bariloche
Reunión:
Congreso; V Congreso Argentino de Bioinformática y Biología Computacional; 2014
Institución organizadora:
Asociación Argentina de Bioinformática y Biología Computacional
Resumen:
Background
The common approach for protein
subfamily classification relies on grouping protein sequences according to
their degree of similarity. However, there is no single sequence similarity
threshold for accurately grouping sequences into isofunctional groups. Most methods
rely on protein superfamilies as a starting point for subfamily classification.
Superfamilies are defined
as a set of homologous proteins in which conserved sequence or structural
characteristics can be associated with conserved functional characteristics. Superfamily members can
be highly divergent and catalyze quite different overall reactions. A subfamily is defined as a set of
homologous proteins within a superfamily that perform an identical function by
the same mechanism.
Current subfamily classification
methods use bottom-up clustering to construct a cluster hierarchy, then cut the
hierarchy at the most appropriate locations to obtain a single partitioning [1,
2]. These methods usually integrate data such as protein sequence similarity,
residue conservation within groups and HMM profiles. Moreover, results usually predict
a great number of subfamilies with few members and limited biological meaning.The goal of this study is to
identify subsets of functionally closely related sequences within a given
superfamily. Since all proteins within a superfamily share a common ancestor,
we hypothesize that functional diversity within superfamilies has arisen
through a series of concerted changes that must have left an identifiable
coevolutionary signal.
Material and Methods
The challenge is to be able to
separate the subfamilies coevolutionary signals and use them in the process of
subfamily classification. This information can be used to guide a hierarchical
clustering. Our approach uses Mutual Information to calculate covariation [3]
and commonly used clustering methods based on sequence similarity. We have
defined a select group of superfamilies from the Structure Function Linkage
Database as our gold standard dataset [4].
Results
Different approaches were
considered for integrating Mutual Information data in sequence clustering.
Since Mutual Information can only be calculated for a group of sequences, a
preliminary sequence clustering is performed. Using solely covariation data, our
method can cluster groups of sequences from the same subfamily. For a complete
clustering solution, it performs almost as good as a hierarchical clustering
based on sequence similarity. The next step will be to integrate both methods.
Conclusions
Automated protein classification remains an active
topic of research and state of the art methods are far from predicting
biologically meaningful results. Covariation data has never been used before in
this context and further analysis are needed to improve the method.