BECAS
IRAZOQUI JosÉ MatÍas
congresos y reuniones científicas
Título:
ANSI: A tool for the Automatic Identification of Novel and Undescribed Protein Subfamilies
Autor/es:
MATÍAS IRAZOQUI; MARIA VICTORIA REVUELTA; MARCEL BRUN; ARJEN TEN HAVE
Lugar:
Rosario
Reunión:
Congreso; IV Congreso Argentino de Bioinformática y Biología Computacional; 2013
Institución organizadora:
Asociación Argentina de Bioinformática y Biología Computacional
Resumen:
BackgroundThe accurate prediction of protein function from its primary structure has been one of the major objectives in bioinformatics research. One of the first steps towards studying a Superfamily of Proteins is to build a Multiple Sequence Alignment (MSA). However, when considering the large amount of incorrect gene models and the high levels of variation due to "Non Functional Homologues" (Homologues that lack hallmark residues, which are characters of the sequence that are required for a given function), this can be a rather complicated task. If restrictions are provided, according to known hallmarks, the MSA´s quality can be improved. This process has an inconvenience: hallmark residues are mostly biased towards sequences found in model organisms. The MSA depuration task may then be leaving out sequences that could belong to novel subfamilies that have not been described. If sequences with mutations in their known hallmarks are functional, a rather high level of conservation for those sites would be expected. For example, M14 CarboxyPeptidases (M14CPs) have a strictly conserved H180-E183-R238-H307-E381 motif. A data mining in 40 completely sequenced fungal genomes was performed, resulting in the identification of 40 M14CP homologues with a conserved mutation in 381, having a K instead of an E [ten Have, unpublished]. Since these sequences do not have paralogues, it is clear that they consist of functional homologues with a strictly conserved mutation. An MSA depuration based on the animal based hallmark would have left those sequences uncharacterized. For that reason, we aim to identify groups of proteins that exhibit key mutations, that could be part of novel protein subfamilies within a known family. ANSI is a package for the identification of new subfamilies consisting of sequences hitherto annotated as "non-functional homologues".Methodology - Proposed WorkflowANSI consists of a number of consecutive modules (see figure 1). The first module (A) is directed at the clustering of the homologous sequences into functional and non-functional homologues. This is achieved by scrutinizing the sequences with a Prosite formatted inclusion pattern (Patternscan). This pattern can be provided by the user or obtained using Patternbuild. Patternbuild requires an MSA input and has two modes: AI) Probabilistic Mode, comparable to PRATT and Gibbs sampler and AII) Deterministic Mode, where the user indicates which columns should at least be included and with which values. Patternbuild only then determines the exact pattern definition. Output of this first module is a pair of coupled MSAs, one corresponding with functional sequences and one with non-functional sequences. The second module (B) will then determine if the non-functional homologues demonstrate an overrepresentation of a hallmark site mutation, based on likelihood substitution models. The third module (C) will be directed at the identification of putative compensating mutations, or mutations at other spots than hallmarks sites. In short, it will identify columns that show a gain in conservation, when comparing the non-functional homologues MSA with the functional homologues MSA. In addition it will identify conserved mutations in highly conserved columns. Phylogenetic clustering will be included in order to corroborate if the mutation is conserved, i.e. if the mutation is conserved within certain clades. This last step will be aided with the construction of HMMer profiles and the scanning of multiple complete proteomes, that will provide a large amount of data to the putative subfamilies. ConclusionsANSI is a tool that will assist the identification of novel subfamilies, by means of the identification of over-represented mutations in proteins that belong to a certain family (as classified by means of overall similarity). Not only will this tool allow to identify possible functional mutations, but also to track the appearance of the key mutations across a phylogeny. Module A is being tested. Modules B and C are currently under development.