INVESTIGADORES
TEN HAVE Arjen
congresos y reuniones científicas
Título:
ANSI: A tool for the Automatic Identification of Novel and Undescribed Protein Subfamilies.
Autor/es:
IRAZOQUI, M; REVUELTA MV; BRUN M; TEN HAVE A
Lugar:
Rosario
Reunión:
Congreso; 4th International Conference of Iberoamerican Society of Bioinformatcis and 4th Congress of the Asocación Argentina de Bioinformática y Biología Computacional.; 2013
Institución organizadora:
Asocación Argentina de Bioinformática y Biología Computacional.
Resumen:
The accurate prediction of protein function from its primary structure has been one of the major objectives inbioinformatics research. One of the first steps towards studying a Superfamily of Proteins is to build aMultiple Sequence Alignment (MSA). However, when considering the large amount of incorrect genemodels and the high levels of variation due to "Non Functional Homologues" (Homologues that lackhallmark residues, which are characters of the sequence that are required for a given function), this can be arather complicated task. If restrictions are provided, according to known hallmarks, the MSA's quality can beimproved. This process has an inconvenience: hallmark residues are mostly biased towards sequences foundin model organisms. The MSA depuration task may then be leaving out sequences that could belong to novelsubfamilies that have not been described. If sequences with mutations in their known hallmarks arefunctional, a rather high level of conservation for those sites would be expected. For example, M14CarboxyPeptidases (M14CPs) have a strictly conserved H180-E183-R238-H307-E381 motif. A data miningin 40 completely sequenced fungal genomes was performed, resulting in the identification of 40 M14CPhomologues with a conserved mutation in 381, having a K instead of an E [ten Have, unpublished]. Sincethese sequences do not have paralogues, it is clear that they consist of functional homologues with a strictlyconserved mutation. An MSA depuration based on the animal based hallmark would have left thosesequences uncharacterized. For that reason, we aim to identify groups of proteins that exhibit key mutations,that could be part of novel protein subfamilies within a known family. ANSI is a package for theidentification of new subfamilies consisting of sequences hitherto annotated as "non-functionalhomologues".Methodology - Proposed WorkflowANSI consists of a number of consecutive modules (see figure 1).The first module (A) is directed at the clustering of the homologoussequences into functional and non-functional homologues. This isachieved by scrutinizing the sequences with a Prosite formattedinclusion pattern (Patternscan). This pattern can be provided by theuser or obtained using Patternbuild. Patternbuild requires an MSAinput and has two modes: AI) Probabilistic Mode, comparable toPRATT and Gibbs sampler and AII) Deterministic Mode, where theuser indicates which columns should at least be included and withwhich values. Patternbuild only then determines the exact patterndefinition. Output of this first module is a pair of coupled MSAs,one corresponding with functional sequences and one with nonfunctionalsequences. The second module (B) will then determine ifthe non-functional homologues demonstrate an overrepresentationof a hallmark site mutation, based on likelihood substitutionmodels. The third module (C) will be directed at the identificationof putative compensating mutations, or mutations at other spots thanhallmarks sites. In short, it will identify columns that show a gain inconservation, when comparing the non-functional homologues MSAwith the functional homologues MSA. In addition it will identifyconserved mutations in highly conserved columns. Phylogenetic clustering will be included in order tocorroborate if the mutation is conserved, i.e. if the mutation is conserved within certain clades. This last stepwill be aided with the construction of HMMer profiles and the scanning of multiple complete proteomes, thatwill provide a large amount of data to the putative subfamilies.ConclusionsANSI is a tool that will assist the identification of novel subfamilies, by means of the identification of overrepresentedmutations in proteins that belong to a certain family (as classified by means of overallsimilarity). Not only will this tool allow to identify possible functional mutations, but also to track theappearance of the key mutations across a phylogeny. Module A is being tested. Modules B and C arecurrently under development.