IIB   20738
INSTITUTO DE INVESTIGACIONES BIOLOGICAS
Unidad Ejecutora - UE
congresos y reuniones científicas
Título:
The alignment of protein superfamily sequences part 1: Identifying Cluster Specific Subsequences (CSS) in protein Families
Autor/es:
M.V.REVUELTA; FACUNDO ORTS; ARJEN TEN HAVE
Lugar:
Bahía Blanca
Reunión:
Congreso; CAB2C; 2015
Institución organizadora:
Argentinian Association of Bioinformatics and Computational Biology
Resumen:
BackgroundMultiple Sequence Alignments (MSA) are crucial tools in protein bioinformatics. Despite recent advances such as obtained by MAFFT and Promals3D, MSA construction of complex superfamilies remains problematic and requires rigorous manual correction. Here we present two posters that describe our attempts to develop a protocol and, subsequently, a software that is directed at aligning many (>500) sequences of complex protein superfamilies. Part of the problems is formed by the presence of Cluster Specific Subsequences (CSS), defined as subsequences present in a subfamily but absent in the rest of the superfamily, which generate gaps and misalignments. The objective is to design and test a pipeline to address the CSS problem by identifying them, using two different protein family datasets: Aspartic Proteinases and Sedolisins.ResultsAlthough the definition of a CSS is straightforward, the resulting problem is not easily defined. Two types of situations regarding CSSs are envisaged. Type 1 CSSs are those where proteins of a subfamily (clade) have a CSS at a structural position where the other superfamily proteins lack sequence. It should be stressed, however, that this is not necessarily reflected by the MSA. Type 2 CSSs, which likely have a low occurrence, are those were there are two clades that have independent CSSs at the same structural position. Type 2 CSSs will have a severe effect on the MSA. Although type 1 CSSs will likely generate many gaps, it is our experience that type 1 CSS often attract residues from neighboring sites, also resulting in poorly aligned blocks. Poorly aligned regions will be identified by means of Information Content (IC) and realigned clade-wise. An increment in the IC (ΔIC) then indicates putative CSSs. In order to validate if the increment is significant, we perform a permutation test: ΔIC values for 10.000 N- sized subsequence samples are calculated as the mean ΔIC for the region , before and after realigning. Regions that show a high improvement should appear in the 95% or 99% tail of the ΔIC score distribution, when compared with the null model and are confirmed CSSs. Similarly, comparison with null-models derived form other random subsequences will indicate if the CSS contributes significantly to the cluster. For the Aspartic Proteinase set, preliminary analyses show that the pipeline identifies several CSS that have been already described [1] and a new CSS is confirmed for an AP subfamily that consists of fungal phytopathogen sequences only. The Sedolisins case is currently under investigation. ConclusionΔIC score is a valuable tool for detecting cluster specific subsequences in protein superfamily MSAs.