INVESTIGADORES
LLERA Andrea Sabina
congresos y reuniones científicas
Título:
GOBoot: towards a robust SEA Analysis
Autor/es:
CRISTOBAL FRESNO; ANDREA LLERA; MARIA ROMINA GIROTTI; MARIA PIA VALACCO; JUAN ANTONIO LÓPEZ; LAURA ZINGARETTI; LAURA PRATO; OSVALDO PODHAJCER; MONICA BALZARINI; FEDERICO PRADA; ELMER FERNÁNDEZ
Lugar:
Oro Verde
Reunión:
Congreso; 3° Congreso Argentino de Bioinformática y Biología Computacional (A2B2C); 2012
Resumen:
Background Set enrichment analysis (SEA) is the traditionally used approach for Gene Ontology (GO) analysis, due to its trajectory and availability over commercial and public tools/websites [1-2]. In the GO structure, each term is statistically evaluated at a time resulting enriched if the observed proportion of differentially expressed proteins/genes differ from the expected when compared against a background reference (BR). The appropriate BR is difficult to devise and GO results tend to depend on it. In this sense, terms would result enriched or not according to the BR used. Here, a new method is presented to evaluate the enrichment robustness of nodes by means of bootstrap perturbations of the used BR. Thus, each node will have a ?power score?, where high stability nodes are candidates to by explored and leaving spurious enriched terms out of the analysis. Methods A resampling technique was implemented to provide a stability (power) measure of SEA to evaluate the effectiveness of a given BR to identify true enriched terms. Simulated BRs were generated by bootstrapping a BR, trying to keep each simulated BR as close as possible to the length of the original BR (in order to introduce small perturbations in length of both GO members and BR). The power value was calculated as the percentage of times a term gets enriched, over a high number of simulated BRs. In this sense, higher power implies greater stability of the term. DAVID [3] was the chosen tool to test SEA in a proteomic (Girotti et al., unpublished) and three microarray experiments freely available at Gene Expression Omnibus [4-6] under different BRs: the genome of the specie (BR-I), the chip-gene list (BR-II, if possible) and a user defined reference (BR-III [7]). The BR-III (but is not restricted to) was the reference used for power calculation, as it is considered the one which fulfills the statistical assumption. Boxplot of the enriched terms of main GO category (Biological Process) was plotted, using a Venn-diagram color pattern to contrast enrichment with typical BR selections (BR-I or BR-II). Results In Figure 1 it is possible to see that the power boxplots of all enriched nodes (in white) are above 40% for most of datasets. Almost all nodes found in BR-III reached power values above 50%. Meanwhile, those nodes that appeared enriched by bootstrapping BR-III and previously found by BR-I or shared by BR-I & II, showed power values less than 40% in all cases. This suggests that enriched nodes found by BR-III were highly consistent and potentially meaningful. These enriched terms were validated by literature. Figure 1: Biological process power boxplots of bootstrapped enriched nodes, coded with the overlapping source of the full BR length (BR-I to BR-III). Notice that boxplot (in white) corresponds to the boxplot of all bootstrapped enriched nodes. Discussion By means of stability analysis it was shown that non-consensus nodes identified only with BR-I and/or BR-II are unstable, suggesting spurious enrichment. On the contrary, enriched terms found by BR-III showed high power suggesting more confidence (robustness) making these terms good candidates for further exploration. We found that robust terms where biologically relevant to the experimental setting [7]. In this context, the proposed tool provided additional information (power values) addressing ontology exploration and new unseen terms blurred by the traditional approaches, to assist researchers in ontology analysis. References [1] P. Khatri, S. Drăghici, Ontological analysis of gene expression data: current tools, limitations, and open problems, Bioinformatics, 21, 3587-3595 (2005) [2] D. Wei Huang et al. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists, Nucleic Acids Res., 37:1-13 (2009) [3] I. Rivals, L. Personnaz, L. Taing, M-C. Potier, Enrichment or depletion of a GO category within a class of genes: which test? Bioinformatics, 23, 401-407 (2007) [4] L. M. Packer et al. Gene expression profiling in melanoma identifies novel downstream effectors of p14ARF, Int. J. Cancer, 121, 784-790 (2007) [5] A. Spira et al. Effects of cigarette smoke on the human airway epithelial cell transcriptome. Proc. Natl. Acad. Sci. U. S. A., 101, 10143-10148 (2004) [6] S. McGrath-Morrow et al. Impaired lung homeostasis in neonatal mice exposed to cigarette smoke. Am. J. Respir. Cell. Mol. Biol., 38, 393-400 (2008) [7] C. Fresno, A. S. Llera, M. R. Girotti, M. P. Valacco, J. A. López, O. L. Podhajcer, M. G. Balzarini, F. Prada, E. A. Fernández, The Multi-Reference Contrast Method: facilitating set enrichment analysis, Comput. Biol. Med. 42, 188-194 (2012) BR-I BR-II BR-III 3er Congreso Argentino de Bioinform´atica y Biolog´ıa Computacional 67