INVESTIGADORES
FIORINI DE MAGALHAES Ivan Luiz
congresos y reuniones científicas
Título:
Extraction of sentence-specific co-occurrences of proteins using text-mining.
Autor/es:
IVAN LUIZ FIORINI DE MAGALHÃES; ADRIANO BARBOSA-SILVA; JOSÉ MIGUEL ORTEGA
Lugar:
São Paulo, São Paulo
Reunión:
Congreso; X-Meeting 2007: AB3C III International Conference; 2007
Institución organizadora:
Associação Brasileira de Bioinformática e Biologia Computacional
Resumen:
The data explosion in Biological Sciences is reflected by the amount of papers available in the current literature databases. As example, the 17 millions of indexed articles in PubMed database make it difficult for scientists to retrieve all the details about their research field and to keep updated with the new concepts. In this issue, information extraction methods appear as tools to mine this qualified information. Natural language processing, co-occurrence analysis and machine learning methods are successfully used in textmining approaches [1]. We present here a method to reconstruct biological networks derived from the cooccurrence analysis of protein names pairs, along with a biointeraction term, in the same sentence of biological abstracts. As case study, we applied the method to rebuild a network of proteins related to disease resistance in plants. The query “disease AND resistance AND proteins AND plants” on PubMed database retrieved abstracts entries formatted in XML (eXtensible Markup Language). Parsed as input of a program for protein name tagging - NLProt 2.0 [2] - these abstracts had their protein names tagged between “<n>” and “</n>” tags; and a report about the identified protein names, the corresponding organism, and their position in text was generated. Only protein names with an attributed UniProt ID were considered. A discovery script, called ProCON (Protein Cooccurrences in Networks), written in PHP-CLI (version 5.1) has been developed to discover the pairs of co-occurring proteins names in sentences of NLProt tagged abstracts. ProCON is divided into several steps and is the middle layer between the softwares for protein tagging (NLProt) and network visualization (Medusa), both used in our textmining approach. ProCON uses dictionaries of protein names and biointeraction terms, this one including synonyms for each verb, to validate the tagging. The algorithm scans the abstracts word by word and records the occurrence of a biointeraction term between two tagged protein names if no punctuation signals are found between them. Moreover, we optimized the algorithm to select those sentences in which the biointeraction term appears up to three words after the second protein name in the pair. The script stores the filtered interactions for further usage to create another file: the interaction network input to be visualized in Medusa 1.02 [3]. This is a Java-based program that shows the co-occurrences as a graph, connecting each node (protein names) through edges (biointeraction terms) attributing a confidence value, in our case, the global number of non-distinct biointeractions among two protein names in the whole abstract dataset. Finally, an HTML with highlighted co-occurrences is generated. The query returned 805 abstracts where 6,858 protein names could be identified by NLProt, totalizing 1,990 distinct protein names. Only 333 of these names were further considered due to Uniprot ID attribution. Fifty cooccurring proteins pairs were located in 41 out of 805 abstracts (5%). The number of correct co-occurrences among the extracted ones (precision), and the number of correct co-occurrences among the existing ones (recall) were ~80% and ~12.5%, respectively. Recall was calculated for a sample of ~10% (80 out of 805) of total abstracts. Frequently, co-interacting protein pairs are correctly derived from co-occurrence sentences in ‘protein + verb + protein’ and ‘protein + protein + verb’ syntactic structures, like in sentences: ‘SGT1 associates with SKP1...’ (PMID: 11847307) and ‘EDS1 and PAD4 are both required...’ (PMID: 11574472), respectively. However, wrong protein names assignments may lead to incorrect co-interacting pairs, as in example: ‘Psm expressing the avirulence gene...’ (PMID: 11069695), where the term “Psm” refers to a bacteria (P. syringae pv. mauliola) instead of a protein name. Sometimes nonsense co-interacting pairs are assigned due to sentence syntactic complexity, for instance, in the sentence ‘Silencing of HSP90 but not of PP5 completely blocked cell...’ (PMID: 15998314), the pair “HSP90” and “PP5” does not co-interact, despite co-occurring at the same sentence along with a biointeraction term (“blocked”). The relationship graph resultant from the text-mining analysis of the 41 abstracts displaying co-occurrences showed 57 nodes and 43 edges. Clusters related to the PTO, RPS2 and NPR1 signaling pathways have been represented separately, surrounded by less expressive satellite networks. This work presented the initial results of an open source text-mining resource in development. Currently, information retrieval from biological abstracts using co-occurrence analysis has supported the generation of new hypothesis about several fields in science. Our tool offer this feature allowing scientists to mine a great amount of works in their specific field in a less consuming time, by focusing on the co-occurrence of their favorite protein/genes in sentences of the scientific literature. Supplementary material in http://biodados.icb.ufmg.br .References[1] Jensen, L.J., Saric, J., Bork, P. Literature mining for the biologist: frominformation retrieval to biological discovery. Nature. (2006) 7:119-129.[2] Mika, A., and Rost, B. NLProt: extracting protein names and sequencesfrom papers. Nucl. Acids Res. (2004) 32:W634-W637.[3] Hooper, S.D., Bork, P. Medusa: a simple tool for interaction graphanalysis. Bioinformatics. (2005) 21(24):4432-4433.