ICIC   25583
INSTITUTO DE CIENCIAS E INGENIERIA DE LA COMPUTACION
Unidad Ejecutora - UE
congresos y reuniones científicas
Título:
Language modeling tools for massive historical OCR post-processing
Autor/es:
XAMENA, EDUARDO; MAGUITMAN, ANA GABRIELA
Lugar:
Buenos Aires
Reunión:
Simposio; VI Simposio Argentino de Ciencia de Datos y GRANdes DAtos (AGRANDA 2020); 2020
Institución organizadora:
Sociedad Argentina de Informática e Investigación Operativa (SADIO)
Resumen:
Upon these days, there is a large number of available historical documentary collections that have not been exploited to extract information. Many efforts are being made to digitize these volumes and make them available for digital platforms. However, various obstacles appear in the task of processing their content. Due to the deterioration of documents and other factors such as the different dialects and language variants, the quality of the digitizations is usually low. By means of NLP tools it is possible to increase the quality of texts. The current proposal consists in the employment of NLP tools, particularly neural language models, for processing the output of different OCR mechanisms. Important improvements in the quality of the texts are expected, as this has been the case in many related tasks. The ultimate purpose of this work is the use of the resulting digitized texts in information retrieval (IR) and information extraction (IE) platforms.