INVESTIGADORES
XAMENA Eduardo
congresos y reuniones científicas
Título:
A web platform for collaborative semi-automatic OCR Post-processing
Autor/es:
MECHACA, ANA LIDIA; MARMANILLO, WALTER GABRIEL; XAMENA, EDUARDO; RAMIREZ-ORTA, JUAN; MAGUITMAN, ANA GABRIELA; MILIOS, EVANGELOS E.
Lugar:
Buenos Aires (Evento Virtual)
Reunión:
Congreso; AGRANDA 2021: Simposio Argentino de Grandes Datos - 50 JAIIO: 50th Jornadas Argentinas de Informática. Buenos Aires, Argentina 10/2021; 2021
Institución organizadora:
SADIO - Sociedad Argentina de Investigación Operativa
Resumen:
Digital Humanities researchers often make use of software that helpsthem in the task of finding non-trivial relationships among characters in historical text. Usually, the source texts that contain such information come from OCRacquired volumes, carrying high amounts of errors within them. This work explains the development of a web platform for the task of OCR post-processingand ground-truth generation. This platform employs machine learning to predictthe correct texts accurately from OCR noisy strings. The method used for thistask involves transformers for character-based denoising language models. Anactive learning workflow is proposed, as the users can feed their corrections tothe platform, generating new annotated data for re-training the underlying machine learning correction models.