INVESTIGADORES
GARRIDO Alejandra
capítulos de libros
Título:
Flexible Detection of Similar DOM Elements
Autor/es:
GRIGERA, JULIÁN; GARDEY, JUAN CRUZ; ROSSI, GUSTAVO; GARRIDO, ALEJANDRA
Libro:
Web Information Systems and Technologies. Lecture Notes in Business Information Processing
Editorial:
Springer
Referencias:
Año: 2023; p. 174 - 195
Resumen:
Different research fields related to the web require detecting similarity between DOM elements. In the field of information extraction, many approaches emerged to extract structured data from web documents, most of which require comparing sample documents to extract their underlying structure. Other fields of applicability like web augmentation or transcoding also require analyzing structural similarity, but on UI components with smaller structures than full documents, making them unsuitable for the algorithms generally used in information extraction. Instead, these approaches tend to rely on the DOM elements’ location, but this does not resist structural changes in the document, and cannot locate similar elements placed in different positions. In this paper we present two flexible algorithms to measure similarity between DOM elements by using a mixed approach that considers both elements’ location and inner structure, together with a wrapper induction technique. We evaluated our algorithms with respect to other known approaches in the literature by comparing how they cluster a dataset of 1200+ DOM elements, using a manual clustering as ground truth. Results show that both proposed algorithms outperform all baseline ones. The proposed algorithms run in linear time, so they are faster than most approaches that analyze structural similarity.