CONICET | Buscador de Institutos y Recursos Humanos

Although hate speech detection has been extensively tackled in the literature as a classification task, recent works have raised concerns about the robustness of such systems. Understanding hate speech remains a significant challenge for creating reliable datasets and automatizing its detection. An essential goal for detection techniques is to ensure that they are not unduly biased towards or against particular norms of offense. For example, ensuring that models are not reproducing common biases in society associating certain terms with hateful content. This situation is known as unintended bias, in which models learn usual associations between words (commonly called identity terms) which causes them to classify content as hateful just because it contains one identity word. In this work, we tackle the issue of measuring and explaining the sensitivity of models to the presence of identity terms during model training. To this end, focusing on a misogyny detection task, we study how models behave in the presence of the identified terms, and whether they contribute to biasing the performance of trained models.