CONICET | Buscador de Institutos y Recursos Humanos

The ability to predict human-like labels from images is an important task in many real-world applications involving the understanding and organization of visual content on a large scale. However, training visual recognition models from scratch require large amounts of labeled data and a time-consuming process. This is the reason why most computer vision practitioners rely on pre-trained models that are either fine-tuned [3] or used as a generic feature extractor module [11] on top of which a more specific model is built. For image classification, it is common for the outputs of these models to be aligned with the one thousand categories of ILSVRC [10]. However, the categories predicted by such models may become overly specific. This problem originates from the fact that ILSVRC categories correspond to leaf nodes in the WordNet lexical ontology [7]. In this work, we aim at developing a generic tagging system that is able to leverage existing models while avoiding the need of fine-tuning them towards a specific end task. Our work is closely related to the problem of entry-level categories prediction [4, 8], i.e. categories that people would use to name objects.