IMASL   20939
INSTITUTO DE MATEMATICA APLICADA DE SAN LUIS "PROF. EZIO MARCHI"
Unidad Ejecutora - UE
congresos y reuniones científicas
Título:
Is it necessary to read the entire document to classify?
Autor/es:
JUAN MARTÍN LOYOLA; MARCELO LUIS ERRECALDE
Lugar:
San Luis
Reunión:
Conferencia; PyData San Luis; 2017
Institución organizadora:
Universidad Nacional de San Luis
Resumen:
The problem of classification in supervised learning is a widely studied one. Nonetheless, there are scenarios that received little attention despite its applicability. One of such scenarios is early text classification, which deals with the development of predictive models that can determine the class a document belongs to as soon as possible. Here a document is assumed to be processed sequentially, starting at the beginning and reading its containing parts one by one. In this context, it is desired to make predictions with as little information (as soon) as possible. The importance of this variant of the classification problem is evident in tasks like sexual predator detection, where one wants to identify an offender as early as possible.It is important to note that the early text classification problem consists of two related and complementary tasks. On the one hand, the task of classification with partial information, which consists of obtaining an efficient predictive model when only partial information is available that has been read sequentially up to a certain point in time. The emphasis in this case is to determine which classification methods are more likely to achieve performance comparable to that obtained when classified using the entire document. On the other hand, we have the task of decision of the moment of classification, that is, in which point in time one can stop reading and classify with some degree of confidence that the prediction is going to be correct. [1]Here, we focus in the problem of classification with partial information, comparing the performance of different predictive models in the dataset R8 provided by Cachopo in [2]. We will like to know how much time (percentage of document read) does it takes to correctly classify most of the documents and find out which model does it earlier.