Título

Using the Web as corpus for self-training text categorization

Autor

RAFAEL GUZMAN CABRERA

MANUEL MONTES Y GOMEZ

Paolo ROSSO

LUIS VILLASEÑOR PINEDA

Nivel de Acceso

Acceso Abierto

Resumen o descripción

Most current methods for automatic text categorization are based on supervised learning techniques and, therefore, they face the problem of requiring a great number of training instances to construct an accurate classifier. In order to tackle this problem, this paper proposes a new semi-supervised method for text categorization, which considers the automatic extraction of unlabeled examples from the Web and the application of an enriched self-training approach for the construction of the classifier. This method, even though language independent, is more pertinent for scenarios where large sets of labeled resources do not exist. That, for instance, could be the case of several application domains in different non-English languages such as Spanish. The experimental evaluation of the method was carried out in three different tasks and in two different languages. The achieved results demonstrate the applicability and usefulness of the proposed method.

Editor

Springer Science+Business Media

Fecha de publicación

2009

Tipo de publicación

Artículo

Versión de la publicación

Versión aceptada

Formato

application/pdf

Idioma

Inglés

Audiencia

Estudiantes

Investigadores

Público en general

Sugerencia de citación

Guzmán-Cabrera, R., et al., (2009). Using the Web as corpus for self-training text categorization, Springer Science Inf. Retrieval (12): 400–415

Repositorio Orígen

Repositorio Institucional del INAOE

Descargas

284

Comentarios



Necesitas iniciar sesión o registrarte para comentar.