Título
Using the Web as corpus for self-training text categorization
Autor
RAFAEL GUZMAN CABRERA
MANUEL MONTES Y GOMEZ
Paolo ROSSO
LUIS VILLASEÑOR PINEDA
Nivel de Acceso
Acceso Abierto
Materias
Text categorization - (TEXT CATEGORIZATION) Semi-supervised learning - (SEMI-SUPERVISED LEARNING) Self-training - (SELF-TRAINING) Web as corpus - (WEB AS CORPUS) Authorship attribution - (AUTHORSHIP ATTRIBUTION) CIENCIAS FÍSICO MATEMÁTICAS Y CIENCIAS DE LA TIERRA - (CTI) MATEMÁTICAS - (CTI) CIENCIA DE LOS ORDENADORES - (CTI)
Resumen o descripción
Most current methods for automatic text categorization are based on supervised learning techniques and, therefore, they face the problem of requiring a great number of training instances to construct an accurate classifier. In order to tackle this problem, this paper proposes a new semi-supervised method for text categorization, which considers the automatic extraction of unlabeled examples from the Web and the application of an enriched self-training approach for the construction of the classifier. This method, even though language independent, is more pertinent for scenarios where large sets of labeled resources do not exist. That, for instance, could be the case of several application domains in different non-English languages such as Spanish. The experimental evaluation of the method was carried out in three different tasks and in two different languages. The achieved results demonstrate the applicability and usefulness of the proposed method.
Editor
Springer Science+Business Media
Fecha de publicación
2009
Tipo de publicación
Artículo
Versión de la publicación
Versión aceptada
Recurso de información
Formato
application/pdf
Idioma
Inglés
Audiencia
Estudiantes
Investigadores
Público en general
Sugerencia de citación
Guzmán-Cabrera, R., et al., (2009). Using the Web as corpus for self-training text categorization, Springer Science Inf. Retrieval (12): 400–415
Repositorio Orígen
Repositorio Institucional del INAOE
Descargas
284