Atribución de autoría utilizando distintos tipos de características a través de una nueva representación

ADRIAN PASTOR LOPEZ MONROY

Título

Autor

ADRIAN PASTOR LOPEZ MONROY

Colaborador

MANUEL MONTES Y GOMEZ (Asesor de tesis)

LUIS VILLASEÑOR PINEDA (Asesor de tesis)

Nivel de Acceso

Acceso Abierto

Licencia

http://creativecommons.org/licenses/by-nc-nd/4.0

Materias

Attribute grammars - (GRAMÁTICAS DE ATRIBUTOS) Computer applications - (APLICACIONES COMPUTACIONALES) Classification - (CLASIFICACIÓN) Automatic programming - (PROGRAMACIÓN AUTOMÁTICA) Assembly languajes - (LENGUAJES DE ENSAMBLE) CIENCIAS FÍSICO MATEMÁTICAS Y CIENCIAS DE LA TIERRA - (CTI) MATEMÁTICAS - (CTI) CIENCIA DE LOS ORDENADORES - (CTI)

Resumen o descripción

Nowadays, the huge amount of information available in the Web is constantly

growing. Much of this information is in plain text written by users under different

contexts, for example: social networks, forums, blogs, emails, etc. In this regard, it

is important to have automated tools in order to assist the analysis of such information.

One situation that has gained interest in recent years is the Authorship

Attribution (AA) task. In general the main goal of AA is to identify automatically

documents belonging to one or more authors. For example, building methods to

deal with situations such as: terrorist message verification, spam filtering, copyright

disputes, etc. Currently, different algorithms and strategies for addressing

AA have been proposed; especially machine learning approaches. The idea of this

approach is to build classifiers using a set of training documents. Unfortunately,

the available document set is not always ideal, the latter is because there are scenarios

where the instances are few, imbalanced, or both. Considering the above

situations, textual features that best represent the style of each author and documents

representation, play a key role in the performance of machine learning

algorithms. This thesis proposes an alternative method for AA that takes advantage

of using different types of attributes, through a new representation. It follows

the idea that different types of attributes (e.g., character n-grams, punctuation

marks) provide different perspectives of the style of documents and therefore of authors. In particular, we propose: i) using sets of attributes that can retain the

style of the authors, ii) characterizing textual features with a representation that

considers the relationships between documents and authors, and iii) proposing

alternatives to integrate representations of different types of attributes in a classification model. The evaluation is performed on the c50 corpus, which has been

used in different AA works. In our experiments we measure the classification accuracy,

considering scenarios with few training data and imbalanced classes for a

set of authors. The experimental results showed that the proposed method and

our representation is a good alternative to AA, even in settings where the training

data is limited or imbalanced.

Hoy en día la inmensa cantidad de información disponible a través de internet

se encuentra en constante crecimiento. Gran parte de ésta es texto escrito

por usuarios bajo distintos contextos, por ejemplo: redes sociales, foros, bitácoras,

correos electrónicos, etc. En este sentido, surge la necesidad de contar con

mecanismos automáticos para facilitar el análisis de dicha información. Una de

las situaciones que en recientes años ha estado ganando interés es la Atribución

de Autoría (AA). De forma general, la AA consiste en lograr identificar automáticamente los documentos de uno o más autores. Por ejemplo, existe interés en

el desarrollo de métodos para hacer frente a situaciones de: verificación de mensajes

terroristas, filtrado de spam, disputas por derechos de autor, etc. Hoy en

día se han propuesto diferentes algoritmos y estrategias para llevar a cabo la AA;

en especial enfoques de aprendizaje automático. Con este enfoque se pretende

construir clasificadores utilizando un conjunto de documentos de entrenamiento.

Desafortunadamente, no siempre se tiene disponible un conjunto de documentos

ideal, es decir existen escenarios donde los datos son escasos o desbalanceados.

Considerando las situaciones anteriores, los atributos textuales que mejor representen

el estilo de cada autor, así como la representación de los documentos, juegan

un papel fundamental para el buen desempeño de los algoritmos de aprendizaje.

En esta tesis se propone un método alternativo para AA que aproveche el uso de distintos tipos de atributos, por medio de una nueva representación. Se sigue la

idea de que distintos tipos de atributos (e.g., n-gramas de caracteres, signos de

puntuación) proporcionan distintas perspectivas del estilo de los documentos y

por consiguiente de los autores. En particular, proponemos: i) utilizar conjuntos

de atributos que puedan retener el estilo de los autores, ii) caracterizarlos con

una representación que considere las relaciones entre documentos y autores, y iii)

proponer alternativas para la integración de la representación de distintos tipos

de atributos en un modelo de clasificación. La evaluación se realiza sobre el corpus

c50, el cual ha sido utilizado en distintos trabajos de AA. Durante la evaluación

utilizamos la exactitud para medir la clasificación, considerando escenarios con

pocos datos de entrenamiento y desbalanceados.

Editor

Instituto Nacional de Astrofísica, Óptica y Electrónica

Fecha de publicación

2012

Tipo de publicación

Tesis de maestría

Versión de la publicación

Versión aceptada

Recurso de información

http://inaoe.repositorioinstitucional.mx/jspui/handle/1009/755

Formato

application/pdf

Idioma

Español

Audiencia

Estudiantes

Investigadores

Público en general

Sugerencia de citación

Lopez-Monroy A.P.

Repositorio Orígen

Repositorio Institucional del INAOE

Descargas

931

Comentarios

Necesitas iniciar sesión o registrarte para comentar.