Detección automática de plagio basada en la distinción y fragmentación del texto reutilizado

JOSE FERNANDO SANCHEZ VEGA

Título

Autor

JOSE FERNANDO SANCHEZ VEGA

Colaborador

LUIS VILLASEÑOR PINEDA (Asesor de tesis)

Nivel de Acceso

Acceso Abierto

Licencia

http://creativecommons.org/licenses/by-nc-nd/4.0

Materias

Artificial intelligence - (INTELIGENCIA ARTIFICIAL) Natural languages - (LENGUAJES NATURALES) Text analysis - (ANÁLISIS DE TEXTO) CIENCIAS FÍSICO MATEMÁTICAS Y CIENCIAS DE LA TIERRA - (CTI) MATEMÁTICAS - (CTI) CIENCIA DE LOS ORDENADORES - (CTI)

Resumen o descripción

In the current scenario there are many digital documents which are “easily” accessed; these big libraries (whatever you call them, virtual libraries or the public Internet) contain works covering a wide variety of topics with a huge diversity of approaches. At the same time in the apogee of "easy" information, it is being a resurgence of reuse, and the problem is that this reuse is unconscionable; it is done without notice that the content come from another original works and without the corresponding credit. These "misuses" of information constitute a theft of intellectual material known as plagiarism. The detection of plagiarism is the natural response to the imbalance that generated the information technology, which does not protect the authors, who remain producing original material. It is important to address these areas, because this is where it carries out the production and communication of knowledge. In the automatic plagiarism detection (APD) a document (which presents a kind of suspected plagiarism or which you want to check that there is no possibility of containing plagiarism) is automatically compared by a computer with a particular source to assess whether it is plagiarism. The APD techniques typically perform the detection by measuring the amount of shared text documents between two documents; the main difficult in this task is that not all the shared text is due by a plagiarism; because the existence of thematic or stylistic similarities produces lots of common text but not necessarily plagiarism. To attack this difficulty we propose a representation with attributes that capture the fragmentation and allow to make a distinction of the common text. The fragmentation of the common text is captured using a series of attributes that account for common text strings; each of these attributes is specialized in a particular string length. The distinction of the common text is used to weigh each of the common sequences, this weight schema measures both the thematic relevance and usability (how much of the text was used by the potential plagiarist) of common text.

Also, to deal with the problem of plagiarized text, that is modified to avoid detection, a new model is proposed which increases the detection rate of the text that was taken from the original document. We improve certain flexible criteria which are used in the search of the reused text, considering not only the exact copy but also some possible modifications.

En el panorama actual existen gran cantidad de documentos digitales que pueden ser fácilmente consultados; estas grandes bibliotecas (llámense bibliotecas virtuales o la Internet pública) contienen obras que abarcan una gran variedad de temas con una enorme diversidad de enfoques. Al mismo tiempo de este apogeo de información “fácil”, se está dando un nuevo auge de la reutilización, y el problema es que esta reutilización es inescrupulosa; se realiza sin dar cuenta de que los contenidos provienen de obras originales, lo que aleja el material de discusión de sus verdaderos autores, sin darles el crédito correspondiente. Estos “abusos” de la información constituyen un robo de material intelectual conocido como plagio. La detección de plagio es la respuesta natural al desequilibrio que han generado las tecnologías de la información frente a los autores que se mantienen produciendo materiales originales. Es importante atender a estos sectores, pues es aquí donde se lleva a cabo la producción y comunicación del conocimiento. En la detección automática de plagio (DAP), un documento (el cual se sospecha presenta alguna clase de plagio o del cual se quiere comprobar que no exista posibilidad alguna de contener plagio) es comparado de manera automática por una computadora con alguna fuente particular para evaluar si se trata de un plagio.

Las técnicas de la DAP típicamente realizan la detección midiendo la cantidad de texto compartido entre los dos documentos; la principal dificultad en esta tarea es que no todo el texto compartido se debe a un plagio, pues la existencia de coincidencias temáticas o de estilo produce porciones de texto común sin ser necesariamente plagio. Para atacar esta dificultad proponemos una representación con atributos que capturan la fragmentación y la distinción del texto compartido. La fragmentación del texto compartido es capturada utilizando una seria de atributos que contabilizan las secuencias de texto compartido especializándose cada atributo en una longitud particular. La distinción del texto compartido es utilizada para ponderar cada una de las secuencias compartidas; esta ponderación mide tanto la relevancia temática como la usabilidad (que tanto fue usado por el posible plagiario) del texto compartido. También, para poder afrontar el problema del texto plagiado que es modificado para evitar su detección, se propone un novedoso modelo que permite aumentar el porcentaje de detección del texto que fue tomado del documento original.

Editor

Instituto Nacional de Astrofísica, Óptica y Electrónica

Fecha de publicación

enero de 2011

Tipo de publicación

Tesis de maestría

Versión de la publicación

Versión aceptada

Recurso de información

http://inaoe.repositorioinstitucional.mx/jspui/handle/1009/727

Formato

application/pdf

Idioma

Español

Audiencia

Estudiantes

Investigadores

Público en general

Sugerencia de citación

Sanchez-Vega J.F

Repositorio Orígen

Repositorio Institucional del INAOE

Descargas

592

Comentarios

Necesitas iniciar sesión o registrarte para comentar.