Desarrollo de algoritmos para el descubrimiento de patrones secuenciales maximales

RENE ARNULFO GARCIA HERNANDEZ

Título

Autor

RENE ARNULFO GARCIA HERNANDEZ

Colaborador

JOSE FRANCISCO MARTINEZ TRINIDAD (Asesor de tesis)

JESUS ARIEL CARRAZCO OCHOA (Asesor de tesis)

Nivel de Acceso

Acceso Abierto

Licencia

http://creativecommons.org/licenses/by-nc-nd/4.0

Materias

Sequence - (SECUENCIA) Pattern recognition - (RECONOCIMIENTO DE PATRONES) Sequential pattern mining - (EXTRACCIÓN SECUENCIAL DE PATRONES) INGENIERÍA Y TECNOLOGÍA - (CTI) CIENCIAS TECNOLÓGICAS - (CTI) TECNOLOGÍA DE LOS ORDENADORES - (CTI) SISTEMAS DE RECONOCIMIENTO DE CARACTERES - (CTI)

Resumen o descripción

Information of a document is described by words in a sequential way. From this point of view,

the knowledge transmitted in a text is sequentially structured by its author. Thus, text it is not

only useful for expressing the author’s ideas, but also it is useful for discovering new knowledge

from the frequent sequential order of the words in the text. The latter aspect has been part of

our motivation, since there are a big amount of electronic documents that could be useful for

discovering sequential patterns, which often cannot be seen at first sight and could be helpful

for text analysis; achieving in this manner a text mining process.

Since the number of frequent sequences can be huge, it is possible considering only those

which are not contained in another frequent sequence, it means, the Maximal Frequent

Sequences (MFS’s) can be considered as a compact representation of all frequent

sequences. When a MFS is found, the words, length and frequency of such MFS are

determined by the text. The last characteristic is very important because the frequency allows

having support for that MFS. Other important feature of the MFS’s is that they can be

extracted independently of the language. Frequent sequences preserve, in some way, the

natural sequential order of the text. Moreover, by its legibility, MFS’s are easy to understand by

humans.

The above mentioned characteristics make MFS’s suitable to be applied in specific text

mining tasks like document clustering and classification; or in tasks like information retrieval,

question answering, text summarization, etc. However, the MFS discovering problem has

received special attention due to the big amount of combinations that have to be reviewed

for discovering such patterns. For example, if a frequent sequence has 100 elements then

2 100 − 1 ≈ 10 30 combinations have to be reviewed for extracting such pattern. This problem

has been classified as NP-hard.

This dissertation deals with the problem of discovering MFS’s in text. It is important to remark,

that the main objective of this dissertation was to propose algorithms for improving the search

of MFS’s from textual information. However, the proposed algorithms can work over any other

kind of sequential information like DNA sequences, WEB logs, etc.; it is, with objects that

describe a sequential behavior through symbols.

La información de un documento de texto la encontramos descrita de manera secuencial

mediante palabras. Desde este punto de vista, el conocimiento transmitido por el autor de

un texto es estructurado de manera secuencial. Así, el texto no sólo sirve para el fin que

determinó el autor originalmente, sino que también es posible descubrir conocimiento nuevo

a partir del orden secuencial de las palabras que frecuentemente se presenta en un texto.

Este último aspecto ha sido precisamente parte de la motivación de esta disertación, pues

existe una gran cantidad de documentos electrónicos disponibles que permitirían descubrir

patrones en el texto que difícilmente pueden determinarse a primera vista y que pueden ser

de gran utilidad para el análisis de texto; realizando de esta manera un proceso de minería

sobre el texto.

Debido a que el número de secuencias frecuentes SF’s encontradas puede ser enorme, es

posible considerar únicamente aquellas SF’s que no están contenidas dentro de otras, es

decir, utilizar las secuencias frecuentes maximales (SFM’s) como la presentación compacta

del conjunto de SF’s. Cuando se descubre una SFM, es el contenido del texto el que

determina las palabras, longitud y frecuencia de la SFM. Esta última característica es muy

importante porque la frecuencia nos permite tener un soporte sobre la existencia de dicha

SFM. Otra de las características importantes de las SFM’s radica en su extracción de manera

independiente del lenguaje, incluso de texto que no esté bien escrito. Las SF’s de texto

preservan, en cierto modo, la naturaleza secuencial del texto. Incluso, por su legibilidad, las

SFM’s son entendibles por el humano.

Por las características mencionadas anteriormente, las SFM’s son potencialmente útiles para

tareas específicas de la minería de texto como la clasificación y agrupamiento de

documentos; en el análisis automático de texto; en la recuperación de información; en la

búsqueda de respuestas; extracción de hipónimos y en la elaboración de resúmenes, entre

otras. Sin embargo, el problema de descubrimiento de SFM’s ha requerido de atención

especial debido al gran número de combinaciones que se tienen que revisar en su

extracción. Por ejemplo, si una SF tiene una longitud de 100 elementos se tendrían que revisar

2 100 − 1 ≈ 10 30 combinaciones antes de poder extraer dicho patrón. Este problema ha sido

clasificado como un problema NP-difícil.

Editor

Instituto Nacional de Astrofísica, Óptica y Electrónica

Fecha de publicación

septiembre de 2007

Tipo de publicación

Tesis de doctorado

Versión de la publicación

Versión aceptada

Recurso de información

http://inaoe.repositorioinstitucional.mx/jspui/handle/1009/600

Formato

application/pdf

Idioma

Español

Audiencia

Estudiantes

Investigadores

Público en general

Sugerencia de citación

García-Hernández RA

Repositorio Orígen

Repositorio Institucional del INAOE

Descargas

2721

Comentarios

Necesitas iniciar sesión o registrarte para comentar.