A Case Study of Spanish Text Transformations for Twitter Sentiment Analysis

Oscar Sánchez Siordia; Eric Tellez; SABINO MIRANDA JIMENEZ; Mario Graff; Daniela Moctezuma; Elio Atenógenes Villaseñor García

Título

Autor

Oscar Sánchez Siordia

Eric Tellez

SABINO MIRANDA JIMENEZ

Mario Graff

Daniela Moctezuma

Elio Atenógenes Villaseñor García

Nivel de Acceso

En Embargo

Licencia

http://creativecommons.org/licenses/by-nc-nd/4.0

Identificador alterno

doi: https://doi.org/10.1016/j.eswa.2017.03.071

Materias

Sentiment Analysis - (AUTOR) Error-robust text representations - (AUTOR) Opinion mining - (AUTOR) INGENIERÍA Y TECNOLOGÍA - (CTI) CIENCIAS TECNOLÓGICAS - (CTI) TECNOLOGÍA DE LOS ORDENADORES - (CTI) INTELIGENCIA ARTIFICIAL - (CTI) INTELIGENCIA ARTIFICIAL - (CTI)

Resumen o descripción

Sentiment analysis is a text mining task that determines the polarity of a given text, i.e., its positiveness or negativeness. Recently, it has received a lot of attention given the interest in opinion mining in micro-blogging platforms. These new forms of textual expressions present new challenges to analyze text because of the use of slang, orthographic and grammatical errors, among others. Along with these challenges, a practical sentiment classiﬁer should be able to handle eﬃciently large workloads. The aim of this research is to identify in a large set of combinations which text transformations (lemmatization, stemming, entity removal, among others), tokenizers (e.g., word n-grams), and token-weighting schemes make the most impact on the accuracy of a classiﬁer (Support Vector Machine) trained on two Spanish datasets. The methodology used is to exhaustively analyze all combinations of text transformations and their respective parameters to ﬁnd out what common characteristics the best performing classiﬁers have. Furthermore, we introduce a novel approach based on the combination of word-based n-grams and character-based q-grams. The results show that this novel combination of words and characters produces a classiﬁer that outperforms the traditional wordbased combination by 11.17% and 5.62% on the INEGI and TASS’15 dataset, respectively.

Editor

Elsevier

Fecha de publicación

septiembre de 2017

Tipo de publicación

Artículo

Versión de la publicación

Versión aceptada

Recurso de información

http://centrogeo.repositorioinstitucional.mx/jspui/handle/1012/243

Formato

application/pdf

Fuente

Expert Systems with Applications Volume 81, 15 September 2017, Pages 457-471

Idioma

Inglés

Audiencia

Estudiantes

Investigadores

Maestros

Repositorio Orígen

Repositorio Institucional de CENTROGEO

Descargas

0

Comentarios

Necesitas iniciar sesión o registrarte para comentar.