文本处理步骤对Twitter情感分类的影响使用Word嵌入

论文标题

文本处理步骤对Twitter情感分类的影响使用Word嵌入

Effect of Text Processing Steps on Twitter Sentiment Classification using Word Embedding

论文作者

Samad, Manar D., Khounviengxay, Nalin D., Witherow, Megan A.

论文摘要

原始文本的处理是文本分类和情感分析的关键第一步。但是，通常使用现成的例程和预构建的单词词典执行文本处理步骤，而无需优化域，应用程序和上下文。本文研究了七个文本处理方案对特定文本域（Twitter）和应用程序（情感分类）的影响。开发了基于革兰氏的单词嵌入，以包括twitter口语单词，表情符号和标签关键字，这些单词通常因在传统文献语料库中无法使用而被删除。我们的实验揭示了对两个常见文本处理步骤的情感分类的负面影响：1）停止单词删除和2）平均单词向量表示各个推文。 1）新的有效步骤，包括非ASCII表情符号字符，2）测量单词嵌入中的单词重要性，3）将单词向量汇总到Tweet嵌入中； 4）已提出了开发可分离的特征空间，以优化情感分类管道。文本处理步骤的最佳组合在曲线（AUC）下的最高平均面积为88.4（+/- 0.4），并用三个情感标签对14,640条推文进行分类。从上下文驱动的单词嵌入中的单词选择表明，推文中只有十个最重要的单词累计产生的最大准确性的98％以上。结果证明了在推文分类中数据驱动的重要单词选择的手段，而不是使用预构建的单词词典。提出的推文嵌入非常健壮，并减轻了几个文本处理步骤的需求。

Processing of raw text is the crucial first step in text classification and sentiment analysis. However, text processing steps are often performed using off-the-shelf routines and pre-built word dictionaries without optimizing for domain, application, and context. This paper investigates the effect of seven text processing scenarios on a particular text domain (Twitter) and application (sentiment classification). Skip gram-based word embeddings are developed to include Twitter colloquial words, emojis, and hashtag keywords that are often removed for being unavailable in conventional literature corpora. Our experiments reveal negative effects on sentiment classification of two common text processing steps: 1) stop word removal and 2) averaging of word vectors to represent individual tweets. New effective steps for 1) including non-ASCII emoji characters, 2) measuring word importance from word embedding, 3) aggregating word vectors into a tweet embedding, and 4) developing linearly separable feature space have been proposed to optimize the sentiment classification pipeline. The best combination of text processing steps yields the highest average area under the curve (AUC) of 88.4 (+/-0.4) in classifying 14,640 tweets with three sentiment labels. Word selection from context-driven word embedding reveals that only the ten most important words in Tweets cumulatively yield over 98% of the maximum accuracy. Results demonstrate a means for data-driven selection of important words in tweet classification as opposed to using pre-built word dictionaries. The proposed tweet embedding is robust to and alleviates the need for several text processing steps.

下载PDF全文

下载文献需遵守相关版权规定

论文标题