论文标题
使用深度学习技术对僧伽罗语的情感分析
Sentiment Analysis for Sinhala Language using Deep Learning Techniques
论文作者
论文摘要
由于机器学习和深度学习的快速发展领域的影响很大,因此自然语言处理(NLP)任务进一步获得了资源丰富的语言(例如英语和中文)的全面表演。但是,僧伽罗(Sinhala)是一种资源不足的语言,具有丰富的形态,并没有经历这些进步。为了进行情感分析,只有两项以深度学习方法为重点的研究,这些研究仅集中在二进制案例的文档级别的情感分析上。他们仅尝试了三种类型的深度学习模型。相比之下,本文介绍了一项关于使用标准序列模型(例如RNN,LSTM,BI-LSTM)以及更近期最新的最新模型(例如分层注意力杂种杂交神经网络和胶囊网络)的全面研究。分类是在文档级别完成的,但通过考虑积极,消极,中立和冲突类,具有更大的粒度。 15059 Sinhala新闻评论的数据集用这四个类别注释,并且语料库由948万个令牌组成。这是迄今为止僧伽罗的最大情感注释数据集。
Due to the high impact of the fast-evolving fields of machine learning and deep learning, Natural Language Processing (NLP) tasks have further obtained comprehensive performances for highly resourced languages such as English and Chinese. However Sinhala, which is an under-resourced language with a rich morphology, has not experienced these advancements. For sentiment analysis, there exists only two previous research with deep learning approaches, which focused only on document-level sentiment analysis for the binary case. They experimented with only three types of deep learning models. In contrast, this paper presents a much comprehensive study on the use of standard sequence models such as RNN, LSTM, Bi-LSTM, as well as more recent state-of-the-art models such as hierarchical attention hybrid neural networks, and capsule networks. Classification is done at document-level but with more granularity by considering POSITIVE, NEGATIVE, NEUTRAL, and CONFLICT classes. A data set of 15059 Sinhala news comments, annotated with these four classes and a corpus consists of 9.48 million tokens are publicly released. This is the largest sentiment annotated data set for Sinhala so far.