论文标题
预测马拉地语进攻性社交媒体帖子的类型和目标
Predicting the Type and Target of Offensive Social Media Posts in Marathi
论文作者
论文摘要
在社交媒体上存在进攻性语言是非常普遍的激励平台,以投资使社区更安全的策略。这包括开发能够在线识别进攻内容的强大机器学习系统。除了一些值得注意的例外,大多数关于自动进攻性语言识别的研究都涉及英语和其他一些高级资源语言,例如法语,德语和西班牙语。在本文中,我们通过在印度使用的低资源印度 - 雅利安语言马拉地语中解决进攻性语言识别来解决这一差距。我们介绍了马拉地语进攻语言数据集v.2.0或模具2.0,并在此数据集上介绍了多个实验。霉菌2.0是霉菌的更大版本,其注释扩大到流行的OLID分类法的B(类型)和C(类型)和C(目标)。霉菌2.0是第一个针对马拉地语编译的层次进攻性语言数据集,因此为低资源印度语语言开辟了新的研究途径。最后,我们还介绍了SEMOLD,这是一个较大的数据集,遵循实心中介绍的半监督方法。
The presence of offensive language on social media is very common motivating platforms to invest in strategies to make communities safer. This includes developing robust machine learning systems capable of recognizing offensive content online. Apart from a few notable exceptions, most research on automatic offensive language identification has dealt with English and a few other high resource languages such as French, German, and Spanish. In this paper we address this gap by tackling offensive language identification in Marathi, a low-resource Indo-Aryan language spoken in India. We introduce the Marathi Offensive Language Dataset v.2.0 or MOLD 2.0 and present multiple experiments on this dataset. MOLD 2.0 is a much larger version of MOLD with expanded annotation to the levels B (type) and C (target) of the popular OLID taxonomy. MOLD 2.0 is the first hierarchical offensive language dataset compiled for Marathi, thus opening new avenues for research in low-resource Indo-Aryan languages. Finally, we also introduce SeMOLD, a larger dataset annotated following the semi-supervised methods presented in SOLID.