比较网络欺凌检测的不同语言支持单词嵌入的性能

论文标题

比较网络欺凌检测的不同语言支持单词嵌入的性能

Comparing Performance of Different Linguistically-Backed Word Embeddings for Cyberbullying Detection

论文作者

Eronen, Juuso, Ptaszynski, Michal, Masui, Fumito

论文摘要

在大多数情况下，单词嵌入仅从原始令牌或某些情况下是从引理中学到的。这包括诸如Bert之类的预训练的语言模型。为了调查捕获词汇项目与结构之间更深层次关系并过滤冗余信息的潜力，我们建议通过将它们与原始的令牌或诱饵结合在一起来保留形态学，句法和其他类型的语言信息。例如，这意味着在使用的词汇特征中包括词性或依赖性信息。然后可以在组合上而不是原始令牌上训练嵌入一词。也可以以后将此方法应用于巨大语言模型的预训练，并可能提高其性能。这将有助于解决从语言代表的角度（例如检测网络欺凌）来解决更复杂的问题。

In most cases, word embeddings are learned only from raw tokens or in some cases, lemmas. This includes pre-trained language models like BERT. To investigate on the potential of capturing deeper relations between lexical items and structures and to filter out redundant information, we propose to preserve the morphological, syntactic and other types of linguistic information by combining them with the raw tokens or lemmas. This means, for example, including parts-of-speech or dependency information within the used lexical features. The word embeddings can then be trained on the combinations instead of just raw tokens. It is also possible to later apply this method to the pre-training of huge language models and possibly enhance their performance. This would aid in tackling problems which are more sophisticated from the point of view of linguistic representation, such as detection of cyberbullying.

下载PDF全文

下载文献需遵守相关版权规定

论文标题