论文标题
部分可观测时空混沌系统的无模型预测
L3Cube-MahaNER: A Marathi Named Entity Recognition Dataset and BERT models
论文作者
论文摘要
命名实体识别(NER)是一项基本的NLP任务,在对话和搜索系统中找到了主要应用程序。它可以帮助我们识别用于下游应用程序的句子中的关键实体。用于流行语言的NER或类似的老虎机填充系统已在商业应用中大量使用。在这项工作中,我们专注于马拉地语,这是一种印度语言,由马哈拉施特拉邦的人民着重说。马拉地语是一种低资源语言,仍然缺乏有用的NER资源。我们提出了L3Cube-Mahaner,这是马拉地语中第一个名为“实体识别数据集”的主要金标准。我们还描述了此过程中遵循的手动注释指南。最后,我们在不同的CNN,LSTM和基于变压器的模型(例如Mbert,XLM-ROBERTA,INDIANBERT,MAHABERT等)上基于数据集进行了基准测试。Mahabert在所有模型中提供了最佳性能。数据和模型可在https://github.com/l3cube-pune/marathinlp上找到。
Named Entity Recognition (NER) is a basic NLP task and finds major applications in conversational and search systems. It helps us identify key entities in a sentence used for the downstream application. NER or similar slot filling systems for popular languages have been heavily used in commercial applications. In this work, we focus on Marathi, an Indian language, spoken prominently by the people of Maharashtra state. Marathi is a low resource language and still lacks useful NER resources. We present L3Cube-MahaNER, the first major gold standard named entity recognition dataset in Marathi. We also describe the manual annotation guidelines followed during the process. In the end, we benchmark the dataset on different CNN, LSTM, and Transformer based models like mBERT, XLM-RoBERTa, IndicBERT, MahaBERT, etc. The MahaBERT provides the best performance among all the models. The data and models are available at https://github.com/l3cube-pune/MarathiNLP .