论文标题
标签 - 标签分段的数据集
HashSet -- A Dataset For Hashtag Segmentation
论文作者
论文摘要
主题分段是将主题标签分解为其组成代币的任务。主题标签通常编码用户生成的帖子的本质以及主题和情感等信息,这些信息在下游任务中很有用。主题标签优先考虑简洁,并以独特的方式编写 - 音译和混合语言,拼写变化,创意命名实体。用于主题标签分段任务的基准数据集-STAN,BOUN - 大小很小,并从一组推文中提取。但是,数据集应反映主题标签的书面样式的变化,还应说明域和语言特异性,结果失败的结果将歪曲模型性能。我们认为,应在各种主题标签上评估模型性能,并应仔细策划数据集。为此,我们提出了标签,该数据集包含:a)1.9k手动注释的数据集; b)330万松散监督的数据集。与现有数据集相比,将标签数据集从不同的一组推文中采样,并提供了替代标签的主题标签以构建和验证标签分段模型。我们表明,SOTA模型用于主题标签细分的性能在建议的数据集上大大下降,这表明所提出的数据集提供了一组替代的主题标签来训练和评估模型。
Hashtag segmentation is the task of breaking a hashtag into its constituent tokens. Hashtags often encode the essence of user-generated posts, along with information like topic and sentiment, which are useful in downstream tasks. Hashtags prioritize brevity and are written in unique ways -- transliterating and mixing languages, spelling variations, creative named entities. Benchmark datasets used for the hashtag segmentation task -- STAN, BOUN -- are small in size and extracted from a single set of tweets. However, datasets should reflect the variations in writing styles of hashtags and also account for domain and language specificity, failing which the results will misrepresent model performance. We argue that model performance should be assessed on a wider variety of hashtags, and datasets should be carefully curated. To this end, we propose HashSet, a dataset comprising of: a) 1.9k manually annotated dataset; b) 3.3M loosely supervised dataset. HashSet dataset is sampled from a different set of tweets when compared to existing datasets and provides an alternate distribution of hashtags to build and validate hashtag segmentation models. We show that the performance of SOTA models for Hashtag Segmentation drops substantially on proposed dataset, indicating that the proposed dataset provides an alternate set of hashtags to train and assess models.