甲板：提高BERT模型的可解释性和概括性的行为测试，从文本中检测到抑郁症

论文标题

甲板：提高BERT模型的可解释性和概括性的行为测试，从文本中检测到抑郁症

DECK: Behavioral Tests to Improve Interpretability and Generalizability of BERT Models Detecting Depression from Text

论文作者

Novikova, Jekaterina, Shkaruta, Ksenia

论文摘要

准确地检测出文本抑郁症的模型是解决流行后心理健康危机的重要工具。基于BERT的分类器的有希望的性能和现成的可用性使它们成为此任务的绝佳候选人。但是，已知这些模型会遭受性能不一致和概括不佳的影响。在本文中，我们介绍了甲板（抑郁清单），抑郁症特异性模型的行为测试，可以更好地解释性并提高BERT分类器在抑郁域中的普遍性。我们创建了23次测试，以评估BERT，Roberta和Albert Depinds Classifiers在三个数据集中，两个基于Twitter和一个基于临床访谈的分类器。我们的评估表明，这些模型：1）对于文本中某些性别敏感的变化是可靠的； 2）依靠使用第一人称代词的使用的重要抑郁语言标记； 3）无法检测到其他一些抑郁症状，例如自杀念头。我们还证明，甲板测试可用于将特定于症状的信息纳入训练数据中，并始终提高所有三种BERT模型的概括性，而分布式F1分数的增加高达53.93％。

Models that accurately detect depression from text are important tools for addressing the post-pandemic mental health crisis. BERT-based classifiers' promising performance and the off-the-shelf availability make them great candidates for this task. However, these models are known to suffer from performance inconsistencies and poor generalization. In this paper, we introduce the DECK (DEpression ChecKlist), depression-specific model behavioural tests that allow better interpretability and improve generalizability of BERT classifiers in depression domain. We create 23 tests to evaluate BERT, RoBERTa and ALBERT depression classifiers on three datasets, two Twitter-based and one clinical interview-based. Our evaluation shows that these models: 1) are robust to certain gender-sensitive variations in text; 2) rely on the important depressive language marker of the increased use of first person pronouns; 3) fail to detect some other depression symptoms like suicidal ideation. We also demonstrate that DECK tests can be used to incorporate symptom-specific information in the training data and consistently improve generalizability of all three BERT models, with an out-of-distribution F1-score increase of up to 53.93%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题