论文标题
基于双向LSTM和时间分布的CNN的韵律和语义特征的抑郁严重程度的预测
Prediction of Depression Severity Based on the Prosodic and Semantic Features with Bidirectional LSTM and Time Distributed CNN
论文作者
论文摘要
抑郁症在全球范围内越来越多地影响个人。它已成为全球主要的公共卫生问题,并引起了各个研究领域的关注。传统上,抑郁症的诊断是通过半结构化访谈和补充问卷提出的,这使得诊断在很大程度上依赖医生的经验并受到偏见。可以通过自动抑郁诊断系统实施心理健康监测和基于云的远程诊断。在本文中,我们提出了一种基于注意力的多模式语音和抑郁预测的文本表示。我们的模型经过培训,可以使用遇险分析访谈绿野仙踪(DAIC-WOZ)数据集估算参与者的抑郁严重程度。对于音频方式,我们使用数据集提供的协作语音分析存储库(VORAPEP)功能,并采用双向长期短期内存网络(BI-LSTM),然后是时间分配的卷积神经网络(T-CNN)。对于文本模式,我们将全局向量用于单词表示(手套)来执行单词嵌入,并且将嵌入量被馈入BI-LSTM网络。结果表明,音频和文本模型在抑郁严重程度估计任务上均表现良好,最佳序列级别F1得分为0.9870,患者级别的F1得分为0.9074,在五个类别(健康,轻度,中度,中度,中度严重和严重)的音频模型中,序列模型为0.9074,序列F1得分为0.9709和0.9709和患者级别的F1级别的F1和0.9245的序列F1评分。多模式融合模型的结果相似,在五个类别的患者级抑郁症检测任务上,F1得分为0.9580。实验表明,对以前的工作具有统计学上的显着改善。
Depression is increasingly impacting individuals both physically and psychologically worldwide. It has become a global major public health problem and attracts attention from various research fields. Traditionally, the diagnosis of depression is formulated through semi-structured interviews and supplementary questionnaires, which makes the diagnosis heavily relying on physicians experience and is subject to bias. Mental health monitoring and cloud-based remote diagnosis can be implemented through an automated depression diagnosis system. In this article, we propose an attention-based multimodality speech and text representation for depression prediction. Our model is trained to estimate the depression severity of participants using the Distress Analysis Interview Corpus-Wizard of Oz (DAIC-WOZ) dataset. For the audio modality, we use the collaborative voice analysis repository (COVAREP) features provided by the dataset and employ a Bidirectional Long Short-Term Memory Network (Bi-LSTM) followed by a Time-distributed Convolutional Neural Network (T-CNN). For the text modality, we use global vectors for word representation (GloVe) to perform word embeddings and the embeddings are fed into the Bi-LSTM network. Results show that both audio and text models perform well on the depression severity estimation task, with best sequence level F1 score of 0.9870 and patient-level F1 score of 0.9074 for the audio model over five classes (healthy, mild, moderate, moderately severe, and severe), as well as sequence level F1 score of 0.9709 and patient-level F1 score of 0.9245 for the text model over five classes. Results are similar for the multimodality fused model, with the highest F1 score of 0.9580 on the patient-level depression detection task over five classes. Experiments show statistically significant improvements over previous works.