fraug：一种基于帧速率的数据增强方法，用于从语音信号中检测到抑郁症

论文标题

fraug：一种基于帧速率的数据增强方法，用于从语音信号中检测到抑郁症

FrAUG: A Frame Rate Based Data Augmentation Method for Depression Detection from Speech Signals

论文作者

Ravi, Vijay, Wang, Jinhan, Flint, Jonathan, Alwan, Abeer

论文摘要

在本文中，提出了一种数据增强方法，用于从语音信号中检测抑郁症。通过在功能提取过程中更改框架宽度和框架偏移参数来创建数据增强的样本。与其他数据增强方法（例如VTLP，螺距扰动或速度扰动）不同，所提出的方法不会明确更改声学参数，而是帧级特征的时间频解分辨率。使用两个不同的数据集，模型和输入声学特征评估了所提出的方法。对于DAIC-WOZ（英语）数据集，当使用Depaudionet模型和MEL-SPECTROGRAM作为输入时，与基线相比，提出的方法可改善5.97％（验证）和25.13％（测试）（测试）。使用CNN用作后端和MFCC作为输入功能的X-VORTECTER嵌入时，收敛（普通话）数据集的改进为9.32％（验证）和12.99％（测试）。基线系统不包含任何数据增强。此外，该方法的表现优于常用的数据实践方法，例如噪声增强，VTLP，速度和螺距扰动。所有改进都具有统计学意义。

In this paper, a data augmentation method is proposed for depression detection from speech signals. Samples for data augmentation were created by changing the frame-width and the frame-shift parameters during the feature extraction process. Unlike other data augmentation methods (such as VTLP, pitch perturbation, or speed perturbation), the proposed method does not explicitly change acoustic parameters but rather the time-frequency resolution of frame-level features. The proposed method was evaluated using two different datasets, models, and input acoustic features. For the DAIC-WOZ (English) dataset when using the DepAudioNet model and mel-Spectrograms as input, the proposed method resulted in an improvement of 5.97% (validation) and 25.13% (test) when compared to the baseline. The improvements for the CONVERGE (Mandarin) dataset when using the x-vector embeddings with CNN as the backend and MFCCs as input features were 9.32% (validation) and 12.99% (test). Baseline systems do not incorporate any data augmentation. Further, the proposed method outperformed commonly used data-augmentation methods such as noise augmentation, VTLP, Speed, and Pitch Perturbation. All improvements were statistically significant.

下载PDF全文

下载文献需遵守相关版权规定

论文标题