论文标题
使用长帧和STFT幅度的有效基于变压器的语音增强
Efficient Transformer-based Speech Enhancement Using Long Frames and STFT Magnitudes
论文作者
论文摘要
隔离架构在语音分离中表现出非常好的结果。像其他学习的编码器模型一样,它使用了短帧,因为它们已被证明在这些情况下可以获得更好的性能。这会导致输入中大量帧,这是有问题的。由于分离器是基于变压器的,因此其计算复杂性随着较长的序列而大大增加。在本文中,我们在语音增强任务中采用了隔离器,并表明,通过用短期傅立叶变换(STFT)表示替换学习式编码器的功能,我们可以使用长帧而不会损害感知增强性能。对于10秒的话语,我们获得了等效质量和清晰度评估得分,同时将操作数量减少了约8倍。
The SepFormer architecture shows very good results in speech separation. Like other learned-encoder models, it uses short frames, as they have been shown to obtain better performance in these cases. This results in a large number of frames at the input, which is problematic; since the SepFormer is transformer-based, its computational complexity drastically increases with longer sequences. In this paper, we employ the SepFormer in a speech enhancement task and show that by replacing the learned-encoder features with a magnitude short-time Fourier transform (STFT) representation, we can use long frames without compromising perceptual enhancement performance. We obtained equivalent quality and intelligibility evaluation scores while reducing the number of operations by a factor of approximately 8 for a 10-second utterance.