使用长帧和STFT幅度的有效基于变压器的语音增强

论文标题

使用长帧和STFT幅度的有效基于变压器的语音增强

Efficient Transformer-based Speech Enhancement Using Long Frames and STFT Magnitudes

论文作者

de Oliveira, Danilo, Peer, Tal, Gerkmann, Timo

论文摘要

隔离架构在语音分离中表现出非常好的结果。像其他学习的编码器模型一样，它使用了短帧，因为它们已被证明在这些情况下可以获得更好的性能。这会导致输入中大量帧，这是有问题的。由于分离器是基于变压器的，因此其计算复杂性随着较长的序列而大大增加。在本文中，我们在语音增强任务中采用了隔离器，并表明，通过用短期傅立叶变换（STFT）表示替换学习式编码器的功能，我们可以使用长帧而不会损害感知增强性能。对于10秒的话语，我们获得了等效质量和清晰度评估得分，同时将操作数量减少了约8倍。

The SepFormer architecture shows very good results in speech separation. Like other learned-encoder models, it uses short frames, as they have been shown to obtain better performance in these cases. This results in a large number of frames at the input, which is problematic; since the SepFormer is transformer-based, its computational complexity drastically increases with longer sequences. In this paper, we employ the SepFormer in a speech enhancement task and show that by replacing the learned-encoder features with a magnitude short-time Fourier transform (STFT) representation, we can use long frames without compromising perceptual enhancement performance. We obtained equivalent quality and intelligibility evaluation scores while reducing the number of operations by a factor of approximately 8 for a 10-second utterance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题