MEL频谱图与稳定螺距

论文标题

MEL频谱图与稳定螺距

Mel Spectrogram Inversion with Stable Pitch

论文作者

Di Giorgi, Bruno, Levy, Mark, Sharp, Richard

论文摘要

Vocoders是能够将音频信号（通常是MEL频谱图）转换为波形的低维光谱表示。现代语音生成管道使用Vocoder作为其最终组成部分。最近开发的言语模型实现了高度的现实主义，因此自然而然地想知道它们在音乐信号上的表现。与言语相比，音乐声纹理的异质性和结构提供了新的挑战。在这项工作中，我们专注于某种专为语音设计的Vocoder模型倾向于表现出来的一种特定工件：合成持续的音符时，倾斜的音调不稳定。我们认为，该伪像的特征声音是由于缺乏水平相一致性，这通常是由于使用时间域的目标空间具有与时间班的模型（例如卷积神经网络）的结果。我们提出了专门为音乐设计的新型Vocoder模型。提高音高稳定性的关键是选择由幅度频谱和相位梯度组成的移位不变目标空间。我们讨论了启发我们重新构建Vocoder任务的原因，概述一个工作示例，并在音乐信号上进行评估。我们的方法使用新型的谐波误差度量标准，导致60％和10％的改善了相对于现有模型的持续音符和和弦的重建。

Vocoders are models capable of transforming a low-dimensional spectral representation of an audio signal, typically the mel spectrogram, to a waveform. Modern speech generation pipelines use a vocoder as their final component. Recent vocoder models developed for speech achieve a high degree of realism, such that it is natural to wonder how they would perform on music signals. Compared to speech, the heterogeneity and structure of the musical sound texture offers new challenges. In this work we focus on one specific artifact that some vocoder models designed for speech tend to exhibit when applied to music: the perceived instability of pitch when synthesizing sustained notes. We argue that the characteristic sound of this artifact is due to the lack of horizontal phase coherence, which is often the result of using a time-domain target space with a model that is invariant to time-shifts, such as a convolutional neural network. We propose a new vocoder model that is specifically designed for music. Key to improving the pitch stability is the choice of a shift-invariant target space that consists of the magnitude spectrum and the phase gradient. We discuss the reasons that inspired us to re-formulate the vocoder task, outline a working example, and evaluate it on musical signals. Our method results in 60% and 10% improved reconstruction of sustained notes and chords with respect to existing models, using a novel harmonic error metric.

下载PDF全文

下载文献需遵守相关版权规定

论文标题