论文标题
基于音符的位置 - 意识到的注意机制唱歌声音综合
Singing Voice Synthesis Based on a Musical Note Position-Aware Attention Mechanism
论文作者
论文摘要
本文提出了一个新颖的序列到序列(SEQ2SEQ)模型,具有音符的位置,可以意识到语音综合(SVS)。可以同时执行声学和时间建模的SEQ2SEQ建模方法很有吸引力。但是,由于唱歌声音的时间建模的难度,许多具有编码器模型的SVS系统仍然明确依赖于其他模块生成的持续时间信息。尽管一些研究使用具有注意机制的SEQ2SEQ模型同时进行建模,但它们对时间建模的鲁棒性不足。提出的注意机制旨在通过考虑音乐评分给出的节奏来估计注意力重量。此外,还引入了几种技术,以提高歌声的建模性能。实验结果表明,所提出的模型在天然和鲁棒性方面都是有效的。
This paper proposes a novel sequence-to-sequence (seq2seq) model with a musical note position-aware attention mechanism for singing voice synthesis (SVS). A seq2seq modeling approach that can simultaneously perform acoustic and temporal modeling is attractive. However, due to the difficulty of the temporal modeling of singing voices, many recent SVS systems with an encoder-decoder-based model still rely on explicitly on duration information generated by additional modules. Although some studies perform simultaneous modeling using seq2seq models with an attention mechanism, they have insufficient robustness against temporal modeling. The proposed attention mechanism is designed to estimate the attention weights by considering the rhythm given by the musical score. Furthermore, several techniques are also introduced to improve the modeling performance of the singing voice. Experimental results indicated that the proposed model is effective in terms of both naturalness and robustness of timing.