论文标题
轻巧的仪器 - 反应模型,用于复合音符转录和多通估计
A Lightweight Instrument-Agnostic Model for Polyphonic Note Transcription and Multipitch Estimation
论文作者
论文摘要
自动音乐转录(AMT)已被公认为具有广泛应用的关键启用技术。鉴于任务的复杂性,通常已经报告了针对特定设置的系统,例如仪器特异性系统倾向于在仪器不合时宜的方法上产生改进的结果。同样,只有在估计帧$ f_0 $值并忽略更难的音符事件检测时,就可以获得更高的精度。尽管它们的准确性很高,但这种专业系统通常无法在现实世界中部署。存储和网络约束禁止使用多个专用模型,而内存和运行时约束则限制了它们的复杂性。在本文中,我们提出了一个用于乐器转录的轻量级神经网络,该网络支持多音量输出并推广到各种乐器(包括人声)。我们的模型经过训练,可以共同预测框架,多语和音符激活,我们通过实验表明,这种多输出结构提高了所得的框架级音符的准确性。尽管它很简单,但基准的结果表明,我们的系统的音符估计比可比的基线要好得多,并且其框架级别的准确性仅略低于专门的最先进的AMT系统。通过这项工作,我们希望鼓励社区进一步调查低资源的仪器不足的AMT系统。
Automatic Music Transcription (AMT) has been recognized as a key enabling technology with a wide range of applications. Given the task's complexity, best results have typically been reported for systems focusing on specific settings, e.g. instrument-specific systems tend to yield improved results over instrument-agnostic methods. Similarly, higher accuracy can be obtained when only estimating frame-wise $f_0$ values and neglecting the harder note event detection. Despite their high accuracy, such specialized systems often cannot be deployed in the real-world. Storage and network constraints prohibit the use of multiple specialized models, while memory and run-time constraints limit their complexity. In this paper, we propose a lightweight neural network for musical instrument transcription, which supports polyphonic outputs and generalizes to a wide variety of instruments (including vocals). Our model is trained to jointly predict frame-wise onsets, multipitch and note activations, and we experimentally show that this multi-output structure improves the resulting frame-level note accuracy. Despite its simplicity, benchmark results show our system's note estimation to be substantially better than a comparable baseline, and its frame-level accuracy to be only marginally below those of specialized state-of-the-art AMT systems. With this work we hope to encourage the community to further investigate low-resource, instrument-agnostic AMT systems.