使用规范相关图神经网络进行节能的语音增强的多模式视听信息融合

论文标题

使用规范相关图神经网络进行节能的语音增强的多模式视听信息融合

Multimodal Audio-Visual Information Fusion using Canonical-Correlated Graph Neural Network for Energy-Efficient Speech Enhancement

论文作者

Passos, Leandro Aparecido, Papa, João Paulo, Del Ser, Javier, Hussain, Amir, Adeel, Ahsan

论文摘要

本文提出了一种新型的多式模式自学架构，用于节能音频 - 视听（AV）语音增强，将图形神经网络与规范相关性分析（CCA-GNN）集成在一起。所提出的方法将其基础置于最先进的CCA-GNN上，该方法通过最大化相同输入的增强视图对之间的相关性，同时脱离了分离的特征，从而学习了代表性的嵌入。常规CCA-GNN的关键思想涉及丢弃增强变化的信息并保留增加的信息，同时阻止捕获冗余信息。我们提出的AV CCA-GNN模型涉及多模式表示学习环境。具体而言，我们的模型通过从音频和视觉嵌入的同一信道和规范相关性的增强视图中最大化的规范相关性来改善上下文AV语音处理。此外，它提出了一个位置节点编码，该位置节点在计算节点最近的邻居时考虑了先前的框架序列距离，而不是特征空间表示，并通过邻域的连接在嵌入式中引入时间信息。在基准Chime3数据集上进行的实验表明，我们提出的基于框架的AV CCA-GNN可确保在时间上下文中更好的特征学习，从而导致比最先进的CCA-GNN和多层perceptron更节能的语音重建。

This paper proposes a novel multimodal self-supervised architecture for energy-efficient audio-visual (AV) speech enhancement that integrates Graph Neural Networks with canonical correlation analysis (CCA-GNN). The proposed approach lays its foundations on a state-of-the-art CCA-GNN that learns representative embeddings by maximizing the correlation between pairs of augmented views of the same input while decorrelating disconnected features. The key idea of the conventional CCA-GNN involves discarding augmentation-variant information and preserving augmentation-invariant information while preventing capturing of redundant information. Our proposed AV CCA-GNN model deals with multimodal representation learning context. Specifically, our model improves contextual AV speech processing by maximizing canonical correlation from augmented views of the same channel and canonical correlation from audio and visual embeddings. In addition, it proposes a positional node encoding that considers a prior-frame sequence distance instead of a feature-space representation when computing the node's nearest neighbors, introducing temporal information in the embeddings through the neighborhood's connectivity. Experiments conducted on the benchmark ChiME3 dataset show that our proposed prior frame-based AV CCA-GNN ensures better feature learning in the temporal context, leading to more energy-efficient speech reconstruction than state-of-the-art CCA-GNN and multilayer perceptron.

下载PDF全文

下载文献需遵守相关版权规定

论文标题