视听跨模式检索的标签空间中的完全跨三重损失

论文标题

视听跨模式检索的标签空间中的完全跨三重损失

Complete Cross-triplet Loss in Label Space for Audio-visual Cross-modal Retrieval

论文作者

Zeng, Donghuo, Wang, Yanan, Wu, Jianming, Ikeda, Kazushi

论文摘要

异质性差距问题是跨模式检索的主要挑战。因为跨模式数据（例如，视听）具有无法直接比较的不同分布和表示形式。为了弥合视听方式之间的差距，我们通过使用带注释的标签来利用自然同步的视听数据来学习一个共同的子空间。 TNN-CCCA是迄今为止最好的视听跨模式检索（AV-CMR）模型，但是当通过应用三重态损失来预测输入之间的相对距离时，模型训练对硬性负样本敏感。在本文中，为了减少硬性样本在表示学习中的干扰，我们提出了一种新的AV-CMR模型，以通过直接预测标签，然后使用完整的交叉三重损失来测量视听数据之间的固有相关性来优化语义特征。特别是，我们的模型通过最小化特征投影和接地标签表示后的预测标签特征之间的距离，将视听特征投射到标签空间中。此外，我们通过利用模式跨模式的所有可能相似性和相似性语义信息之间的关系来采用完整的跨三重损失，以优化预测的标签特征。两次视听双检查数据集的广泛实验结果表明，对于当前最新方法，对于AV-CMR任务而言，平均地图的提高约为2.1％，这表明我们提出的模型的有效性。

The heterogeneity gap problem is the main challenge in cross-modal retrieval. Because cross-modal data (e.g. audiovisual) have different distributions and representations that cannot be directly compared. To bridge the gap between audiovisual modalities, we learn a common subspace for them by utilizing the intrinsic correlation in the natural synchronization of audio-visual data with the aid of annotated labels. TNN-CCCA is the best audio-visual cross-modal retrieval (AV-CMR) model so far, but the model training is sensitive to hard negative samples when learning common subspace by applying triplet loss to predict the relative distance between inputs. In this paper, to reduce the interference of hard negative samples in representation learning, we propose a new AV-CMR model to optimize semantic features by directly predicting labels and then measuring the intrinsic correlation between audio-visual data using complete cross-triple loss. In particular, our model projects audio-visual features into label space by minimizing the distance between predicted label features after feature projection and ground label representations. Moreover, we adopt complete cross-triplet loss to optimize the predicted label features by leveraging the relationship between all possible similarity and dissimilarity semantic information across modalities. The extensive experimental results on two audio-visual double-checked datasets have shown an improvement of approximately 2.1% in terms of average MAP over the current state-of-the-art method TNN-CCCA for the AV-CMR task, which indicates the effectiveness of our proposed model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题