通过交叉模式梯度协调来缩放多模式预训练

论文标题

通过交叉模式梯度协调来缩放多模式预训练

Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization

论文作者

Wu, Junru, Liang, Yi, Han, Feng, Akbari, Hassan, Wang, Zhangyang, Yu, Cong

论文摘要

自我监督的预训练最近证明了大规模多模式数据的成功，而最先进的对比学习方法通常会从跨模式输入（例如视频/音频或视频/文本对）中实现特征一致性。尽管它方便在实践中制定和杠杆作用，但这种跨模式对准（CMA）只是一个弱且嘈杂的监督，因为即使它们在时间上也是在语义上可能会失错。例如，即使在通常采用的教学视频中，说话者有时也可以指当前框架中不存在的内容。对于互联网的原始视频，语义错位只会更不可预测。我们猜想可能会引起模式之间的冲突和偏见，因此可能会禁止CMA使用更大，更异质的数据扩展到训练。本文首先通过观察到，即使在最新的VATT预培训中，仅使用教学视频进行培训，在同一视频，音频，文本三胞胎中的不同CMA损失之间也存在强烈的梯度冲突，这表明它们是嘈杂的监督来源。然后，我们建议通过两种技术协调此类梯度：（i）跨模式梯度重组：修改每个样本三重态的不同CMA损耗梯度，以使它们的梯度方向更加对齐；（ii）基于梯度的课程学习：利用样本噪声指标上的梯度冲突信息，制定课程学习策略，以优先考虑较少嘈杂的样本三重态的培训。将这些技术应用于HOWTO100M数据集的预训练VATT，我们始终如一地提高其在不同下游任务上的性能。此外，我们能够将VATT预培训扩展到更复杂的非叙事YouTube8M数据集，以进一步改善最新的技术。

Self-supervised pre-training recently demonstrates success on large-scale multimodal data, and state-of-the-art contrastive learning methods often enforce the feature consistency from cross-modality inputs, such as video/audio or video/text pairs. Despite its convenience to formulate and leverage in practice, such cross-modality alignment (CMA) is only a weak and noisy supervision, since two modalities can be semantically misaligned even they are temporally aligned. For example, even in the commonly adopted instructional videos, a speaker can sometimes refer to something that is not visually present in the current frame; and the semantic misalignment would only be more unpredictable for the raw videos from the internet. We conjecture that might cause conflicts and biases among modalities, and may hence prohibit CMA from scaling up to training with larger and more heterogeneous data. This paper first verifies our conjecture by observing that, even in the latest VATT pre-training using only instructional videos, there exist strong gradient conflicts between different CMA losses within the same video, audio, text triplet, indicating them as the noisy source of supervision. We then propose to harmonize such gradients, via two techniques: (i) cross-modality gradient realignment: modifying different CMA loss gradients for each sample triplet, so that their gradient directions are more aligned; and (ii) gradient-based curriculum learning: leveraging the gradient conflict information on an indicator of sample noisiness, to develop a curriculum learning strategy to prioritize training on less noisy sample triplets. Applying those techniques to pre-training VATT on the HowTo100M dataset, we consistently improve its performance on different downstream tasks. Moreover, we are able to scale VATT pre-training to more complicated non-narrative Youtube8M dataset to further improve the state-of-the-arts.

下载PDF全文

下载文献需遵守相关版权规定

论文标题