C2KD：多语言文本视频检索的跨语性跨模式知识蒸馏

论文标题

C2KD：多语言文本视频检索的跨语性跨模式知识蒸馏

C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval

论文作者

Rouditchenko, Andrew, Chuang, Yung-Sung, Shvetsova, Nina, Thomas, Samuel, Feris, Rogerio, Kingsbury, Brian, Karlinsky, Leonid, Harwath, David, Kuehne, Hilde, Glass, James

论文摘要

近年来，多语言文本视频检索方法已大大提高，但其他语言的性能落后于英语。我们提出了一种跨语性的跨模式知识蒸馏方法，以改善多语言文本视频检索。受英语文本视频检索优于其他语言的事实的启发，我们使用不同语言的输入文本训练学生模型，以使用英语输入文本从教师模型中进行跨模式预测。我们提出了一个基于交叉熵的目标，该目标迫使学生的文本视频相似性分数分布与教师模型的分布相似。我们通过将YouCook2视频数据集中的英文字幕转换为其他8种语言，介绍了一个新的多语言视频数据集Multi-YouCook2。我们的方法改善了多语言文本视频检索性能在多YouCook2和其他几个数据集上，例如Multi-MSRVTT和VATEX。我们还对不同多语言文本模型作为教师的有效性进行了分析。代码，模型和数据集可在https://github.com/roudimit/c2kd上找到。

Multilingual text-video retrieval methods have improved significantly in recent years, but the performance for other languages lags behind English. We propose a Cross-Lingual Cross-Modal Knowledge Distillation method to improve multilingual text-video retrieval. Inspired by the fact that English text-video retrieval outperforms other languages, we train a student model using input text in different languages to match the cross-modal predictions from teacher models using input text in English. We propose a cross entropy based objective which forces the distribution over the student's text-video similarity scores to be similar to those of the teacher models. We introduce a new multilingual video dataset, Multi-YouCook2, by translating the English captions in the YouCook2 video dataset to 8 other languages. Our method improves multilingual text-video retrieval performance on Multi-YouCook2 and several other datasets such as Multi-MSRVTT and VATEX. We also conducted an analysis on the effectiveness of different multilingual text models as teachers. The code, models, and dataset are available at https://github.com/roudimit/c2kd.

下载PDF全文

下载文献需遵守相关版权规定

论文标题