论文标题
用火斗争:评估文本到视频检索基准的有效性
Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video Retrieval Benchmarks
论文作者
论文摘要
使用文本描述搜索视频的曲线是核心的多模式检索任务。由于缺乏用于文本到视频检索的专用数据集,因此已重新使用视频字幕数据集来评估模型,通过(1)将字幕视为其各自视频的积极匹配,以及(2)假设所有其他视频为负面。但是,这种方法在评估过程中导致了一个基本缺陷:由于字幕仅与它们的原始视频相关,因此许多替代视频也与标题相匹配,该标题引入了错误的字幕字幕 - 视频对。我们表明,当这些假否定因素得到纠正时,最新的最新模型获得了25 \%的召回点 - 这种差异威胁到基准本身的有效性。为了诊断和减轻此问题,我们注释并发布683K额外的字幕 - 视频对。使用这些,我们重新计算了两个标准基准(MSR-VTT和MSVD)的三个模型的有效性得分。我们发现(1)最佳模型的重新计算指标最多高25 \%的召回点,(2)这些基准测试的召回@10,(3)字幕长度(一般性)与阳性数量有关,并且(4)可以通过采样来减轻注释成本。我们建议以当前形式退休这些基准,并为将来的文本与视频检索基准提出建议。
Searching troves of videos with textual descriptions is a core multimodal retrieval task. Owing to the lack of a purpose-built dataset for text-to-video retrieval, video captioning datasets have been re-purposed to evaluate models by (1) treating captions as positive matches to their respective videos and (2) assuming all other videos to be negatives. However, this methodology leads to a fundamental flaw during evaluation: since captions are marked as relevant only to their original video, many alternate videos also match the caption, which introduces false-negative caption-video pairs. We show that when these false negatives are corrected, a recent state-of-the-art model gains 25\% recall points -- a difference that threatens the validity of the benchmark itself. To diagnose and mitigate this issue, we annotate and release 683K additional caption-video pairs. Using these, we recompute effectiveness scores for three models on two standard benchmarks (MSR-VTT and MSVD). We find that (1) the recomputed metrics are up to 25\% recall points higher for the best models, (2) these benchmarks are nearing saturation for Recall@10, (3) caption length (generality) is related to the number of positives, and (4) annotation costs can be mitigated through sampling. We recommend retiring these benchmarks in their current form, and we make recommendations for future text-to-video retrieval benchmarks.