我的自动音频字幕系统太糟糕了吗？蜘蛛 - 马克斯：一个考虑几个标题候选人的指标

论文标题

我的自动音频字幕系统太糟糕了吗？蜘蛛 - 马克斯：一个考虑几个标题候选人的指标

Is my automatic audio captioning system so bad? spider-max: a metric to consider several caption candidates

论文作者

Labbé, Etienne, Pellegrini, Thomas, Pinquier, Julien

论文摘要

自动音频字幕（AAC）是旨在使用自然语言来描述音频信号的任务。 AAC系统作为输入音频信号并输出一个称为标题的自由形式的文本句子。评估此类系统并不是微不足道的，因为有许多表达相同想法的方法。因此，使用了几个互补指标，例如BLEU，苹果酒，香料和蜘蛛，用于将单个自动标题与人类注释者产生的一个或几个参考字幕进行比较。然而，自动系统可以在句子生成过程中使用一些随机性，或者通过考虑使用Beam-Search解码期间考虑各种竞争假设的字幕来产生几个字幕候选。如果我们考虑AAC系统的最终用户，则呈现几个字幕而不是单个字幕，似乎与提供一些多样性有关，类似于信息检索系统。在这项工作中，我们探讨了在评估过程中考虑几个预测字幕而不是一个。为此，我们提出了Spider-Max，该指标是在几个标题候选者的分数中获得最大蜘蛛值。为了提倡我们的指标，我们报告了基于转换的系统的Clotho V2.1和AudioCaps的实验。例如，在录音机上，该系统达到了接近参考的蜘蛛人分数的蜘蛛最大值值（有5个候选者）。

Automatic Audio Captioning (AAC) is the task that aims to describe an audio signal using natural language. AAC systems take as input an audio signal and output a free-form text sentence, called a caption. Evaluating such systems is not trivial, since there are many ways to express the same idea. For this reason, several complementary metrics, such as BLEU, CIDEr, SPICE and SPIDEr, are used to compare a single automatic caption to one or several captions of reference, produced by a human annotator. Nevertheless, an automatic system can produce several caption candidates, either using some randomness in the sentence generation process, or by considering the various competing hypothesized captions during decoding with beam-search, for instance. If we consider an end-user of an AAC system, presenting several captions instead of a single one seems relevant to provide some diversity, similarly to information retrieval systems. In this work, we explore the possibility to consider several predicted captions in the evaluation process instead of one. For this purpose, we propose SPIDEr-max, a metric that takes the maximum SPIDEr value among the scores of several caption candidates. To advocate for our metric, we report experiments on Clotho v2.1 and AudioCaps, with a transformed-based system. On AudioCaps for example, this system reached a SPIDEr-max value (with 5 candidates) close to the SPIDEr human score of reference.

下载PDF全文

下载文献需遵守相关版权规定

论文标题