语音图像语义对齐不取决于任何先前的分类任务

论文标题

语音图像语义对齐不取决于任何先前的分类任务

Speech-Image Semantic Alignment Does Not Depend on Any Prior Classification Tasks

论文作者

Mortazavi, Masood S.

论文摘要

语义对齐的$（语音，图像）$数据集可用于探索“视觉上的语音”。在大多数现有研究中，图像信号的特征是在其他任务上使用“预训练”的神经网络提取的（例如，对成像网的分类）。在其他情况下，预训练的网络用于在语义嵌入之前提取音频功能。没有通过预先训练的初始化或预训练的功能提取的“转移学习”，先前的结果往往显示出较低的召回率，$ secement \ rightarrow image $和$ image \ rightarrow secement $ queries。 Choosing appropriate neural architectures for encoders in the speech and image branches and using large datasets, one can obtain competitive recall rates without any reliance on any pre-trained initialization or feature extraction: $(speech,image)$ semantic alignment and $speech \rightarrow image$ and $image \rightarrow speech$ retrieval are canonical tasks worthy of independent investigation of their own and allow one to explore other questions---e.g., the size of the音频嵌入器可以大大减少，而$ Speech \ Rightarrow Image $和$ Image \ rightarrow Speek $ Queries中的召回率很少，召回率很少。

Semantically-aligned $(speech, image)$ datasets can be used to explore "visually-grounded speech". In a majority of existing investigations, features of an image signal are extracted using neural networks "pre-trained" on other tasks (e.g., classification on ImageNet). In still others, pre-trained networks are used to extract audio features prior to semantic embedding. Without "transfer learning" through pre-trained initialization or pre-trained feature extraction, previous results have tended to show low rates of recall in $speech \rightarrow image$ and $image \rightarrow speech$ queries. Choosing appropriate neural architectures for encoders in the speech and image branches and using large datasets, one can obtain competitive recall rates without any reliance on any pre-trained initialization or feature extraction: $(speech,image)$ semantic alignment and $speech \rightarrow image$ and $image \rightarrow speech$ retrieval are canonical tasks worthy of independent investigation of their own and allow one to explore other questions---e.g., the size of the audio embedder can be reduced significantly with little loss of recall rates in $speech \rightarrow image$ and $image \rightarrow speech$ queries.

下载PDF全文

下载文献需遵守相关版权规定

论文标题