论文标题
基于变压器的跨模式配方嵌入具有大批量训练的
Transformer-based Cross-Modal Recipe Embeddings with Large Batch Training
论文作者
论文摘要
在本文中,我们提出了一个跨模式食谱检索框架,基于变压器的大批量训练网络(TNLBT),其灵感来自Acme〜(对抗性跨模式嵌入)和H-T〜(层次变压器)。 TNLBT旨在在从食谱嵌入中生成图像时完成检索任务。我们应用了基于层次变压器的配方文本编码器,基于视觉变压器〜(VIT)的配方图像编码器和对抗性网络体系结构,以实现更好的跨模式嵌入学习,以获取食谱文本和图像。此外,我们使用自我监督的学习来利用没有相应图像的食谱文本中的丰富信息。由于根据最新的自我监督学习文献,对比度学习可以从较大的批处理大小中受益,因此我们在培训期间采用了较大的批量大小,并验证了其有效性。在实验中,所提出的框架在基准配方1M上的跨模式配方检索和图像生成任务中的当前最新框架显着优于当前的最新框架。这是确认跨模式食谱嵌入的大批量训练的有效性的第一项工作。
In this paper, we present a cross-modal recipe retrieval framework, Transformer-based Network for Large Batch Training (TNLBT), which is inspired by ACME~(Adversarial Cross-Modal Embedding) and H-T~(Hierarchical Transformer). TNLBT aims to accomplish retrieval tasks while generating images from recipe embeddings. We apply the Hierarchical Transformer-based recipe text encoder, the Vision Transformer~(ViT)-based recipe image encoder, and an adversarial network architecture to enable better cross-modal embedding learning for recipe texts and images. In addition, we use self-supervised learning to exploit the rich information in the recipe texts having no corresponding images. Since contrastive learning could benefit from a larger batch size according to the recent literature on self-supervised learning, we adopt a large batch size during training and have validated its effectiveness. In the experiments, the proposed framework significantly outperformed the current state-of-the-art frameworks in both cross-modal recipe retrieval and image generation tasks on the benchmark Recipe1M. This is the first work which confirmed the effectiveness of large batch training on cross-modal recipe embeddings.