论文标题
使用文字教图像检索
Using Text to Teach Image Retrieval
论文作者
论文摘要
图像检索在很大程度上取决于数据建模的质量和特征空间中的距离测量。在图像歧管的概念的基础上,我们首先建议代表图像的特征空间,该图像是通过神经网络学习的图形。现在,特征空间中的邻域是由图像之间的地球距离定义的,该图像以图形顶点表示或歧管样本表示。当有限的图像可用时,该歧管会稀疏采样,从而使测量计算和相应的检索更加困难。为了解决这个问题,我们用几何形式对齐的文本增强了流形样本,从而使用大量句子来教我们有关图像的信息。除了在标准数据集上进行了广泛的结果外,还引入了基于CLEVR的新公共数据集,以量化视觉数据和文本数据之间的语义相似性。实验结果表明,关节嵌入歧管是一个可靠的表示形式,使其成为仅在图像上进行图像和文本指令进行图像检索的更好的基础
Image retrieval relies heavily on the quality of the data modeling and the distance measurement in the feature space. Building on the concept of image manifold, we first propose to represent the feature space of images, learned via neural networks, as a graph. Neighborhoods in the feature space are now defined by the geodesic distance between images, represented as graph vertices or manifold samples. When limited images are available, this manifold is sparsely sampled, making the geodesic computation and the corresponding retrieval harder. To address this, we augment the manifold samples with geometrically aligned text, thereby using a plethora of sentences to teach us about images. In addition to extensive results on standard datasets illustrating the power of text to help in image retrieval, a new public dataset based on CLEVR is introduced to quantify the semantic similarity between visual data and text data. The experimental results show that the joint embedding manifold is a robust representation, allowing it to be a better basis to perform image retrieval given only an image and a textual instruction on the desired modifications over the image