学习对象表示的体现愿景

论文标题

学习对象表示的体现愿景

Embodied vision for learning object representations

论文作者

Aubret, Arthur, Teulière, Céline, Triesch, Jochen

论文摘要

最近的时间对抗性学习方法设法在没有监督的情况下学习了不变的对象表示。这是通过将对象的连续视图映射到近距离内部表示形式中来实现的。当将这种学习方法视为人类对象识别的开发模型时，重要的是要考虑一个在与对象互动时通常会观察到的视觉输入。首先，人类的视力高度浮动，高分辨率仅在视野的中部地区可用。其次，由于婴儿的景深有限，可以在模糊背景下看到物体。第三，在物体操纵期间，一个蹒跚学步的幼儿大多观察到封闭的物体，因为它们相当短的手臂，填充了视野的很大一部分。在这里，我们研究了这些影响如何影响通过时间对抗性学习学到的视觉表示质量。为此，我们让一个视觉体现的代理“播放”，并在近乎逼真的公寓的不同位置的对象。在每个游戏会话中，代理在转动其身体以查看另一个对象之前，请在多个方向上查看对象。产生的视图顺序为时间对抗性学习算法提供了。我们的结果表明，模仿幼儿的视觉统计数据提高了熟悉和新颖环境中的对象识别精度。我们认为，这种效果是由背景中提取的特征的减少，图像中大特征的神经网络偏见以及新颖背景区域和熟悉的背景区域之间更大的相似性引起的。我们得出的结论是，视觉学习的体现性质对于理解人类物体感知的发展可能至关重要。

Recent time-contrastive learning approaches manage to learn invariant object representations without supervision. This is achieved by mapping successive views of an object onto close-by internal representations. When considering this learning approach as a model of the development of human object recognition, it is important to consider what visual input a toddler would typically observe while interacting with objects. First, human vision is highly foveated, with high resolution only available in the central region of the field of view. Second, objects may be seen against a blurry background due to infants' limited depth of field. Third, during object manipulation a toddler mostly observes close objects filling a large part of the field of view due to their rather short arms. Here, we study how these effects impact the quality of visual representations learnt through time-contrastive learning. To this end, we let a visually embodied agent "play" with objects in different locations of a near photo-realistic flat. During each play session the agent views an object in multiple orientations before turning its body to view another object. The resulting sequence of views feeds a time-contrastive learning algorithm. Our results show that visual statistics mimicking those of a toddler improve object recognition accuracy in both familiar and novel environments. We argue that this effect is caused by the reduction of features extracted in the background, a neural network bias for large features in the image and a greater similarity between novel and familiar background regions. We conclude that the embodied nature of visual learning may be crucial for understanding the development of human object perception.

下载PDF全文

下载文献需遵守相关版权规定

论文标题