使用卷积神经网络预测视觉搜索期间的视觉注意力和分心

论文标题

使用卷积神经网络预测视觉搜索期间的视觉注意力和分心

Predicting Visual Attention and Distraction During Visual Search Using Convolutional Neural Networks

论文作者

Samiei, Manoosh, Clark, James J.

论文摘要

视觉注意的计算建模中的大多数研究都包含图像的无任务观察。自由观看的显着性认为日常生活的有限情况。大多数视觉活动都是面向目标的，需要大量自上而下的注意力控制。与自由观看相比，视觉搜索任务需要更多自上而下的注意力。在本文中，我们提出了两种在视觉搜索过程中对观察者的视觉关注和分心的方法。我们的第一种方法适应了轻巧的自由观看显着性模型，以使用搜索图像的像素来预测人类观察者在搜索图像的像素上的眼睛固定密度图，并使用两流卷积编码器网络，对Coco-Search18数据集进行了训练和评估。该方法可以预测，在搜索特定目标时，哪些位置会更分散注意力。我们的网络在标准显着性指标上取得了良好的效果（AUC-JUDD = 0.95，AUC-BORJI = 0.85，SAUC = 0.84，NSS = 4.64，KLD = 0.93，CC = 0.72，SIM = 0.54和IG = 2.59）。我们的第二种方法是基于对象的，并在视觉搜索过程中预测了干扰物和目标对象。干扰物是所有对象，除了观察者在搜索过程中固定的目标。该方法使用在MS-Coco上预先训练的Mask-RCNN分割网络，并在可可搜索的数据集上进行微调。我们发布了针对三个目标类别的可可搜索中的目标和干扰因素的分割注释：瓶子，碗和汽车。这三个类别的平均得分为：F1得分= 0.64，MAP（IOU：0.5）= 0.57，MAR（IOU：0.5）= 0.73。我们在TensorFlow中的实施代码可在https://github.com/manooshsamiei/distraction-visual-search上公开获得。

Most studies in computational modeling of visual attention encompass task-free observation of images. Free-viewing saliency considers limited scenarios of daily life. Most visual activities are goal-oriented and demand a great amount of top-down attention control. Visual search task demands more top-down control of attention, compared to free-viewing. In this paper, we present two approaches to model visual attention and distraction of observers during visual search. Our first approach adapts a light-weight free-viewing saliency model to predict eye fixation density maps of human observers over pixels of search images, using a two-stream convolutional encoder-decoder network, trained and evaluated on COCO-Search18 dataset. This method predicts which locations are more distracting when searching for a particular target. Our network achieves good results on standard saliency metrics (AUC-Judd=0.95, AUC-Borji=0.85, sAUC=0.84, NSS=4.64, KLD=0.93, CC=0.72, SIM=0.54, and IG=2.59). Our second approach is object-based and predicts the distractor and target objects during visual search. Distractors are all objects except the target that observers fixate on during search. This method uses a Mask-RCNN segmentation network pre-trained on MS-COCO and fine-tuned on COCO-Search18 dataset. We release our segmentation annotations of targets and distractors in COCO-Search18 for three target categories: bottle, bowl, and car. The average scores over the three categories are: F1-score=0.64, MAP(iou:0.5)=0.57, MAR(iou:0.5)=0.73. Our implementation code in Tensorflow is publicly available at https://github.com/ManooshSamiei/Distraction-Visual-Search .

下载PDF全文

下载文献需遵守相关版权规定

论文标题