论文标题
靠近重复图像检测系统的网络尺度的演变
Evolution of a Web-Scale Near Duplicate Image Detection System
论文作者
论文摘要
检测几乎重复的图像是照片共享Web应用程序的内容生态系统的基础。但是,当涉及包含数十亿张图像的网络尺度图像语料库时,此类任务是具有挑战性的。在本文中,我们提出了一个有效的系统,用于检测80亿张图像中的近重复图像。我们的系统包括三个阶段:候选人生成,候选人选择和聚类。我们还证明,该系统可用于大大提高许多现实应用程序的建议和搜索结果的质量。 此外,我们在六年的时间内还包括该系统的演变,从而为如何适应有机内容增长以及最新技术的新系统提供了经验和课程。最后,我们正在释放本文介绍的约53,000对图像的人体标记的数据集。
Detecting near duplicate images is fundamental to the content ecosystem of photo sharing web applications. However, such a task is challenging when involving a web-scale image corpus containing billions of images. In this paper, we present an efficient system for detecting near duplicate images across 8 billion images. Our system consists of three stages: candidate generation, candidate selection, and clustering. We also demonstrate that this system can be used to greatly improve the quality of recommendations and search results across a number of real-world applications. In addition, we include the evolution of the system over the course of six years, bringing out experiences and lessons on how new systems are designed to accommodate organic content growth as well as the latest technology. Finally, we are releasing a human-labeled dataset of ~53,000 pairs of images introduced in this paper.