论文标题
对计算机视觉变压器的全面调查
A Comprehensive Survey of Transformers for Computer Vision
论文作者
论文摘要
作为一种特殊类型的变压器,视觉变压器(VIT)用于各种计算机视觉应用程序(CV),例如图像识别。卷积神经网络(CNN)存在一些潜在的问题,可以用VIT解决。对于图像编码任务,例如压缩,超分辨率,分割和DeNoing,使用了不同的VIT变体。这项调查的目的是介绍VIT在简历中的第一个应用。据我们所知,这项调查是CVS的首次VIT。在第一步中,我们将适用VIT的不同CV应用程序进行了分类。 CV应用程序包括图像分类,对象检测,图像分割,图像压缩,图像超分辨率,图像降解和异常检测。我们的下一步是查看每个类别中的最先进,并列出可用模型。在此之后,我们对每个模型进行了详细的分析和比较,并列出了其优点和缺点。之后,我们介绍了每个类别的见解和教训。此外,我们讨论了一些开放研究挑战和未来的研究方向。
As a special type of transformer, Vision Transformers (ViTs) are used to various computer vision applications (CV), such as image recognition. There are several potential problems with convolutional neural networks (CNNs) that can be solved with ViTs. For image coding tasks like compression, super-resolution, segmentation, and denoising, different variants of the ViTs are used. The purpose of this survey is to present the first application of ViTs in CV. The survey is the first of its kind on ViTs for CVs to the best of our knowledge. In the first step, we classify different CV applications where ViTs are applicable. CV applications include image classification, object detection, image segmentation, image compression, image super-resolution, image denoising, and anomaly detection. Our next step is to review the state-of-the-art in each category and list the available models. Following that, we present a detailed analysis and comparison of each model and list its pros and cons. After that, we present our insights and lessons learned for each category. Moreover, we discuss several open research challenges and future research directions.