带有产品量化蒙版图像建模的图像压缩

论文标题

带有产品量化蒙版图像建模的图像压缩

Image Compression with Product Quantized Masked Image Modeling

论文作者

El-Nouby, Alaaeldin, Muckley, Matthew J., Ullrich, Karen, Laptev, Ivan, Verbeek, Jakob, Jégou, Hervé

论文摘要

最近的神经压缩方法基于流行的高位框架。它依靠标量量化并提供非常强大的压缩性能。这与图像产生和表示学习的最新进展形成了鲜明的对比，在图像产生和表示量化更常用的情况下。在这项工作中，我们试图通过重新审视矢量量化图像压缩来使这些研究更加紧密。我们建立在VQ-VAE框架的基础上，并引入了几种修改。首先，我们用产品量化器代替了香草矢量量化器。向量和标量量化之间的中间解决方案允许一组较宽的速率点点：它隐式定义了高质量的量化器，否则这些量化器否则需要大量的大型代码书。其次，受到蒙版图像建模（MIM）在自我监督学习和生成图像模型背景下的成功的启发，我们提出了一个新颖的条件熵模型，该模型通过建模量化潜在代码的共同依赖性来改善熵编码。所得的PQ-MIM模型令人惊讶地有效：与最近的高位方法相同的压缩性能。当通过感知损失（例如对抗性）优化时，它在FID和KID指标方面也表现出色。最后，由于PQ-MIM与图像生成框架兼容，因此我们在定性上表明它可以在压缩和发电之间的混合模式下运行，而无需进一步的训练或填充。结果，我们探索了将图像压缩到200个字节中的极端压缩方案，即小于推文。

Recent neural compression methods have been based on the popular hyperprior framework. It relies on Scalar Quantization and offers a very strong compression performance. This contrasts from recent advances in image generation and representation learning, where Vector Quantization is more commonly employed. In this work, we attempt to bring these lines of research closer by revisiting vector quantization for image compression. We build upon the VQ-VAE framework and introduce several modifications. First, we replace the vanilla vector quantizer by a product quantizer. This intermediate solution between vector and scalar quantization allows for a much wider set of rate-distortion points: It implicitly defines high-quality quantizers that would otherwise require intractably large codebooks. Second, inspired by the success of Masked Image Modeling (MIM) in the context of self-supervised learning and generative image models, we propose a novel conditional entropy model which improves entropy coding by modelling the co-dependencies of the quantized latent codes. The resulting PQ-MIM model is surprisingly effective: its compression performance on par with recent hyperprior methods. It also outperforms HiFiC in terms of FID and KID metrics when optimized with perceptual losses (e.g. adversarial). Finally, since PQ-MIM is compatible with image generation frameworks, we show qualitatively that it can operate under a hybrid mode between compression and generation, with no further training or finetuning. As a result, we explore the extreme compression regime where an image is compressed into 200 bytes, i.e., less than a tweet.

下载PDF全文

下载文献需遵守相关版权规定

论文标题