HeadPOSR：使用变压器编码器的端到端可训练的头部姿势估算

论文标题

HeadPOSR：使用变压器编码器的端到端可训练的头部姿势估算

HeadPosr: End-to-end Trainable Head Pose Estimation using Transformer Encoders

论文作者

Dhingra, Naina

论文摘要

在本文中，提议使用单个RGB图像预测头部姿势。 \ textIt {headPosr}使用一个新的体系结构，其中包括变压器编码器。在混凝土中，它包括：（1）骨干；（2）连接器；（3）变压器编码器；（4）预测头。研究了对HPE使用变压器编码器的重要性。进行广泛的消融研究，以改变（1）编码器数量。（2）头数；（3）不同位置嵌入；（4）不同的激活；（5）输入通道大小，在headPOSR中使用的变压器中。有关使用的进一步研究：（1）不同的骨干，（2）使用不同的学习率。详细的实验和消融研究是使用三种不同的用于HPE的开源数据集进行的，即300W-LP，AFLW2000和BIWI数据集。实验表明，\ textIt {headPosr}胜过所有最先进的方法，包括没有里程碑的方法和其他方法，基于对AFLW2000数据集和BIWI数据集进行的地标或深度估算，当时使用300W-LP进行培训。当比较数据集的结果平均时，它也表现出色，因此为HPE问题设定了基准，这也证明了在最先进的情况下使用变压器的有效性。

In this paper, HeadPosr is proposed to predict the head poses using a single RGB image. \textit{HeadPosr} uses a novel architecture which includes a transformer encoder. In concrete, it consists of: (1) backbone; (2) connector; (3) transformer encoder; (4) prediction head. The significance of using a transformer encoder for HPE is studied. An extensive ablation study is performed on varying the (1) number of encoders; (2) number of heads; (3) different position embeddings; (4) different activations; (5) input channel size, in a transformer used in HeadPosr. Further studies on using: (1) different backbones, (2) using different learning rates are also shown. The elaborated experiments and ablations studies are conducted using three different open-source widely used datasets for HPE, i.e., 300W-LP, AFLW2000, and BIWI datasets. Experiments illustrate that \textit{HeadPosr} outperforms all the state-of-art methods including both the landmark-free and the others based on using landmark or depth estimation on the AFLW2000 dataset and BIWI datasets when trained with 300W-LP. It also outperforms when averaging the results from the compared datasets, hence setting a benchmark for the problem of HPE, also demonstrating the effectiveness of using transformers over the state-of-the-art.

下载PDF全文

下载文献需遵守相关版权规定

论文标题