结合高效，精确的手语识别：良好的姿势估计库是您所需要的

论文标题

结合高效，精确的手语识别：良好的姿势估计库是您所需要的

Combining Efficient and Precise Sign Language Recognition: Good pose estimation library is all you need

论文作者

Boháček, Matyáš, Cao, Zhuo, Hrúz, Marek

论文摘要

手语识别可以显着改善使用一般消费者技术（例如物联网设备或视频会议）的D/聋人的用户体验。但是，当前的手语识别体系结构通常在计算上很重，并且需要适合GPU的硬件才能实时运行。一些型号通过最大程度地降低其尺寸和复杂性来旨在低端设备（例如智能手机），从而提高准确性。这种高度审查了准确的野外应用程序。我们建立在属于后一组光方法的Poasser体系结构上，因为它接近了用于此任务的大型模型的性能。通过将其原始的第三方姿势估计模块替换为MediaPipe库，我们在WLASL100数据集上实现了总体最新结果。值得注意的是，与相关基准相比，我们的方法比以前的较大体系结构击败了以前的较大体系结构，而推断的推断速度仍差不多11美元。为了证明我们的方法的综合效率和精确度，我们构建了一个在线演示，使用户能够在浏览器中翻译美国手语的标志诱饵。这是第一个公开可用的在线应用程序，该应用程序据我们所知，展示了此任务。

Sign language recognition could significantly improve the user experience for d/Deaf people with the general consumer technology, such as IoT devices or videoconferencing. However, current sign language recognition architectures are usually computationally heavy and require robust GPU-equipped hardware to run in real-time. Some models aim for lower-end devices (such as smartphones) by minimizing their size and complexity, which leads to worse accuracy. This highly scrutinizes accurate in-the-wild applications. We build upon the SPOTER architecture, which belongs to the latter group of light methods, as it came close to the performance of large models employed for this task. By substituting its original third-party pose estimation module with the MediaPipe library, we achieve an overall state-of-the-art result on the WLASL100 dataset. Significantly, our method beats previous larger architectures while still being twice as computationally efficient and almost $11$ times faster on inference when compared to a relevant benchmark. To demonstrate our method's combined efficiency and precision, we built an online demo that enables users to translate sign lemmas of American sign language in their browsers. This is the first publicly available online application demonstrating this task to the best of our knowledge.

下载PDF全文

下载文献需遵守相关版权规定

论文标题