语义2Graph：基于图的多模式特征融合，用于视频中的动作分割

论文标题

语义2Graph：基于图的多模式特征融合，用于视频中的动作分割

Semantic2Graph: Graph-based Multi-modal Feature Fusion for Action Segmentation in Videos

论文作者

Zhang, Junbin, Tsai, Pei-Hsuan, Tsai, Meng-Hsun

论文摘要

视频动作细分已广泛应用于许多领域。大多数先前的研究用于此目的使用基于视频的视觉模型。但是，他们通常依靠大型的接受领域，LSTM或变压器方法来捕获视频中的长期依赖性，从而导致了巨大的计算资源需求。为了应对这一挑战，提出了基于图的模型。但是，以前的基于图的模型不太准确。因此，这项研究介绍了一种名为Semantic2Graph的图形结构方法，以对视频中的长期依赖性进行建模，从而降低计算成本并提高准确性。我们在框架级别构建视频的图形结构。时间边缘用于对视频中的时间关系和行动顺序进行建模。此外，我们设计了正面和负面语义边缘，并伴随着相应的边缘权重，以捕获视频动作中的长期和短期语义关系。节点属性包括从视频内容，图形结构和标签文本中提取的丰富的多模式特征，包括视觉，结构和语义提示。为了有效地合成此多模式信息，我们采用图形神经网络（GNN）模型来融合节点动作标签分类的多模式特征。实验结果表明，语义2Graph在性能方面的表现优于最先进的方法，尤其是在GTEA和50salads等基准数据集上。多次消融实验进一步验证了语义特征在增强模型性能方面的有效性。值得注意的是，在语义2Graph中包含语义边缘可以具有成本效益的长期依赖性，从而确认其在解决基于视频的视觉模型中计算资源限制所带来的挑战方面的实用性。

Video action segmentation have been widely applied in many fields. Most previous studies employed video-based vision models for this purpose. However, they often rely on a large receptive field, LSTM or Transformer methods to capture long-term dependencies within videos, leading to significant computational resource requirements. To address this challenge, graph-based model was proposed. However, previous graph-based models are less accurate. Hence, this study introduces a graph-structured approach named Semantic2Graph, to model long-term dependencies in videos, thereby reducing computational costs and raise the accuracy. We construct a graph structure of video at the frame-level. Temporal edges are utilized to model the temporal relations and action order within videos. Additionally, we have designed positive and negative semantic edges, accompanied by corresponding edge weights, to capture both long-term and short-term semantic relationships in video actions. Node attributes encompass a rich set of multi-modal features extracted from video content, graph structures, and label text, encompassing visual, structural, and semantic cues. To synthesize this multi-modal information effectively, we employ a graph neural network (GNN) model to fuse multi-modal features for node action label classification. Experimental results demonstrate that Semantic2Graph outperforms state-of-the-art methods in terms of performance, particularly on benchmark datasets such as GTEA and 50Salads. Multiple ablation experiments further validate the effectiveness of semantic features in enhancing model performance. Notably, the inclusion of semantic edges in Semantic2Graph allows for the cost-effective capture of long-term dependencies, affirming its utility in addressing the challenges posed by computational resource constraints in video-based vision models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题