Dynamic Graph Modeling for Weakly-Supervised Temporal Action Localization

ACM MM 2022 2022.06.30,

Haichao Shi, Xiao-Yu Zhang*, Changsheng Li, Lixing Gong, Yong Li, Yongjun Bao.


Weakly supervised action localization is a challenging task that aims to localize action instances in untrimmed videos given only video-level supervision. Existing methods mostly distinguish action from background via attentive feature fusion with RGB and optical flow modalities. Unfortunately, this strategy fails to retain the distinct characteristics of each modality, leading to inaccurate localization under hard-to-discriminate cases such as action-context interference and in-action stationary period. As an action is typically comprised of multiple stages, an intuitive solution is to model the relation between the finer-grained action segments to obtain a more detailed analysis. In this paper, we propose a dynamic graph-based method, namely DGCNN, to explore the two-stream relation between action segments. To be specific, segments within a video which are likely to be actions are dynamically selected to construct an action graph. For each graph, a triplet adjacency matrix is devised to explore the temporal and contextual correlations between the pseudo action segments, which consists of three components, i.e., mutual importance, feature similarity, and high-level contextual similarity. The two-stream dynamic pseudo graphs, along with the pseudo background segments, are used to derive more detailed video representation. For action localization, a non-local based temporal refinement module is proposed to fully leverage the temporal consistency between consecutive segments. Experimental results on three datasets, i.e., THUMOS14, ActivityNet v1.2 and v1.3, demonstrate that our method is superior to the state-of-the-arts.