SAPS: Self-Attentive Pathway Search for weakly-supervised action localization with background-action augmentation

Computer Vision and Image Understanding. (CVIU) 2021.07.24,

Xiao-Yu Zhang*, Yaru Zhang, Haichao Shi, Jing Dong.

Abstract

Weakly supervised temporal action localization is a challenging computer vision task, which aims to derive frame-level action identifier based on video-level supervision. Attention mechanism is a widely used paradigm for action recognition and localization in most recent methods. However, existing attention-based methods mostly focus on capturing the global dependency of the frame sequence regardless of the local inter-frame distances. Moreover, during background modeling, different background contents are typically classified into one category, which inevitably jeopardizes the discriminative ability of classifiers and brings about irrelevant noise. In this paper, we present a novel self-attentive pathway search framework, namely SAPS, to address the above challenges. To achieve comprehensive representation with discriminative attention weights, we design a NAS-based attentive module with a path-level searching process, and construct a competitive attention structure revealing both local and global dependency. Furthermore, we propose the action-related background modeling for robust background-action augmentation, where knowledge derived from background can provide informative clues for action recognition. An ensemble T-CAM operation is subsequently designed to incorporate background information to further refine the temporal action localization results. Extensive experiments on two benchmark datasets (i.e., THUMOS14 and ActivityNet1.2) have clearly corroborated the efficacy of our method.