Enhancing Temporal Action Localization : Advanced State Space Modeling with a Recurrent Mechanism
- 주제어 (키워드) Temporal action localization , State space model , Recurrent mechanism , Architecture designm , Feature aggregate
- 발행기관 서강대학교 일반대학원
- 지도교수 박형민
- 발행년도 2024
- 학위수여년월 2024. 8
- 학위명 석사
- 학과 및 전공 일반대학원 전자공학과
- 실제 URI http://www.dcollection.net/handler/sogang/000000079229
- UCI I804:11029-000000079229
- 본문언어 영어
- 저작권 서강대학교 논문은 저작권 보호를 받습니다.
초록 (요약문)
Temporal Action Localization (TAL) is crucial for understanding actions within videos by classifying them and determining their temporal boundaries. Traditional deep learning approaches using CNNs, RNNs, GCNs, and Transformers have made significant strides yet often struggle with capturing long-range dependencies (LRD) and efficiently processing extensive video sequences. This thesis derives insights from these methods to enhance TAL performance, specifically by leveraging a combination of Feature aggregated Bi- S6 block design, Dual Bi-S6 structure, and Recurrent mechanism. Our approach involves three key components. First, the Feature aggregated Bi-S6 block uses multiple Conv1D layers with various kernel sizes in parallel, summing their outputs to capture local contexts of different ranges. This aggregated result is then fed into the Bi-directional S6 (Bi-S6), enhancing its capacity to model complex features. Sec- ond, the Dual Bi-S6 structure employs two parallel Feature Aggregated Bi-S6 blocks: one (TFA-Bi-S6) processes the temporal dimension, and the other (C-Bi-S6) processes the channel dimension. The outputs are then combined using point-wise multiplication, effec- tively integrating temporal dependencies of spatiotemporal features and spatiotemporal dependencies of temporal features, thereby making TAL more robust. Third, the Recur- rent mechanism applies the Dual Bi-S6 structure recursively r times in a residual manner, allowing the model to iteratively refine its representation of long-range dependencies. The proposed architecture, inspired by ActionFormer and ActionMamba, includes a Pretrained video encoder that extracts spatiotemporal features for each clip of a video. It features a Backbone that captures dependencies and extracts features at various temporal resolutions from the sequence data containing spatiotemporal features of each clip. The architecture also includes a simple post-processing Neck for handling multiple temporal resolutions and a Head that classifies actions and performs regression on video segments using the post-processed results. Extensive evaluation on benchmark datasets such as THUMOS-14, ActivityNet, Fine- Action, and HACS validates our approach, with our models outperforming existing state- of-the-art solutions. We achieved remarkable mean Average Precision (mAP) scores of 74.2% on THUMOS-14, 42.9% on ActivityNet, 29.6% on FineAction, and 45.8% on HACS.
more목차
1 Introduction 1
1.1 Motivation 1
1.2 Overview of the Proposed Methods 4
2 Related Works 6
2.1 CNNs 6
2.2 RNNs 9
2.3 GNNs 10
2.4 Transformers 11
3 Proposed Methods 13
3.1 State Space Model (SSM) 13
3.2 Proposed Methods 17
4 Experiments 24
4.1 Dataset and Evaluation Metrics 24
4.2 Evaluation on Benchmarks 25
4.3 Ablation studies 26
5 Conclusion 30
5.1 Conclusion 30
5.2 Limitations and Future work 31
Bibliography 32