Abstract
The rapid growth of video data across surveillance, healthcare, robotics, and entertainment applications has created a demand for efficient action recognition approaches. Traditional supervised models relied heavily on annotated datasets, which required extensive human labor and introduced annotation bias. This limitation has motivated research interest in learning meaningful video representations from unlabeled sequences. The present work addressed this challenge by developing a self-supervised learning framework that exploited temporal consistency and cross-frame feature alignment. The method has incorporated contrastive learning, spatial-temporal masking, and proxy tasks that predicted motion direction and frame order. The framework has avoided manual annotation and instead learned discriminative features by enforcing relationships across multiple augmented versions of the same video. The video encoder has extracted spatio-temporal cues while the temporal transformer module preserved motion dynamics across frames. A contrastive objective aligned augmented views, and a pretext classifier predicted masked patches and shuffled segments. The experiment results indicated that the proposed framework achieved 88.1% accuracy, 86.4% precision, 85.2% recall, and 85.8% F1-score, which demonstrated significant improvement compared with existing approaches. The latency stabilized at 126 ms, confirming the framework suitability for near real-time applications. These results validated that the proposed method has provided strong temporal reasoning, enhanced representation consistency, and improved downstream action recognition performance in unlabeled datasets.
Authors
P. Jeyaprabhavathi
Arden University, United Kingdom
Keywords
Self-Supervised Learning, Temporal Modeling, Contrastive Learning, Action Recognition, Video Representation