SELF-SUPERVISED MULTI-MODAL VIDEO FEATURE REPRESENTATION LEARNING FOR SCALABLE ACTION RECOGNITION IN UNLABELED VISUAL DATASETS

ICTACT Journal on Image and Video Processing ( Volume: 16 , Issue: 2 )

Abstract

The rapid growth of video data across surveillance, healthcare, robotics, and entertainment applications has created a demand for efficient action recognition approaches. Traditional supervised models relied heavily on annotated datasets, which required extensive human labor and introduced annotation bias. This limitation has motivated research interest in learning meaningful video representations from unlabeled sequences. The present work addressed this challenge by developing a self-supervised learning framework that exploited temporal consistency and cross-frame feature alignment. The method has incorporated contrastive learning, spatial-temporal masking, and proxy tasks that predicted motion direction and frame order. The framework has avoided manual annotation and instead learned discriminative features by enforcing relationships across multiple augmented versions of the same video. The video encoder has extracted spatio-temporal cues while the temporal transformer module preserved motion dynamics across frames. A contrastive objective aligned augmented views, and a pretext classifier predicted masked patches and shuffled segments. The experiment results indicated that the proposed framework achieved 88.1% accuracy, 86.4% precision, 85.2% recall, and 85.8% F1-score, which demonstrated significant improvement compared with existing approaches. The latency stabilized at 126 ms, confirming the framework suitability for near real-time applications. These results validated that the proposed method has provided strong temporal reasoning, enhanced representation consistency, and improved downstream action recognition performance in unlabeled datasets.

Authors

P. Jeyaprabhavathi
Arden University, United Kingdom

Keywords

Self-Supervised Learning, Temporal Modeling, Contrastive Learning, Action Recognition, Video Representation

Published By
ICTACT
Published In
ICTACT Journal on Image and Video Processing
( Volume: 16 , Issue: 2 )
Date of Publication
November 2025
Pages
3746 - 3751
Page Views
28
Full Text Views

ICT Academy is an initiative of the Government of India in collaboration with the state Governments and Industries. ICT Academy is a not-for-profit society, the first of its kind pioneer venture under the Public-Private-Partnership (PPP) model

Contact Us

ICT Academy
Module No E6 -03, 6th floor Block - E
IIT Madras Research Park
Kanagam Road, Taramani,
Chennai 600 113,
Tamil Nadu, India

For Journal Subscription: journalsales@ictacademy.in

For further Queries and Assistance, write to us at: ictacademy.journal@ictacademy.in