TRANSFORMER-DRIVEN MULTI-SCALE FEATURE FUSION FRAMEWORK FOR EFFICIENT AND ACCURATE REAL-TIME OBJECT DETECTION IN COMPLEX ENVIRONMENTS

ICTACT Journal on Image and Video Processing ( Volume: 16 , Issue: 2 )

Abstract

Real-time object detection has gained noticeable attention in robotics, autonomous navigation, and security systems. Traditional convolution-based detectors often struggled when dealing with complex backgrounds, dynamic illumination, and scale variations. These challenges have created an urgent need for a robust architecture that handled diverse object sizes while maintaining efficiency. The emergence of transformer networks inspired recent progress, but most implementations faced latency constraints when applied in real-time environments. This study proposed a transformer-based multi-scale feature fusion framework that integrated hierarchical local and global feature representations for real-time detection tasks. The method has incorporated a convolutional encoder to extract low-level spatial cues, while a transformer encoder captured long-range dependencies across scales. A pyramid fusion mechanism has been applied to merge multi-scale features adaptively, with an attention-driven refinement stage that enhanced contextual object boundaries. The detection head has been optimized to process fused representations with minimal computational overhead. Experimental evaluation is performed using the COCO dataset over 100 training epochs. The proposed system achieves a precision of 0.94, recall of 0.92, and an mAP of 0.90. The IoU value remains consistently high at 0.86, and the system processes frames at 63 FPS in real-time deployment. These numerical results demonstrate a 10–14 percent performance improvement over YOLOv5, Swin Transformer Detector, and FPN-Based Detector while maintaining efficiency during inference.

Authors

Vadhana Kumari Selvaraj1, Brilly Sangeetha2, P. Neethu Prabhakaran3
Vimal Jyothi Engineering College, India1, IES College of Engineering, India2,3

Keywords

Real-Time Detection, Transformer Model, Multi-Scale Fusion, Attention Mechanism, Complex Scenes

Published By
ICTACT
Published In
ICTACT Journal on Image and Video Processing
( Volume: 16 , Issue: 2 )
Date of Publication
November 2025
Pages
3719 - 3724
Page Views
51
Full Text Views
3

ICT Academy is an initiative of the Government of India in collaboration with the state Governments and Industries. ICT Academy is a not-for-profit society, the first of its kind pioneer venture under the Public-Private-Partnership (PPP) model

Contact Us

ICT Academy
Module No E6 -03, 6th floor Block - E
IIT Madras Research Park
Kanagam Road, Taramani,
Chennai 600 113,
Tamil Nadu, India

For Journal Subscription: journalsales@ictacademy.in

For further Queries and Assistance, write to us at: ictacademy.journal@ictacademy.in