ICTACT Journals

TRANSFORMER-DRIVEN MULTI-SCALE FEATURE FUSION FRAMEWORK FOR EFFICIENT AND ACCURATE REAL-TIME OBJECT DETECTION IN COMPLEX ENVIRONMENTS

ICTACT Journal on Image and Video Processing ( Volume: 16 , Issue: 2 )

Abstract

Real-time object detection has gained noticeable attention in robotics, autonomous navigation, and security systems. Traditional convolution-based detectors often struggled when dealing with complex backgrounds, dynamic illumination, and scale variations. These challenges have created an urgent need for a robust architecture that handled diverse object sizes while maintaining efficiency. The emergence of transformer networks inspired recent progress, but most implementations faced latency constraints when applied in real-time environments. This study proposed a transformer-based multi-scale feature fusion framework that integrated hierarchical local and global feature representations for real-time detection tasks. The method has incorporated a convolutional encoder to extract low-level spatial cues, while a transformer encoder captured long-range dependencies across scales. A pyramid fusion mechanism has been applied to merge multi-scale features adaptively, with an attention-driven refinement stage that enhanced contextual object boundaries. The detection head has been optimized to process fused representations with minimal computational overhead. Experimental evaluation is performed using the COCO dataset over 100 training epochs. The proposed system achieves a precision of 0.94, recall of 0.92, and an mAP of 0.90. The IoU value remains consistently high at 0.86, and the system processes frames at 63 FPS in real-time deployment. These numerical results demonstrate a 10–14 percent performance improvement over YOLOv5, Swin Transformer Detector, and FPN-Based Detector while maintaining efficiency during inference.

Authors

Vadhana Kumari Selvaraj¹, Brilly Sangeetha², P. Neethu Prabhakaran³
Vimal Jyothi Engineering College, India¹, IES College of Engineering, India^2,3

Keywords

Real-Time Detection, Transformer Model, Multi-Scale Fusion, Attention Mechanism, Complex Scenes

Published By

ICTACT

Published In

ICTACT Journal on Image and Video Processing
( Volume: 16 , Issue: 2 )

Date of Publication

November 2025

Pages

3719 - 3724

Doi

10.21917/ijivp.2025.0526

Page Views

158

Article Details ICTACT Journals