Abstract
Real-time object detection has gained noticeable attention in robotics, autonomous navigation, and security systems. Traditional convolution-based detectors often struggled when dealing with complex backgrounds, dynamic illumination, and scale variations. These challenges have created an urgent need for a robust architecture that handled diverse object sizes while maintaining efficiency. The emergence of transformer networks inspired recent progress, but most implementations faced latency constraints when applied in real-time environments. This study proposed a transformer-based multi-scale feature fusion framework that integrated hierarchical local and global feature representations for real-time detection tasks. The method has incorporated a convolutional encoder to extract low-level spatial cues, while a transformer encoder captured long-range dependencies across scales. A pyramid fusion mechanism has been applied to merge multi-scale features adaptively, with an attention-driven refinement stage that enhanced contextual object boundaries. The detection head has been optimized to process fused representations with minimal computational overhead. Experimental evaluation is performed using the COCO dataset over 100 training epochs. The proposed system achieves a precision of 0.94, recall of 0.92, and an mAP of 0.90. The IoU value remains consistently high at 0.86, and the system processes frames at 63 FPS in real-time deployment. These numerical results demonstrate a 10–14 percent performance improvement over YOLOv5, Swin Transformer Detector, and FPN-Based Detector while maintaining efficiency during inference.
Authors
Vadhana Kumari Selvaraj1, Brilly Sangeetha2, P. Neethu Prabhakaran3
Vimal Jyothi Engineering College, India1, IES College of Engineering, India2,3
Keywords
Real-Time Detection, Transformer Model, Multi-Scale Fusion, Attention Mechanism, Complex Scenes