ICTACT Journals

GENERATIVE ADVERSARIAL NETWORKS (GANs) IN MULTIMODAL AI USING BRIDGING TEXT, IMAGE, AND AUDIO DATA FOR ENHANCED MODEL PERFORMANCE

ICTACT Journal on Soft Computing ( Volume: 15 , Issue: 3 )

Abstract

The integration of multimodal data is critical in advancing artificial intelligence models capable of interpreting diverse and complex inputs. While standalone models excel in processing individual data types like text, image, or audio, they often fail to achieve comparable performance when these modalities are combined. Generative Adversarial Networks (GANs) have emerged as a transformative approach in this domain due to their ability to synthesize and learn across disparate data types effectively. This study addresses the challenge of bridging multimodal datasets to improve the generalization and performance of AI models. The proposed framework employs a novel GAN architecture that integrates textual, visual, and auditory data streams. Using a shared latent space, the system generates coherent representations for cross-modal understanding, ensuring seamless data fusion. The GAN model is trained on a benchmark dataset comprising 50,000 multimodal instances, with 25% allocated for testing. Results indicate significant improvements in multimodal synthesis and classification accuracy. The model achieves a text-to-image synthesis FID score of 14.7, an audio- to-text BLEU score of 35.2, and a cross-modal classification accuracy of 92.3%. These outcomes surpass existing models by 8-15% across comparable metrics, highlighting the GAN’s effectiveness in handling data heterogeneity. The findings suggest potential applications in areas such as virtual assistants, multimedia analytics, and cross-modal content generation.

Authors

R. Arun Kumar¹, C. Lisa², V.R. Rashmi³, K. Sandhya⁴
University of South Wales, United Kingdom¹, Nehru College of Engineering and Research Centre, India^2,3, Malabar College of Engineering and Technology, India⁴

Keywords

Multimodal AI, Generative Adversarial Networks, Cross-Modal Synthesis, Text-Image-Audio Fusion, Model Performance Enhancement

Published By

ICTACT

Published In

ICTACT Journal on Soft Computing
( Volume: 15 , Issue: 3 )

Date of Publication

January 2025

Pages

3567 - 3577

DOI

10.21917/ijsc.2025.0497