The integration of multimodal data is critical in advancing artificial
intelligence models capable of interpreting diverse and complex inputs.
While standalone models excel in processing individual data types like
text, image, or audio, they often fail to achieve comparable
performance when these modalities are combined. Generative
Adversarial Networks (GANs) have emerged as a transformative
approach in this domain due to their ability to synthesize and learn
across disparate data types effectively. This study addresses the
challenge of bridging multimodal datasets to improve the
generalization and performance of AI models. The proposed
framework employs a novel GAN architecture that integrates textual,
visual, and auditory data streams. Using a shared latent space, the
system generates coherent representations for cross-modal
understanding, ensuring seamless data fusion. The GAN model is
trained on a benchmark dataset comprising 50,000 multimodal
instances, with 25% allocated for testing. Results indicate significant
improvements in multimodal synthesis and classification accuracy. The
model achieves a text-to-image synthesis FID score of 14.7, an audio-
to-text BLEU score of 35.2, and a cross-modal classification accuracy
of 92.3%. These outcomes surpass existing models by 8-15% across
comparable metrics, highlighting the GAN’s effectiveness in handling
data heterogeneity. The findings suggest potential applications in areas
such as virtual assistants, multimedia analytics, and cross-modal
content generation.
R. Arun Kumar1, C. Lisa2, V.R. Rashmi3, K. Sandhya4 University of South Wales, United Kingdom1, Nehru College of Engineering and Research Centre, India2,3, Malabar College of Engineering and Technology, India4
Multimodal AI, Generative Adversarial Networks, Cross-Modal Synthesis, Text-Image-Audio Fusion, Model Performance Enhancement
January | February | March | April | May | June | July | August | September | October | November | December |
4 | 11 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Published By : ICTACT
Published In :
ICTACT Journal on Soft Computing ( Volume: 15 , Issue: 3 , Pages: 3567 - 3577 )
Date of Publication :
January 2025
Page Views :
127
Full Text Views :
16
|