Introduction
The paper introduces a Multimodal Autoencoder (MMAE) that learns unified semantic representations across image, audio, and text without requiring large-scale paired or contrastive datasets. Using joint reconstruction, the model discovers modality-invariant structure from moderately sized, broadcast-like data, outperforming PCA and single-modality baselines with significantly higher Silhouette, ARI, and NMI scores. Its key novelty lies in enabling cross-modal alignment, metadata automation, and semantic clustering through a lightweight, fully reproducible, data-efficient framework. Unlike CLIP-style models, the MMAE excels in domains where labeled data is scarce, offering a practical foundation for next-generation media understanding and content management.
Download
Click below to access the complete technical paper in PDF format.