Multimodal Unified Bidirectional Cross-Modal Audio-Visual Saliency Prediction
No Thumbnail Available
Date
2025-06
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
Human attention in dynamic environments is inherently multimodal and is
shaped by the interplay of auditory and visual cues. Although existing saliency prediction methods predominantly focus on visual semantics, they neglect audio as a critical modulator of gaze behavior. Recent audiovisual approaches attempt to address this gap but remain limited by temporal misalignment between modalities and inadequate retention of spatio-temporal information, which is key to resolving both the location and timing of salient events, ultimately yielding suboptimal performance. Inspired by recent breakthroughs in cross-attention transformers with convolutions for joint global-local representation learning and conditional denoising diffusion models for progressive refinement, we introduce a novel multimodal framework for bidirectional efficient audiovisual saliency prediction.
It employs dual-stream encoders to process video and audio independently,
coupled with separate efficient cross-modal attention pathways that model mutual modality influence: One pathway aligns visual features with audio features, while the other adjusts audio embeddings to visual semantics. Critically, these pathways converge into a unified latent space, ensuring coherent alignment of transient audiovisual events through iterative feature fusion. To preserve finegrained details, residual connections propagate multiscale features across stages. For saliency generation, a conditional diffusion decoder iteratively denoises a noise-corrupted ground truth map, conditioned at each timestep on the fused audiovisual features through a hierarchical decoder that enforces spatio-temporal coherence via multiscale refinement. Extensive experiments demonstrate that our model outperforms state of the art methods, achieving individual improvements of up to 11.52% (CC), 20.04% (SIM), and 3.79% (NSS) across evaluation metrics over DiffSal on the AVAD dataset
Description
Keywords
MUBiC, employs dual-stream encoders, AVAD dataset