Multimodal Unified Bidirectional Cross-Modal Audio-Visual Saliency Prediction

Tadele Melesse

Multimodal Unified Bidirectional Cross-Modal Audio-Visual Saliency Prediction

Files

Tadele Melesse.pdf (2.8 MB)

Date

2025-06

Authors

Tadele Melesse

Publisher

Addis Ababa University

Abstract

Human attention in dynamic environments is inherently multimodal and is shaped by the interplay of auditory and visual cues. Although existing saliency prediction methods predominantly focus on visual semantics, they neglect audio as a critical modulator of gaze behavior. Recent audiovisual approaches attempt to address this gap but remain limited by temporal misalignment between modalities and inadequate retention of spatio-temporal information, which is key to resolving both the location and timing of salient events, ultimately yielding suboptimal performance. Inspired by recent breakthroughs in cross-attention transformers with convolutions for joint global-local representation learning and conditional denoising diffusion models for progressive refinement, we introduce a novel multimodal framework for bidirectional efficient audiovisual saliency prediction. It employs dual-stream encoders to process video and audio independently, coupled with separate efficient cross-modal attention pathways that model mutual modality influence: One pathway aligns visual features with audio features, while the other adjusts audio embeddings to visual semantics. Critically, these pathways converge into a unified latent space, ensuring coherent alignment of transient audiovisual events through iterative feature fusion. To preserve finegrained details, residual connections propagate multiscale features across stages. For saliency generation, a conditional diffusion decoder iteratively denoises a noise-corrupted ground truth map, conditioned at each timestep on the fused audiovisual features through a hierarchical decoder that enforces spatio-temporal coherence via multiscale refinement. Extensive experiments demonstrate that our model outperforms state of the art methods, achieving individual improvements of up to 11.52% (CC), 20.04% (SIM), and 3.79% (NSS) across evaluation metrics over DiffSal on the AVAD dataset

Keywords

MUBiC, employs dual-stream encoders, AVAD dataset

URI

https://etd.aau.edu.et/handle/123456789/7099

Collections

School of Information Technology and Engineering

Full item page

Multimodal Unified Bidirectional Cross-Modal Audio-Visual Saliency Prediction

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections