Multimodal Contextual Transformer Augmented Fusion For Emotion Recognition
No Thumbnail Available
Date
2025-06
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
As emotionally intelligent systems increasingly become integral to human-centered Artificial Intelligence (AI), the precise recognition of emotions in conversational settings continues to pose a fundamental difficulty. This challenge arises from the context-sensitive and evolving characteristics of emotional expression. Although the majority of Multimodal Emotion Recognition (MER) systems utilize speech and text features, they often overlook conversational context, such as prior dialogue exchanges, speaker identity, and interaction history, which are crucial for discerning nuanced or ambiguous emotions, particularly during dyadic and multiparty interactions. This study presents Multimodal Contextual Transformer Augmented Fusion (MCTAF), a lightweight, context-sensitive framework for MER. MCTAF explicitly represents context as a third modality, integrating the prior K utterances (dialogue history including text and audio), speaker characteristics, and turn-level temporal structure. The contextual features are processed using a Bidirectional Gated Recurrent Unit (BiGRU)-based context encoder that functions concurrently with distinct BiGRU encoders for textual and audio characteristics. All three modality-specific representations are integrated using a transformer-based self-attention method to capture both intra- and inter-modal interdependence across conversation turns. To our knowledge, this is the first study to clearly conceptualize conversational history as a key modality inside a unified transformer architecture, processing it concurrently with voice and text before a dynamic, attentiondriven fusion. MCTAF surpasses robust baselines when assessed on Interactive Emotional Dyadic Motion Capture (IEMOCAP) and Multimodal EmotionLines Dataset (MELD). It achieves 89.9% accuracy and 88.3% weighted F1-score on IEMOCAP and MELD benchmarks, respectively, delivering performance increases of up to +4.0 percentage points in accuracy and +3.0 in F1-score above preceding state-of-the-art models. Ablation experiments further validate the significance of context modeling, demonstrating a 3-4 point decline in F1 when the context module is eliminated. In terms of efficiency, MCTAF decreases training time by 8% each epoch and employs 12% fewer parameters than equivalent transformer-based baselines, with an average inference time of 26.1 ms per syllable. These findings demonstrate the potential of MCTAF for scalable and resource-efficient implementation.
Description
Keywords
Multimodal emotion recognition, Contextual transformer, Cross-modal attention, Speech-text fusion, Dialogue context, Transformer-based fusion