Ensemble Learning with Attention and Audio for Robust Video Classification

dc.contributor.advisorBeakal Gizachew (PhD)
dc.contributor.authorDereje Tadesse
dc.date.accessioned2025-10-07T09:15:55Z
dc.date.available2025-10-07T09:15:55Z
dc.date.issued2025-06
dc.description.abstractThe classification of video scenes is a fundamental task for many applications, such as content recommendation, indexing, and monitoring broadcasts. Current methods often depend on annotation-dependent object detection models, restricting their generalizability when working with different types of broadcast content, particularly cases where visual clues like logos or brands may not have clear definition or presence. This thesis is intended to address the problems associated with current methods through describing a two-stage classification framework that integrates both recognized and unheard information to improve accuracy and robustness of classification. The first stage utilizes a detection model based on pretrained models of object detection and enhanced spatial attention to detect physical visual markers (such as program logo or branded intro sequences) in video program content. However, individual visual indicators are sometimes not robust enough to add confidence, especially in content such as situational comedies where logos do not exist. The second stage describes a twostaged, early fusion ensemble presentation of convolutional neural network-based visual features and recurrent neural network-based audio features. The two modes each use some complementary properties, thus could be used for more robust classification. Experiments were completed with a dataset of approximately 19 hours of content from 13 TV programs across three channels, all focused on intro, credit, and outro segments. The visual-only model achieved 96.83% accuracy, while the audio-only model achieved 90.91%. The proposed early fusion ensemble method achieved 94.13% accuracy and revealed more robustness in difficult situations when quality of visual data was low or ambiguous. Ablation studies contrasting model performance with different ensemble methods confirmed the greater utility of early fusion and its capturing of cross-modal interactions. The system is also designed to be computationally efficient allowing for operationalization in broadcast media settings. This work, while also demonstrating methodical video classification ability, fills a significant gap for scalable and generalizable video classification through the integration of multimodal learning, especially with large amounts of uncontrollable annotations which has previously been a hurdle to more typical models.
dc.identifier.urihttps://etd.aau.edu.et/handle/123456789/7473
dc.language.isoen_US
dc.publisherAddis Ababa University
dc.subjectVideo Classification
dc.subjectEnsemble Learning
dc.subjectAttention Mechanism
dc.subjectAudio-Visual Fusion
dc.subjectObject Detection
dc.titleEnsemble Learning with Attention and Audio for Robust Video Classification
dc.typeThesis

Files

Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Dereje Tadesse.pdf
Size:
3.47 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed to upon submission
Description: