School of Information Technology and Engineering

Permanent URI for this collection

http://etd.aau.edu.et/handle/123456789/460

Browse

Now showing 1 - 2 of 2

Ensemble Learning with Attention and Audio for Robust Video Classification
(Addis Ababa University, 2025-06) Dereje Tadesse; Beakal Gizachew (PhD)
The classification of video scenes is a fundamental task for many applications, such as content recommendation, indexing, and monitoring broadcasts. Current methods often depend on annotation-dependent object detection models, restricting their generalizability when working with different types of broadcast content, particularly cases where visual clues like logos or brands may not have clear definition or presence. This thesis is intended to address the problems associated with current methods through describing a two-stage classification framework that integrates both recognized and unheard information to improve accuracy and robustness of classification. The first stage utilizes a detection model based on pretrained models of object detection and enhanced spatial attention to detect physical visual markers (such as program logo or branded intro sequences) in video program content. However, individual visual indicators are sometimes not robust enough to add confidence, especially in content such as situational comedies where logos do not exist. The second stage describes a twostaged, early fusion ensemble presentation of convolutional neural network-based visual features and recurrent neural network-based audio features. The two modes each use some complementary properties, thus could be used for more robust classification. Experiments were completed with a dataset of approximately 19 hours of content from 13 TV programs across three channels, all focused on intro, credit, and outro segments. The visual-only model achieved 96.83% accuracy, while the audio-only model achieved 90.91%. The proposed early fusion ensemble method achieved 94.13% accuracy and revealed more robustness in difficult situations when quality of visual data was low or ambiguous. Ablation studies contrasting model performance with different ensemble methods confirmed the greater utility of early fusion and its capturing of cross-modal interactions. The system is also designed to be computationally efficient allowing for operationalization in broadcast media settings. This work, while also demonstrating methodical video classification ability, fills a significant gap for scalable and generalizable video classification through the integration of multimodal learning, especially with large amounts of uncontrollable annotations which has previously been a hurdle to more typical models.
Ensemble Learning with Attention and Audio for Robust Video Classification
(Addis Ababa University, 2025-06) Dereje Tadesse; Beakal Gizachew (PhD)
The classification of video scenes is a fundamental task for many applications, such as content recommendation, indexing, and monitoring broadcasts. Current methods often depend on annotation-dependent object detection models, restricting their generalizability when working with different types of broadcast content, particularly cases where visual clues like logos or brands may not have clear definition or presence. This thesis is intended to address the problems associated with current methods through describing a two-stage classification framework that integrates both recognized and unheard information to improve accuracy and robustness of classification. The first stage utilizes a detection model based on pretrained models of object detection and enhanced spatial attention to detect physical visual markers (such as program logo or branded intro sequences) in video program content. However, individual visual indicators are sometimes not robust enough to add confidence, especially in content such as situational comedies where logos do not exist. The second stage describes a twostaged, early fusion ensemble presentation of convolutional neural network-based visual features and recurrent neural network-based audio features. The two modes each use some complementary properties, thus could be used for more robust classification. Experiments were completed with a dataset of approximately 19 hours of content from 13 TV programs across three channels, all focused on intro, credit, and outro segments. The visual-only model achieved 96.83% accuracy, while the audio-only model achieved 90.91%. The proposed early fusion ensemble method achieved 94.13% accuracy and revealed more robustness in difficult situations when quality of visual data was low or ambiguous. Ablation studies contrasting model performance with different ensemble methods confirmed the greater utility of early fusion and its capturing of cross-modal interactions. The system is also designed to be computationally efficient allowing for operationalization in broadcast media settings. This work, while also demonstrating methodical video classification ability, fills a significant gap for scalable and generalizable video classification through the integration of multimodal learning, especially with large amounts of uncontrollable annotations which has previously been a hurdle to more typical models.

Browse

Browsing School of Information Technology and Engineering by Author "Dereje Tadesse"

Results Per Page

Sort Options