Vertically Segmented Target Zone based Audio Fingerprinting
No Thumbnail Available
Date
2022-04
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
An audio fingerprint is a set of perceptual features that uniquely identify an audio file.
Audio fingerprinting has applications in broadcast monitoring, meta-data collection,
royalty tracking, etc. Audio fingerprinting systems suffer a lot from noise, compression,
and modifications present in the audio. Pitch shifting is one such audio modification.
Common real-world scenarios where pitch-shifting occurs include radio broadcasts,
DJ sets, and deliberate alterations. Since pitch-shifting scales the spectral content
of the original audio, matching pitch-shifted query audio to its original unmodified
version is challenging. This thesis work proposes a Shazam-based audio fingerprinting
system resistant to pitch-shifting.
The proposed approach uses CQT to transform the scaling effect of pitch-shifting into
vertical translation. From the spectrogram generated by CQT, the proposed approach
picks triple spectral peaks to encode pitch-shifting resistant fingerprint hashes. Vertically
segmented target zones were employed to organize spectral peaks into triplets.
By increasing the locality of the generated fingerprint hashes, vertically segmenting
the spectrogram minimizes the effect of pitch-shifting. A fingerprint hashing scheme
that leverages vertically segmented target zones is proposed.
A total of 42,000 query audio and a reference database of 3000 freely available songs
were used to evaluate the proposed approach as well as the chosen baseline works:
Panako and Quad. The result collected shows that the proposed approach handles
pitch-shift modifications from -11% to +12% except for modification values of -8, -3,
+3, and +9 percent. Panako achieved to identify queries with -6% to +6% pitch shifts
except for modification values of -3 and +3 percent. Quad, on the other hand, can handle
-12% to +7% pitch shifts with no such drops. The proposed approach is also robust
to linear speed modification from -6% to +12%, which is a significant improvement
over Panako, which can only handle -4% to +8% modifications. Quad showed better
robustness to linear speed modification by handling rates ranging from -16% to +12%.
However, Quad took, on average, 3 times more time to query a single audio than the
proposed approach. Moreover, the proposed approach shows robustness to common
audio effects such as echo, tremolo, flanger, band-pass, and chorus while Quad suffered
significant accuracy drop for chorus, flanger and tremolo.
Description
Keywords
Audio Fingerprinting, Vertically Segmented Target Zone, Pitch Shifting