Vertically Segmented Target Zone based Audio Fingerprinting

No Thumbnail Available



Journal Title

Journal ISSN

Volume Title


Addis Ababa University


An audio fingerprint is a set of perceptual features that uniquely identify an audio file. Audio fingerprinting has applications in broadcast monitoring, meta-data collection, royalty tracking, etc. Audio fingerprinting systems suffer a lot from noise, compression, and modifications present in the audio. Pitch shifting is one such audio modification. Common real-world scenarios where pitch-shifting occurs include radio broadcasts, DJ sets, and deliberate alterations. Since pitch-shifting scales the spectral content of the original audio, matching pitch-shifted query audio to its original unmodified version is challenging. This thesis work proposes a Shazam-based audio fingerprinting system resistant to pitch-shifting. The proposed approach uses CQT to transform the scaling effect of pitch-shifting into vertical translation. From the spectrogram generated by CQT, the proposed approach picks triple spectral peaks to encode pitch-shifting resistant fingerprint hashes. Vertically segmented target zones were employed to organize spectral peaks into triplets. By increasing the locality of the generated fingerprint hashes, vertically segmenting the spectrogram minimizes the effect of pitch-shifting. A fingerprint hashing scheme that leverages vertically segmented target zones is proposed. A total of 42,000 query audio and a reference database of 3000 freely available songs were used to evaluate the proposed approach as well as the chosen baseline works: Panako and Quad. The result collected shows that the proposed approach handles pitch-shift modifications from -11% to +12% except for modification values of -8, -3, +3, and +9 percent. Panako achieved to identify queries with -6% to +6% pitch shifts except for modification values of -3 and +3 percent. Quad, on the other hand, can handle -12% to +7% pitch shifts with no such drops. The proposed approach is also robust to linear speed modification from -6% to +12%, which is a significant improvement over Panako, which can only handle -4% to +8% modifications. Quad showed better robustness to linear speed modification by handling rates ranging from -16% to +12%. However, Quad took, on average, 3 times more time to query a single audio than the proposed approach. Moreover, the proposed approach shows robustness to common audio effects such as echo, tremolo, flanger, band-pass, and chorus while Quad suffered significant accuracy drop for chorus, flanger and tremolo.



Audio Fingerprinting, Vertically Segmented Target Zone, Pitch Shifting