Music recognition AI

Wouldn't it be cool if a program could find the title of that song they are giving on the radio? That's your lucky day because the algorithm I'm going to describe could come in handy.

Anyway, why are listening to the radio while reading my article?!

From an abstract point of view, the algorithm works like this:
  1. it analyzes a collection of songs then stores some computed information of said songs on a database
  2. you make it listen to a short sample of the song you want to find the name of
  3. if the song is in the database a title is returned

Actually, I made this project quite a while ago (roughly in 2017) as my final high school thesis. The thing is that only recently I was able to recover both the original Java and \(\LaTeX\) thesis source code.

1. Music and physics

A sound is a vibration that propagates through the air and can be understood by human ears. MP3 players or computers use headphones or built-in speakers to produce vibrations and make "sounds". Music is just a kind of signal which can be represented using various sinusoidal waveforms and it must be treated like that.

1.1. Signal scrambling via sines sums

A sine wave is characterized by two parameters:
sine wave
Figure 1.1: A 20Hz sine wave
A generic sine wave is described by the function: \[y(t) = A \cdot \sin(2\pi f t + \varphi)\] where: In the figure 1.1 the sine wave is described by the function: \[y(t)=1\cdot\sin(2\pi 20 t + 0)=\sin(40\pi t)\]
three sine waves
Figure 1.2: Three different sine waves
But, in the real world, almost any signal is a weighted sum of sines. For example, take three different sine waves, as follows:
sum sine wave
Figure 1.3: Sum of the three sine waves
These three functions can be plotted separately as shown in the figure 1.2, or they can be summed up in just a single function to represent a more realistic sound (figure 1.3).

1.2. Spectrogram

A spectrogram (figure 1.4) is a visual representation of how the intensities of the frequencies in a signal vary with time. Usually, a spectrogram is made up of three axes (making a 3D graph):
spectrogram
Figure 1.4: Spectrogram of a real song

1.3. Digitalization

Nowadays the most common way to listen to music is using a digital file (such as an MP3 file or a FLAC one). However, a sound is an analog phenomenon that needs to be converted into a digital representation to be easily recorded and stored.

1.3.1. Sampling

Analog signals are continuous signals, which means that given two times \(t_1\) and \(t_2\) within the signal if we choose \(t_3\) such that \(t_1<t_3<t_2\), the signal exists for each \(t_3\). But in the digital world it's not so affordable to store an infinite amount of data, so the analog signal needs to be reduced to a discrete-time signal. This process is called sampling. It is quite simple: an instantaneous value of the continuous signal is taken every T seconds. T is called sampling period and it should be short enough so that the digital song sounds like the analog one.
The sampling rate or sampling frequency \(f_s\) is the number of samples obtained in one second, given by the formula: \[f_s=\frac{1}{T}\] The standard sampling rate for digital music is usually 44100 Hz. The reason behind that number lies in the Nyquist-Shannon theorem which can be expressed as:
Nyquist-Shannon theorem
Suppose the highest frequency component for a given analog signal is \(f_{max}\) then the sampling rate \(f_s\) needs to be chosen such that: \(f_s ≥ 2 f_{max}\).
There is some theory involved in the theorem, but it states that given an analog signal, it needs at least 2 points per cycle to be correctly identified. So, since the human ears can listen to signals whose frequency is between 20 Hz and 20 kHz, taking the highest boundary and multiplying it by 2, it will give 40 kHz, which is close enough to the standardized 44.1 kHz (figure 1.5).
sampling example
Figure 1.5: Sampling example of a 30 Hz signal

1.3.2. Quantization

With the sampling process, the signal is not fully digital: the time resolution became discrete but what about signal amplitude? The amplitude of a signal represents its loudness, and in an analog world it is continuous, so there must be a way to make it discrete. This process is called quantization. The quantization resolution is measured in bits and is known as bit depth.
Taking a depth of 3 bits means that the "loudness" of the song can vary between \(0\) and \(2^3 −1\) , so there are just 8 quantization levels to represent the loudness of the whole song (figure 1.6).
3 bit quantization
Figure 1.6: 3-bit quantization example
The higher the bit depth, the better the amplitude is approximated. The standard quantization resolution is 16 bits. For instance, in the same previous analog signal with a 16-bit quantization, the amplitude can assume \(2^{16}\) values, which will result in much higher accuracy (figure 1.7).
16 bit quantization
Figure 1.7: 16-bit quantization example

1.4. Computing the spectrum

The previous sections should have given enough information to proceed to the real problem: how to break down a complex audio signal into pure sine waves with their own parameters.

1.4.1. Discrete Fourier Transform

The DFT (Discrete Fourier Transform) applies to discrete signals and gives a discrete spectrum. \[X(n)=\sum_{t=0}^{N-1} x(t)e^{-i\frac{2\pi t n}{N}} \tag{1}\] Where: The interpretation is that the vector \(x\) represents the signal level at various points in time, the vector \(X\) represents the signal level at various frequencies. What the formula \((1)\) states is that the signal level at frequency \(n\) is equal to the sum of the signal level at each time \(t\) multiplied by a complex exponential (figure 1.8).
FT representation
Figure 1.8: Visual representation of FT
For example, take an audio signal with 512 samples, this formula must be applied 512 times: A bin of frequencies is a group of frequencies among two boundaries.
The reason why the DFT can compute bins of frequencies and not exact frequencies is quite simple: the DFT gives a discrete spectrum. A bin of frequencies is the smallest unit of frequencies the DFT can compute and the size of the bin is called spectral resolution or frequency resolution which is given by the formula: \[B_S=\frac{F_S}{N}\] Where: For instance, taking a sampling rate of 8000 Hz and a window size of 512, the bin size will be of 15.6 Hz, so: A particularity for a real-valued signal (such as an audio recording) is that only half of the bins computed by the DFT are needed since the output of the DFT is symmetric. In this case, fewer calculations can be made by exploiting this property, which goes under the name of conjugate complex symmetry.
Conjugate complex symmetry of the DFT
If a function \(x(t)\) is real-valued then: \[X(N-n)=X^*(n) \tag{2}\] where:
Proof
Insert \((1)\) in the property \((2)\): $$\begin{aligned} X(n) &= \sum_{t=0}^{N-1} x(t)e^{-i\frac{2\pi t n}{N}} \\ X(N-n) &= \sum_{t=0}^{N-1} x(t)e^{-i\frac{2\pi t (N-n)}{N}} \\ &= \sum_{t=0}^{N-1} x(t)e^{-i 2\pi t}e^{i\frac{2\pi t n}{N}} \\ &= \sum_{t=0}^{N-1} x(t)e^{i\frac{2\pi t n}{N}} \\ &= \left(\sum_{t=0}^{N-1} x(t)e^{-i\frac{2\pi t n}{N}}\right) \\ &= X^*(n) \end{aligned}$$
Hence, if the window size is equal to 512: sketch DFT So, the DFT algorithm needs to be repeated only half times the window size (256 times in this example).
To be accurate, most real-DFT implementations outputs an \(N/2 + 1\) length array, where \(N\) is the window size. Taking, as always, a sampling rate of 8000 Hz and a window size of 512 (with the bin size being 15.6 Hz):
Proof: DC component
Calculate the DFT \((1)\) with \(n=0\): $$\begin{aligned} X(n) |_{n=0} &= \sum_{t=0}^{N-1} x(t)e^{-i\frac{2\pi t n}{N}}|_{n=0} \\ &= \sum_{t=0}^{N-1} x(t) \end{aligned}$$
The DC component is simply ignored by the algorithm implementation and, in most cases, equals 0.

1.4.2. Window function

Now the problem is partially solved: the DFT can be used to obtain the frequencies amplitude for (just to say) the first \(\frac{1}{10}\) second part of the song, for the second, the third, and so on: this process is called signal windowing, where only a part of a signal is analyzed at a time, instead of doing it all in one take.
The problem is that in this way a rectangular function is implicitly applied: a function that equals 1 for the song portion under analysis and 0 elsewhere (figure 1.9).
rectangular window
Figure 1.9: Rectangular window function example


By windowing the audio signal, the audio signal is multiplied by a window function which depends on the piece of the audio signal under analysis. The usage of a window function produces spectral leakage: the power of a frequency leaks to the nearby other frequencies, making the DFT less accurate.

Spectral leakage cannot be avoided but it can be controlled and reduced by choosing the right window function: there are many different window functions besides the rectangular one. Just to name a few: triangular, Blackman, Hamming, Hann.
When analyzing unknown very noisy data, the best choice is the Hann window function, defined by the following formula: \[w(n)=\frac{1}{2}\left(1-\cos\left(\frac{2\pi n}{N-1}\right)\right)\] Where: The aim of this window function is to decrease the amplitude of the discontinuities at the boundaries of a given piece of an audio signal (see figure 1.10).
signal and window shape
Figure 1.10: Shaping the signal with a window function

1.4.3. Downsampling and window size

Downsampling is the process of reducing the sampling rate of a signal. Also, the window size can be reduced accordingly, taking the highest possible acceptable value (the value which gives an acceptable frequency resolution). For instance, resample a song from 44.1 kHz to 8 kHz and use a window size of just 512: in this way, there are fewer samples to analyze and the DFT is applied less frequently.
The only difference is that the resampled song will only have frequencies from 0 to 4 kHz (see the Nyquist-Shannon theorem), but the most important part of the song is still present in this range of frequencies.

2. The actual algorithm

Now is the time to put everything together and start coding the algorithm. The aim of the software is to record a small portion of a song and find its title.
The following sections will cover a global overview of the algorithm, then the actual process involved in the song scoring.
Take into account that, at this abstraction level, some obvious parts (such as the track file reading routines) will not be described, since they are standard pieces of code.

2.1. Global overview

An audio fingerprint is a digital summary that can be used to identify an audio sample. In figure 2.1 a simplified architecture of the scoring algorithm is represented.

general overview scheme
Figure 2.1: General overview scheme
On the server-side: On the client-side: In the following sections, the algorithm is being described.

2.2. Spectrogram creation

The first step to analyze the audio is to create the spectrogram. The process is described in the following lines:
audio windowing with overlap
Figure 2.2: Audio windowing with overlap

2.3. Peaks finder and fingerprinting

The spectrogram is then processed to obtain its most significant information. It is divided into a sort of a grid where each cell has the following size: Take the simplified figure 2.3 where:
spectrogram division
Figure 2.3: Spectrogram division (each color is a band)
In the actual algorithm, \(C = 32\) and the bands' lengths follow a logarithmic scale (see Dividing the spectrogram into bands).
For each cell, the algorithm finds and stores the 3 most powerful frequencies in a vector (figure 2.4).
three points per cell spectrogram
Figure 2.4: Three points per cell spectrogram

2.3.1. Dividing the spectrogram into bands

There exist at least a few ways to divide the spectrogram into meaningful frequency ranges, but the most interesting one is the mel scale. It was experimentally discovered that the higher the frequency the hardest it is for a human hear to notice the difference between relatively close but different frequencies.
In other words, humans perceive frequencies on a logarithmic scale, so that the difference between 500 and 1000 Hz is more noticeable than, to say, the difference between 10000 and 10500 Hz.
The formula to convert between the Hertz (linear) scale and the mel (logarithmic) scale is defined as: \[f(m)=700\left(10^{\frac{m}{2595}}-1\right) \tag{3}\] where: The plotted formula is represented in figure 2.5.
mel scale to hertz mapping
Figure 2.5: Mel scale to Hertz mapping
The algorithm follows this approach:
  1. A starting mel value \(\alpha\) is chosen
  2. A mel band size \(\delta\) is chosen
  3. Compute the mel bands boundaries as: \[\left\{0, \alpha, \alpha + \delta, \alpha + 2\delta, ..., \alpha + k\delta \right\}\]
  4. Convert the mel band boundaries back to Hertz with formula \(3\): \[\left\{0, f(\alpha), f(\alpha+\delta), f(\alpha+2\delta), ..., f(\alpha+k\delta)\right\}\]
Each band will contain roughly the same "amount of information" from the listener's point of view.

2.4. Links structure

Now, having a way to compute a frequency summary of an audio signal, the recognition process should be straight enough: each point of the recording should be compared with the points of the full song. Though it works well, this simple approach requires a lot of computation time and power.
Instead of comparing each point one by one, the idea is to look for multiple points at the same time. This group of points is called a link, and it represents a relationship between a point and some others. To be sure that both the recording and the full song will generate the same links, a rule must be defined: spectrogram points must be sorted in ascending order according to the window they belong to.

So, given a point \(\alpha\) (called address) and an initially empty set of points \(A\), the algorithm will put in \(A\) all the other points \(\beta\) whose absolute window difference compared to the point \(\alpha \) is between 1 and 3 and are in the same band as \(\alpha\). In symbols: \[ \beta\in A\Leftrightarrow(1\leq\beta\textrm{.window}-\alpha\textrm{.window}<3)\wedge(\beta\textrm{.band}=\alpha\textrm{.band}) \] For each point \(\beta\) belonging to \(A\), a link is computed and added to a list (figure 2.6). The link structure is the following (figure 2.7): \[ \textrm{link} \left\{\begin{matrix} \textrm{.hash} = h\left ( \delta_w, \delta_f, \alpha\textrm{.frequency} \right ) \\ \textrm{.window} = \alpha\textrm{.window} \end{matrix}\right. \] Where: By doing so the links are quite reproducible, even in the presence of noise and codec compression. Another advantage is that in this way all the times are relative.
On the server-side each link is stored in a database along with the song information.
links creation
Figure 2.6: Links creation
link structure
Figure 2.7: Link structure

2.5. Scoring

It is assumed that the full song links are in the database. Now a client, after recording a small piece of song, computes on its own the links of the recording and sends them to the server. The server has to:
  1. Put the recording links in a temporary in-memory table
  2. Relate the full songs links table with the recording table ones if they share the same hash
  3. Count the tuples grouped by the time difference between the recording links and the full song links (links of the same song share the same time difference) and the song id
  4. Descending sort the tuples according to the count field
  5. The topmost tuple obtained contains the id of the most likely match

3. Show me the code

There are two implementations of this same algorithm. You can find both on github:
Please keep in mind that I'm not a C++ expert, so wear your goggles before visiting my repo.