Music recognition AI

1. Music and physics

A sound is a vibration that propagates through the air and can be understood by human ears. MP3 players or computers use headphones or built-in speakers to produce vibrations and make "sounds". Music is just a kind of signal which can be represented using various sinusoidal waveforms and it must be treated like that.

1.1. Signal scrambling via sines sums

A sine wave is characterized by two parameters:

Amplitude $\Longrightarrow$ the height of the cycle
Frequency $\Longrightarrow$ the number of cycles per unit of time, measured in Hertz
Phase $\Longrightarrow$ how much the first cycle is shifted forward or backward w.r.t. $t=0$

Figure 1.1: A 20Hz sine wave

A generic sine wave is described by the function: \[y(t) = A \cdot \sin(2\pi f t + \varphi)\] where:

$A$ is the amplitude
$f$ is the frequency
$\varphi$ is the phase
$t$ is the time

In the figure 1.1 the sine wave is described by the function: \[y(t)=1\cdot\sin(2\pi 20 t + 0)=\sin(40\pi t)\]

Figure 1.2: Three different sine waves

But, in the real world, almost any signal is a weighted sum of sines. For example, take three different sine waves, as follows:

Frequency 10Hz and amplitude 1 $\Longrightarrow a(t) = \sin(20\pi t)$
Frequency 30Hz and amplitude 2 $\Longrightarrow b(t) = 2\cdot\sin(60\pi t)$
Frequency 60Hz and amplitude 3 $\Longrightarrow c(t) = 3\cdot\sin(120\pi t)$

Figure 1.3: Sum of the three sine waves

These three functions can be plotted separately as shown in the figure 1.2, or they can be summed up in just a single function to represent a more realistic sound (figure 1.3).

1.2. Spectrogram

A spectrogram (figure 1.4) is a visual representation of how the intensities of the frequencies in a signal vary with time. Usually, a spectrogram is made up of three axes (making a 3D graph):

Time on the horizontal axis ($x$)
Frequencies on the vertical axis ($y$)
Intensity of a given frequency at a given time (encoded by colors)

1.3. Digitalization

Nowadays the most common way to listen to music is using a digital file (such as an MP3 file or a FLAC one). However, a sound is an analog phenomenon that needs to be converted into a digital representation to be easily recorded and stored.

1.3.1. Sampling

Analog signals are continuous signals, which means that given two times $t_1$ and $t_2$ within the signal if we choose $t_3$ such that $t_1<t_3<t_2$, the signal exists for each $t_3$. But in the digital world it's not so affordable to store an infinite amount of data, so the analog signal needs to be reduced to a discrete-time signal. This process is called sampling. It is quite simple: an instantaneous value of the continuous signal is taken every T seconds. T is called sampling period and it should be short enough so that the digital song sounds like the analog one.
The sampling rate or sampling frequency $f_s$ is the number of samples obtained in one second, given by the formula: \[f_s=\frac{1}{T}\] The standard sampling rate for digital music is usually 44100 Hz. The reason behind that number lies in the Nyquist-Shannon theorem which can be expressed as:

Nyquist-Shannon theorem
Suppose the highest frequency component for a given analog signal is $f_{max}$ then the sampling rate $f_s$ needs to be chosen such that: $f_s ≥ 2 f_{max}$.

There is some theory involved in the theorem, but it states that given an analog signal, it needs at least 2 points per cycle to be correctly identified. So, since the human ears can listen to signals whose frequency is between 20 Hz and 20 kHz, taking the highest boundary and multiplying it by 2, it will give 40 kHz, which is close enough to the standardized 44.1 kHz (figure 1.5).

sampling example — Figure 1.5: Sampling example of a 30 Hz signal

1.3.2. Quantization

With the sampling process, the signal is not fully digital: the time resolution became discrete but what about signal amplitude? The amplitude of a signal represents its loudness, and in an analog world it is continuous, so there must be a way to make it discrete. This process is called quantization. The quantization resolution is measured in bits and is known as bit depth.
Taking a depth of 3 bits means that the "loudness" of the song can vary between $0$ and $2^3 −1$ , so there are just 8 quantization levels to represent the loudness of the whole song (figure 1.6).

3 bit quantization — Figure 1.6: 3-bit quantization example

The higher the bit depth, the better the amplitude is approximated. The standard quantization resolution is 16 bits. For instance, in the same previous analog signal with a 16-bit quantization, the amplitude can assume $2^{16}$ values, which will result in much higher accuracy (figure 1.7).

16 bit quantization — Figure 1.7: 16-bit quantization example

1.4. Computing the spectrum

The previous sections should have given enough information to proceed to the real problem: how to break down a complex audio signal into pure sine waves with their own parameters.

1.4.1. Discrete Fourier Transform

The DFT (Discrete Fourier Transform) applies to discrete signals and gives a discrete spectrum. \[X(n)=\sum_{t=0}^{N-1} x(t)e^{-i\frac{2\pi t n}{N}} \tag{1}\] Where:

$N$ is the size of the window: the number of samples that composed the signal
$X(n)$ is the n^th bin of frequencies
$x(t)$ is the t^th sample of the audio signal

The interpretation is that the vector $x$ represents the signal level at various points in time, the vector $X$ represents the signal level at various frequencies. What the formula $(1)$ states is that the signal level at frequency $n$ is equal to the sum of the signal level at each time $t$ multiplied by a complex exponential (figure 1.8).

FT representation — Figure 1.8: Visual representation of FT

For example, take an audio signal with 512 samples, this formula must be applied 512 times:

Once for $n = 0$ to compute the 0^th bin of frequencies
Once for $n = 1$ to compute the 1^st bin of frequencies
...
Once for $n = 511$ to compute the 511^th bin of frequencies

A bin of frequencies is a group of frequencies among two boundaries.
The reason why the DFT can compute bins of frequencies and not exact frequencies is quite simple: the DFT gives a discrete spectrum. A bin of frequencies is the smallest unit of frequencies the DFT can compute and the size of the bin is called spectral resolution or frequency resolution which is given by the formula: \[B_S=\frac{F_S}{N}\] Where:

$B_S$ is the bin size
$F_S$ is the sampling rate
$N$ is the number of samples or the size of the window (more on this later)

For instance, taking a sampling rate of 8000 Hz and a window size of 512, the bin size will be of 15.6 Hz, so:

The 0^th bin contains the frequencies between 0 Hz and 15.6 Hz
The 1^st bin contains the frequencies between 15.6 Hz and 31.2 Hz
And so on

A particularity for a real-valued signal (such as an audio recording) is that only half of the bins computed by the DFT are needed since the output of the DFT is symmetric. In this case, fewer calculations can be made by exploiting this property, which goes under the name of conjugate complex symmetry.

Conjugate complex symmetry of the DFT
If a function $x(t)$ is real-valued then: \[X(N-n)=X^*(n) \tag{2}\] where:

$X(\odot)$ is the output of the DFT applied to $x(t)$

$(\odot)^*$ denotes the conjugate

Proof
Insert $(1)$ in the property $(2)$: $$\begin{aligned} X(n) &= \sum_{t=0}^{N-1} x(t)e^{-i\frac{2\pi t n}{N}} \\ X(N-n) &= \sum_{t=0}^{N-1} x(t)e^{-i\frac{2\pi t (N-n)}{N}} \\ &= \sum_{t=0}^{N-1} x(t)e^{-i 2\pi t}e^{i\frac{2\pi t n}{N}} \\ &= \sum_{t=0}^{N-1} x(t)e^{i\frac{2\pi t n}{N}} \\ &= \left(\sum_{t=0}^{N-1} x(t)e^{-i\frac{2\pi t n}{N}}\right) \\ &= X^*(n) \end{aligned}$$

Hence, if the window size is equal to 512:

So, the DFT algorithm needs to be repeated only half times the window size (256 times in this example).
To be accurate, most real-DFT implementations outputs an $N/2 + 1$ length array, where $N$ is the window size. Taking, as always, a sampling rate of 8000 Hz and a window size of 512 (with the bin size being 15.6 Hz):

The 0^th bin contains the so-called DC component or DC offset, being the sum of each sample in the window (see below)
The 1^st bin contains the frequencies between 0 Hz and 15.6 Hz
The 2^nd bin contains the frequencies between 15.6 Hz and 31.2 Hz
And so on

Proof: DC component
Calculate the DFT $(1)$ with $n=0$: $$\begin{aligned} X(n) |_{n=0} &= \sum_{t=0}^{N-1} x(t)e^{-i\frac{2\pi t n}{N}}|_{n=0} \\ &= \sum_{t=0}^{N-1} x(t) \end{aligned}$$

The DC component is simply ignored by the algorithm implementation and, in most cases, equals 0.

1.4.2. Window function

Now the problem is partially solved: the DFT can be used to obtain the frequencies amplitude for (just to say) the first $\frac{1}{10}$ second part of the song, for the second, the third, and so on: this process is called signal windowing, where only a part of a signal is analyzed at a time, instead of doing it all in one take.
The problem is that in this way a rectangular function is implicitly applied: a function that equals 1 for the song portion under analysis and 0 elsewhere (figure 1.9).

Figure 1.9: Rectangular window function example

By windowing the audio signal, the audio signal is multiplied by a window function which depends on the piece of the audio signal under analysis. The usage of a window function produces spectral leakage: the power of a frequency leaks to the nearby other frequencies, making the DFT less accurate.

Spectral leakage cannot be avoided but it can be controlled and reduced by choosing the right window function: there are many different window functions besides the rectangular one. Just to name a few: triangular, Blackman, Hamming, Hann.
When analyzing unknown very noisy data, the best choice is the Hann window function, defined by the following formula: \[w(n)=\frac{1}{2}\left(1-\cos\left(\frac{2\pi n}{N-1}\right)\right)\] Where:

$N$ is the size of the window
$w(n)$ is the value of the window function at $n$

The aim of this window function is to decrease the amplitude of the discontinuities at the boundaries of a given piece of an audio signal (see figure 1.10).

signal and window shape — Figure 1.10: Shaping the signal with a window function

1.4.3. Downsampling and window size

Downsampling is the process of reducing the sampling rate of a signal. Also, the window size can be reduced accordingly, taking the highest possible acceptable value (the value which gives an acceptable frequency resolution). For instance, resample a song from 44.1 kHz to 8 kHz and use a window size of just 512: in this way, there are fewer samples to analyze and the DFT is applied less frequently.
The only difference is that the resampled song will only have frequencies from 0 to 4 kHz (see the Nyquist-Shannon theorem), but the most important part of the song is still present in this range of frequencies.

2. The actual algorithm

Now is the time to put everything together and start coding the algorithm. The aim of the software is to record a small portion of a song and find its title.
The following sections will cover a global overview of the algorithm, then the actual process involved in the song scoring.
Take into account that, at this abstraction level, some obvious parts (such as the track file reading routines) will not be described, since they are standard pieces of code.

2.1. Global overview

An audio fingerprint is a digital summary that can be used to identify an audio sample. In figure 2.1 a simplified architecture of the scoring algorithm is represented.

general overview scheme — Figure 2.1: General overview scheme

On the server-side:

The algorithm computes the fingerprints of the input tracks
The computed fingerprints are stored in the database

On the client-side:

The current playing music is recorded for a couple of seconds
The algorithm computes the fingerprints of the recording
The fingerprints are sent to the server
The server analyzes the fingerprints and possibly outputs a song title

In the following sections, the algorithm is being described.

2.2. Spectrogram creation

The first step to analyze the audio is to create the spectrogram. The process is described in the following lines:

A 512-sized Hann window function is computed
The audio signal is divided into 512-sized windows with a 50% overlap (figure 2.2)
Each audio window is multiplied by the Hann precomputed window
The DFT is computed for each audio window and added into a vector which represents the spectrogram

audio windowing with overlap — Figure 2.2: Audio windowing with overlap

2.3. Peaks finder and fingerprinting

The spectrogram is then processed to obtain its most significant information. It is divided into a sort of a grid where each cell has the following size:

Width equals C
Height equals a range of frequencies (called a band)

Take the simplified figure 2.3 where:

$C = 4$
The band size is always 100 Hz

spectrogram division — Figure 2.3: Spectrogram division (each color is a band)

In the actual algorithm, $C = 32$ and the bands' lengths follow a logarithmic scale (see Dividing the spectrogram into bands).
For each cell, the algorithm finds and stores the 3 most powerful frequencies in a vector (figure 2.4).

three points per cell spectrogram — Figure 2.4: Three points per cell spectrogram

2.3.1. Dividing the spectrogram into bands

There exist at least a few ways to divide the spectrogram into meaningful frequency ranges, but the most interesting one is the mel scale. It was experimentally discovered that the higher the frequency the hardest it is for a human hear to notice the difference between relatively close but different frequencies.
In other words, humans perceive frequencies on a logarithmic scale, so that the difference between 500 and 1000 Hz is more noticeable than, to say, the difference between 10000 and 10500 Hz.
The formula to convert between the Hertz (linear) scale and the mel (logarithmic) scale is defined as: \[f(m)=700\left(10^{\frac{m}{2595}}-1\right) \tag{3}\] where:

$m$ is the mel frequency
$f$ is the Hertz frequency

The plotted formula is represented in figure 2.5.

Figure 2.5: Mel scale to Hertz mapping

The algorithm follows this approach:

A starting mel value $\alpha$ is chosen
A mel band size $\delta$ is chosen
Compute the mel bands boundaries as: \[\left\{0, \alpha, \alpha + \delta, \alpha + 2\delta, ..., \alpha + k\delta \right\}\]
Convert the mel band boundaries back to Hertz with formula $3$: \[\left\{0, f(\alpha), f(\alpha+\delta), f(\alpha+2\delta), ..., f(\alpha+k\delta)\right\}\]

Each band will contain roughly the same "amount of information" from the listener's point of view.

2.4. Links structure

Now, having a way to compute a frequency summary of an audio signal, the recognition process should be straight enough: each point of the recording should be compared with the points of the full song. Though it works well, this simple approach requires a lot of computation time and power.
Instead of comparing each point one by one, the idea is to look for multiple points at the same time. This group of points is called a link, and it represents a relationship between a point and some others. To be sure that both the recording and the full song will generate the same links, a rule must be defined: spectrogram points must be sorted in ascending order according to the window they belong to.

So, given a point $\alpha$ (called address) and an initially empty set of points $A$, the algorithm will put in $A$ all the other points $\beta$ whose absolute window difference compared to the point $\alpha $ is between 1 and 3 and are in the same band as $\alpha$. In symbols: \[ \beta\in A\Leftrightarrow(1\leq\beta\textrm{.window}-\alpha\textrm{.window}<3)\wedge(\beta\textrm{.band}=\alpha\textrm{.band}) \] For each point $\beta$ belonging to $A$, a link is computed and added to a list (figure 2.6). The link structure is the following (figure 2.7): \[ \textrm{link} \left\{\begin{matrix} \textrm{.hash} = h\left ( \delta_w, \delta_f, \alpha\textrm{.frequency} \right ) \\ \textrm{.window} = \alpha\textrm{.window} \end{matrix}\right. \] Where:

$\delta_w = \beta\textrm{.window} - \alpha\textrm{.window}$
$\delta_f = \beta\textrm{.frequency} - \alpha\textrm{.frequency}$
$h$ is a generic hash function

By doing so the links are quite reproducible, even in the presence of noise and codec compression. Another advantage is that in this way all the times are relative.
On the server-side each link is stored in a database along with the song information.

links creation — Figure 2.6: Links creation

link structure — Figure 2.7: Link structure

2.5. Scoring

It is assumed that the full song links are in the database. Now a client, after recording a small piece of song, computes on its own the links of the recording and sends them to the server. The server has to:

Put the recording links in a temporary in-memory table
Relate the full songs links table with the recording table ones if they share the same hash
Count the tuples grouped by the time difference between the recording links and the full song links (links of the same song share the same time difference) and the song id
Descending sort the tuples according to the count field
The topmost tuple obtained contains the id of the most likely match

3. Show me the code

There are two implementations of this same algorithm. You can find both on github:

Please keep in mind that I'm not a C++ expert, so wear your goggles before visiting my repo.