Anyway, why are listening to the radio while reading my article?!

From an abstract point of view, the algorithm works like this:

- it analyzes a collection of songs then stores some computed information of said songs on a database
- you make it listen to a short sample of the song you want to find the name of
- if the song is in the database a title is returned

Actually, I made this project quite a while ago (roughly in 2017) as my final high school thesis. The thing is that only recently I was able to recover both the original Java and \(\LaTeX\) thesis source code.

*Amplitude*\(\Longrightarrow\) the height of the cycle*Frequency*\(\Longrightarrow\) the number of cycles per unit of time, measured in Hertz*Phase*\(\Longrightarrow\) how much the first cycle is shifted forward or backward w.r.t. \(t=0\)

- \(A\) is the amplitude
- \(f\) is the frequency
- \(\varphi\) is the phase
- \(t\) is the time

- Frequency 10Hz and amplitude 1 \(\Longrightarrow a(t) = \sin(20\pi t)\)
- Frequency 30Hz and amplitude 2 \(\Longrightarrow b(t) = 2\cdot\sin(60\pi t)\)
- Frequency 60Hz and amplitude 3 \(\Longrightarrow c(t) = 3\cdot\sin(120\pi t)\)

- Time on the horizontal axis (\(x\))
- Frequencies on the vertical axis (\(y\))
- Intensity of a given frequency at a given time (encoded by colors)

The

There is some theory involved in the theorem, but it states that given an analog signal, it needs at least 2 points per cycle to be correctly identified. So, since the human ears can listen to signals whose frequency is between 20 Hz and 20 kHz, taking the highest boundary and multiplying it by 2, it will give 40 kHz, which is close enough to the standardized 44.1 kHz (figure 1.5).Nyquist-Shannon theoremSuppose the highest frequency component for a given analog signal is \(f_{max}\) then the sampling rate \(f_s\) needs to be chosen such that: \(f_s ≥ 2 f_{max}\).

Taking a depth of 3 bits means that the "loudness" of the song can vary between \(0\) and \(2^3 −1\) , so there are just 8 quantization levels to represent the loudness of the whole song (figure 1.6). The higher the bit depth, the better the amplitude is approximated. The standard quantization resolution is

- \(N\) is the size of the
**window**: the number of samples that composed the signal - \(X(n)\) is the
**n**^{th}bin of frequencies - \(x(t)\) is the
**t**^{th}sample of the audio signal

For example, take an audio signal with 512 samples, this formula must be applied 512 times:

- Once for \(n = 0\) to compute the 0
^{th}bin of frequencies - Once for \(n = 1\) to compute the 1
^{st}bin of frequencies - ...
- Once for \(n = 511\) to compute the 511
^{th}bin of frequencies

The reason why the DFT can compute bins of frequencies and not exact frequencies is quite simple: the DFT gives a

- \(B_S\) is the
**bin size** - \(F_S\) is the
**sampling rate** - \(N\) is the
**number of samples**or the**size of the window**(more on this later)

- The 0
^{th}bin contains the frequencies between 0 Hz and 15.6 Hz - The 1
^{st}bin contains the frequencies between 15.6 Hz and 31.2 Hz - And so on

Hence, if the window size is equal to 512: So, the DFT algorithm needs to be repeated only half times the window size (256 times in this example).Conjugate complex symmetry of the DFTIf a function \(x(t)\) is real-valued then: \[X(N-n)=X^*(n) \tag{2}\] where:

- \(X(\odot)\) is the output of the DFT applied to \(x(t)\)
- \((\odot)^*\) denotes the conjugate
ProofInsert \((1)\) in the property \((2)\): $$\begin{aligned} X(n) &= \sum_{t=0}^{N-1} x(t)e^{-i\frac{2\pi t n}{N}} \\ X(N-n) &= \sum_{t=0}^{N-1} x(t)e^{-i\frac{2\pi t (N-n)}{N}} \\ &= \sum_{t=0}^{N-1} x(t)e^{-i 2\pi t}e^{i\frac{2\pi t n}{N}} \\ &= \sum_{t=0}^{N-1} x(t)e^{i\frac{2\pi t n}{N}} \\ &= \left(\sum_{t=0}^{N-1} x(t)e^{-i\frac{2\pi t n}{N}}\right) \\ &= X^*(n) \end{aligned}$$

To be accurate, most real-DFT implementations outputs an \(N/2 + 1\) length array, where \(N\) is the window size. Taking, as always, a sampling rate of 8000 Hz and a window size of 512 (with the bin size being 15.6 Hz):

- The 0
^{th}bin contains the so-called**DC component**or**DC offset**, being the sum of each sample in the window (see below) - The 1
^{st}bin contains the frequencies between 0 Hz and 15.6 Hz - The 2
^{nd}bin contains the frequencies between 15.6 Hz and 31.2 Hz - And so on

The DC component is simply ignored by the algorithm implementation and, in most cases, equals 0.Proof: DC componentCalculate the DFT \((1)\) with \(n=0\): $$\begin{aligned} X(n) |_{n=0} &= \sum_{t=0}^{N-1} x(t)e^{-i\frac{2\pi t n}{N}}|_{n=0} \\ &= \sum_{t=0}^{N-1} x(t) \end{aligned}$$

The problem is that in this way a

By windowing the audio signal, the audio signal is multiplied by a window function which depends on the piece of the audio signal under analysis. The usage of a window function produces

Spectral leakage cannot be avoided but it can be controlled and reduced by choosing the right window function: there are many different window functions besides the rectangular one. Just to name a few:

When analyzing unknown very noisy data, the best choice is the Hann window function, defined by the following formula: \[w(n)=\frac{1}{2}\left(1-\cos\left(\frac{2\pi n}{N-1}\right)\right)\] Where:

- \(N\) is the size of the window
- \(w(n)\) is the value of the window function at \(n\)

The only difference is that the resampled song will only have frequencies from 0 to 4 kHz (see the Nyquist-Shannon theorem), but the most important part of the song is still present in this range of frequencies.

The following sections will cover a global overview of the algorithm, then the actual process involved in the song scoring.

Take into account that, at this abstraction level, some obvious parts (such as the track file reading routines) will not be described, since they are standard pieces of code.

On the

- The algorithm computes the fingerprints of the input tracks
- The computed fingerprints are stored in the database

- The current playing music is recorded for a couple of seconds
- The algorithm computes the fingerprints of the recording
- The fingerprints are sent to the server
- The server analyzes the fingerprints and possibly outputs a song title

- A 512-sized Hann window function is computed
- The audio signal is divided into 512-sized windows with a 50% overlap (figure 2.2)
- Each audio window is multiplied by the Hann precomputed window
- The DFT is computed for each audio window and added into a vector which represents the spectrogram

- Width equals C
- Height equals a range of frequencies (called a
**band**)

- \(C = 4\)
- The band size is always 100 Hz

For each cell, the algorithm finds and stores the 3 most powerful frequencies in a vector (figure 2.4).

In other words, humans perceive frequencies on a

The formula to convert between the Hertz (linear) scale and the mel (logarithmic) scale is defined as: \[f(m)=700\left(10^{\frac{m}{2595}}-1\right) \tag{3}\] where:

**\(m\)**is the mel frequency**\(f\)**is the Hertz frequency

- A starting mel value \(\alpha\) is chosen
- A mel band size \(\delta\) is chosen
- Compute the mel bands boundaries as: \[\left\{0, \alpha, \alpha + \delta, \alpha + 2\delta, ..., \alpha + k\delta \right\}\]
- Convert the mel band boundaries back to Hertz with formula \(3\): \[\left\{0, f(\alpha), f(\alpha+\delta), f(\alpha+2\delta), ..., f(\alpha+k\delta)\right\}\]

Instead of comparing each point one by one, the idea is to look for multiple points at the same time. This group of points is called a

So, given a point \(\alpha\) (called address) and an initially empty set of points \(A\), the algorithm will put in \(A\) all the other points \(\beta\) whose absolute window difference compared to the point \(\alpha \) is between 1 and 3 and are in the same band as \(\alpha\). In symbols: \[ \beta\in A\Leftrightarrow(1\leq\beta\textrm{.window}-\alpha\textrm{.window}<3)\wedge(\beta\textrm{.band}=\alpha\textrm{.band}) \] For each point \(\beta\) belonging to \(A\), a link is computed and added to a list (figure 2.6). The link structure is the following (figure 2.7): \[ \textrm{link} \left\{\begin{matrix} \textrm{.hash} = h\left ( \delta_w, \delta_f, \alpha\textrm{.frequency} \right ) \\ \textrm{.window} = \alpha\textrm{.window} \end{matrix}\right. \] Where:

- \(\delta_w = \beta\textrm{.window} - \alpha\textrm{.window}\)
- \(\delta_f = \beta\textrm{.frequency} - \alpha\textrm{.frequency}\)
- \(h\) is a generic hash function

On the server-side each link is stored in a database along with the song information.

- Put the recording links in a temporary in-memory table
- Relate the full songs links table with the recording table ones if they share the same hash
- Count the tuples grouped by the time difference between the recording links and the full song links (links of the same song share the same time difference) and the song id
- Descending sort the tuples according to the count field
- The topmost tuple obtained contains the id of the most likely match

Please keep in mind that I'm not a C++ expert, so wear your goggles before visiting my repo.