Welcome back to the course on Audio Signal Processing for Musical Applications. Last week, we discussed some applications of the spectrum modeling techniques that we have been presenting in the course. Applications that were related with sound transformations. This week, we will cover another type of applications, the ones that relate with describing sounds. In this first lecture, we start by talking about Spectral-based audio features. So we'll start by presenting what we understand for audio features. And then, talk about the audio features that can be extracted from a single frame of a sound from a single spectrum of the analysis. And then, we will discuss about audio features that require more than one frame. That requires a series of frames, segment of a sound, in order to be computed. This is a generic block diagram of the extraction process of audio features based on its spectral analysis. We have seen this diagram before. We start from our signal x. Then, we window a portion of it with a window. And then, we compute the spectrum using the fast Fourier transform, resulting into the magnitude and phase spectrum. And it's from here that we are then extracting a relevant features of the audio signal. Features that hopefully are of relevance to a particular application. There are many types of features that can be extracted from audio signals, and many algorithms implementing their extraction. In this class, we'll use Essentia as the software tool to extract the audio features. Essentia is an open source library for all the analysis. Developed and being maintained at music technology group in Barcelona, that contains algorithms for a large set of music descriptions. In the programming lectures, we will explain how to use Essentia and you'll have the opportunity to learn how to use it. Here, we just present some of the most common descriptors. So in this slide, you see some of the descriptors that are available in Essentia, group into some categories, so for example, we have features that are extracted from the spectrum, so what we call a Spectral descriptors. And here, we have things like BarkBands, MelBands or the MFCCs that we'll talk about, LPC, etc., etc. There are a lot of features based on the spectral characteristics of the sound. Another group of features are extracted directly from the time signal, from the time-domain representation of the sound. And here, we can, of course, talk about effective duration of a sound, or we can talk about descriptors like the zero crossing rate. Another group of descriptors related with pitch related information. Typically, they are also using a spectral-base techniques. And here, we could talk about PitchSalience, or we can talk about what is called the HPCP or the harmonic pitch class profile type of analysis. Or identifying the key of sound, etc. Then, there is another group of descriptors that relate with ryhtm type of descriptors. Again, some might be implemented in the spectral domain. Some are implemented in the time-domain. And here it is to identify things right with the beat or the tempo or the onsets of the sound. Then, we have group some descriptors in what we called sound effects that the late more we says standard type of sound characteristics. Like the attack of a sound, or the things were related to the energy of a particular sound that maybe of use for some descriptions. And finally, there's some high-level descriptors that attempt to describe higher concepts of sound of music. Things like danceability or the complexity of a sound. Some of these descriptors are really more complex to be computed. They're not just simply computing something from the spectrum. So most of the features or descriptors are computed one frame at a time, so let me now discuss a few examples of these descriptors. So you can get an idea of how to go about that. So we'll first talk about the descriptors that relate with energy. Then, we'll talk about descriptors that relate with the spectral shape of the sound. And this include spectral centroid or what we call the mel-frequency cepstral coefficients, and then, let me present two descriptors or two features that relate with pitch related information. One is called a pitch salience and the other is what we call the chroma features. So the energy of an audio frame can be computed from the magnitude spectrum, can also be computed from the time signal. In the magnitude spectrum, we do it by summing over the square of the magnitudes. So the first equation here shows that so we are summing over the square of every magnitude spectrum of the sound. Another energy related measure is what we call the RMS or root means square, and that's a modified version of the energies is the square root of the arithmetic mean of the energy. So it's another way to visualize or to at least compute the energy of the signal. And then, finally, there is another feature or descriptor that is implemented in Essentia. Which is called the Steven's power law which is a proposed measure that tries to emulate the more perceptual kind of view so it tries to emulate the perceived intensity, and this is done by using the sound pressure level of 3,000 hertz tone. As a way to perform a perceptual normalization. This is a very simple measure of loudness, and we call it loudness as strictly speaking we have to compute that perceived loudness of a sound. We have to develop more sophisticated algorithms for that. But anyway, this is a first approximation to the concept of perceptual measure. In this case, the loudness. And in here, we see these three energy related features as they have been computed from our piano sound that we have emissioned before, so we have the time information so we are analyzing this measures throughout the whole sound about one frame at a time. And we can see that they're very much correlated, of course. They're all energy related, but there is quite a bit of difference between these three measures. And clearly, we can have the other measures which is related with energy, may be more related with perceptual Issues that might result into a different type of function. But this is quite useful for measuring energy or loudness related things of a complete sound. So now we talk about descriptor feature that tries to characterize the spectral shape of a particular sound. This is the spectral centroid, which indicates where the center mass of the spectrum is. Perceptually, it is very much related with impression of brightness of sound. And it is calculated as the weighted mean of the frequencies present in the signal. So this equation that we see here sums over the whole spectrum. Waiting it by every frequency and normalizing it by the total energy. So we have a measure and we can see that in this plot that we can see as the sound varies in time how the brightness changes in time. So this is the sound of speech. We also have heard the speech before, it's a male speech. And below it, we see these spectral centroid. The values are frequencies, so it's the center of the spectrum, the mass, the center point of the spectrum as it changes in time. So we see here that this speech sound varies in between 1,000 hertz and 7,000 hertz. And the noisy parts and, in fact, the silence clearly has a high centroid. And the voice parts of the sound have match a lower centroid. So this is a good measure that can be use to characterize quite a few aspect of the sound and in particular this idea of brightness. Another feature that attempts to characterize the spectral shape in a more complex way. So the spectral shape of a sound, this what we call the Mel frequency cepstral coefficients. And we normally use the abbreviation, we say MFCCs. The Mel frequency cepstral coefficients is a representation of the magnitude spectrum. And it is computed by taking the cosine transform of the log magnitude spectrum on a scale that is nonlinear, on a scale that is called the Mel scale. So this equation shows that. So it shows how we take every spectrum, |X sub l [k]|. And then, we multiply by a filter bank. So we multiply by a window so that we keep changing this weighting for every frequency in a way that is dependent on this Mel scale. And the attempt of that is to do a more perceptual base, waiting of the magnitude spectrum of the result of the FFT. And then, we do the log of that. And finally, we do the discrete cosine transform, and the equation of the discrete cosine transform is here, too. And we see the block diagram of this whole process that will start from the magnitude spectrum. Then, we split this magnitude spectrum with this bank of filters or the bank of portions of the spectrum according to this Mel scale. Then, we take the log 10 of that and then we computer DCT and finally we obtained this coefficients, the Mel fact of c coefficients. The Mel scale approximates the frequency resolution of the auditory system. It relates the perceived frequency of each of a pure tone to its actual measured frequency. We, humans, are much better at discerning small changes in pitch at low frequencies than at high frequencies. So by incorporating the scale, we are making the spectral features much more closely with what we actually hear. So in this diagram, we see in this function, there are the Mel scale. And any horizontal axis is the linear scale. And then, on the vertical axis is this new scale. This modified frequency scale. That as you see puts more emphasis on the low frequencies. And on the high frequencies, it puts a less emphasis. So this is the redistribution that we do of the frequency components and the energies of the frequency components in the MFCC analysis. This is a visualization of the MFCC analysis of this same speech sound that we saw before, the male speech. The resulting representation which is this plot below is not so intuitive. In fact, the issue is that every coefficient is a representation of a different level of obstruction of the spectral shape. And therefore, it has to be clearly distinguished from what we would consider a normal spectrograms, quite different. So here, for example, we see the first 12 coefficients. We can choose a number of coefficients at 12 coefficients is a quite standard number that is used. And in fact, the zero coefficient is not shown. Normally, we do not display the zero coefficient because that's relates to the loudness or to the energy of the sound. And that we have other majors for that. So we normally show starting from the first coefficient, at until normally like I say 12 coefficients. The first coefficient is the one that describes the bigger picture of the spectrum, so the bigger overall shape and as we go higher up, it describe more details, more small changes in the spectrum. And so, this is normally used as a vector including all these coefficients at every frame. So we have very compact representation, just 11 or 12 values, that can capture different aspects of the spectral shape. Let's now talk about some features, some descriptors that relate with the pitch information of a sound. And the first one is idea of pitch salience. Pitch salience is a measure of the presence of a pitch sounds in signal. This particular implementation of pitch salience that is available in Essentia starts from the spectral peaks and we know about that. And from then it computes the salience of all possible pitches present. So it does it by summing the weighted energies found at multiples of every particular peak. So it tries to find the possible harmonics that are present of a particular peak. And then it sums all that and it computes this pitch salience per every peak volume. Here in this we see the magnetic spectrum, how we find the peaks. We get the amplitude and frequencies of every peak and then we have this pitch salience function which is a quite complex equation. And here we just see a very overall picture of overall equation of this computation. But basically at every peak and for every amplitude of every peak, we apply a waiting function that measures these multiples and measures the energy of all these multiples of the fundamental frequency, of our peak being considered a fundamental frequency. And then it sums all together into S[b] is this salience at every bin frequency that we are starting with. So we are basically computing the salience of all possible frequencies being considered as a fundamental frequency. And this is the result that we obtain if we only take the maximum salience at every particular frame. So at every particular frame, we have many salient values, but this idea of peak salience normally relates to how much of a peak is present at a particular frame. So by taking the maximum of it is a good measure of how probable, let's say, there is a good pitch sound at every particular frame. So this is an orchestral sound. It is this Chinese orchestra that we have heard before. And there are many instruments playing together. Some are pitch sounds, some are percussive sounds. So by looking at this function, this pitch salience, we can sort of visualize and estimate the presence of the pitch sounds in every frame. And that can be quite useful to characterize quite a number of sounds. And then let me talk about another type of feature that is also related with pitch information and this is the chroma feature. And in particular, we'll talk about the harmonic pitch class profile. But chroma, which is a concept used in music perception and music theory, is a concept that represents the inherent singularity of pitch organization. The same pitch notes in different octaves have the same chroma. So when we talk about pitch classes, we refer to all the pitches that have the same chroma. And the HPCP, the harmonic pitch class profile, is a particular implementation of this idea of chroma features. And it is a distribution of the signal energy across a predefined set of pitch classes. So the idea, and this equation shows that, again, starts from the spectral peaks, A sub p. And then by applying a function to that and summing over all possible peaks, we can get a measure of the different pitches that are present within a particular octave. So the idea of chroma is that we fold everything into one octave and we can divide the octave in 12 semi-tones or in any other type of frequency quantization. And this equation and this implementation basically finds the pitches that have that particular chroma, that have that particular, let's say, note name. So this is an example of analyzing a sound with the HPCP implementation available in asynthia. This is the cello sound in which I played two notes, in fact, let's listen to that. [MUSIC] So in here, what we see is basically the pitches, the pitch classes that are present in this fragment. This is a fragment in which I play basically two strings, a double string in which one is very stable, the low note. And in fact the zero values that we see here, the more red horizontal line relates to one of these very stable pitches, which basically is the A sound that is always present. And then what we see is the other pitches, there is a very strong D sound that is also present throughout. So we see it all throughout and we see also the other notes a little bit, it’s not very clear. But it gives us an idea that there is some clear pitches. And we might hearing it a little bit, we could get quite a distant view of the pitch classes, not the absolute frequencies of the pitches, but the pitch classes or the notes that are present in this recording. Okay, now let's go to multiple frames so features that require to be analyzed with multiple frames. And let me give you just three examples of things that we could do with multiple frames. One is the idea of segmenting an audio recording and identifying onsets, for example. Another is to find the prominent pitch, and for the prominent pitch, we need to see the continuation of the pitch. And finally, the idea is that we can compute the statistics of the single frame features but on a larger scale, on a fragment of a sound. So the segmentation of a recording, and for example, identification of the onsets, can be obtained by calculating some spectral features that measure the change in frequency content. For example, the spectral flux, which is a very common feature used in segmentation, what it does, it compares two consecutive spectra. And then it sums overall these differences. This is basically the L1 norm of these differences. And these can give a measure of the spectral variation, and these can be an indication of where things are changing in the sound. There are many implementations of this idea of a spectral flux. And we can develop variations that can focus on a particular aspect. For the particular case of identifying the onsets, segmenting the sound by finding where are the beginning of an event or a note to starts, there is a number of features that we can use, in fact the spectral flux could be used for that. But here, I have put another feature which is the high frequency content. So what this descriptor does is Find the content, the high-frequency content. So how much of the high frequencies are present, and then we compare with the previous one. So in the case of identifying the onset, clearly an onset is a part of the sound in which there is an increase of high frequencies. Most attacks represent a higher presence of high frequencies. So if we identify where we have an increasing presence of high frequencies, we can detect where the onsets are. Okay, and this is an example of the results of running these two feature analysis on again on this speech sound, and one is the spectral flux. So the red line is the spectral flux, and the blue line is the normalized onset detection. So both are normalized, so we don't see the absolute values of them. But with them and looking at them and if we basically zoom into this function. We would see that the text or it has a higher value at the frames or at fragments of the sound that are transitioned. Portion of the sound. So all the attacks and in the case of the spectral flux, we also can see sort of changes in the decays or the ending of the nodes. Okay, another feature related with pitch is this idea of predominant pitch. And this one that in fact we introduced when we talked about X0 detection in week 6, is a measure of the predominant pitch within a more complex sound. And we don't have time to go into details of that. But basically this feature starts again from the magnum spectrum and then finds the peaks. And then computes the pitch salience that we have seen before. And from this pitch salience, it tries to identify different pitch contours. It tries to identify different pitches that are evolving, in a particular sound fragment. And then out of those, it selects the one that is the most prominent one, okay? So in this case we have this sound that, again, we have heard before. Which is a carnatic piece of music in which there is accompanying instrument. And the green line is the prominent melody that has been obtained. And again, this is an algorithm that is present in and it can be used for identifying the prominent melodies in complex signals. Okay and finally, most audio frame type of features can be aggregated over a complete recording or over a fragment of the sound. And we can compute the statistics to get a view of this more overall type of behavior of a particular audio feature. So we normally what we do is we compute the moments of that particular feature and the first moment would be the arithmetic mean so we get the mean of a particular feature or we can get the second moment which corresponds to the variance, so how it varies this particular feature. And then finally, the skewness, which is the third moment, which is a little bit more analyzing the distribution of this variance, so how this feature is deviating from kind of a normal or a Gaussian type of variation. So with these three the statistical measures, we can get a pretty good view of how a particular audio feature is evolving or is behaving in a particular fragment of a sound recording. In terms of references, there are many references on the topics that I discussed in this lecture for the actual descriptors that we mentioned. Essentia is a good starting point. There is documentation on the website of Essentia. That then it links to the articles or where the particular algorithms were obtained from. In Wikipedia, you can find a lot of information about these things and here's just a few links of the features that I talk about. So the spectral centroid or the MFC or loudness or the HPCP or the idea of detecting onsets and moments, the mathematical moments that we also mentioned. And finally, the code that showed the plots. So that maybe also a good starting point to understand some of these things are available on the sms-tools GitHub location. And that's all for this lecture. We introduced the application of sound and music description by discussing several audio features that can be extracted from audio signals. And that can be used to correct right sounds. In the next lecture, we will continue on this topic by extending the idea of characterizing a sound to the idea of characterizing collections of sounds. So I hope to see you then. Bye, bye.