I now have a full series called "Audio Signal Processing for Machine Learning", which develops the concept introduced here in greater detail. You can check it out at ua-cam.com/video/iCwMQJnKk2c/v-deo.html
It is a great series. And I would love to learn about the digital processing stuff you were talking about in the video . Please do a series on it too. Thanks again.
I'm new at machine learning for audio and I've been following along your videos taking some notes and I feel that I'm learning a loot. Thanks Mr. Valerio!
Dude, you've made my life so much easier. I'm going for DL in speech processing and frankly, the task of analog waves to DL features conversion has been a mystery untill now!! If at all you're launching a descriptive audio/signal processing series, I would love to watch it.
I've been following along, seeing your videos from the day I saw your Reddit post. I gotta say, you are doing a great work explaining the theory behind Deep Learning. Keep doing the work! Cheers :)
Really, Man, you are doing a great job. This is the best series for Audio Deep Learning. This is far away from any other courses. Hats off to you. I never preferred to comment, but this content forced me to comment. Thanks, buddy for your efforts and for sharing knowledge with us.
ugh u are a legendary for putting this info out for free this is always sommething i wanted to learn and i didn't know where to start and now i know i can just consume your content to learn more about this exciting field!! god bless you
Thanks for the great series. I am working on TTS and STT for my local language and this channel might be very helpful. Thanks and waiting for the next one. Kudos from Munich and Sri Lanka
Amazing Job i've been working for a project on Speech related AI and had very little knowledge about sound and everything related and you are a very good, cut to the point teacher, thanks!
This has been really informative . the spaghetti i made with your recipe was off the charts ! love how you simplify the addition of the instances of the algebric space of linear functions and specifically Fourrier transforms so neatly . thanks for making things accessible to everyone
Thank you! I'm planning to create a whole series on audio DSP for music over the next months, where I'll delve into the mathematical details. Stay tuned :)
Can't find a better series for audio processing with DL like this! Great content as always. It would be really helpful if you can touch on concepts like audio augmentation techniques and transfer learning in the future in this series. Thanks Valerio!!!
these are really helpful series, I hope you could make more series about audio data for DL with more details. I really liked your way of explaining things
It is really an amazing series and I am happy that I found it. I wish to thank you a lot for your time. Keep up the good work. Alongside, I am also more curious to learn about MFCCs. It would be really helpful if you make another series about Audio DSP as you mentioned earlier. Thanks again!
It is a really interesting topic. I want to request to you could you please make videos related to the Speech Enhancement system. How can we do create a neural network model or CNN for speech enhancement? How can we remove the noise signal from human speech using deep learning or specific model like (CNN, ANN, RNN, and LSTM)? Thanks for making amazing videos for us.
by the way, can i ask what you are using to automaticly give suggestions for your code? I use VS Code, but am looking for something as effective as you use :)
That content is amazing! Very very clear for me! I am just a programmer interested in extracting audio from video and then transforming it in a podcast (without the advertising intervals) to hear while I am doing the dishes :) I did the first version with just random forests and it was ok, but its time to do some deep learning now... this series is gooold! I did a small flask app with to divide the audio into small 1sec parts and serve as content to an app where I can easily classify my audio and use it like labels in the DL project...
Thank you for the feedback! I'm considering creating a series on audio DSP / music processing over the next months. If you're interested in the topic, you should take a look at "Fundamentals of Music Processing" www.springer.com/gp/book/9783319219448 This book is quite dense, but it'll give you a strong background in all of these topics, and way more. I'll probably use this book as the main reference for my series on music processing.
I am a bit confused about the meaning of "Magnitude" in the frequency domain graph , generated after doing fft on time domain. Can you please explain, out of the below two , which explanation is correct. 1) Magnitude corresponding to a particular frequency after fft shows the number of times that particular frequency has occurred. 2) Magnitude corresponding to a particular frequency after fft shows the Amplitude corresponding to the sine wave having that particular frequency. Thanks for this wonderful video 🤩
it was very helpful, but please can u show us how to prepare corpus for other language from scratch like under resoursed language from broadcast news data
Thanks :) I always include a link to the Python implementations (when there is one!) in the description section. Stay tuned for next video, where I'll implement some of the topics I've covered in this video :)
This is a great series! I want to know how I can extract features about periodicity of audio data. The frequency, timbre and other MFCC features would tell me about the note or pitch at a point in time. But, to extract the rhythm signature, I would need to look at the repeating patterns over a time period.
This video was very well helpful, I would definitely like more videos on digital signal processing.Additionally, could you also make a video on feature engineering for ML algorithms.
Why does the MFCC graph just look a like a blocky version of the spectrogram? The intuition of looking at it is that the frequencies are just split into 13 groups?
Hi Valerio, thanks for the great explanation. At 25:05 you explain that the ZCR feature can be fed into the ML Algorithm. But when I use the librosa zero_crossing_rate function I get a quite long array, so how do I summarize this array? Is it by taking the average value? It's a pleasure if you answer my questions, thank you
Given many of you have requested it, I've started a new in-depth series 🔥🔥 on Audio Processing for Machine Learning 🎼🤖. Check it out at: ua-cam.com/video/iCwMQJnKk2c/v-deo.html
A great explanation in understanding audio data for deep learning. It's really "new" for me. I just want to ask that is all audio analysis using spectrogram data as the basis? Thank you
Thank you! Not all analysis uses power spectrograms. If you're using traditional ML / audio DSP techniques you would use also other features (e.g., chromograms, zero-crossing rate). Spectrograms and similar features are usually used in end-to-end DL approaches. I'm planning to create a series on audio/music processing where I dig deeper in the topics I only scratched in these couple of videos.
Exactly what I was looking for, thank you! One follow up question: Is there always exactly one possible result from a fourier transformation? Or (1) can it be impossible to decompose the sound or (2) can there be more than one possible composition?
Never really understood why we dont just use the raw waveform as input to the neural network as a 1D array or something? Where the index represents time and the values represent amplitude. Shouldnt it have all the information we need? Any help in understanding this would be much appreciated.
This might be a stupid question, but do you use something like a sliding window on that MFCC? I thought most of the sequential data is processed with RNNs/LSTMs, but then I would guess only a value from a single time-step is processed from that MFCC
It's actually a great question. Deriving MFCCs is an elaborate process, with several steps. The first is to perform a STFT, which uses a sliding window. The sliding window is characterised by two values: the window (i.e., the frame size expressed in num. of samples), and the hop length (also expressed in num. of samples). When you perform the FFT, you consider a time interval equal to the frame size. Then, you shift to the left by an amount of samples equal to the hop length. The hop length is < frame size. This is the case, in order to produce overlapping FFTs, which preserve info about the edge of the intervals. Since the MFCCs rely on the STFT, you can state that to extract MFCCs you use a sliding window. As for the second part of your comment, you can definitely use an RNN to process MFCCs, passing the MFCC vector for a single window at a time. However, you can also process MFCCs using basic MLP or CNN architectures, treating the MFCCs as 2D data, similar to images. We'll take a look at this in the following videos. Stay tuned!
@@ValerioVelardoTheSoundofAI Thank you for the explanation. I've never worked with audio (I do graphics stuff mostly), so this new domain is pretty fascinating to me. I would guess that some architecture similar to video processing would also work as you have a series of 2D time-dependent inputs. Looking forward to the next video!
Question: In the fourier transform, if we compute the full fourier transform (meaning the phase + the amplitude instead of just the amplitude) we can actually recompose the entire signal without any loss of time information. The original signal is just the inverse fourier transform of the fourier transform of the signal: f(n) -> F(w) -> f(n) Why don't we just do that ? Why do we need the short time fourier transform ? is it more efficient this way ? Am I missing something ? Thanks for your great work !
Hi, amazing videos... Just a question, When creating subtitles for a video with DL, do we create a spectrogram from the video's audio and use a network with CNN layers?
Hello sir, I'm very grateful i found your videos. I'm currently preparing for my thesis for musical genre classification, but I'm having problems in understanding the features extraction part. So my question is, in music genre classification, we only need to use MFCC? is it enough? Thanks!
Did you find that uisng MFCCs works better than using a spectrogram? I just stumbled upon your video and I have been using the same dataset but I extract the spectrogram and feed that into my network. I am constantly running into overfitting, and even when I use the same CNN as you do (in your later video) I only get about 50% validation accuracy, while getting 99% training accuracy. Does using MFCCs reduce overfitting?
If the program has many MFCCs for an audio, will the program average that MFCCs to get just one MFCC? please if you understand what I have meant, explain your answer more
So when we pass the spectrogram as input to the NN, we represent it as a 2-D input (meaning we have to get rid of either time or magnitude) or as a 3-D input? Thanks!
The dimension depends on the type of network you're using. However, the basic idea is that you'll be able to package time, frequency, and magnitude in a 2d array, in the same way we visualise the spectrogram. The shape of the array is (# time steps, # frequency bands). The values featured in the array are the magnitudes, for each frequency band at each time step. In case of a CNN, however, you'll have to pass a 3d array, where the 3rd dimension indicating the depth. For audio data, depth is 1, just like in greyscale pictures. For RGB images, depth=3. I cover this and more in the following videos. So, stay tuned :)
Hi Valerio, Nice and very informative series on understanding Audio for Machine learning. One question about MFCC spectogram. it is shown from the MFCC spectogram, the first coefficient of MFCC is always the least value representing by blue color. why is that so? thanks for your response.
The first MFCC value is the least representative for an audio file, and is often dropped for audio characterisiation. That's because it has information mainly connected to loudness.
I now have a full series called "Audio Signal Processing for Machine Learning", which develops the concept introduced here in greater detail. You can check it out at ua-cam.com/video/iCwMQJnKk2c/v-deo.html
Thanks :)
Brazilian CS student here, thank you for your dedication, this exactly what I needed for my personal project.
Obrigado!
It is a great series. And I would love to learn about the digital processing stuff you were talking about in the video . Please do a series on it too. Thanks again.
Thank you - stay tuned for more :)
I see you have made the course, looking forward to watching that after i finish this one!
I'm new at machine learning for audio and I've been following along your videos taking some notes and I feel that I'm learning a loot.
Thanks Mr. Valerio!
Dude, you've made my life so much easier. I'm going for DL in speech processing and frankly, the task of analog waves to DL features conversion has been a mystery untill now!! If at all you're launching a descriptive audio/signal processing series, I would love to watch it.
I've been following along, seeing your videos from the day I saw your Reddit post. I gotta say, you are doing a great work explaining the theory behind Deep Learning. Keep doing the work! Cheers :)
Thank you for the kind words :)
So cool, I am from Belarus and start work on my startup, this videos are so useful for my work.
Fantastic Giuliano!
I feel his tutorials should get more recognition. Thanks for the series
Thank you Rajat!
Really, Man, you are doing a great job. This is the best series for Audio Deep Learning. This is far away from any other courses. Hats off to you. I never preferred to comment, but this content forced me to comment. Thanks, buddy for your efforts and for sharing knowledge with us.
Thank you!
Taught me more than my uni lecturer, by far. You're the boss my dude
Thanks!
I really appreciate how clearly you explain these concepts.
Thank you for the feedback :)
ugh u are a legendary for putting this info out for free this is always sommething i wanted to learn and i didn't know where to start and now i know i can just consume your content to learn more about this exciting field!!
god bless you
I would watch more audio processing videos for sure!
YES would love a more in depth video about this topic
I was curious about what the input data format for deep learning. Now I understand. Very clear! thank you.
Glad I completed the audio signal processing playlist first, its a quick revision for me in this video
Thumbs up for digital audio signal processing videos ...
Excellent series..In the first video you sid thhis is not for beginners but i am able to perfectly follow along.Excellent explainations.
Thank you!
Classic ! Enjoyed how you explain the use cases of MFCC with DL networks. Thanks
Thanks for the great series. I am working on TTS and STT for my local language and this channel might be very helpful. Thanks and waiting for the next one. Kudos from Munich and Sri Lanka
You helped me a lot with my undergraduate thesis. Many thanks!
Please release the series on audio digital sinal processing. You 're the best.
Thanks! I already have a series called "Audio Processing for ML". Check it out!
Thank you, this content is far better than what i could found in some books. I hope you keep doing it!
Thank you! Of course I will... stay tuned ;)
What an amazing explanation! Thank you. All the audio things became more clearer for me now.
Thank you. My work brought me here and you helped me a lot.
Glad I could help!
Amazing Job i've been working for a project on Speech related AI and had very little knowledge about sound and everything related and you are a very good, cut to the point teacher, thanks!
I'm glad I could help! Stick around for more :)
I'm studying for an Automatic Speech Recognition seminar right now and this was really helpful. Thank you!
It's great you're finding this useful!
Yes, would be very interested to see videos on audio processing and MFCCs
Thank you man for the series.
You're welcome!
This has been really informative . the spaghetti i made with your recipe was off the charts ! love how you simplify the addition of the instances of the algebric space of linear functions and specifically Fourrier transforms so neatly . thanks for making things accessible to everyone
i just start this lesson and the way you explain it really simple and helps me alot with my research paper, thank you!
Awesome series! You have the clearest deep learning videos I've seen so far.
Thank you Omar!
Thanks, very enlightening and useful explanation
I wish I had a teacher like you in school! Thank you so much :)
Thank you Vikas!
Dude this was really a great Lecture, can please do a video on the mathematical aspects of Fourier Transform and Mfccs
Thank you! I'm planning to create a whole series on audio DSP for music over the next months, where I'll delve into the mathematical details. Stay tuned :)
Can't find a better series for audio processing with DL like this! Great content as always. It would be really helpful if you can touch on concepts like audio augmentation techniques and transfer learning in the future in this series. Thanks Valerio!!!
Thank you! I'm now producing a new series "Audio Processing for ML", where I'll probably get into data augmentation for audio.
What a great video! Very easy to follow, thank you!
these are really helpful series, I hope you could make more series about audio data for DL with more details. I really liked your way of explaining things
You made this topic so easy , you are amazing sir , thank you sir 🙏
It is really an amazing series and I am happy that I found it. I wish to thank you a lot for your time. Keep up the good work. Alongside, I am also more curious to learn about MFCCs. It would be really helpful if you make another series about Audio DSP as you mentioned earlier.
Thanks again!
Really glad you like this! I'm thinking about making an Audio DSP series in the future.
Great video, thank you for the work!!
Thanks for such great videos .. make videos on the signal processing on sound wave
best series ever!! thanks brother
I'm working with TTS. Glad to see the series.
I'm happy I can help!
this is sooo cool , thanks for that valerio... hight quality content ❤❤❤❤😍😍😍
Absolutely brilliant, I just want to implement this on million song dataset
This sounds like a great idea -- and it'll enable you to learn a lot in the process!
@@ValerioVelardoTheSoundofAI I may ask for some help from you!
@@raktimbarua6601 I'm here to help... if I can ;)
@@ValerioVelardoTheSoundofAI Would you mind to check my work and give me feedback, please? I can share my GitHub link. Many thanks
Great video, thank you.
It is a really interesting topic. I want to request to you could you please make videos related to the Speech Enhancement system. How can we do create a neural network model or CNN for speech enhancement? How can we remove the noise signal from human speech using deep learning or specific model like (CNN, ANN, RNN, and LSTM)? Thanks for making amazing videos for us.
Nicely explained concepts.
Can you also make a video on how mfcc are extracted...
I mean the use of pre emphasis filters and hamming windows
Great job!!!
Thank you!
so good video for me like not in to much about data like this even thought im still hard to know what does it mean so clearly
Interesting stuff Valerio!
by the way, can i ask what you are using to automaticly give suggestions for your code? I use VS Code, but am looking for something as effective as you use :)
Amazing! This helps me tremendously
Very good resource for newcomers like me.
The explanation is very great and this is cool video😁👍
Thanks!
That content is amazing! Very very clear for me! I am just a programmer interested in extracting audio from video and then transforming it in a podcast (without the advertising intervals) to hear while I am doing the dishes :) I did the first version with just random forests and it was ok, but its time to do some deep learning now... this series is gooold! I did a small flask app with to divide the audio into small 1sec parts and serve as content to an app where I can easily classify my audio and use it like labels in the DL project...
I'm really glad you find this useful! Stay tuned for more interesting stuff to come ;)
Excellent course
14/11/2020
great stuff, man... props
Thanks!
Thanks for all the videos!!!! Would be very interested in learning more about this topic and potential other resources for supplement.
Thank you for the feedback! I'm considering creating a series on audio DSP / music processing over the next months. If you're interested in the topic, you should take a look at "Fundamentals of Music Processing" www.springer.com/gp/book/9783319219448 This book is quite dense, but it'll give you a strong background in all of these topics, and way more. I'll probably use this book as the main reference for my series on music processing.
This was incredible
Thanks!
Quality content, God bless you.
Glad you like it!
I am a bit confused about the meaning of "Magnitude" in the frequency domain graph , generated after doing fft on time domain. Can you please explain, out of the below two , which explanation is correct.
1) Magnitude corresponding to a particular frequency after fft shows the number of times that particular frequency has occurred.
2) Magnitude corresponding to a particular frequency after fft shows the Amplitude corresponding to the sine wave having that particular frequency.
Thanks for this wonderful video 🤩
Really great content 🙏
Explain about DWT feature extraction ? And can you please explain what is a mFcc coefficient in particular.
can you please prepare one series for emotion recognition from speech
This is gold.
Thank you!
Hi, I found this video really informative
very helpful!
it was very helpful, but please can u show us how to prepare corpus for other language from scratch like under resoursed language from broadcast news data
Really good video 👌👌 Keep up posting such videos.
Thank you!
Awesome! will you post link to github repo with python implementations?
Thanks :) I always include a link to the Python implementations (when there is one!) in the description section. Stay tuned for next video, where I'll implement some of the topics I've covered in this video :)
This is a great series! I want to know how I can extract features about periodicity of audio data. The frequency, timbre and other MFCC features would tell me about the note or pitch at a point in time. But, to extract the rhythm signature, I would need to look at the repeating patterns over a time period.
I suggest you check out my new series (still in production) "Audio Signal Processing for Machine Learning"
This video was very well helpful, I would definitely like more videos on digital signal processing.Additionally, could you also make a video on feature engineering for ML algorithms.
Thank you! When I'll make the audio/music DSP series, I'll definitely cover feature engineering for ML.
Why does the MFCC graph just look a like a blocky version of the spectrogram?
The intuition of looking at it is that the frequencies are just split into 13 groups?
really nice
Thanks for this amazing video. I gained a lot. Can you explain more about Harmonics and chronogram?
Glad you liked this! I'll definitely cover more of that in the future :)
Hi Valerio, thanks for the great explanation. At 25:05 you explain that the ZCR feature can be fed into the ML Algorithm. But when I use the librosa zero_crossing_rate function I get a quite long array, so how do I summarize this array? Is it by taking the average value? It's a pleasure if you answer my questions, thank you
Given many of you have requested it, I've started a new in-depth series 🔥🔥 on Audio Processing for Machine Learning 🎼🤖. Check it out at: ua-cam.com/video/iCwMQJnKk2c/v-deo.html
A great explanation in understanding audio data for deep learning. It's really "new" for me.
I just want to ask that is all audio analysis using spectrogram data as the basis?
Thank you
Thank you! Not all analysis uses power spectrograms. If you're using traditional ML / audio DSP techniques you would use also other features (e.g., chromograms, zero-crossing rate). Spectrograms and similar features are usually used in end-to-end DL approaches. I'm planning to create a series on audio/music processing where I dig deeper in the topics I only scratched in these couple of videos.
Exactly what I was looking for, thank you!
One follow up question: Is there always exactly one possible result from a fourier transformation? Or (1) can it be impossible to decompose the sound or (2) can there be more than one possible composition?
Would like to know how to classify sound
Never really understood why we dont just use the raw waveform as input to the neural network as a 1D array or something? Where the index represents time and the values represent amplitude. Shouldnt it have all the information we need? Any help in understanding this would be much appreciated.
This might be a stupid question, but do you use something like a sliding window on that MFCC? I thought most of the sequential data is processed with RNNs/LSTMs, but then I would guess only a value from a single time-step is processed from that MFCC
It's actually a great question. Deriving MFCCs is an elaborate process, with several steps. The first is to perform a STFT, which uses a sliding window. The sliding window is characterised by two values: the window (i.e., the frame size expressed in num. of samples), and the hop length (also expressed in num. of samples). When you perform the FFT, you consider a time interval equal to the frame size. Then, you shift to the left by an amount of samples equal to the hop length. The hop length is < frame size. This is the case, in order to produce overlapping FFTs, which preserve info about the edge of the intervals. Since the MFCCs rely on the STFT, you can state that to extract MFCCs you use a sliding window.
As for the second part of your comment, you can definitely use an RNN to process MFCCs, passing the MFCC vector for a single window at a time. However, you can also process MFCCs using basic MLP or CNN architectures, treating the MFCCs as 2D data, similar to images. We'll take a look at this in the following videos. Stay tuned!
@@ValerioVelardoTheSoundofAI Thank you for the explanation. I've never worked with audio (I do graphics stuff mostly), so this new domain is pretty fascinating to me. I would guess that some architecture similar to video processing would also work as you have a series of 2D time-dependent inputs. Looking forward to the next video!
@@aigen-journey your guess is right :)
Question:
In the fourier transform, if we compute the full fourier transform (meaning the phase + the amplitude instead of just the amplitude) we can actually recompose the entire signal without any loss of time information. The original signal is just the inverse fourier transform of the fourier transform of the signal: f(n) -> F(w) -> f(n)
Why don't we just do that ? Why do we need the short time fourier transform ? is it more efficient this way ? Am I missing something ? Thanks for your great work !
I suggest you to check out my series on Audio Signal Processing for ML. There I spend 4+ videos on these topics ;)
Many thank you :*
This is really a good video, can you make a video about x-vector and I-vector that would be really cool
20:28 How audio transform to spectogram using stft... Spectogram: signal in the frequency domain
Thanks
Hi, amazing videos... Just a question,
When creating subtitles for a video with DL, do we create a spectrogram from the video's audio and use a network with CNN layers?
Hello sir, I'm very grateful i found your videos. I'm currently preparing for my thesis for musical genre classification, but I'm having problems in understanding the features extraction part.
So my question is, in music genre classification, we only need to use MFCC? is it enough? Thanks!
MFCCs do a pretty good job. Mel Spectrograms are state of the art. You don't need to mix these features with others.
Did you find that uisng MFCCs works better than using a spectrogram? I just stumbled upon your video and I have been using the same dataset but I extract the spectrogram and feed that into my network. I am constantly running into overfitting, and even when I use the same CNN as you do (in your later video) I only get about 50% validation accuracy, while getting 99% training accuracy. Does using MFCCs reduce overfitting?
20:39. 4000 hz why is it blue it should be bright red right since it is the frequency with highest amplitude
If the program has many MFCCs for an audio, will the program average that MFCCs to get just one MFCC? please if you understand what I have meant, explain your answer more
Hi, do you have a video explaining the role of the phase in the fourier representation ?
Yes, I have a detailed explanation of the FFT in the "Audio Signal Processing for ML" series.
You are great!
Thanks!
hey Valerio, where can I talk to you directly?
do you have a conversation room in the discord or telegram or... ?
U are good 👌
Thanks!
So when we pass the spectrogram as input to the NN, we represent it as a 2-D input (meaning we have to get rid of either time or magnitude) or as a 3-D input? Thanks!
The dimension depends on the type of network you're using. However, the basic idea is that you'll be able to package time, frequency, and magnitude in a 2d array, in the same way we visualise the spectrogram. The shape of the array is (# time steps, # frequency bands). The values featured in the array are the magnitudes, for each frequency band at each time step. In case of a CNN, however, you'll have to pass a 3d array, where the 3rd dimension indicating the depth. For audio data, depth is 1, just like in greyscale pictures. For RGB images, depth=3. I cover this and more in the following videos. So, stay tuned :)
Hi Valerio, Nice and very informative series on understanding Audio for Machine learning.
One question about MFCC spectogram. it is shown from the MFCC spectogram, the first coefficient of MFCC is always the least value representing by blue color.
why is that so? thanks for your response.
The first MFCC value is the least representative for an audio file, and is often dropped for audio characterisiation. That's because it has information mainly connected to loudness.
@@ValerioVelardoTheSoundofAI Thanks valerio for your response.
Hello Valerio:
Thanks for this awesome channel. Thanks a lot.
I do have a doubt. When should we use spectrogram vs mfcc for a deep neural problem?
Spectrogram are state-of-the-art features in DL now. MFCCs are rarely used in DL.
@@ValerioVelardoTheSoundofAI so,in that case we should use spectrogram always in DL