I now have a full series called "Audio Signal Processing for Machine Learning", which develops the concepts introduced here in greater detail. You can check it out at ua-cam.com/video/iCwMQJnKk2c/v-deo.html
this is so well-explained, helps me entirely for the project i'm working on! i can never thank you enough for making all these videos, you deserve the best!
If anybody has an issue on the librosa.feature.mfcc() line (mfcc takes 0 positional arguments but 1 positional argument...) make sure you add "y=" before signal that is MFCCs = librosa.feature.mfcc(y = signal, n_fft= n_fft, hop_length = hop_length, n_mfcc= 13) Hope this helps
The series so far was very well explained and paced but personally I would've wanted a little more detailed explanation of MFCC as it would be the most important thing we are going to use in the nn right? If there are some resources you can recommend it'd be really appreciated!
Thank you for the feedback! I get your point. But I made the choice not to get into the algorithmic/mathematical details of MFCCs because it's a quite complicated topic that would probably derail too much from the focus on deep learning. As I mentioned in the videos, if I see enough interest I may create a whole series on audio DSP. There, I'll definitely go into the nitty gritty of MFCCs and the Fourier transform. On this point, would you be interested in a series on audio digital signal processing? As for this course, I don't think using MFCCs as a black box is going to be detrimental for DL applications. As for extra resources on MFCCs, I suggest you take a look at this article: practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/ It's a friendly intro into the concept. Hope this helps :)
@@ValerioVelardoTheSoundofAI I would definitely be interested in a series of audio DSP; although, as an DL enthusiast I would love that if the topics covered in that series would circle around their significance in DL somehow? Also thank you so much for the article!
@@ValerioVelardoTheSoundofAI does reslly hard manage music data to make animations or reactions with it? Im really really new on this thing about music spectre and ML
It's not in a hurry. I tried to do build a music recommender system, and the music classification is the first step. I tried several models and tried to optimize it.
great series! I have learned much more from this that other courses that have cost me alot of money. I have one question if you could help. What is the number of features you are extracting to use in the NNs in the series? it wasnt very clear in the videos.
While plotting the power spectrum we take only half of the data (left) after doing fft? Then how come we dont do the same while plotting data based on sfft? 😮
8:30 length of 'signal', 'magnitude' and 'frequency' is the same. Why is frequency increasing from beginning towards the end, it should be increasing and decreasing at different times and it should not only increase over time. What am I missing? 9:29 by lower frequency I understand the left most area in the graph. But we see the same height of energy/magnitude towards the end (right side) of the graph. But you say that "the higher we go with frequency the less contribution they will give us". I don't understand the graph this seams.
Thanks for this tutorials ! I want to aks question : What you not use notebook ? i'am using notebook with vscode ".ipynb " file extension its just pratical no ? Good luck
WOW! this lesson totally good, are we done for preprocessing audio thing with that? coz i wanna build a system for speaker recognition and need to learn how to build a. model, one of the step to do preprocessing for the audio, this video very helpful if we are done for preprocessing audio thing, and what do we do after this? or maybe if we want to representation the result using numeric, not visualizing, thank you
Thankyou for spreading the knowledge. I've a question tho, If I want to make a source separation kind of application. should I use mel scale spectograms or should I opt better for other time series representations like Gramian matrices and Markov transtions?
Thank you for this wonderful video lecture, I am working on lung sound analysis. would you show us also how to implement wavelet analysis particularly discret wavelet transform as like FFT, STFT, MFCCS?
Hi Valerio, thank you so much for your amazing videos. I am doing an emergency vehicle siren detection by deep learning, I divided my data into emergency and non-emergency and used band-pass filter to remove the noise. Now I have doubt that should I implement this filter on just emergency audio files or on all the data (emergency and non-emergency). I would be grateful if you could guide me on this.
Valerio, first of all, congratulations for your excelent job! I am learning so much from you! Secondly, can you explain how to load mp3 files with librosa? From what I read on the documentation, installing ffmpeg should solve it, but it did not. Thank you!
Yes, you can use MFCCs for speaker identification. The process is similar to that I've used for genre recognition in the following videos. Check those out!
Thank you so much for ur videos. I have a question regarding processing of audio. If I want to classify a bell that rings for less than a sec then stops for some time, do I have to collect the audio of the individual rings and cut out the silences or can I use a longer audio of the bell ringing and stopping ?
Hello Valerio Your videos are very helpful to learn about audio signals processing in AI. Am learning about AI and the theory you have explained are easier to grab. Thank you for such great lessons. I have a doubt as input you have been using .wav file, which is uncompressed, thus the file size will be large. Can you tell me what method can be used to process the audio file with best quality and without losing information.
Thank you! There isn't an ideal solution to compress audio files and not lose information. WAV (loseless) is the best. Many AI music applications won't be affected negatively if you use MP3s instead.
Thanks! I think I used a piece classified as blues from the GTZAN dataset. You can search for the file with the same name in the dataset. I provided the link to download GTZAN in a previous video in the series.
Thank you for posting this wonderful video. I'm working on a Toy project where I search for music with humming, is it right to use Mel Spectrogram? I don't know if CQT is more appropriate. I would appreciate your reply.
You can definitely give a try to Mel Spectrograms. Try to focus on intervals instead of absolute pitches as people with no absolute pitch (i.e., the overwhelming majority) can hum the intervals which make up a melody, but not necessarily in the right key. Focus only on monophonic music (i.e., a vocal melody). The generalisation is a way harder problem. Hope this helps!
Amazing series. A question, frequency and magnitude are numpy arrays of size > 661000 respectively. But while plotting, the x-axis (denoting frequency) scales itself to the sample rate which is 22050. Why so? I'm talking about the spectrum plot here.
You help me like God send you to help me .... After 3 days i have submission till preprocessing 😅 Thank you so much .... Please make a video on How to build a accurate model for audio signal ❤️
This was a brilliant video. I have a query which I would like to shoot. Don't know if it's answered in the next set of videos. Does it matter if the time span of each clips are different in the dataset? Do the same principles applied here apply to any audio eg : animal audios, scream detection? How to deal with noise?
1- If you're using a CNN architecture you need to have all data samples with the same duration. To obtain this, you should segment clips with different duration (e.g., 10 secs clips). 2- Yes, you can transfer the same approach used here to other audio domains. 3- If you're using a DL approach, the network should be able to learn to deal with noise automatically. If you'd like to learn more about these topics, I suggest you to check out my series "Audio Signal Processing for ML".
@@ValerioVelardoTheSoundofAI Thanks Valerio. I'm watching the signal processing series too..! Another query I have.. There is another library called kapre. That one seems like it's built upon Keras. How do you think it compares with librosa? Kapre seemed very easy with just additions of layers to the model. I'm not sure if it can do everything that librosa can.
@@midhunsatheesan5717 Kapre is great if you want to extract spectrograms computing FT on GPU. However, it can't do many things that librosa can. So if you plan to use basic audio features used in DL go with Kapre. Otherwise, go with librosa :)
Hello Valerio, have you ever extracted ivector from audio clips? I am trying to find documentation on it but am struggling. Your advice would be greatly appreciated
Hi Valerio, I've been looking for resources on how to deal with deep learning and audio for some time without too many results, so I'm really grateful to you for sharing these videos! I would like to ask you if it is possible and how to recover the original signal from the spectrogram. I tried to use the inverse functions like librosa.db_to_amplitude and librosa.core.istft, but the output signal seems very bad. I think this happens because we truncate complex numbers for the construction of the spectrogram. Can you suggest me the right way?
You're absolutely right! The issue with istft is that we ignore the phase. The audio result is somewhat problematic. Reconstructing audio from a power spectrogram is a major problem still actively researched. There isn't a simple solution I'm afraid :(
@@ValerioVelardoTheSoundofAI Yeah, i found the same answer in a research paper i was reading just now. Do you think a well trained LSTM autoencoder could approximate a better result? I mean, if we use these corrupted istft's outputs as input and the original waveforms as output, could we obtain a neural net that can reconstruct a better waveform? Or do you think it's only a waste of time? Thanks in advance for your attention!
@@massimomontanaro mmh... this is a highly dimensional problem. You'll need a MASSIVE dataset to try to get something decent. It may be worth an experiment, but I wouldn't be super confident.
i have downloaded that audio file . but still it is showing error as FileNotFoundError: [Errno 2] No such file or directory: 'blues.00000.wav' sir solution please?
Hello Valerio, I have 3 folders(go, yes, no) that contains 30 .wav files. Each folder has 10 wav files. How can I run this code with different 30 wav files?
@@AshwaniKumar04 not necessarily. If most of the patterns for classification are in the lower frequencies having a high sr can actually be counterproductive.
Hola Valerio. Un saludo desde Colombia. He estado viendo algunos videos tuyos referentes a MFCC pero como comprenderás, mi inglés es un humilde casi B1 y he estado poniendo subtítulos a tus videos; sin embargo, este no fue el caso :C porque no aparece la opción. Me encantaría que este video tuviera la opción de de los subtítulos, te quedaré muy agradecida. Quisiera saber qué hay después de la obtención de los MFCC, qué se debe implementar en Phyton para que finalmente tome la decisión de clasificar un sonido como Xo Y ?. Quedo muy agradecida con tu colaboración.
Great videos man! Is there a way to make a database from audio files from metadata? Like labeling each file with BPM, Key, etc. But automatic, doing a database from scracth is going to take more than the coding itself lol
Thanks! There are algorithms for extracting Key, BPM automatically. You'll then need to implement a DB and populate it with the metadata. The algorithms aren't perfect. They are also genre-dependent.
4 роки тому
@@ValerioVelardoTheSoundofAI i want to automatically sort my samples, and obviously experiment with NN and python, what approach would you recommend
If I remember correctly, it comes from the Marsyas genre dataset (marsyas.info/downloads/datasets.html). I may have mentioned this in a previous video.
That really depends on the problem you're working on and your dataset. Let me give you a couple of examples. In music processing, we usually use 15'' or 30'' of a song to analyse it. In keyword spotting systems, you would often have 1-second long clips.
@@ValerioVelardoTheSoundofAI I'm working on ML for voice recognition using a data set that contains conversations in the form of a wav file. is there any suggestion you can give for a good duration .wav file for my problem? Thank you for the advice
Hi there, i'm interested to know how can i clean my audio dataset(google speech commands) if it contains faulty audio. For example i should hear word "three" but there is too much noise or the word is cut in the middle of pronunciation so it just says "thh.." Any idea how to get rid of those audio files and clean my dataset without doing it manually?
Could you make a video on the inverse functions? signal->stft->istft->signal works fine, but signal->stft->amplitude_to_db->db_to_amplitude->istft->signal results in a distorted signal. Same with inverse.mfcc_to_audio.
This is a somewhat more advanced topic in DSP. I'm thinking of creating a series on audio DSP / music processing. I'll definitely cover the inverse functions in that series. Before engaging in the implementation, I'd like to dig deeper in the math behind FT/MFCC. You're totally right re the reconstruction of the signal from MFCCs. It's a long shot, and the result isn't that great.
@@ValerioVelardoTheSoundofAI I kinda get why we are losing information if we convert our spectrogram to a mel-spectrogram, but why are we already losing information when using amplitude_to_db on the stft? Isn't it "just" a log-function?
@@Gileadean excellent question! I'm glad you've been playing around with these interesting DSP concepts :) Now, on to the answer. The STFT outputs a matrix with complex numbers. To arrive at the spectrogram, we calculate the absolute value of each complex number. This process removes the imaginary part of the complex values, which carries information about the phase of the signal. At this point you've already lost information! When you try to reconstruct the signal, the inverse STFT can't rely on phase information anymore. Hence, the somewhat distorted sound. As you correctly hinted to in your question, the conversion back and forth from amplitude to db doesn't loose any additional vital info. I hope this helps!
@@ValerioVelardoTheSoundofAI Thanks for your quick replies! I somehow missed the np.abs(stft) and the warning message that occurs when calling amplitude_to_db on a complex input (phases will be discarded)
Hi there, I want to do music semantic segmentation(intro, chorus, verse etc.). Could you please suggest me how should I label my audio data? and what features I should use for that?
The task you're referring to is called "music segmentation" or "music structure analysis". I'm assuming you want to work with audio (e.g., WAV) not symbolic data (e.g., MIDI). There's a lot of literature on this topic. The techniques that work best are based on music processing algorithms which don't involve machine learning. The high-level idea is to extract a chromogram, manipulate it, and use a self-similarity matrix to identify similar parts of a song. The book "Fundamentals of Music Processing" has a chapter that discusses music segmentation in detail. Here's a slide presentation that summarises that book chapter: s3-us-west-2.amazonaws.com/musicinformationretrieval.com/slides/mueller_music_structure.pdf Hope this helps :)
I have an audio dataset, each audio file consists of letters that are spoken all at one time. How can I prepare these audio files for machine learning? I would like to have each letter in an audio file. if anyone has an idea please help.
Hi, thanks alot for these videos they are very useful. I was just wondering if it would be beneficial to represent the frequency scale logarithmically as humans interpret sound in this way (since musical intervals/harmonics are represented by multiples of a frequency rather than an absolute difference). Are deep learning algorithms not trained with this scale since it mimics human hearing more?
Great intuition! You can take the logarithm of the spectrogram, or, apply Mel filterbanks, and arrive at the so called Mel Spectrogram. I have another series called "Audio Signal Processing for ML" that dives deep into all of these topics, if you're interested.
Hi Valerio, Thank you for your detailed explanation. I am sure like me thousands of others are benefitting from your videos. I understood everything in your video however I have one query, can we use log_spectrogram for deep learning instead of MFCCs? Or in other words, why do we only use MFCCs in deep learning? One more concern, I have audio data that is recorded in 44100Hz, can I use a sample rate of 44100 instead of 22050 (which you are using in this tutorial)? Thank you in advance.
I now have a full series called "Audio Signal Processing for Machine Learning", which develops the concepts introduced here in greater detail. You can check it out at
ua-cam.com/video/iCwMQJnKk2c/v-deo.html
Sir how to understand the mfccs like which mfcc to consider and which one to leave.
For who is learning course in 2022 - a name of function "waveplot" was changed to "waveshow".
1:47 that song hits HARD
hahaha :D
Там есть звук или это просто файл wav без звука?
I'm so glad I found this series. Great quality content (Y)
Thank you!
this is so well-explained, helps me entirely for the project i'm working on! i can never thank you enough for making all these videos, you deserve the best!
Thanks!
Great video! I like how you stepped through everything and the code in the video works
Thanks Erik!
The series is awesome. And 😭at 1:50 . Love you bro!
Finally found exactly what I was looking for. Great explanations! ❤
Thanks!
This channel is a Gem!! Thank you for putting out these tutorials. Keep going!
The best instructional video I've ever seen, even better than college ❤❤❤❤
awesome series, deserves more recognition !!!
Sto scrivendo la tesi grazie a te, ti devo una cena! Amazing Job
In bocca al lupo ;)
Thank you for the wonderful work. If you can make a series of audio signal processing, that would be great. Have a nice day!
Thank you for the feedback!
i second this!!
Yeah, I would be very helpful if you made a video series explaining these different DSP methods.
You are amazing! A real music and deep learning wizard!
Thank you Fabio :)
hello what is the python version you use in this tutorial
Extremely helpful! You are the best
Thanks!
If anybody has an issue on the librosa.feature.mfcc() line (mfcc takes 0 positional arguments but 1 positional argument...)
make sure you add "y=" before signal that is
MFCCs = librosa.feature.mfcc(y = signal, n_fft= n_fft, hop_length = hop_length, n_mfcc= 13)
Hope this helps
Thanks for the help! Was a little confused when code wasn't working!
thanks a lot bud
thanks dude!
This is helping me in my capstone masters project. Thank you so much.
The same reason i was searching the concepts then this channel showed up.
So glad I found this! You do an amazing job!
Thank you Aend!
Incredible channel! Please keep going!!!
Love you so much sir....No words...
thankyou so much ... i m so happy you make this video .you make my work easier
..
Glad I've been useful! Have you seen my new series Audio Sianal Processing for ML? It runs very deep into these and more topics in audio processing.
20:16 With this graph, how would you display the number of seconds on the x-axis and the range of frequencies on the y-axis?
The series so far was very well explained and paced but personally I would've wanted a little more detailed explanation of MFCC as it would be the most important thing we are going to use in the nn right? If there are some resources you can recommend it'd be really appreciated!
Thank you for the feedback! I get your point. But I made the choice not to get into the algorithmic/mathematical details of MFCCs because it's a quite complicated topic that would probably derail too much from the focus on deep learning. As I mentioned in the videos, if I see enough interest I may create a whole series on audio DSP. There, I'll definitely go into the nitty gritty of MFCCs and the Fourier transform. On this point, would you be interested in a series on audio digital signal processing?
As for this course, I don't think using MFCCs as a black box is going to be detrimental for DL applications.
As for extra resources on MFCCs, I suggest you take a look at this article: practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/ It's a friendly intro into the concept. Hope this helps :)
@@ValerioVelardoTheSoundofAI I would definitely be interested in a series of audio DSP; although, as an DL enthusiast I would love that if the topics covered in that series would circle around their significance in DL somehow?
Also thank you so much for the article!
@@mohammadareebsiddiqui5739 that could be interesting...
this is so good , what do you suggest to implement this things?
i very excited to read that book and learn sounds in deep learning
thank you so much
Please make a tutorial on Anomaly detection on raw sound
Amazing clear content. Thanks a lot !
Thank you!
Just amazing content, you are a live saver.
Your wonderful videos are helping me in my PhD on Indian Vocal Music. But alas, no videos on Indian Classical Vocals.
That may come at some point in the future. Stay tuned!
This is what i've been looking for.
Thanks Roner!
@@ValerioVelardoTheSoundofAI does reslly hard manage music data to make animations or reactions with it? Im really really new on this thing about music spectre and ML
@@9b177-becc5 not really. As long as you have audio parameters (e.g., loudness, chroma, beat), you can map them to different elements of an animation.
having an issue with librosa missing _soundfile_data module when i try load the song
we couldn't heard the song but it is super cool video.
Amazing tutorial, thank you
Can you update the code for present versions of matplotlib and librosa.
I think I am finishing my master degree with you😂 , thank you for your amazing job!
Bachelor degree here but same.. I guess Im low on time as I see the complexity tho :DD When is your due date and how far are you? :DDD
I'm happy the videos can help :)
It's not in a hurry. I tried to do build a music recommender system, and the music classification is the first step. I tried several models and tried to optimize it.
@@evicluk Do you use tensorflow or pytorch?
@@rekreator9481 tensorflow
Hello, thank you very much for this tutorial. What if i have problems with numpy? My ide tells there is a mistake
hello, do you know how to classify extracted features of Audio from MFCC to SVM?
great series! I have learned much more from this that other courses that have cost me alot of money. I have one question if you could help. What is the number of features you are extracting to use in the NNs in the series? it wasnt very clear in the videos.
what version of python are you using ?
While plotting the power spectrum we take only half of the data (left) after doing fft? Then how come we dont do the same while plotting data based on sfft? 😮
valerio could you please give some code for removing silence from whole audio data, please guide
Excellent course
14/11/2020
hi valerio, I can't show the wave image can you help me?
8:30 length of 'signal', 'magnitude' and 'frequency' is the same.
Why is frequency increasing from beginning towards the end, it should be increasing and decreasing at different times and it should not only increase over time. What am I missing?
9:29 by lower frequency I understand the left most area in the graph. But we see the same height of energy/magnitude towards the end (right side) of the graph. But you say that "the higher we go with frequency the less contribution they will give us". I don't understand the graph this seams.
Thanks for this tutorials ! I want to aks question : What you not use notebook ? i'am using notebook with vscode ".ipynb " file extension its just pratical no ? Good luck
Shouldn't we multiply the magnitude by 2 when narrowing the power spectrum plot to the Nyquist frequency?
how to use that blues.0000.wav file for running the code many error are coming
WOW! this lesson totally good, are we done for preprocessing audio thing with that? coz i wanna build a system for speaker recognition and need to learn how to build a. model, one of the step to do preprocessing for the audio, this video very helpful if we are done for preprocessing audio thing, and what do we do after this? or maybe if we want to representation the result using numeric, not visualizing, thank you
Thanks man! Wonderful videos
Thank you!
Thankyou for spreading the knowledge. I've a question tho, If I want to make a source separation kind of application. should I use mel scale spectograms or should I opt better for other time series representations like Gramian matrices and Markov transtions?
Thank you for this wonderful video lecture, I am working on lung sound analysis. would you show us also how to implement wavelet analysis particularly discret wavelet transform as like FFT, STFT, MFCCS?
Glad you liked it Biruk! I'm planning to start a whole series on audio/music processing over the next few weeks. Stay tuned :)
Hi Valerio, thank you so much for your amazing videos. I am doing an emergency vehicle siren detection by deep learning, I divided my data into emergency and non-emergency and used band-pass filter to remove the noise. Now I have doubt that should I implement this filter on just emergency audio files or on all the data (emergency and non-emergency). I would be grateful if you could guide me on this.
You should apply the same preprocessing on all the data you train on.
@@ValerioVelardoTheSoundofAI thank you so much
Nice work broo! I have a question tho. Is it alright to have negative MFCCs? Btw I am using RAVDESS dataset.
It's totally fine to get negative MFCCs. Stay tuned for my coming videos in the "Audio Processing for ML" series on MFCCs to learn more ;)
What is the number of hop_length for voice recognition?
Nice work! Keep it up.
Thank you!
Valerio, first of all, congratulations for your excelent job! I am learning so much from you!
Secondly, can you explain how to load mp3 files with librosa? From what I read on the documentation, installing ffmpeg should solve it, but it did not.
Thank you!
Thank you! Please refer to this thread: github.com/librosa/librosa/issues/945
Great video, thanks
really great series !!
Again! This is just awesome!
And thank you again :)
Hi sir
thanks for this video
i just want to know how can we play the audio in python and listen to it from this form (signal , sr = l.load(file))
great series man,thank you.can we differentiate human voices by using the mel spectrums?if yes can u please tell me how?ur reply would be helpful.
Yes, you can use MFCCs for speaker identification. The process is similar to that I've used for genre recognition in the following videos. Check those out!
Thank you so much for ur videos. I have a question regarding processing of audio. If I want to classify a bell that rings for less than a sec then stops for some time, do I have to collect the audio of the individual rings and cut out the silences or can I use a longer audio of the bell ringing and stopping ?
You can use the long sample. Hopefully the algorithm will figure that out!
Hello Valerio
Your videos are very helpful to learn about audio signals processing in AI. Am learning about AI and the theory you have explained are easier to grab. Thank you for such great lessons.
I have a doubt as input you have been using .wav file, which is uncompressed, thus the file size will be large. Can you tell me what method can be used to process the audio file with best quality and without losing information.
Thank you! There isn't an ideal solution to compress audio files and not lose information. WAV (loseless) is the best. Many AI music applications won't be affected negatively if you use MP3s instead.
Can anyone send the link for the music dataset of popular / hit songs please
great content. btw, how can i download the wav file?
it seems as if it can't be downloaded from the github link u published. is there another place to download it from?
Thanks! I think I used a piece classified as blues from the GTZAN dataset. You can search for the file with the same name in the dataset. I provided the link to download GTZAN in a previous video in the series.
Ok bro but, what I need to pass on my input to teach my neural network?
Thank you for posting this wonderful video.
I'm working on a Toy project where I search for music with humming, is it right to use Mel Spectrogram? I don't know if CQT is more appropriate. I would appreciate your reply.
You can definitely give a try to Mel Spectrograms. Try to focus on intervals instead of absolute pitches as people with no absolute pitch (i.e., the overwhelming majority) can hum the intervals which make up a melody, but not necessarily in the right key. Focus only on monophonic music (i.e., a vocal melody). The generalisation is a way harder problem. Hope this helps!
Amazing series.
A question, frequency and magnitude are numpy arrays of size > 661000 respectively.
But while plotting, the x-axis (denoting frequency) scales itself to the sample rate which is 22050. Why so? I'm talking about the spectrum plot here.
You help me like God send you to help me .... After 3 days i have submission till preprocessing 😅 Thank you so much .... Please make a video on How to build a accurate model for audio signal ❤️
This was a brilliant video. I have a query which I would like to shoot. Don't know if it's answered in the next set of videos.
Does it matter if the time span of each clips are different in the dataset?
Do the same principles applied here apply to any audio eg : animal audios, scream detection?
How to deal with noise?
1- If you're using a CNN architecture you need to have all data samples with the same duration. To obtain this, you should segment clips with different duration (e.g., 10 secs clips).
2- Yes, you can transfer the same approach used here to other audio domains.
3- If you're using a DL approach, the network should be able to learn to deal with noise automatically.
If you'd like to learn more about these topics, I suggest you to check out my series "Audio Signal Processing for ML".
@@ValerioVelardoTheSoundofAI Thanks Valerio. I'm watching the signal processing series too..! Another query I have.. There is another library called kapre. That one seems like it's built upon Keras. How do you think it compares with librosa? Kapre seemed very easy with just additions of layers to the model. I'm not sure if it can do everything that librosa can.
@@midhunsatheesan5717 Kapre is great if you want to extract spectrograms computing FT on GPU. However, it can't do many things that librosa can. So if you plan to use basic audio features used in DL go with Kapre. Otherwise, go with librosa :)
Can we use a full 10 minute wav file as an example or do we need to cut the file into pieces in preprocessing?
It depends on the application. 10' is probably too long. I would suggest segmenting the files.
@@ValerioVelardoTheSoundofAI thanks
Thanks 👍👍👍
sir im making a project on attendance system using voice. which python modules should i use ?? which algorithms should i use???
Hello Valerio, have you ever extracted ivector from audio clips? I am trying to find documentation on it but am struggling. Your advice would be greatly appreciated
hii Valerio,if we have training data in mp3 format, is it important to convert mp3 files to wav files for training ? will it improve performance
Don't worry about mp3 files. With Librosa you can directly load them, without the need to convert them to wav files first.
How can i fix this error "UserWarning: PySoundFile failed. Trying audioread instead.
warnings.warn("PySoundFile failed. Trying audioread instead.")"?
Hi Valerio, I've been looking for resources on how to deal with deep learning and audio for some time without too many results, so I'm really grateful to you for sharing these videos! I would like to ask you if it is possible and how to recover the original signal from the spectrogram. I tried to use the inverse functions like librosa.db_to_amplitude and librosa.core.istft, but the output signal seems very bad. I think this happens because we truncate complex numbers for the construction of the spectrogram. Can you suggest me the right way?
You're absolutely right! The issue with istft is that we ignore the phase. The audio result is somewhat problematic. Reconstructing audio from a power spectrogram is a major problem still actively researched. There isn't a simple solution I'm afraid :(
@@ValerioVelardoTheSoundofAI Yeah, i found the same answer in a research paper i was reading just now. Do you think a well trained LSTM autoencoder could approximate a better result? I mean, if we use these corrupted istft's outputs as input and the original waveforms as output, could we obtain a neural net that can reconstruct a better waveform? Or do you think it's only a waste of time? Thanks in advance for your attention!
@@massimomontanaro mmh... this is a highly dimensional problem. You'll need a MASSIVE dataset to try to get something decent. It may be worth an experiment, but I wouldn't be super confident.
i have downloaded that audio file . but still it is showing error as
FileNotFoundError: [Errno 2] No such file or directory: 'blues.00000.wav'
sir solution please?
Hello Valerio, I have 3 folders(go, yes, no) that contains 30 .wav files. Each folder has 10 wav files. How can I run this code with different 30 wav files?
well made tutorial
Now I know why fourier transforms is added to my degree syllabus
Thanks for the video.
One question: Should we always use SR = 22050?
It depends on the problem. Most of the time sr = 16K is OK for sound/music classification problems.
@@ValerioVelardoTheSoundofAI Thanks for the reply. Does having a higher value increases the model accuracy?
@@AshwaniKumar04 not necessarily. If most of the patterns for classification are in the lower frequencies having a high sr can actually be counterproductive.
thanks man u are a hero
Thanks!
I'm working on an audio projet and your video help me a lot
Hola Valerio. Un saludo desde Colombia. He estado viendo algunos videos tuyos referentes a MFCC pero como comprenderás, mi inglés es un humilde casi B1 y he estado poniendo subtítulos a tus videos; sin embargo, este no fue el caso :C porque no aparece la opción. Me encantaría que este video tuviera la opción de de los subtítulos, te quedaré muy agradecida. Quisiera saber qué hay después de la obtención de los MFCC, qué se debe implementar en Phyton para que finalmente tome la decisión de clasificar un sonido como Xo Y ?. Quedo muy agradecida con tu colaboración.
Hola me interesa esto que decis Jessica!
Great videos man! Is there a way to make a database from audio files from metadata? Like labeling each file with BPM, Key, etc. But automatic, doing a database from scracth is going to take more than the coding itself lol
Thanks! There are algorithms for extracting Key, BPM automatically. You'll then need to implement a DB and populate it with the metadata. The algorithms aren't perfect. They are also genre-dependent.
@@ValerioVelardoTheSoundofAI i want to automatically sort my samples, and obviously experiment with NN and python, what approach would you recommend
Thanks for the great content! :)
Hello sir from where do I get the audio file that you have used here? Would you please provide me the link?
If I remember correctly, it comes from the Marsyas genre dataset (marsyas.info/downloads/datasets.html). I may have mentioned this in a previous video.
@@ValerioVelardoTheSoundofAI thank you :)
Thanks you dude!
You're welcome!
How long should the wav file be used in preprocessing?
That really depends on the problem you're working on and your dataset. Let me give you a couple of examples. In music processing, we usually use 15'' or 30'' of a song to analyse it. In keyword spotting systems, you would often have 1-second long clips.
@@ValerioVelardoTheSoundofAI I'm working on ML for voice recognition using a data set that contains conversations in the form of a wav file. is there any suggestion you can give for a good duration .wav file for my problem?
Thank you for the advice
There's no sound when you play that blues file.
Yeah... I had to remove it for copyright reasons :(
Hi there, i'm interested to know how can i clean my audio dataset(google speech commands) if it contains faulty audio. For example i should hear word "three" but there is too much noise or the word is cut in the middle of pronunciation so it just says "thh.."
Any idea how to get rid of those audio files and clean my dataset without doing it manually?
Could you make a video on the inverse functions? signal->stft->istft->signal works fine, but signal->stft->amplitude_to_db->db_to_amplitude->istft->signal results in a distorted signal. Same with inverse.mfcc_to_audio.
This is a somewhat more advanced topic in DSP. I'm thinking of creating a series on audio DSP / music processing. I'll definitely cover the inverse functions in that series. Before engaging in the implementation, I'd like to dig deeper in the math behind FT/MFCC. You're totally right re the reconstruction of the signal from MFCCs. It's a long shot, and the result isn't that great.
@@ValerioVelardoTheSoundofAI I kinda get why we are losing information if we convert our spectrogram to a mel-spectrogram, but why are we already losing information when using amplitude_to_db on the stft? Isn't it "just" a log-function?
@@Gileadean excellent question! I'm glad you've been playing around with these interesting DSP concepts :) Now, on to the answer. The STFT outputs a matrix with complex numbers. To arrive at the spectrogram, we calculate the absolute value of each complex number. This process removes the imaginary part of the complex values, which carries information about the phase of the signal. At this point you've already lost information! When you try to reconstruct the signal, the inverse STFT can't rely on phase information anymore. Hence, the somewhat distorted sound. As you correctly hinted to in your question, the conversion back and forth from amplitude to db doesn't loose any additional vital info. I hope this helps!
@@ValerioVelardoTheSoundofAI Thanks for your quick replies! I somehow missed the np.abs(stft) and the warning message that occurs when calling amplitude_to_db on a complex input (phases will be discarded)
I think that the magnitude of the frequency in FFT is the module which calculate by np.absolute() not np.abs()
np.absolute() and np.abs() are completely ideantical. You can use either one.
@@ValerioVelardoTheSoundofAI Yeah, I see. Thanks.
Hi there,
I want to do music semantic segmentation(intro, chorus, verse etc.). Could you please suggest me how should I label my audio data? and what features I should use for that?
The task you're referring to is called "music segmentation" or "music structure analysis". I'm assuming you want to work with audio (e.g., WAV) not symbolic data (e.g., MIDI). There's a lot of literature on this topic. The techniques that work best are based on music processing algorithms which don't involve machine learning. The high-level idea is to extract a chromogram, manipulate it, and use a self-similarity matrix to identify similar parts of a song. The book "Fundamentals of Music Processing" has a chapter that discusses music segmentation in detail. Here's a slide presentation that summarises that book chapter: s3-us-west-2.amazonaws.com/musicinformationretrieval.com/slides/mueller_music_structure.pdf Hope this helps :)
Thank You :)
I have an audio dataset, each audio file consists of letters that are spoken all at one time. How can I prepare these audio files for machine learning? I would like to have each letter in an audio file. if anyone has an idea please help.
Good Job
Hi, thanks alot for these videos they are very useful.
I was just wondering if it would be beneficial to represent the frequency scale logarithmically as humans interpret sound in this way (since musical intervals/harmonics are represented by multiples of a frequency rather than an absolute difference). Are deep learning algorithms not trained with this scale since it mimics human hearing more?
Great intuition! You can take the logarithm of the spectrogram, or, apply Mel filterbanks, and arrive at the so called Mel Spectrogram. I have another series called "Audio Signal Processing for ML" that dives deep into all of these topics, if you're interested.
Hi Valerio, Thank you for your detailed explanation. I am sure like me thousands of others are benefitting from your videos. I understood everything in your video however I have one query, can we use log_spectrogram for deep learning instead of MFCCs? Or in other words, why do we only use MFCCs in deep learning? One more concern, I have audio data that is recorded in 44100Hz, can I use a sample rate of 44100 instead of 22050 (which you are using in this tutorial)? Thank you in advance.
Mel-Spectrograms is the feature of choice in DL. Of course, you can use a RD of 44.1K.