10 - Understanding audio data for deep learning

Поділитися
Вставка
  • Опубліковано 3 жов 2024

КОМЕНТАРІ • 180

  • @ValerioVelardoTheSoundofAI
    @ValerioVelardoTheSoundofAI  4 роки тому +47

    I now have a full series called "Audio Signal Processing for Machine Learning", which develops the concept introduced here in greater detail. You can check it out at ua-cam.com/video/iCwMQJnKk2c/v-deo.html

  • @leumas8688
    @leumas8688 11 місяців тому +1

    Brazilian CS student here, thank you for your dedication, this exactly what I needed for my personal project.

  • @ramangarg881
    @ramangarg881 4 роки тому +21

    It is a great series. And I would love to learn about the digital processing stuff you were talking about in the video . Please do a series on it too. Thanks again.

  • @JulioCesar-dd2ge
    @JulioCesar-dd2ge 3 роки тому +2

    I'm new at machine learning for audio and I've been following along your videos taking some notes and I feel that I'm learning a loot.
    Thanks Mr. Valerio!

  • @anishbhanushali
    @anishbhanushali 3 роки тому +6

    Dude, you've made my life so much easier. I'm going for DL in speech processing and frankly, the task of analog waves to DL features conversion has been a mystery untill now!! If at all you're launching a descriptive audio/signal processing series, I would love to watch it.

  • @gautamj7450
    @gautamj7450 4 роки тому +18

    I've been following along, seeing your videos from the day I saw your Reddit post. I gotta say, you are doing a great work explaining the theory behind Deep Learning. Keep doing the work! Cheers :)

  • @ХАЗЯЕВАССЫРОМ
    @ХАЗЯЕВАССЫРОМ 3 роки тому +1

    So cool, I am from Belarus and start work on my startup, this videos are so useful for my work.

  • @rajatkeshri9392
    @rajatkeshri9392 4 роки тому +3

    I feel his tutorials should get more recognition. Thanks for the series

  • @pankajkumarchoudhary3845
    @pankajkumarchoudhary3845 3 роки тому +5

    Really, Man, you are doing a great job. This is the best series for Audio Deep Learning. This is far away from any other courses. Hats off to you. I never preferred to comment, but this content forced me to comment. Thanks, buddy for your efforts and for sharing knowledge with us.

  • @croftyprojects
    @croftyprojects 3 роки тому +1

    Taught me more than my uni lecturer, by far. You're the boss my dude

  • @mariantalbert8201
    @mariantalbert8201 4 роки тому +1

    I really appreciate how clearly you explain these concepts.

  • @ahmedanwer6899
    @ahmedanwer6899 Рік тому

    ugh u are a legendary for putting this info out for free this is always sommething i wanted to learn and i didn't know where to start and now i know i can just consume your content to learn more about this exciting field!!
    god bless you

  • @PhilipTheDuke
    @PhilipTheDuke 3 роки тому +1

    I would watch more audio processing videos for sure!

  • @Rishi-nv7bp
    @Rishi-nv7bp 4 роки тому +1

    YES would love a more in depth video about this topic

  • @lzdddd
    @lzdddd 2 роки тому

    I was curious about what the input data format for deep learning. Now I understand. Very clear! thank you.

  • @Moonwalkerrabhi
    @Moonwalkerrabhi 3 роки тому +1

    Glad I completed the audio signal processing playlist first, its a quick revision for me in this video

  • @ManontheBroadcast
    @ManontheBroadcast 4 роки тому +1

    Thumbs up for digital audio signal processing videos ...

  • @souvikpaul1153
    @souvikpaul1153 4 роки тому

    Excellent series..In the first video you sid thhis is not for beginners but i am able to perfectly follow along.Excellent explainations.

  • @sathyanarayananvittal7832
    @sathyanarayananvittal7832 8 місяців тому

    Classic ! Enjoyed how you explain the use cases of MFCC with DL networks. Thanks

  • @malinthasandamal
    @malinthasandamal 4 роки тому +2

    Thanks for the great series. I am working on TTS and STT for my local language and this channel might be very helpful. Thanks and waiting for the next one. Kudos from Munich and Sri Lanka

  • @Απο-ρ6σ
    @Απο-ρ6σ 2 роки тому

    You helped me a lot with my undergraduate thesis. Many thanks!

  • @raj-nq8ke
    @raj-nq8ke 2 роки тому

    Please release the series on audio digital sinal processing. You 're the best.

  • @Sawaedo
    @Sawaedo 3 роки тому

    Thank you, this content is far better than what i could found in some books. I hope you keep doing it!

  • @dishkakrauch
    @dishkakrauch 2 роки тому

    What an amazing explanation! Thank you. All the audio things became more clearer for me now.

  • @flaminglotus11
    @flaminglotus11 3 роки тому

    Thank you. My work brought me here and you helped me a lot.

  • @maddonotcare
    @maddonotcare 4 роки тому +1

    Amazing Job i've been working for a project on Speech related AI and had very little knowledge about sound and everything related and you are a very good, cut to the point teacher, thanks!

  • @jennifer6278
    @jennifer6278 4 роки тому

    I'm studying for an Automatic Speech Recognition seminar right now and this was really helpful. Thank you!

  • @alvinkariuki236
    @alvinkariuki236 2 роки тому

    Yes, would be very interested to see videos on audio processing and MFCCs

  • @sunilshah300
    @sunilshah300 4 роки тому +1

    Thank you man for the series.

  • @aymentlili9109
    @aymentlili9109 2 роки тому

    This has been really informative . the spaghetti i made with your recipe was off the charts ! love how you simplify the addition of the instances of the algebric space of linear functions and specifically Fourrier transforms so neatly . thanks for making things accessible to everyone

  • @tentyluaysari3393
    @tentyluaysari3393 3 роки тому

    i just start this lesson and the way you explain it really simple and helps me alot with my research paper, thank you!

  • @_Saucypasta_
    @_Saucypasta_ 4 роки тому

    Awesome series! You have the clearest deep learning videos I've seen so far.

  • @user-hi2hb2ny2p
    @user-hi2hb2ny2p Рік тому

    Thanks, very enlightening and useful explanation

  • @vikasnair8447
    @vikasnair8447 4 роки тому

    I wish I had a teacher like you in school! Thank you so much :)

  • @chaithanyakumara5828
    @chaithanyakumara5828 4 роки тому +8

    Dude this was really a great Lecture, can please do a video on the mathematical aspects of Fourier Transform and Mfccs

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 роки тому +4

      Thank you! I'm planning to create a whole series on audio DSP for music over the next months, where I'll delve into the mathematical details. Stay tuned :)

  • @mishachandar3965
    @mishachandar3965 4 роки тому

    Can't find a better series for audio processing with DL like this! Great content as always. It would be really helpful if you can touch on concepts like audio augmentation techniques and transfer learning in the future in this series. Thanks Valerio!!!

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 роки тому +1

      Thank you! I'm now producing a new series "Audio Processing for ML", where I'll probably get into data augmentation for audio.

  • @harleensingh2531
    @harleensingh2531 2 роки тому

    What a great video! Very easy to follow, thank you!

  • @SehamMohammed2020
    @SehamMohammed2020 4 роки тому

    these are really helpful series, I hope you could make more series about audio data for DL with more details. I really liked your way of explaining things

  • @sampadamatkar6027
    @sampadamatkar6027 3 роки тому

    You made this topic so easy , you are amazing sir , thank you sir 🙏

  • @prasanthantonyraj8876
    @prasanthantonyraj8876 4 роки тому

    It is really an amazing series and I am happy that I found it. I wish to thank you a lot for your time. Keep up the good work. Alongside, I am also more curious to learn about MFCCs. It would be really helpful if you make another series about Audio DSP as you mentioned earlier.
    Thanks again!

  • @caiovillela3708
    @caiovillela3708 3 роки тому

    Great video, thank you for the work!!

  • @islamic1007
    @islamic1007 4 роки тому +1

    Thanks for such great videos .. make videos on the signal processing on sound wave

  • @trendyimpacttv
    @trendyimpacttv 3 роки тому +1

    best series ever!! thanks brother

  • @rajansaharaju1427
    @rajansaharaju1427 4 роки тому

    I'm working with TTS. Glad to see the series.

  • @javadmahdavi1151
    @javadmahdavi1151 2 роки тому

    this is sooo cool , thanks for that valerio... hight quality content ❤❤❤❤😍😍😍

  • @raktimbarua6601
    @raktimbarua6601 4 роки тому +1

    Absolutely brilliant, I just want to implement this on million song dataset

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 роки тому

      This sounds like a great idea -- and it'll enable you to learn a lot in the process!

    • @raktimbarua6601
      @raktimbarua6601 4 роки тому

      @@ValerioVelardoTheSoundofAI I may ask for some help from you!

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 роки тому

      @@raktimbarua6601 I'm here to help... if I can ;)

    • @raktimbarua6601
      @raktimbarua6601 4 роки тому

      @@ValerioVelardoTheSoundofAI Would you mind to check my work and give me feedback, please? I can share my GitHub link. Many thanks

  • @mariameboukrim657
    @mariameboukrim657 5 місяців тому

    Great video, thank you.

  • @IstiakAhammed
    @IstiakAhammed 2 роки тому

    It is a really interesting topic. I want to request to you could you please make videos related to the Speech Enhancement system. How can we do create a neural network model or CNN for speech enhancement? How can we remove the noise signal from human speech using deep learning or specific model like (CNN, ANN, RNN, and LSTM)? Thanks for making amazing videos for us.

  • @sambhavgulla6730
    @sambhavgulla6730 4 роки тому +1

    Nicely explained concepts.
    Can you also make a video on how mfcc are extracted...
    I mean the use of pre emphasis filters and hamming windows

  • @sundeeparandara
    @sundeeparandara 4 роки тому +1

    Great job!!!

  • @123siip
    @123siip 4 роки тому

    so good video for me like not in to much about data like this even thought im still hard to know what does it mean so clearly

  • @vincentouwendijk3746
    @vincentouwendijk3746 3 роки тому

    Interesting stuff Valerio!

    • @vincentouwendijk3746
      @vincentouwendijk3746 3 роки тому

      by the way, can i ask what you are using to automaticly give suggestions for your code? I use VS Code, but am looking for something as effective as you use :)

  • @jokkerBANG
    @jokkerBANG 3 роки тому

    Amazing! This helps me tremendously

  • @artyomgevorgyan7167
    @artyomgevorgyan7167 4 роки тому

    Very good resource for newcomers like me.

  • @ManusiaSetengahChiKuadrat
    @ManusiaSetengahChiKuadrat 4 роки тому

    The explanation is very great and this is cool video😁👍

  • @oroneki
    @oroneki 4 роки тому

    That content is amazing! Very very clear for me! I am just a programmer interested in extracting audio from video and then transforming it in a podcast (without the advertising intervals) to hear while I am doing the dishes :) I did the first version with just random forests and it was ok, but its time to do some deep learning now... this series is gooold! I did a small flask app with to divide the audio into small 1sec parts and serve as content to an app where I can easily classify my audio and use it like labels in the DL project...

  • @Magistrado1914
    @Magistrado1914 3 роки тому

    Excellent course
    14/11/2020

  • @AM-jx3zf
    @AM-jx3zf 3 роки тому

    great stuff, man... props

  • @sbraun27
    @sbraun27 4 роки тому

    Thanks for all the videos!!!! Would be very interested in learning more about this topic and potential other resources for supplement.

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 роки тому

      Thank you for the feedback! I'm considering creating a series on audio DSP / music processing over the next months. If you're interested in the topic, you should take a look at "Fundamentals of Music Processing" www.springer.com/gp/book/9783319219448 This book is quite dense, but it'll give you a strong background in all of these topics, and way more. I'll probably use this book as the main reference for my series on music processing.

  • @pravinyadav8372
    @pravinyadav8372 3 роки тому

    This was incredible

  • @ShortVine
    @ShortVine 4 роки тому

    Quality content, God bless you.

  • @pushkarpadmnav
    @pushkarpadmnav 3 роки тому

    I am a bit confused about the meaning of "Magnitude" in the frequency domain graph , generated after doing fft on time domain. Can you please explain, out of the below two , which explanation is correct.
    1) Magnitude corresponding to a particular frequency after fft shows the number of times that particular frequency has occurred.
    2) Magnitude corresponding to a particular frequency after fft shows the Amplitude corresponding to the sine wave having that particular frequency.
    Thanks for this wonderful video 🤩

  • @raghavgupta6186
    @raghavgupta6186 3 роки тому

    Really great content 🙏

  • @akshaya3086
    @akshaya3086 Рік тому

    Explain about DWT feature extraction ? And can you please explain what is a mFcc coefficient in particular.

  • @nishantbarsainyan5700
    @nishantbarsainyan5700 3 роки тому

    can you please prepare one series for emotion recognition from speech

  • @3arabs4
    @3arabs4 3 роки тому

    This is gold.

  • @thiebesleeuwaert9930
    @thiebesleeuwaert9930 3 роки тому

    Hi, I found this video really informative

  • @suhanikashyap839
    @suhanikashyap839 2 роки тому

    very helpful!

  • @sifftube7537
    @sifftube7537 4 роки тому

    it was very helpful, but please can u show us how to prepare corpus for other language from scratch like under resoursed language from broadcast news data

  • @arnabmukherjee9939
    @arnabmukherjee9939 4 роки тому

    Really good video 👌👌 Keep up posting such videos.

  • @nikotuba
    @nikotuba 4 роки тому +2

    Awesome! will you post link to github repo with python implementations?

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 роки тому

      Thanks :) I always include a link to the Python implementations (when there is one!) in the description section. Stay tuned for next video, where I'll implement some of the topics I've covered in this video :)

  • @prachichitnis
    @prachichitnis 4 роки тому

    This is a great series! I want to know how I can extract features about periodicity of audio data. The frequency, timbre and other MFCC features would tell me about the note or pitch at a point in time. But, to extract the rhythm signature, I would need to look at the repeating patterns over a time period.

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 роки тому +1

      I suggest you check out my new series (still in production) "Audio Signal Processing for Machine Learning"

  • @meetgandhi8782
    @meetgandhi8782 4 роки тому

    This video was very well helpful, I would definitely like more videos on digital signal processing.Additionally, could you also make a video on feature engineering for ML algorithms.

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 роки тому +2

      Thank you! When I'll make the audio/music DSP series, I'll definitely cover feature engineering for ML.

  • @Zuke22
    @Zuke22 3 роки тому

    Why does the MFCC graph just look a like a blocky version of the spectrogram?
    The intuition of looking at it is that the frequencies are just split into 13 groups?

  • @AnkitSharma1063ankit
    @AnkitSharma1063ankit 2 роки тому

    really nice

  • @ayoadeadeyemi01
    @ayoadeadeyemi01 4 роки тому

    Thanks for this amazing video. I gained a lot. Can you explain more about Harmonics and chronogram?

  • @vigahardjanto3690
    @vigahardjanto3690 6 місяців тому

    Hi Valerio, thanks for the great explanation. At 25:05 you explain that the ZCR feature can be fed into the ML Algorithm. But when I use the librosa zero_crossing_rate function I get a quite long array, so how do I summarize this array? Is it by taking the average value? It's a pleasure if you answer my questions, thank you

  • @ValerioVelardoTheSoundofAI
    @ValerioVelardoTheSoundofAI  4 роки тому

    Given many of you have requested it, I've started a new in-depth series 🔥🔥 on Audio Processing for Machine Learning 🎼🤖. Check it out at: ua-cam.com/video/iCwMQJnKk2c/v-deo.html

  • @ekoteguhwidodo2047
    @ekoteguhwidodo2047 4 роки тому

    A great explanation in understanding audio data for deep learning. It's really "new" for me.
    I just want to ask that is all audio analysis using spectrogram data as the basis?
    Thank you

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 роки тому +1

      Thank you! Not all analysis uses power spectrograms. If you're using traditional ML / audio DSP techniques you would use also other features (e.g., chromograms, zero-crossing rate). Spectrograms and similar features are usually used in end-to-end DL approaches. I'm planning to create a series on audio/music processing where I dig deeper in the topics I only scratched in these couple of videos.

  • @juliangermek4843
    @juliangermek4843 3 роки тому

    Exactly what I was looking for, thank you!
    One follow up question: Is there always exactly one possible result from a fourier transformation? Or (1) can it be impossible to decompose the sound or (2) can there be more than one possible composition?

  • @isaacmwanza9162
    @isaacmwanza9162 3 роки тому

    Would like to know how to classify sound

  • @Waffano
    @Waffano 2 роки тому

    Never really understood why we dont just use the raw waveform as input to the neural network as a 1D array or something? Where the index represents time and the values represent amplitude. Shouldnt it have all the information we need? Any help in understanding this would be much appreciated.

  • @aigen-journey
    @aigen-journey 4 роки тому +1

    This might be a stupid question, but do you use something like a sliding window on that MFCC? I thought most of the sequential data is processed with RNNs/LSTMs, but then I would guess only a value from a single time-step is processed from that MFCC

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 роки тому +2

      It's actually a great question. Deriving MFCCs is an elaborate process, with several steps. The first is to perform a STFT, which uses a sliding window. The sliding window is characterised by two values: the window (i.e., the frame size expressed in num. of samples), and the hop length (also expressed in num. of samples). When you perform the FFT, you consider a time interval equal to the frame size. Then, you shift to the left by an amount of samples equal to the hop length. The hop length is < frame size. This is the case, in order to produce overlapping FFTs, which preserve info about the edge of the intervals. Since the MFCCs rely on the STFT, you can state that to extract MFCCs you use a sliding window.
      As for the second part of your comment, you can definitely use an RNN to process MFCCs, passing the MFCC vector for a single window at a time. However, you can also process MFCCs using basic MLP or CNN architectures, treating the MFCCs as 2D data, similar to images. We'll take a look at this in the following videos. Stay tuned!

    • @aigen-journey
      @aigen-journey 4 роки тому

      @@ValerioVelardoTheSoundofAI Thank you for the explanation. I've never worked with audio (I do graphics stuff mostly), so this new domain is pretty fascinating to me. I would guess that some architecture similar to video processing would also work as you have a series of 2D time-dependent inputs. Looking forward to the next video!

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 роки тому

      @@aigen-journey your guess is right :)

  • @yannickpezeu3419
    @yannickpezeu3419 3 роки тому

    Question:
    In the fourier transform, if we compute the full fourier transform (meaning the phase + the amplitude instead of just the amplitude) we can actually recompose the entire signal without any loss of time information. The original signal is just the inverse fourier transform of the fourier transform of the signal: f(n) -> F(w) -> f(n)
    Why don't we just do that ? Why do we need the short time fourier transform ? is it more efficient this way ? Am I missing something ? Thanks for your great work !

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  3 роки тому

      I suggest you to check out my series on Audio Signal Processing for ML. There I spend 4+ videos on these topics ;)

  • @amirmoradkhani7299
    @amirmoradkhani7299 4 роки тому

    Many thank you :*

  • @bgsouayboss199
    @bgsouayboss199 4 роки тому

    This is really a good video, can you make a video about x-vector and I-vector that would be really cool

  • @orhanors1800
    @orhanors1800 4 роки тому

    20:28 How audio transform to spectogram using stft... Spectogram: signal in the frequency domain

  • @yannickpezeu3419
    @yannickpezeu3419 3 роки тому

    Thanks

  • @amitbenhur3722
    @amitbenhur3722 3 роки тому

    Hi, amazing videos... Just a question,
    When creating subtitles for a video with DL, do we create a spectrogram from the video's audio and use a network with CNN layers?

  • @harvajourdieyosua9529
    @harvajourdieyosua9529 3 роки тому

    Hello sir, I'm very grateful i found your videos. I'm currently preparing for my thesis for musical genre classification, but I'm having problems in understanding the features extraction part.
    So my question is, in music genre classification, we only need to use MFCC? is it enough? Thanks!

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  3 роки тому

      MFCCs do a pretty good job. Mel Spectrograms are state of the art. You don't need to mix these features with others.

  • @matthewk6522
    @matthewk6522 Рік тому

    Did you find that uisng MFCCs works better than using a spectrogram? I just stumbled upon your video and I have been using the same dataset but I extract the spectrogram and feed that into my network. I am constantly running into overfitting, and even when I use the same CNN as you do (in your later video) I only get about 50% validation accuracy, while getting 99% training accuracy. Does using MFCCs reduce overfitting?

  • @thrishulh9834
    @thrishulh9834 2 місяці тому

    20:39. 4000 hz why is it blue it should be bright red right since it is the frequency with highest amplitude

  • @hussain_sh2763
    @hussain_sh2763 4 роки тому

    If the program has many MFCCs for an audio, will the program average that MFCCs to get just one MFCC? please if you understand what I have meant, explain your answer more

  • @yannickpezeu3419
    @yannickpezeu3419 3 роки тому

    Hi, do you have a video explaining the role of the phase in the fourier representation ?

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  3 роки тому

      Yes, I have a detailed explanation of the FFT in the "Audio Signal Processing for ML" series.

  • @alexiatopalidou4733
    @alexiatopalidou4733 4 роки тому

    You are great!

  • @javadmahdavi1151
    @javadmahdavi1151 2 роки тому

    hey Valerio, where can I talk to you directly?
    do you have a conversation room in the discord or telegram or... ?

  • @shyamnand5354
    @shyamnand5354 3 роки тому

    U are good 👌

  • @chryszification
    @chryszification 4 роки тому +1

    So when we pass the spectrogram as input to the NN, we represent it as a 2-D input (meaning we have to get rid of either time or magnitude) or as a 3-D input? Thanks!

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 роки тому

      The dimension depends on the type of network you're using. However, the basic idea is that you'll be able to package time, frequency, and magnitude in a 2d array, in the same way we visualise the spectrogram. The shape of the array is (# time steps, # frequency bands). The values featured in the array are the magnitudes, for each frequency band at each time step. In case of a CNN, however, you'll have to pass a 3d array, where the 3rd dimension indicating the depth. For audio data, depth is 1, just like in greyscale pictures. For RGB images, depth=3. I cover this and more in the following videos. So, stay tuned :)

  • @zeldisuryady1541
    @zeldisuryady1541 4 роки тому

    Hi Valerio, Nice and very informative series on understanding Audio for Machine learning.
    One question about MFCC spectogram. it is shown from the MFCC spectogram, the first coefficient of MFCC is always the least value representing by blue color.
    why is that so? thanks for your response.

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 роки тому +1

      The first MFCC value is the least representative for an audio file, and is often dropped for audio characterisiation. That's because it has information mainly connected to loudness.

    • @zeldisuryady1541
      @zeldisuryady1541 4 роки тому

      @@ValerioVelardoTheSoundofAI Thanks valerio for your response.

  • @AshwaniKumar04
    @AshwaniKumar04 3 роки тому

    Hello Valerio:
    Thanks for this awesome channel. Thanks a lot.
    I do have a doubt. When should we use spectrogram vs mfcc for a deep neural problem?

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  3 роки тому +1

      Spectrogram are state-of-the-art features in DL now. MFCCs are rarely used in DL.

    • @AshwaniKumar04
      @AshwaniKumar04 3 роки тому

      @@ValerioVelardoTheSoundofAI so,in that case we should use spectrogram always in DL