Use OpenAI Whisper For FREE | Best Speech to Text Model

Поділитися
Вставка
  • Опубліковано 1 лип 2024
  • In this video, I will show you how to run the whisper v3 model on Google Colab Notebook. Enjoy :)
    Want to Follow:
    🦾 Discord: / discord
    ▶️️ Subscribe: www.youtube.com/@engineerprom...
    Want to Support:
    ☕ Buy me a Coffee: ko-fi.com/promptengineering
    |🔴 Support my work on Patreon: / promptengineering
    Need Help?
    📧 Business Contact: engineerprompt@gmail.com
    💼Consulting: calendly.com/engineerprompt/c...
    LINKS:
    Google Notebook: tinyurl.com/mryypr2p
    Github Repo: github.com/openai/whisper
    All Interesting Videos:
    Everything LangChain: • LangChain
    Everything LLM: • Large Language Models
    Everything Midjourney: • MidJourney Tutorials
    AI Image Generation: • AI Image Generation Tu...
  • Наука та технологія

КОМЕНТАРІ • 63

  • @engineerprompt
    @engineerprompt  7 місяців тому

    Want to connect?
    💼Consulting: calendly.com/engineerprompt/consulting-call
    🦾 Discord: discord.com/invite/t4eYQRUcXB
    ☕ Buy me a Coffee: ko-fi.com/promptengineering
    |🔴 Join Patreon: Patreon.com/PromptEngineering

    • @milokornblum8672
      @milokornblum8672 5 місяців тому

      Can you tell me what is the command to put the results in subtitle format? .srt

  • @Nihilvs
    @Nihilvs 7 місяців тому +4

    Thanks for the video ! Been using this model for a long while to do translation+transcription of lectures (one and a half hours), mostly it works like a charm. I dont know about large-v3 but large-v2 would sometimes repeat and loop one sentence about half of the transcription.
    So it needs optimization (some solutions clean the audio before whisper).

    • @user-fh2dq7vq9v
      @user-fh2dq7vq9v 7 місяців тому +1

      is small , tiny that model available in the v3 , if yeas please give me link.

    • @marcin8432
      @marcin8432 7 місяців тому

      @@user-fh2dq7vq9vlaziness destroys any kind of progress, bear that in mind

  • @thevinnnslair
    @thevinnnslair 5 місяців тому +1

    Very helpful, thanks for this

  • @thunderwh
    @thunderwh 7 місяців тому +1

    I like the idea of chatting with documents through speech

    • @user-fh2dq7vq9v
      @user-fh2dq7vq9v 7 місяців тому

      hey please, can you please explain me ,, why the 10 gb model just stored in 3.9Gb . I want to download complete model , because I want to host the model in the server.

  • @rccarsxdr
    @rccarsxdr 7 місяців тому +1

    Helpful video! Now I can run code locally. Thanks

    • @user-fh2dq7vq9v
      @user-fh2dq7vq9v 7 місяців тому

      hey please, can you please explain me ,, why the 10 gb model just stored in 3.9Gb . I want to download complete model , because I want to host the model in the server.

  • @ekstrajohn
    @ekstrajohn 7 місяців тому +1

    I am using V2 on my Nvidia 1080 GPU. The perofrmance difference between the base model and the large model is very small. I tried multiple sources, tones of voice, noise, etc. Base version is really fast, so I recommend that one. Even V2 is really perfect for transcribing speech to text.

    • @user-fh2dq7vq9v
      @user-fh2dq7vq9v 7 місяців тому

      hey please, can you please explain me ,, why the 10 gb model just stored in 3.9Gb . I want to download complete model , because I want to host the model in the server.

  • @RameshBaburbabu
    @RameshBaburbabu 7 місяців тому +3

    🎯 Key Takeaways for quick navigation:
    00:00 🎙️ *Overview of Whisper V3 Model*
    - Whisper V3 is OpenAI's latest speech-to-text model.
    - Five configurations available: tiny, base, small, medium, and large V3.
    - Memory requirements vary from 1 GB to 10 GB VRAM.
    01:25 🔄 *Comparison: Whisper V2 vs. V3*
    - V3 generally performs better with lower error rates than V2.
    - There are specific cases where V2 outperforms V3, demonstrated later.
    - Important to consider performance metrics when choosing between V2 and V3.
    03:02 ⚙️ *Setting Up Whisper V3 in Google Colab*
    - Installation of necessary packages: Transformer, Accelerator, and Dataset.
    - GPU availability check and configuration for optimal performance.
    - Loading the Whisper V3 model, setting processor, and creating the pipeline.
    05:27 🎤 *Speech-to-Text Transcription Process*
    - Creating a pipeline for automatic speech recognition using the Whisper V3 model.
    - Uploading and transcribing an audio file in a Google Colab notebook.
    - Additional options such as specifying timestamps during transcription.
    07:45 🌐 *Language Recognition and Translation*
    - V2 may be preferable when language is unknown, as it can automatically recognize it.
    - Whisper supports direct translation from one language to another.
    - Highlighting the importance of specifying the language in V3 if known.
    09:22 ⚡ *Flash Attention and Distal Whisper*
    - Enabling Flash Attention for improved performance if the GPU supports it.
    - Introduction to Distal Whisper, a smaller, faster version of Whisper.
    - Demonstrating how to use Distal Whisper Medium English in code.
    11:54 🌐 *Future Applications and Closing*
    - Exploring potential applications, like enabling speech communication with documents.
    - Encouraging viewers to explore and experiment with the Whisper model.
    - Expressing the usefulness and versatility of the Whisper model in various applications.
    Made with HARPA AI

    • @user-fh2dq7vq9v
      @user-fh2dq7vq9v 7 місяців тому

      hey please, can you please explain me ,, why the 10 gb model just stored in 3.9Gb . I want to download complete model , because I want to host the model in the server.

  • @yuzual9506
    @yuzual9506 7 місяців тому

    Your a god of pedagogy and i m french!!! thx!

  • @ardavaneghtedari
    @ardavaneghtedari 6 місяців тому

    Thanks!

  • @Hasi105
    @Hasi105 7 місяців тому

    Yesterday i was trying it out, nice that you explain it. Thanks! Can you create tell me how to use the a microphone for direct transcription? Maybe also a nice use for tavernAI and or MemGPT.

    • @engineerprompt
      @engineerprompt  7 місяців тому +2

      Sure, will create a video on it

    • @user-fh2dq7vq9v
      @user-fh2dq7vq9v 7 місяців тому

      hey please, can you please explain me ,, why the 10 gb model just stored in 3.9Gb . I want to download complete model , because I want to host the model in the server.

  • @bmqww223
    @bmqww223 2 місяці тому

    Greetings, can this be used for whisperX too? i tried it with v2 model it used to work maybe

  • @user-mv9ul9tz1c
    @user-mv9ul9tz1c 7 місяців тому

    Thank you for sharing. May I inquire if you provide Colab code for testing? Does the new version of Whisper still have a 25MB file size limitation? Previously, I was able to split files for batch processing and then integrate them. However, when it comes to batch-processing SRT files, there seems to be a timing issue.

    • @user-fh2dq7vq9v
      @user-fh2dq7vq9v 7 місяців тому

      hey please, can you please explain me ,, why the 10 gb model just stored in 3.9Gb . I want to download complete model , because I want to host the model in the server.

  • @benoitmialet9842
    @benoitmialet9842 7 місяців тому

    Whisper is an amazing model. I use the medium version and with a fine tuning with only 40 min of audio, it is able to adapt to a specific domain (2 epochs are enough) quite well.
    I never tried to fine tune the large V3, but I will. Large V2 seems worse than medium for French language.
    Did you have the opportunity to fine tune large V3 and to compare it's performance with medium ?

    • @engineerprompt
      @engineerprompt  7 місяців тому +1

      I haven’t looked at fine tuning v3 yet. But seems like it will track with large v2.

    • @billcollins6894
      @billcollins6894 7 місяців тому

      Can fine tuning help accuracy if the vocabulary is limited? I only want maybe 100 words to be recognized, but I want high probability of a match. I have also tried to see how to get a return value of probability of word or phrase match, but not clear on how to do that.

    • @benoitmialet9842
      @benoitmialet9842 7 місяців тому

      @@billcollins6894 if you FT Whisper with audios containing these words, normally it should increase model performance on these specific words.
      If you set word timestamps to True as shown on the video, normally the returned dictionary gives you the probability for each word as well. I havent tried with the pipeline but it works.

    • @user-fh2dq7vq9v
      @user-fh2dq7vq9v 7 місяців тому

      hey please, can you please explain me ,, why the 10 gb model just stored in 3.9Gb . I want to download complete model , because I want to host the model in the server.

  • @easylife7775
    @easylife7775 5 місяців тому

    HELLO can i use this from A language to B language for example..

  • @farahabdulahi474
    @farahabdulahi474 6 місяців тому

    is it just me, or is this not a big jump in improvement? at least for me
    i wanted
    1. speaker recogition / diarization
    2. higher accuracy rate in mandarin
    hope they do the first asap, and the second will get better over time i hope. Azure already has a speech-to-text service that includes speaker recognition and is quite good. I wonder if could affect how they prioritise this important feature

  • @user-qt8bb9ix1w
    @user-qt8bb9ix1w 4 місяці тому

    Hey. I have been trying to reduce the length of the subtitles as the characters generated by whisper can be overwhelming, ranging between 12 - 18 words in a single caption. I am using google colab and so far there's no success. Here are the commands i have used:
    !whisper "FILE NAME" --model medium --word_timestamps True --max_line_width 40
    !whisper "FILE NAME" --model medium --word_timestamps True --max_words_per_line 5
    It works completely fine with the following command but with large number of words
    !whisper "FILE NAME" --model medium
    Could you please help.

  • @contractorwolf
    @contractorwolf 5 місяців тому

    is there a colab for this?

  • @JuanGea-kr2li
    @JuanGea-kr2li 7 місяців тому +1

    VERY interesting, I would love to know how to run it locally, I mean with a UI in a local computer, not in a google notebook, it will be very VERY useful to transcript a video, translate later with another tool or model and then generate subtitles or generate an audio to translate the video, everything locally :)

    • @engineerprompt
      @engineerprompt  7 місяців тому +1

      Let me see if I can put together a streamlit based UI for it

    • @JuanGea-kr2li
      @JuanGea-kr2li 7 місяців тому

      @@engineerprompt awesome, thank you!

    • @user-fh2dq7vq9v
      @user-fh2dq7vq9v 7 місяців тому

      hey please, can you please explain me ,, why the 10 gb model just stored in 3.9Gb . I want to download complete model , because I want to host the model in the server.

  • @SonGoku-pc7jl
    @SonGoku-pc7jl 7 місяців тому

    thanks :) please, what is the config if audio is english and y want translate to text in spanish ? only translate other-languages to english? or is posible translate english to spanish for example? And model whisper-destil, if is english .en model or not, for translate audio english to text in spanish is posible? thanks! :)

    • @user-fh2dq7vq9v
      @user-fh2dq7vq9v 7 місяців тому

      hey please, can you please explain me ,, why the 10 gb model just stored in 3.9Gb . I want to download complete model , because I want to host the model in the server.

  • @WillyFlowerz
    @WillyFlowerz 6 місяців тому

    Is there still NO practical application for real-time transcribing (+/- translating the text) that is readily available on android ?
    I think I heard about one or two projects that wanted to do that but still nothing concrete more than a full year after this incredible piece of technology appeared
    Am I missing something ? Is whisper incompatible with android ? Is there no way to apply whisper to continuous live audio recording ?
    Has nobody managed to do it ??

  • @matangbaka
    @matangbaka 4 місяці тому

    Hi, can anyone help me, I'm having problem in following the tutorial, I encountered some error in
    pipe = pipeline("automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=15,
    batch_size=16,
    return_timestams=True,
    torch_dtype=torch_dtype,
    device=device)
    it says that:
    TypeError Traceback (most recent call last)
    in ()
    ----> 1 pipe = pipeline("automatic-speech-recognition",
    2 model=model,
    3 tokenizer=processor.tokenizer,
    4 feature_extractor=processor.feature_extractor,
    5 max_new_tokens=128,

  • @Nawaz-lb9eq
    @Nawaz-lb9eq 7 місяців тому +1

    Can you do a video on multi-speaker identification and transcription using whisper please.

    • @mbrochh82
      @mbrochh82 7 місяців тому

      don't think it is possible with whisper

    • @user-fh2dq7vq9v
      @user-fh2dq7vq9v 7 місяців тому

      hey please, can you please explain me ,, why the 10 gb model just stored in 3.9Gb . I want to download complete model , because I want to host the model in the server.

  • @AustinStAubin
    @AustinStAubin 7 місяців тому

    Do you think you could show an example with diarization?

    • @engineerprompt
      @engineerprompt  7 місяців тому

      I haven't worked with it directly but I think I have seen some examples. Probably not directly in whisper but it can be used to augment. Here is something that seems interesting. tinyurl.com/yeysk4bz
      I will explore this further and see what I can come up with.

    • @user-fh2dq7vq9v
      @user-fh2dq7vq9v 7 місяців тому

      hey please, can you please explain me ,, why the 10 gb model just stored in 3.9Gb . I want to download complete model , because I want to host the model in the server.

  • @techmoo5595
    @techmoo5595 7 місяців тому

    Just run the colab note, and compare v3 with v2, I think the v2 is better.

    • @user-fh2dq7vq9v
      @user-fh2dq7vq9v 7 місяців тому +1

      hey please, can you please explain me ,, why the 10 gb model just stored in 3.9Gb . I want to download complete model , because I want to host the model in the server.

  • @ericneeds1285
    @ericneeds1285 2 місяці тому

    I'm a "scopist" and I need to edit transcripts with different speakers, I take it this does not differentiate speakers?

    • @engineerprompt
      @engineerprompt  2 місяці тому

      I have a video on speaker identification on the channel.

  • @user-fh2dq7vq9v
    @user-fh2dq7vq9v 7 місяців тому +2

    in this we are downloading the model , or using the interface API ??? actualy , I am new to it and confused. If it is model then it will be best for me to host it in the server.

  • @Vollpflock
    @Vollpflock 7 місяців тому

    So what's the difference to using Whisper through the API? Is this free and even better than using it through the API?

    • @engineerprompt
      @engineerprompt  7 місяців тому +1

      The main difference is it’s free if you can run it locally compared to the API which will cost you money. Performance seems to be the same

    • @user-fh2dq7vq9v
      @user-fh2dq7vq9v 7 місяців тому

      hey please, can you please explain me ,, why the 10 gb model just stored in 3.9Gb . I want to download complete model , because I want to host the model in the server.

  • @trilogen
    @trilogen 6 місяців тому +1

    People want to run it locally for privacy not route it to Google

  • @DihelsonMendonca
    @DihelsonMendonca 6 місяців тому

    ⚠️ I need a good text to speech for free. It doesn't help if you can talk to a model, but it can't talk back. So, what to do ? Any good, free text to speech ?? 😮

    • @engineerprompt
      @engineerprompt  6 місяців тому

      You might want to check out github.com/suno-ai/bark

  • @weslieful
    @weslieful 6 місяців тому

    どうしようかしら

  • @Tyrone-Ward
    @Tyrone-Ward 5 місяців тому +2

    This is NOT running Whisper locally. Misleading title.

    • @listentomusic8160
      @listentomusic8160 5 місяців тому

      My PC have 128 mb vram 😅. How could I suppose to run 16 gb vram model in my local machine 😂

  • @user-qt8bb9ix1w
    @user-qt8bb9ix1w 4 місяці тому

    Hey. I have been trying to reduce the length of the subtitles but havent been successful. I am using google colab. so here is the command
    !whisper "FILE NAME" --model medium --word_timestamps True --max_line_width 40 (didnt succeed)
    !whisper "FILE NAME" --model medium --word_timestamps True --max_words_per_line 5 (no success)
    help needed please

  • @krstoevandrus5937
    @krstoevandrus5937 4 місяці тому

    hi, i am med student i got some med video for transcribe. i tried whisper API vs colab local run large-v3. i found the API MUCH better. the local run large-v3 is NOT acceptable unless there is human edition, while the API one is fully okay on it's own. even for a name called moblitz (correct), API could recognize this, and local run named it mobits(wrong). mind to comment on this for me? thanks