Can Whisper be used for real-time streaming ASR?

Поділитися
Вставка
  • Опубліковано 6 сер 2024
  • Try Voice Writer: voicewriter.io
    Whisper is a robust Automatic Speech Recognition (ASR) model by OpenAI, but can it handle real-time streaming ASR where the latency requirement is several seconds? This is actually not too difficult, using the open-source whisper-streaming project, which turns Whisper into a streaming ASR system. It works by feeding longer and longer audio buffers into the Whisper model, using the LocalAgreement algorithm to confirm output as soon as it is agreed upon in two iterations, and then scrolls the buffer forward until the start of the next sentence.
    0:00 - Introduction
    0:35 - Batch vs Streaming ASR
    1:55 - Why is this difficult?
    2:58 - Whisper-streaming demo
    3:38 - Processing consecutive audio buffers
    4:36 - Confirming tokens with LocalAgreement
    6:05 - Prompting previous context
    7:01 - Limitations vs other streaming ASR models
    References:
    github.com/ufal/whisper_strea...
    Macháček, Dominik, Raj Dabre, and Ondřej Bojar. "Turning Whisper into Real-Time Transcription System." IJCNLP-AACL 2023.
    Chen, Xie, et al. "Developing real-time streaming transformer transducer for speech recognition on large-scale dataset." ICASSP 2021.
  • Наука та технологія

КОМЕНТАРІ • 16

  • @nmstoker
    @nmstoker 4 місяці тому

    Thank you - yet another beautifully explained topic 🙂

  • @wolpumba4099
    @wolpumba4099 4 місяці тому +1

    *Abstract*
    This video explores the potential of OpenAI's Whisper model for real-time streaming automatic speech recognition (ASR). While Whisper excels in batch ASR, its ability to handle streaming scenarios with low latency is less obvious. The video introduces the open-source whisper-streaming project, which adapts Whisper for streaming applications by processing consecutive audio buffers of increasing size and confirming output tokens using the LocalAgreement algorithm. The video also discusses the limitations of this approach compared to models specifically designed for streaming ASR.
    *Summary*
    *Introduction (**0:00**)*
    * The video investigates whether OpenAI's Whisper model can be used for real-time streaming ASR.
    * Whisper is a powerful ASR model trained on a massive multilingual dataset, known for its robustness to noise and accents.
    *Batch vs Streaming ASR (**0:35**)*
    * Batch ASR processes entire audio recordings at once, while streaming ASR produces output as the speaker talks, with minimal delay.
    * Streaming ASR is crucial for applications like live captioning, where real-time transcription is essential.
    *Why is Streaming Whisper Difficult? (**1:55**)*
    * Whisper is designed for processing fixed-length audio segments (30 seconds), making it challenging to handle longer recordings in a streaming fashion.
    * Simply splitting audio into chunks can lead to inaccurate word recognition and high latency.
    *Whisper-streaming Demo (**2:58**)*
    * The video showcases the open-source whisper-streaming project, which enables real-time transcription using Whisper.
    * The demo demonstrates the project's ability to transcribe speech with minimal delay and provide timestamps.
    *Processing Consecutive Audio Buffers (**3:38**)*
    * Whisper-streaming feeds increasingly larger audio chunks into Whisper until an end-of-sentence marker is detected.
    * This ensures that Whisper processes complete sentences, leading to better accuracy.
    *Confirming Tokens with LocalAgreement (**4:36**)*
    * The LocalAgreement algorithm confirms output tokens only after they are generated in two consecutive audio buffers.
    * This helps distinguish between confirmed and unconfirmed transcription results, allowing for real-time feedback with potential corrections.
    *Prompting Previous Context (**6:05**)*
    * Whisper-streaming uses the previous sentence as prompt tokens for the model, providing additional context and improving accuracy.
    *Limitations vs Other Streaming ASR Models (**7:01**)*
    * Whisper's design isn't optimized for streaming, leading to inefficiencies like repeatedly processing the beginning of long sentences.
    * Dedicated streaming ASR models utilize architectures that allow for efficient processing of continuous audio streams with fixed context windows.
    * Adapting Whisper for streaming requires modifying its architecture and retraining, which is currently limited by data accessibility.
    I used gemini 1.5 pro for the summary
    Token count
    4,084 / 1,048,576

  • @AmirMahmoudi-je2pu
    @AmirMahmoudi-je2pu 2 місяці тому

    nice video and great voice writer, I have tried implementing it with transformers js package and its whisper model but no luck yet since processing is heavy

    • @EfficientNLP
      @EfficientNLP  2 місяці тому +1

      There are a number of things you can do to speed up the whisper model. Some backends are more optimized depending on your hardware; faster-whisper is a popular one. You can also try smaller models: "base" is a good tradeoff that sacrifices some quality for better performance.

  • @gpminsuk
    @gpminsuk 3 місяці тому

    Thanks for the video!! This is a great technique. I am thinking to use this technique for our application. I have one question. When the words are confirmed, why don't you feed the partial audio (except the confirmed words part) with the confirmed text in the initial prompt? Would that be a lot faster when a sentence is really long? Or faster on smaller chips like SBC?

    • @EfficientNLP
      @EfficientNLP  3 місяці тому +1

      The main issue is Whisper is trained on audio that is at the beginning of a sentence so feeding it audio that starts in the middle of a sentence would be out of distribution. Your suggestion would be more efficient, but may lead to a degradation in transcript quality.

  • @qwerty_and_azerty
    @qwerty_and_azerty 4 місяці тому +1

    What happens if 2 consecutive predictions continue to disagree on a specific word? Do you pick one of the options at random? Or does the sentence starting at that word never become confirmed?

    • @EfficientNLP
      @EfficientNLP  4 місяці тому

      Generally, the predictions change up to a certain point, after which they no longer change based on additional inputs, and then they are confirmed. If this never occurs, then I guess it will need to handle this edge case in some way, such as picking randomly, but this should not happen often.

  • @pedroprobst5230
    @pedroprobst5230 4 місяці тому

    Thank you. Are you using faster-whisper as your backend? I'm trying to achieve something similar but with whisper.cpp.

    • @EfficientNLP
      @EfficientNLP  3 місяці тому

      This method should work for any backend, but only faster-whisper is supported in the current implementation of whisper-streaming. Some modification will be required to make it work for whisper.cpp.

    • @pedroprobst5230
      @pedroprobst5230 3 місяці тому

      @@EfficientNLP Interesting; thank you very much.

    • @Bub_s69
      @Bub_s69 Місяць тому

      Did you end up figuring it out with whisper.cpp?

  • @wolpumba4099
    @wolpumba4099 4 місяці тому

    I would like something like your voice writer but instead of outputting text it should output speech. I should remove my grammar mistakes and accent but should copy my intonation. Do you think this is possible at this time? I can't find good text-to-speech or voice cloning models.

    • @EfficientNLP
      @EfficientNLP  4 місяці тому

      This sounds quite different from what I'm building with Voice Writer. I've not looked at voice cloning models before, so I'm not sure of their feasibility, but it's a good and potentially useful project idea.

  • @jacemc9852
    @jacemc9852 5 днів тому

    What latencies can be expected with Whisper Streaming? I'd like to know what to expect before going down that route?

    • @EfficientNLP
      @EfficientNLP  4 дні тому

      Latency depends on various factors such as your hardware, model size, and options like minimum chunk size; the paper reports latency results between 3-6 seconds depending on configuration.