You asked for it - and I delivered | Live speech transcription with OpenAI Whisper STT
Вставка
- Опубліковано 13 кві 2024
- Run live speech transcription on Raspberry Pi 5 with faster-whisper and WhisperLive, see the transcription results as they are processed and send the final output to an LLM or TTS. Less finicky than SDL2, WhisperLive instead uses PyAudio for audio capture. Tested with two microphones: ReSpeaker 2-Mics Pi HAT and ReSpeaker USB Mic Array.
How to make whisper.cpp transcribe faster? (audio_ctx explanation)
/ how-to-make-cpp-102337630
Microphones:
www.seeedstudio.com/ReSpeaker...
www.seeedstudio.com/ReSpeaker...
Barebone single-threaded implementation of FasterWhisper transcription from the microphone
gist.github.com/AIWintermuteA...
My fork of whisper.cpp Python bindings
github.com/AIWintermuteAI/whi...
My fork of Whisper live - git clone this to follow along the video
github.com/AIWintermuteAI/Whi...
faster-whisper repository
github.com/SYSTRAN/faster-whi...
Piper TTS
github.com/rhasspy/piper - Наука та технологія
I have been following you for a long time and you have done such a fantastic job of crafting your style and keeping your content relevant. Great job Dmitry!
Thank you for leaving this comment!
I'm still refining my style to tell the truth. One of the things I was successful recently (I think) is keeping my videos more to the point, with good flow of information. Now it looks to me I was blabbering way too much in my older videos at times. I cut a lot of stuff now on post-processing if I feel the video is overloaded.
I plan to make some more storytelling-oriented robotics content next half-year, stay tuned and see how it goes.
Thank you for making this follow up!
Appreciate your support
Excellent work, keep it up!!! Shared on Twitter too.
Thanks for sharing!!
I'm trying to make an AI voice assistant and would be completely lost without your videos. Thanks so much!
Glad I could help!
Thank you so much! You’ve really helped me speed up my project. I normally don’t like and subscribe but I made an exception 🙃. Keep it up!!
Thank you for your support!
Hi there, could you please advise what is the best and easiest way to transcribe mp3 files speech recordings to text with no coding experience at all. Thank you
Thank you for sharing your knowledge. I'm trying to do "float16" STT transcription with diarization using WhisperX on an 8GB Pi5, but "the ctranslate2 package does not compile with CUDA support." Per the whisperx readme, I tried to install pytorch v11.8 from the PyTorch pip command, and then I tried the current version, before trying to install whisperx with no joy. Apologies if this is a silly question, but is there a CUDA version that works on a Pi5 GPU (Broadcom VideoCore VII), or must I only use CPU CUDA? What do you recommend? Thanks!
This is too good....I think this should fit in directly with one of my project. Do you have any recommendation for real time TTS ?
Hopefully!
I used espeak before for other projects... it is pretty horrible by modern standards, but does its job.
For this example I used piper TTS - much better quality, but not as fast as espeak.
Thx
My pleasure! Thanks for commenting!
hi,
I am getting following error while running the fork..any ideas ?
A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.0 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.
If you are a user of the module, the easiest solution will be to
downgrade to 'numpy
Hey, just wanted to say I really appreciated your last two videos. Will you please be my sensei? Thank you!!
I appreciate your appreciation! xD
I'd say that I'm already a sensei of sorts... You always can support me on Patreon for some extras, but otherwise simply stay tuned for more videos!
Like!!!Dima, awesome content, what do u think about VOSK API and compare it to Whisper? Great example of PiperTTS. Thank you!
Thanks, appreciate it!
I'll try it out and compare it - I don't think I'll make a video about it, but maybe a blog article :)
I've been trying to find a way to make end of speech flag to be more intelligent than just detecting a pause. I find it common that I may have a mental blank, or misspeak, and the delay in my speech incorrectly flags end of speech. It would be interesting if STT systems can continue listening after a pause if it detects an incomplete sentence. Any ideas?
That's a hard one. I don't think this one is solved even in commercial STT engines - e.g. google assistant or siri.
That would require understanding on sentence context. We might be getting somewhere with multi-modal models, such as GPT4o, but I don't think there is anything available to be run on Raspberry Pi format computer.
Also, as a shortcut, perhaps it would be possible to either run a classifier or modify whisper model to output probability of sentence being finished... It's just an idea though, finding out how well will it work is another thing entirely.
Please advise the hardware setup for offline RAG, TTS, STT
Hard to estimate without knowing the details?
Real implementation is using websocket
Idea is
App is transmitting PCM 16k raw audio
WS Server will capture those audio packets
Sent that to whisper ai to get transcription and return to app in json
do you use the 8gb raspberry pi?
Yes, Raspberry Pi 5 8 Gb - but RAM is hardly relevant here, for tiny.en model.
@@Hardwareai thinking of getting the Pi 4 with 1GB RAM. shouldn't be an issue to replicate hopefully.
Can raspberry pi5 run whisper using Python?
Yes. absolutely!
Thx
Appreciate it!