Local Low Latency Speech to Speech - Mistral 7B + OpenVoice / Whisper | Open Source AI

All About AI

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 17 тра 2024
Local Low Latency Speech to Speech - Mistral 7B + OpenVoice / Whisper | Open Source AI
👊 Become a member and get access to GitHub:
/ allaboutai
🤖 AI Engineer Course:
scrimba.com/?ref=allabtai
Get a FREE 45+ ChatGPT Prompts PDF here:
📧 Join the newsletter:
www.allabtai.com/newsletter/
🌐 My website:
www.allabtai.com
Openvoice:
github.com/myshell-ai/OpenVoice
LM Studio:
lmstudio.ai
I created a local low latency speech to speech system with LM Studio, Mistral 7b, OpenVoice and Whisper. This work 100% offline , uncensroed and with dependencies like APIs etc. Still working on optimizing the latency. Running on a 4080.
00:00 Intro
00:31 Local Low Latency Speech to Speech Flowchart
01:32 Setup / Python Code
05:13 Local Speech to Speech Test 1
07:06 Local Speech to Speech Test 2
10:06 Local Speech to Speech Simulation
12:37 Conclusion
Наука та технологія

КОМЕНТАРІ • 160

@JohnSmith762A11B 4 місяці тому ⁺⁴⁷
More suggestions: add a "thought completed" detection layer that decides when the user has finished speaking based on the stt input so far (based upon context and natural pauses and such). It will auto-submit the text to the AI backend. Then have the app immediately begin listening to the microphone at the conclusion of playback of the AI's tts-converted response. Yes, sometimes the AI will interrupt the speaker if they hadn't entirely finished what they wanted to say, but that is how real human conversations work when one person perceives the other has finished their thought and chooses to respond. Also, if the user says "What?" or "(could you) Repeat that"" or "please repeat?" or "Say again?" Or "Sorry I missed that." the system should simply play the last WAV file again without going for another round trip to the AI inference server and doing another tts conversion of the text. Reserve the Control-C for stopping and starting this continuous auto-voice recording and response process instead. This will shave a many precious milliseconds of latency and make the conversation much more natural and less like using a walkie-talkie.
@SaiyD 3 місяці тому ⁺¹
nice let me give one suggest to your suggestion. add a random choice with 50% chance to re play the audio or send your input to backend.
@ChrizzeeB 3 місяці тому
so it'd be sending the STT input again and again with every new word detected? rather than just at the end of a sentence or message?
@deltaxcd 3 місяці тому ⁺³
I have better idea to feed it partial prompt without waiting user to finish and it starts generating response if there is a slightest pause if user continues talking more text is added to the prompt and output is regenerated. If user talks on top of speaking AI. Ai terminates its response and continues listening
this will improve things 2 fold because moel will have a chance to process partial prompt and it will reduce time required to process the prompt later
if we combine that to now wasting for full reply conversation will be completely natural
there is no need for any of that say again because AI will do that by itself if asked
@williamjustus2654 4 місяці тому ⁺¹¹
Some of the best work and fun that I have seen so far. Can't wait to try on my own. Keep up the great work!!
@Canna_Science_and_Technology 4 місяці тому ⁺¹⁸
Awesome! Time to replace my slow speech to speech code using openAI. Also, added eleven labs for a bit of a comedic touch. Thanks for putting this together.
@ales240 4 місяці тому ⁺¹
Just subscribed! can't wait to get my hands on it, looks super cool!
@tommoves3385 4 місяці тому ⁺¹
Hey Kris - that is awesome. I like it very much. Great that you do this open source stuff. Very cool 😎.
@avi7278 4 місяці тому ⁺⁵
In the US we have this concept, if you watch a football game which is notorious for having a shizload of commercials (ie latency), if you start watching the game 30 minutes late but from the beginning, you can skip most of the commercials. If you just shift the latency to the beginning, 15 seconds of "loading" would probably be sufficient enough for a 5-10 minute conversation between the two chatbots, and also avoid loops by having a third party observer who reviews the last 5 messages and determines if the conversation has gone "stale" and interjects a new idea into one of the interlocutors.
@ryanjames3907 4 місяці тому ⁺¹
very cool, low latency voice, thanks for sharing, i watch all your videos, and i look forward to the next one,
@deeplearningdummy 3 місяці тому ⁺³
I've been trying to figure out how to do this. Great job. I want to support your work and get this up and running for myself, but is UA-cam membership the only option?
@BruceWayne15325 4 місяці тому ⁺¹⁶
very impressive! I'd love to see them implement this in smartphones for real-time translation when visiting foreign countries / restaurants.
@optimyse 3 місяці тому ⁺¹
S24 Ultra?
@deltaxcd 3 місяці тому
there are models that so speech to speech translation
@swannschilling474 4 місяці тому ⁺²
I am still using Tortoise but Open Voice seems to be promising! 😊 Thanks for this video!! 🎉🎉🎉
@zyxwvutsrqponmlkh 3 місяці тому ⁺²
I have tried open voice and bark, but VITS by far makes the most natural sounding voices.
@darcwader 8 днів тому
this was more comedy show than tech , lol. so hilarious responses from johnny.
@codygaudet8071 2 місяці тому
Just earned yourself a sub sir!
@nyny 4 місяці тому ⁺¹³
Thats supah cool, I actually built something almost exactly like this yesterday. I get about the same performance. The hard part is needing to figure out threading/process pools/asyncio. To get that latency down. I used small instead of base. I think I get about the same response or better.
@user-rz6pp5my4t 3 місяці тому ⁺⁷
Hi ! Very impressive !! Do you have a github to share your code ?
@CognitiveComputations 3 місяці тому
can we see your code please
@limebulls 2 місяці тому
Im interested in it as well
@arvsito 4 місяці тому ⁺¹
It will be very interesting to see this in a web application
@aladinmovies 3 місяці тому
Good job. Interesting video
@researchforumonline 3 місяці тому
wow very cool! Thanks
@duffy666 16 днів тому
I really like it! It this already on Github for members (could not find it)?
@cmcdonough2 10 днів тому
This was great 😃👍
@denisblack9897 4 місяці тому ⁺¹
I know about this for more than a year now and it still blows my mind. wtf
@user-bd8jb7ln5g 4 місяці тому
This is great. But personally I think a speech recognition with push to talk or push to toggle talk is most useful.
@PhillipThomas87 3 місяці тому ⁺⁷
I mean, this is dependent on your hardware... Are the specs anywhere for this "inference server"
@SonGoku-pc7jl 3 місяці тому
thanks, good project. Whisper can translate my spanish to english to spanish directly with little change in code? and tts i need change something also? thanks!
@SaveTheHuman5 3 місяці тому ⁺⁵
Hello, please can inform to us what is your cpu, gpu, ram etc?
@kleber1983 2 місяці тому ⁺¹
Hi, I´d like to know the computer specs required to run your speech to speech system, I m quite interested but I need to know first I my computer can handle it. thanks.
@yoagcur 4 місяці тому ⁺¹
Fascinating. Any chance you could upgrade it so that specific voices could be used and a recording made automatically, Could make for some interesting Biden v Trump debates
@fatsacktony1 3 місяці тому
Could you get it to read information and context from a video game, like X4: Foundations, so that you could ask it like a personal assistant to help you manage your space empire?
@MelindaGreen 3 місяці тому ⁺²
I'm daunted by the idea of setting up these development systems just to use a model. Any chance people can bundle them into one big executable for Windows and iOS? I sure would love to just load-and-go.
@LFPGaming 4 місяці тому ⁺²
do you know of any offline/local way to do translations? i've been searching but haven't found a way to do local translations of video or audio using LargeLanguageModels
@deltaxcd 3 місяці тому ⁺¹
there is a program "subtitle edit" which can do that
@arkdirfe 3 місяці тому
Interesting, this is similar to a small project I made for myself. But instead of a chatbot conversation, the whisper output is fed into SAM (yes, the funny robot voice) and sent to an audio output. Basically makes SAM say whatever I say with a slight delay. I'm chopping up the speech into small segments so it can start transcribing while I speak for longer, but that introduces occasional weirdness, but I'm fine with that.
@Embassy_of_Jupiter 3 місяці тому ⁺⁷
This gave me an interesting idea. Once could build streaming LLMs that at least partially build thoughts one word at a time (I mean the input, not the output).
Basically precomputing most of the final embedding with an unfinished sentence, and if it has the full sentence and it's time to answer, it only has to go threw just a few very low latency, very cheap layers.
Different but related idea: Similarly you could actually feed unfinished senteces into Mistral with a prompt that says "this is an unfinished sentence, say INTERRUPTION if you think it is an appropriate time to interrupt the speaker", to make the voice bot interrupt you. Like a normal person would. Would make it feel much more natural.
@deltaxcd 3 місяці тому
Actually AI can do that you can feed it partial prompt let it process it then acd more and continue from where you left. thats huge speedup.
but prompt processing is pretty fast anyway
to make it respond faster you need to let it speak before it finishes "thinking"
@josephtilly258 Місяць тому
really interesting, lot of it i can't understand because I don't know coding but speech to speech could be a big thing within few years
@JohnSmith762A11B 4 місяці тому ⁺⁴
I wonder if you are (or can, if not) caching the processed .mp3 voice model after the speech engine processes it and turns it into partials. That would cut out a lot of latency if it didn't need to process those 20 seconds of recorded voice audio every time. Right now it's pretty fast but the latency still sounds more like they are using walkie talkies than speaking on a phone.
@levieux1137 4 місяці тому ⁺³
it could go way further by using the native libs and dropping all the python-based wrappers that pass data between stages using files and that copy, copy, copy and recopy data all the time. For example llama.cpp is clearly recognizable in the lower layers, all the tunable parameters match it. I don't know for openvoice for example however, but the state the presenter arrived at shows that we're pretty close to reaching a DIY conversational robot, which is pretty cool.
@JohnSmith762A11B 4 місяці тому
@@levieux1137 By native libs, you mean the system tts speech on say Windows and macOS?
@levieux1137 4 місяці тому ⁺²
@@JohnSmith762A11B not necessarily that, but I'm speaking about the underlying components that are used here. In fact if you look, this is essentially python code built as wrapper on top of other parts that already run natively. The llama.cpp server for example is used here apparently. And once wrapped into layers and layers, you see that it becomes heavy to transport contents from one layer to another (particularly when passing via files, but even memcpy is expensive). It might even be possible that some elements are re-loaded from scratch and re-initialized after each sentence. The python script here appears to be mostly a wrapper around all such components,working like a shell script recording input from the microphone to a file then sending it to openvoice, then send that output to a file, then load another component with that file, etc... This is just like a shell script working with files and heavy initialization at every step. Dropping all that layer and directly using the native APIs of the various libs and components would be way more efficient. And it's very possible that past a point the author will discover that Python is not needed at all, which could suddenly offer more possibilities for lighter embedded processing.
@irraz1 25 днів тому ⁺¹
wow! I would love to have such an assistant to practice languages. The “python hub” code, do you plan to share it at some point?
@jacoballessio5706 3 місяці тому
I wonder if you could directly convert embeddings to speech to skip text inference
@musumo1908 4 місяці тому
Hey cool…anyway to run this self hosted for an online speech to speech setup? Want to drop this into a chatbot project…what level membership to access the code thanks
@squiddymute 3 місяці тому ⁺¹
no api = pure genius
@mastershake2782 3 місяці тому
I am trying to clone a voice from a reference audio file, but despite following the standard process, the output doesn't seem to change according to the reference. When I change the reference audio to a different file, there's no noticeable change in the voice characteristics of the output. The script successfully extracts the tone color embeddings, but the conversion process doesn't seem to reflect these in the final output. I'm using the demo reference audio provided by OpenVoice (male voice), but the output synthesized speech remains in a female voice, typical of the base speaker model. I've double-checked the script, model checkpoints, and audio file paths, but the issue persists. If anyone has encountered a similar problem or has suggestions on what might be going wrong, I would greatly appreciate your insights. Thank you in advance!
@enriquemontero74 4 місяці тому
Can I configure it in Spanish? so that Mistral speaks Spanish and open voice in Spanish? I would like to confirm this to join as a member and access the github and try to make it work since my native language is Spanish, thank you for your work, it is incredible, you deserve many more followers, keep it up.
@googlenutzer3384 3 місяці тому
Is it also possible to adjust to different languages?
@ExploreTogetherYT 3 місяці тому
how much RAM do you have to run mistral 7b locally? using gpu or cpu?
@EpicFlow 3 місяці тому
looks interesting but where is this community link you mentioned? :)
@LadyTink 3 місяці тому
Kinda feels like something the "rabbit R1" does
with the whole fast speech to speech thing
@JG27Korny 4 місяці тому
I run the oobabooga silero plus whisper, but those take forever to make voice from text, especially silero.
@skullseason1 3 місяці тому
How can i do this with the Apple M1, this is soooo awesome i need to figure it out!
@MegaMijit 3 місяці тому
this is awesome, but voice could use some fine tuning to sound more realistic
@gabrielsandstedt 4 місяці тому ⁺⁷
If you are fine venturing into c# or c++ then I know how you can improve the latency and create a single .exe that includes all of your different parts here, including using local models for the whisper voice recognition. I have done this myself using LLama sharp for runnign the GGUF file, and then embedding all external python into a batch process which it calls.
@matthewfuller9760 Місяць тому ⁺¹
code on github?
@gabrielsandstedt Місяць тому ⁺²
@@matthewfuller9760 i should put it there actually. I have been jumping between projects lately without sharing much. Will send a link when it is up
@matthewfuller9760 Місяць тому
@@gabrielsandstedt cool
@matthewfuller9760 Місяць тому
I think at even 1/3rd the speed with my rtx titan it would run just fine to learn a new language. Waiting 3 seconds is perfectly acceptable as a novice language learner.
@JohnGallie 3 місяці тому ⁺¹
is there anyway that you can give the python 90% of system resources so it would be faster
@weisland2807 3 місяці тому
would be funny if you had this in games - like the people on the streets of gta having convos fueled by somthing like this. maybe it's already happening tho, i'm not in the know. awesomesauce!
@ProjCRys 4 місяці тому ⁺¹
Nice! I was about to create something like this for myself but I still couldn't use OpenVoice because I keep failing to run it on my venv instead of conda.
@Zvezdan88 4 місяці тому
How do you even install OpenVoice?
@_-JR01 3 місяці тому
does openvoice perform better than whisper's TTS?
@Ms.Robot. 3 місяці тому
❤❤❤🎉 nice
@tag_of_frank 2 місяці тому
Why LM Studio over OogaBooga? What are the pros/cons of them? I have been using Ooga, but wondering why one might switch.
@64jcl 3 місяці тому
Surely the response time is a function of what rig you are doing this on - an RTX 4080 as you have is no doubt a major contributor here, and I would guess you have a beast of a CPU and high speed memory on a newer motherboard.
@suminlee6576 3 місяці тому
Do you have a video for showing how to do this step by step? I was going to be paid member but I couldn't see how to video in your paid channel?
@Yossisinterests-hq2qq 3 місяці тому
hi I dont have talk.py, but is there another way of running it im missing?
@mickelodiansurname9578 3 місяці тому
can the llm handle being told in a system prompt that it will be taking in the sentences in small chunks? say cut up into 2 second audio chunks per transcript. Can the mistral model do that? Anyway if so you might even be able to get it to 'butt in' to your prompt. now thats low latency!
@deltaxcd 3 місяці тому
No it cant be told that but it is not necessary.
just feed it the chunk and then if user speaks before it managed to reply more restart and feed more
@fire17102 4 місяці тому ⁺²
Would love to see some realtime animations to go with the voice, could be a face, but also can be minimalistic (like the R1 rabbit).
@wurstelei1356 4 місяці тому
You need a second GPU for this. Lets say you put on Stable Diffusion. Displaying a robot face with emotions would be nice.
@leucome 4 місяці тому
Try Amica AI . It has VRM 3D/vtuber character and multiple option for the voice and the llm backed.
@fire17102 2 місяці тому
@@leucomedoes it work locally in real time?
@fire17102 2 місяці тому
@@wurstelei1356Again, I think a minimalistic animation would also do the trick , or prerendeing the images once, and using them in the appropriate sequence in realtime.
@leucome 2 місяці тому ⁺¹
@@fire17102 Yes it can work in real-time locally as long as the GPU is fast and has enough vram to run the AI+Voice. It can also connect to online service if required. I uploaded a video where I play Minecraft and talk to the AI at same time with all the component running on a single GPU.
@OdikisOdikis 3 місяці тому
the predefined answer timing is what makes it not real conversation. It should spit answer questions at random timings like any human can think of something and only then answer. Randomizing timings would create more realistic conversations
@witext 2 місяці тому
I look forward to actual speech to speech LLM, not any speech to text translation layers, pure speech in and speech out, it would be revolutionary imo
@mertgundogdu211 23 дні тому
How I can try this in my computer?? I couldnt find the talk.py in github code??
@MrScoffins 3 місяці тому ⁺²
So if you disconnect your computer from the Internet, will it still work?
@jephbennett 3 місяці тому ⁺¹
Yes, this code package is not pulling APIs (which is why the latency is low), so it doesn't need internet connection. Downside is, it cannot access info outside of it's core dataset, so no current events or anything like that.
@aboudezoa 3 місяці тому
Running on 4080 🤣 makes sense the damn thing is very fast
@NirmalEleQtra 5 днів тому
Where can i find whole GitHub repo ?
@binthem7997 3 місяці тому
Great tutorial but I wish you could share gists or share your code
@deltaxcd 3 місяці тому
I think to decrease latency more you need to make it speak before AI finishes its sentence
unfortunately there is no obvious way to feed it partial prompt but waiting until it will finish generating reply takes asy too long
@Stockholm_Syndrome 4 місяці тому
BRUTAL! hahaha
@tijendersingh5363 4 місяці тому
Just wao
@microponics2695 3 місяці тому ⁺¹
I have the uncensored model the same one and when I ask it to list curse words it says it can't do that. ???
@jungen1093 3 місяці тому
Lmao that’s annoying
@ArnaudMEURET 3 місяці тому
Just to paraphrase your models: “Dude ! Are you actually grabbing the gorram scrollbars to scroll down an effing window !? What is this? 1996 ? Ever heard of a mouse wheel? You know it’s even emulated by double drag on track pads, right?” 🤘
@alexander191297 3 місяці тому ⁺¹
I swear on my mother’s grave lol… this AI is hilarious! 😂😂😂
@aestendrela 4 місяці тому ⁺²
It would be interesting to make a real-time translator. I think it could be very useful. The language barrier would end.
@deltaxcd 3 місяці тому
meta didi it already they created speech to speech translation model
@JohnGallie 3 місяці тому
you need to get out more man lol. that was toooo much!
@MetaphoricMinds 3 місяці тому ⁺¹
What GPU are you running?
@AllAboutAI 3 місяці тому ⁺¹
4080 RTX!
@ayatawan123 3 місяці тому
This made me laugh so hard!
@jeffsmith9384 3 місяці тому
I would like to see how a chat room full of different models would problem solve... ChatGPT + Claude + * 7B + Grok + Bard... all in a room, trying to decide what you should have for lunch
@Nursultan_karazhigit 3 місяці тому ⁺¹
Thanks . Is whisper api free ?
@m0nxt3r 15 днів тому
it's open source
@ajayjasperj 3 місяці тому
we can make youtube content with those conversation between bots😂❤
@BrutalStrike2 4 місяці тому ⁺¹
Jumanji Alan
@NoLimitYou 3 місяці тому ⁺⁶⁴
Too bad you take open source and make it closed.
@mblend27 3 місяці тому ⁺¹
Explain?
@NoLimitYou 3 місяці тому
@@mblend27 You take code openly available, and ask people to become a member, to receive the code of what you demo using the open source code. The whole idea of open source is that everyone contributes without putting it behind walls
@Ms.Robot. 3 місяці тому ⁺³
You can in several ways.
@NoLimitYou 3 місяці тому ⁺⁸
You take open source and make something with that and put it behind a wall.
@TheGrobe 3 місяці тому
@@mblend27 You make someone pay to access something on github you comprised of open source components.
@TheRottweiler_Gemii 11 днів тому
Anybody done with this and have a code or link can share please
@VitorioMiguel 4 місяці тому
Try fast-whisper. Open source and faster
@mickelodiansurname9578 3 місяці тому ⁺¹
AI: "We got some rich investors on board dude, and their willing to back us up!"
I think this script just announced the games commencing in the 2024 US Election... [not in the US so reaches for popcorn]
@robertgoldbornatyout 2 місяці тому
Could make for some interesting Biden v Trump debates
@kritikusi-666 4 місяці тому ⁺¹
the voices are Mehh...cool project tho. You always have some fire content. You could train a LLM just off your content and be set haha.
@jerryqueen6755 Місяць тому ⁺¹
How can I install this on my PC? I am a member of the channel
@AllAboutAI Місяць тому
did you get the gh invite?
@jerryqueen6755 Місяць тому
@@AllAboutAI yes, thanks
@miaohf Місяць тому
@@AllAboutAI I am a member of the channel too, how to get gh invite?
@picricket712 20 днів тому
can someone please give me that source code
@wurstelei1356 4 місяці тому ⁺⁴
Sadly this video has fewer hits than it should have. I am looking forward for a more automated version of this. Hopefully the low amount of views wont hinder it.
@javedcolab 3 місяці тому ⁺¹
Do not make AI lie on your face, man. Thankfully this is local.
@MetaphoricMinds 3 місяці тому
Dude just made a JARVIS embryo.
@lokiwhacker 3 місяці тому ⁺²
Thought this was really cool, love open source. But this really isnt open source if youre hiding it behind a pay wall... smh
@artisalva 3 місяці тому
haha AI conversations could have their own chanels
@calvinwayne3017 3 місяці тому
now add a metahuman and auidio2face :)
@Edward_ZS 4 місяці тому
I dont see Dan.mp3
@Canna_Science_and_Technology 4 місяці тому ⁺¹
It is funny.
@laalbujhakkar Місяць тому
How is a system that goes out to openAI, "local" ????????
@seRko123 17 днів тому
Open air whisper is locally
@smthngsmthngsmthngdarkside 3 місяці тому ⁺²
So where's the source code mate?
Or is this just a hook for your newsletter marketing and crap website?
@Skystunt123 6 днів тому
Just a hook, the code is not shared.
@asdasdaa7063 4 місяці тому
its great but OpenVoice doesn't allow for commercial use. Would be nice to do this with a model that can be used for commercial use
@thygrrr 3 місяці тому
WORST OPSEC for a hacker. :D

Наступне

Автоматичне відтворення

SUPER Fast AI Real Time Speech to Text Transcribtion - Faster Whisper / Python