- 148
- 644 595
Thorsten-Voice
Germany
Приєднався 12 лис 2013
Guude! (hi, nice to see you) 👋,
i'm Thorsten 😊.
You like open source, privacy aware and local running voice technology? Me too 😎. You'll find cooking recipe like tutorials on TTS, STT, Voice Assistants, AI, ML and way more cool stuff here. So, hop on and join my amazing community 🥰.
#opensource #voice #cloning #technology #news #tutorial #local #privacy #tech #tts #stt #voiceassistant #raspberrypi #smarthome #homeassistant
* My project website: www.Thorsten-Voice.de
* Me on GitHub: github.com/thorstenMueller
i'm Thorsten 😊.
You like open source, privacy aware and local running voice technology? Me too 😎. You'll find cooking recipe like tutorials on TTS, STT, Voice Assistants, AI, ML and way more cool stuff here. So, hop on and join my amazing community 🥰.
#opensource #voice #cloning #technology #news #tutorial #local #privacy #tech #tts #stt #voiceassistant #raspberrypi #smarthome #homeassistant
* My project website: www.Thorsten-Voice.de
* Me on GitHub: github.com/thorstenMueller
F5 Text to Speech Tutorial | Hit "Refresh" on Your AI Voice!
🔥🔥🔥 Impressive voice cloning with F5 TTS! Clone your voice with a few seconds audio data for your personal AI voice. Step-by-step tutorial
For comparison reason - here's my computer spec:
* CPU: 4x Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz
* RAM: 16GB
* GPU: NVIDIA GeForce GTX 1050 Ti
Based on some comments you might want to watch it on 1.5x speed 😁.
#F5TTS #textToSpeech #AIVoiceCloning #FreeTextToSpeech
00:00 Intro
01:45 Overview of F5 TTS
04:58 Local install of F5 TTS
08:55 Using F5 TTS voice cloning
12:20 "Neutral" voice cloning
15:00 More "emotional" voice cloning
16:30 Multispeech test | dialogue with different styles
22:45 Voice chat (LLM with your personal voice)
* github.com/swivid/F5-TTS/
* huggingface.co/SWivid/F5-TTS
* huggingface.co/spaces/mrfakename/E2-F5-TTS
* arxiv.org/pdf/2410.06885
Please subscribe to my channel 😊.
ua-cam.com/users/ThorstenMueller
---
- www.Thorsten-Voice.de
- github.com/thorstenMueller/Thorsten-Voice/
For comparison reason - here's my computer spec:
* CPU: 4x Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz
* RAM: 16GB
* GPU: NVIDIA GeForce GTX 1050 Ti
Based on some comments you might want to watch it on 1.5x speed 😁.
#F5TTS #textToSpeech #AIVoiceCloning #FreeTextToSpeech
00:00 Intro
01:45 Overview of F5 TTS
04:58 Local install of F5 TTS
08:55 Using F5 TTS voice cloning
12:20 "Neutral" voice cloning
15:00 More "emotional" voice cloning
16:30 Multispeech test | dialogue with different styles
22:45 Voice chat (LLM with your personal voice)
* github.com/swivid/F5-TTS/
* huggingface.co/SWivid/F5-TTS
* huggingface.co/spaces/mrfakename/E2-F5-TTS
* arxiv.org/pdf/2410.06885
Please subscribe to my channel 😊.
ua-cam.com/users/ThorstenMueller
---
- www.Thorsten-Voice.de
- github.com/thorstenMueller/Thorsten-Voice/
Переглядів: 2 084
Відео
3 steps to run HuggingFace 🤗 "Parler TTS" AI Voice on your local machine
Переглядів 4,6 тис.Місяць тому
How to run "Parler TTS" from @HuggingFace on your local machine in 3 simple steps (using python code)! Including audio samples. #python #parler #tts #huggingface 00:00 Intro 02:22 Parler TTS Github repo 03:10 Dataset basis for Parler TTS 05:40 Huggingface space to try it out 06:20 Set up Python venv for Parler TTS & Install 09:47 Using python script to synthesize audio 14:45 Synthesizing audio ...
Best AI Voice Generator | 2024.08
Переглядів 17 тис.3 місяці тому
Free #TTS with #Mars5 #Parler #MetaVoice #Toucan and #ChatTTS. First look and comparison video on voice cloning and more. Thanks to you great #opensource text to speech projects and @HuggingFace for providing cool spaces to play around with 🤗. And thank you "VB" for pointing to these cool projects on LinkedIn 👏: www.linkedin.com/posts/vaibhavs10_text-to-speech-ecosystem-has-been-booming-activit...
Automate Voice Dataset Creation Using Whisper AI
Переглядів 1,8 тис.4 місяці тому
Easy tutorial on creating a structured voice dataset on raw audio data using Python and Whisper by OpenAI for speech recognition. #ai #whisper #tts #voice #data #python 00:00 Intro 01:10 Set up python virtual environment 03:00 Working with "the magic" script :) 07:00 Run voice dataset generation with Whisper AI STT 07:58 Checking results 09:45 Outro * github.com/thorstenMueller/Audio-to-Voice-D...
TTS Voice Dataset | LJSpeech | Voice Cloning
Переглядів 2,3 тис.4 місяці тому
Close look to ljspeech voice dataset and it's structure for tts voice cloning. The ljspeech voice dataset is widely supported by tts voice cloning software. Videos is describing the structure and how you can create it for your personal voice clone. 00:00 Intro 02:23 LJSpeech info and download 04:15 LJSpeech in research (Google Scholar) 05:17 Close look to the voice dataset file structure 06:25 ...
Unlock AI Superpowers with NVIDIA CUDA: Boost Performance in Python!
Переглядів 1,4 тис.5 місяців тому
Boost your AI performance by using NVIDIA CUDA on Windows. Step by step tutorial on how to use CUDA with Python / pytorch and performance comparison with Coqui TTS. #performance #nvidia #python #ai #machinelearning #tts Please subscribe to my channel 😊. ua-cam.com/users/ThorstenMueller Thanks dear @MightyReiti for your inspiration and support on my new recording setup ❤️. 00:00 Intro 01:55 What...
Home Assistant ❤️ Voice - Tutorial 05 - Wyoming protocol
Переглядів 4,1 тис.8 місяців тому
Home Assistant ❤️ Voice - Tutorial 05 - Wyoming protocol
Home Assistant ❤️ Voice - Tutorial 04 - Piper TTS
Переглядів 6 тис.8 місяців тому
Home Assistant ❤️ Voice - Tutorial 04 - Piper TTS
Home Assistant ❤️ Voice - Tutorial 03 - Conversation / NLP
Переглядів 1,4 тис.8 місяців тому
Home Assistant ❤️ Voice - Tutorial 03 - Conversation / NLP
Home Assistant ❤️ Voice - Tutorial 02 - Text Assist
Переглядів 1,6 тис.8 місяців тому
Home Assistant ❤️ Voice - Tutorial 02 - Text Assist
Home Assistant ❤️ Voice - Tutorial 01 - Basic setup & demo entities
Переглядів 3,8 тис.8 місяців тому
Home Assistant ❤️ Voice - Tutorial 01 - Basic setup & demo entities
Running a local Piper TTS server with Python on Linux
Переглядів 6 тис.9 місяців тому
Running a local Piper TTS server with Python on Linux
🔥 Voice interview Michael Hansen | HA | Raspberry | Piper | Rhasspy
Переглядів 1,9 тис.9 місяців тому
🔥 Voice interview Michael Hansen | HA | Raspberry | Piper | Rhasspy
Local voice cloning with 6 seconds audio | Coqui XTTS on Windows
Переглядів 43 тис.11 місяців тому
Local voice cloning with 6 seconds audio | Coqui XTTS on Windows
🇩🇪 Künstliche Sprachausgabe uff Hessisch | Kostenlos und OHNE CLOUD !
Переглядів 1 тис.Рік тому
🇩🇪 Künstliche Sprachausgabe uff Hessisch | Kostenlos und OHNE CLOUD !
TEXT TO SPEECH | Piper TTS on Windows 🚀 AI voice 10x faster Realtime!
Переглядів 26 тис.Рік тому
TEXT TO SPEECH | Piper TTS on Windows 🚀 AI voice 10x faster Realtime!
XTTS FAQ | Interview with Josh Meyer from Coqui AI
Переглядів 2,2 тис.Рік тому
XTTS FAQ | Interview with Josh Meyer from Coqui AI
Python virtual environment / venv | Windows, Linux & Mac OS X
Переглядів 3 тис.Рік тому
Python virtual environment / venv | Windows, Linux & Mac OS X
Free voice recording for BEST voice cloning | Piper-Recording-Studio | Windows
Переглядів 9 тис.Рік тому
Free voice recording for BEST voice cloning | Piper-Recording-Studio | Windows
Is Mycroft Mark 2 the better Alexa?! | Private | Voice Assistant
Переглядів 3,6 тис.Рік тому
Is Mycroft Mark 2 the better Alexa?! | Private | Voice Assistant
Create your AI digital voice clone locally with Piper TTS | Tutorial
Переглядів 48 тис.Рік тому
Create your AI digital voice clone locally with Piper TTS | Tutorial
Increase Text to Speech pronunciation quality with eSpeak | Tutorial
Переглядів 11 тис.Рік тому
Increase Text to Speech pronunciation quality with eSpeak | Tutorial
Talk locally (no ChatGPT) with your documents 😄 | PrivateGPT + Whisper + Coqui TTS
Переглядів 6 тис.Рік тому
Talk locally (no ChatGPT) with your documents 😄 | PrivateGPT Whisper Coqui TTS
Raspberry Pi | Local TTS | High Quality | Faster Realtime with Piper TTS
Переглядів 29 тис.Рік тому
Raspberry Pi | Local TTS | High Quality | Faster Realtime with Piper TTS
Thorsten-Voice TTS in Windows nutzen | DDC / VITS
Переглядів 5 тис.Рік тому
Thorsten-Voice TTS in Windows nutzen | DDC / VITS
Thorsten-Voice TTS in Linux nutzen | DDC / VITS / Piper
Переглядів 3,1 тис.Рік тому
Thorsten-Voice TTS in Linux nutzen | DDC / VITS / Piper
Thorsten-Voice TTS in Mac OS X nutzen | DDC / VITS
Переглядів 1,5 тис.Рік тому
Thorsten-Voice TTS in Mac OS X nutzen | DDC / VITS
Freie "Thorsten" Stimme in HOME ASSISTANT lokal nutzen | Text-to-Speech/TTS | Tutorial
Переглядів 4,4 тис.Рік тому
Freie "Thorsten" Stimme in HOME ASSISTANT lokal nutzen | Text-to-Speech/TTS | Tutorial
Thorsten-Voice TTS in Raspberry Pi OS nutzen | Piper
Переглядів 1,5 тис.Рік тому
Thorsten-Voice TTS in Raspberry Pi OS nutzen | Piper
End of home automation/smarthome AND voiceassistant software?!
Переглядів 405Рік тому
End of home automation/smarthome AND voiceassistant software?!
I tired this out on a rtx 3600 12gb model and it's fast. Quicker than speaking, maybe 2x faster to process than to listen to. Sounds really good to me.
Thanks for your helpful comment and performance indicator on a 3600 👍🏻.
@ThorstenMueller I should have said it's paired with a 2700 ryzen. It's a pretty cheap rig now, I think you could buy both parts used for about 300 pounds on eBay. 30 pound cpu and 270 for the gpu. Or wait a year and pick up a 3090 24gb for same price, currently sitting around 500. I did pick up a tesla 24gb I forget model number, from China for 300 which is good for really large llm. Thank you for showing me this, I have project I can purposely upgrade now.
can we run piper tts in gpu using cuda ?
Not tried it myself. According to piper community there seems to be some active discussions on gpu/cuda support. github.com/rhasspy/piper/issues?q=is%3Aissue+cuda
This error originates from a subprocess, and is likely not a problem with pip. error: subprocess-exited-with-error im getting this error please help someone
Hello Thorsten, thanks for your great channel. I came about these videos which shows how one can train F5 with different languages ua-cam.com/video/UO4usaOojys/v-deo.html ua-cam.com/video/RQXHKO5F9hg/v-deo.html As you are experienced with training of speech models, I am wondering how much hours material would be required to train a German language model in good quality and what things should be considered in regards to training data. In the referenced youtube video the creator simply takes audiobooks. Can one expect to get a good quality model in this way?
Hi Thorsten, thank you for another excellent tutorial. I have installed f5 on a Raspberry Pi 5 and it generates very good quality output but to be expected it is very slow. I am trying to understand how f5 works, does it take a standard model and modify it in some way using the ref_text & audio before generating the desired output (gen_text)? Is there an intermediate stage that could be executed separately? Thanks Ernie
How much time it took to train one model??? And how powerfull is your system?
For my Thorsten-Voice piper models i used an nvidia jetson agx device and training 24/7 took around 2 month.
If anybody is encountering errors when installing TTS, try pip install coqui-tts (tts is deprecated as of november 2024)
I went down this rabbithole myself last year. Alas! Have I watched this 6min video I would've saved lotsa disk space and time (weeks). Cheers! ✌
Thank you for your nice feedback and welcome at the "rabbithole" 😄.
Hi, Thorsten, the community thrives because of people like you - thanks for your work!
Thank you for your very kind words 🥰
thanks! it is faesabel to do all of that trought scripted pyton code?
Good point 👍🏻. I took a quick but did not see an obvious solution for native python integration.
the greetin comes from hessen in germany right? :D funny intro and exactly what i am searching for :) abo
Ei sicher 😄. Greetings back from Hessen and thanks for joining my community 😊.
What GPU do you have on your computer?
An nvidia 1050 ti in this case.
May I ask what gpu you are using, or if it is using a gpu?
when you start gradio the fist time and the model is downloading, it shows that pytorch loading the models into CPU, i'll investigate on that
correction: I'm running it on a 1080ti, it takes 16 sec for 4 sec of speech to synthesise. Don't know, whether it's always re-analysing the reference as well.
okay, further investigation: i let the output text the same but uploaded a longer reference, it then also takes longer to synthesise. so, the whole time is comprising reference learning as well as synthesis. would be interesting to see how much time mere synthesis would take...
If you use f5 on huggingface it will use a random gpu that is available in that momoment. If you use it locally without cuda (nvidia gpu) it will use cpu.
For anyone coming recently, the tts repo isn't maintained anymore according to an issue post on the github. It results in an error when running 'pip install tts'. This fork worked for me instead: 'pip install coqui-tts'
Thanks for that fork hint 👍🏻. Maybe an issue with a (too new) python version.
is there a way to use this with nvidia gpu on windows to speed up performance?
Didn't try it myself, but there's some discussion on CUDA (nvidia gpu) on their repository. Maybe you can find additional info there 😊. github.com/rhasspy/piper/issues?q=is%3Aissue+cuda
can this be deployed and hosted on a server?
Yes, absolutely 😊.
I tried it and it works but it did not sound like me. Nothing close to what you did. Not a fan at this time it really should have done better. Thanks for sharing you got my thumbs up...
Thanks for your "thumb up" and sorry to hear it didn't work for you as expected.
@ThorstenMueller not your fault, you laid it out perfectly. Its probably the quality of my samples. Thanks again
great stuff!
Haha the F5 joke😂. The progress is amazing, right? Still waiting for german support for F5... Anyway in english it is now already easy to create synthetic voice datasets for piper for example, just an idea😊
H(ei) 👋, thanks for your nice comment 😊 and yes, progress is really impressive.
i need more georgian voices. there is only Natia(woman) voice ( how can i make them? any tut?
If you have a useable voice dataset for georgian, you can use this tutorial: ua-cam.com/video/b_we_jma220/v-deo.htmlsi=iRIGUkAKf_7gWkRF
@@ThorstenMueller thanks
Thanks for your video. F5 TTS is absolutely stunning! Let's hope they will include other languages (GERMAN) soon. ;)
Additional question: Does the model "re-learn" the voice everytime I want it to generate a sentence? Is there a way to learn the voice once and then use the trained model over and over again?
According to their community they are working on additional languages, including german 😊
That was great!! Thanks for your content! I've got this running now and it is amazing!!
Thanks for your nice feedback 😊.
That whisper at the beginning really sounded like Stephan Molyneux?!!!
Great
Thank you 😊, i'm impressed by f5 too.
Try MaskGCT
You made a reference to your computer speed. Care to elaborate on its GPU and CPU and ram?
You're absolutely right. I forgot adding it to the description. Thanks to your hint, my computer specs are now in description 😊.
I enjoyed the intro it made me laugh.
I'm happy you liked it 😊.
Hi Thorsten! I want to make a portfolio website where people can talk to myself. Id have a text to text that knows everything about me and that would go to a tts of my own voice to tell it what to say each time. My problem is hosting. I dont understand how the APIs of these tts models work and how id be able to host it as most gpu hosting websites offer per hour rates which seem very expensive.. what do i do! maybe ive got the wrong approach..
I also forgot to mention i do have a mini pc I can run 24/7 but it doesnt have a gpu
1.5x speed is about right 😂
😆 i heart this already recently. Maybe i should speed up by 1.5x in postproduction 😅🤔.
ChatGPT was right? 🥴
Could you please help me to decide what model(s) (will use as voiceover) is fit, more natural human sound for faceless YT videos?
Spitzen Arbeit 👍 Es ist eigentlich genau das was ich gesucht habe. Bin jedoch absoluter Computer Legasteniker und hoffe das mir ein Freund dabei hilft deine Stimme zu installieren.
bei mir kommt die Fehlermeldung das Python nicht installiert werden kann da es auf dem Server nicht zur Verfügung steht ....?
Wenn du ein "terminal" in Mac OS X geöffnet hast und dieses Kommando ausführst "python3 -V" - was passiert dann?
@@ThorstenMueller Super, danke für die Rückmeldung. Die erste Meldung: Python3 Befehl erfordert Coammnd Line Tools. Die zweite Meldung: Software auf dem Softwareupdateserver steht zur Zeit nicht zur Verfügung. Ich benutze deine Stimmer im Moment online, da funktioniert zwar auch nur eines der Module, aber besser als nix. Ich Kumpel wird mir demnächst mal helfen, eventuell kann man das fehlende Tool auch aus einer anderen Quelle beziehen. Trotz der kleinen Probleme (kenn ich nicht anders in der EDV) ist deine Stimme allerdings das Beste was ich bisher gehört habe!
Thank you very much for this excellent and very easy to follow tutorial!
Thanks for your amazing feedback 😊 and you are welcome.
I can't get good results with the Parler TTS demo whenever I let it speak a paragraph. It starts out fine but then it starts to drop more and more words, garbling up all the text into an incomprehensible mess. Very weird.
Might be an issue with the text input itself or the text length maybe. Have you tried shorter paragraphs as input text?
Love thi channel 😊😊😊
Thanks a lot 😊
Really helpful video, thanks!
Thank you 😊
Danke für's video. Es funkioniert endlich! Richtiger ehrenmann 😀
Besten Dank 😊
Sehr gutes Video. Hätte ich gewusst, dass du hier die Installation auf Windows vornimmst, hätte ich mir 2 Tage arbeit gespart :D
Freut mich sehr, dass dir das Video gefallen hat und ich hoffe, dir fehlen die 2 verlorenen Tage nicht zu sehr 😉.
@ nee habe sehr viel dabei gelernt. Bin aber schlussendlich zu Ubuntu gewechselt da es unter Windows nicht so gut funktioniert:/
hi Thorsten, may the next "how to" would be training coqui-TTS model based on Glow-TTS and HiFiGAN vocoder?
Hi Thorsten, How many hours/steps you spent to trains your DE dataset to become usable model in couqi-tts? I'm trying to do some model training with my dataset (35 minutes of audio) and I start hearing some voice on 10k steps but it is far away from what I would like to get....
I used my Thorsten-Voice datasets containing over 20k recordings and training took over 2 month (around 500k steps) on an NVIDIA Jetson AGX device. You might be able to hear better, human sounding, like results after 100k steps.
How can i be sure that all wavs are used in trainning process? For test i have 650 segments ( 35 minutes ) but i would like to prepare several hours but do not know if it is needed... have you have any sugession?
What is the different between vocoder and tts? Should i train vocoder and then model based on the same dataset wavs? Does the model with vocoder be better than without it?
First is to train the tts model and then train a vocoder model for that tts model.
Hi would it be possible to you to create traing vocoder HiFi-GAN?
Hello Thorsten, could you please create video in which you can shows different between: tacotron2-DDC, vits_tts, glow_tts....?
I am severly dissapointed as I just wanted to install my native tounge HEBREW to be used at Firefox read aloud add on. I scrolled my way up down than upwards the scrollrd down again with thay stronge negative feeling i get as a jewish woman born at my only known country Israrel, as my family elders may they rest in peace sufferered the absolut horors of ww2 holocaust somehow managed to survive what so many other could not. Did nt. And this notion of how so very real that was not back in the far past. but i cannnot feel that. Im a refuge of horror against each and every 3rd gen jew. I got that notion so vivid and real in me. The dear aunt who was my nanny when i was growing up for so many years My dear nanny rip aunty paula her hand holding mine as we cross the road as she tool me day by day to school. ... . That damn number i never ever tryed to memorized refused in my subcontius to place thos4 horrible numbers. But their damn geeanish turqoize bluish shade is carved in my mind always a testemony to the hell thay was in her life. I cant realy get how they managed to build a life. But that notion of hate and impulse of enilating me as a jew born this way..... is so much present. 7.10.2023 2nd holocaust came upon us. HAMMAS was and is since its creation driven to enielate a deatroy Israel, and was inflicting TERROR day in and out back at those days when Israel was stupied enough to credify the up until a moment b4 ad after Oslo to recognize the PLO as a form of national entiity. The fisrt nation that did this was Israrel. Israel forcefully painfully removed by force all the lives and livelihoof that was existing i damn Gaza strip. After removing our own brothers and sistets and childrrn that was living at modest areas growing the best herbs i can recall employing the local ppl of Gaza. Israel the country destroyed to rubbles whatever was made by Israel. Living it so very purged of the presensce of Israel. What better statement was it for the ppl that were ppl of Gaza back than. Here its all damn your. Our nation was torn by that move. But as elections took place in Gaza. GAZA chose Hammas. The ppl chose Terrorist leadership masking as the good faithfull religious supposed to be better leadership that will take care of the poor people that the pillow was taking its money for you know the exact same thing but way more violent way more terrorist that later on Hamas leadership to the same money and used it for making Gaza into Hamastan old regime in Gaza became under the control of a terrorist organization or people if you may call those who preached to my death human became a machine supporting a terrorist organization Iron Dome for Israel is no less than a miracle but every attack made by them targeting rockets and missiles toward first southern cities..than enlarging bit by bit the distance. I was and still am a valid target. The ppl choose Tha ppl agree The ppll came upon our nation on that second holocaust happend at 7.10 But not only Gaza i speaking arabic which i honestly regreting not learning it ..... Well What about Iran? Sure. I support the ppl . I have a gut feeling that sometday we shall discollver the unbareable amount of our both cultures... Yet The Natzi regime states over and over indoctrinate a younger future genaration to destroy ISRAEL Nazis Propoganda education Millitary developm÷nt declare to destroy my home a 2 other jewish elder kingdoms who rulled this troublea filled land I complain No language available for my nation . No Hebrew. How come? Why am my tounge and nation existing just like other nations got their own I insist HEBREW Should be plantifull with at least 5 voices to choose from I have a languge that might be lacking constant usase but ots alive and kicking I am without being asked a human in a cointry called Israel . Not all Israelis are jews. And israel citizens are not living at aparthide. We live in something more like tribes and we to my view should mingle way more between outself. Less apart But all Isralis call and know that our imperfect collective home is our home. Please make Hebrew planty of speach engines ASAP please Please make it happen. We are no less that other nationalities. We try our best at a biased world
Hello Thorsten, have you had time to implemented this: "textCleaned": text.lower() # TODO: Add textcleaner library (multilanguage support) in your script?
No, not yet 🙃. If you have a good idea feel free to send a pull request.
@@ThorstenMueller i have no programmer skill, but this is what i have modified on your script to use GPU and polish dataset: from pydub import AudioSegment from pydub.silence import split_on_silence import pandas as pd import os import glob import whisper import torch import cleantext # import text cleaning library # Initialize the Whisper model model = whisper.load_model("large-v3") device = "cuda:0" if torch.cuda.is_available() else "cpu" torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 model.to(device) # Directory containing the input WAV files input_dir = "./data_to_LJSpeech" # Output directory output_dir = "output" audio_dir = os.path.join(output_dir, "audio") if not os.path.exists(audio_dir): os.makedirs(audio_dir) metadata = [] # Parameters for silence-based splitting min_silence_len = 500 # minimum length of silence (in ms) to be used for a split silence_thresh = None # will be calculated for each file keep_silence = 200 # amount of silence (in ms) to leave at the beginning and end of each chunk # Get the list of all WAV files in the directory wav_files = sorted(glob.glob(os.path.join(input_dir, "*.wav"))) total_files = len(wav_files) # Total number of files to process for idx, wav_file in enumerate(wav_files, start=1): # Load audio file print(f"--> Processing file {idx}/{total_files}: {wav_file}") audio = AudioSegment.from_wav(wav_file) # Calculate silence threshold for the current file if silence_thresh is None: silence_thresh = audio.dBFS - 14 # Split the audio into chunks based on silence audio_chunks = split_on_silence(audio, min_silence_len=min_silence_len, silence_thresh=silence_thresh, keep_silence=keep_silence) # Transcribe each chunk and save with metadata for i, chunk in enumerate(audio_chunks): # Export chunk as temporary wav file chunk_path = os.path.join(output_dir, f"chunk_{i}.wav") chunk.export(chunk_path, format="wav") # Transcribe chunk in Polish language result = model.transcribe(chunk_path, language="pl") # set language to Polish # Get the transcribed text text = result['text'].strip() # Save chunk with unique ID sentence_id = f"LJ{str(len(metadata) + 1).zfill(4)}" sentence_path = os.path.join(audio_dir, f"{sentence_id}.wav") chunk.export(sentence_path, format="wav") # Add metadata, including cleaned text metadata.append({ "ID": sentence_id, "text": text, "textCleaned": cleantext.clean(text, extra_spaces=True, lowercase=True) # text cleaning }) # Remove temporary chunk file os.remove(chunk_path) # Create metadata.csv file with audio file IDs and corresponding sentences metadata_df = pd.DataFrame(metadata) metadata_csv_path = os.path.join(output_dir, "metadata.csv") metadata_df.to_csv(metadata_csv_path, sep="|", header=False, index=False) print(f"Processed {len(metadata)} sentences.") print(f"CSV file saved at {metadata_csv_path}") and i think i would be good to save each transcribe wav file to csv because last night my PC crashed and i lost almost 3k transcribtion lines :(
@@ThorstenMueller this is what i used for my polish dataset from pydub import audio segment from pydub.silence import split_on_silence import pandas as pd import os import globe import whisper import torch import plaintext # import library to plaintext # Initialize Whisper model model = whisper.load_model("large-v3") device = "cuda:0" if torch.cuda.is_available() else "cpu" torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 model.to(device) # Directory with basic WAV files input_dir = "./data_to_LJSpeech" # Output directory output_dir = "output" audio_dir = os.path.join(output_dir, "audio") if not os.path.exists(audio_dir): os.makedirs(audio_dir) metadata = [] # Parameters to distribute on silence min_silence_len = 500 # minimum length of silence (in ms) to use for branches silence_thresh = None # will be calculated for each file keep_silence = 200 # amount of silence (in ms) to specify at the beginning and end of each fragment # Fetching lists of all WAV files in a directory wav_files = sorted(glob.glob(os.path.join(input_dir, "*.wav"))) total_files = len(wav_files) # Total number of files to process for idx, wav file in enumerate(wav_files, start=1): # Loading audio file print(f"--> Processing file {idx}/{total_files}: {wav_file}") audio = AudioSegment.from_wav(wav_file) # Calculating silence threshold for file if silence_thresh is None: silence_thresh = audio.dBFS - 14 # Splitting audio based on silence justification audio_chunks = split_on_silence(audio, min_silence_len=min_silence_len, silence_thresh=silence_thresh, keep_silence=keep_silence) # Transcription of each fragment and saving with metadata for me fragment in enumerate(audio_chunks): # Exporting fragment as temporary WAV file chunk_path = os.path.join(output_dir, f"chunk_{i}.wav") chunk.export(chunk_path, format="wav") # Transcribe fragments in Polish result = model.transcribe(chunk_path, language="pl") # set language to Polish # Download transcribed text text = result['text'].strip() # Save fragment from ID storage sentence_id = f"LJ{str(len(metadata) + 1).zfill(4)}" sentence_path = os.path.join(audio_dir, f"{sentence_id}.wav") chunk.export(sentence_path, format="wav") # Add metadata, including cleaned text metadata.append({ "ID": sentence_id, "text": text, "textCleaned": cleantext.clean(text, extra_spaces=True, smallcase=True) # text cleanup }) # Remove temporary chunk file os.remove(chunk_path) # Create metadata.csv file with audio file ids and split permissions metadata_df = pd.DataFrame(metadata) metadata_csv_path = os.path.join(output_dir, "metadata.csv") metadata_df.to_csv(metadata_csv_path, sep="|", header=false, index=false) print(f"{len(metadata)} sentences processed.") print(f"CSV file saved to {metadata_csv_path}") and i think i would be good to save each transcribe wav file to csv because last night my PC crashed and i lost almost 3k transcribed files.... :(
I'm also interested in Voice Clone, please do share your knowledge in coming videos
Have you already seen my Piper or Coqui voice cloning tutorials?
Hello Thorsten, does your script include specify model name that whisper should use and language for the model? can you update it? How can i use it on my RTX card? update: i have updated your script to use RTX by adding this code: device = "cuda:0" if torch.cuda.is_available() else "cpu" torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 model.to(device) but first torch with cuda enabled needs to be installed....
What I really need is the ability to create Unreal Engine projects that do what Convai can do, but run locally within the project, so I don't have to pay a lot of money. Ideally running Gemma 2 9B and a good TTS with the possibility for voice cloning that are all part of the Unreal Engine game project.
Hi! Thanks for the video, it was really helpful for anyone looking to get into voice cloning. However, I have a question. I followed all the steps, and I noticed that after exporting, the .wav files generated in the wavs folder have a sample rate of 48000 Hz. However, for fine-tuning a pre-trained Piper model, it seems that .wav files with a sample rate of 22050 Hz are required. My question is: are the 48000 Hz audio files still acceptable for fine-tuning, or should I convert them to 22050 Hz before proceeding?
Thanks for your nice feedback. IMHO you should downsample to 22kHz to finetune an existing Piper model which uses 22kHz samplerate.