That's great you brought this tutorial for the windows community. I personally use linux to train my models, but it's awesome you are making an effort to make the windows open voice community stronger.
Thank you for this tutorial and your entire audio series. I once started with Turtoise, which was too slow for me. Then I found coqui and your public voice model, which is also really good and understandable and with the factor 0.41 is also super fast for me. For my use case, however, still too funny pronunciations of proper names. Through this video I could finally create my own voice model that is completely adapted to the requirements of telling stories. It still sounds a bit shaky here and there and has just 100k steps (with increasing audio material), but is already on the way to improvement. Due to recording conditions and my unfortunately not so great narrator voice. I even come to a loss of 26-36%. So here can still be properly readjusted. For all who are interested in the Sats, if they also want to do something like that: Specs: RTX 2070, I7-10900k, Samsung Evo 970 Steptime: 0.5-0.6 Batchsize (you can go higher): 20 Checkpoint_steps: 1000 (just because i am lazy and train it in the middle of some idle periods, school work etc., so i don't have to wait for 10000) Audio dataset: Specs: HyperX idk (the rgb one) with pop filter, relative big room Here I can't make a statement like this and if you start with the "Total" you will get faster results. I trained in steps with increasing audio files: 0-5k: about 230 files ~ 0.4h 5-10k: about 350 files ~ 0.6h 10-30k: about 500 files ~ 1h 30-60k: about 800 files ~ 1.6h 60k-100k: about 1200 files ~ 2h Current total: 1200 ~ 2h Milestones: from 10k: First beginnings to understand not only noise but (not understandable) from 20k: First word recognizable without knowing text from 30-40k: Understandable text (but not nearly speech) from 80k: It's okay :) * Please note, however, that I used as input books and book excerpts with many proper names and denglish (German with some English words in books). This makes the training process slower in any case and generally worse (but in the trained areas, proper names, very good). Recording: For the recordings I wrote a Python script that automatically splits the text of a text file into sentences (ignoring sentences below 5 words) and outputs them. Then the recording was automatically started and stopped as soon as one second the sound was below 50DB. Then this audio was trimmed so that front and back everything is dropped (below 50DB to garantee a instant speech) and filled with 50ms silence. Then nomalized and saved in ljspeech format. Delete function included
Thanks for sharing your great setup and training step times 👏😊. This will help other users for sure. I agree that pronouncing foreign words is still a challenge.
How many iterations did you get with this setup? I am only getting ~80 iterations per HOUR with a RTX 3090, AMD Ryzen 9 5950X 16-Core , and 64gb RAM 3200Mhz so I think something is wrong with my installation or training setup. I am using a batch_size=64
I spent ages trying to get this to work and finally ended up installing wsl which made the setup work. You should make a video on how to create your own dataset for training! Liebe Grüße aus den USA!
So now you have another way to train a TTS model in addition to wsl. Hope you enjoyed this video 😊. I've created a tutorial on recording and creating a voice dataset here: ua-cam.com/video/4YT8WZT_x48/v-deo.html
@@ThorstenMueller Ah very cool, I'll have to give that a shot. I've been using openAi's Whisper to transcribe audio I downloaded from youtube videos and podcasts and it's getting close. But I think I need to do a better job cleaning up and organizing the audio I download. Any suggestions for how how large the dataset should be when using vits? I've been using about 1-3 hours of clips and it's starting to sound ok...but I'm guessing I just need more and cleaner data. Thanks again!
@@anthonyschilling7132 My voice datasets are way longer - at least 10k recordings, meaning > 10 hours of pure audio. But more important might be a good phoneme coverage.
Vielen Dank für das nette Kompliment 😊. Mit ATI Grafikkarten habe ich in diesem Zusammenhang keine Erfahrung. CUDA ist primär auf NVIDIA Karten ausgelegt. Es gibt/gab wohl ein altes Projekt namens "gpuocelot" was in diesem Bereich unterstützen wollte. Aber da kann ich Dir nicht wirklich weiterhelfen.
I would have appreciated you breaking down how the audio samples should be formatted, maybe a bit more explanation of the code and also torch audio does not install along with torch either.
Thanks for your suggestion. I thought diving to deep into code might be hard to follow, but i'll think on more in detail video - which will be longer though.
Hello, Mister @Thorsten, I wanted to know how you do the training a thousand times, and yet the sound does not sound clear, but when I use your voice through the tts-server, the sound appears very clear .... How did you train your voice? (which is on the server) and Thank you for this great effort.
Thanks for your feedback. The training in this video is just for the demo. With 3.000 steps there cannot be a clear voice. My public released models with tts-server have been trained for over 2 month with around 600.000 steps. Does this explanation help you?
I subscribed although I could only watch for a few mins because of some health problems I’m having nowadays. If possible I Would like a cool tutorial or explanation on ways to do this without downloading anything new to my computer or going through a long process, like maybe if it’s possible to do this 100% online then that would be awesome! Since technology is improving so fast nowadays I’m sure there’s some sites that have to exist where we can do this online right..
First of all, i hope you get well soon 😊. Thanks for subscribing and i agree, right now the process is not a simple 1-2-3 process, but voice cloning is getting better and for english voices it might be possible (in near future) to clone your voice easier. Not sure how perfect the cloned voice will be with a simple process, but we'll see.
Thank you for joining my channel 😊. This will work in other languages as well. I've created an earlier video (not Windows specific) with some more detail if that's helpful for you. ua-cam.com/video/4YT8WZT_x48/v-deo.html
Hello, thanks so much for the video. I'm in the process of training a custom VITS TTS model using a dataset that I've created. Around the 200,000-step mark, the average loss on my trainEpochstats/avg_loss_1 is creeping up . My dataset is fairly small, approximately 1 hour in length, but it does have good coverage of phonemes. When I tested the audio, it had the correct voice quality but the speech was nonsensical. Should I halt the training to expand my dataset, or is it typical for models to require more training steps to produce meaningful audio?
You're welcome 😊. If your dataset is nice phonetically balanced it should produce useable results. My VITS model has been trained (i guess) for 600k steps so there might be room for more training. But maybe you can ask this on the Coqui TTS Github discussion before there are real pros in machine learning. If available add some screenshots on Tensorboard for analysis.
Hello! Thanks for the tutorial! Just finished teaching. My bot can't string letters into words at all. I would like to ask you what scale the dataset should be, and is it possible to speed up the training with google collab?
You are welcome 🙂. Not sure what you mean by "letters into words"? Do you mean, as example, "TTS" vs. "T T S"? pronunciation? Google colab provides simple GPU power which is far better than CPU, but it disconnects sessions regularly (in the free edition).
@@ThorstenMueller First, thanks for the reply! I mean my bot can't say a word, it's more like a monster roar (like grr). But at the same time, he can change the tone of speech, using, for example, an exclamation mark. I asked about the dataset in my first comment because I think it's my problem and the quality of my dataset is not high enough.
Hi Thorsten, thanks for this awesome tutorial which worked perfectly on my machine. However, I trained my model and it's great but not perfect. Is there an option to continue training with this model instead of training a new one (which would take ages just to get to the point where i am now)? I am relatively new to python, so I am not sure if I just have to modify the training script a little or if there is a command somewhere which does this, or if it's just not possible. If you could give me a pointer that would be great!
Thanks for your nice feedback 😊. You're looking for restore_path and/or continue_path. I've made a special video tutorial on continuing a TTS model training from a previous step checkpoint. ua-cam.com/video/O6KxJR95WpE/v-deo.html
Thanks for the tutorial. Its really helpful. Can you also make a tutorial on how can we make use of coqui TTS service to fine-tune yourTTS for low resource language with better quality. That would be really helpful. Thanks and keep inspiring :)
@@AdityaGupta-k3q Okay, sorry did get that wrong 🤦♂. Not sure on that. Maybe you can get a good answer when asking this good/important question on Coqui TTS community.
Great tutorial! Thank you for all the details! I have a question though about the training process and dataset. I used 102 samples for my dataset. In order to record them I used Audacity with default recordings settings (mono, 44100 Hz, 32-bit float). For the recipe file, I used the one you show in your video (named something like a "youtube recipe"). After 1000 Epochs I checked the results by synthesizing some words and sentences using tts-server. It was sounding very slow, not normal. While checking the congif,json file I found out that the sample rate in was set to 22050. After I changed it to 44100 and restarted the tts-sever voice was sounding closer to mine, but the quality is still really bad. Could the fact that all the samples were recorded at 44100 Hz affect the whole training since the default saple_rate in that config.json file is 22050 or it is irrelevant and I just need to train it more? Or do I need to start over using samples recorded with 22050 Hz frequency?
Thanks for your nice feedback on the details in my tutorial 😊. I guess that you might not get great results with just 102 recordings. Did the training process run even the samplerate did not match? I'd thought this should abort training process. However just changing the value after the training and just for time of synthesis this will not work. Samplerate in config and wave SR must match before starting training process not matter if 22 or 44k at least config is matching reality 🙃
@@ThorstenMueller yes, exactly. is that even possible or do you need GPU? I want to be able to use my local NAS for something more than a filestore so I was wondering if this was possible
Hi! Everything works fine, thanx! Except that it refuses to handle accented Hungarian characters (éáűőúöüóí). Does it need to be converted somewhere to handle these letters as well? For sentences without an accented character, it is perfect.
Do you mean you have problems on training the model with these chars or did training run good and you're having problems synthesizing? Have you trained using phonemes or characters? Maybe you can run this script on your dataset and add any specials chars to your config. github.com/coqui-ai/TTS/blob/dev/TTS/bin/find_unique_chars.py
@@ThorstenMueller Yes, "abcdefgh..." - ok. "éáőúöüó..." - omits it from the speech. A new config.json is created in the new folder at every start. Where can I add the returned values to the configuration?
hello Thorsten: I try to figure it out by myself follow the step, but it doesn't work in some how, can i make appointment with you for about half an hour, so that you can give me some guidance?
hey, I am trying to train my model on my language(kazakh) by your tutrotial. it's been over 1 day since it training, but I am getting some weird noises of speakers, I didn't see that you change or add any symbols, so did I. Do I need to add alphabet of my language?
@@ThorstenMueller I've used phoneme based. Well I was thinking maybe at least I will get something. The data was containing over 12k audio samples with a lot of speakers, each speaker has 250 samples. Maybe because of that the feature it didn't match.
@@ThorstenMueller I don't know. Arduino says this is the first time that we can recognize voice commands with neural decision processor, ultra low power consumption and very good recognition. I don't know if it's true or not. It's expensive but I think I'll give it a try
Hi Thorsten, thanks for putting the video together, when I try run my version of your train_vits_.py script, I get an error saying ModuleNotFoundError: No module named 'TTS.tts.configs.shared_configs - any pointers (I tried to add the project path to my system environment variable, but no luck)
awesome tutorial, thank you... unfortunately, it keeps getting interrupted with a multiprocess error before the last step, I'm looking for a solution to solve the error. If others have succeeded, and I see in the video that it works for you, maybe it will work for me too. :) Could there be a difference between Windows that could cause this error?
I follow every step up until 08:33 but when I run `pip install TTS` it tries to install every version of transformers. I would share a screenshot if I could. Never seen a `pip install` go through all the different versions of a package
@@shivam5648 are you running into the problem i described when you try to install now? This only happened for the old release back then as I understand it. The OG Coqui is pretty much deprecated now but this error shouldnt happen anymore.
You're looking for the parameter "dataset_config" in the training recipe file. There you can write the file location to your voice files (in LJSpeech format) for training.
It's some time ago since i used continue/restore a training. I guess you know my video on exact this topic? ua-cam.com/video/O6KxJR95WpE/v-deo.html This isn't working? Maybe it's a bug or a changed usecase in Coqui TTS then.
@@ThorstenMueller Yes, that's the video I found the method in. I'm not sure if anyone else is having the same trouble, but I haven't been able to find a solution at present.
@@ThorstenMueller I tried to post the full info but it seems to have been hidden. Basically the traceback ends in TypeError: expected string or bytes-like object
How do I fix the freeze issue? I can't find anything about it other than the resource you provided (bug) was 'closed' with the authors comment being 'we don't support windows' when you've clearly done it on windows! I've spent a lot of time on this and would like to figure it, and help would be appreciated.
I need help!Inside the folder TTS - training there are some archives as you show in the video, how were these archives found there? How do I put it exactly the same in the folder TTS - Training I made?and when i change directory and enter in the TTS - Training folder and type the python command nothing happens.Please could you help me on that? :(
I'm not sure if i understand your question right. So training process starts and the "output_folder" is created and filled with files. Are you already trying to synthesize voice while training? Are audio samples in Tensorboard available?
@@ThorstenMueller I don't know how the output file was created in your video and filled with files.I follow your steps one by one, i installed python,eSpeak-ng,Microsoft Build Tools and when you open the command prompt i really stuck there.I created the directory as you did but in my directory there's not the files that you show in the video.I type the python commands but nothing happened.What i did wrong? :(
@@kostas9849 Strange, the output directory with the training_run name and a timestamp for training start date will be created automatically. Did cloning the Coqui TTS repo work and adjusting the recipe?
You're right. It's working the same way. Maybe you can watch this tutorial showing how to create a voice dataset for your new language model. ua-cam.com/video/4YT8WZT_x48/v-deo.html
11:09 Help please im stuck in this step becuase its gave this error: "OSError: [WinError 126] The specified module could not be found. Error loading "cudart64_110.dll" or one of its dependencies."
hello! I just wanted to know hoy many audio files do I need to clon a voice, since i just recorded like 50 wavs files but when I start the trainer the script fails since "there is no sample left"
Hey Thorsten. Kann man coqui so installieren mit allen Models und Funktionen, wie auf der Website, dass man keine Commands mehr eintippen muss und es komplett offline nutzen kann über das User Interface?
Hi Andi, ich gehe davon aus, dass Du Coqui Studio meinst. Soweit ich weiß, ist das nicht Teil ihrer Open-Source Veröffentlichung. Also sage ich mal, das ist nicht möglich. Lediglich das Kommando "tts-server" bringt ein lokal lauffendes Webfrontend, was aber natürlich nicht mit Coqui Studio verglichen werden kann.
@@andiratze9591 Du kannst alle Coqui TTS Modelle offline nutzen, nur eben nicht per so komfortabler Oberfläche wie Coqui Studio. Kennst Du das Video von mir? Da zeige ich das. ua-cam.com/video/alpI-DnVlO0/v-deo.html
Ah danke, ich dachte, das ist nur ein Video mit Terminalbefehlen, ohne vorhandenes User Interface. Ich mache nachher mein Windows neu und probiere es mal aus.🙂
Ich werde später mal versuchen, Python zu lernen. Vielleicht kann ich mein eigenes TTS-VC programmieren. Es ist unmöglich Freesoftware in dem Bereich zu finden, die einfach zu bedienen ist. Bei allen finde ich was. Foto Video u.s.w aber tts ist voll schlimm🥴
It tells me that I might need to install an third party phonemizer for the language de.... Where do you get the extra files from that u have installed and cd.. into at about 10:37 ? I
I had this problem too... A reboot seemed to fix it, but I also did a "pip install phonemizer" before, which may not have actually been necessary. In case anyone else is wondering, got this running on Win 11, using Anaconda 2.5.1 (Python 3.11.5), CUDA 12.3.5.1, and Coqui TTS 0.21.2
Thanks for your good question. Yes, that's possible. You can set "use_phonemes" to "false" and then it will use character based training. Maybe this helps a bit. tts.readthedocs.io/en/latest/tutorial_for_nervous_beginners.html?highlight=use_phonemes
@@ThorstenMueller I started following your mimic recording studio and it's instructions, so I could make my own Coqui LJSpeech model, but it isn't working for some reason. Some files don't exist anymore, and it seems mad about numpy.
@@MistakingManx Hmm, as Mimic-Recording-Studio is not actively maintained this might stop working due newer package versions (like numpy). I'd use Piper-Recording-Studio as it will generate an LJSpeech like dataset too.
@@ThorstenMueller I already used mimic-recording-studio, it's what the tutorials used, and it seemingly worked fine, minus the part I had to fix. Your script that makes the dataset was useful, I just can't get the training stuff to work at all. I wanted to use windows since I have a 4090ti on it. Would it be possible to talk on a platform like discord?
@@MistakingManx You can send me an email using my contact form here: www.thorsten-voice.de/en/contact/ But it might take some time to respond for me so please be a little bit patient 🙂.
The process return the error: "PermissionError: [WinError 32] The process cannot access the file because it is being used by another process..." Im used the your code.
I've seen this error previously, but i'm not absolutely sure about the reason. Is training running nevertheless or not starting? Does running command line prompt as admin change the behavior?
@@ThorstenMueller I tried modifying the root of the folder and the permission of the prompt, but the error keeps returning. Have you ever seen anything like it? Even using your "train..." which already contains "if _name_ == '__main__':", returns me with an error in training. Can you imagine which way I should go? 😪😥
Thanks for your nice feedback 😊 and great question. I tried this some time ago too, but didn't find an easy solution for this. But if this is interesting in general i might give it a closer look. Most voices seems to come out of their Microsoft Azure cloud services.
Depending on what you mean by "good" 😉. By step 30k you should be able to hear a voice with lots of background noise. Starting by step 100k voice should be clearer. Then it's up to your personal expecations.
@@JamesBond-ix8rn It's hard to call specific values as it depends on the hardware you have available for training. Might be some hours to weeks/month training time. Ensure a good phonetic balance and add more recordings by time if you're not satisfied with the result.
As always - it depends 😉. With less than 100 the training process will not start. I recorded > 10.000 phrases for my german "Thorsten-Voice" TTS models. But phonetic coverage might be more important than the pure number of recordings.
@@-.nocturna.- Absolutely. This is the usual trade-off between graphics performance and duration. I used an NVIDIA Jetson Xavier AGX, which has a relatively low power consumption.
@@thebluefacedbeastyangzhi Hard to say, but maybe you try a Google colab notebook with GPU that supports CUDA. Might be a more easy way for you if you don't have access to a local NVIDIA GPU card.
With Coqui TTS or Piper TTS there are some pretrained and really nice sounding TTS models available for Linux in multiple languages 😊. Do you know these?
Thanks Thorsten for your endless efforts at communicating a complex subject with enthusiasm and passion to people who don't know much about python. I see that you have linked another video about preparing recordings ua-cam.com/video/4YT8WZT_x48/v-deo.html
@@ThorstenMueller Well sorry for the late respond, i tried many different ways to install and use TTS, but one big problem i have was that i cant install python 3.8 for all users, every other version i can and im not sure if thats the big problem
@@ThorstenMueller wies halt so geht. Übrigens besten Dank für Deinen content. Ich hab mir 2 rtx a5000 gekauft, und frag mich was ich damit anstellen kann da ich kein Gamer oder Architekt oder Programmierer bin (die ursprüngliche Absicht eine Renderingworkstation zu bauen wurde aus unterschiedlichen Gründen obsolet) und deine Vids inspirieren zu ganz interessanten Versuchen. Ich war interessiert eigene ai Projekte auszuführen, und es scheint du bietest hierzu know how an. Beste Grüsse aus der Türkei vom rheinischen Exilanten.
Thank you, this video has helped me get to this point. Can you help with this error, I am stuck here and can't seam to find a solution. I followed your video but when I go to run the trainer i get the following error: (TTS) C:\Users\7danny\Documents\CoquiTTS\TTS>python .\train_vits_win.py Traceback (most recent call last): File ".\train_vits_win.py", line 6, in from TTS.tts.configs.vits_config import VitsConfig File "C:\Users\7danny\Documents\CoquiTTS\TTS\TTS\tts\configs\vits_config.py", line 5, in from TTS.tts.models.vits import VitsArgs, VitsAudioConfig File "C:\Users\7danny\Documents\CoquiTTS\TTS\TTS\tts\models\vits.py", line 38, in from TTS.vocoder.models.hifigan_generator import HifiganGenerator File "C:\Users\7danny\Documents\CoquiTTS\TTS\TTS\vocoder\models\hifigan_generator.py", line 6, in from torch.nn.utils.parametrizations import weight_norm ImportError: cannot import name 'weight_norm' from 'torch.nn.utils.parametrizations' (C:\Users\7danny\Documents\CoquiTTS\TTS\lib\site-packages\torch n\utils\parametrizations.py)
This is an awesome tutorial, thank you for doing all the trial and error that I kept running into. I do have one problem though. I've used your modified training script and only changed the directories, but I'm still getting a permission error: PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'D:/TTS/ThorstenTut/ljsAlex01-April-26-2023_05+12PM-0000000\\events.out.tfevents.1682554375.DESKTOP-IUNHJ2B' Is there any workaround for this? It's pointing to one of the files it just generated, which means it's not being used by any other process, so it must be that multithreading problem you mentioned still being an issue somehow.
Thanks for your nice feedback 😃. I run into that permission thing once, too. I'm not sure how i solved it. I'll check my notes for this video and think how i solved this. When i remember i can share it here. Maybe try running command line prompt as local admin might be a first try.
@@thefurrowzor Might this issue help you? For me it worked while testing for this tutorial. Hopefully i'll work for you too. If this is the case, i could add the link to the video description. github.com/coqui-ai/TTS/issues/1711
That's great you brought this tutorial for the windows community. I personally use linux to train my models, but it's awesome you are making an effort to make the windows open voice community stronger.
Yes, personally i use Linux for training, too. But model training on Windows has been requested quite often.
@@user-wc2jy4jr7r Not sure if i got you right. Do you mean "SAPI" in context of Windows integrated TTS voices?
Danke für's video. Es funkioniert endlich! Richtiger ehrenmann 😀
Besten Dank 😊
Thank you for this tutorial and your entire audio series. I once started with Turtoise, which was too slow for me. Then I found coqui and your public voice model, which is also really good and understandable and with the factor 0.41 is also super fast for me. For my use case, however, still too funny pronunciations of proper names. Through this video I could finally create my own voice model that is completely adapted to the requirements of telling stories.
It still sounds a bit shaky here and there and has just 100k steps (with increasing audio material), but is already on the way to improvement.
Due to recording conditions and my unfortunately not so great narrator voice. I even come to a loss of 26-36%. So here can still be properly readjusted.
For all who are interested in the Sats, if they also want to do something like that:
Specs: RTX 2070, I7-10900k, Samsung Evo 970
Steptime: 0.5-0.6
Batchsize (you can go higher): 20
Checkpoint_steps: 1000 (just because i am lazy and train it in the middle of some idle periods, school work etc., so i don't have to wait for 10000)
Audio dataset:
Specs: HyperX idk (the rgb one) with pop filter, relative big room
Here I can't make a statement like this and if you start with the "Total" you will get faster results. I trained in steps with increasing audio files:
0-5k: about 230 files ~ 0.4h
5-10k: about 350 files ~ 0.6h
10-30k: about 500 files ~ 1h
30-60k: about 800 files ~ 1.6h
60k-100k: about 1200 files ~ 2h
Current total: 1200 ~ 2h
Milestones:
from 10k: First beginnings to understand not only noise but (not understandable)
from 20k: First word recognizable without knowing text
from 30-40k: Understandable text (but not nearly speech)
from 80k: It's okay :)
* Please note, however, that I used as input books and book excerpts with many proper names and denglish (German with some English words in books). This makes the training process slower in any case and generally worse (but in the trained areas, proper names, very good).
Recording:
For the recordings I wrote a Python script that automatically splits the text of a text file into sentences (ignoring sentences below 5 words) and outputs them. Then the recording was automatically started and stopped as soon as one second the sound was below 50DB. Then this audio was trimmed so that front and back everything is dropped (below 50DB to garantee a instant speech) and filled with 50ms silence. Then nomalized and saved in ljspeech format.
Delete function included
Thanks for sharing your great setup and training step times 👏😊. This will help other users for sure. I agree that pronouncing foreign words is still a challenge.
How many iterations did you get with this setup? I am only getting ~80 iterations per HOUR with a RTX 3090, AMD Ryzen 9 5950X 16-Core , and 64gb RAM 3200Mhz so I think something is wrong with my installation or training setup.
I am using a batch_size=64
Since most of my friends and clients use MS Win 10 or 11, I must support Windows ! A new vid on MacOS would also be great !
As Coqui shut down at the beginning of 2024 i am not sure if someone will adjust the code for newer operating systems.
Thanks for listening to us and making this video!
You're welcome. I'm always happy for feedback and suggestions from my community and try to make right content for you 😊.
I spent ages trying to get this to work and finally ended up installing wsl which made the setup work. You should make a video on how to create your own dataset for training!
Liebe Grüße aus den USA!
So now you have another way to train a TTS model in addition to wsl. Hope you enjoyed this video 😊.
I've created a tutorial on recording and creating a voice dataset here:
ua-cam.com/video/4YT8WZT_x48/v-deo.html
@@ThorstenMueller Ah very cool, I'll have to give that a shot. I've been using openAi's Whisper to transcribe audio I downloaded from youtube videos and podcasts and it's getting close. But I think I need to do a better job cleaning up and organizing the audio I download. Any suggestions for how how large the dataset should be when using vits? I've been using about 1-3 hours of clips and it's starting to sound ok...but I'm guessing I just need more and cleaner data. Thanks again!
@@anthonyschilling7132 My voice datasets are way longer - at least 10k recordings, meaning > 10 hours of pure audio. But more important might be a good phoneme coverage.
Sehr gutes Video. Hätte ich gewusst, dass du hier die Installation auf Windows vornimmst, hätte ich mir 2 Tage arbeit gespart :D
Freut mich sehr, dass dir das Video gefallen hat und ich hoffe, dir fehlen die 2 verlorenen Tage nicht zu sehr 😉.
@ nee habe sehr viel dabei gelernt. Bin aber schlussendlich zu Ubuntu gewechselt da es unter Windows nicht so gut funktioniert:/
Thank you very much for your videos. I almst never subscribe but I was so thankful for these that I've been liking every one and I did subscribe. :)
Wow, that's probably one of the best feedback i received for my work on these videos 🤩.
Yay! I've been waiting on this one. Thank you so much.
nice! giving Windows some love :D
Thanks Josh, at least a little bit 😁.
hi Thorsten,
may the next "how to" would be training coqui-TTS model based on Glow-TTS and HiFiGAN vocoder?
Great help for figuring out all these little details you just have to know somehow. Tnx!
Came here looking some information on Coqui as I'm looking to do a voice clone for voice over work. Fantastic job.
Great feedback like yours always keeps me motivated - thank you 😊.
Thank You from a new subscriber !
Thanks for joining and welcome 😊.
@@ThorstenMuellerP.S. Since Coqui is 'DEAD", what local TTS Model with personal Voice Cloning can we employ ????
@@davidtindell950 I'd go with Piper TTS for now. ua-cam.com/video/b_we_jma220/v-deo.htmlsi=aFZ-Z5nNpiQxa0Zo
Really good video man. Well explained and researched. Thanks a lot
Thanks for your nice feedback. I'm happy that you liked it 😊.
I love this if for no other reason it helps me learn German dialects.
So, i'm your reference for a german dialect? 😆👍
Great and Unique Videos Always, Thank you for your time and efforts.
Thank you so much. Feedback like yours always keeps me motivated ☺️.
Thanks for sharing, Thorsten! Got yourself a new subscriber (y)
Thank you and welcome 🤗.
this very well done explained.. Thank you Thorston-Voice this video helps me to continue my hobby and research.
Thank you. Nice feedback like yours always keeps me motivated to continue this journey ☺️.
i subscibed 1st video great teacher!!!!!!!!
Thanks a lot for your very nice feedback - and welcome 😊
Thank you so much for this video, really helpful!
Thank you for your nice feedback 😊.
5:36 is not very clear where did that come from?
You mean the voice dataset in this LJSpeech file and directory structure?
Your content is amazing, really useful. Thx.
Thanks a lot for your nice feedback 😊. I'm always happy to hear if people find my content helpful.
I love your knowledge man
Thank you so much 😊
Chatgpt provided me step by step with all the codes needed to run coqui TTS
Thank you so much!
Thank you for this really nice feedback. Feedback like yours keeps me motivated 😊.
Mal wieder klasse Video. Gibt es ein ATI Äquivalent?
Vielen Dank für das nette Kompliment 😊. Mit ATI Grafikkarten habe ich in diesem Zusammenhang keine Erfahrung. CUDA ist primär auf NVIDIA Karten ausgelegt. Es gibt/gab wohl ein altes Projekt namens "gpuocelot" was in diesem Bereich unterstützen wollte. Aber da kann ich Dir nicht wirklich weiterhelfen.
I would have appreciated you breaking down how the audio samples should be formatted, maybe a bit more explanation of the code and also torch audio does not install along with torch either.
Thanks for your suggestion. I thought diving to deep into code might be hard to follow, but i'll think on more in detail video - which will be longer though.
Hello, Mister @Thorsten, I wanted to know how you do the training a thousand times, and yet the sound does not sound clear, but when I use your voice through the tts-server, the sound appears very clear .... How did you train your voice? (which is on the server) and Thank you for this great effort.
Thanks for your feedback. The training in this video is just for the demo. With 3.000 steps there cannot be a clear voice. My public released models with tts-server have been trained for over 2 month with around 600.000 steps. Does this explanation help you?
@@ThorstenMueller Thank you for this useful information. The picture is now clearer
Thanks for the video. Also can you make a video on how to run tortoise tts locally on your computer.
Thanks for your comment 🙂. I've TorToiSe TTS already on my TODO list.
@@ThorstenMueller tyvm
thank you for your video , it's great worker
You're very welcome. Happy it's helpful for you 😊.
I subscribed although I could only watch for a few mins because of some health problems I’m having nowadays. If possible I Would like a cool tutorial or explanation on ways to do this without downloading anything new to my computer or going through a long process, like maybe if it’s possible to do this 100% online then that would be awesome! Since technology is improving so fast nowadays I’m sure there’s some sites that have to exist where we can do this online right..
First of all, i hope you get well soon 😊. Thanks for subscribing and i agree, right now the process is not a simple 1-2-3 process, but voice cloning is getting better and for english voices it might be possible (in near future) to clone your voice easier. Not sure how perfect the cloned voice will be with a simple process, but we'll see.
@@ThorstenMueller Thanks! I'm fluent in Japanese, and looking forward to doing this in Japanese sometime too.
Hello,i just subscribe to your channel and i have one question:does this work with foreign languages or only english?
Thank you for joining my channel 😊. This will work in other languages as well. I've created an earlier video (not Windows specific) with some more detail if that's helpful for you. ua-cam.com/video/4YT8WZT_x48/v-deo.html
@@ThorstenMueller Thank you so much,you are the best!
A lot of thanks man !
You're very welcome 😊.
Which graphic card do you use pls. Thanks for the info.
In this video i've used an NVIDIA GTX 1050 Ti. But for my other models training i use an NVIDIA Jetson Xavier AGX.
Hello, thanks so much for the video. I'm in the process of training a custom VITS TTS model using a dataset that I've created. Around the 200,000-step mark, the average loss on my trainEpochstats/avg_loss_1 is creeping up . My dataset is fairly small, approximately 1 hour in length, but it does have good coverage of phonemes. When I tested the audio, it had the correct voice quality but the speech was nonsensical. Should I halt the training to expand my dataset, or is it typical for models to require more training steps to produce meaningful audio?
You're welcome 😊. If your dataset is nice phonetically balanced it should produce useable results. My VITS model has been trained (i guess) for 600k steps so there might be room for more training. But maybe you can ask this on the Coqui TTS Github discussion before there are real pros in machine learning. If available add some screenshots on Tensorboard for analysis.
Hello! Thanks for the tutorial! Just finished teaching. My bot can't string letters into words at all. I would like to ask you what scale the dataset should be, and is it possible to speed up the training with google collab?
You are welcome 🙂. Not sure what you mean by "letters into words"? Do you mean, as example, "TTS" vs. "T T S"? pronunciation? Google colab provides simple GPU power which is far better than CPU, but it disconnects sessions regularly (in the free edition).
@@ThorstenMueller First, thanks for the reply! I mean my bot can't say a word, it's more like a monster roar (like grr). But at the same time, he can change the tone of speech, using, for example, an exclamation mark.
I asked about the dataset in my first comment because I think it's my problem and the quality of my dataset is not high enough.
Hi Thorsten, thanks for this awesome tutorial which worked perfectly on my machine. However, I trained my model and it's great but not perfect. Is there an option to continue training with this model instead of training a new one (which would take ages just to get to the point where i am now)? I am relatively new to python, so I am not sure if I just have to modify the training script a little or if there is a command somewhere which does this, or if it's just not possible. If you could give me a pointer that would be great!
Thanks for your nice feedback 😊.
You're looking for restore_path and/or continue_path. I've made a special video tutorial on continuing a TTS model training from a previous step checkpoint.
ua-cam.com/video/O6KxJR95WpE/v-deo.html
@@ThorstenMueller wow, i didn't see that. Sorry about that and thanks a lot for the quick reply and help!
Thanks for the tutorial. Its really helpful. Can you also make a tutorial on how can we make use of coqui TTS service to fine-tune yourTTS for low resource language with better quality. That would be really helpful. Thanks and keep inspiring :)
Thanks for your nice feedback. So you mean a model that is fast enough for e.g. a Raspberry Pi but with a high quality?
@@ThorstenMueller With low resource language I mean Hindi, Korean, Arabic etc
@@AdityaGupta-k3q Okay, sorry did get that wrong 🤦♂. Not sure on that. Maybe you can get a good answer when asking this good/important question on Coqui TTS community.
Thank you for this
You're welcome 😊. I hope it's been helpful for you.
Great tutorial! Thank you for all the details! I have a question though about the training process and dataset. I used 102 samples for my dataset. In order to record them I used Audacity with default recordings settings (mono, 44100 Hz, 32-bit float). For the recipe file, I used the one you show in your video (named something like a "youtube recipe"). After 1000 Epochs I checked the results by synthesizing some words and sentences using tts-server. It was sounding very slow, not normal. While checking the congif,json file I found out that the sample rate in was set to 22050. After I changed it to 44100 and restarted the tts-sever voice was sounding closer to mine, but the quality is still really bad. Could the fact that all the samples were recorded at 44100 Hz affect the whole training since the default saple_rate in that config.json file is 22050 or it is irrelevant and I just need to train it more? Or do I need to start over using samples recorded with 22050 Hz frequency?
Thanks for your nice feedback on the details in my tutorial 😊. I guess that you might not get great results with just 102 recordings. Did the training process run even the samplerate did not match? I'd thought this should abort training process. However just changing the value after the training and just for time of synthesis this will not work. Samplerate in config and wave SR must match before starting training process not matter if 22 or 44k at least config is matching reality 🙃
@@ThorstenMueller The training process did run even the samplerate did not match, 1000 epochs.
cool video, can you do this with a docker setup, sans windows?
Thanks for your feedback 🙂. Do you mean training a TTS model using Coqui TTS inside a Docker container?
@@ThorstenMueller yes, exactly. is that even possible or do you need GPU? I want to be able to use my local NAS for something more than a filestore so I was wondering if this was possible
yes please@@ThorstenMueller
Thanks a lot!!
You're very welcome 😊.
So i'm getting as far as running the "pip install -e ." command before getting errored out with status code 1 something about wheel
Try running "pip install setuptools wheel -U" before, maybe this helps.
Hi! Everything works fine, thanx! Except that it refuses to handle accented Hungarian characters (éáűőúöüóí). Does it need to be converted somewhere to handle these letters as well? For sentences without an accented character, it is perfect.
Do you mean you have problems on training the model with these chars or did training run good and you're having problems synthesizing? Have you trained using phonemes or characters? Maybe you can run this script on your dataset and add any specials chars to your config.
github.com/coqui-ai/TTS/blob/dev/TTS/bin/find_unique_chars.py
@@ThorstenMueller Yes, "abcdefgh..." - ok. "éáőúöüó..." - omits it from the speech. A new config.json is created in the new folder at every start. Where can I add the returned values to the configuration?
hello Thorsten: I try to figure it out by myself follow the step, but it doesn't work in some how, can i make appointment with you for about half an hour, so that you can give me some guidance?
You can contact me by using my contact form here, but it might take some time until i can respond. www.thorsten-voice.de/en/contact/
hey, I am trying to train my model on my language(kazakh) by your tutrotial. it's been over 1 day since it training, but I am getting some weird noises of speakers, I didn't see that you change or add any symbols, so did I. Do I need to add alphabet of my language?
In general one day is not much time for training a tts model. Do you use phoneme or character based training?
@@ThorstenMueller I've used phoneme based. Well I was thinking maybe at least I will get something. The data was containing over 12k audio samples with a lot of speakers, each speaker has 250 samples. Maybe because of that the feature it didn't match.
Hi, nice video ! Could you tell me what you think of the new arduino for speech recognition ? -> nicla voice
Personally i've no experience with arduino. You think it's worth to check this topic?
@@ThorstenMueller I don't know. Arduino says this is the first time that we can recognize voice commands with neural decision processor, ultra low power consumption and very good recognition. I don't know if it's true or not. It's expensive but I think I'll give it a try
Nice thank youu
You're very welcome 😊.
Hi Thorsten, thanks for putting the video together, when I try run my version of your train_vits_.py script, I get an error saying ModuleNotFoundError: No module named 'TTS.tts.configs.shared_configs - any pointers (I tried to add the project path to my system environment variable, but no luck)
Hi, are you in your Python venv? Does "pip list" shows a TTS package?
awesome tutorial, thank you... unfortunately, it keeps getting interrupted with a multiprocess error before the last step, I'm looking for a solution to solve the error. If others have succeeded, and I see in the video that it works for you, maybe it will work for me too. :)
Could there be a difference between Windows that could cause this error?
Thanks for your nice feedback 😊. Different Windows version might be a reason. Which version do you use? Is there an error message shown?
Can you change the tone of the voice reading text {e.g. excited, sad, etc}?
Emotions aren't supported on Coqui TTS models (as far i know). Maybe SSML in Mimic 3 might be at least a little bit helpful in that context.
I follow every step up until 08:33 but when I run `pip install TTS` it tries to install every version of transformers. I would share a screenshot if I could. Never seen a `pip install` go through all the different versions of a package
Maybe Coqui TTS dependencies have changed in newer releases? Could you download/clone the version i've used in the video just to check if this works.
So any solution to that problem?
@@shivam5648 are you running into the problem i described when you try to install now? This only happened for the old release back then as I understand it. The OG Coqui is pretty much deprecated now but this error shouldnt happen anymore.
@@youngphlo it's just not installing and take hours after installing for hours there is this error .it's so frustrating
confused where exactly did you put your voice file for training ?
You're looking for the parameter "dataset_config" in the training recipe file. There you can write the file location to your voice files (in LJSpeech format) for training.
Is it possbile to combine two voices? And what sample rate should I use for the dataset?
What do you mean with "combining two voices"? I've trained my TTS models with 22kHz samplerate.
Is there a way to stop and resume training? The continue path command does begin the process but it then fails when generating sample sentences.
It's some time ago since i used continue/restore a training. I guess you know my video on exact this topic? ua-cam.com/video/O6KxJR95WpE/v-deo.html
This isn't working? Maybe it's a bug or a changed usecase in Coqui TTS then.
@@ThorstenMueller Yes, that's the video I found the method in. I'm not sure if anyone else is having the same trouble, but I haven't been able to find a solution at present.
@@Hellfreezer Is there any specific error message when running continue and while generating sample sentences?
@@ThorstenMueller I tried to post the full info but it seems to have been hidden. Basically the traceback ends in TypeError: expected string or bytes-like object
@@Hellfreezer There's a closed issue on that. Maybe this is helpful for you.
github.com/coqui-ai/TTS/issues/2070
How do I fix the freeze issue? I can't find anything about it other than the resource you provided (bug) was 'closed' with the authors comment being 'we don't support windows' when you've clearly done it on windows! I've spent a lot of time on this and would like to figure it, and help would be appreciated.
Nevermind, I didn't get to the part where you explained it!
😄, good luck :-)
I need help!Inside the folder TTS - training there are some archives as you show in the video, how were these archives found there? How do I put it exactly the same in the folder TTS - Training I made?and when i change directory and enter in the TTS - Training folder and type the python command nothing happens.Please could you help me on that? :(
I'm not sure if i understand your question right. So training process starts and the "output_folder" is created and filled with files. Are you already trying to synthesize voice while training? Are audio samples in Tensorboard available?
@@ThorstenMueller I don't know how the output file was created in your video and filled with files.I follow your steps one by one, i installed python,eSpeak-ng,Microsoft Build Tools and when you open the command prompt i really stuck there.I created the directory as you did but in my directory there's not the files that you show in the video.I type the python commands but nothing happened.What i did wrong? :(
@@kostas9849 Strange, the output directory with the training_run name and a timestamp for training start date will be created automatically. Did cloning the Coqui TTS repo work and adjusting the recipe?
Sir while running last line, error occurres = charmap, codec can't decide bytes.
Plz help
Is your config file in UTF-8?
I would like to training new model tts for new language. Is this the same way to to that? Can you give me some advice it please.. it's really help me
You're right. It's working the same way. Maybe you can watch this tutorial showing how to create a voice dataset for your new language model.
ua-cam.com/video/4YT8WZT_x48/v-deo.html
I am getting only around 80 iterations per hour in a setup with a rtx 3090. Is tooooo slow right?
Good question. But it is way faster than my NVIDIA Jetson Xavier AGX 😉
11:09 Help please im stuck in this step becuase its gave this error: "OSError: [WinError 126] The specified module could not be found. Error loading "cudart64_110.dll" or one of its dependencies."
Seems like your CUDA installation is broken. Are you sure CUDA is installed correctly?
@@ThorstenMueller Im not sure, i followed your steps clearly
@@kaymat2368 Hard to say, what might cause this issue. Maybe try installing a newer CUDA version.
@@ThorstenMueller Ok, thanks for replying, btw, my GPU is nvidia GeForce GT 520, Os Win 7
hello! I just wanted to know hoy many audio files do I need to clon a voice, since i just recorded like 50 wavs files but when I start the trainer the script fails since "there is no sample left"
I guess 50 is way too less. I recorded over 10k wave files for my german "Thorsten-Voice" voice clone. Maybe give it a try with 1000 recordings.
i can't get the pip command to work, help!!
What error message are you receiving?
Hey Thorsten. Kann man coqui so installieren mit allen Models und Funktionen, wie auf der Website, dass man keine Commands mehr eintippen muss und es komplett offline nutzen kann über das User Interface?
Hi Andi, ich gehe davon aus, dass Du Coqui Studio meinst.
Soweit ich weiß, ist das nicht Teil ihrer Open-Source Veröffentlichung. Also sage ich mal, das ist nicht möglich. Lediglich das Kommando "tts-server" bringt ein lokal lauffendes Webfrontend, was aber natürlich nicht mit Coqui Studio verglichen werden kann.
Gibt es andere Software, die man, nachdem man alles eingerichtet hat, offline nutzen kann oder Coqui wenigstens mit ein paar pretrained Models?
@@andiratze9591 Du kannst alle Coqui TTS Modelle offline nutzen, nur eben nicht per so komfortabler Oberfläche wie Coqui Studio. Kennst Du das Video von mir? Da zeige ich das. ua-cam.com/video/alpI-DnVlO0/v-deo.html
Ah danke, ich dachte, das ist nur ein Video mit Terminalbefehlen, ohne vorhandenes User Interface. Ich mache nachher mein Windows neu und probiere es mal aus.🙂
Ich werde später mal versuchen, Python zu lernen. Vielleicht kann ich mein eigenes TTS-VC programmieren. Es ist unmöglich Freesoftware in dem Bereich zu finden, die einfach zu bedienen ist. Bei allen finde ich was. Foto Video u.s.w aber tts ist voll schlimm🥴
It tells me that I might need to install an third party phonemizer for the language de.... Where do you get the extra files from that u have installed and cd.. into at about 10:37 ? I
Did you install espeak-ng as shown here?
ua-cam.com/video/bJjzSo_fOS8/v-deo.html
I had this problem too... A reboot seemed to fix it, but I also did a "pip install phonemizer" before, which may not have actually been necessary.
In case anyone else is wondering, got this running on Win 11, using Anaconda 2.5.1 (Python 3.11.5), CUDA 12.3.5.1, and Coqui TTS 0.21.2
I want to make a new model of Indonesian language. but in espeak-ng it doesn't support that language. is it still possible to make a new model?
Thanks for your good question. Yes, that's possible. You can set "use_phonemes" to "false" and then it will use character based training.
Maybe this helps a bit. tts.readthedocs.io/en/latest/tutorial_for_nervous_beginners.html?highlight=use_phonemes
@@ThorstenMueller still using espeak or not? the alphabet is the same as in English, but only the spelling is different. sorry I ask a lot
I started training the model, and after 8 hours, only 2 epochs were completed. Is this normal and do I need to complete all 1000?
What do you mean by "completed"? Normally the training process runs until you stop it manually. Did training end automatically?
Right, how should I go about creating the dataset though?
Hi, do you know my tutorial on Piper-Recording-Studio for doing so? ua-cam.com/video/Z1pptxLT_3I/v-deo.html
@@ThorstenMueller I started following your mimic recording studio and it's instructions, so I could make my own Coqui LJSpeech model, but it isn't working for some reason.
Some files don't exist anymore, and it seems mad about numpy.
@@MistakingManx Hmm, as Mimic-Recording-Studio is not actively maintained this might stop working due newer package versions (like numpy). I'd use Piper-Recording-Studio as it will generate an LJSpeech like dataset too.
@@ThorstenMueller I already used mimic-recording-studio, it's what the tutorials used, and it seemingly worked fine, minus the part I had to fix.
Your script that makes the dataset was useful, I just can't get the training stuff to work at all.
I wanted to use windows since I have a 4090ti on it.
Would it be possible to talk on a platform like discord?
@@MistakingManx You can send me an email using my contact form here: www.thorsten-voice.de/en/contact/
But it might take some time to respond for me so please be a little bit patient 🙂.
Du bist mein Held.
Soweit würde ich wohl nicht gehen 😉. Aber ich freue mich sehr über dieses mehr als nette Feedback 😊.
The process return the error: "PermissionError: [WinError 32] The process cannot access the file because it is being used by another process..." Im used the your code.
I've seen this error previously, but i'm not absolutely sure about the reason. Is training running nevertheless or not starting? Does running command line prompt as admin change the behavior?
@@ThorstenMueller The training starts, but the error occurs in the sequence. I don't know how to fix
@@ThorstenMueller I tried modifying the root of the folder and the permission of the prompt, but the error keeps returning.
Have you ever seen anything like it? Even using your "train..." which already contains "if _name_ == '__main__':", returns me with an error in training. Can you imagine which way I should go? 😪😥
same error- i am also getting, any solution found this?
great tutorial but i am trying to replace my microsoft voices with my cloned voice is this doable?
Thanks for your nice feedback 😊 and great question. I tried this some time ago too, but didn't find an easy solution for this. But if this is interesting in general i might give it a closer look. Most voices seems to come out of their Microsoft Azure cloud services.
Okay 👍🏻👍🏻
how long training until it sounds good?
Depending on what you mean by "good" 😉. By step 30k you should be able to hear a voice with lots of background noise. Starting by step 100k voice should be clearer. Then it's up to your personal expecations.
@@ThorstenMueller thanks for the prompt response. how long does this take in hours/days/months and how much input data would approximately need?
@@JamesBond-ix8rn It's hard to call specific values as it depends on the hardware you have available for training. Might be some hours to weeks/month training time. Ensure a good phonetic balance and add more recordings by time if you're not satisfied with the result.
Ich habe Python 3.11 installiert. Muss ich das deinstallieren und 3.8 installieren? Wäre voll kacke
Laut Readme sollte Python 3.11 funktionieren (python >= 3.9, < 3.12.).
how many samples do we need for the trainnig
As always - it depends 😉. With less than 100 the training process will not start. I recorded > 10.000 phrases for my german "Thorsten-Voice" TTS models. But phonetic coverage might be more important than the pure number of recordings.
How long does it take to train a model? lg
Hallo 👋. For my Thorsten-Voice models training took around 3 month 7x24 compute time. But this depends on your available hardware for training.
@@ThorstenMueller Woah, did you train it yourself? What GPU did you use? Thats insanely long in this trying times of energy prices. :/
@@-.nocturna.- Absolutely. This is the usual trade-off between graphics performance and duration. I used an NVIDIA Jetson Xavier AGX, which has a relatively low power consumption.
@@ThorstenMueller Thats a nice one. 30w vs the 320w of my 4080 :| i think i will do it if my other projects fail :P Have a nice night :>
is it possible to train the model to speak in Portuguese?
Sure, if you have a Portuguese voice dataset ready for training.
@@ThorstenMueller well.. i have my own voice 🤣. i wanna try that.
Where is TTS-training??
It is an empty folder in which you start working. I created a new folder "TTS-Training" but you can name it whatever you want.
Is there a non CUDA version?
Coqui has a command line parameter called "use_cuda" which can be set to "false", but i guess training will take waaay longer than with CUDA.
@@ThorstenMueller Thank you doe the reply. I have AMD and not Nvidia. So should I give up this method?
@@thebluefacedbeastyangzhi Hard to say, but maybe you try a Google colab notebook with GPU that supports CUDA. Might be a more easy way for you if you don't have access to a local NVIDIA GPU card.
@@ThorstenMueller thank you again for this information
Waiting fro Linux to get proper HQ Text to Speech.
With Coqui TTS or Piper TTS there are some pretrained and really nice sounding TTS models available for Linux in multiple languages 😊. Do you know these?
Thanks Thorsten for your endless efforts at communicating a complex subject with enthusiasm and passion to people who don't know much about python. I see that you have linked another video about preparing recordings ua-cam.com/video/4YT8WZT_x48/v-deo.html
You're very welcome 😊. And yes, i'm really passionate about this topic.
Coqui Eleutherodactylus a frog from Puerto Rico 🇵🇷
True, true 👍
Well something on my end is not working -.-!
Do you get any specific error message?
@@ThorstenMueller Well sorry for the late respond, i tried many different ways to install and use TTS, but one big problem i have was that i cant install python 3.8 for all users, every other version i can
and im not sure if thats the big problem
@@KominoStyle Which Python version are you using then?
It looks like mining issues
Oh, nur betrug clips stellt der Herr rein, intressant, da gibt es viel zu reporten ...
das shirt 😂 scheiß encoding, fühl ich
Vielen Dank 😊 - ist auch eins meiner Lieblingsshirts.
ei gude wie?! ;)
Ei subba - un selbst? ;)
@@ThorstenMueller wies halt so geht. Übrigens besten Dank für Deinen content. Ich hab mir 2 rtx a5000 gekauft, und frag mich was ich damit anstellen kann da ich kein Gamer oder Architekt oder Programmierer bin (die ursprüngliche Absicht eine Renderingworkstation zu bauen wurde aus unterschiedlichen Gründen obsolet) und deine Vids inspirieren zu ganz interessanten Versuchen. Ich war interessiert eigene ai Projekte auszuführen, und es scheint du bietest hierzu know how an. Beste Grüsse aus der Türkei vom rheinischen Exilanten.
Thank you, this video has helped me get to this point. Can you help with this error, I am stuck here and can't seam to find a solution. I followed your video but when I go to run the trainer i get the following error:
(TTS) C:\Users\7danny\Documents\CoquiTTS\TTS>python .\train_vits_win.py
Traceback (most recent call last):
File ".\train_vits_win.py", line 6, in
from TTS.tts.configs.vits_config import VitsConfig
File "C:\Users\7danny\Documents\CoquiTTS\TTS\TTS\tts\configs\vits_config.py", line 5, in
from TTS.tts.models.vits import VitsArgs, VitsAudioConfig
File "C:\Users\7danny\Documents\CoquiTTS\TTS\TTS\tts\models\vits.py", line 38, in
from TTS.vocoder.models.hifigan_generator import HifiganGenerator
File "C:\Users\7danny\Documents\CoquiTTS\TTS\TTS\vocoder\models\hifigan_generator.py", line 6, in
from torch.nn.utils.parametrizations import weight_norm
ImportError: cannot import name 'weight_norm' from 'torch.nn.utils.parametrizations' (C:\Users\7danny\Documents\CoquiTTS\TTS\lib\site-packages\torch
n\utils\parametrizations.py)
You're welcome. Did you update all python packages before starting the training?
This is an awesome tutorial, thank you for doing all the trial and error that I kept running into.
I do have one problem though. I've used your modified training script and only changed the directories, but I'm still getting a permission error:
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'D:/TTS/ThorstenTut/ljsAlex01-April-26-2023_05+12PM-0000000\\events.out.tfevents.1682554375.DESKTOP-IUNHJ2B'
Is there any workaround for this? It's pointing to one of the files it just generated, which means it's not being used by any other process, so it must be that multithreading problem you mentioned still being an issue somehow.
Thanks for your nice feedback 😃. I run into that permission thing once, too. I'm not sure how i solved it. I'll check my notes for this video and think how i solved this. When i remember i can share it here. Maybe try running command line prompt as local admin might be a first try.
@@ThorstenMueller I have the same problem. Please let me know if you have found a solution to the error. Thank you very much!
Any updates regarding this issue?
@@thefurrowzor nope, still stuck here. Not sure what to do
@@thefurrowzor Might this issue help you? For me it worked while testing for this tutorial. Hopefully i'll work for you too. If this is the case, i could add the link to the video description.
github.com/coqui-ai/TTS/issues/1711