FREE Voice Cloning in Microsoft Windows with Coqui TTS

Поділитися
Вставка
  • Опубліковано 24 лис 2024

КОМЕНТАРІ • 257

  • @guilherme1556
    @guilherme1556 Рік тому +15

    That's great you brought this tutorial for the windows community. I personally use linux to train my models, but it's awesome you are making an effort to make the windows open voice community stronger.

    • @ThorstenMueller
      @ThorstenMueller  Рік тому +4

      Yes, personally i use Linux for training, too. But model training on Windows has been requested quite often.

    • @ThorstenMueller
      @ThorstenMueller  11 місяців тому

      ​@@user-wc2jy4jr7r Not sure if i got you right. Do you mean "SAPI" in context of Windows integrated TTS voices?

  • @luke_foxy5170
    @luke_foxy5170 25 днів тому +1

    Danke für's video. Es funkioniert endlich! Richtiger ehrenmann 😀

  • @Vito_0912
    @Vito_0912 Рік тому +5

    Thank you for this tutorial and your entire audio series. I once started with Turtoise, which was too slow for me. Then I found coqui and your public voice model, which is also really good and understandable and with the factor 0.41 is also super fast for me. For my use case, however, still too funny pronunciations of proper names. Through this video I could finally create my own voice model that is completely adapted to the requirements of telling stories.
    It still sounds a bit shaky here and there and has just 100k steps (with increasing audio material), but is already on the way to improvement.
    Due to recording conditions and my unfortunately not so great narrator voice. I even come to a loss of 26-36%. So here can still be properly readjusted.
    For all who are interested in the Sats, if they also want to do something like that:
    Specs: RTX 2070, I7-10900k, Samsung Evo 970
    Steptime: 0.5-0.6
    Batchsize (you can go higher): 20
    Checkpoint_steps: 1000 (just because i am lazy and train it in the middle of some idle periods, school work etc., so i don't have to wait for 10000)
    Audio dataset:
    Specs: HyperX idk (the rgb one) with pop filter, relative big room
    Here I can't make a statement like this and if you start with the "Total" you will get faster results. I trained in steps with increasing audio files:
    0-5k: about 230 files ~ 0.4h
    5-10k: about 350 files ~ 0.6h
    10-30k: about 500 files ~ 1h
    30-60k: about 800 files ~ 1.6h
    60k-100k: about 1200 files ~ 2h
    Current total: 1200 ~ 2h
    Milestones:
    from 10k: First beginnings to understand not only noise but (not understandable)
    from 20k: First word recognizable without knowing text
    from 30-40k: Understandable text (but not nearly speech)
    from 80k: It's okay :)
    * Please note, however, that I used as input books and book excerpts with many proper names and denglish (German with some English words in books). This makes the training process slower in any case and generally worse (but in the trained areas, proper names, very good).
    Recording:
    For the recordings I wrote a Python script that automatically splits the text of a text file into sentences (ignoring sentences below 5 words) and outputs them. Then the recording was automatically started and stopped as soon as one second the sound was below 50DB. Then this audio was trimmed so that front and back everything is dropped (below 50DB to garantee a instant speech) and filled with 50ms silence. Then nomalized and saved in ljspeech format.
    Delete function included

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      Thanks for sharing your great setup and training step times 👏😊. This will help other users for sure. I agree that pronouncing foreign words is still a challenge.

    • @josebo8780
      @josebo8780 2 місяці тому

      How many iterations did you get with this setup? I am only getting ~80 iterations per HOUR with a RTX 3090, AMD Ryzen 9 5950X 16-Core , and 64gb RAM 3200Mhz so I think something is wrong with my installation or training setup.
      I am using a batch_size=64

  • @davidtindell950
    @davidtindell950 2 місяці тому +2

    Since most of my friends and clients use MS Win 10 or 11, I must support Windows ! A new vid on MacOS would also be great !

    • @ThorstenMueller
      @ThorstenMueller  Місяць тому +1

      As Coqui shut down at the beginning of 2024 i am not sure if someone will adjust the code for newer operating systems.

  • @connordissident6881
    @connordissident6881 Рік тому +1

    Thanks for listening to us and making this video!

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      You're welcome. I'm always happy for feedback and suggestions from my community and try to make right content for you 😊.

  • @anthonyschilling7132
    @anthonyschilling7132 Рік тому +4

    I spent ages trying to get this to work and finally ended up installing wsl which made the setup work. You should make a video on how to create your own dataset for training!
    Liebe Grüße aus den USA!

    • @ThorstenMueller
      @ThorstenMueller  Рік тому +2

      So now you have another way to train a TTS model in addition to wsl. Hope you enjoyed this video 😊.
      I've created a tutorial on recording and creating a voice dataset here:
      ua-cam.com/video/4YT8WZT_x48/v-deo.html

    • @anthonyschilling7132
      @anthonyschilling7132 Рік тому

      @@ThorstenMueller Ah very cool, I'll have to give that a shot. I've been using openAi's Whisper to transcribe audio I downloaded from youtube videos and podcasts and it's getting close. But I think I need to do a better job cleaning up and organizing the audio I download. Any suggestions for how how large the dataset should be when using vits? I've been using about 1-3 hours of clips and it's starting to sound ok...but I'm guessing I just need more and cleaner data. Thanks again!

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      @@anthonyschilling7132 My voice datasets are way longer - at least 10k recordings, meaning > 10 hours of pure audio. But more important might be a good phoneme coverage.

  • @VitiliKo
    @VitiliKo 25 днів тому

    Sehr gutes Video. Hätte ich gewusst, dass du hier die Installation auf Windows vornimmst, hätte ich mir 2 Tage arbeit gespart :D

    • @ThorstenMueller
      @ThorstenMueller  24 дні тому +1

      Freut mich sehr, dass dir das Video gefallen hat und ich hoffe, dir fehlen die 2 verlorenen Tage nicht zu sehr 😉.

    • @VitiliKo
      @VitiliKo 22 дні тому

      @ nee habe sehr viel dabei gelernt. Bin aber schlussendlich zu Ubuntu gewechselt da es unter Windows nicht so gut funktioniert:/

  • @christopherwoods3339
    @christopherwoods3339 Рік тому +2

    Thank you very much for your videos. I almst never subscribe but I was so thankful for these that I've been liking every one and I did subscribe. :)

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      Wow, that's probably one of the best feedback i received for my work on these videos 🤩.

  • @john_blues
    @john_blues Рік тому

    Yay! I've been waiting on this one. Thank you so much.

  • @toykotokyoto
    @toykotokyoto Рік тому +3

    nice! giving Windows some love :D

  • @ŁukaszMadajczyk
    @ŁukaszMadajczyk 26 днів тому

    hi Thorsten,
    may the next "how to" would be training coqui-TTS model based on Glow-TTS and HiFiGAN vocoder?

  • @scndsky
    @scndsky Рік тому

    Great help for figuring out all these little details you just have to know somehow. Tnx!

  • @hangtime79
    @hangtime79 Рік тому

    Came here looking some information on Coqui as I'm looking to do a voice clone for voice over work. Fantastic job.

    • @ThorstenMueller
      @ThorstenMueller  Рік тому +1

      Great feedback like yours always keeps me motivated - thank you 😊.

  • @davidtindell950
    @davidtindell950 2 місяці тому +1

    Thank You from a new subscriber !

    • @ThorstenMueller
      @ThorstenMueller  2 місяці тому

      Thanks for joining and welcome 😊.

    • @davidtindell950
      @davidtindell950 2 місяці тому

      @@ThorstenMuellerP.S. Since Coqui is 'DEAD", what local TTS Model with personal Voice Cloning can we employ ????

    • @ThorstenMueller
      @ThorstenMueller  Місяць тому

      @@davidtindell950 I'd go with Piper TTS for now. ua-cam.com/video/b_we_jma220/v-deo.htmlsi=aFZ-Z5nNpiQxa0Zo

  • @manuelherrerahipnotista8586
    @manuelherrerahipnotista8586 Рік тому +1

    Really good video man. Well explained and researched. Thanks a lot

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      Thanks for your nice feedback. I'm happy that you liked it 😊.

  • @devinhedge
    @devinhedge Рік тому +1

    I love this if for no other reason it helps me learn German dialects.

    • @ThorstenMueller
      @ThorstenMueller  Рік тому +1

      So, i'm your reference for a german dialect? 😆👍

  • @jonnypawan4650
    @jonnypawan4650 Рік тому

    Great and Unique Videos Always, Thank you for your time and efforts.

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      Thank you so much. Feedback like yours always keeps me motivated ☺️.

  • @CezarPopescu
    @CezarPopescu Рік тому

    Thanks for sharing, Thorsten! Got yourself a new subscriber (y)

  • @MrArdo-branch-main
    @MrArdo-branch-main Рік тому

    this very well done explained.. Thank you Thorston-Voice this video helps me to continue my hobby and research.

    • @ThorstenMueller
      @ThorstenMueller  Рік тому +1

      Thank you. Nice feedback like yours always keeps me motivated to continue this journey ☺️.

  • @seansean995
    @seansean995 Рік тому

    i subscibed 1st video great teacher!!!!!!!!

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      Thanks a lot for your very nice feedback - and welcome 😊

  • @prakharpaw-de7vh
    @prakharpaw-de7vh Рік тому

    Thank you so much for this video, really helpful!

  • @techterry5299
    @techterry5299 10 місяців тому

    5:36 is not very clear where did that come from?

    • @ThorstenMueller
      @ThorstenMueller  10 місяців тому

      You mean the voice dataset in this LJSpeech file and directory structure?

  • @loiclacaille8683
    @loiclacaille8683 Рік тому

    Your content is amazing, really useful. Thx.

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      Thanks a lot for your nice feedback 😊. I'm always happy to hear if people find my content helpful.

  • @belalgaber555
    @belalgaber555 Рік тому

    I love your knowledge man

  • @impishsquirrel
    @impishsquirrel Місяць тому

    Chatgpt provided me step by step with all the codes needed to run coqui TTS

  • @RichardCastuera-d8l
    @RichardCastuera-d8l Рік тому +1

    Thank you so much!

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      Thank you for this really nice feedback. Feedback like yours keeps me motivated 😊.

  • @der-putz
    @der-putz Рік тому

    Mal wieder klasse Video. Gibt es ein ATI Äquivalent?

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      Vielen Dank für das nette Kompliment 😊. Mit ATI Grafikkarten habe ich in diesem Zusammenhang keine Erfahrung. CUDA ist primär auf NVIDIA Karten ausgelegt. Es gibt/gab wohl ein altes Projekt namens "gpuocelot" was in diesem Bereich unterstützen wollte. Aber da kann ich Dir nicht wirklich weiterhelfen.

  • @anaveragegoogleaccountname
    @anaveragegoogleaccountname Рік тому

    I would have appreciated you breaking down how the audio samples should be formatted, maybe a bit more explanation of the code and also torch audio does not install along with torch either.

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      Thanks for your suggestion. I thought diving to deep into code might be hard to follow, but i'll think on more in detail video - which will be longer though.

  • @amaarboss2115
    @amaarboss2115 Рік тому +1

    Hello, Mister @Thorsten, I wanted to know how you do the training a thousand times, and yet the sound does not sound clear, but when I use your voice through the tts-server, the sound appears very clear .... How did you train your voice? (which is on the server) and Thank you for this great effort.

    • @ThorstenMueller
      @ThorstenMueller  Рік тому +6

      Thanks for your feedback. The training in this video is just for the demo. With 3.000 steps there cannot be a clear voice. My public released models with tts-server have been trained for over 2 month with around 600.000 steps. Does this explanation help you?

    • @amaarboss2115
      @amaarboss2115 Рік тому +1

      @@ThorstenMueller Thank you for this useful information. The picture is now clearer

  • @Supratim-jc9kz
    @Supratim-jc9kz Рік тому

    Thanks for the video. Also can you make a video on how to run tortoise tts locally on your computer.

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      Thanks for your comment 🙂. I've TorToiSe TTS already on my TODO list.

    • @Supratim-jc9kz
      @Supratim-jc9kz Рік тому

      @@ThorstenMueller tyvm

  • @feixym
    @feixym Рік тому

    thank you for your video , it's great worker

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      You're very welcome. Happy it's helpful for you 😊.

  • @phen-themoogle7651
    @phen-themoogle7651 Рік тому +1

    I subscribed although I could only watch for a few mins because of some health problems I’m having nowadays. If possible I Would like a cool tutorial or explanation on ways to do this without downloading anything new to my computer or going through a long process, like maybe if it’s possible to do this 100% online then that would be awesome! Since technology is improving so fast nowadays I’m sure there’s some sites that have to exist where we can do this online right..

    • @ThorstenMueller
      @ThorstenMueller  Рік тому +1

      First of all, i hope you get well soon 😊. Thanks for subscribing and i agree, right now the process is not a simple 1-2-3 process, but voice cloning is getting better and for english voices it might be possible (in near future) to clone your voice easier. Not sure how perfect the cloned voice will be with a simple process, but we'll see.

    • @phen-themoogle7651
      @phen-themoogle7651 Рік тому

      @@ThorstenMueller Thanks! I'm fluent in Japanese, and looking forward to doing this in Japanese sometime too.

  • @kostas9849
    @kostas9849 Рік тому +1

    Hello,i just subscribe to your channel and i have one question:does this work with foreign languages or only english?

    • @ThorstenMueller
      @ThorstenMueller  Рік тому +1

      Thank you for joining my channel 😊. This will work in other languages as well. I've created an earlier video (not Windows specific) with some more detail if that's helpful for you. ua-cam.com/video/4YT8WZT_x48/v-deo.html

    • @kostas9849
      @kostas9849 Рік тому

      @@ThorstenMueller Thank you so much,you are the best!

  • @omarharbah6972
    @omarharbah6972 11 місяців тому

    A lot of thanks man !

  • @pocketsfullofdynamite
    @pocketsfullofdynamite Рік тому

    Which graphic card do you use pls. Thanks for the info.

    • @ThorstenMueller
      @ThorstenMueller  Рік тому +2

      In this video i've used an NVIDIA GTX 1050 Ti. But for my other models training i use an NVIDIA Jetson Xavier AGX.

  • @mementomori-l2l
    @mementomori-l2l Рік тому

    Hello, thanks so much for the video. I'm in the process of training a custom VITS TTS model using a dataset that I've created. Around the 200,000-step mark, the average loss on my trainEpochstats/avg_loss_1 is creeping up . My dataset is fairly small, approximately 1 hour in length, but it does have good coverage of phonemes. When I tested the audio, it had the correct voice quality but the speech was nonsensical. Should I halt the training to expand my dataset, or is it typical for models to require more training steps to produce meaningful audio?

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      You're welcome 😊. If your dataset is nice phonetically balanced it should produce useable results. My VITS model has been trained (i guess) for 600k steps so there might be room for more training. But maybe you can ask this on the Coqui TTS Github discussion before there are real pros in machine learning. If available add some screenshots on Tensorboard for analysis.

  • @MrAngryWh1te
    @MrAngryWh1te Рік тому

    Hello! Thanks for the tutorial! Just finished teaching. My bot can't string letters into words at all. I would like to ask you what scale the dataset should be, and is it possible to speed up the training with google collab?

    • @ThorstenMueller
      @ThorstenMueller  Рік тому +1

      You are welcome 🙂. Not sure what you mean by "letters into words"? Do you mean, as example, "TTS" vs. "T T S"? pronunciation? Google colab provides simple GPU power which is far better than CPU, but it disconnects sessions regularly (in the free edition).

    • @MrAngryWh1te
      @MrAngryWh1te Рік тому

      @@ThorstenMueller First, thanks for the reply! I mean my bot can't say a word, it's more like a monster roar (like grr). But at the same time, he can change the tone of speech, using, for example, an exclamation mark.
      I asked about the dataset in my first comment because I think it's my problem and the quality of my dataset is not high enough.

  • @TheCeratius
    @TheCeratius Рік тому

    Hi Thorsten, thanks for this awesome tutorial which worked perfectly on my machine. However, I trained my model and it's great but not perfect. Is there an option to continue training with this model instead of training a new one (which would take ages just to get to the point where i am now)? I am relatively new to python, so I am not sure if I just have to modify the training script a little or if there is a command somewhere which does this, or if it's just not possible. If you could give me a pointer that would be great!

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      Thanks for your nice feedback 😊.
      You're looking for restore_path and/or continue_path. I've made a special video tutorial on continuing a TTS model training from a previous step checkpoint.
      ua-cam.com/video/O6KxJR95WpE/v-deo.html

    • @TheCeratius
      @TheCeratius Рік тому

      @@ThorstenMueller wow, i didn't see that. Sorry about that and thanks a lot for the quick reply and help!

  • @AdityaGupta-k3q
    @AdityaGupta-k3q Рік тому

    Thanks for the tutorial. Its really helpful. Can you also make a tutorial on how can we make use of coqui TTS service to fine-tune yourTTS for low resource language with better quality. That would be really helpful. Thanks and keep inspiring :)

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      Thanks for your nice feedback. So you mean a model that is fast enough for e.g. a Raspberry Pi but with a high quality?

    • @AdityaGupta-k3q
      @AdityaGupta-k3q Рік тому

      @@ThorstenMueller With low resource language I mean Hindi, Korean, Arabic etc

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      @@AdityaGupta-k3q Okay, sorry did get that wrong 🤦‍♂. Not sure on that. Maybe you can get a good answer when asking this good/important question on Coqui TTS community.

  • @nestboxcam-Surabaya
    @nestboxcam-Surabaya Рік тому

    Thank you for this

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      You're welcome 😊. I hope it's been helpful for you.

  • @vadzimyesman7693
    @vadzimyesman7693 Рік тому

    Great tutorial! Thank you for all the details! I have a question though about the training process and dataset. I used 102 samples for my dataset. In order to record them I used Audacity with default recordings settings (mono, 44100 Hz, 32-bit float). For the recipe file, I used the one you show in your video (named something like a "youtube recipe"). After 1000 Epochs I checked the results by synthesizing some words and sentences using tts-server. It was sounding very slow, not normal. While checking the congif,json file I found out that the sample rate in was set to 22050. After I changed it to 44100 and restarted the tts-sever voice was sounding closer to mine, but the quality is still really bad. Could the fact that all the samples were recorded at 44100 Hz affect the whole training since the default saple_rate in that config.json file is 22050 or it is irrelevant and I just need to train it more? Or do I need to start over using samples recorded with 22050 Hz frequency?

    • @ThorstenMueller
      @ThorstenMueller  Рік тому +1

      Thanks for your nice feedback on the details in my tutorial 😊. I guess that you might not get great results with just 102 recordings. Did the training process run even the samplerate did not match? I'd thought this should abort training process. However just changing the value after the training and just for time of synthesis this will not work. Samplerate in config and wave SR must match before starting training process not matter if 22 or 44k at least config is matching reality 🙃

    • @vadzimyesman7693
      @vadzimyesman7693 Рік тому

      @@ThorstenMueller The training process did run even the samplerate did not match, 1000 epochs.

  • @qodeninja
    @qodeninja Рік тому

    cool video, can you do this with a docker setup, sans windows?

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      Thanks for your feedback 🙂. Do you mean training a TTS model using Coqui TTS inside a Docker container?

    • @qodeninja
      @qodeninja Рік тому

      @@ThorstenMueller yes, exactly. is that even possible or do you need GPU? I want to be able to use my local NAS for something more than a filestore so I was wondering if this was possible

    • @qodeninja
      @qodeninja Рік тому

      yes please@@ThorstenMueller

  • @masamiakita993
    @masamiakita993 Рік тому

    Thanks a lot!!

  • @RogueMandoGaming
    @RogueMandoGaming 11 місяців тому

    So i'm getting as far as running the "pip install -e ." command before getting errored out with status code 1 something about wheel

    • @ThorstenMueller
      @ThorstenMueller  11 місяців тому

      Try running "pip install setuptools wheel -U" before, maybe this helps.

  • @EzmiTV
    @EzmiTV Рік тому

    Hi! Everything works fine, thanx! Except that it refuses to handle accented Hungarian characters (éáűőúöüóí). Does it need to be converted somewhere to handle these letters as well? For sentences without an accented character, it is perfect.

    • @ThorstenMueller
      @ThorstenMueller  Рік тому +1

      Do you mean you have problems on training the model with these chars or did training run good and you're having problems synthesizing? Have you trained using phonemes or characters? Maybe you can run this script on your dataset and add any specials chars to your config.
      github.com/coqui-ai/TTS/blob/dev/TTS/bin/find_unique_chars.py

    • @EzmiTV
      @EzmiTV Рік тому

      @@ThorstenMueller Yes, "abcdefgh..." - ok. "éáőúöüó..." - omits it from the speech. A new config.json is created in the new folder at every start. Where can I add the returned values to the configuration?

  • @jaylee6488
    @jaylee6488 5 місяців тому

    hello Thorsten: I try to figure it out by myself follow the step, but it doesn't work in some how, can i make appointment with you for about half an hour, so that you can give me some guidance?

    • @ThorstenMueller
      @ThorstenMueller  5 місяців тому

      You can contact me by using my contact form here, but it might take some time until i can respond. www.thorsten-voice.de/en/contact/

  • @mukhamejantalap4526
    @mukhamejantalap4526 8 місяців тому

    hey, I am trying to train my model on my language(kazakh) by your tutrotial. it's been over 1 day since it training, but I am getting some weird noises of speakers, I didn't see that you change or add any symbols, so did I. Do I need to add alphabet of my language?

    • @ThorstenMueller
      @ThorstenMueller  8 місяців тому

      In general one day is not much time for training a tts model. Do you use phoneme or character based training?

    • @mukhamejantalap4526
      @mukhamejantalap4526 8 місяців тому

      @@ThorstenMueller I've used phoneme based. Well I was thinking maybe at least I will get something. The data was containing over 12k audio samples with a lot of speakers, each speaker has 250 samples. Maybe because of that the feature it didn't match.

  • @justelesnews
    @justelesnews Рік тому

    Hi, nice video ! Could you tell me what you think of the new arduino for speech recognition ? -> nicla voice

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      Personally i've no experience with arduino. You think it's worth to check this topic?

    • @justelesnews
      @justelesnews Рік тому

      @@ThorstenMueller I don't know. Arduino says this is the first time that we can recognize voice commands with neural decision processor, ultra low power consumption and very good recognition. I don't know if it's true or not. It's expensive but I think I'll give it a try

  • @pink_kniteu
    @pink_kniteu Рік тому

    Nice thank youu

  • @mi16chap
    @mi16chap Рік тому

    Hi Thorsten, thanks for putting the video together, when I try run my version of your train_vits_.py script, I get an error saying ModuleNotFoundError: No module named 'TTS.tts.configs.shared_configs - any pointers (I tried to add the project path to my system environment variable, but no luck)

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      Hi, are you in your Python venv? Does "pip list" shows a TTS package?

  • @zsoltvastagh7023
    @zsoltvastagh7023 Рік тому

    awesome tutorial, thank you... unfortunately, it keeps getting interrupted with a multiprocess error before the last step, I'm looking for a solution to solve the error. If others have succeeded, and I see in the video that it works for you, maybe it will work for me too. :)
    Could there be a difference between Windows that could cause this error?

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      Thanks for your nice feedback 😊. Different Windows version might be a reason. Which version do you use? Is there an error message shown?

  • @deeber35
    @deeber35 Рік тому

    Can you change the tone of the voice reading text {e.g. excited, sad, etc}?

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      Emotions aren't supported on Coqui TTS models (as far i know). Maybe SSML in Mimic 3 might be at least a little bit helpful in that context.

  • @youngphlo
    @youngphlo Рік тому

    I follow every step up until 08:33 but when I run `pip install TTS` it tries to install every version of transformers. I would share a screenshot if I could. Never seen a `pip install` go through all the different versions of a package

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      Maybe Coqui TTS dependencies have changed in newer releases? Could you download/clone the version i've used in the video just to check if this works.

    • @shivam5648
      @shivam5648 13 днів тому +1

      So any solution to that problem?

    • @youngphlo
      @youngphlo 12 днів тому

      @@shivam5648 are you running into the problem i described when you try to install now? This only happened for the old release back then as I understand it. The OG Coqui is pretty much deprecated now but this error shouldnt happen anymore.

    • @shivam5648
      @shivam5648 12 днів тому

      @@youngphlo it's just not installing and take hours after installing for hours there is this error .it's so frustrating

  • @ThugLife-is1yo
    @ThugLife-is1yo Рік тому

    confused where exactly did you put your voice file for training ?

    • @ThorstenMueller
      @ThorstenMueller  Рік тому +1

      You're looking for the parameter "dataset_config" in the training recipe file. There you can write the file location to your voice files (in LJSpeech format) for training.

  • @RobinLorenczat
    @RobinLorenczat Рік тому

    Is it possbile to combine two voices? And what sample rate should I use for the dataset?

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      What do you mean with "combining two voices"? I've trained my TTS models with 22kHz samplerate.

  • @Hellfreezer
    @Hellfreezer Рік тому

    Is there a way to stop and resume training? The continue path command does begin the process but it then fails when generating sample sentences.

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      It's some time ago since i used continue/restore a training. I guess you know my video on exact this topic? ua-cam.com/video/O6KxJR95WpE/v-deo.html
      This isn't working? Maybe it's a bug or a changed usecase in Coqui TTS then.

    • @Hellfreezer
      @Hellfreezer Рік тому

      @@ThorstenMueller Yes, that's the video I found the method in. I'm not sure if anyone else is having the same trouble, but I haven't been able to find a solution at present.

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      @@Hellfreezer Is there any specific error message when running continue and while generating sample sentences?

    • @Hellfreezer
      @Hellfreezer Рік тому

      @@ThorstenMueller I tried to post the full info but it seems to have been hidden. Basically the traceback ends in TypeError: expected string or bytes-like object

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      @@Hellfreezer There's a closed issue on that. Maybe this is helpful for you.
      github.com/coqui-ai/TTS/issues/2070

  • @boogeyman8099
    @boogeyman8099 Рік тому

    How do I fix the freeze issue? I can't find anything about it other than the resource you provided (bug) was 'closed' with the authors comment being 'we don't support windows' when you've clearly done it on windows! I've spent a lot of time on this and would like to figure it, and help would be appreciated.

  • @kostas9849
    @kostas9849 Рік тому

    I need help!Inside the folder TTS - training there are some archives as you show in the video, how were these archives found there? How do I put it exactly the same in the folder TTS - Training I made?and when i change directory and enter in the TTS - Training folder and type the python command nothing happens.Please could you help me on that? :(

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      I'm not sure if i understand your question right. So training process starts and the "output_folder" is created and filled with files. Are you already trying to synthesize voice while training? Are audio samples in Tensorboard available?

    • @kostas9849
      @kostas9849 Рік тому

      @@ThorstenMueller I don't know how the output file was created in your video and filled with files.I follow your steps one by one, i installed python,eSpeak-ng,Microsoft Build Tools and when you open the command prompt i really stuck there.I created the directory as you did but in my directory there's not the files that you show in the video.I type the python commands but nothing happened.What i did wrong? :(

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      ​@@kostas9849 Strange, the output directory with the training_run name and a timestamp for training start date will be created automatically. Did cloning the Coqui TTS repo work and adjusting the recipe?

  • @Live_draw_today
    @Live_draw_today Рік тому

    Sir while running last line, error occurres = charmap, codec can't decide bytes.
    Plz help

  • @pink_kniteu
    @pink_kniteu Рік тому

    I would like to training new model tts for new language. Is this the same way to to that? Can you give me some advice it please.. it's really help me

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      You're right. It's working the same way. Maybe you can watch this tutorial showing how to create a voice dataset for your new language model.
      ua-cam.com/video/4YT8WZT_x48/v-deo.html

  • @josebo8780
    @josebo8780 2 місяці тому

    I am getting only around 80 iterations per hour in a setup with a rtx 3090. Is tooooo slow right?

    • @ThorstenMueller
      @ThorstenMueller  2 місяці тому +1

      Good question. But it is way faster than my NVIDIA Jetson Xavier AGX 😉

  • @kaymat2368
    @kaymat2368 Рік тому

    11:09 Help please im stuck in this step becuase its gave this error: "OSError: [WinError 126] The specified module could not be found. Error loading "cudart64_110.dll" or one of its dependencies."

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      Seems like your CUDA installation is broken. Are you sure CUDA is installed correctly?

    • @kaymat2368
      @kaymat2368 Рік тому

      @@ThorstenMueller Im not sure, i followed your steps clearly

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      @@kaymat2368 Hard to say, what might cause this issue. Maybe try installing a newer CUDA version.

    • @kaymat2368
      @kaymat2368 Рік тому

      @@ThorstenMueller Ok, thanks for replying, btw, my GPU is nvidia GeForce GT 520, Os Win 7

  • @MatyssMatyss
    @MatyssMatyss 9 місяців тому

    hello! I just wanted to know hoy many audio files do I need to clon a voice, since i just recorded like 50 wavs files but when I start the trainer the script fails since "there is no sample left"

    • @ThorstenMueller
      @ThorstenMueller  9 місяців тому

      I guess 50 is way too less. I recorded over 10k wave files for my german "Thorsten-Voice" voice clone. Maybe give it a try with 1000 recordings.

  • @dayteimyasuki
    @dayteimyasuki Рік тому

    i can't get the pip command to work, help!!

  • @andiratze9591
    @andiratze9591 Рік тому

    Hey Thorsten. Kann man coqui so installieren mit allen Models und Funktionen, wie auf der Website, dass man keine Commands mehr eintippen muss und es komplett offline nutzen kann über das User Interface?

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      Hi Andi, ich gehe davon aus, dass Du Coqui Studio meinst.
      Soweit ich weiß, ist das nicht Teil ihrer Open-Source Veröffentlichung. Also sage ich mal, das ist nicht möglich. Lediglich das Kommando "tts-server" bringt ein lokal lauffendes Webfrontend, was aber natürlich nicht mit Coqui Studio verglichen werden kann.

    • @andiratze9591
      @andiratze9591 Рік тому

      Gibt es andere Software, die man, nachdem man alles eingerichtet hat, offline nutzen kann oder Coqui wenigstens mit ein paar pretrained Models?

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      @@andiratze9591 Du kannst alle Coqui TTS Modelle offline nutzen, nur eben nicht per so komfortabler Oberfläche wie Coqui Studio. Kennst Du das Video von mir? Da zeige ich das. ua-cam.com/video/alpI-DnVlO0/v-deo.html

    • @andiratze9591
      @andiratze9591 Рік тому

      Ah danke, ich dachte, das ist nur ein Video mit Terminalbefehlen, ohne vorhandenes User Interface. Ich mache nachher mein Windows neu und probiere es mal aus.🙂

    • @andiratze9591
      @andiratze9591 Рік тому

      Ich werde später mal versuchen, Python zu lernen. Vielleicht kann ich mein eigenes TTS-VC programmieren. Es ist unmöglich Freesoftware in dem Bereich zu finden, die einfach zu bedienen ist. Bei allen finde ich was. Foto Video u.s.w aber tts ist voll schlimm🥴

  • @psyk0l0ge
    @psyk0l0ge Рік тому

    It tells me that I might need to install an third party phonemizer for the language de.... Where do you get the extra files from that u have installed and cd.. into at about 10:37 ? I

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      Did you install espeak-ng as shown here?
      ua-cam.com/video/bJjzSo_fOS8/v-deo.html

    • @captainlavenderVHS
      @captainlavenderVHS 11 місяців тому

      I had this problem too... A reboot seemed to fix it, but I also did a "pip install phonemizer" before, which may not have actually been necessary.
      In case anyone else is wondering, got this running on Win 11, using Anaconda 2.5.1 (Python 3.11.5), CUDA 12.3.5.1, and Coqui TTS 0.21.2

  • @muhammadalfahrezi1745
    @muhammadalfahrezi1745 Рік тому

    I want to make a new model of Indonesian language. but in espeak-ng it doesn't support that language. is it still possible to make a new model?

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      Thanks for your good question. Yes, that's possible. You can set "use_phonemes" to "false" and then it will use character based training.
      Maybe this helps a bit. tts.readthedocs.io/en/latest/tutorial_for_nervous_beginners.html?highlight=use_phonemes

    • @muhammadalfahrezi1745
      @muhammadalfahrezi1745 Рік тому

      @@ThorstenMueller still using espeak or not? the alphabet is the same as in English, but only the spelling is different. sorry I ask a lot

  • @cmyk8964
    @cmyk8964 Рік тому

    I started training the model, and after 8 hours, only 2 epochs were completed. Is this normal and do I need to complete all 1000?

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      What do you mean by "completed"? Normally the training process runs until you stop it manually. Did training end automatically?

  • @MistakingManx
    @MistakingManx 7 місяців тому

    Right, how should I go about creating the dataset though?

    • @ThorstenMueller
      @ThorstenMueller  7 місяців тому

      Hi, do you know my tutorial on Piper-Recording-Studio for doing so? ua-cam.com/video/Z1pptxLT_3I/v-deo.html

    • @MistakingManx
      @MistakingManx 7 місяців тому

      @@ThorstenMueller I started following your mimic recording studio and it's instructions, so I could make my own Coqui LJSpeech model, but it isn't working for some reason.
      Some files don't exist anymore, and it seems mad about numpy.

    • @ThorstenMueller
      @ThorstenMueller  6 місяців тому

      @@MistakingManx Hmm, as Mimic-Recording-Studio is not actively maintained this might stop working due newer package versions (like numpy). I'd use Piper-Recording-Studio as it will generate an LJSpeech like dataset too.

    • @MistakingManx
      @MistakingManx 6 місяців тому

      @@ThorstenMueller I already used mimic-recording-studio, it's what the tutorials used, and it seemingly worked fine, minus the part I had to fix.
      Your script that makes the dataset was useful, I just can't get the training stuff to work at all.
      I wanted to use windows since I have a 4090ti on it.
      Would it be possible to talk on a platform like discord?

    • @ThorstenMueller
      @ThorstenMueller  6 місяців тому

      ​@@MistakingManx You can send me an email using my contact form here: www.thorsten-voice.de/en/contact/
      But it might take some time to respond for me so please be a little bit patient 🙂.

  • @peethaer
    @peethaer Рік тому

    Du bist mein Held.

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      Soweit würde ich wohl nicht gehen 😉. Aber ich freue mich sehr über dieses mehr als nette Feedback 😊.

  • @recrieprodutora
    @recrieprodutora Рік тому

    The process return the error: "PermissionError: [WinError 32] The process cannot access the file because it is being used by another process..." Im used the your code.

    • @ThorstenMueller
      @ThorstenMueller  Рік тому +1

      I've seen this error previously, but i'm not absolutely sure about the reason. Is training running nevertheless or not starting? Does running command line prompt as admin change the behavior?

    • @recrieprodutora
      @recrieprodutora Рік тому

      @@ThorstenMueller The training starts, but the error occurs in the sequence. I don't know how to fix

    • @recrieprodutora
      @recrieprodutora Рік тому

      @@ThorstenMueller I tried modifying the root of the folder and the permission of the prompt, but the error keeps returning.
      Have you ever seen anything like it? Even using your "train..." which already contains "if _name_ == '__main__':", returns me with an error in training. Can you imagine which way I should go? 😪😥

    • @shadaaan
      @shadaaan Рік тому

      same error- i am also getting, any solution found this?

  • @michaelb1099
    @michaelb1099 Рік тому

    great tutorial but i am trying to replace my microsoft voices with my cloned voice is this doable?

    • @ThorstenMueller
      @ThorstenMueller  Рік тому +2

      Thanks for your nice feedback 😊 and great question. I tried this some time ago too, but didn't find an easy solution for this. But if this is interesting in general i might give it a closer look. Most voices seems to come out of their Microsoft Azure cloud services.

  • @shazams461
    @shazams461 Рік тому

    Okay 👍🏻👍🏻

  • @JamesBond-ix8rn
    @JamesBond-ix8rn Рік тому

    how long training until it sounds good?

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      Depending on what you mean by "good" 😉. By step 30k you should be able to hear a voice with lots of background noise. Starting by step 100k voice should be clearer. Then it's up to your personal expecations.

    • @JamesBond-ix8rn
      @JamesBond-ix8rn Рік тому

      @@ThorstenMueller thanks for the prompt response. how long does this take in hours/days/months and how much input data would approximately need?

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      ​@@JamesBond-ix8rn It's hard to call specific values as it depends on the hardware you have available for training. Might be some hours to weeks/month training time. Ensure a good phonetic balance and add more recordings by time if you're not satisfied with the result.

  • @IngridUterus
    @IngridUterus 10 місяців тому

    Ich habe Python 3.11 installiert. Muss ich das deinstallieren und 3.8 installieren? Wäre voll kacke

    • @ThorstenMueller
      @ThorstenMueller  10 місяців тому

      Laut Readme sollte Python 3.11 funktionieren (python >= 3.9, < 3.12.).

  • @mungamurisairamiiitdharwad7451

    how many samples do we need for the trainnig

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      As always - it depends 😉. With less than 100 the training process will not start. I recorded > 10.000 phrases for my german "Thorsten-Voice" TTS models. But phonetic coverage might be more important than the pure number of recordings.

  • @-.nocturna.-
    @-.nocturna.- Рік тому

    How long does it take to train a model? lg

    • @ThorstenMueller
      @ThorstenMueller  Рік тому +1

      Hallo 👋. For my Thorsten-Voice models training took around 3 month 7x24 compute time. But this depends on your available hardware for training.

    • @-.nocturna.-
      @-.nocturna.- Рік тому

      @@ThorstenMueller Woah, did you train it yourself? What GPU did you use? Thats insanely long in this trying times of energy prices. :/

    • @ThorstenMueller
      @ThorstenMueller  Рік тому +1

      @@-.nocturna.- Absolutely. This is the usual trade-off between graphics performance and duration. I used an NVIDIA Jetson Xavier AGX, which has a relatively low power consumption.

    • @-.nocturna.-
      @-.nocturna.- Рік тому

      @@ThorstenMueller Thats a nice one. 30w vs the 320w of my 4080 :| i think i will do it if my other projects fail :P Have a nice night :>

  • @BaDHamisteR
    @BaDHamisteR Рік тому

    is it possible to train the model to speak in Portuguese?

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      Sure, if you have a Portuguese voice dataset ready for training.

    • @BaDHamisteR
      @BaDHamisteR Рік тому

      @@ThorstenMueller well.. i have my own voice 🤣. i wanna try that.

  • @azer0013
    @azer0013 Рік тому

    Where is TTS-training??

    • @ThorstenMueller
      @ThorstenMueller  Рік тому +1

      It is an empty folder in which you start working. I created a new folder "TTS-Training" but you can name it whatever you want.

  • @thebluefacedbeastyangzhi
    @thebluefacedbeastyangzhi Рік тому

    Is there a non CUDA version?

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      Coqui has a command line parameter called "use_cuda" which can be set to "false", but i guess training will take waaay longer than with CUDA.

    • @thebluefacedbeastyangzhi
      @thebluefacedbeastyangzhi Рік тому

      @@ThorstenMueller Thank you doe the reply. I have AMD and not Nvidia. So should I give up this method?

    • @ThorstenMueller
      @ThorstenMueller  Рік тому +1

      @@thebluefacedbeastyangzhi Hard to say, but maybe you try a Google colab notebook with GPU that supports CUDA. Might be a more easy way for you if you don't have access to a local NVIDIA GPU card.

    • @thebluefacedbeastyangzhi
      @thebluefacedbeastyangzhi Рік тому

      @@ThorstenMueller thank you again for this information

  • @JoeLinux2000
    @JoeLinux2000 Рік тому

    Waiting fro Linux to get proper HQ Text to Speech.

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      With Coqui TTS or Piper TTS there are some pretrained and really nice sounding TTS models available for Linux in multiple languages 😊. Do you know these?

  • @magenta6
    @magenta6 Рік тому

    Thanks Thorsten for your endless efforts at communicating a complex subject with enthusiasm and passion to people who don't know much about python. I see that you have linked another video about preparing recordings ua-cam.com/video/4YT8WZT_x48/v-deo.html

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      You're very welcome 😊. And yes, i'm really passionate about this topic.

  • @tesitest378
    @tesitest378 Рік тому

    Coqui Eleutherodactylus a frog from Puerto Rico 🇵🇷

  • @KominoStyle
    @KominoStyle Рік тому

    Well something on my end is not working -.-!

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      Do you get any specific error message?

    • @KominoStyle
      @KominoStyle Рік тому

      @@ThorstenMueller Well sorry for the late respond, i tried many different ways to install and use TTS, but one big problem i have was that i cant install python 3.8 for all users, every other version i can
      and im not sure if thats the big problem

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      @@KominoStyle Which Python version are you using then?

  • @tarekhassan6958
    @tarekhassan6958 Рік тому

    It looks like mining issues

  • @Hinterfrage
    @Hinterfrage Рік тому +1

    Oh, nur betrug clips stellt der Herr rein, intressant, da gibt es viel zu reporten ...

  • @nobudy_left
    @nobudy_left 3 місяці тому

    das shirt 😂 scheiß encoding, fühl ich

    • @ThorstenMueller
      @ThorstenMueller  3 місяці тому +1

      Vielen Dank 😊 - ist auch eins meiner Lieblingsshirts.

  • @a.tevetoglu3366
    @a.tevetoglu3366 Рік тому

    ei gude wie?! ;)

    • @ThorstenMueller
      @ThorstenMueller  Рік тому +1

      Ei subba - un selbst? ;)

    • @a.tevetoglu3366
      @a.tevetoglu3366 Рік тому

      @@ThorstenMueller wies halt so geht. Übrigens besten Dank für Deinen content. Ich hab mir 2 rtx a5000 gekauft, und frag mich was ich damit anstellen kann da ich kein Gamer oder Architekt oder Programmierer bin (die ursprüngliche Absicht eine Renderingworkstation zu bauen wurde aus unterschiedlichen Gründen obsolet) und deine Vids inspirieren zu ganz interessanten Versuchen. Ich war interessiert eigene ai Projekte auszuführen, und es scheint du bietest hierzu know how an. Beste Grüsse aus der Türkei vom rheinischen Exilanten.

  • @OurSouthernLife
    @OurSouthernLife 10 місяців тому

    Thank you, this video has helped me get to this point. Can you help with this error, I am stuck here and can't seam to find a solution. I followed your video but when I go to run the trainer i get the following error:
    (TTS) C:\Users\7danny\Documents\CoquiTTS\TTS>python .\train_vits_win.py
    Traceback (most recent call last):
    File ".\train_vits_win.py", line 6, in
    from TTS.tts.configs.vits_config import VitsConfig
    File "C:\Users\7danny\Documents\CoquiTTS\TTS\TTS\tts\configs\vits_config.py", line 5, in
    from TTS.tts.models.vits import VitsArgs, VitsAudioConfig
    File "C:\Users\7danny\Documents\CoquiTTS\TTS\TTS\tts\models\vits.py", line 38, in
    from TTS.vocoder.models.hifigan_generator import HifiganGenerator
    File "C:\Users\7danny\Documents\CoquiTTS\TTS\TTS\vocoder\models\hifigan_generator.py", line 6, in
    from torch.nn.utils.parametrizations import weight_norm
    ImportError: cannot import name 'weight_norm' from 'torch.nn.utils.parametrizations' (C:\Users\7danny\Documents\CoquiTTS\TTS\lib\site-packages\torch
    n\utils\parametrizations.py)

    • @ThorstenMueller
      @ThorstenMueller  10 місяців тому

      You're welcome. Did you update all python packages before starting the training?

  • @OmriDaxia
    @OmriDaxia Рік тому

    This is an awesome tutorial, thank you for doing all the trial and error that I kept running into.
    I do have one problem though. I've used your modified training script and only changed the directories, but I'm still getting a permission error:
    PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'D:/TTS/ThorstenTut/ljsAlex01-April-26-2023_05+12PM-0000000\\events.out.tfevents.1682554375.DESKTOP-IUNHJ2B'
    Is there any workaround for this? It's pointing to one of the files it just generated, which means it's not being used by any other process, so it must be that multithreading problem you mentioned still being an issue somehow.

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      Thanks for your nice feedback 😃. I run into that permission thing once, too. I'm not sure how i solved it. I'll check my notes for this video and think how i solved this. When i remember i can share it here. Maybe try running command line prompt as local admin might be a first try.

    • @zsoltvastagh7023
      @zsoltvastagh7023 Рік тому

      @@ThorstenMueller I have the same problem. Please let me know if you have found a solution to the error. Thank you very much!

    • @thefurrowzor
      @thefurrowzor Рік тому

      Any updates regarding this issue?

    • @OmriDaxia
      @OmriDaxia Рік тому

      @@thefurrowzor nope, still stuck here. Not sure what to do

    • @ThorstenMueller
      @ThorstenMueller  Рік тому

      @@thefurrowzor Might this issue help you? For me it worked while testing for this tutorial. Hopefully i'll work for you too. If this is the case, i could add the link to the video description.
      github.com/coqui-ai/TTS/issues/1711