Wow 😊, this is absolutely on my top list of most amazing feedbacks i got so far 🤩. If you have a special topic you would like to see, please let me know. Thank you 😊
Looks a lot like mimic-recording-studio re-branded as Piper. Even so, excellent work with the tutorial. For anyone using this tool to create their own voice training data, it is important to take a break from recording every half hour or so to enure high quality voice samples, avoid straining your voice, and to give your voice time to rest while recording.
Thanks for your nice feedback on my tutorial 😊. I agree, it has lots of similarities with MRS, but provides already text phrases to record for miltiple languages and makes it easy to create a useable LJSpeech structure. And yes, making recording pauses regularly is really important for quality 👍.
Thanks for your feedback and suggestion 😊. I try to show the final results in upcoming videos more at the beginning that people get an idea on what to expect.
It would be super awesome to make a simple python script that uses Whisper to parse a folder of wavs into the ljspeech csv format. That way you could use audio samples from an existing corpus without having to manually record your voice.
Hi! Thanks for the video, it was really helpful for anyone looking to get into voice cloning. However, I have a question. I followed all the steps, and I noticed that after exporting, the .wav files generated in the wavs folder have a sample rate of 48000 Hz. However, for fine-tuning a pre-trained Piper model, it seems that .wav files with a sample rate of 22050 Hz are required. My question is: are the 48000 Hz audio files still acceptable for fine-tuning, or should I convert them to 22050 Hz before proceeding?
Hey! Really great job on the tutorial! I was wondering, can we use Piper in our python code? Call some method to perform TTS on some string of text and don't save the resulting sound file but just play it
Thanks for your nice feedback 😊. Using Piper native in Python would be great, but i'm not sure on how far development on this has come. I found this (github.com/rhasspy/piper#running-in-python), but it's not as native as i would hope for. But i guess this will be possible in near future.
Hi, excellent video, thank you for your time! I just have one question about the audio quality used by default with piper recording... My instance (installed manually like in your video), is outputing 48Khz records by default. I saw that this quality is too high for model training purposes. Do you know if it is possible to reduce the quality of the .web during recording or downsampling will need to be done after .wav convesion with ffmpeg? Thank you again for your time on this subject.
Thanks for your kind feedback 😊. I guess the easiest way is, as you mentioned, to downsample later using tools like ffmpeg. At least i'm not aware of setting a lower rate for actual recording session.
@@ThorstenMueller thank you for the reply. Yes I confirm that the best solution was to manually launch a ffmpeg batch on all the .wav files to have 22khz samples 👍
Great tutorial @ThorstenMueller! One of the best channels for stepping into the AI voice world. Since my native language is Slovenian, which is not (yet) supported by Rhasspy, and me being quite a DIY enthusiast, I would love to contribute to the open-source society or at least start with recordings for testing with Home Assistant (HA). I've put some work into translating intents in HA and prepared code for the Slovenian language. Now, I need to proceed with TTS for Slovenian. I have some questions if I may: I checked 'how to contribute' in Rhasspy docs, and things are quite straightforward for supported languages. Is it the same for unsupported languages? What kind of equipment (mics) for recording Slovenian voice is recommended in terms of quality standards and minimum requirements? Thanks for your understanding, and I look forward to your guidance!
Thanks for your really nice feedback - that my channel "is one of the best for getting in AI voice world" is a huge compliment - so thank you 🥰. If you would like to contribute your voice for open voice community you can choose if you prefer contributing your voice recordings and/or a pretrained Piper TTS model with your voice. In general use a good microphone, listen to your recordings on full volume (to hear random noise in the background) and record in a consistent speed.
Thanks for the video. This is the software I used to record my voice a few months back, except I used the docker version. This all went well for me. It was the training I couldn't manage to figure out / get working. I think the last video you produced was on training if I remember correctly. For high quality voice trainings. do you have to have more samples than piper studio records? and if you do, what do you recommend for doing the additional recordings? Thanks for taking the time and sharing the information. Joe
Hi Joe, thanks for your comment and nice feedback in my video 😊. For my german "Thorsten-Voice" TTS models recorded over 30k of wave files 😉. I used Mimic-Recording-Studio in the past but would now use Piper-Recording-Studio as it is better maintained.
You mean reading the actual text before starting recording session? Here's one textfile for english (as example): github.com/rhasspy/piper-recording-studio/blob/master/prompts/English%20(United%20States)_en-US/0000000001_0300000050_General.txt Is this what you're looking for?
You're welcome and hopefully you found it helpful 😊. After recording and having this LJSpeech structure you can follow this tutorial for the next steps. ua-cam.com/video/b_we_jma220/v-deo.html
Thanks for the nice tutorial! Let's say I don't want to use piper recording studio for recording but still want to use it for exporting my dataset to avoid creating this tedious ljspeech format by hand. This should be possible as long as I follow piper recording studio's file naming convention, right? Am I able to choose the file names freely (with the txt and audio still matching) or do they have to be numbered in some sense? Also, do the individual sentences have to be splitted in seperate files or doesn't it make a difference? The reason I'm asking is because I'm capable of doing more professional voice recordings and to me it would be way more convenient if I recorded it in my DAW (also I would like to play around with audio postprocessing). It would also be cool since you could simply take any existing recording and automaticall transcribe it into a dataset using a speecht to text AI like Whisper. However, splitting it up requires a lot of manual work which I would like to avoid. So is it fine to just have a single text file matching an hour long recording or so?
Thanks for your nice comment and questions. I'm not sure if you can "inject" existing recordings into PRS just for the export process. But as the LJSpeech structure isn't too complex maybe just create a script to create this structure. This could be a helpful start: github.com/thorstenMueller/Thorsten-Voice/blob/master/helperScripts/MRS2LJSpeech.py There's no special numeration for recordings, but the filenames has to be unique. I'd split the recordings into sentences. IMHO opinion having eg. one recording with a duration of one hour will not produce great results. Especial if you would like to synthesize shorter phrases.
@@ThorstenMueller thanks for the reply. Makes sense. I had a further look into the ljspeech structure. I guess you're right that writing my own script might be easier than forcing an injection into PRS.
Thanks for your kind feedback 😊. This is primarily an ethical and legal question, which is why I can't say much about it. Technically this should work in principle. Of course, it depends on the quality and quantity of audio recordings available.
Hallo Thorsten, danke für deine tollen Videos. Die elementare Frage für mich und denke auch für viele andere ist, wie bekommt man seine eigene Stimme in Homeassistant. Denke das wäre eine eigene Playlist wert und vor allem erreichst du damit eine wahnsinnig große Community... den viele wollen hier eine eigene Stimme integrieren. Über ein Feedback oder Links für eine Anleitung, sollte hier schon etwas existieren, und ich habe es nur nicht gefunden würde ich mich freuen. Vielen Dank
That is a good idea, thank you 😊. Anything seems always to be at the wrong position hiding important screen parts - a logo, camera image, some overlay. And if all of this worked the font seems too small 😆. But i try my best to improve quality with every recording.
Hello. Would it be possible to clone your Voice first using elevenlabs, than have it speak out all the relevant stuff, to the wavs using the api, thereby creating the ljspeech package in the same python-script, or would there be a rate limit, even with paid accounts? Would save so much Recording time. Ok, we have to spend some Dollars for the wav files. My Question qould be how much for a well covered german Dataset to train your piper-tts, and than break free from the cloud? Thanks, Attila.
Hi Attila, i've no practical experience with Elevenlabs, but from i've heard the quality is impressive. I don't know on any costs, but you should be able to create several wave file using Elevenlabs and pack it into an LJSpeech dataset and train Piper TTS with that. As all input files have already been synthesized it might be okay, to use less input data for training as the phonetical pronunciation should be identical.
Thank you for the great tutorial. I could follow along until this point: (.venv) D:\piper\piper-recording-studio>python -m pip install -r requirements_export.txt There, I came across this error: ERROR: Could not find a version that satisfies the requirement onnxruntime=1.11.0 (from versions: none) ERROR: No matching distribution found for onnxruntime=1.11.0 What could I have done wrong?
Hi, by quality in my opinion Coqui TTS offers better model training options, but Piper is way more performant and runs way faster on eg. a Raspberry Pi and offering a nice quality too.
Does this send my audio somewhere to the internet? I'm confused by the "By clicking Submit, you agree to dedicate your recorded audio to the public domain (CC0)" message.
I was confused by this message as you while preparing the video, so i contacted Mike as main contributor and in short - no, all stays local 😊. But i pinged Mike, so maybe he can add more details to my comment.
No, it doesn't send the audio anywhere 🙂 I have this message in there because Piper Recording Studio is what's used for the Rhasspy voice contribution website which does upload back to the same server.
You can use your voice to create a dataset and create a nepali tts model, if this is what you are looking for. Do you know my tutorial on that? ua-cam.com/video/b_we_jma220/v-deo.htmlsi=V_eUArukdne7zjUi A pretrained nepali tts for piper does not exist, imho.
When attempting to start the Piper Recording Studio: python3 -m piper_recording_studio I received the following error: ImportError: cannot import name 'url_quote' from 'werkzeug.urls' The Werkzeug package released a new version on September 30th 2023, and appears to have removed a needed dependency. I decided to downgrade the Werkzeug package from 3.0.0 to 2.3.7 and it appears to now work. pip uninstall Werkzeug pip install --upgrade Werkzeug==2.3.7
Thanks for sharing the solution when having struggle on Werkzeug dependencies 👏. I search the Github repo issues and found this one, which is probably opened by you, oder 😉? github.com/rhasspy/piper-recording-studio/issues/11 Would it be okay for you if i add the link to the video description - just in case other people struggle with this too.
HI Thorsten, thanks for your video. Unfortunately I'm running in an error while executing the last step: ERROR:__main__:export_audio Traceback (most recent call last): File "C:\Users\myuser\Documents\GitHub\piper-recording-studio\export_dataset\__main__.py", line 107, in __call__ audio_16khz_bytes = subprocess.check_output( File "C:\Python310\lib\subprocess.py", line 420, in check_output return run(*popenargs, stdout=PIPE, timeout=timeout, check=True, File "C:\Python310\lib\subprocess.py", line 524, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ffmpeg', '-i', 'output\\de-DE\\0000000001_0300000050_General\\0000000009.webm', '-f', 's16le', '-acodec', 'pcm_s16le', '-ac', '1', '-ar', '16000', 'pipe:']' returned non-zero exit status 3221225781. Can somebody provide support? The ffmpeg.exe is stored in the base folder and all dependencies are installed as well.
Whenever I see a notice from your channel, I am excited to see the video. Thank you for making our lives easier❤
Wow 😊, this is absolutely on my top list of most amazing feedbacks i got so far 🤩. If you have a special topic you would like to see, please let me know. Thank you 😊
This has been very helpful. Thanks for all you do.
Looks a lot like mimic-recording-studio re-branded as Piper. Even so, excellent work with the tutorial. For anyone using this tool to create their own voice training data, it is important to take a break from recording every half hour or so to enure high quality voice samples, avoid straining your voice, and to give your voice time to rest while recording.
Thanks for your nice feedback on my tutorial 😊. I agree, it has lots of similarities with MRS, but provides already text phrases to record for miltiple languages and makes it easy to create a useable LJSpeech structure. And yes, making recording pauses regularly is really important for quality 👍.
Great tutorial, thank you Thorsten!
Also funny to hear the voice which is talking from my home assistant instance to us😂😉
😂
I would love a tutorial on how to train your own wake word with openwakeword (with your own voice/samples)
"Wakeword" is on my TOP 5 video topic TODO list 😊. So, yes it's coming.
Thank you for the guide, some generated examples from the cloned voice would have been helpful to assess the quality.
Thanks for your feedback and suggestion 😊. I try to show the final results in upcoming videos more at the beginning that people get an idea on what to expect.
another cool tuto ... thanks sir ... all respect
Thanks for your nice feedback 😊
It would be super awesome to make a simple python script that uses Whisper to parse a folder of wavs into the ljspeech csv format. That way you could use audio samples from an existing corpus without having to manually record your voice.
Sounding like a nice usecase. Whisper is still on my todo list 😊.
Yes, that would be extremely awesome. Having wav files into a folder and parsing that folder to create LJSpeech csv format or what they call dataset.
Hi! Thanks for the video, it was really helpful for anyone looking to get into voice cloning. However, I have a question.
I followed all the steps, and I noticed that after exporting, the .wav files generated in the wavs folder have a sample rate of 48000 Hz. However, for fine-tuning a pre-trained Piper model, it seems that .wav files with a sample rate of 22050 Hz are required.
My question is: are the 48000 Hz audio files still acceptable for fine-tuning, or should I convert them to 22050 Hz before proceeding?
Thanks for your nice feedback. IMHO you should downsample to 22kHz to finetune an existing Piper model which uses 22kHz samplerate.
Hey! Really great job on the tutorial!
I was wondering, can we use Piper in our python code? Call some method to perform TTS on some string of text and don't save the resulting sound file but just play it
Thanks for your nice feedback 😊. Using Piper native in Python would be great, but i'm not sure on how far development on this has come. I found this (github.com/rhasspy/piper#running-in-python), but it's not as native as i would hope for. But i guess this will be possible in near future.
Hi, excellent video, thank you for your time! I just have one question about the audio quality used by default with piper recording... My instance (installed manually like in your video), is outputing 48Khz records by default. I saw that this quality is too high for model training purposes. Do you know if it is possible to reduce the quality of the .web during recording or downsampling will need to be done after .wav convesion with ffmpeg? Thank you again for your time on this subject.
Thanks for your kind feedback 😊. I guess the easiest way is, as you mentioned, to downsample later using tools like ffmpeg. At least i'm not aware of setting a lower rate for actual recording session.
@@ThorstenMueller thank you for the reply. Yes I confirm that the best solution was to manually launch a ffmpeg batch on all the .wav files to have 22khz samples 👍
Great tutorial @ThorstenMueller! One of the best channels for stepping into the AI voice world. Since my native language is Slovenian, which is not (yet) supported by Rhasspy, and me being quite a DIY enthusiast, I would love to contribute to the open-source society or at least start with recordings for testing with Home Assistant (HA). I've put some work into translating intents in HA and prepared code for the Slovenian language. Now, I need to proceed with TTS for Slovenian. I have some questions if I may:
I checked 'how to contribute' in Rhasspy docs, and things are quite straightforward for supported languages. Is it the same for unsupported languages?
What kind of equipment (mics) for recording Slovenian voice is recommended in terms of quality standards and minimum requirements?
Thanks for your understanding, and I look forward to your guidance!
Thanks for your really nice feedback - that my channel "is one of the best for getting in AI voice world" is a huge compliment - so thank you 🥰.
If you would like to contribute your voice for open voice community you can choose if you prefer contributing your voice recordings and/or a pretrained Piper TTS model with your voice.
In general use a good microphone, listen to your recordings on full volume (to hear random noise in the background) and record in a consistent speed.
More of my lessons learned are here: ua-cam.com/video/Z1pptxLT_3I/v-deo.html
Thanks for the video. This is the software I used to record my voice a few months back, except I used the docker version. This all went well for me. It was the training I couldn't manage to figure out / get working. I think the last video you produced was on training if I remember correctly. For high quality voice trainings. do you have to have more samples than piper studio records? and if you do, what do you recommend for doing the additional recordings? Thanks for taking the time and sharing the information.
Joe
Hi Joe, thanks for your comment and nice feedback in my video 😊. For my german "Thorsten-Voice" TTS models recorded over 30k of wave files 😉. I used Mimic-Recording-Studio in the past but would now use Piper-Recording-Studio as it is better maintained.
There is 1000+ text lines to prepare the training data. From where I can get those text line so that I do the practice before the recording.
You mean reading the actual text before starting recording session? Here's one textfile for english (as example): github.com/rhasspy/piper-recording-studio/blob/master/prompts/English%20(United%20States)_en-US/0000000001_0300000050_General.txt
Is this what you're looking for?
@@ThorstenMueller Thank you very much 🙏
Thanks for this howto!
Once you get the LJSpeech produced, how do you use it within Piper TTS?
You're welcome and hopefully you found it helpful 😊. After recording and having this LJSpeech structure you can follow this tutorial for the next steps. ua-cam.com/video/b_we_jma220/v-deo.html
Thanks for the nice tutorial! Let's say I don't want to use piper recording studio for recording but still want to use it for exporting my dataset to avoid creating this tedious ljspeech format by hand. This should be possible as long as I follow piper recording studio's file naming convention, right? Am I able to choose the file names freely (with the txt and audio still matching) or do they have to be numbered in some sense? Also, do the individual sentences have to be splitted in seperate files or doesn't it make a difference? The reason I'm asking is because I'm capable of doing more professional voice recordings and to me it would be way more convenient if I recorded it in my DAW (also I would like to play around with audio postprocessing). It would also be cool since you could simply take any existing recording and automaticall transcribe it into a dataset using a speecht to text AI like Whisper. However, splitting it up requires a lot of manual work which I would like to avoid. So is it fine to just have a single text file matching an hour long recording or so?
Thanks for your nice comment and questions.
I'm not sure if you can "inject" existing recordings into PRS just for the export process. But as the LJSpeech structure isn't too complex maybe just create a script to create this structure. This could be a helpful start: github.com/thorstenMueller/Thorsten-Voice/blob/master/helperScripts/MRS2LJSpeech.py
There's no special numeration for recordings, but the filenames has to be unique. I'd split the recordings into sentences. IMHO opinion having eg. one recording with a duration of one hour will not produce great results. Especial if you would like to synthesize shorter phrases.
@@ThorstenMueller thanks for the reply. Makes sense. I had a further look into the ljspeech structure. I guess you're right that writing my own script might be easier than forcing an injection into PRS.
I like your video's. I have one question can a TTS voice model be created just from voice recordings of a person who has passed?
Thanks for your kind feedback 😊. This is primarily an ethical and legal question, which is why I can't say much about it. Technically this should work in principle. Of course, it depends on the quality and quantity of audio recordings available.
Hallo Thorsten, danke für deine tollen Videos. Die elementare Frage für mich und denke auch für viele andere ist, wie bekommt man seine eigene Stimme in Homeassistant. Denke das wäre eine eigene Playlist wert und vor allem erreichst du damit eine wahnsinnig große Community... den viele wollen hier eine eigene Stimme integrieren. Über ein Feedback oder Links für eine Anleitung, sollte hier schon etwas existieren, und ich habe es nur nicht gefunden würde ich mich freuen. Vielen Dank
Vielen Dank für dein nettes Feedback und den guten Themenvorschlag 😊. Ich habe deine Idee auf meine TODO Liste gesetzt. Ist sicherlich spannend.
I have a suggestion for you. For the next time, you can move your image in the top right. Thanks for videos.
That is a good idea, thank you 😊. Anything seems always to be at the wrong position hiding important screen parts - a logo, camera image, some overlay. And if all of this worked the font seems too small 😆. But i try my best to improve quality with every recording.
Hello. Would it be possible to clone your Voice first using elevenlabs, than have it speak out all the relevant stuff, to the wavs using the api, thereby creating the ljspeech package in the same python-script, or would there be a rate limit, even with paid accounts? Would save so much Recording time. Ok, we have to spend some Dollars for the wav files. My Question qould be how much for a well covered german Dataset to train your piper-tts, and than break free from the cloud? Thanks, Attila.
Hi Attila, i've no practical experience with Elevenlabs, but from i've heard the quality is impressive. I don't know on any costs, but you should be able to create several wave file using Elevenlabs and pack it into an LJSpeech dataset and train Piper TTS with that. As all input files have already been synthesized it might be okay, to use less input data for training as the phonetical pronunciation should be identical.
Thank you for the great tutorial. I could follow along until this point:
(.venv) D:\piper\piper-recording-studio>python -m pip install -r requirements_export.txt
There, I came across this error:
ERROR: Could not find a version that satisfies the requirement onnxruntime=1.11.0 (from versions: none)
ERROR: No matching distribution found for onnxruntime=1.11.0
What could I have done wrong?
Thanks for your nice comment 😊.
Maybe try running "pip install onnxruntime==1.16.3". This should install a supported version.
I finally downgraded to Python 11.7. Now everything works fine. Thank you for your kind help anyway 👍
@@DoctorVolt Happy you got it working 😊.
Hi, which is better, Piper or Coqui? Which gives better similarity, quality when cloning a voice?
Hi, by quality in my opinion Coqui TTS offers better model training options, but Piper is way more performant and runs way faster on eg. a Raspberry Pi and offering a nice quality too.
@@ThorstenMuellerOne more question, if I may. I understand correctly that it is pointless to use tortoise TTS as an api?
@@neurofoxo Sure you can ask additional questions, but i can't answer this because i've no practical experience with Tortoise TTS yet 😉.
Does this send my audio somewhere to the internet? I'm confused by the "By clicking Submit, you agree to dedicate your recorded audio to the public domain (CC0)" message.
I was confused by this message as you while preparing the video, so i contacted Mike as main contributor and in short - no, all stays local 😊. But i pinged Mike, so maybe he can add more details to my comment.
No, it doesn't send the audio anywhere 🙂 I have this message in there because Piper Recording Studio is what's used for the Rhasspy voice contribution website which does upload back to the same server.
@@synesthesiam thanks!!
too bad it doesnt let you UPLOAD a file
What type of file would you like to upload?
Thank you everything is ok but i Need Nepali,so there is no Nepali language how to add nepali .Please tell how to get Nepali.
You can use your voice to create a dataset and create a nepali tts model, if this is what you are looking for. Do you know my tutorial on that? ua-cam.com/video/b_we_jma220/v-deo.htmlsi=V_eUArukdne7zjUi
A pretrained nepali tts for piper does not exist, imho.
Tape recorders have been able to copy ("clone") your voice for decades. The actual term for it is "recording."
1100 record,too much。haha
I have recorded over 30k recordings for voice models already 😉😅.
When attempting to start the Piper Recording Studio:
python3 -m piper_recording_studio
I received the following error: ImportError: cannot import name 'url_quote' from 'werkzeug.urls'
The Werkzeug package released a new version on September 30th 2023, and appears to have removed a needed dependency. I decided to downgrade the Werkzeug package from 3.0.0 to 2.3.7 and it appears to now work.
pip uninstall Werkzeug
pip install --upgrade Werkzeug==2.3.7
Thanks for sharing the solution when having struggle on Werkzeug dependencies 👏. I search the Github repo issues and found this one, which is probably opened by you, oder 😉?
github.com/rhasspy/piper-recording-studio/issues/11
Would it be okay for you if i add the link to the video description - just in case other people struggle with this too.
HI Thorsten, thanks for your video. Unfortunately I'm running in an error while executing the last step:
ERROR:__main__:export_audio
Traceback (most recent call last):
File "C:\Users\myuser\Documents\GitHub\piper-recording-studio\export_dataset\__main__.py", line 107, in __call__
audio_16khz_bytes = subprocess.check_output(
File "C:\Python310\lib\subprocess.py", line 420, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "C:\Python310\lib\subprocess.py", line 524, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ffmpeg', '-i', 'output\\de-DE\\0000000001_0300000050_General\\0000000009.webm', '-f', 's16le', '-acodec', 'pcm_s16le', '-ac', '1', '-ar', '16000', 'pipe:']' returned non-zero exit status 3221225781.
Can somebody provide support? The ffmpeg.exe is stored in the base folder and all dependencies are installed as well.
fixed, you need the whole bin content not just the ffmpeg.exe
Okay, thanks. Can i add you tipp to the video description?
Sure :)@@ThorstenMueller
@@drachenweisheit It's added to the description 😊
how do I create my own model .onnx and .onnx,jason files for the dataset I exported?
I've responded to your other comment and hope my video suggestion has been helpful. ua-cam.com/video/b_we_jma220/v-deo.htmlsi=yjdXJIJQ1p693jRy
@@ThorstenMueller made some comments