The end of the video got cut off -_-. I only had like 10 seconds left so when I get the chance, I'm just going to link a shorts so that you guys can see the rest of the video lol
Wow, I am amazed by this channel. A few weeks ago I was searching for Diarization of voices but had no good luck finding a good fit. Not only do you have a very good tutorial, you seem to be knowledgeable and up to date with everything (as up to date as one can when things are moving this quick).
Just found your channel last night, and your workflows are so clear and to the point. Quickly becoming my go-to for voice2voice workflows. Thank you for your work.
Just found your channel, and I want to say i'm too deep into the rabbit hole that I instantly recognize all the voice you use for conversion at the start😂
Your channel and the AI Hub have helped me a lot in getting started. I just trained a model with 2 hours of audio from Faunas last stream in RVCv2 on 1000 epochs and it came out very well
For those receiving an error with the "split_audio" script not creating the .srt audio as per the above tutorial, run this in an Anaconda or Python prompt, let it download the required dependencies and it will work as you need. Thank you for a great tutorial!
copied from the issues section, worked for me. Running split_audio.py threw this error Exception has occurred: FileNotFoundError [Errno 2] No such file or directory: 'D:\ai\programs\audiosplitter_whisper\data\output\1.srt' File "D:\ai\programs\audiosplitter_whisper\split_audio.py", line 96, in extract_audio_with_srt subs = pysrt.open(srt_file) File "D:\ai\programs\audiosplitter_whisper\split_audio.py", line 150, in process_audio_files extract_audio_with_srt(audio_file_path, srt_file, speaker_segments_dir) File "D:\ai\programs\audiosplitter_whisper\split_audio.py", line 180, in main process_audio_files(input_folder, settings) File "D:\ai\programs\audiosplitter_whisper\split_audio.py", line 183, in main() FileNotFoundError: [Errno 2] No such file or directory: 'D:\ai\programs\audiosplitter_whisper\data\output\1.srt' Additionally, the terminal was saying something about not having or not finding cublas64_12 (I can't remember exactly what it said) The error is thrown because the program can't find the srt file, because it can't make the srt file, and this is caused by a mismatch of CUDA versions. Torch (or something) has CUDA 11, but the script (or whatever) needs CUDA 12. I'm not a programmer, I don't know exactly what is what. All I know is that I fixed it. To fix this, do the following. Download and install CUDA 12 developer.nvidia.com/cuda-12-0-0-download-archive Navigate to "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.0\bin" Copy cublas64_12.dll, cublasLt64_12.dll, cudart64_12.dll Navigate to "...\audiosplitter_whisper\venv\Lib\site-packages\torch\lib" Paste the dlls into this folder Now when you run split_audio.py, it will be able to create the srt file, fixing the issue with not being able to find said file.
Hey, thanks for your awesome series of tutorials! As someone who is pretty new to this, it really helps out a ton. Would it be possible if you could make a tutorial on how to train a RVC 2 Voices with the dataset I just created? Thanks again and keep up the great work!
Hello, me again with two small questions: 1. The file format of choice is of course WAV, but what should it be for the best quality? 44khz or 48khz? Mono or stereo? (My recordings are in mono, but I could duplicate the channels and create a "pseudo-stereo track" if that produces better results.) 2. Your Audiosplitter_Whisper is good for my spoken sound files, but what is the best way to split the sung recordings? I think that because of the continuous singing there is not always a silence every 10 seconds (or less). What could you recommend to me? Or do you know a current, nice HowTo that describes everything in detail for achieving best quality ? (These are really my last questions :) )
I am still working on it, I have decided to do this on the worst quad core CPU there is, the 1.3 GHz, with no turbo, 4 core, 4 thread AMD Sempron 3850. I spent a bit over a week getting clean audio to save on the Ultimate Vocal Remover. I am using 12 hours of talking.
@@Jarods_Journey Thanks! It's going somewhat smoothly, got 5 errors in the CPU part of Visual Studio Code, but I am just going to pretend they don't exist, and move on with it. Lol.
i had this same error but ended up having the file from a previous installation of alltalk_tts. I'm sure you could find it elsewhere though. I ended up placing it in "audiosplitter_whisper\venv\Lib\site-packages\torch\lib" and everything worked as it did in his video.
Thank you so much ! It's a clear video and we see that you know what you are doing! I have a small question regarding the .wav files of the dataset, is it better to encode them in stereo or in mono? Or does the program make no difference?
I don't think it makes a difference, but I read somewhere that it should be done in stereo. It flattens them I believe though so it doesn't really matter after it's been processed though
@@Jarods_Journey Thank you very much ! One last question : Is it better to segment the sounds into files of 10 seconds each, or to cut in the form of complete sentences (and therefore to have files of very variable duration)? Thx for your work !
@@pilpinpin322 :), complete sentences works best so you don't get weird clippings, but if you run out of VRAM, you'll need to split into smaller segments.
Hey another banger video mate! Do you reckon its wise to keep the sound of breaths such as when they inhale or exhale?? or do I need to ONLY need the part where the source voice talks or sing?? let me know your thought and keep up the cool vids!
Whatever is included in the split audio, should be fine. It may cut out some of the breathing perhaps at the end of a sentence or beginning, it everything else in between is fine to keep :)!
Hello. Thanks for your great Videos. One Question: I am from Germany and have WAV files spoken or sung in english AND in german language. For your tool / whisperx I can handle them separately by changing the languange. But my question is about RVC: For training a new model, can I mix those different languages together? I always did that and now I realize, that this maybe wasn't a good idea? Or does that not matter for RVC? Thanks in advance ;)
This is fine, RVC doesn't look at text to train - it's strictly extracting features from the audio provided. The only thing is it may sound accented, for example, if I train a model on Japanese audio, if I use it to convert English speech, it may not sound 100% English native
Hey there! Thank you for all those videos! I hadn't realized UVR5 had advanced options, lol. Hey, I have a question that can look silly but it is serious : is it really required to train for _hundreds_ of epochs? I have had absolutely great results with 50 epochs only. What does more epochs bring exactly? Meanwhile, the issues I have also happen with models trained for hundreds or thousands of epochs, because most of my problems come from the way I clean the audio I want to clone. I also noticed my feminine voices tend to break at growls. Is it required to have growling audio in the database used for training? Or is there a secret sauce to make any voice have growls?
Appreciate it! A finished epoch indicates that the model has seen every sample once. Increasing epochs just repeats this process for X number of epochs. It's all data dependant, as you don't always need more epochs for a good model. As well for growls, just in general, they seem to be harder for the models to infer on and my anecdotal experience seems to be all models kinda struggle with it. I have yet to try training with growls, but I want to try a similar experience with laughing because often times laughing just sound weird 😂
for some reason i keep getting an error where it cannot open the vocals.srt file. did i miss a step? there is no vocals.srt file generated in the output folder for audiosplitter.
I have 3 minutes of studio quality lossless vocals I would like to use to train. Is that sufficient? Additionally, there are some interviews on UA-cam of the same artist speaking at length but I was concerned whether the lower quality mp3 stuff should be avoided for these purposes. Thanks for your video! Very informative
Muffled audio should be excluded but if the voice sounds good enough you can include it. 3 minutes may be okay, but idk, you just gotta try it out mate 🤟. 10 minutes or more is recommended but you can use less sometimes and it'll be fine.
Jarod, do you know if there is software or websites or whatever that let you make a new voice out of other voices? Like blend them into a new voice? especially RVC type of voices (since I know that best) ..but would be curious otherwise of others too
I have audiosplitter_whisper installed and vscode opened, trying to run debugging as per 12:00 in the video and am getting the following error "configuration 'python:file 'is missing in 'launch.json"" any idea what might be going on? BTW: It appears to work if I run "python split_audio.py" in power shell.
Hi! I followed your tutorial and managed to set everything up and run the script without getting any errors, but the problem is that I didn't get the expected amount of segments.... I tried the script with three different audios. The first one, of about 4 minutes, got me an output of 35 seconds worth of segments; the second one, also about 4 minutes got an output of 1min 36sec total; and the the third, a bit over 2 minutes, got 55 seconds. Do you know what could be the issue? Also, I tested speaker diarization with another audio but it didn't go very well. It had 4 different speakers, which it separated in only 2 and all 4 speakers where in both folders.
They have been updated and now it is not possible to sort files by speakers. Can you look at the new version and tell me what can be done? Is it possible to use the old version somehow?
ERROR: Could not find a version that satisfies the requirement torch==2.0.0+cu118 (from versions: none) ERROR: No matching distribution found for torch==2.0.0+cu118
Ultimate Vocal Remover is struggling with some track like i hear the instrumental in the back with Kim Vocal 1 is there a model where the vocal are perfect like ?? great vid!
The vocal removers are really good, but they're not 100% unfortunately. That's very hard to achieve and I'm sure there are brilliant minds working towards this eventually. But doesn't exist ATM rn, you may be able to get better results with ensemble mode, but you'll have to research a bit on the best combos: github.com/Anjok07/ultimatevocalremovergui/issues/344
ERROR: Error [WinError 2] The system cannot find the file specified while executing command git version ERROR: Cannot find command 'git' - do you have 'git' installed and in your PATH? lol
Hi! I'm from another country and I don't really understand English, but this topic is very interesting! How can I teach a model to speak my language better?
just have a question. How high is your batch size, when you train? Is it something that if you set it too high, you get an imprecise module? If I have a dataset of one hour, what should my batch size be?
Is there some kind of vocaloid-like interface so that i have some control on how certain words would sound like? would be cool to have a TTS that could run the trained RVC voices.
All points to using WIndows on any of these. Or am I missing something? I am on MacOS and all stuff is .bat and .exe , google collab sandbox running things. Is there no UI to thisdate, that also runs on MacOS? Have I missed to locate it perhaps?
Thank you for your amazing videos, it really helps me understand how everything works, just one question, I'm having some problems when running the "split_audio" script, it seems it isn't creating the .srt file of the audio and when it tries to call the file it runs into an error, do you know what it could be?
Whisperx may not be being downloaded correctly. I would try rerunning the setup file again and trying to get this going. One other thing you can do is type and enter whisperx into the console after activating the venv to see if it got installed
I think I had the same problem using the cuda installation. If your debugger tells you that it can't find the .srt file when running split_audio script then check your terminal logs. If you have an error like this: "ValueError: Requested float16 compute type, but the target device or backend do not support efficient float16 computation." Then it means that your GPU does not support FP16 execution. To fix it go line 26 in the split_audio script which must be: return 'cuda', "float16" and replace "float16" by "float32" or "int8".
The code does not generate an srt file for me from a single WAV, and I get a filenotfound errror: No such file or directory: 'D:vocalsplittest/data\\output\\song.srt
Maybe it's too late, but I solved it with "pip install -r requirements-cuda.txt" in my case I have an Nvidia graphics card, if you use cpu then replace it with "requirements-cpu.txt" for some reason there is a missing package that it is not installed when running "setup-cuda.py". Always run the command within the virtual environment created previously with "venv"
I'm getting FileNotFounderror in Visual Code Studios, where it cannot find srt_file. I followed your tutorial step by step, but I'm sure I did something wrong since I dont get the same results when I run the program. Since I have no python experience, I'm not sure what I did wrong here.
Maybe it's too late, but I solved it with "pip install -r requirements-cuda.txt" in my case I have an Nvidia graphics card, if you use cpu then replace it with "requirements-cpu.txt" for some reason there is a missing package that it is not installed when running "setup-cuda.py". Always run the command within the virtual environment created previously with "venv"
Just a maybe random question I was having issues installnig the audio splitter and I thought it was because I haven't installed cuda toolkit of NVIDIA, so ended up installing it, but it was other thing that was giving me the error so my question is Should I uninstall this cuda toolkit I don't know what it does exactly or it won't harm my configuration or gpu in the future ?
I'm trying to do my own voice and got some decent results, but it can't handle higher pitches. Should I add more samples with my voice in a higher pitch, or give it more samples with my normal voice and train it for longer? I have it trained using the Harvard Sentences from a previous video and I did 300 epochs.
You can try adding samples of higher pitch, it's mainly going to be good at speaking in the pitch and timbre of the voice you train it with, so if your voice is naturally deeper it's not going to know how to handle that if you try to speak high all of a sudden
Hey, at 12:22 I get a similar error, but .\venv\Scripts\activate doesn't seem to fix it, are there any other solutions? It's giving me an error saying "FileNotFoundError", highlighting "subs = pysrt.open(srt_file)" Here's most of the error (there's more, just basically the same thing) "Exception has occurred: FileNotFoundError [Errno 2] No such file or directory: 'C:Users/myuser/OneDrive/Desktop/deleteme/audiosplitter_whisper/data\\output\\MyDataSet.srt' File "C:Users\myuser\OneDrive\Desktop\deleteme\audiosplitter_whisper\split_audio.py", line 101, in extract_audio_with_srt subs = pysrt.open(srt_file) ^^^^^^^^^^^^^^^^^^^^^^^^^^^" Also great video so far!
Something happened when trying to make the srt file, make sure that whisperx downloaded and the setup wan without issue. You may also have to run vscode in admin mode
Maybe it's too late, but I solved it with "pip install -r requirements-cuda.txt" in my case I have an Nvidia graphics card, if you use cpu then replace it with "requirements-cpu.txt" for some reason there is a missing package that it is not installed when running "setup-cuda.py". Always run the command within the virtual environment created previously with "venv"
i had to manually go through the pain of finding out and basically you make sure your not the the virtual environment to make sure type "deactivate" then all you do for everything isn't installed or is saying the module name isn't found find out online the command to install it then add "--use-pep517" after each command so try "pip install PyYAML --use-pep517" for yaml
thanks for trying but this thing has failed for me multiple times and im tired of trying to troubleshoot this. is it that hard to just make an executable for people to use? i dont know jack shit about code and cant fix it when it doesnt do the same shit your computer does, even when following all the steps.
What if my voice doesn't speak any of the default languages? I have found a phoneme-based ASR model that suits me but how do I use it in your code? Anyway, great tutorital!
Ah... I haven't dabbled in that area yet and don't know how it works in other non supported languages. I would test it as a command line script first to see if you can get it working that way. I believe the --align_model argument would need to be used
When running it like on 13:00 it says that it `failed to align segment ("!!!!!!!!!!"): no character in this segment found in model disctionary, resorting to original...` multiple times and once it was finished the folder had no segmented audio and was just empty. How do I fix this
I think this is a language issue, if your audio files have multiple languages being used, this causes issue with whisperx, as well as if it's an unsupported language. Further than that, please reference the whisperx GitHub issues page for more details as I'm not sure what else causes this.
Hey Jarods, much appreciation for your tutorials. I'm facing some issue when running the split_audio.py. I'm using a Spanish database, followed all your steps and changed conf.yaml to language: "es". But, when I run the split_audio.py script, I face this issue: Exception has occurred: FileNotFoundError [Errno 2] No such file or directory: 'D:\\Documentos\\VoiceCloning - AudioSplitter\\audiosplitter_whisper\\data\\Vocals\\output\\100_Salmo 53_(Vocals).srt' File "D:\Documentos\VoiceCloning - AudioSplitter\audiosplitter_whisper\split_audio.py", line 96, in extract_audio_with_srt subs = pysrt.open(srt_file) File "D:\Documentos\VoiceCloning - AudioSplitter\audiosplitter_whisper\split_audio.py", line 150, in process_audio_files extract_audio_with_srt(audio_file_path, srt_file, speaker_segments_dir) File "D:\Documentos\VoiceCloning - AudioSplitter\audiosplitter_whisper\split_audio.py", line 180, in main process_audio_files(input_folder, settings) File "D:\Documentos\VoiceCloning - AudioSplitter\audiosplitter_whisper\split_audio.py", line 183, in main() Can you help me out?
Can you make a video on how to keep the emotions from the original souce voice? I have everything beautifully working for a clean and perfect voice clone but my source audio has some strong emotion acting, anger/fear/happiness etc, that are not represented in the cloned audio. Thanks.
Can I use talking + singing audio to create my model or should it be split into two separate models. One for singing voice and one for talking voice. I am having trouble finding clean singing audio for my model and considering using talking audio from like interviews etc.
Thanks for the vid although I'm confused. I understand the UVR step to isolate vocals. I would generally then use that as the dataset. What is the benefit of the next step of splitting the file up? is that all it does? What else is happening that I don't know about? I've generally just used longer clean audio files for training. Thanks for enlightening me :)
By splitting it, we solve the biggest issue of CUDA out of memory as I don't believe RVC splits larger audio files into more digestible chunks. Splitting it allows us to control this issue, and then additionally, get rid of any silence in the audio samples. Then theres also the fact you can easily remove any bad data from the audio file that you may not want in the training set. If your running it just fine with UVR without the out of memory issue though, you should be good to go there, but splitting it just gives you a bit more freedom with the data.
hell @jarod, i got this error whiles its creating output and vocal audio sets CUDA is available. Running on GPU. The torchaudio backend is switched to 'soundfile'. Note that 'sox_io' is not supported on Windows. The torchaudio backend is switched to 'soundfile'. Note that 'sox_io' is not supported on Windows. Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.6. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file C:\Users\kit\.cache\torch\whisperx-vad-segmentation.bin` Model was trained with pyannote.audio 0.0.1, yours is 2.1.1. Bad things might happen unless you revert pyannote.audio to 0.x. Model was trained with torch 1.10.0+cu102, yours is 2.0.0+cu118. Bad things might happen unless you revert torch to 1.x. >>Performing transcription... Traceback (most recent call last): File "C:\Users\kit\Desktop vc\audiosplitter_whisper\venv\Scripts\whisperx-script.py", line 33, in sys.exit(load_entry_point('whisperx==3.1.1', 'console_scripts', 'whisperx')()) File "C:\Users\kit\Desktop vc\audiosplitter_whisper\venv\lib\site-packages\whisperx\transcribe.py", line 159, in cli result = model.transcribe(audio, batch_size=batch_size) File "C:\Users\kit\Desktop vc\audiosplitter_whisper\venv\lib\site-packages\whisperx\asr.py", line 288, in transcribe for idx, out in enumerate(self.__call__(data(audio, vad_segments), batch_size=batch_size, num_workers=num_workers)): File "C:\Users\kit\Desktop vc\audiosplitter_whisper\venv\lib\site-packages\transformers\pipelines\pt_utils.py", line 124, in __next__ item = next(self.iterator) File "C:\Users\kit\Desktop vc\audiosplitter_whisper\venv\lib\site-packages\transformers\pipelines\pt_utils.py", line 125, in __next__ processed = self.infer(item, **self.params) File "C:\Users\kit\Desktop vc\audiosplitter_whisper\venv\lib\site-packages\transformers\pipelines\base.py", line 1028, in forward model_outputs = self._forward(model_inputs, **forward_params) File "C:\Users\kit\Desktop vc\audiosplitter_whisper\venv\lib\site-packages\whisperx\asr.py", line 228, in _forward outputs = self.model.generate_segment_batched(model_inputs['inputs'], self.tokenizer, self.options) File "C:\Users\kit\Desktop vc\audiosplitter_whisper\venv\lib\site-packages\whisperx\asr.py", line 138, in generate_segment_batched result = self.model.generate( RuntimeError: CUDA failed with error out of memory
Hey, is it bad if there are low sounds of people slamming doors or making pop-like noise in the back?(They get loud on purpose everytime I sing) I can't get rid of those as well as plosives from breathing. But you can still here my voice :/
make space cakes and give it to them, start recording an hour later. You should be good for a few hours whilst they are all monging on the Sofa ;-) I feel for your situation that people around you can't be reasonable with you for ten or so minutes. Maybe show them some video's of what you are looking to do, and offer to make them a voice , on the proviso that they just shut up or 10 minutes whilst you do your? Good luck
Hi, I have a question about rvc. I am trying to train a module where I have chosen no pitch. it sounds autotuner like how can i fix it` how does learning rate work` what is batch size`
I have source track with background noises and of course I can solve that using UVR5 or another voice isolation VSTs, but there are also segments with much voice reverb and when I decrease that reverb it cuts low-mod frequencies from voice, what i shoul do at such situation? maybe i need to find reference with good eq and try to improve target data using eq match?
In this case, you're in a tough spot because if you can't clean the data, it may have some murkiness in the final output. As much as you can, you would wanna get you're audio as clean as possible before training.
I do not think I have cuda...just cpu but got the error ERROR: Could not find a version that satisfies the requirement torch (from versions: none) ERROR: No matching distribution found for torch
Hello dear, I would like your help regarding sound reproduction via Google Colab. Is the data uploaded in Wave Mono or Stereo format and is it 16 bit or 24 bit?
1) what/how/can I change this to have multiple data directories (if I want to tweak/add on a later retry, and as a way of keeping things organized). I presume I can make a subdirectory like the "vocal" ones for each unique dataset? 2) can I bypass the audio split step if I've exported my dataset in
1. Each file you put in the data folder will be exported to its own segmented folder in the output folder. Once finished here, I recommend moving the finished files to some somewhere else on your PC. 2. Yes, no need 3. The exported files (segmented pieces) are coded in by me and organized to export to the folder you chose at the start. Means unlimited freedom if you wanted to modify the code 4. It sorta is a batch process, what additional feature are you looking for? From the question, I'm assuming you just want to choose an input and an output folder right? Since it makes a folder per file name, I can see this being a bit cumbersome to have to manually move them into one directory, but this is for sorting reasons. A 3060 is good as it can utilize CUDA. Imo, 3060 will gives more flexibility due to its 12gb VRAM so this would be the cheaper option to go with compared to like a 3070 or 3060ti
@@Jarods_Journey 1) ok 2) ok 3) ah; following along without actually doing it makes it easy to discount where you started at, ahrgh, sorry 4) by batch, effectively automating starting Visual Studio, getting to the point where training ui begins... or in essence, an actual app ala UVC that does the environment setup, python behind the scenes. I want to copy my dataset over, then jump to a ui to start training.... and ideally the same ui to manage models, inference. Installing python, visual studio etc. are one time things I don't mind - I'm thankful you've done these tutorials, but the steps, steps, steps, steps, steps just to get to starting training seems automatable? My interest is in music, singing replacement; and what happens by tweaking the dataset, getting to what I hear in my head. Which I want bad enough to jump through hoops (and buy a new pc I previously didn't need, lol) but.... gahhh... it's like being a kid again, configuring AUTOEXEC.BAT and CONFIG.SYS for hours, only to be burned out by the time you get Wolfenstein to run in SVGA with a hand-me-down SoundBlaster 16 card....
@@TheChipMcDonald Gotcha! The RVC web-UI is actually pretty close, it's literally just missing the data curation side of things as it comes in a downloadable release too. A few more quality of life things later like file browsers instead of paths, etc. and I think we're looking at a very robust and easy to follow workflow. I'll definitely keep the channel updated WHEN someone comes out with something that has all of the puzzle pieces put together. 🙏
Anything that is a Nvidia 3060 12 GB or above should be fine, even 20 series cards work still too. Anything that is not Nvidia often has issues so I don't recommend those.
Hello ! I've followed closely the tutorial three times, but I keep getting that one error at line 101 : "Exception has occurred: File Not Found Error" It seems to be looking for an srt file ? Also the terminal says "Requested float16 compute type, but the target device or backend do not support efficient float16 computation."
That means no srt file was generated by whisperx. Try redownloading with setup-cpu.py as you're gpu probably doesn't support float16. That, or in the code, you can change it to int8 where there is float16. I'll need to work on a fix for this.
Hi Jarod, I still get an error even after I do .\venv\Scripts\activate Exception has occurred: FileNotFoundError [Errno 2] No such file or directory: 'C:\\Users\\Mah\\Desktop\\AudioSplitter_Whisper\\audiosplitter_whisper\\data\\output\\Vocals.srt' File "C:\Users\Mah\Desktop\AudioSplitter_Whisper\audiosplitter_whisper\split_audio.py", line 96, in extract_audio_with_srt subs = pysrt.open(srt_file) File "C:\Users\Mah\Desktop\AudioSplitter_Whisper\audiosplitter_whisper\split_audio.py", line 150, in process_audio_files extract_audio_with_srt(audio_file_path, srt_file, speaker_segments_dir) File "C:\Users\Mah\Desktop\AudioSplitter_Whisper\audiosplitter_whisper\split_audio.py", line 180, in main process_audio_files(input_folder, settings) File "C:\Users\Mah\Desktop\AudioSplitter_Whisper\audiosplitter_whisper\split_audio.py", line 183, in main() FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\Mah\\Desktop\\AudioSplitter_Whisper\\audiosplitter_whisper\\data\\output\\Vocals.srt' Does it matter that I'm running OS encryption with VeraCrypt?
I get an error while installing ERROR: Could not find a version that satisfies the requirement torch==2.0.0+cu118 (from versions: 2.2.0, 2.2.0+cpu, 2.2.0+cu118, 2.2.0+cu121, 2.2.1, 2.2.1+cpu, 2.2.1+cu118, 2.2.1+cu121) ERROR: No matching distribution found for torch==2.0.0+cu118
Hello, I got a few questions.. So I have access to 6ch audio with the voice I want to clone, and I'm extracting it all manually using Adobe Audition. 1. Using UVR helps remove any lingering bg noise but sometimes a little noise will remain. It is not that noticeable, so is it okay to have a little noise or will that affect the model? 2. I know to remove long silences, but what about the small gaps between when the character is actually speaking, should I remove that too so it is just a continuous stream of talking with not even 0.5 second breaks? And what about the sounds when a character isn't actually speaking, e.g. growls or hums, or breathy sounds like laughing, that naturally have some silence in there.
My observation is the little bit of noise is ok, it shouldn't be that noticeable. One case though I have of a model is that it does show in the output that I can hear the bg that was not removed. Hard to get it perfect though. 2. The little gaps are fine, as for growls and what not I'd say to cut those out, but I haven't actually tried so I can't say for certain.
One more question.. for now.. if that’s okay? Say I wanted be excessive to get the cleanest, most accurate, almost perfect result possible on the first train. And I had 1 and 1/2 hours or even 2 hours max audio data. And My PC could probably handle it (For context i a have NVIDIA GeForce rtx 3060 graphics card and 32GB ram) What is the Max amount of epochs do you recommend I could train for?
Dunno, the big answer is "it depends". Just try training for 10 epochs and hear how it sounds. Tain around other epochs and try those as well. You're looking for the lowest epoch #
@@Jarods_Journey is the artifacting sounding like octave shift/cracking/falsetto effects. That's been a problem with some of the voice models I've made, and some that I've downloaded and tried using.
Thanks for the helpful video, I have a gtx 1660 ti 6gb vram cuda say i am out of memeory is there a low vram option like in stable diffusion or i am stuck with using cpu?
There are some low VRAM options built into whisperx that have to be passed, you would have to modify the script to do that. I'll get around to adding it when I get the chance
That was only so complicated it was ridiculous. Why don't you actually write a program that just does all that by clicking "split"? What about slicer-gui-windows-v1.2.1? Will that do the same thing?
hello, i'm askin anyone right now because i got a bit lost. i'm trying to make the ai voice not glitch out whenever i'm doing long vowels so it doesn't look for all of them at once making it sound like a mess, and i so far thought you have to train them to sound better, but i think that's not the case. can someone explain what i have to do to achive this?
Just found out your channel and wanted to ask you if you know any ways to follow these steps on Mac. As a student the only computer I have is my MacBook Air M1. I watched your video where you show how to use RVC on Colab and I want to learn how I can create my own dataset and remove vocals from songs.
You can run this on CPU using setup-cpu, though I haven't tried myself since I don't have MAC. You could technically do all this in Collab as well, but you'll have to set that up yourself
Sorry mate, I haven't looked into this area and don't know quite exactly how to do it either. You have to tell whisperx the location of the alignment model your using, but that's as far as I know.
hi when I tried this I got this message Failed to align segment ("!!!! Ш!!!!!!!!!!!!!!!!"): no characters in this segment found in model dictionary, resorting to original !!!!!!!!!!!!!!!!!!"): no characters in this segment found in model dictionary, resorting to original...
I also got this message too Model was trained with pyannote.audio 0.0.1, yours is 2.1.1. Bad things might happen unless you revert pyannote.audio to 0.x. Model was trained with torch 1.10.0+u102, yours is 2.0.0+cul18. Bad things might happen unless you revert torch to 1.%.
Both error messages are fine, and you should still be getting an output file at the end. Though, if it's not, believe that's a whisperx limitation where it can't align some words
Hi Jarods, Im currently on my project of doing Audiobook using cloned voice where I will be the voice How good the training will be If I have an i5 and GTX1060 6gb. Is this enough?
The end of the video got cut off -_-. I only had like 10 seconds left so when I get the chance, I'm just going to link a shorts so that you guys can see the rest of the video lol
Finishing the Data Curation Video...
@@Jarods_Journey Your audiosplitter code exports 44.1khz audio. how do I make it export 48khz? I am losing quality with this code!
Wow, I am amazed by this channel. A few weeks ago I was searching for Diarization of voices but had no good luck finding a good fit.
Not only do you have a very good tutorial, you seem to be knowledgeable and up to date with everything (as up to date as one can when things are moving this quick).
Too many things, too fast. Appreciate it :D, tis is the realm of open source.
@Jarods_Journey Love you, bro! Thanks a ton. I didn't even know this existed!
Just found your channel last night, and your workflows are so clear and to the point. Quickly becoming my go-to for voice2voice workflows. Thank you for your work.
Appreciate it 🙏
Thank you Jarod.
If people don't want to use GIT they just can download the zip and unpack it at the preferred location. 😉
Solid tip, thanks Luz! Totally skipped my mind.
how to combine ! - voices to create a total unique one !
OMG Jerod! Your video tutorials are becoming better and better. I love seeing a new release from you! Thanks for all your hard work!
Jarod managed to help me figure out a strange problem that I was not able to figure out at all. He's got my sub. Thanking you kindly!
Just found your channel, and I want to say i'm too deep into the rabbit hole that I instantly recognize all the voice you use for conversion at the start😂
Your channel and the AI Hub have helped me a lot in getting started. I just trained a model with 2 hours of audio from Faunas last stream in RVCv2 on 1000 epochs and it came out very well
Haha awesome, glad to hear!
Is there a way I can get a copy of it? (>
how much better is that than 300? does that prevent static sounds if you don't use pretrained generators?
Bro. This channel is amazing. I've been around and you are needed by many. Welcome.
Hey Bruh I'm getting some errors while converting trained data to out put,, ffmpeg error + dtype, type error... (Ffmpeg is already installed )
For those receiving an error with the "split_audio" script not creating the .srt audio as per the above tutorial, run this in an Anaconda or Python prompt, let it download the required dependencies and it will work as you need.
Thank you for a great tutorial!
How does it work? Because i got the exact problem.
copied from the issues section, worked for me.
Running split_audio.py threw this error
Exception has occurred: FileNotFoundError
[Errno 2] No such file or directory: 'D:\ai\programs\audiosplitter_whisper\data\output\1.srt'
File "D:\ai\programs\audiosplitter_whisper\split_audio.py", line 96, in extract_audio_with_srt
subs = pysrt.open(srt_file)
File "D:\ai\programs\audiosplitter_whisper\split_audio.py", line 150, in process_audio_files
extract_audio_with_srt(audio_file_path, srt_file, speaker_segments_dir)
File "D:\ai\programs\audiosplitter_whisper\split_audio.py", line 180, in main
process_audio_files(input_folder, settings)
File "D:\ai\programs\audiosplitter_whisper\split_audio.py", line 183, in
main()
FileNotFoundError: [Errno 2] No such file or directory: 'D:\ai\programs\audiosplitter_whisper\data\output\1.srt'
Additionally, the terminal was saying something about not having or not finding cublas64_12 (I can't remember exactly what it said)
The error is thrown because the program can't find the srt file, because it can't make the srt file, and this is caused by a mismatch of CUDA versions. Torch (or something) has CUDA 11, but the script (or whatever) needs CUDA 12. I'm not a programmer, I don't know exactly what is what. All I know is that I fixed it.
To fix this, do the following.
Download and install CUDA 12 developer.nvidia.com/cuda-12-0-0-download-archive
Navigate to "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.0\bin"
Copy cublas64_12.dll, cublasLt64_12.dll, cudart64_12.dll
Navigate to "...\audiosplitter_whisper\venv\Lib\site-packages\torch\lib"
Paste the dlls into this folder
Now when you run split_audio.py, it will be able to create the srt file, fixing the issue with not being able to find said file.
your channel is amazing. I was looking for this long time.
Hey, thanks for your awesome series of tutorials! As someone who is pretty new to this, it really helps out a ton. Would it be possible if you could make a tutorial on how to train a RVC 2 Voices with the dataset I just created? Thanks again and keep up the great work!
Appreciate it! Respective tutorials already exist, so I'd go check those out! ua-cam.com/play/PLknlHTKYxuNshtQQQ0uyfulwfWYRA6TGn.html
Hello, me again with two small questions:
1. The file format of choice is of course WAV, but what should it be for the best quality? 44khz or 48khz? Mono or stereo? (My recordings are in mono, but I could duplicate the channels and create a "pseudo-stereo track" if that produces better results.)
2. Your Audiosplitter_Whisper is good for my spoken sound files, but what is the best way to split the sung recordings? I think that because of the continuous singing there is not always a silence every 10 seconds (or less). What could you recommend to me? Or do you know a current, nice HowTo that describes everything in detail for achieving best quality ? (These are really my last questions :) )
I'm from the future: Don't install Python 3.12, use 3.10.
True story 🤬
Exactly
what
I
needed
to
know
I am still working on it, I have decided to do this on the worst quad core CPU there is, the 1.3 GHz, with no turbo, 4 core, 4 thread AMD Sempron 3850. I spent a bit over a week getting clean audio to save on the Ultimate Vocal Remover. I am using 12 hours of talking.
There is probably a way to do this on collab, but atm, Collab is a hassle I don't wanna have to deal with :(. Good luck on it 🫡
@@Jarods_Journey Thanks! It's going somewhat smoothly, got 5 errors in the CPU part of Visual Studio Code, but I am just going to pretend they don't exist, and move on with it. Lol.
Thank you so much. King!
Great tutorial!!
this is an error i got RuntimeError: Library cublas64_12.dll is not found or cannot be loaded
SAME! :(
i had this same error but ended up having the file from a previous installation of alltalk_tts. I'm sure you could find it elsewhere though. I ended up placing it in "audiosplitter_whisper\venv\Lib\site-packages\torch\lib" and everything worked as it did in his video.
Great , Thank you for this video,
Thank you so much ! It's a clear video and we see that you know what you are doing! I have a small question regarding the .wav files of the dataset, is it better to encode them in stereo or in mono? Or does the program make no difference?
I don't think it makes a difference, but I read somewhere that it should be done in stereo. It flattens them I believe though so it doesn't really matter after it's been processed though
@@Jarods_Journey Thank you very much ! One last question : Is it better to segment the sounds into files of 10 seconds each, or to cut in the form of complete sentences (and therefore to have files of very variable duration)? Thx for your work !
@@pilpinpin322 :), complete sentences works best so you don't get weird clippings, but if you run out of VRAM, you'll need to split into smaller segments.
@@Jarods_Journey Thx for the fast reponse, even if there are very small sentence of 1 sec like " Yess i agree ! " ?
Hey another banger video mate!
Do you reckon its wise to keep the sound of breaths such as when they inhale or exhale?? or do I need to ONLY need the part where the source voice talks or sing?? let me know your thought and keep up the cool vids!
Whatever is included in the split audio, should be fine. It may cut out some of the breathing perhaps at the end of a sentence or beginning, it everything else in between is fine to keep :)!
Hello. Thanks for your great Videos. One Question: I am from Germany and have WAV files spoken or sung in english AND in german language. For your tool / whisperx I can handle them separately by changing the languange. But my question is about RVC: For training a new model, can I mix those different languages together? I always did that and now I realize, that this maybe wasn't a good idea? Or does that not matter for RVC? Thanks in advance ;)
This is fine, RVC doesn't look at text to train - it's strictly extracting features from the audio provided. The only thing is it may sound accented, for example, if I train a model on Japanese audio, if I use it to convert English speech, it may not sound 100% English native
@@Jarods_Journey Cool. Thank You for your quick reply ;)
Hey there! Thank you for all those videos! I hadn't realized UVR5 had advanced options, lol.
Hey, I have a question that can look silly but it is serious : is it really required to train for _hundreds_ of epochs? I have had absolutely great results with 50 epochs only. What does more epochs bring exactly?
Meanwhile, the issues I have also happen with models trained for hundreds or thousands of epochs, because most of my problems come from the way I clean the audio I want to clone.
I also noticed my feminine voices tend to break at growls. Is it required to have growling audio in the database used for training? Or is there a secret sauce to make any voice have growls?
Appreciate it! A finished epoch indicates that the model has seen every sample once. Increasing epochs just repeats this process for X number of epochs. It's all data dependant, as you don't always need more epochs for a good model.
As well for growls, just in general, they seem to be harder for the models to infer on and my anecdotal experience seems to be all models kinda struggle with it. I have yet to try training with growls, but I want to try a similar experience with laughing because often times laughing just sound weird 😂
Do I need to sing in the audio for the dataset or talk is enough (like reading something from the web)? Thx, apart from that great tutorial. ^^
If you get an error when running xwhsiper, make sure you have version 12 of NVIDIA CUDA toolkit installed
for some reason i keep getting an error where it cannot open the vocals.srt file. did i miss a step? there is no vocals.srt file generated in the output folder for audiosplitter.
I'm having the same problem. Did you manage to sort this out?
Dont forget to change execution policy to default when you are done with this
Thx for your content! Why would I use WhisperX tho? Is it just for data management or is it actually helps RVC train?
For curating better data, by using sub timing, there's may be less chances for audio samples being empty noise
I have 3 minutes of studio quality lossless vocals I would like to use to train. Is that sufficient?
Additionally, there are some interviews on UA-cam of the same artist speaking at length but I was concerned whether the lower quality mp3 stuff should be avoided for these purposes. Thanks for your video! Very informative
Muffled audio should be excluded but if the voice sounds good enough you can include it. 3 minutes may be okay, but idk, you just gotta try it out mate 🤟.
10 minutes or more is recommended but you can use less sometimes and it'll be fine.
Jarod, do you know if there is software or websites or whatever that let you make a new voice out of other voices? Like blend them into a new voice? especially RVC type of voices (since I know that best) ..but would be curious otherwise of others too
i recommend using software like audacity for post processing the audio it help with clearity and if random noice
I have audiosplitter_whisper installed and vscode opened, trying to run debugging as per 12:00 in the video and am getting the following error "configuration 'python:file 'is missing in 'launch.json"" any idea what might be going on? BTW: It appears to work if I run "python split_audio.py" in power shell.
Hi! I followed your tutorial and managed to set everything up and run the script without getting any errors, but the problem is that I didn't get the expected amount of segments.... I tried the script with three different audios. The first one, of about 4 minutes, got me an output of 35 seconds worth of segments; the second one, also about 4 minutes got an output of 1min 36sec total; and the the third, a bit over 2 minutes, got 55 seconds. Do you know what could be the issue? Also, I tested speaker diarization with another audio but it didn't go very well. It had 4 different speakers, which it separated in only 2 and all 4 speakers where in both folders.
They have been updated and now it is not possible to sort files by speakers. Can you look at the new version and tell me what can be done? Is it possible to use the old version somehow?
The whisper no longer has setup cpu and setup cuda. Do I just download later versions or are is there a newer tutorial?
11:01 i don't know why, but keeps getting me the same error (No module named 'pysrt') but 'pysrt' is already installed
ERROR: Could not find a version that satisfies the requirement torch==2.0.0+cu118 (from versions: none)
ERROR: No matching distribution found for torch==2.0.0+cu118
Ultimate Vocal Remover is struggling with some track like i hear the instrumental in the back with Kim Vocal 1 is there a model where the vocal are perfect like ?? great vid!
The vocal removers are really good, but they're not 100% unfortunately. That's very hard to achieve and I'm sure there are brilliant minds working towards this eventually. But doesn't exist ATM rn, you may be able to get better results with ensemble mode, but you'll have to research a bit on the best combos: github.com/Anjok07/ultimatevocalremovergui/issues/344
Hi Jarods, can I use large-v3 model instead of large-v2?
ERROR: Error [WinError 2] The system cannot find the file specified while executing command git version
ERROR: Cannot find command 'git' - do you have 'git' installed and in your PATH? lol
Hi! I'm from another country and I don't really understand English, but this topic is very interesting! How can I teach a model to speak my language better?
I thought this included both the separation and training, but all those GB of programs are only for isolating voice, daym !
mine says Failed to create virtual environment. Error: [Errno 13] Permission denied
did you solve that?
Same error for me. Unable to get pass this.
Fixed it. Uninstall python and reinstall no later than 3.9
Hello please help me erorr
(Requested float16 compute type, but the target device or backend do not support efficient float16 computation.)
I am having similar issues, did you ever figure it out?
just have a question. How high is your batch size, when you train? Is it something that if you set it too high, you get an imprecise module? If I have a dataset of one hour, what should my batch size be?
How do you use the dataset created following this tutorial with AI voice cloning 3.0?
You don't explain how to use them.
Can you make a video?
Is there some kind of vocaloid-like interface so that i have some control on how certain words would sound like? would be cool to have a TTS that could run the trained RVC voices.
ATM, I don't know of any that use RVC voice, though I'm bound to see it happening someday
All points to using WIndows on any of these. Or am I missing something? I am on MacOS and all stuff is .bat and .exe , google collab sandbox running things. Is there no UI to thisdate, that also runs on MacOS? Have I missed to locate it perhaps?
Thank you for your amazing videos, it really helps me understand how everything works, just one question, I'm having some problems when running the "split_audio" script, it seems it isn't creating the .srt file of the audio and when it tries to call the file it runs into an error, do you know what it could be?
Whisperx may not be being downloaded correctly. I would try rerunning the setup file again and trying to get this going. One other thing you can do is type and enter whisperx into the console after activating the venv to see if it got installed
@@Jarods_Journey Thanks! I'll try uninstalling everything and installing again because now the set-up is showing error when previously it didn't
@@davidmaldonado9254Managed to solve? I have the same problem
Run VS code as admin.
I think I had the same problem using the cuda installation. If your debugger tells you that it can't find the .srt file when running split_audio script then check your terminal logs. If you have an error like this:
"ValueError: Requested float16 compute type, but the target device or backend do not support efficient float16 computation."
Then it means that your GPU does not support FP16 execution.
To fix it go line 26 in the split_audio script which must be: return 'cuda', "float16" and replace "float16" by "float32" or "int8".
The code does not generate an srt file for me from a single WAV, and I get a filenotfound errror: No such file or directory: 'D:vocalsplittest/data\\output\\song.srt
Apparently this is an issue with whisperx, as somne devices like mine donot support this float type, making this code unuseable :(
same problem
Yeah, have the same problem. Hopefully it will be fixed soon
Can you try setting it up with setup-cpu? My laptop has a i7-8650u and works with this setup, this switches it over to int 8 instead of float 16
Maybe it's too late, but I solved it with "pip install -r requirements-cuda.txt" in my case I have an Nvidia graphics card, if you use cpu then replace it with "requirements-cpu.txt" for some reason there is a missing package that it is not installed when running "setup-cuda.py". Always run the command within the virtual environment created previously with "venv"
I'm getting FileNotFounderror in Visual Code Studios, where it cannot find srt_file. I followed your tutorial step by step, but I'm sure I did something wrong since I dont get the same results when I run the program. Since I have no python experience, I'm not sure what I did wrong here.
Some people have reported that it'll work if you try running vscode in admin mode
@@Jarods_Journey Thank you for responding! I will try that.
Maybe it's too late, but I solved it with "pip install -r requirements-cuda.txt" in my case I have an Nvidia graphics card, if you use cpu then replace it with "requirements-cpu.txt" for some reason there is a missing package that it is not installed when running "setup-cuda.py". Always run the command within the virtual environment created previously with "venv"
Just a maybe random question I was having issues installnig the audio splitter and I thought it was because I haven't installed cuda toolkit of NVIDIA, so ended up installing it, but it was other thing that was giving me the error so my question is Should I uninstall this cuda toolkit I don't know what it does exactly or it won't harm my configuration or gpu in the future ?
I'm trying to do my own voice and got some decent results, but it can't handle higher pitches. Should I add more samples with my voice in a higher pitch, or give it more samples with my normal voice and train it for longer? I have it trained using the Harvard Sentences from a previous video and I did 300 epochs.
You can try adding samples of higher pitch, it's mainly going to be good at speaking in the pitch and timbre of the voice you train it with, so if your voice is naturally deeper it's not going to know how to handle that if you try to speak high all of a sudden
I have 954 audio file in my training folder, is it a bit too much for rvc to train?
Hey, at 12:22 I get a similar error, but .\venv\Scripts\activate doesn't seem to fix it, are there any other solutions? It's giving me an error saying "FileNotFoundError", highlighting "subs = pysrt.open(srt_file)"
Here's most of the error (there's more, just basically the same thing)
"Exception has occurred: FileNotFoundError
[Errno 2] No such file or directory: 'C:Users/myuser/OneDrive/Desktop/deleteme/audiosplitter_whisper/data\\output\\MyDataSet.srt'
File "C:Users\myuser\OneDrive\Desktop\deleteme\audiosplitter_whisper\split_audio.py", line 101, in extract_audio_with_srt
subs = pysrt.open(srt_file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^"
Also great video so far!
Something happened when trying to make the srt file, make sure that whisperx downloaded and the setup wan without issue.
You may also have to run vscode in admin mode
Maybe it's too late, but I solved it with "pip install -r requirements-cuda.txt" in my case I have an Nvidia graphics card, if you use cpu then replace it with "requirements-cpu.txt" for some reason there is a missing package that it is not installed when running "setup-cuda.py". Always run the command within the virtual environment created previously with "venv"
Unfortunately, I don't see any file for Cuda setup file in the cloned directory. Any help?
Does it matter if my source audio is chopped up? For example incomplete words/sentences etc..
when i run script this is my error no module name 'ymal"
i had to manually go through the pain of finding out and basically you make sure your not the the virtual environment to make sure type "deactivate" then all you do for everything isn't installed or is saying the module name isn't found find out online the command to install it then add "--use-pep517" after each command so try "pip install PyYAML --use-pep517" for yaml
Can this be done for so vits? Becaus RVC loses the human element in my voice when I try making cover songs
thanks for trying but this thing has failed for me multiple times and im tired of trying to troubleshoot this. is it that hard to just make an executable for people to use? i dont know jack shit about code and cant fix it when it doesnt do the same shit your computer does, even when following all the steps.
What if my voice doesn't speak any of the default languages? I have found a phoneme-based ASR model that suits me but how do I use it in your code? Anyway, great tutorital!
Ah... I haven't dabbled in that area yet and don't know how it works in other non supported languages. I would test it as a command line script first to see if you can get it working that way. I believe the --align_model argument would need to be used
When running it like on 13:00 it says that it `failed to align segment ("!!!!!!!!!!"): no character in this segment found in model disctionary, resorting to original...` multiple times and once it was finished the folder had no segmented audio and was just empty. How do I fix this
I think this is a language issue, if your audio files have multiple languages being used, this causes issue with whisperx, as well as if it's an unsupported language. Further than that, please reference the whisperx GitHub issues page for more details as I'm not sure what else causes this.
Hey Jarods, much appreciation for your tutorials. I'm facing some issue when running the split_audio.py. I'm using a Spanish database, followed all your steps and changed conf.yaml to language: "es". But, when I run the split_audio.py script, I face this issue: Exception has occurred: FileNotFoundError
[Errno 2] No such file or directory: 'D:\\Documentos\\VoiceCloning - AudioSplitter\\audiosplitter_whisper\\data\\Vocals\\output\\100_Salmo 53_(Vocals).srt'
File "D:\Documentos\VoiceCloning - AudioSplitter\audiosplitter_whisper\split_audio.py", line 96, in extract_audio_with_srt
subs = pysrt.open(srt_file)
File "D:\Documentos\VoiceCloning - AudioSplitter\audiosplitter_whisper\split_audio.py", line 150, in process_audio_files
extract_audio_with_srt(audio_file_path, srt_file, speaker_segments_dir)
File "D:\Documentos\VoiceCloning - AudioSplitter\audiosplitter_whisper\split_audio.py", line 180, in main
process_audio_files(input_folder, settings)
File "D:\Documentos\VoiceCloning - AudioSplitter\audiosplitter_whisper\split_audio.py", line 183, in
main()
Can you help me out?
did you fix the issue ??
Which Video Player do you use?
Can you make a video on how to keep the emotions from the original souce voice? I have everything beautifully working for a clean and perfect voice clone but my source audio has some strong emotion acting, anger/fear/happiness etc, that are not represented in the cloned audio. Thanks.
Can I use talking + singing audio to create my model or should it be split into two separate models. One for singing voice and one for talking voice. I am having trouble finding clean singing audio for my model and considering using talking audio from like interviews etc.
You can use both. As long as it's the same voice, it should be fine
Thanks for the vid although I'm confused. I understand the UVR step to isolate vocals. I would generally then use that as the dataset. What is the benefit of the next step of splitting the file up? is that all it does? What else is happening that I don't know about? I've generally just used longer clean audio files for training. Thanks for enlightening me :)
By splitting it, we solve the biggest issue of CUDA out of memory as I don't believe RVC splits larger audio files into more digestible chunks. Splitting it allows us to control this issue, and then additionally, get rid of any silence in the audio samples. Then theres also the fact you can easily remove any bad data from the audio file that you may not want in the training set.
If your running it just fine with UVR without the out of memory issue though, you should be good to go there, but splitting it just gives you a bit more freedom with the data.
Hey Jarod. Slight issue when cloning the audiosplitter_whisper. I don't get the .git file at the top. Just the rest of the files. How do I fix that?
hell @jarod, i got this error whiles its creating output and vocal audio sets
CUDA is available. Running on GPU.
The torchaudio backend is switched to 'soundfile'. Note that 'sox_io' is not supported on Windows.
The torchaudio backend is switched to 'soundfile'. Note that 'sox_io' is not supported on Windows.
Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.6. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file C:\Users\kit\.cache\torch\whisperx-vad-segmentation.bin`
Model was trained with pyannote.audio 0.0.1, yours is 2.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.0+cu118. Bad things might happen unless you revert torch to 1.x.
>>Performing transcription...
Traceback (most recent call last):
File "C:\Users\kit\Desktop
vc\audiosplitter_whisper\venv\Scripts\whisperx-script.py", line 33, in
sys.exit(load_entry_point('whisperx==3.1.1', 'console_scripts', 'whisperx')())
File "C:\Users\kit\Desktop
vc\audiosplitter_whisper\venv\lib\site-packages\whisperx\transcribe.py", line 159, in cli
result = model.transcribe(audio, batch_size=batch_size)
File "C:\Users\kit\Desktop
vc\audiosplitter_whisper\venv\lib\site-packages\whisperx\asr.py", line 288, in transcribe
for idx, out in enumerate(self.__call__(data(audio, vad_segments), batch_size=batch_size, num_workers=num_workers)):
File "C:\Users\kit\Desktop
vc\audiosplitter_whisper\venv\lib\site-packages\transformers\pipelines\pt_utils.py", line 124, in __next__
item = next(self.iterator)
File "C:\Users\kit\Desktop
vc\audiosplitter_whisper\venv\lib\site-packages\transformers\pipelines\pt_utils.py", line 125, in __next__
processed = self.infer(item, **self.params)
File "C:\Users\kit\Desktop
vc\audiosplitter_whisper\venv\lib\site-packages\transformers\pipelines\base.py", line 1028, in forward
model_outputs = self._forward(model_inputs, **forward_params)
File "C:\Users\kit\Desktop
vc\audiosplitter_whisper\venv\lib\site-packages\whisperx\asr.py", line 228, in _forward
outputs = self.model.generate_segment_batched(model_inputs['inputs'], self.tokenizer, self.options)
File "C:\Users\kit\Desktop
vc\audiosplitter_whisper\venv\lib\site-packages\whisperx\asr.py", line 138, in generate_segment_batched
result = self.model.generate(
RuntimeError: CUDA failed with error out of memory
Hey, is it bad if there are low sounds of people slamming doors or making pop-like noise in the back?(They get loud on purpose everytime I sing)
I can't get rid of those as well as plosives from breathing. But you can still here my voice :/
make space cakes and give it to them, start recording an hour later. You should be good for a few hours whilst they are all monging on the Sofa ;-)
I feel for your situation that people around you can't be reasonable with you for ten or so minutes.
Maybe show them some video's of what you are looking to do, and offer to make them a voice , on the proviso that they just shut up or 10 minutes whilst you do your?
Good luck
Just a question does remove the background voice of another speaker if there is another speaker speaking behind the the target speaker
Unfortunately it does not, overlapping speech and disentanglement is still a research in progress field
@@Jarods_Journey One last question what does Speaker diarize do? like cut out each speaker? Nvm you explained it in the video
Hi, I have a question about rvc. I am trying to train a module where I have chosen no pitch. it sounds autotuner like how can i fix it` how does learning rate work` what is batch size`
Not too sure about this unfortunately
Do you have any voice modifications like the ones in the video played in real time? to use the same discord for example voicemod/clownfish ?
I have source track with background noises and of course I can solve that using UVR5 or another voice isolation VSTs, but there are also segments with much voice reverb and when I decrease that reverb it cuts low-mod frequencies from voice, what i shoul do at such situation? maybe i need to find reference with good eq and try to improve target data using eq match?
In this case, you're in a tough spot because if you can't clean the data, it may have some murkiness in the final output. As much as you can, you would wanna get you're audio as clean as possible before training.
I do not think I have cuda...just cpu but got the error ERROR: Could not find a version that satisfies the requirement torch (from versions: none)
ERROR: No matching distribution found for torch
Hello dear, I would like your help regarding sound reproduction via Google Colab. Is the data uploaded in Wave Mono or Stereo format and is it 16 bit or 24 bit?
1) what/how/can I change this to have multiple data directories (if I want to tweak/add on a later retry, and as a way of keeping things organized). I presume I can make a subdirectory like the "vocal" ones for each unique dataset?
2) can I bypass the audio split step if I've exported my dataset in
1. Each file you put in the data folder will be exported to its own segmented folder in the output folder. Once finished here, I recommend moving the finished files to some somewhere else on your PC.
2. Yes, no need
3. The exported files (segmented pieces) are coded in by me and organized to export to the folder you chose at the start. Means unlimited freedom if you wanted to modify the code
4. It sorta is a batch process, what additional feature are you looking for? From the question, I'm assuming you just want to choose an input and an output folder right? Since it makes a folder per file name, I can see this being a bit cumbersome to have to manually move them into one directory, but this is for sorting reasons.
A 3060 is good as it can utilize CUDA. Imo, 3060 will gives more flexibility due to its 12gb VRAM so this would be the cheaper option to go with compared to like a 3070 or 3060ti
@@Jarods_Journey 1) ok 2) ok 3) ah; following along without actually doing it makes it easy to discount where you started at, ahrgh, sorry 4) by batch, effectively automating starting Visual Studio, getting to the point where training ui begins... or in essence, an actual app ala UVC that does the environment setup, python behind the scenes. I want to copy my dataset over, then jump to a ui to start training.... and ideally the same ui to manage models, inference. Installing python, visual studio etc. are one time things I don't mind - I'm thankful you've done these tutorials, but the steps, steps, steps, steps, steps just to get to starting training seems automatable?
My interest is in music, singing replacement; and what happens by tweaking the dataset, getting to what I hear in my head. Which I want bad enough to jump through hoops (and buy a new pc I previously didn't need, lol) but.... gahhh... it's like being a kid again, configuring AUTOEXEC.BAT and CONFIG.SYS for hours, only to be burned out by the time you get Wolfenstein to run in SVGA with a hand-me-down SoundBlaster 16 card....
@@Jarods_JourneyThanks
@@TheChipMcDonald Gotcha! The RVC web-UI is actually pretty close, it's literally just missing the data curation side of things as it comes in a downloadable release too.
A few more quality of life things later like file browsers instead of paths, etc. and I think we're looking at a very robust and easy to follow workflow. I'll definitely keep the channel updated WHEN someone comes out with something that has all of the puzzle pieces put together. 🙏
Do we need highlevel gpu spec to do above things that you showed in the video
Anything that is a Nvidia 3060 12 GB or above should be fine, even 20 series cards work still too. Anything that is not Nvidia often has issues so I don't recommend those.
Hello ! I've followed closely the tutorial three times, but I keep getting that one error at line 101 : "Exception has occurred: File Not Found Error" It seems to be looking for an srt file ? Also the terminal says "Requested float16 compute type, but the target device or backend do not support efficient float16 computation."
That means no srt file was generated by whisperx. Try redownloading with setup-cpu.py as you're gpu probably doesn't support float16. That, or in the code, you can change it to int8 where there is float16. I'll need to work on a fix for this.
@@Jarods_Journey Thank you ! I'll try so.
Hi Jarod, I still get an error even after I do .\venv\Scripts\activate
Exception has occurred: FileNotFoundError
[Errno 2] No such file or directory: 'C:\\Users\\Mah\\Desktop\\AudioSplitter_Whisper\\audiosplitter_whisper\\data\\output\\Vocals.srt'
File "C:\Users\Mah\Desktop\AudioSplitter_Whisper\audiosplitter_whisper\split_audio.py", line 96, in extract_audio_with_srt
subs = pysrt.open(srt_file)
File "C:\Users\Mah\Desktop\AudioSplitter_Whisper\audiosplitter_whisper\split_audio.py", line 150, in process_audio_files
extract_audio_with_srt(audio_file_path, srt_file, speaker_segments_dir)
File "C:\Users\Mah\Desktop\AudioSplitter_Whisper\audiosplitter_whisper\split_audio.py", line 180, in main
process_audio_files(input_folder, settings)
File "C:\Users\Mah\Desktop\AudioSplitter_Whisper\audiosplitter_whisper\split_audio.py", line 183, in
main()
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\Mah\\Desktop\\AudioSplitter_Whisper\\audiosplitter_whisper\\data\\output\\Vocals.srt'
Does it matter that I'm running OS encryption with VeraCrypt?
I get an error while installing
ERROR: Could not find a version that satisfies the requirement torch==2.0.0+cu118 (from versions: 2.2.0, 2.2.0+cpu, 2.2.0+cu118, 2.2.0+cu121, 2.2.1, 2.2.1+cpu, 2.2.1+cu118, 2.2.1+cu121)
ERROR: No matching distribution found for torch==2.0.0+cu118
use python 3.10.11, thats what Jarods did
it definitely worked for me after i used that specific version
hey can you do an update on this video?
any new tools and methodologies that replaces what is outlined in this video.
Hello, I got a few questions..
So I have access to 6ch audio with the voice I want to clone, and I'm extracting it all manually using Adobe Audition.
1. Using UVR helps remove any lingering bg noise but sometimes a little noise will remain. It is not that noticeable, so is it okay to have a little noise or will that affect the model?
2. I know to remove long silences, but what about the small gaps between when the character is actually speaking, should I remove that too so it is just a continuous stream of talking with not even 0.5 second breaks? And what about the sounds when a character isn't actually speaking, e.g. growls or hums, or breathy sounds like laughing, that naturally have some silence in there.
My observation is the little bit of noise is ok, it shouldn't be that noticeable. One case though I have of a model is that it does show in the output that I can hear the bg that was not removed. Hard to get it perfect though.
2. The little gaps are fine, as for growls and what not I'd say to cut those out, but I haven't actually tried so I can't say for certain.
@@Jarods_Journey Thank You!
One more question.. for now.. if that’s okay?
Say I wanted be excessive to get the cleanest, most accurate, almost perfect result possible on the first train. And I had 1 and 1/2 hours or even 2 hours max audio data. And My PC could probably handle it (For context i a have NVIDIA GeForce rtx 3060 graphics card and 32GB ram) What is the Max amount of epochs do you recommend I could train for?
Dunno, the big answer is "it depends". Just try training for 10 epochs and hear how it sounds. Tain around other epochs and try those as well. You're looking for the lowest epoch #
@@Jarods_Journey Oh okay then 🤔 Thank you a lot! I really appreciate you taking the time to answer
How important is it to remove silence between the speaker's words or does it matter at all?
It may help reduce some artifacting, but oftentimes you can leave some silence in there and it'll be fine.
@@Jarods_Journey is the artifacting sounding like octave shift/cracking/falsetto effects. That's been a problem with some of the voice models I've made, and some that I've downloaded and tried using.
Is this only for Nvidia?
🔥🔥
Thanks for the helpful video, I have a gtx 1660 ti 6gb vram cuda say i am out of memeory is there a low vram option like in stable diffusion or i am stuck with using cpu?
There are some low VRAM options built into whisperx that have to be passed, you would have to modify the script to do that. I'll get around to adding it when I get the chance
ModuleNotFoundError: No modulenamed 'yaml' how do I fix it??
That was only so complicated it was ridiculous. Why don't you actually write a program that just does all that by clicking "split"? What about slicer-gui-windows-v1.2.1? Will that do the same thing?
hello, i'm askin anyone right now because i got a bit lost. i'm trying to make the ai voice not glitch out whenever i'm doing long vowels so it doesn't look for all of them at once making it sound like a mess, and i so far thought you have to train them to sound better, but i think that's not the case. can someone explain what i have to do to achive this?
Just found out your channel and wanted to ask you if you know any ways to follow these steps on Mac. As a student the only computer I have is my MacBook Air M1. I watched your video where you show how to use RVC on Colab and I want to learn how I can create my own dataset and remove vocals from songs.
You can run this on CPU using setup-cpu, though I haven't tried myself since I don't have MAC. You could technically do all this in Collab as well, but you'll have to set that up yourself
@@Jarods_Journey I will spend some time on it and if I found a way, I will post here for others.
Thanks for tutorial, could you please explain how to replace the whisper model for the one that was trained on my native language?
BTW, I already found the model, but it's still a mystery on how to use it with your script
Sorry mate, I haven't looked into this area and don't know quite exactly how to do it either. You have to tell whisperx the location of the alignment model your using, but that's as far as I know.
hi when I tried this I got this message
Failed to align segment ("!!!!
Ш!!!!!!!!!!!!!!!!"): no characters in this segment found in model dictionary, resorting to original !!!!!!!!!!!!!!!!!!"): no characters in this segment found in model dictionary, resorting to original...
I also got this message too
Model was trained with pyannote.audio 0.0.1, yours is 2.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+u102, yours is 2.0.0+cul18. Bad things might happen unless you revert torch to 1.%.
Both error messages are fine, and you should still be getting an output file at the end. Though, if it's not, believe that's a whisperx limitation where it can't align some words
@@Jarods_Journey I have the wav files but I don’t have the vocals with the small audio files
Is there a way to fix that
Hi Jarods, Im currently on my project of doing Audiobook using cloned voice where I will be the voice
How good the training will be If I have an i5 and GTX1060 6gb. Is this enough?
That GPU might be rough... You might wanna train on Google colab. The training quality should be the same, just training time will be different
@@Jarods_Journey Thanks for the tips
In my record, script start using phrases from record instead of SPEAKER_00 and SPEAKER_01, what can cause that problem?
Nvm, it's seems like some phrases don't have speaker, so i just modified script a little bit.