The k prime calculated in the synthesis step 2 (around @38:45) does not contribute the output synthesis phase since it will be cancelled out the in step 4 with the phi_rs calculated in step 3. Also checked the code and had the same conclusion. Maybe I missed something? Thanks!
Timestretching can be achieved by resampling the output of a pitch shifter. E.g.: to double the duration do the following: - pitch up by one octave (2x frequency) - upsample the output by 2x - play back the upsampled output at the input sampling rate You can achieve arbitrary timestretching ratios by using fractional ratios in the pitch shifter and resampler. Keep in mind that constant-rate timestretching in real-time is kind of impossible or unusable because you either run out of input data if you are speeding up or start building up an ever increasing delay if you are slowing down.
Hello, great thanks for that video. But I have some important question. When you implemented slider for changing hop size there is some misunderstanding for me concern to hop size/window size/fft size. When for example set slider to min value 64 then the window length is 64*4=256. But in process_fft when copying to unwrapped buffer you use fft size loop and call window for whole fft size range which is 1024. So it use values from outside of window range. How it is work? Is it appropriate? For any help great thanks in advance. Best Regards.
If you're talking about the robotisation effect then yes you do end up with some funny combinations of parameters you wouldn't otherwise encounter in the phase vocoder. That's because the drone-like distortion of those settings is precisely the effect we're trying to create. But as a more general point, it's not unusual to have an FFT that is longer than the window. For example you could have a window that wasn't a power of 2, and the FFT would usually be the power of 2 about that (for efficiency reasons). You can still reconstruct the original (shorter) window with more FFT bins than you strictly need. The real problem is the other way around, when your FFT is shorter than the window, and that's where the bad distortion in the robotisation effect starts to creep in.
@@apm414 Great thanks for your answer. Sorry I forgot mention that my question was concern to whisperisation effect, but it is also relevant for robotisation. And yes I understand the problem when FFT is shorter than window. But my question concerns exactly to situation when FFT is longer than window. And I am stil not sure if I understand it. Let's say we start with "ideal" size where FFT is the same as window, and hop length is equal to FFT length devide by 4. (let's say FFTsize=winSize=1024, and hopSize=256). So we have precalculated window vector with size 1024. So all members of that vector is filled with relevant values. And now let's shrink the length of hopSize to let's say 128. So we need to update winSize to hopSize*4=512. And we now we calculate again all data in window vector, but only for first 512 members of it, and the rest of members still have data from window size 1024. And in your cose we still use that old data which now are out of range. Should't we avoid that in some way? I am asking because I am writing my own code and I want to avoid using additional thread for fft, so for efficiency I want data to be collected in unwrappedBuffer all the time to avoid additional fftSize loop in fft procedure. But I meet some problems with windowing. And that is one among some other issues.
Hello! I have one more question. Hopefuly anybody could help me. The phase vocoder implementation in this video works perfect, but there are some phase issues in my case. When I change pitch slider everything seems to be OK, but when I slider back to default position (pitch ratio = 1) then the sound is no more the same as original. Not sure how to explain it. The sound has original pitch, but it sounds like there are some phase cancelations or some filter applied. I found that solution is to zero all lastInputPhases and lastOutputPhases everytime I change the pitch slider position. Now everytime I set back pitch slider to it's default value the sound seems to be the same as original. But the problem with my solution is that during moving pitch slider the sound is unpleasent. So I suppose I should update in someway the last input and output phases but not with simply zero. But can't figure it out how to do that properly. For any help reat thanks in advance. Best Regards
That is not surprising, because the moment the effect starts to run you are only updating the phases as an estimate of the real ones, so over time they will drift. If you jump suddenly to a different location all bets are off on the phase so it will probably sound smeared. This is why to get the most out of the phase vocoder you have to pick appropriate times to reset phase back to what it is in the original signal rather than update it in the running state. For the pitch shifter there isn't a really good rule of thumb for when to resync the phase, which is why the phase vocoder pitch shift kind of always sounds a bit smeared.
Question, for anyone really. I notice that with my pitch shifter, the fundamental frequency of the FFT size (mine is 1024, it's close to an F) becomes audible in the output. Does anyone have any idea why? Every note I play sounds like some frequency modulated note with F, even simple sine waves.
You're probably hearing the period of the hop size, rather than the FFT size, though you could check this by changing the hop size to see if the effect changes. There are a lot of reasons that could happen, but basically, something is probably not right in either the phase calculations or the windowing, leading to an effect similar to robotisation where the phase is deliberately reset each hop. In the example, the windowing is done for you, so probably it has something to do with the phase reconstruction. Check that you have a static (or global) array of type float to hold the output phases and that you're updating it properly each hop, following the solution in the video. Then check the implementation of the specific equations. You might see whether the pitch is present when there should be no pitch shift, or only once you try changing the pitch.
@@apm414 Thank you for the reply! I'm still stuck, but you were right that I was hearing the frequency of the hop size - changing it changed the harmonic. So like you suggested I revisited my code and it translates pretty well with the solution. The reason I say this is that I am using visual studio (not Bela) and using a VST setup so there are some small differences, but I think it all should be pretty much the same. I.e. I cannot use the auxiliary task feature and I may be incrementing the Write Out pointer after when I should (right now I increment it by 1 hop size after I copy from the FFT buffer to the out buffer). To answer your last question, the pitch is present when there should be no pitch shift, and stays constant regardless of the input (meaning the input is pitched but the bad pitch remains). If I remove the final window (I know I shouldn't) there is no bad pitch present until I touch the pitch knob, and then it sounds worse than before with or without the knob. Does this still sound like a phase problem to you, or more like a window one? I appreciate your time.
I found that the "solution" to my problem was that my hop size was half the size of the FFT size. My FFT size was 512 and my hop size was 256. When I changed the FFT size to 1024 and hop size to 128 like in the video, the sound became much clearer, at the expense of performance. Thanks for the ideas, Andrew.
@@sqfx744 Hard to say since the underlying platform (and therefore the threading structure) is different, but what I would do is try doing a straight passthrough in the frequency domain (no pitch shift code at all, just FFT --> IFFT), but with analysis and synthesis windows. That will trace the problem to either the overlap-add code or the phase vocoder code.
@@apm414 Thanks for your reply. I figured it out, my FFT and Hop Sizes needed to be adjusted, after that it sounded decent. I have realized that this technique is too computationally expensive and too phase-y sounding for me, so currently I'm working on a time domain one. I'm sticking with the time domain and trying out a method where I find the period by collecting samples approx. 0. Then I am changing the "size" of buffer so that when I reach the end I just back continuously to the beginning, using some trig stuff. I think my problem is when I get a new buffer I get clicks (I think the phase is off). Anyway, thanks for the help.
There are effectively two ways of pitch shifting with the phase vocoder. One is to manipulate the frequency components directly like this example does. The other is to time stretch the signal, keeping its pitch constant, then resample it to get back to the original speed (but with a different pitch). That approach requires interpolation.
@@apm414 Thank you for your answer. I am wondering why anyone would do the interpolation version, this one here seems quite a bit easier to me. Is there any difference in quality?
This is the best series I’ve seen on this topic
Awesome series!! Respect for teaching all this stuff. I like to see people like you making this information more accessible.
Amazing tutorial!
It’s worth noting that Antares Auto-Tune actually operates in the time domain using the TD-PSOLA algorithm for pitch shifting.
The k prime calculated in the synthesis step 2 (around @38:45) does not contribute the output synthesis phase since it will be cancelled out the in step 4 with the phi_rs calculated in step 3. Also checked the code and had the same conclusion. Maybe I missed something? Thanks!
Woa! THIS IS NEW STUFF! Ill check it out tomorrow ^^
Is it possible to create a video for changing the voice entirely?
Hello Great tutorial. Would this be possible you explain how to make a timestretch algorithm using phase vocoder? Thanks in advance
Timestretching can be achieved by resampling the output of a pitch shifter. E.g.: to double the duration do the following:
- pitch up by one octave (2x frequency)
- upsample the output by 2x
- play back the upsampled output at the input sampling rate
You can achieve arbitrary timestretching ratios by using fractional ratios in the pitch shifter and resampler.
Keep in mind that constant-rate timestretching in real-time is kind of impossible or unusable because you either run out of input data if you are speeding up or start building up an ever increasing delay if you are slowing down.
loved the series, thank you so much!
Hello, great thanks for that video. But I have some important question.
When you implemented slider for changing hop size there is some misunderstanding for me concern to hop size/window size/fft size.
When for example set slider to min value 64 then the window length is 64*4=256. But in process_fft when copying to unwrapped buffer you use fft size loop and call window for whole fft size range which is 1024. So it use values from outside of window range. How it is work? Is it appropriate?
For any help great thanks in advance. Best Regards.
If you're talking about the robotisation effect then yes you do end up with some funny combinations of parameters you wouldn't otherwise encounter in the phase vocoder. That's because the drone-like distortion of those settings is precisely the effect we're trying to create.
But as a more general point, it's not unusual to have an FFT that is longer than the window. For example you could have a window that wasn't a power of 2, and the FFT would usually be the power of 2 about that (for efficiency reasons). You can still reconstruct the original (shorter) window with more FFT bins than you strictly need. The real problem is the other way around, when your FFT is shorter than the window, and that's where the bad distortion in the robotisation effect starts to creep in.
@@apm414 Great thanks for your answer. Sorry I forgot mention that my question was concern to whisperisation effect, but it is also relevant for robotisation. And yes I understand the problem when FFT is shorter than window. But my question concerns exactly to situation when FFT is longer than window. And I am stil not sure if I understand it. Let's say we start with "ideal" size where FFT is the same as window, and hop length is equal to FFT length devide by 4. (let's say FFTsize=winSize=1024, and hopSize=256). So we have precalculated window vector with size 1024. So all members of that vector is filled with relevant values. And now let's shrink the length of hopSize to let's say 128. So we need to update winSize to hopSize*4=512. And we now we calculate again all data in window vector, but only for first 512 members of it, and the rest of members still have data from window size 1024. And in your cose we still use that old data which now are out of range. Should't we avoid that in some way? I am asking because I am writing my own code and I want to avoid using additional thread for fft, so for efficiency I want data to be collected in unwrappedBuffer all the time to avoid additional fftSize loop in fft procedure. But I meet some problems with windowing. And that is one among some other issues.
Hello! I have one more question. Hopefuly anybody could help me. The phase vocoder implementation in this video works perfect, but there are some phase issues in my case. When I change pitch slider everything seems to be OK, but when I slider back to default position (pitch ratio = 1) then the sound is no more the same as original. Not sure how to explain it. The sound has original pitch, but it sounds like there are some phase cancelations or some filter applied. I found that solution is to zero all lastInputPhases and lastOutputPhases everytime I change the pitch slider position. Now everytime I set back pitch slider to it's default value the sound seems to be the same as original. But the problem with my solution is that during moving pitch slider the sound is unpleasent. So I suppose I should update in someway the last input and output phases but not with simply zero. But can't figure it out how to do that properly. For any help reat thanks in advance. Best Regards
That is not surprising, because the moment the effect starts to run you are only updating the phases as an estimate of the real ones, so over time they will drift. If you jump suddenly to a different location all bets are off on the phase so it will probably sound smeared. This is why to get the most out of the phase vocoder you have to pick appropriate times to reset phase back to what it is in the original signal rather than update it in the running state. For the pitch shifter there isn't a really good rule of thumb for when to resync the phase, which is why the phase vocoder pitch shift kind of always sounds a bit smeared.
Question, for anyone really. I notice that with my pitch shifter, the fundamental frequency of the FFT size (mine is 1024, it's close to an F) becomes audible in the output. Does anyone have any idea why? Every note I play sounds like some frequency modulated note with F, even simple sine waves.
You're probably hearing the period of the hop size, rather than the FFT size, though you could check this by changing the hop size to see if the effect changes. There are a lot of reasons that could happen, but basically, something is probably not right in either the phase calculations or the windowing, leading to an effect similar to robotisation where the phase is deliberately reset each hop.
In the example, the windowing is done for you, so probably it has something to do with the phase reconstruction. Check that you have a static (or global) array of type float to hold the output phases and that you're updating it properly each hop, following the solution in the video. Then check the implementation of the specific equations. You might see whether the pitch is present when there should be no pitch shift, or only once you try changing the pitch.
@@apm414 Thank you for the reply! I'm still stuck, but you were right that I was hearing the frequency of the hop size - changing it changed the harmonic.
So like you suggested I revisited my code and it translates pretty well with the solution. The reason I say this is that I am using visual studio (not Bela) and using a VST setup so there are some small differences, but I think it all should be pretty much the same. I.e. I cannot use the auxiliary task feature and I may be incrementing the Write Out pointer after when I should (right now I increment it by 1 hop size after I copy from the FFT buffer to the out buffer).
To answer your last question, the pitch is present when there should be no pitch shift, and stays constant regardless of the input (meaning the input is pitched but the bad pitch remains). If I remove the final window (I know I shouldn't) there is no bad pitch present until I touch the pitch knob, and then it sounds worse than before with or without the knob.
Does this still sound like a phase problem to you, or more like a window one? I appreciate your time.
I found that the "solution" to my problem was that my hop size was half the size of the FFT size. My FFT size was 512 and my hop size was 256. When I changed the FFT size to 1024 and hop size to 128 like in the video, the sound became much clearer, at the expense of performance. Thanks for the ideas, Andrew.
@@sqfx744 Hard to say since the underlying platform (and therefore the threading structure) is different, but what I would do is try doing a straight passthrough in the frequency domain (no pitch shift code at all, just FFT --> IFFT), but with analysis and synthesis windows. That will trace the problem to either the overlap-add code or the phase vocoder code.
@@apm414 Thanks for your reply. I figured it out, my FFT and Hop Sizes needed to be adjusted, after that it sounded decent. I have realized that this technique is too computationally expensive and too phase-y sounding for me, so currently I'm working on a time domain one. I'm sticking with the time domain and trying out a method where I find the period by collecting samples approx. 0. Then I am changing the "size" of buffer so that when I reach the end I just back continuously to the beginning, using some trig stuff. I think my problem is when I get a new buffer I get clicks (I think the phase is off).
Anyway, thanks for the help.
Thank you for this. I still can't get my head around why so many other sources require an interpolation step but we don't need one here.
There are effectively two ways of pitch shifting with the phase vocoder. One is to manipulate the frequency components directly like this example does. The other is to time stretch the signal, keeping its pitch constant, then resample it to get back to the original speed (but with a different pitch). That approach requires interpolation.
@@apm414 Thank you for your answer. I am wondering why anyone would do the interpolation version, this one here seems quite a bit easier to me. Is there any difference in quality?
this library have autotune ?