If you want to make sure to compile using SIMD instructions specific for the HostCPU you can use llvm bindings for the language of your choice and then compile through llvm. Interesting vid!
Nice video & nicely paced and clear. Just what I needed to get this topic a bit more. Just need some more examples of calculations actually taken care of by the SIMD extension sets, and perhaps some alternative SIMD/FFT libraries with info about what does what and how, that would be epic. Not many people teaching this in audio with such good phrasing! Keep up the great work! 👍
many dsp algorithms contain single sample feedback. can anything be done to vectorize these algorithms? It seems like the feedback complicates any attempt to use block processing to vectorize.
Very helpful video. I was working on a particle system/simulation, and I use GL to draw the particles. Was wondering with SIMD and GL, how can I draw multiple particles at once? Or is this something more to do with GL buffers?
I can understand the concept of simd. But, in the code I can see that you are adding each value when it is added to the register. I see that which is equivalent to scalar addition, I think inorder to avoid one more for loop to store the addition values into the result array which makes sense. This points me to ask whether the intrinsic function performs the addition, only when all the 256bits are filled with values or it can also perform otherwise?
Late reply, you probably already have figured it out by now. Responding anyway for others with the same question. That's like saying planes are pointless for traveling large distances, because you still need to walk the short distance to your destination from the airport. SIMD will do a large portion of the work, in this case it will do it in multiples of 8, and the regular loop will finish the remaining amount so for normal loop you are looking at: N*scalar while for SIMD you are getting: floor(N/8)*SIMD + (N%8)*scalar Since by design 1*SIMD will be faster than 8*scalar, for sizes greater or equal to 8, the second algorithm will be faster than just doing the first loop. Otherwise, for sizes smaller than 8, it will be the same as the first loop + some overhead because of the division by 8.
Have I helped you with this video? If yes, please, consider buying me a ☕ coffee at www.buymeacoffee.com/janwilczek
Thanks! 🙂
Thanks this was great intro on this topic. I wanted to get started on SIMD and this will put me in right way
Thanks for your great introduction and lively demo! I really like your pace!
If you want to make sure to compile using SIMD instructions specific for the HostCPU you can use llvm bindings for the language of your choice and then compile through llvm. Interesting vid!
hej, to jest trudny temat, nic nie można znaleść na Internet, cieli dziękuję ci Jan!
Bardzo się cieszę, dzięki również!
Thank you Jan!
Thanks for commenting! :)
Great job explaining, and demonstrating. Thank you.
great video!. looking forward the next one. for the next time, could include more on the arm and risc v case?
You made the concept easy to understand, thank you. Would like to see some C examples if it's possible too.
Line 13 is killing me lol
Nice video & nicely paced and clear. Just what I needed to get this topic a bit more. Just need some more examples of calculations actually taken care of by the SIMD extension sets, and perhaps some alternative SIMD/FFT libraries with info about what does what and how, that would be epic. Not many people teaching this in audio with such good phrasing! Keep up the great work! 👍
I didn´t read the article about this topic you wrote before. It is great, much more info there giving more depth, thanks!
many dsp algorithms contain single sample feedback. can anything be done to vectorize these algorithms? It seems like the feedback complicates any attempt to use block processing to vectorize.
Very helpful video. I was working on a particle system/simulation, and I use GL to draw the particles. Was wondering with SIMD and GL, how can I draw multiple particles at once? Or is this something more to do with GL buffers?
Yes, you helped a lot ^_^
That's great ;)
I can understand the concept of simd. But, in the code I can see that you are adding each value when it is added to the register. I see that which is equivalent to scalar addition, I think inorder to avoid one more for loop to store the addition values into the result array which makes sense. This points me to ask whether the intrinsic function performs the addition, only when all the 256bits are filled with values or it can also perform otherwise?
Great job,
but didn't second for-loop killed the entire reason of using SIMD?
Late reply, you probably already have figured it out by now. Responding anyway for others with the same question.
That's like saying planes are pointless for traveling large distances, because you still need to walk the short distance to your destination from the airport.
SIMD will do a large portion of the work, in this case it will do it in multiples of 8, and the regular loop will finish the remaining amount
so for normal loop you are looking at:
N*scalar
while for SIMD you are getting:
floor(N/8)*SIMD + (N%8)*scalar
Since by design 1*SIMD will be faster than 8*scalar, for sizes greater or equal to 8, the second algorithm will be faster than just doing the first loop. Otherwise, for sizes smaller than 8, it will be the same as the first loop + some overhead because of the division by 8.