I will also add that branching can stall a CPU, particularly as processors attempt to "guess" which bit of code will be executed next. If it guesses wrong, it has to effectively "go back", so removing branching is a good strategy for optimisation.
@javidx9 ... hmm did you ever calculate _c in your code example? Well it is likely in the git repository ;) but in your video i think that part is missing (e.g. at 54:36 ). I just see comments on what you want to achieve ... or did i overlook that part? Nevertheless a cool video, as a long time ago i programmed in Assembler but nowadays i'm relying on the C#-Compiler ;).
C++ 20 brings us [[likely]] and [[unlikely]] that may help to fix a branching conflict. See Jason Turner on this topic at ua-cam.com/video/ew3wt0g99kg/v-deo.html Thank you for the educating video javidx9. Stay safe. P.S.: Isn't it nice that meat-bags(humans) are still useful for optimization work and making videos?:)
@@dieSpinnt It's worth noting that the [[likely]] and [[unlikely]] tags (or the equivalent compiler-specific markup you would have used prior to C++20, such as __builtin_expect) can't really directly help the CPU to predict branches better, they mainly help make correctly-predicted branches perform better, by hinting the compiler to, for example, reorder branches to reduce the overall number of jumps in the expected code path, or improve the cache locality of the expected code path by laying it out contiguously in memory, or to decide whether to use a branch vs. a cmove.
Why is it so hard to find resources about the incredible branching prediction of processors ? I only saw it mentionned in a talk by former Intel/Tesla chief processor architect Jim Keller. It's not just predicting what will be used next, but parallelising automatically by finding independant pieces, like initialising variables can be done before the function is called! It seems that there is a processor inside the processor doing those predictions live depending of the current run time and other threads from other programs. The firmware can optimise machine code live, not following the .exe machine code. And Intel wants to use neural networks to predict branching. That's how they manage go make code run faster without increasing clock speed. Also, there are professional grade Intel compilers with licence prices higher than consumer processors that make much more advanced optimisation than the generic GCC compiler. It seems such a fascinating topic, but surprisingly secret.
@@Bvic3 if you're interested in how branch prediction works, you might want to read the Wikipedia article en.m.wikipedia.org/wiki/Branch_predictor. If you look at the reference section you'll find that much if the theory is freely available, what's secret isn't usually really how things work, rather its all the work that goes into implementing something that can actually put it into practice and be competitive. As to the Intel compiler vs GCC, a lot of that is marketing, sometimes Intel does better, sometimes GCC does, it really depends on specifics (i.e. what code is being compiled, how is performance being measured, what system is it running on, what version of the compiler, what compiler options were selected, etc.) Naturally its easy for Intel marketing to cherry pick scenarios that put their compiler in the best light, so its important to understand that your results will vary.
*head explodes* - I see a lot of basic programming videos online with all the usual fair, and they are very nice. But it's refreshing to see more advanced topics like this covered, and covered so well.
I was watching this when I couldn't get to sleep. It is so fascinating that I kept watching and watching. It didn't help me get to sleep at all ;-). Thanks for a great lesson.
Interesting, I watched the video and thought: "Wow, what a amazing teacher, full of content", then i subscribed and check the channel videos and realize that when i was in the beginning of graduation I visited that same channel to see start level content and now, almost finishing the course, here i am, seeing a more complex thing, moral of the history: The channel and his creator is both incredible. Thank you !!! Sorry for my poor english....
Please keep this series of explaining parts used in C++ SIMD to continue, your way of explaining its awesome, Thanks for putting such a high quality content out in public.
A bit late to the party, but here goes... This video has inspired me to try SIMD programming. I have long been a fan of Mandelbrots and many years ago wrote a program to plot and explore them. Eventually I got myself a PC with an i7 processor and explored making my Mandelbrot program multi-threaded which worked well. Now is the time to upgrade it again with SIMD. Now my cpu is still the same i7 which does not support anything past SSE4, but then compiler of choice is Delphi 6 (don't judge), which completely does not support intrinsic functions at all. However it does have an in-built assembler which supports up to SSE2. So my task has been to translate all this C++ code into Pascal/assembler. I have eventually got this to work - a few radical changes were required - e.g. I only have 8 x 128 bit mmx registers to play with so only 2 pixels at a time, but the speed-up is amazing. My program is rendering full screen images in just a few hundreds of milliseconds (sometimes much less) where it was taking multiple seconds before. The most complex image so far has only taken slightly over a second to process. Thank you so much for explaining this in such simple terms that I was able to do this and learn about SIMD.
Hey that's great Bobby! SSE4 is no slouch, and I'm pleased you got it working to your expectations. I must confess I'd not considered the availability of intrinsics in other languages before, so this is quite interesting.
@@javidx9 C# Recently added Intrinsics via the System.Numberics namespace and they work pretty well. See: devblogs.microsoft.com/dotnet/hardware-intrinsics-in-net-core/ and: docs.microsoft.com/en-us/dotnet/api/system.numerics?view=netcore-3.1
Thank you for this video. I took a CS course in parallel computing this semester, and it demystified a lot of what makes high-performance code tick. This video helped me to connect what I've learned with what is going on in an IDE like Visual Studio.
Incredibly informative, the most hardcore but so enjoyable. Personally, I found masking not the hardest thing in it, instead it was the x positions and offsets, especially 52:04 - 52:12, what a... Fantastic stuff, thanks to your channel I really got more and more interested in more low level coding. The way you present it makes it much less scary, even the assembly code :)
Recently I had a C# code that would take about 50 minutes to execue and calculate. Running parallel got it to about 5 minutes. Using OpenCL (kinda like CUDA) got it a little under 10 seconds xd Edit: And yes, I did run the code for 50 minutes xd
@@guiorgy I wish I had the time to bring me up to speed with Cuda or Open CL, but besides a little bit of programming Arduinos I'm not a C programmer, and I struggle with basic concepts like 'const * char const' etc. I have a project regarding gamified genetic algorithm which I have done in Java years ago, and someime I have to recode in C, GPU computing and a powerful graphic engine
When I first looked on to the intrinsic code, I thought how complicated it was... But you explained perfectly, and something clicked, and I realized how easy it really is. Thanks to the avx, I'm getting double the performance on my Mandelbrot set renderer. The best thing is, it even works on multiple cores with OpenMP directive . The performance on CPU is as good if not better than on GPU.
I will have a Computer Architecture exam next week and a significant chunk of the material is about SIMD extensions but since it's a university course it's all theory, so it's nice to see it in action.
Another great video and little trip to memory lane. Few years ago, I had to work image processing with older hardware which did not had any GPU acceleration and some algorithms had to be written with SIMD. After getting mind wrapped to work in vector-oriented mode the project was surprisingly pleasant to code.
I showed one of your video to my father to make him believe you were me, we look exactly alike, it took him a good minute to realise it was'nt me !!! We laughed so hard, keep up the good work =)
44:34 ++++ You don't need to use the comparison mask to select/blend between '0' and '1'. Since 'all ones' is the two's complement representation of '-1', you can simply subtract the mask from your iteration counter (x + 1 x - (-1) AND x + 0 x - 0). You could've explained the blend intrinsic with this code segment, going from where you were with your AND equivalent, but also showing off the trick I mentioned afterwards.
I'm not quite sure why you talked about cache locality when you did, as it's unrelated to the loop unrolling optimization. The cache behavior of the loop is the same either way - the reason it gets unrolled is just to reduce loop overhead (fewer compare and branch instructions per iteration). Other than that, this seems like a great video for introducing people to SIMD programming. Your explanations of lanes vs. register width, masking, and the utility of intrinsics in general, are all very clear, concise, and thorough. Good stuff!
Thanks Not Null - I kind of agree with you, I wanted to fit in locality somewhere, and there is some truth to unrolling being advantageous to cache usage, for the reason you describe in fact - aside from branching having its own overhead which you want to reduce, and of course branch prediction being a factor, the branch test itself could potentially pollute the cache. SIMD stuff works best when streamed, and there are in fact cache organisation intrinsic functions to hint where the data should be moved to before the next set of instructions. Streaming of course works best with contiguous data in memory, and typically such memory is moved around "together". Once that extension pipeline is fired up, you want to cram as much data through it as possible, so I dont agree that its unrelated, but I do concede it is secondary to chaos branching can cause.
@@javidx9 Doesn't the loop condition (at least in this case) just come down to a compare instruction and a conditional jump on the relevant flag bit? I don't see how that would pollute the cache, but I might be missing something.
@@notnullnotvoid On powerful processors such as desktop ones, its not quite that simple. Yes, the condition is based off a single bit, but 2 things, firstly the pipelined nature of the processor requires branch prediction, and flushing out the pipeline is undesirable for performance, secondly the arguments for the condition itself may require memory to be read, thus potentially polluting the cache.
By far the best video about this topic on youtube in overall. I only found much less detailed videos or way too detailed only on some specific parts. Cheers :)
Just watched Linus tech tips where Anthony mentioned "AVX 512" support on a new macbook and since I recently watched this video I could say "Oh yeah... I understand that... in depth." :)
Javidx9, my hero. Why? He read every single comment i wrote on this channel and im sure it applies to everyone else. If I become successful person one day, the reason must be your videos. They are very well made and he explains every single step he made on his videos. I cant help you with financial part right now, but I will make sure pay what you did to me in the future after I get some job. You are very cool man (I cant even describe it with word). And thinking about what you did for me make me so emotional.
Great video and amazing channel. I just want to point out a small note at 41:57 which is n < iterations is not the same as iterations > n because of the case where n = iterations
This is a good point Mohamed - combined with the way the loop is structured now, I think this approach always does one further iteration compared with the reference function.
(1) n < iterations (2) iterations > n If n = iterations , both expressions are false, since both comparators exclude equality. They are in fact the same.
Ha! And here it is: the SIMD video I asked for earlier today, along with plenty of others who asked before that, because I didn't check the post date on the brute forcing video. Great stuff!
So this eases the old school approach of having an __asm {} block to optimize what logic the compiler would not be able to do like we find in some older open sourced e.g. games engines, with organized instrinsic functions for exposing modern cpu instructions via modern compilers. Nice.
Is there space where we can give Ideas for the new videos (so we have a list) and then we can vote witch subject is most selected ...obviously this is Your channel and Your vote is final but one thing is certain. You have a gift and the way and Your voice is just in perfect balance: a very very good teacher. We are very lucky You have time for those videos.
Hi Marcin - kind of, but mostly no - On the discord we have a requests board, though fundamentally it requires that I feel confident enough about the subject matter to demonstrate it. I simply wont make videos about subjects I dont have a good understanding/experience of, they wouldn't help anybody! Also, I often disappoint people with timing of videos, since this is a hobby for me, it helps if the video i'm making is related to some project I'm working on at the time. In the case of intrinsics for example, I've been using them a lot in a different project which isnt a video, so its fresh in my mind. But always happy to see a comment from your good self, a long time supporter and I thank you for that!
At 49:40, it is also possible to use _mm256_extract_epi64 to get simple types out of a register again, which would get rid of the ifdef Having done some intrinsic programming before, and i think that your video is an amazing ressource on how you programm with it's quirks in minds Well, all of your videos are an amazing ressource - keep up the good work! :)
Cheers buddy, The problem I find with intrinsics is there are so many functions, but Ive not found a sensible "high level" list of function catagories XD so thanks!
!!!Beware!!! Don't ship into dangerous waters. Rule 16 Do not use identifiers which begin with one or two underscores (`_' or `__'). > The use of two underscores (`__') in identifiers is reserved for the compiler's internal use according to the ANSI-C standard. > Underscores (`_') are often used in names of library functions (such as "_main" and "_exit"). In order to avoid collisions, do not begin an identifier with an underscore. via www.doc.ic.ac.uk/lab/cplus/c++.rules/chap5.html Just my two nit-picky cents:P
Fractals and similar iterations sound like a close second to me. There's very little to be transferred into the GPU, and very little back out. Moving the heavy lifting into the GPU could be very profitable, even more so since modern GPUs tend to have 100s of cores, even the better consumer-grade models. Not exactly your everyday algorithm, but even if you want to save the data to disk, it looks very promising. If you don't, real-time animation in full HD is definitely on the horizon thanks to Cuda. For other stuff, it can be the other way around. Instead of freeing CPU cores, it could tie cores down with management duties (or even worse: tie ONE core and block the others out), which is probably a workload for which most modern OSes are not optimized (unlike processing in the CPU or pure output generation in the GPU).
It might be instructional to add the benefit of using intrinsics by showing a sid-by-side video of the fractal generations. IOW a "what is there to gain" for all your extra coding efforts. Well done, I have recommended this on IDZ
Hi Zubble and thanks - In principle this was a follow up to the previous video that did show the the difference with/without intrinsics, its just that one did not show the intrinsic code in detail.
I think I found a really cool problem that could use intrinsics so I'm excited. A couple of other optimizations and I'm aiming to solve out to 1 million instead if grinding it out to 50 thousand or so. Great video!
Great video as always! For those who want a more detailed look at the difference in timings for cache vs Memory vs Hard drive I recommend the talk "Getting Nowhere Faster" By Chandler Carruth at Cppcon 2017.
Love your vids. Even if I don't understand them the first time I watch because I'm just a simple web developer (PHP, NodeJS) but the way you explain helps me to understand more of our computers and the way programs work (and I hope they make me a better programmer - even on simpler stuff ^^)
Hello javidx9, referring to 1:58 I'm a little annoyed, and would like to complain about the fact that you unnecessarily free up the heap allocated arrays before the program terminates, as the memory will be freed anyway when the program exists. It will just make exiting the the program slower.
While technically this is true, I would argue that the benefits of consistently freeing resources after use (so it becomes habit) far outweight the cost of freeing some memory, even in a small demo like this. Explicitly freeing all allocated memory can also make it easier to catch subtle bugs (e.g. segfaults) and will keep your static/dynamic analysis tools happy (if you're using any that is).
@@sbfk6799 This thought process eventually leads to RAII because things become too complex. Applications generally handle data in batches, allocate everything at the start, throw it all away at the end. If you wanted to be pedantic you could argue that he should of allocated everything in some kinda arena then freed that at the end. But telling people to delete 3 arrays at the end of a 100 line demo program is bad dogmatic advice.
Am I missing something or why do you assume that the allocated memory is freed anyway when the program exits? Whilst that is practically certainly true because a modern OS is smarter than most programmers, it is still implementation-defined (or OS-specific) behaviour and therefore non-standard. Unfreed memory, per standard, is leaked memory which is why him explicitly freeing (or deleting) it is absolutely correct and fine. Considering there's much more the OS has to do anyway whenever a program exits, the slowdown induced by a couple of frees is entirely negligible. I argue it's not even measurable. Also, for C++, not deleting a non-trivial (i.e. non-pod) object (or array of objects) means that the destructor(s) aren't called either. And that's undefined behaviour.
22:24 Just letting you know, the compiler is actually smarter than it seems. In this example you will get the desired behavior by using a different "branching" technique: for(...) arr3[i] = arr1[i] + arr2[i] * (arr1[i]!=23); This way the compiler uses ymm and moves them 8 at a time. Can't say how wasteful it may be, as it fills the whole ymm6 with 0x17's (23's) for parallel comparisons, but still, impressive imo.
I think what there needs to be is a nice api that allowed you to "agnostically" use the SIMD extensions without needing to know which ones your cpu support. And the api will provide a way to manually leverage the registers in a more human-readable way without having to pay attention to choosing the right set for your cpu. It just generates the right intrinsics for your system. And you won't need to think about if the registers are 128, 256, or 512. The system will pack in the data automatically and its up to you to manually use it to process data in bulk.
An excellent video! Thank you for taking the time to respond to user feedback. Appreciate the details about masks and how to use them to perform logical operations. I've beem learning x64 programming via "Beginning x64 Assembly Programming: From Novice to AVX Professional" by Jo Van Hoey and this is an incredible suppliment to the C/C++ side of things.
one of the biggest issues today is that cpu % meters do not show stall time. SO you can have a horrifically inefficient data layout and be running at 5% cpu speed but the cpu meter will show 100%, I am amazed that there is still no way in perfmon, VS ,... to see the real cpu load. I did not realize how truly huge the impact was
@ not that I know how javidx9 did learn cpp, but I wanted to highlight a bit of a logical fallacy you're falling on. The original comment asks about how did javidx9 (he in particular) learned cpp. Bolt Strikes said that he likely went to college (which I guess is likely). You are saying that going to college is not a requirement to have good programming skills, which I definitely agree, but that's besides the point. The question here is "how did javidx9 learn cpp", not "is it possible to learn cpp without going to college".
One big thing I noticed was _c was not actually set on in the video, but was initialized in github. This confused me greatly: _c = _mm256_and_si256(_one, _mask2);
The function refers to the logical and operation. _mask2 contains numbers which are either all 1s or all 0s. If you and something with all 1s, it doesn't change its value, while if you and something with all 0s, it returns 0. _one contains four constant 1s. This line stores in c the result of logical anding each value in _mask2 with 1. Essentially, this line converts each value in the mask from being either 64 ones or 64 zeroes to just one or zero.
16:50 so it could have just gotten rid of the loop altogether, just making 16 something thousand instruction repetitions, but it decided not to do so because the program size would be too large, am I correct? And if it actually did this, would it increase the performance even more?
Hi Javid! Nice video, I really like this stuff! I have a question. Would these functions fail to compile on CPUs that don't support the required extensions or would they compile and cause erratic behavior at runtime?
Thanks Gianluca! They would compile just fine, but under normal circumstances you would not be able to launch the executable program, as a quick check is made at the start to see if it contains unsupported instructions.
19:54 "Yes it means that's the executable is bigger". I Often see that in release mode, the executable is actually smaller, and the more language is "high-leveled" the more is the difference Why so?
That's because not all optimizations are like loop unrolling. Many reduce the number of instructions executed by reducing the number that are in the final program. For instance, if you check for a condition that the compiler knows can't happen, it will just remove that whole branch of code. Or if you write a bunch of calculations that don't change with runtime input, the optimizer can reduce those calculations to a single number (or whatever the resulting type is). Also, debug mode tends to store all local variables on the stack, in main memory, and will insert instructions to move them to and from registers as needed, whereas release mode tries to keep everything in a register as much as possible and doesn't need those extra instructions. Roughly speaking, an optimizing compiler tries to minimize the number of instructions that run, and that usually means removing instructions from the binary. But in some cases, like with loop unrolling, it actually helps to write more instructions.
@@jeremydavis3631 actually usually it does the opposite and bloats the assembly. There is a special -Os for gcc to optimise for space. -O2 is not bloating assembly, that is why it is used for linux kernel and not -O3.
There are LOTS of improvements which help with both, size and time. And small size helps time, too, since you can load data faster if you have to load fewer instructions. If you look at 04:31, you see LOTS of unnecessary instructions. For example, the first *mov* after "bigArray3[i] = blahblahblah" loads i into the register eax. The *fld* instruction uses that to load from address [ecx+eax*4], i.e. 4*i bytes from the origin of bigArray3. Then, the next *mov* instruction loads the same i into edx _although i hasn't changed in the meantime_ . That's an instruction which can be removed if you change the register of the next *mov* to edx (so that it doesn't overwrite eax) and the address of the *fadd* to [edx+eax*4] (since eax and edx switched rôles). That's far less than the things the compiler actually does in Release, but if you set the compiler to "Release / Optimize size", it will optimize without unrolling aggressively, saving more bytes.
Rather than the bitwise and of _mask and _one and adding that to the counter, couldn't you just subtract the mask from the counter? Wouldn't it be simpler to track the iteration limit in a separate traditional integer register?
i have some questions... what if the x axis width is not divisible by 4? the compiler knows how to get the next value from the next line? also i missed a demonstrations of the assembli code at the end.. i totally believe you that it is probably the way it shoul be but i would like to have seen it, other than that, nice video, very advanced
Thanks Matheus! On the whole, if you are using these extensions, its better to work with data that is multiples of the register size you are using - it just makes life easier, so waste a few bytes of memory if required. It doesnt really matter in the case of the fractal example if the x-axis width is not a multiple of 4, it would just wrap around and do a few pixels on the next line. If you really cant clamp your array to a multiple, then it can get fiddly, and you may be better off recognizing that as a special case, and just hand processing those remaining elements.
While we're on the optimisation train is there any chance of you doing an easy (or as easy as is reasonable) to understand video about things like Data Oriented Design, SoA vs AoS and when to choose which?
Hey Gareth, it's interesting stuff, but maybe a little difficult to demonstrate easily in a video. I don't mean the concept in this instance, but its efficacy without resorting to deliberately contrived scenarios. I'll add it to me "thonk" list XD
Are there any fast c++ pow(a,b) functions where b is any real number? What about intrinsic type power functions (couldn't find any)? Substituting pow(z,r) for z*z makes the program run like molasses. I am incrementing r +=/-= 0.1 and getting nice shapes.
That was a very interesting video - you have a rare knack for hitting a happy medium in the conflict between "informative" and "comprehensible". Your vids are usually pretty entertainig too, so double thanks. I was wondering - having taken the fractal rendering so far with intrinsics and multithreading, Could yoU Devise An even more hardcore strategy to get even higher performance?
Hi javidx9, thanks for all your videos, it's great work and always so pleasant to watch! A little question here: how do you manage to show a preview of your current file on the right side of your screen? Is that an extension, or a feature I'm not able to find in the endless universe of visual studio's options? The split functionality only splits horizontally, and doesn't allow a different zoom value for the 2 resulting windows, so I guess it's a completely different thing...
If the CPU guess wrong it has to go back, and thats why heavily branched programs should be implemented in C-squared and execute from end to start. That way each miss will actually improve performance - since going backward will be closer to the end of the program.
I will also add that branching can stall a CPU, particularly as processors attempt to "guess" which bit of code will be executed next. If it guesses wrong, it has to effectively "go back", so removing branching is a good strategy for optimisation.
@javidx9 ... hmm did you ever calculate _c in your code example? Well it is likely in the git repository ;) but in your video i think that part is missing (e.g. at 54:36 ). I just see comments on what you want to achieve ... or did i overlook that part?
Nevertheless a cool video, as a long time ago i programmed in Assembler but nowadays i'm relying on the C#-Compiler ;).
C++ 20 brings us [[likely]] and [[unlikely]] that may help to fix a branching conflict.
See Jason Turner on this topic at ua-cam.com/video/ew3wt0g99kg/v-deo.html
Thank you for the educating video javidx9. Stay safe.
P.S.: Isn't it nice that meat-bags(humans) are still useful for optimization work and making videos?:)
@@dieSpinnt It's worth noting that the [[likely]] and [[unlikely]] tags (or the equivalent compiler-specific markup you would have used prior to C++20, such as __builtin_expect) can't really directly help the CPU to predict branches better, they mainly help make correctly-predicted branches perform better, by hinting the compiler to, for example, reorder branches to reduce the overall number of jumps in the expected code path, or improve the cache locality of the expected code path by laying it out contiguously in memory, or to decide whether to use a branch vs. a cmove.
Why is it so hard to find resources about the incredible branching prediction of processors ? I only saw it mentionned in a talk by former Intel/Tesla chief processor architect Jim Keller.
It's not just predicting what will be used next, but parallelising automatically by finding independant pieces, like initialising variables can be done before the function is called!
It seems that there is a processor inside the processor doing those predictions live depending of the current run time and other threads from other programs. The firmware can optimise machine code live, not following the .exe machine code.
And Intel wants to use neural networks to predict branching. That's how they manage go make code run faster without increasing clock speed.
Also, there are professional grade Intel compilers with licence prices higher than consumer processors that make much more advanced optimisation than the generic GCC compiler.
It seems such a fascinating topic, but surprisingly secret.
@@Bvic3 if you're interested in how branch prediction works, you might want to read the Wikipedia article en.m.wikipedia.org/wiki/Branch_predictor. If you look at the reference section you'll find that much if the theory is freely available, what's secret isn't usually really how things work, rather its all the work that goes into implementing something that can actually put it into practice and be competitive.
As to the Intel compiler vs GCC, a lot of that is marketing, sometimes Intel does better, sometimes GCC does, it really depends on specifics (i.e. what code is being compiled, how is performance being measured, what system is it running on, what version of the compiler, what compiler options were selected, etc.) Naturally its easy for Intel marketing to cherry pick scenarios that put their compiler in the best light, so its important to understand that your results will vary.
The most handsome C++ guy that ever walked this planet
if u like javix check out chilli tomato noodle too, for some more sweet c++ 😃
Javidx9 and ChiliTomatoNoodle are surely the best C++ teachers I ever had. :)
The cherno is also a great guy
Theres also the chinese guy i dont remember the name and Jason
leo carvalho Thomas Kim or Bo qian?
*head explodes* - I see a lot of basic programming videos online with all the usual fair, and they are very nice. But it's refreshing to see more advanced topics like this covered, and covered so well.
@@esepecesito bisqwit
Fr
OMG HE HEARD OUR BEGGING FOR MORE SIMD COVERAGE! Blessed shall you be, you immortal being :D
Both your C++ and your teaching skills are absolutely excellent! They should give you a Bjarne Stroustrup Award.
Quite the intrinsic video! I haven't even watched the video long enough to know what it means, but I wanted to use that adjective! :)
Good one lol
That's a perfectly cromulent adjective!
I was watching this when I couldn't get to sleep. It is so fascinating that I kept watching and watching. It didn't help me get to sleep at all ;-). Thanks for a great lesson.
Interesting, I watched the video and thought: "Wow, what a amazing teacher, full of content", then i subscribed and check the channel videos and realize that when i was in the beginning of graduation I visited that same channel to see start level content and now, almost finishing the course, here i am, seeing a more complex thing, moral of the history: The channel and his creator is both incredible.
Thank you !!!
Sorry for my poor english....
Please keep this series of explaining parts used in C++ SIMD to continue, your way of explaining its awesome, Thanks for putting such a high quality content out in public.
A bit late to the party, but here goes... This video has inspired me to try SIMD programming. I have long been a fan of Mandelbrots and many years ago wrote a program to plot and explore them. Eventually I got myself a PC with an i7 processor and explored making my Mandelbrot program multi-threaded which worked well. Now is the time to upgrade it again with SIMD. Now my cpu is still the same i7 which does not support anything past SSE4, but then compiler of choice is Delphi 6 (don't judge), which completely does not support intrinsic functions at all. However it does have an in-built assembler which supports up to SSE2. So my task has been to translate all this C++ code into Pascal/assembler. I have eventually got this to work - a few radical changes were required - e.g. I only have 8 x 128 bit mmx registers to play with so only 2 pixels at a time, but the speed-up is amazing. My program is rendering full screen images in just a few hundreds of milliseconds (sometimes much less) where it was taking multiple seconds before. The most complex image so far has only taken slightly over a second to process. Thank you so much for explaining this in such simple terms that I was able to do this and learn about SIMD.
Hey that's great Bobby! SSE4 is no slouch, and I'm pleased you got it working to your expectations. I must confess I'd not considered the availability of intrinsics in other languages before, so this is quite interesting.
@@javidx9 C# Recently added Intrinsics via the System.Numberics namespace and they work pretty well. See: devblogs.microsoft.com/dotnet/hardware-intrinsics-in-net-core/
and: docs.microsoft.com/en-us/dotnet/api/system.numerics?view=netcore-3.1
Thank you for this video. I took a CS course in parallel computing this semester, and it demystified a lot of what makes high-performance code tick. This video helped me to connect what I've learned with what is going on in an IDE like Visual Studio.
Exactly when I needed it! The timing couldn't be more than perfect
Incredibly informative, the most hardcore but so enjoyable. Personally, I found masking not the hardest thing in it, instead it was the x positions and offsets, especially 52:04 - 52:12, what a... Fantastic stuff, thanks to your channel I really got more and more interested in more low level coding. The way you present it makes it much less scary, even the assembly code :)
That was great - and now CUDA ;-)
Then PTX assembly.
Recently I had a C# code that would take about 50 minutes to execue and calculate. Running parallel got it to about 5 minutes. Using OpenCL (kinda like CUDA) got it a little under 10 seconds xd
Edit: And yes, I did run the code for 50 minutes xd
@@guiorgy I wish I had the time to bring me up to speed with Cuda or Open CL, but besides a little bit of programming Arduinos I'm not a C programmer, and I struggle with basic concepts like 'const * char const' etc.
I have a project regarding gamified genetic algorithm which I have done in Java years ago, and someime I have to recode in C, GPU computing and a powerful graphic engine
@@Schwuuuuupwrite it like this const * char data; this will make more sense
@@aamirpashah7159 dude, my post is over 3 years old
When I first looked on to the intrinsic code, I thought how complicated it was...
But you explained perfectly, and something clicked, and I realized how easy it really is.
Thanks to the avx, I'm getting double the performance on my Mandelbrot set renderer. The best thing is, it even works on multiple cores with OpenMP directive . The performance on CPU is as good if not better than on GPU.
I will have a Computer Architecture exam next week and a significant chunk of the material is about SIMD extensions but since it's a university course it's all theory, so it's nice to see it in action.
Hell yeah! I tried those functions myself. Amazing tutorial. The speed gain is insane combined with using threads :) Thank you!
Another great video and little trip to memory lane. Few years ago, I had to work image processing with older hardware which did not had any GPU acceleration and some algorithms had to be written with SIMD. After getting mind wrapped to work in vector-oriented mode the project was surprisingly pleasant to code.
You sir are a beast! I'm a senior developer coding for 10 years, your knowledge is serios :)
I showed one of your video to my father to make him believe you were me, we look exactly alike, it took him a good minute to realise it was'nt me !!! We laughed so hard, keep up the good work =)
a doppelganger eh?
44:34 ++++
You don't need to use the comparison mask to select/blend between '0' and '1'. Since 'all ones' is the two's complement representation of '-1', you can simply subtract the mask from your iteration counter (x + 1 x - (-1) AND x + 0 x - 0).
You could've explained the blend intrinsic with this code segment, going from where you were with your AND equivalent, but also showing off the trick I mentioned afterwards.
Amazing as usual!! I am simply amazed by the quality of your videos, topics and explanations.
Thanks wowLinh - It always pleases me when I see you comment - you've been around a loooong time now XD
I'm not quite sure why you talked about cache locality when you did, as it's unrelated to the loop unrolling optimization. The cache behavior of the loop is the same either way - the reason it gets unrolled is just to reduce loop overhead (fewer compare and branch instructions per iteration). Other than that, this seems like a great video for introducing people to SIMD programming. Your explanations of lanes vs. register width, masking, and the utility of intrinsics in general, are all very clear, concise, and thorough. Good stuff!
Thanks Not Null - I kind of agree with you, I wanted to fit in locality somewhere, and there is some truth to unrolling being advantageous to cache usage, for the reason you describe in fact - aside from branching having its own overhead which you want to reduce, and of course branch prediction being a factor, the branch test itself could potentially pollute the cache. SIMD stuff works best when streamed, and there are in fact cache organisation intrinsic functions to hint where the data should be moved to before the next set of instructions. Streaming of course works best with contiguous data in memory, and typically such memory is moved around "together". Once that extension pipeline is fired up, you want to cram as much data through it as possible, so I dont agree that its unrelated, but I do concede it is secondary to chaos branching can cause.
@@javidx9 Doesn't the loop condition (at least in this case) just come down to a compare instruction and a conditional jump on the relevant flag bit? I don't see how that would pollute the cache, but I might be missing something.
@@notnullnotvoid On powerful processors such as desktop ones, its not quite that simple. Yes, the condition is based off a single bit, but 2 things, firstly the pipelined nature of the processor requires branch prediction, and flushing out the pipeline is undesirable for performance, secondly the arguments for the condition itself may require memory to be read, thus potentially polluting the cache.
By far the best video about this topic on youtube in overall.
I only found much less detailed videos or way too detailed only on some specific parts.
Cheers :)
Just watched Linus tech tips where Anthony mentioned "AVX 512" support on a new macbook and since I recently watched this video I could say "Oh yeah... I understand that... in depth." :)
Yeah thing is... AVX512 takes a way bigger CPU hit, not worth it.
Great video, it went from totally cryptic gibberish code to understandable logical code thanks to your elite explaining. Thank you !
XD err thanks John!
Javidx9, my hero. Why? He read every single comment i wrote on this channel and im sure it applies to everyone else. If I become successful person one day, the reason must be your videos. They are very well made and he explains every single step he made on his videos. I cant help you with financial part right now, but I will make sure pay what you did to me in the future after I get some job. You are very cool man (I cant even describe it with word). And thinking about what you did for me make me so emotional.
lol, thank you Dode XD
Great video and amazing channel. I just want to point out a small note at 41:57 which is n < iterations is not the same as iterations > n because of the case where n = iterations
This is a good point Mohamed - combined with the way the loop is structured now, I think this approach always does one further iteration compared with the reference function.
(1) n < iterations
(2) iterations > n
If n = iterations , both expressions are false, since both comparators exclude equality. They are in fact the same.
Ha! And here it is: the SIMD video I asked for earlier today, along with plenty of others who asked before that, because I didn't check the post date on the brute forcing video. Great stuff!
lol thanks arcade, I was gonna say something earlier, but I figured you'd find it! XD
Cant get enough of your videos javidx9! Love your videos man
I used to heavily use SSE2. Excited to see you cover AVX256
ohhhh.... that's why its called avx2 for short...
@@truboxlnow it makes sense to me as well even tho i wouldn't shorten something already short
So this eases the old school approach of having an __asm {} block to optimize what logic the compiler would not be able to do like we find in some older open sourced e.g. games engines, with organized instrinsic functions for exposing modern cpu instructions via modern compilers. Nice.
I'm tired and I need sleep. Oh! A new javidx video.
Probably the most valuable video on UA-cam so far!
Is there space where we can give Ideas for the new videos (so we have a list) and then we can vote witch subject is most selected ...obviously this is Your channel and Your vote is final but one thing is certain. You have a gift and the way and Your voice is just in perfect balance: a very very good teacher. We are very lucky You have time for those videos.
Hi Marcin - kind of, but mostly no - On the discord we have a requests board, though fundamentally it requires that I feel confident enough about the subject matter to demonstrate it. I simply wont make videos about subjects I dont have a good understanding/experience of, they wouldn't help anybody! Also, I often disappoint people with timing of videos, since this is a hobby for me, it helps if the video i'm making is related to some project I'm working on at the time. In the case of intrinsics for example, I've been using them a lot in a different project which isnt a video, so its fresh in my mind. But always happy to see a comment from your good self, a long time supporter and I thank you for that!
Thanks for what you are doing for the community javid.
At 49:40, it is also possible to use _mm256_extract_epi64 to get simple types out of a register again, which would get rid of the ifdef
Having done some intrinsic programming before, and i think that your video is an amazing ressource on how you programm with it's quirks in minds
Well, all of your videos are an amazing ressource - keep up the good work! :)
Cheers buddy, The problem I find with intrinsics is there are so many functions, but Ive not found a sensible "high level" list of function catagories XD so thanks!
31:15 just a detail, but I think 'p' in '_mm256_mul_pd' stands for 'packed' not 'parallel'
What does epi stand for? I assume 'something packed integer'
@@axelanderson2030 "extended packed integer", for 128+ bit registers, because _pi* was already taken for MMX.
@@orbik_fin thanks
You are very gifted at explaining things.
I believe "pd" stands for "packed double" and not "parallel double"
!!!Beware!!! Don't ship into dangerous waters.
Rule 16
Do not use identifiers which begin with one or two underscores (`_' or `__').
> The use of two underscores (`__') in identifiers is reserved for the compiler's internal use according to the ANSI-C standard.
> Underscores (`_') are often used in names of library functions (such as "_main" and "_exit"). In order to avoid collisions, do not begin an identifier with an underscore.
via www.doc.ic.ac.uk/lab/cplus/c++.rules/chap5.html
Just my two nit-picky cents:P
So that is why these AVX flags are used in GCC! Thank you for the explanation :)
Your ability to simplify complicated stuff borders on the divine - Thank You!
It would be very interesting to see this programmed for CUDA processing.
gpu optimization information is rare and valuable. i dont know if hed be willing to expose such secrets of the dark arts.
Simd is better than cuda in some cases. It don't need to transfer the data to the GPU and the loop is faster with simd(is complicated to explain why)
@@michelefaedi Oh but you do need to transfer data to the GPU anyways. GPU is the one that actually does the drawing, isn't it?
@@karma6746 only if you consider the graphics calculation. CUDA can do any algorithms you want. Even the one that don't require the video directly
Fractals and similar iterations sound like a close second to me.
There's very little to be transferred into the GPU, and very little back out. Moving the heavy lifting into the GPU could be very profitable, even more so since modern GPUs tend to have 100s of cores, even the better consumer-grade models.
Not exactly your everyday algorithm, but even if you want to save the data to disk, it looks very promising. If you don't, real-time animation in full HD is definitely on the horizon thanks to Cuda.
For other stuff, it can be the other way around. Instead of freeing CPU cores, it could tie cores down with management duties (or even worse: tie ONE core and block the others out), which is probably a workload for which most modern OSes are not optimized (unlike processing in the CPU or pure output generation in the GPU).
It might be instructional to add the benefit of using intrinsics by showing a sid-by-side video of the fractal generations. IOW a "what is there to gain" for all your extra coding efforts. Well done, I have recommended this on IDZ
Hi Zubble and thanks - In principle this was a follow up to the previous video that did show the the difference with/without intrinsics, its just that one did not show the intrinsic code in detail.
You are an absolute legend. I was wondering how much experience one needs to have to get that good
Spending my summer break learning more about coding, but what can I say, these videos are too good!
Thank you.
Thank you for the insights to programming with intrinsics!
Are you a wizard? I was trying to learn about this just recently, and then your video comes out. Thank You
As always, the best content one can find on UA-cam!
It is nice to see more complicated stuff like this on UA-cam!
Bow to you, Sir, for the quality of your explanation. I love how your mind works.
I think I found a really cool problem that could use intrinsics so I'm excited. A couple of other optimizations and I'm aiming to solve out to 1 million instead if grinding it out to 50 thousand or so. Great video!
Great video as always!
For those who want a more detailed look at the difference in timings for cache vs Memory vs Hard drive I recommend the talk "Getting Nowhere Faster" By Chandler Carruth at Cppcon 2017.
Thank you so much for this video, I learned so much! You truly are a blessing for the C++ UA-cam community :-)
Thanks for your videos! I love the way you explain things!
Love your vids. Even if I don't understand them the first time I watch because I'm just a simple web developer (PHP, NodeJS) but the way you explain helps me to understand more of our computers and the way programs work (and I hope they make me a better programmer - even on simpler stuff ^^)
I will need to watch this again in the future.
You are just awesome. I have learned many things from your videos. 😍😀 thank you so much 😊.
Thanks Jayasri!
Hello javidx9, referring to 1:58 I'm a little annoyed, and would like to complain about the fact that you unnecessarily free up the heap allocated arrays before the program terminates, as the memory will be freed anyway when the program exists. It will just make exiting the the program slower.
While technically this is true, I would argue that the benefits of consistently freeing resources after use (so it becomes habit) far outweight the cost of freeing some memory, even in a small demo like this. Explicitly freeing all allocated memory can also make it easier to catch subtle bugs (e.g. segfaults) and will keep your static/dynamic analysis tools happy (if you're using any that is).
@@sbfk6799 Relax bro, he's trolling :)
@@sbfk6799 This thought process eventually leads to RAII because things become too complex. Applications generally handle data in batches, allocate everything at the start, throw it all away at the end.
If you wanted to be pedantic you could argue that he should of allocated everything in some kinda arena then freed that at the end. But telling people to delete 3 arrays at the end of a 100 line demo program is bad dogmatic advice.
Am I missing something or why do you assume that the allocated memory is freed anyway when the program exits? Whilst that is practically certainly true because a modern OS is smarter than most programmers, it is still implementation-defined (or OS-specific) behaviour and therefore non-standard. Unfreed memory, per standard, is leaked memory which is why him explicitly freeing (or deleting) it is absolutely correct and fine. Considering there's much more the OS has to do anyway whenever a program exits, the slowdown induced by a couple of frees is entirely negligible. I argue it's not even measurable.
Also, for C++, not deleting a non-trivial (i.e. non-pod) object (or array of objects) means that the destructor(s) aren't called either. And that's undefined behaviour.
@@srccde I was mostly just trolling -_- But yes, that's a very good point you make.
Loved it! Simple, easy to understand yet complete. Thank you!
Mind blown! 🤯 Excellent video!
need more of this
22:24
Just letting you know, the compiler is actually smarter than it seems. In this example you will get the desired behavior by using a different "branching" technique:
for(...)
arr3[i] = arr1[i] + arr2[i] * (arr1[i]!=23);
This way the compiler uses ymm and moves them 8 at a time.
Can't say how wasteful it may be, as it fills the whole ymm6 with 0x17's (23's) for parallel comparisons, but still, impressive imo.
I think what there needs to be is a nice api that allowed you to "agnostically" use the SIMD extensions without needing to know which ones your cpu support. And the api will provide a way to manually leverage the registers in a more human-readable way without having to pay attention to choosing the right set for your cpu. It just generates the right intrinsics for your system. And you won't need to think about if the registers are 128, 256, or 512. The system will pack in the data automatically and its up to you to manually use it to process data in bulk.
Excellent explanation
I'd love more technical videos like this in the future. It's hard to get tutorials for this type of stuff.
Finally, a video on UA-cam i can relate to 😆
Premature optimization might be the root of all evil, but watching this video is pre-premature optimization and is very enjoyable!
Jesus Christ, you've outdone yourself. But thank you, I like videos where I learn something new and this certainly exceeded that by a long shot.
_in be4_ "Jesus take the mousewheel"
An excellent video! Thank you for taking the time to respond to user feedback. Appreciate the details about masks and how to use them to perform logical operations. I've beem learning x64 programming via "Beginning x64 Assembly Programming: From Novice to AVX Professional" by Jo Van Hoey and this is an incredible suppliment to the C/C++ side of things.
Loved it! Clear explanations, awesome video
i like how VS shows a small "
Thank you very much for such a detailed explanation 👍
one of the biggest issues today is that cpu % meters do not show stall time. SO you can have a horrifically inefficient data layout and be running at 5% cpu speed but the cpu meter will show 100%, I am amazed that there is still no way in perfmon, VS ,... to see the real cpu load. I did not realize how truly huge the impact was
I would really like to know how you discovered/learned all that stuff. What/who did "bring this to you"?
for an actual answer: most likely went to college
@ not that I know how javidx9 did learn cpp, but I wanted to highlight a bit of a logical fallacy you're falling on. The original comment asks about how did javidx9 (he in particular) learned cpp. Bolt Strikes said that he likely went to college (which I guess is likely). You are saying that going to college is not a requirement to have good programming skills, which I definitely agree, but that's besides the point. The question here is "how did javidx9 learn cpp", not "is it possible to learn cpp without going to college".
Intrinsics look scary. They're not as scary now that I've seen this video!
One big thing I noticed was _c was not actually set on in the video, but was initialized in github. This confused me greatly:
_c = _mm256_and_si256(_one, _mask2);
The function refers to the logical and operation. _mask2 contains numbers which are either all 1s or all 0s. If you and something with all 1s, it doesn't change its value, while if you and something with all 0s, it returns 0. _one contains four constant 1s. This line stores in c the result of logical anding each value in _mask2 with 1. Essentially, this line converts each value in the mask from being either 64 ones or 64 zeroes to just one or zero.
16:50 so it could have just gotten rid of the loop altogether, just making 16 something thousand instruction repetitions, but it decided not to do so because the program size would be too large, am I correct? And if it actually did this, would it increase the performance even more?
Hi Javid! Nice video, I really like this stuff! I have a question. Would these functions fail to compile on CPUs that don't support the required extensions or would they compile and cause erratic behavior at runtime?
Thanks Gianluca! They would compile just fine, but under normal circumstances you would not be able to launch the executable program, as a quick check is made at the start to see if it contains unsupported instructions.
@@javidx9 Awesome, thanks!
wonderful thank you!
An ALU "It's the thong that does the stuff" - OLC 2020 (7:20)
I love it
"thong" ? hahah
Anyway, yeah that's a good quote; should be on a t-shirt:
"It's the thing that does the stuff" - Programming Bible, Javi 7:20
19:54 "Yes it means that's the executable is bigger".
I Often see that in release mode, the executable is actually smaller, and the more language is "high-leveled" the more is the difference
Why so?
That's because not all optimizations are like loop unrolling. Many reduce the number of instructions executed by reducing the number that are in the final program. For instance, if you check for a condition that the compiler knows can't happen, it will just remove that whole branch of code. Or if you write a bunch of calculations that don't change with runtime input, the optimizer can reduce those calculations to a single number (or whatever the resulting type is). Also, debug mode tends to store all local variables on the stack, in main memory, and will insert instructions to move them to and from registers as needed, whereas release mode tries to keep everything in a register as much as possible and doesn't need those extra instructions.
Roughly speaking, an optimizing compiler tries to minimize the number of instructions that run, and that usually means removing instructions from the binary. But in some cases, like with loop unrolling, it actually helps to write more instructions.
@@jeremydavis3631 There is also conditional move that can help reduce if/else to only a few instructions in special cases imgur.com/a/DCQ3rxh
@@jeremydavis3631 actually usually it does the opposite and bloats the assembly. There is a special -Os for gcc to optimise for space. -O2 is not bloating assembly, that is why it is used for linux kernel and not -O3.
Aside embedded debug symbols, sometimes compiler finds that some code has no effect for all inputs, so it will not generate it at all
There are LOTS of improvements which help with both, size and time. And small size helps time, too, since you can load data faster if you have to load fewer instructions.
If you look at 04:31, you see LOTS of unnecessary instructions. For example, the first *mov* after "bigArray3[i] = blahblahblah" loads i into the register eax. The *fld* instruction uses that to load from address [ecx+eax*4], i.e. 4*i bytes from the origin of bigArray3. Then, the next *mov* instruction loads the same i into edx _although i hasn't changed in the meantime_ . That's an instruction which can be removed if you change the register of the next *mov* to edx (so that it doesn't overwrite eax) and the address of the *fadd* to [edx+eax*4] (since eax and edx switched rôles).
That's far less than the things the compiler actually does in Release, but if you set the compiler to "Release / Optimize size", it will optimize without unrolling aggressively, saving more bytes.
thanks! would be cool if u compared the execution time with Intrinsic and without
You’re such a smart dude.
Rather than the bitwise and of _mask and _one and adding that to the counter, couldn't you just subtract the mask from the counter?
Wouldn't it be simpler to track the iteration limit in a separate traditional integer register?
i have some questions... what if the x axis width is not divisible by 4? the compiler knows how to get the next value from the next line? also i missed a demonstrations of the assembli code at the end.. i totally believe you that it is probably the way it shoul be but i would like to have seen it, other than that, nice video, very advanced
Thanks Matheus! On the whole, if you are using these extensions, its better to work with data that is multiples of the register size you are using - it just makes life easier, so waste a few bytes of memory if required. It doesnt really matter in the case of the fractal example if the x-axis width is not a multiple of 4, it would just wrap around and do a few pixels on the next line. If you really cant clamp your array to a multiple, then it can get fiddly, and you may be better off recognizing that as a special case, and just hand processing those remaining elements.
So THIS is an actual practical use of bit masks. Very good to know, thank you!
This is so well explained!!
Thank you!
While we're on the optimisation train is there any chance of you doing an easy (or as easy as is reasonable) to understand video about things like Data Oriented Design, SoA vs AoS and when to choose which?
And yes: I know that DoD is not just using SoA.
Hey Gareth, it's interesting stuff, but maybe a little difficult to demonstrate easily in a video. I don't mean the concept in this instance, but its efficacy without resorting to deliberately contrived scenarios. I'll add it to me "thonk" list XD
Are there any fast c++ pow(a,b) functions where b is any real number? What about intrinsic type power functions (couldn't find any)? Substituting pow(z,r) for z*z makes the program run like molasses. I am incrementing r +=/-= 0.1 and getting nice shapes.
34:47 it is Fused multiply add, (on) packed doubles
Yes! This is great. More advanced topics please!!
That was a very interesting video - you have a rare knack for hitting a happy medium in the conflict between "informative" and "comprehensible". Your vids are usually pretty entertainig too, so double thanks.
I was wondering - having taken the fractal rendering so far with intrinsics and multithreading, Could yoU Devise An even more hardcore strategy to get even higher performance?
Thanks Tim, the next stage would realistically be GPU computation, SIMD devices can do this sort of thing orders of magnitude more performant.
Part 1 is here: ua-cam.com/video/PBvLs88hvJ8/v-deo.html
"Brute Force Processing"
javidx9, it would be nice if this link were in the video description
The way to outsmart the compiler is to become the compiler!
Hi javidx9, thanks for all your videos, it's great work and always so pleasant to watch! A little question here: how do you manage to show a preview of your current file on the right side of your screen? Is that an extension, or a feature I'm not able to find in the endless universe of visual studio's options? The split functionality only splits horizontally, and doesn't allow a different zoom value for the 2 resulting windows, so I guess it's a completely different thing...
still the best c++ videos
When you called the goto function, I was like: Wait a second, isn't this just assembly then? And indeed it was 😅
If the CPU guess wrong it has to go back, and thats why heavily branched programs should be implemented in C-squared and execute from end to start. That way each miss will actually improve performance - since going backward will be closer to the end of the program.