Excellent tutorial. One minor thing I would have mentioned in your video is that copying between device and host or host and device is a relatively expensive operation since you are moving data from/to the CPU to/from the GPU through the pci express bus which no matter how fast or modern system you have is still a bottleneck compared to data transfer between CPU and memory or GPU and dram. So the performance advantage is only noticeable when the duration of data copying is relatively short compared to the task execution time.
Hm..yes, but only if your data is of significant size aswell. Also the Bus speed is fixed by platform. Only a concern if your gpu is significantly faster than it fetches new data. Otherwise yes agree. You always have to test everything. Best example is UnrealEngine5, where after testing it turns out software rasterizing is faster than doing it on the gpu for some reason 😂 Always test if what you do would actually benefit from switching the compute device and dealing with copying data, etc.
Yeah that part hit me in the face when I was writing a 3D engine. Starting at a few hundred to thousand objects it is not so much the complexity of shading each object, but the number of seperate draw calls to the GPU that slows things down to a crawl. In that case it is the latency of communication between the CPU and GPU, rather than the bandwidth, that causes problems, but the fundamental issue is the same: Sending data between the two is slow. I had found this cool technique that would speed up deferred shading a lot more by doing additional checks for what area would actually be hit by light sources. The problem with this was that it ment 2 draw calls per light source instead of 1. Even though this saved the GPU itself a lot of work, it ended up dramatically decreasing performance since it were the draw calls that bottlenecked me. For the mentioned scenario, the proper solution are batch calls where a single call to the GPU can render many objects at once (particularly identical ones that use the same shader and ideally the same base mesh).
@@gonda8365 sometimes low vram also just doesn’t work at all. Like blender cuda rendering, if the scene doesn’t fit in vram, it won’t render, not even in a million hours.
Back when I tried GPGPU, the most astonishing performance trap was just memory handling. Selecting what data to put into what kind of memory and utilizing them was very hard, but when you did it right the thing performed 10x better.
Amazing video! I love the way you explain things thoroughly enough that a beginner can easily understand it without explaining *too* much and droning on. Thorough yet concise, great job :)
Amazing intro to CUDA man! For those interested in gpu programming, I'd also recommend learning openACC. Not as powerful as CUDA, but gives you a nice "first working" gpu program to have an idea before suffering with low level optimization hehe. Would be nice to see a follow up to this using both MPI and CUDA to work with multiple GPUs :D
No dislikes, no wonder why :) I finally found a comprehensive tutorial, because most of them fail to explain the basic mindset behind CUDA programming.
@@widrolo or people who dont like multi threading for some weird reason or people maybe who know some different framework for this and get annoyed he showed this one idk it can be anything
As someone who doesn’t have nVidia, you should do an OpenCL or OpenGL series, which everyone can use! Unless there’s something special about cuda, I never see the cross platform ones on UA-cam…
Look at the Intel's oneAPI Base Toolkit, which includes a dpc++ sycl compiler. It may hide all this low level stuff, which is too hard to do efficiently. By default it works the best, Intel being Intel, with OpenCL (3.0 sure, not so much about 2.2; doesn't with 2.1), but there is already experimental support for CUDA out of the box. sycl is an open, GPU-agnostic (ahem, supposed to be) standard. CUDA code looks like C++, but in fact you think about hardware all the time, it's harder than assembly, in fact. OpenCL is no simpler. Looks are deceptive. This is why I believe a good compiler eventually beat low-level CUDA/OpenCL coding. Who would hand-optimize Intel CPU code these days, and beat the optimizer? High-level distributed/parallel C++ (DPC++) is da way to look into future. BTW, OpenCL is for compute, OpenGL is for 3D drawing/rendering, it's not "or." Entirely different APIs. OpenCL takes the same task as CUDA. OpenGL is xplat and oldsy (ah, that 80's feel!); for Windows-only, DirectX is preferable. If you take the oneAPI route, one piece of advice is to chose components to install. The full thing takes 40GB installed or so, and takes awful time to install and upgrade, even on a second-high-end Gen12 CPU and very fast PCIe SSD. And you hardly need data analytics or video compression libraries.
This is a lot more straightforward than I thought it would be. Basically, replace all allocation operations and pointer operations with CUDA framework types and functions. 😅
Hey this is super useful! I elected High Performance Computing and Microprocessors and Embedded Systems modules for my degree, and this channel has become my go-to guide.
I would imagine the CPU implementation would win performance wise when it comes to simple addition, since copying memory to and from the GPU is a pretty expensive operation. Especially if we make the benchmarking fair and utilize threads on the CPU implementation.
Useful, but the discussion about the block size and grid size was avoided. I think there should be a video focused only on this topic as it's not easy to digest, especially for new CUDA programmers. A comparison with OpenCL would be even better :)
Thank you for the Video, it was good to see a easy example how it works. I was watching recently a video about the MMX instruction set of the first Pentium CPU (around 1997), and it was mentioned that the main usage of that new feature was for example changing the brightness of a photo (probably bitmap file) where a lot of mathematical manipulation needed on a huge file, and the mathematical functions is repeating for every pixels. The idea behind MMX was, that multiple registers was loading with values, and then cpu executed one instruction and some clock cycles later all the output registers were ready filled. I think it was called "single instructon multiple data". I have this feeling now, that the GPU Cuda core could do all the mathematical manipulation with a bitmap picture, we only have to load the picture in the GPU memory, and the mathematical manipulation pattern(s) with it, and execute the transformation. Probably it does not worth transform only one picture, with all the preparation we lose time, but if we have many different pictures (for example a video), maybe it makes sense to use the power of the GPU.
I dont know if im correct but when you render video in blender for example it can use gpu, and you can do Things like mainpulating Color. Dont know if it has any relevance just my thoughts only
Yeah photoshop does a few things on the gpu, and good video editing algorithms run on the gpu as well. It’s exactly like you said. And SIMD instructions are also used quite a lot, but from what I’ve seen, they seem more of a middle ground. If CPU is too slow, but gpu not really worth it due to latency or complexity.
I guess it might be similar to the SIMD instructions on arm cortex. Basically there's a coprocessor dedicated to executing instructions that operate on multiple registers at the same time.
Very good video (pictures, code and explanation) on this subject. GPU's are for complex and fast graphics calculations. Modern video games need vary fast graphics rendering therefore advanced GPU's are required. NVIDIA (CUDA programming) facilitates us to utilize the power of parallelism of GPU. Therefore, NVIDIA APIs (CUDA APIs) provide the magic of CPU + GPU to accomplish the fast processing and complex calculations required in ML, AI, GAI, DL, and Streaming etc. CPU + GPU will give best performance when data copy between these is optimized. I was amazed seeing the power of GPU vs CPU. For one lengthy task CPU took about 5 seconds but GPU took only 2-3 milliseconds for the same task. All magic of true parallelism by thousands of cores in GPU.
That was an excellent beginner friendly overview. Almost a hello world type of intro to get your feet wet. Definitely looking forward to more videos from you.
Thanks for the video! CC: When the narrator follows new, or not immediately obvious to a newcomer information with, “right?” I feel really lost and a little stressed thinking I can’t even understand this basic information!!
Are there any guides explaining how the code segments are actually sent to the GPU and how the API and firmware handle operations? Just understanding the coding portion isn't enough until you understand the hardware architecture and low-level ops.
Have you ever looked at GPU assembly and decompiling shader code? Not sure if you can export from RenderDoc to Ghidra, but would be fun to look at that
Cool intro, thanks! In the year 2021, tho, I'd rather use even simpler modern cudaMallocManaged() UVM call. One may get faster code by manually controlling memory transfers in multiple streams and synchronization; this is what I have seen in code written by an NVIDIA Sr. SWE, but could never really fully grok it. For the rest of us, there's UVM-you just allocate memory accessible to both the CPU and the device, and it's synchronized and moved in the right direction at the driver level. It does allow writing stupidly inefficient code, but this is not too easy, really :) For a GPU starter, it simplifies memory tracking a lot.
Muchas Gracias! No podia hacer correr un simple codigo en vs code y viendo este video me funcionó a la primera. Estaría genial si haces un video para poder compilar con vs code. Abrazo
Building a GPU on a breadboard is really cool, why should this video be better, its just a different topic. Or were you trying to build an ASIC on a breadboard, and realized now that you can just use CUDA? ;D
Damn was this interesting. So basically everything time I have big for loops or even nested for loops, my graphics card could calculate it way faster. Thanks man this was interesting
Nice presentation, but you should speak about OpenCL, even if it's not well supported on Nvidia card, at least you can target multiple parallel devices (at the same time). Andthe core conept of grid, block and threads are quite the same (with different name, but same cache segregation principle).
How can you dynamically manage and display your available GPU memory based on load and display it as a bar graph? Such as when you're choosing LoD or texture and geometry complexity and want to estimate if it's going to throttle the gpu. Many thanks! Happy to be pointed to a resource if it's not something you've covered as yet :-)
Maybe you don't read this because the video is 2 years old now, but could you make a video about how graphics programming works on a computer? 2d and or 3d. You are so good at explaining stuff, it would be really amazing imo
I have some knowledge of 2d and 3d graphics when it comes to the CPU, however I do not know of the GPU so much. If you want to render on the CPU assuming you are only working on a per pixel level this is how you'll want to do it, for a simple line you'll just use Bresenham's algorithm, for a triangle on the other hand, you'll want to convert the triangle into a bunch of horizontal lines, you can do this by splitting the triangle into two triangles, the triangles you're splitting the original triangle into are easier to render as one of the sides of the triangle should be a horizontal line, the way you split the original triangle into two new triangles that each share a horizontal side is by finding the point on your triangle that is in-between the other two points on the Y-axis then create a corresponding point on the opposite side with linear algebra that should be at the same Y coordinate, this will result in you having 4 points instead of three, you will now create two new triangles where both triangles share two points, finally you can iterate along the two non horizontal lines for each of your triangles and use linear algebra to find points corresponding to each Y position then just draw horizontal lines between these points, as a result you will have drawn a triangle. Interestingly it is actually easier to draw a circle than a triangle, to draw a circle you will perform a simple process, starting from the top of the circle you will iterate down drawing horizontal lines all the way through to the bottom of the circle, all of these horizontal lines will be centered at the circle's X position, the width of these lines will be determined through some simple trigonometry, first you will need to find the difference between the Y coordinate of the line and the Y coordinate of the center of the circle, then you will make a reversal of Pythagorean's theorem where instead of calculating for distance through D = sqrt(X*X+Y*Y) you will instead be solving for X, here are the steps to alter the equation to solve for X, D = sqrt(X*X + Y*Y), D*D = X*X + Y*Y, D*D - Y*Y = X*X, sqrt(D*D - Y*Y) = X, X = sqrt(D*D - Y*Y), simply input the previously calculated Y offset into this equation to find X which is the width needed for this horizontal line, after this simply draw the lines and you're done with drawing your circle. This has so far all been 2d, but with those building blocks created it is actually pretty easy to move onto 3d graphics, a sphere in 3d can be projected into a circle, for 3d graphics you need to first understand the basics of perspective, you can assume that most players will be viewing the game pretty centered from their computer screen, in other words the eyes of the player are at X and Y of 0. However, the distance of the player to the computer screen is not always the same, some will be viewing it from much further away than others, for a TV screen you can expect them to be looking at it from farther away, this type of distance is represented as Z position similar to X and Y position but for the third axis of movement, however in the particular case of the distance from the user to the screen it is actually called FOV and is often adjustable, now, with this in mind you need to know how to convert a 3d point in your game to a 2d point on the computer screen, to do this you need some more linear algebra, you need to make a line from the player who is located at position (0, 0, -FOV) to the 3d point in your game (X, Y, Z), then you need to find the intersection of this line and the computer screen, luckily we know that the intersection will always be located at a position formatted like this (X, Y, 0) which is good as it removes the Z position which is the 3d element, to find this point we can simply do the following mathematics on the 3d point (X, Y, Z), (FOV*X/Z,FOV*Y/Z) is the 2d point given a 3d point, finally with this in mind we are almost done with the basics of 3d graphics, a simple method of 3d rendering can be used now called Bill Boarding to render spheres, if the sphere should be centered at (170,290,400) then you will first need to find the corresponding 2d point which is (FOV*170/400, FOV*290/400) this is where you will render the sphere at, secondly you also need to find the size of the sphere since we are working with bill boarding, this can be found with a similar equation, we need a new variable D which represents the diameter of the sphere, now the 2d representation's width will be determined with the following equation, FOV*D/Z this means that if the sphere is 100 wide then it will be drawn with a width of FOV*100/400, finally we will simply plug these numbers into the circle drawing program from before to draw the 3d sphere. Next up is rendering 3d triangles which is more powerful than the sphere drawing, to render a 3d triangle with the points (X1, Y1, Z1), (X2, Y2, Z2), and (X3, Y3, Z3) we first need to convert all of these 3d points into their 2d versions, (FOV*X1/Z1, FOV*Y1/Z1), then we simply plug these points into the 2d triangle drawing function and we are already done with the basics of 3d triangle rendering, however there is an interesting problem, while with the sphere drawing we can simply not draw the sphere if it is offscreen, it is a little trickier with the triangles because one point on the triangle could be at a negative Z position which would break the previously shown projection formulas to solve for this we need to clip the 3d triangles at a certain Z position which is slightly in front of the screen, for instance Z=0.001, if two of the points on the 3d triangle are off screen then we can simply find the intersection of them and the flat plane where Z = 0.001 which will clip them quite quickly with little issue, however if only one of the points is off screen we come to have a problem where we need to split what was once only one triangle into two triangles, because now the clipped version of the triangle is actually a quadrilateral. Additionally, layering these 3d triangles is quite tricky to do with the CPU effectively, however in the GPU approach you can just use a Z-buffer, and later on you will probably want better looking triangles than the single-color fill triangles, that's a whole different can of worms. Good luck!
Very nice tutorial. I really liked it. It's brief, to the point and very clear. Thanks. Could you please make a video for the same example but in Linux?
sizeof(a) will return the size of the array in bytes. sizeof(int) will return the size of an int in bytes (which might be 2, 4 or 8 depending on whether you're on 16 bit, 32 bit or 64 bit hardware). The division then gives you the number of elements in the array. A useful helper #define is: #define ArrayLength(a) (sizeof(a)/(sizeof(a[0])) This will also work if you have an array of structures.
That is VERY impressive how relatively SIMPLE and CLEAR you showed that! Wow, thank you! Question: There is SOME sort of parallel or vector operation also possible on the modern CPUs, right? Could you show how THAT would be done in this example?
Hi ! A classical C++ question, zero knowledge of GPU programming ! After cudaMalloc, shouldn't there be a sort of cudaFree ? What happens with the GPU memory ? Thank you for the comments and for the video !
I wonder if the CUDA framework is available for use in C#? I don't know C++ and I really don't want to spend years learning how to "properly" create C++ apps.
12:02 So you use GRID_SIZE to define blocks and BLOCK_SIZE to define threads? I don't get it. Shouldn't there be 3 parameters? What does it mean when there are 2? He didn't explain what a kernel is. A "kernel" is a function. That's what CUDA calls them. In this example, "vectorAdd" is a kernel. And you run them with >.
Can some explain how the threadIdx.x works? Like, how does it know where the vectors are and how it indexes into them ? The relevant time stamp is 11:01
When you write “sizeof(int)” what is this referencing? I’m not at all familiar with this language, but I am assuming that’s references something that’s been determined previously?
Compiler changes it to the actual value when compiles. The value in this case is the size in bytes of an int type, which is one of basic built-in types in C.
I really would like to learn, how to program my GPU. Software of the shelf is so boring. I am not gamer. I just bought a used, but nearly new RTX 3080 eGPU with circa 8000,cores for CUDA and OpenCL learning. Unfortunately I may not connect it to my Mac Studio…😢
What does the size of array/size of int do? I've seen it in C++ demonstration of array being referenced to as a whole pointer instead of slot per slot.
It's basically a compile-time function that tells the compiler to calculate the size of some variable or data type in bytes, then replace the call to the function with that size. `sizeof(int) * 10` will have the compiler figure out the size of the `int` data type in bytes, which we'll say is 4 bytes (I believe C ints can have different sizes between different compilers, with some compilers using 2 bytes, others 4 and some even 8), then replaces the `sizeof(int)` with `4`, leaving you with `4 * 10`. That is further evaluated to `40`, which is the size of an array of 10 ints in bytes, ie 40 bytes.
I like Cuda but Considering how many cuda tutorials there are I would like an OpenCL Tutorial because there are only like really advanced examples out there and you have to start with the basics. Which on UA-cam I couldn't find
I've been trying to use unified memory and it's simply not working properly 💀 I'm doing something wrong but I have all the steps required and I tested them on a separate program that works just fine. Only reason I wanted to use unified memory is that you don't have to have separate pointers for cpu and CUDA which makes it much easier to read.
The documentation says a call to a __global__ function is asynchronous. What happens if the vectorAdd function already returned and we are trying to cudaMemcpy before the GPU has finished the operation?
IIRC, it implicitly synchronizes at the cudaMemcpy. I believe you have to use the Cuda stream API or the task graph API if you want asynchronous memory and compute.
I was messing with this, and I changed the size of the arrays (a, b, c). If I used arrays of size 1024 (2^10) elements, it works fine (c outputs the correct result), but if I increase it further, to say 1025, c becomes a 0 array. does anyone know why 2^10 is the limit? If it's a hardware problem, I use an RTX 3060 (laptop).
This video was great never thought it would be so simple. Do you mind digging deeper into this. Maybe some filter, coordinate transformation or other basic math stuff?
__global__ is not a keyword, but some CUDA library definition, right? Why does it look like a keyword (the same colour as "return")? Why isn't it in a form like CUDA_RUN_GLOBAL?
Dividing the size of an array by the size of the elements in that array gets you the number of elements in that array. Similar to “if I line up 100ft of cars, and each car is 5ft long, then I have 20 cars”.
A Khronos Group junkie huh? Well at least OpenCL has a SDK and not just a API . Most people don’t bare metal program, scripture’s if they get near this usually use Python
Keep up the good content boss !
@JM thank you very much sir! Will do! :D
Excellent tutorial. One minor thing I would have mentioned in your video is that copying between device and host or host and device is a relatively expensive operation since you are moving data from/to the CPU to/from the GPU through the pci express bus which no matter how fast or modern system you have is still a bottleneck compared to data transfer between CPU and memory or GPU and dram. So the performance advantage is only noticeable when the duration of data copying is relatively short compared to the task execution time.
Hm..yes, but only if your data is of significant size aswell. Also the Bus speed is fixed by platform. Only a concern if your gpu is significantly faster than it fetches new data. Otherwise yes agree. You always have to test everything. Best example is UnrealEngine5, where after testing it turns out software rasterizing is faster than doing it on the gpu for some reason 😂 Always test if what you do would actually benefit from switching the compute device and dealing with copying data, etc.
Yeah that part hit me in the face when I was writing a 3D engine.
Starting at a few hundred to thousand objects it is not so much the complexity of shading each object, but the number of seperate draw calls to the GPU that slows things down to a crawl. In that case it is the latency of communication between the CPU and GPU, rather than the bandwidth, that causes problems, but the fundamental issue is the same: Sending data between the two is slow.
I had found this cool technique that would speed up deferred shading a lot more by doing additional checks for what area would actually be hit by light sources. The problem with this was that it ment 2 draw calls per light source instead of 1. Even though this saved the GPU itself a lot of work, it ended up dramatically decreasing performance since it were the draw calls that bottlenecked me.
For the mentioned scenario, the proper solution are batch calls where a single call to the GPU can render many objects at once (particularly identical ones that use the same shader and ideally the same base mesh).
The more vram you have the larger training datasets you can use. For certain tasks cards with low vram are perfectly usable, for others not.
@@gonda8365 sometimes low vram also just doesn’t work at all. Like blender cuda rendering, if the scene doesn’t fit in vram, it won’t render, not even in a million hours.
I totally agree. I wanted to write a very similar remark, just noticed yours.
This was super insightful, never would have thought it'd be that easy... I need to look more into cuda programming now
It’s definitely not but by now you have realized that 😮😅
This fight at @7:30 with "*" placement was hilarious. I laughed so hard when you gave up :)
You could make a series out of this - basics of CUDA are trivial, but there are many, many performance traps in gpgpu
Especially once you get into cuBLAS and Thust teritory, things get complicated really quickly.
@@VulpeculaJoyYour not joking!
Me: "Throws hands in the air in frustration."
Back when I tried GPGPU, the most astonishing performance trap was just memory handling. Selecting what data to put into what kind of memory and utilizing them was very hard, but when you did it right the thing performed 10x better.
@@andrebrait please i need some help with that, if you can help a bit a nd guide me this will be much appreciated 👍🙏
@@andrebrait seconding the comment below yours. It'd be so nice of you to share any sort of source.
Amazing video! I love the way you explain things thoroughly enough that a beginner can easily understand it without explaining *too* much and droning on. Thorough yet concise, great job :)
You're very welcome!
Finnaly i can use my rtx 3060 Ti to do something useful...
Nice, now you can bubble sort a array
Bogo sort
bogo sort look like gamble with fate.
I use my gpu to play league
I do love grid computing and parallelism. I really want to learn how to program may new eGPU (RTX 3080).
Amazing intro to CUDA man! For those interested in gpu programming, I'd also recommend learning openACC. Not as powerful as CUDA, but gives you a nice "first working" gpu program to have an idea before suffering with low level optimization hehe. Would be nice to see a follow up to this using both MPI and CUDA to work with multiple GPUs :D
i discovered your channel recently and so far I am loving it.
Glad you enjoy it!
No dislikes, no wonder why :)
I finally found a comprehensive tutorial, because most of them fail to explain the basic mindset behind CUDA programming.
there are 5 now, propably people who didnt like him personally or trolls...
@@widrolo or people who dont like multi threading for some weird reason
or people maybe who know some different framework for this and get annoyed he showed this one
idk it can be anything
You can see dislikes?
@@widrolo Its AMD engineers
@@balern4 There's an extension on the chrome web store that adds them back
As someone who doesn’t have nVidia, you should do an OpenCL or OpenGL series, which everyone can use! Unless there’s something special about cuda, I never see the cross platform ones on UA-cam…
Look at the Intel's oneAPI Base Toolkit, which includes a dpc++ sycl compiler. It may hide all this low level stuff, which is too hard to do efficiently. By default it works the best, Intel being Intel, with OpenCL (3.0 sure, not so much about 2.2; doesn't with 2.1), but there is already experimental support for CUDA out of the box. sycl is an open, GPU-agnostic (ahem, supposed to be) standard. CUDA code looks like C++, but in fact you think about hardware all the time, it's harder than assembly, in fact. OpenCL is no simpler. Looks are deceptive. This is why I believe a good compiler eventually beat low-level CUDA/OpenCL coding. Who would hand-optimize Intel CPU code these days, and beat the optimizer? High-level distributed/parallel C++ (DPC++) is da way to look into future.
BTW, OpenCL is for compute, OpenGL is for 3D drawing/rendering, it's not "or." Entirely different APIs. OpenCL takes the same task as CUDA. OpenGL is xplat and oldsy (ah, that 80's feel!); for Windows-only, DirectX is preferable.
If you take the oneAPI route, one piece of advice is to chose components to install. The full thing takes 40GB installed or so, and takes awful time to install and upgrade, even on a second-high-end Gen12 CPU and very fast PCIe SSD. And you hardly need data analytics or video compression libraries.
Compute shaders! They run on all gpus.
Is vulkan worth learning?
@@ben_jammin242 I was just about to mention that vulkan is the future while openGL lags far behind in terms of being adopted by the masses
@@PutsOnSneakers cadum, cadum
This is a lot more straightforward than I thought it would be. Basically, replace all allocation operations and pointer operations with CUDA framework types and functions. 😅
Hey this is super useful! I elected High Performance Computing and Microprocessors and Embedded Systems modules for my degree, and this channel has become my go-to guide.
That's probably the degree I'm gonna go for as well.
This channel is amazing xP
You channel is amazing! Just found it and I must tell you have a great way of teaching. Kudos for that congrats on the amazing content
Thanks so much!
Good video. It would be interesting to make the vectors huge and run some benchmarks comparing the cuda function to the cpu version.
i think armed with this video, its something you could do yourself :) the best students are the ones that use what was taught.
I would imagine the CPU implementation would win performance wise when it comes to simple addition, since copying memory to and from the GPU is a pretty expensive operation. Especially if we make the benchmarking fair and utilize threads on the CPU implementation.
Thanks
Useful, but the discussion about the block size and grid size was avoided. I think there should be a video focused only on this topic as it's not easy to digest, especially for new CUDA programmers. A comparison with OpenCL would be even better :)
Easier than I thought! Would love to see you do this in OpenCL!
Great suggestion!
@@LowLevelTV Yes, definitely give OpenCL content, there's not enough of it
Thank you for the Video, it was good to see a easy example how it works.
I was watching recently a video about the MMX instruction set of the first Pentium CPU (around 1997), and it was mentioned that the main usage of that new feature was for example changing the brightness of a photo (probably bitmap file) where a lot of mathematical manipulation needed on a huge file, and the mathematical functions is repeating for every pixels. The idea behind MMX was, that multiple registers was loading with values, and then cpu executed one instruction and some clock cycles later all the output registers were ready filled. I think it was called "single instructon multiple data".
I have this feeling now, that the GPU Cuda core could do all the mathematical manipulation with a bitmap picture, we only have to load the picture in the GPU memory, and the mathematical manipulation pattern(s) with it, and execute the transformation. Probably it does not worth transform only one picture, with all the preparation we lose time, but if we have many different pictures (for example a video), maybe it makes sense to use the power of the GPU.
I dont know if im correct but when you render video in blender for example it can use gpu, and you can do Things like mainpulating Color. Dont know if it has any relevance just my thoughts only
Yeah photoshop does a few things on the gpu, and good video editing algorithms run on the gpu as well. It’s exactly like you said. And SIMD instructions are also used quite a lot, but from what I’ve seen, they seem more of a middle ground. If CPU is too slow, but gpu not really worth it due to latency or complexity.
I guess it might be similar to the SIMD instructions on arm cortex. Basically there's a coprocessor dedicated to executing instructions that operate on multiple registers at the same time.
This was a super cool video. I'm currently learning assembly so seeing how to operate at a pretty low level was very interesting to me.
Pretty straight forward tutorial. What do you think would be the next step? vector multiplication?
Thanks
Hi Excellent Tutorial
Can you do the same tutorial where by one of the node said node A has a strength value in it?
i like the fact that u write in c that i do at school and i understand what you are coding
Very good video (pictures, code and explanation) on this subject.
GPU's are for complex and fast graphics calculations. Modern video games need vary fast graphics rendering therefore advanced GPU's are required. NVIDIA (CUDA programming) facilitates us to utilize the power of parallelism of GPU. Therefore, NVIDIA APIs (CUDA APIs) provide the magic of CPU + GPU to accomplish the fast processing and complex calculations required in ML, AI, GAI, DL, and Streaming etc. CPU + GPU will give best performance when data copy between these is optimized.
I was amazed seeing the power of GPU vs CPU. For one lengthy task CPU took about 5 seconds but GPU took only 2-3 milliseconds for the same task. All magic of true parallelism by thousands of cores in GPU.
You explained it so well, thanks a lot
That was an excellent beginner friendly overview. Almost a hello world type of intro to get your feet wet. Definitely looking forward to more videos from you.
Thanks for the video!
CC: When the narrator follows new, or not immediately obvious to a newcomer information with, “right?” I feel really lost and a little stressed thinking I can’t even understand this basic information!!
Are there any guides explaining how the code segments are actually sent to the GPU and how the API and firmware handle operations?
Just understanding the coding portion isn't enough until you understand the hardware architecture and low-level ops.
Have you ever looked at GPU assembly and decompiling shader code?
Not sure if you can export from RenderDoc to Ghidra, but would be fun to look at that
Cool intro, thanks! In the year 2021, tho, I'd rather use even simpler modern cudaMallocManaged() UVM call. One may get faster code by manually controlling memory transfers in multiple streams and synchronization; this is what I have seen in code written by an NVIDIA Sr. SWE, but could never really fully grok it. For the rest of us, there's UVM-you just allocate memory accessible to both the CPU and the device, and it's synchronized and moved in the right direction at the driver level.
It does allow writing stupidly inefficient code, but this is not too easy, really :) For a GPU starter, it simplifies memory tracking a lot.
Super nice starting video for someone like me who was too afraid to try it blind :D
Muchas Gracias! No podia hacer correr un simple codigo en vs code y viendo este video me funcionó a la primera. Estaría genial si haces un video para poder compilar con vs code. Abrazo
Never expected to hear that ending song. It's a really good song. It's Run by Hectorino Martinez.
Great video! Short and to the point, just enought to get me started!
Bro actually Showed both result that came in ~1nanoSecond and 0.3nanoSecond and thought we would notice.
jk Your Explanation is Amazing
I'd really love to see more videos like these
Thanks, that was a super clear example. Amused that you called it a register, guess you can't turn off thinking in assembly code :D
The vector thing in the beginning could be done multicore too I think so with 3 vectors you can just do each one on a different core at yhe same time.
I was busy trying to build a GPU on a breadboard like a weirdo when I found this. Much better.
Building a GPU on a breadboard is really cool, why should this video be better, its just a different topic. Or were you trying to build an ASIC on a breadboard, and realized now that you can just use CUDA? ;D
Damn was this interesting. So basically everything time I have big for loops or even nested for loops, my graphics card could calculate it way faster.
Thanks man this was interesting
Nice presentation, but you should speak about OpenCL, even if it's not well supported on Nvidia card, at least you can target multiple parallel devices (at the same time). Andthe core conept of grid, block and threads are quite the same (with different name, but same cache segregation principle).
Yeah OpenCL is the way to go for using several gpus or different types of gpus (like Nvidia and amd)
Channel is just the sickest ty ty
How can you dynamically manage and display your available GPU memory based on load and display it as a bar graph? Such as when you're choosing LoD or texture and geometry complexity and want to estimate if it's going to throttle the gpu. Many thanks! Happy to be pointed to a resource if it's not something you've covered as yet :-)
I believe the Nsight debugging tools should give you everything you need for this
Maybe you don't read this because the video is 2 years old now, but could you make a video about how graphics programming works on a computer? 2d and or 3d. You are so good at explaining stuff, it would be really amazing imo
I have some knowledge of 2d and 3d graphics when it comes to the CPU, however I do not know of the GPU so much. If you want to render on the CPU assuming you are only working on a per pixel level this is how you'll want to do it, for a simple line you'll just use Bresenham's algorithm, for a triangle on the other hand, you'll want to convert the triangle into a bunch of horizontal lines, you can do this by splitting the triangle into two triangles, the triangles you're splitting the original triangle into are easier to render as one of the sides of the triangle should be a horizontal line, the way you split the original triangle into two new triangles that each share a horizontal side is by finding the point on your triangle that is in-between the other two points on the Y-axis then create a corresponding point on the opposite side with linear algebra that should be at the same Y coordinate, this will result in you having 4 points instead of three, you will now create two new triangles where both triangles share two points, finally you can iterate along the two non horizontal lines for each of your triangles and use linear algebra to find points corresponding to each Y position then just draw horizontal lines between these points, as a result you will have drawn a triangle. Interestingly it is actually easier to draw a circle than a triangle, to draw a circle you will perform a simple process, starting from the top of the circle you will iterate down drawing horizontal lines all the way through to the bottom of the circle, all of these horizontal lines will be centered at the circle's X position, the width of these lines will be determined through some simple trigonometry, first you will need to find the difference between the Y coordinate of the line and the Y coordinate of the center of the circle, then you will make a reversal of Pythagorean's theorem where instead of calculating for distance through D = sqrt(X*X+Y*Y) you will instead be solving for X, here are the steps to alter the equation to solve for X, D = sqrt(X*X + Y*Y), D*D = X*X + Y*Y, D*D - Y*Y = X*X, sqrt(D*D - Y*Y) = X, X = sqrt(D*D - Y*Y), simply input the previously calculated Y offset into this equation to find X which is the width needed for this horizontal line, after this simply draw the lines and you're done with drawing your circle. This has so far all been 2d, but with those building blocks created it is actually pretty easy to move onto 3d graphics, a sphere in 3d can be projected into a circle, for 3d graphics you need to first understand the basics of perspective, you can assume that most players will be viewing the game pretty centered from their computer screen, in other words the eyes of the player are at X and Y of 0. However, the distance of the player to the computer screen is not always the same, some will be viewing it from much further away than others, for a TV screen you can expect them to be looking at it from farther away, this type of distance is represented as Z position similar to X and Y position but for the third axis of movement, however in the particular case of the distance from the user to the screen it is actually called FOV and is often adjustable, now, with this in mind you need to know how to convert a 3d point in your game to a 2d point on the computer screen, to do this you need some more linear algebra, you need to make a line from the player who is located at position (0, 0, -FOV) to the 3d point in your game (X, Y, Z), then you need to find the intersection of this line and the computer screen, luckily we know that the intersection will always be located at a position formatted like this (X, Y, 0) which is good as it removes the Z position which is the 3d element, to find this point we can simply do the following mathematics on the 3d point (X, Y, Z), (FOV*X/Z,FOV*Y/Z) is the 2d point given a 3d point, finally with this in mind we are almost done with the basics of 3d graphics, a simple method of 3d rendering can be used now called Bill Boarding to render spheres, if the sphere should be centered at (170,290,400) then you will first need to find the corresponding 2d point which is (FOV*170/400, FOV*290/400) this is where you will render the sphere at, secondly you also need to find the size of the sphere since we are working with bill boarding, this can be found with a similar equation, we need a new variable D which represents the diameter of the sphere, now the 2d representation's width will be determined with the following equation, FOV*D/Z this means that if the sphere is 100 wide then it will be drawn with a width of FOV*100/400, finally we will simply plug these numbers into the circle drawing program from before to draw the 3d sphere. Next up is rendering 3d triangles which is more powerful than the sphere drawing, to render a 3d triangle with the points (X1, Y1, Z1), (X2, Y2, Z2), and (X3, Y3, Z3) we first need to convert all of these 3d points into their 2d versions, (FOV*X1/Z1, FOV*Y1/Z1), then we simply plug these points into the 2d triangle drawing function and we are already done with the basics of 3d triangle rendering, however there is an interesting problem, while with the sphere drawing we can simply not draw the sphere if it is offscreen, it is a little trickier with the triangles because one point on the triangle could be at a negative Z position which would break the previously shown projection formulas to solve for this we need to clip the 3d triangles at a certain Z position which is slightly in front of the screen, for instance Z=0.001, if two of the points on the 3d triangle are off screen then we can simply find the intersection of them and the flat plane where Z = 0.001 which will clip them quite quickly with little issue, however if only one of the points is off screen we come to have a problem where we need to split what was once only one triangle into two triangles, because now the clipped version of the triangle is actually a quadrilateral. Additionally, layering these 3d triangles is quite tricky to do with the CPU effectively, however in the GPU approach you can just use a Z-buffer, and later on you will probably want better looking triangles than the single-color fill triangles, that's a whole different can of worms. Good luck!
@LowLevelLearning Could be very cool to see a bit more complex & lengthy setup to show difference in time on GPU vs CPU for different use cases.
Very nice tutorial. I really liked it. It's brief, to the point and very clear. Thanks. Could you please make a video for the same example but in Linux?
New to your channel. Liked and subbed!
Edit: what is "sizeof(a)/sizeof(int)" computing? I thought size of c would be N if a and b are both N
sizeof(a) will return the size of the array in bytes. sizeof(int) will return the size of an int in bytes (which might be 2, 4 or 8 depending on whether you're on 16 bit, 32 bit or 64 bit hardware). The division then gives you the number of elements in the array. A useful helper #define is:
#define ArrayLength(a) (sizeof(a)/(sizeof(a[0]))
This will also work if you have an array of structures.
Thank you for your cristal clear explanation
You are welcome!
Man, the Nvidea dos are ok, but this is si well made, very nice :D
That is VERY impressive how relatively SIMPLE and CLEAR you showed that! Wow, thank you! Question: There is SOME sort of parallel or vector operation also possible on the modern CPUs, right? Could you show how THAT would be done in this example?
Hi ! A classical C++ question, zero knowledge of GPU programming ! After cudaMalloc, shouldn't there be a sort of cudaFree ? What happens with the GPU memory ? Thank you for the comments and for the video !
I wonder if the CUDA framework is available for use in C#? I don't know C++ and I really don't want to spend years learning how to "properly" create C++ apps.
After writing 400 LOC for initializing OpenCL and finally giving up, this seems so easy!
Great explanation!
Absolutely awesome
The clicky keyboard sounds was oddly satisfying to me. It's like a little white noise to me. It's so peaceful
Thanks. Nicely done.
This was insightful
12:02 So you use GRID_SIZE to define blocks and BLOCK_SIZE to define threads? I don't get it. Shouldn't there be 3 parameters? What does it mean when there are 2?
He didn't explain what a kernel is. A "kernel" is a function. That's what CUDA calls them. In this example, "vectorAdd" is a kernel. And you run them with >.
Super interesting, thanks a lot!
Super! Mark Duper!
Can some explain how the threadIdx.x works? Like, how does it know where the vectors are and how it indexes into them ? The relevant time stamp is 11:01
could you present some gpu card?
This is absolutely mental 😎
Great video, really interesting stuff. Looks like I need an Nvidia gpu now
When you write “sizeof(int)” what is this referencing? I’m not at all familiar with this language, but I am assuming that’s references something that’s been determined previously?
Compiler changes it to the actual value when compiles. The value in this case is the size in bytes of an int type, which is one of basic built-in types in C.
Is there any difference between compute shader and this cuda programming?
I really would like to learn, how to program my GPU. Software of the shelf is so boring. I am not gamer. I just bought a used, but nearly new RTX 3080 eGPU with circa 8000,cores for CUDA and OpenCL learning. Unfortunately I may not connect it to my Mac Studio…😢
Super interesting! Thanks
Me: "but how do I stop doing all this low level memory management like it's 1967?"
GPU: "That's the neat part. You don't"
Very helpfull. Thank you for sharing!
Loved the video! Had to like and subscribe! Can't wait to see the rest of the project as well as what other projects you work on!
How do i see the veiw that he sees when checking the results of the first lines of code?
thanks a lot, great tutorial
Thanks, can you recommend resources for learning this specific type of programming? or from where to get this kind of knowledge?
i like the intro "we are mining bit coin"
thnx you are a legend brothers
What, no scaled up run with speed test?
What does the size of array/size of int do? I've seen it in C++ demonstration of array being referenced to as a whole pointer instead of slot per slot.
It's basically a compile-time function that tells the compiler to calculate the size of some variable or data type in bytes, then replace the call to the function with that size. `sizeof(int) * 10` will have the compiler figure out the size of the `int` data type in bytes, which we'll say is 4 bytes (I believe C ints can have different sizes between different compilers, with some compilers using 2 bytes, others 4 and some even 8), then replaces the `sizeof(int)` with `4`, leaving you with `4 * 10`. That is further evaluated to `40`, which is the size of an array of 10 ints in bytes, ie 40 bytes.
I like Cuda but Considering how many cuda tutorials there are I would like an OpenCL Tutorial because there are only like really advanced examples out there and you have to start with the basics. Which on UA-cam I couldn't find
I've been trying to use unified memory and it's simply not working properly 💀
I'm doing something wrong but I have all the steps required and I tested them on a separate program that works just fine. Only reason I wanted to use unified memory is that you don't have to have separate pointers for cpu and CUDA which makes it much easier to read.
would have been cool to compare the timing on some larger data sets. like 1 million or so.
Very good thanks.
The documentation says a call to a __global__ function is asynchronous. What happens if the vectorAdd function already returned and we are trying to cudaMemcpy before the GPU has finished the operation?
IIRC, it implicitly synchronizes at the cudaMemcpy. I believe you have to use the Cuda stream API or the task graph API if you want asynchronous memory and compute.
Under what conditions would you use more than one grid?
I was messing with this, and I changed the size of the arrays (a, b, c). If I used arrays of size 1024 (2^10) elements, it works fine (c outputs the correct result), but if I increase it further, to say 1025, c becomes a 0 array. does anyone know why 2^10 is the limit?
If it's a hardware problem, I use an RTX 3060 (laptop).
Try increasing the blocks (the
An amazing tutorial but ,, 1 question,, after running does it free the allocated memory itself or we free afterwards?
thanks alot for the video
This video was great never thought it would be so simple. Do you mind digging deeper into this. Maybe some filter, coordinate transformation or other basic math stuff?
love this
Could you also do the same video for ROCM?
wow I really hate this, do we really have to allocate memory in the gpu for each individual variable? is there any higher level syntax than this?
Do you think you could do a video of using GPU to solve for subset sum?
__global__ is not a keyword, but some CUDA library definition, right? Why does it look like a keyword (the same colour as "return")? Why isn't it in a form like CUDA_RUN_GLOBAL?
would have loves to see the difference in time by running it with gpu v cpu
Not familiar with C. Could someone explain the sizeof(a) / sizeof(int) thing for me please?
Dividing the size of an array by the size of the elements in that array gets you the number of elements in that array. Similar to “if I line up 100ft of cars, and each car is 5ft long, then I have 20 cars”.
@@LowLevelTV thanks! That's interesting. I guess this is necessary because there isn't a .count method like in C#?
Can i use this on a pc with amd graphics card???
Very helpful tutorial!
I prefer OpenCL though :)
A Khronos Group junkie huh? Well at least OpenCL has a SDK and not just a API . Most people don’t bare metal program, scripture’s if they get near this usually use Python
What is the variance between the two in terms of accuracy, without stressing the system, and time not being a factor? Are results different?
How about writing code that use the gpu for a dictionary attack to decrypt a file, is that possible?