Thank you so much for responding to my request for making a CUDA programming. I have donated 0.1 BTC to your account as a way to thank you. My professor has done so many hours trying to explain CUDA and none of my classmates really understood. I just can not believe that you do all this for free and that is why me and my classmates have decided to collect some funds to donate to you. Thanks for all that you do and please keep going.
Thank you so much. Probably the best introdution to CUDA with Python. The example you use, while very basic, touches on usage of blocks, which is usually omitted in other introduction-level tutorials. Great stuff! Hope you return with some more videos. I have subscribed!
as a data scientist +2 years of experience, i ALWAYS learn something new with your content! please nich, never stop doing this things, and also, never cut your smile in your face, even if your are having bugs!!
I have been looking into gpu programming using numba and python for a while, this seems to be the best tutorial I was able to find so far.. . thank you
Love the channel Nicholas, have recently graduated from an NLP Master's degree and seeing you explain stuff in a simpler way and your coding challenges is really helping me connect with the material I've learned! Keep it up and I'll keep watching!
Hey this is super useful! I elected High Performance Computing and Microprocessors and Embedded Systems modules for my degree, and this channel has become my go-to guide.
Ayyyy, so glad you like it @Patrick. For the last two weeks I've just been making videos on stuff I find hard or want to get my head around I figure it's not just me staring there at some of these concepts like huh?!? Thanks for checking it out!!
Interesting, but two remarks: Example 1: on my setup (3080Ti, i7-8700K, running in WSL2 with Ubuntu 22.04) vector multiplication runs actually *faster* on CPU (if you either use the vectorized formulation in MultiplyMyVectors with target "cpu" or, simply, a*b instead of the unnecessary for loop in the CPU code). IMO that is mostly due to the overhead of copying the data to the GPU memory. Example 2: to get a fair comparison, you should also use the JIT for FillArrayWithouGPU, decorating with @jit(target_backend="cpu"). Then, GPU array filling is still faster, but only by a factor of 2.
Ahmad , thanks for taking time to create these videos. It is unfortunate that people view your videos and then feel inspired to complain about a free gift. Folks could just keep it moving or add helpful insights.
Hey Ahmad , I love watching your videos because of the way you tell the story. Great graphics mate. Love the reference to rocket man too... lol keep up the good work.
This is very helpful. Most people don't realize the overheads and code refactoring necessary to take advantages of the GPUs. I am going to refactor a simple MNIST training propgram I have which currently uses only Numpy. See if I can get meaningful improvements in training time.
يعطيك العافية اخوي احمد بدي اطلب منك طلب بسيط انه تعمل نفس الكورس عربي عارف مش حيكون كثير المشاهدات. ولكن اخوانك يحتاجوك اكثر من الاجانب انا بفهم عليك بس فيه غيري بيحبوا المجال وبيحبوا يتعلموه بلغتهم. اذا انت ما تملك الوقت اسمحلي اترجم الفيديو واشرحه عندي بلايك من عندك على انك موافق.
That's mostly how it works. It's more like sorting the stones by its color and pattern and counting each variety. Using the CPU way, you would need to count each variety separately. If you have 100 different colors and patterns, that would take a long time to count (even if you could count extremely accurate and fast, similar to how the CPU makes up for it's lack of parallelism). The GPU way lets many people count them. Given 100 people (like the GPU), each person would count each variety at the same time.
yes, you could do this by hand, which would be a great challenge in distributed computing to code by hand. Another option is to use a framework/platform like AWS Sagemaker to do distributed kmeans. Most organizations will do this.
It works on both AMD and NVIDIA. If you have CUDA code, you can convert it to HIP with their automated tool, there is very little CUDA specific that can't be just translated over.
Perfect Video! Saw was revealing to me to understand how it works. Thank you! I am a new subscriber of your channel. Regards from Buenos Aires, Argentina
Once you initialized lr to 0.0, I knew you were going to forget to change it lol. Love the challenges tho, keep doing them, I think it would be cool to see how you implement a neural network from scratch
This is an academic example that shows the process of copying data to the GPU, doing a vectorized operation, then showing the results. Actually what makes sense on the GPU vs CPU is something I didn't cover, and am hoping other can figure out some cool ideas.
What makes the CPU better than the GPU is that each core is clocked at a faster speed and has many built-in instructions like SSE, allowing data to be processed faster. This provides a tremendous benefit to programs that only run on 1 core. In rendering where multiple cores can be used, you would need the CPU to process pixels about 5+x faster to match the GPU's performance.
Also, the CT5 simulator from 1981 may not count as being from the '70s or '60s, but from what I understand, the CT5 was capable of realtime, rasterized, 3D polygonal rendering and was $20 million at the time. It used gouraud shading, if memory serves. There were several other CT (continuous tone) simulators developed by E&S in the '70s that did something similar or of much lower capability than the CT5 of '81. There was also the Digistar planeteriums that date back to the early '80s, and the Picture System goes back to at least the early '80s. Might be vector or raster, not entirely sure myself, though.
Technically, Yes. However, CUDA isn't designed to give you an extra processor to use. It's just to give you the option of using a different type of processor to do your work. GPUs have lots of processing cores (100-1000+) which helps a lot with rendering. Each core can process 1 pixel allowing 100+ pixels to processed at once. CPUs have a small number of cores (2 - 18 in the Xeons) so only 2 - 18 pixels can be processed at once. The Hyper-Threading technology can double that number, but 36 is small compared to 100.
On the PC side Matrox was the first company to introduce GPU's . This was followed by ATI . NVidia came into the scene after the success of these 2 Canadian companies. Matrox's original 3D board was a 3 setboard with custom asics. I believe NVIDA actually acquired ATI. So yes, NVIDIA was not the first but they are the biggest in the space now. Matrox is still around but more involved in the industrial and nice markets.
Thank you very much for this tutorial. I would love to have the code available because typing it in myself from the video is a bit hard especially with the atocomplete on all the time. Keep up the good work.
This was a great video to me, I have very limited C++ experience and was looking for an explanation of CUDA. Another video like this could easily have been 70-80% over my head. This one was only about 15% whoosh. And now I actually find C++ interesting again!
Nice demo - I am getting into CUDA GPU programming and have a workstation build with a 1950x 16 core CPU and two rtx 2080ti gpus and would like to check this demo on the machine and observe the outcome results without using colab definitely will check this out today. By the way , with notebook python3 environment , I need to use pip to install numba library as shown or do i have to create a new virtal environemnt? I am curious about that. Thank you
CUDA also is in the form of an API (i.e. using NVIDIA's CUDA library in C) to abstract away parallel computation tasks to the GPU - but yes its both, the API is the software side but the GPU must be CUDA compatible (have CUDA cores) to take advantage of this.
It's a mandelbrot set explorer that uses both CUDA- and C-extensions to calculate the iterations. The multithreaded C-implementation is definitely no slouch, but when you start doing over 10 000 iterations per pixel the CUDA-implementation becomes significantly faster. In contrast pure Python based implementation get frustratingly slow already at around 1000 iterations so it wasn't even worth adding to the comparison.
Dear Ahmad, you are 30 years old only doing post-doc ? I'm sorry but this to me sounds very underrated. Postdocs are not always well compensated for their work but spend a lot of time working and doign research. If i were you, i'd invest more time on my youtube channel, rather than doing something that does not compensate well.
Thanks for the video, subscribed! A suggestion : this small change to your code would demonstrate a real-world gradient descent solution for linear regression with noisy data. E.g. :
Thank you so much for responding to my request for making a CUDA programming. I have donated 0.1 BTC to your account as a way to thank you. My professor has done so many hours trying to explain CUDA and none of my classmates really understood. I just can not believe that you do all this for free and that is why me and my classmates have decided to collect some funds to donate to you.
Thanks for all that you do and please keep going.
Thank you for the donation, it really means a lot !
@@AhmadBazzi No thank you !
Thank you so much for responding to my request for making a CUDA programming.
Wow amazing
Wow amazing
You just opened my eyes to parallel programming. Thanks for the quick overview.
Too hard to find high -quality content like this these days. Thank you so much
Too hard to find high-quality content like this these days. Thank you so much
That was very well explained. I have only have taken one course, and you made it clearer than my professor or fellow students ever did.
12:36 This guy is a God !
very nice
So beautiful
Thank you so much. Probably the best introdution to CUDA with Python. The example you use, while very basic, touches on usage of blocks, which is usually omitted in other introduction-level tutorials. Great stuff! Hope you return with some more videos. I have subscribed!
Excelent
this was such an excellent video
Just did my research and this guy is at one of the most prestigious universities in the world ! No wonder why his lectures come up neat !
as a data scientist +2 years of experience, i ALWAYS learn something new with your content! please nich, never stop doing this things, and also, never cut your smile in your face, even if your are having bugs!!
I have been looking into gpu programming using numba and python for a while, this seems to be the best tutorial I was able to find so far.. . thank you
Love the channel Nicholas, have recently graduated from an NLP Master's degree and seeing you explain stuff in a simpler way and your coding challenges is really helping me connect with the material I've learned! Keep it up and I'll keep watching!
Hey this is super useful! I elected High Performance Computing and Microprocessors and Embedded Systems modules for my degree, and this channel has become my go-to guide.
wanted to comment that the information in this presentation is very well structured and the flow is excellent.
Too hard to find high-quality content like this these days. ⚡
Thank you so much for this series! It's so clear and easy to follow
the essence of Deep learning in a few lines of code... awesome
I feel like Cuda has been demystified. Very glad I found your series.
#
Ayyyy, so glad you like it @Patrick. For the last two weeks I've just been making videos on stuff I find hard or want to get my head around I figure it's not just me staring there at some of these concepts like huh?!? Thanks for checking it out!!
Interesting, but two remarks:
Example 1: on my setup (3080Ti, i7-8700K, running in WSL2 with Ubuntu 22.04) vector multiplication runs actually *faster* on CPU (if you either use the vectorized formulation in MultiplyMyVectors with target "cpu" or, simply, a*b instead of the unnecessary for loop in the CPU code). IMO that is mostly due to the overhead of copying the data to the GPU memory.
Example 2: to get a fair comparison, you should also use the JIT for FillArrayWithouGPU, decorating with @jit(target_backend="cpu"). Then, GPU array filling is still faster, but only by a factor of 2.
This is the best introduction to CUDA I've seen, thanks a lot !
#
This was by far one of the most enlightening videos you have put up on your channel. Thanks and keep up the good work!!
Ahmad , thanks for taking time to create these videos. It is unfortunate that people view your videos and then feel inspired to complain about a free gift. Folks could just keep it moving or add helpful insights.
Fantastic tutorials on CUDA. You deserve more followers.
Thanks for the comment... contact me for information and profitable investment strategies..⤴️
what a passionate tutorial! I wish you were my professor for my parallel programming course. Well done!
holy shit, i was looking into this to speed up my mandelbrot-zooms and they are what you use as an example! This is a dream come true!
You saved me, i had to read the PointNet2 implementation for my BCS thesis. this made the job much easier!
LOL. Loved the graphic at 6:23! Brought tears to my eyes.
and that's what I call a great tutorial. Thankyou sir. I wish you make more tutorials.
Thanks for the comment... contact me for information and profitable investment strategies...⬆️
Woah congrats @Ally 🎊 🎉 glad you’re enjoying the challenges, plenty more to come!!
Excellent example of vector addition of using for loop and using CUDA
I have no idea what kind of videos i am watching ... but i sure will learn
Ohh, yes, Thank you, and the documentation at nvidia site about CUDA is highly professionally written. Thank you.
Oh Ahmad , your tutorials are incredible and inspiring....
Great video, I like this kind of video where you code some AI task counterclock, you teach us the concepts and show us the reality of implementing it👏
Thank you for this great introduction to numba and more specifically numba+cuda.
I'm doing an internship in a research lab and I'll have to program some kernels to implement Blas primitives, this video really helps :)
Thanks for the comment... contact me for information and profitable investment strategies..⤴️
Wow It is really awesome! It is much better than a tutorial from university! Thanks!
Thanks for the comment... contact me for information and profitable investment strategies...⬆️
You are a lifesaver @Spencer, will do it next time i'm on the streaming rig!
This was oddly intense. Great job Nicholas! Even though you ran out of time, this video is still a win to me. 😉
Thank you so very much. This is the exact kind of material I was looking for on this very specific subject. Kudos.
Hey Ahmad , I love watching your videos because of the way you tell the story. Great graphics mate. Love the reference to rocket man too... lol keep up the good work.
OHHHH MANNN, I thought about doing that but I was debating whether I'd hit the 15 minute deadline already. Good suggestion @Julian!
Thank you so much for this video. It has helped me massively to prepare for my computer science exam.
This is very helpful. Most people don't realize the overheads and code refactoring necessary to take advantages of the GPUs. I am going to refactor a simple MNIST training propgram I have which currently uses only Numpy. See if I can get meaningful improvements in training time.
Very well explained. The best CUDA explaination I have come across uptil now 😊😊. Keep up the spirits sir.👍👍
Thanks for the comment... contact me for information and profitable investment strategies...⤴️
this is extremely helpful. you did an amazing job explaining the foundations
Thanks for the comment... contact me for information and profitable investment strategies...⤴️
Thanks for making all these topics very approachable!
يعطيك العافية اخوي احمد
بدي اطلب منك طلب بسيط انه تعمل نفس الكورس عربي عارف مش حيكون كثير المشاهدات. ولكن
اخوانك يحتاجوك اكثر من الاجانب
انا بفهم عليك بس فيه غيري بيحبوا المجال وبيحبوا يتعلموه بلغتهم.
اذا انت ما تملك الوقت اسمحلي اترجم الفيديو واشرحه عندي بلايك من عندك على انك موافق.
That's mostly how it works. It's more like sorting the stones by its color and pattern and counting each variety. Using the CPU way, you would need to count each variety separately. If you have 100 different colors and patterns, that would take a long time to count (even if you could count extremely accurate and fast, similar to how the CPU makes up for it's lack of parallelism). The GPU way lets many people count them. Given 100 people (like the GPU), each person would count each variety at the same time.
The video was very helpful for me. Many thanks to the author for developing his audience with interesting and useful content
The Knowledge of Ahmad knows no bounds.
It is effectively a very easy approach to harness the power of cuda in simple python scripts.
It's very informative and a good intro to CUDA programming. Thanks very much!
#
Awesome video !! It's preety cool to see such theoretical concepts coded and explained like this. Keep going Nich !!
Thanks for the video, I found the first half and the wrap up really excellent.
yes, you could do this by hand, which would be a great challenge in distributed computing to code by hand. Another option is to use a framework/platform like AWS Sagemaker to do distributed kmeans. Most organizations will do this.
Amazing! I'm learning so much watching you code. Thank you for sharing.
Well just built a new rig with a 980ti and a 4790k so I'm gonna put that to test. Thank you for your wonderful explanation :D
It works on both AMD and NVIDIA. If you have CUDA code, you can convert it to HIP with their automated tool, there is very little CUDA specific that can't be just translated over.
PS. I really so moved for your stock price episode. thank you so sosososo much.
opened my eyes to parallel programming
Thanks for the comment... contact me for information and profitable investment strategies..⬆️
Perfect Video! Saw was revealing to me to understand how it works. Thank you! I am a new subscriber of your channel. Regards from Buenos Aires, Argentina
I like how you did the website for documenting the video notes for reference later
Once you initialized lr to 0.0, I knew you were going to forget to change it lol. Love the challenges tho, keep doing them, I think it would be cool to see how you implement a neural network from scratch
This is an academic example that shows the process of copying data to the GPU, doing a vectorized operation, then showing the results. Actually what makes sense on the GPU vs CPU is something I didn't cover, and am hoping other can figure out some cool ideas.
This reminds me a lot of the computer tutorial tapes from the 90s
An insanely underrated series!!!
Thanks for the comment... contact me for information and profitable investment strategies..⤴️
What makes the CPU better than the GPU is that each core is clocked at a faster speed and has many built-in instructions like SSE, allowing data to be processed faster. This provides a tremendous benefit to programs that only run on 1 core. In rendering where multiple cores can be used, you would need the CPU to process pixels about 5+x faster to match the GPU's performance.
Sir,make more detailed sessions on CUDA,your explanation is great
YESSSS, right?! Glad you liked it Miguel!
Also, the CT5 simulator from 1981 may not count as being from the '70s or '60s, but from what I understand, the CT5 was capable of realtime, rasterized, 3D polygonal rendering and was $20 million at the time. It used gouraud shading, if memory serves. There were several other CT (continuous tone) simulators developed by E&S in the '70s that did something similar or of much lower capability than the CT5 of '81. There was also the Digistar planeteriums that date back to the early '80s, and the Picture System goes back to at least the early '80s. Might be vector or raster, not entirely sure myself, though.
This is amazing! Thank you for taking effort to make it!
Technically, Yes. However, CUDA isn't designed to give you an extra processor to use. It's just to give you the option of using a different type of processor to do your work. GPUs have lots of processing cores (100-1000+) which helps a lot with rendering. Each core can process 1 pixel allowing 100+ pixels to processed at once. CPUs have a small number of cores (2 - 18 in the Xeons) so only 2 - 18 pixels can be processed at once. The Hyper-Threading technology can double that number, but 36 is small compared to 100.
So stoked you liked it 🙏
On the PC side Matrox was the first company to introduce GPU's . This was followed by ATI . NVidia came into the scene after the success of these 2 Canadian companies. Matrox's original 3D board was a 3 setboard with custom asics. I believe NVIDA actually acquired ATI. So yes, NVIDIA was not the first but they are the biggest in the space now. Matrox is still around but more involved in the industrial and nice markets.
This is really helpful for my computing. Thank you.
Hey, thanks for explanation! Very well done 👍 I am downloading CUDA 💪
I was needing this!!! Thanks a lot, Sir!!!!
Many thanks for the lucid explanation.
Love your videos. Please don't stop!
glad to see you take it as a feedback and not as a hate comment
Can't wait to see Juan's better tutorial that he's definitely going to release :') lmao. Great video Ahmad .
Thank you very much for this tutorial. I would love to have the code available because typing it in myself from the video is a bit hard especially with the atocomplete on all the time. Keep up the good work.
You are bloody watching a master at work xD
This was a great video to me, I have very limited C++ experience and was looking for an explanation of CUDA. Another video like this could easily have been 70-80% over my head. This one was only about 15% whoosh. And now I actually find C++ interesting again!
i need to say this: you are the gamechanger here!!
I disagree, I think he did a great job explaining everything, especially the code.
wold love to see a video on what are a few CUDA programming challenges
@nvidia I personally think the way you did the demonstration was perfectly sufficient. IMO, fancy graphics are unnecessary. Good job.
This guy is so underrated.
Nice demo - I am getting into CUDA GPU programming and have a workstation build with a 1950x 16 core CPU and two rtx 2080ti gpus and would like to check this demo on the machine and observe the outcome results without using colab definitely will check this out today. By the way , with notebook python3 environment , I need to use pip to install numba library as shown or do i have to create a new virtal environemnt? I am curious about that. Thank you
CUDA also is in the form of an API (i.e. using NVIDIA's CUDA library in C) to abstract away parallel computation tasks to the GPU - but yes its both, the API is the software side but the GPU must be CUDA compatible (have CUDA cores) to take advantage of this.
It's a mandelbrot set explorer that uses both CUDA- and C-extensions to calculate the iterations. The multithreaded C-implementation is definitely no slouch, but when you start doing over 10 000 iterations per pixel the CUDA-implementation becomes significantly faster. In contrast pure Python based implementation get frustratingly slow already at around 1000 iterations so it wasn't even worth adding to the comparison.
Thanks a million @Lakshman!! I try to keep it pretty tight so it’s a good challenge otherwise I know I’ll just talk for 22 minutes anyway😅
It can be found in O(1). As far as I remember the formula is derived using LDU decomposition or Diagonalising a matrix, for matrix exponentiation.
Dear Ahmad, you are 30 years old only doing post-doc ? I'm sorry but this to me sounds very underrated. Postdocs are not always well compensated for their work but spend a lot of time working and doign research. If i were you, i'd invest more time on my youtube channel, rather than doing something that does not compensate well.
Йдйж
----
-----1--11
Уу3уййфйфйфяфффЯффффыяяыыяфяфыффффффыыяяяЯш
@@Марат-ъ1в5у гений
Great talk, thank you ! Well structured and clear.
Thanks for the comment... contact me for information and profitable investment strategies...⬆️
Great explanation! Fascinatingly clear
Thanks for the video, subscribed! A suggestion : this small change to your code would demonstrate a real-world gradient descent solution for linear regression with noisy data. E.g. :
This was really good. Thanks for posting this!