0Mean1Sigma
0Mean1Sigma
  • 13
  • 50 363
Must Know Technique in GPU Computing | Episode 4: Tiled Matrix Multiplication in CUDA C
Tiled (general) Matrix Multiplication from scratch in CUDA C.
Code Repo: github.com/tgautam03/CUDA-C/tree/master/05_tiled_mat_mul
Notes: 0mean1sigma.com/chapter-4-memory-coalescing-and-tiled-matrix-multiplication/
Animations: github.com/tgautam03/0Mean1Sigma/tree/master/CUDA_04
00:00 Introduction
00:41 Standard Matrix Multiplication
01:41 Tiled Matrix Multiplication Algorithm
03:24 Tiled Matrix Multiplication Code
05:53 General (Tiled) Matrix Multiplication
08:11 Demo
08:26 Next Video: Tensor Cores!
Переглядів: 924

Відео

4.5x Faster CUDA C with just Two Variable Changes || Episode 3: Memory Coalescing
Переглядів 504Місяць тому
Memory Coalescing for efficient global memory transfers in CUDA C. Video Notes: 0mean1sigma.com/chapter-4-memory-coalescing-and-tiled-matrix-multiplication/ Code Repository: github.com/tgautam03/CUDA-C/tree/master/04_sq_mat_mul Animations: github.com/tgautam03/0Mean1Sigma/tree/master/CUDA_03 00:00 - Introduction 00:52 - Global Memory in GPUs 02:00 - Coalesced Memory Access 03:07 - Uncoalesced M...
Understanding NVIDIA GPU Hardware as a CUDA C Programmer | Episode 2: GPU Compute Architecture
Переглядів 1,3 тис.2 місяці тому
NVIDIA GPU hardware from the CUDA C programmer's point of view. Video Notes: 0mean1sigma.com/chapter-3-gpu-compute-and-memory-architecture/ Code Repository: github.com/tgautam03/CUDA-C Animations: github.com/tgautam03/0Mean1Sigma/tree/master/CUDA_02 00:00 - Introduction 00:50 - GPU Hardware 02:50 - Warps 04:55 - Latency Tolerance 07:01 - Conclusion
2678x Faster with CUDA C: Simple Matrix Multiplication on a GPU | Episode 1: Introduction to GPGPU
Переглядів 15 тис.2 місяці тому
Parallel Matrix Multiplication on a GPU using CUDA C. Video Notes: 0mean1sigma.com/2678x-faster-how-gpus-supercharge-matrix-multiplication/ Code Repository: github.com/tgautam03/CUDA-C Animations: github.com/tgautam03/0Mean1Sigma/tree/master/CUDA_01 00:00 - Introduction 01:00 - Matrix Multiplication 01:52 - Sequential Matrix Multiplication in C 03:23 - Why use a GPU for this problem 04:01 - CPU...
Wave Simulation from scratch using finite difference method
Переглядів 3553 місяці тому
CORRECTION at 6:16, the equation should end with dt^2 * s(x,t). CORRECTION at 3:48, the delta function approximation should be d(x-a)=1/dx when x is greater than a-dx and x is less than a dx, and 0 elsewhere WaveSim code repository: github.com/tgautam03/WaveSim Animations were generated using Manim, and the code can be found here: github.com/tgautam03/0Mean1Sigma/tree/master/WaveSim References:...
Transformer Neural Network: Visually Explained
Переглядів 9 тис.7 місяців тому
Transformers Neural Network explained and implemented using PyTorch. Code Repository: github.com/tgautam03/Transformers References: The blog post by Peter Bloem (peterbloem.nl/blog/transformers) is where I started and is one of the best resources for learning about Transformers out there. Music Credits: Moonlight by Kris Keypovsky (freemusicarchive.org/music/kris-keypovsky/single/moonlight/) 00...
Automatic Differentiation Engine from scratch
Переглядів 61511 місяців тому
I was introduced to the field of Scientific Machine Learning over 5 years ago and Automatic Differentiation has intrigued me since day 1. So, finally I decided to explore the high seas of AutoDiff and write a basic AutoDiff library from Scratch. Repo: github.com/tgautam03/jac Credits This blog post helped me a lot and it's very nicely written sidsite.com/posts/autodiff/ Music: Moonlight by Kris...
Understanding Heat Equation | From Derivation to Solution
Переглядів 39411 місяців тому
Heat Equation is one of the most fundamental partial differential equations. In this video, I've derived Heat Equation from 1st principles and then used a special case known as Steady State Equation equation to explain what Boundary Conditions are and why they're so important when it comes to finding a unique solution to the PDE. Credits: Steve Brunton's video on Heat Equation: ua-cam.com/video...
The Problem with Gradient Descent #SoME3
Переглядів 2 тис.Рік тому
Gradient Descent is the backbone of modern Machine Learning. However, it's far from perfect and has a major problem that prevents it from being used in most of the real life applications. In this video, I'll start with the very basics of Mathematical Modeling, and then use Linear Regression to explain Gradient Descent. I'll also show the problem with using Gradient Descent and then explain a qu...
Visualizing Neural Network Training and Predictions: A Universal Function Approximator
Переглядів 12 тис.Рік тому
In this video I've given a visual demonstration of the training and predictions of a neural network. 00:00 - Introduction 00:40 - Overview 01:11 - Linear Models vs Neural Networks 02:12 - Maths behind Neural Network 03:16 - Neural Network Training 04:36 - Neural Network Predictions 06:33 - Conclusions
Automatic Differentiation: Differentiate (almost) any function
Переглядів 6 тис.Рік тому
Automatic Differentiation is the backbone of every Deep Learning Library. GitHub: github.com/tgautam03/jac Music: No One Is Perfect by HoliznaCC0 (freemusicarchive.org/music/holiznacc0/be-happy-with-who-you-are/no-one-is-perfect/) 00:00 - Recap 00:30 - Topics Overview 01:00 - Finite Differences 02:40 - Automatic Differentiation (Forward Pass) 04:28 - Local Gradients 05:30 - Backward Pass 07:38 ...
Gradient Descent Algorithm: How Machines Learn
Переглядів 1,7 тис.Рік тому
A visual explanation of Linear Regression using Gradient Descent. GitHub: github.com/tgautam03/jac Background Music: In Her Name by Marco Castelli (freemusicarchive.org/music/Marco_Castelli/Malessere_Fiorentino/In_Her_Name/) 00:00 - Introduction 01:26 - Topics Overview 02:21 - Linear Regression Model 04:07 - Cost Function 05:54 - Gradient Descent 07:36 - Conclusions #machinelearning #3blue1brow...
Basic Automatic Differentiation Theory
Переглядів 1,1 тис.Рік тому
Topics discussed: - Why care about differentiation? - Different ways to differentiate? - Why Automatic Differentiation is best suited for ML? - What is Automatic Differentiation? Music: No one is perfect by HoliznaCC0 (downloaded from freemusicarchive.org) References: Small Pebble (can be found on GitHub) #machinelearning

КОМЕНТАРІ

  • @ITDedra
    @ITDedra 4 години тому

    🙂

  • @attiladren6990
    @attiladren6990 3 дні тому

    Thank you for your fantastic instructional video. May I ask what software did you use to create it?

  • @zijiali8349
    @zijiali8349 3 дні тому

    This is the best material so far. All the other videos failed to explain the concept of "PHASE". in each phase, each tile, which has the same dimension as block size, get transferred two copies, subA and subB, from A and B. this step caused extra time, but the subsequent calculation can take advantage of shared memory. Looking forward to your future videos!

  • @manikant1990
    @manikant1990 4 дні тому

    Very well made 👍👍

  • @ViliamF.
    @ViliamF. 5 днів тому

    0:36 2000 percent (unless I misunderstood what you said) is only a factor of 20, because it's "per cent", so 2000 per a hundred, i.e. 2000/100 = 20.

    • @0mean1sigma
      @0mean1sigma 5 днів тому

      No, actually it is 1680/0.63=2600x

  • @exodus8213
    @exodus8213 10 днів тому

    Can you help us in developibg a code please???? We need a cuda code to do bigint and we faced a problem…Willing to pay you bro…please help us

    • @0mean1sigma
      @0mean1sigma 9 днів тому

      Glad you liked the content. Unfortunately I can't help you out. Thanks a lot for leaving the comment and goodluck with your project 😃

  • @GlortMusic
    @GlortMusic 12 днів тому

    Interesting video! I also love the fact that you put a link to the source code in the description for beginners in Manim like me to check it out and learn how to make this type of videos. It helps a lot!

    • @0mean1sigma
      @0mean1sigma 12 днів тому

      Thanks a lot. My source code will always be open source 😃

  • @lab5184
    @lab5184 20 днів тому

    Beautiful explanation

  • @dipi71
    @dipi71 24 дні тому

    What about ARM, AMD, Gallium, IBM, Intel, Texas Instruments or POCL? Your CUDA example only runs on Nvidia hardware. Use OpenCL.

    • @lizardking640
      @lizardking640 7 днів тому

      Well that`s cause nvidias hardware is the most used for practical applications . You suggest he makes 4h long video using all these products?

    • @dipi71
      @dipi71 7 днів тому

      @@lizardking640 I suggested to use OpenCL.

  • @gs1987100
    @gs1987100 26 днів тому

    So. clearly explainrd, most valuable video on topic ever made..... wow...

    • @0mean1sigma
      @0mean1sigma 26 днів тому

      Thanks a lot. Glad you liked it 😃

  • @karthikm1558
    @karthikm1558 27 днів тому

    Excellent 👌👍🎉

  • @dandan1364
    @dandan1364 Місяць тому

    I’m super curious why, in this code, you never any of the function parameters and you use variables that aren’t declared in the function.

    • @0mean1sigma
      @0mean1sigma Місяць тому

      There's a small typo. I forgot to change the A inside the function to d_A. Thanks a lot for catching that, I completely missed that while writing the animation code. However, in the code repo, it's correct.

  • @dandan1364
    @dandan1364 Місяць тому

    Super High quality content. Thank you.

  • @marcosd3976
    @marcosd3976 Місяць тому

    Excellent, very good....

  • @Coolmd-it4ck
    @Coolmd-it4ck Місяць тому

    This was what I was waiting for!!! Thank you as always😊

    • @0mean1sigma
      @0mean1sigma Місяць тому

      Thanks a lot. Glad you found it useful 😃

  • @fractergiftogod3226
    @fractergiftogod3226 Місяць тому

    What a well made video! Having both a code example along with a visual representation is awfully pleasant and I can't believe more aren't doing the same. Combine that with the fact your pacing, pronouncing, clarity and presentation are all great, all while feeling genuine and expressive. I love to see this kind of thing. Just saying this to hopefully encourage you to make more! Looking forward to see what else you do in the future!

    • @0mean1sigma
      @0mean1sigma Місяць тому

      Thanks a lot for the comment 😃 I'm in this for the long run.

    • @fractergiftogod3226
      @fractergiftogod3226 Місяць тому

      @@0mean1sigma very glad to hear that! I'll be one of your long-time fans, haha

  • @ProjectPhysX
    @ProjectPhysX Місяць тому

    GPU go brrrr!!

  • @abdulamaan4784
    @abdulamaan4784 Місяць тому

    nice video

  • @jakeaustria5445
    @jakeaustria5445 Місяць тому

    Thank you

  • @ProjectPhysX
    @ProjectPhysX Місяць тому

    Yes yes yes more GPU programming videos!! Fantastic! Memory coalescence is is one of the magic tricks that make GPU software lightning fast. When I first experienced this ~4x speedup for a one-line change it blew me away. Unfortunately for many GPU kernels the optimization mostly ends here at the global memory bandwidth limit. Only special cases like matrix multiply or n-body can get another 10x from shared/local memory, and beyond there is still warp operations through inline assembly. Looking forward to next episode!

    • @0mean1sigma
      @0mean1sigma Місяць тому

      Thanks a lot 😀…. Next video is on tiling! I’m more excited to share that one. When I first learned tiling (~3 years ago), it was confusing and took me a long time to get the hang of it. I’ve always felt that HPC concepts go well with animations so I’m trying to do that using this channel.

  • @surajsamal4161
    @surajsamal4161 Місяць тому

    anther day another banger

  • @theevilcottonball
    @theevilcottonball 2 місяці тому

    I do not think your comparison between CPU and GPU is fair, your CPU implementation does not seem to be optimised at all. Modern CPUs support vector instructions (e.g. mine even supports AVX2) and have multiple cores/virtual cores. Also for the CPU you can use tiling and transposing one of the matrices to reduce cache misses, speeding up the calculation. I don't know much about this but you could compare with a somewhat optimised implementation of matrix multiplication on the CPU like OpenBLAS for a better performance comparison.

    • @0mean1sigma
      @0mean1sigma 2 місяці тому

      Yes you're right in the sense that the sequential program is not optimised. However, I haven't optimised the GPU program as well. Also, what I often say is that we need to use both CPU and GPU. In my last video (as well as my blog posts), I've mentioned several times that GPU is good at very specific tasks and it's our job to figure that out (and use it to our advantage). My comparison between CPU and GPU is intended as a motivation to look beyond traditional CPU programming. Hope this clears things up. Thanks a lot for your comment as you raised some really good points here. 😃

  • @fundoo203
    @fundoo203 2 місяці тому

    Man this is awesome. I have solved Schrodinger equation time independent using finite difference method for 1d and 2d. I got good results and understood a lot of quantum mechanics and waves. I hope to extend that to 3d and solve more advanced potentials like multi particle system and use GPUs also. Please keep doing videos like this. Will be really helpful to people like me

    • @0mean1sigma
      @0mean1sigma 2 місяці тому

      Thanks a lot for your comment 😃. Right now I'm working on GPU programming and will probably use that knowledge to solve the 3D wave equation (or some other PDE) in parallel on a GPU in some future video. I also have a website where I post blogs a little early. If you're interested you can sign up there to get early access to some of the content and give me feedback (that I can use for the UA-cam video). Thanks a lot again 😃

  • @surajsamal4161
    @surajsamal4161 2 місяці тому

    bro ur making my life easier thankyou

    • @0mean1sigma
      @0mean1sigma 2 місяці тому

      I'm glad you found the video useful 😃

  • @Coolmd-it4ck
    @Coolmd-it4ck 2 місяці тому

    definitely one of the most underrated youtubers

  • @Sidmanale643
    @Sidmanale643 2 місяці тому

    insane quality !

  • @Enko97
    @Enko97 2 місяці тому

    I f*cking love this video series! Congrats for the quality of the videos! ✨

    • @0mean1sigma
      @0mean1sigma 2 місяці тому

      Glad you're enjoying the content! I've at least 4-5 videos more planned on CUDA C 😃

    • @Enko97
      @Enko97 2 місяці тому

      @@0mean1sigma Excelent! Those are great news! 😌 Tbh, I just started learning about NLA and well I kinda started taking interest in the internals CUDA C library. Seriously your explanations are just amazing. Thank you so much for the videos:)

  • @illustrationvaz
    @illustrationvaz 2 місяці тому

    Thank you for this video!! Great content and nice animations

  • @ProjectPhysX
    @ProjectPhysX 2 місяці тому

    The real magic starts with cache tiling and shared memory optimization. Hope to see this in Episode 2!

    • @0mean1sigma
      @0mean1sigma 2 місяці тому

      Yup, that's ep 2 and 3. 😃

  • @plutoz1152
    @plutoz1152 2 місяці тому

    Crisp and clean explanation! I wondered can you do a video on warps, thread tiling, different types of kernel reduction and fusion in a simple application based example ?

    • @0mean1sigma
      @0mean1sigma 2 місяці тому

      The next video is on Warps and details related to the GPU memory (shared memory, registers, etc.)! After that video, I'll make another one on tiled matrix multiplication. If you're interested please sign up on my website where I post the video notes. That way you can access the detailed content and post your thoughts in the discussion section. Thanks a lot for the comment 😃

  • @finmat95
    @finmat95 2 місяці тому

    Just two questions: 1- What if you want to use the GPU power and efficiency without rely on CUDA and use a general code to perform operations on a general GPU (AMD users for example)? What code do you have to write? 2- The performance would be the same?

    • @0mean1sigma
      @0mean1sigma 2 місяці тому

      AMD has ROCm (that's their CUDA). However, I'm not sure how well it works. I tried ROCm some 4-5 years ago and it was not a pleasant experience back then. I moved to nvidia after that because with nvidia, things just work. There's also opencl, but again it has installation and performance issues (in some cases). But the concepts of parallel programming that I focus more on are the same everywhere. Only the syntax changes. Hope this answers your questions. 😃

    • @MrHaggyy
      @MrHaggyy Місяць тому

      In Python you have modules like CuPy to use the GPU. Tensorflow and PyTorch will also use it under the hood. The performance will be slightly worse as you have some overhead for abstraction. But it should be a point of deminishing returns in almost every case.

    • @finmat95
      @finmat95 Місяць тому

      @@MrHaggyy In Python, and in C?

    • @MrHaggyy
      @MrHaggyy Місяць тому

      @@finmat95 I'm not aware of C/C++ modules that hide HW abstraction on the level that Python does. But those Python modules use a lot of C/C++ themself so you could look into their code. Or use a framework like Chlorine or Kompute. Kompute is nice because it has the same examples in C++ and Python.

  • @finmat95
    @finmat95 2 місяці тому

    Simple and clear. Awesome.

  • @dan_pal
    @dan_pal 2 місяці тому

    This was an amazing explanation, thanks for sharing.

    • @0mean1sigma
      @0mean1sigma 2 місяці тому

      Thanks a lot. Glad you liked it 😃

  • @sharrehabibi
    @sharrehabibi 2 місяці тому

    🙌👏

  • @dtamien
    @dtamien 2 місяці тому

    I loved this video. I wished it had kept going on

    • @0mean1sigma
      @0mean1sigma 2 місяці тому

      Thanks a lot 😃 I've a few more videos on GPU programming coming up...

  • @empatikokumalar8202
    @empatikokumalar8202 2 місяці тому

    In the matrix multiplications used at 2:00, are the numbers of rows and columns in the matrices variable or fixed? If it is variable, in what value range, if it is constant, in what value. Also, how many bit operations do these matrices use?

    • @0mean1sigma
      @0mean1sigma 2 місяці тому

      In the real world, matrix size is set by the data so for different problems the number of rows/columns will be different (so it's a variable in that sense). However, once the execution begins, the matrix size does not change. The range of values can be anything (there's no limit on that). However, for very large and small matrices, parallelization techniques will change. I'll cover a few of those in my next video (like: using shared memory, memory coalescing, thread coarsening), so please keep an eye on that (you can also sign up on my website where I publish detailed blog posts and you'll get notified when I publish something). As far as the operations are concerned, I have done a detailed analysis of computation cost in the video notes (link in the description), in short, matrix multiplication is of O(N^3) complexity. I appreciate you watching the video, and if you've more questions after reading the notes, I'm happy to answer those in the discussion section of my website. 😃

    • @empatikokumalar8202
      @empatikokumalar8202 2 місяці тому

      @@0mean1sigma I would be very happy if you could make a video about H100s sometime.

    • @empatikokumalar8202
      @empatikokumalar8202 2 місяці тому

      @@0mean1sigma So I developed a different processor method. It is not transistor based. Therefore, it has features that are faster and consume less energy than you can imagine. But I can't find a way to use it and make money from it. I am no longer sure that companies and states are really searching for this issue.

    • @0mean1sigma
      @0mean1sigma 2 місяці тому

      I'm not sure what I can say about H100s. My focus is on writing fast (enough) + easy to understand code by understanding the general hardware components (not specific to a GPU/CPU model), so that it scales well with new hardware generations. In any case, I'm not influential enough to get access to H100s so I don't think I would be able to code by keeping H100 specs in mind (at least at this point in time).

  • @warpdrive9229
    @warpdrive9229 2 місяці тому

    Namaste Tushar bhai! Kaise ho!

    • @0mean1sigma
      @0mean1sigma 2 місяці тому

      All good! Hope you're doing well 😃

    • @warpdrive9229
      @warpdrive9229 2 місяці тому

      @@0mean1sigma I too have enrolled myself in a PhD program in Machine Learning like you. Pretty anxious. Hope things turn out well XD

    • @0mean1sigma
      @0mean1sigma 2 місяці тому

      All the best 😃

  • @sehbanomer8151
    @sehbanomer8151 2 місяці тому

    Great introduction. One thing to add, each thread can also contain a small block of output elements rather than a single one.

    • @0mean1sigma
      @0mean1sigma 2 місяці тому

      You're right! But that's only required if the matrix is too large. Otherwise you'll be looping over the elements sequentially, defeating the purpose of parallelization. There are several optimizations yet to be done and I'm working on a video right now where I'll explain how we can use GPU hardware smartly (especially different memory components) to speed up the computations even more. If you're interested in an early discussion on that topic, please sign up to the blog posts on my website and I would love technical discussions in the comment section there (I generally post early there and you'll get notified in your mail). BTW, Thanks a lot for watching the video 😃

  • @jakeaustria5445
    @jakeaustria5445 2 місяці тому

    Hi, Standard Normal, thanks for the great vid!😊

  • @cariyaputta
    @cariyaputta 2 місяці тому

    Nice channel.

    • @0mean1sigma
      @0mean1sigma 2 місяці тому

      Thanks. Glad you liked the content 😃

  • @vigneshs.666
    @vigneshs.666 2 місяці тому

    amazing video!

  • @divyamxdeep
    @divyamxdeep 2 місяці тому

    While clicking at the video, never in a million years could’ve I imagined that you explain all of this stuff in such simple and comprehensive manner. Great Work.

    • @0mean1sigma
      @0mean1sigma 2 місяці тому

      Glad you liked it. I would appreciate your feedback on my blog posts as well (link in the description). I started writing early this month and keen on improving there as well. Thanks a lot again 😃

    • @divyamxdeep
      @divyamxdeep 2 місяці тому

      @@0mean1sigma I’ll sure take a look

    • @xl000
      @xl000 18 днів тому

      this is the first part of every chapter 1 of every CUDA programming book.

  • @bilal_ali
    @bilal_ali 2 місяці тому

    How did you make this animation like 3blue1brown btw you name 0mean1sigma is quite Standardized.

    • @0mean1sigma
      @0mean1sigma 2 місяці тому

      Manim is open source. All my work is open source as well and I've provided a link to the animation code for my videos. Thanks a lot for watching... 😃

    • @deepak_nigwal
      @deepak_nigwal 2 місяці тому

      for a moment, i literally thought it was 3blue1brown video 😅

  • @korigamik
    @korigamik 5 місяців тому

    Man I really like this video! Can you share the code for the animations that you used in the video with us?

    • @0mean1sigma
      @0mean1sigma 5 місяців тому

      Glad you liked the video. Unfortunately the complete code for the animations got deleted (accidentally) as I didn't have any kind of workflow back then (but the animation of the network learning is available on the GitHub with link in the video description). However I'm improving on that and have uploaded the code for the animation in my latest video. I'm really sorry about this again.

  • @niks1632
    @niks1632 5 місяців тому

    I didn't like. Equation transition is toooooo faaast . Please do something to make it understable.

    • @0mean1sigma
      @0mean1sigma 5 місяців тому

      I'm really sorry about this. I'll keep this in mind next time. Thanks a lot for the feedback.

  • @Тима-щ2ю
    @Тима-щ2ю 5 місяців тому

    Great work!! I finished my course (methods of mathematical physics),but never was interested of what this equations actually are and why they look like this! But your video captured my attention and now understand Heat Equation. Keep Going!!!

  • @Тима-щ2ю
    @Тима-щ2ю 5 місяців тому

    Keep going, good work, thanks!!!

    • @0mean1sigma
      @0mean1sigma 5 місяців тому

      Glad you found it useful.

  • @MLSCLUB-t2y
    @MLSCLUB-t2y 6 місяців тому

    NICE WORK MANN

  • @BlueBirdgg
    @BlueBirdgg 6 місяців тому

    Tahnk you for the video!

  • @dimitrisspiridonidis3284
    @dimitrisspiridonidis3284 6 місяців тому

    Amazing work