Writing Code That Runs FAST on a GPU

Поділитися
Вставка
  • Опубліковано 5 чер 2024
  • In this video, we talk about how why GPU's are better suited for parallelized tasks. We go into how a GPU is better than a CPU at certain tasks. Finally, we setup the NVIDIA CUDA programming packages to use the CUDA API in Visual Studio.
    GPUs are a great platform to executed code that can take advantage of hyper parallelization. For example, in this video we show the difference between adding vectors on a CPU versus adding vectors on a GPU. By taking advantage of the CUDA parallelization framework, we can do mass addition in parallel.
    🏫 COURSES 🏫 Check out my new courses at lowlevel.academy
    🙌 SUPPORT THE CHANNEL 🙌 Become a Low Level Associate and support the channel at / lowlevellearning

КОМЕНТАРІ • 309

  • @empireempire3545
    @empireempire3545 Рік тому +781

    You could make a series out of this - basics of CUDA are trivial, but there are many, many performance traps in gpgpu

    • @VulpeculaJoy
      @VulpeculaJoy Рік тому +17

      Especially once you get into cuBLAS and Thust teritory, things get complicated really quickly.

    • @Freakinkat
      @Freakinkat Рік тому +4

      ​@@VulpeculaJoyYour not joking!
      Me: "Throws hands in the air in frustration."

    • @andrebrait
      @andrebrait Рік тому +41

      Back when I tried GPGPU, the most astonishing performance trap was just memory handling. Selecting what data to put into what kind of memory and utilizing them was very hard, but when you did it right the thing performed 10x better.

    • @deadvoicegame
      @deadvoicegame 9 днів тому

      @@andrebrait please i need some help with that, if you can help a bit a nd guide me this will be much appreciated 👍🙏

  • @shanebenning3846
    @shanebenning3846 2 роки тому +515

    This was super insightful, never would have thought it'd be that easy... I need to look more into cuda programming now

    • @cedricvillani8502
      @cedricvillani8502 2 роки тому +23

      It’s definitely not but by now you have realized that 😮😅

  • @peterbulyaki
    @peterbulyaki Рік тому +306

    Excellent tutorial. One minor thing I would have mentioned in your video is that copying between device and host or host and device is a relatively expensive operation since you are moving data from/to the CPU to/from the GPU through the pci express bus which no matter how fast or modern system you have is still a bottleneck compared to data transfer between CPU and memory or GPU and dram. So the performance advantage is only noticeable when the duration of data copying is relatively short compared to the task execution time.

    • @MrSofazocker
      @MrSofazocker Рік тому

      Hm..yes, but only if your data is of significant size aswell. Also the Bus speed is fixed by platform. Only a concern if your gpu is significantly faster than it fetches new data. Otherwise yes agree. You always have to test everything. Best example is UnrealEngine5, where after testing it turns out software rasterizing is faster than doing it on the gpu for some reason 😂 Always test if what you do would actually benefit from switching the compute device and dealing with copying data, etc.

    • @T33K3SS3LCH3N
      @T33K3SS3LCH3N 11 місяців тому +31

      Yeah that part hit me in the face when I was writing a 3D engine.
      Starting at a few hundred to thousand objects it is not so much the complexity of shading each object, but the number of seperate draw calls to the GPU that slows things down to a crawl. In that case it is the latency of communication between the CPU and GPU, rather than the bandwidth, that causes problems, but the fundamental issue is the same: Sending data between the two is slow.
      I had found this cool technique that would speed up deferred shading a lot more by doing additional checks for what area would actually be hit by light sources. The problem with this was that it ment 2 draw calls per light source instead of 1. Even though this saved the GPU itself a lot of work, it ended up dramatically decreasing performance since it were the draw calls that bottlenecked me.
      For the mentioned scenario, the proper solution are batch calls where a single call to the GPU can render many objects at once (particularly identical ones that use the same shader and ideally the same base mesh).

    • @peterbulyaki
      @peterbulyaki 9 місяців тому +1

      The more vram you have the larger training datasets you can use. For certain tasks cards with low vram are perfectly usable, for others not.

    • @LuLeBe
      @LuLeBe 7 місяців тому +4

      @@gonda8365 sometimes low vram also just doesn’t work at all. Like blender cuda rendering, if the scene doesn’t fit in vram, it won’t render, not even in a million hours.

    • @perinoid
      @perinoid 6 місяців тому

      I totally agree. I wanted to write a very similar remark, just noticed yours.

  • @dominikkruk5235
    @dominikkruk5235 2 роки тому +514

    Finnaly i can use my rtx 3060 Ti to do something useful...

    • @ojoaoprocopio
      @ojoaoprocopio 11 місяців тому +64

      Nice, now you can bubble sort a array

    • @FuzeEdits
      @FuzeEdits 11 місяців тому +25

      Bogo sort

    • @vuhuy8952
      @vuhuy8952 11 місяців тому +11

      bogo sort look like gamble with fate.

    • @hikari1690
      @hikari1690 6 місяців тому +5

      I use my gpu to play league

    • @LeicaM11
      @LeicaM11 4 місяці тому +1

      I do love grid computing and parallelism. I really want to learn how to program may new eGPU (RTX 3080).

  • @Borszczuk
    @Borszczuk Рік тому +23

    This fight at @7:30 with "*" placement was hilarious. I laughed so hard when you gave up :)

  • @BenjaminWheeler0510
    @BenjaminWheeler0510 Рік тому +508

    As someone who doesn’t have nVidia, you should do an OpenCL or OpenGL series, which everyone can use! Unless there’s something special about cuda, I never see the cross platform ones on UA-cam…

    • @cykkm
      @cykkm Рік тому +56

      Look at the Intel's oneAPI Base Toolkit, which includes a dpc++ sycl compiler. It may hide all this low level stuff, which is too hard to do efficiently. By default it works the best, Intel being Intel, with OpenCL (3.0 sure, not so much about 2.2; doesn't with 2.1), but there is already experimental support for CUDA out of the box. sycl is an open, GPU-agnostic (ahem, supposed to be) standard. CUDA code looks like C++, but in fact you think about hardware all the time, it's harder than assembly, in fact. OpenCL is no simpler. Looks are deceptive. This is why I believe a good compiler eventually beat low-level CUDA/OpenCL coding. Who would hand-optimize Intel CPU code these days, and beat the optimizer? High-level distributed/parallel C++ (DPC++) is da way to look into future.
      BTW, OpenCL is for compute, OpenGL is for 3D drawing/rendering, it's not "or." Entirely different APIs. OpenCL takes the same task as CUDA. OpenGL is xplat and oldsy (ah, that 80's feel!); for Windows-only, DirectX is preferable.
      If you take the oneAPI route, one piece of advice is to chose components to install. The full thing takes 40GB installed or so, and takes awful time to install and upgrade, even on a second-high-end Gen12 CPU and very fast PCIe SSD. And you hardly need data analytics or video compression libraries.

    • @LRTOTAL
      @LRTOTAL Рік тому +35

      Compute shaders! They run on all gpus.

    • @ben_jammin242
      @ben_jammin242 Рік тому +14

      Is vulkan worth learning?

    • @PutsOnSneakers
      @PutsOnSneakers Рік тому +16

      @@ben_jammin242 I was just about to mention that vulkan is the future while openGL lags far behind in terms of being adopted by the masses

    • @whannabi
      @whannabi Рік тому

      @@PutsOnSneakers cadum, cadum

  • @rezq2883
    @rezq2883 11 місяців тому +11

    Amazing video! I love the way you explain things thoroughly enough that a beginner can easily understand it without explaining *too* much and droning on. Thorough yet concise, great job :)

  • @lucasgasparino6141
    @lucasgasparino6141 Рік тому +94

    Amazing intro to CUDA man! For those interested in gpu programming, I'd also recommend learning openACC. Not as powerful as CUDA, but gives you a nice "first working" gpu program to have an idea before suffering with low level optimization hehe. Would be nice to see a follow up to this using both MPI and CUDA to work with multiple GPUs :D

  • @0xggbrnr
    @0xggbrnr 11 місяців тому +5

    This is a lot more straightforward than I thought it would be. Basically, replace all allocation operations and pointer operations with CUDA framework types and functions. 😅

  • @mrmplatt
    @mrmplatt 10 місяців тому +2

    This was a super cool video. I'm currently learning assembly so seeing how to operate at a pretty low level was very interesting to me.

  • @herrxerex8484
    @herrxerex8484 2 роки тому +7

    i discovered your channel recently and so far I am loving it.

  • @Shamysoza92
    @Shamysoza92 2 роки тому +6

    You channel is amazing! Just found it and I must tell you have a great way of teaching. Kudos for that congrats on the amazing content

  • @Antagon666
    @Antagon666 2 роки тому +47

    No dislikes, no wonder why :)
    I finally found a comprehensive tutorial, because most of them fail to explain the basic mindset behind CUDA programming.

    • @widrolo
      @widrolo 2 роки тому +3

      there are 5 now, propably people who didnt like him personally or trolls...

    • @nomadshiba
      @nomadshiba 2 роки тому +9

      @@widrolo or people who dont like multi threading for some weird reason
      or people maybe who know some different framework for this and get annoyed he showed this one
      idk it can be anything

    • @balern4
      @balern4 Рік тому +3

      You can see dislikes?

    • @SGIMartin
      @SGIMartin Рік тому +8

      @@widrolo Its AMD engineers

    • @NuLuumo
      @NuLuumo 11 днів тому

      @@balern4 There's an extension on the chrome web store that adds them back

  • @psevekar
    @psevekar 2 роки тому +6

    You explained it so well, thanks a lot

  • @bogdandumitrescu8987
    @bogdandumitrescu8987 2 роки тому +57

    Useful, but the discussion about the block size and grid size was avoided. I think there should be a video focused only on this topic as it's not easy to digest, especially for new CUDA programmers. A comparison with OpenCL would be even better :)

  • @rampage_sl
    @rampage_sl Рік тому +12

    Hey this is super useful! I elected High Performance Computing and Microprocessors and Embedded Systems modules for my degree, and this channel has become my go-to guide.

    • @MCgranat999
      @MCgranat999 Рік тому

      That's probably the degree I'm gonna go for as well.
      This channel is amazing xP

  • @thorasmund
    @thorasmund Рік тому

    Great video! Short and to the point, just enought to get me started!

  • @a1nd23
    @a1nd23 Рік тому +47

    Good video. It would be interesting to make the vectors huge and run some benchmarks comparing the cuda function to the cpu version.

    • @alzeNL
      @alzeNL 6 місяців тому +5

      i think armed with this video, its something you could do yourself :) the best students are the ones that use what was taught.

    • @Ellefsen97
      @Ellefsen97 6 місяців тому +5

      I would imagine the CPU implementation would win performance wise when it comes to simple addition, since copying memory to and from the GPU is a pretty expensive operation. Especially if we make the benchmarking fair and utilize threads on the CPU implementation.

  • @JM-fo3yb
    @JM-fo3yb 2 роки тому +46

    Keep up the good content boss !

  • @ramezanifard
    @ramezanifard Рік тому +5

    Very nice tutorial. I really liked it. It's brief, to the point and very clear. Thanks. Could you please make a video for the same example but in Linux?

  • @johnhajdu4276
    @johnhajdu4276 Рік тому +12

    Thank you for the Video, it was good to see a easy example how it works.
    I was watching recently a video about the MMX instruction set of the first Pentium CPU (around 1997), and it was mentioned that the main usage of that new feature was for example changing the brightness of a photo (probably bitmap file) where a lot of mathematical manipulation needed on a huge file, and the mathematical functions is repeating for every pixels. The idea behind MMX was, that multiple registers was loading with values, and then cpu executed one instruction and some clock cycles later all the output registers were ready filled. I think it was called "single instructon multiple data".
    I have this feeling now, that the GPU Cuda core could do all the mathematical manipulation with a bitmap picture, we only have to load the picture in the GPU memory, and the mathematical manipulation pattern(s) with it, and execute the transformation. Probably it does not worth transform only one picture, with all the preparation we lose time, but if we have many different pictures (for example a video), maybe it makes sense to use the power of the GPU.

    • @tormodhag6824
      @tormodhag6824 Рік тому +1

      I dont know if im correct but when you render video in blender for example it can use gpu, and you can do Things like mainpulating Color. Dont know if it has any relevance just my thoughts only

    • @LuLeBe
      @LuLeBe 7 місяців тому

      Yeah photoshop does a few things on the gpu, and good video editing algorithms run on the gpu as well. It’s exactly like you said. And SIMD instructions are also used quite a lot, but from what I’ve seen, they seem more of a middle ground. If CPU is too slow, but gpu not really worth it due to latency or complexity.

    • @simodefa12
      @simodefa12 5 місяців тому

      I guess it might be similar to the SIMD instructions on arm cortex. Basically there's a coprocessor dedicated to executing instructions that operate on multiple registers at the same time.

  • @Frost_Byte_Tech
    @Frost_Byte_Tech 9 місяців тому

    I'd really love to see more videos like these

  • @bluustreak6578
    @bluustreak6578 Рік тому

    Super nice starting video for someone like me who was too afraid to try it blind :D

  • @WistrelChianti
    @WistrelChianti Рік тому +1

    Thanks, that was a super clear example. Amused that you called it a register, guess you can't turn off thinking in assembly code :D

  • @miketony2069
    @miketony2069 Рік тому

    That was an excellent beginner friendly overview. Almost a hello world type of intro to get your feet wet. Definitely looking forward to more videos from you.

  • @Rottingflare
    @Rottingflare 2 роки тому

    Loved the video! Had to like and subscribe! Can't wait to see the rest of the project as well as what other projects you work on!

  • @zrodger2296
    @zrodger2296 2 роки тому +21

    Easier than I thought! Would love to see you do this in OpenCL!

    • @LowLevelLearning
      @LowLevelLearning  2 роки тому +4

      Great suggestion!

    • @NoorquackerInd
      @NoorquackerInd 2 роки тому +3

      @@LowLevelLearning Yes, definitely give OpenCL content, there's not enough of it

  • @murdomeiring2934
    @murdomeiring2934 Рік тому +3

    @LowLevelLearning Could be very cool to see a bit more complex & lengthy setup to show difference in time on GPU vs CPU for different use cases.

  • @iyadahmed3773
    @iyadahmed3773 Рік тому

    Thanks a ton, very clear explaination 🙏

  • @hstrinzel
    @hstrinzel Рік тому +5

    That is VERY impressive how relatively SIMPLE and CLEAR you showed that! Wow, thank you! Question: There is SOME sort of parallel or vector operation also possible on the modern CPUs, right? Could you show how THAT would be done in this example?

  • @skylo706
    @skylo706 6 місяців тому

    Maybe you don't read this because the video is 2 years old now, but could you make a video about how graphics programming works on a computer? 2d and or 3d. You are so good at explaining stuff, it would be really amazing imo

  • @cykkm
    @cykkm Рік тому +3

    Cool intro, thanks! In the year 2021, tho, I'd rather use even simpler modern cudaMallocManaged() UVM call. One may get faster code by manually controlling memory transfers in multiple streams and synchronization; this is what I have seen in code written by an NVIDIA Sr. SWE, but could never really fully grok it. For the rest of us, there's UVM-you just allocate memory accessible to both the CPU and the device, and it's synchronized and moved in the right direction at the driver level.
    It does allow writing stupidly inefficient code, but this is not too easy, really :) For a GPU starter, it simplifies memory tracking a lot.

  • @illosuth
    @illosuth 4 місяці тому

    Pretty straight forward tutorial. What do you think would be the next step? vector multiplication?

  • @nathanaelmccooeye3204
    @nathanaelmccooeye3204 Рік тому

    Thanks for the video!
    CC: When the narrator follows new, or not immediately obvious to a newcomer information with, “right?” I feel really lost and a little stressed thinking I can’t even understand this basic information!!

  • @dominikschroder3784
    @dominikschroder3784 Місяць тому

    Great explanation!

  • @MrHaggyy
    @MrHaggyy Рік тому

    This video was great never thought it would be so simple. Do you mind digging deeper into this. Maybe some filter, coordinate transformation or other basic math stuff?

  • @olivalle
    @olivalle 2 роки тому +1

    Thank you for your cristal clear explanation

  • @nefasto_
    @nefasto_ Рік тому

    i like the fact that u write in c that i do at school and i understand what you are coding

  • @typeer
    @typeer 2 роки тому

    Channel is just the sickest ty ty

  • @bean_mhm
    @bean_mhm Рік тому

    Super interesting, thanks a lot!

  • @gabrielgraf2521
    @gabrielgraf2521 Рік тому +2

    Damn was this interesting. So basically everything time I have big for loops or even nested for loops, my graphics card could calculate it way faster.
    Thanks man this was interesting

  • @ankk98
    @ankk98 5 місяців тому

    This was insightful

  • @GeorgesChannel
    @GeorgesChannel Рік тому

    Very helpfull. Thank you for sharing!

  • @EnderMega
    @EnderMega 2 роки тому

    Man, the Nvidea dos are ok, but this is si well made, very nice :D

  • @Dedi369
    @Dedi369 2 місяці тому

    Super interesting! Thanks

  • @mutt8553
    @mutt8553 11 місяців тому

    Great video, really interesting stuff. Looks like I need an Nvidia gpu now

  • @d_shepperd
    @d_shepperd Рік тому

    Thanks. Nicely done.

  • @UnrealOG137
    @UnrealOG137 Рік тому

    Never expected to hear that ending song. It's a really good song. It's Run by Hectorino Martinez.

  • @lohphat
    @lohphat 11 місяців тому +3

    Are there any guides explaining how the code segments are actually sent to the GPU and how the API and firmware handle operations?
    Just understanding the coding portion isn't enough until you understand the hardware architecture and low-level ops.

  • @ronensuperexplainer
    @ronensuperexplainer Рік тому +7

    After writing 400 LOC for initializing OpenCL and finally giving up, this seems so easy!

  • @ben_jammin242
    @ben_jammin242 Рік тому +5

    How can you dynamically manage and display your available GPU memory based on load and display it as a bar graph? Such as when you're choosing LoD or texture and geometry complexity and want to estimate if it's going to throttle the gpu. Many thanks! Happy to be pointed to a resource if it's not something you've covered as yet :-)

    • @_RMSG_
      @_RMSG_ Рік тому +1

      I believe the Nsight debugging tools should give you everything you need for this

  • @TheFuture36520
    @TheFuture36520 5 місяців тому

    This is absolutely mental 😎

  • @arbiter7234
    @arbiter7234 8 місяців тому

    thanks a lot, great tutorial

  • @gregoryfenn1462
    @gregoryfenn1462 11 місяців тому

    Thanks for this! As this channel is Low Level Programming, can we look into making our own GPU driver code (the GPU malloc and parallel function call interface)? Jist calling cuda APIs is really high level programming with all the technical detaileds abstracted away.

  • @yah3136
    @yah3136 2 роки тому +62

    Nice presentation, but you should speak about OpenCL, even if it's not well supported on Nvidia card, at least you can target multiple parallel devices (at the same time). Andthe core conept of grid, block and threads are quite the same (with different name, but same cache segregation principle).

    • @jan-lukas
      @jan-lukas 2 роки тому +11

      Yeah OpenCL is the way to go for using several gpus or different types of gpus (like Nvidia and amd)

  • @KogDrum
    @KogDrum 2 роки тому +1

    Thanks, can you recommend resources for learning this specific type of programming? or from where to get this kind of knowledge?

  • @muhammedthahirm423
    @muhammedthahirm423 9 місяців тому +1

    Bro actually Showed both result that came in ~1nanoSecond and 0.3nanoSecond and thought we would notice.
    jk Your Explanation is Amazing

  • @danielniedzwiecki638
    @danielniedzwiecki638 Рік тому

    thnx you are a legend brothers

  • @ben_jammin242
    @ben_jammin242 Рік тому +2

    New to your channel. Liked and subbed!
    Edit: what is "sizeof(a)/sizeof(int)" computing? I thought size of c would be N if a and b are both N

    • @carljacobs1287
      @carljacobs1287 Рік тому +3

      sizeof(a) will return the size of the array in bytes. sizeof(int) will return the size of an int in bytes (which might be 2, 4 or 8 depending on whether you're on 16 bit, 32 bit or 64 bit hardware). The division then gives you the number of elements in the array. A useful helper #define is:
      #define ArrayLength(a) (sizeof(a)/(sizeof(a[0]))
      This will also work if you have an array of structures.

  • @Freakinkat
    @Freakinkat Рік тому

    The clicky keyboard sounds was oddly satisfying to me. It's like a little white noise to me. It's so peaceful

  • @ProjectPhysX
    @ProjectPhysX 2 роки тому +1

    Very helpful tutorial!
    I prefer OpenCL though :)

    • @cedricvillani8502
      @cedricvillani8502 2 роки тому

      A Khronos Group junkie huh? Well at least OpenCL has a SDK and not just a API . Most people don’t bare metal program, scripture’s if they get near this usually use Python

  • @jarsal_firahel
    @jarsal_firahel 11 місяців тому

    Absolutely awesome

  • @philtoa334
    @philtoa334 Рік тому

    Very good thanks.

  • @kreinraan6558
    @kreinraan6558 Рік тому +1

    Nice video! Sorry if I missed it, but is there a reason why you did not use std::vector as arrays?

    • @GregMoress
      @GregMoress 10 місяців тому

      Probably because the hardware doesn't use it.

  • @kaiperdaens7670
    @kaiperdaens7670 5 місяців тому +1

    The vector thing in the beginning could be done multicore too I think so with 3 vectors you can just do each one on a different core at yhe same time.

  • @gat0tsu
    @gat0tsu 11 місяців тому

    thanks alot for the video

  • @SauvikRoy
    @SauvikRoy Рік тому

    +1 for using light theme for demonstration!
    Nice tutorial.

  • @jonathanmoore5619
    @jonathanmoore5619 2 роки тому +2

    Super! Mark Duper!

  • @AvgDan
    @AvgDan Рік тому

    Do you think you could do a video of using GPU to solve for subset sum?

  • @bimDe2024
    @bimDe2024 Рік тому

    are u going to make a series on cuda?

  • @thoufeekbaber8597
    @thoufeekbaber8597 Рік тому

    i like the intro "we are mining bit coin"

  • @jesseparrish1993
    @jesseparrish1993 Рік тому +1

    I was busy trying to build a GPU on a breadboard like a weirdo when I found this. Much better.

    • @altayakkus4611
      @altayakkus4611 Рік тому +1

      Building a GPU on a breadboard is really cool, why should this video be better, its just a different topic. Or were you trying to build an ASIC on a breadboard, and realized now that you can just use CUDA? ;D

  • @LightStrikerQc
    @LightStrikerQc Рік тому

    I like that VS doesn't like you adding space between a type and the * :p

  • @Rejnols
    @Rejnols 2 роки тому

    Well well well, i see you changed the thumbnail 😎

  • @kyuthefox
    @kyuthefox Рік тому +1

    I like Cuda but Considering how many cuda tutorials there are I would like an OpenCL Tutorial because there are only like really advanced examples out there and you have to start with the basics. Which on UA-cam I couldn't find

  • @dj-maxus
    @dj-maxus Рік тому +1

    After discovering your channel and this video, I became interested in how GPU programming in Rust looks like. May I wonder if there's any chance such a video appears?

  • @testuser6429
    @testuser6429 6 місяців тому

    I was waiting for you to add the timimg APIs and bench mark the code for CPU and GPU runtimes 😢😢 but great tutorial anyway

  • @petrosros
    @petrosros 8 місяців тому

    What is the variance between the two in terms of accuracy, without stressing the system, and time not being a factor? Are results different?

  • @godknowsgatsi9246
    @godknowsgatsi9246 11 місяців тому

    An amazing tutorial but ,, 1 question,, after running does it free the allocated memory itself or we free afterwards?

  • @fabiosuriano1266
    @fabiosuriano1266 Місяць тому

    you forgot to sync after the call to the kernel! Wait the GPU to finish before to copy from device to host the result :)

  • @Yupppi
    @Yupppi 7 місяців тому

    What does the size of array/size of int do? I've seen it in C++ demonstration of array being referenced to as a whole pointer instead of slot per slot.

  • @Chriss120
    @Chriss120 7 місяців тому

    would have been cool to compare the timing on some larger data sets. like 1 million or so.

  • @donutwarior200
    @donutwarior200 7 місяців тому

    goated opening

  • @BradyGiacopelli
    @BradyGiacopelli 6 місяців тому

    would have loves to see the difference in time by running it with gpu v cpu

  • @Adrian.Rengle
    @Adrian.Rengle Рік тому

    Hi ! A classical C++ question, zero knowledge of GPU programming ! After cudaMalloc, shouldn't there be a sort of cudaFree ? What happens with the GPU memory ? Thank you for the comments and for the video !

  • @50Kvful
    @50Kvful 8 місяців тому

    Inspiring

  • @unrealpaulo7857
    @unrealpaulo7857 Рік тому

    thanks !

  • @robertpearson8546
    @robertpearson8546 4 місяці тому

    I am fascinated that the two functions cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost are invoked by the same function cudoMemcpy with a parameter to determine which function to use instead of directly calling the function. AWKWARD!

  • @user-oh4wd9vk4s
    @user-oh4wd9vk4s 10 місяців тому

    love this

  • @ragtop63
    @ragtop63 4 місяці тому

    I wonder if the CUDA framework is available for use in C#? I don't know C++ and I really don't want to spend years learning how to "properly" create C++ apps.

  • @tansakdevilhunter9462
    @tansakdevilhunter9462 Рік тому +2

    Is there any difference between compute shader and this cuda programming?

  • @holthuizenoemoet591
    @holthuizenoemoet591 Рік тому

    Could you also do the same video for ROCM?

  • @richardcasey4439
    @richardcasey4439 11 місяців тому

    Within threads, there are also warps, = 32 threads per warp

  • @kakyoindonut3213
    @kakyoindonut3213 Рік тому

    what a cool technology that I could use but couldn't afford

  • @maximinmaster7511
    @maximinmaster7511 2 роки тому +1

    Hello, thank you for this video. Question : what is the limit of thread in block 1 ?

  • @AlessandroContrino
    @AlessandroContrino Рік тому

    Thanks

  • @Vextrove
    @Vextrove Рік тому

    I have no clue what he is talking about but it is interesting

  • @jooseptavits9456
    @jooseptavits9456 11 місяців тому

    Under what conditions would you use more than one grid?

  • @MightyElemental
    @MightyElemental Рік тому

    I've been trying to use unified memory and it's simply not working properly 💀
    I'm doing something wrong but I have all the steps required and I tested them on a separate program that works just fine. Only reason I wanted to use unified memory is that you don't have to have separate pointers for cpu and CUDA which makes it much easier to read.

  • @peter.b
    @peter.b 4 місяці тому

    When you write “sizeof(int)” what is this referencing? I’m not at all familiar with this language, but I am assuming that’s references something that’s been determined previously?

    • @1kvolt1978
      @1kvolt1978 19 днів тому

      Compiler changes it to the actual value when compiles. The value in this case is the size in bytes of an int type, which is one of basic built-in types in C.