FASTER Ray Tracing with Multithreading // Ray Tracing series

Поділитися
Вставка
  • Опубліковано 24 січ 2025

КОМЕНТАРІ • 146

  • @TheCherno
    @TheCherno  2 роки тому +13

    Thank you all for watching! If you want to contribute to the optimization discussion, check out the GitHub issue here ► github.com/TheCherno/RayTracing/issues/6
    Also check out Brilliant to learn all the math you need for this series! Get started for free, and hurry-the first 200 people get 20% off an annual premium subscription ► brilliant.org/TheCherno/

    • @Theawesomeking4444
      @Theawesomeking4444 2 роки тому +1

      Can you please do a Morton Z order in C++ tutorial next? I feel that would be nice to learn considering graphics use it a lot.

    • @ivanivenskii6942
      @ivanivenskii6942 2 роки тому

      Здравствуйте, а вы знаете русский язык?

    • @ivanivenskii6942
      @ivanivenskii6942 2 роки тому +1

      @@dav1dsm1th героя Слава

    • @nathans_codes
      @nathans_codes 2 роки тому

      can you take a look at the issues and PR's on the walnut repo?
      It has some serious problems right now

  • @FabricioSTH
    @FabricioSTH 2 роки тому +16

    Maybe a Matt Parker fenomena erupts from the internets and we get a 40,832,277,770% improvement. Or maybe not, cause we are not starting with python.

    • @matthewparker9276
      @matthewparker9276 2 роки тому +1

      Probably not that much. It's not like the baseline was 1 month to render a frame.

  • @blackbriarmead1966
    @blackbriarmead1966 2 роки тому +40

    This video seems made for me. I was on a huge time crunch so I had to implement a ray tracer with reflections, BVH, etc, in about 36 hours total. It took a lot of coffee but I got it done. It's reasonably performant but I rendered a similar scene using cycles in blender and it is simply so much faster. What takes blender seconds takes me minutes, even with mutlithreading, and I don't have "fancy" features such as texture mapping running yet

    • @blackbriarmead1966
      @blackbriarmead1966 2 роки тому +1

      the way I'm currently doing it is by using a library called CTPL, in which I push all of my future operations. I give each thread an nxn block, just like blender, and as the tasks complete ctpl deals with joining the threads and starting new threads and all of that. I have them all write to the same framebuffer which I display on the screen so you can keep track of the progress of the render

    • @blackbriarmead1966
      @blackbriarmead1966 2 роки тому +4

      update: minimized size of bounding boxes in BVH by using surface area heuristic, made it 50% faster

    • @Fragtex_CN
      @Fragtex_CN 2 роки тому +1

      Hey bro. If it's possible may i have a link to your repository to learn sth from that or2

    • @Alkanen
      @Alkanen 2 роки тому

      @@blackbriarmead1966 simply picking the two objects that create the bounding box with the smallest surface area to combine?
      Do you loop through all your objects to find the absolut smallest surface area, or do you do a more stochastic approach by sampling the objects and picking the smallest area from the objects in the sample to speed up BVH creation?

    • @blackbriarmead1966
      @blackbriarmead1966 2 роки тому +1

      @@Alkanen the way I do it currently is I sort the objects in terms of their centroids along the x, y, or z axis depending on the depth of the bounding box. I create two bounding boxes, one starting at the triangle with the smallest value, and the other starting at the triangle with the largest value, and I add triangles smaller to bigger and bigger to smaller respectively. I store the surface area of all of these potential bounding boxes, and I choose the pair of bounding boxes which minimize the surface area heuristic. The surface area heuristic is the surface area of the bounding box times the number of children it has. The lower this heuristic, the more optimized the BVH. so you would choose the candidates that minimize this, or choose not to split the parent if none of the candidates are better than the parent itself for some reason. I use axis aligned bounding boxes which allow for faster intersection calculations than some other methods

  • @Theawesomeking4444
    @Theawesomeking4444 Рік тому +2

    4:10 thats actually wrong, gpus dont have thousands of cores, what they have is bigger simd widths usually 64-256, cpus also have simd widths of 8-16 so you can actually turn your cpu into a gpu if you are willing to vectorize or use intrinsics.

  • @bishboria
    @bishboria 2 роки тому +10

    In my own version of this, I initially tried grouping a chunk of rows per thread and got good improvements. But then I noticed that certain blocks would take longer to run if there were lots going on in the image, so you'd have 1 thread working alone when all others were finished. I ended up using a threadpool and allocating each thread in the pool to work on one pixel, once that pixel was calculated the thread would go back in the pool and pick up the next pixel to work on. This worked very well and keeps maxing out the cpu until there are fewer pixels left to calculate than than cores available to work on them.
    I'd love to change the code to work on GPU, and I did try for a while to get Metal to work but just couldn't work it out…

    • @bunpasi
      @bunpasi 2 роки тому +1

      Good point. Have you tried interlacing the rows? So if you have 8 hyperthreads, you skip 7 rows. It's probably going to be better divided.

    • @bishboria
      @bishboria 2 роки тому

      @@bunpasi I think you’d still have a similar problem as with chunking: one thread will be running the final row when all the others have finished and are now idle. The whole cpu won’t be maxed out.

    • @bunpasi
      @bunpasi 2 роки тому

      @@bishboria We can simplify the problem by using an image with 3 regions. The 2 upper regions are primarily sky and take 1 ms to process individually, whereas the bottom section has a lot of objects taking 7 ms to process. With 1 thread, the image will take 9 ms. Now we look at 3 threads, so ideally it will take 9/3=3 ms.
      Scenario 1:
      We use chunks. Thread 1 and 2 will be done in 1 ms, but thread 3 will take 7ms. In total it will take 7 ms.
      Scenario 2:
      We skip rows. All threads will handle a third of each section. 1/3 + 1/3 + 7/3 = 3 ms. And yes, one thread might lag a few rows behind, but if we take a height of 1080px, this will be orders of magnitude less. Even if one thread is a 10 rows behind, this will only add 7 / (1080 / 3) = 0.02 ms

    • @bishboria
      @bishboria 2 роки тому

      @@bunpasi yes I understood you originally. If you prefer to do it that way go ahead. For now, while I still need to work out how to convert to gpu based computation, I prefer the thread pool as I want as much of the cpu maxed out as I can for as long as possible.

    • @bunpasi
      @bunpasi 2 роки тому

      @@bishboria Because in a gaming engine there are a lot more things you might want to do simultaneously, a thread pool (with event queue) might be the best solution indeed. Good luck!

  • @Kazyek
    @Kazyek 2 роки тому +4

    Isn't `std::execution::par` enforcing sequential execution, which is not required here? I believe simply switching to `std::execution::par_unseq` would be an instant speedup.
    But ultimately, thread creation have a overhead, and creating exactly as many threads as there is logical core and distributing the work would be faster.
    But then again, not all threads would have the same amount of work since some pixels take longer than others, so to fully saturate all threads for the whole frame it would be better to use a thread-stealing threadpool.
    However, maybe exactly N threads (N: amount of logical cores) might still be faster even if not fully well balanced if you have distinct tiles with thread-local data for them for better cache locality...

  • @srisairayapudi6074
    @srisairayapudi6074 2 роки тому +1

    YO BELATED HAPPY BDAY MAN! Wish i came sooner, would have wished you on the day :( HAVE A GOOD ONE EVERYDAY

  • @ChrisM541
    @ChrisM541 2 роки тому +2

    Excellent challenge, cheers Cherno! Loving this series.
    There's a lot of optimisation possible here - 2x faster (around 60ms/16.6fps to 30ms/33.3fps) is some way below what we'd expect from fully independent worker units (check: are they? include worker timer and look for normal/abnormal timing distribution), all this assuming maximum threads isn't set to 2, of course ;)
    I'd also be checking the thread allocation process (hint: another, more 'direct' way?), and making sure the work is 100% optimally split up, and 100% optimally allocated to the maximum threads returned from hardware_concurrency() (though historically not 100% guaranteed to work (return 0), don't know if it's now fixed...been a while for me).

    • @zvxcvxcz
      @zvxcvxcz 2 роки тому +1

      Might be from using so many more threads than there are cores. Probably should really restrict to the number of threads the hardware can actually use and have a proper thread queue.

  • @marcotroster8247
    @marcotroster8247 2 роки тому +10

    Superior performance can also be achieved with techniques other than multi-threading. In fact, threading can actually be slower when the synchronization effort outweighs the performance gains (see Amdahl's law).
    First, notice that CPUs already have parallelism built into their instruction feed pipeline. Fetching / decoding / executing / writing back results can be performed in parallel for successive instructions if they don't depend on each others results. Rearranging commands in assembly can have crazy gains (but with C/C++ we usually don't dig that deep).
    Second, there are dedicated SIMD instruction sets on modern CPUs that can perform the same operations for multiple inputs (256 / 512 bit wide registers) at once to increase data throughput (e.g. 8 or 16 float ops at once).
    Third, avoiding allocation can save lots of compute, too. Preprocessing data only once upfront is very nice. And having smaller stack frames to allocate / destroy is also important. Using some static, rewritable cache memory that's owned by one thread can really help performance such that there's smaller stack frames (at the downside of non-threadsafe code).
    And last, there are different CPU caching layers which have 1000x faster I/O delays. So fitting all the memory in a faster cache and constantly reusing it will skyrocket the performance. CPUs have great latency once the data is loaded into a register. Small and simple is fast.
    Maybe this inspires some devs here to write faster programs. Cheers, have fun at optimizing 🤓👨🏻‍💻🏎️

  • @jfgh900
    @jfgh900 2 роки тому +8

    I really appreciate this! I've always wondered how multithreading is implemented but always got stuck in the syntax. Are there any plans on showing how to set up rendering on a graphics card?

  • @manuntn08
    @manuntn08 2 роки тому +2

    Thank you very much for your effort you put in this video. I've learnt a lot from your tips.
    Could you please make some videos about how to optimize in the case when the computation on one pixel related to pixels around (Example : convolution, Gaussian filtering...)
    Once again, thank you and have a nice year !

  • @theonetribble5867
    @theonetribble5867 2 роки тому +5

    Hey, thanks for the series. The first video kick started my learning process about path tracing. In my opinion the series was a little slow and I was eager to outpace it. So I wrote a Vulkan path tracer in rust and learned most things by doing them. now I'm writing my Bachelors Thesis about differentiable path tracing. Btw. Mitsuba3 is a great tool for learning about path tracing as well, especially if you don't want to deal with C++. Anyways Thanks for the inspiration.

    • @edu_rinaldi
      @edu_rinaldi 2 роки тому

      Any suggested source for learning Vulkan raytracing extension ? (and maybe also Vulkan in general) Thanks in advance :)

    • @Pedro-jj7gp
      @Pedro-jj7gp 2 роки тому

      I'm also interested in hearing about resources to learn Vulkan and path tracing. I might even try and learn Rust while I'm at it! :)

    • @theonetribble5867
      @theonetribble5867 2 роки тому

      @@edu_rinaldi Hi, I replied to @Pedro. I hope you get the notification.

    • @theonetribble5867
      @theonetribble5867 2 роки тому +2

      @@Pedro-jj7gp Hi, sorry for taking so long to reply. It seems that UA-cam doesn't allow me to paste links but didn't warn me (If you can't find the resources contact me directly if that's possible on YT). There are some resources I used to learn vulkan though I still don't quite understand it (I used screen-13 a Vulkan abstraction layer in rust). First of all there is the vulkan tutorial which helped a lot.
      I can also recommend the Vulkan lecture series from "Computer Graphics at TU Wien".
      Specifically for ray tracing there are some blog entries from the Khronos group explaining the high level layout. For more detail there is a tutorial for NVIDIA which uses the KHR extension (Note there are, i think two extension for Vulkan ray tracing KHR and one from NVIDIA the KHR extension also works on AMD GPUs). If you want to learn more about path tracing in general there is also the Rendering Lecture from CG at TU Wien (thats where I learned about path tracing the most). In general If you want to know things about such topics I can recommend to look at lectures from universities many European universities put their lectures online but MIT also has some stuff under "Open Course Ware". I can also highly recommend the paper from Eric Veach if you want to have a more mathematical background but it's a very long paper and I mostly use it for reference.

    • @edu_rinaldi
      @edu_rinaldi 2 роки тому

      @@theonetribble5867 Thank you so much! ❤️

  • @Alkanen
    @Alkanen 2 роки тому +4

    Messing around trying to optimise the code a bit, I noticed that your implementation of Random::InUnitSphere() is wrong. It's biased towards values in the directions of the corners of the unit box surrounding the sphere (because it draws a sample from the unit box and then normalizes that sample to fit on the surface of a sphere).

  • @nathans_codes
    @nathans_codes 2 роки тому +1

    can you take a look at the issues and PR's on the walnut repo?
    It has some serious problems right now

  • @peezieforestem5078
    @peezieforestem5078 2 роки тому +1

    Would you please do more episodes on various methods of multithreading? C++17 exclusive thing is nice, but I'd like to know the broadest applicable method, a method that works for C, the most optimal method, etc.

    • @Alkanen
      @Alkanen 2 роки тому +3

      I suspect the most widely supported variant might be using pthreads. It's originally Unix (well, POSIX), but there are Windows compatible implementations available if you google for a couple of minutes, and then you'll have code that works on all POSIX compatible systems, which is pretty nice. And it's in C.
      Not to bad to work with either if I remember correctly, but it's been a few decades (jesus, I'm getting old) since I wrote my wrapper around it so I might be misremembering :)

    • @peezieforestem5078
      @peezieforestem5078 2 роки тому +1

      @@Alkanen Thank you, mate!

  • @jeofthevirtuoussand
    @jeofthevirtuoussand 2 роки тому +2

    I am not a programmer nor a developer but I am actually curious.
    Would it be possible to say to the hardware:
    " hey can you run raytracing in parallel on 3 cores but only use 60% of the cores and assigne the remaining 40% for ennemis AI calculations "

  • @jumponblocker
    @jumponblocker 2 роки тому

    I actually had an assignment where we made a raytracer recently. Kind of funny that I also used std::for_each which I had not heard of before. The only difference was that I just looped over 1 vector containing each pixel index rather than an inner and outer loop.

  • @ezpzgamez
    @ezpzgamez 2 роки тому

    I have been following along with this series while writing in Rust over C++ to see how things can compare. Until this series everything on the Rust side has been matching the C++ performance if not a somewhat better. (In comparison to the laptop, my desktop PC with an i9-9900k gets about 15ms where the laptop gets about 60ms for single-threaded).
    One thing Rust suffers from here is being able to mutate simple structures in an async context. A mutex or rwlock is required to be able to do what is asked of the multithreading unless allocating temporary buffers (one for both the image data and accumulation data). In an unsafe context it would be a lot easier but unfortunately Rust lacks a lot of things for async including some unsafe items. SyncUnsafeCell has yet to be stabilized.
    So from here on out I guess I'll stick with the single-threaded and see how the performance goes. Would rather do that than clone two large vectors on every iteration. Just my two cents from outside of C++ :)

  • @1ups_15
    @1ups_15 9 місяців тому +1

    hello, thank you for you video, it looks very useful, however I have a problem; I have noticed that my raytracer doesn't gain any performance from applying your changes, it even gets slightly worse, and when I look at my processor usage using htop, only one of my cores is being used. I am using linux and compiling using g++ through cmake, is there some flags I could use to actually make it multithreaded?

  • @anime_erotika585
    @anime_erotika585 Рік тому

    7:07 I want multithreading, at my table, until tomorrow!

  • @lithium
    @lithium 2 роки тому +1

    std::iota is the "fancy function" you're avoiding to generate sequences, fyi ;)

  • @Iuigi_t
    @Iuigi_t Рік тому +1

    Where are the triangles?

  • @alessandrocaviola1575
    @alessandrocaviola1575 2 роки тому +2

    On my raytracer i got almost perfect scaling in performance: 4x the Speed the Moment i multithreaded It on a 4 cores CPU, so there Is definitely room for improvements there

    • @Theodorlei1
      @Theodorlei1 Рік тому

      Yeah he got a 2.5x speedup on an 8core machine on a parallel problem - at least 8x should be possible for him

  • @Kaldrax
    @Kaldrax 2 роки тому +8

    Interesting, I didn’t know about this one. I attended a lecture called high performance computing last semester in which we did similar things, starting with OpenMPI, then threads and in the end OpenMP. I absolutely cannot recommend OpenMPI since it’s a total nightmare. OpenMP on the other hand would simplify this code. You don’t need the iterators and I believe you can just write #pragma omp parallel for collapse(2) above the nested loops and it will achieve the same performance. 🙂

    • @unknownunknown6531
      @unknownunknown6531 2 роки тому +3

      OpenMPI does not address the same problem, it is used to distribute a task on multiple computers (a cluster) rather than only one, hence the additional complexity :). OpenMP is the tool to use in this case indeed !

    • @psychoinferno4227
      @psychoinferno4227 2 роки тому

      As an exercise, you should run a profiler and understand why it's only 2x faster on an 8 core machine.

    • @peezieforestem5078
      @peezieforestem5078 2 роки тому

      I did some testing with OpenMP and my code started working slower... not sure why this happens, I made sure to parallelize the independent loops.

    • @zvxcvxcz
      @zvxcvxcz 2 роки тому +1

      Yup, iterators are gross, OpenMP is way nicer (suck it C++ committee).

    • @zvxcvxcz
      @zvxcvxcz 2 роки тому

      @@peezieforestem5078 Slower than what was done in the video or slower than the code was before? You shouldn't really do even what he did in the video. In either case, creating way more threads than you actually have the hardware for can cause a lot of contention and cache misses and actually slow things down sometimes. He has 8x the hardware threads and was only getting like 2x the performance... not exactly ideal. What you should really do is create just 8-16 threads when you have 8 physical cores and have a thread queue so they pick up a new task each time they finish a pixel until there are no pixels left.

  • @CreativeOven
    @CreativeOven 2 роки тому +1

    Dude make us a chapter someday showing you programming in Cpp to get at your level .. ( idea ) , because some of us we are super in stone age in cpp

  • @ovi1326
    @ovi1326 2 роки тому +1

    allocating a vector of numbers going from 0 to width and height made me very sad altough I get that this is for the sake of simplicity
    for anyone interested though, here are some tips
    a more proper way to go about this would be to either implement a custom range iterator (look up legacy iterator on cppreference) or use std::ranges::iota_view which is roughly equivalent to python's `range()` or rust's `x..y` thingy
    you can also just avoid using parallel for_each, and instead split work for multiple threads by giving them responsibility over equally divided ranges of scanlines. this is pretty straightforward to implement and should yield good enough perfomance

    • @zvxcvxcz
      @zvxcvxcz 2 роки тому

      Not just "good enough," but better because there will likely be less cache contention and less thread creation overhead.

    • @ovi1326
      @ovi1326 2 роки тому

      @@zvxcvxcz I meant that there are better methods than simply splitting work by rows, ie. someone in the comments mentioned using a thread pool to saturate the cpu which sounds kinda cool

  • @ivansanz4029
    @ivansanz4029 2 роки тому

    If instead of having each thread do a row you make them do a column, the performance is even better as the "sky" is very cheap to process and the real complex part (the "ground") is distributed better across threads.

    • @ZeroUm_
      @ZeroUm_ 2 роки тому

      It probably won't do much, if 20% of a scene is sky, with 1080 lines you still have 216 lines to go divided by a much smaller number of threads. With 8 threads, that's still 27 passes, enough to saturate them equally.

    • @ivansanz4029
      @ivansanz4029 2 роки тому

      @@ZeroUm_ Yeah I was forward-thinking to when he will use the GPU cores :D

  • @thebasicmaterialsproject1892
    @thebasicmaterialsproject1892 2 роки тому

    go on the cherno still killing it

  • @helmuthpetelin4613
    @helmuthpetelin4613 2 роки тому

    hey do you ve planed to show how to push the raytracing to the gpu?

  • @sshawarma
    @sshawarma 2 роки тому +1

    Awesome video as always!
    Why was the program not running 8x faster? Only thing I can think of is an IO bottleneck.

    • @psychoinferno4227
      @psychoinferno4227 2 роки тому +1

      Run a profiler and you'll find a different answer. If you want to spoil the fun see the responses in the Github discussion.

  • @eduardoassis2826
    @eduardoassis2826 Рік тому

    hey, how you do to draw during explications over your current window? I'm curious for a long time now and can't help to ask :).

  • @thomasavino3450
    @thomasavino3450 2 роки тому

    What theme/color scheme are you using? (the default visual assist is not like this)

  • @gustavbw
    @gustavbw Рік тому

    Wouldn't allocating the threads on every std::for_each() be highly inefficient compared to pre-allocating the pool when the program starts?

    • @dmitrysapelnikov
      @dmitrysapelnikov Рік тому

      In fact the c++ runtime uses an internal thread pool for parallel for_each(). But AFAIK there is no way for the user to explicitly control this pool.

  • @kelvinpoetra
    @kelvinpoetra 2 роки тому

    hallo cherno, I want to ask how to make graphic software and software such as Microsoft Word. Is the basis for making software all the same stages.

  • @vasile2321
    @vasile2321 Рік тому

    What RTX do you have on your pc? Thx

  • @HandsomeLukeMan
    @HandsomeLukeMan 2 роки тому

    Love the red you've done with your syntax highlighting. How did you do this? I don't see an option for keywords like const and for and if in VA settings? Curious what value of red that is.

    • @davidrobinson8523
      @davidrobinson8523 2 роки тому

      Its from a third party paid extension. Visual Assist. And yes it is so much better than the defaults.

    • @HandsomeLukeMan
      @HandsomeLukeMan 2 роки тому

      @@davidrobinson8523 Yeah, I've got VA but curious what he did to modify his theme. I do not know how I would live without VA now that I've used it for so long.

  • @CreativeOven
    @CreativeOven 2 роки тому

    Comment 10 10 out of 10 : D, How is hazel ? I see it is not all about drawing that open GL 3d lines right for those vertices? : P

  • @ChaoticFlounder
    @ChaoticFlounder 2 роки тому

    how difficult would it be to implement the RayTracing calculations on the integrated graphics on your cpu?

    • @zvxcvxcz
      @zvxcvxcz 2 роки тому

      "It depends," is the unfortunate answer there. It depends just what types you're using, what the driver for that GPU exposes and if it supports the necessary extensions, etc... Maybe you can drop it on there with CUDA or OpenCL or maybe you can even wrangle the regular display part of the driver into giving you what you need with OpenGL or DirectX, etc... Often laptop manufacturers have not been great about switching these GPUs (sometimes if you're primarily on the discrete care, the integrated one can be almost totally deactivated, or vice versa). Sometimes that is seen as a plus, since it dealt with battery concerns.

  • @andrewporter1868
    @andrewporter1868 2 роки тому

    Multi-threading is also a mistake. It's a failure to defer parallel computing to the programmer. Instead of providing an asynchronous master-slave universal scheduler system and then on top of that the ability to do cheap software scheduling by providing a simple custom scheduler that can use the exact same code (it's asynchronous, so you just insert the scheduler code at some point in the future on one of your existing execution pathways), we got this pile of garbage that requires us to add all this overhead by synchronizing everything and it's just this massive headache where you can't just write parallel code but you have to think about synchronization too, and if you think too hard, you get a synchronization bug that you spend the afternoon fixing instead of fixing your actual code that's supposed to be part of the design that you're implementing, not a standard library feature that's missing from every language and imposed on us by all major operating systems.

  • @mackerel987
    @mackerel987 2 роки тому

    Hey guys. Does anyone get the "no instance of overloaded function:"std::for_each" matches the arguments list " error? Afaik we only need to include the execution header for it to work. Am I missing something?

    • @simonmaracine4721
      @simonmaracine4721 2 роки тому +3

      Make sure you compile with C++17 flag or newer, and your compiler supports C++17.

    • @mackerel987
      @mackerel987 2 роки тому

      @@simonmaracine4721 exactly what was wrong. thank you.

  • @gabrieldesimone4644
    @gabrieldesimone4644 2 роки тому

    Hey there, I'm not familiar with C# or game making stuff but I was wondering that code is running on CPU cores, how do you make it use GPU cores instead?

    • @Alkanen
      @Alkanen 2 роки тому +1

      That's coming in a future episode

    • @zvxcvxcz
      @zvxcvxcz 2 роки тому

      3 main options 1) wrangle your GPU into doing so by sort of telling it that it is doing normal math for output using OpenGL/DirectX/etc... 2) use OpenCL 3) use CUDA.

  • @steellung
    @steellung 2 роки тому

    Does anyone know which software he uses for drawing on the screen on the fly?

    • @rastaarmando7058
      @rastaarmando7058 2 роки тому +1

      It looks very similar to gInk.

    • @steellung
      @steellung 2 роки тому

      @@rastaarmando7058 cool, didn't know this one. Thanks

    • @erikrl2
      @erikrl2 2 роки тому +1

      He uses ZoomIt

  • @rckeet
    @rckeet 2 роки тому +1

    oh yesssssss!!😎

  • @CP-sr6ml
    @CP-sr6ml Рік тому

    Don't get me wrong your content is great but... Why are we bothering with multithreading if we could just move to the gpu? I don't undedrstand why you keep building and even optimizing like this on cpu side now. Wont that just make it harder/more work to move to the gpu?

  • @ng.h9315
    @ng.h9315 2 роки тому

    Wonderful courses👌, but please continue the "Create Game engine in cpp" course add 3d game development option build for Android , ios ,,,
    Please teach us how to create a game engine like unreal Engine 😀.
    Im waiting for your answer......
    Thanks for all of things Cherno ♥️

  • @larryfulkerson4505
    @larryfulkerson4505 4 місяці тому

    I like to write code by the principle of least astonishment.

  • @MorebitsUK
    @MorebitsUK 2 роки тому +1

    Nice!! Always good content Cherno. Any Idea on how to use IntStream in Java to parallelize stuff.
    FYI I'm using `map`; not `for_each`.

    • @wuangg
      @wuangg 2 роки тому

      Use IntStream.parallel() to return a parallel IntStream and after that, use forEach() to perform an action to each element in the stream in parallel, it will use all available processors to do the job.
      For example:
      IntStream stream = IntStream.range(1, 10); // create a sequential ordered IntStream from the range of 1 to 10
      stream.parallel().forEach(i -> {
      // do stuff to element 'i' here
      }); // perform an action to each element in the stream in multi-threaded
      This is equivalent to C++ std::for_each with parallel execution policy, which is being shown in this video.

    • @MorebitsUK
      @MorebitsUK 2 роки тому

      @@wuangg Thanks for the reply, but I'm using Map not For_Each.
      I just need to return something from the map.
      String[] results = IntStream.range(0,imageHeight-1).parallel().map(i -> { // y value
      String row = String.join(System.lineSeparator(),
      IntStream.range(0, imageWidth).map(j -> { // x value
      Vec3 pixelColour = new Vec3(0, 0, 0);
      float u = (i + Utils.randomFloat(0.0f, 1.0f)) / (float) (imageWidth - 1);
      float v = (j + Utils.randomFloat(0.0f, 1.0f)) / (float) (imageHeight - 1);
      final Ray rayP = camera.getRay(u, v);
      pixelColour.addEquals(rayColor(rayP, finalWorld, maxDepth));
      String pixel = PPM.vectorToRGB(pixelColour, 1);
      }));

  • @stinkybeam
    @stinkybeam Рік тому

    I know nothing of programing and coding, watch this video remind me of high school math class. I think I understand but actually I don't

  • @hymen0callis
    @hymen0callis Рік тому

    Unfortunately, std::for_each() is not very "efficient". Apparently, you got a speedup of only about 2, while I (using the exact same parallelization scheme) got a speedup of 5.5 (I only have 8 logical cores) by using PPL's Concurrency::parallel_for() instead. It's not portable code, but if it is almost 3 times faster, I'll go with Microsoft's PPL.
    Edit: just watched the next video where you fixed your global RNG. In my code, the RNG was already thread_local, which explains the much higher speedup in my example. So, I guess std::for_each() isn't that slow after all.

  • @AnalogFoundry
    @AnalogFoundry 2 роки тому +2

    I wish the team at Striking Distance Studios would take notes and improve ray-tracing performance in their game called The Callisto Protocol. At the moment their CPU utilization with RT is abysmal.

    • @zvxcvxcz
      @zvxcvxcz 2 роки тому

      Are they not doing their raytracing on the GPU though? Recent GPUs have hardware accelerated raytracing. I'm not at all familiar with the game or what they've done other than that it is supposed to be like a AAA title? I would expect any AAA to be using the GPU features on this (whether or not they should be).

    • @AnalogFoundry
      @AnalogFoundry 2 роки тому

      @@zvxcvxcz - they are doing RT on the GPU using dedicated RT cores of AMD and NVIDIA, but building BVH and stuff is handled on the CPU. Thus RT can be very taxing even on the CPU. The problem with Callisto Protocol is that it uses very little of the CPU (i.e. not well multithreaded) even with the latest greatest multi-core CPUs which causes huge fps issues.

  • @ricbattaglia6976
    @ricbattaglia6976 Рік тому

    Is not faster a gpu render? Thanks

  • @JATmatic
    @JATmatic 2 роки тому

    I made it much faster than the MT version here by fixing the wonky Walnut::Random code
    and removing branches from Renderer::TraceRay() loop.
    Render runs in about ~11ms on Ryzen 2700 8-core.

  • @stephenkamenar
    @stephenkamenar 2 роки тому

    GPUs don't have 8,000 cores. they have very wide instructions. like SIMD but on massive data at the same time.
    same difference tho

  • @nenomius1148
    @nenomius1148 2 роки тому

    8:50 running around updating two vectors on each window resize is much simpler than that stinky over-engineered std::views::iota from "modern" C++

    • @ovi1326
      @ovi1326 2 роки тому

      yeah but like think of the cache friendliness of accessing a buffer of memory just to get the next consecutive number

    • @nenomius1148
      @nenomius1148 2 роки тому

      @@ovi1326 Yeah, reading consecutive numbers from memory is much more cache-friendly than generating them on CPU in registers

  • @Alkanen
    @Alkanen 2 роки тому

    Wohoo!

  • @TheApsiiik
    @TheApsiiik Рік тому

    It's been 2 months.. where is next episode!!!1

  • @Notsorandomnumbers
    @Notsorandomnumbers 2 роки тому

    anyone know of a channel similar to this but like 1 degree more amateur? I find myself having difficulty keeping up at points

  • @fanisdeli
    @fanisdeli 2 роки тому +1

    Complete assumption because I'm too lazy to search it: I would think that std::for_each would be smarter than creating a thread for every single item in your iterator.
    Creating the threads would be much slower than actually running on one thread. My assumption is that it creates a few threads, depending on your hardware, and it reuses them. When one iteration is done, the same thread is used for a future iteration. That would also explain why using nested std::for_each made no difference in performance.

    • @zvxcvxcz
      @zvxcvxcz 2 роки тому

      You think it's only getting 2x rather than at least ballpark 8x if it is being smart? I think the nested for makes no difference because the single loop is already that bad for resource contention (several thousand threads on 8 hardware cores... ) and thread creation that it doesn't get any worse than that.

    • @fanisdeli
      @fanisdeli 2 роки тому

      @@zvxcvxcz I don't think that it could possibly be creating millions (1920*1080) threads, 60+ times a second
      Also, in programming there's no such thing as "it can't get worse" lol. If it was a thread per iteration, then without nesting you'd have 1920 threads, with nesting you'd have over 2 million. So, yeah, that would be WAY worse for sure. Like "freeze the entire OS and blue screen" type of stuff

  • @Jkauppa
    @Jkauppa 2 роки тому

    multicore avx-512 on cpu

    • @Jkauppa
      @Jkauppa 2 роки тому

      screen space dynamic baking surface light map caching

    • @Jkauppa
      @Jkauppa 2 роки тому

      update the surface dynamic baked light map only when needed, new or when update is needed, like every 4th frame, at some fps, like 240fps, update light only at 60fps

    • @Jkauppa
      @Jkauppa 2 роки тому

      pseudo-coding is a must, so that you are not tied to a language

    • @Jkauppa
      @Jkauppa 2 роки тому

      focus on programming pipeline or in the pseudo-algorithm methods

    • @Jkauppa
      @Jkauppa 2 роки тому

      language specifics are so 80's :)

  • @MrMirbat
    @MrMirbat 2 роки тому

    Thanks for sharing knowledge. Can you do tutorial how to make casino games like slot machines - Book of Ra, Texas holdem poker or roulette? Thanks in advance.

  • @IshanChaudharii
    @IshanChaudharii 2 роки тому

    Oh my goodness finally!!!! ❤️🥲🎉

  • @closingtheloop2593
    @closingtheloop2593 2 роки тому

    Why arent you doing this in cuda? Or in an opengl fragment shader?

  • @mr.mirror1213
    @mr.mirror1213 2 роки тому

    lesss gooo

  • @zvxcvxcz
    @zvxcvxcz 2 роки тому

    Iterators are gross... I would rather use OpenMP.

  • @anlcangulkaya6244
    @anlcangulkaya6244 2 роки тому +9

    #pragma omp parallel for

    • @psychoinferno4227
      @psychoinferno4227 2 роки тому

      The performance was nearly identical to the for_each with a parallel execution policy.

    • @peezieforestem5078
      @peezieforestem5078 2 роки тому

      Hey, I tried OpenMP once and my code got slower. I'm not sure why, do you have any ideas?

    • @zvxcvxcz
      @zvxcvxcz 2 роки тому

      @@psychoinferno4227 Yes, but with OpenMP you don't need those silly ranges, that's the advantage there. I would expect the performance to be about the same as to what was done in the video if done the same way like that. Creating thousands of threads on a machine with 8 physical cores is begging for 1) overhead due to thread creation and 2) resource contention as all those threads want to get their task executed, so expect an increase in cache misses. The proper way to do it is to create a properly sized thread pool (somewhere between 8 and 16 most likely if you have 8 hardware cores) and have a task queue where each pixel's processing is a task. Have each thread pick up a new task when finished until there are no tasks left. I would expect something like a 5x-7.8x ish improvement rather than 2x. I might be wrong, but that's my naive expectation without knowing too many details about the raytracing algorithm itself. Offhand I don't think we're being memory bottlenecked in this case in terms of throughput, just perhaps by cache misses as the threads swap.

    • @zvxcvxcz
      @zvxcvxcz 2 роки тому

      I use a sort of implied threading in Bash too with the same model. Can't have an ancient Bash though because they didn't add the feature to wait for any task to finish until like 4.something. So now you can start 8 commands while hundreds more wait and each time one finishes the next starts, it's pretty sweet. Prior to that bash you could only wait for all tasks to finish or you had to know the exact task you were waiting for (and of course you can't know ahead of time what order they will finish in (in most cases).

  • @irfanjames6551
    @irfanjames6551 2 роки тому

    Thanks a lot
    I was really waiting for the optimisations especially
    M
    u
    l
    t
    i
    -
    t
    h
    r
    e
    a
    d
    i
    n
    g.