Brute Force Processing

Поділитися
Вставка
  • Опубліковано 29 вер 2024

КОМЕНТАРІ • 570

  • @dartstone238
    @dartstone238 4 роки тому +420

    The SIMD conversion would be super interesting I thing. I do like the low level part of things.

    • @Agatio17x
      @Agatio17x 4 роки тому +5

      I would love to watch a SIMD video too!

    • @ElGuAcA11
      @ElGuAcA11 4 роки тому +2

      Me too, i'd love to see it!

    • @byrontheobald6663
      @byrontheobald6663 4 роки тому +2

      From my own testing with other problems (N-Body) you don't actually need to do much.
      If you look here: gcc.gnu.org/projects/tree-ssa/vectorization.html
      You'll see that GCC will actually automatically compile SIMD instructions if you compile past -O2, just make sure your code is in a vector/array or otherwise contiguous in memory. I'd imagine MSVC will do something very similar.

    • @mileswilliams527
      @mileswilliams527 4 роки тому +2

      Yes PLEASE!!!
      I'm at this point in my coding journey where I need to start getting the most performance I can and have started looking at parallelism so this would be very interesting

    • @ScottFCote
      @ScottFCote 4 роки тому +2

      I also would very much like to see this.

  • @atimholt
    @atimholt 4 роки тому

    Sounds like that thread pool at the end is crying out for coroutines.

  • @MrWorshipMe
    @MrWorshipMe 4 роки тому

    Instead of using abs on the complex number, you could have used norm to avoid that sqrt.

  • @masoneyler7734
    @masoneyler7734 4 роки тому +1

    Is the avx similar to a graphics card? Or is it the same thing (this might be a dumb question, I’m new to programming).

    • @javidx9
      @javidx9  4 роки тому

      Not quite Mason, AVX is built into the CPU. It's similar to a GPU but on a much much smaller scale. In AVX you could compute a handful of things in parallel, on a GPU can compute thousands (sort of)

  • @wessmall7957
    @wessmall7957 4 роки тому +379

    Arm-chair fractal geometry expert here. Was expecting to be bored, but was actually thrilled to see how your optimizations would perform. However I'm dispointed you didn't go on a 45 minute tangent about the existential implications of the fractal nature of life, the universe, and everything.

    • @daedreaming6267
      @daedreaming6267 4 роки тому +30

      Well you see... What you forgot to take into account is that the fractal nature of things? Really boils down into one thing. And that thing? Is 42.

    • @SpaghettiToaster
      @SpaghettiToaster 3 роки тому +4

      @@daedreaming6267 What he actually forgot to take into account is that there is no such things as the fractal nature of things, because things are not fractal in nature.

    • @Capris4EveR
      @Capris4EveR 3 роки тому

      @@SpaghettiToaster in my point of view everything in nature is paralel (mirrored to infinite). We can't find the start since its infinit but we will be able to predict the future and change the lane of our evolution. In my mind a higher being developed us for God knows what purpose and to be paralel with that, someone else created them too, when our turn comes we will create other if we didn't already (animals).

    • @Kenjuudo
      @Kenjuudo 3 роки тому

      @@emjizone I'm pretty sure that if you would iterate through every single color component of every single pixel in every single frame in every single video for every single video resolution as they are stored on the google servers you'd pretty much find the number 42 in all of them. And for the off-chance a video's video stream doesn't contain that number there are also the audio streams to sifter through... :)

    • @austinwoodall5423
      @austinwoodall5423 2 роки тому

      Fractals are so named for their FRACTionAL dimensionality. for instance, the Mandelbrot set is neither 1D nor 2D but 1.72 dimensional. Fractals are characterized by their infinite perimeter but 0 area (or other applicable into)

  • @fucku2b
    @fucku2b 4 роки тому +394

    34:00 there's enough interest.

    • @Pogosoke
      @Pogosoke 4 роки тому +8

      Yes please!

    • @3Triskellion3
      @3Triskellion3 4 роки тому +5

      Yes, please (or at least a blog post?)

    • @Gustvoasbezerra
      @Gustvoasbezerra 4 роки тому +4

      Interest there is

    • @EIYEI
      @EIYEI 4 роки тому +4

      Yes please!!!!!

    • @p003872
      @p003872 4 роки тому +3

      Ohh yes there is!

  • @dragan38765
    @dragan38765 4 роки тому +174

    More SIMD/AVX videos? Yes, please. I'd love to learn more about those, I only know the basics sadly. I pray to the compiler gods that they have mercy upon my soul and vectorize it automatically.

    • @douggale5962
      @douggale5962 4 роки тому +10

      In most cases, they can't auto-vectorize. Usually because it can't be certain your pointers aren't aliasing, it thinks your stores may be changing the values of subsequent loads, so it is afraid to use SIMD, it changes the results. You can help a lot with "restrict" pointers (__restrict in g++). It can be absolutely sure there is no aliasing when you say they are restrict. Saying restrict in the signature's pointers means that the function cannot deal with the restrict pointers pointing to the same memory. Applying that restriction sets the compiler free to assume that its stores won't be feeding back into loads, allowing vectorization.

    • @Bravo-oo9vd
      @Bravo-oo9vd 4 роки тому +1

      Instead of praying, you can profile and inspect compiler output. It's hard to speed up your code when you don't know where the bottlenecks are and you can help the compiler with autovectorising if you provide it with some hints. Writing code with intrinsics directly is not very maintainable, and better thing to do would be to look into parallel processing libraries that manage this complexity for you.
      Also by looking at output assembly you're able to confirm that other optimisations are working such as instruction level paralellism

    • @GregoryTheGr8ster
      @GregoryTheGr8ster 4 роки тому +4

      I hate to have to be the one to bring you the news, but there are no compiler gods. I SWEAR THAT I AM NOT MAKING THIS UP!

  • @glitched_code
    @glitched_code 4 роки тому +86

    very interesting, i'm rooting for the SIMD conversion video to come out :)

    • @javidx9
      @javidx9  4 роки тому +20

      thanks and noted!

  • @daklabs
    @daklabs 4 роки тому +251

    The AVX sounds interesting, I'd like to see more on it!

    • @gideonmaxmerling204
      @gideonmaxmerling204 4 роки тому +1

      there is a youtuber named "what's a creel" that has some AVX in asm vids and also some VCL vids which is a vector library for c++

  • @taraxacum2633
    @taraxacum2633 4 роки тому +63

    maybe real time ray tracing??? 😄 I am trying to do it rn but I'd love your explanation and tricks!

    • @javidx9
      @javidx9  4 роки тому +42

      Its a popular request for sure!

  • @Omnituens
    @Omnituens 4 роки тому +48

    I'm in for a dedicated video on the SIMD conversion.

  • @geotale95
    @geotale95 4 роки тому +164

    I'm a simple man. I see brute forcing and the Mandelbrot set, and I click on the video.

    • @kepeb1
      @kepeb1 4 роки тому +9

      I'm a simpleton, I see overused formulaic cliche comments, and I join in.

    • @geotale95
      @geotale95 4 роки тому +1

      @@kepeb1 Yeah, sorry, I didn't see any other comments and wanted to comment quickly, and that's the best I could do :p

    • @luisgeniole369
      @luisgeniole369 4 роки тому +2

      I think the Mandelbrot Set and fractals in general are so engaging because they're the closest thing we have of visualizing eternal cosmic inflation, and that pokes at a very powerful memory of our collective consciousness. Also the pretty colors.

    • @RogerBarraud
      @RogerBarraud 3 роки тому

      One does not simply brute-force into Mandelbrot...

  • @antipainK
    @antipainK 4 роки тому +29

    Could you make a video about dynamically resizing variables (ones, that when they run out of precision - they double their length)?
    That would of course require defining all calculations and make program slower, but would enable us to zoom "indefinitely".

  • @unknownuser8044
    @unknownuser8044 4 роки тому +77

    And now, Implemented on the GPU? :)

    • @javidx9
      @javidx9  4 роки тому +54

      Well that is indeed the next phase is suppose, though I reckon there is a lot to squeeze out of the CPU yet!

    • @Gelikafkal
      @Gelikafkal 3 роки тому +1

      @@javidx9 I think the communication between cpu and gpu over the PCI bus will eat up most of the performance that a GPU implementation would give you as long as you don't have a way to directly give a device pointer to the graphics library/engine.

    • @mariokart6309
      @mariokart6309 3 роки тому +1

      @@Gelikafkal GPU compute is very much viable for anything that can take advantage of mass parallelisation, otherwise why would that functionality even exist?

    • @lucasgasparino6141
      @lucasgasparino6141 3 роки тому +1

      To anyone interested: the CUDA C book has a bare bones implementation of the Julia/Mandelbrot sets, written in C. It's very intuitive to use, and the difference is GIGANTIC if your GPU is reasonably modern.

    • @TenshiSeraph7
      @TenshiSeraph7 3 роки тому

      Yup, I reimplemented the Mandelbrot set using recursive CUDA, although it was not a live updating viewer (following a tutorial).
      Indeed the transfer of data between host and device (GPU) is usually costly, but it should be possible to move fractal data directly to the frame buffer without transfering everything back to the CPU.
      There is also an optimization for parallel implementations of the Mandelbrot set, which takes advantage of the fact, that all points in the set are connected. You can divide the set into different rectangles which are computed by different threads. First you only compute each pixel on the border of the rectangle. If none of the border pixels contain any part of the fractal, then you can skip the calculation of the pixels inside the rectangle (assuming you are not zoomed out too far).

  • @ScramblerUSA
    @ScramblerUSA 4 роки тому +32

    30:02 - should be "bit 255".

  • @antipainK
    @antipainK 4 роки тому +18

    I would love to see how to do conversion to SIMD! :)
    Btw, my first C++ project was Mandelbrot Set generator and I did most of the mistakes that you pinpointed here. Now I'm gonna rewrite it after final exams. :D

  • @TimJSwan
    @TimJSwan 4 роки тому +27

    For millenia, every kid:
    "meh, math seems like boring black and whit..."
    BAM
    Mandelbrot entered the chat.

  • @atrumluminarium
    @atrumluminarium 4 роки тому +18

    Yes please I would be interested in seeing more about the intrinsics and performance advantages. I've been programming for years and had no idea that was even a thing you can do in C/C++

  • @Hacker-at-Large
    @Hacker-at-Large 4 роки тому +13

    I’d also love to see this extended into using the GPU.

  • @undersquire
    @undersquire 4 роки тому +28

    luv ur vids very informational :) i've learned a lot
    i would love to see a series on possibly making your own basic coding language in C++

    • @javidx9
      @javidx9  4 роки тому +15

      It's certainly an interesting idea! If I find a need for such a language then it may make a good series

  • @matthewpalermo4957
    @matthewpalermo4957 4 роки тому +14

    Loved the demonstration of how good the compiler optimization is! Question: Is there an abs_squared method for std complex values? That could be used instead of your implementation and still avoid the square root.

    • @johngreen1060
      @johngreen1060 4 роки тому +4

      That would be a dot product of the complex number against itself or multiplication of the number by a complex conjugate of itself.

    • @fredg8328
      @fredg8328 4 роки тому +1

      Compilers are not always good at optimizing. They are smart to do small optimizations, but generally you can do better with higher level ones. And there are a lot of differences between compilers. Visual Studio compiler is known to be one of the worst at optimizing. But that's the same story as why you should use STL or not. Visual Studio follows the standards more tightly and has to add a lot of tests in the generated code.

    • @wtk300
      @wtk300 4 роки тому +2

      std::norm(std::complex)

  • @DiamondShocked
    @DiamondShocked 4 роки тому +50

    This is exactly the type of computation that GPUs are designed for, and you would likely see a huge performance improvement there.
    In addition, while this approach generalizes fractal computation somewhat well, there are many techniques to be explored to optimize specifically the rendering of the mandelbrot set. For example, there is no need for the computation to go to the max iteration count for pixels which are clearly in the middle region and will go forever if possible.
    There are also ways to scale the values of the floating points based on the zoom so that the issue of precision is not a limitation. This would allow you to zoom indefinitely.
    Next, it does not make sense that 32 threads should have improved performance over 8 threads on your machine, since your 4 core processor can only effectively execute instructions from 8 threads simultaneously (two instruction streams per CPU core).
    I would also say avoid busy waiting for threads in the thread pool. Simply deschedule these threads if they do not have work to do, and make them runnable when they should run. It is perhaps possible that this is handled somehow by the OS/Compiler (if a loop is seen to jump to the previous instruction and the condition is not met, the processor may be able to dynamically see that no useful progress can be made and the thread will yield to another?), although unlikely. While it is a rule of thumb to avoid thread busy waiting at all costs on a single core machine, there are reasons why busy waiting may be an appropriate solution in a multi-core system: conditions can change while a thread is busy waiting due to the behavior of other threads, and the system call to make the thread runnable can be avoided. So the tradeoffs between these options should be weighed. Also your solution to thread synchronization (atomic counter) can be replaced more elegantly with a barrier or other synchronization primitive.
    An interesting experiment for the approach you have taken would be to disable the optimization flags of the compiler, so that simple code optimizations such as loop unrolling or strength reduction can actually be measured.

    • @AusSkiller
      @AusSkiller 4 роки тому +16

      With the way he was doing things it absolutely makes sense that 32 threads would improve performance over 8 threads, what is probably tripping you up is that the number of threads isn't the reason it'll be faster (because more threads doesn't really help as you probably correctly concluded), but rather it is faster because he divides the area up based on the number of threads thus reducing the time it takes to process the most computationally expensive section, which with only 32 sections is likely still a major bottleneck. With just 8 threads and 8 sections whenever a thread completes its section then the hardware thread will sit idle until the others are all complete, but with 32 threads and 32 sections whenever a section is completed another software thread can utilise the hardware thread that was freed up, and the maximum time to complete a section is reduced to a quarter of what it would take with 8 sections allowing it to be up to nearly 4 times faster in the worst case scenario for 8 threads. However it is entirely dependent on the complexity of the sections and how well balanced they are, if all 8 sections are equally complex then there will be no performance benefit running 32 threads/sections, and 32 threads/sections may even be marginally slower in that case.
      Basically the more sections there are the better it can load balance and keep the CPU fully utilised, at least up to the point where the overhead of more threads starts negating the better utilisation.

    • @luisendymion9080
      @luisendymion9080 4 роки тому +2

      AusSkiller is right. You're speaking of a better but way more complex implementation. Javi's goal is to use a minimalist implementation that still makes a good use (although not optimal) of the cores. For his algorithm makes sense to use more threads because it reduces the overall time to join them.

    • @DiamondShocked
      @DiamondShocked 4 роки тому +1

      @@AusSkiller I see your point. The load balancing is the key issue here.
      For each thread to have somewhat equal work, each pixel should have an estimated cost and the total cost of the pixels for each thread should be similar.
      At least that's one way you could do it off the top of my head.
      Also the pixels chosen for each thread should be done so with memory access locality in mind.

    • @Splatpope
      @Splatpope 4 роки тому

      yep, got 60fps all the way to float precision limit on my opengl implementation of the julia set

    • @Splatpope
      @Splatpope 4 роки тому

      can you tell me how to overcome float precision ?

  • @marcodebruin5370
    @marcodebruin5370 2 роки тому +2

    One way to avoid the "waiting for the slowest thread" problem and to keep every thread busy till the end (and thus shorten the total time needed). Have the workers calculate the iterations for 1 pixel, then ask for the next pixel not yet started, then feeding the pixels to those threads until you've run out of pixels to do. All threads will be kept busy until the last few pixels.
    That should still give a significant improvement (especially on screen-areas that are unbalanced)

  • @jordanclarke7283
    @jordanclarke7283 4 роки тому +5

    I wish to register interest in the SIMD code. Also I wish to acknowledge that the part where you talked about running in release mode versus debug mode was for my benefit after the (stupid) question I asked in your last video regarding the speed of decal rendering. Keep up the good work! Awesome channel! 👍🏻

  • @BuxleyHall
    @BuxleyHall 4 роки тому +2

    Thanks for the video! Add my name to those interested in learning more about AVX and SIMD. On a different topic, but still related to the but related to rendering the Mandelbrot set, I’d love to see how to get past the resolution limit of 64-bit doubles so one could explore even deeper into the set. Thanks again!

  • @cocorinow
    @cocorinow 4 роки тому +5

    javid you angel! just as i'm starting my master's thesis having to write a program that will have to crunch through a lot of complex numbers you post this, thanks!

  • @ryanlyle9201
    @ryanlyle9201 4 роки тому +2

    "quite advance for this channel". My god, he's holding back his full power.

  • @tacticalcenter8658
    @tacticalcenter8658 4 роки тому +3

    Frame time is a better indicator of performance than fps but because fps is engrained in people's minds for years its hard to change that from the wide array of applications and games using fps. A graph plotting frame time vs a single number for max fps is better but hard for people to understand. Well... The normal non tech person.

    • @javidx9
      @javidx9  4 роки тому +4

      I think it's also a sexier marketing number - it gets bigger if better

    • @tacticalcenter8658
      @tacticalcenter8658 4 роки тому

      @@javidx9 indeed, probably the number one reason hardware manufacturers don't advertise frame times.

  • @ghollisjr
    @ghollisjr 3 роки тому +1

    I had a lot of fun making a Mandelbrot fractal Web app with WebGL by putting the fractal calculations in a fragment shader. Highly recommend it as a learning project.

  • @rachelmaxwell4936
    @rachelmaxwell4936 4 роки тому +6

    I enjoy and appreciate all of your content. I would really like a video on intrinsics.

  • @BlackLinerer
    @BlackLinerer 4 роки тому +1

    Shouldn't your compiler do the vectorization with adding -arch=native as well?

  • @oj0024
    @oj0024 4 роки тому +6

    It would have been quite interesting to compare different compiler optimization options like automatic SIMD, fastmath and -O3 or the msvc equivalent.

    • @javidx9
      @javidx9  4 роки тому +6

      I agree, the source is available, hint hint XD

    • @michaelmahn4373
      @michaelmahn4373 4 роки тому

      I was also wondering how automatic SIMD would perform in comparsion. But I think then the program would crash on an old CPU without, whereas it can fall back on a slower method if you use intrinsics afaik.

    • @obinator9065
      @obinator9065 4 роки тому

      Michael Mahn Compilers are so advanced that there really is no need to go through that. O3 batches everything full with AVX and SSE.

  • @dylansanderson3663
    @dylansanderson3663 4 роки тому +1

    Great content as usual. @javidx9 , could you possibly do a video about hashing? Lots of videos about the concept of hashing but not much coding/implementation.

  • @thecprogrammer3908
    @thecprogrammer3908 2 роки тому +1

    "Harness all of the power of my PC to solve a problem"
    OpenCL heard that.

  • @tarikeljabiri
    @tarikeljabiri 4 роки тому +3

    As always amazing things. That's why C/C++ called low level. Thanks. Amazing and complex.

  • @MrRobbyvent
    @MrRobbyvent 4 роки тому +1

    downloaded the binaries but window closes and/or crashes as soon I start the program.

  • @nonchip
    @nonchip 4 роки тому +1

    34:00 since "goto is evil" (as in "not as intuitive to see the loop at all times"), couldn't that "label: ....... if(...) goto label;" just be literally swapped out for "do{......}while(...)" to improve readability? or is there something weird about the mm256 stuff preventing that? afaik "do while" is pretty much the same as "label if goto" internally (just without a "{block}")?
    46:30 "i just copied it in verbatim" couldn't you just call the one you've already implemented instead?

  • @Saxie81
    @Saxie81 4 роки тому +4

    @32:54 I think it would be neat to look at the conversion process!

    • @javidx9
      @javidx9  4 роки тому +3

      yeah, me too! noted!

  • @sanyasanders9580
    @sanyasanders9580 2 роки тому +1

    Wow! It's incridible! Thanks for this video

  • @oscill8ocelot
    @oscill8ocelot 4 роки тому +4

    Consider this my shout-out for more vector processing instruction videos :3

  • @sachinambetkar3637
    @sachinambetkar3637 4 роки тому +4

    Thank you so much sir ... I always follow your tutorial.. it helped me a lot ... 🙂

    • @javidx9
      @javidx9  4 роки тому

      Glad to hear that Sachin!

  • @GrayBlood1331
    @GrayBlood1331 4 роки тому +1

    I come to this channel whenever I want to feel stupid.

  • @portlyoldman
    @portlyoldman 4 роки тому +1

    Blimey. Really enjoyed that 😀I’m not even a C++ programmer only comfortable with C# but I learned a lot from this. Also enjoyed the style and presentation. Thanks. Subbed 😁

    • @javidx9
      @javidx9  4 роки тому

      Thank you Jim, much appreciated!

  • @level3143
    @level3143 4 роки тому +1

    Very nice. I actually did a Brute Force algo video for my new channel on May the 4th. I'm takings users through the steps to write a program which solves all possible configuration of the board game Genius Square.
    Check it out if you're interested. (It's video 4 in the series; I'll post a direct link in a sub comment.)
    The code that has been covered so far is in Python but the final version is multi-threaded C++.

    • @level3143
      @level3143 4 роки тому

      Link to the video for those who are interested:
      ua-cam.com/video/QzJo-Oj0X3A/v-deo.html

  • @jsflood
    @jsflood 4 роки тому +2

    Amazing as always.
    This is great! Really interesting. I would love to see a video explaining the vector co-processing (immtrin/SIMD/AVX) further. Thank you. Great work :-)

  • @FunctionGermany
    @FunctionGermany 3 роки тому +1

    Could we offload the hard work to the GPU? Could we even figure out a way to allocate groups of calculations to the GPU and CPU, so that we get potentially the most Mandelbrot performance our computer can support?

    • @javidx9
      @javidx9  3 роки тому

      Sure, the GPU is excellent at rendering fractals. It's highly parallel and requires no cross 'thread' communication.

  • @stephenkamenar
    @stephenkamenar 4 роки тому +1

    40:08 i was working on a similar problem and found just throwing 1000 threads at it was actually faster than trying to smartly divide the work between threads (i was using "greenthreads")

  • @dr_ned_flanders
    @dr_ned_flanders 4 роки тому +1

    Wonderful video. How about when we use CUDA cores in the GPU? Would that work or is precision an obstacle?

    • @saultube44
      @saultube44 4 роки тому

      GPU Stream Processors/Shaders are in the 1,000s, the amount of organization you need to use them, even knowing the Shader Language: OpenGL, WebGL, Vulkan, etc, is a lot, but of course each of them are a 64-bit FPU, so you would accelerate it greatly, specially because a GPU has hardware accelerated functions and procedures that CPUs don't have, the trade off is that's a lot of work to implement; unless you're an expert, have all your tools and programming environments ready, so it's easy for you

  • @toymonsterprankcompilation6949
    @toymonsterprankcompilation6949 4 роки тому +1

    When you prepare a video, do you just copy existing information from some online tutorials or does it come from your own experience and personal interpretation? I'm not criticizing your awesome content and channel by any means, I just wonder because I would like to make online tutorials but I don't feel "good enough"

    • @javidx9
      @javidx9  4 роки тому +1

      Thanks! Programming is my hobby, so when I see something cool I like to work it out rather than follow tutorials, so I guess most of the videos are based on experience and interpretation - I've been doing this for a loooong time. Also, now educational content is so prevalent, I actively avoid tutorials so I can enjoy working things out for myself. I don't approach my videos as tutorials in the first place - they are simply me talking through what I'm working on at the time. Whilst I acknowledge that may sound like pretentious gibberish, it is true :D

  • @trolledwoods377
    @trolledwoods377 4 роки тому +5

    You should have interviewed eriksonn for this video actually, that would have been great hahaha

    • @javidx9
      @javidx9  4 роки тому +9

      Can't be shown up by someone that actually knows what they are talking about XD

    • @luisendymion9080
      @luisendymion9080 4 роки тому

      @@javidx9 a wise man knows his limits xDD

  • @TheCaveGamingPodcast
    @TheCaveGamingPodcast 4 роки тому +1

    Mike from Become the Knight thinks he is slick

  • @gilleswalther5964
    @gilleswalther5964 4 роки тому +2

    Great performance! Really impressive what you can get out of the CPU.

    • @DFPercush
      @DFPercush 4 роки тому +1

      +1 just for the profile pic. Very apropo.

  • @Kenjuudo
    @Kenjuudo 3 роки тому +1

    Awesome video. It seems I've struck gold by finding your channel. I can already tell I will never regret subscribing to you!
    Don''t ever change your format! You hear???

  • @anakimluke
    @anakimluke 4 роки тому +2

    THE INTRO SOUND is finally better!! thanks! hahah :)

  • @Dave_thenerd
    @Dave_thenerd 3 роки тому +1

    19:05 you could use std::fma(z, z, c); and there's a chance it will use the fma instruction which should be faster if available in hardware. Although, the compiler may have already figured this out.

  • @johngreen1060
    @johngreen1060 4 роки тому +2

    I would like to know how important aligning to cache lines is in this case. Or how bad it would get if we encountered cache associativity issues. People often talk how important it is but rarely provide numbers.

    • @javidx9
      @javidx9  4 роки тому +1

      Hi John, usually with vector processors managing cache, and memory alignment are actually quite important - but this algorithm never reads anything from memory - so not applicable.

  • @ogr2
    @ogr2 4 роки тому +2

    Awesome video man. Actually I was watching another video about the "Mandelbrot Conjunction" and I found it quite interesting to analyze.

    • @javidx9
      @javidx9  4 роки тому

      Cheers Oscar, they just never get boring - I really find Mandelbulbs quite interesting

  • @CrystalMusicProductions
    @CrystalMusicProductions 4 роки тому +3

    Wow this video and in general your channel are a great motivation to keep programming and exploring the field :)

    • @javidx9
      @javidx9  4 роки тому +1

      Thank you Crystal Music!

  • @FuriousGuineaPig
    @FuriousGuineaPig 4 роки тому +1

    Question: would dividing screen into squares, not columns, speed up computations in this case? Maybe it's worth a try? :)

  • @oblivionronin
    @oblivionronin 4 роки тому +1

    Awesome long video love it ! Fractal are indeed hypnotizing ! Also Yes i definitly want to see that SIMD optimisation.
    I was thinking, couple things we coudl do to optimise the display of this woudl be
    1. cache the world position, zoom level and itteration count, if thoses havent changed we can just re-render that frame thats already baked in.
    2. once thoses are calculated once, you can use that data (time for the thread to finish) to guess where and how you shoudl divide up your thread grid, maybe some area (like that big half screen with very little calculation shoudl be handled by only one thread and that one sliver of fractal shoudl be calculated by the other 31 threads. (IE : you hold the thread time to finish for each of them, and an average of them all. On each frame or itteration if the thread finish time is shorter then the average, it shoudl take some screen space from other threads (verically or horizontal) if its higher it shoud give some up. That way, threads kind of dynamically exchange their screen space depending on their own finish time and the average time it takes. )
    Sorry for the long comment, just had to put that out there ! :P Cheers !

    • @antipainK
      @antipainK 4 роки тому

      Yeah, dividing screen to threads by bitmasks should do the trick. Didn't think of it. Thanks, gonna implement it in my own Mandelbrot generator :)

  • @LaurentLaborde
    @LaurentLaborde 4 роки тому +1

    micro(?) optimisation : separate threads using lines, not columns :)

    • @javidx9
      @javidx9  4 роки тому +1

      Yeah a common response this. I dont think it will improve things much simply because nothing is read from memory. The algorithm is purely generative.

  • @maybenexttime37
    @maybenexttime37 4 роки тому +1

    Great video! An improvement would be to divide the screen into many thin rows instead. Rows are better than columns for two reasons: 1- Better cache locality for threads when they're working on a particular job 2- You can increase the number of rows arbitrarily whereas number columns have to be tuned such that you're as close to a multiple of 8 pixels per column as possible (otherwise you give up potenial AVX benefits). This is important because in order to reduce bubbling you want to increase the number of jobs so that the workload is more balanced among threads. Of course to do that you need to decrease the size of each row/column. Maybe even down to one single pixel row per job, because at given width (1280) thread signaling overhead will likely be negligible even then.

  • @szokelorand685
    @szokelorand685 4 роки тому +1

    @javidx9 could you do a part two where you "fake" the double precision limit and make the madlebrot set pseudo-infinite? Btw thanks for the videos, I'm actually missing classes to watch your videos. I learn much more from these than from my teachers. Really cool.

    • @javidx9
      @javidx9  4 роки тому +1

      If you create your own number types in principle you can go very deep, but it would start to become very slow early on. Lol. [Responsible adult mode on] Dont skip school! [off]

    • @szokelorand685
      @szokelorand685 4 роки тому

      ​@@javidx9 Could this also be solved by some algorithm scaling back and moving the camera just enough to go back one iteration seemlessly?(I figure it's kinda hard but could be done?..)(I haven't played around with the code truth be told, but i most likely will). [ I wish you could see how terrible underpayed high-school teachers are in my country :)), I could bet they haven't even seen actual code]

  • @rubenbezuidenhout1493
    @rubenbezuidenhout1493 4 роки тому +1

    I'd also realy like to see a more in depth video on SIMD

  • @stenzenneznets
    @stenzenneznets 4 роки тому +1

    Hello, very nice video, thank you!
    A naive question here: how it's possible to go beyond the double precision? There are many video in which they zoom much much more. It's a matter of hardware or there are indeed more sophisticated way to approach the problem. Thanks to everyone who will answer the question :)

    • @javidx9
      @javidx9  4 роки тому +2

      Thanks Stefano. Fundamentally there are two approaches. You can analyse the mathematics and try to keep track of error, this becomes complex and will only get you so far. The alternative is to remember its all just bits. "Double" is just a collection of them with dedicated hardware that understands their arrangement. Nothing stopping you creating your own types with any number of bits and processing them manually in code. Many of these deep zoom videos are not real time, you just take an image every step and composite them into a video sequence.

    • @stenzenneznets
      @stenzenneznets 4 роки тому

      @@javidx9 thank you ! Of course, I did not think about the real time thing. With dedicated powerful hardware, their own heavy weight type and, I don't know, maybe 1/2 possibly 10seconds per frame, they can edit a video witch seems in real time but it's not and reach crazy deep zoom. Now I understand how it's done, thank you very much :)

  • @Najvalsa
    @Najvalsa 4 роки тому +2

    *OUT!*

  • @MrRobbyvent
    @MrRobbyvent 3 роки тому

    can't compile with code::blocks under windows , anyone can help? compiler first error:
    C:\Code_Block_Projects\OLC_Mandelbrot\OneLoneCoder_PGE_Mandelbrot.cpp||In member function 'void olcFractalExplorer::CreateFractalIntrinsics(const vi2d&, const vi2d&, const vd2d&, const vd2d&, int)':|
    C:\Code_Block_Projects\OLC_Mandelbrot\OneLoneCoder_PGE_Mandelbrot.cpp|291|error: request for member 'm256i_i64' in '_n', which is of non-class type '__m256i'|

  • @Kilohercas
    @Kilohercas 3 роки тому

    When i was trying to get most speed i noticed that uint32 is 4x slower than int32. don't ask why.
    Also you don't need to have loop checks, you can use try/catch to see when variable get into memory it should not get to, this check is done somewhere anyway, we can use it :D That usually means that your while/for loop is over :)

  • @w3w3w3
    @w3w3w3 4 роки тому

    mandelbrot I knew it! soon as I see it I knew it kek

  • @Dayta
    @Dayta 3 роки тому

    its kinda funny it still amazes me to this day how far computer power has actualy come to .. i rememeber very clearly with a zoom of 3x on the mandelbrot it took my cpu which was running at about 7mhz more or less several minutes to complete "one" image render at maximum resulution of cours which was i belive 320x240 at that time :P at some pixels i alsmost felt like i could maybe beat the computer to it by doing the same calculation with pen and paper .. there it was .. another pixel .. NICE ! that was a great time and now ... look at what computers can do today sometimes i think efficiant programing is kinda a lost art theese days it would be nice to combine the art of efficiant programing wiht the speed that is available theese days .. one thing im sure about .. we can do better :)

  • @stickfigure31
    @stickfigure31 4 місяці тому

    @24:36 I'm sure in most cases using the standard libraries is the preferred way, but tracing this back down to the square root function cutting performance in 1/2 did gave me an appreciation for why Id software made their own faster (but less accurate) square root algorithm for the Quake engine.

  • @thewelder3538
    @thewelder3538 8 місяців тому

    This is one of the reasons why I particularly dislike the std::thread, beginthreadex, and pthreads etc. You end up having to almost busy loop your threads. When what you actually want to do is create a thread in a suspended/waiting state. One of the great things about the Win32 API is that you can CreateThreadEx() suspended. Then it's not occupying any CPU time until you trigger an Event. You could use something like condition variables etc to achieve something like the same thing, but then you have to deal with spurious wake ups and other gotchas.

  • @d97x17
    @d97x17 4 роки тому +1

    Very nice explanation. I would be interested in the conversion process :) If possible it would be awesome if it also includes a small introduction to assembly itself
    I have a question regarding the parallelisation, would you recommend using the thread functionality from the standard library (as you did in this video) or would you advise using the OpenMP API?

  • @AlexxxMurkin
    @AlexxxMurkin 4 роки тому

    Real optimizations are there: www.iquilezles.org/www/index.htm "fractals/complex dynamics" section.

  • @rustycherkas8229
    @rustycherkas8229 2 роки тому

    Processor envy!
    I explored Mandelbrot on my '286 (without a coprocessor). Filling a single EGA screen, at higher zooms, could take hours! Those were the days...
    Yours is running plenty fast for us consumers, but, seeking to 'optimise' my slow version, I recall using the log() of the zoom (somehow) to 'dynamically' adjust the iteration threshold... It was all long, long ago... Probably should have been "inverse square" functionality, but log() performed well enough... I'm no mathematician...

  • @DAVIDGREGORYKERR
    @DAVIDGREGORYKERR 3 роки тому

    Fractint.cpp is worth downloading and compiling to see what it does, if you have a computer running a AMD 3990x in a AORUS ZENITH EXTREME X399 TRX40 motherboard with 1TB of Threadripper compatible DDR4 RAM, What about using Long Double rather than double.

  • @christianantfeld3827
    @christianantfeld3827 4 роки тому +1

    I'm shocked by how smart you and the other commenters are. Absolutely incredible if anyone else can follow along.

  • @benjackson9736
    @benjackson9736 3 роки тому

    "The Mandelbrot set is the set of values of c in the complex plane for which the orbit of 0 under iteration of the quadratic map z_n+1 = z_n^2 + c remains bounded."
    "Indeed."

  • @AJMansfield1
    @AJMansfield1 2 роки тому

    You could probably increase the performance by another 20% over the SIMD by taking advantage of both the SIMD and FPU execution units on your CPU simultaneously. Essentially, if you do the operations for a fifth pixel with standard FPU instructions and alternate between the FPU instructions for that pixel and the SIMD instructions for the other four, that'll allow the CPU pipeline to dispatch those FPU operations while it's waiting for the vector unit to complete the SIMD, and likely take no extra CPU time over just using SIMD.

  • @chrismingay6005
    @chrismingay6005 4 роки тому +1

    Great video thank you. I really liked the pacing (as always) and while not a c++ programmer in any capacity I learned quite a lot as well.

  • @farkhodpulatov6366
    @farkhodpulatov6366 4 роки тому

    Why not clear C? (Sorry for my English, level - THE LONDON IS THE CAPITAL OF THE GREAT BRITAIN)

  • @grproteus
    @grproteus 3 роки тому

    Intrinsics and SIMD is why ARM is getting ahead of the game. They could easily add more core and result in a better, more compiler-optimizable system that plays across platforms. Ryzen does not support AVX512, while intel does. This means nobody is going to be using AVX512. Such a waste...

  • @lukasz-mf5ri
    @lukasz-mf5ri 4 роки тому

    Please do a video about it, especially I want to understand if I could implement it for example for boost 128 bits floating type (I know that it would only perform "two times faster" not "four" then basic implementation of the code using boost, but I really want to achieve really nice precision of calculations and be able to zoom quite far.
    Also I have a question... Does all modern CPUs support AVX-256 ? And does something like AVX-512 exist? My CPU is AMD Ryzen 5 1600, I am not sure if it is important. I am not really in this topic of assembly instructions etc. but some time ago (like maybe two years...) I think I heard some noise that some CPU had avx-256 instruction and some not but I am not sure about it.
    Great video by the way!
    PS.
    Maybe you could also make some video about optimizing this fractal with CUDA?

  • @SamClayMold
    @SamClayMold 4 роки тому +1

    Hey javidx9, love this video and your channel. Commenting here to say I am definitely interested in that video about the 256 bit registers and parallelism

  • @Kattemageren
    @Kattemageren 3 роки тому

    TL;DR Bisqwit also did this, search up "bisqwit parallelism in c++"
    I know you already covered SIMD I would just like to point out that another one of my favorite youtubers also did a great series on this subject. Search up "bisqwit parallelism in c++". It is a great series that you could probably draw some inspiration from!
    Great video as always, thanks!

  • @CorryDMG
    @CorryDMG 3 роки тому

    I've made a similar Mandelbrot renderer back in 2002 with Visual Basic 6.0. I've reached something like 1 frame per minute. Time for an update it seems...

  • @Cozee2
    @Cozee2 3 роки тому

    this is my second video I am watching on you channel... when you are counting to 256 from 0 you should end with 255 (first one was forbidden c++ when you said rand()%10 is getting you a number between 0 and 10 and it is 0 and 9 :P) and it should be 0, 31, 63, 127, etc.

  • @cameraforchristmas
    @cameraforchristmas 4 роки тому

    What a terrific find! I'm working on a project based on the Mandelbrot set. I fell in love with it when It was an assignment in college many years ago. What is crazy is that I came across your channel not for the Mandelbrot part but when trying to find a good C++ development environment for the Raspberry Pi. I chose WXWidgets, Code:: Blocks and MinGW. So, I found the channel vid on WxWidhets setup. Now here I've run into you twice on the same project. How cool is that?
    I've already compiled my Mandelbrot starting point in C#, my favorite environment.
    My end point is a Raspberry PI and a 7" touchscreen in a permanent case. It should allow the user to zoom in (and out). I want it to behave as a purpose-built, single function appliance. No desktop, no multiple-windows, no choosing the app to run, just plug it in to the wall power.
    For extra credit, I thought I'd see if I could output the display on the HDMI connector at the same time so someone can enjoy it on a monitor or the living room TV. This would be the Manderbrot-set-top-box. Haha.
    So, thanks twice for the great content!
    Dave
    San Francisco

  • @tagKnife
    @tagKnife 4 роки тому

    javidx you didn't quiet harness all the power of your PC, maybe you could make a part 2 about offloading workers to the GPU by using OpenACC or OpenCL

  • @SteinGauslaaStrindhaug
    @SteinGauslaaStrindhaug 3 роки тому

    Wow, I just happened to be thinking about your beard just before you addressed it. How did you know?!
    (Well, mostly I was reminded I should probably do something about my own overgrown face..)

  • @yaroslavpanych2067
    @yaroslavpanych2067 4 роки тому

    Well, 6 must be OpenMP. No topic about parallelism is full without covering/mentioning OpenMP!

  • @funmaster5249
    @funmaster5249 4 роки тому

    I may be wrong but I think you squish the mandlebrot into a ratio thats not as it should be? The mandelbrot set goes from x(-2, 1) and from y(-1, 1). I believe you may have squished it to a 480/360 resolution, a 4:3 ratio instead of the proper 3:2 So some of the geometry will be distorted/squished.

  • @RogerBarraud
    @RogerBarraud 3 роки тому

    You could perhaps extend (sorta) simply to double extended... 't' type 80-bit a la x87 stack internal representation... which can be loaded and stored from/to memory.

  • @john_critchley
    @john_critchley 4 роки тому

    AVX video pls! I did a very quick one-shot mandelbrot sixel generator last summer (www.critchley.biz/mandel.c); it was after watching a Hannah Fry youtube video about mandelbrot sets, and I had been playing with sixel terminal emulators, so just did the minimum to produce some output. I have just re-learned CPP (learned it in uni in early '90s) and was looking to do my own ANN code, but need fast vector multiplies so had been looking at Eigen template library; but would be interested in comparing to using intrinsics.... If you did such video, then I'd add intrinsic calls to my mandel.c and make it able to zoom...

  • @morphx666
    @morphx666 4 роки тому +1

    Absolutely! I'd love to see a video explaining the conversion to SIMD data types.

  • @davidsullivan7455
    @davidsullivan7455 4 роки тому +1

    Add me to the list of folks wanting to see the SIMD conversion. Would also be interested in getting past the floating point limit!

  • @jayuliana9697
    @jayuliana9697 3 роки тому

    I did the same with Java, by using multiple threads and volatile image processing on a barebones game engine. I'm not even sure where to begin with trying to break the 64-bit double barrier however, some solutions I have tried involve using BigDouble, DoubleDouble (two 64-bit doubles to represent LSB/MSB) but both of these significantly increase processing time.

  • @GregEwing
    @GregEwing 4 роки тому

    So having written asm/SIMD myself in the past. The very distant past. I would have assumed that compilers with the right combo of -O3 (unroll loops etc) flags would be able to do a lot of this automagically. ie use a double[] and let the compiler sort it out. Its been a while since I have done high-performance C++ and last time was mostly CUDA stuff. But i would have hoped that would work.
    I mean that intrinsic function is about as unmaintainable as you can get, and its a good example of a fairly simple algo that has no complex perfetch/conditions or out of order shenanigans a lot of real problems can have. Of course, the few real-world problems that does have nice simple SIMD solutions are typically already in libs? like FFT/DCT as an example off the top of my head.

  • @dylanhooper1909
    @dylanhooper1909 3 роки тому

    for the lolz i ran this at 2048 iterations and it ate up 34% (3950x) of my cpu, but the memory consumption was quite low comparatively.

  • @MrWorshipMe
    @MrWorshipMe 4 роки тому

    I've used openmp directives for my threading, it's just adding 2 lines to the code for a x4 performance boost on my 4 cores 8 threads CPU. But I'm still far behind your level of performance due to not facilitating AVX.