Adding Nested Loops Makes this Algorithm 120x FASTER?

DepthBuffer

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 6 лис 2024

КОМЕНТАРІ • 184

@bernardcrnkovic3769 Рік тому ⁺⁴⁶⁰
dude, this video is so beautifully animated. i can't wrap my head around how much time you spent on this.
@mhavock Рік тому ⁺⁷
yeah, cache you l8ter! 🤣
@minhuang8848 Рік тому ⁺⁶
The animation is sweet, but do people realize that this is a native ultrawide vid? IMMEDIATE subscription for choosing by far the best aspect ratio... y'know, the one we collectively failed to adopt. Seriously, not opting for 21:9 is the big tech fumble of our time, it's wild how narrow and tight 16:9 looks these days if you have some UW experience.
@gordonlawrenz7635 Рік тому ⁺⁶
@@minhuang8848 I like horizontal space. Give me a 32-40 inch 4:3. 4:3 is crazy good to work with.
@ImSidgr Рік тому ⁺²
Good ol’ Canva
@itellyouforfree7238 Рік тому
@@minhuang8848 21:9 is one of the worst inventions of out time
@Zullfix Рік тому ⁺¹²⁶
The fact that optimizing for cache hits and enabling SIMD was able to bring a given matrix multiplication operation from 2000ms to 15ms is wild.
@okie9025 Рік тому ⁺²⁵
It reminds me of when Matt Parker made a unique wordle combination finding algorithm that took a month to complete, and then his viewers managed to optimize it so it only takes a few ms.
@amirnuriev9092 Рік тому ⁺⁴
@@okie9025this is so not the same, both algorithms here are o(n^3), but with constant optimizations you reduce the runtime significantly
in matt parkers video he basically wrote the most naive bruteforcey approach
@MichaelPohoreski Рік тому
@@okie9025Matt Parker *isn’t a programmer* so he implemented a naive solution in a dog-SLOW language Python.
It took me a one day to write a C solution; this took 20 mins to run. The next morning I optimized it down to 20 seconds. In the evening I had multithreading working on my 24 core/48 thread Threadripper. It took ~2 seconds to find all 538 solutions.
Understanding not only data access but calculations is paramount to getting high performance.
@BosonCollider 8 місяців тому
@@amirnuriev9092 Right, this is the kind of speedup that programmers with only python experience would assume to be impossible (multiple order of magnitude improvement despite having the same asymptotic complexity, using the same language, and not even using any stupid premature abstractions in the initial version that could slow things down)
@Czeckie Місяць тому
multithreading was the mvp
@RoyerAdames Рік тому ⁺⁷
I saw your video in the ThePrimeTime, and your video is epic. Very well explain for such high level topics.
@just-a-hriday 7 місяців тому ⁺²
I believe it's actually rather low level.
@Affax Рік тому ⁺⁵⁵
This video flew entirely above my head (still getting into this math in Uni haha), but the presentation is stunning, and I still watched through all of it because these optimizations are weirdly beautiful to me.
Awesome job!
@stxnw Рік тому ⁺¹
u only need basic algebra bro..
@shimadabr Рік тому ⁺³⁴
That's awesome! I've been studying parallel programming and a lot of these strategies i had no idea were possible. I wish my university had courses on HPC/Parallel programming like yours. This course you mentioned at the end of the video seems great.
@trustytrojan Рік тому ⁺³³
great animations and visual style! you've found yourself a great side hobby :)
@erikjohnson9112 Рік тому ⁺¹
I was thinking the same. The matrix multiplication was very effective. I remembered the first matrix row gets a dot product with 2nd matrix's first column, so I could see that it was lower left being the first matrix and upper right being the second (lower right obviously being the result).
@foolriver Рік тому ⁺¹²⁰
You can get better performance by unrolling the loops(do it aggresively, for example, unroll the whole loop by block size(in GCC, #pragma GCC unroll 8)). However, it would still be 3 times slower than intel mkl.
@axe863 Рік тому ⁺²
Why wouldnt the compiler pick that up? Make the compiler more aggressive???
@blipman17 Рік тому ⁺¹⁰
@@axe863GCC in particular doesn't like unrolling loops and vectorizing them as much as LLVM does in my experience. Although they seem to have caught up a lot with the latest few versions.
@axe863 Рік тому
@blipman17 Thanks for the info.
@davidr2421 Рік тому ⁺¹
@@blipman17 Yeah this is some info I was looking for. Based on what I've seen, LLVM would just automatically do these SIMD things without needing to be explicitly told, right?
@blipman17 Рік тому
@@davidr2421 errr… yes and no. It really depends. Moving data to SIMD-registers and back has overhead compared to regular single instruction, single data (SISD) code that we normally program. Some SISD algorithm implementations in assembly can just be faster than the equivalent SIMD algorithms depending on the instruction set. Also, heat. All instructions generate some amounth of heat, which will affect the clock speed. So while an algorithm might be faster on SIMD, on a given processor it might be slower due to the CPU thermal-throttling. This is/was extreme apparent on AVX-512 SIMD instructions on Intel. That’s also why GCC used to be or is hesitant about SIMD instructions. Furtheron, any consumer/server cpu of the latest 20 years can run instructions out-of-order and in paralel. This is extremely apparent on for-loops where for example the 6’th iteration of the loop is started before the first iteration is even finished! That’s parallelism you get for free.
And this all changes drastically per specific model of CPU + RAM and sometimes even between different cooling solutions. So… benchmark before making any such statements. This is extremely difficult.
@1000percent1000 Рік тому ⁺¹⁵
I've been teaching myself how to use SIMD for the past 3 weeks and I can't even tell you how helpful this video was. I was baffled when my serial implementation of some image processing code was 10x faster than my naive SIMD implementation. Took me quite a while to understand how that was possible. This video has made my greatly appreciate the simplicity of Rust's (experimental) portable SIMD library. Also, did not know what OpenMP was and it seems somewhat similar to Rust's library. Absolutely incredible video!
@fenril6685 Рік тому ⁺⁸
Subscribed. I can't believe you don't have more subscribers already. Any software engineer dealing with matrix math should watch this video.
@cefcephatus Рік тому ⁺⁸
I love how in depth this video makes me feel smart. I know that only a few people could make sense of such content like this. But, you make it feel like even more people can get close to it.
@giordano7703 Рік тому ⁺⁹
I love it when videos like these give me inspiration to delve into a topic I'm not familiar with. I admire people like you for coming up with such elegant and beautiful ways to communicate these concepts.
@Val-vm6qu Рік тому ⁺⁶
The final part where you say "Oh but there is that library that does everything for you for that case" just made me laugh my head off, great vid full of deep commentary, really great job!
@camofelix Рік тому
Very well done! I make a living optimizing BLAS routines, this will probably become my default “what do you do” to send people
@Dayal-Kumar Рік тому ⁺¹⁴
I did this exact thing in my intern project this summer at TI. Their processor even had a special instruction to do the matrix multiplication of two small blocks. I was able to achieve around 80% the rated capacity of the processor.
@naj0916 Рік тому
Very good. I haven't thought about optimization for a long time since I was doing Assembly in the 90's. This brings back memories and the feeling. Very good!
@joshpauline Рік тому
probably one of the best programming videos I have ever seen, as a more senior developer there is a lack of content on this level of production quality when explaining complex ideas
@brucea9871 Рік тому ⁺⁴
Until now I only considered the efficiency of the algorithm for optimizing program execution. Although I admit I didn't understand everything in this video it demonstrates that knowledge of the computer's hardware and taking advantage of it can significantly speed up execution.
@LimeTree-m3j Рік тому
This is so beautiful I nearly cried, and at the end when you reviled that the library had a 1000x optimisation I died and came back to life a better person.
@amadzarak7746 Рік тому
GEMM is the perfect example to demonstrate these concepts. Wonderful video my friend. You earned a subscriber
@EricGT Рік тому
this is a well done video and explains the idea of leveraging hardware and machine code to optimize cache lookups super well. But I also want to shout out what may be the best animation of a matrix dot product I’ve ever seen. this feels like the first time I watched a video that got me to understand what monads are
@tansanDOTeth Рік тому ⁺¹
Visualizations were freaking amazing! Loved those! could you make a video of your process for editing these videos?
@megaxlrful Рік тому ⁺⁵⁹
Great video! But I feel like I will never get to do this kind of work during my job, since we use a scripting language, and all bets are off when you store everything on the heap.
@wiktorwektor123 Рік тому ⁺¹⁸
While scripting langauge is a huge problem here memory isn't. Stack and heap are just different names of how memory is allocated by OS kernel. All those types of memory are still sitting on the same RAM chip, so there is no difference in access speed and locality of that memory to CPU cache. While OS will almost always store stack memory in CPU cache, every time you access heap address it pull to cache and couple of more addresses. It's not that CPU loads only one value you request, it's value you request + some block of data after this. If you have an array of numbers and you want element zero, CPU is loading not only that but propably 1k more elements after.
@MahdeenSky Рік тому ⁺²
I loved the presentation and the animations really helped me collect the concepts.
@MohammedAbdulatef Рік тому ⁺¹
The most captivating and insightful presentation I have ever witnessed.
@LV-ii7bi Рік тому
This seems really cool. Good work, you're on the right path.
@Antonio-yy2ec Рік тому ⁺³
Pure gold! Thanks for the video!
@Otomega1 Рік тому ⁺⁸
love this stuff keep the great work !
@danh9002 Рік тому
Awesome video! Thank you so much for sharing this!
@U2VidWVz Рік тому
Cool videos. Interesting information and great graphics/animations. Thanks!
@nordicus666 Рік тому
Great videos, you certainly know what you are talking about and you can share it while keeping it interesting, keep it up, you deserve more subscribers
@woolfel Рік тому ⁺²
I could be wrong, but NVidia's linear algebra library does these types of optimizations to optimize utilization. The cublas library does a bunch of re-ordering and transposing by default.
@gara8142 Рік тому
This was a great watch, thanks you.
That library you used, MKL, can you make a video about it? It sparked my interest
@Proprogrammer001 Рік тому
This has so fucking deeply re-ignited my passion for computer science. Gosh I am on fire right now
@dirty-kebab Рік тому
I love this. As a Snr.Sweng, looking to learn more about these god tier optimisations, where can I start? 😮😮
@Malte133 Рік тому
Great work and its true optimizing the code by splitting up commands in assembler is a pain. saw videos on that too.
@xdevs23 Рік тому
This is so well explained and the animation are SO good!
Also it would be interesting to see how clang performs compared to gcc.
@stavgafny Рік тому ⁺³
Just seen ThePrimeagen react to your last video lol
@Gefionius 8 місяців тому
Incredible video. Feel like I am getting a PhD in Optimization just by watching and I haven’t really coded for 20ys 😂
@codeChuck 9 місяців тому
This lvl of perf is amazing! The depth of knowledge and understanding is stunning! Is it possible to do something like this in Typescript / React / Next / tRPC?
@samuelwaller4924 8 місяців тому
If you mean in JS, then realistically no. You can maybe do some of these optimizations using some of the newer arraybuffers or whatever they are called to use "real" arrays, but at the end of the day you can't guarantee that the runtime or interpreter won't mess something up. It just wouldn't be worth it, just use a library
@homematvej Рік тому ⁺²
We need to go deeper!
@simian3455 Рік тому ⁺⁴
I cant lie I do not program by any stretch of the imagination... but your videos are something else, the clean and clear explanations are just amazing.
@jacob_90s 7 місяців тому
Transposition works really well if you can combine it as a result of a previous operation. For instance if the matrix that's about to be multiplied is the result of adding together two matrices, when you're adding the elements of the matrix, reverse the order so that the result is the transposition of actual answer.
@sawinjer Рік тому ⁺¹
Software engineers can have two mindsets - one of being the best and the other of being the worst. After watching this video, I quickly fell into the second. 10 out of 10
@flightman2870 Рік тому
hats off to the effort and thanks to @primetime for showing us this gem
@ChaoticNeutralMatt Рік тому
Once you reminded me of the context of the last video, I believe i know where this is going
@saltyowl3229 Рік тому ⁺¹
Please tell me the next one will only be a month or two. I love learning this level of optimization and it is conveyed so well. I will openly admit I’m being selfish cause I don’t wanna wait XD, though I do understand if that can’t be the case, hardly got a lick of free time with my own courses as well
@karim-gb5nx Рік тому ⁺²
Interesting subject, well presented, thank you.
@laughingvampire7555 7 місяців тому
man, thank you for work, is amazing.
@rabbitcreative Рік тому
I have a real-world example that feels-to-me similar what is described here. It involves a Postgres install on a machine with a RAID-5 setup. Benchmarking showed peak ~15k TPS, for about 30-seconds, then the TPS would degrade to ~2k TPS, and widly fluctuate up and down, for about 30-seconds, and then resume at ~15k TPS for another 30-seconds. This behavior was steady over longer benchmarks (10 minutes).
I used Postgres's table partitioning ability to solve this. First I used 10 partitions, which made no difference. Then I used 100 partitions, and while the peak throughput capped at ~13k TPS, it was steady for the entire 10-minute benchmark.
Another, more human example, is writing the same sentence over and over on a chalkboard. Easier to write "I, I, I, I, will, will, will, will, never, never, never, never, chew, chew, chew, chew, gum, gum, gum, gum." Instead of "I will never chew gum", etc.
It's all related.
@mNikke 10 місяців тому
It would be cool if you would follow up on this using strassen-algorithm with z-ordering (morton ordering), and maybe even avx512 instructions. I think you would get pretty close for large matrices.
@DeathSugar Рік тому ⁺¹
there's also computational optimization similar to karatsuba multiplication but for matrices. probably proprietary lib uses it
@ElZafro_ Рік тому ⁺⁵
Impressive! Never heard of these optimizations druing my computer engineering grade
@joeysoroka2082 Місяць тому
it looks like the guthub repo you linked doesn't contain all the code you showcased in the video, im trying to go through your code a bit more thoroughly and was wondering where I could find all the examples shown in the video.
@NIghtmare8791 Рік тому ⁺¹
Its a great videof or optimizing a matrix multiplication from a purely technical/computing science standpoint. But i think improvement you still could make is to multiply the matrices with something like the Strassen algorithmus ( nowdays there even faster algorithm), but a matrix multiplication doesnt need to have a cubic ^3 runtime. You can actually do matrix multiplications in slightly less for example ^2.8, which should give significant improvements when multiplying big matrices
@rvoros Рік тому
Nice video.
I love Cities XL music too. ;)
@Mefistofy Рік тому ⁺¹
Beautiful video. Understandable even for a EE grunt like me. The take home message at the end was a little disheartening though. AFAIK only Intel and Nvidia churn out hyper optimized libraries. Binding you to their tools. I'm not sure about the impact of using Python for ML but I guess all the nice libraries do something similar or use MKL directly (numpy does this with specific versions).
Just a user of all the shiny tools to build new shiny tools.
@shakibrahman Рік тому ⁺²¹
anybody got any online course/book similar to the course DepthBuffer mentions he took at university for highly optimized code?
@simonl1938 Рік тому ⁺²
I'm also interested in that :)
@winda1234567 Рік тому
same @@simonl1938
@leshommesdupilly Рік тому ⁺²
also interested :D
@depth-buffer Рік тому ⁺²³
We are not using any textbooks during the course, and the course is also not recorded. But the slides are published here: pages.cs.wisc.edu/~sifakis/courses/cs639-s20/
@shakibrahman Рік тому ⁺²
@@depth-buffer awesome! thank you
@jupiterbjy Рік тому
kinda wonder why our university barely have such lectures, amazing video
@69k_gold Рік тому ⁺¹
DSA pro: You get 1900ms, take it or leave it
Guy with an intel chil and an ASM course: Hold my cache
@manuel-3500 Рік тому
just bruh, what's this video, why you have only 8k sub, why the animation are so precise, why I'm here
@diosupremo4928 Місяць тому
Beautiful!
@dakata2416 Рік тому
Great video!
@marketsmoto3180 Рік тому
this dude's channel bout to get huge now
@HadiAriakia Рік тому
Love the video,
Where have you been mate until now?😂
@evgenytsykunov8903 25 днів тому
For those interested in how to get 380 size threshold (2:54)
Arithmetic Intensity = # ops / # bytes, algorithm property, measures how many operations will be done per moved byte.
ops:byte ratio = BW_math / BW_mem, hardware property
Math limited: arithmetic intensity > ops:byte ratio
Mem limited: arithmetic intensity < ops:byte ratio
transition from memory to compute: Arithmetic Intensity = ops:byte ratio
(2N^3) / (2 * 3*N^2) = (640 GFLOP/s) / (5.25 GFLOP/s)
N = 364
# bytes = 2 * 3*N^2, assuming FP16, 2 bytes
# ops = 2*N^3
Any ideas how author got 380?
@hansadler6716 Місяць тому
really good video!
@Kotfluegel Рік тому ⁺²
There is an error in how multiplication of the block matrices is animated. As the current active result block is being calculated the blocks of the input matrices should iterate through rows and columns of blocks just like with regular matrix matrix multiplication.
@ericraio7145 Рік тому ⁺¹
Came here from primeagen! :D
@thanatosor Рік тому ⁺²
Memory access now is the bottleneck for CPU
@frango_molhado Рік тому
man what a great video
@__samuel-cavalcanti__ Рік тому
@depthBuffer can you share the book or materials that you have read to understand this topic ?
btw, blazing beautiful video
@groff8657 Рік тому
This is the first time I've ever heard the term 'giga-flops', and I quite like it.
@minhuang8848 Рік тому ⁺⁵
If you're new to the field, which seems to be a fair assumption... I highly recommend checking out machine learning papers and their ridiculous names. A tiny selection:
- We used Neural Networks to Detect Clickbaits: You won't believe what happened Next!
- Learning to learn by gradient descent by gradient descent
- BERT has a mouth, and it must speak
- Gotta Learn Fast: A New Benchmark for Generalization in RL
- Everything involving BERT and ERNIE and the full range of Sesame Street characters
Yeah, there's lots of whimsical stuff going on in that particular space, and it often is pretty relevant to the actual topic at hand, too.
@TylerHallHiveTech Рік тому
Great video! Do you mind sharing the source for your motion canvas? I like the implementation of the graph animations. Just getting used to motion and was inspired how you used it. Thanks!
@depth-buffer Рік тому ⁺¹
Currently, my repository contains source code for my next video so I can’t make it public. But I’m using my own custom fork of motion canvas and some effects in the video require those custom features. You can check it out if you want: github.com/fangjunzhou/motion-canvas
@wChris_ Місяць тому
Always Nester? This seems like a direct attack against the Never Nester!
@itellyouforfree7238 Рік тому
very nicely done
@huanxu5140 Рік тому ⁺¹
Maybe Intel used AVX vector intrinsics under the hood if your CPU supports it and that leads to perf difference.
@birdbrid9391 Рік тому
best visualizations i've seen gj
@axe863 Рік тому
Beautiful.
@geckoo9190 Рік тому
The other day I was watching a video that was implying that using less conditions makes the code faster, but then I ran some tests in assembly and saw that without the conditions, the code was longer
@Erlisch1337 Рік тому ⁺²
that would depend entirely on what you are doing
@vbregier Рік тому ⁺¹
Please, if you mention MKL, do mention that this is a non-portable intel-only library !
It won't work at all on any non-x86 compatible architecture, such as arm, and has poor performance on AMD cpu.
There are some open-source portable alternatives to MKL (blis, libflame).
Don't trade portability for performance on a single CPU family !
@stacksmasherninja7266 Рік тому
Isnt there an algorithm that uses 7 computations instead of 8 for the 2x2 GEMM? Cant we exploit that to always work in 2x2 blocks?
Also, sweet animations dude! What tools do you use for these? Is that just manim?
@depth-buffer Рік тому ⁺¹
I’m using motion canvas
@HyperFocusMarshmallow Рік тому
Great video!
@ujjwalaggarwal7065 Рік тому
just amazing
@insu_na Рік тому
great video
@Rudxain Рік тому
6:24 That's like constant/static recursion, which is much faster than dynamic recursion (specially if there's no TCO).
11:16 Can NUMA help with this?
@snoozy355 Рік тому
What's your IDE at 13:55 ?
@thanatosor Рік тому
Imagine how much underlying low level mechanism was hidden from programmers.
They aren’t supposed to know if they don’t have to optimize to the limit.
@ДімаКрасько-с7м Рік тому
would it even be possible in higher level languages as javascript and python?
@wiipronhi Рік тому
OMFG I love these videos
@sammtanX Рік тому
Wait. What is your CPU? How do you know how much GFLOPS your CPU have?
@abbasballout4441 11 місяців тому
Wait what's the music?
@machinelearningTR 9 місяців тому
Anybody knows how can i make animation like this video? Which app should i use?
@MNNoxMortem Рік тому
Amazing.
@Ximaz- Рік тому
What's your CPU ?
@AREACREWBMX Рік тому
Genius
@alfredaquino3774 2 місяці тому
hit that cache like it owes you money
@colonthree Рік тому
Shoutouts to CinemaScope. :3
@yairlevi8469 Місяць тому
what is this outro?
@skope2055 Рік тому ⁺¹
Great!
@Vaaaaadim Рік тому ⁺¹
Maybe a factor in why MKL is 8x faster than your implementation is that it uses something like the Strassen algorithm
when it recognizes that you are just doing matrix-matrix multiplication of matrices with real values?
@dmitrykargin4060 Рік тому
You need larger size for strassen to bring such advantage
@Vaaaaadim Рік тому
@@dmitrykargin4060 I only said it might be a factor, a few recursions could be applied whilst still using the typical algorithm.
Quoting from en.wikipedia.org/wiki/Strassen_algorithm "In practice, Strassen's algorithm can be implemented to attain better performance than conventional multiplication even for matrices as small as 500x500". And yeah wikipedia is not necessarily always right, but looking at the paper cited for this statement, it does seem to support the claim.
@dmitrykargin4060 Рік тому ⁺¹
@@Vaaaaadimit is not 3 times difference for this size. You need far larger matrices to get that. I bet it was avx512 with smart reductions to get that 3x boost
@theexplosionist2019 Рік тому
No. Strassen accumulates errors.

Наступне

Автоматичне відтворення