One cool thing from tha paper which didn't make it to the video was that instead of using number of multiplications as your cost function, one can use the actual timings from running each algorithm. This allows to find fast multiplication algorithm which is tailored for the device it's running on. Interestingly, they showed that indeed for different hardware platforms, different algorithms performed better.
I don't know the details but there's now some logic designs where addition is significantly slower than multiplication! If I had to guess it's because multiplication of 2 n-bit numbers will always fit within 2n bits. In many common scenarios you'll always use all 2n bits in the results. In contrast with addition of 2 n-bit numbers plus a carry bit you'll get, at most, an n-bit number plus an carry bit. (Ignoring 2s-compliment and that it would be a 'borrow' bit.) That's a tremendous waste if you're working with 64 bit words, much less 128 or even 256 bit words. I know - dedicated hardware can use 65-bit words internally. Or you could write a library that quietly converts the data to 63-bit words, plus a carry, for calculations and them convert it back to the standard notation at the end. You'll lose a little bit of range but it might be an acceptable tradeoff. (With floating point numbers you would perform your shift so you would just lose a little bit of precision.) It's some work upfront but it might be worth it if you can keep everything in the innermost cache until you've finished your calculations. But that's all just guesswork based on some classes that are all decades out of work....
I think that that is just incidental dust. Consider for example that the elements of the matrices are objects of a form defined in a class by C++ say. You have then no access to architectural bias. The complexity of the object itself overrides any platform consideration.
More likely, it's because platforms have different performance profiles for addition, multiplication, and memory accesses, and in the case of different ISAs maybe even the number of registers to accumulate dot products into. For example, if computer A has memory access that's four times faster than computer B, it's probably worth it to spend more time doing memory-intensive algorithms which use less multiplication/addition operations.
For matrix-multiplications Strassen is the best overall - works for any size and any form of matrices. But there are of course faster methods for specific cases. There are algorithms that are faster for where large matrices, that are better for diagonalised matrices etc They found an algorithm for matrices only containing 1 and 0 that is slightly faster - single-digit percentage faster. It is an achievement and does help, but the bigger benefit might just be that with specialised hardware it can use fewer multiplication-units - thus less power.
so based on the video, the discover from alpha tensor doesn't work in real matrix multiplication because its only 0 and 1. But in 3d games for rotating or translating and object, you need to add different numbers than zeroes or ones when updating in a game loop.
Multiplication of Fourier matrix only need N*logN, so for some specific matrices, it is definitely possible to reduce the complexity. But maybe it is not generic.
It kind of is pretty dumb. Muliplication of binary numbers is very fast, it's not like that algorithm is really neededbthere anyway. The general case is way more interesting.
@@webgpu "best overall? is there anything "particularly best" ? 🤣" I mean - no - why would there? Why would any sane person assume that a solution that is the best in general would be the best for any specific scenario? Modern desktop CPUs are the best for general computation. Are they the best in matrix-multiplication? No. The best at encryption? No. Best at image-recognition? No. Best at network-trafficing? No. There are specialisations when you need those. Hopefully that was simple enough for you to comprehend.
I devised an algorithm in 1977 that utilised 24 multiplications and 69 additions for the 3x3 case but didn't get much further as it was equivalent to Strassen's in the 2x2 case. I did it with pencil and paper. After seeing this video I don't feel so bad now about my failure :-)
I'll be totally honest and say that I took Strassen's lead and tried it out in the 3x3 case ... hence it was inevitable that it would reduce to his case. There was no spark of a new idea just a plodding through the algebra, it took me about 12 hours and a few hundred feet of line printer paper. When I opened my door, the next morning, Mike's (Atkinson) room opposite seemed open so I walked in and asked for him to look at 'this' and he was most joyed. He smiled so much I felt that I had done something useful, He asked if he could use it in his class and I said 'Yes! f course' and then I left Cardiff. Me and Henry had shared our office ... I am afraid that he was probably quite unhappy that I had left the paper all over the place .. He was a good mate though :-)
I mean _technically_ weve had "faster" algorithms than the stassen one for quite a bit. Were down from the factor of ~2.7 to ~2.37 for very large matrices. Although an actually usable algorithm thats faster even in the relatively special cases is always nice to see
They are not talking about algorithm complexity for large matrices. This is about the fastest way to compute a 2x2, 3x3, 4x4 and 5x5. Algorithms with "faster" complexity end up having larger overhead and perform worse than strassen on smaller matrices. With this new algorithm for 4x4, you can use your favorite low complexity algorithm to break down the multiplication to 4x4 matrices. And then use the decomposition found by AlphaTensor.
@@tunafllsh Yes, im aware. Due to ambiguity of their wording they said that Strassens algorithm was unbeaten which is technically not true since there are algorithms with much better asymptotic time, albeit not practical ones. A small nitpick for my entertainment and an opportunity to share some trivia for those who might not know
I finished university 4 years ago and I still get nightmares about my diploma being revoked because they’ve discovered that I failed linear algebra exam. Doing matrix multiplication by hand is a literal nightmare for me.
@@Penrose707 I find the best way to do matrix multiplication is to write some python code. Why would I ever do it by hand outside of that dreadful exam? It’s not the 17th century…
I read the corresponding papers and this is a pretty good recap video on the topic. Kudos for visualizing the algorithms in an understandable way and even giving some background information from the researchers! 🧑🎓
The way you described the tensor decomposition problem as a game of removing tensors from the original to reduce it to all zeros as fast as possible, and how you showed it with the Strassen example, reminds me a lot of the Lights Out game where you press a light to flip it and all adjacent to make all the lights dark. The moves in tensor decomposition are a lot more complex of course, but seeing it as a game makes a lot of sense.
For the sake of completness, in practical applications one should keep in mind that reducing the number of basic operations is not the main bottelneck for matrix multiplication efficiency, moving data is (bus 500Mh, cpu 1-2Ghz). However, without new ideas on this (last not understood) basic operation (polynomial, integer multiplication is mastered), brute computer force (cleverly used in the present case) is welcome.
@@btf_flotsam478 not necessarily. On hardware, you don't access single numbers, you always access multiple bytes that happen to be adjacent in memory at once (e.g. if the data bus is 128 bits wide, you access 2 or 4 floating point numbers at once, depending on precision). A naive alorithm that uses matrix element in order can basically half or even quarter the number of memory accesses. On the other hand, if the optimized algorithm makes heavy use of out-of-order access to the matrix (the technical term is random access), it might "waste" some of the transferred numbers, just because the adjecent numbers are needed at a later step of the algorithm. An asymptotically fastest algorithm might not be the fastest algorithm in practice. That's especially true for applications that are already ridiculously optimized to specific hardware, such as the matrix multiplication.
@@btf_flotsam478 Not always. Bott periodicity in 8 dimensions has values 16 times as big. Even though lines of code can be n - strings short the amount of complexity embedding in a single character can be infinitely large.
The caching behavior on nearly every processor will determine the absolute speed limit for MM. One reason Strassen-style partitioning works is that it continuously tesselates the large matrices which brings down the workingset size and reduces cache overflow lossage. Similar considerations apply to parallel implementations.
I didn't follow along with some of the more detailed parts but over all found it to be really interesting. Nice work to everyone involved in this directly or otherwise.
They managed to put AI into juggling complex algebraic compositions/decompositions and applied results to computer calculus tasks optimizations. Nothing mind-blowing, but important step towards self-improving algorithms.
matrix math is just one of those things that while more efficient equations might be good on the theory side of things, there is far more gains from specialized hardware to do that kind of math. if something has 8 vs 7 multiplications might be a factor if you are forced to do it consecutively, when you have hardware that does it all at once one more "lane" of computation for that 8th figure is a moot point. Back in the day there used to be math co-processors and we came back full swing where now the gpu IS the math co-processor once more with things like cuda and tensor cores that are specifically designed for that kind of math. The big question is, what other kind of specialized hardware would be of high value and implement that rather than pushing the limits of math theory to find efficiency.
If you have to deal with NxN matrices where N grows arbitrarily large and a multiplication takes N^k operations, then faster hardware can't compete with an algorithmic improvement that lowers the exponent k. The point of the 8->7 trick in the Strassen algorithm is that it lowers the exponent k from 3 to 2.808 for larger matrices. However, if N is fixed and the improved algorithm is very complicated, then it may well be faster to use a simpler algorithm. According to a 2016 paper, on then modern hardware the Strassen algorithm was 20% faster for N=512 (and for larger N the gain will be even larger).
@@ronald3836 can u mention a real life application where we'd need a 512 * 512 matrix? Edit: I just remembered photos are basically matrices of pixels. But would still love to hear about any other applications?
@@ronald3836can you cite the paper? Not calling you out or anything but no modern BLAS implementation makes use of strassen's. I'm sure OP understands algorithmic complexity, and they're still right - hardware is the deciding factor for all practical implementations of Matrix mult.
To explain the Alphatensor result it would be better to show and explain the Karazuba algorithm as Alphatensor found something similiar for 4x4 to Karazuba for 2x2. Sure Strassen is beating Karazuba on 2x2 so it should be mentioned, but for the didactic of the solution Karazuba is giving more insight to the result.
These are completely different problems. Karatsuba and its extension like Toom-3 are for polynomial (or big-int) multiplications, where as this is about matrix multiplications. One source of the mistake might be, that after Karatsuba discovery there was also an algorithm by Strassen (And Schonhage) performing the polynomial multiplication (or convolution) using Number Theoretic Transform recursively.
Has any of these researchers looked the impact of this new algorithm on specialize matrix libraries such as Tensorflow, and GPUs. It's possible that the 50-year old algorithm will execute faster than the new DM algorithm, IF for example, there's a significant advantage using specialized functions tuned for GPU instruction sets. It would be important to #1 - Know this answer? #2 - Where the cross-over occurs? #3 - Which and When existing libraries will adopt this new algorithm?
I have a better algorithm requiring new tensor operations on a register set of a CPU (or GPU). I have something else to do right now, but the key is: new tensor opetations on the register set. Those who have resources for hardware algorithms will easily solve it as well and also prduce the CPU
For very large matrices you can get an easy 40% speed-up over the naive algorithm if you order the multiplication and addition operations in a cache coherent order. I'm skeptical would be faster than Strassen's though.
Agreed, but these are considered implementation optimizations, not a different and superior algorithm. The criteria used for a better type of algorithm is a reduction of the overall power by which the matrix dimension are raised by; that is, the "big O" [Order of complexity] of the algorithm.
Algorithms that reorder operations to achieve better computation locality are not simple implementation optimizations. They are genuinely different algorithm. For instance one can use them to decompose matrices in sub-blocks. This allows for instance better parallelism. Or it can enable you to break your original problem in different way, which can give you speed up orders of magnitude better than a pure better multiplication algorithm. Another thing is that algorithm with better O may be much more difficult to implement in a "computer friendly" way. For instance they make difficult to take advantage of long cache line memory fetch or poor locality lead to cache thrashing. So, OK, they are more satisfying on the paper but worse in the real life.
@@YouRich92A small follow-up. Suppose we're multiplying two large square matrices (A X B = C). First flip B about the main diagonal, then both A and B can easily be traversed in a cache coherent order, instead of only B. I'll have to experiment with this to see how well this works.
@@jimwinchester339 Many theoretically superior algorithms run much slower than optimized "inferior" implementations. Any linked-list case being the poster-child. Cache-coherency is King.
2:23 Quanta Magazine: Going from eight down to seven multiplications may seem trivial. me thinking: for 2x2 sure but for something like 8x8 it is significant. 2:29 *shows 8x8* me: lol
I am extremely impressed that they managed to turn this abstract task into a ML problem. I am really surprised at how computationally expensive this can be by brute force. A 4x4 multiplication of matrices would correspond to a tensor of 4^3=64 cells. Since these multiplications are in F_2, then there are only 2^64~10^19 options for a tensor. If we want to find one with 47 by brute forcing that is 10^100 roughly. Insane
You should make video on constraint based solvers, its just amazing you can bypass all physics and still do real physical simulation just by few linear equations, which in turn just uses matrix multiplication..
Okay but do we know yet if this type of multiplication algorithm will work in any context other than Z/2Z? It is not surprising to think that there are extra tricks to multiplying matrices in Z/2Z that one could use which are useless over the reals.
Good point. Seems like there are some p vs np implications. If matrices with only binary elements troubles the Google Deepmind team, then doing this with matrices with any non-binary element sets would probably be too convoluted.
The records of ℤ/2ℤ are of course better than ℂ, where ab=ba, although this is not as important as the records where ab≠ba, like matrices. But the records are really close to each other.
No it seems one can apply improvs to any floating point number (or complex number for that matter). The article says that for ℤ/2ℤ however: "In a few cases, AlphaTensor even beat existing records. Its most surprising discoveries happened in modulo 2 arithmetic, where it found a new algorithm for multiplying 4-by-4 matrices in 47 multiplication steps, an improvement over the 49 steps required for two iterations of Strassen’s algorithm. It also beat the best-known algorithm for 5-by-5 modulo 2 matrices, reducing the number of required multiplications from the previous record of 98 to 96. (But this new record still lags behind the 91 steps that would be required to beat Strassen’s algorithm using 5-by-5 matrices.)"
It seems very likely that 49 is optimal over the reals. One way a 47 algorithm could arise is if two rank one tensors which are distinct over the reals become equal over Z/Z2 and cancel (since addition and subtraction are the same in Z/Z2). Also for some other matrix formats where better algorithms are known for Z/Z2, those beat the best known algorithms for the real case by exactly 2 multiplication steps.
*FYI: THIS ISN'T "NEW," IT'S LITERALLY REPLICATING A STACK OF WINDOWED PAPER....WHICH IS HOW MATHEMATICIANS ORIGINALLY SOLVED MATRIX PROBLEMS MANUALLY. THE AI JUST RE-DISCOVERED AN ORIGINAL CONCEPT BUT IT DIDN'T FIND ANYTHING "NEW". THIS VIDEO IS INCREDIBLY MISLEADING.*
Um .. 4x4 of ones and zeros ... I would just pre-store a cache of combinations of the row results ... 4 bits per row, 16 bits for the second matrix, 4 bit result row = 512KB of nicely cached memory. 4 lookups, done. ;)
Error codes, multiplications, everything is done faster by splicing it. One thing I noticed that when I used every day D6 dices for a few moments a day just to sharp my addition in a trivial way my arithmetical tasks went smoother back in the days. I did split everything I could to smaller chunks in my mind. Universe is a weird fractal :)
As long as people do not start calling computer programs ‘proofs’, with no formal error analysis, I m fine with machines helping out. What I do not like is calling a theorem or a proof what is clearly not. And it is already happening in many areas.
The caveat is that it works only with binary elements. What's interesting, though, is the fact that there's a loss of rigor that comes from mathematical proof whenever you use a experimental approach (try and see if it works; if not, try something else but make sure you don't try anything close to what didn't work before). That is why the mathematicians were able to find a better solution that the machine learning algorithm missed. These two caveats suggest that the method (of the machine learning algorithm) is not a "proof" in the purest sense, but rather a "proof" in the pragmatic sense ("the proof's in the pudding"). It's typically what engineers do. I remember when I studied digital signal processing in the early days prior to the advancement of discrete mathematics. Mathematicians claimed that the sampling theorem was wrong, and they were right in the purist sense, but that doesn't mean it can't be used in practice, because the ear cannot perceive the errors if you sample beyond the Nyquist rate. At some point, the mathematics journals will come to accept machine learning experiments as "proof", and the meaning of the word and mathematics itself (as a language) will lose the precision it's known for that comes from the rigor it has traditionally enforced.
The narration says (2:55) that Strassen's method requires "a third less" computation while the video contradicts this by saying "67% less". 343 is 67% of 512 so it is 33% (a third) less than 512.
They meant 67% of the original value but yes if one is being pedantic then you are correct and in this world one can never be too pedantic or descriptive in my opinion
The day will probably come, when AI mathematicians can go over the possibilities so much faster and so much more creatively, it'd be like asking a five year-old to advance the state of mathematics now. Wolfram has an AI that has been coming up with new theorems for some time. I'm not fooled, when ppl say "It won't replace ...". We'll have significant, useful input to add for a while. The time when we won't, I think, isn't nearly as far off as we think. Now, imagine Malthusian elites directing AI, while they still can ...
If the matrices only contain 0 or 1, you can build the whole naive algorithm in hardware where you multiply 2 matrices in a few ticks (dont know how many layers that algo has... should be around 1 to 4... multiplying, and adding mod 2 are trivial bit operations which can be executed in parallel). There are propably ASICs for exactly that situation, where you have "zero" multiplications per 4x4 Matmul mod 2.
Strassen's algorithm was the work of a genius because he paved the way and showed it was possible. Without that breakthrough nobody would have tried. I have no idea how he came up with his method, in particular because the additions needed before the multiplications are done are not at all obvious. The graphics here showed nicely how the 8 3d tensors could be used to represent the operations needed for the matrix multiplication of two 2 x 2 matrices in the standard algorithm and similarly for Strassen's algorithm.
It's mind-boggling that Strasser's method wasn't discovered a century ago or more. Mathematicians seem to have been expanding and reducing and playing with equations since antiquity. I would have never assumed that multiplying two 2x2 matrices cannot be done with less than 8 multiplications, but simply assume that it had been found by consensus to be objectively or subjectively the 'best' way.
It's always a case of what's more intuitive to teach. The "classic" method is neat and understandable, and efficient for teaching purposes (no exam will ask a 5x5 * 5x5 matrix multiplication). Machines don't have an instinct, they execute code so what matters is only the code itself, not how understandable it is to a human. The strength of this method of finding things is that they can just tune the parameters to whatever specific case they need and the model will, hopefully, eventually converge to a solution or to a path to the solution, which is incredibly fascinating.
Also, most of the mathematicians of past centuries, were often extreme good in calculating things in head and probably more often than not more savant like. They had to do all kind of calculations in either their head or by hand with close to no support from machines. And in this environment, there is not much to gain from reducing multiplications und have more additions instead. For smaller numbers, it's close the same effort for (very skilled) humans and for bigger numbers, they'd usually just approximate the calculation with pre-calculated logarthmic tables, and then it's just an addition again. It's just computers that can additions much more performant than multiplications or lookups in (big) precomputed logarithmic tables. What I miss here in the video is why the results for matrices \in [-1, 0, 1]^2n matter. Shouldn't be a multiplication for these 3 nrs be have the same computation time as an addition. I guess, we can use the algorithm to adapt for R^n spaces, but that's somehow unmentioned. Or maybe I don't understand it.
There was never a use for Strassen's method before the use of digital computing. If you have to multiply matrices by hand (or abacus) it's more important to not make mistakes and to not get ridiculous intermediate values. Strassen is also not as numerically stable.
Not a criticism of what looks like a great systematic approach to a necessarily exhaustive methodology, to which we just add that the Universe of Logarithmic Time Duration Timing is obviously self-defining probability, and will evolve such solutions as demonstrated by Euler's pure-math function, ie it's QM-TIME Completeness cause-effect holography in the rough. (But a proper Mathemagical analysis will deduce the best smoothness choice of processes, fun to imagine for Mathematicians)
This is amazing stuff. Just one point of curiosity: why are modulo 2 matrices useful? Is there some real world application that uses just 0 and 1 in a matrix, which can benefit from a speed increase?
everything computer related can pretty much break down into binary computations as far as I know. complex boolean logic algorithms. I hope somebody comments a real world application, I am interested too.
Now I got to know that how important it is for researchers to create UA-cam videos of their research because it's feels very daunting to read research papers and even with UA-cam videos because it gives more reachability and readability which will increase the citations it.
@@myxail0 First, I wasn't picky. I just explained with few words that real number matrices are not used by computers, but matrices over finit fields, and hence this results are helpuful. And second, floating-point arithmetic allows you represent real numbers while working with finite fields. What I said is as "picky" as you can be. What you said adds extra info, but does not correct me, because there is nothing to be corrected in what I said.
@@nickfaire what you said wasn't helpful at all. There's a high chance the commenter used real numbers meaning floating point numbers. It's not sth outlandish, it's what they are for, for modeling real numbers. On the other hand, it's a huge step algorithmically from binary matrices to floating point matrices which is not trivial at all. Hencewhy the question you answered to is interesting, and your answer isn't. In my opinion.
@@myxail0 I think you're both correct, but from different perspectives. Being based on binary (even floating points!), actual efficient and fast algorithms used are binary or a power of 2. I realize you wanted to know from a practical perspective in terms of real world applications? The situation is, it can wildly vary between processor architectures. But discounting that, the real bottleneck is memory and bandwidth. Maybe this could help in theory, but I don't think at all, as long as the input and output sizes of the multiplication don't change.
So nice that google invests money in such fundamental research. Still, I think the video title is kind of click-bait because the method is only applicable to very specific matrices.
Yeah, I learned about this a few days after it happened. This along with AI speeding up the process for the design in the lenses that print the silica chips as well as the improvements it does to the transistor layout........ I'm thinking we can already say it is improving itself rappidly. These improvements will then lead to other improvements shortly- Here we go singularity!
Circuit analysis using SPICE involves converting nodes in electronic circuits into values in a matrix. The node values change with frequency. While this sounds good, limiting a matrix to values of zero and one would impact the ability to model frequency response.
I was going to say exactly what @tolkeinfan1972 did... This only functions with entries in F_2 or GF(2) (same thing, different convention: F_2 means the unique finite field of two elements up to isomorphism, whereas GF(2) means the Galois field of two elements, and they are the same). In neural networks, the matrices used don't have entries from F_2: they have entries that are floating point numbers, and thus this improvement isn't applicable to neural nets.
@@AndreiFierbinteanu Theranos: We used only one drop of your blood ... But here, look at this nice and colorful presentation about matrix multiplication. ... oh and btw. we also burned all of your investment capital for that. Good bye!
so what you should have a look at how compilers compile earlier versions of themselves, and the trick they have to use in terms of the feature set of the first compillers. This is too long to explain in a comment and it' s well documented in the Dragon book (one book on compilers)
quaternion is used for 3d rotation, 16 multiplies versus 27. problem in matrices is floating point, composition of matrices increase level if imprecision quickly
Correct me if I'm wrong, but I've always considered society to be pushed on by a few brilliant people, and then a lot of very intelligent people having their minds 'opened' let's say by a breakthrough that the likes of Da Vinci or Einstein developed. For example, as a very basic example, many secondary school pupils can get a good understanding of integration, but without Newton, none of them could ever have been able to perform such math. in this way, a 'normal' or 'bright' person is accelerated by the one-in-a-century geniuses that pop-up every now and then. Could AI act as a faster replacement for these geniuses, providing these new ideas and breakthroughs that accelerates what a 'bright' person could do? Are we replacing our geniuses? (not that this is necessarily bad, just are we)?
@@jondo7680 really, oh thanks for the explanation, I didn't knew that, because I thought it was a Skynet prototype. What would've we done without your crystal clear and precise analysis. Please delete your stupid comment and I will delete this one here, which I wouldappreciate. Maybe you can stay away from the Internet for a couple of days as well.
The Problem is all most ALL matrixs are NOT JUST 0 or 1! So your stuck with doing the FULL matrix multiplications. But good news AFTER you have done the FULL matrix multiplications those can be used as a look up table to solve the answer you are working on.
i was a maths savant in school - cant remember a single bit of it at 52. would have no idea how to sigma an equation. i might be able to do some basic algebra and work stuff out using common sense but its all gone...like the sands of time.
One can never imagine the success of Srinivasa Ramanujan if he had access to these type of computers. I wish he was still alive and contribute to our society. 😢
This specific kind of technique isn't going to replace mathematicians, but other techniques, some that already work and just need to be scaled up, almost definitely will.
Nice work DeepMind, and the humans who collaborated with he/her. While matrices of 1s and 0s are common in many situations, did you ask DM to work on a general solution for all matrices? I'm hoping you'll put DM on solving the Kronecker matrix multiplication next. It's resultant tensor grows much faster then the matrix cross-product - Linear vs. Square. Hence, a [4 x 4] x [4 x 4] cross product only results in a [4 x 4] resultant matrix. Where as a [4 x 4] x [4 x 4] Kronecker results in a [16 x 16] resultant matrix! We use this a lot in physics.
Since I do not use these methods because of the obvious bottlenecks , this presentation shows me a root aspect of the tunnel vision plaguing industry . This motivates me to work out an explanation for a presentation of my methods .
Wow, this video is truly mind-blowing! Training the AI system AlphaTensor to explore faster algorithms for matrix multiplication is a remarkable feat with far-reaching implications. The potential impact on various disciplines is truly exciting to envision. Thank you for sharing this fascinating glimpse into the future of mathematical research and its potential real-world applications!
The algorithm to perform the quaternions multiplication gives it its compactness that would be available to matrix multiplication if it would use the same path of using the projection of the results on the hypercube and back to 3d cube in DeepMind's AlphaTensor. I bet a 6-pack of beer that it will improve the actual best result.
It's funny cause it for sure adds more additions than the amount of multiplications it saves. This should make it slower than the regular algorithm?!?!
O(n^3) is still polynomial complexity imagine if you have exponential or factorial About the spelling in German V is read as F and their W is read as your V St is read as Sht
You can read more about this work on Quanta's website: www.quantamagazine.org/ai-reveals-new-possibilities-in-matrix-multiplication-20221123/ Correction: At 2:53 in the video, the text previously read "67% less" but has been changed to "67%" for accuracy.
@@LucenProject It's the initials of the team that published the paper "Discovering faster matrix multiplication algorithms with reinforcement learning".
3:47 Why is the matrix broken into four 4x4 matrices instead of breaking it into 16 2x2 matrices? When I perform the new algorithm recursively on an arbitrarily large (N x N) matrix, I'd break it into 4*4=16 matrices of dimension (N/4)x(N/4). These "blocks" are then combined exactly the way the scalars in a 4x4 matrix are combined with 47 multiplications, only here the multiplications are performed recursively. This means that the input size N is divided by four in each step of the recursion and the runtime is N^{log_4(47)} ~ N^{2,777} by the Master theorem. Doing it the other way around as shown in the video would be performing Strassen on the topmost call (7 multiplications of problem size N/2), am I wrong?
A 3d tensor is used because it's multiplication of 2 2d tensors. So for higher dimensions you need to multiply higher order objects, or do something else. 4d space is naturally suited for 3d rotations for example..
I hate the whole "There are more options than atoms in universe..." etc. Because whenever that is brought up, it tends to be that basically everything but few get eliminated very quickly when basic rules get introduced in to the system. If you got a huge floor and need to make mosaic on it, your possibilities are absurdly huge if you just fill it with random tiles, but soon as you set any kind of rules or boundaries the possibilities start to reduce quickly, until a limit between "there is one solution" and "there is no solution" is reached.
If it is only for some cases, that may not be really useful in application. It is just like you are adding an "If" condition to check whether for a case should use the old approach or use new approach to get a relatviely faster result, that checking logic may also time consuming, and that time consumed can in fact simply spend on solving the calc directly, and also, this will only adding complexity to the straightforward calculation solution.
If one is just multiplying two 2x2 matrices with ones or zeros, there's only 8 combinations of those, and so you can just use a 256-entry, 4-bit lookup table and get the answer in O(n^2) time. Then you just divide into blocks. It doesn't seem like the case of elements being zero or one is that interesting.
ah, I'm a bit confused... doing matrix multiplication when you have only 0 or 1 can be done without any multiplication, you just replace the multiplication operation with 'AND' gate, you know 0x0=0 0x1=0, 1x0=0, 1x1=1...
Then you'd still need mathematicians who are able to understand both the algorithmically generated proof, and well as the algorithms themselves. AI will always automate some tasks, while at the same time create new tasks. The outcome won't be less human work, but higher efficiency for the same or even higher amount of human work.
Nothing new. Only top-notch people bother to study enough of the cutting-edge to be empowered by it, the rest don't. Tale as old as time. Nothing has changed, despite easy access to massive amount of information.
I don’t see why we want to reduce the number of multiplications for binary numbers. Computationally, multiplying 2 bits should be the same difficulty, if not, easier, than adding 2 bits.
That's probabl the reason why nobody found this genius algorithm in 50 years; noone on their right mind would search for a algorithm to replace binary multiplication with several binary additions to slow down their program.
So to make matrix multiplication more efficient, they used machine learning, which relies heavily on matrix multiplication. Very cool!
The earth is flat.
@@aritroshome8480 Your brain is flat, my friend.
dont read into it chief
this video is pure hyperbole. They said seasoned mathematicians haven't mastered matrix multiplication. So fcking stupid
To launch a rocket stage filled with fuel even further you indeed need another stage that also contains fuel in the first place.
One cool thing from tha paper which didn't make it to the video was that instead of using number of multiplications as your cost function, one can use the actual timings from running each algorithm. This allows to find fast multiplication algorithm which is tailored for the device it's running on. Interestingly, they showed that indeed for different hardware platforms, different algorithms performed better.
Naturally, one might ask what platform would provide the fastest algorithm and why.
I don't know the details but there's now some logic designs where addition is significantly slower than multiplication!
If I had to guess it's because multiplication of 2 n-bit numbers will always fit within 2n bits. In many common scenarios you'll always use all 2n bits in the results.
In contrast with addition of 2 n-bit numbers plus a carry bit you'll get, at most, an n-bit number plus an carry bit. (Ignoring 2s-compliment and that it would be a 'borrow' bit.) That's a tremendous waste if you're working with 64 bit words, much less 128 or even 256 bit words.
I know - dedicated hardware can use 65-bit words internally. Or you could write a library that quietly converts the data to 63-bit words, plus a carry, for calculations and them convert it back to the standard notation at the end. You'll lose a little bit of range but it might be an acceptable tradeoff. (With floating point numbers you would perform your shift so you would just lose a little bit of precision.) It's some work upfront but it might be worth it if you can keep everything in the innermost cache until you've finished your calculations.
But that's all just guesswork based on some classes that are all decades out of work....
I think that that is just incidental dust. Consider for example that the elements of the matrices are objects of a form defined in a class by C++ say. You have then no access to architectural bias. The complexity of the object itself overrides any platform consideration.
@@AleXander-eo3iz ASIC
More likely, it's because platforms have different performance profiles for addition, multiplication, and memory accesses, and in the case of different ISAs maybe even the number of registers to accumulate dot products into. For example, if computer A has memory access that's four times faster than computer B, it's probably worth it to spend more time doing memory-intensive algorithms which use less multiplication/addition operations.
For matrix-multiplications Strassen is the best overall - works for any size and any form of matrices. But there are of course faster methods for specific cases. There are algorithms that are faster for where large matrices, that are better for diagonalised matrices etc
They found an algorithm for matrices only containing 1 and 0 that is slightly faster - single-digit percentage faster. It is an achievement and does help, but the bigger benefit might just be that with specialised hardware it can use fewer multiplication-units - thus less power.
so based on the video, the discover from alpha tensor doesn't work in real matrix multiplication because its only 0 and 1. But in 3d games for rotating or translating and object, you need to add different numbers than zeroes or ones when updating in a game loop.
Multiplication of Fourier matrix only need N*logN, so for some specific matrices, it is definitely possible to reduce the complexity. But maybe it is not generic.
It kind of is pretty dumb. Muliplication of binary numbers is very fast, it's not like that algorithm is really neededbthere anyway. The general case is way more interesting.
best overall? is there anything "particularly best" ? 🤣
@@webgpu "best overall? is there anything "particularly best" ? 🤣"
I mean - no - why would there? Why would any sane person assume that a solution that is the best in general would be the best for any specific scenario?
Modern desktop CPUs are the best for general computation. Are they the best in matrix-multiplication? No. The best at encryption? No. Best at image-recognition? No. Best at network-trafficing? No. There are specialisations when you need those.
Hopefully that was simple enough for you to comprehend.
I devised an algorithm in 1977 that utilised 24 multiplications and 69 additions for the 3x3 case but didn't get much further as it was equivalent to Strassen's in the 2x2 case. I did it with pencil and paper. After seeing this video I don't feel so bad now about my failure :-)
I'd like to learn from you, sir.
I'll be totally honest and say that I took Strassen's lead and tried it out in the 3x3 case ... hence it was inevitable that it would reduce to his case. There was no spark of a new idea just a plodding through the algebra, it took me about 12 hours and a few hundred feet of line printer paper. When I opened my door, the next morning, Mike's (Atkinson) room opposite seemed open so I walked in and asked for him to look at 'this' and he was most joyed. He smiled so much I felt that I had done something useful, He asked if he could use it in his class and I said 'Yes! f course' and then I left Cardiff. Me and Henry had shared our office ... I am afraid that he was probably quite unhappy that I had left the paper all over the place .. He was a good mate though :-)
You 'just' had to continue a 'little' bit 😬
That's AWESOME
That's amazing nonetheless!
I mean _technically_ weve had "faster" algorithms than the stassen one for quite a bit. Were down from the factor of ~2.7 to ~2.37 for very large matrices. Although an actually usable algorithm thats faster even in the relatively special cases is always nice to see
even recursive strassen-like algos had better versions than 2x2
with rectangular (3x5, iirc) still scaling better than deepmind's 4x4
technically, your comment is useless and provides nothing to the conversation.
They are not talking about algorithm complexity for large matrices. This is about the fastest way to compute a 2x2, 3x3, 4x4 and 5x5.
Algorithms with "faster" complexity end up having larger overhead and perform worse than strassen on smaller matrices.
With this new algorithm for 4x4, you can use your favorite low complexity algorithm to break down the multiplication to 4x4 matrices. And then use the decomposition found by AlphaTensor.
@@tunafllsh Yes, im aware. Due to ambiguity of their wording they said that Strassens algorithm was unbeaten which is technically not true since there are algorithms with much better asymptotic time, albeit not practical ones. A small nitpick for my entertainment and an opportunity to share some trivia for those who might not know
@@RepChris did you get a kick out of that? How fast is your algorithm?
I finished university 4 years ago and I still get nightmares about my diploma being revoked because they’ve discovered that I failed linear algebra exam. Doing matrix multiplication by hand is a literal nightmare for me.
If the Nemotechnic is hard enough for you, try by definition 🤗
The Sigma sum is a lot easier to memorize the method
For a lot of cases there are arguably better alternatives. But numerical computations are better left for computers anyway..
@@francogonz I’m just really bad at mental arithmetic. And even so, I didn’t actually fail it…
@@Wooksley I personally find matrix multiplaction by cofactor expansion to be the easiest method to compute by hand
@@Penrose707 I find the best way to do matrix multiplication is to write some python code. Why would I ever do it by hand outside of that dreadful exam? It’s not the 17th century…
I read the corresponding papers and this is a pretty good recap video on the topic. Kudos for visualizing the algorithms in an understandable way and even giving some background information from the researchers! 🧑🎓
Tbh i didn't understand shit from the visuals...yes ofc it seemed legitimate and got the vibe going...but it wasnt understandable as a highschooler
@@missionaryrav628 git gud kid
The way you described the tensor decomposition problem as a game of removing tensors from the original to reduce it to all zeros as fast as possible, and how you showed it with the Strassen example, reminds me a lot of the Lights Out game where you press a light to flip it and all adjacent to make all the lights dark. The moves in tensor decomposition are a lot more complex of course, but seeing it as a game makes a lot of sense.
Life is a game.
Those who see their work as games do very well!
The power of AI isn't intelligence, it is a feature of Iteration.
Linear Algebra mob want you to think it's intelligence.
What is biological intelligence other than millions and millions of years of evolutionary iteration?
For the sake of completness, in practical applications one should keep in mind that reducing the number of basic operations is not the main bottelneck for matrix multiplication efficiency, moving data is (bus 500Mh, cpu 1-2Ghz). However, without new ideas on this (last not understood) basic operation (polynomial, integer multiplication is mastered), brute computer force (cleverly used in the present case) is welcome.
Surely, fewer calculations means fewer movements of data?
@@btf_flotsam478 not necessarily. On hardware, you don't access single numbers, you always access multiple bytes that happen to be adjacent in memory at once (e.g. if the data bus is 128 bits wide, you access 2 or 4 floating point numbers at once, depending on precision). A naive alorithm that uses matrix element in order can basically half or even quarter the number of memory accesses. On the other hand, if the optimized algorithm makes heavy use of out-of-order access to the matrix (the technical term is random access), it might "waste" some of the transferred numbers, just because the adjecent numbers are needed at a later step of the algorithm.
An asymptotically fastest algorithm might not be the fastest algorithm in practice. That's especially true for applications that are already ridiculously optimized to specific hardware, such as the matrix multiplication.
@@btf_flotsam478
Not always.
Bott periodicity in 8 dimensions has values 16 times as big.
Even though lines of code can be n - strings short the amount of complexity embedding in a single character can be infinitely large.
The caching behavior on nearly every processor will determine the absolute speed limit for MM. One reason Strassen-style partitioning works is that it continuously tesselates the large matrices which brings down the workingset size and reduces cache overflow lossage.
Similar considerations apply to parallel implementations.
I didn't follow along with some of the more detailed parts but over all found it to be really interesting. Nice work to everyone involved in this directly or otherwise.
They managed to put AI into juggling complex algebraic compositions/decompositions and applied results to computer calculus tasks optimizations. Nothing mind-blowing, but important step towards self-improving algorithms.
matrix math is just one of those things that while more efficient equations might be good on the theory side of things, there is far more gains from specialized hardware to do that kind of math. if something has 8 vs 7 multiplications might be a factor if you are forced to do it consecutively, when you have hardware that does it all at once one more "lane" of computation for that 8th figure is a moot point. Back in the day there used to be math co-processors and we came back full swing where now the gpu IS the math co-processor once more with things like cuda and tensor cores that are specifically designed for that kind of math. The big question is, what other kind of specialized hardware would be of high value and implement that rather than pushing the limits of math theory to find efficiency.
If you have to deal with NxN matrices where N grows arbitrarily large and a multiplication takes N^k operations, then faster hardware can't compete with an algorithmic improvement that lowers the exponent k. The point of the 8->7 trick in the Strassen algorithm is that it lowers the exponent k from 3 to 2.808 for larger matrices.
However, if N is fixed and the improved algorithm is very complicated, then it may well be faster to use a simpler algorithm.
According to a 2016 paper, on then modern hardware the Strassen algorithm was 20% faster for N=512 (and for larger N the gain will be even larger).
@@ronald3836 my man here with that DATA. Dayum
Wow you’re short sighted, like you have no idea what you’re talking about it’s so wrong
@@ronald3836 can u mention a real life application where we'd need a 512 * 512 matrix?
Edit: I just remembered photos are basically matrices of pixels. But would still love to hear about any other applications?
@@ronald3836can you cite the paper? Not calling you out or anything but no modern BLAS implementation makes use of strassen's. I'm sure OP understands algorithmic complexity, and they're still right - hardware is the deciding factor for all practical implementations of Matrix mult.
Never explained how AT achieved the matrix calculations with fewer steps than previous (47 to 45 or 96 to 95).
Indeed! Left me gravely unsatisfied.
After doing electrical engineering and being traumatized by Matrices, it blows my mind how easily you explained it.
To explain the Alphatensor result it would be better to show and explain the Karazuba algorithm as Alphatensor found something similiar for 4x4 to Karazuba for 2x2. Sure Strassen is beating Karazuba on 2x2 so it should be mentioned, but for the didactic of the solution Karazuba is giving more insight to the result.
@Techmagus76
You are on the right track.
Karatsuba discovered n ^ 1.58.
The improvement was n * log n.
These are completely different problems. Karatsuba and its extension like Toom-3 are for polynomial (or big-int) multiplications, where as this is about matrix multiplications. One source of the mistake might be, that after Karatsuba discovery there was also an algorithm by Strassen (And Schonhage) performing the polynomial multiplication (or convolution) using Number Theoretic Transform recursively.
Has any of these researchers looked the impact of this new algorithm on specialize matrix libraries such as Tensorflow, and GPUs. It's possible that the 50-year old algorithm will execute faster than the new DM algorithm, IF for example, there's a significant advantage using specialized functions tuned for GPU instruction sets. It would be important to #1 - Know this answer? #2 - Where the cross-over occurs? #3 - Which and When existing libraries will adopt this new algorithm?
I guess - none, because it can only do 1 bit computations :> 10:45
I have a better algorithm requiring new tensor operations on a register set of a CPU (or GPU).
I have something else to do right now, but the key is: new tensor opetations on the register set.
Those who have resources for hardware algorithms will easily solve it as well and also prduce the CPU
For very large matrices you can get an easy 40% speed-up over the naive algorithm if you order the multiplication and addition operations in a cache coherent order. I'm skeptical would be faster than Strassen's though.
Agreed, but these are considered implementation optimizations, not a different and superior algorithm. The criteria used for a better type of algorithm is a reduction of the overall power by which the matrix dimension are raised by; that is, the "big O" [Order of complexity] of the algorithm.
@@jimwinchester339Yup, I took that class decades ago. ;)
Algorithms that reorder operations to achieve better computation locality are not simple implementation optimizations.
They are genuinely different algorithm. For instance one can use them to decompose matrices in sub-blocks.
This allows for instance better parallelism. Or it can enable you to break your original problem in different way, which can give you speed up orders of magnitude better than a pure better multiplication algorithm.
Another thing is that algorithm with better O may be much more difficult to implement in a "computer friendly" way.
For instance they make difficult to take advantage of long cache line memory fetch or poor locality lead to cache thrashing.
So, OK, they are more satisfying on the paper but worse in the real life.
@@YouRich92A small follow-up. Suppose we're multiplying two large square matrices (A X B = C). First flip B about the main diagonal, then both A and B can easily be traversed in a cache coherent order, instead of only B. I'll have to experiment with this to see how well this works.
@@jimwinchester339 Many theoretically superior algorithms run much slower than optimized "inferior" implementations. Any linked-list case being the poster-child.
Cache-coherency is King.
The first Multimedia Extensions for CPU's that Intel made, the MMX instruction set, was basically matrix operations and mostly multiplications
Minor error at 2:53. Shouldn't it just say 67% rather than 67% less?
Good catch! We'll add a correction in the UA-cam description.
I was confused
heh they blurred the word "less"
@@viliml2763 good enough :D
@@kamel3d haha
2:23
Quanta Magazine: Going from eight down to seven multiplications may seem trivial.
me thinking: for 2x2 sure but for something like 8x8 it is significant.
2:29 *shows 8x8*
me: lol
Math just goes over my head, but it is cool that we can automate the trial-and-error aspect of mathematical models.
It's better than trial and error, but still kind of mindless. It's like evolutions survival of the fittest.
yeah its greedy with extra steps
I am extremely impressed that they managed to turn this abstract task into a ML problem. I am really surprised at how computationally expensive this can be by brute force. A 4x4 multiplication of matrices would correspond to a tensor of 4^3=64 cells. Since these multiplications are in F_2, then there are only 2^64~10^19 options for a tensor. If we want to find one with 47 by brute forcing that is 10^100 roughly. Insane
You should make video on constraint based solvers, its just amazing you can bypass all physics and still do real physical simulation just by few linear equations, which in turn just uses matrix multiplication..
It's not really physics it's more of geometry
I so respect guys like Fawzi. Amazing. Imagine what is going to happen within the next 30 years?
Okay but do we know yet if this type of multiplication algorithm will work in any context other than Z/2Z? It is not surprising to think that there are extra tricks to multiplying matrices in Z/2Z that one could use which are useless over the reals.
Good point. Seems like there are some p vs np implications. If matrices with only binary elements troubles the Google Deepmind team, then doing this with matrices with any non-binary element sets would probably be too convoluted.
The records of ℤ/2ℤ are of course better than ℂ, where ab=ba, although this is not as important as the records where ab≠ba, like matrices.
But the records are really close to each other.
@@davidcotham1939 My dear colleague, I fail to see how this has anything to do with P vs NP - this is not search nor is the complexity exponential.
No it seems one can apply improvs to any floating point number (or complex number for that matter). The article says that for ℤ/2ℤ however: "In a few cases, AlphaTensor even beat existing records. Its most surprising discoveries happened in modulo 2 arithmetic, where it found a new algorithm for multiplying 4-by-4 matrices in 47 multiplication steps, an improvement over the 49 steps required for two iterations of Strassen’s algorithm. It also beat the best-known algorithm for 5-by-5 modulo 2 matrices, reducing the number of required multiplications from the previous record of 98 to 96. (But this new record still lags behind the 91 steps that would be required to beat Strassen’s algorithm using 5-by-5 matrices.)"
It seems very likely that 49 is optimal over the reals. One way a 47 algorithm could arise is if two rank one tensors which are distinct over the reals become equal over Z/Z2 and cancel (since addition and subtraction are the same in Z/Z2). Also for some other matrix formats where better algorithms are known for Z/Z2, those beat the best known algorithms for the real case by exactly 2 multiplication steps.
*FYI: THIS ISN'T "NEW," IT'S LITERALLY REPLICATING A STACK OF WINDOWED PAPER....WHICH IS HOW MATHEMATICIANS ORIGINALLY SOLVED MATRIX PROBLEMS MANUALLY. THE AI JUST RE-DISCOVERED AN ORIGINAL CONCEPT BUT IT DIDN'T FIND ANYTHING "NEW". THIS VIDEO IS INCREDIBLY MISLEADING.*
Um .. 4x4 of ones and zeros ... I would just pre-store a cache of combinations of the row results ... 4 bits per row, 16 bits for the second matrix, 4 bit result row = 512KB of nicely cached memory. 4 lookups, done. ;)
Already knew about the AI, but didn't know that we managed another reduction! Cool stuff!
Truly amazing solution to centuries-old problem
Error codes, multiplications, everything is done faster by splicing it. One thing I noticed that when I used every day D6 dices for a few moments a day just to sharp my addition in a trivial way my arithmetical tasks went smoother back in the days. I did split everything I could to smaller chunks in my mind. Universe is a weird fractal :)
As long as people do not start calling computer programs ‘proofs’, with no formal error analysis, I m fine with machines helping out. What I do not like is calling a theorem or a proof what is clearly not. And it is already happening in many areas.
Holy crap, I just understood matrixes in a much more deeper level than what I was taught in university.
The caveat is that it works only with binary elements. What's interesting, though, is the fact that there's a loss of rigor that comes from mathematical proof whenever you use a experimental approach (try and see if it works; if not, try something else but make sure you don't try anything close to what didn't work before). That is why the mathematicians were able to find a better solution that the machine learning algorithm missed. These two caveats suggest that the method (of the machine learning algorithm) is not a "proof" in the purest sense, but rather a "proof" in the pragmatic sense ("the proof's in the pudding"). It's typically what engineers do. I remember when I studied digital signal processing in the early days prior to the advancement of discrete mathematics. Mathematicians claimed that the sampling theorem was wrong, and they were right in the purist sense, but that doesn't mean it can't be used in practice, because the ear cannot perceive the errors if you sample beyond the Nyquist rate. At some point, the mathematics journals will come to accept machine learning experiments as "proof", and the meaning of the word and mathematics itself (as a language) will lose the precision it's known for that comes from the rigor it has traditionally enforced.
The narration says (2:55) that Strassen's method requires "a third less" computation while the video contradicts this by saying "67% less". 343 is 67% of 512 so it is 33% (a third) less than 512.
They meant 67% of the original value but yes if one is being pedantic then you are correct and in this world one can never be too pedantic or descriptive in my opinion
The day will probably come, when AI mathematicians can go over the possibilities so much faster and so much more creatively, it'd be like asking a five year-old to advance the state of mathematics now. Wolfram has an AI that has been coming up with new theorems for some time. I'm not fooled, when ppl say "It won't replace ...". We'll have significant, useful input to add for a while. The time when we won't, I think, isn't nearly as far off as we think. Now, imagine Malthusian elites directing AI, while they still can ...
They started in Characteristic 2 and extrapolated the algorithm to all numbers. Brilliant strategy!
Love the recursive nature of computer science
Love the recursive nature of the recursive nature of computer science
Thank you Dr Simon for funding this
If the matrices only contain 0 or 1, you can build the whole naive algorithm in hardware where you multiply 2 matrices in a few ticks (dont know how many layers that algo has... should be around 1 to 4... multiplying, and adding mod 2 are trivial bit operations which can be executed in parallel). There are propably ASICs for exactly that situation, where you have "zero" multiplications per 4x4 Matmul mod 2.
do you really think no one thought of that? Of course there companies for years who build such tensor accelerators for Deep Learning. Google "TPU"
Strassen's algorithm was the work of a genius because he paved the way and showed it was possible. Without that breakthrough nobody would have tried. I have no idea how he came up with his method, in particular because the additions needed before the multiplications are done are not at all obvious. The graphics here showed nicely how the 8 3d tensors could be used to represent the operations needed for the matrix multiplication of two 2 x 2 matrices in the standard algorithm and similarly for Strassen's algorithm.
It's mind-boggling that Strasser's method wasn't discovered a century ago or more. Mathematicians seem to have been expanding and reducing and playing with equations since antiquity. I would have never assumed that multiplying two 2x2 matrices cannot be done with less than 8 multiplications, but simply assume that it had been found by consensus to be objectively or subjectively the 'best' way.
It's always a case of what's more intuitive to teach. The "classic" method is neat and understandable, and efficient for teaching purposes (no exam will ask a 5x5 * 5x5 matrix multiplication). Machines don't have an instinct, they execute code so what matters is only the code itself, not how understandable it is to a human.
The strength of this method of finding things is that they can just tune the parameters to whatever specific case they need and the model will, hopefully, eventually converge to a solution or to a path to the solution, which is incredibly fascinating.
Also, most of the mathematicians of past centuries, were often extreme good in calculating things in head and probably more often than not more savant like. They had to do all kind of calculations in either their head or by hand with close to no support from machines. And in this environment, there is not much to gain from reducing multiplications und have more additions instead. For smaller numbers, it's close the same effort for (very skilled) humans and for bigger numbers, they'd usually just approximate the calculation with pre-calculated logarthmic tables, and then it's just an addition again.
It's just computers that can additions much more performant than multiplications or lookups in (big) precomputed logarithmic tables.
What I miss here in the video is why the results for matrices \in [-1, 0, 1]^2n matter. Shouldn't be a multiplication for these 3 nrs be have the same computation time as an addition. I guess, we can use the algorithm to adapt for R^n spaces, but that's somehow unmentioned. Or maybe I don't understand it.
AlphaGo found new strategies in the game, that has been played by humans for 4500 years. The human brain has it's limits.
There was never a use for Strassen's method before the use of digital computing. If you have to multiply matrices by hand (or abacus) it's more important to not make mistakes and to not get ridiculous intermediate values. Strassen is also not as numerically stable.
@@gehirndoper It is as numerically stable as the normal algorithm. You are right about the digital computer and Abacus though :-)
Not a criticism of what looks like a great systematic approach to a necessarily exhaustive methodology, to which we just add that the Universe of Logarithmic Time Duration Timing is obviously self-defining probability, and will evolve such solutions as demonstrated by Euler's pure-math function, ie it's QM-TIME Completeness cause-effect holography in the rough.
(But a proper Mathemagical analysis will deduce the best smoothness choice of processes, fun to imagine for Mathematicians)
Bruh ai literally made itself faster 💀💀💀
14:56₩☆
@@raheelrehmtbro really linked to a time after the ending
thats how it all starts!! terminator...
AI doesn't operate with modulo 2 matrixes though! It's not actually relevant to AI computational speeds.
"The right way to look at this ..." seems like a very short sighted / unimaginative statement
This is amazing stuff. Just one point of curiosity: why are modulo 2 matrices useful? Is there some real world application that uses just 0 and 1 in a matrix, which can benefit from a speed increase?
everything computer related can pretty much break down into binary computations as far as I know. complex boolean logic algorithms. I hope somebody comments a real world application, I am interested too.
Now I got to know that how important it is for researchers to create UA-cam videos of their research because it's feels very daunting to read research papers and even with UA-cam videos because it gives more reachability and readability which will increase the citations it.
“Don’t praise the machine.”
The Simpsons, S05E17
I hear a giant “for now” after each reassurance that we are not being replaced by these software.
Quanta 👌🏻👌🏻 Always make these awesome and informative videos. Some enlightenment everytime!
Punit paaal Singh 😅
Awesome! But do these results extend to real number matrices? I believe most applications use real number matrices rather than mod 2
No computer works with real numbers. Computers work with finite fields, as F2 (which is the same that working mode 2)
@@nickfaire floating point numbers if you wanna be that picky
@@myxail0 First, I wasn't picky. I just explained with few words that real number matrices are not used by computers, but matrices over finit fields, and hence this results are helpuful. And second, floating-point arithmetic allows you represent real numbers while working with finite fields. What I said is as "picky" as you can be. What you said adds extra info, but does not correct me, because there is nothing to be corrected in what I said.
@@nickfaire what you said wasn't helpful at all. There's a high chance the commenter used real numbers meaning floating point numbers. It's not sth outlandish, it's what they are for, for modeling real numbers. On the other hand, it's a huge step algorithmically from binary matrices to floating point matrices which is not trivial at all. Hencewhy the question you answered to is interesting, and your answer isn't. In my opinion.
@@myxail0 I think you're both correct, but from different perspectives.
Being based on binary (even floating points!), actual efficient and fast algorithms used are binary or a power of 2.
I realize you wanted to know from a practical perspective in terms of real world applications? The situation is, it can wildly vary between processor architectures. But discounting that, the real bottleneck is memory and bandwidth. Maybe this could help in theory, but I don't think at all, as long as the input and output sizes of the multiplication don't change.
So nice that google invests money in such fundamental research. Still, I think the video title is kind of click-bait because the method is only applicable to very specific matrices.
Reminds me of the famous Seven Queens problem in Chess.
Eight Queens problem has a solution.
I think I've found a channel to binge watch.
Yeah, I learned about this a few days after it happened. This along with AI speeding up the process for the design in the lenses that print the silica chips as well as the improvements it does to the transistor layout........ I'm thinking we can already say it is improving itself rappidly. These improvements will then lead to other improvements shortly- Here we go singularity!
Circuit analysis using SPICE involves converting nodes in electronic circuits into values in a matrix. The node values change with frequency.
While this sounds good, limiting a matrix to values of zero and one would impact the ability to model frequency response.
It blew my mind when Deepmind discovered Strassen's Algorithm in just minutes.
@MorTobXD Good for you sir!
Love this channel
12:41 I don’t think people can be made irrelevant by any of this kind of work it just empowers people to do more.
That’s the right spirit, amen.
Remember that matrix multiplication is the basis of neural nets, so AI is actually improving itself already!
Not yet. The multiply it imporved is over GF2. The multiplies used in AI are finite precision "reals". Aka floating point. 😁
Thanos: I used matrix multiplication to improve matrix multiplication
I was going to say exactly what @tolkeinfan1972 did... This only functions with entries in F_2 or GF(2) (same thing, different convention: F_2 means the unique finite field of two elements up to isomorphism, whereas GF(2) means the Galois field of two elements, and they are the same). In neural networks, the matrices used don't have entries from F_2: they have entries that are floating point numbers, and thus this improvement isn't applicable to neural nets.
@@AndreiFierbinteanu
Theranos: We used only one drop of your blood ...
But here, look at this nice and colorful presentation about matrix multiplication.
... oh and btw. we also burned all of your investment capital for that. Good bye!
so what
you should have a look at how compilers compile earlier versions of themselves, and the trick they have to use in terms of the feature set of the first compillers.
This is too long to explain in a comment and it' s well documented in the Dragon book (one book on compilers)
quaternion is used for 3d rotation, 16 multiplies versus 27. problem in matrices is floating point, composition of matrices increase level if imprecision quickly
Correct me if I'm wrong, but I've always considered society to be pushed on by a few brilliant people, and then a lot of very intelligent people having their minds 'opened' let's say by a breakthrough that the likes of Da Vinci or Einstein developed. For example, as a very basic example, many secondary school pupils can get a good understanding of integration, but without Newton, none of them could ever have been able to perform such math.
in this way, a 'normal' or 'bright' person is accelerated by the one-in-a-century geniuses that pop-up every now and then.
Could AI act as a faster replacement for these geniuses, providing these new ideas and breakthroughs that accelerates what a 'bright' person could do?
Are we replacing our geniuses? (not that this is necessarily bad, just are we)?
AI is still just a tool and it can't be compared to human's mind. Not yet. A lot of work is done by humans that stand behind this discovery
dude without farmers and food producers nobody could make any ai
As long as this operation is numerically stable. Lots of algorithms lead to huge accumulation of operation-to-operation numerical errors.
"The biggest threat that arises from AI is that some people will conclude too early that they have fully understood it."
You are probably talking about general ai, this thing can just do one thing. Find better algorithms to do the multiplication.
@@jondo7680 really, oh thanks for the explanation, I didn't knew that, because I thought it was a Skynet prototype. What would've we done without your crystal clear and precise analysis. Please delete your stupid comment and I will delete this one here, which I wouldappreciate. Maybe you can stay away from the Internet for a couple of days as well.
@@homerp.hendelbergenheinzel6649 what the hell did he do to you, damn, you really have nothing better to do than diss internet strangers
@@homerp.hendelbergenheinzel6649 the only thing stupid here is your 🧠 my friend.
The Problem is all most ALL matrixs are NOT JUST 0 or 1! So your stuck with doing the FULL matrix multiplications. But good news AFTER you have done the FULL matrix multiplications those can be used as a look up table to solve the answer you are working on.
Me who hasnt even graduated highschool watching video and reading all the comments/arguments like i know better
Both this result and this presentation of it are absolutely spectacular
The final comment on mathematicians being replaced by computers was very cliché
it was a cliche question
There once was a job called "Computer".
i was a maths savant in school - cant remember a single bit of it at 52. would have no idea how to sigma an equation. i might be able to do some basic algebra and work stuff out using common sense but its all gone...like the sands of time.
Ain't no way a 52 year old listening to sgp
One can never imagine the success of Srinivasa Ramanujan if he had access to these type of computers. I wish he was still alive and contribute to our society. 😢
02:56 a third fewer is 33% less, not 67% less
This specific kind of technique isn't going to replace mathematicians, but other techniques, some that already work and just need to be scaled up, almost definitely will.
I remember that this technique was taught to us when we did our math major in mid 90's
Nice work DeepMind, and the humans who collaborated with he/her. While matrices of 1s and 0s are common in many situations, did you ask DM to work on a general solution for all matrices? I'm hoping you'll put DM on solving the Kronecker matrix multiplication next. It's resultant tensor grows much faster then the matrix cross-product - Linear vs. Square. Hence, a [4 x 4] x [4 x 4] cross product only results in a [4 x 4] resultant matrix. Where as a [4 x 4] x [4 x 4] Kronecker results in a [16 x 16] resultant matrix! We use this a lot in physics.
Deep mind is a company and isn't gendered
It's either "them" (the epicene) or "it" (the neuter)
thanks for the accessible yet detailed explanation. really cool
Since I do not use these methods because of the obvious bottlenecks , this presentation shows me a root aspect of the tunnel vision plaguing industry . This motivates me to work out an explanation for a presentation of my methods .
Сразу видно, что вы профессионалы
Thank you for your work! Best math and tech channel ❤
this video finally helped me understand tensors. also super cool
Wow, this video is truly mind-blowing! Training the AI system AlphaTensor to explore faster algorithms for matrix multiplication is a remarkable feat with far-reaching implications. The potential impact on various disciplines is truly exciting to envision. Thank you for sharing this fascinating glimpse into the future of mathematical research and its potential real-world applications!
get out of here with this goofy ass chatgpt comment
The algorithm to perform the quaternions multiplication gives it its compactness that would be available to matrix multiplication if it would use the same path of using the projection of the results on the hypercube and back to 3d cube in DeepMind's AlphaTensor. I bet a 6-pack of beer that it will improve the actual best result.
MMULT function is matrix multiplication.... it's discussed in LINEST multiple linear regression completely explained in Google sheets.
«Wow! This algorithm uses fewer multiplications!» - «Aha, and it works only when multipliers are zeros or ones.» 😉
It's funny cause it for sure adds more additions than the amount of multiplications it saves. This should make it slower than the regular algorithm?!?!
@@Last_Resort991addition is less computationally expensive than multiplication
@@tone618 Not if you multiply by zero or one.
O(n^3) is still polynomial complexity imagine if you have exponential or factorial
About the spelling in German V is read as F and their W is read as your V
St is read as Sht
This is pure hyperbole; seasoned mathematicians haven't mastered matrix multiplication? Pure absurdity
Maybe what the author means to say is that they have yet to find the most effective method of multiplying matrices.
This could potentially do wonders for video codecs
your videos are great! Lot of science videos on youtube are bullshit filled with stock photos and does not even explain the concept that title says
I'd say it more impressive what Strauss did but kudos to the team over at Google. ;)
You can read more about this work on Quanta's website: www.quantamagazine.org/ai-reveals-new-possibilities-in-matrix-multiplication-20221123/
Correction: At 2:53 in the video, the text previously read "67% less" but has been changed to "67%" for accuracy.
Error in percentage of reduction at ca 3 minutes. You meant 33 % reduction for 8 rows.
11:50 What is the second word on the screen?
@@LucenProject It's the initials of the team that published the paper "Discovering faster matrix multiplication algorithms with reinforcement learning".
3:47 Why is the matrix broken into four 4x4 matrices instead of breaking it into 16 2x2 matrices?
When I perform the new algorithm recursively on an arbitrarily large (N x N) matrix, I'd break it into 4*4=16 matrices of dimension (N/4)x(N/4). These "blocks" are then combined exactly the way the scalars in a 4x4 matrix are combined with 47 multiplications, only here the multiplications are performed recursively. This means that the input size N is divided by four in each step of the recursion and the runtime is N^{log_4(47)} ~ N^{2,777} by the Master theorem.
Doing it the other way around as shown in the video would be performing Strassen on the topmost call (7 multiplications of problem size N/2), am I wrong?
@@Flovus thanks!
(Title of ) video starts @3:55
Very interesting ! It went from 2-D to 3-D. Is it possible to use a 4-D space or higher, to reduce the number of steps even more?
A 3d tensor is used because it's multiplication of 2 2d tensors. So for higher dimensions you need to multiply higher order objects, or do something else. 4d space is naturally suited for 3d rotations for example..
You might be interested in geometric algebra and projective geometry too
@@shoam2103 I’m a bit confused. Are you saying when you multiply 2 matrices in R^2 you can a matrix in R^3? I don’t believe that’s correct
This discovery stands to reason that it should be applied to the study of file compression.
I hate the whole "There are more options than atoms in universe..." etc. Because whenever that is brought up, it tends to be that basically everything but few get eliminated very quickly when basic rules get introduced in to the system. If you got a huge floor and need to make mosaic on it, your possibilities are absurdly huge if you just fill it with random tiles, but soon as you set any kind of rules or boundaries the possibilities start to reduce quickly, until a limit between "there is one solution" and "there is no solution" is reached.
If it is only for some cases, that may not be really useful in application. It is just like you are adding an "If" condition to check whether for a case should use the old approach or use new approach to get a relatviely faster result, that checking logic may also time consuming, and that time consumed can in fact simply spend on solving the calc directly, and also, this will only adding complexity to the straightforward calculation solution.
Correction: the strassen alg has 33% less operations, instead of 67% at 2:57. if 512 is 100%, 343 is 67 then it's 33% less.
Strassen's algorithm is still far from obsolete. This method is only good for limited matrices with 1s and 0s
If one is just multiplying two 2x2 matrices with ones or zeros, there's only 8 combinations of those, and so you can just use a 256-entry, 4-bit lookup table and get the answer in O(n^2) time. Then you just divide into blocks. It doesn't seem like the case of elements being zero or one is that interesting.
You don't need lookup-tables, this can be implemented directly into hardware and most alu (cpu) should have it.
ah, I'm a bit confused... doing matrix multiplication when you have only 0 or 1 can be done without any multiplication, you just replace the multiplication operation with 'AND' gate, you know 0x0=0 0x1=0, 1x0=0, 1x1=1...
Yep, can even be implemented directly into hardware.
Will a computer make mathematicians obsolete? I'll believe it when an AI resolves the Riemann Hypothesis.
Then you'd still need mathematicians who are able to understand both the algorithmically generated proof, and well as the algorithms themselves. AI will always automate some tasks, while at the same time create new tasks. The outcome won't be less human work, but higher efficiency for the same or even higher amount of human work.
The smile at 11:20 is devious.
It empowers top-notch people to do more. Everyone else is kind of screwed
Nothing new. Only top-notch people bother to study enough of the cutting-edge to be empowered by it, the rest don't. Tale as old as time. Nothing has changed, despite easy access to massive amount of information.
I don’t see why we want to reduce the number of multiplications for binary numbers. Computationally, multiplying 2 bits should be the same difficulty, if not, easier, than adding 2 bits.
That's probabl the reason why nobody found this genius algorithm in 50 years; noone on their right mind would search for a algorithm to replace binary multiplication with several binary additions to slow down their program.