Scalable MatMul-free Language Modeling (Paper Explained)

Поділитися
Вставка
  • Опубліковано 21 сер 2024

КОМЕНТАРІ • 113

  • @user-vw5pg5vr3g
    @user-vw5pg5vr3g Місяць тому +42

    Loved that references for BitNet are 10 and 11

  • @eoghanf
    @eoghanf Місяць тому +19

    Your point about estimating whether non-straight lines cross based on three datapoints is a very good one. HOWEVER, the reason for giving them the benefit of the doubt on the training dynamics side is that the *inference* time power efficiency gain (which you don't spend any time on!) is massive. From the abstract "We processed billion-parameter scale models at 13W beyond human readable throughput, moving LLMs closer to brain-like efficiency". That's pretty amazing.

  • @wolpumba4099
    @wolpumba4099 Місяць тому +51

    *Summary*
    *Problem:*
    * *(**2:30**)* Matrix multiplications (MatMuls) are the core of modern machine learning, but they are resource-intensive and require specialized hardware like GPUs.
    *Proposed Solution:*
    * *(**0:00**)* This paper proposes eliminating MatMuls entirely from large language models (LLMs) while maintaining competitive performance.
    * *(**16:35**)* The architecture replaces:
    * *(**16:35**)* *Attention layers* with parallelizable recurrent layers inspired by GRUs.
    * *(**5:55**)* *Dense layers* with "ternary accumulation," using quantized weights limited to -1, 0, and 1. This replaces multiplication with simpler selection and addition operations.
    *Key Findings:*
    * *(**38:30**)* *Performance:* The MatMul-free models perform on par with state-of-the-art Transformers at scales up to 2.7 billion parameters.
    * *(**38:30**)* *Scaling Laws:* The performance gap between MatMul-free models and traditional Transformers seems to decrease with increasing model size, suggesting a potential crossover point where MatMul-free models become more efficient. However, the video author expresses skepticism about this extrapolation.
    * *(**45:00**)* *Hardware Efficiency:* The proposed architecture significantly reduces memory usage and latency. Implementing it on custom hardware like FPGAs, optimized for ternary operations, could lead to even greater efficiency gains.
    *Author's Opinion (Yannic Kilcher):*
    * *(**48:20**)* The research is exciting and promising for edge computing and energy-efficient AI.
    * *(**48:20**)* He remains skeptical about:
    * Whether MatMul-free models can truly surpass traditional Transformers in performance, especially for complex tasks.
    * The validity of extrapolating scaling laws based on limited data points.
    * The simplification trade-offs (like removing state-dependent hidden state updates) might limit the architecture's ultimate capabilities.
    *Overall:*
    The paper offers a compelling alternative to traditional MatMul-heavy LLMs, with potential for improved hardware efficiency. While challenges and open questions remain, it presents a promising direction for future research and development.
    i used gemini 1.5 pro to summarize the transcript

    • @interstellarsurfer
      @interstellarsurfer Місяць тому +5

      I guess Gemini isn't completely useless. 🤷‍♂️

    • @theupsider
      @theupsider Місяць тому

      thats what LLMs are for. thanks

  • @ttul
    @ttul Місяць тому +27

    The FPGA angle is what's interesting about this research. The paper proposes replacing all feed-forward operations in large language models with more computationally efficient operations, mostly by using ternary weights (i.e. -1, 0, and 1 are the only allowed values). Ternary weights are basically a simple logic gate with only three permitted operations:
    a) Change the sign of the input (i.e. flip the sign bit and copy the rest)
    b) Output zero
    c) Copy the input to the output
    If your goal is to make a neural network scream on hardware, having only three simple operations to choose from means you can use simple logic gates. The researchers tried this out in FPGAs and this is a promising area of research. From FPGA's it's not a big leap to ASICs, which nets the most power efficient computation theoretically possible. So if ternary gate networks can be made to scale, everyone should be excited.
    Caveats:
    1. The attention mechanism is replaced with a parallelizable form of recurrent neural network because applying ternary operations to attention does not train.
    2. A linearized Gated Recurrent Unit (GRU) architecture allows for parallel computation; this is a neat trick.
    3. The channel mixer (a feed-forward equivalent) uses dense layers with ternary accumulation operators.
    Results show performance comparable to traditional Transformers, with better scaling properties at larger model sizes.
    Yannick expresses some skepticism about the projected crossover point where this architecture would outperform traditional Transformers.
    But I think the really interesting thing about this is the FPGA/ASIC aspect.

    • @robmacl7
      @robmacl7 Місяць тому +1

      You could also reduce some work by pre processing the weights to just drop the zero entries, but this would be somewhat a nuisance for a hardware realization because the work needed would vary by output element.

    • @hjups
      @hjups Місяць тому +1

      @@robmacl7 Why would variable work be an issue? You replace a deterministic sequence with signal barriers that only occur at synchronization points in the compute graph.
      The bigger issue with dropping zero entries would be the extra step needed for decompression into a dense operation (e.g. stored as RLE or a Sparse format), and then aligning fetches to DRAM bursts.

  • @philiptren2792
    @philiptren2792 Місяць тому +7

    19:15 I think the model will learn to be more efficient with the extra accuracy. We can increase the length of the vector and the model will learn to use higher accuracy for the important values and a lower one for the ones where precision doesn’t matter as much, saving unnecessary precision. It’s like quantizing each and every weight of the model independently and exactly the right amount.

  • @KevinHorecka
    @KevinHorecka Місяць тому +15

    "stay hydrated" was a shockingly helpful reminder that I haven't drank any water today. Thanks!

  • @Mordenor
    @Mordenor Місяць тому +2

    Thank you Mr Yannic for explaining MatMul free Language Modelling to your viewers!

  • @HansKonrad-ln1cg
    @HansKonrad-ln1cg Місяць тому +1

    i have heard that after training you can basically throw away 90% of a network without changing the behaviour too much. that is because most of the weights are near zero which basically means a non-existent connection of the neurons. so if you omitt the calculation right away by taking it as exactly zero with the ternary values you save a lot of time that would have otherwise been spent with multiplying by zero for no reason.

  • @clray123
    @clray123 Місяць тому +1

    What I missed in the video and in the paper is an interpretation of replacing the weights with -1, 0, 1. And that would be: matrix multiplication xW is just calculation of n vector dot products - one dot product between x and each row of W. A dot product of two vectors is max when the vectors point in the same direction, min when the vectors point in the opposite direction, 0 if they are orthogonal. So it's basically deciding "let's glue all the KQV vectors, whose direction we compare with x, to the base axes (of the coordinate system), rather than allow them to point in any direction". I think that's what they call "privileged bases" in interpretability research. But given that you can only fit so many orthogonal vectors in n dimensions (and a lot more "almost" orthogonal vectors), it feels like it should impact the ability of the model to uniquely represent inputs.

  • @pauldruhg2992
    @pauldruhg2992 Місяць тому +6

    Why stop with terniary? Go for powers of two and bit shifting. Speed and precision win-win.

    • @WalterSamuels
      @WalterSamuels Місяць тому

      Can you elaborate?

    • @danielg3857
      @danielg3857 Місяць тому +1

      @@WalterSamuels he means replacing ternary logic gates with three possible outputs(1, 0, -1), with just binary logic gates/functions, to benefit from even better math hacks so to speak, you can do neat tricks with binary numbers/functions; haven't even watched most of the video mind you, just reading the abstract and comments so far

    • @pauldruhg2992
      @pauldruhg2992 Місяць тому

      @@WalterSamuels multiplication and division by powers of two can be replaced with bit-shifting, which is faster

  • @unvergebeneid
    @unvergebeneid Місяць тому +1

    Anything that uses balanced ternary is already a superior method in my book :D

  • @FryGuy1013
    @FryGuy1013 Місяць тому +2

    As someone who has written CUDA code, this is relatively straightforward to do on GPUs. So your concern seems kind of unfounded that it will be basically the same performance as a full floating point multiplications

    • @Noxeus1996
      @Noxeus1996 Місяць тому +6

      As someone who has written most of the llama.cpp CUDA code, matrix multiplications on GPUs are only so fast due to specialized hardware, i.e. tensor cores. Without specialized instructions for Bitnet or whatever I doubt that the performance will be (much) better than just doing dense 16 bit matrix multiplications unless you also quantize the activations to 4/8 bits.

  • @RPG_Guy-fx8ns
    @RPG_Guy-fx8ns Місяць тому +1

    if you have a layer of 64 neurons, the weights would be 16 bytes per neuron. You can use a look up table with 256 entries, instead of summing the binary digits. That way, most of the math is just turned into jumps into that table, finding 2 sums to subtract. its 16 boolean AND operations, to compare the previous layer output and this neuron's weights, 16 array lookups, adding them up as 2 totals, then subtracting the 2 bytes. That would be extremely fast compared to other neural networks, but I wonder if it can match the quality of other solutions.

  • @ronhightower6549
    @ronhightower6549 Місяць тому +74

    Hopefully the research community gets these fundamental improvements figured out before Sam Altman spends a trillion dollars on data centers running Nvidia MatMul devices.

    • @danielmewes
      @danielmewes Місяць тому +2

      Might still need it for training?

    • @TheNerd484
      @TheNerd484 Місяць тому +7

      It would be funny if this happens like a month after he buys them. It would also mean we get a lot of cheap compute cards

    • @eadweard.
      @eadweard. Місяць тому +5

      ​@@TheNerd484Resentment-powered compute.

    • @clray123
      @clray123 Місяць тому +1

      Too late. Also, Anthropic spends substantial resources on interpretability of transformer-based models. As far as I'm aware, these interpretability gains do not translate easily into other architectures.

    • @jswew12
      @jswew12 Місяць тому +1

      @@danielmewescorrect me if I an wrong, but isn’t training also possible on the FPGA they introduce? It’s been a couple weeks since I read the paper and I haven’t finished this video, but I could have sworn that all the operations they need for training are programmed into the FPGA and are shown to be better than GPU equivalents. Could be a problem of scale maybe?

  • @josehugoelsas8699
    @josehugoelsas8699 Місяць тому +1

    One important thing to notice is that this approach is trading off very regular, very high numerical intensity normal matmul, with very sparse, very memory irregular filtering operations to do the ternary if statements.
    For me it is not clear if this will yield any improvement over present GPU or other accelerator architectures.
    Also, it relies heavily on quantization, which can be fragile depending on the situation. It is not much of a problem for inference, but can be a problem for training.
    Multiplying floats, specially dense matrices, is cheap, what is expensive is moving data, and I don't see how this paper improves on this front.

  • @eruiluvatar236
    @eruiluvatar236 Місяць тому +1

    I believe that you could still implement a fast "ternary multiplication" in a current GPU by using logic gates operating on multiple weights per register. Matmults are crazy fast on GPUs but by squeezing multiple weights together in a single register it might end up being faster.

  • @hjups
    @hjups Місяць тому +1

    With usefulness, there's still an underlying assumption that 1) the comparable performance will hold with increased scale / specialized models, and 2) properties required for improved reliability in transformers also translate to this architecture.
    My guess is that (1) depends on the task / benchmark, and (2) is unlikely to occur (SSMs are missing some of these properties), which will set an upper bound on the model size and usability. That said, this approach is probably applicable for more classical NLP tasks which are easier than generative AI, and maybe some sort of low-effort HCI (e.g. take this JSON packet and convert it into a human understandable response).

  • @eoghanf
    @eoghanf Місяць тому +2

    I would really be interested in knowing more about the how the Straight-Through Estimator allows these things to train. That's the big mystery to me.

  • @jimbo8853
    @jimbo8853 Місяць тому +17

    Devs learning linear algebra to upskill for AI in shambles

    • @Decocoa
      @Decocoa Місяць тому +2

      Joking aside mate why would devs need Linear Algebra for Ai? Surely the basics from high school should be sufficient? You abstract away the layers and optimisers with TF?

    • @jamescunningham8092
      @jamescunningham8092 Місяць тому +23

      @@DecocoaTo be truly effective in an environment where the state of the art changes all the time, you need at least a little understanding of how things work. Without any understanding of linear algebra you’d be at a big disadvantage.

    • @coversine479
      @coversine479 Місяць тому +4

      @@Decocoa if you don't know LA and Calculus you can't understand AI papers. Period. But if you are just an application developer using someone else's AI API obviously you don't need to know how it works internally to use it

  • @adamrak7560
    @adamrak7560 Місяць тому +4

    Dot product in-memory architectures would be extremely fast and efficient for the inference. Less so for training.
    So _if_ we change the architecture there are relatively simple ways we could add a few order of magnitudes to the inference performance.

    • @Balorng
      @Balorng Місяць тому

      Inference speed equals model performance because, currently, algorithms like "Graph of Thoughts", extensive multi-agentic systems, "smart RAG" and, most importantly, metacognition in general is extremely inference-heavy (you can generate orders of magnitude of "subcounscious" tokens for each one shown to the user), so is generating oodles of very high-quality training data to create "leaner" yet more performant models using much less data by eliminating junk. I particularly liked the idea of creating multiple "interlocking" variants of data designed to combat llm flaw of A = B, B =/= A "reversal curse" and otherwise their inability to truly generalize.
      My pet "internal model of LMM attention" is actually DNA sequencing. A huge pattern is broken apart into small chunks and then pieced together into new patterns by having them mech with each other using semantic distance similarity - that explains both the strong and weak points of LMMs. While I think that using graph RAG and symbolic logic metacognitive systems is still a must to make LMMs truly useful, simply having more patterns that are "rotated/translated" this way and that should create better "illusion of general intelligence" at the very least...

    • @hjups
      @hjups Місяць тому

      "Extremely fast and efficient" is relative. Samsung and SK Hynix already do that with their HBM-PIM, but are only able to get a 2x-3x improvement. That's at most 2 orders of magnitude (in base 2). That 2x is still valuable, but it's limited by communication depth (sum trees can't be faster than log2 N), and the technology nodes used by DRAM are relatively slow compared to CMOS.

    • @adamrak7560
      @adamrak7560 Місяць тому

      @@hjupsHBM-PIM is a generic processor near each pair of DRAM banks, with a quite underpowered FPU. It is not a highly parallel and specific dot-product engine. So for AI inferencing it is unsurprisingly very weak. For AI inferencing we only need a dot-product engine, and very little control circuitry, or registers.

    • @hjups
      @hjups Місяць тому

      @@adamrak7560 That's incorrect. The HBM-PIM implementations are a special-function SIMD ALU near each bank (they have an ISA of 16 instructions or something small like that), one of which has a dot-product sum tree (I can't recall which one it was).
      And you do need more than just a dot-product engine for efficient inference. You also need the ability to perform element-wise addition, multiplication, and some movement operations for transpose.

  • @adeeelh
    @adeeelh Місяць тому

    +100 to the rant at 25:32 about researchers relying on tricks instead of the main idea of the paper. It's my biggest pet peeve with deep learning papers.

  • @alan2here
    @alan2here Місяць тому +1

    Evolution, the models are the species, we cause mutation and are also the environment, speciation is common.

  • @sentinelav
    @sentinelav Місяць тому +2

    40:25 "More bang for your flop" 💀

  • @VincentKun
    @VincentKun Місяць тому +2

    About data dependency did you saw the Illusion of State in state space model paper?
    Every time they try to get to something recurrent they lose parallelization and state dependency is one of those cases

  • @ssssssstssssssss
    @ssssssstssssssss Місяць тому +1

    I saw this the other day and really liked how they claim not to be doing matrix multiplication while still doing matrix multiplication. It's just an efficient implementation of a special case. It makes me feel a bit disappointed despite the contribution of the paper looking to be quite solid.

  • @AleksandrUmnov
    @AleksandrUmnov Місяць тому +1

    6:24 the pigeon moment

  • @jmirodg7094
    @jmirodg7094 Місяць тому +1

    It is only a first attempt I'm keen to see the following papers...

  • @serhanciftlikci3651
    @serhanciftlikci3651 Місяць тому

    I think it all boils down to the classical idea or bias-variance tradeoff. Using ternary weights results with a biased model (hence the big loss gap compared to the transformer at the start). They can populate more weights but it will remove all gains from the inference. If they can also find a component to increase the variance of the system, it may be the new way to train LLMs in the future.

  • @VladMysla
    @VladMysla Місяць тому

    30:26 in hidden state it actually depends on the previous state to select what to forget

  • @alan2here
    @alan2here Місяць тому +1

    PC's today can get a tertiary value into 2 bits, utilising 75% of the space, and compute with it fairly efficiently. Maybe not so practical to compute with but 3 tertiaries also fit into 5 bit giving 84%, and 10 tertiary values in 16 bits (2 bytes) utilising 90%. 😮

    • @alan2here
      @alan2here Місяць тому

      Unfortunately 2^n is never equal to 3^n for any integer other than 0

  • @abdulshabazz8597
    @abdulshabazz8597 27 днів тому

    This algorithm can be further adapted to arbitrary, non-binary bit-arrays to further improve their performance by first factoring the RHS matrices into primes, which are essentially then viewed as unary values, and summing each tensor of primes and their product's in parallel...

  • @clray123
    @clray123 Місяць тому

    I don't understand putting this linearized architecture in the same basket as state-space models at 30:22. The (selective) "accumulation of the past" in state-space models (specifically Mamba) makes the next state data-dependent (namely on all the selectively accumulated past data). Not just on the next token. Or are you saying that because of the selectivity newer tokens may have no chance of using information from older tokens that have been rejected by selection (but this is kinda the tradeoff for not having to maintain a KV cache of indefinite length).

  • @WalterSamuels
    @WalterSamuels Місяць тому

    Look into VSA (hyperdimensional computing), and balanced ternary notation.

  • @norlesh
    @norlesh Місяць тому +1

    How does this effect the GPU poor such as myself (humble RTX 2080) - I'm wondering how this would perform implemented as something like llama.cpp tailored to run on CPU and system ram with the GPU just for icing when available.

    • @JBoy340a
      @JBoy340a 28 днів тому

      Yes. As a fellow 2080 owner I often run into issues with resources. it would be nice to see these sort of issues go away.

  • @JBoy340a
    @JBoy340a 28 днів тому

    The FPGA is interesting. It would be interesting what to see what this means for a portable real-time devices.

  • @hasko_not_the_pirate
    @hasko_not_the_pirate Місяць тому

    19:20 Isn’t the essential trade-off that they encode learned models in a 1.6 bit “ternary” data type rather than a 8 bit, 16 bit, or 32 bit float data type for the weight matrix? It seems likely that you would need roughly 20 times as many weights to encode the same information as a float32 weight matrix, which would then increase compute complexity accordingly.

  • @evilby
    @evilby Місяць тому +1

    TTT on the way?

  • @pavalep
    @pavalep Місяць тому

    thanks for the informative vid :)

  • @albinoameise
    @albinoameise Місяць тому

    But your idea of simply repeating the input tokens for attention does not necessarily result into too many tokens. Because you can use this np.where operation once in a step before doing that to thin out the input tokens with a ternary thinning matrix and then replicating and 'attending' only those with values > 0.
    So I find your idea at least worthy to try!

  • @cherubin7th
    @cherubin7th Місяць тому +5

    Nvidia is cooked

  • @ekstrapolatoraproksymujacy412
    @ekstrapolatoraproksymujacy412 Місяць тому

    Attention layer is needed for in context learning and in context learning capability is strongly corelated with intelligence, architectures like RWKV struggles with this, looking at a loss and most of the current benchmarks is very misleading regarding actual performance, those things mostly measures how much the model remembered not how well it generalizes, that's why mobody really uses those "modern rnn" thingies, they only look good on paper, not in practice.

  • @clray123
    @clray123 Місяць тому

    I have a nagging suspicion that the attention complication they do after the ternary quantizing of the QKV weights is there to recover (as in "store elsewhere") the same weights that they claim to have dropped...

  • @aneeshprasobhan
    @aneeshprasobhan Місяць тому +10

    NVIDIAs shares rely on this paper not getting too much attention xD

    • @tarumath319
      @tarumath319 Місяць тому +1

      They would just need to add ternary accelerators and maybe more int8 ones.

    • @eadweard.
      @eadweard. Місяць тому +4

      Is that a pun?

    • @aneeshprasobhan
      @aneeshprasobhan Місяць тому

      @@eadweard. i tried xD

    • @aneeshprasobhan
      @aneeshprasobhan 27 днів тому

      @@eadweard. i tried xD

    • @kazedcat
      @kazedcat 27 днів тому

      Nvidia could just add ternary operation to their GPU. It is a super simple hardware "copy if 1, zero out if zero and negate if -1". They only need to add a single new instruction VTerAcc "Vector Ternary Accumulate"

  • @FredericoKlein
    @FredericoKlein Місяць тому

    a multiplication by 2 is just a bit shift in binary (in floating point, its just adding 1 to the exponent, isnt it?)
    So they could have done 2, 4, 8,... and -2, -4, -8.. couldnt they?

  • @rockapedra1130
    @rockapedra1130 Місяць тому +1

    18:16 I like the duplication hack. I wonder if brains use that. Synapses would be +1 = excitatory synapse, -1 = inhibitory synapse, 0 = no synapse, other numbers = multiple synapses. Maybe. Who knows. LOL

    • @LuizFernando-hv1td
      @LuizFernando-hv1td Місяць тому

      I think you would be interested in looking into SNNs! From what I understand, when you include the time dimension, something like this happens in the form of spike frequency.

    • @rockapedra1130
      @rockapedra1130 Місяць тому

      @@LuizFernando-hv1tdhey, that's pretty cool! If we add spiking frequency and an "integration window" to the mix then it works even better! Then we can do: spike freq * int window * (num exc synapses - num inh synapses) = value! That allows arbitrary precision with ternary synapses. If I were a brain engineer, I'd do that! Probably everybody does already ... Lol.

  • @TheNerd484
    @TheNerd484 Місяць тому

    IMO, if any architecture will yield actually intelligent AIs, it would look very similar to this. I think training would be the main hard part.
    I'm of the opinion that if this model were trained such that it does not have to output a token on every iteration, you would see significant performance improvement basically for free.

  • @TheTruthOfAI
    @TheTruthOfAI Місяць тому

    this paper is wild as hell... even coming out with FPGA solutioning.. to be honest, is one of those papers that i dont fully entirely 101% grasp.. i did tried some of this ternary multilateration approach.. according the "book", its numerical floating precision by example on 13 operators reaches 100% precision of float16.. truth is on the battlefield it doesnt perform good within my experiments.

  • @aitarun
    @aitarun Місяць тому

    1 bit and 1.56 bit llm paper came long back. I wonder why are not these models available yet. There are quantized models but no model is available which was trained @ 1 or 1.56 bits. Seems like some accuracy related issues not making them worthy as their full precision counter part.

  • @hannesstark5024
    @hannesstark5024 15 днів тому

    Using straight through estimator sounds to me like that for both our forward and backward pass we still need to compute everything in floating points and then we quantize the gradients to the level of our weights. So we would have no compute efficiency benefits. Does someone know what I am missing here?

  • @fiNitEarth
    @fiNitEarth Місяць тому

    Well didn’t they compare their model to transformer ++ which also quantizes its weights to trinary?

  • @MrBioloidboy
    @MrBioloidboy Місяць тому

    Sentient ai is here! Can I try brain tech data science integrations now?

  • @mrpocock
    @mrpocock Місяць тому +2

    Is this not an opinionated relu?

  • @tarumath319
    @tarumath319 Місяць тому

    A lot of people talk about bitnet and that improvement over it but the big guys in AI like OpenAI seem to not care about it.

    • @clray123
      @clray123 Місяць тому +1

      Sunk cost fallacy. The hardware they've already paid for needs to be amortized first. It's very difficult to admit to investors they've burnt so much money by committing to an unripe architecture.

  • @bjarke7886
    @bjarke7886 Місяць тому

    ESM3 ESM3 ESM3 ESM3 ESM3 ESM3

  • @christospapadopoulos7894
    @christospapadopoulos7894 Місяць тому +1

    8 authors for a scientific paper is absurd, at this point who even is the main one?

  • @JoeTaber
    @JoeTaber Місяць тому

    I wonder if a tenstorrent device would be able to process these operations efficiently.

  • @g_glop
    @g_glop Місяць тому +1

    MatMul? i'm allergic

  • @hermannschmidt9788
    @hermannschmidt9788 Місяць тому

    Bitcoin mining used to be run on GPUs first. Then came the FPGAs, followed by ASICs. I wonder if this progression will apply to transformer networks as well. This would put Nvidia out of business. Calculating a hash value is a simpler task, however.

    • @clray123
      @clray123 Місяць тому

      Why do you think Nvidia would be incapable of manufacturing (and foremost patenting) these other circuits?

    • @hermannschmidt9788
      @hermannschmidt9788 Місяць тому

      @@clray123 I just followed the mining analogy. They stayed with the GPUs, which is their core competence, and gave away this business.

  • @erickmarin6147
    @erickmarin6147 Місяць тому

    Been trying to verilog something like that myself for a while

  • @khaledbouzaiene3959
    @khaledbouzaiene3959 Місяць тому

    i wich you explain the fpga or asic part how this done using addition or element wise instead of matrix multiplication

    • @hjups
      @hjups Місяць тому +1

      The authors don't go into detail nor is the RTL code in their repo. From their description and diagram, it's a stand-alone DMA unit, which takes in the address of the ternary matrix, the address of the activation matrix (most likely), and the address of the destination matrix (most likely). Then it fetches a column of the transposed ternary matrix to store in a local buffer, and streams the rows of the activation matrix into an accumulator, which then gets written back to the destination address.

  • @kop-lg7lo
    @kop-lg7lo Місяць тому

    kinda cool, but surely we not ready for this type of architecture

  • @eaglefacts990
    @eaglefacts990 Місяць тому

    What PDF editor do You use??

  • @charstringetje
    @charstringetje Місяць тому

    Am I the first to see that Q=K=V, and that we can reduce all MatMul to ⅓ the current operations without introducing other operations? 🙃 3:44

    • @charstringetje
      @charstringetje Місяць тому

      Oh, I spoke too soon... Handwaving follows.

    • @clray123
      @clray123 Місяць тому

      The weight matrices are "obviously" supposed to be different, but in some cases the same K and V submatrices are reused for subsets of Q (or for all Q), indeed leading to memory savings (although not to 1/3). See papers on multi-query attention (MQA -> all Qs share same KV) and grouped-query attention (GQA -> some Qs share same KV).

  • @adityashukla9840
    @adityashukla9840 Місяць тому

    Can you please make a video on DUCK net

  • @seanreynoldscs
    @seanreynoldscs Місяць тому +2

    I’m calling BS. They are approximating the floating points by having overly large weights matrices. This paper could also be called, having a smaller network sometimes outperformed a larger network for small datasets.

  • @Navhkrin
    @Navhkrin Місяць тому +2

    Big doubt this approaches scales. It is giving me vibes of kind of research that works for that one specific tailor engineered scenario and sucks for everything else. Otherwise we would have seen significantly higher amount of experiments in various settings

    • @clray123
      @clray123 Місяць тому

      That is one stupid argument to make, with that approach you can disqualify any new idea ("the idea must obviously be bad otherwise we would have seen it before").

    • @deltamico
      @deltamico Місяць тому +2

      It's more like "the idea must be bad because otherwise the author would be willing to explore it's capabilities in different settings" which is not always true but absolutely has grounds

    • @clray123
      @clray123 Місяць тому +1

      @@deltamico But this whole "but does it scale" argument assumes the researchers have infinite money to burn on hardware. They obviously don't, that's why they explore new ideas with smaller models.

    • @Jononor
      @Jononor Місяць тому +1

      Integer quantization is standard practice in edge/mobile/TinyML. Sub byte quantization and even binary networks have considerable research in the last decade. Most research has been on CNN, Transformers and LLMs has not seen as much research yet - but it is coming. No one knows if ternary or no matmul will be the best representation though...

  • @naromsky
    @naromsky Місяць тому

    That's one boring paper.