Finally: Grokking Solved - It's Not What You Think

Поділитися
Вставка
  • Опубліковано 27 січ 2025

КОМЕНТАРІ • 112

  • @code4AI
    @code4AI  13 днів тому +5

    With the automatic audio dubbing from UA-cam /Google you hear a synthetic voice in your regional language.
    To hear my original voice in English, switch to "Default" or "English" in the settings. Thank you.

    • @d3fau1thmph
      @d3fau1thmph 13 днів тому +1

      I would prefer synthetic English. With good pronunciation.

    • @kirill9064
      @kirill9064 11 днів тому

      It doesn't work for me.

    • @HUEHUEUHEPony
      @HUEHUEUHEPony 5 днів тому

      There is no regional in spanish, can you provide the dub in spanish?

  • @be1tube
    @be1tube 13 днів тому +52

    This did not explain grokking to me because my question is not "why does it take so long to happen" but why does it happen at all after memorization? Why is there still a training signal after the network has memorized the training examples?

    • @fackarov9412
      @fackarov9412 13 днів тому +16

      the model memorize the training data on "primitive" structures, with training the model continues to evolve and if better structures emerge to represent the data, it exploits them and if it finds the "exact" data structure it activates grokking
      once a better structure is find it activates a exponential space of "good" candidate structures to go better and better

    • @mircorichter1375
      @mircorichter1375 13 днів тому +8

      People use the term "grokking" differently. Here and in most research 'grokking' is not the generalization AFTER the delay but the delay before the expected generalization. So a model that generalizes on par or soon after training memorization does not grok, while a model groks if there is a delay...

    • @mircorichter1375
      @mircorichter1375 13 днів тому +2

      That is arguably counterintuitive due to the meaning of the word 'grokking'

    • @daPawlak
      @daPawlak 13 днів тому +1

      @@mircorichter1375 so it's just a problem with getting result that could be achievable without that delay or is it a additional functionality provided by that additional compute time?

    • @be1tube
      @be1tube 13 днів тому +3

      @@fackarov9412 What was confusing me is the 100% on training. However, I have a hypothesis now that the graph shown in these situations plots training accuracy, not training loss. So there is a signal (the difference between an output of (.2,.2,.6) and (0,0,1)), it's just not shown in the graph.

  • @jabowery
    @jabowery 13 днів тому +23

    For those new to the subject everything up to 18:22 is pretty much a diagnosis of the disease not the cure. Don't be discouraged by the long introduction. This video is very important.

  • @NaveenReddy-p5j
    @NaveenReddy-p5j 14 днів тому

    Great progress on grokking! This could be the turning point for more efficient AI training methods. Hats off to the Imperial College London team!

  • @thielt01
    @thielt01 14 днів тому +13

    I've been waiting for such a paper!
    I'm more of a hobbyist in the space rather than an actual researcher.

    • @tescOne
      @tescOne 13 днів тому +1

      same :) Grokking blew my mind some months ago

    • @RickySupriyadi
      @RickySupriyadi 13 днів тому

      which one i can't find it

    • @tescOne
      @tescOne 12 днів тому +1

      @@RickySupriyadi "Grokking at the Edge of Numerical Stability"

    • @RickySupriyadi
      @RickySupriyadi 12 днів тому

      @@tescOne thanks

  • @asfaust312
    @asfaust312 13 днів тому +6

    link your papers in the description please.

  • @hjups
    @hjups 13 днів тому +18

    I think I missed something in your explanation. If the model gets stuck due to softmax collapse, then how does Grokking occur? Presumably, this would require the model to eventually become "unstuck", but that's not possible with a zero gradient. It's also unlikely to occur from floating-point behavior as an underflow will be truncated and an overflow will become inf/NaN. Or is the idea that a high lr avoids this collapse by amplifying small gradients?
    Additionally, if SC is attributed to logit scaling prior to softmax, then does that not suggest that using non-affine scaled normalization prior to the softmax may prevent this issue? Norm(c*f(x)) ~ f(x) / || f(x) ||.

    • @malchemis
      @malchemis 13 днів тому +4

      +1 This does not explain the "suddenly". I may have missed something in the video, I need to take a look at the paper. Is the model getting unstuck simply due to noise (repeated perturbation) ? The reasoning would be that a global minimum is stable and more resistant to perturbation than a local one

    • @mucabi
      @mucabi 13 днів тому +6

      I only skimmed the paper but if I remember correctly for grokking consistently to occur you somewhat need weight decay. So over time the weights get decayed enough to get meaningful gradients again.
      There are also pretty pictures in the paper, take a look :)

    • @devinbae9914
      @devinbae9914 13 днів тому +1

      @malchemis You should look into spontaneous replica symmetry breaking in diffusion models. Grokking is a similar phenomena where the state space undergoes a phase transition... it is clear that the energy in conjunction with some other parameters must reach some sort of threshold (which itself can be sequestered) before finally settling in a NEW attractor basin which might be called the "true basin" in contrast to the "false basin" (which is the initial training plateau). Hope this makes a little bit more sense?

    • @hjups
      @hjups 13 днів тому +2

      @devinbae9914 I suspect this is an accurate description, but does not address the issue posed. What exactly causes the phase transition? It can't simply be stochastic chance, since it was established that the gradients are 0 due to softmax collapse, meaning the model is unable to update.
      So either, the zero gradient assumption is false, weight decay helps recover from SC, or the scenario description in the video is flawed (i.e. maybe only models which do not exhibit SC are able to Grok based on random seeds). Or maybe there's some other factor that we're missing?
      Understanding the mathematical mechanism is crucial to backing the theoretical model.

    • @devinbae9914
      @devinbae9914 13 днів тому

      ​@@hjups Yes very true, I'd wager it's because there is a parameter which causes the phase transition. The weight decay explanation seems pretty reasonable and it's actually cited as the primary explanation in the paper... otherwise factors like the loss function are also involved (they note MSE loss on shallow networks can induce Grokking) and weight scaling.

  • @Anonymous-lw1zy
    @Anonymous-lw1zy 13 днів тому +4

    Why not just run softmax at far higher precision (double: 64 bit float (53 bit significand)? quadruple: 128 bit float (113 bit significand)?) so you don't get precisely 1,0,0,... from the softmax?
    Or rather than compute softmax with exp(), use an adjustable steepness function, adjusting it over time, much like learning rates are adjusted over time. Put in a control loop to keep it away from getting pinned to 1,0,0,...
    ---
    OK, I read the paper. These are suggested solutions.

    • @tbirdal
      @tbirdal 13 днів тому +2

      Good point and we did experiment with these in the paper. While it helps, you cannot get rid of the intrinsic problem by simply using higher precision.

    • @robtaylor1444
      @robtaylor1444 12 днів тому

      @@tbirdalhave you looked at posits?

  • @be1tube
    @be1tube 13 днів тому +1

    There was a paper many months ago showing that increasing certain frequency components of the gradients (I think it was a low-pass filter but it could have been the opposite) skipped most of the delay for grokking.

  • @TheDarkhawk243
    @TheDarkhawk243 13 днів тому +5

    Why do you never link papers?

  • @mrpocock
    @mrpocock 14 днів тому +13

    Ok, so can you just slap an l1 loss in parallel to the logit layers, that's proportional to the inverse entropy of the logit?

    • @therainman7777
      @therainman7777 13 днів тому +2

      I was just thinking the exact same thing.

    • @therainman7777
      @therainman7777 13 днів тому +3

      Ah, now I just got to 19:00 in the video, where the paper’s authors say that they have a method of preventing softmax collapse _without_ regularization (your suggestion is a form of regularization). So maybe they have a better method, I’ll need to finish the video and read the paper to find out 😅

    • @mrpocock
      @mrpocock 13 днів тому +1

      @@therainman7777 there may be a numerical methods trick, but if so, it is beyond my level of understanding.

    • @therainman7777
      @therainman7777 13 днів тому +1

      @@mrpocock Yeah, I finished the video and there was no indication yet of what, if anything, the authors have already figured out. But I believe the poster of the video said he’ll be making a part two, so there may be more info to come. I’m going to read the paper in the meantime.

    • @jakobrees4973
      @jakobrees4973 13 днів тому +5

      ​@@therainman7777 The trick they use is subtracting the gradient projection onto the weights from the gradient before using it to apply a step to the model. The idea is that the NLM direction will become apparent in the model parameters: we don't want to continue in the NLM direction so remove it from the gradient.
      From my (very quick) testing of the authors' method, this actually does not necessarily always seem to work that well, but specifically only is useful once we start to overfit drastically.

  • @drowningpenguin1588
    @drowningpenguin1588 13 днів тому +1

    Really appreciate the quality of your videos!

  • @macchiato_1881
    @macchiato_1881 10 днів тому

    This strongly aligns with my doubts and problems with softmax for a while now. In my analysis on dynamics on neural networks, I noticed that classifiers tend to have really high euclidean norms for both weights and output logits due to how softmax works. This holds true even with explicit and implicit regularization techniques applied. This was a concern for me as I was focusing on recurrent setups where these unconstrained representations either blow the gradients out of proportion or make them vanish entirely. What I didn't know was that grokking was involved in this as well for non-recurrent setups. Interesting.

  • @awesomedata8973
    @awesomedata8973 13 днів тому

    I'm not a math guy (at least not in the traditional sense), but I love your channel's exploration of the bleeding edge of those progressing (and trying to understand) this technology - but mostly how, despite the complexity of the subject-matter, you make it as accessible as you can to the average person (like me), who doesn't delve too deeply into mathematical rigor.
    I feel like more physical (and less abstract) explanations of phenomenon that are easy to visualize (i.e. in story form?), wherever possible, generally helps me to visualize this stuff more, but you at least take the time to go slowly over the material and explain everything that's necessary from a beginner's level without leaving anything out. I can generally still keep up with this because it's clear you're passionate about this stuff and it makes me want to learn more than I already want to (allowing me to pay closer attention). Your channel doesn't have to be flashy to get it right. Great job! - Keep up the amazing work!

  • @trucid2
    @trucid2 13 днів тому +1

    How do models overcome the softmax collapse and start learning representations that generalize?

  • @scottmiller2591
    @scottmiller2591 6 днів тому

    One quibble: the authors are using a non-standard definition of numerical instability. However, given that, it's a good paper. It's interesting that all of these issues come down to the use/misuse of one layer, and it's the output layer, which means that fixes will be applied quickly on many models, leading to rapid improvements. It does give one pause, however, about what dragons are living deeper and more complexly in the deep networks that aren't as easily analyzed.

  • @andikunar7183
    @andikunar7183 13 днів тому +1

    Thanks, AMAZING content, WOW! Does this mean in non-mathematical/laymen‘s terms, that good, smaller context, knowledge-samples (decreased dimensionality) during training help with grokking?

    • @polyscopes
      @polyscopes 13 днів тому +3

      I think he was saying that the decreased dimensionality helped prevent memorization forcing it to generalize sooner instead of memorizing the training data first and then learning to generalize.

  • @ibgib
    @ibgib 14 днів тому +4

    How does the most important AI channel only have 50k subs. Great video ty

    • @Pure_Science_and_Technology
      @Pure_Science_and_Technology 13 днів тому +1

      I totally agree. I think I’ve watched every video and it’s paid dividends in many areas.

    • @ibgib
      @ibgib 13 днів тому +1

      @Pure_Science_and_Technology I've seen a couple dozen to some degree. It's hard to keep up with such good content, so i have to pick my battles. Can't play this channel at 2x!

  • @ChaseFreedomMusician
    @ChaseFreedomMusician 13 днів тому

    I'm really looking forward to part 2!!

  • @cmilkau
    @cmilkau 12 днів тому

    A more direct approach would be reducing the NLM component of the gradient during update, forcing the optimizer to find other ways of reducing loss. I wonder why they didn't try that. My guess is it'll just kill the gradient even faster.

  • @KilgoreTroutAsf
    @KilgoreTroutAsf 13 днів тому +6

    This is the vanishing gradient problem all over again.

    • @yeetdeets
      @yeetdeets 13 днів тому +2

      Exploding in this case though, right?

  • @TimoGraw
    @TimoGraw 13 днів тому +2

    Am i correct in my understanding that this is about a significant reduction in transformer model training time?

    • @GodbornNoven
      @GodbornNoven 13 днів тому +3

      Not exactly but it does save on costs and a bit on compute because we no longer need to do any regularization to achieve generalization as it can be achieved thru stablemax and adamW. But it does help our understanding of AI and generalization which could theoretically allow us to develop better architectures for learning.

  • @irbsurfer1585
    @irbsurfer1585 13 днів тому +1

    Great video!

  • @야옹-m7h
    @야옹-m7h День тому

    Can I know the reference papers you are referencing from?

  • @user-qw1rx1dq6n
    @user-qw1rx1dq6n 11 днів тому

    I wrote something a while ago where in a single attention layer I used some heads that got a softmax activation and some that got mish this showed improved learning maybe it had something to do with this.

  • @LudicrousTachyon
    @LudicrousTachyon 13 днів тому

    This is pretty reasonable. If we always give all the data, the system doesn't need to generalize because it has all the data. If you take away some of the data, the system now needs to start assuming the portions that are missing or guessing what the possibly missing data is and thus needs to train to generalize the data is does have to what is possible. This is what we do with children. We throw tons of input at them, then throw the kid into a situation, which may or may not have all the same inputs. Finally, we correct or promote their behavior. Thus the child learns to deal with variable situations.

  • @mohl-bodell2948
    @mohl-bodell2948 13 днів тому +5

    What a cliff hanger...

    • @polyscopes
      @polyscopes 13 днів тому

      For real haha llm training getting 2x + cheaper overnight

    • @hdot2613
      @hdot2613 13 днів тому

      Such a tease 😂

  • @tiagotiagot
    @tiagotiagot 13 днів тому

    Would the solution be to always introduce noise that is just above the scale of the rounding error?

  • @timealchemist7508
    @timealchemist7508 13 днів тому +2

    Ugh… Cliffhanger. 😂 Looking forward to tomorrow’s video!

  • @bernardoramos9409
    @bernardoramos9409 12 днів тому

    this may explain why dropout is useful sometimes. it would modify the output, and then there would be a gradient again

  • @luke.perkin.online
    @luke.perkin.online 13 днів тому +5

    Link the paper in the description please!

    • @k.c.sunshine1934
      @k.c.sunshine1934 13 днів тому +2

      18:40 after learning from my mistakes, I realize that UA-cam is not a fan of direct links (over-protection of copyright, I guess). UA-cam is a social media tool rather than an academic system. You can find the link starting at my reference mark.

  • @desmur36
    @desmur36 13 днів тому +11

    This is the most important video on the internet for our advancement of AI!

  • @Sirus20x6
    @Sirus20x6 13 днів тому

    so train a low rank until you run out of space to fit more learning, and slowly up the quantization?

  • @dzhukov
    @dzhukov 13 днів тому

    Instant subscribe!

  • @En1Gm4A
    @En1Gm4A 14 днів тому

    interesting stuff. went along such an issue during my masters thesis

  • @mohammedbenaissa1278
    @mohammedbenaissa1278 13 днів тому +2

    Why do we have to wait for tomorrow

    • @johncolbourne7789
      @johncolbourne7789 13 днів тому +5

      😂Where is my AI paper explainer video, UA-camr slave?

    • @fkxfkx
      @fkxfkx 13 днів тому +1

      because it is not here yet

  • @HoldMyData
    @HoldMyData 13 днів тому

    So now, what am I going to waste my time waiting on training runs? @24:35 😅 Great explanation, thanks. I remember Nous or someone when all of this first started with "Ok so we just keep going and going and eventually it works". This makes it understandable.

  • @breakablec
    @breakablec 13 днів тому

    Sounds like:
    1. once the error become low you want to stop the maximisation of optimal parameters to increase gradient
    2. you want to use decimals with large integer part and small fractional part to increase precision

  • @mloewen248
    @mloewen248 13 днів тому

    Wouldn't simply adding noise to the input data then solve this issue after repeated epochs?

  • @firefly618
    @firefly618 10 днів тому

    At 13:00 why are you drawing the "you are here" point somewhere on the slope? It should be at the bottom. When you reach 100% accuracy, you have reached the minimum of the optimization problem. That's why your gradients are zero. That's why learning stops. When the learning algorithm has memorized the training data, it has nothing more to learn from it. It is working as intended.
    In fact, the first half of the video is looking at the wrong problem. The NLM direction paper is just a fancy mathematical way to say something we always knew: if you allow your model (or your human student) to reach 100% accuracy on the training data but not on test data, it means it/he/she has simply memorized the training data. That's what overfitting is. At that point, you can keep reviewing the same data and you will memorize it better and better, meaning overfit it more and more, but there is generally nowhere to go from there.
    In fact, the surprising thing is that after overfitting for 10,000s more steps, something different happens at all. SC collapse happens during the overfitting phase and, by itself, is not the cause of any "grokking." My bet is that it's some type of numerical instability or some other boundary that, when reached, breaks the pattern and turns it into a different type of learning algorithm. That would be interesting to investigate. (I'll go watch part 2, you probably address it there.)
    The slide at 22:15 is indeed a valuable insight, although nothing distinctly new. We already know (from last year's brilliant paper) that vectors in a high dimensional space, like those commonly used in LLMs, effectively behave as if they were projections from a much higher or even infinite dimensional space, that of all possible knowable concepts, and that's why LLMs can "grok" or generalize anything in the first place. The observation that increasing the input vector size makes memorization easier, and hence prevents generalization, just proves once again that neural networks are a delicate balance between training set size, input size, and hidden layer sizes, alongside a plethora of practical techniques to keep overfitting (memorization) under control. In a way, quantizing the vectors is just another form of regularization. As is probably the case for whatever numerical instability or boundary happens after the logits explode.

  • @tikendraw
    @tikendraw 13 днів тому

    Let say if a company hires you as a data science researcher, with your current knowledge and same limitations as tech giants how far you can take the llm if designed and trained from scratch ?

  • @mathematicalninja2756
    @mathematicalninja2756 10 днів тому

    Multi precisoon scale gradient descent will solve grokking thank you very much

  • @АлексейТучак-м4ч
    @АлексейТучак-м4ч 13 днів тому

    well, first thing that comes to mind is to use something slower growing, than exponents in the normalization layer

  • @alaneric1618
    @alaneric1618 13 днів тому

    I've often wondered if using floating point numbers with an extremely large number of bits (like thousands) if any novel learning would happen in the lowest significant digits. I'm surprised this is only now being talked about. Most comp sci people are very familiar with floating point. Many monetary systems have numerical scaling issues that most engineers assume don't exist because all the small numbers are probably just a wash. I once wrote a paper showing how that is not always the case and people are leaving money on the table.

  • @mariusj.2192
    @mariusj.2192 13 днів тому

    I don't understand why scaling the logits wouldn't be a helpful learning objective if the logit for the correct class already has the largest logit value.
    The whole network before it would be incentivised to increase whatever contributed to the positive logits to make them more positive and to make the contributions to the negative logits stronger to make them more negative - scaling the output of the LM-head is not limited to the adjustment of its own weights.

  • @drdca8263
    @drdca8263 13 днів тому

    22:30 : hm, is it even grokking at that point? This sounds like maybe grokking is just, “the NN after being stuck in a rut, eventually gets lucky and stops being stuck”?
    Ah. Cliffhanger.

  • @yakmage8085
    @yakmage8085 13 днів тому

    Love it

  • @oliverdaniels4961
    @oliverdaniels4961 13 днів тому

    This could be as big a jump as transformers, or overcoming the gradient explosion in DNNs

  • @mirijanyavo6532
    @mirijanyavo6532 9 днів тому

    So, the LLM finally "got it"?
    "it clicked" for the LLM?

  • @fontenbleau
    @fontenbleau 13 днів тому

    Why they using such dorky word? What about calling that "acceleration" or self-tuning?

  • @aayatrubab
    @aayatrubab 14 днів тому

    nice thanks :)

  • @geldoku
    @geldoku 13 днів тому

    this video must be important but I don't understand a word of it.

  • @letsRegulateSociopaths
    @letsRegulateSociopaths 12 днів тому

    Musk does grokkking

  • @sveindanielsolvenus
    @sveindanielsolvenus 12 днів тому

    You are repeating yourself a lot in this video. The topic is very nice, but why say the same thing over and over and over again?

  • @грешилов
    @грешилов 14 днів тому

  • @wwkk4964
    @wwkk4964 14 днів тому

    Kroggink

  • @rubncarmona
    @rubncarmona 13 днів тому +2

    This makes me think of nGPT from Nvidia a bunch

  • @seanharbinger
    @seanharbinger 14 днів тому +3

    Grokking is when a hipster in Silicon Valley attempts to summarize a new subject by fluttering their eyelids like an android in download mode - only to emit imbecilic irony.