A Universal Law of Robustness

Поділитися
Вставка
  • Опубліковано 23 сер 2021
  • I give a tentative theoretical justification for why large overparametrization is important in neural networks.
    Primarily based on "A Universal Law of Robustness via Isoperimetry" by S.B. and Mark Sellke.
    arxiv.org/abs/2105.12806

КОМЕНТАРІ • 21

  • @JonathanGraehl
    @JonathanGraehl 2 роки тому +7

    Summary: Microsoft guy Sebastian Bubeck talking about seemingly overparameterized neural models being necessary for learning (due to label noise?). Validation 'early stopping' of training duration or size scaling is a mistake. After you're over some initial hump that would trigger validation early stopping, overfitting is 'benign' [already known, dubbed 'double descent']. As soon as you can defeat adversarial attacks then you're probably using enough parameters. He (+intern) proves that perfectly memorizing the label-noised data set such that small perturbations don't change in output, you need a much larger parameter set than the data set (perfectly memorizing the training data set should be possible within some constant factor of its size). He predicts that ImageNet (image labeling task) could benefit from 10-100 billion parameters instead of the current sub-1-billion.

  • @daniel-mika
    @daniel-mika Рік тому

    Really amazing talk. Thank you for your contributions

  • @Extys
    @Extys 2 роки тому +1

    Amazing work!

  • @chatuly
    @chatuly 2 роки тому +1

    Really beautiful talk! Thanks, Sebastien! (Chatuli is Shai BD)

  • @jonathanbaxter5821
    @jonathanbaxter5821 Рік тому

    Nice talk. Thanks for posting. And a very nice result. That said, it doesn't seem particularly surprising that you'd get effective-dimension dependence like this given the lipschitz bounds on your class (which are effectively bounds on the weights and depth in the case of neural nets). Given these lower bounds on parameter dimension for smoothness in training, the most interesting question (to me): does smoothness explain generalization? If you're rote-learning the data by sticking bumps in the input space (which is well-behaved lipschitz as you explain), it has no right to, so this kind of result by itself doesn't seem to be enough to get there. What extra? Is implicit Bayes like you get with drop-out also critical? An assumption on the target function? e.g. it's close to the zero-function (in L2)?

  • @grantsmith3653
    @grantsmith3653 Рік тому

    At 14:10, I'm having a hard time understanding why we need label noise. I understand that it is a proxy for the difficulty of the dataset, but I don't understand why it's necessary for the mathematical theory. I think Sebastian is saying something about it being in or not being in the model class. Maybe it's just saying it is a lower bound on the difficulty of learning the function... Otherwise all the labels could just be 0 or something and the function would be too easy to learn? Idk
    Love the talk, and all the other ones! I'm going through all these. This is exactly what I'm looking for... Theoretical deep learning. Awesome stuff

  • @ramsever5087
    @ramsever5087 2 роки тому +1

    Take ImageNet as an example:
    The training dataset is around 1M images (d), the dimensionality of the input images is on the order of magnitude of 50k
    And STOA models with 300M parameters achieve over 90% top-1 accuracy.
    In this case p is much smaller than nd, but still we get good generalization of the network.
    How is the Law of Robustness applicable for this case? it looks that nd is a very loose upper bound, right?

    • @SebastienBubeck
      @SebastienBubeck  2 роки тому +2

      Right, so the law of robustness does NOT talk about generalization but rather about the tradeoff between size and smoothness conditionally on low training error. So in your example it does strongly suggest that the 300M parameters network will NOT be smooth (ie small perturbations of input can have great effect on the output).

    • @marouanemaachou7875
      @marouanemaachou7875 2 роки тому +2

      @@SebastienBubeck So this could explain advsersarial examples for these ImageNet trained models ?

  • @stackoverflow8260
    @stackoverflow8260 2 роки тому

    Can you please give an example of what is a kolmogrov-arnold type network?

    • @SebastienBubeck
      @SebastienBubeck  2 роки тому +1

      The wikipedia entry is quite good en.wikipedia.org/wiki/Kolmogorov%E2%80%93Arnold_representation_theorem . You can also take a look at the corresponding section in the universal law paper.

  • @akram2s
    @akram2s 2 роки тому

    How to get the effective dimension of a given dataset? Maybe find the smallest dimension of the compressed represntation in an autoencoder that results in an acceptable reconstruction? Or is there a better way of doing it?

    • @anvarkurmukov2438
      @anvarkurmukov2438 2 роки тому +1

      Curiously for both mnist and imagenet Sebastien ends up with effective dimension of the order of number of classes. This also coincide with my intuition: if you have a “good” definition of separate classes, you should still have “some” variation inside this class, but not too much.
      But I also like you practical idea, may be as a first approximation, something like number of pc components explaining large amount of variation will do the job.

    • @SebastienBubeck
      @SebastienBubeck  2 роки тому +5

      Both of your comments are quite interesting! I don't know what is the "correct" way to estimate the dimension, it's really a fantastic question.

    • @pt3931
      @pt3931 2 роки тому +1

      @@SebastienBubeck : Maybe some function of F( rank_PCA ) ? (may work for some datasets...)

    • @VishnuNareshBoddeti
      @VishnuNareshBoddeti 2 роки тому

      There are ways to estimate the effective/intrinsic dimension of data. We have estimated this for ImageNet features rather than the images themselves. We found that the intrinsic dimension is around 20-30. Estimating directly for images is challenging due to their high-dimensionality.
      Setting the effective dimension as the number of classes is only an upper bound since that comes from the requirement of linear seperability. So for non-linear seperability it would be lower, perhaps something like log(C).