#61: Prof. YANN LECUN: Interpolation, Extrapolation and Linearisation (w/ Dr. Randall Balestriero)

Поділитися
Вставка
  • Опубліковано 29 кві 2024
  • We are now sponsored by Weights and Biases! Please visit our sponsor link: wandb.me/MLST
    Patreon: / mlst
    Discord: / discord
    Yann LeCun thinks that it's specious to say neural network models are interpolating because in high dimensions, everything is extrapolation. Recently Dr. Randall Bellestrerio, Dr. Jerome Pesente and prof. Yann LeCun released their paper learning in high dimensions always amounts to extrapolation. This discussion has completely changed how we think about neural networks and their behaviour.
    [00:00:00] Pre-intro
    [00:11:58] Intro Part 1: On linearisation in NNs
    [00:28:17] Intro Part 2: On interpolation in NNs
    [00:47:45] Intro Part 3: On the curse
    [00:57:41] LeCun intro
    [00:58:18] Why is it important to distinguish between interpolation and extrapolation?
    [01:03:18] Can DL models reason?
    [01:06:23] The ability to change your mind
    [01:07:59] Interpolation - LeCun steelman argument against NNs
    [01:14:11] Should extrapolation be over all dimensions
    [01:18:54] On the morphing of MNIST digits, is that interpolation?
    [01:20:11] Self-supervised learning
    [01:26:06] View on data augmentation
    [01:27:42] TangentProp paper with Patrice Simard
    [01:29:19] LeCun has no doubt that NNs will be able to perform discrete reasoning
    [01:38:44] Discrete vs continous problems?
    [01:50:13] Randall introduction
    [01:50:13] are the interpolation people barking up the wrong tree?
    [01:53:48] Could you steel man the interpolation argument?
    [01:56:40] The definition of interpolation
    [01:58:33] What if extrapolation was being outside the sample range on every dimension?
    [02:01:18] On spurious dimensions and correlations dont an extrapolation make
    [02:04:13] Making clock faces interpolative and why DL works at all?
    [02:06:59] We discount all the human engineering which has gone into machine learning
    [02:08:01] Given the curse, NNs still seem to work remarkably well
    [02:10:09] Interpolation doesn't have to be linear though
    [02:12:21] Does this invalidate the manifold hypothesis?
    [02:14:41] Are NNs basically compositions of piecewise linear functions?
    [02:17:54] How does the predictive architecture affect the structure of the latent?
    [02:23:54] Spline theory of deep learning, and the view of NNs as piecewise linear decompositions
    [02:29:30] Neural Decision Trees
    [02:30:59] Continous vs discrete (Keith's favourite question!)
    [02:36:20] MNIST is in some sense, a harder problem than Imagenet!
    [02:45:26] Randall debrief
    [02:49:18] LeCun debrief
    Pod version: anchor.fm/machinelearningstre...
    Our special thanks to;
    - Francois Chollet (buy his book! www.manning.com/books/deep-le...)
    - Alexander Mattick (Zickzack)
    - Rob Lange
    - Stella Biderman
    References:
    Learning in High Dimension Always Amounts to Extrapolation [Randall Balestriero, Jerome Pesenti, Yann LeCun]
    arxiv.org/abs/2110.09485
    A Spline Theory of Deep Learning [Dr. Balestriero, baraniuk]
    proceedings.mlr.press/v80/bal...
    Neural Decision Trees [Dr. Balestriero]
    arxiv.org/pdf/1702.07360.pdf
    Interpolation of Sparse High-Dimensional Data [Dr. Thomas Lux]
    tchlux.github.io/papers/tchlu...
    If you are an old fart and offended by the background music, here is the intro (first 60 mins) with no background music. drive.google.com/file/d/16bc7...

КОМЕНТАРІ • 179

  • @andreye9068
    @andreye9068 2 роки тому +14

    Thanks for posting this episode! And as "that guy" at 2:08:19, I'm happy to say I found the discussion very interesting and it's changed my mind :)

    • @nomenec
      @nomenec 2 роки тому +2

      Thank you, Andre! And thank you for your article. Apologies we couldn't recall your name on the fly; we did make sure to show your name in video though ;-) I'm very curious, how did the discussion change you views?

    • @MachineLearningStreetTalk
      @MachineLearningStreetTalk  2 роки тому

      Hey Andre, we really appreciate you dropping in here. Great article! For the benefit of folks -- here it is medium.com/analytics-vidhya/you-dont-understand-neural-networks-until-you-understand-the-universal-approximation-theorem-85b3e7677126

    • @MachineLearningStreetTalk
      @MachineLearningStreetTalk  2 роки тому

      And this was the tweet where LeCun picked it up twitter.com/ylecun/status/1409940043951742981

    • @andreye9068
      @andreye9068 2 роки тому +3

      @@nomenec For sure - Twitter really isn't the best platform to exchange nuanced perspectives, so when the Twitter conversation began, I took the disagreement (i.e. between LeCun, Pinker, Marcus, Booch, etc.) to be a sign that it was one of those types of ambiguous problems that one can't really confidently their mind up about. A lot of the Twitter thread content seemed pretty speculative or pulled willy-nilly without much organization. When I first read the paper on extrapolation, I was even more unsure of what to think - I was actually wondering many of the questions that you all asked in the interview, e.g. why choose the convex hull instead of another definition? Does this mean that neural networks are actually extrapolating? etc. After listening to LeCun and Balestriero's responses, I have a much more well-informed perspective of the paper's context and argument, and I think it's probably correct.
      Thanks guys for all the work you do arranging context and asking insightful questions!

    • @parker9163
      @parker9163 Рік тому

      @@MachineLearningStreetTalk p

  • @AICoffeeBreak
    @AICoffeeBreak 2 роки тому +72

    This is incredible! Ms. Coffee Bean's dream came true: the extrapolation interpolation beef explained in a verbal discussion! 🤯
    Cannot wait to watch this. So happy about a new episode from MLST. You kept us waiting.

    • @nomenec
      @nomenec 2 роки тому +9

      Thank you, Letitia! We burned the midnight oil for weeks on this one; we are looking forward to the community enjoying (hopefully!) the effort. We are grateful to both Yann LeCun and Randall Balestriero for spending time with us!

  • @DanElton
    @DanElton 2 роки тому +19

    I’m literally working on a blog post about how deep learning is interpolation only , based on double descent phenomena and distribution shift issues, and then this drops!! Lol

    • @ketilmalde3402
      @ketilmalde3402 2 роки тому +1

      Link (when you're done)? In general, where is a good place to discuss this? So many questions... but YT tends to drown in a zillion low-quality comments as soon as anything gets popular. The AI Stack Exchange?

    • @Smalldatalooser
      @Smalldatalooser 2 роки тому +1

      @@ketilmalde3402 Are you the Ketil Malde who wrote the Arxiv paper about 'semantic' meaningful learning of Plankton images with siamese NNs?
      I just completed my Bachelorthesis using semi-supervised SimCLR for Plankton image categorization and really enjoyed and heavily used it. However i was not able to produce such nice clusters in representation space. So if you are the one who wrote it: Thanks a lot for the inspiration!

    • @ketilmalde3402
      @ketilmalde3402 2 роки тому +1

      @@Smalldatalooser yes, that would be me :-) Thanks for the kind words! I really should try to get it published properly, but the review asked for lots of detailed changes and a full resubmission (rather than a revision with a deadline), so it got kinda left by the roadside. And the field moves so quickly and I've leared a lot since then, so nowadays I would probably use a different method (like you did).

  • @leinarramos
    @leinarramos 2 роки тому +26

    Just spent 5 hours watching this 3-hour video. This is both dense and profound. Great job, best episode yet in my book!

    • @nomenec
      @nomenec 2 роки тому +4

      Thank you for your time and commitment!

    • @EmileAI
      @EmileAI Рік тому

      I spent 7h lmao
      I'm still too new to machine learning
      I love this episode

  • @abdurrezzakefe5308
    @abdurrezzakefe5308 2 роки тому +11

    I wait for your videos in more excitement than I wait for my favorite tv shows' new seasons. Looks amazing!

  • @xorenpetrosyan2879
    @xorenpetrosyan2879 2 роки тому +3

    imagine the balls to make a 1 hour intro before the main discussion :D

  • @stalinsampras
    @stalinsampras 2 роки тому +19

    Couple of minutes into the video and you break some of the fundamentals assumptions I had about deep learning/Neural nets, Jeez man. Excited for this 3hrs long video.
    And as usual the production quality of the videos keeps getting better. Happy New Year Guys

    • @nomenec
      @nomenec 2 роки тому +4

      Happy New Year! Tim and I certainly walked away with very different (upgraded, in my opinion) view on neural nets. Would love to learn how, if at all, your views change after watching.

  • @BenuTuber
    @BenuTuber 2 роки тому +15

    Starting off the new year with a bang. Tim, Keith and Yannic - thank you so much for this quality work. You can clearly tell how much love and dedication goes into every episode. Also the intros just continue to amaze me - the level of understanding you approach the variety of topics with is extremely inspiring.

  • @teksasteksasen1249
    @teksasteksasen1249 2 роки тому +5

    Sooo what are the odds we can get a conversation between LeCun and Chollet? Would love to watch them have a discussion on this.

  • @Kerrosene
    @Kerrosene 2 роки тому +1

    Occam's razor always makes straight cuts (in reference to piecewise linear functions) was a great line!

  • @jaapterwoerds9850
    @jaapterwoerds9850 2 роки тому +1

    The content in this channel is just mind blowing. But the main reason I come back is the thoughtful editing and introductions and reflections of the content by dr Tim. I cannot keep up yet in grasping all the content in real time but that is exactly why it's so awesome. Thanks!

  • @victoroko3954
    @victoroko3954 2 роки тому +3

    You guys really kept us waiting.
    Thank you! MLST for this one.

  • @Artula55
    @Artula55 3 місяці тому

    I think I have seen this video over a dozen times, but every time I keep learning something new. Thx MLST!

  • @vishalrajput9856
    @vishalrajput9856 Рік тому +2

    Thank you guys, I've not been more amazed by anything in AI than this completely brand new revelation of neural network's internal working. Insanely interesting and beautiful.

  • @lucca1820
    @lucca1820 2 роки тому +3

    we need more of Prof. Yann lecun!

  • @ChaiTimeDataScience
    @ChaiTimeDataScience 2 роки тому +10

    WOHOOOO! I'm so so stoked to see this video!
    Time to drop everything and watch another epic interview by the MLST team!

    • @nomenec
      @nomenec 2 роки тому +4

      Cheers! Just don't drop your Chai! ;-)

  • @oncedidactic
    @oncedidactic 2 роки тому +10

    5 minutes in and it feels like extended Christmas :D
    So glad to have the show back!

  • @YoungMasterpiece
    @YoungMasterpiece 11 місяців тому

    I love the analogy 'I feel like I'm standing on Pluto', nice :)

  • @dr.mikeybee
    @dr.mikeybee 2 роки тому

    Tim, your statement about neural networks being analogous to classical decision trees absolutely hits home.

  • @Soul-rr3us
    @Soul-rr3us 11 місяців тому +1

    I keep coming back to this. One of the best MLSTs.

  • @stretch8390
    @stretch8390 2 роки тому +1

    Time to write the afternoon off and make the most of an incredible opportunity in listening to this discussion.

  • @barlowtwin
    @barlowtwin 2 роки тому +2

    Just got done watching it. Grateful for the great work the team has done. Cheers :)

  • @OisinNolanChannel
    @OisinNolanChannel 2 роки тому +4

    Love these long form videos -- really appreciate the effort you guys are putting in!!

  • @vigneshpadmanabhan
    @vigneshpadmanabhan Рік тому

    Thanks you for creating this amazing channel. The amount of insights one can get sitting for three hours with the Professionals is immense!

  • @johanneslaute3675
    @johanneslaute3675 2 роки тому +3

    Great episode! These long deep dives are amazing, I get a lot of intuition from them and they are a great point to start reading more papers on the topic (who except Yann can keep up with axiv these days...) Really appreciate the effort and have a great 2022 :)

  • @mikenashtech
    @mikenashtech 2 роки тому +3

    Fantastic discussion and explanation of the thinking behind interpolation, extrapolation and linearisation. This has really helped shift the needle towards towards the ultimate problem we all face, helping decipher what input is relevant to the task. If possible, please do V.2 covering some of the other concepts Prof LeCun was talking about. Could be a series on its own as so good! Mike Nash - The AI finder

  • @madmanzila
    @madmanzila Місяць тому

    Well done guys it's really a pleasure to be diving into into this field

  • @scottmiller2591
    @scottmiller2591 2 роки тому +1

    Interesting talk - I'm working on a pile of notes, amplifications, and critiques.

  • @paxdriver
    @paxdriver 2 роки тому +3

    Happy New Year's!!! I've missed you guys

  • @dinoscheidt
    @dinoscheidt 2 роки тому

    They are back ❤️ if only youtube decided to use that bell 🔔. Great talk - thank you very much for all your efforts!

  • @abby5493
    @abby5493 2 роки тому +2

    Wow! What an amazing video! Best one yet!

  • @tchlux
    @tchlux 2 роки тому +4

    Thanks for the shoutout at 38:01 Tim! The Discord channel rocks 😆
    An additional note on extrapolation that people might find interesting:
    - In effect, the ReLU activation function prevents value extrapolation to the left. So when these are stacked, they serve as "extrapolation inhibitors".
    - This clipping could be applied to other activation functions to improve generalization (or forewarn excessive extrapolation)!
    - I.e., clipping the inputs to all activation functions within a neural network to be in the range seen at the end of training time will reduce large extrapolation errors at evaluation time (and counting the number of times an input point is clipped throughout the network could indicate how far "outside the relevant convex hull" it is).
    The clipping shouldn't be introduced until training is done (because we don't have a reason to assume the initialization vectors are "good" at identifying the relevant parts of the convex hull). But I'd be willing to bet that this "neuron input clipping" could improve generalization for many problems, is part of why ReLU works well for so many problems, and can prevent predictions from being made at all for adversarial inputs.

    • @oncedidactic
      @oncedidactic 2 роки тому

      "[Activation clipping] ... can prevent predictions from being made at all for adversarial inputs." Would love to hear more about this line of thinking! Both practical side and what this illuminates on the theory side about "what does it mean to be adversarial / robust / etc". You guys didn't get a chance to discuss adversarial stuff on your chat episode much at all but it seems to abut the topic of generalization quite often which in turn tends to come up with geometric interpretation.

    • @tchlux
      @tchlux 2 роки тому +1

      ​@@oncedidactic happy to clarify. One way I like to think about it is that every basis function inside an MLP (the activations at a node after applying the nonlinearity) generates a distribution. If you have 10k points at training time, then for every internal 1D function you can plot the distribution of the 10k values at those points. That should give a pretty precise definition of the CDF (from central limit theorem), and rather tight bounds of what is "in distribution" (/ likely given observations). The issue is that the generated distribution of values at internal nodes over training data is (obviously) not independent of the training process. So to get an accurate estimation of the distributions we withhold validation data, which provides a true estimation of the error function (the error of the model over the space covered by the validation data).
      Now when you apply the model to new data, you can look at the values produced at internal nodes relative to the distributions seen at training / validation time. If you observe that a single evaluation point produces "out-of-distribution" (extrapolative) values for a substantial number of nodes in the model, then we know for certain that the point is not "nearby" to our training data. Even more, if the new point is out of distribution for the validation data, then that means we don't have a guess as to what the error looks like! 😱
      One of the core mechanisms for making approximations outside the bounds of training data is projecting new points back into the region of space where you can make an approximation (usually on to the convex hull). So in practice we can project points onto the convex hull of the 1D basis functions by clipping all values to the minimum and maximum seen at training time. We would want to do this mainly because we have no reason to assume that the linear fit produced by one node (and it's infinite linear extrapolation to the right) is correct! No training data justified that behavior. If we let our basis functions extrapolate without bounds then our error *definitely* grows without bounds. If we prevent infinite extrapolation, then we *might* be bounding our error too.
      To tie it all together, the distributions of values seen at validation time (more validation data ➞ better distribution estimates) should *precisely* match the distributions for testing. If they do not, then you know that something about the data has changed (from training & validation time) and your error will change in a commensurate fashion (in an unknown way). This relates to another important fact: we can never modify a model based on validation error. If we make decisions based on validation error, then we entirely undo the (necessary) orthogonality of the validation set (and hence remove our ability to estimate error).

    • @oncedidactic
      @oncedidactic 2 роки тому

      ​@@tchlux Thanks for the detailed reply! Any further reading you can point to? It makes perfect sense to me you would want to use the clipping / projection to learned convex hull to prevent wild extrapolation that leaves you at the mercy of "out-of-distribution", be that natural or adversarial. I can't think of an example where this is implemented but my knowledge is *not* deep. I imagine this curtails the "magic" of kinda-sorta extrapolating well sometimes, but you win the tradeoff because the limitation of your model is predictable. Or in other words predictably dumb is better than undependably intelligent, as a system component.
      "Even more, if the new point is out of distribution for the validation data, then that means we don't have a guess as to what the error looks like! 😱" This is so insightful yet simple and really reframes the whole issue for me. Not to pile on too much, but this just feels like another sign that it seems pointless to expect better training sets or bigger models to ever overcome the problem of "something you haven't seen before."

    • @tchlux
      @tchlux 2 роки тому +1

      @@oncedidactic
      > Any further reading you can point to?
      I mostly just think about things in terms of basic linear algebra. If you get super comfortable with matrix multiplication, linear operations, and think really hard about (or better, implement) a principal component analysis algorithm (any method), then you'll start to form the same intuitions I have (for better or worse 😜).
      I try to think of everything in terms of directions, distances, and derivatives (/ rates of change). I can't think of any "necessary" knowledge in machine learning that you can't draw a nice 2D or 3D picture of, or at least produce a really simple example. I suggest aggressively simplifying anything until you can either draw a picture or clear example with minimal information. If it seems too complicated, it probably is.
      Stephen Boyd's convex optimization work (UA-cam or book) is great. And 3blue1brown is wonderful too.
      > this just feels like another sign that it seems pointless to expect better training sets or bigger models to ever overcome the problem of "something you haven't seen before."
      Exactly. People will probably continue to talk about it forever, but it only makes sense to *extrapolate* in very specific scenarios with relatively strong assumptions. What we really want in most cases is a model that identifies a low dimensional subspace of the input where it can accurately interpolate.

  • @NelsLindahl
    @NelsLindahl 2 роки тому

    Your build quality here is really high. Nice work. My only comment on this video was that I had to give parts of this video my full attention. That is probably a good thing.

    • @nomenec
      @nomenec 2 роки тому +1

      Lol, cheers and thank you!

  • @flooreijkelboom1693
    @flooreijkelboom1693 2 роки тому

    Awesome! Thanks again for arranging this :) !

  • @hossromani
    @hossromani 7 місяців тому

    Great video - excellent conceptual discussion

  • @crimythebold
    @crimythebold 2 роки тому +2

    Loved this one. Again

  • @michaelwangCH
    @michaelwangCH 2 роки тому +2

    I spend past two years in uni and attended all related classes in to ML and AI to try to understand the DNN, because noone in CS department can answer my question in the way which I can intuitively understand what the DNN is doing, how and why the DNN is doing.
    Tim, thanks for the enlighted explanation.

    • @TimScarfe
      @TimScarfe 2 роки тому +3

      Thanks a lot Michael! But don't thank us too much, most of this wisdom is coming directly from Chollet, Balestriero and LeCun we are just digesting their fascinating ideas and presenting them in the best way we can.

  • @juusokorhonen1628
    @juusokorhonen1628 2 роки тому +1

    Came here from Lex Fridman video, and gotta say these make the perfect combination (especially now that Lex has arched into some topics outside AI). Keep delivering this fantastically specified content👍

    • @nomenec
      @nomenec 2 роки тому

      Thank you, Jusso! I really appreciate that. Tim and I often struggle with finding the right balance while keeping it (hopefully) entertaining. It's not easy and we are also trying to brainstorm on ways to improve. So, it's great to hear from a satisfied viewer!

  • @SLAM2977
    @SLAM2977 2 роки тому +2

    Great stuff guys, LeCun is next level!:)

  • @saundersnecessary
    @saundersnecessary 2 роки тому +1

    Just wonderful thank you!

  • @leonidas193
    @leonidas193 2 роки тому +1

    Great episode, keep up the good work.
    Agree with the reasoning = optimization, at least the reasoning that we currently do with machine learning. There is also a well known result in optimization which states that separation = optimization, where separation means finding a separating hyperplane between a point and some convex hull. So in other words membership, or interpolation is optimization.
    Many of these concepts are well known in the optimization community for some time now. For instance linear vs nonlinear or discrete vs continuous are known to be of little difference, while convexity is the main concept that makes things tractable. Also the curse of dimensionality can be avoided if you formulate the problem combinatorially as a graph for instance which is dimensionless.

  • @youseftraveller2546
    @youseftraveller2546 2 роки тому

    Great interviews, with many abstract ideas, made simple; I want to wish you all great success, and I will wait for more interesting conversations to come. I am coming for a computational engineering background. We are looking in my field for models that can extrapolate for problems that can be categorized as a mix of differentiable and discrete in nature. Is there any possibility to see a video in future that discusses the ideas of the current episode but more toward computational engineering and physics orientated problems? Thanks and Happy New Year

  • @arvisz1871
    @arvisz1871 2 роки тому +1

    Well done! 👍

  • @rafaeldelaflor
    @rafaeldelaflor Рік тому

    Thanks guys

  • @sabawalid
    @sabawalid 2 роки тому +4

    Very very good episode guys - kudos, as always
    I have a problem with LeCun's strong statement that "reasoning = optimization" (that most reasoning can be simulated by minimizing some cost). Inference/deduction is not optimization. That's not true at all.

  • @connorshorten6311
    @connorshorten6311 2 роки тому +2

    Amazing, congratulations!

  • @rgarthwood3881
    @rgarthwood3881 2 роки тому +1

    If you use relus, and simple ff networks yes they're tessellations but not non-linear act fns with inter-layer feedback connections. An example of the latter is the transformer hypothesis class.

  • @UmairbinMansoor
    @UmairbinMansoor 7 місяців тому

    The talk was beautifully presented... Thank you all
    My question is: why are we considering that the new sample (the test set) lies outside the convex hull of the training data, considering the dataset strictly represents a domain like pictures with or without cats?
    My second question is: In signal processing, the impulse contains all the frequency content the reason why we have to characterize any form of the filter by its impulse response. Having said that, for a particular domain, can we have a training set that completely characterizes the problem and hence the ML model which means, any test data must then lie within the convex hull...???

  • @gren287
    @gren287 2 роки тому +2

    Reflected ReLU > ReLU 😎
    I want a neural network from you Tim ❤

  • @SimonJackson13
    @SimonJackson13 2 роки тому

    The surface dividing the training set in two? How many would there be and are some better to consider as AND with the "search term"? Multiple max entropy cosearch parallelism?

  • @EngineeringNibbles
    @EngineeringNibbles 2 роки тому +1

    Amazing video, the bass is super high though! Wish it was a little lower as it requires manual EQ

  • @SimonJackson13
    @SimonJackson13 2 роки тому

    The converdivergence of x^n at x=1 saddle unstable point even implies input scaling has a convergence implication on a polynomial fit.

  • @citizizen
    @citizizen 2 роки тому +1

    Note@15 min: if you create hyperplanes, this, my guess, will partake into extra usable information per hyperplane. No proof though.
    Note@28 min: one OR at a time; not to give properties to objects such that you loose the 'single or instant'.
    Note@38min: Experience pays off.
    Note@:41min: "math lump", creating simple datasets and putting those together. Like a sentence of 'objects'. You play with the : "semantics".
    Note@45min: Can one throw an object through all of the information present at hand and see what it does? Like an analysis: (one object at a time (no dogma)), and see, which manifold is strong and which is not.. (to entangle time as it where (@ 46.50 min))
    Note@46min: I simply love this video!
    Note@53min: So if we have a ball (lot of density), we could encode only its traits we want to have and work with that.
    Note@1:02min: You need to build from certain objects, only a single spot. Not an object you need to redraw in each case. Such that it can be applied.
    -(question) IF you are inspired at 50 minutes and see something at 60 for more inspiration and add it to the 50th minute inspiration. IS this wrong?
    -IS it possible to let some data collect some data over time and notice as it where where it is going. Perhaps even creating objects that are good in this and adding these to ones data analytic toolkit. Having one such single object, is interesting simply in itself. Perhaps creating a vocabulary of some kind???
    Term: "dataplatonic" mindset
    @on the curse: : "jackpot ;-)" ,,
    Note@51min: i guess it is utile to acquire virtualized versions of objects. Such that the data takes account of 'objects', i.e. : terms. Like a circle or a square as circle and square.
    So, if we have a term, like a concept, we should generalize(?) it into something that we can use. So getting rid of 'drawing' objects... I guess a 'vocabulary' of a dataset is a nice concept as well..
    How to make a concept. Keep track of it. Like: a point drawn, becomes a sphere. So if we create an animation, we re-encode this into data for data analysis... Perhaps even creating synesthesia for the sentences created. Such a 'gift', might parametrize for people watching.
    Current conclusion: Each thing you want to analyse needs to be built up itself, such that you do not take big objects but building block parameters.. Such the result is not about objects but building blocks that might be like bigger objects, but without the crap (data intensive). One wants to get rid of .. and let the computer do it. Building the right concepts by the computer and by guidence of the hand.
    Note@01:33min: if you got a function where the energy is understood (being zero). You can grow and shrink it and add it to 'a sentence'. Next you should be able to adapt (add substract) these and using such functions in line and create a kind of word sequence.

  • @SimonJackson13
    @SimonJackson13 2 роки тому

    A discrete attraction chaoform. Convergence to attractor locations as solutions of time series. Then a disjunct split and fold to exceptional zones surrounding expected precursors to exception. Then train for drop errors triggering exceptional close zone to chaoform large split discreet?

  • @fredericln
    @fredericln 2 роки тому

    Great, great talk! My reaction is based on the first 37' first, but before I go to sleep and forget… two (very non-expert) cents. 1) around 15', you say that NN basically try to find boundaries and don't care about the internal structure of classes. How far does this hold? Loss functions do take into account how far the data point is from the boundary of the class (how dog-typical this dog is, etc.). For sure this is only one tiny part of what 'class structure' can encompass. 2) (I'm quite sure I will find the answer in the remaining part, but) ReLU are different from previous, e.g. logistic, activation functions, which were basically smoothed separators, smoothed piecewise constant functions. ReLU are not constant on the x>0 side :-) - which I found dangerous at first (how far will this climb? how much will a single ReLU influence the outcome, on out-of-distribution test points?) - but doesn't *that* add to the ability to extrapolate, i.e. to say things about what happens far from the convex hull of training points?

  • @filoautomata
    @filoautomata 2 роки тому

    42:31 - 42:41
    in 1993 there have been an architecture called ANFIS
    that is combination of interpretability, monotonicity, from Fuzzy Logic Inference System and
    combination of the adaptiveness of neural network
    ANFIS guaranteed smooth gradual change of prediction caused by slight modification of input because of the smoothness and monotonicity aspect from fuzzy logic while still being able to be optimized using gradient based optimizer if desired

  • @BuFu1O1
    @BuFu1O1 Рік тому

    Part 3 on the curse of dimensionality 🤯

  • @federicorios1140
    @federicorios1140 2 роки тому +1

    This is fucking crazy, there's just no other way to put it.
    The idea of piecewise linearity of a neural network is the single biggest opening of the deep learning black box that I have ever seen

    • @nomenec
      @nomenec 2 роки тому +1

      Cheers, Federico! I share your opinion as well; for me it was an eye opening view point.

  • @cambrinus4
    @cambrinus4 2 роки тому

    Great and very inspring interviews. Thank you!
    I wonder how to explain the fact that CNNs learn very practical features in first layers like edge detectors and texture detectors in a persepective of a spline trees theory (I mentioned these because we know what they do and that they are present in NNs). Of course we know that they are used by NNs to split latent space but I think that the fact that NNs are able to figure out such specific features at all is enough qualitative difference comparing to decision trees to question if an analogy to decision trees makes sense at all. Yann LeCun claims that in high dimensional spaces everything is an extrapolation I think it's valid to ask if in high dimensional spaces everything is decion tree-like hyperplane splitting.

  • @ClaudeCOULOMBE
    @ClaudeCOULOMBE 2 роки тому

    Enlightning episode! A bit long but exciting subject... I would had appreciate to get François Chollet in this debate. Unfortunately, the elephant is not in the room...

  • @user__214
    @user__214 Рік тому

    Great video! Here's a question I have after reading the papers, if anybody can help me:
    Hypothetically, if, say, the MNIST digits *did* lie on a lower dimensional manifold, then by definition all new data points would fall on that manifold, right? So in the Extrapolation paper, when they show in Table 1 that the test set data doesn't even fall within the convex hull of the Resnet **latent space **, this must mean either 1) Resnet is doing a poor job of learning the true latent space, or 2) MNIST digits do not actually fall on a lower dimensional manifold.
    Is that right?

  • @PhilipTeare
    @PhilipTeare 2 роки тому

    does GELU not smooth these polyhedra from a geodesic structure into a continuous smooth manifold?

  • @User127169
    @User127169 2 роки тому

    Dr . Randall is saying that even in the generative setting, in GAN's latent space (which has large number of dimensions), there is no interpolation (due to the curse of dimensionality of course). What is then the explanation on why these models even work, and how come they manage to generate new examples? I can't quite figure it out. Great video, enjoyed it!

  • @citizizen
    @citizizen 2 роки тому

    @1:36min: if we create labels for important stuff. These can be used again. Kind of 'meta propagation'. To be able to take something up. Building up a vocabulary.
    Note: IF we can have a tiny center where a lot can happen. This can be applied on say: a hand or a foot. If we have A and B connected, we do not need all that happens in between. I guess one wants to create something that is applicable everywhere. Teleportation.
    @1:46min: something differentiated, and molded together with related stuff (not yet known). Like velocity and acceleration together with the images related to it. Next normalize such information, into single principles (i guess normalization and making objects with what is normalized might be a way of creating : concepts).
    Note: IF the will can be defined as 'one or a couple of objects, taken together at once', then you must be able to work with such (like how to work in a database). Perhaps apply it as a regular expression? This can become very very agressive, and thus interesting.
    Note: a language such that we can derive where the machine is about. Like: visualizing what happens. (disentangle). Normalization.
    To normalize a principle. PErhaps making a database of normalized principles.
    @1:56min: perhaps create classes, like : per dimension a way to go about.
    @02:00min: MAtch! Got the same idea somewhat.
    Note: a language that generates generation 5 programming languages (relational language). Then terms normalized, put in a dataset. So, with a proper 'calculus', one can create discrete' objects, like: if it repeats a pattern on itself again: one needs 2 circles. (example).
    You do not need to know everything, If you get a couple of dimensions you work in. Like: 1, 2 and 4. Then this can be called discrete because you solve it with (underneath), these. I label this will because you can let those 3 work together and learn like that.

  • @rafaeldelaflor
    @rafaeldelaflor Рік тому

    The order of the equations matters is what they have established. This was already a basic tenet of symbolism.

  • @Hexanitrobenzene
    @Hexanitrobenzene Рік тому

    ~2:55:00
    I think discrete vs continuous dichotomy is not so absolute. Human brain seems to be an analog system, but it can emulate discrete reasoning. Computers are discrete machines, but with neural networks they can emulate continuous reasoning. The main problem seems to be efficiency: emulating one via another is extremely inefficient, that's why dr. Balestriero noted that a hybrid system would be the most efficient.
    EDIT: Yup, a little later Keith noted that, too.

  • @jacquesgouimenou9668
    @jacquesgouimenou9668 2 роки тому +1

    wahoo! it's amazing

  • @PhilipTeare
    @PhilipTeare 2 роки тому

    I didn't catch the new non-contrastive method Yann mentions after BYOL and Barlow Twins. Does anyone know?

  • @CharlesVanNoland
    @CharlesVanNoland Рік тому

    For inputs that lie within the training data it's an ellipsoid. For inputs that lie outside of the training data I imagine more of a paraboloid. It seems like data could lie both inside of training data in some dimensions and outside in other dimensions, which makes it some kind of ellipsoid paraboloid hybrid. Is this a thing?

  • @willd1mindmind639
    @willd1mindmind639 2 роки тому +1

    I believe most of this just is a byproduct of the fact that brain neurons operate in analog space while computer neural networks are digital which is an approximation of analog data where sampling is always relevant. The other issue is the fact that all data in neural networks are collected together in a singular bucket with a singular answer for various learned scenarios. Whereas in the brain things are much more decomposed into component pieces or dimensions which become inputs into higher order reasoning processes. And this is what leads to the human cognitive evolution that creates language from symbols, with embedded meanings and things like numbers and mathematics.
    An analogy for this is to say that each individual arabic numeral has a distinct identity function (learned symbol pattern recognition) corresponding to a set of neurons in the brain. Separate from that you have another set of neurons that have learned the concept of numbers and can associate that with the symbol of a number. And separate from that there is a set of neurons that have learned the principle of counting associated with numbers. That is a network of networks that work together to produce a result. And as such the brain can learn and understand linear algebra and do calculations with it because of the preservation of low level atomic identity functions or logic functions that are not simple statistics problems. Meaning the brain is a network of networks where each dimension is a distinct network unto itself as opposed a singular statistical model.

    • @nomenec
      @nomenec 2 роки тому +1

      I think that's a very nice way of looking at things. In a sense, NNs breaking up the space into polyhedra is like a simple hacked version of a network of networks. They are encoding little subunit networks, by virtual of the ReLU activations, that are then forced into shared latent value array representations. That introduces artifacts and isn't as flexible as networks of networks. The killer for trying to train networks of networks is the combinatorial blowup that happens when exploring the space of all possible connection configurations. And it's why so much of what makes NNs work today is actually the human engineering of certain network architectures that structurally hardcode useful priors. Great comment, thank you!

    • @willd1mindmind639
      @willd1mindmind639 2 роки тому

      @@nomenec Thanks. It is definitely a much simpler quantification effort for dealing with probability calculation within a bounded context as defined by the algorithmic model, data provided for training and tuning of calculations. However, there is no reason not to investigate more open ended architectures, especially as a thought exercise of how such a thing would be possible.

  • @siarez
    @siarez 2 роки тому

    So if I replace ReLU with Softplus, does that break their arguments?

    • @nomenec
      @nomenec 2 роки тому +1

      On the one hand, sure, arguments which explicitly state repeatedly that they apply to piecewise linear functions do not immediately apply outside that assumption. On the other hand, that is not evidence the against a more general interpretation of the conclusions and there is soft evidence that in many problem spaces, NNs, regardless of their activation functions, are driven towards chopping up space into affine cells. Some examples of such soft evidence is 1a) the dominance of piecewise linear activation functions overall or otherwise 1b) the dominance of activation functions that are asymptotic (including your softplus example) and 2) the dominance of NN nodes structured as nonlinear functions of linear combinations as opposed to nonlinear combinations.
      The consequence of 2) is that softness still falls along a linear hyperplane boundary! And given 1b) there is a distance at which it effectively behaves as a piecewise linear function. It becomes a problem specific empirical question as to how much an NN actually leverages the curvature versus the asymptotic behavior. My claim, and it's just a gut conjecture based on soft evidence at this point, is that for most problems and typical NNs, any such activation function curvature is incidental and/or sometimes useful for efficient training and that's why ReLU and like, which "abandon all pretense at smooth non-linearity", as I said the video, are dominating.

  • @henryzwart1024
    @henryzwart1024 2 роки тому

    I'm only 20 minutes in, so will remove this comment if it's answered in the discussion, but....
    How do you think smooth activation functions (e.g. ELU) would affect the polyhedral covering of the feature space? If ReLU functions create hard boundaries between separate polyhedra, would smooth functions create smooth boundaries? Or perhaps weighted combinations of polyhedra?

  • @dr.mikeybee
    @dr.mikeybee 2 роки тому

    I wonder if a machine's ability to find a non-linear function or to integrate one would be analogous to what Stephen Wolfram calls computational reducibility? Certainly, an agent can call a non-linear function rather than a piece-wise linear model.

  • @rafaeldelaflor
    @rafaeldelaflor Рік тому

    Can anyone discuss or comment on extrapolation in context of projection volume? Or more than 3D

  • @SimonJackson13
    @SimonJackson13 2 роки тому

    Extrapolation is interpolation where one endpoint is magnified by some potentiate of infinity controlled by end zone locking. The outer manifold potentiate?

    • @SimonJackson13
      @SimonJackson13 2 роки тому

      The reflectome of the outer manifold into the morphology of the inner trained manifold to achieve greater formance from the IM. The focal of the reflectome as a filter to multibatch the stasis of the correct?

  • @NehadHirmiz
    @NehadHirmiz Місяць тому

    at 3:04:08 Yannic foreshadowing "active inference" 😁

  • @dr.mikeybee
    @dr.mikeybee 2 роки тому

    If in high dimensional spaces only have varying gradient in 16 or fewer dimensions, doesn't that suggest that principle component analysis should always be run?

    • @nomenec
      @nomenec 2 роки тому +2

      Do you mean to run PCA on the ambient space and then throw away all but the top-K eigen vectors? Or just run PCA and use the entire transformed vector as input data points instead of raw data points?
      If the former, I guess the fear (probably justified) is that we'd be subjecting the entire data set to a single linear transform and possibly throwing out factors that are only useful in smaller subsets of the data. Instead, NNs are able to chop up the space and use different transforms for different regions of the ambient data space. In a sense, they can defer and tweak decisions to throw out factors/linear-combinations. That chopping, ie piece-wise, capability seems an essential upgrade to using only a single transform for the entire data space.
      If the latter, we'd just be adding another matrix multiplication to a stack of such and it wouldn't change much beyond perhaps numerical stability or efficiency since NNs are of course capable of finding any single linear transform including a PCA projection. In a way, it's related to all the various efforts at improving learning algorithms by tweaking gradients, hessians, etc. In the end, in practice most found that doing something super simple at GPU scale was faster; I'm not sure about the state-of-the-art in numerical stability, though.

    • @dr.mikeybee
      @dr.mikeybee 2 роки тому

      @@nomenec I mean throw away the input data that isn't significant. Among other things, it will make smaller faster models. I hadn't heard that for really high dimensional data only 16 or fewer dimensions matter. If I'm not misunderstanding this, which I may very well be, doing PCA first makes a lot of sense. It takes me time to wrap my head around anything, and I'm often far off the mark anyway. Still, this seems logical.

  • @dennisestenson7820
    @dennisestenson7820 2 роки тому

    40:51 doesn't this suggest that the input data should to be transformed into a reduced dimension before training on it? Using MNIST digits, for example, the raw pixels could be transformed into the sequence of pen strokes that composed the written symbol. This might have dimensionality around a dozen rather than 784. Obviously, finding that transformation wouldn't be trivial. However, it could also allow generative models to create more realistic interpolations.

  • @democratizing-ai
    @democratizing-ai 2 роки тому +3

    When will Jürgen follow? :)

  • @DavenH
    @DavenH 2 роки тому +1

    Ho ho! Here we go!

  • @larrybird3729
    @larrybird3729 2 роки тому +1

    hmm...what word do we use for a Interpolation between Interpolation and Extrapolation 🤪

  • @jovian304
    @jovian304 2 роки тому

    Finally!!

  • @SimonJackson13
    @SimonJackson13 2 роки тому

    An effective interpolation? Do all interpolations have to be effective? Is it still not an interpolation between even if inaccurate?

  • @ShayanBanerji
    @ShayanBanerji 2 роки тому +1

    Discussions from hallowed halls of Academia brought to UA-cam. Or better? You 3 are setting very high standards!

  • @autobotrealm7897
    @autobotrealm7897 Рік тому

    so addictive

  • @CristianGarcia
    @CristianGarcia 2 роки тому +1

    It happened!!!

  • @ketilmalde3402
    @ketilmalde3402 2 роки тому

    Great stuff, extremely interesting topic and strong content. But is there a version without the background music - podcast or video? I'm probably just a grumpy old fart, but I find it really hard to concentrate, it's like trying to follow a conversation while somebody is simultaneously licking my ear.

    • @MachineLearningStreetTalk
      @MachineLearningStreetTalk  2 роки тому +3

      drive.google.com/file/d/16bc7XJjKJzw4YdvL5rYdRZZB19dSzR70/view?usp=sharing here is the intro (first 60 mins) with no background music, you old fart! :)

  • @funkypipole
    @funkypipole 2 роки тому +1

    Great ! Why not considering inviting Prof. Jerome Darbon from Brown Univ, he has always bright views on that topic !

  • @XOPOIIIO
    @XOPOIIIO 2 роки тому +2

    I would like to see this "absolute dog". I think it's possible to generate one, just reward the GAN to react both to realism and activation of particular neuron. I wonder how that doggest dog ever would look like. I would also like to see the doggest cat, and the cattest dog.

  • @muhammadaliyu3076
    @muhammadaliyu3076 2 роки тому +6

    Where have you being?

    • @nomenec
      @nomenec 2 роки тому +6

      A tremendous amount of work went into this show let alone the MLST channel as a whole. Good things take time. Thank you for your patience and continued viewership!

  • @jabowery
    @jabowery 11 місяців тому

    Imputation can make interpolation appear to be extrapolation. But more importantly people don't understand the relationship between interpolation extrapolation and the Chomsky hierarchy. You simply cannot do extrapolation with context-free grammars. Transformers are context-free grammar capable not more.

    • @MachineLearningStreetTalk
      @MachineLearningStreetTalk  11 місяців тому

      Thanks James! According to arxiv.org/pdf/2207.02098.pdf transformers map to finite state automata computational model with no augmented memory and can recognise finite languages only

  • @alivecoding4995
    @alivecoding4995 Рік тому

    @MLST: Do you guys still follow this line of thinking; or did you end up settling on yet another interpretation recently? 😊

    • @MachineLearningStreetTalk
      @MachineLearningStreetTalk  Рік тому +1

      More or less! We are going to drop another show with RandallB next month, we now think that MLPs are more extrapolative than previously thought (along the lines of Randalls/Yann's everything is extrapolation paper)

    • @alivecoding4995
      @alivecoding4995 Рік тому

      @@MachineLearningStreetTalk Thank you very much for taking the time to answer my question. 😌👍

  • @lenyabloko
    @lenyabloko 2 роки тому +1

    You should invite Guy Emerson from Department of Computer Science and Technology University of Cambridge.

  • @rafaeldelaflor
    @rafaeldelaflor Рік тому

    Does the guy with the Sun glasses have eye problems?

  • @mfpears
    @mfpears 2 роки тому

    28:10 This isn't how humans understand physics. Really, really good video though.
    34:00
    54:30 It's cool that humans still understand a lot though. The possibilities in the universe are massively constrained by the fact that nothing is generated outside of physical laws.
    1:00:00 The limitation that deep learning can't extrapolate
    1:03:30 Extrapolation = reasoning? So can they reason?
    1:03:50 No
    1:05:50 Supervised learning is the thing that sucks
    1:06:50 Geoff Hinton thinks general unsupervised first, specialization after
    1:12:00 RBF network
    1:15:00 Different definitions of interpolation
    1:22:50 latent contrastive predictive models
    1:25:00 New architectures that aren't contrastive have come out
    1:29:30 No, they will be able to reason
    1:30:00 What would prove that neural networks can reason?
    1:35:30 RNNs are the networks that can train with variable number of layers
    1:37:28 Nobody can train a neural net to print the nth digit of pi (I can). Yeah, once we figure out basic things we might be able to try mathematical concepts.
    1:45:00 System 1 and 2 in chess and driving
    2:07:10 Convolution is nothing more than a performance optimization by giving the network pre-knowledge that interesting features are spatially local
    A lot of tearing down of not 100% correct analogies of neural networks and what might actually model them well
    - 2:30:30 It's impossible for a neural network to find the nth digit of pi
    2:34:45 Discrete vs smooth... Have both systems? (Actions distill, Jordan Peterson)
    2:36:30 (The real world is limited) is it because neural nets only use textures? No, resolution is low, or it would blow up
    (Man that accent was tough for me)
    2:45:30 Summary of that last interview. Intuition is fine, but mathematical rigor doesn't apply well with that definition
    2:47:30 We need a better definition of what kind of interpolation is happening, and that will help us progress
    2:50:00 It's hard to figure out where researchers exactly disagree because of politeness
    2:53:00 It's all about pushing back on the limitation that neural networks can't extrapolate
    2:54:40 Digits of pi again. It's not what he's talking about actually, too advanced. He's talking about a cat jumping in a place it's never seen before (Tesla predicts paths of cars). He thinks eventually we'll get there, but I'm not as optimistic.
    2:56:00 There's an article by Andre Ye that annoyed him because it invoked interpolation vs extrapolation to say they'll never do it, which is the real question
    2:57:10 At the end of the day, neuron signals are continuous functions, but somehow they produce digital reasoning. But will it be efficient?
    2:59:00 But there is no discrete thing (actions)
    3:00:40 (There you go. Yes. It's going to be hard. But that's the only way for a neural network to do it, and calculators aren't going to discover profound truths.)
    3:01:30 (Omg it feels like they're starting to think the way I do about it. System 1, system 2) It's insanely powerful to train a discrete algorithm on top of neural network. Longer term possibility.
    3:05:00 Underexplored.
    Feature creep? (No! That's insane. Is general intelligence feature creep?)
    3:06:30 Hard to train (it seems the opposite to me) Getting to do discrete stuff involves lots of hacks.
    3:07:30 "TABLE 1" Attacks attack on paper
    3:17:00 You can't initialize a neural net with zeros
    3:18:00 We're comparing neural nets to the entire human race and its culture and inheritance

  • @killaz5526
    @killaz5526 8 місяців тому

    I have no idea what any of this is about, I was watching a video by the channel Food Theory talking about a McDonald's conspiracy, then woke up 4 hours later to the end of this. Autoplay can be quite mysterious.

  • @R4RealMalfunctor
    @R4RealMalfunctor 2 роки тому

    old school signal processing look at new school learning:
    2:26:10

  • @rafaeldelaflor
    @rafaeldelaflor Рік тому

    I still haven’t really absorbed the intropolation VS extrapolation argument.

    • @rafaeldelaflor
      @rafaeldelaflor Рік тому

      I think I understand the bimodal discussion of High dimension interpolation and extrapolation. Linear regression is a fitting of interpolated volume in 3 dimensions while extrapolation is any 3 dimensional values outside the interpolated value volume

    • @rafaeldelaflor
      @rafaeldelaflor Рік тому

      I still don’t understand his argument.

    • @rafaeldelaflor
      @rafaeldelaflor Рік тому

      How would this information change the direction of systematic AGI

  • @Georgesbarsukov
    @Georgesbarsukov 9 місяців тому

    I love Yann LeCun

    • @Georgesbarsukov
      @Georgesbarsukov 9 місяців тому

      Oddly enough, he's my inspiration to leave the AWS $$$ to pursue AI research.

  • @Lumeone
    @Lumeone 2 роки тому +1

    I wonder, do linear functions exist in the universe?

    • @nomenec
      @nomenec 2 роки тому +2

      There are certainly linear relationships at various levels of physical description: the first and second laws of thermodynamics, the Schwarzschild radius vs mass, photon energy vs frequency, etc. Whether or not any of these actually "exist in the Universe" is something philosophers have argued for at least millennia and probably will argue until heat death. From my perspective, they "exist" as epistemic descriptions of emergent phenomena.

    • @taiducnguyen7694
      @taiducnguyen7694 2 роки тому

      E = mc^2, F = ma, KE = 1/2mv^2, tons in thermodynamics, notably W = -P (delta V)...basically, often the most fundamental ones are linear