Yann LeCun - Self-Supervised Learning: The Dark Matter of Intelligence (FAIR Blog Post Explained)

Поділитися
Вставка
  • Опубліковано 2 тра 2024
  • #selfsupervisedlearning #yannlecun #facebookai
    Deep Learning systems can achieve remarkable, even super-human performance through supervised learning on large, labeled datasets. However, there are two problems: First, collecting ever more labeled data is expensive in both time and money. Second, these deep neural networks will be high performers on their task, but cannot easily generalize to other, related tasks, or they need large amounts of data to do so. In this blog post, Yann LeCun and Ishan Misra of Facebook AI Research (FAIR) describe the current state of Self-Supervised Learning (SSL) and argue that it is the next step in the development of AI that uses fewer labels and can transfer knowledge faster than current systems. They suggest as a promising direction to build non-contrastive latent-variable predictive models, like VAEs, but ones that also provide high-quality latent representations for downstream tasks.
    OUTLINE:
    0:00 - Intro & Overview
    1:15 - Supervised Learning, Self-Supervised Learning, and Common Sense
    7:35 - Predicting Hidden Parts from Observed Parts
    17:50 - Self-Supervised Learning for Language vs Vision
    26:50 - Energy-Based Models
    30:15 - Joint-Embedding Models
    35:45 - Contrastive Methods
    43:45 - Latent-Variable Predictive Models and GANs
    55:00 - Summary & Conclusion
    Paper (Blog Post): / self-supervised-learni...
    My Video on BYOL: • BYOL: Bootstrap Your O...
    ERRATA:
    - The difference between loss and energy: Energy is for inference, loss is for training.
    - The R(z) term is a regularizer that restricts the capacity of the latent variable. I think I said both of those things, but never together.
    - The way I explain why BERT is contrastive is wrong. I haven't figured out why just yet, though :)
    Video approved by Antonio.
    Abstract:
    We believe that self-supervised learning (SSL) is one of the most promising ways to build such background knowledge and approximate a form of common sense in AI systems.
    Authors: Yann LeCun, Ishan Misra
    Links:
    TabNine Code Completion (Referral): bit.ly/tabnine-yannick
    UA-cam: / yannickilcher
    Twitter: / ykilcher
    Discord: / discord
    BitChute: www.bitchute.com/channel/yann...
    Minds: www.minds.com/ykilcher
    Parler: parler.com/profile/YannicKilcher
    LinkedIn: / yannic-kilcher-488534136
    BiliBili: space.bilibili.com/1824646584
    If you want to support me, the best thing to do is to share out the content :)
    If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
    SubscribeStar: www.subscribestar.com/yannick...
    Patreon: / yannickilcher
    Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
    Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
    Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
    Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
  • Наука та технологія

КОМЕНТАРІ • 190

  • @ThichMauXanh
    @ThichMauXanh 3 роки тому +8

    All DNN with loss = some_distance(y, pred) is indeed energy-based model as you said. But not All energy based model has the form loss = some_distance(y, pred) where pred = f(x) is an explicit part of the model. So by Energy-based model, Yann means a generalization of traditional formulation where we can escape the problem of multiple y given a single x. The blogpost needs to make this distinction clearer.

  • @sehbanomer8151
    @sehbanomer8151 3 роки тому +39

    13:00 Am I the only one who found that question mark really satisfying?

  • @YannicKilcher
    @YannicKilcher  3 роки тому +54

    ERRATA:
    - The difference between loss and energy: Energy is for inference, loss is for training.
    - The R(z) term is a regularizer that restricts the capacity of the latent variable. I think I said both of those things, but never together.
    - The way I explain why BERT is contrastive is wrong. I haven't figured out why just yet, though :)
    OUTLINE:
    0:00 - Intro & Overview
    1:15 - Supervised Learning, Self-Supervised Learning, and Common Sense
    7:35 - Predicting Hidden Parts from Observed Parts
    17:50 - Self-Supervised Learning for Language vs Vision
    26:50 - Energy-Based Models
    30:15 - Joint-Embedding Models
    35:45 - Contrastive Methods
    43:45 - Latent-Variable Predictive Models and GANs
    55:00 - Summary & Conclusion

    • @bzqp2
      @bzqp2 2 роки тому +1

      Can you perhaps pin this to the top? Thanks.

  • @baskaisimkalmamisti
    @baskaisimkalmamisti 3 роки тому +40

    thanks to you that I can both watch UA-cam and keep up with the research at the same time.

  • @falconeagle3655
    @falconeagle3655 3 роки тому +10

    Congrats!! Yann Lecun sent me to your video.

  • @EternalKernel
    @EternalKernel 3 роки тому

    Thank you, Yannic!

  • @emransaleh9535
    @emransaleh9535 3 роки тому +3

    Good topic to tackle in this time. I will enjoy watching the video.

  • @GianlucaTruda
    @GianlucaTruda 3 роки тому +13

    A superb video as usual, Yannic! Your side commentary - such as the energy-based model just being another way of saying "has a loss function" - is so rich in intuition pumps!
    Aside: what tool/app do you use to clip the article?

  • @PavelChernov
    @PavelChernov 3 роки тому

    Thank you! Very interesting topic.

  • @sheggle
    @sheggle 3 роки тому +26

    We love us some content that doesn't chase sota, thank you as always Yannic!

  • @alvarohenriquez497
    @alvarohenriquez497 3 роки тому

    Really enjoy your videos. Hope that you do one on the Timesformer soon. Thanks.

  • @membershipyuji
    @membershipyuji 3 роки тому +1

    Very helpful video. I was able fill in many gaps present in the post.

  • @brendawilliams8062
    @brendawilliams8062 Рік тому

    Thankyou. Informative and nicely explained.

  • @shengyaozhuang3748
    @shengyaozhuang3748 3 роки тому +2

    looking forward to "Barlow Twins: Self-Supervised Learning via Redundancy Reduction" review, also from Yann's group. BYOL like method but without momentum updates!

  • @apollozou9809
    @apollozou9809 2 роки тому

    Thank you!

  • @cem_kaya
    @cem_kaya Рік тому

    thanks for the explanation

  • @bsdjns
    @bsdjns 3 роки тому

    Great video, please keep them coming!
    I actually didn't know you're German until you mentioned the eierlegende Wollmilchsau :D

  • @HaykTarkhanyan
    @HaykTarkhanyan Рік тому

    Thanks, that was very helpful

  • @mar-a-lagofbibug8833
    @mar-a-lagofbibug8833 3 роки тому

    ThAnk you.

  • @lucathiede9238
    @lucathiede9238 3 роки тому +33

    There is a difference between energy functions and objective functions:
    In physics, energy functions are defined as scalar fields with curl = 0 everywhere, so their gradient field is conservative (which is important, because otherwise the path integral for a closed path would be > 0, violating conservation of energy)
    In ML, there are objectives with gradient fields that are not conservative. The best-known example for this is the GAN objective

    • @jackdkendall
      @jackdkendall 3 роки тому +8

      Yes, and energy-based models are also linked to probability distributions in an explicit way which general loss functions are not. An energy function is an unnormalized probability distribution where you can explicitly get relative probabilities. To say that an energy function is the same thing as a loss function is inaccurate.

    • @lucathiede9238
      @lucathiede9238 3 роки тому +4

      @@jackdkendall Correct, there is always an explicit link between energy and probability distribution. However, I would not go quite as far as saying energy functions are unnormalized pdfs. For example in thermodynamics, the likelihood of finding a system in a given state is described by the Boltzmann distribution of the energy of the given state, not just the energy normalized
      And this is also only true for the ideal gas model, for higher-order interactions it becomes non-analytical. But the connection of higher energy -> lower likelihood always remains as far as I know

    • @jackdkendall
      @jackdkendall 3 роки тому +2

      @@lucathiede9238 as far as I'm aware in thermodynamics the probability is defined as the energy of a state divided by the partition function. The Boltzmann distribution is a direct consequence of this. In physics the Arrhenius equation always gives relative probabilities in terms of activation energy, which is just the energy difference between two states.
      In ML, it's similarly defined. Probability is just energy divided by partition function aka normalization term.

    • @jackdkendall
      @jackdkendall 3 роки тому +2

      Actually scratch that, the Boltzmann distribution is a result of entropy maximization under conservation of energy

    • @lucathiede9238
      @lucathiede9238 3 роки тому +5

      @@jackdkendall Mh, I am not sure, maybe I have to freshen up my thermodynamics knowledge
      What I am very sure of though, is that the energy (whether we need to take the boltzman distribution of it or not) only describes the likelihood of a state, not the probability.
      To get the probability we need to consider the Helmholtz free energy of a macro state, which essentially takes the entropy and the temperature into account
      I know this is a bit nitpicking, but it is often messed up and very important for example for protein folding, since the protein does not just fold into the state of lowest energy, but lowest free energy

  • @CharlesVanNoland
    @CharlesVanNoland 2 роки тому +3

    Regarding the object permanence / gravity thing. They did experiments with cats, raising them in environments that just had vertical stripes all over everything, effectively denying them the opportunity to see horizontal lines and when they matured they would put the cats in more natural conventional environments and they had no concept of the danger of heights or falling because they couldn't perceive the ledges they were approaching.

  • @CosmiaNebula
    @CosmiaNebula Рік тому +6

    "energy-based method", as defined in the paper, is *extensionally* the same as machine learning using a loss function, but is *intensionally* different. The problem is that LeCun didn't describe it rigorously using the language of symmetry, though in his subconscious (and in the subconscious of every physicist who reads the paper), the "energy function" is intended to be "energy function that has good symmetries". I will explain.
    ## Feynmann's unworldly equation
    Consider, for example, Feynman's "unworldliness equation" U = 0, where U = f^2 + g^2 + ..., and each f, g, ... is a scalar equation of nature. This equation is of course entirely correct, but it is trivial.
    However, this does not make every equation trivial. Some equations really are more substantial than others. What is the substance? It is *symmetry*, or invariance under transformations.
    When Maxwell wrote down the Maxwell equations, he used 20 scalar equations. In 4-vector notation, there are just 2 equations. Why such a great simplification? It is not the trivial kind of simplification as in U = 0, but a deep simplification -- all equations written in 4-vector notation are necessarily invariant under Lorentz transforms. Because the proper "home" of Maxwell equations is a universe that is invariant under Lorentz transforms, it's no wonder that they are more elegant when in 4-vector notations.
    Conversely, when you notice how elegant the equations are in 4-vector form, you realize that the universe should probably be invariant under Lorentz transforms.
    Modern theoretical physics is basically a game of inventing new transforms, then constructing equations invariant under the transforms, then publish it.
    > So the “beautifully simple” law in Eq. (25.32) is equivalent to the whole series of equations that you originally wrote down. It is therefore absolutely obvious that a simple notation that just hides the complexity in the definitions of symbols is not real simplicity. It is just a trick. The beauty that appears in Eq. (25.32)-just from the fact that several equations are hidden within it-is no more than a trick. When you unwrap the whole thing, you get back where you were before.
    > However, there is more to the simplicity of the laws of electromagnetism written in the form of Eq. (25.29). It means more, just as a theory of vector analysis means more. The fact that the electromagnetic equations can be written in a very particular notation which was designed for the four-dimensional geometry of the Lorentz transformations-in other words, as a vector equation in the four-space-means that it is invariant under the Lorentz transformations. It is because the Maxwell equations are invariant under those transformations that they can be written in a beautiful form.
    (Feynman 2, 25:6)
    www.feynmanlectures.caltech.edu/II_25.html#Ch25-S6
    ## Energy-based methods, from the POV of
    ### a ML scientist
    Extensionally, any machine learning problem defined using an energy function is equivalent to one defined using a loss function. And conversely, any ML problem defined by a loss function is equivalent to one defined by an energy function.
    Intensionally, if you start with any loss function, and find its equivalent energy function, you would almost certainly get an energy function with no good symmetry at all.
    Energy-based method is a principled way to convert symmetries in the problem into good priors over your neural network. Instead of using arbitrary loss functions constructed ad-hoc, or perhaps meta-learn a loss function, we impose the prior over the space of loss functions that respect the symmetries. Writing down an energy that respects the symmetries is just an efficient, implicit way to impose the prior.
    ### a physicist
    Energy-based methods provide a principled way to write down equations that are invariant under physically relevant symmetries, such as translation (R^n), rotation (SO(n)), reflections (E(n)), volume-preserving maps (SL(n)), and so on. It also allows us to use gauge theory for ML.
    Not only that, it also allows one to enforce only local interactions, by writing the energy as a sum of local interactions (such as E = x1 x2 + x2 x3 + x3 x4 + ...), bringing statistical mechanics and renormalization techniques to the table.
    Not only does this allow you to import the greatest hits of modern physics and make ML as abstract as string theory, it also imposes good priors. A ML model for physical processes should probably only consider models that are invariant under the symmetries of nature, such as translation, rotation, reflection, etc.
    ### a mathematician
    Energy-based methods is the Erlangen program for high-dimensional probability. All hail Felix Klein, the felicitous king of symmetry.
    ### a linguist
    Extensional definition and intensional definitions often diverge, and it's more important to discover the intension and making it explicit, than to focus on the extension and quibble. For example, extensionally, an "activation function" is *any* function of type R^n → R, but that's the extensional definition. When you actually say "activation function" you mean any function of this type that has been profitably used in a neural network.

  • @Arthurein
    @Arthurein 3 роки тому +1

    I am likely wrong so please correct me if it's true. In a standard classification problem, the loss function is the objective for which we optimize the neural network. In an energy-based setting, the loss function *is* the network. Which truly can be interchanged concepts since, after all, the output of a classifier will describe something akin to a probability distribution on the possible categories. But the cool part about the energy function is that it does not require any sort of normalization, and thus it can just be a black box that gives you a small number if things make sense, or a big number if things don't (or viceversa).

  • @CalvinJKu
    @CalvinJKu Рік тому

    Didn’t like the video thumbnail at first sight but the content is king! Subscribed!

  • @WhatsAI
    @WhatsAI 3 роки тому +16

    Awesome video as always! And I completely agree, I feel like they are kind of trying to "set their terminology" on already existing concepts, but it was still an interesting read, and even better to hear your point of view on it!

  • @jeroenput9564
    @jeroenput9564 3 роки тому

    Love your critical thinking!

  • @norabelrose198
    @norabelrose198 3 роки тому +10

    I think "energy based model" more precisely is supposed to refer to models that output unnormalized scores as opposed to (log-) probabilities. LeCun has said that he doesn't like approaches that are specifically designed to output valid probabilities or approximations of probabilities (i.e. normalizing flows, traditional VAEs) when arguably some other non-probability based approach would work better. But confusingly he also seems to lump even probability based models into the EBM category when he feels like it.

    • @ruroruro
      @ruroruro 3 роки тому +2

      Agreed. Extremely hand-wavy and non-specific. Also, I wonder, how do you even determine, if some model is approximating a probability distribution or not. Like I am pretty sure, that for any score function, you can produce a monotonic mapping of that score to [0; 1], that gives you a pretty good approximation of the underlying probability distribution.

    • @lucathiede9238
      @lucathiede9238 3 роки тому

      Normalizing flows output valid probabilities (or more precisely, likelihood) yes, but VAEs dont, they only output a sample without the associated likelihood

    • @norabelrose198
      @norabelrose198 3 роки тому +1

      @@lucathiede9238 The loss function for VAEs is negative ELBO, which is provably a lower bound on the true log probability of the data

    • @norabelrose198
      @norabelrose198 3 роки тому +2

      @@ruroruro Yeah, I think the main problem with mapping arbitrary score functions to probability distributions is computing the normalization constant to ensure the integral of the score over all possible inputs is equal to 1. That’s not tractable in a lot of cases. Some people try hard to figure out ways to compute or approximate the normalizing constant, and LeCun’s approach seems to just be, forget about it, don’t normalize the scores at all.

    • @lucathiede9238
      @lucathiede9238 3 роки тому +1

      @@norabelrose198 Not quite, the encoder gives you the pdf of a *latent variable*, conditioned on a sample. It will not give you the probability of the sample itself.
      At no point in the VAE you can actually get p(sample), which is what you are usually trying to approximate in energy-based models afaik.

  • @ekjotnanda6832
    @ekjotnanda6832 2 роки тому

    Very nice explanation 👍🏻

  • @mrutyunjaybiswal5130
    @mrutyunjaybiswal5130 3 роки тому +6

    Hey Yannic, I don't if you already have it. Is it possible to make these things available as a podcast over Spotify or anywhere? Your approaches are really good, and it would be a lot easier to just plug in ear phones and go for a walk with your explanations on. Thanks.

  • @waleedawad4520
    @waleedawad4520 3 роки тому +1

    That was very helpful

  • @QuadraticPerplexity
    @QuadraticPerplexity Рік тому +1

    14:05 Regarding whether the third kind of masking could be used for NLP: if the word embedding is good, probably you could mask out a subset of the dimensions.

  • @sean_vikoren
    @sean_vikoren 2 роки тому

    To find the possible number of arrangements you simply add the bits in each pixel and that will be a specific number which can essentially be represented by the number of bytes that represent the image. So not infinite.

  • @_kkaai
    @_kkaai Рік тому

    Very helpful

  • @georgelomia4724
    @georgelomia4724 3 роки тому

    To the point you brought up on minute 38:15, would an unsupervised clustering method help us group images together according to their similarity, if not why not?

  • @rogerfreitas7323
    @rogerfreitas7323 3 роки тому +3

    So far best channel in youtube

  • @ocifka
    @ocifka 3 роки тому +4

    52:20 AFAIK the latent variable _z_ and the "embedding" are actually sort of the same thing. (The embedding is just a realization of that random variable I would say.) The confusion probably comes from the fact that there are different distributions over _z_ involved: _p(z)_ and _q(z|x)_ - the latter is what the encoder outputs, including the reparametrization trick.

    • @PabloHuijseHeise
      @PabloHuijseHeise 3 роки тому +1

      Totally right. Also the "making the latent variable fuzzy" bit refers to constraining q(z|x) to be close to p(z), the latter typically being a standard gaussian distribution

    • @frenchmarty7446
      @frenchmarty7446 Рік тому +1

      @@PabloHuijseHeise Actually no. Making the latent variable "fuzzy" refers to adding noise and a KL divergence term to p(z|x) and thereby making p(z) smoother and easier to sample from. This has the side-effect of making p(z) closer to a Gaussian then it otherwise would be but enforcing p(z) to be Gaussian is a separate problem addressed by things like Adversarial Autoencoders, Factor-VAE, etc.

  • @alexandervlasov6746
    @alexandervlasov6746 2 роки тому

    As far as I undersood, energy-based learning is a non-probabilistic counterpart to maximum likelihood (or MAP) estimation. Similar to SVM and Logistic Regression, both are used as classifiers and have the same classifier form, but LR having probabilistic roots, while SVM - not. Basically, one uses the same idea, but avoids probability constraints (non-negative, sums to one).
    In that context, an energy function is not the same as a loss function. Yann LeCun uses energy function for inference (y_pred = argmin_y F(x,y)). However, a training/validation loss can be constructed in a different way. Often, the loss is (mis-)prediction based, but it's not necessary so, e.g. one can use a structurally similar but different argmin approach to obtain loss/feedback information.
    A similar difference is between a probability mass/density function and a (log-)likelihood function.

  • @DucNguyen-wy1ir
    @DucNguyen-wy1ir 3 роки тому +3

    Hi @yannic. Thanks for the fantastic video. Regarding your confusion about limiting the capacity of the latent variable `z` in VAE, i think what the post means is that in VAE, the authors use a unimodel Gaussian distribution, which is of limited capacity because they could have used some other distributions that are of higher capacity, like a Gaussian mixture models or some multimodal distributions (in fact as far as my memory can serve me, there is a paper doing it). What do you think?

    • @moormanjean5636
      @moormanjean5636 Рік тому

      I think what they mean is that by introducing noise to the latent variable, the network learns to rely less on the latent part of the network and more on the autoencoder.

    • @frenchmarty7446
      @frenchmarty7446 Рік тому

      I'm not sure if the information capacity of the distribution itself is related to how VAEs constrict information. Those are two different kinds of information.
      The decoder doesn't recieve the entire distribution, it is given one sample from the latent distribution. The information in the distribution itself is lost.
      A multivariate Cauchy distribution contains more information than a multivariate Gaussian distribution, but a random sample from the former would actually contain less information* (more dimensions of my vector will be extreme outliers that are unrepresentative of the mode). It's more accurate to say that a non-Gaussian distribution *implies* more information (I have to know more as an observer to model something as non-Gaussian, a Cauchy distribution is a less expected guess than a Gaussian a priori); the distribution doesn't pass this information on unless I take multiple samples and relearn the distribution. A vector of n dimensions is still just a vector of n dimensions.
      Here is a good explanation on why VAEs usually use Gaussian noise instead of anything else: stats.stackexchange.com/questions/517467/is-it-possible-to-use-variational-autoencoders-with-non-gaussian-data
      This article has a good explanation of why variations autoencoders are variational in the first place with a good visualization of how Gaussian noise "smooths" the latent space (which is useful for generative sampling): www.jeremyjordan.me/variational-autoencoders/
      Gaussian noise has the side-effect of restricting information flow, but this wasn't a problem for autoencoders to begin with because we have complete control over the dimensionality of "Z" to begin with. In fact if you wanted to use noise to restrict information flow you would actually generate noise from a more "informative" distribution like Cauchy noise. Remember that the decoder is ultimately trying to reconstruct what was fed into the encoder to start with, if "Z" is more complicated it is that much harder for the decoder to guess what "X" produced it from a single sample of "Z".
      *this should actually be expected. If a distribution implies more information then by extension any particular sample must be relatively less informative. It's not an accident that highly non-Gaussian distribution are very sample inefficient to parameterize and/or require special less efficient estimators.
      Also, there is a difference between pushing individual samples to be Gaussian and pushing the entire latent space to be Gaussian. The later is what things like Adversarial Autoencoders (arxiv.org/abs/1511.05644) and beta-VAE (openreview.net/forum?id=Sy2fzU9gl) try to do and is closer to what I think you mean.

    • @frenchmarty7446
      @frenchmarty7446 Рік тому

      @@moormanjean5636 I'm not sure what you mean by "rely less on the latent part and more on the autoencoder".
      The latent variable is the output of the encoder [p(z|x)] and is the only input the decoder [p(x|z)] receives.
      Gaussian noise is added to make the decoder more robust and the latent distribution smoother, both of which make it easier to sample new values and to generalize to new inputs.

  • @InquilineKea
    @InquilineKea 2 роки тому

    If you could only watch 4 Yannic videos this would be one of them

  • @lamiaalsalloom1881
    @lamiaalsalloom1881 Рік тому

    thank you, you are a saint

  • @dwhdai
    @dwhdai 3 роки тому

    are there any good examples of SSLs on timeseries data?

  • @RohitKumarSingh25
    @RohitKumarSingh25 2 роки тому

    Recently many papers using similar approach as BYOL i.e. instead of exact network keeping a moving average of network with lots of augmented images while training, has solved the negative sampling problem in contrastive loss approach.

    • @terguunzoregtiin8791
      @terguunzoregtiin8791 2 роки тому

      Hi, do you think "masked auto encoders are scalable vision learners" follows this blog's idea?

  • @aamir122a
    @aamir122a 3 роки тому

    At time index 45 minutes, when you say mix x with z, what do you mean ( add, multiply, divide subtract ) etc, as a suggestion it would be best to explain the math operation if functions as you go along.

  • @finlayl2505
    @finlayl2505 3 роки тому +5

    I wonder if you trained a model on filling in video first, could you then transfer that to object recognition and get better results? Because in the video training it would no doubt get a better understanding of 3d space which may help it in object recognition.

  • @chriscanal999
    @chriscanal999 3 роки тому +29

    Lmao “my raspberry pi has that capacity”

  • @marc-andrepiche1809
    @marc-andrepiche1809 3 роки тому

    An energy model is only a model where the lowest lost is 0. (some models maximise and/or can be negative)

  • @abekang3623
    @abekang3623 3 роки тому +1

    I was wondering if VAE's were considered fuzzy in reference to regular autoencoders which do not have the distribution and sampling in the center. Also because many auto encoders scale down their latent representation at the sampling point (center)would that be another reason why VAEs constrict or limit their representation.

    • @frenchmarty7446
      @frenchmarty7446 Рік тому

      I believe the "fuzziness" is in reference to the output, which is caused by the loss function (mean squared error) which encourages blurry outputs. VAEs that use a discriminator like GANs ("VAE-GAN") do not have the same problem.
      But you are right on both points. Though VAEs actually give better outputs when sampling outside of the data distribution, the injection of noise in the latent space forces the decoder to learn to give acceptable outputs over a wider and smoother range of outputs.

  • @citizizen
    @citizizen 3 роки тому

    I want to build a brain...
    First a thanks for this channel! I think that, all the datasets can be put together in chunks (kinds of reservoirs as it where), represented in interesting ways. Future gene pools. In essence my idea is that when we get multiple datasets, and each of those are represent as special kinds of information, we build something grand in the end. I guess that, intelligence might emerge more effectively like this.. Intelligence is about a lot of repetitions..
    Regards, Justin

    • @moormanjean5636
      @moormanjean5636 Рік тому

      what you are talking about sounds a lot like transfer learning, you should look into this maybe

  • @Blattealkiller
    @Blattealkiller 3 роки тому +4

    Agree with you about the energy based model terminology ahah

    • @AlexanderMath
      @AlexanderMath 3 роки тому

      You beat me to it. I was looking for anyone that took up the challenge, I thought exactly the same with energy based model.

  • @bdennyw1
    @bdennyw1 3 роки тому +18

    Yannic did a great job as always, but I'm not sure what the point of the paper was. Seems like a rehash of everything from the last couple of years.

  • @larsojinnaka
    @larsojinnaka 3 роки тому +2

    The smiley face at 28:10

  • @oraz.
    @oraz. 3 роки тому +1

    I'm sure Lecun will put this to good use for his employers Facebook and Instagram.

  • @zoltanczesznak976
    @zoltanczesznak976 2 роки тому

    dont take it as a paper but as tutorial, with a catchy name, so it is sort of high end popular science, big crowds wont start to read statistical learning theory, in addition you are also a good educator i believe, so it served its purpose

  • @thegistofcalculus
    @thegistofcalculus 3 роки тому

    I understand the motivation behind Siamese networks but why not a giant one hot vector target where each input picture corresponds to a node in the giant one hot network (it was done before and you can just search the title "Unsupervised Feature Learning via Non-Parametric Instance Discrimination")

  • @wenxue8155
    @wenxue8155 3 роки тому

    Isn't there already a paper on that: Learning to Predict Without Looking Ahead:
    World Models Without Forward Prediction. The whole idea very similar to World Models.

  • @XX-vu5jo
    @XX-vu5jo 3 роки тому +2

    I am making an implementation of this with a slight variations will share it on my github soon.

  • @mdfeatherwx
    @mdfeatherwx 3 роки тому

    This video deserves to get my 1hour of time

  • @adityakane5669
    @adityakane5669 3 роки тому

    Great video!
    Just some food for thought. Assume you have many random images of the world, from mountains to seas to cities and whatnot. Then you clip such a part (say 100x100) and ask the model to predict that. You use a sliding window approach to this, by which I mean you ask the model to predict many clips of the same image. Now you shuffle all these (image, clip) pairs and train the model. What will the model learn? Will it even learn anything? Or will it have a good understanding of the world as the authors suggest?

    • @frenchmarty7446
      @frenchmarty7446 Рік тому

      What exactly do you mean by shuffle the image/clip pairs? Do you mean pair a clipped image to a non-matching clip and learn to discriminate good matches from poor matches?
      If you do this correctly then yes, the model will learn quite a bit about the world. It will learn a latent representation that is semantically meaningful (like clustered with like, etc).
      I don't know where the line for "good understanding" is drawn, but we do know that these latent representations are useful for many other unrelated tasks downstream with fine tuning.
      I can use the same model to add labels to images for example with a small number of labeled training examples. It is faster to learn the relationship between latent labels then raw data labels, so the latent space is capturing some important information in a compressed form.

    • @adityakane5669
      @adityakane5669 Рік тому

      @@frenchmarty7446 By clips I mean cut-outs of the image. The self-supervised task is to predict the contents of the cutout. I'm sorry for the confusion earlier. I now know that such efforts have been made in papers like MAE.

  • @hannesstark5024
    @hannesstark5024 3 роки тому +1

    The way I understand the VAE statements is that we have the gaussian as latent variable and restrict it via the encoder.

    • @frenchmarty7446
      @frenchmarty7446 Рік тому

      It would be closer to say it's the other way around. The encoder generates the latent variable and we constrict the latent space in several different ways (reduced dimensionality, KL divergence loss, Gaussian noise, etc).
      The decoder is tasked with guessing what went into the encoder to produce the latent variable (p(x | z)) hence a game in which the model learns the most important information to pass through the constriction.
      The result is we get a latent space that is dense with information, generalizes well and meets whatever other requirements we impose (easy to sample from, etc). We take this and do other useful tasks much easier.

  • @Markste-in
    @Markste-in 2 роки тому +1

    My questions is: what keeps the model from just ignoring the latent variable und just use the input (x). Wouldn't that be more easy for the model than trying to handle a jittery latent variable (z)

    • @frenchmarty7446
      @frenchmarty7446 Рік тому

      1.) The model is actually two components, the encoder and the decoder. The decoder doesn't get to see 'X'.
      2.) Modeling an intermediate latent variable forces the model to learn a more generalizable strategy. If you mapped from X* -> X it's impossible to know what the model will do with unseen values. Whereas if the encoder always maps to the same constricted latent space, the information that passes through the constriction is more likely to be meaningful.
      It's the same logic behind something like LASSO in standard regression. Better the model learns some important information than absorb everything.
      Also the beauty of using a latent intermediate representation is that you can dramatically change how the model "thinks" about the data by imposing restrictions on what "Z" can look like (see Factor-VAE, Causal-VAE, cluster-VAE).

  • @sg22r
    @sg22r 2 роки тому

    How is self supervision different than augmentations?

  • @ProjectsWithRed
    @ProjectsWithRed 3 роки тому +2

    Am I not correct in saying that finding the missing patch/crop in an image is not infinite, just a very high number. Because let say the crop (hidden part) is 100x100 and for example there are 3 colour channels, so 100x100x3, we know each colour channel has a range from 0-255, so we can technically make all possible crops that go in that image, which is all combinations of pixel values, which is discrete and not infinite, and one of the combinations of pixel values will be that exact patch in the whole image. It might be infinite in real-life but it is discrete in terms of CV.

    • @frenchmarty7446
      @frenchmarty7446 Рік тому

      A better way of putting it is illposed. There are more possible answers then we have data to learn from, therefore learning the parameters of a probability distribution is impossible. There are infinitely many good *parameters* of our model that fit the data perfectly well.

  • @harrywoods9784
    @harrywoods9784 Рік тому

    Just a thought, in my mind embodiment is the key to self learning. That’s why Teslas humanoid Optimus robot is exciting.
    🤔IMO

  • @ahmedtrabelsi2936
    @ahmedtrabelsi2936 3 роки тому

    Hello , great work . Can you have a look for neuromorphic articles, human brain inspired articles (SNN Algo ...)

  • @sebastiangerard9548
    @sebastiangerard9548 2 роки тому

    The blog post says: "An EBM is a trainable system that, given two inputs, x and y, tells us how incompatible they are with each other."
    I think that models trained with loss functions are not EBMs, since the task that the model is solving is e.g. image classification, not to output the loss between the ground truth and the input image. That's just something you use to find the parameters of the model, but not the task that the model solves. Maybe I misunderstood you, but it seemed like you wanted to argue that using a loss function during training is enough to qualify ML models as EBMs.
    You could try to argue that e.g. an image classifier is an EBM, since it takes as input an image x and then outputs compatibility scores with each of the classes. In that case you would need to define your second input y as being constant and representing all the classes, since your model outputs needs to be a compatibility measure between x and y. However, I would argue that it is then not an EBM, since it cannot make predictions for varying values of y. Following the definition above, it would need to be able to indicate how incompatible x and y are, for any x and y, to qualify as an EBM.
    Happy to hear any counterpoints. I don't have any previous experience with EBM or their origin in physics, but this is how the definition would make sense to me. Predicting the compatibility is defined as the central task that the model is solving.

  • @willd1mindmind639
    @willd1mindmind639 3 роки тому

    Most of the tasks for deep leaning are associated with making sense of the buckets of binary information sitting on a a computer server. The rise of machine learning is associated with the rise of big data as they have a natural synergy, aka Google. But it really isn't intelligence. Intelligence is being able to derive understanding about the world from the data provided. In terms of visual intelligence, that means learning what light is, what shadows are, what surfaces are, what textures are, what perspective is, what near and far is, left and right, up vs down and so forth at a base level (most creatures with vision can do that). Then on top of that basic core of fvisual comprehension, there is the higher order reasoning, learning and understanding that humans have evolved, but neural networks can't do. Even with that limitation though, the reason it works in so many applications of modern industry and computing is because a lot of business functioning is based around statistical models anyway. And using statistic based models to help model behaviors and generate predictions fits in well to a large number of business tasks. And of course data stored in silicon as a result of human activity online is growing exponentially. So these neural network models work reasonably well for a large set of business related use cases and scenarios on general purpose computer hardware.....

  • @junhanouyang6593
    @junhanouyang6593 3 роки тому

    I may be really stupid. But for latent variable predictive models if we want to limit the capacity of z and make sure our model focused on the Pred(x) decoding and not care about z. Then what is the purpose of z here?

    • @frenchmarty7446
      @frenchmarty7446 Рік тому

      We do care about "Z", it just isn't (always) part of our loss term. A constricted latent space forces the model to retain only important and (hopefully) meaningful information and thereby generalize much better.
      If Z was unconstrained, the model would just learn the identity function (x ≈ z) and learn nothing interesting.

  • @jean-baptistedelabroise5391
    @jean-baptistedelabroise5391 3 роки тому +1

    hmm, if you scrap image from a search engine is it not possible to get harder negatives for example: 2 images issued from a research for chess pieces are much more likely to be harder negatives than 2 images taken at random.

    • @frenchmarty7446
      @frenchmarty7446 Рік тому +1

      Yes and that is mostly what people do in practice in unsupervised contrastive learning.
      The problem is you have no guarantees or bounds on how different the random images are if at all. I might have a picture of a chess board and my random "negatives" are: another chess board, another board game, a picture of a dog and a child's sketch. My negatives all vary in their degree of similarity but I have no way of telling the model that.
      In practice, this kind of solves itself. The model learns to tolerate some "negatives" being closer to the "positives" because that minimizes the loss overall and generally that will cluster like things together.
      The point I think is that this is less data efficient than self-supervised methods. For the same amount of data, the model can learn more by reconstructing the data then by clustering it.

  • @fredericfournier5594
    @fredericfournier5594 3 роки тому

    I think energy model not just a equivalent reformulation, it's a reformulation that aloud you to make unsupervised learning , by having a lost fonction ( energy function) than you learn like a adversarial model in some way. I think don't read all the precedent ( you was talking in the other video) paper it's why hou still don't understand.

  • @tinyentropy
    @tinyentropy 3 роки тому

    Somehow disappointing take aways from a big title like this one. Good explanation, though. Thanks!!!

  • @IqweoR
    @IqweoR 3 роки тому

    29:58 So basically energy based model predicts this F(x,y), this function becomes another learnable parameter, which in this case would be what you define 'loss function'. How do you train it exactly? I don't know, but, in theory, if you train it on a some real videos there's a chance it will overfit to these videos and will probably predict high value to anything your main model outputs, for example your model predicts that a cat that wears a hat, that's rediculous at a first glance, compared to real videos of a cats, but we can actually think of this. We've seen bloggers doing this for the lols :)
    And if you somehow pair them (your self supervised model and this F model) and train them together in pair, this F should represent not just 'how real output is', but actually 'how it fits the data', it wouldn't matter that much if a cat wears a hat, if that's appropriate to the context. But how to do this training properly - still an open question.

  • @GuillermoValleCosmos
    @GuillermoValleCosmos 3 роки тому

    I don't get why we need to limit the capacity of z. Isn't adding conditioning on the discriminator for a GAN enough to force it to attent to the conditioning?

    • @frenchmarty7446
      @frenchmarty7446 Рік тому

      We create an information bottleneck to force the model to learn meaningful representations without labels. We could very easily make "Z" have the same dimensions as "X" but it won't learn anything interesting.
      A conditional GAN is something else entirely where we already have labels and want to generate new samples that match our labels (as opposed to just being completely random). GANs don't learn representations on their own they only generate new samples.

  • @reginaphalange2563
    @reginaphalange2563 2 роки тому

    23:55 a nice glass of wine indeed

  • @zrmsraggot
    @zrmsraggot 3 роки тому

    At 24:00 don't we all assume the most likely thing hidden is the most 'usual' thing until something else can make us think different ? For ex, if you saw a faceshot of me typing on my chair but couldn't see below that table I'm at, with no info you would have to assume I'm wearing some jeans right ? But if I had a bowl of Cheerios next to me then you could suppose all I wear is some pants.

    • @zeamon4932
      @zeamon4932 3 роки тому

      Humans tend to stereotype as they grow up. Is that why children imagine more creatively

  • @jvboid
    @jvboid 3 роки тому

    Thank you. I wanted to add that Chollet also defines intelligence as "skill acquisition efficiency with respect to … information"

    • @Khawalidmi
      @Khawalidmi 3 роки тому

      I think this is the standard definition of intelligence in human-centric fields such as psychology as well.

  • @owlmaster1528
    @owlmaster1528 3 роки тому +1

    In your video about Multimodal Neurons - that is my comment there:
    I didn't know that Picasso was connected to the AI. Now we know on what trip he went. We need an answer of just what exactly AI he was connected to (in trance or whatever) so he would made all this images.
    Do we have more proof for the Matrix now?

    • @YEASTY_COMMIE
      @YEASTY_COMMIE 3 роки тому +3

      bro idk what you on but I want some of that

    • @owlmaster1528
      @owlmaster1528 3 роки тому

      @@YEASTY_COMMIE Keep calm, breath in and compare them :) You earned one like from me :)

  • @samernoureddine
    @samernoureddine 3 роки тому +1

    28:10 Amazon logo

  • @eelcohoogendoorn8044
    @eelcohoogendoorn8044 3 роки тому +1

    Better than openAI? They did release some pretrained CLIP models, so not so fast!
    But yeah, got to agree on the energy thing. Indeed it seems to mean 'loss function for people with physics envy'. Cmon Yann, you are too well paid for that.

  • @MIbra96
    @MIbra96 2 роки тому

    23:54
    With only a 32x32 8-bit greyscale image you have a total of 256^(32*32) possible images. That is ridiculously huge. xD

  • @dimitriognibene8945
    @dimitriognibene8945 2 роки тому

    Is this any different from predictive coding? I find offensive and unfair to rename concepts without giving credit to related people like Mumford, Ballard, rao, friston ...

  • @silberlinie
    @silberlinie 2 роки тому

    Is Yannic constantly getting closer and closer
    to Agent Smith from the Matix in appearance?
    If so, what can we expect from him?
    If it is not so, why do I ask for it?

  • @davidk991
    @davidk991 2 роки тому

    loved the german :D

  • @dr.mikeybee
    @dr.mikeybee 3 роки тому

    You need an hadoop cluster of raspberry pis for that.

  • @duncanmays68
    @duncanmays68 2 роки тому

    4:45 Insulting cows is very Swiss

  • @hoaxuan7074
    @hoaxuan7074 3 роки тому

    There is an argument to go the other way and extensively use human labeled data. To make up for the 'fact' that neural network training algorithms are only able to search for statistical solutions, not explore the full solution space.

    • @ZakkeryDiaz
      @ZakkeryDiaz 3 роки тому

      The space would be confined to the labels provided

    • @hoaxuan7074
      @hoaxuan7074 3 роки тому

      @@ZakkeryDiaz The net output could be augmented with extensive labels or short descriptive sentences. You might train the net to produce an image out and a descriptive sentence. To train a net to do that you are forcing in human concepts via the training sentences.
      Anyway it would have to make the net better at its basic task of producing images. Proof needed.

    • @ZakkeryDiaz
      @ZakkeryDiaz 3 роки тому

      @@hoaxuan7074 I think theres 2 things here. One is the desire to reduce the cost of data acquisition. Requiring human intervention is very expensive.
      The second point is I think there will be more sophisticated structures in the future (Sparse networks and inhibitions to subsystems etc) that will give rise to emergent properties to explore some of those spaces we can't find purely with neural networks. We lose the opportunity for other paths in the network space because the labels prematurely close them before they prove useful in a situation not covered by your test/training

    • @hoaxuan7074
      @hoaxuan7074 3 роки тому

      @@ZakkeryDiaz Sufficently sparse or small neural networks are no longer chained to statistics. There are many examples of small nets trained by evolution, to play games, on YT, that are not statistical.
      I just hsd an idea of jobs for most of the population helping neural networks improve. That people have employment is important. The less valuble you are the greater the chance you will be misused.

    • @hoaxuan7074
      @hoaxuan7074 3 роки тому

      @@ZakkeryDiaz You know there are Fast Transform fixed-fixed-filter bank neural networks? They are really fast and really work nice for many problems. Yet by construction they are 100% statistical in behavior.
      And that is quite in contrast to Numenta's spare neural network where ReLU is replaced by top-k magnitude selection. They are certainly both unconventional neural networks. Which is better?

  • @Hypotemused
    @Hypotemused 3 роки тому

    Speaking of ‘energy functions’ - any idea how much these models cost to train and the energy they consume ? GPT3, SEER , etc -- the performance metrics are always published but they never say what it cost or how much energy they consume. Is it so negligible that it’s not even worth mentioning?

  • @citizizen
    @citizizen 3 роки тому

    If it is hard to do visual stuff. A computer do analysis in different forms.. Like verbal methods for visual patterns.. like a language.
    Can't it be that some visual stuff has verbal side to it? Not sure here. Perhaps energy based models and language based techniques can be combined. So energy might be like 'emotion' of the objects used, and this energy with verbal material...
    Perhaps a joke...=> p2p, internet, sharing resources by any counting device..

  • @jamiekawabata7101
    @jamiekawabata7101 3 роки тому +9

    I have a new method I call "simulated annealing". Oh crap that's already taken.

  • @jonseltzer321
    @jonseltzer321 3 роки тому

    not sure I follow the translation between the example of '...a cow lying on a beach' and kids being able to identify the cow because they have a model of the world. There's nothing common about a cow lying on a beach. I think it's more likely that humans are modeling and are able to exclude the beach and still see the cloud where machine learning is cheating and considering context when doing the analysis so making the cow, not just the cow, but also grass, clouds, cows eating grass, etc. And when confronted with a cow in the wrong context, they fail. The machine learning system has flattened the data where as the child has a hierarchical picture of the data.

    • @frenchmarty7446
      @frenchmarty7446 Рік тому

      That is what is meant by a world model. You're world model tells you what can and cannot be disentangled.
      A cow lying on a beach being a rare sight is a fact of the data distribution, not the world model per ce. We would say that a good world model is robust to this kind of confounding whereas a poor world model is naive to confounding (it "cheats" with spurious relationships).

  • @beans2874
    @beans2874 3 роки тому +3

    54:13 - 69 vs 96

  • @wiktormigaszewski8684
    @wiktormigaszewski8684 3 роки тому

    why can't you generate negative examples from the same photo, by just taking a more distant fragment of it?..

    • @andrewcutler4599
      @andrewcutler4599 3 роки тому

      Forgetting the paper but there is one that breaks an image into chunks then uses an auxiliary loss to encourage the model to produce similar embeddings from each chunk. So that's an example of positive examples from the same photo. Not sure why they would be negative. You're saying neighboring portions of the image should be more similar than distant portions?

    • @wiktormigaszewski8684
      @wiktormigaszewski8684 3 роки тому

      @@andrewcutler4599 sure, this is how it works in reality! :)

    • @zeamon4932
      @zeamon4932 3 роки тому

      @@wiktormigaszewski8684 i dont think so. Think about copying an apple to get N*N apples to generate a single image

  • @LuisAldamiz
    @LuisAldamiz Рік тому +1

    Imagine a baby that does not have five senses but only a connection to the Internet which it reads in ones and zeroes and has to learn everything from that. It would not work, so the AI is quite amazing: it understands binary and extracts a lot of info from it, something we can't easily do.

  • @NeoShameMan
    @NeoShameMan 3 роки тому

    human are probably few shoot learner has proven by the subliminal effect (ie one input (1 frame) isn't enough),
    and human brain is full of feedback loop, that work as working memory, few shoot with memory is probably what is happening , with continuous learning to bake the data. Also human training is quite long, baby don't wake up talking and walking day one, AND there is a bunch of autonomous function baked in that is override by the training through time.

  • @XX-vu5jo
    @XX-vu5jo 3 роки тому +2

    54:17 he Wants it!!!

  • @Cl0udn1n3
    @Cl0udn1n3 4 місяці тому

    A F Of XY energy based model. Foxy energy is what I always look for when on a train. #NyuWorldOrder #SoylentGarchBurgerQlists #yannie-chan

  • @hoaxuan7074
    @hoaxuan7074 3 роки тому

    A paper to look at is Numenta's sparse neural net. A random mask is used to get rid of say 95% of the weights. Top-k magnitude selection is used on each layer of dot products and only those selected connect forward.
    I pointed out that maximum magitude is correlated with minimum angle between input vector and weight vector. And minimum angle is correlated with reduced noise sensitivity. You can see discourse numenta for the argument.
    In fact a dot product has a critical zone within which it displays error correction. The two factors are angle and distribution. A single non-zero input is not distributed, all inputs equal and non-zero is fully distributed. No talk of this in the neural network books suggests poor scientific methodology and castles being built on foundations of sand.

  • @seanreynoldscs
    @seanreynoldscs Рік тому

    no object permanence is learned. that is why peakaboo is so fun for kids.

  • @wilhelmvanbabbenburg8443
    @wilhelmvanbabbenburg8443 11 місяців тому

    25:30 😂😂

  • @vsiegel
    @vsiegel 3 роки тому

    This is Yannics cat.

  • @rnoro
    @rnoro 3 роки тому +1

    I agree with most of the comments. It looks like "self-supervised learning" = "unsupervised learning". They just renamed the loss function to energy function.