Yann LeCun - Self-Supervised Learning: The Dark Matter of Intelligence (FAIR Blog Post Explained)

Yannic Kilcher

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 30 січ 2025

КОМЕНТАРІ • 189

@YannicKilcher 3 роки тому ⁺⁵⁷
ERRATA:
- The difference between loss and energy: Energy is for inference, loss is for training.
- The R(z) term is a regularizer that restricts the capacity of the latent variable. I think I said both of those things, but never together.
- The way I explain why BERT is contrastive is wrong. I haven't figured out why just yet, though :)
OUTLINE:
0:00 - Intro & Overview
1:15 - Supervised Learning, Self-Supervised Learning, and Common Sense
7:35 - Predicting Hidden Parts from Observed Parts
17:50 - Self-Supervised Learning for Language vs Vision
26:50 - Energy-Based Models
30:15 - Joint-Embedding Models
35:45 - Contrastive Methods
43:45 - Latent-Variable Predictive Models and GANs
55:00 - Summary & Conclusion
@bzqp2 3 роки тому ⁺²
Can you perhaps pin this to the top? Thanks.
@nyoseesoyn 3 роки тому ⁺⁸
All DNN with loss = some_distance(y, pred) is indeed energy-based model as you said. But not All energy based model has the form loss = some_distance(y, pred) where pred = f(x) is an explicit part of the model. So by Energy-based model, Yann means a generalization of traditional formulation where we can escape the problem of multiple y given a single x. The blogpost needs to make this distinction clearer.
@baskaisimkalmamisti 3 роки тому ⁺⁴⁰
thanks to you that I can both watch UA-cam and keep up with the research at the same time.
@brendawilliams8062 2 роки тому
I get tired of it.
@falconeagle3655 3 роки тому ⁺¹¹
Congrats!! Yann Lecun sent me to your video.
@sheggle 3 роки тому ⁺²⁶
We love us some content that doesn't chase sota, thank you as always Yannic!
@aniruddhadatta925 3 роки тому
Because it mostly just stacks up hardware
@sehbanomer8151 3 роки тому ⁺³⁹
13:00 Am I the only one who found that question mark really satisfying?
@ProjectsWithRed 3 роки тому
I was just about to comment this haha.
@NextFuckingLevel 3 роки тому ⁺¹
Another prove that yannic is an android
@rockapedra1130 3 роки тому
Best question mark ever!
@CalvinJKu Рік тому
Didn’t like the video thumbnail at first sight but the content is king! Subscribed!
@QuadraticPerplexity Рік тому ⁺¹
14:05 Regarding whether the third kind of masking could be used for NLP: if the word embedding is good, probably you could mask out a subset of the dimensions.
@lucathiede9238 3 роки тому ⁺³⁴
There is a difference between energy functions and objective functions:
In physics, energy functions are defined as scalar fields with curl = 0 everywhere, so their gradient field is conservative (which is important, because otherwise the path integral for a closed path would be > 0, violating conservation of energy)
In ML, there are objectives with gradient fields that are not conservative. The best-known example for this is the GAN objective
@jackdkendall 3 роки тому ⁺⁸
Yes, and energy-based models are also linked to probability distributions in an explicit way which general loss functions are not. An energy function is an unnormalized probability distribution where you can explicitly get relative probabilities. To say that an energy function is the same thing as a loss function is inaccurate.
@lucathiede9238 3 роки тому ⁺⁴
@@jackdkendall Correct, there is always an explicit link between energy and probability distribution. However, I would not go quite as far as saying energy functions are unnormalized pdfs. For example in thermodynamics, the likelihood of finding a system in a given state is described by the Boltzmann distribution of the energy of the given state, not just the energy normalized
And this is also only true for the ideal gas model, for higher-order interactions it becomes non-analytical. But the connection of higher energy -> lower likelihood always remains as far as I know
@jackdkendall 3 роки тому ⁺²
@@lucathiede9238 as far as I'm aware in thermodynamics the probability is defined as the energy of a state divided by the partition function. The Boltzmann distribution is a direct consequence of this. In physics the Arrhenius equation always gives relative probabilities in terms of activation energy, which is just the energy difference between two states.
In ML, it's similarly defined. Probability is just energy divided by partition function aka normalization term.
@jackdkendall 3 роки тому ⁺²
Actually scratch that, the Boltzmann distribution is a result of entropy maximization under conservation of energy
@lucathiede9238 3 роки тому ⁺⁵
@@jackdkendall Mh, I am not sure, maybe I have to freshen up my thermodynamics knowledge
What I am very sure of though, is that the energy (whether we need to take the boltzman distribution of it or not) only describes the likelihood of a state, not the probability.
To get the probability we need to consider the Helmholtz free energy of a macro state, which essentially takes the entropy and the temperature into account
I know this is a bit nitpicking, but it is often messed up and very important for example for protein folding, since the protein does not just fold into the state of lowest energy, but lowest free energy
@norabelrose198 3 роки тому ⁺¹⁰
I think "energy based model" more precisely is supposed to refer to models that output unnormalized scores as opposed to (log-) probabilities. LeCun has said that he doesn't like approaches that are specifically designed to output valid probabilities or approximations of probabilities (i.e. normalizing flows, traditional VAEs) when arguably some other non-probability based approach would work better. But confusingly he also seems to lump even probability based models into the EBM category when he feels like it.
@ruroruro 3 роки тому ⁺²
Agreed. Extremely hand-wavy and non-specific. Also, I wonder, how do you even determine, if some model is approximating a probability distribution or not. Like I am pretty sure, that for any score function, you can produce a monotonic mapping of that score to [0; 1], that gives you a pretty good approximation of the underlying probability distribution.
@lucathiede9238 3 роки тому
Normalizing flows output valid probabilities (or more precisely, likelihood) yes, but VAEs dont, they only output a sample without the associated likelihood
@norabelrose198 3 роки тому ⁺¹
@@lucathiede9238 The loss function for VAEs is negative ELBO, which is provably a lower bound on the true log probability of the data
@norabelrose198 3 роки тому ⁺²
@@ruroruro Yeah, I think the main problem with mapping arbitrary score functions to probability distributions is computing the normalization constant to ensure the integral of the score over all possible inputs is equal to 1. That’s not tractable in a lot of cases. Some people try hard to figure out ways to compute or approximate the normalizing constant, and LeCun’s approach seems to just be, forget about it, don’t normalize the scores at all.
@lucathiede9238 3 роки тому ⁺¹
@@norabelrose198 Not quite, the encoder gives you the pdf of a *latent variable*, conditioned on a sample. It will not give you the probability of the sample itself.
At no point in the VAE you can actually get p(sample), which is what you are usually trying to approximate in energy-based models afaik.
@Arthurein 3 роки тому ⁺¹
I am likely wrong so please correct me if it's true. In a standard classification problem, the loss function is the objective for which we optimize the neural network. In an energy-based setting, the loss function *is* the network. Which truly can be interchanged concepts since, after all, the output of a classifier will describe something akin to a probability distribution on the possible categories. But the cool part about the energy function is that it does not require any sort of normalization, and thus it can just be a black box that gives you a small number if things make sense, or a big number if things don't (or viceversa).
@ocifka 3 роки тому ⁺⁴
52:20 AFAIK the latent variable _z_ and the "embedding" are actually sort of the same thing. (The embedding is just a realization of that random variable I would say.) The confusion probably comes from the fact that there are different distributions over _z_ involved: _p(z)_ and _q(z|x)_ - the latter is what the encoder outputs, including the reparametrization trick.
@PabloHuijseHeise 3 роки тому ⁺¹
Totally right. Also the "making the latent variable fuzzy" bit refers to constraining q(z|x) to be close to p(z), the latter typically being a standard gaussian distribution
@frenchmarty7446 2 роки тому ⁺¹
@@PabloHuijseHeise Actually no. Making the latent variable "fuzzy" refers to adding noise and a KL divergence term to p(z|x) and thereby making p(z) smoother and easier to sample from. This has the side-effect of making p(z) closer to a Gaussian then it otherwise would be but enforcing p(z) to be Gaussian is a separate problem addressed by things like Adversarial Autoencoders, Factor-VAE, etc.
@CharlesVanNoland 3 роки тому ⁺³
Regarding the object permanence / gravity thing. They did experiments with cats, raising them in environments that just had vertical stripes all over everything, effectively denying them the opportunity to see horizontal lines and when they matured they would put the cats in more natural conventional environments and they had no concept of the danger of heights or falling because they couldn't perceive the ledges they were approaching.
@WhatsAI 3 роки тому ⁺¹⁶
Awesome video as always! And I completely agree, I feel like they are kind of trying to "set their terminology" on already existing concepts, but it was still an interesting read, and even better to hear your point of view on it!
@CosmiaNebula 2 роки тому ⁺⁸
"energy-based method", as defined in the paper, is *extensionally* the same as machine learning using a loss function, but is *intensionally* different. The problem is that LeCun didn't describe it rigorously using the language of symmetry, though in his subconscious (and in the subconscious of every physicist who reads the paper), the "energy function" is intended to be "energy function that has good symmetries". I will explain.
## Feynmann's unworldly equation
Consider, for example, Feynman's "unworldliness equation" U = 0, where U = f^2 + g^2 + ..., and each f, g, ... is a scalar equation of nature. This equation is of course entirely correct, but it is trivial.
However, this does not make every equation trivial. Some equations really are more substantial than others. What is the substance? It is *symmetry*, or invariance under transformations.
When Maxwell wrote down the Maxwell equations, he used 20 scalar equations. In 4-vector notation, there are just 2 equations. Why such a great simplification? It is not the trivial kind of simplification as in U = 0, but a deep simplification -- all equations written in 4-vector notation are necessarily invariant under Lorentz transforms. Because the proper "home" of Maxwell equations is a universe that is invariant under Lorentz transforms, it's no wonder that they are more elegant when in 4-vector notations.
Conversely, when you notice how elegant the equations are in 4-vector form, you realize that the universe should probably be invariant under Lorentz transforms.
Modern theoretical physics is basically a game of inventing new transforms, then constructing equations invariant under the transforms, then publish it.
> So the “beautifully simple” law in Eq. (25.32) is equivalent to the whole series of equations that you originally wrote down. It is therefore absolutely obvious that a simple notation that just hides the complexity in the definitions of symbols is not real simplicity. It is just a trick. The beauty that appears in Eq. (25.32)-just from the fact that several equations are hidden within it-is no more than a trick. When you unwrap the whole thing, you get back where you were before.
> However, there is more to the simplicity of the laws of electromagnetism written in the form of Eq. (25.29). It means more, just as a theory of vector analysis means more. The fact that the electromagnetic equations can be written in a very particular notation which was designed for the four-dimensional geometry of the Lorentz transformations-in other words, as a vector equation in the four-space-means that it is invariant under the Lorentz transformations. It is because the Maxwell equations are invariant under those transformations that they can be written in a beautiful form.
(Feynman 2, 25:6)
www.feynmanlectures.caltech.edu/II_25.html#Ch25-S6
## Energy-based methods, from the POV of
### a ML scientist
Extensionally, any machine learning problem defined using an energy function is equivalent to one defined using a loss function. And conversely, any ML problem defined by a loss function is equivalent to one defined by an energy function.
Intensionally, if you start with any loss function, and find its equivalent energy function, you would almost certainly get an energy function with no good symmetry at all.
Energy-based method is a principled way to convert symmetries in the problem into good priors over your neural network. Instead of using arbitrary loss functions constructed ad-hoc, or perhaps meta-learn a loss function, we impose the prior over the space of loss functions that respect the symmetries. Writing down an energy that respects the symmetries is just an efficient, implicit way to impose the prior.
### a physicist
Energy-based methods provide a principled way to write down equations that are invariant under physically relevant symmetries, such as translation (R^n), rotation (SO(n)), reflections (E(n)), volume-preserving maps (SL(n)), and so on. It also allows us to use gauge theory for ML.
Not only that, it also allows one to enforce only local interactions, by writing the energy as a sum of local interactions (such as E = x1 x2 + x2 x3 + x3 x4 + ...), bringing statistical mechanics and renormalization techniques to the table.
Not only does this allow you to import the greatest hits of modern physics and make ML as abstract as string theory, it also imposes good priors. A ML model for physical processes should probably only consider models that are invariant under the symmetries of nature, such as translation, rotation, reflection, etc.
### a mathematician
Energy-based methods is the Erlangen program for high-dimensional probability. All hail Felix Klein, the felicitous king of symmetry.
### a linguist
Extensional definition and intensional definitions often diverge, and it's more important to discover the intension and making it explicit, than to focus on the extension and quibble. For example, extensionally, an "activation function" is *any* function of type R^n → R, but that's the extensional definition. When you actually say "activation function" you mean any function of this type that has been profitably used in a neural network.
@shengyaozhuang3748 3 роки тому ⁺²
looking forward to "Barlow Twins: Self-Supervised Learning via Redundancy Reduction" review, also from Yann's group. BYOL like method but without momentum updates!
@emransaleh9535 3 роки тому ⁺³
Good topic to tackle in this time. I will enjoy watching the video.
@StochasticCockatoo 2 роки тому
If you could only watch 4 Yannic videos this would be one of them
@membershipyuji 3 роки тому ⁺¹
Very helpful video. I was able fill in many gaps present in the post.
@rogerfreitas7323 3 роки тому ⁺³
So far best channel in youtube
@alexandervlasov6746 3 роки тому
As far as I undersood, energy-based learning is a non-probabilistic counterpart to maximum likelihood (or MAP) estimation. Similar to SVM and Logistic Regression, both are used as classifiers and have the same classifier form, but LR having probabilistic roots, while SVM - not. Basically, one uses the same idea, but avoids probability constraints (non-negative, sums to one).
In that context, an energy function is not the same as a loss function. Yann LeCun uses energy function for inference (y_pred = argmin_y F(x,y)). However, a training/validation loss can be constructed in a different way. Often, the loss is (mis-)prediction based, but it's not necessary so, e.g. one can use a structurally similar but different argmin approach to obtain loss/feedback information.
A similar difference is between a probability mass/density function and a (log-)likelihood function.
@georgelomia4724 3 роки тому
To the point you brought up on minute 38:15, would an unsupervised clustering method help us group images together according to their similarity, if not why not?
@finlayl2505 3 роки тому ⁺⁵
I wonder if you trained a model on filling in video first, could you then transfer that to object recognition and get better results? Because in the video training it would no doubt get a better understanding of 3d space which may help it in object recognition.
@brendawilliams8062 2 роки тому
A spark gap unleashed a zoo
@learnstochastic 3 роки тому ⁺⁶
Hey Yannic, I don't if you already have it. Is it possible to make these things available as a podcast over Spotify or anywhere? Your approaches are really good, and it would be a lot easier to just plug in ear phones and go for a walk with your explanations on. Thanks.
@brendawilliams8062 2 роки тому
Thankyou. Informative and nicely explained.
@alvarohenriquez497 3 роки тому
Really enjoy your videos. Hope that you do one on the Timesformer soon. Thanks.
@ProjectsWithRed 3 роки тому ⁺²
Am I not correct in saying that finding the missing patch/crop in an image is not infinite, just a very high number. Because let say the crop (hidden part) is 100x100 and for example there are 3 colour channels, so 100x100x3, we know each colour channel has a range from 0-255, so we can technically make all possible crops that go in that image, which is all combinations of pixel values, which is discrete and not infinite, and one of the combinations of pixel values will be that exact patch in the whole image. It might be infinite in real-life but it is discrete in terms of CV.
@frenchmarty7446 2 роки тому
A better way of putting it is illposed. There are more possible answers then we have data to learn from, therefore learning the parameters of a probability distribution is impossible. There are infinitely many good *parameters* of our model that fit the data perfectly well.
@bsdjns 3 роки тому
Great video, please keep them coming!
I actually didn't know you're German until you mentioned the eierlegende Wollmilchsau :D
@sg22r 3 роки тому
How is self supervision different than augmentations?
@aamir122a 3 роки тому
At time index 45 minutes, when you say mix x with z, what do you mean ( add, multiply, divide subtract ) etc, as a suggestion it would be best to explain the math operation if functions as you go along.
@abekang3623 3 роки тому ⁺¹
I was wondering if VAE's were considered fuzzy in reference to regular autoencoders which do not have the distribution and sampling in the center. Also because many auto encoders scale down their latent representation at the sampling point (center)would that be another reason why VAEs constrict or limit their representation.
@frenchmarty7446 2 роки тому
I believe the "fuzziness" is in reference to the output, which is caused by the loss function (mean squared error) which encourages blurry outputs. VAEs that use a discriminator like GANs ("VAE-GAN") do not have the same problem.
But you are right on both points. Though VAEs actually give better outputs when sampling outside of the data distribution, the injection of noise in the latent space forces the decoder to learn to give acceptable outputs over a wider and smoother range of outputs.
@DucNguyen-wy1ir 3 роки тому ⁺³
Hi @yannic. Thanks for the fantastic video. Regarding your confusion about limiting the capacity of the latent variable `z` in VAE, i think what the post means is that in VAE, the authors use a unimodel Gaussian distribution, which is of limited capacity because they could have used some other distributions that are of higher capacity, like a Gaussian mixture models or some multimodal distributions (in fact as far as my memory can serve me, there is a paper doing it). What do you think?
@moormanjean5636 2 роки тому
I think what they mean is that by introducing noise to the latent variable, the network learns to rely less on the latent part of the network and more on the autoencoder.
@frenchmarty7446 2 роки тому
I'm not sure if the information capacity of the distribution itself is related to how VAEs constrict information. Those are two different kinds of information.
The decoder doesn't recieve the entire distribution, it is given one sample from the latent distribution. The information in the distribution itself is lost.
A multivariate Cauchy distribution contains more information than a multivariate Gaussian distribution, but a random sample from the former would actually contain less information* (more dimensions of my vector will be extreme outliers that are unrepresentative of the mode). It's more accurate to say that a non-Gaussian distribution *implies* more information (I have to know more as an observer to model something as non-Gaussian, a Cauchy distribution is a less expected guess than a Gaussian a priori); the distribution doesn't pass this information on unless I take multiple samples and relearn the distribution. A vector of n dimensions is still just a vector of n dimensions.
Here is a good explanation on why VAEs usually use Gaussian noise instead of anything else: stats.stackexchange.com/questions/517467/is-it-possible-to-use-variational-autoencoders-with-non-gaussian-data
This article has a good explanation of why variations autoencoders are variational in the first place with a good visualization of how Gaussian noise "smooths" the latent space (which is useful for generative sampling): www.jeremyjordan.me/variational-autoencoders/
Gaussian noise has the side-effect of restricting information flow, but this wasn't a problem for autoencoders to begin with because we have complete control over the dimensionality of "Z" to begin with. In fact if you wanted to use noise to restrict information flow you would actually generate noise from a more "informative" distribution like Cauchy noise. Remember that the decoder is ultimately trying to reconstruct what was fed into the encoder to start with, if "Z" is more complicated it is that much harder for the decoder to guess what "X" produced it from a single sample of "Z".
*this should actually be expected. If a distribution implies more information then by extension any particular sample must be relatively less informative. It's not an accident that highly non-Gaussian distribution are very sample inefficient to parameterize and/or require special less efficient estimators.
Also, there is a difference between pushing individual samples to be Gaussian and pushing the entire latent space to be Gaussian. The later is what things like Adversarial Autoencoders (arxiv.org/abs/1511.05644) and beta-VAE (openreview.net/forum?id=Sy2fzU9gl) try to do and is closer to what I think you mean.
@frenchmarty7446 2 роки тому
@@moormanjean5636 I'm not sure what you mean by "rely less on the latent part and more on the autoencoder".
The latent variable is the output of the encoder [p(z|x)] and is the only input the decoder [p(x|z)] receives.
Gaussian noise is added to make the decoder more robust and the latent distribution smoother, both of which make it easier to sample new values and to generalize to new inputs.
@ekjotnanda6832 3 роки тому
Very nice explanation 👍🏻
@Markste-in 3 роки тому ⁺¹
My questions is: what keeps the model from just ignoring the latent variable und just use the input (x). Wouldn't that be more easy for the model than trying to handle a jittery latent variable (z)
@frenchmarty7446 2 роки тому
1.) The model is actually two components, the encoder and the decoder. The decoder doesn't get to see 'X'.
2.) Modeling an intermediate latent variable forces the model to learn a more generalizable strategy. If you mapped from X* -> X it's impossible to know what the model will do with unseen values. Whereas if the encoder always maps to the same constricted latent space, the information that passes through the constriction is more likely to be meaningful.
It's the same logic behind something like LASSO in standard regression. Better the model learns some important information than absorb everything.
Also the beauty of using a latent intermediate representation is that you can dramatically change how the model "thinks" about the data by imposing restrictions on what "Z" can look like (see Factor-VAE, Causal-VAE, cluster-VAE).
@dwhdai 3 роки тому
are there any good examples of SSLs on timeseries data?
@wenxue8155 3 роки тому
Isn't there already a paper on that: Learning to Predict Without Looking Ahead:
World Models Without Forward Prediction. The whole idea very similar to World Models.
@jean-baptistedelabroise5391 3 роки тому ⁺¹
hmm, if you scrap image from a search engine is it not possible to get harder negatives for example: 2 images issued from a research for chess pieces are much more likely to be harder negatives than 2 images taken at random.
@frenchmarty7446 2 роки тому ⁺¹
Yes and that is mostly what people do in practice in unsupervised contrastive learning.
The problem is you have no guarantees or bounds on how different the random images are if at all. I might have a picture of a chess board and my random "negatives" are: another chess board, another board game, a picture of a dog and a child's sketch. My negatives all vary in their degree of similarity but I have no way of telling the model that.
In practice, this kind of solves itself. The model learns to tolerate some "negatives" being closer to the "positives" because that minimizes the loss overall and generally that will cluster like things together.
The point I think is that this is less data efficient than self-supervised methods. For the same amount of data, the model can learn more by reconstructing the data then by clustering it.
@sean_vikoren 3 роки тому
To find the possible number of arrangements you simply add the bits in each pixel and that will be a specific number which can essentially be represented by the number of bytes that represent the image. So not infinite.
@apollozou9809 3 роки тому
Thank you!
@cem_kaya 2 роки тому
thanks for the explanation
@EternalKernel 3 роки тому
Thank you, Yannic!
@thegistofcalculus 3 роки тому
I understand the motivation behind Siamese networks but why not a giant one hot vector target where each input picture corresponds to a node in the giant one hot network (it was done before and you can just search the title "Unsupervised Feature Learning via Non-Parametric Instance Discrimination")
@PavelChernov 3 роки тому
Thank you! Very interesting topic.
@RohitKumarSingh25 3 роки тому
Recently many papers using similar approach as BYOL i.e. instead of exact network keeping a moving average of network with lots of augmented images while training, has solved the negative sampling problem in contrastive loss approach.
@terguunzoregtiin8791 3 роки тому
Hi, do you think "masked auto encoders are scalable vision learners" follows this blog's idea?
@jeroenput9564 3 роки тому
Love your critical thinking!
@junhanouyang6593 3 роки тому
I may be really stupid. But for latent variable predictive models if we want to limit the capacity of z and make sure our model focused on the Pred(x) decoding and not care about z. Then what is the purpose of z here?
@frenchmarty7446 2 роки тому
We do care about "Z", it just isn't (always) part of our loss term. A constricted latent space forces the model to retain only important and (hopefully) meaningful information and thereby generalize much better.
If Z was unconstrained, the model would just learn the identity function (x ≈ z) and learn nothing interesting.
@HaykTarkhanyan 2 роки тому
Thanks, that was very helpful
@chriscanal999 3 роки тому ⁺²⁹
Lmao “my raspberry pi has that capacity”
@G12GilbertProduction 3 роки тому ⁺¹
My capacity is better than yours. ×P
@GuillermoValleCosmos 3 роки тому
I don't get why we need to limit the capacity of z. Isn't adding conditioning on the discriminator for a GAN enough to force it to attent to the conditioning?
@frenchmarty7446 2 роки тому
We create an information bottleneck to force the model to learn meaningful representations without labels. We could very easily make "Z" have the same dimensions as "X" but it won't learn anything interesting.
A conditional GAN is something else entirely where we already have labels and want to generate new samples that match our labels (as opposed to just being completely random). GANs don't learn representations on their own they only generate new samples.
@zoltanczesznak976 3 роки тому
dont take it as a paper but as tutorial, with a catchy name, so it is sort of high end popular science, big crowds wont start to read statistical learning theory, in addition you are also a good educator i believe, so it served its purpose
@hannesstark5024 3 роки тому ⁺¹
The way I understand the VAE statements is that we have the gaussian as latent variable and restrict it via the encoder.
@frenchmarty7446 2 роки тому
It would be closer to say it's the other way around. The encoder generates the latent variable and we constrict the latent space in several different ways (reduced dimensionality, KL divergence loss, Gaussian noise, etc).
The decoder is tasked with guessing what went into the encoder to produce the latent variable (p(x | z)) hence a game in which the model learns the most important information to pass through the constriction.
The result is we get a latent space that is dense with information, generalizes well and meets whatever other requirements we impose (easy to sample from, etc). We take this and do other useful tasks much easier.
@marc-andrepiche1809 3 роки тому
An energy model is only a model where the lowest lost is 0. (some models maximise and/or can be negative)
@sebastiangerard9548 3 роки тому
The blog post says: "An EBM is a trainable system that, given two inputs, x and y, tells us how incompatible they are with each other."
I think that models trained with loss functions are not EBMs, since the task that the model is solving is e.g. image classification, not to output the loss between the ground truth and the input image. That's just something you use to find the parameters of the model, but not the task that the model solves. Maybe I misunderstood you, but it seemed like you wanted to argue that using a loss function during training is enough to qualify ML models as EBMs.
You could try to argue that e.g. an image classifier is an EBM, since it takes as input an image x and then outputs compatibility scores with each of the classes. In that case you would need to define your second input y as being constant and representing all the classes, since your model outputs needs to be a compatibility measure between x and y. However, I would argue that it is then not an EBM, since it cannot make predictions for varying values of y. Following the definition above, it would need to be able to indicate how incompatible x and y are, for any x and y, to qualify as an EBM.
Happy to hear any counterpoints. I don't have any previous experience with EBM or their origin in physics, but this is how the definition would make sense to me. Predicting the compatibility is defined as the central task that the model is solving.
@oraz. 3 роки тому ⁺¹
I'm sure Lecun will put this to good use for his employers Facebook and Instagram.
@zrmsraggot 3 роки тому
At 24:00 don't we all assume the most likely thing hidden is the most 'usual' thing until something else can make us think different ? For ex, if you saw a faceshot of me typing on my chair but couldn't see below that table I'm at, with no info you would have to assume I'm wearing some jeans right ? But if I had a bowl of Cheerios next to me then you could suppose all I wear is some pants.
@zeamon4932 3 роки тому
Humans tend to stereotype as they grow up. Is that why children imagine more creatively
@adityakane5669 3 роки тому
Great video!
Just some food for thought. Assume you have many random images of the world, from mountains to seas to cities and whatnot. Then you clip such a part (say 100x100) and ask the model to predict that. You use a sliding window approach to this, by which I mean you ask the model to predict many clips of the same image. Now you shuffle all these (image, clip) pairs and train the model. What will the model learn? Will it even learn anything? Or will it have a good understanding of the world as the authors suggest?
@frenchmarty7446 2 роки тому
What exactly do you mean by shuffle the image/clip pairs? Do you mean pair a clipped image to a non-matching clip and learn to discriminate good matches from poor matches?
If you do this correctly then yes, the model will learn quite a bit about the world. It will learn a latent representation that is semantically meaningful (like clustered with like, etc).
I don't know where the line for "good understanding" is drawn, but we do know that these latent representations are useful for many other unrelated tasks downstream with fine tuning.
I can use the same model to add labels to images for example with a small number of labeled training examples. It is faster to learn the relationship between latent labels then raw data labels, so the latent space is capturing some important information in a compressed form.
@adityakane5669 2 роки тому
@@frenchmarty7446 By clips I mean cut-outs of the image. The self-supervised task is to predict the contents of the cutout. I'm sorry for the confusion earlier. I now know that such efforts have been made in papers like MAE.
@Blattealkiller 3 роки тому ⁺⁴
Agree with you about the energy based model terminology ahah
@AlexanderMath 3 роки тому
You beat me to it. I was looking for anyone that took up the challenge, I thought exactly the same with energy based model.
@dimitriognibene8945 3 роки тому
Is this any different from predictive coding? I find offensive and unfair to rename concepts without giving credit to related people like Mumford, Ballard, rao, friston ...
@mar-a-lagofbibug8833 3 роки тому
ThAnk you.
@larsojinnaka 3 роки тому ⁺²
The smiley face at 28:10
@lamiaalsalloom1881 Рік тому
thank you, you are a saint
@XX-vu5jo 3 роки тому ⁺²
I am making an implementation of this with a slight variations will share it on my github soon.
@_kkaai 2 роки тому
Very helpful
@bdennyw1 3 роки тому ⁺¹⁹
Yannic did a great job as always, but I'm not sure what the point of the paper was. Seems like a rehash of everything from the last couple of years.
@IqweoR 3 роки тому
29:58 So basically energy based model predicts this F(x,y), this function becomes another learnable parameter, which in this case would be what you define 'loss function'. How do you train it exactly? I don't know, but, in theory, if you train it on a some real videos there's a chance it will overfit to these videos and will probably predict high value to anything your main model outputs, for example your model predicts that a cat that wears a hat, that's rediculous at a first glance, compared to real videos of a cats, but we can actually think of this. We've seen bloggers doing this for the lols :)
And if you somehow pair them (your self supervised model and this F model) and train them together in pair, this F should represent not just 'how real output is', but actually 'how it fits the data', it wouldn't matter that much if a cat wears a hat, if that's appropriate to the context. But how to do this training properly - still an open question.
@waleedawad4520 3 роки тому ⁺¹
That was very helpful
@LuisAldamiz 2 роки тому ⁺¹
Imagine a baby that does not have five senses but only a connection to the Internet which it reads in ones and zeroes and has to learn everything from that. It would not work, so the AI is quite amazing: it understands binary and extracts a lot of info from it, something we can't easily do.
@citizizen 3 роки тому
I want to build a brain...
First a thanks for this channel! I think that, all the datasets can be put together in chunks (kinds of reservoirs as it where), represented in interesting ways. Future gene pools. In essence my idea is that when we get multiple datasets, and each of those are represent as special kinds of information, we build something grand in the end. I guess that, intelligence might emerge more effectively like this.. Intelligence is about a lot of repetitions..
Regards, Justin
@moormanjean5636 2 роки тому
what you are talking about sounds a lot like transfer learning, you should look into this maybe
@Gauloi007 3 роки тому
I think energy model not just a equivalent reformulation, it's a reformulation that aloud you to make unsupervised learning , by having a lost fonction ( energy function) than you learn like a adversarial model in some way. I think don't read all the precedent ( you was talking in the other video) paper it's why hou still don't understand.
@mdfeatherwx 3 роки тому
This video deserves to get my 1hour of time
@tinyentropy 3 роки тому
Somehow disappointing take aways from a big title like this one. Good explanation, though. Thanks!!!
@harrywoods9784 2 роки тому
Just a thought, in my mind embodiment is the key to self learning. That’s why Teslas humanoid Optimus robot is exciting.
🤔IMO
@silberlinie 3 роки тому
Is Yannic constantly getting closer and closer
to Agent Smith from the Matix in appearance?
If so, what can we expect from him?
If it is not so, why do I ask for it?
@wiktormigaszewski8684 3 роки тому
why can't you generate negative examples from the same photo, by just taking a more distant fragment of it?..
@andrewcutler4599 3 роки тому
Forgetting the paper but there is one that breaks an image into chunks then uses an auxiliary loss to encourage the model to produce similar embeddings from each chunk. So that's an example of positive examples from the same photo. Not sure why they would be negative. You're saying neighboring portions of the image should be more similar than distant portions?
@wiktormigaszewski8684 3 роки тому
@@andrewcutler4599 sure, this is how it works in reality! :)
@zeamon4932 3 роки тому
@@wiktormigaszewski8684 i dont think so. Think about copying an apple to get N*N apples to generate a single image
@willd1mindmind639 3 роки тому
Most of the tasks for deep leaning are associated with making sense of the buckets of binary information sitting on a a computer server. The rise of machine learning is associated with the rise of big data as they have a natural synergy, aka Google. But it really isn't intelligence. Intelligence is being able to derive understanding about the world from the data provided. In terms of visual intelligence, that means learning what light is, what shadows are, what surfaces are, what textures are, what perspective is, what near and far is, left and right, up vs down and so forth at a base level (most creatures with vision can do that). Then on top of that basic core of fvisual comprehension, there is the higher order reasoning, learning and understanding that humans have evolved, but neural networks can't do. Even with that limitation though, the reason it works in so many applications of modern industry and computing is because a lot of business functioning is based around statistical models anyway. And using statistic based models to help model behaviors and generate predictions fits in well to a large number of business tasks. And of course data stored in silicon as a result of human activity online is growing exponentially. So these neural network models work reasonably well for a large set of business related use cases and scenarios on general purpose computer hardware.....
@owlmaster1528 3 роки тому ⁺¹
In your video about Multimodal Neurons - that is my comment there:
I didn't know that Picasso was connected to the AI. Now we know on what trip he went. We need an answer of just what exactly AI he was connected to (in trance or whatever) so he would made all this images.
Do we have more proof for the Matrix now?
@YEASTY_COMMIE 3 роки тому ⁺³
bro idk what you on but I want some of that
@owlmaster1528 3 роки тому
@@YEASTY_COMMIE Keep calm, breath in and compare them :) You earned one like from me :)
@ahmedtrabelsi2936 3 роки тому
Hello , great work . Can you have a look for neuromorphic articles, human brain inspired articles (SNN Algo ...)
@MIbra96 3 роки тому
23:54
With only a 32x32 8-bit greyscale image you have a total of 256^(32*32) possible images. That is ridiculously huge. xD
@jvboid 3 роки тому
Thank you. I wanted to add that Chollet also defines intelligence as "skill acquisition efficiency with respect to … information"
@خالد_الشيباني 3 роки тому
I think this is the standard definition of intelligence in human-centric fields such as psychology as well.
@hoaxuan7074 3 роки тому
There is an argument to go the other way and extensively use human labeled data. To make up for the 'fact' that neural network training algorithms are only able to search for statistical solutions, not explore the full solution space.
@ZakkeryDiaz 3 роки тому
The space would be confined to the labels provided
@hoaxuan7074 3 роки тому
@@ZakkeryDiaz The net output could be augmented with extensive labels or short descriptive sentences. You might train the net to produce an image out and a descriptive sentence. To train a net to do that you are forcing in human concepts via the training sentences.
Anyway it would have to make the net better at its basic task of producing images. Proof needed.
@ZakkeryDiaz 3 роки тому
@@hoaxuan7074 I think theres 2 things here. One is the desire to reduce the cost of data acquisition. Requiring human intervention is very expensive.
The second point is I think there will be more sophisticated structures in the future (Sparse networks and inhibitions to subsystems etc) that will give rise to emergent properties to explore some of those spaces we can't find purely with neural networks. We lose the opportunity for other paths in the network space because the labels prematurely close them before they prove useful in a situation not covered by your test/training
@hoaxuan7074 3 роки тому
@@ZakkeryDiaz Sufficently sparse or small neural networks are no longer chained to statistics. There are many examples of small nets trained by evolution, to play games, on YT, that are not statistical.
I just hsd an idea of jobs for most of the population helping neural networks improve. That people have employment is important. The less valuble you are the greater the chance you will be misused.
@hoaxuan7074 3 роки тому
@@ZakkeryDiaz You know there are Fast Transform fixed-fixed-filter bank neural networks? They are really fast and really work nice for many problems. Yet by construction they are 100% statistical in behavior.
And that is quite in contrast to Numenta's spare neural network where ReLU is replaced by top-k magnitude selection. They are certainly both unconventional neural networks. Which is better?
@rnoro 3 роки тому ⁺¹
I agree with most of the comments. It looks like "self-supervised learning" = "unsupervised learning". They just renamed the loss function to energy function.
@jamiekawabata7101 3 роки тому ⁺¹⁰
I have a new method I call "simulated annealing". Oh crap that's already taken.
@Hypotemused 3 роки тому
Speaking of ‘energy functions’ - any idea how much these models cost to train and the energy they consume ? GPT3, SEER , etc -- the performance metrics are always published but they never say what it cost or how much energy they consume. Is it so negligible that it’s not even worth mentioning?
@jonseltzer321 3 роки тому
not sure I follow the translation between the example of '...a cow lying on a beach' and kids being able to identify the cow because they have a model of the world. There's nothing common about a cow lying on a beach. I think it's more likely that humans are modeling and are able to exclude the beach and still see the cloud where machine learning is cheating and considering context when doing the analysis so making the cow, not just the cow, but also grass, clouds, cows eating grass, etc. And when confronted with a cow in the wrong context, they fail. The machine learning system has flattened the data where as the child has a hierarchical picture of the data.
@frenchmarty7446 2 роки тому
That is what is meant by a world model. You're world model tells you what can and cannot be disentangled.
A cow lying on a beach being a rare sight is a fact of the data distribution, not the world model per ce. We would say that a good world model is robust to this kind of confounding whereas a poor world model is naive to confounding (it "cheats" with spurious relationships).
@reginaphalange2563 3 роки тому
23:55 a nice glass of wine indeed
@eelcohoogendoorn8044 3 роки тому ⁺¹
Better than openAI? They did release some pretrained CLIP models, so not so fast!
But yeah, got to agree on the energy thing. Indeed it seems to mean 'loss function for people with physics envy'. Cmon Yann, you are too well paid for that.
@samernoureddine 3 роки тому ⁺¹
28:10 Amazon logo
@beans2874 3 роки тому ⁺³
54:13 - 69 vs 96
@GuillermoValleCosmos 3 роки тому ⁺¹
nice vs cine
@duncanmays68 3 роки тому
4:45 Insulting cows is very Swiss
@dr.mikeybee 3 роки тому
You need an hadoop cluster of raspberry pis for that.
@davidk991 3 роки тому
loved the german :D
@hoaxuan7074 3 роки тому
A paper to look at is Numenta's sparse neural net. A random mask is used to get rid of say 95% of the weights. Top-k magnitude selection is used on each layer of dot products and only those selected connect forward.
I pointed out that maximum magitude is correlated with minimum angle between input vector and weight vector. And minimum angle is correlated with reduced noise sensitivity. You can see discourse numenta for the argument.
In fact a dot product has a critical zone within which it displays error correction. The two factors are angle and distribution. A single non-zero input is not distributed, all inputs equal and non-zero is fully distributed. No talk of this in the neural network books suggests poor scientific methodology and castles being built on foundations of sand.
@citizizen 3 роки тому
If it is hard to do visual stuff. A computer do analysis in different forms.. Like verbal methods for visual patterns.. like a language.
Can't it be that some visual stuff has verbal side to it? Not sure here. Perhaps energy based models and language based techniques can be combined. So energy might be like 'emotion' of the objects used, and this energy with verbal material...
Perhaps a joke...=> p2p, internet, sharing resources by any counting device..
@seanreynoldscs 2 роки тому
no object permanence is learned. that is why peakaboo is so fun for kids.
@XX-vu5jo 3 роки тому ⁺²
54:17 he Wants it!!!
@Zantorc 3 роки тому
This is _the_ cat; This is _our_ cat; This is _his_ cat; This is _her_ cat; This is _their_ cat; This is _somebodies_ cat; This is _ones_ cat; This is _zir/hir/eir/vis/LGBTQ+ pronouns_ cat; This is _Yannic's_ cat;...
@NeoShameMan 3 роки тому
human are probably few shoot learner has proven by the subliminal effect (ie one input (1 frame) isn't enough),
and human brain is full of feedback loop, that work as working memory, few shoot with memory is probably what is happening , with continuous learning to bake the data. Also human training is quite long, baby don't wake up talking and walking day one, AND there is a bunch of autonomous function baked in that is override by the training through time.
@charlesquarra5050 3 роки тому
Liking at 27:30 how Yannic cuts through all the BS terminology in behalf of us. Yes, "energy" in ML context is just another notch of upward trending terminology entropy for the usual concept of loss. Nothing special or mystical about it. Thanks Yannic for fighting the good fight of terminology entropy reduction
@jackdkendall 3 роки тому ⁺¹
Energy based models have been around far longer than deep learning. Not to mention that it's not even the same thing as a loss function. These distinctions exist for a reason.
@lucathiede9238 3 роки тому
No it is not, energies have explicitly a rotation free gradient field, as opposed to for example the GAN objective. Also as another commentator pointed out, another way to look at it is, that energies are explicitly linked to probabilities, the lower the energy the higher the probability
This concept does simply not make sense for GAN objectives (this comment obviously refers to the OG comment, not to the previous answer)
@Cl0udn1n3 Рік тому
A F Of XY energy based model. Foxy energy is what I always look for when on a train. #NyuWorldOrder #SoylentGarchBurgerQlists #yannie-chan
@wilhelmvanbabbenburg8443 Рік тому
25:30 😂😂

Наступне

Автоматичне відтворення

DINO: Emerging Properties in Self-Supervised Vision Transformers (Facebook AI Research Explained)