05L - Joint embedding method and latent variable energy based models (LV-EBMs)
Вставка
- Опубліковано 8 чер 2024
- Course website: bit.ly/DLSP21-web
Playlist: bit.ly/DLSP21-UA-cam
Speaker: Yann LeCun
Chapters
00:00:00 - Welcome to class
00:00:39 - Predictive models
00:02:25 - Multi-output system
00:06:36 - Notation (factor graph)
00:07:41 - The energy function F(x, y)
00:08:53 - Inference
00:11:59 - Implicit function
00:15:53 - Conditional EBM
00:16:24 - Unconditional EBM
00:19:18 - EBM vs. probabilistic models
00:21:33 - Do we need a y at inference?
00:23:29 - When inference is hard
00:25:02 - Joint embeddings
00:28:29 - Latent variables
00:33:54 - Inference with latent variables
00:37:58 - Energies E and F
00:42:35 - Preview on the EBM practicum
00:44:30 - From energy to probabilities
00:50:37 - Examples: K-means and sparse coding
00:53:56 - Limiting the information capacity of the latent variable
00:57:24 - Training EBMs
01:04:02 - Maximum likelihood
01:13:58 - How to pick β?
01:17:28 - Problems with maximum likelihood
01:20:20 - Other types of loss functions
01:26:32 - Generalised margin loss
01:27:22 - General group loss
01:28:26 - Contrastive joint embeddings
01:34:51 - Denoising or mask autoencoder
01:46:14 - Summary and final remarks
Love the energy from Prof. Yann LeCun, just from his excitement on the topic and the small smiles he has when he is talking about how fresh this content is, is amazing. Thanks a lot Prof. Alfredo!
😄😄😄
The most important video all around the internet for comp. vis. researchers. I watch the video several times in a year regularly.
😀😀😀
gold, ive watched the last year's lectures and i'm filling the gaps with this year's ones.
💛🧡💛
A cool thing about prediction systems is that they can be used also to predict the past, not only the future. For example if you see something falling you both intuitively predict where is going and where it came from.
Again @Alfredo Canziani thank you very much for making this public, this is an amazing content.
I have several questions (I refer to the instant(s) in the video):
16:34 and 50:43 => Unconditional model is when the input is partially observed but you dont know exactly what part.
- What is test/inference in these unconditional EBM models? Is there a proper split between training and inference/test in the unconditional models?
- How does models like PCA or K-means fit here, what are the partially observed inputs Y? For example in K-MEans you receive all the components of Y, I dont see that they are partially observed
25:10 and 1:01:50 => With the joint embedding architecture
- What would be inference with this architecture, inferring a Y from a given X minimizing the cost C(h, h')? I know that you could run gradient descent to the Y backward the Pred(y) network but it is not clear to me the purpose of inferring Y given X in this architecure.
- What does the "Advange: no pixel-level reconstruction" in green means? (I suspect that this may have something to do with my just above question)
- Can this architecture also be trained as a Latent Variable EBM? or it is always trained in a Contrastive way
Perfect. Thank you so much :)
😇😇😇
Thank you, Alfredo! :)
Пожалуйста 🥰🥰🥰
I guess i need to watch many times to get what Yann was trying to explain :)
It's alright. It took me ‘only’ 5 repetitions 😅😅😅
Hi Alfredo thank you for making the course public. It is super useful especially to those who are self-learning cutting-edge AI concept and I've found EBM a fascinating one.
I have a question regarding EBM: How should I describe "overfitting" in the context of EBM? Does that mean the energy landscape have very small volume surrounding the training sample data points?
You're welcome. And yes, precisely. And underfitting would be having a flat manifold.
Thank you so much for making these lectures public!
The slides are very difficult to read because of being overlaid over Yann's face and the background image. I imagine this could be an accessibility issue for anyone with vision impairments, too.
That's why we provide the slides 🙂🙂🙂
so that is what contrastive learning is all about!
It seems so 😀😀😀
Very Interesting
🧐🧐🧐
Hi, Alfredo 👋
Am I missing something, or in this lecture there is no "non-contrastive joint embeddings" methods Yann was talking about at 1:34:40 ? I also briefly checked the next lectures but didn't find something related to this. Could you please point me out? 😇
Thank you for the video, btw, brilliant as always :)
If you open the slides for lecture 6 you can find a whole page on non-contrastive embeddings.
Thanks a whole bunch for this lecture, after two times I think I'm starting to grasp it :) One thing that confuses me though is: in the very beginning, it is mentioned that x may or may not be adapted when going for the optimum location. I cannot quickly come up with an example where I would want that? Wouldn't that mean I am just discarding the info in x and - in the case of modeling with latent variables - now my inference becomes a function of z exclusively?
You need to write down the timestamp in minutes:seconds if you want me to be able to address any particular aspect of the video.
@@alfcnz Thanks for taking the time to respond! Here we go, 15:20
Great lecture, thanks a lot. But it would be also great if you could tell us a reference book or publications for this lecture. Thanks a lot in advance.
I'm writing the book right now. A bit of patience, please 😅😅😅
@@alfcnz Looking forward to the book Alfredo. Can you give a ball park estimate of the 'patience' here? :-)
End of summer ‘22 the first draft will see the light.
@@alfcnz omg, I'm so excited
Dr. Yann only mentioned this in passing at 20:00 , but I just wanted to clarify, why does EBM offer more flexibility in choice of scores and objective functions? It's from page 9 on the slides. Thank you!
nvm, I should have just watched on, at 1:04:27 Yann explained how probabilistic models are EBM where the objective function is NLL.
then by extension, the scoring function for a probabilistic model is probably restricted to a probability.
the info at 18:17 is underrated
What are the research papers from Facebook mentioned around 1:30?
All references are written on the slides.
At that timestamp I don't hear Yann mentioning any paper.
Hi Alfredo, which book on DL do you recommend that has the same sort of structure as the content of this course?
The one I’m writing 😇
@@alfcnz Great, I think it would be a great companion to these lectures, looking forward to it.
How do you use autograd in pytorch for "nonstochastic" gradient descent?
probably conjugate gradient
If the function I have is not approximate (not like the per-batch approximation of the dataset loss), then you're performing non-stochastic GD. The stochasticity comes from the approximation to the objective function.
Is there any book that i can read from to know more about these methods. thank you.
I'm writing the book. It'll take some time.
@@alfcnz thank you so much!
❤️❤️❤️
Not to be nitpicking but I believe there's a minus missing @49:22 in denominator of P(y|x) at the far end (right side of screen) behind the beta.
Oh, yes indeed! Yann is s little heedless when crafting slides 😅
@@alfcnz These things happen. I just waned to make sure that I'm following the calculations correctly. Thanks for confirmation.
Sure sure 😊
Interesting, Energy based models do something very similar to metric learning. (Or am I missing something?).
Indeed metric learning can be formulated as an energy model. I'd say energy models are like a large umbrella under which many conventional models can be recast.
I tried to calculate the derivative Yann said (1:07:45), but probably I am missing something because in my final result I don't have the integral (only -P_w(.) ...). Is there any supplementary material with these calculations?
Thanks again for your amazing and hard work!
Uh… can you share your calculations? I can have a look. Maybe post them in the Discord server, maths room, so that others may be able to help as well.
@@alfcnz It was my bad. I... misunderstand the formula of P_w(y/x) and thought that was an integral at the numerator (over all y's), but that didn't make any sense to me and checked again your notes and ...voilà I got the right answer.
Is the discord open to us too? I thought only for students of NY. I definitely join then (learning alone, isn't fun :P).
Discord is for *non* NYU students. I have another communication system set up for them.
French language seems to be more suited for misic.. Has a sweet tonality..
🇫🇷🥖🗼
Yanic kilcher is asking questions it seems
Where, when? 😮😮😮
Meta helicopter
:-))
I spent some time to derive the step mention in 1:07:44. I made my best effort to get the final result. But, I am not sure if my steps are correct. I hope my fellow students can help to point out my mistakes. Due to the lack of LaTex support in UA-cam comment, I try my best to make my steps as clear as possible. I use partial derivative for log to get to the second step. Then, I use Leibniz integral rule to move the partial derivative inside the integral in the third step. The rest is pretty straightforward, hopefully. Thank you!
∂/∂w (1/β) log[ ∫y′ exp[−βFw(x, y')] ] = (1/β) [1/∫y′ exp[−βFw(x, y')] ∂/∂w ∫y′ exp[−βFw(x, y')] = (1/β) [1/∫y′ exp[−βFw(x, y')] [∫y′ ∂/∂w exp[−βFw(x, y')]] = (1/β) [1/∫y′ exp[−βFw(x, y')] [∫y′ exp[−βFw(x, y')] ∂/∂w −βFw(x, y')] = - [1/∫y′ exp[−βFw(x, y')] [∫y′ exp[−βFw(x, y')] ∂/∂w Fw(x, y')] = - [∫y′ exp[−βFw(x, y')/∫y′ exp[−βFw(x, y')] [∂/∂w Fw(x, y')] = - ∫y′ Pw(y'|x) ∂/∂w Fw(x, y')
can you put a link to a latex file? I did the derivative and maybe able to help.