Been reading and doing ML for the last 5 years, and every time I hear Yan I always get to know something I don't know or missed. Thanks to both Alfredo and Yan for this amazing course. Lovin it!!!!
@8:33 One of the reasons why ReLU is better in deep networks than say Sigmoid is that the gradient in backward pass after each sigmoid nonlinearity gets smaller (multiplied by around 0.25), but in ReLU-like nonlinearities, the gradient does not get smaller after each layer (the gradient is one in positive part)
First of all, thank you very much for doing this. The world owes you! What is the meaning of the update rule when the parameter vector is the output a function [at time 1:31:21] ? As the name implies, w is the output of a function, so how can you update the output?
at 1:14:38 Yann says that we can do a Non-linear classification with mixture of linear classifiers which are gated, isn't it still linear classifier? Why is it non-linear, what is it that makes it non-linear classification.
A linear classifier has a single hyperplane cutting the data space. Here we partition the space in two and then in two again. So, we end up with something that is clearly not linear. It actually smells a lot as a decision tree, where you have iterative subdivision of the data space.
Regarding the Mixture of experts, would it make sense to train the Experts nets separatelly? I mean something like training the Expert 1 over a dataset that is for sure Catalan, and the Expert 2 over other language. After that we'll have those trained models, and a general dataset (with multiple languages) could be used to train the Gater only (the Experts' weights wouldn't change anymore).
Thank you Alfredo, and I want to know what's the 'z' in the attention architecture example as Yan introduced, does the 'z' also come from the training data? thank you!
z is a latent input. Latent means it's missing from the data set. Hence, you need to infer it by minimisation of the energy using GD. We've extensively covered latent variable energy based models in previous lectures.
@@alfcnz Thank you, Sorry, I didn't get this information in previous lecture (if it in the order of the video), but I noticed that there is a lecture call "05.1 - Latent Variable Energy Based Models (LV-EBMs), inference" in later videos, thank you!
My bad, I apologise. I thought this was the lecture on associative memories. These topics are only briefly introduced here and will be extensively covered later on.
Hey @Alfredo! I'm new to the SP21 course. Is there an order to the videos? You've previously uploaded videos numbered 01, 02, 03... but your recent videos are 01L, 02L, ... What does the "L" mean? Should I watch 01L after 01? I am trying to understand the naming convention here. Thanks! :)
L stands for lecture. I was not planning to release them, initially. The order / index is on the class website, the official content organisation homepage.
I've just published the first theme. You'll have a new one every week. For the latest news you want to follow me on Twitter, where I announce all these things.
Absolute value has been used as non linear function, but it wouldn't let you turn off specific inputs. So, the output would always be the non-zero piecewise linear combination of the input.
@@HassanAliAnwar Hey! I think this is covered at the start of the lecture when Yann takes a question regarding non-monotonic activation functions. To summarise, I think intuitively, he explains that as there would be two solutions for x when f(x) = some value (except for x=0) this means that the gradient descent step could be taken in multiple directions, which can, but not always, lead to less efficient learning for the problem, i.e. if abs(x) = 2, do we walk towards the direction of the gradient being -1 or in the direction of the gradient being 1?
Been reading and doing ML for the last 5 years, and every time I hear Yan I always get to know something I don't know or missed. Thanks to both Alfredo and Yan for this amazing course. Lovin it!!!!
Yann, with two n's 😉
You're welcome 😊😊😊
@@alfcnz mah bad 😄
@@alfcnz then it should be also Lecunn , with 2 n's ( to read using Igor's voice)
Who's Igor? 😮😮😮
yeah who's Igor?
@8:33 One of the reasons why ReLU is better in deep networks than say Sigmoid is that the gradient in backward pass after each sigmoid nonlinearity gets smaller (multiplied by around 0.25), but in ReLU-like nonlinearities, the gradient does not get smaller after each layer (the gradient is one in positive part)
Assuming no normalisation layer is used in-between, yes.
What would be bounded?
Awesome lecture. Particularly intrigued by the mixture of experts part, definitely trying this out.
😊😊😊
First of all, thank you very much for doing this. The world owes you!
What is the meaning of the update rule when the parameter vector is the output a function [at time 1:31:21] ? As the name implies, w is the output of a function, so how can you update the output?
Through changes on its input.
Yann is showing how gradient descent changes its direction when an input is a function of another input.
at 1:14:38 Yann says that we can do a Non-linear classification with mixture of linear classifiers which are gated, isn't it still linear classifier? Why is it non-linear, what is it that makes it non-linear classification.
A linear classifier has a single hyperplane cutting the data space. Here we partition the space in two and then in two again. So, we end up with something that is clearly not linear. It actually smells a lot as a decision tree, where you have iterative subdivision of the data space.
@@alfcnz Thank you!!
You're welcome 😊
No one mentioned a ReLU here. 🤨🤨🤨
How is that addressing Aditya's question?
do the ideas for update rules for W in the example @1:33:00 come from $dw = \frac{\partial H}{\partial u} \cdot du$ ?
This side has a better comment explaining the math in the previous year's video.
ua-cam.com/video/FW5gFiJb-ig/v-deo.html&lc=UgxPIlrkdcQAncIPyQ14AaABAg
Regarding the Mixture of experts, would it make sense to train the Experts nets separatelly?
I mean something like training the Expert 1 over a dataset that is for sure Catalan, and the Expert 2 over other language. After that we'll have those trained models, and a general dataset (with multiple languages) could be used to train the Gater only (the Experts' weights wouldn't change anymore).
Sure you can do that. It turns out that the joint model works better because it can exploit the similarity between the two contexts.
@@alfcnz makes sense, thanks!
You’re welcome 😊😊😊
Also, Do you have quizzes for this class? In the course website I see Homework problems mainly. Thanks!
Hello, Alfredo :) Thank you for video! :))
Hello 👋🏻 You're most welcome 😇
Thank you Alfredo,
and I want to know what's the 'z' in the attention architecture example as Yan introduced, does the 'z' also come from the training data? thank you!
You need to point out minutes:seconds for me to be able to address your question.
@@alfcnz ~59:02 the topic is "multiplicative modules", thank you
z is a latent input. Latent means it's missing from the data set. Hence, you need to infer it by minimisation of the energy using GD. We've extensively covered latent variable energy based models in previous lectures.
@@alfcnz Thank you,
Sorry, I didn't get this information in previous lecture (if it in the order of the video), but I noticed that there is a lecture call "05.1 - Latent Variable Energy Based Models (LV-EBMs), inference" in later videos, thank you!
My bad, I apologise.
I thought this was the lecture on associative memories. These topics are only briefly introduced here and will be extensively covered later on.
Hello Alfredo, I can't find these slides in the website. Am I just not looking right or are they missing?
Click on the icon next to the lecture title on the website.
@@alfcnz There is only a camera icon and it sends me to this video :(
Oh, is this link missing?
drive.google.com/file/d/1IaDI6BJ6g4SJbJLtNjVE_miWRzBH1-MX/
Feel free to send a PR if it's correct.
Hey @Alfredo! I'm new to the SP21 course. Is there an order to the videos? You've previously uploaded videos numbered 01, 02, 03... but your recent videos are 01L, 02L, ... What does the "L" mean? Should I watch 01L after 01? I am trying to understand the naming convention here. Thanks! :)
L stands for lecture. I was not planning to release them, initially.
The order / index is on the class website, the official content organisation homepage.
@@alfcnz Thanks for clarifying! The class website isn't fully updated yet, so I was a bit confused. Will you be uploading more Lecture videos?
I've just published the first theme. You'll have a new one every week. For the latest news you want to follow me on Twitter, where I announce all these things.
@@alfcnz Than you very much! 😇
Why doesn't ReLU have a variant with ReLU (x) = -x for x
What do you need the identify function for? 🤨🤨🤨
@@alfcnz I meant RELU(x) = abs(x). It will still be non-linear, but I guess it serves no purpose.
Absolute value has been used as non linear function, but it wouldn't let you turn off specific inputs. So, the output would always be the non-zero piecewise linear combination of the input.
@@HassanAliAnwar Hey! I think this is covered at the start of the lecture when Yann takes a question regarding non-monotonic activation functions. To summarise, I think intuitively, he explains that as there would be two solutions for x when f(x) = some value (except for x=0) this means that the gradient descent step could be taken in multiple directions, which can, but not always, lead to less efficient learning for the problem, i.e. if abs(x) = 2, do we walk towards the direction of the gradient being -1 or in the direction of the gradient being 1?
Yup.