This is one of the best lectures you gave in this series, super clear, very helpful and even enjoyable, thanks to this wonderful demo. Seriously, well done!
This was one of the best explanations I have heard on neural networks and really cool demo at the end. I suspect you could have chosen a better picture for yourself though.
Everything i have seen from this lecture is absolutely fantastic! Thank you very much for uploading this. Your enthusiasm while teaching makes it so fun to learn and im very glad that i don't completely rely on my professors lectures now.
After Gilbert Strang 18.06 sets the intuition right for vector space, you have been doing the exactly same with ML algos. We are more than privileged to go through your lectures. This is great service to mankind. Namaste from India !
He is undeniably one of the best instructors. He effortlessly simplifies complex topics, presenting them in an engaging and entertaining manner. His ability to use humor not only makes the learning experience enjoyable but also serves as a powerful teaching tool for machine learning concepts. Additionally, I was impressed by his humble and excellent personality, which greatly enhances the overall learning environment. His passion for the subject is palpable, and it genuinely enriches the course
-So... do I study OR do I have some fun? - Was I asking myself one hour ago when thinking about watching The Expanse. Watched this lecture. An OR became an AND. I learned AND I laughed a lot. You are so good it even defines logic. Du bist der Beste. Vielen Dank!
4:35 Why we use ReLU instead of sigmoid 6:00 Sigmoid is flat at ends and has no gradient 9:45 Deep learning scales linearly with more data. Kernels are quadratic 15:00 Discussion begins 17:00 What’s going on inside a neural network 23:50 Piecewise linear approximation 28:00 With a neural network you can approximate any smooth function arbitrarily close 29:45 Discuss layers 33:00 There is no function that you can learn with a deep network that you can’t also learn with a shallow network with just one layer 35:30 The benefit of multiple layers 37:00 The exponential effect of multiple layers 38:00 A few small matrices have the same expressive power 40:00 Why do deep networks not overfit…SGD 41:10 Demo 45:30 Second demo. Hot or not facial dataset
Thank you so much Kilian for this beautiful series of lectures on Machine Learning. I had a doubt regarding the intuition that you had given for the neural networks. From what I have understood, when we train a NN, we learn a mapping from a vector space where our data is highly non-linear to a vector space where are data is much less 'complexly' arranged (linearly separable for classification and linear for regression). Subsequently, our learned weights represent the nonlinear approximation of the hyperplane in the original vector space, that was required in the first place, to accomplish whatever task we have to do. So, I just wanted to confirm that both these intuitions go hand in hand right??? As in the weights learned give us a mapping and then can also be seen as representing a non linear hyperplane in our original vector space, as shown by you. Furthermore, for models like a decoder in an Autoencoder, is there an intuition of a hyperplane, cause the mapping intuition seems the only one that is right for such a case as we map our latent representation into a vector space with greater number dimensions, but the idea of a hyperplane being fit doesn't seem appropriate. Is it that, we consider every pixel of output as an individual classification or regression (depending on what loss we choose)?? It will be a grt help if you can help me with this.
I bookmarked your course link www.cs.cornell.edu/courses/cs4780/2018fa/lectures/ and did not see any video, I was worried that I could not watch your lecture. And here it is, thank you very much. I don't have enough words to say how much I appreciate your teaching. 3 lectures left to go...
Hi prof Weinberger, I love these lectures so much and I'm so glad I learned my first machine learning class from you, thanks a lot!! One more thing, I'm so interested in the projects you mentioned in class, I found the homeworks posted online, but I just can't find the information of projects. Could you please post these online so I have a chance to try stuffs I learned from the classes?
Hello Prof Weinberger, I'm unable to understand how the regression model(that can fit nonlinear data using multiple lines), can be extended to a classification problem. Using a combination of multiple lines with relu activations would essentially approximate a non-linear function(please correct me if I'm wrong). For a classification problem with 2 classes, can this nonlinear function(a combination of multiple lines with activations), itself be treated as the decision boundary? If not, how exactly is nonlinearity achieved in a decision boundary? I'm sorry if I'm mixing two different concepts up. Thank you.
If you have a classification problem, you typically let the neural network have one output per class (which you normalize with a soft-max function). Each output predicts the probability that this input has this particular class. Actually, this is similar to multi-class logistic regression. If you have k classes, then for logistic regression people just train k logistic regression classifiers, each one deciding if it is a particular class or not. Here, also the outputs are normalized with a softmax function. Essentially you use that classifier as the last layer of the neural network. Hope this helps ...
Hello Prof Weinberger, your insight on piecewise approximation really shed light on my understanding of NN, thanks for sharing your amazing lessons. Actually i keep wondering what happens instead with DNNs and how that piecewise approximation propagate along the subsequent layers. Is it reasonable to say that in next layers each piece-fucntion of the previous layer is approximated with local piecewise approximations to get closer and closer to the real funcion's curvature?
In the subsequent layers you are again building piece wise linear functions - however out of the piece wise linear functions from the earlier layers. In fact, the number of “kinks” (i.e. nonlinearities) in your piece wise linear function grows exponentially with depth, which is why deep neural networks are so powerful (and why you would need exponentially many hidden nodes with a shallow one-hidden layer network). Imagine the first layer has 10 nodes, the second layer 20 nodes, the third layer is the output (one node).Each node in the second layer is then a piece-wise linear function consisting of 10 non-linearities. The final function is a linear combination of 20 functions, each consisting of 10 non-linearities, so you end up with 200 linear parts. Hope this helps ...
@@kilianweinberger698 thank you prof for your kind reply. I got it, again a great explanation. Simply the 2nd layer's input is no more linear. The 2nd functions are evaluated on values from piece-wise curve instead that from a line.
Basically, functions that map input vectors into another space. Then, you use this new mapped version as your input for the linear classifier. Linear classifier thinks it is in a higher dimension and since it is easier to seperate data points if they are far away from each other, and this is the case in higher dimensions, you usually seperate them successfully. There is more information in the previous videos and lecture notes. So, it is not GPU functions if that is what you were asking.
hahaha, lazy learning is good :D:DD:D lazy is a good characteristics, not active :D:D:D active is so bad :D the new era, the new characteristics, new elites :D:D:D
This is one of the best lectures you gave in this series, super clear, very helpful and even enjoyable, thanks to this wonderful demo. Seriously, well done!
Congrats on making it all the way through. It's not easy!
Good god! I've watched the whole series and this by far was the best one! Thank you Prof Weinberger for making these available!
I had goosebumps at 23:03 when he gives the other perspective on neural networks. HOW DID I NOT THINK OF THAT?
Best Teacher ever in my opinion!!!
This was one of the best explanations I have heard on neural networks and really cool demo at the end. I suspect you could have chosen a better picture for yourself though.
You are the best professor I have ever come across. Thanks a lot! This world needs more people like you!
I can not thank you enough professor. This is extremely helpful to me. I idolize you
Everything i have seen from this lecture is absolutely fantastic! Thank you very much for uploading this. Your enthusiasm while teaching makes it so fun to learn and im very glad that i don't completely rely on my professors lectures now.
Did n't hear so beautiful explanation of deep learning before, A Prof Killian is worth thousand blogs!!
I have taken three graduate courses in ML and data Analytics, this one still inspire me a lot!
After Gilbert Strang 18.06 sets the intuition right for vector space, you have been doing the exactly same with ML algos. We are more than privileged to go through your lectures. This is great service to mankind. Namaste from India !
He is undeniably one of the best instructors. He effortlessly simplifies complex topics, presenting them in an engaging and entertaining manner. His ability to use humor not only makes the learning experience enjoyable but also serves as a powerful teaching tool for machine learning concepts. Additionally, I was impressed by his humble and excellent personality, which greatly enhances the overall learning environment. His passion for the subject is palpable, and it genuinely enriches the course
One of the best machine learning lecture. Thank you very much, professor.
-So... do I study OR do I have some fun? - Was I asking myself one hour ago when thinking about watching The Expanse.
Watched this lecture. An OR became an AND. I learned AND I laughed a lot.
You are so good it even defines logic. Du bist der Beste. Vielen Dank!
This was very helpful. Your understanding and the way you teach are amazing. Thank you very much!!!
I agree!!
4:35 Why we use ReLU instead of sigmoid
6:00 Sigmoid is flat at ends and has no gradient
9:45 Deep learning scales linearly with more data. Kernels are quadratic
15:00 Discussion begins
17:00 What’s going on inside a neural network
23:50 Piecewise linear approximation
28:00 With a neural network you can approximate any smooth function arbitrarily close
29:45 Discuss layers
33:00 There is no function that you can learn with a deep network that you can’t also learn with a shallow network with just one layer
35:30 The benefit of multiple layers
37:00 The exponential effect of multiple layers
38:00 A few small matrices have the same expressive power
40:00 Why do deep networks not overfit…SGD
41:10 Demo
45:30 Second demo. Hot or not facial dataset
Wow, the way you explained the concept of layers, and the demo at the end. What a JOY it must be to be present physically in your class.
😍🤩
Really great analogy for the deep neural network.
Watched the entire semester course in 22 days. Got every single explanation. Super Clear, Extremely awesome
This is the best lecture on deep learning I have ever seen.
Simply amazing lecture! Loved every bit of it
Thank you so much Kilian for this beautiful series of lectures on Machine Learning.
I had a doubt regarding the intuition that you had given for the neural networks. From what I have understood, when we train a NN, we learn a mapping from a vector space where our data is highly non-linear to a vector space where are data is much less 'complexly' arranged (linearly separable for classification and linear for regression). Subsequently, our learned weights represent the nonlinear approximation of the hyperplane in the original vector space, that was required in the first place, to accomplish whatever task we have to do.
So, I just wanted to confirm that both these intuitions go hand in hand right??? As in the weights learned give us a mapping and then can also be seen as representing a non linear hyperplane in our original vector space, as shown by you. Furthermore, for models like a decoder in an Autoencoder, is there an intuition of a hyperplane, cause the mapping intuition seems the only one that is right for such a case as we map our latent representation into a vector space with greater number dimensions, but the idea of a hyperplane being fit doesn't seem appropriate. Is it that, we consider every pixel of output as an individual classification or regression (depending on what loss we choose)??
It will be a grt help if you can help me with this.
This was an insanely good explanation.
I bookmarked your course link www.cs.cornell.edu/courses/cs4780/2018fa/lectures/ and did not see any video, I was worried that I could not watch your lecture. And here it is, thank you very much. I don't have enough words to say how much I appreciate your teaching. 3 lectures left to go...
after all I still come back to here and try to comprehend again what is going on, best lecture ever 🤖
amazing!
Fun to learn :)
Amazing lecture as usual....
Thanks a lot
best demo in the world
so amazing!!!
legendary!
They are called Matryoshka ))
Hi prof Weinberger,
I love these lectures so much and I'm so glad I learned my first machine learning class from you, thanks a lot!! One more thing, I'm so interested in the projects you mentioned in class, I found the homeworks posted online, but I just can't find the information of projects. Could you please post these online so I have a chance to try stuffs I learned from the classes?
Thank you professor !
Hello Prof Weinberger,
I'm unable to understand how the regression model(that can fit nonlinear data using multiple lines), can be extended to a classification problem. Using a combination of multiple lines with relu activations would essentially approximate a non-linear function(please correct me if I'm wrong). For a classification problem with 2 classes, can this nonlinear function(a combination of multiple lines with activations), itself be treated as the decision boundary? If not, how exactly is nonlinearity achieved in a decision boundary?
I'm sorry if I'm mixing two different concepts up.
Thank you.
If you have a classification problem, you typically let the neural network have one output per class (which you normalize with a soft-max function). Each output predicts the probability that this input has this particular class.
Actually, this is similar to multi-class logistic regression. If you have k classes, then for logistic regression people just train k logistic regression classifiers, each one deciding if it is a particular class or not. Here, also the outputs are normalized with a softmax function. Essentially you use that classifier as the last layer of the neural network. Hope this helps ...
@@kilianweinberger698 Thanks a lot professor!
Hello Prof Weinberger,
your insight on piecewise approximation really shed light on my understanding of NN, thanks for sharing your amazing lessons.
Actually i keep wondering what happens instead with DNNs and how that piecewise approximation propagate along the subsequent layers.
Is it reasonable to say that in next layers each piece-fucntion of the previous layer is approximated with local piecewise approximations to get closer and closer to the real funcion's curvature?
In the subsequent layers you are again building piece wise linear functions - however out of the piece wise linear functions from the earlier layers. In fact, the number of “kinks” (i.e. nonlinearities) in your piece wise linear function grows exponentially with depth, which is why deep neural networks are so powerful (and why you would need exponentially many hidden nodes with a shallow one-hidden layer network).
Imagine the first layer has 10 nodes, the second layer 20 nodes, the third layer is the output (one node).Each node in the second layer is then a piece-wise linear function consisting of 10 non-linearities. The final function is a linear combination of 20 functions, each consisting of 10 non-linearities, so you end up with 200 linear parts.
Hope this helps ...
@@kilianweinberger698 thank you prof for your kind reply.
I got it, again a great explanation. Simply the 2nd layer's input is no more linear. The 2nd functions are evaluated on values from piece-wise curve instead that from a line.
11:00 what are the kernel you are talking about?
Basically, functions that map input vectors into another space. Then, you use this new mapped version as your input for the linear classifier. Linear classifier thinks it is in a higher dimension and since it is easier to seperate data points if they are far away from each other, and this is the case in higher dimensions, you usually seperate them successfully. There is more information in the previous videos and lecture notes. So, it is not GPU functions if that is what you were asking.
Which of your previous Lecture(s) are important requirement(s) to understand this one?
And they told me Germans can't make jokes.
Kilian Weinberger single handedly raising the average german comedy coefficient by 1 standard deviation.
ohhhh my god. This was funny!!
🤯 35:00
Damn, I want to give 100 thumbs up for the demo but youtube allows me to give just one -:(
Amazing stuff
2:42 start
Why do they have a piano in their class?
kernel matrix quatratic, DL linear
I mean why don't you try stand-up. Or have you tried it before?
Anyone else accidentally raise their hands?! Just me, cool
禁止套娃
hahah, cai thang bia ra cai ly thuyet 1 lop chang khac gi triet ly mac-le :D:D du do luong gat biet bao nhieu the he :D:D:D:D
Younkniwnwhy
Ears
hahaha, lazy learning is good :D:DD:D lazy is a good characteristics, not active :D:D:D active is so bad :D the new era, the new characteristics, new elites :D:D:D