Thanks a lot sir for making this video. I loved the way you explain each and every step of proof in an easy way. Again Thank you @Kilian Weinberger Sir.
Great Teacher! I've never heard of any of this a week a go and I'm able to keep up at each step. Danke schön prof. Weinberger. Is it possible to make the placement exam available? Thank you.
I'm really hoping you still view the comments on these videos. Is there any way to know what the programming projects involved? The assignments and lecture notes are obviously incredibly useful, but I don't feel confident without doing any coding. Would appreciate it so much if the programming projects, at least the description of them, were made available
is it possible to detect divergence of algorithm during learning, i. e. the case when data is not linearly separable? Can we infer gamma from data to check wheter we exceeded 1/gamma^2 updates?
The intuition (for me) is that wTw* and wTw both grow (at least and at most) linearly in the number of updates M, but wTw* is linear in w while wTw is quadratic in w.
I understand up to the point that w^tw* increases by at least gamma and w^tw increases by at most 1 but I do not understand how this proves that w necessarily converges to w*, could someone help me out please?
I think he means that if the second condition is true then the only way the inner product of w and w* increases is if they align themselves better than before( cos theta increases), so w is indeed moving towards w*.
if our data is sparse, after scaling it to a circle with a radius of 1, won't it shift to dense data distributions and cause problems When we scale our data?
there's one thing I couldn't get: why is gamma defined from the "best" hyperplane? if M is bounded by 1/gamma² and gamma could be arbitrarily close to zero (if you pick the worst possible hyperplane for instance), then the proof is spoiled.
how is y^2xTx smaller than one when we know that y^2 is equal to 1 , so if the xTx term is less than one but positive the whole term becomes greater than 1 and the inequality that the expression is greater than
wTw* increasing means that they are getting more similar. But there is another case, in which w is just scaled. by showing that wTw is not increasing, we can show that w is not being scaled, so wTw* must be getting more similar
Hi Professor. At 32:04, you write that w.T dot w_star = abs( w.T dot w_star ). How does it follow that the dot product of those two vectors is necessarily positive? My intuition says that the first update of w will point w in the direction of w_star making the dot product positive. It makes sense, but it does not seem a trivial statement to me. w.T and w_star could be pointing in opposite directions and thus yield a negative dot product. What am I missing? :) Thanks.
I have somehow figured out the answer, minutes after posting this question. w starts as the zero vector and w.T dot w_star can only increase after each iteration, by at least gamma. Thus, making w.T dot w_star positive and making the following statement true: w.T dot w_star = abs( w.T dot w_star )
Hi Professor Weinberger, it looks like this lecture is about a smart algorithm created by smart people which can classify datas into two classes. But in the first introduction lecture you mentioned that machine learning is about computer learning to design a program by itself to achieve our goal. So I' confused what's the relationship between this perceptron hyperplane algorithm with machine learning? It looks like we human just design this algorithm and code it into a program and feed it into a computer to solve the classification problem...
So the Perceptron algorithm is the learning algorithm which is designed by humans. However, given a data set, this learning algorithm generates a classifier and you can view this classifier as a program that is learned from data. The program code is stored inside the weights of the hyperplane. You could put all that in automatically generated C code if you want to and compile it. Hope this helps.
I can't understand something: M is positive, Gamma is positive then M times Gamma is positive. After 1 update, M = 1. (w^T)*(w^*) can be negative, since (w^T) might have started pointing the opposite direction of (w^*) Then how can (w^T)*(w^*), a negative number be greater than a positive number of 1 times gamma (M*gamma)?
Here, w is initialized as 0 vector, so, (w^T)*(w^*) would be 0. Thus, after first update, it will be at least gamma. And like this (atleast) gamma keeps getting added at each update, thus making it a positive value. Refer, ua-cam.com/video/vAOI9kTDVoo/v-deo.html, to see how it converges when w is initialized randomly.
I have a question about convergence. From my understanding, since there are different satisfiable margin, there would be a set of w*. So w* is a random variable, which means, if there exists a set of hyperplane, the algorithm will converge w to a random variable w*, but not a fixed w*. Not sure if I understand correctly.
hahahaha, see! sitting in the class is not always the optimal choice :D:D:D:D thay dan dat my dan nhoi so mot hoi deo ai hieu gi het :D:D:D bang chung hung hon, chinh thay nhan thay luon nha, deo phai tui pd noi xau thay nha :D:D:D
0:40 Hang on, Mr. Weinberger are you german? You`re last name might gives a hint but you never know. But that German was perfect! Not only by the choosen words, also by the way you pronounced them all was absolutley perfect German.
I think what he said was that margin is the minimum distance of x to hyper plane. Since the direction of hyper plane is w, so xTw is projection of x on w, which is the distance from x to hyperplane.
he said the distance to the hyperplane defined by w, not the distance to w itself, now the distance of x to the hyperplane is equal to the projection of x on w
No, Prof is correct. The margin is defined correctly as well (i.e. the distance of shortest point from hyper plane). Read the proof here: www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote09.html#:~:targetText=Margin,closest%20point%20across%20both%20classes.
These lectures are gold! Thank you so much for putting them online! :-)
Love that you ask to raise hands "if you understand". It really shows the will of teaching's there.
Very comprehensive and clear! Thanks for sharing this video with us.
Great lecture! Now i can understand it. Great thanks from South Korea
Thanks a lot sir for making this video. I loved the way you explain each and every step of proof in an easy way. Again Thank you @Kilian Weinberger Sir.
Professor Kilian, I really wish I get to meet you someday, I can't express how much I appreciate you and value those lectures
this is the best expanation of th e convergence proof on youtube at the moment
Great lecture! Thank you Dr. Weinberger!
Amazing explanation!!
It was amazing professor. really helpfull
currently taking 4780, and I still come home and watch your videos!
great lectures. great teacher
Amazing lectures!!
brilliant lecture indeed!
Great Teacher! I've never heard of any of this a week a go and I'm able to keep up at each step. Danke schön prof. Weinberger. Is it possible to make the placement exam available? Thank you.
This proof is beautiful!
I'm really hoping you still view the comments on these videos.
Is there any way to know what the programming projects involved? The assignments and lecture notes are obviously incredibly useful, but I don't feel confident without doing any coding. Would appreciate it so much if the programming projects, at least the description of them, were made available
is it possible to detect divergence of algorithm during learning, i. e. the case when data is not linearly separable? Can we infer gamma from data to check wheter we exceeded 1/gamma^2 updates?
Does anyone know which 5 inequalities professor is talking about ?
I'm really enjoying your lectures professor. Is there anyway I can access the projects?
At 8:38, why do we rescale w star? Can we not just leave it with a norm of 1?
The intuition (for me) is that wTw* and wTw both grow (at least and at most) linearly in the number of updates M, but wTw* is linear in w while wTw is quadratic in w.
32:39 what are the other 4 inequalities that everyone should know?
I am watching these lectures and wondering if it will be any moment on data science journey where the matrices will be self adjoints
Thank you sir
Where is the valentines poem of the prooff??!?!?!
17:54 "The HOLY GRAIL weight vector that we know actually separates the data"
love the last story haha
did anyone manage to find the projects or anything related to this class
why didnt you wrote second constraint as wT(w+yx) instead of (w+yx)T(w+yx)? im confused
I understand up to the point that w^tw* increases by at least gamma and w^tw increases by at most 1 but I do not understand how this proves that w necessarily converges to w*, could someone help me out please?
I think he means that if the second condition is true then the only way the inner product of w and w* increases is if they align themselves better than before( cos theta increases), so w is indeed moving towards w*.
Really enjoy watching your lecture, thank you very much. Do you plan to put this course (along with projects) on Coursera or any online platform.
Cornell offers an online version through their eCornell program. ecornell.cornell.edu/
@@kilianweinberger698 Thank you very much. I will have a look.
if our data is sparse, after scaling it to a circle with a radius of 1, won't it shift to dense data distributions and cause problems When we scale our data?
there's one thing I couldn't get: why is gamma defined from the "best" hyperplane? if M is bounded by 1/gamma² and gamma could be arbitrarily close to zero (if you pick the worst possible hyperplane for instance), then the proof is spoiled.
oh okay I get it. finding other bigger bounds for M says nothing about the lowest bound you found.
how is y^2xTx smaller than one when we know that y^2 is equal to 1 , so if the xTx term is less than one but positive the whole term becomes greater than 1 and the inequality that the expression is greater than
y^2 = 1
xTx y^2xTx
@@consumidorbrasileiro222 Oh yea I misse the fact that were raising the whole term to a power and if xTx is less than 1 the whole term will be
Professor or anyone please tell why we need to consider the effect of update on
w transpose w star ,
w transpose w.
Please reply!
wTw* increasing means that they are getting more similar. But there is another case, in which w is just scaled. by showing that wTw is not increasing, we can show that w is not being scaled, so wTw* must be getting more similar
@@XoOnannoOoX whoah! Thank u so much 💯
Hi Professor. At 32:04, you write that w.T dot w_star = abs( w.T dot w_star ). How does it follow that the dot product of those two vectors is necessarily positive? My intuition says that the first update of w will point w in the direction of w_star making the dot product positive. It makes sense, but it does not seem a trivial statement to me. w.T and w_star could be pointing in opposite directions and thus yield a negative dot product. What am I missing? :) Thanks.
I have somehow figured out the answer, minutes after posting this question. w starts as the zero vector and w.T dot w_star can only increase after each iteration, by at least gamma. Thus, making w.T dot w_star positive and making the following statement true: w.T dot w_star = abs( w.T dot w_star )
Hi Professor Weinberger, it looks like this lecture is about a smart algorithm created by smart people which can classify datas into two classes. But in the first introduction lecture you mentioned that machine learning is about computer learning to design a program by itself to achieve our goal. So I' confused what's the relationship between this perceptron hyperplane algorithm with machine learning? It looks like we human just design this algorithm and code it into a program and feed it into a computer to solve the classification problem...
So the Perceptron algorithm is the learning algorithm which is designed by humans. However, given a data set, this learning algorithm generates a classifier and you can view this classifier as a program that is learned from data. The program code is stored inside the weights of the hyperplane. You could put all that in automatically generated C code if you want to and compile it. Hope this helps.
In 27:05 you write:
2y(w^T • x) < 0
Why is it not
Oh yes, good catch,
2y(w^T * x) < 0 ==> that the data point was classified incorrectly. Even if it was
What are other 4 inequalities in computer science ??
Wondering the same thing
I can't understand something:
M is positive, Gamma is positive then M times Gamma is positive.
After 1 update, M = 1.
(w^T)*(w^*) can be negative, since (w^T) might have started pointing the opposite direction of (w^*)
Then how can (w^T)*(w^*), a negative number be greater than a positive number of 1 times gamma (M*gamma)?
Here, w is initialized as 0 vector, so, (w^T)*(w^*) would be 0. Thus, after first update, it will be at least gamma. And like this (atleast) gamma keeps getting added at each update, thus making it a positive value.
Refer, ua-cam.com/video/vAOI9kTDVoo/v-deo.html, to see how it converges when w is initialized randomly.
Why is the minimum distance between the point and the hyperplane = inner product between the point and w*?
Because (w^*)T(w^*)=1. (For details check out the detailed proof here: www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote03.html )
@@kilianweinberger698 Thank you Professor!! You're a legend!!!
I have a question about convergence. From my understanding, since there are different satisfiable margin, there would be a set of w*. So w* is a random variable, which means, if there exists a set of hyperplane, the algorithm will converge w to a random variable w*, but not a fixed w*. Not sure if I understand correctly.
hahahaha, see! sitting in the class is not always the optimal choice :D:D:D:D thay dan dat my dan nhoi so mot hoi deo ai hieu gi het :D:D:D bang chung hung hon, chinh thay nhan thay luon nha, deo phai tui pd noi xau thay nha :D:D:D
wäre cool wenn du auch einen deutschen kurs hättest
0:40 Hang on, Mr. Weinberger are you german? You`re last name might gives a hint but you never know. But that German was perfect! Not only by the choosen words, also by the way you pronounced them all was absolutley perfect German.
Ja, ich bin in Bayern aufgewachsen. :-)
why dont we have Nobel prices for Computer Science??? This algorithm is worth of 10 Nobel prices, indeed.
plot twist
gamma = 0
M
you defined the margin wrongly.
xt•w is the projection of the vector xt on w.
the distance is ||x-w||
I think what he said was that margin is the minimum distance of x to hyper plane. Since the direction of hyper plane is w, so xTw is projection of x on w, which is the distance from x to hyperplane.
@@vincentxu1964 only assuming that ||w||=1
@@yrosenstein Yeah I think so. You can take a look at lecture 14. I think he redefines the margin with any w.
he said the distance to the hyperplane defined by w, not the distance to w itself, now the distance of x to the hyperplane is equal to the projection of x on w
No, Prof is correct. The margin is defined correctly as well (i.e. the distance of shortest point from hyper plane). Read the proof here: www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote09.html#:~:targetText=Margin,closest%20point%20across%20both%20classes.
Starts at 0:53
At 43:02, that face on the blackboard