Indeed, he is an amazing professor. It is clear that he knows his mathematics. He is confident in the responses and goes into a little more depth than the usual machine learning professor. I really appreciate his desire to teach the theory behind the applications
Thank you, Prof. Yaser Abu-Mostafa for these lectures. The concepts are concisely and precisely explained. I especially like how he explains the equations with real world examples. It makes the course material much more approachable.
Wow. This lecture was great. The difference between using linear regression for classification and using linear classification outright couldn't have been explained better (I mean it was just amazing).
Jon Snow in case it still matters: When you iterate in the PLN algorithm you adjust the weight of the perceptron in each iteration. so hence in each iteration you might classify more points correctly or less and this is reflected by the in sample error aswell as the out of sample error.
Good lecture. I think the question had to do with correlation between a transformed feature and the original feature. This is describing the problem of multicollinearity. With multicollinearity, numerical instability can occur. The weight estimates are unbiased(on average you'd expect them to be correct) but they're unstable - running with slightly different data might give different estimates. E.g. Estimate SAT scores. If given weight, height and parent's occupation one might expect all 3 are correlated. The OLS algorithm won't know which to properly credit.
A clean classification represents the lowest level of entropy (things aren't "muddled"). So going from the current situation to the lowest level of entropy will result in a maximum change in entropy.
37:10 The formula is correct if we define the gradient as the Jacobian matrix transposed, not just Jacobian matrix. In the optimalization techniques this assumption is very helpful, so I think he uses this convention.
How come no one here in the comments admits that they are confused. I don't understand why college professors are so out of touch with their students. Why does learning hard engineering topics always have to be so much of a struggle. He is teaching as if we have years of experience in the field.
Like, in lecture 2 he mentioned that E in was used in place of new, which is basically the sample mean. Now he is using it here in a completely different context withoutt explaining why? It seems like alot of other students here have the same question. PLEASE SOMEONE OUT THERE JUST LEARN HOW TO TEACH PROPERLY ugh
Being confused doesn’t mean the professor is incapable. It’s a natural part of the learning process. If no one teaches right according to you, it might not be the teacher
Not sure I'm a fan of the "symmetry" measure. The number 8 in that example is clearly offset from center, the example only would have apparent symmetry because it's a wide number with a lot of black space. If a 1 is slanted and off center, it will literally have nearly 0 apparent symmetry because only its center point would have vertical symmetry. Oh well, we'll see where it goes.
You probably would use a bit more sophistication. After you flip the number you could "slide" it over the original number looking for the maximum matching value.
When he says hypotheses h1, h2, etc. does he mean different hypotheses that fit same general form (e.g. all 2nd order polynomials) or different hypotheses forms (e.g. linear, polynomial, etc.)? Thanks
I see it as follow: E(in) is the fraction of "red" marbel, which is the fraction of wrong estimation by your hypothesis; which is the error of that h. That fraction is the probability 'q' of a Bernoulli distribution, which expected value E(q)=q
@@solsticetwo3476 The other way around. E(out) is the fraction of red marble i.e. the fraction of wrong estimation by your hypothesis. This value has nothing to do with a probability distribution. E(in) the in-sample error is coupled to selection and hence is coupled to a probablity distribution.
That there is a probability distribution on X. What this says is (more or less) is that what I saw happen in the past (i.e. what I selected which drives my in sample error) says something about what will happen in the future. The "what will happen in the future" is a statement that says something about my out of sample and drives my out of sample error. Hoefding's inequality places numerical constraints on the relationship and is based on this probability relationship.
The calculation uses the derivative set to zero, which I guess means finding where the slope is 0, but what if there are more than one such minimal spots? How is the global minimum guaranteed?
I thought linear regression was for extrapolating data to a line to forecast future predictions. But here it is explained in terms of a seperation boundary for classification. can someone explain?
For data extrapolation you look for the line which minimizes the distance of ALL points to that line. In the classification problem you look for the line which minimizes the distance of WRONGLY CLASSIFIED points to that line.
Some videos on statistics/probability and linear algebra would already help a lot. Khan academy has many great videos. www.khanacademy.org/math/statistics-probability www.khanacademy.org/math/linear-algebra
E_in and E_out represent respectively the in-sample error and the out-sample error. Usually you don't have access to E_out the out-sample error. But you know that The in-sample error approximates the out-sample error the more you have data. The y axis represents the error percentage on data, while x-axis represents the iterations.
On the y-axis, we are tracking the "fraction-of-mislabeled-examples". So, E_in is the fraction of training-set examples that we got wrong. Similarly E_out is the fraction of examples(not from training-set) that we got wrong.
It seems to me that there is a small typo on the 18th slide (48:25). To perform classification using linear regression, it seems one needs to check sign(wx - y) rather than sign(wx).
Great lecture! Does the use of features ('features' are "higher level representations of raw inputs") increase the performance of a model out of sample? Does it somehow add information? Or does it simply make it computationally easier to produce a model? I'm working on a problem where this could potentially be very useful. I could also see how the use of features could make a model more meaningful to human interpretation, but there is a risk as well that interpretations will vary between people based on what words are being used. 'Intensity' and 'symmetry' are used here which are great examples, but is could very quickly get more abstract or technical. Thank you in advance to anyone who has a answer to my question!
It depends on whether your features could be learned implicitly by your model. That is, let's say your original data are scores on two measures: IQ and age, and you want to use those to predict people's salaries. Let's also assume that the true way in which those are related is: salary = (IQ + age)*100 + e, where e is some residual error not explained by these two variables. In this case you could define a new feature that is the sum of IQ and age, and this would reduce the number of free parameters in your model, making it slightly easier to fit. Given enough data to train on however, your old model would perform just as well, because the feature in the new model is a linear combination of features in the old model. (That is, in the old model you would have w1 = w2 = 100, whereas in the new one you would just have w1 = 100.) Often, however, we define new features not (just) to reduce the number of model parameters, but to deal with non-linearities. In the example of the written digits, you can't really predict very well which digit is written in an image by computing a weighted sum over pixel intensities, because the mapping of digits to pixel values happens in a higher order space. So in this case we can greatly improve the performance of our model if we define our features in the same higher order space. The reason is not that we add information that wasn't in the data before, but that the information wasn't recoverable by our linear model.
Kathryn Jessen Kathryn - I was born without the ability to be a mom :/ I will never experience the depth and vastness of a mother's understanding. I can sure pick a thing or two and try to pretend ;)
With your training data you know which points have been wrongly classified. Look for the line which minimimes the distance (least squares error) of all wrongly classified data. Each time you move the line other data may become wrongly classified so you have to do redo the calculation but look for the line which gives you the minimum overall value for the lines associated set of wrongly classified data.
+movax20h You can not calculate E_out because you do not know the whole population of samples. But E_in can tell you something about E_out. This relation is explained in Lecture 02-Is learning feasible
+fouad Mohammed That is exactly why I am asking. The graph clearly shows the E_out being calculated somehow. I guess, this is done using validation techniques from one of the last lectures probably. Anyway, this is a synthetic example, so it is not hard to generate known unknown target function, and generate as many training and test examples as you want, just for examples sake. I do not believe it was claculated by Hoefding , because it is a probabilistic inequality, and would actually to circular reasoning logic here: lets use it to predict E_out, and use this prediction to claim that E_in tracks well E_out. That might be correct in probabilistic sense, but is not good way of demonstrating it at all.
Assume that data was already labeled and used to generate E_in (test set). Take a portion of this data (training sample) generate your hypothesis. You can use that hypothesis to measure E_in (training data) and E_out (on the test set). He's making the assumption that the whole set is labeled. Which doesn't usually apply to the real world.
Jon Snow The error in the sample set could increase in a next iteration if the algorithm change the hypothesis (weights) in a way that hurt the classification. The PLA is a random walk on the weights sub space.
@@solsticetwo3476 PLA is not really a random walk in the weights sub space... The algorithm optimises the weights for a given (randomly chosen) miss-classified point. Fixing the weights as to not missclassify this point may lead to other points that were previously correctly classified to be misclassified. Hence, the rise in error, followed by drop etc. The algorithm works for non separable datasets, so you can't really call it a random walk, it clearly has a set of rules its following.
Literally, learning in a non-probabilistic (absolutistic or certainty) sense. However this runs up against the so-called induction problem first described by the philosopher Hume (you can google it). In our context here the Hume induction problem can be translated to "If I pick a number of balls and they always turn out to be green can I conclude (with certainty) that all balls in the bin are green?". In the lecture the statement is made that you can't. The philosophical discussion is a bit more nuanced. In any case, machine learning avoids this discussion by side stepping trying to make statements with certainty and moving to (weaker) probabilistic statements.
This comment is old and the op probably figured it out by now. However for anyone else who wonders this: It has to do with notation, the derivative of Xw WRT to w is either X or X^T depending on your notation. Here, we're using denominator layout notation, so we use X^T. (en.wikipedia.org/wiki/Matrix_calculus#:~:text=displaystyle%20%5Cmathbf%20%7BI%7D%20%7D-,A%20is%20not%20a%20function%20of%20x,%7B%5Cdisplaystyle%20%5Cmathbf%20%7BA%7D%20%7D,-%7B%5Cdisplaystyle%20%5Cmathbf%20%7BA) The natural follow up question here is why do we use one notation over the other? When did we choose our notation? Notation choices are often just the choice of the author, and can often make formulae more succinct or more clear. I think this answers the question without going off tangentially too far, as any further questions are probably best answered at your own inquiry.
As the professor mentions multiple times, this lecture is a bit out-of-place as it is placed before covering the theory. Other lectures are more theoretically dense and explained in reasonable depth.
The best lectures in machine learning that I've listen to online. Thanks Professor Yaser.
Such a great professor, he is so clear in his explanations.
agreed.
Indeed, he is an amazing professor. It is clear that he knows his mathematics. He is confident in the responses and goes into a little more depth than the usual machine learning professor. I really appreciate his desire to teach the theory behind the applications
I’m thrilled to have studied at Caltech for this course and actually talk to him! He’s so nice!
@@cosette8570 What is your annual salary now ?
@@sivaramkrishnanagireddy8341 Very weird question and I'm not surprised the person you replied to didn't bother to respond.
Prof. Yaser Abu-Mostafa is by far the best lecturer I've ever seen. Well done, great course!
Thank you, Prof. Yaser Abu-Mostafa for these lectures. The concepts are concisely and precisely explained. I especially like how he explains the equations with real world examples. It makes the course material much more approachable.
Wow. This lecture was great. The difference between using linear regression for classification and using linear classification outright couldn't have been explained better (I mean it was just amazing).
what a great authority and fielding questions like a true pundit on the subject..great respect and thanks a lot.
Great great lecture. Thank you Pr. Yaser Abu-Mostafa, it is clear and well performed!
What is the graph shown at 15:42 ,why are there ups and downs in Ein?
Jon Snow in case it still matters: When you iterate in the PLN algorithm you adjust the weight of the perceptron in each iteration. so hence in each iteration you might classify more points correctly or less and this is reflected by the in sample error aswell as the out of sample error.
For the first time I am finding Machine Learning interesting and learn-able. Thank you very much sir.
thank you Pr. Yaser Abu-Mostafa
very very very clear and detective really it's hard to find genius explanation like that
Thank you Caltech and Prof.
Such a compassionate lecturer 🥺
Good lecture.
I think the question had to do with correlation between a transformed feature and the original feature. This is describing the problem of multicollinearity. With multicollinearity, numerical instability can occur. The weight estimates are unbiased(on average you'd expect them to be correct) but they're unstable - running with slightly different data might give different estimates.
E.g. Estimate SAT scores. If given weight, height and parent's occupation one might expect all 3 are correlated. The OLS algorithm won't know which to properly credit.
Absolutely well done and definitely keep it up!!! 👍👍👍👍👍
Dude is a rockstar, just threw my panties at my screen again... Great lecturer.
What an amazing and charming teacher...I wanna have a beer with him.
He is muslim...don't drink beer. Let have coffee with him.
Any new material from yasir?
@@sanuriishak308 I think he prefers to 'sip on tea while the machine learns'
Howard Wolowitz is so good! One of the best lectures though.
I learnt a method in another class where the primary classification features were selected based on things that causes the maximum change in entropy.
Decision Tree Algorithm.
A clean classification represents the lowest level of entropy (things aren't "muddled"). So going from the current situation to the lowest level of entropy will result in a maximum change in entropy.
A million times better than my proffessor.
Thank you, great and relaxing lectures!
great lecture & channel. Thanks for such opportunity.
37:10 The formula is correct if we define the gradient as the Jacobian matrix transposed, not just Jacobian matrix. In the optimalization techniques this assumption is very helpful, so I think he uses this convention.
How come no one here in the comments admits that they are confused. I don't understand why college professors are so out of touch with their students. Why does learning hard engineering topics always have to be so much of a struggle. He is teaching as if we have years of experience in the field.
Like, in lecture 2 he mentioned that E in was used in place of new, which is basically the sample mean. Now he is using it here in a completely different context withoutt explaining why? It seems like alot of other students here have the same question. PLEASE SOMEONE OUT THERE JUST LEARN HOW TO TEACH PROPERLY ugh
Being confused doesn’t mean the professor is incapable. It’s a natural part of the learning process.
If no one teaches right according to you, it might not be the teacher
I like the x1.5 speeding, works perfectly.
x1.25 for me!
Amazing content. Thank you.
a great lecture. Thanks for sharing it.
AWESOME, lots of thanks.
Interesting lecture
He's example help a lot to understand the class
Wow This is the oldest comment I’ve seen in these videos thus far. How are you doing these days? Did you end-up pursuing machine learning?
@@FsimulatorX nope, I am working in cyber security.
"Surprise, surprise"... What a great professor!
Thank you for uploading this!
fantastic lecture
What is the difference between the simple perceptron algorithm and linear classification algorithm?
Not sure I'm a fan of the "symmetry" measure. The number 8 in that example is clearly offset from center, the example only would have apparent symmetry because it's a wide number with a lot of black space. If a 1 is slanted and off center, it will literally have nearly 0 apparent symmetry because only its center point would have vertical symmetry. Oh well, we'll see where it goes.
You probably would use a bit more sophistication. After you flip the number you could "slide" it over the original number looking for the maximum matching value.
Key takeaway: Linearity of Weights not "variables".
@52:00
this is really good
Fantastic!
When he says hypotheses h1, h2, etc. does he mean different hypotheses that fit same general form (e.g. all 2nd order polynomials) or different hypotheses forms (e.g. linear, polynomial, etc.)? Thanks
In the previous lectures E(in) was used for in-sample performance. Was is substituted to in-sample error in this lecture? Am i missing something ?
Performance is a rough wording, and error is one way to really evaluate the performance.
I see it as follow: E(in) is the fraction of "red" marbel, which is the fraction of wrong estimation by your hypothesis; which is the error of that h. That fraction is the probability 'q' of a Bernoulli distribution, which expected value E(q)=q
@@solsticetwo3476 The other way around. E(out) is the fraction of red marble i.e. the fraction of wrong estimation by your hypothesis. This value has nothing to do with a probability distribution. E(in) the in-sample error is coupled to selection and hence is coupled to a probablity distribution.
6:14
Hello, where did I find this dataset to implement the algorithms?
The number set is called MNIST dataset
Great!!
Hello, It's a nice lecture!! Thanks to caltech and Prof Yasser. Can any one tell me where I can get the corresponding slides and textbooks? Thanks
work.caltech.edu/lectures.html#lectures
Sami Albouq
Many thanks
Just to clarify, what piece of theory guarantees that the in-sample error will track the out of sample error? E_in E_out
That there is a probability distribution on X. What this says is (more or less) is that what I saw happen in the past (i.e. what I selected which drives my in sample error) says something about what will happen in the future. The "what will happen in the future" is a statement that says something about my out of sample and drives my out of sample error. Hoefding's inequality places numerical constraints on the relationship and is based on this probability relationship.
The calculation uses the derivative set to zero, which I guess means finding where the slope is 0, but what if there are more than one such minimal spots? How is the global minimum guaranteed?
I thought linear regression was for extrapolating data to a line to forecast future predictions. But here it is explained in terms of a seperation boundary for classification. can someone explain?
For data extrapolation you look for the line which minimizes the distance of ALL points to that line. In the classification problem you look for the line which minimizes the distance of WRONGLY CLASSIFIED points to that line.
is there subtitle available?
Thank you.
What I should do if I didn't understand all the math in this lecture?
do you have some resource that explains it quickly?
Some videos on statistics/probability and linear algebra would already help a lot. Khan academy has many great videos.
www.khanacademy.org/math/statistics-probability
www.khanacademy.org/math/linear-algebra
Depends on what you don't understand. There should be introductory courses to linear algebra, analysis and stochastics at your university.
Anybody have homeworks accompanying these lectures? I don't have a registered account for the course and the registration is closed now. :(
Quick question, what is the y-axis label at 20:00? What probability are we tracking for E_in and E_out?
E_in and E_out represent respectively the in-sample error and the out-sample error. Usually you don't have access to E_out the out-sample error. But you know that The in-sample error approximates the out-sample error the more you have data.
The y axis represents the error percentage on data, while x-axis represents the iterations.
On the y-axis, we are tracking the "fraction-of-mislabeled-examples". So, E_in is the fraction of training-set examples that we got wrong. Similarly E_out is the fraction of examples(not from training-set) that we got wrong.
It seems to me that there is a small typo on the 18th slide (48:25). To perform classification using linear regression, it seems one needs to check sign(wx - y) rather than sign(wx).
really confused me without your comment!
Second-time watch this video makes me clear that there is no mistake. The threshold value is contained in w0. sign(wx) is the correct.
wX is the output for each datapoint in the training set, taking the sign of each output gives you its classification.
Great lecture! Does the use of features ('features' are "higher level representations of raw inputs") increase the performance of a model out of sample? Does it somehow add information? Or does it simply make it computationally easier to produce a model? I'm working on a problem where this could potentially be very useful.
I could also see how the use of features could make a model more meaningful to human interpretation, but there is a risk as well that interpretations will vary between people based on what words are being used. 'Intensity' and 'symmetry' are used here which are great examples, but is could very quickly get more abstract or technical.
Thank you in advance to anyone who has a answer to my question!
It depends on whether your features could be learned implicitly by your model. That is, let's say your original data are scores on two measures: IQ and age, and you want to use those to predict people's salaries. Let's also assume that the true way in which those are related is: salary = (IQ + age)*100 + e, where e is some residual error not explained by these two variables. In this case you could define a new feature that is the sum of IQ and age, and this would reduce the number of free parameters in your model, making it slightly easier to fit. Given enough data to train on however, your old model would perform just as well, because the feature in the new model is a linear combination of features in the old model. (That is, in the old model you would have w1 = w2 = 100, whereas in the new one you would just have w1 = 100.)
Often, however, we define new features not (just) to reduce the number of model parameters, but to deal with non-linearities. In the example of the written digits, you can't really predict very well which digit is written in an image by computing a weighted sum over pixel intensities, because the mapping of digits to pixel values happens in a higher order space. So in this case we can greatly improve the performance of our model if we define our features in the same higher order space. The reason is not that we add information that wasn't in the data before, but that the information wasn't recoverable by our linear model.
rubeseba - That was very helpful. Thank you!
Kathryn Jessen Kathryn - I was born without the ability to be a mom :/ I will never experience the depth and vastness of a mother's understanding. I can sure pick a thing or two and try to pretend ;)
Great lecture overall. However, I couldn't really understand how to implement linear regression for classification...
With your training data you know which points have been wrongly classified. Look for the line which minimimes the distance (least squares error) of all wrongly classified data. Each time you move the line other data may become wrongly classified so you have to do redo the calculation but look for the line which gives you the minimum overall value for the lines associated set of wrongly classified data.
Perfect
This is not the Squared Error but the Mean Squared Error.
How was E_out computed in each iteration? Was it using subsample of given sample and estimated E_out on full sample?
+movax20h You can not calculate E_out because you do not know the whole population of samples. But E_in can tell you something about E_out. This relation is explained in Lecture 02-Is learning feasible
+fouad Mohammed That is exactly why I am asking. The graph clearly shows the E_out being calculated somehow. I guess, this is done using validation techniques from one of the last lectures probably. Anyway, this is a synthetic example, so it is not hard to generate known unknown target function, and generate as many training and test examples as you want, just for examples sake.
I do not believe it was claculated by Hoefding , because it is a probabilistic inequality, and would actually to circular reasoning logic here: lets use it to predict E_out, and use this prediction to claim that E_in tracks well E_out. That might be correct in probabilistic sense, but is not good way of demonstrating it at all.
Assume that data was already labeled and used to generate E_in (test set). Take a portion of this data (training sample) generate your hypothesis. You can use that hypothesis to measure E_in (training data) and E_out (on the test set). He's making the assumption that the whole set is labeled. Which doesn't usually apply to the real world.
check the itunes page
there are homeworks an solutions available for free for this course
What is the graph shown at 15:42 ,why are there ups and downs in Ein?
Jon Snow The error in the sample set could increase in a next iteration if the algorithm change the hypothesis (weights) in a way that hurt the classification. The PLA is a random walk on the weights sub space.
@@solsticetwo3476 PLA is not really a random walk in the weights sub space... The algorithm optimises the weights for a given (randomly chosen) miss-classified point. Fixing the weights as to not missclassify this point may lead to other points that were previously correctly classified to be misclassified. Hence, the rise in error, followed by drop etc. The algorithm works for non separable datasets, so you can't really call it a random walk, it clearly has a set of rules its following.
I didn't understand how he obtained X- transpose after the differentation.
math.stackexchange.com/questions/2128462/derivative-of-squared-frobenius-norm-of-a-matrix :)
Why learning only occurs in a probabilistic sense? What other way could it be?
Literally, learning in a non-probabilistic (absolutistic or certainty) sense. However this runs up against the so-called induction problem first described by the philosopher Hume (you can google it). In our context here the Hume induction problem can be translated to "If I pick a number of balls and they always turn out to be green can I conclude (with certainty) that all balls in the bin are green?". In the lecture the statement is made that you can't. The philosophical discussion is a bit more nuanced. In any case, machine learning avoids this discussion by side stepping trying to make statements with certainty and moving to (weaker) probabilistic statements.
@@roelofvuurboom5431 great reply, thanks! I loved how you tied to Hume. I didn't think of this connection. Thanks for linking things.
42 minutes !! BANG !!
2.8 doesnt happen in Caltech..... 3.8 doesnt happen in Tribhuvan University
LOL
"+1" and "-1" among other things happen to be real numbers! LOL
+Zeeshan Ali Sayyed There is something genius about the simplicity though lol
Vyas Sathya Indeed. :P
How we got X^T at 38:25.?
In linear algebra the definition of ||X||^2 is (X^T)(X). Apply this formula to ||Xw-Y||^2
This comment is old and the op probably figured it out by now. However for anyone else who wonders this:
It has to do with notation, the derivative of Xw WRT to w is either X or X^T depending on your notation. Here, we're using denominator layout notation, so we use X^T. (en.wikipedia.org/wiki/Matrix_calculus#:~:text=displaystyle%20%5Cmathbf%20%7BI%7D%20%7D-,A%20is%20not%20a%20function%20of%20x,%7B%5Cdisplaystyle%20%5Cmathbf%20%7BA%7D%20%7D,-%7B%5Cdisplaystyle%20%5Cmathbf%20%7BA)
The natural follow up question here is why do we use one notation over the other? When did we choose our notation? Notation choices are often just the choice of the author, and can often make formulae more succinct or more clear. I think this answers the question without going off tangentially too far, as any further questions are probably best answered at your own inquiry.
I wish he showed how to write the algorithms in Python because he teaches very well.
There are plenty of other resources for that. Once you understand the theoretical component, implementation becomes easy
People don't get 2.8's at caltech? I smell grade inflation.
I understood that as people with 2.8 or better not going there. Who takes courses in all of those fields at university?
Never imagined that I'd learn ML from Emperor Palpatine himself!
32:22 MSE, mean squared error, for those with statistics background.
22:48, haha!
what's wrong? the video is not opening.
44:50
best sound ever.
The linear regression is terribly explained.
Он же сказал что это просто на закуску. Далее будет более подробное объяснение.
As the professor mentions multiple times, this lecture is a bit out-of-place as it is placed before covering the theory. Other lectures are more theoretically dense and explained in reasonable depth.
Absolutely well done and definitely keep it up!!! 👍👍👍👍👍
Brilliant lecture
Amazing lecture