1 an overview of the course in this introductory meeting. 2 linear regression, gradient descent, and normal equations and discusses how they relate to machine learning. 3 locally weighted regression, probabilistic interpretation and logistic regression and how it relates to machine learning. 4 Newton's method, exponential families, and generalized linear models and how they relate to machine learning. 5 generative learning algorithms and Gaussian discriminative analysis and their applications in machine learning. 6 naive Bayes, neural networks, and support vector machine. 7 optimal margin classifiers, KKT conditions, and SUM duals. 8 support vector machines, including soft margin optimization and kernels. 9 learning theory, covering bias, variance, empirical risk minimization, union bound and Hoeffding's inequalities. 10 learning theory by discussing VC dimension and model selection. 11 Bayesian statistics, regularization, digression-online learning, and the applications of machine learning algorithms. 12 unsupervised learning in the context of clustering, Jensen's inequality, mixture of Gaussians, and expectation-maximization. 13 expectation-maximization in the context of the mixture of Gaussian and naive Bayes models, as well as factor analysis and digression. 14 factor analysis and expectation-maximization steps, and continues on to discuss principal component analysis (PCA). 15 principal component analysis (PCA) and independent component analysis (ICA) in relation to unsupervised machine learning. 16 reinforcement learning, focusing particularly on MDPs, value functions, and policy and value iteration. 17 reinforcement learning, focusing particularly on continuous state MDPs, discretization, and policy and value iterations. 18 state action rewards, linear dynamical systems in the context of linear quadratic regulation, models, and the Riccati equation, and finite horizon MDPs. 19 debugging process, linear quadratic regulation, Kalmer filters, and linear quadratic Gaussian in the context of reinforcement learning. 20 POMDPs, policy search, and Pegasus in the context of reinforcement learning.
Andrew Ng. Rocks .. he's an amazing teacher and a influential engineer as well as a great scholar. in a rather small but unprecedented step, you've managed to popularize Machine Learning. Nice!
The fact that by using simple physical examples (Portland property prices), and you could generalize and abstract into Learning Algorithms is just amazing. What an inspiration as a teacher!! Thank You.
This is pure gold mine for anyone interested in machine learning. He's doing such an amazing job explaining everything in a simple way, especially the parameters in new definitions and equations with plenty of examples and interesting videos.
it's just like study in Stanford! Although it is not physically , but i really let me gain more knowledge of machine learning that only from my university. And he is really a good lecturer! thank you for you guys that propose it to the Standford University and upload it!
The normal equations fall out immediately from perpendicularity criterion for shortest distance X^t (X * theta - y) = 0 and you don't have to get into trace computations.
+Pat Bradley You should know introductory probability, linear algebra and maybe some multivariate calculus. If you're determined, mit has lecture series on all of those on youtube. You might also want to think about applying some of these algorithms yourself so the theory sticks.
I hope it Is not too late, I feel exactly what u felt about this first time I encounter this lecture, the math it involves is multivariate calculus and some elementary statistics. Moreover, there are good books about machine learning, plus tons of materials on the internet about gradient descent which are very helpful.
Andrew Ng (the lecturer in these videos) teaches a course on Coursera that is based on this class. It covers the same fundamental ideas but might not be as in depth as these Stanford lectures.
If we have a dataset with the number of points =1000=m - Batch Gradient Descent: apply the process on all points in each step of the iteration (i=1......m) - Stochastic Gradient Descent: apply the process, not at all points, 1
NOTE: A^(T) represents transpose of matrix A. At 59:56 it should only be C^(T)AB^(T) and not C^(T)AB^(T)+ CAB as according to one of the above equations, gradient of AB wrt A is equal to B^(T), thus the gradient of ABA^(T)C should be equal to (BA^(T)C)^(T) and that is equal to C^(T)AB^(T). Please help me sort this out.
no, because A^T also depends on A, so what you're saying is like : when deriving for respect to x, derivative(x*a)=a, SO derivative (x*a*x) is a*x, that's not the case.
Question: at the 34:23, for a certain training sample, we have adjustment of the jth of Theta= - alpha * (estimation error )*Xj For example we only have one Theta and one x where Theta = unit price/sqr ft and X= the number of sqr ft I don't understand why a larger Xj should lead to a larger Theta adjustment. For example, if we have 2 cases, in both the estimation error is 10000 dollars. In the first case, the Xj = 500 sqr ft, in the second case Xj=5000 sqr ft. Then the second case feeds back a 10x larger adjustment for unit price. But why? In the first case, you tell the machine, hey you missed by 10,000 dollars, given that the apt have 500 sqr ft, next time, next time reduce 20 dollars per sqr ft. This makes sense. Then in the second case, you tell the machine, hey you missed by 10.000 dollars, given that the house have 5000 ft, next time reduce 200 dollars per sqr ft. That's weird. Thanks folks
+Antony Lawler Thank you so much for the reply. Technically, as you said, Theta can't be defined as unit price. But at least, I think Theta is an analogue of unit price, and that the product of Theta 1 * X1 (area) roughly represent the part of house price corresponding to area. This feedback design seems to be counter-intuitive.
+田野 The adjustment formula is oversimplified. I believe that alpha in the formula should vary with xj. Basically the adjustment formula tries to arrive at a solution for which dJ_over_dtheta = 0, which is the first partial derivative of J with respect to theta. If you use Newton-Raphson's formula for 0-finding, you end up with theta := theta - beta * dJ_over_dtheta/d2J_over_dtheta2, where d2Jdtheta2 is the second partial derivative. If you carry out the math, you will find that the second derivative is proportional to xj^2. With the first derivative proportional to xj, you end up with the adjustment term as a constant beta multiplied by 1/xj, so a smaller adjustment is make when xj becomes larger. Hope this helps. Very interesting observation though!
I think one alternate answer for the question @41:40 might also be that we found the minimal point or the convergence point when the derivative goes to 0 or nearby: the derivative of a function measure the slope and when it goes to 0 it means that we found a local maximum or minimum; because we are hunting the minimum it means that we found it. Am I right ?
Around time = 28:00, Dr. Ng noted that to go in the direction of steepest descent from a point, ( theta1, theta2, J(theta1, theta2) ), we should go in the direction of the gradient of J at that point; however, this is incorrect. The gradient always points in the direction of steepest ascent, not descent; therefore, the direction of steepest descent from ( theta1, theta2, J(theta1, theta2) ) is opposite of the gradient: -Del( J( theta1, theta2 ) ).
At 1:10:20, I think there is a trace missing before the Nabla_Theta(y^TX Theta) Term (the very last term). All the other terms have traces, why doesn't this one? Without it, one cannot apply the rules he introduced before (Nabla tr(AB) = B^T)
you are right.It should have the trace notation too.Otherwise he cannot use the 2nd fact out of the 5 facts he mentioned during the matrix algebra revision.He might have accidentally missed it.
Two ways to find the theta that minimized the cost function: 1- Normal equation: (No Iteration) By taking its derivative and setting it to equal zero. 2- Gradient Descent: (With iteration) By taking its derivative and applying GD algorithm. ***************************************** For example: To find the minimum, if y=X^2 : 1- Normal Equation: 2X=0 X=0. This is the solution. 2- Gradient Descent: 2X X1 = X0 - step_size *2X0 After # iteration, X will reach to be zero. X=0. This is the solution.
just wondering if you could encode the landscape using fourier transforms and then use that multi-level representation with a slightly modified algorithm to get a faster / more accurate result?
Can someone clarify please. On 50:00 when he answers the question about stochastic gradient descent, surely he does not mean that each iteration we use the SAME training example, right? I am sure he means that each iteration we take a different training example, but the way he talks about it is slightly confusing.
i think for the first step, you use the first training example and update all of the thetas. then for the second step, you use the second training example and update all of the thetas. and so on... so yeah, you use a different training example for each step/iteration
Yeah, the confusion is because he says "for each step, you're only using one training example". Worth emphasising that it's the jth example, which changes each step, and not the SAME training example. In batch, you use the entire training set of all (potentially millions) of examples, so each equivalent step for stochastic is potentially millions of times faster. It's just a compromise for the sake of speed. More generally, presumably you would actually take 'a random sample' of training examples rather than the jth, for greater accuracy.
Am I wrong or right if I assume that the gradient is actually oriented in the direction of biggest ASCENT? wikipedia says so too.. so I assume we should use the gradients orientation multiplicated with -1 for the stated example contrary to what is mentioned in the video
+King Schultz Maybe it depends on what exactly you are trying to optimize. If you are looking for a minimum cost you would go in the direction of greatest descent and if you are looking for a maximum profit you would go in the direction of greatest ascent?
That makes total sense of course. I just mean that the gradient is mathematically definded as the greatest ascent so it actually points to the greatest ascent and its length is the magnitude of the ascent. Thats why it irritates me that we use the gradient here as if it was pointed to the biggest descent.
+King Schultz You're right, gradient points in the direction of greatest ascent, so he is slightly off when he talks about it. Not a huge deal though; just gotta keep in mind when he says "gradient" we should be thinking "negative gradient".
I'm a highschool junior and I didn't know what a partial derivative was so I walked into my AP Calc class today asked the teacher and was told to never speak of it again. Apparantly my teacher has repressed nightmares of it in college haha. I looked it up. seems pretty straight forward I think i get it now.
@da ny You deserve to be kept far away of every learner ! Give this 'kid' the hope and belief he can do it and he will, instead of trying to fix your ego.
At 1:01:52 the design matrix X is m by n. Then he multiplies by theta and it looks like we're just left with a mx1 vector. Is each x in the resulting vector assumed to be an n dimensional or am I missing something?
Hi. Great video. I have a question: At time 1:08:40 Why the first element of the product (XO -y)^t(XO-y) is equal to O^t X^t X O. Why is not X^t O^t X O?
Denzel it is all about the changing of the thetas which are parameters (weights) which take on new values with each update. We desire to choose a theta that will minimize J(theta). Gradient descent takes the form: thetaj: = thetaj - alpha p.d./p.d.thetaj jtheta. The actual update is performed upon all j values at the same time using theta. Thus we begin with some value theta and then we repeatedly change the value of theta to make Jtheta smaller.Alpha is just the learning rate determinant.
Firstly, I'm loving this, great class! I have a question about the derivation of Gradient Descent. How is the partial derivative of J(theta) taken in the iterative algorithm if it's simply a constant? We already have x, y, and the initial theta (zero vector), so how can we take the partial derivative AND THEN plug in what we know...could the mathematical notation possibly be improved a bit? As it stands now, it's not making sense to me and I've been through an entire calculus sequence.
In my course of linear systems we used the same normal equation for estimating parameters of a discrete model of continuous system. The thing is, it can be derived in much simpler way than the one shown in the lecture. (without the use of traces, let alone the traces algebra) :) So besides that, great lecture and certainly motivating.
%% Visualizing Gradient Descent on quadratic function using matlab: clear all close all clc %% Defining the Input and the Output : Input=-5:0.1:5; Output=Input.^2; %% Plotting the function: plot(Input,Output,'LineWidth',3) hold on %% Determining the required parameters: step_size=0.01; Iterations = 100; %% Initialize the initializing points: X0(1)=[3.5]; %% Plotting the first step: Ite=1; disp(['Iteration ' num2str(Ite) ': Best Minima = ' num2str(X0(Ite))]); Output=X0(Ite).^2; plot(X0(Ite),Output,'.','MarkerSize',30) %% Starting the iterative gradient descent: Ite=2; while( Ite < Iterations)
%% Least Mean Squares (Gradient Descent): X0(Ite,:) = X0(Ite-1,:) - step_size.*2.*(X0(Ite-1,:)); Output=X0(Ite).^2; disp(['Iteration ' num2str(Ite) ': Best Minima = ' num2str(X0(Ite))]); %% Plotting the next step: plot(X0(Ite),Output,'.','MarkerSize',30) Ite=Ite+1; end
One thing I did not understand is why introduce the batch gradient descent or the stochastic version if the problem can be solved by linear algebra. Is this only a way to get throug those algorithms, which we will use for more complicated minimization problems ? Or do you really use these algorithms for this particular problem ?
I think the case may be that doing it using linear algebra can be quite computationally intensive, whereas using the gradient descent algorithms don't require matrix multiplication (computationally intensive)
I think its because of the quantity of the data involved. If the training set data is too large, iterative algorithm might not be practical due to hardware limitation. So, yes, I think we pick the most efficient algorithm depending on the situation.
+phibouafia In general, only some problems (ie, minimizing least squares with linear h function) can be solved using linear algebra closed forms. Most can't, unfortunately. I think he shows us the gradient descent methods here even though we don't need them because we WILL need them lots more later in the course.
At 18:30 he talks about the summation of the 'vectors' as being a transpose of theta * x. How did he determine this? Did he use the dot product rule for transpose where [a • b] = a^T * b ?
Around 55:47, should it be written as the gradient of f wrt A, and not be evaluated at A? i.e. drop the "(A)" before the "="? Otherwise, you'd be taking the gradient of a real #, unless I'm reading something wrong...
So, a bit late my response, but in this case A is regarded as a variable, so f(A) would be the same as just f. Here f has no specific value, like, A= I or something.
To Maris, since they square the result, it doesn't matter whether you subtract y-h(x) or h(x)-y. (for some reason there was no reply option under your question. maybe it's too old. but someone else might have the same question.)
At 44:05, he says that the derivative of tbe function gives the steepest descent a d said the TAs would probably elaborate on that in another session. Can someone pls explain that.
Thanks jcbmack, between your comment and reviewing the lesson again I was able to make heads and tails of the concept I was misunderstanding. I was considering the parameters/thetas to be constant when in fact they are varying; why, I have no idea, haha. Cheers!
at 19:30... the lecturer writes h(x) = (theta transpose) times (x) but that would give a 3 by 3 matrix shouldn't it be h(x) = (x) times (theta transpose)???
@astroboomboy on the course website (google it) it says you need linear algebra and probability theory, but it said you need basic linear algebra and probability and a little programming experience.
Hi, I have a question about stochastic gradient descent. In 48:42, the inner loop has an iteration of j=1 to m. Does m signify the number of the whole dataset? If it signifies the number of the whole dataset, I think it does not really different from sigma j=1 to m in batch gradient descent. So.... m in stochastic gradient descent is different from m in batch gradient descent right???
This lecture would be improved by first introducing a simple quadratic equation (i.e. Y=x^2+2x+1), find a minimum by finding the derivative, setting it to zero and solve for the value of X (the input parameter - cause of that minimum). Then, extend this concept to a 3D equation with two inputs X, Y and output Z and find the derivative, setting to zero and determining the values of X, Y in this case Theta1 and Theta2. The point of this lesson was to find a min (or max) given any # of inputs.
@matharoofmaths; Yes ... and that's why he makes so many mistakes in this lectures and has a hard time answering his student's questions (and occassionally evades student questions) in later lectures ... but if his research papers are any indication, he will definitely be an outstanding teacher in the future. All criticism aside, this is much better than what we had before - nothing. Thank you Dr. Ng and Stanford for letting us in. This is making Machine Learning that much more accessible.
I have a query guys: In cost function in some examples like the lectures provided in AI series by andrew had 1/m term, my query is what are the points we need to consider when defining a cost function.
Around time = 43:00, Dr. Ng again gave the wrong description of the gradient. Example: Let f(x,y) = x^2 + y^2. Hence, the gradient is ( 2x, 2y ). At the point (1,1), the gradient is (2,2). Since the only local minimum of f(x,y) is at (0,0) and since (1,1)+(2,2)=(3,3), then the gradient at (1,1) points away from the only local minimum of f(x,y); therefore, the gradient does not point toward the direction of steepest descent. The gradient points in the direction of steepest ASCENT.
If we divide by m, we are substracting theta(i) by alpha times the average of the sum. If we dont divide by m, we are substracting theta(i) by alpha times the sum. Technically it doesnt matter if we divide by m or not. But dividing by m, will make us to converge faster I guess. Would love to hear some mathematical explanation around this.
I have implemented gradient descent in R with and without using m. In both the cases it is converging. But the catch here is when you don't use m, we have to use small value of alpha like 0.01. If I use 0.1 it is not converging.
I think the use of the trace operator in the derivation of the Least Squares Estimator obfuscates the derivation. I believe this would be easier to follow if the properties of matrix derivatives were used instead.
Around time = 28:00, Dr. Ng that if we want to go in the direction of steepest descent from a point J( theta1, theta2 ), then we should go in the direction of the gradient of J( theta1, theta2 ); however, this is incorrect. The gradient always points toward the direction of steepest ascent, not descent; therefore, if we want to go in the direction of steepest descent from a point J( theta1, theta12 ), then we should go in the direction that is opposite of the gradient ... -J( theta1, theta2 ).
@Fusionicon Basic Calculus. Other than the weird stats stuff he brings into play when formulating the error function ("J"), you don't need anything else, so long as you really pay close attention.
+Xanfighter The reason you minimize the square of the difference/error instead of the absolute error is because the linear algebra works out a lot easier this way. The assumption is that if the absolute difference is high, it is the same as if the difference squared is high. But basically, it's simply for mathematical ease. There is a lot of research on L1 norm minimization, check out the wikipedia article: "Least absolute deviations"
What he says I think is right... He says that if XTX is not invertible which is the case when X is not full rank matrix(he says that X is dependent) then in that case you find the pseudo inverse in that particular case.
In the method described "(Batch) Gradient Descent" is just optimization, by iterating over a training set from selected start-point (initial parameters) to find new minimums with their respective parameters. He is right, it can be slow if you have MANY parameters, since that will increase the number of combinations. The derivative would eliminate useless combinations. The Stochastic version is better, because it tries to "guess" direction and doesn't attempt to iterate every combo available.
He doesn't say how they decide alpha. It is just a "step size" for the gradient descent. It is the "weight" of the change in the parameter theta. Larger alpha means theta will converge faster but less accurately.
I haven't learnt Math. So, can someone please explain what exactly is θ? What is θ0 + θ1X? I understood Hypothesis, but I don't know what does θ0 + θ1X actually mean.
Sandy Sandeep This means that the algorithm is going to come up with a simple linear regression model where theta zero denotes the price of a very small house (theoretically zero square feet but as you know there is no such house) and theta one denotes the price increase per increase in each square footage.
hey sandy, theta0 is the base price. think of it as the minimum price for all houses. like they have to have this theta0 price as minimum. X is some feature of that house(size, number of bedrooms etc.) which we multiply by a coefficient theta1. This is our hypothesis that each house has to have a base price and that the feature x of the house affects the house of the price by a factor of theta1. So each unit increase in X increases the price of the house by theta1. Only thing we have to do now is compute the value of theta1 which professor does in the end of the video.
I just started to watch this lecture too, and I'm only in my second year of EE, but if you don't understand this stuff I guess you'd better off thoroughly read a book about linear algebra first. And probably some theory about Signals and Systems. He models the target as a linear function of the input, plus a constant term. I guess this how you should think about this stuff in general. But as I said, only 2. year Bachelor student^^
hmm.. so alvin looks at the road ahead and records the steering direction. So what if the road ahead is a curve but since I'm on a straight patch for the moment my steering direction is still straight? Seeing where the cam was placed and that there was no bonnet in the pictures it must have been calculated for a few metres ahead. Does that affect anything? In the video it seems like Alvins response is about 0.5 seconds behind a typical human response. Specially in the live tests
I just wonder if the stochastic gradient algorithm is more efficient than batch gradient algorithm give than the number of data n is large. the number of iteration for batch gradient algorithm should far less than n.
overview: batch gradient descent, stochastic gradient descent, normal equation batch ~: update Theta after scanning all samples stochastic~: update Theta after scanning one sample (useful when number of samples is large) normal equation: the analytical solution of Theta without iteration
Just wondering, why did the normal equation is only for OLS case? Wondering what assumption was made in the derivation for the equation to restrict to this specific case?
Could someone explain how to get Gradient tr ABAC^T=CAB+C^TAB^T? I can't see how you can get an addition on the right hand side. At least not from within the rules he described in the lecture. Could one use the chain rule for derivation?
Lecture notes differ. The batch grad. descent in notes calculates residual (if I understand correctly Data minus Fit) y-h(x), the square of which we try to minimize, but prof has h(x)-y. Which one is correct?
1 an overview of the course in this introductory meeting.
2 linear regression, gradient descent, and normal equations and discusses how they relate to machine learning.
3 locally weighted regression, probabilistic interpretation and logistic regression and how it relates to machine learning.
4 Newton's method, exponential families, and generalized linear models and how they relate to machine learning.
5 generative learning algorithms and Gaussian discriminative analysis and their applications in machine learning.
6 naive Bayes, neural networks, and support vector machine.
7 optimal margin classifiers, KKT conditions, and SUM duals.
8 support vector machines, including soft margin optimization and kernels.
9 learning theory, covering bias, variance, empirical risk minimization, union bound and Hoeffding's inequalities.
10 learning theory by discussing VC dimension and model selection.
11 Bayesian statistics, regularization, digression-online learning, and the applications of machine learning algorithms.
12 unsupervised learning in the context of clustering, Jensen's inequality, mixture of Gaussians, and expectation-maximization.
13 expectation-maximization in the context of the mixture of Gaussian and naive Bayes models, as well as factor analysis and digression.
14 factor analysis and expectation-maximization steps, and continues on to discuss principal component analysis (PCA).
15 principal component analysis (PCA) and independent component analysis (ICA) in relation to unsupervised machine learning.
16 reinforcement learning, focusing particularly on MDPs, value functions, and policy and value iteration.
17 reinforcement learning, focusing particularly on continuous state MDPs, discretization, and policy and value iterations.
18 state action rewards, linear dynamical systems in the context of linear quadratic regulation, models, and the Riccati equation, and finite horizon MDPs.
19 debugging process, linear quadratic regulation, Kalmer filters, and linear quadratic Gaussian in the context of reinforcement learning.
20 POMDPs, policy search, and Pegasus in the context of reinforcement learning.
You're a godsend. Thanks
Thank you
Thanks for the comprehensive list
Andrew Ng. Rocks .. he's an amazing teacher and a influential engineer as well as a great scholar.
in a rather small but unprecedented step, you've managed to popularize Machine Learning. Nice!
I disagree to some extent
@@sharjeeltahir5583 why?
The fact that by using simple physical examples (Portland property prices), and you could generalize and abstract into Learning Algorithms is just amazing. What an inspiration as a teacher!! Thank You.
Best lecture to understand Machine Learning that I've gone through so far. Professor Andrew Ng is all time best teacher for me.
This is pure gold mine for anyone interested in machine learning. He's doing such an amazing job explaining everything in a simple way, especially the parameters in new definitions and equations with plenty of examples and interesting videos.
Very well explained. I was going through the Coursera Course videolecs, but found this one much better.
+ajayram198 same
g j b de mi 22223300000ap000aÑañañsñsñ0P00a099ooaaoq99qq9 a0apaa
which course?
coursera course by andrew ng himself
Agreed. Of all the MOOCs, I like Coursera the least but Ng is much better in this lecture format
it's just like study in Stanford! Although it is not physically , but i really let me gain more knowledge of machine learning that only from my university. And he is really a good lecturer!
thank you for you guys that propose it to the Standford University and upload it!
The normal equations fall out immediately from perpendicularity criterion for shortest distance X^t (X * theta - y) = 0 and you don't have to get into trace computations.
Well that escalated quickly... time to brush up on some of this math before continuing.
+cmares5858 Lol, Yup
+cmares5858 Yup!! I was like ahh illl be fine ...NOPE. what subjects do you think you need to brush up on before you can understand this ?
+cmares5858 Well we barely learned anything from lecture 1..
+Pat Bradley You should know introductory probability, linear algebra and maybe some multivariate calculus. If you're determined, mit has lecture series on all of those on youtube.
You might also want to think about applying some of these algorithms yourself so the theory sticks.
I hope it Is not too late, I feel exactly what u felt about this first time I encounter this lecture, the math it involves is multivariate calculus and some elementary statistics. Moreover, there are good books about machine learning, plus tons of materials on the internet about gradient descent which are very helpful.
im raising my hand, why isn't professor Ng calling on me?
Jabrils i'm big fan of you !
He doesn’t like you Jabrils
lol !
Dude, you inspired me to start taking Computer Science two years ago. Thanks, Jabrils!
do you your really understand how lucky we are to find someone like this legend explain to us this material.
Andrew Ng (the lecturer in these videos) teaches a course on Coursera that is based on this class. It covers the same fundamental ideas but might not be as in depth as these Stanford lectures.
That coursera course is bs compared to this series.
If we have a dataset with the number of points =1000=m
- Batch Gradient Descent: apply the process on all points in each step of the iteration (i=1......m)
- Stochastic Gradient Descent: apply the process, not at all points, 1
first learning algorithm. i am so pumped.
Top KeK 😀
Stanford. Thanks for posting these lectures! Big thank you!
NOTE: A^(T) represents transpose of matrix A.
At 59:56 it should only be C^(T)AB^(T) and not C^(T)AB^(T)+ CAB as according to one of the above equations, gradient of AB wrt A is equal to B^(T), thus the gradient of ABA^(T)C should be equal to (BA^(T)C)^(T) and that is equal to C^(T)AB^(T). Please help me sort this out.
no, because A^T also depends on A, so what you're saying is like : when deriving for respect to x, derivative(x*a)=a, SO derivative (x*a*x) is a*x, that's not the case.
Aimane Harrak thanks I got it now.
ans 44:00 we do double differentiate the gradient if it is greater than zero then it is going descent else ascent.
Question: at the 34:23, for a certain training sample, we have
adjustment of the jth of Theta= - alpha * (estimation error )*Xj
For example we only have one Theta and one x where Theta = unit price/sqr ft and X= the number of sqr ft
I don't understand why a larger Xj should lead to a larger Theta adjustment.
For example, if we have 2 cases, in both the estimation error is 10000 dollars. In the first case, the Xj = 500 sqr ft, in the second case Xj=5000 sqr ft. Then the second case feeds back a 10x larger adjustment for unit price. But why?
In the first case, you tell the machine, hey you missed by 10,000 dollars, given that the apt have 500 sqr ft, next time, next time reduce 20 dollars per sqr ft. This makes sense.
Then in the second case, you tell the machine, hey you missed by 10.000 dollars, given that the house have 5000 ft, next time reduce 200 dollars per sqr ft. That's weird.
Thanks folks
+田野 I think it is because theta isn't a $ value in sqr ft, but a number by which the sample xi is multiplied.
+Antony Lawler
Thank you so much for the reply. Technically, as you said, Theta can't be defined as unit price. But at least, I think Theta is an analogue of unit price, and that the product of Theta 1 * X1 (area) roughly represent the part of house price corresponding to area.
This feedback design seems to be counter-intuitive.
+田野 No problem. How are you getting on with Lecture 3 ?
+田野 The adjustment formula is oversimplified. I believe that alpha in the formula should vary with xj. Basically the adjustment formula tries to arrive at a solution for which dJ_over_dtheta = 0, which is the first partial derivative of J with respect to theta. If you use Newton-Raphson's formula for 0-finding, you end up with theta := theta - beta * dJ_over_dtheta/d2J_over_dtheta2, where d2Jdtheta2 is the second partial derivative. If you carry out the math, you will find that the second derivative is proportional to xj^2. With the first derivative proportional to xj, you end up with the adjustment term as a constant beta multiplied by 1/xj, so a smaller adjustment is make when xj becomes larger. Hope this helps. Very interesting observation though!
+Antony Lawler I got quite frustrated with math.. I got stuck at video 3 and have not
revisited for a few weeks.
Really nice. Well taught. I am really enjoying listening to these lectures. A true service to public.
I think one alternate answer for the question @41:40 might also be that we found the minimal point or the convergence point when the derivative goes to 0 or nearby: the derivative of a function measure the slope and when it goes to 0 it means that we found a local maximum or minimum; because we are hunting the minimum it means that we found it. Am I right ?
1:00:10 further on : The training example is a row matrix and we take transpose so that makes it column matrix?
learning a whole new concept easily in one hour is fantabulous.......thanx...
Around time = 28:00, Dr. Ng noted that to go in the direction of steepest descent from a point, ( theta1, theta2, J(theta1, theta2) ), we should go in the direction of the gradient of J at that point; however, this is incorrect. The gradient always points in the direction of steepest ascent, not descent; therefore, the direction of steepest descent from ( theta1, theta2, J(theta1, theta2) ) is opposite of the gradient: -Del( J( theta1, theta2 ) ).
At 1:10:20, I think there is a trace missing before the Nabla_Theta(y^TX Theta) Term (the very last term).
All the other terms have traces, why doesn't this one? Without it, one cannot apply the rules he introduced before (Nabla tr(AB) = B^T)
you are right.It should have the trace notation too.Otherwise he cannot use the 2nd fact out of the 5 facts he mentioned during the matrix algebra revision.He might have accidentally missed it.
Two ways to find the theta that minimized the cost function:
1- Normal equation: (No Iteration)
By taking its derivative and setting it to equal zero.
2- Gradient Descent: (With iteration)
By taking its derivative and applying GD algorithm.
*****************************************
For example: To find the minimum, if y=X^2 :
1- Normal Equation:
2X=0
X=0. This is the solution.
2- Gradient Descent:
2X
X1 = X0 - step_size *2X0
After # iteration, X will reach to be zero.
X=0. This is the solution.
If the rest of the lectures is based on these operators ... then I will hang out till the very end ... elegant!!
just wondering if you could encode the landscape using fourier transforms and then use that multi-level representation with a slightly modified algorithm to get a faster / more accurate result?
Can someone clarify please. On 50:00 when he answers the question about stochastic gradient descent, surely he does not mean that each iteration we use the SAME training example, right? I am sure he means that each iteration we take a different training example, but the way he talks about it is slightly confusing.
i think for the first step, you use the first training example and update all of the thetas. then for the second step, you use the second training example and update all of the thetas. and so on... so yeah, you use a different training example for each step/iteration
Yeah, the confusion is because he says "for each step, you're only using one training example".
Worth emphasising that it's the jth example, which changes each step, and not the SAME training example.
In batch, you use the entire training set of all (potentially millions) of examples, so each equivalent step for stochastic is potentially millions of times faster. It's just a compromise for the sake of speed. More generally, presumably you would actually take 'a random sample' of training examples rather than the jth, for greater accuracy.
Thanks!
Am I wrong or right if I assume that the gradient is actually oriented in the direction of biggest ASCENT?
wikipedia says so too.. so I assume we should use the gradients orientation multiplicated with -1 for the stated example contrary to what is mentioned in the video
+King Schultz Maybe it depends on what exactly you are trying to optimize. If you are looking for a minimum cost you would go in the direction of greatest descent and if you are looking for a maximum profit you would go in the direction of greatest ascent?
+Eric Hoft I could be talking out of my ass though.
That makes total sense of course. I just mean that the gradient is mathematically definded as the greatest ascent so it actually points to the greatest ascent and its length is the magnitude of the ascent. Thats why it irritates me that we use the gradient here as if it was pointed to the biggest descent.
+King Schultz You're right, gradient points in the direction of greatest ascent, so he is slightly off when he talks about it. Not a huge deal though; just gotta keep in mind when he says "gradient" we should be thinking "negative gradient".
That is why he subtract gradient (which is simply add gradient multiplied by -1)
I'm a highschool junior and I didn't know what a partial derivative was so I walked into my AP Calc class today asked the teacher and was told to never speak of it again. Apparantly my teacher has repressed nightmares of it in college haha. I looked it up. seems pretty straight forward I think i get it now.
I don't think u should study machine learning now, and I don't think u got 'it', it involves way more than just partial derivative , kid.
that was a year ago, he's probably graduated college by now
@da ny You deserve to be kept far away of every learner ! Give this 'kid' the hope and belief he can do it and he will, instead of trying to fix your ego.
yup, I know PDEs well enough. Shame on your teacher for turning you away!
Thank you Professor Ng and Stanford University.
At 1:01:52 the design matrix X is m by n. Then he multiplies by theta and it looks like we're just left with a mx1 vector. Is each x in the resulting vector assumed to be an n dimensional or am I missing something?
Actually I think I'm being stupid. It's because we're multiplying by theta which is n x 1 right?
Hi. Great video. I have a question: At time 1:08:40 Why the first element of the product (XO -y)^t(XO-y) is equal to O^t X^t X O. Why is not X^t O^t X O?
Denzel it is all about the changing of the thetas which are parameters (weights) which take on new values with each update. We desire to choose a theta that will minimize J(theta). Gradient descent takes the form: thetaj: = thetaj - alpha p.d./p.d.thetaj jtheta. The actual update is performed upon all j values at the same time using theta. Thus we begin with some value theta and then we repeatedly change the value of theta to make Jtheta smaller.Alpha is just the learning rate determinant.
Firstly, I'm loving this, great class! I have a question about the derivation of Gradient Descent. How is the partial derivative of J(theta) taken in the iterative algorithm if it's simply a constant? We already have x, y, and the initial theta (zero vector), so how can we take the partial derivative AND THEN plug in what we know...could the mathematical notation possibly be improved a bit? As it stands now, it's not making sense to me and I've been through an entire calculus sequence.
why there is no m involved in the denominator? @1:04:25
I wondered the same thing. Instead he arbitrarily assign 1/2 versus the normal sum sq diff over n.
Am I the only one impressed by the chalk board that wipes itself clean when he lifts it up and pulls it back down
It's different board, you dumbass.
Board doesn't get cleaned, its illusion. Lector just lifts one board up and pulls new one down. Look 48:00
He obviously applied a learning algo to it!
I am laughing so hard. I would be impressed by that too, but as others said, they are overlapping chalkboards.
hahah~~~
These videos are brilliant!!Andrew is super cool at teaching, thanks Stanford!!
In my course of linear systems we used the same normal equation for estimating parameters of a discrete model of continuous system.
The thing is, it can be derived in much simpler way than the one shown in the lecture. (without the use of traces, let alone the traces algebra) :)
So besides that, great lecture and certainly motivating.
%% Visualizing Gradient Descent on quadratic function using matlab:
clear all
close all
clc
%% Defining the Input and the Output :
Input=-5:0.1:5;
Output=Input.^2;
%% Plotting the function:
plot(Input,Output,'LineWidth',3)
hold on
%% Determining the required parameters:
step_size=0.01;
Iterations = 100;
%% Initialize the initializing points:
X0(1)=[3.5];
%% Plotting the first step:
Ite=1;
disp(['Iteration ' num2str(Ite) ': Best Minima = ' num2str(X0(Ite))]);
Output=X0(Ite).^2;
plot(X0(Ite),Output,'.','MarkerSize',30)
%% Starting the iterative gradient descent:
Ite=2;
while( Ite < Iterations)
%% Least Mean Squares (Gradient Descent):
X0(Ite,:) = X0(Ite-1,:) - step_size.*2.*(X0(Ite-1,:));
Output=X0(Ite).^2;
disp(['Iteration ' num2str(Ite) ': Best Minima = ' num2str(X0(Ite))]);
%% Plotting the next step:
plot(X0(Ite),Output,'.','MarkerSize',30)
Ite=Ite+1;
end
One thing I did not understand is why introduce the batch gradient descent or the stochastic version if the problem can be solved by linear algebra.
Is this only a way to get throug those algorithms, which we will use for more complicated minimization problems ? Or do you really use these algorithms for this particular problem ?
I think the case may be that doing it using linear algebra can be quite computationally intensive, whereas using the gradient descent algorithms don't require matrix multiplication (computationally intensive)
I think its because of the quantity of the data involved. If the training set data is too large, iterative algorithm might not be practical due to hardware limitation. So, yes, I think we pick the most efficient algorithm depending on the situation.
+phibouafia In general, only some problems (ie, minimizing least squares with linear h function) can be solved using linear algebra closed forms. Most can't, unfortunately. I think he shows us the gradient descent methods here even though we don't need them because we WILL need them lots more later in the course.
At 18:30 he talks about the summation of the 'vectors' as being a transpose of theta * x. How did he determine this? Did he use the dot product rule for transpose where [a • b] = a^T * b ?
It's the dot (inner) product; [theta0, theta1, theta2]^T * [1, x1, x2] = theta0 + theta1*x1 + theta2*x2
Around 55:47, should it be written as the gradient of f wrt A, and not be evaluated at A? i.e. drop the "(A)" before the "="?
Otherwise, you'd be taking the gradient of a real #, unless I'm reading something wrong...
So, a bit late my response, but in this case A is regarded as a variable, so f(A) would be the same as just f. Here f has no specific value, like, A= I or something.
I can't express how much i loved this video
To Maris, since they square the result, it doesn't matter whether you subtract y-h(x) or h(x)-y. (for some reason there was no reply option under your question. maybe it's too old. but someone else might have the same question.)
He is missing the index superscript i (training example) on y at the last line inside summation equation. Min 1:04:23
OK he fixed it later...
At 44:05, he says that the derivative of tbe function gives the steepest descent a d said the TAs would probably elaborate on that in another session. Can someone pls explain that.
Thanks jcbmack, between your comment and reviewing the lesson again I was able to make heads and tails of the concept I was misunderstanding. I was considering the parameters/thetas to be constant when in fact they are varying; why, I have no idea, haha. Cheers!
at 19:30...
the lecturer writes h(x) = (theta transpose) times (x)
but that would give a 3 by 3 matrix
shouldn't it be h(x) = (x) times (theta transpose)???
@astroboomboy on the course website (google it) it says you need linear algebra and probability theory, but it said you need basic linear algebra and probability and a little programming experience.
Lecture 2 is done Sir (1:13 am).
See u 2morrow on lecture 3.
Thank you Professor. Thank you Stanford.
Hi, I have a question about stochastic gradient descent. In 48:42, the inner loop has an iteration of j=1 to m. Does m signify the number of the whole dataset? If it signifies the number of the whole dataset, I think it does not really different from sigma j=1 to m in batch gradient descent. So.... m in stochastic gradient descent is different from m in batch gradient descent right???
Thanks to my Linear Algebra course in Peru :),
I understood this nice lecture...
so I continue with Lesson 3.
Thanks Stanford!!!
Stanford has the right idea with spreading all this knowledge for free :D
this is brilliant. thank you so much professor
This lecture would be improved by first introducing a simple quadratic equation (i.e. Y=x^2+2x+1), find a minimum by finding the derivative, setting it to zero and solve for the value of X (the input parameter - cause of that minimum). Then, extend this concept to a 3D equation with two inputs X, Y and output Z and find the derivative, setting to zero and determining the values of X, Y in this case Theta1 and Theta2. The point of this lesson was to find a min (or max) given any # of inputs.
The images shown in white background are pretty hard to make out(Like the plot of housing price vs foot squared).
@matharoofmaths; Yes ... and that's why he makes so many mistakes in this lectures and has a hard time answering his student's questions (and occassionally evades student questions) in later lectures ... but if his research papers are any indication, he will definitely be an outstanding teacher in the future.
All criticism aside, this is much better than what we had before - nothing. Thank you Dr. Ng and Stanford for letting us in. This is making Machine Learning that much more accessible.
most impressive of all is that this lecturer is actually a robot
well that's why he is teaching at Stanford!!! Show some respect. Thanks
agree with caesiume. this type of lecture is great. both free and good.
43:35 Doesn't the gradient give the direction of steepest ascent?
Yes
Xavier Thanks!
theta is some constant. If you had a quadratic equation:
y=3+2x+5x^2
theta0 would be 3, theta1 would be 2, theta2 would be 5.
I have a query guys: In cost function in some examples like the lectures provided in AI series by andrew had 1/m term, my query is what are the points we need to consider when defining a cost function.
Any idea where to get the proofs to the two distinct matrix trace properties used for solving the Normal Equations?
Try Learning from Data on EdX - Easier to follow and easier to work through examples. There are solutions homework problems.
Around time = 43:00, Dr. Ng again gave the wrong description of the gradient.
Example: Let f(x,y) = x^2 + y^2. Hence, the gradient is ( 2x, 2y ). At the point (1,1), the gradient is (2,2). Since the only local minimum of f(x,y) is at (0,0) and since (1,1)+(2,2)=(3,3), then the gradient at (1,1) points away from the only local minimum of f(x,y); therefore, the gradient does not point toward the direction of steepest descent. The gradient points in the direction of steepest ASCENT.
Interesting. Usually it will come back to the gradient descent when we solve inverse.
Isn't the cost function 1/(2*m) instead of just 1/2 of the sum of the squared errors?
Galina Staneva yes that m is missing...
I don't think so, actually that 1/2 is just added so as to get a neat expression after taking derivatives.
If we divide by m, we are substracting theta(i) by alpha times the average of the sum. If we dont divide by m, we are substracting theta(i) by alpha times the sum. Technically it doesnt matter if we divide by m or not. But dividing by m, will make us to converge faster I guess. Would love to hear some mathematical explanation around this.
I have implemented gradient descent in R with and without using m. In both the cases it is converging. But the catch here is when you don't use m, we have to use small value of alpha like 0.01. If I use 0.1 it is not converging.
Thank you very much for your answer! This clears things up!
I think the use of the trace operator in the derivation of the Least Squares Estimator obfuscates the derivation. I believe this would be easier to follow if the properties of matrix derivatives were used instead.
Around time = 28:00, Dr. Ng that if we want to go in the direction of steepest descent from a point J( theta1, theta2 ), then we should go in the direction of the gradient of J( theta1, theta2 ); however, this is incorrect. The gradient always points toward the direction of steepest ascent, not descent; therefore, if we want to go in the direction of steepest descent from a point J( theta1, theta12 ), then we should go in the direction that is opposite of the gradient ... -J( theta1, theta2 ).
@Fusionicon Basic Calculus. Other than the weird stats stuff he brings into play when formulating the error function ("J"), you don't need anything else, so long as you really pay close attention.
I can't keep up. I want to learn this but I have no experience with the math he's using. Calculus, right?
+Jordan Shackelford it's basic calculus and basic linear algebra... you can find free online courses for both online (check out MIT OCW, for example)
linear algebra is Greek for most people. One semester will set you straight IF you do the homework.
Why are we trying to minimize (h(x)-y)^2 and not just h(x)-2?
+Xanfighter : cuz we only need the absolute value of (h(x)-2) be the minimal. but (h(x)-2)^2 is more convinient for math expression.
+Xanfighter
minimizing means that derivative is equal to zero. We dont care about coefficient(constant)
Thanks guys, really appreciate the answer :)
+Xanfighter The reason you minimize the square of the difference/error instead of the absolute error is because the linear algebra works out a lot easier this way. The assumption is that if the absolute difference is high, it is the same as if the difference squared is high. But basically, it's simply for mathematical ease. There is a lot of research on L1 norm minimization, check out the wikipedia article: "Least absolute deviations"
A couple of lectures in, it's surprisingly easy to get your head around this shit. Guess it all gets very tricky and intricate soon after, though.
What he says I think is right... He says that if XTX is not invertible which is the case when X is not full rank matrix(he says that X is dependent) then in that case you find the pseudo inverse in that particular case.
what is that transpose of theta represents
basically the partial derivative gives you the "steeper" way to the local minimum (about 43:00, last question)
y is missing superscript i at 1:04:10?
at 37.43 shouldn't we normalize se sum by dividing with m? otherwise the correction amount will blow up the more training data we input.
okay, forget about the question, error function J itself is not normalized, so its okay to blow everithing up.
In the method described "(Batch) Gradient Descent" is just optimization, by iterating over a training set from selected start-point (initial parameters) to find new minimums with their respective parameters. He is right, it can be slow if you have MANY parameters, since that will increase the number of combinations. The derivative would eliminate useless combinations. The Stochastic version is better, because it tries to "guess" direction and doesn't attempt to iterate every combo available.
He doesn't say how they decide alpha. It is just a "step size" for the gradient descent. It is the "weight" of the change in the parameter theta. Larger alpha means theta will converge faster but less accurately.
what does convergence mean here?? Is it the actual value converging to the predicted value?
moving towards the local minima or global minima where J(theta) will be minimum.
@@sanjhECE Thank you so much
His notations are listed here: 13:41
Once you reach the min(acc to GD alg) it stops moving, tetha doesnt change anymore.Thats when you know you've reached the local minimum.
I don't know about the math formulas in the lecture so what is the solution for me?
Thank you stanford ...really great work ...The lectures are great
Can we somehow get those TA Classes ? on Friday?
I think Batch Gradient Descent formula is missing 1/m at 44:39 Am I correct?
I think the same
I can not understand most of the equations on lecture 2. What kind of background knowledge should I look for?
you can check the least square solution.
For batch and stochastic gradient descent, is alpha (learning rate) usually the same size?
I haven't learnt Math. So, can someone please explain what exactly is θ? What is θ0 + θ1X? I understood Hypothesis, but I don't know what does θ0 + θ1X actually mean.
Sandy Sandeep This means that the algorithm is going to come up with a simple linear regression model where theta zero denotes the price of a very small house (theoretically zero square feet but as you know there is no such house) and theta one denotes the price increase per increase in each square footage.
hey sandy, theta0 is the base price. think of it as the minimum price for all houses. like they have to have this theta0 price as minimum. X is some feature of that house(size, number of bedrooms etc.) which we multiply by a coefficient theta1. This is our hypothesis that each house has to have a base price and that the feature x of the house affects the house of the price by a factor of theta1. So each unit increase in X increases the price of the house by theta1. Only thing we have to do now is compute the value of theta1 which professor does in the end of the video.
I just started to watch this lecture too, and I'm only in my second year of EE, but if you don't understand this stuff I guess you'd better off thoroughly read a book about linear algebra first. And probably some theory about Signals and Systems. He models the target as a linear function of the input, plus a constant term. I guess this how you should think about this stuff in general.
But as I said, only 2. year Bachelor student^^
hmm.. so alvin looks at the road ahead and records the steering direction. So what if the road ahead is a curve but since I'm on a straight patch for the moment my steering direction is still straight? Seeing where the cam was placed and that there was no bonnet in the pictures it must have been calculated for a few metres ahead. Does that affect anything? In the video it seems like Alvins response is about 0.5 seconds behind a typical human response. Specially in the live tests
I just wonder if the stochastic gradient algorithm is more efficient than batch gradient algorithm give than the number of data n is large. the number of iteration for batch gradient algorithm should far less than n.
Are the discussion sessions posted online?
overview: batch gradient descent, stochastic gradient descent, normal equation
batch ~: update Theta after scanning all samples
stochastic~: update Theta after scanning one sample (useful when number of samples is large)
normal equation: the analytical solution of Theta without iteration
Just wondering, why did the normal equation is only for OLS case? Wondering what assumption was made in the derivation for the equation to restrict to this specific case?
At 1:11:35, we have C=X'X and C' = X'X
Can someone please explain??
(A * B)' = B' * A', this means if we apply this to (X' * X) we will get: (X' * X)' = X' * (X')' = X' * X, thus here it is the same thing.
c=x'x so c it's scalrai val ===>c=c' because x'x=num val but xx' its matrix for ex 1=1' also (ab)'=b'a'
Why 1/2 is multiplied with the squred value of difference between predicted and actual value..Why not any other constants or keep it as it is?
It's just because when you do the derivative of the squared term, you get 1/2 * 2 which is 1 and so it's nicely legible again :)
Could someone explain how to get Gradient tr ABAC^T=CAB+C^TAB^T? I can't see how you can get an addition on the right hand side. At least not from within the rules he described in the lecture. Could one use the chain rule for derivation?
Lecture notes differ. The batch grad. descent in notes calculates residual (if I understand correctly Data minus Fit) y-h(x), the square of which we try to minimize, but prof has h(x)-y. Which one is correct?