Lecture 2 | Machine Learning (Stanford)

Поділитися
Вставка
  • Опубліковано 21 гру 2024

КОМЕНТАРІ • 382

  • @sienna367
    @sienna367 6 років тому +133

    1 an overview of the course in this introductory meeting.
    2 linear regression, gradient descent, and normal equations and discusses how they relate to machine learning.
    3 locally weighted regression, probabilistic interpretation and logistic regression and how it relates to machine learning.
    4 Newton's method, exponential families, and generalized linear models and how they relate to machine learning.
    5 generative learning algorithms and Gaussian discriminative analysis and their applications in machine learning.
    6 naive Bayes, neural networks, and support vector machine.
    7 optimal margin classifiers, KKT conditions, and SUM duals.
    8 support vector machines, including soft margin optimization and kernels.
    9 learning theory, covering bias, variance, empirical risk minimization, union bound and Hoeffding's inequalities.
    10 learning theory by discussing VC dimension and model selection.
    11 Bayesian statistics, regularization, digression-online learning, and the applications of machine learning algorithms.
    12 unsupervised learning in the context of clustering, Jensen's inequality, mixture of Gaussians, and expectation-maximization.
    13 expectation-maximization in the context of the mixture of Gaussian and naive Bayes models, as well as factor analysis and digression.
    14 factor analysis and expectation-maximization steps, and continues on to discuss principal component analysis (PCA).
    15 principal component analysis (PCA) and independent component analysis (ICA) in relation to unsupervised machine learning.
    16 reinforcement learning, focusing particularly on MDPs, value functions, and policy and value iteration.
    17 reinforcement learning, focusing particularly on continuous state MDPs, discretization, and policy and value iterations.
    18 state action rewards, linear dynamical systems in the context of linear quadratic regulation, models, and the Riccati equation, and finite horizon MDPs.
    19 debugging process, linear quadratic regulation, Kalmer filters, and linear quadratic Gaussian in the context of reinforcement learning.
    20 POMDPs, policy search, and Pegasus in the context of reinforcement learning.

  • @quantummath
    @quantummath 11 років тому +77

    Andrew Ng. Rocks .. he's an amazing teacher and a influential engineer as well as a great scholar.
    in a rather small but unprecedented step, you've managed to popularize Machine Learning. Nice!

  • @cogent4645
    @cogent4645 7 років тому +1

    The fact that by using simple physical examples (Portland property prices), and you could generalize and abstract into Learning Algorithms is just amazing. What an inspiration as a teacher!! Thank You.

  • @SagarPokhrel
    @SagarPokhrel 7 років тому +2

    Best lecture to understand Machine Learning that I've gone through so far. Professor Andrew Ng is all time best teacher for me.

  • @yardenm15
    @yardenm15 6 років тому

    This is pure gold mine for anyone interested in machine learning. He's doing such an amazing job explaining everything in a simple way, especially the parameters in new definitions and equations with plenty of examples and interesting videos.

  • @ajayram198
    @ajayram198 9 років тому +69

    Very well explained. I was going through the Coursera Course videolecs, but found this one much better.

    • @Rjsipad
      @Rjsipad 9 років тому +4

      +ajayram198 same

    • @nenasegura3069
      @nenasegura3069 9 років тому

      g j b de mi 22223300000ap000aÑañañsñsñ0P00a099ooaaoq99qq9 a0apaa

    • @evgeniynorin7345
      @evgeniynorin7345 7 років тому

      which course?

    • @coolguy-dw5jq
      @coolguy-dw5jq 7 років тому +2

      coursera course by andrew ng himself

    • @davidalexander829
      @davidalexander829 7 років тому +1

      Agreed. Of all the MOOCs, I like Coursera the least but Ng is much better in this lecture format

  • @teejiahen
    @teejiahen 15 років тому +2

    it's just like study in Stanford! Although it is not physically , but i really let me gain more knowledge of machine learning that only from my university. And he is really a good lecturer!
    thank you for you guys that propose it to the Standford University and upload it!

  • @curcicm
    @curcicm 8 років тому +7

    The normal equations fall out immediately from perpendicularity criterion for shortest distance X^t (X * theta - y) = 0 and you don't have to get into trace computations.

  • @cmares5858
    @cmares5858 9 років тому +136

    Well that escalated quickly... time to brush up on some of this math before continuing.

    • @TTGxCROTTY
      @TTGxCROTTY 9 років тому

      +cmares5858 Lol, Yup

    • @sahawndada
      @sahawndada 8 років тому

      +cmares5858 Yup!! I was like ahh illl be fine ...NOPE. what subjects do you think you need to brush up on before you can understand this ?

    • @tj8870
      @tj8870 8 років тому +8

      +cmares5858 Well we barely learned anything from lecture 1..

    • @elborrador333
      @elborrador333 8 років тому +16

      +Pat Bradley You should know introductory probability, linear algebra and maybe some multivariate calculus. If you're determined, mit has lecture series on all of those on youtube.
      You might also want to think about applying some of these algorithms yourself so the theory sticks.

    • @danny-bw8tu
      @danny-bw8tu 7 років тому +1

      I hope it Is not too late, I feel exactly what u felt about this first time I encounter this lecture, the math it involves is multivariate calculus and some elementary statistics. Moreover, there are good books about machine learning, plus tons of materials on the internet about gradient descent which are very helpful.

  • @Jabrils
    @Jabrils 7 років тому +90

    im raising my hand, why isn't professor Ng calling on me?

    • @UbuntuTricks
      @UbuntuTricks 6 років тому +1

      Jabrils i'm big fan of you !

    • @leonhardeuler9839
      @leonhardeuler9839 6 років тому +2

      He doesn’t like you Jabrils

    • @kongki7563
      @kongki7563 5 років тому

      lol !

    • @JohannSuarez
      @JohannSuarez 4 роки тому +1

      Dude, you inspired me to start taking Computer Science two years ago. Thanks, Jabrils!

  • @bennasserchafi304
    @bennasserchafi304 7 років тому

    do you your really understand how lucky we are to find someone like this legend explain to us this material.

  • @dongiea
    @dongiea 11 років тому +4

    Andrew Ng (the lecturer in these videos) teaches a course on Coursera that is based on this class. It covers the same fundamental ideas but might not be as in depth as these Stanford lectures.

    • @TheDestint
      @TheDestint 4 роки тому

      That coursera course is bs compared to this series.

  • @eng.mohammadshericmrp9251
    @eng.mohammadshericmrp9251 5 років тому

    If we have a dataset with the number of points =1000=m
    - Batch Gradient Descent: apply the process on all points in each step of the iteration (i=1......m)
    - Stochastic Gradient Descent: apply the process, not at all points, 1

  • @George-lt6jy
    @George-lt6jy 8 років тому +16

    first learning algorithm. i am so pumped.

    • @KCOWMOO
      @KCOWMOO 7 років тому +1

      Top KeK 😀

  • @arran5498
    @arran5498 15 років тому +2

    Stanford. Thanks for posting these lectures! Big thank you!

  • @akshatb
    @akshatb 8 років тому +1

    NOTE: A^(T) represents transpose of matrix A.
    At 59:56 it should only be C^(T)AB^(T) and not C^(T)AB^(T)+ CAB as according to one of the above equations, gradient of AB wrt A is equal to B^(T), thus the gradient of ABA^(T)C should be equal to (BA^(T)C)^(T) and that is equal to C^(T)AB^(T). Please help me sort this out.

    • @harrakaymane
      @harrakaymane 8 років тому +2

      no, because A^T also depends on A, so what you're saying is like : when deriving for respect to x, derivative(x*a)=a, SO derivative (x*a*x) is a*x, that's not the case.

    • @akshatb
      @akshatb 8 років тому

      Aimane Harrak thanks I got it now.

  • @abhishekkumar-os5zk
    @abhishekkumar-os5zk 4 роки тому

    ans 44:00 we do double differentiate the gradient if it is greater than zero then it is going descent else ascent.

  • @field-yetian6001
    @field-yetian6001 9 років тому +1

    Question: at the 34:23, for a certain training sample, we have
    adjustment of the jth of Theta= - alpha * (estimation error )*Xj
    For example we only have one Theta and one x where Theta = unit price/sqr ft and X= the number of sqr ft
    I don't understand why a larger Xj should lead to a larger Theta adjustment.
    For example, if we have 2 cases, in both the estimation error is 10000 dollars. In the first case, the Xj = 500 sqr ft, in the second case Xj=5000 sqr ft. Then the second case feeds back a 10x larger adjustment for unit price. But why?
    In the first case, you tell the machine, hey you missed by 10,000 dollars, given that the apt have 500 sqr ft, next time, next time reduce 20 dollars per sqr ft. This makes sense.
    Then in the second case, you tell the machine, hey you missed by 10.000 dollars, given that the house have 5000 ft, next time reduce 200 dollars per sqr ft. That's weird.
    Thanks folks

    • @antonylawler3423
      @antonylawler3423 9 років тому

      +田野 I think it is because theta isn't a $ value in sqr ft, but a number by which the sample xi is multiplied.

    • @field-yetian6001
      @field-yetian6001 9 років тому

      +Antony Lawler
      Thank you so much for the reply. Technically, as you said, Theta can't be defined as unit price. But at least, I think Theta is an analogue of unit price, and that the product of Theta 1 * X1 (area) roughly represent the part of house price corresponding to area.
      This feedback design seems to be counter-intuitive.

    • @antonylawler3423
      @antonylawler3423 9 років тому

      +田野 No problem. How are you getting on with Lecture 3 ?

    • @gt7318d
      @gt7318d 9 років тому +1

      +田野 The adjustment formula is oversimplified. I believe that alpha in the formula should vary with xj. Basically the adjustment formula tries to arrive at a solution for which dJ_over_dtheta = 0, which is the first partial derivative of J with respect to theta. If you use Newton-Raphson's formula for 0-finding, you end up with theta := theta - beta * dJ_over_dtheta/d2J_over_dtheta2, where d2Jdtheta2 is the second partial derivative. If you carry out the math, you will find that the second derivative is proportional to xj^2. With the first derivative proportional to xj, you end up with the adjustment term as a constant beta multiplied by 1/xj, so a smaller adjustment is make when xj becomes larger. Hope this helps. Very interesting observation though!

    • @field-yetian6001
      @field-yetian6001 9 років тому

      +Antony Lawler I got quite frustrated with math.. I got stuck at video 3 and have not
      revisited for a few weeks.

  • @mayaahmed
    @mayaahmed 15 років тому

    Really nice. Well taught. I am really enjoying listening to these lectures. A true service to public.

  • @CosminVarlan
    @CosminVarlan 7 років тому

    I think one alternate answer for the question @41:40 might also be that we found the minimal point or the convergence point when the derivative goes to 0 or nearby: the derivative of a function measure the slope and when it goes to 0 it means that we found a local maximum or minimum; because we are hunting the minimum it means that we found it. Am I right ?

  • @OrakzaiSays
    @OrakzaiSays 5 років тому

    1:00:10 further on : The training example is a row matrix and we take transpose so that makes it column matrix?

  • @rakeshprab1
    @rakeshprab1 14 років тому

    learning a whole new concept easily in one hour is fantabulous.......thanx...

  • @joshuaburkholder
    @joshuaburkholder 16 років тому

    Around time = 28:00, Dr. Ng noted that to go in the direction of steepest descent from a point, ( theta1, theta2, J(theta1, theta2) ), we should go in the direction of the gradient of J at that point; however, this is incorrect. The gradient always points in the direction of steepest ascent, not descent; therefore, the direction of steepest descent from ( theta1, theta2, J(theta1, theta2) ) is opposite of the gradient: -Del( J( theta1, theta2 ) ).

  • @PMetheney84
    @PMetheney84 9 років тому +2

    At 1:10:20, I think there is a trace missing before the Nabla_Theta(y^TX Theta) Term (the very last term).
    All the other terms have traces, why doesn't this one? Without it, one cannot apply the rules he introduced before (Nabla tr(AB) = B^T)

    • @bidhovbizar
      @bidhovbizar 6 років тому

      you are right.It should have the trace notation too.Otherwise he cannot use the 2nd fact out of the 5 facts he mentioned during the matrix algebra revision.He might have accidentally missed it.

  • @eng.mohammadshericmrp9251
    @eng.mohammadshericmrp9251 5 років тому +1

    Two ways to find the theta that minimized the cost function:
    1- Normal equation: (No Iteration)
    By taking its derivative and setting it to equal zero.
    2- Gradient Descent: (With iteration)
    By taking its derivative and applying GD algorithm.
    *****************************************
    For example: To find the minimum, if y=X^2 :
    1- Normal Equation:
    2X=0
    X=0. This is the solution.
    2- Gradient Descent:
    2X
    X1 = X0 - step_size *2X0
    After # iteration, X will reach to be zero.
    X=0. This is the solution.

  • @samferrer
    @samferrer 9 років тому

    If the rest of the lectures is based on these operators ... then I will hang out till the very end ... elegant!!

  • @chriswalsh5925
    @chriswalsh5925 8 років тому +1

    just wondering if you could encode the landscape using fourier transforms and then use that multi-level representation with a slightly modified algorithm to get a faster / more accurate result?

  • @signemadara2459
    @signemadara2459 10 років тому +3

    Can someone clarify please. On 50:00 when he answers the question about stochastic gradient descent, surely he does not mean that each iteration we use the SAME training example, right? I am sure he means that each iteration we take a different training example, but the way he talks about it is slightly confusing.

    • @ericakim4587
      @ericakim4587 10 років тому +4

      i think for the first step, you use the first training example and update all of the thetas. then for the second step, you use the second training example and update all of the thetas. and so on... so yeah, you use a different training example for each step/iteration

    • @elliottbajema3092
      @elliottbajema3092 10 років тому

      Yeah, the confusion is because he says "for each step, you're only using one training example".
      Worth emphasising that it's the jth example, which changes each step, and not the SAME training example.
      In batch, you use the entire training set of all (potentially millions) of examples, so each equivalent step for stochastic is potentially millions of times faster. It's just a compromise for the sake of speed. More generally, presumably you would actually take 'a random sample' of training examples rather than the jth, for greater accuracy.

    • @signemadara2459
      @signemadara2459 10 років тому

      Thanks!

  • @pitr2596
    @pitr2596 8 років тому +6

    Am I wrong or right if I assume that the gradient is actually oriented in the direction of biggest ASCENT?
    wikipedia says so too.. so I assume we should use the gradients orientation multiplicated with -1 for the stated example contrary to what is mentioned in the video

    • @erichoft7154
      @erichoft7154 8 років тому

      +King Schultz Maybe it depends on what exactly you are trying to optimize. If you are looking for a minimum cost you would go in the direction of greatest descent and if you are looking for a maximum profit you would go in the direction of greatest ascent?

    • @erichoft7154
      @erichoft7154 8 років тому

      +Eric Hoft I could be talking out of my ass though.

    • @pitr2596
      @pitr2596 8 років тому

      That makes total sense of course. I just mean that the gradient is mathematically definded as the greatest ascent so it actually points to the greatest ascent and its length is the magnitude of the ascent. Thats why it irritates me that we use the gradient here as if it was pointed to the biggest descent.

    • @DavidVaughan00
      @DavidVaughan00 8 років тому

      +King Schultz You're right, gradient points in the direction of greatest ascent, so he is slightly off when he talks about it. Not a huge deal though; just gotta keep in mind when he says "gradient" we should be thinking "negative gradient".

    • @СергейКиян-ш6у
      @СергейКиян-ш6у 8 років тому +14

      That is why he subtract gradient (which is simply add gradient multiplied by -1)

  • @TheReaMrBurntSausage
    @TheReaMrBurntSausage 8 років тому +25

    I'm a highschool junior and I didn't know what a partial derivative was so I walked into my AP Calc class today asked the teacher and was told to never speak of it again. Apparantly my teacher has repressed nightmares of it in college haha. I looked it up. seems pretty straight forward I think i get it now.

    • @danny-bw8tu
      @danny-bw8tu 7 років тому +3

      I don't think u should study machine learning now, and I don't think u got 'it', it involves way more than just partial derivative , kid.

    • @elzilcho222
      @elzilcho222 7 років тому

      that was a year ago, he's probably graduated college by now

    • @jazzpote4316
      @jazzpote4316 6 років тому +8

      @da ny You deserve to be kept far away of every learner ! Give this 'kid' the hope and belief he can do it and he will, instead of trying to fix your ego.

    • @superwiseman452
      @superwiseman452 6 років тому

      yup, I know PDEs well enough. Shame on your teacher for turning you away!

  • @florocasta
    @florocasta 5 років тому

    Thank you Professor Ng and Stanford University.

  • @jameskhan9383
    @jameskhan9383 8 років тому +1

    At 1:01:52 the design matrix X is m by n. Then he multiplies by theta and it looks like we're just left with a mx1 vector. Is each x in the resulting vector assumed to be an n dimensional or am I missing something?

    • @jameskhan9383
      @jameskhan9383 8 років тому

      Actually I think I'm being stupid. It's because we're multiplying by theta which is n x 1 right?

  • @flaviopibetagama
    @flaviopibetagama 4 роки тому +1

    Hi. Great video. I have a question: At time 1:08:40 Why the first element of the product (XO -y)^t(XO-y) is equal to O^t X^t X O. Why is not X^t O^t X O?

  • @jcbmack
    @jcbmack 12 років тому

    Denzel it is all about the changing of the thetas which are parameters (weights) which take on new values with each update. We desire to choose a theta that will minimize J(theta). Gradient descent takes the form: thetaj: = thetaj - alpha p.d./p.d.thetaj jtheta. The actual update is performed upon all j values at the same time using theta. Thus we begin with some value theta and then we repeatedly change the value of theta to make Jtheta smaller.Alpha is just the learning rate determinant.

  • @DrDizzyMorris
    @DrDizzyMorris 12 років тому +2

    Firstly, I'm loving this, great class! I have a question about the derivation of Gradient Descent. How is the partial derivative of J(theta) taken in the iterative algorithm if it's simply a constant? We already have x, y, and the initial theta (zero vector), so how can we take the partial derivative AND THEN plug in what we know...could the mathematical notation possibly be improved a bit? As it stands now, it's not making sense to me and I've been through an entire calculus sequence.

  • @SiddharthGupta234
    @SiddharthGupta234 8 років тому +4

    why there is no m involved in the denominator? @1:04:25

    • @davidalexander829
      @davidalexander829 7 років тому

      I wondered the same thing. Instead he arbitrarily assign 1/2 versus the normal sum sq diff over n.

  • @newbielives
    @newbielives 8 років тому +51

    Am I the only one impressed by the chalk board that wipes itself clean when he lifts it up and pulls it back down

    • @UtkarshRuhela
      @UtkarshRuhela 8 років тому +44

      It's different board, you dumbass.

    • @СергейКиян-ш6у
      @СергейКиян-ш6у 8 років тому +7

      Board doesn't get cleaned, its illusion. Lector just lifts one board up and pulls new one down. Look 48:00

    • @mksv7663
      @mksv7663 8 років тому +14

      He obviously applied a learning algo to it!

    • @joshuaadickerson
      @joshuaadickerson 8 років тому +17

      I am laughing so hard. I would be impressed by that too, but as others said, they are overlapping chalkboards.

    • @xiangzhang7355
      @xiangzhang7355 7 років тому

      hahah~~~

  • @praneeta133
    @praneeta133 15 років тому

    These videos are brilliant!!Andrew is super cool at teaching, thanks Stanford!!

  • @Jacob011
    @Jacob011 13 років тому

    In my course of linear systems we used the same normal equation for estimating parameters of a discrete model of continuous system.
    The thing is, it can be derived in much simpler way than the one shown in the lecture. (without the use of traces, let alone the traces algebra) :)
    So besides that, great lecture and certainly motivating.

  • @eng.mohammadshericmrp9251
    @eng.mohammadshericmrp9251 5 років тому +1

    %% Visualizing Gradient Descent on quadratic function using matlab:
    clear all
    close all
    clc
    %% Defining the Input and the Output :
    Input=-5:0.1:5;
    Output=Input.^2;
    %% Plotting the function:
    plot(Input,Output,'LineWidth',3)
    hold on
    %% Determining the required parameters:
    step_size=0.01;
    Iterations = 100;
    %% Initialize the initializing points:
    X0(1)=[3.5];
    %% Plotting the first step:
    Ite=1;
    disp(['Iteration ' num2str(Ite) ': Best Minima = ' num2str(X0(Ite))]);
    Output=X0(Ite).^2;
    plot(X0(Ite),Output,'.','MarkerSize',30)
    %% Starting the iterative gradient descent:
    Ite=2;
    while( Ite < Iterations)

    %% Least Mean Squares (Gradient Descent):
    X0(Ite,:) = X0(Ite-1,:) - step_size.*2.*(X0(Ite-1,:));
    Output=X0(Ite).^2;
    disp(['Iteration ' num2str(Ite) ': Best Minima = ' num2str(X0(Ite))]);
    %% Plotting the next step:
    plot(X0(Ite),Output,'.','MarkerSize',30)
    Ite=Ite+1;
    end

  • @phibouafia
    @phibouafia 10 років тому +2

    One thing I did not understand is why introduce the batch gradient descent or the stochastic version if the problem can be solved by linear algebra.
    Is this only a way to get throug those algorithms, which we will use for more complicated minimization problems ? Or do you really use these algorithms for this particular problem ?

    • @orrymr
      @orrymr 10 років тому

      I think the case may be that doing it using linear algebra can be quite computationally intensive, whereas using the gradient descent algorithms don't require matrix multiplication (computationally intensive)

    • @daniellee3987
      @daniellee3987 10 років тому +1

      I think its because of the quantity of the data involved. If the training set data is too large, iterative algorithm might not be practical due to hardware limitation. So, yes, I think we pick the most efficient algorithm depending on the situation.

    • @DavidVaughan00
      @DavidVaughan00 8 років тому +2

      +phibouafia In general, only some problems (ie, minimizing least squares with linear h function) can be solved using linear algebra closed forms. Most can't, unfortunately. I think he shows us the gradient descent methods here even though we don't need them because we WILL need them lots more later in the course.

  • @dkwroot
    @dkwroot 7 років тому

    At 18:30 he talks about the summation of the 'vectors' as being a transpose of theta * x. How did he determine this? Did he use the dot product rule for transpose where [a • b] = a^T * b ?

    • @kentasuzuki4522
      @kentasuzuki4522 7 років тому

      It's the dot (inner) product; [theta0, theta1, theta2]^T * [1, x1, x2] = theta0 + theta1*x1 + theta2*x2

  • @drhoads9
    @drhoads9 8 років тому

    Around 55:47, should it be written as the gradient of f wrt A, and not be evaluated at A? i.e. drop the "(A)" before the "="?
    Otherwise, you'd be taking the gradient of a real #, unless I'm reading something wrong...

    • @jose-rs
      @jose-rs 8 років тому

      So, a bit late my response, but in this case A is regarded as a variable, so f(A) would be the same as just f. Here f has no specific value, like, A= I or something.

  • @filipturczynowicz-suszycki7728
    @filipturczynowicz-suszycki7728 7 років тому

    I can't express how much i loved this video

  • @punstress
    @punstress 10 років тому

    To Maris, since they square the result, it doesn't matter whether you subtract y-h(x) or h(x)-y. (for some reason there was no reply option under your question. maybe it's too old. but someone else might have the same question.)

  • @KlajdiDervishaj
    @KlajdiDervishaj 5 років тому

    He is missing the index superscript i (training example) on y at the last line inside summation equation. Min 1:04:23

  • @vg9311
    @vg9311 6 років тому

    At 44:05, he says that the derivative of tbe function gives the steepest descent a d said the TAs would probably elaborate on that in another session. Can someone pls explain that.

  • @DrDizzyMorris
    @DrDizzyMorris 12 років тому +1

    Thanks jcbmack, between your comment and reviewing the lesson again I was able to make heads and tails of the concept I was misunderstanding. I was considering the parameters/thetas to be constant when in fact they are varying; why, I have no idea, haha. Cheers!

  • @sushantkhanal480
    @sushantkhanal480 7 років тому

    at 19:30...
    the lecturer writes h(x) = (theta transpose) times (x)
    but that would give a 3 by 3 matrix
    shouldn't it be h(x) = (x) times (theta transpose)???

  • @tessb
    @tessb 13 років тому

    @astroboomboy on the course website (google it) it says you need linear algebra and probability theory, but it said you need basic linear algebra and probability and a little programming experience.

  • @armanrainy
    @armanrainy 13 років тому

    Lecture 2 is done Sir (1:13 am).
    See u 2morrow on lecture 3.
    Thank you Professor. Thank you Stanford.

  • @이인서-h1p
    @이인서-h1p 4 роки тому

    Hi, I have a question about stochastic gradient descent. In 48:42, the inner loop has an iteration of j=1 to m. Does m signify the number of the whole dataset? If it signifies the number of the whole dataset, I think it does not really different from sigma j=1 to m in batch gradient descent. So.... m in stochastic gradient descent is different from m in batch gradient descent right???

  • @sdenkasp
    @sdenkasp 13 років тому

    Thanks to my Linear Algebra course in Peru :),
    I understood this nice lecture...
    so I continue with Lesson 3.
    Thanks Stanford!!!

  • @Hero7641
    @Hero7641 12 років тому

    Stanford has the right idea with spreading all this knowledge for free :D

  • @bennasserchafi304
    @bennasserchafi304 7 років тому +1

    this is brilliant. thank you so much professor

  • @sboparai09
    @sboparai09 12 років тому +2

    This lecture would be improved by first introducing a simple quadratic equation (i.e. Y=x^2+2x+1), find a minimum by finding the derivative, setting it to zero and solve for the value of X (the input parameter - cause of that minimum). Then, extend this concept to a 3D equation with two inputs X, Y and output Z and find the derivative, setting to zero and determining the values of X, Y in this case Theta1 and Theta2. The point of this lesson was to find a min (or max) given any # of inputs.

  • @ParthPatel643
    @ParthPatel643 7 років тому

    The images shown in white background are pretty hard to make out(Like the plot of housing price vs foot squared).

  • @joshuaburkholder
    @joshuaburkholder 14 років тому

    @matharoofmaths; Yes ... and that's why he makes so many mistakes in this lectures and has a hard time answering his student's questions (and occassionally evades student questions) in later lectures ... but if his research papers are any indication, he will definitely be an outstanding teacher in the future.
    All criticism aside, this is much better than what we had before - nothing. Thank you Dr. Ng and Stanford for letting us in. This is making Machine Learning that much more accessible.

  • @jamesmeikle8310
    @jamesmeikle8310 8 років тому +25

    most impressive of all is that this lecturer is actually a robot

    • @datalicious43
      @datalicious43 8 років тому +17

      well that's why he is teaching at Stanford!!! Show some respect. Thanks

  • @abramswee
    @abramswee 13 років тому

    agree with caesiume. this type of lecture is great. both free and good.

  • @syn3rman65
    @syn3rman65 6 років тому

    43:35 Doesn't the gradient give the direction of steepest ascent?

  • @tculig
    @tculig 12 років тому

    theta is some constant. If you had a quadratic equation:
    y=3+2x+5x^2
    theta0 would be 3, theta1 would be 2, theta2 would be 5.

  • @ajiteshbhan
    @ajiteshbhan 4 роки тому

    I have a query guys: In cost function in some examples like the lectures provided in AI series by andrew had 1/m term, my query is what are the points we need to consider when defining a cost function.

  • @chaityapatel2703
    @chaityapatel2703 7 років тому

    Any idea where to get the proofs to the two distinct matrix trace properties used for solving the Normal Equations?

  • @NetIdentity
    @NetIdentity 11 років тому +1

    Try Learning from Data on EdX - Easier to follow and easier to work through examples. There are solutions homework problems.

  • @joshuaburkholder
    @joshuaburkholder 16 років тому

    Around time = 43:00, Dr. Ng again gave the wrong description of the gradient.
    Example: Let f(x,y) = x^2 + y^2. Hence, the gradient is ( 2x, 2y ). At the point (1,1), the gradient is (2,2). Since the only local minimum of f(x,y) is at (0,0) and since (1,1)+(2,2)=(3,3), then the gradient at (1,1) points away from the only local minimum of f(x,y); therefore, the gradient does not point toward the direction of steepest descent. The gradient points in the direction of steepest ASCENT.

  • @lsun9593
    @lsun9593 6 років тому

    Interesting. Usually it will come back to the gradient descent when we solve inverse.

  • @gal1l1l-f7c
    @gal1l1l-f7c 8 років тому +1

    Isn't the cost function 1/(2*m) instead of just 1/2 of the sum of the squared errors?

    • @adityasoni121
      @adityasoni121 8 років тому +1

      Galina Staneva yes that m is missing...

    • @JWang-co2vj
      @JWang-co2vj 8 років тому +1

      I don't think so, actually that 1/2 is just added so as to get a neat expression after taking derivatives.

    • @venkatagangadharraoy5407
      @venkatagangadharraoy5407 8 років тому +1

      If we divide by m, we are substracting theta(i) by alpha times the average of the sum. If we dont divide by m, we are substracting theta(i) by alpha times the sum. Technically it doesnt matter if we divide by m or not. But dividing by m, will make us to converge faster I guess. Would love to hear some mathematical explanation around this.

    • @venkatagangadharraoy5407
      @venkatagangadharraoy5407 8 років тому +1

      I have implemented gradient descent in R with and without using m. In both the cases it is converging. But the catch here is when you don't use m, we have to use small value of alpha like 0.01. If I use 0.1 it is not converging.

    • @gal1l1l-f7c
      @gal1l1l-f7c 8 років тому

      Thank you very much for your answer! This clears things up!

  • @sharkllama
    @sharkllama 11 років тому

    I think the use of the trace operator in the derivation of the Least Squares Estimator obfuscates the derivation. I believe this would be easier to follow if the properties of matrix derivatives were used instead.

  • @joshuaburkholder
    @joshuaburkholder 16 років тому

    Around time = 28:00, Dr. Ng that if we want to go in the direction of steepest descent from a point J( theta1, theta2 ), then we should go in the direction of the gradient of J( theta1, theta2 ); however, this is incorrect. The gradient always points toward the direction of steepest ascent, not descent; therefore, if we want to go in the direction of steepest descent from a point J( theta1, theta12 ), then we should go in the direction that is opposite of the gradient ... -J( theta1, theta2 ).

  • @Gaiacarra
    @Gaiacarra 14 років тому

    @Fusionicon Basic Calculus. Other than the weird stats stuff he brings into play when formulating the error function ("J"), you don't need anything else, so long as you really pay close attention.

  • @JordanShackelford
    @JordanShackelford 9 років тому +2

    I can't keep up. I want to learn this but I have no experience with the math he's using. Calculus, right?

    • @hamsterpoop
      @hamsterpoop 9 років тому +1

      +Jordan Shackelford it's basic calculus and basic linear algebra... you can find free online courses for both online (check out MIT OCW, for example)

    • @blahdeblah1975
      @blahdeblah1975 9 років тому +1

      linear algebra is Greek for most people. One semester will set you straight IF you do the homework.

  • @xxanfighter
    @xxanfighter 9 років тому +1

    Why are we trying to minimize (h(x)-y)^2 and not just h(x)-2?

    • @Sonictll
      @Sonictll 9 років тому

      +Xanfighter : cuz we only need the absolute value of (h(x)-2) be the minimal. but (h(x)-2)^2 is more convinient for math expression.

    • @WahranRai
      @WahranRai 9 років тому

      +Xanfighter
      minimizing means that derivative is equal to zero. We dont care about coefficient(constant)

    • @xxanfighter
      @xxanfighter 9 років тому

      Thanks guys, really appreciate the answer :)

    • @hamsterpoop
      @hamsterpoop 9 років тому +1

      +Xanfighter The reason you minimize the square of the difference/error instead of the absolute error is because the linear algebra works out a lot easier this way. The assumption is that if the absolute difference is high, it is the same as if the difference squared is high. But basically, it's simply for mathematical ease. There is a lot of research on L1 norm minimization, check out the wikipedia article: "Least absolute deviations"

  • @chvan2335
    @chvan2335 13 років тому

    A couple of lectures in, it's surprisingly easy to get your head around this shit. Guess it all gets very tricky and intricate soon after, though.

  • @psbbboyz123
    @psbbboyz123 12 років тому

    What he says I think is right... He says that if XTX is not invertible which is the case when X is not full rank matrix(he says that X is dependent) then in that case you find the pseudo inverse in that particular case.

  • @subhaprakash5416
    @subhaprakash5416 11 років тому +2

    what is that transpose of theta represents

  • @iliasasdf
    @iliasasdf 12 років тому

    basically the partial derivative gives you the "steeper" way to the local minimum (about 43:00, last question)

  • @iliachigogidze6550
    @iliachigogidze6550 5 років тому

    y is missing superscript i at 1:04:10?

  • @jaanuskiipli4647
    @jaanuskiipli4647 6 років тому

    at 37.43 shouldn't we normalize se sum by dividing with m? otherwise the correction amount will blow up the more training data we input.

    • @jaanuskiipli4647
      @jaanuskiipli4647 6 років тому

      okay, forget about the question, error function J itself is not normalized, so its okay to blow everithing up.

  • @sboparai09
    @sboparai09 12 років тому

    In the method described "(Batch) Gradient Descent" is just optimization, by iterating over a training set from selected start-point (initial parameters) to find new minimums with their respective parameters. He is right, it can be slow if you have MANY parameters, since that will increase the number of combinations. The derivative would eliminate useless combinations. The Stochastic version is better, because it tries to "guess" direction and doesn't attempt to iterate every combo available.

  • @DragonSlave49
    @DragonSlave49 11 років тому

    He doesn't say how they decide alpha. It is just a "step size" for the gradient descent. It is the "weight" of the change in the parameter theta. Larger alpha means theta will converge faster but less accurately.

  • @shivananda30
    @shivananda30 5 років тому

    what does convergence mean here?? Is it the actual value converging to the predicted value?

    • @sanjhECE
      @sanjhECE 5 років тому +2

      moving towards the local minima or global minima where J(theta) will be minimum.

    • @shivananda30
      @shivananda30 5 років тому

      @@sanjhECE Thank you so much

  • @fupopanda
    @fupopanda 5 років тому

    His notations are listed here: 13:41

  • @CSEfreak
    @CSEfreak 11 років тому

    Once you reach the min(acc to GD alg) it stops moving, tetha doesnt change anymore.Thats when you know you've reached the local minimum.

  • @djremixmusic6598
    @djremixmusic6598 3 роки тому

    I don't know about the math formulas in the lecture so what is the solution for me?

  • @hnomier
    @hnomier 15 років тому

    Thank you stanford ...really great work ...The lectures are great

  • @OrakzaiSays
    @OrakzaiSays 5 років тому

    Can we somehow get those TA Classes ? on Friday?

  • @iliachigogidze6550
    @iliachigogidze6550 5 років тому

    I think Batch Gradient Descent formula is missing 1/m at 44:39 Am I correct?

  • @GraceTao
    @GraceTao 12 років тому +1

    I can not understand most of the equations on lecture 2. What kind of background knowledge should I look for?

    • @haoc5698
      @haoc5698 5 років тому

      you can check the least square solution.

  • @taketaxisky
    @taketaxisky 14 років тому

    For batch and stochastic gradient descent, is alpha (learning rate) usually the same size?

  • @sandysandeep7227
    @sandysandeep7227 7 років тому +2

    I haven't learnt Math. So, can someone please explain what exactly is θ? What is θ0 + θ1X? I understood Hypothesis, but I don't know what does θ0 + θ1X actually mean.

    • @lahirusomaratne7568
      @lahirusomaratne7568 7 років тому +1

      Sandy Sandeep This means that the algorithm is going to come up with a simple linear regression model where theta zero denotes the price of a very small house (theoretically zero square feet but as you know there is no such house) and theta one denotes the price increase per increase in each square footage.

    • @manisharma3068
      @manisharma3068 7 років тому

      hey sandy, theta0 is the base price. think of it as the minimum price for all houses. like they have to have this theta0 price as minimum. X is some feature of that house(size, number of bedrooms etc.) which we multiply by a coefficient theta1. This is our hypothesis that each house has to have a base price and that the feature x of the house affects the house of the price by a factor of theta1. So each unit increase in X increases the price of the house by theta1. Only thing we have to do now is compute the value of theta1 which professor does in the end of the video.

    • @FelixCrazzolara
      @FelixCrazzolara 6 років тому

      I just started to watch this lecture too, and I'm only in my second year of EE, but if you don't understand this stuff I guess you'd better off thoroughly read a book about linear algebra first. And probably some theory about Signals and Systems. He models the target as a linear function of the input, plus a constant term. I guess this how you should think about this stuff in general.
      But as I said, only 2. year Bachelor student^^

  • @kiriappeee
    @kiriappeee 13 років тому

    hmm.. so alvin looks at the road ahead and records the steering direction. So what if the road ahead is a curve but since I'm on a straight patch for the moment my steering direction is still straight? Seeing where the cam was placed and that there was no bonnet in the pictures it must have been calculated for a few metres ahead. Does that affect anything? In the video it seems like Alvins response is about 0.5 seconds behind a typical human response. Specially in the live tests

  • @eachonly
    @eachonly 11 років тому

    I just wonder if the stochastic gradient algorithm is more efficient than batch gradient algorithm give than the number of data n is large. the number of iteration for batch gradient algorithm should far less than n.

  • @canadianrepublican1185
    @canadianrepublican1185 8 років тому

    Are the discussion sessions posted online?

  • @linhelen8222
    @linhelen8222 5 років тому

    overview: batch gradient descent, stochastic gradient descent, normal equation
    batch ~: update Theta after scanning all samples
    stochastic~: update Theta after scanning one sample (useful when number of samples is large)
    normal equation: the analytical solution of Theta without iteration

  • @Gauravsaxena2512
    @Gauravsaxena2512 11 років тому

    Just wondering, why did the normal equation is only for OLS case? Wondering what assumption was made in the derivation for the equation to restrict to this specific case?

  • @rayptucha3515
    @rayptucha3515 10 років тому

    At 1:11:35, we have C=X'X and C' = X'X
    Can someone please explain??

    • @Korzakapitany
      @Korzakapitany 10 років тому

      (A * B)' = B' * A', this means if we apply this to (X' * X) we will get: (X' * X)' = X' * (X')' = X' * X, thus here it is the same thing.

    • @ritagatspy4750
      @ritagatspy4750 9 років тому

      c=x'x so c it's scalrai val ===>c=c' because x'x=num val but xx' its matrix for ex 1=1' also (ab)'=b'a'

  • @biswajeettripathy773
    @biswajeettripathy773 7 років тому +2

    Why 1/2 is multiplied with the squred value of difference between predicted and actual value..Why not any other constants or keep it as it is?

    • @fakal007
      @fakal007 7 років тому +2

      It's just because when you do the derivative of the squared term, you get 1/2 * 2 which is 1 and so it's nicely legible again :)

  • @juludd
    @juludd 15 років тому

    Could someone explain how to get Gradient tr ABAC^T=CAB+C^TAB^T? I can't see how you can get an addition on the right hand side. At least not from within the rules he described in the lecture. Could one use the chain rule for derivation?

  • @salmonito2
    @salmonito2 11 років тому

    Lecture notes differ. The batch grad. descent in notes calculates residual (if I understand correctly Data minus Fit) y-h(x), the square of which we try to minimize, but prof has h(x)-y. Which one is correct?