Lecture 03 -The Linear Model I

Поділитися
Вставка
  • Опубліковано 9 лис 2024

КОМЕНТАРІ • 131

  • @justpaperskw
    @justpaperskw 9 років тому +28

    The best lectures in machine learning that I've listen to online. Thanks Professor Yaser.

  • @johnallard2429
    @johnallard2429 9 років тому +98

    Such a great professor, he is so clear in his explanations.

    • @helenlundeberg
      @helenlundeberg 9 років тому +2

      agreed.

    • @alexdamado
      @alexdamado 6 років тому +6

      Indeed, he is an amazing professor. It is clear that he knows his mathematics. He is confident in the responses and goes into a little more depth than the usual machine learning professor. I really appreciate his desire to teach the theory behind the applications

    • @cosette8570
      @cosette8570 5 років тому +2

      I’m thrilled to have studied at Caltech for this course and actually talk to him! He’s so nice!

    • @sivaramkrishnanagireddy8341
      @sivaramkrishnanagireddy8341 4 роки тому +5

      @@cosette8570 What is your annual salary now ?

    • @kNowFixx
      @kNowFixx 4 роки тому +3

      @@sivaramkrishnanagireddy8341 Very weird question and I'm not surprised the person you replied to didn't bother to respond.

  • @rvoros
    @rvoros 11 років тому +2

    Prof. Yaser Abu-Mostafa is by far the best lecturer I've ever seen. Well done, great course!

  • @atfchan
    @atfchan 12 років тому +3

    Thank you, Prof. Yaser Abu-Mostafa for these lectures. The concepts are concisely and precisely explained. I especially like how he explains the equations with real world examples. It makes the course material much more approachable.

  • @parvathysarat
    @parvathysarat 7 років тому +9

    Wow. This lecture was great. The difference between using linear regression for classification and using linear classification outright couldn't have been explained better (I mean it was just amazing).

  • @gautamkarmakar3443
    @gautamkarmakar3443 8 років тому +6

    what a great authority and fielding questions like a true pundit on the subject..great respect and thanks a lot.

  • @Tinou49000
    @Tinou49000 10 років тому +17

    Great great lecture. Thank you Pr. Yaser Abu-Mostafa, it is clear and well performed!

    • @jonsnow9246
      @jonsnow9246 7 років тому

      What is the graph shown at 15:42 ,why are there ups and downs in Ein?

    • @laurin1510
      @laurin1510 6 років тому

      Jon Snow in case it still matters: When you iterate in the PLN algorithm you adjust the weight of the perceptron in each iteration. so hence in each iteration you might classify more points correctly or less and this is reflected by the in sample error aswell as the out of sample error.

  • @prasanthaluru5433
    @prasanthaluru5433 11 років тому +2

    For the first time I am finding Machine Learning interesting and learn-able. Thank you very much sir.

  • @ahmednasrzc
    @ahmednasrzc 8 років тому +1

    thank you Pr. Yaser Abu-Mostafa
    very very very clear and detective really it's hard to find genius explanation like that

  • @manjuhhh
    @manjuhhh 10 років тому +2

    Thank you Caltech and Prof.

  • @btsjiminface
    @btsjiminface 4 роки тому

    Such a compassionate lecturer 🥺

  • @xelink
    @xelink 10 років тому +2

    Good lecture.
    I think the question had to do with correlation between a transformed feature and the original feature. This is describing the problem of multicollinearity. With multicollinearity, numerical instability can occur. The weight estimates are unbiased(on average you'd expect them to be correct) but they're unstable - running with slightly different data might give different estimates.
    E.g. Estimate SAT scores. If given weight, height and parent's occupation one might expect all 3 are correlated. The OLS algorithm won't know which to properly credit.

  • @brainstormingsharing1309
    @brainstormingsharing1309 3 роки тому +1

    Absolutely well done and definitely keep it up!!! 👍👍👍👍👍

  • @nias2631
    @nias2631 5 років тому +6

    Dude is a rockstar, just threw my panties at my screen again... Great lecturer.

  • @rezagraz
    @rezagraz 11 років тому +14

    What an amazing and charming teacher...I wanna have a beer with him.

    • @sanuriishak308
      @sanuriishak308 5 років тому +10

      He is muslim...don't drink beer. Let have coffee with him.

    • @abubakarali6279
      @abubakarali6279 4 роки тому

      Any new material from yasir?

    • @your_name96
      @your_name96 3 роки тому

      @@sanuriishak308 I think he prefers to 'sip on tea while the machine learns'

  • @AniketMane-dn5uh
    @AniketMane-dn5uh 3 роки тому +2

    Howard Wolowitz is so good! One of the best lectures though.

  • @avidreader100
    @avidreader100 8 років тому +5

    I learnt a method in another class where the primary classification features were selected based on things that causes the maximum change in entropy.

    • @EE-yv7xg
      @EE-yv7xg 8 років тому +4

      Decision Tree Algorithm.

    • @roelofvuurboom5431
      @roelofvuurboom5431 3 роки тому

      A clean classification represents the lowest level of entropy (things aren't "muddled"). So going from the current situation to the lowest level of entropy will result in a maximum change in entropy.

  • @abdurrezzakefe5308
    @abdurrezzakefe5308 7 років тому +6

    A million times better than my proffessor.

  • @LucaMolari
    @LucaMolari 11 років тому +3

    Thank you, great and relaxing lectures!

  • @melihcan8467
    @melihcan8467 7 років тому

    great lecture & channel. Thanks for such opportunity.

  • @Mateusz-Maciejewski
    @Mateusz-Maciejewski 5 років тому

    37:10 The formula is correct if we define the gradient as the Jacobian matrix transposed, not just Jacobian matrix. In the optimalization techniques this assumption is very helpful, so I think he uses this convention.

  • @danielgray8053
    @danielgray8053 3 роки тому +3

    How come no one here in the comments admits that they are confused. I don't understand why college professors are so out of touch with their students. Why does learning hard engineering topics always have to be so much of a struggle. He is teaching as if we have years of experience in the field.

    • @danielgray8053
      @danielgray8053 3 роки тому

      Like, in lecture 2 he mentioned that E in was used in place of new, which is basically the sample mean. Now he is using it here in a completely different context withoutt explaining why? It seems like alot of other students here have the same question. PLEASE SOMEONE OUT THERE JUST LEARN HOW TO TEACH PROPERLY ugh

    • @Milark
      @Milark 5 місяців тому

      Being confused doesn’t mean the professor is incapable. It’s a natural part of the learning process.
      If no one teaches right according to you, it might not be the teacher

  • @pippo1116
    @pippo1116 7 років тому +2

    I like the x1.5 speeding, works perfectly.

  • @dimitriosalatzoglou5033
    @dimitriosalatzoglou5033 Рік тому

    Amazing content. Thank you.

  • @edvandossantossousa455
    @edvandossantossousa455 3 роки тому

    a great lecture. Thanks for sharing it.

  • @ragiaibrahim6648
    @ragiaibrahim6648 9 років тому +1

    AWESOME, lots of thanks.

  • @huleinpylo3906
    @huleinpylo3906 12 років тому

    Interesting lecture
    He's example help a lot to understand the class

    • @FsimulatorX
      @FsimulatorX 2 роки тому

      Wow This is the oldest comment I’ve seen in these videos thus far. How are you doing these days? Did you end-up pursuing machine learning?

    • @huleinpylo3906
      @huleinpylo3906 2 роки тому

      @@FsimulatorX nope, I am working in cyber security.

  • @fabiof.deaquino4731
    @fabiof.deaquino4731 6 років тому

    "Surprise, surprise"... What a great professor!

  • @astyli
    @astyli 11 років тому

    Thank you for uploading this!

  • @mhchitsaz
    @mhchitsaz 11 років тому

    fantastic lecture

  • @nischalsubedi9432
    @nischalsubedi9432 4 роки тому +1

    What is the difference between the simple perceptron algorithm and linear classification algorithm?

  • @pyro226
    @pyro226 5 років тому +1

    Not sure I'm a fan of the "symmetry" measure. The number 8 in that example is clearly offset from center, the example only would have apparent symmetry because it's a wide number with a lot of black space. If a 1 is slanted and off center, it will literally have nearly 0 apparent symmetry because only its center point would have vertical symmetry. Oh well, we'll see where it goes.

    • @roelofvuurboom5431
      @roelofvuurboom5431 3 роки тому

      You probably would use a bit more sophistication. After you flip the number you could "slide" it over the original number looking for the maximum matching value.

  • @harpreetsinghmann
    @harpreetsinghmann 7 років тому +1

    Key takeaway: Linearity of Weights not "variables".
    @52:00

  • @PerpetualEpiphany
    @PerpetualEpiphany 10 років тому

    this is really good

  • @fndTenorio
    @fndTenorio 10 років тому

    Fantastic!

  • @indatawetrust101
    @indatawetrust101 3 роки тому

    When he says hypotheses h1, h2, etc. does he mean different hypotheses that fit same general form (e.g. all 2nd order polynomials) or different hypotheses forms (e.g. linear, polynomial, etc.)? Thanks

  • @mackashir
    @mackashir 8 років тому +1

    In the previous lectures E(in) was used for in-sample performance. Was is substituted to in-sample error in this lecture? Am i missing something ?

    • @Bing.W
      @Bing.W 7 років тому

      Performance is a rough wording, and error is one way to really evaluate the performance.

    • @solsticetwo3476
      @solsticetwo3476 6 років тому

      I see it as follow: E(in) is the fraction of "red" marbel, which is the fraction of wrong estimation by your hypothesis; which is the error of that h. That fraction is the probability 'q' of a Bernoulli distribution, which expected value E(q)=q

    • @roelofvuurboom5431
      @roelofvuurboom5431 3 роки тому

      @@solsticetwo3476 The other way around. E(out) is the fraction of red marble i.e. the fraction of wrong estimation by your hypothesis. This value has nothing to do with a probability distribution. E(in) the in-sample error is coupled to selection and hence is coupled to a probablity distribution.

  • @SphereofTime
    @SphereofTime 6 місяців тому +1

    6:14

  • @Michellua
    @Michellua 7 років тому +1

    Hello, where did I find this dataset to implement the algorithms?

    • @hetav6714
      @hetav6714 4 роки тому

      The number set is called MNIST dataset

  • @AndyLee-xq8wq
    @AndyLee-xq8wq Рік тому

    Great!!

  • @mohamedelansari5427
    @mohamedelansari5427 10 років тому +1

    Hello, It's a nice lecture!! Thanks to caltech and Prof Yasser. Can any one tell me where I can get the corresponding slides and textbooks? Thanks

  • @douglasholman6300
    @douglasholman6300 5 років тому

    Just to clarify, what piece of theory guarantees that the in-sample error will track the out of sample error? E_in E_out

    • @roelofvuurboom5431
      @roelofvuurboom5431 3 роки тому

      That there is a probability distribution on X. What this says is (more or less) is that what I saw happen in the past (i.e. what I selected which drives my in sample error) says something about what will happen in the future. The "what will happen in the future" is a statement that says something about my out of sample and drives my out of sample error. Hoefding's inequality places numerical constraints on the relationship and is based on this probability relationship.

  • @googlandroid176
    @googlandroid176 2 роки тому

    The calculation uses the derivative set to zero, which I guess means finding where the slope is 0, but what if there are more than one such minimal spots? How is the global minimum guaranteed?

  • @linkmaster959
    @linkmaster959 5 років тому +1

    I thought linear regression was for extrapolating data to a line to forecast future predictions. But here it is explained in terms of a seperation boundary for classification. can someone explain?

    • @roelofvuurboom5431
      @roelofvuurboom5431 3 роки тому +1

      For data extrapolation you look for the line which minimizes the distance of ALL points to that line. In the classification problem you look for the line which minimizes the distance of WRONGLY CLASSIFIED points to that line.

  • @wayenwan
    @wayenwan 11 років тому

    is there subtitle available?

  • @rippa911
    @rippa911 12 років тому

    Thank you.

  • @shahardagan1584
    @shahardagan1584 6 років тому

    What I should do if I didn't understand all the math in this lecture?
    do you have some resource that explains it quickly?

    • @majorqueros6812
      @majorqueros6812 5 років тому +1

      Some videos on statistics/probability and linear algebra would already help a lot. Khan academy has many great videos.
      www.khanacademy.org/math/statistics-probability
      www.khanacademy.org/math/linear-algebra

    • @MrCmon113
      @MrCmon113 5 років тому

      Depends on what you don't understand. There should be introductory courses to linear algebra, analysis and stochastics at your university.

  • @naebliEcho
    @naebliEcho 11 років тому +1

    Anybody have homeworks accompanying these lectures? I don't have a registered account for the course and the registration is closed now. :(

  • @ishanprasad910
    @ishanprasad910 5 років тому

    Quick question, what is the y-axis label at 20:00? What probability are we tracking for E_in and E_out?

    • @blaoi1562
      @blaoi1562 5 років тому

      E_in and E_out represent respectively the in-sample error and the out-sample error. Usually you don't have access to E_out the out-sample error. But you know that The in-sample error approximates the out-sample error the more you have data.
      The y axis represents the error percentage on data, while x-axis represents the iterations.

    • @abhijeetsharma5715
      @abhijeetsharma5715 3 роки тому

      On the y-axis, we are tracking the "fraction-of-mislabeled-examples". So, E_in is the fraction of training-set examples that we got wrong. Similarly E_out is the fraction of examples(not from training-set) that we got wrong.

  • @andysilv
    @andysilv 7 років тому

    It seems to me that there is a small typo on the 18th slide (48:25). To perform classification using linear regression, it seems one needs to check sign(wx - y) rather than sign(wx).

    • @李恒岳
      @李恒岳 6 років тому

      really confused me without your comment!

    • @李恒岳
      @李恒岳 6 років тому +1

      Second-time watch this video makes me clear that there is no mistake. The threshold value is contained in w0. sign(wx) is the correct.

    • @Omar-kw5ui
      @Omar-kw5ui 4 роки тому

      wX is the output for each datapoint in the training set, taking the sign of each output gives you its classification.

  • @DaneJessen101
    @DaneJessen101 9 років тому

    Great lecture! Does the use of features ('features' are "higher level representations of raw inputs") increase the performance of a model out of sample? Does it somehow add information? Or does it simply make it computationally easier to produce a model? I'm working on a problem where this could potentially be very useful.
    I could also see how the use of features could make a model more meaningful to human interpretation, but there is a risk as well that interpretations will vary between people based on what words are being used. 'Intensity' and 'symmetry' are used here which are great examples, but is could very quickly get more abstract or technical.
    Thank you in advance to anyone who has a answer to my question!

    • @rubeseba
      @rubeseba 9 років тому +2

      It depends on whether your features could be learned implicitly by your model. That is, let's say your original data are scores on two measures: IQ and age, and you want to use those to predict people's salaries. Let's also assume that the true way in which those are related is: salary = (IQ + age)*100 + e, where e is some residual error not explained by these two variables. In this case you could define a new feature that is the sum of IQ and age, and this would reduce the number of free parameters in your model, making it slightly easier to fit. Given enough data to train on however, your old model would perform just as well, because the feature in the new model is a linear combination of features in the old model. (That is, in the old model you would have w1 = w2 = 100, whereas in the new one you would just have w1 = 100.)
      Often, however, we define new features not (just) to reduce the number of model parameters, but to deal with non-linearities. In the example of the written digits, you can't really predict very well which digit is written in an image by computing a weighted sum over pixel intensities, because the mapping of digits to pixel values happens in a higher order space. So in this case we can greatly improve the performance of our model if we define our features in the same higher order space. The reason is not that we add information that wasn't in the data before, but that the information wasn't recoverable by our linear model.

    • @DaneJessen101
      @DaneJessen101 9 років тому

      rubeseba - That was very helpful. Thank you!

    • @DaneJessen101
      @DaneJessen101 9 років тому

      Kathryn Jessen Kathryn - I was born without the ability to be a mom :/ I will never experience the depth and vastness of a mother's understanding. I can sure pick a thing or two and try to pretend ;)

  • @rahulrathnakumar785
    @rahulrathnakumar785 5 років тому

    Great lecture overall. However, I couldn't really understand how to implement linear regression for classification...

    • @roelofvuurboom5431
      @roelofvuurboom5431 3 роки тому +1

      With your training data you know which points have been wrongly classified. Look for the line which minimimes the distance (least squares error) of all wrongly classified data. Each time you move the line other data may become wrongly classified so you have to do redo the calculation but look for the line which gives you the minimum overall value for the lines associated set of wrongly classified data.

  • @TomerBenDavid
    @TomerBenDavid 8 років тому

    Perfect

  • @raulbeienheimer
    @raulbeienheimer 3 місяці тому

    This is not the Squared Error but the Mean Squared Error.

  • @movax20h
    @movax20h 8 років тому +2

    How was E_out computed in each iteration? Was it using subsample of given sample and estimated E_out on full sample?

    • @fuadassayadi1
      @fuadassayadi1 8 років тому

      +movax20h You can not calculate E_out because you do not know the whole population of samples. But E_in can tell you something about E_out. This relation is explained in Lecture 02-Is learning feasible

    • @movax20h
      @movax20h 8 років тому +3

      +fouad Mohammed That is exactly why I am asking. The graph clearly shows the E_out being calculated somehow. I guess, this is done using validation techniques from one of the last lectures probably. Anyway, this is a synthetic example, so it is not hard to generate known unknown target function, and generate as many training and test examples as you want, just for examples sake.
      I do not believe it was claculated by Hoefding , because it is a probabilistic inequality, and would actually to circular reasoning logic here: lets use it to predict E_out, and use this prediction to claim that E_in tracks well E_out. That might be correct in probabilistic sense, but is not good way of demonstrating it at all.

    • @abaskm
      @abaskm 8 років тому +5

      Assume that data was already labeled and used to generate E_in (test set). Take a portion of this data (training sample) generate your hypothesis. You can use that hypothesis to measure E_in (training data) and E_out (on the test set). He's making the assumption that the whole set is labeled. Which doesn't usually apply to the real world.

  • @chujgowi
    @chujgowi 11 років тому +2

    check the itunes page
    there are homeworks an solutions available for free for this course

  • @jonsnow9246
    @jonsnow9246 7 років тому +1

    What is the graph shown at 15:42 ,why are there ups and downs in Ein?

    • @solsticetwo3476
      @solsticetwo3476 6 років тому +1

      Jon Snow The error in the sample set could increase in a next iteration if the algorithm change the hypothesis (weights) in a way that hurt the classification. The PLA is a random walk on the weights sub space.

    • @Omar-kw5ui
      @Omar-kw5ui 4 роки тому

      @@solsticetwo3476 PLA is not really a random walk in the weights sub space... The algorithm optimises the weights for a given (randomly chosen) miss-classified point. Fixing the weights as to not missclassify this point may lead to other points that were previously correctly classified to be misclassified. Hence, the rise in error, followed by drop etc. The algorithm works for non separable datasets, so you can't really call it a random walk, it clearly has a set of rules its following.

  • @ajkdrag
    @ajkdrag 5 років тому

    I didn't understand how he obtained X- transpose after the differentation.

    • @adrianbakke1732
      @adrianbakke1732 5 років тому

      math.stackexchange.com/questions/2128462/derivative-of-squared-frobenius-norm-of-a-matrix :)

  • @JoaoVitorBRgomes
    @JoaoVitorBRgomes 4 роки тому

    Why learning only occurs in a probabilistic sense? What other way could it be?

    • @roelofvuurboom5431
      @roelofvuurboom5431 3 роки тому

      Literally, learning in a non-probabilistic (absolutistic or certainty) sense. However this runs up against the so-called induction problem first described by the philosopher Hume (you can google it). In our context here the Hume induction problem can be translated to "If I pick a number of balls and they always turn out to be green can I conclude (with certainty) that all balls in the bin are green?". In the lecture the statement is made that you can't. The philosophical discussion is a bit more nuanced. In any case, machine learning avoids this discussion by side stepping trying to make statements with certainty and moving to (weaker) probabilistic statements.

    • @JoaoVitorBRgomes
      @JoaoVitorBRgomes 3 роки тому

      @@roelofvuurboom5431 great reply, thanks! I loved how you tied to Hume. I didn't think of this connection. Thanks for linking things.

  • @Dwright3316
    @Dwright3316 8 років тому

    42 minutes !! BANG !!

  • @auggiewilliams3565
    @auggiewilliams3565 7 років тому +3

    2.8 doesnt happen in Caltech..... 3.8 doesnt happen in Tribhuvan University

  • @ZeeshanAliSayyed
    @ZeeshanAliSayyed 9 років тому +22

    "+1" and "-1" among other things happen to be real numbers! LOL

    • @vyassathya3772
      @vyassathya3772 8 років тому +3

      +Zeeshan Ali Sayyed There is something genius about the simplicity though lol

    • @ZeeshanAliSayyed
      @ZeeshanAliSayyed 8 років тому +1

      Vyas Sathya Indeed. :P

  • @abubakarali6279
    @abubakarali6279 4 роки тому

    How we got X^T at 38:25.?

    • @roelofvuurboom5431
      @roelofvuurboom5431 3 роки тому

      In linear algebra the definition of ||X||^2 is (X^T)(X). Apply this formula to ||Xw-Y||^2

    • @karlmadl
      @karlmadl 2 роки тому

      This comment is old and the op probably figured it out by now. However for anyone else who wonders this:
      It has to do with notation, the derivative of Xw WRT to w is either X or X^T depending on your notation. Here, we're using denominator layout notation, so we use X^T. (en.wikipedia.org/wiki/Matrix_calculus#:~:text=displaystyle%20%5Cmathbf%20%7BI%7D%20%7D-,A%20is%20not%20a%20function%20of%20x,%7B%5Cdisplaystyle%20%5Cmathbf%20%7BA%7D%20%7D,-%7B%5Cdisplaystyle%20%5Cmathbf%20%7BA)
      The natural follow up question here is why do we use one notation over the other? When did we choose our notation? Notation choices are often just the choice of the author, and can often make formulae more succinct or more clear. I think this answers the question without going off tangentially too far, as any further questions are probably best answered at your own inquiry.

  • @bryanchambers1964
    @bryanchambers1964 5 років тому

    I wish he showed how to write the algorithms in Python because he teaches very well.

    • @FsimulatorX
      @FsimulatorX 2 роки тому

      There are plenty of other resources for that. Once you understand the theoretical component, implementation becomes easy

  • @SJohnTrombley
    @SJohnTrombley 7 років тому +4

    People don't get 2.8's at caltech? I smell grade inflation.

    • @MrCmon113
      @MrCmon113 5 років тому

      I understood that as people with 2.8 or better not going there. Who takes courses in all of those fields at university?

  • @fahimhossain165
    @fahimhossain165 4 роки тому

    Never imagined that I'd learn ML from Emperor Palpatine himself!

  • @pyro226
    @pyro226 5 років тому

    32:22 MSE, mean squared error, for those with statistics background.

  • @millerfour2071
    @millerfour2071 5 років тому

    22:48, haha!

  • @HendSelim87
    @HendSelim87 11 років тому

    what's wrong? the video is not opening.

  • @akankshachawla2280
    @akankshachawla2280 5 років тому +1

    44:50

    • @tobalaba
      @tobalaba 4 роки тому

      best sound ever.

  • @niko97219
    @niko97219 8 років тому +2

    The linear regression is terribly explained.

    • @AlexEx70
      @AlexEx70 8 років тому +6

      Он же сказал что это просто на закуску. Далее будет более подробное объяснение.

    • @abhijeetsharma5715
      @abhijeetsharma5715 3 роки тому

      As the professor mentions multiple times, this lecture is a bit out-of-place as it is placed before covering the theory. Other lectures are more theoretically dense and explained in reasonable depth.

  • @brainstormingsharing1309
    @brainstormingsharing1309 3 роки тому +1

    Absolutely well done and definitely keep it up!!! 👍👍👍👍👍

  • @sidk5919
    @sidk5919 8 років тому +7

    Brilliant lecture

  • @wsross3178
    @wsross3178 7 років тому

    Amazing lecture