Locally Weighted & Logistic Regression | Stanford CS229: Machine Learning - Lecture 3 (Autumn 2018)

Stanford Online

5 500

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 4 лют 2025

КОМЕНТАРІ • 145

@raccoonious4038 10 місяців тому ⁺⁵⁷
The simplification of log likelihood function log(L(theta)) to give you back the cost function J(theta) has to be one of the most beautiful transformations I've seen in a while!
@MosesMakuei-b5z 5 місяців тому ⁺¹
Hehe, I'm certain that the first derivation of the least square as a cost function did not come from a probabilistic interpretation. The goes to prove that if you are right in one angle, you will also be right in all angles. It was interesting to see that too.
@SaidurRahman-c8w 6 місяців тому ⁺⁷⁴
Notice how views decrease by half on each new lecture, Congratulations on making this far, keep going fellas, we got this.
@_desouvik 7 місяців тому ⁺³⁶
They changed the voices of students 👩‍🎓, at first I was amused why they all talk in a same way 😮, but now that makes sense
@manudasmd 2 роки тому ⁺¹¹³
Damn , this guy just explains concepts so clearly. love this course
@morespinach9832 Рік тому ⁺²
Where else have you seen these techniques explained? It’s not that hard.
@hannukoistinen5329 Рік тому
Damn, chinese communist teaching in Stanford!!!
@manudasmd Рік тому ⁺¹⁶
@@hannukoistinen5329 Damn, Cool Joke bro!! You must be really a funny guy.
@ujjolchakrabarty9285 7 місяців тому ⁺¹
Do you know where we can get the practice sets for the course?
@shaksham.22 7 місяців тому
@@ujjolchakrabarty9285 trying to find the same thing. Cant access it from the website
@FrankDong-o8b Місяць тому ⁺¹
This lecture reinforces so much my understanding between the OLS and MLE in a much much better way
@atalantinopieva 7 місяців тому ⁺⁸
For anyone struggling with the concept.. likelihood indicates how likely a particular population is to produce an observed sample given a particular distibution.
For example, if we have data that should follow a Gaussian Distribution with mean=5 and variance = 0.1 but the ACTUAL data in my dataset are all 0.5, well... the likelihood that my data actually follow this distribution is very low!
If each ACTUAL data has a high density probability, the overall likelihood will be high!
@carvalhoribeiro Рік тому ⁺⁷
Your clear explanation of these concepts is greatly appreciated. Thank you so much for sharing
@SalihBekri Рік тому ⁺⁵
i'm an EE student and we don't anything to do with ML except a simple course in the final year and i'm still taking this course wish me luck guys because it's hard reaally hard
@AadityaSaraf69 Рік тому
good luck! and yes, it is very hard
@elonmusk4267 Рік тому ⁺⁷
What a phenomenal lecture! So beautiful, so elegant, just looking like a wow
@dartng5029 Рік тому ⁺¹⁷
0:28: 📚 The video discusses supervised learning, specifically linear regression, locally weighted regression, and logistic regression.
5:38: 📚 Locally weighted regression is a non-parametric learning algorithm that requires keeping data in computer memory.
13:05: 📊 Locally weighted regression is a method that assigns different weights to data points based on their distance from the prediction point.
19:01: 📚 Locally linear regression is a learning algorithm that may not have good results and is not great at extrapolation.
24:46: 🔍 The video discusses Gaussian density and its application in determining housing prices.
31:31: 💡 The likelihood of the parameters is the probability of the data given the parameters, assuming independent and identically distributed errors.
36:55: 📊 Maximum Likelihood Estimation (MLE) is a commonly used method in statistics to estimate parameters by maximizing the likelihood or log-likelihood of the data.
43:44: 📊 Applying linear regression to a binary classification problem is not a good idea.
49:22: 🎯 The video discusses the choice of hypothesis function in learning algorithms and why logistic regression is chosen as a special case of generalized linear models.
54:45: 📚 The video explains how to compress two equations into one line using a notational trick.
1:01:31: ✏ Batch gradient ascent is used to update the parameters in logistic regression.
1:07:52: 📚 The video explains how to use Newton's method to find the maximum or minimum of a function.
1:13:55: 💡 Newton's method is a fast algorithm for finding the place where the first derivative of a function is 0, using the first and second derivatives.
Recap by Tammy AI
@adhammazen2547 2 місяці тому ⁺¹¹
Hello, I don't know if the comment section here is still being replied to/looked at by staff, but I want to point out that the lecture notes PDFs on the website of CS229 lead to an error and don't actually load. I'm certain many others like myself would appreciate being able to look at these notes. Thank you!
@TonyDaExpert 2 місяці тому
I recommend looking for the GitHub that has them
@bhargavtripathi907 2 місяці тому
you can search for the notes online on Google with the name of Andrew ng Cs229 notes .
@T-r5t Місяць тому
I have the same issue, did you get any help?
@nikhils1182 21 день тому
@@T-r5t just checkout the webpage of CS229 2020 class , main_notes has all the course notes by Andrew Ng itself....
@Twi_543 4 дні тому
@@T-r5t perhaps try the wayback machine archive
@glitchAI 8 місяців тому ⁺¹
he speaks with so much bass that I have to ramp up my volume.
@moussadiallo6430 Рік тому ⁺¹
great lecture. ML is fun with you😀
@stanfordonline Рік тому ⁺¹
Thanks for your comment and for watching!
@liketheblue5082 2 роки тому ⁺⁸
32:18 I have a question about this likelihood function. Can somebody help me with it?
According to the IID assumption, the probability of all the observations is equal to the product of each probability . However, isn’t the expression a density instead of a probability of a normal distribution? I am really confused. I think the probability should be the integral of density function. If it's density, what's the meaning of the product of densities?
@HamzaAsgharKhan 2 роки тому ⁺¹⁰
For I.I.D, P(AB) = P(A)P(B). Your observation about it being the probability density of the Gaussian is correct. When we maximize it, we are trying to find the point which has the highest probability. A point that has the highest density will have the highest probability. So using the probability density function is correct in that regard (I think you are confused by the fact that the density will probably result in a value that is not between 0 and 1 but with a little thought about what I said, hopefully you will be able to see why normalizing the values to be between 0 and 1 do not really matter). I don't know how much help this answer will be to you, I'm simply having a hard time to articulate what I'm trying to say.
@liketheblue5082 2 роки тому ⁺⁴
@@HamzaAsgharKhan Thank you very much! I didn't expect someone would give me such a detailed answer! That's exactly what I thought. The product of density might not really have a meaning in statistics, the density can also be greater than 1 , but it would be enough to find the maximum point. I appreciate it!!
@HamzaAsgharKhan 2 роки тому ⁺³
@@liketheblue5082 I'm glad it helped! 😊
@henryyy8625 7 місяців тому ⁺¹
How can we access the homework? The link in the syllabus leads to the piazza but it does not get into the classroom?
@tomzhangg 2 роки тому ⁺¹¹
A classic tradeoff in locally weighted models between training cost and accuracy, though it seems like the cost really comes from refitting for each x input during testing.
@adityachauhan7269 Рік тому ⁺²
ohhh so thats how it does it, wouldnt this overfit? It's like the start of thinking towards "forest-lile" methods, amazing.
@Emanuel-oz1kw 6 місяців тому ⁺⁸
Motivation: only 10% will make it to the last video
@bwmartin24 2 роки тому ⁺²¹
A lot of the links aren't working on the syllabus linked in description. Is there an updated version with the class notes pdf's, etc.?
@aphievel 2 роки тому ⁺³
You can refer to the notes of the summer 2019 class. Though the topics were covered in a different order, the content is the same.
@ikrammaizi8678 Рік тому ⁺²
@@aphievel where?
@shakeelahmad3162 Рік тому
docs.google.com/spreadsheets/d/18pHRegyB0XawIdbZbvkr8-jMfi_2ltHVYPjBEOim-6w/edit#gid=0
@karthikeyapervela3230 9 місяців тому ⁺¹
26:37 How is it being implied? Like we are assuming the error term to be a gaussian, from there we jumped to the conditional distribution of y given x parameterized by theta, I did not understand this implication.
@suvamsivam9658 8 місяців тому
its assumed that the error term is normally distributed
@All_Kraft 11 місяців тому ⁺¹
Thank you for explanation. I don’t know why but it’s so annoying, when lecturer constantly erases and writes the same signs((
@AditiSalunkhe-f4q 3 місяці тому
At timestamp 52:11, How can P(Y=1 | x ; theta) = h(x), since h(x) should only take two values 0 and 1. This will give value if h(x) which lies between [0, 1].
@AyanKhan-ek1iy 3 місяці тому ⁺¹
Basically, h(x) can take values between 0 and 1, since it is a probability. What you're talking about here is the actual class, or "y". It can either be 0, or 1. So, P(Y=1|x;theta) actually means that we are finding the probability that our class would be 1, given a feature x affected by parameter theta. Hope that makes it clear.
@surajyadav1033 Рік тому ⁺²
at 1:16:59 shouldnt the formula have negative sign before the hessian inverse
@neelabhsomani5129 7 місяців тому ⁺¹
We are trying to *maximize* the likelihood function. Hence the formula has a +ve sign instead of a negative sign.
@stephendiopter2289 Рік тому ⁺¹
the course page has some problem sets and class notes provided by prof. but are inaccessible . Is there any way to get those ?
p.s. I just need those problem sets
@stephendiopter2289 Рік тому
never mind. got them
@prienee 11 місяців тому
@@stephendiopter2289how did you get them?
@sowaszpieg7528 10 місяців тому
@@stephendiopter2289 where did you find them?
@durai5213 10 місяців тому
@@stephendiopter2289 Can you help me where to find the lecture note
@durai5213 10 місяців тому
@@stephendiopter2289 Can you help me where to find the lecture note
@CLL-mr3kz Місяць тому
Let's do it guys!
@גבריאלחדאד 13 днів тому
i have a question if anybody can help:
in the locally weighted linear regression algorithm, do we need to recalculate the theta vector for each prediction? because the weight function depends on the example we are trying to make the prediction for... or maybe in the training stage, we pick several points, calculate the theta vector for each of them, and use the closestoption when making predictions? how exactly does this work?
and by the way, can anyone provide a link for the course website?
@RiyaSharma-lq1ok 18 днів тому
Where can I find the lecture notes and the problem sets ?
@НиколайТодоров-и9т Рік тому ⁺²⁴
I love the videos and Mr Ng explains things clearly, but gosh, the markers he uses are so pale and hard to read
@jaymistry689 4 місяці тому ⁺¹
can someone help me to find partial derivative of L(theta) at 1:01:10?
@AyushGupta-zc4lh Рік тому ⁺¹
Awesome lecture
@KipIngram 2 роки тому
10:24 - How is this just not a form of interpolation using shape functions? That doesn't really seem like "learning" to me.
@aysukeskin3749 10 днів тому
how can I have access to homework and projects
@ShubhamKumar-it2uy 10 місяців тому ⁺¹
Can anyone explain what had happened to Andrew's voice at 19:32 ?
@mshoshan9698 10 місяців тому ⁺²
It seems like they applied some audio distortion effect whenever a student asks a question (to preserve anonymity) that makes their voice sound very deep.
@rayugamax183 Рік тому ⁺¹
In the links given in description don't have the class notes he keeps mentioning and he tells to read from them. Can anyone help? I mean how do i get those?
@kinetic_kane9033 Рік тому
Same question. I think the notes are only available to stanford students because its in their intranet.
@prathmeshmishra4357 9 місяців тому
@@kinetic_kane9033 cs229.stanford.edu/lectures-spring2022/main_notes.pdf
@ramankr0022 Рік тому ⁺¹
Very helpful.
@albertlei9249 2 роки тому ⁺²
Looks like in gradient ascent if we replace the scalar learning rate alpha by the inverse H^{-1}, we get the Newton's method.
@tomzhangg 2 роки тому ⁺¹
Also remember that the partial derivative is replaced with the gradient vector, allowing for matrix multiplication.
@shubhamkumar-nw1ui 2 роки тому
Can you guys help me out ? I can't get my head around likelihood of theta thing ....why this is equal to product of probabilities of Y
@ZeroManifold Рік тому
Newton's method uses 2nd order approximation vs the gradient descent uses 1st order approximation, the rationale is quite similar.
@xinli7836 6 місяців тому
Awesome totally
@fahimesokhangou3646 2 роки тому ⁺⁴
I have a question about locally weighted regression. Imagine we want to calculate studentized residual. we have different hat matrix (projection matrix) for each observation and each hat matrix is a matrix (k by k) which k is a number of the observation in the span. Now I would like to calculate the leverage. I would like to know how to determine leverage for each observation?
@haoranlee8649 Рік тому
i like this guy‘s video, it's amazing
@morespinach9832 Рік тому ⁺²
Since when these these basic statistics techniques become “machine learning”??
@closingtheloop2593 11 місяців тому
Its all marketing. To be fair, much of these results fall out of linear system theory that does not require any statistics. So the branding is somewhat subjective.
@Adnan_19946 4 місяці тому
Where can we get the fabled lecture notes?
@viharivemuri7202 11 місяців тому
While deriving maximum likelihood for linear regression, the professor modelled a gaussian error term. However, for logistic regression he did not use an error term, does anyone know why that is?
@shaksham.22 7 місяців тому ⁺¹
You wont need an error term for logistic regression because in linear regression you are trying to predict the h(x) which can vary based only some real world phenomenon. However, in case of classification(for which logistic regression is used) you are more or less trying to fit the h(x) into few defined classes of output, for instance the true or false of an occurrence. Hence presence of error function does not have any effect on the outcome. In other word, the output h(x) is discrete in classification so theres no requirement of an error term.
I may be wrong with this though.
@sophiafunworldatthepark6740 Рік тому
I try to find way how to use this to teach kids.
@pavel.pavlov Рік тому ⁺¹
He needs to get the IBM guys blackboard
@codehere142 10 місяців тому
Can't find the derivation of the MLE
@b14ckb0y9 5 місяців тому
I couldn't find the Newton's method in lecture notes. Can anybody tell me in which page this belongs?
@niayeshshafieian3290 4 місяці тому
Can I ask where did you find the lecture notes?
@elching.8924 Місяць тому
Andrew is applied professional, he has surface level understanding of the math behind these concepts and lack of depth in his knowledge affects the quality of his arguments. There is no way students can understand Newton method without knowing the proofs of mean value theorems.
@studybuddy8307 7 місяців тому
please help me get the problem set
@dr.owl_the_great Рік тому ⁺¹
Where we get problemset of this courses
@ujjolchakrabarty9285 7 місяців тому
Did you find the problem sets?
@youssera6352 Рік тому
Hi, i'm trying to follow this courses in order to start reading papers for my phd research/preperation, i don't seem to understand most of the mathematic equations, do i really need to understand them to achieve my goal or i just need to understand the concepts and memorize the formulas ?
@Nett6799 Рік тому
i have the same problem as you , what's your phd research theme ?
@jaimehernandezbascur8619 Рік тому
Hi, it's highly recommended to have a background in probability and statistics, and linear algebra before studying machine learning. Personally i think that a few knownledge in optimization is sufficiently but no necessary.
@mekuzeeyo 2 роки тому ⁺¹
then what is the difference between the locally weighted regression and polynomial regression? in application
@closingtheloop2593 11 місяців тому
I got the same question. Im sure with more exposure it will be clear. Polynomial regression and locally weighted regression echoes simularity in concept with gains scheduled control design for nonlinear systems. Same trick, different pony. IE, how can we apply linear theory to non linear systems?
@krishnaaa___03 6 місяців тому
can anyone tell me how the h thetha(x) which is equal to sigma j = 0 to n (thetha j Xj) can be written as thetha(transpose) into X ? how the transpose came here it should be thetha into x then it makes sense somewhat... how is this transpose imposed on the thetha?
(obviously in linear regression)
@ippilisaisugandhasri9952 Місяць тому
h theta(x) gives a value right which is a scalar ,and the scalar can be obtained only if you multiply theta(transpose) into X or X(transpose) into theta
@rushinshah4344 Рік тому ⁺¹
where can i access the problem sets?
@iamnotsure237 2 місяці тому
Cs 229 website or github
@logeshwaran1537 11 місяців тому ⁺¹
Whether anybody knows how to get familiar with these concepts..like where to apply and practice these??
@YisneySoto 11 місяців тому
I'm thinking about ChatGPT. Ask it for exercises and to evaluate your responses.
@sanatani_0228 Рік тому
Is it better than his course on coursera or it is same?
@stanfordonline Рік тому ⁺⁶
Hi there, thanks for your comment! The material on coursera is more introductory level and this lecture is from the graduate course CS229 and covers more advanced topics.
@anuragsahu4527 7 місяців тому
@@stanfordonline so what should i prefer?? this course or the one in coursera??
@ilpreterosso 2 роки тому ⁺²
What happened at 17:45
@sanspapyrus683 2 роки тому ⁺¹
Probably a mic failure. Not sure though.
@bouazizzied5086 Рік тому
can someone tell me after we derive the maximum likelihood of theta how do we use it to modify all our parameters theta?
@gautamgirotra3572 Рік тому
From MLE of theta we have the function that should be maximized i.e. l(theta)
Now use any optimization algorithm(like Gradient descent/ newton's method) to optimize
for example using GD
theta(new) = theta(old) +alpha * partial derivative of l(theta)
@meeqvin 11 місяців тому
in our case parameter of learning algorithm(theta) is a cost of our house?
@shaksham.22 7 місяців тому ⁺¹
nope X is the cost of house, parameter are weights of every feature at a given point on x that help you identify the corresponding h(x)
@neelabhsomani5129 Рік тому
Check point 44:16
@malfuriosstormrage5218 Рік тому
THanks. Can you explain what he meant at H(x) is different when using logistic function? Is it because it's bounded [0,1]?
@neelabhsomani5129 7 місяців тому
@@malfuriosstormrage5218 h(x) is nothing but our hypothesis function. So depending on the task (classification or regression), our hypothesis function will look different. For example, for linear regression our h(x) was w0 + w1x1 + ... + wnxn. (Here w is same as theta, parameters). But h(x) looked different in logistic function.
Our hypothesis also depends on preferred outcome. Like you mentioned, h(x) looks different because we want to bound the output to [0,1]. Hope this helps.
@ayushipanda7228 6 місяців тому
My boyfriend does this course very diligently 😊
@haitematik5832 11 місяців тому
ML for Goa'ulds
@browndonkey 2 роки тому ⁺¹
Are the class notes he mentions throughout the course available anywhere for download?
@agustinsalazar9351 2 роки тому ⁺¹
In the link to the syllabus in the description there are some lecture notes available, although many are dead links
@patrickt.4121 2 роки тому ⁺⁴
google it and you'll find them. first hit.
@shashankrana977 Рік тому ⁺²
See what I am doing is to follow the current year course page for assignments as they are mostly working links. Lecture notes can be found in the course page given in the Lect 1 desc.
@stephendiopter2289 Рік тому
can you share the link of current year course page @@shashankrana977
@طالبالعلم-ج1ث Рік тому
Thank You Very Much
@shwetatiwari7910 2 місяці тому ⁺¹
Why do the students sound that Optimus Prime😂
@namphan9281 Рік тому
now I know why my university teaches optimization techniques for CS program 💀
@nanunsaram 2 роки тому ⁺¹
Thank you!
@kaipingli-mh3mw Рік тому
thx
@malfuriosstormrage5218 Рік тому
What a concept. I just "wow"d when MLE was shown. Anyone here familiar with Power System State Estimation?
@The-Thinking-Room Місяць тому
If anyone has notes on it, please feel free to share the link here!
@k4gaurav Місяць тому
do you still need this ??
@aramuradyan2138 Рік тому
Where are lecture notes?
@lohitaksha244 Рік тому ⁺¹
look up cs229 autumn 2018 on google, you should find the repository maxim5/cs229-2018-autumn
@LOGENDIRANVD Рік тому
soooooo goood
@DagmawiAbate 2 роки тому
Okay.
@vasudevrv7417 Рік тому
can anyone explain me where that x came from in the final equation of gradient ascent
@ras4884 Рік тому ⁺⁵
someone, buy this guy better markers!
@vientios_talisman Рік тому
1:02:02
@McAwesomeReaper Рік тому
You know he's thought about just getting slightly shorter sleeves tailored, right?
@Lalala_1701 8 місяців тому ⁺¹
Why every student sounds like a giant.😂
@KevenDuan-cn 4 місяці тому ⁺²
It may be to protect the privacy of the students, so special treatment is made for the sound
@Lionsboy86 2 роки тому
Otu yo
@notsodope7227 Рік тому ⁺¹
The way it started and the way it is going forward :/ So much math
@김연우-i6v 7 місяців тому
토사장 들이 별걸 다만드네 ㅋㅋ

Наступне

Автоматичне відтворення

Lecture 4 - Perceptron & Generalized Linear Model | Stanford CS229: Machine Learning (Autumn 2018)