6. Maximum Likelihood Estimation (cont.) and the Method of Moments
Вставка
- Опубліковано 15 жов 2024
- MIT 18.650 Statistics for Applications, Fall 2016
View the complete course: ocw.mit.edu/18-...
Instructor: Philippe Rigollet
In this lecture, Prof. Rigollet continued on maximum likelihood estimators and talked about Weierstrass Approximation Theorem (WAT), and statistical application of the WAT, etc.
License: Creative Commons BY-NC-SA
More information at ocw.mit.edu/terms
More courses at ocw.mit.edu
I've never taken any regular basic statistics course and it takes literally a day to fully understand 1 lecture video. But as the instructor said, I feel much smarter after taking this lecture.
I'd like to say, this video is the best I've ever seen. The instructor's mind is very clear so that he can relate all critical notions together and depicts vivid images for us in brief.
Method of Moments starts at 32:36
Thank you !
Thanks
@28:50:00 why would there be a square root 2 pi there, I don't get the significance of what he is saying when there are no fudge factors and this is the true asymptotic variance. Why would there be any of that?
@ 19:20 , the dotted curve represents our ESTIMATOR for KL (theta, theta*) where as the solid line is the actual KL (theta, theta*) , the values theta and theta* are the minimum points of the estimator and the actual KL divergence resply. Can you guys help me verify if i understood correctly? Is the dotted line something else? or dis i interpret the solid line incorrectly? please help me out here..
Yes that is what I understood as well. The point of him drawing these two lines was basically to illustrate if you have a very flat base, then even if you somehow managed to find the min of the estimator, there is still a chance that you being pretty far away from the actually parameter theta star.
how is the fisher information used in modern machine learning - especially in practice?
how does his theorem in 30:55 mean that the MLE just going to be an average?
Would have been nice to put in the description or title or somewhere that this lecture focus on Fisher Information (Matrix) - to make it easier to search...I honestly don't know how or why I found this...especially since it was at the bottom of my search results. MIT videos that are relevant should be at the top...
17:10 The word is his name Rigollet in French
46:29 one to the last row in the matrix left side should be x_1^(r1-1), x_2^(r1-1), etc. instead of r-1
In 41:23 he says that it's actually enough to look only at the terms of the form X to the k-th - why is it enough?
Hi, Adam. I hope this answer suits you well.
The reason terms of the form X^k suffice is "linearity". The operation of taking an average is linear, meaning you can take out the constants.
It is the same reason why constants can "escape" an integral.
If E is the expectation, and there's a polynomial a_0 + a_1 X + a_2 X^2 + ... + a_n X^n, its expectation is
E ( a_0 + a_1 X + a_2 X^2 + ... + a_n X^n ) = a_0 + a_1 E( X ) + a_2 E ( X^2 ) + ... + a_n E ( X^n ).
@@owenmireles9615 Ah that's right indeed, thank you!
@@owenmireles9615 @ 19:20 , the dotted curve represents our ESTIMATOR for KL (theta, theta*) where as the solid line is the actual KL (theta, theta*) , the values theta and theta* are the minimum points of the estimator and the actual KL divergence resply. Can you guys help me verify if i understood correctly? Is the dotted line something else? or dis i interpret the solid line incorrectly? please help me out here..
@@jaspreetsingh-nr6gr Hi, Jaspreet.
Your interpretation seems correct. I'll just emphasize some parts which I think weren't covered as in much detail in the lecture.
That's right, the dotted line represents the estimator for the KL divergence.
However, the relationship between theta and theta* is more subtle... there's a bit more going on.
Throughout the video, they mention that theta* is the true parameter that you're trying to find. To do this, you'd like to minimize a function. That function would be f(X) = KL(P_theta*, P_X). In words, you want to find the parameter X that is the "closest" (under KL divergence) to theta*. The graph of this f(X) is the solid line in the video.
If you had perfect information, then obviously theta* is such minimizer.
However, under real-world conditions, you never have perfect data, and have to resort to an approximation, that being Hat(KL). So, what you're actually trying to minimize now is g(X) = Hat(KL) (P_theta*, P_X). The graph of this g(X) is the dotted line in the video.
@@owenmireles9615 Understood, using data (for sample mean) and then the guarantees given by LLN and Continuous functions under LLN ensures hat(KL) reasonably approximates KL divergence--thanks Owen, will ping u again if i get stuck on subsequent lectures.
Thank you very much.
Fisher proof is awesome!
41:04
- moment : expectation of power
what does support of P tetha means please?
This is way too advanced for me. I can understand the calculus but when he starts talking about convergence in probability and distribution, i get really lost. Can anyone point me to a book where i can get a better understanding on these topics of inference and convergence?
asymptotic theory?
www.stat.cmu.edu/~siva/705/lec4.pdf
www.stat.cmu.edu/~siva/705/lec5.pdf
www.stat.cmu.edu/~siva/705/lec6.pdf
I found these helpful!
Try Wasserman's "All of Statistics" its pretty concise and straightforward, and designed for people coming in from other fields.
@@SrEstroncio so true, i was gonna say same thing, it explains them very well and in detail
I have now a clear idea of Fisher
22:50
That was a harry potter on broom entry!
What a doozy. Great lecture.
damn it‘s hard
i hate when teachers be like : who does not know this: and then go and read about it. LOL
agree
He's so bad at cleaning the board omg
He has a broken leg and MIT has staff that come in and clean after each lecture.
@@imtryinghere1 I honestly think it's more the eraser than his lack of skill
i literally search oof moment