Lecture 7 "Estimating Probabilities from Data: Maximum Likelihood Estimation" -Cornell CS4780 SP17

Поділитися
Вставка
  • Опубліковано 24 лип 2024
  • Cornell class CS4780. (Online version: tinyurl.com/eCornellML )
    Lecture Notes: www.cs.cornell.edu/courses/cs4...
    Past 4780 exams are here: www.dropbox.com/s/zfr5w5bxxvizmnq/Kilian past Exams.zip?dl=0
    Past 4780 homeworks are here: www.dropbox.com/s/tbxnjzk5w67...
    If you want to take the course for credit and obtain an official certificate, there is now a revamped version (with much higher quality videos) offered through eCornell ( tinyurl.com/eCornellML ). Note, however, that eCornell does charge tuition for this version.

КОМЕНТАРІ • 63

  • @RS-el7iu
    @RS-el7iu 4 роки тому +18

    ive just stumbled on a treasure of high class lectures and for free. you make me enjoy these topics after graduating since 2000 and believe me its hard to make someone mid 40s to enjoy these when all i think about nowadays is learn stuff like sailing :)). i wish we had profs like you in my country it would have been 100 folds more enjoyable. thank you for sharing all these.

  • @cuysaurus
    @cuysaurus 4 роки тому +16

    48:46 He looks so happy.

  • @xiaoweidu4667
    @xiaoweidu4667 3 роки тому +3

    The key to deeper understanding of algorithms is the assumptions about the underlying data. Thank you and great respect.

  • @meenakshisarkar7529
    @meenakshisarkar7529 4 роки тому +8

    This is probably the best explanation I came across regarding the difference between the Bayesian and Frequentists statistics. :D

  • @crestz1
    @crestz1 Рік тому +3

    This lecturer is amazing. As a Ph.D candidate, I always revisit the lectures to familiarise myself with the basics.

  • @SundaraRamanR
    @SundaraRamanR 4 роки тому +28

    "Bayesian statistics has nothing to do with Bayes' rule" - knowing this would have avoided a lot of confusion for me over the years. I kept trying to make the (presumably strong) connection between the two and assumed I didn't understand Bayesian reasoning because I couldn't figure out this mysterious connection

    • @WahranRai
      @WahranRai 7 місяців тому +1

      You are totally wrong !

  • @abunapha
    @abunapha 5 років тому +20

    Starts at 2:37

  • @deltasun
    @deltasun 4 роки тому +2

    impressive lecture, thanks a lot!
    I was also impressed to discover that, if instead of taking the MAP you take the EAP (expected a posteriori), then the Bayesian approach implies smoothing even with uniform prior (that is alpha=beta=1)! beautiful

  • @sandeepreddy6295
    @sandeepreddy6295 3 роки тому +4

    Makes the concepts of MLE and MAP very very clear. We also get to know that - Bayesians and frequentists both trust the Bayes rule.

  • @JohnWick-xd5zu
    @JohnWick-xd5zu 4 роки тому +3

    Thank you Kilian, you are very talented!!

  • @saitrinathdubba
    @saitrinathdubba 5 років тому +1

    Just Brilliant !! Thank you prof. kilian !!!

  • @brahimimohamed261
    @brahimimohamed261 2 роки тому

    Someone from Algeria confirms that this lecture is incredible. You have transformed complex concepts very simple

  • @zelazo81
    @zelazo81 4 роки тому

    I think I finally understood a difference between frequentist and bayesian reasoning, thank you :)

  • @sumithhh9379
    @sumithhh9379 4 роки тому +3

    Thank you professor Kilian.

  • @mohammadaminzeynali9831
    @mohammadaminzeynali9831 2 роки тому

    Thank you Dr. Weinberger. you are a great lecturer and also UA-cam algorithm subtitles your "also" as "eurozone".

  • @Jeirown
    @Jeirown 3 роки тому +3

    when he says basically, it sounds like bayesly. And most of the time it still makes sense

  • @arjunsigdel8070
    @arjunsigdel8070 3 роки тому +1

    Thank you. This is great service.

  • @KulvinderSingh-pm7cr
    @KulvinderSingh-pm7cr 5 років тому

    Made my day !! Learnt a lot !!

  • @dude8309
    @dude8309 4 роки тому +2

    I have a question about how MLE is formulated when using the binomial distribution (or maybe in general?): I might be overly pedantic or just plain wrong but looking at 18:01 wouldn't it be "more correct" to say P(H | D; theta) instead of just P(D;theta)? Since we're looking at the probability of H given the Data, while using theta as a parameter?

  • @vishchugh
    @vishchugh 4 роки тому +1

    Hi Killian,
    While calculating the likelihood function in the example. You have taken (nH+nT)choose(nH) also into consideration. Which doesn’t change the optimization though, but shouldn’t be there I guess, because in P(Data | parameter) , all samples being independent should just be Q^nH * (1-Q)^nT.
    Rght?

  • @andrewstark8107
    @andrewstark8107 Рік тому

    From 30:00 pure gold content. :)

  • @DavesTechChannel
    @DavesTechChannel 4 роки тому +8

    Amazing lecture, best explanation of MLE vs MAP

  • @StarzzLAB
    @StarzzLAB 3 роки тому

    I teared up at the end as well

  • @abhinavmishra9401
    @abhinavmishra9401 3 роки тому +1

    Impeccable

  • @marcogelsomini7655
    @marcogelsomini7655 2 роки тому

    48:18 loop this!! Thx Professor Weinberger!

  • @jandraor
    @jandraor 4 роки тому +1

    What's the name of the last equation?

  • @jijie133
    @jijie133 3 роки тому +1

    Great!

  • @yuniyunhaf5767
    @yuniyunhaf5767 4 роки тому +1

    thanks prof

  • @JoaoVitorBRgomes
    @JoaoVitorBRgomes 3 роки тому +4

    At circa 37:28, professor, you say some on the lines of 'which parameter makes our data most likely', could I say in other words: 'which parameter it is that corresponds to this distribution of data' ? But not 'which parameter most probable corresponds to this distribution' ? Or neither? Because what confuses me is reading this P(D|theta) . I read as what's the probability of this data / dataset given I got this theta/parameters/weights, because when I start, I start with the data, then I try to estimate the parameters not the opposite. Suppose I have somehow weights then I try to discover the probability that this weights/parameteres/theta belongs to this dataset. Weird. Am I a Bayesian? Lol. (e.g. logistic classification task for fraud). Kind Regards!

    • @kilianweinberger698
      @kilianweinberger698  3 роки тому +5

      Yes, you may be in the early stadium of turning into a Bayesian. Basically if you treat theta is a random variable and assign it a prior distribution you can estimate P(theta|D) i.e. what is the most likely parameter given this data.
      If you are frequentist, then theta is just a parameter of a distribution and you pretend that you drew the data from exactly this distribution. You then maximize P(D;theta) i.e. which parameter theta makes my data most likely.
      (In practice these two approaches end up being very similar ...)

  • @HimZhang
    @HimZhang 2 роки тому

    In the coin toss example (lecture notes, under "True" Bayesian approach), P(heads∣D)=...=E[θ|D] = (nH+α)/(nH+α+nT+β). Can anyone explain why the last equality holds?

  • @SalekeenNayeem
    @SalekeenNayeem 4 роки тому +3

    MLE starts at 11:40

  • @thachnnguyen
    @thachnnguyen 5 місяців тому

    I raise my hand. Why you assume any type of distribution when discussing? What if I don't know that formula? But what I see is nH and nT. Why not work with those?

  • @Klisteristhashit
    @Klisteristhashit 4 роки тому +1

    xkcd commic mentioned in the lecture: xkcd.com/1132/

  • @jachawkvr
    @jachawkvr 4 роки тому

    I have a question. Is P(D;theta) the same as (D|theta)? The same value seems to be used for both in the lecture, but I recall Dr.Weinberger saying that there is a difference earlier in the lecture.

    • @kilianweinberger698
      @kilianweinberger698  4 роки тому +9

      Well, for all means and purposes it is the same. If you write P(D|theta) you imply that theta is a random variable, enabling to impose a prior P(theta). If you write P(D;theta) you treat it as a parameter, and a prior distribution wouldn't make much sense. If you don't use a prior the two notations are identical in practice.

    • @jachawkvr
      @jachawkvr 4 роки тому +1

      Ok, I get it now. Thank you for explaining this!

  • @coolblue5929
    @coolblue5929 2 роки тому

    Very enjoyable. I think a Killian is like a thousand million right?
    I got confused at the end though. I need to revise.

  • @imnischaygowda
    @imnischaygowda Рік тому

    nH + nT choose nH what exactly do u mean here?

  • @hafsabenzzi3609
    @hafsabenzzi3609 2 роки тому

    Amazing

  • @pritamgouda7294
    @pritamgouda7294 6 місяців тому

    can someone tell where's the lecture in which he proved K nearest algorithm which he mentioned @5:09

    • @kilianweinberger698
      @kilianweinberger698  5 місяців тому

      ua-cam.com/video/oymtGlGdT-k/v-deo.html

    • @pritamgouda7294
      @pritamgouda7294 5 місяців тому

      @@kilianweinberger698 sir I saw that lec and it's notes as well but in notes it's mentioned about Bayes optimal classifier but I don't think it's there in the video lec. Please correct me if I'm wrong. Thank you for your reply 😊

  • @abhishekprajapat415
    @abhishekprajapat415 4 роки тому +1

    18:19 how did that expression even came, like what is this expression even called in maths.
    By the way I am b.tech. student so I guess I might not have read the math behind this expression.

    • @SalekeenNayeem
      @SalekeenNayeem 4 роки тому +1

      Just look it up Binomial Distribution. Thats a usual way of writing probability of an event which follows binomial distribution. You may also wanna check Bernoulli's Distribution first.

  • @Bmmhable
    @Bmmhable 4 роки тому

    At 36:43 you call P(D|theta) the likelihood, the quantity we maximize in MLE, but earlier you emphasized how MLE is about maximizing P(D ; theta) and noted how you made a "terrible mistake" in your notes by writing P(D|theta), which is the Bayesian approach...I'm confused.

    • @kilianweinberger698
      @kilianweinberger698  4 роки тому +10

      Actually, it is more subtle. Even if you optimize MAP, you still have a likelihood term. So it is not that Bayesian statistics doesn‘t have likelihoods, it is just that it allows you to treat the parameters as a random variable. So P(D|theta) is still the likelihood of the data, just here theta is a random variable, whereas in P(D;theta) it would be a hyper-parameter. Hope this makes sense.

    • @Bmmhable
      @Bmmhable 4 роки тому

      @@kilianweinberger698 Thanks a lot for the explanation. Highly appreciated.

  • @prwi87
    @prwi87 Рік тому

    Edit: After thinking, and checking, and finishing the lecture, and watching a bit of the lecture after this one i have came to the conclussion that my first explanation was wrong, as i didn't have enough knoweledge. The way it is calculated is good and fine, where i struggled was to understand the right PDF the Professor was using. What threw me off was P(D; theta) which is a joint PDF (i know it's PMF, but for me they are all pdfs if you put delta function in there) of obdaining exactly data D, because D is a realization of some random vector X, so to be more precise in notation P(D; theta) should be written as P(X = D; theta). But what Professor meant was the PDF P(H = n_h; len(D), theta) which is a binomial distribution. Then we can calculate MLE just as it was calculated during the lectures. But this is not the probability of getting the data D, but the probability of observing exactly n_h heads in len(D) tosses. Then in MAP we have conditional PDF H|theta ~ Binom(len(D), theta), written as P(H = n_h | theta; len(D)), we treat theta as random variable but len(D) as a parameter.
    There are two problems with explanation that starts around 18:00. Let me state the notation first. Let D be the data gathered, this data is the realization of random vector X. n_h is the number of heads tossed in D. nCr(x, y) is combinations of x choose y.
    1. Professor writes that P(D;theta) is equal to the binomial distribution of the number of heads tossed which is not true. Binomial distribution is determined by two parameters, the number of independent Bernoulli trials (n) and the probability of obtaining a desired outcome (p), thus theta = (n, p). If we have tossed the coin n times, there is nothing we don't know about n, since we have choosen it, and so n is fixed and most importantly it is known to us! Because of that, let us denote n = len(D) and then theta = p. Let now H = number of heads tossed, then
    P(H = n_h; len(D), theta) = nCr(len(D), n_h) * theta ^ n_h * (1 - theta) ^ (len(D) - n_t)
    is precisely the distribution that was written by the Professor. I have also noticed that one person in comments asked why cannot we write P(H|D;theta), and more precisely P(H = n_h|len(D); theta). The reason for that is that len(D) is not a random variable, we are the one choosing the number of tosses and there is nothing random about it. Note that in this notation used in a particulat comment theta is treated as a parameter as it is written after ";".
    2. To be precise P(X = D; theta) is a joint distribution. For example if we would have tossed the coin three times, then D = (d1, d2, d3) with d_i = {0, 1} (0 for tails and 1 for heads), and P(X = D;theta) = P(d1, d2, d3;theta). P(X = D;theta) is the joint probability of observing the data D we got from the experiment. The likelihood function is then defined as L(theta|D) = P(X = D;theta), but keep in mind that the likelihood is not a conditional probability distribution, as theta is not a random variable. The correct way to interpret L(theta|D) is as function of theta, which value also depends on the underlying measurements D. Now, if the data is i.i.d. then we can write that
    P(X = D;theta) = P(X_1 = d1;theta) * P(X_2 = d2;theta) * ... * P(X_len(D) = d_len(D);theta) = L(theta|D)
    In our example of coin tossing
    P(X_i = d_i;theta) = theta ^ d_i * (1 - theta) ^ (1 - d_i), where d_i = {0, 1} (0 for tails and 1 for heads)
    Given that
    L(theta|D) = theta ^ sum(d_i) * (1 - theta) ^ (len(D) - sum(d_i)),
    where sum(d_i) is simply n_h, the number of heads observed. And now we are maximizing the likelihood of observing the data we have obtained. Note that the way it was done during the lacures was right! But we were maximizing the likelihood of observing n_h heads in len(D) tosses, not of observing exactly data D.
    Also for any curious person, the "true bayesian method" that was described by the Professor at the end is called minimum mean-squared estimation (MMSE), that aims to minimize the expected squared error between random variable theta and some estimation of theta using the data random vector g(X).
    To support my argumenting, here are sources i used to write the above statements: "Foundations of Statistics for Data Scientists" by Alan Agresti (Chapter 4.2), and "Introduction to Probability for Data Science" by Stanley Chan (Chapter 8.1). Sorry for any grammar mistakes, as english is not my first language. As i'm still learning all this data science stuff i can be wrong, and i'm very open to any criticism and discussion. Happy learning!

    • @beluga.314
      @beluga.314 10 місяців тому

      You're mixing up 'distribution' and 'density'.P(d1, d2, d3;theta), this notation is correct but P(X = D;theta) is wrong as its a density function and you can't write like that. But since they are also probabilities(discrete), you can write like that here

  • @utkarshtrehan9128
    @utkarshtrehan9128 3 роки тому

    MVP

  • @deepfakevasmoy3477
    @deepfakevasmoy3477 4 роки тому

    12:46

  • @sushmithavemula2498
    @sushmithavemula2498 5 років тому +4

    Hey Prof, Your lectures are really good. But if you could provide some real time applications/examples while explaining a few concepts it would let every one understand the concepts better !

  • @logicboard7746
    @logicboard7746 2 роки тому

    bayesian @23:30, then 32:00

  • @vatsan16
    @vatsan16 4 роки тому +2

    So the trick to getting past the spam filter is to use obscure words in the english language eh. Who wohuld have thought xD

    • @kilianweinberger698
      @kilianweinberger698  4 роки тому

      Not the lesson I was trying to get across, but yes :-)

    • @vatsan16
      @vatsan16 4 роки тому

      @@kilianweinberger698 okay. I am now having a, "omg he replied!!" moment. :D. Anyway, you are a really great teacher. I have searched long and hard for a course on machine learning that covered it from a mathematical perspective. I found yours on a friday and i have now finished 9 lectures in 3 days. Danke schön! :)

  • @kartikshrivastava1500
    @kartikshrivastava1500 2 роки тому

    Wow, apt explanation. The captions were bad, at some point: "This means that theta is no longer parameter it's a random bear" 🤣

  • @subhasdh2446
    @subhasdh2446 2 роки тому

    I'm in the 7th lecture. I hope I find myself commenting on the last one.

  • @xiaoweidu4667
    @xiaoweidu4667 3 роки тому

    talking about logistics and taking stupid questions from students are major waste of talent of this great teacher.