MLE for the Multivariate Normal distribution | with example in TensorFlow Probability

Поділитися
Вставка
  • Опубліковано 14 чер 2024
  • With the Maximum Likelihood Estimate (MLE) we can derive parameters of the Multivariate Normal based on observed data. This video is a full derivation. Here are the notes: raw.githubusercontent.com/Cey...
    The Multivariate Normal is potentially the most important distribution in all over Machine Learning. Commonly observed data is distributed according to it. Therefore, it is necessary to infer the parameters of the underlying distribution (or the distribution we think that is the underlying one) based on the data. That is the point where the Maximum Likelihood Estimate comes in where we essentially solve an optimization problem which can here be done in closed-form.
    For this we first derive the Likelihood and Log-Likelihood of observed data. The derivation of the MLE estimate for the mu/mean vector is straight-forward. For the covariance matrix, on the other hand, we need some special matrix derivatives that we take from the matrix cookbook: www.math.uwaterloo.ca/~hwolko...
    This book is the "bible" for tensor calculus. We might see it more often when it comes to the Multivariate Normal ;)
    ----------------------------------------
    Information on why the constraint does NOT arise naturally:
    Actually, things don't always arise naturally (unfortunately :/) in reality. There are cases where the MLE for the covariance matrix will be not positive definite, although still symmetric. In scenarios with more features/dimensions than samples our covariance matrix can even become singular. There will be a video on this coming in the future. Until then, take a look at this page of the fantastic sci-kit learn documentation: scikit-learn.org/stable/modul...
    ----------------------
    -------
    📝 : Check out the GitHub Repository of the channel, where I upload all the handwritten notes and source-code files (contributions are very welcome): github.com/Ceyron/machine-lea...
    📢 : Follow me on LinkedIn or Twitter for updates on the channel and other cool Machine Learning & Simulation stuff: / felix-koehler and / felix_m_koehler
    💸 : If you want to support my work on the channel, you can become a Patreon here: / mlsim
    -------
    ⚙️ My Gear:
    (Below are affiliate links to Amazon. If you decide to purchase the product or something else on Amazon through this link, I earn a small commission.)
    - 🎙️ Microphone: Blue Yeti: amzn.to/3NU7OAs
    - ⌨️ Logitech TKL Mechanical Keyboard: amzn.to/3JhEtwp
    - 🎨 Gaomon Drawing Tablet (similar to a WACOM Tablet, but cheaper, works flawlessly under Linux): amzn.to/37katmf
    - 🔌 Laptop Charger: amzn.to/3ja0imP
    - 💻 My Laptop (generally I like the Dell XPS series): amzn.to/38xrABL
    - 📱 My Phone: Fairphone 4 (I love the sustainability and repairability aspect of it): amzn.to/3Jr4ZmV
    If I had to purchase these items again, I would probably change the following:
    - 🎙️ Rode NT: amzn.to/3NUIGtw
    - 💻 Framework Laptop (I do not get a commission here, but I love the vision of Framework. It will definitely be my next Ultrabook): frame.work
    As an Amazon Associate I earn from qualifying purchases.
    -------
    Timestamps:
    00:00 Introduction
    00:44 Recap: Multivariate Normal
    04:20 Likelihood
    08:12 Log-Likelihood
    10:27 Defining the MLE
    11:28 Maximizing for Mu
    17:04 Maximizing for Sigma (Covariance Matrix)
    28:53 Computational Considerations
    35:59 TFP: Creating a dataset
    38:36 TFP: MLE for Mu
    39:08 TFP: MLE for Sigma (Covariance Matrix)
    42:21 TFP: A simpler way
    43:13 Outro

КОМЕНТАРІ • 24

  • @MachineLearningSimulation
    @MachineLearningSimulation  2 роки тому

    Errata:
    At 14:40, the derivative of the log-likelihood is taken with respect to the mu vector. There is a small error in that derivative which led to an incorrect equality (a row vector is not equal to a column vector). The accurate derivative had to follow formula (108) of the matrix cookbook: www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf . The mistake is more technical and does not affect the subsequent derivations nor the final result. Still, it's interesting to see what errors arise. Thanks to @gammeligeskäsebrot and @Aditya Mehrota for pointing this out :)
    The file on GitHub has been updated with some additional information and a correct version of the derivative: www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf

  • @xiaoweidu4667
    @xiaoweidu4667 2 роки тому +3

    You are giving fantastic tutorials, please keep it up. Really appreciate.

    • @MachineLearningSimulation
      @MachineLearningSimulation  2 роки тому

      Thanks a lot ❤️
      There is more like this coming in the future.
      It's that amazing feedback like yours that motivates me a lot. Therefore, thanks again 😊

  • @ivankissiov
    @ivankissiov Рік тому +4

    Thanks!

  • @zejiachen9657
    @zejiachen9657 2 роки тому +1

    Thank you for this great tutorials! It helped a lot for getting my assignment down.

  • @user-or7ji5hv8y
    @user-or7ji5hv8y 3 роки тому +2

    Very clear and instructive. 😊

  • @heitorcarvalho4940
    @heitorcarvalho4940 Рік тому +2

    Found your videos just recently while studying for Multivariate Analysis at college. As many have said and I repeat, they're awesome, congratulations!
    Now, I'd like to ask you two questions if I may:
    1. I'm using Richard Wichern as my main reference book. Have you used that one too, or do you have other recommendations?
    2. Quite often in statistics - at least for me - it's very easy to get lost among all that math and often times I can't build the intuition on how to extrapolate that information to a Data Science problem. Do you have any other resources, besides your channel to help us with that intuition and how to apply that knowledge. Even tough Wichern's book is called Applied Multivariate Statistics, still seems too heavy for me sometimes.

    • @MachineLearningSimulation
      @MachineLearningSimulation  Рік тому +1

      Thanks for the kind words. :)
      1.) I have not heard of the Richard Wichern book, but seems to be a good resource :). The reason I came to these topics is Machine Learning and not pure statistics (although those two are of course tightly coupled :)!). The book I mainly used was Christopher Bishop's "Pattern Recognition and Machine Learning". Although, admittedly, it is quite a hard read.
      2.) I can totally understand this problem. Transferring knowledge to real issues is always tricky. Often, the necessary experience comes over time. Thanks, that you also mention the channel, although, I would argue that the issues this channel deals with are still toy-ish, in a sense :D.
      Probably, solving Kaggle challenges (and looking at the submission notebooks of other Kagglers) can be a helpful thing to do. I can also recommend this beautiful book by Aurelion Geron: www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/

  • @user-or7ji5hv8y
    @user-or7ji5hv8y 3 роки тому +2

    it's interesting with MLE, that you always have to know or assume that the data follows a certain distribution. Like in this case Normal. I guess, assumption about the distribution is kind of like a prior.

    • @MachineLearningSimulation
      @MachineLearningSimulation  3 роки тому +2

      I would agree. Our assumption on the model of the data is somewhat like a prior. Generalizing this idea, one could even speak of "variational priors", i.e. probability distributions over potential probability distributions (which then leads again to functionals).
      But for the application in Machine Learning & Simulation: it might be interesting to look at it the other way around: You don't necessarily choose the distribution based on how the data looks in a scatter, but you choose it based on your understanding of the underlying phenomenon (or the simplifying assumptions you made on it).
      Think for instance of Gaussian Mixture Models. Surely, in many clustering scenarios the clusters do not necessarily follow a Gaussian distribution. But we made this particular assumption for two (probably rather pragmatic) reasons: First, we can still get decent results with it and in the end that is what we usually care about. And more importantly, secondly, we found something that is tractable or we at least know how to work with it.
      This is also the reason the (Multivariate) Normal appears in so many scenarios. (Of course we can also derive it from Maximum entropy etc.), but it is just a distribution that is simple, yet versatile. We know its properties and many things are available analytically or in a closed-form.

  • @user-hc9kp5tp3z
    @user-hc9kp5tp3z 3 роки тому +2

    I am very appreciate your videos. Have you considered making a Bayesian Deep Learning tutorial?

    • @MachineLearningSimulation
      @MachineLearningSimulation  3 роки тому +1

      Thanks a lot for the feedback :)
      There is a lot of content on my To-Do list. I will add Bayesian Deep Learning there, but I think it will take me some time until I can release something on it. For the next weeks I planned mostly videos on deep generative models. Optimistically, I could have some videos on Bayesian Deep Learning at the end of the year.

  • @GammeligesKaeseBrot
    @GammeligesKaeseBrot 2 роки тому +1

    Very nice video ive got a question on the equation presented at min 14. if x is a 1byn vector and mue 2 and the Sigma Matrix is nxn then the 1st term would be of shape nx1 the second would be 1xn . How can they be the same then?

    • @MachineLearningSimulation
      @MachineLearningSimulation  2 роки тому

      Hey,
      thanks a lot for the kind words :)
      Regarding your question: I am not sure, if I understand it correctly. What I wanted to show at that point in the video was that:
      x and mu are R^k vectors, i.e., vectors with k entries. And those are column vectors, one could also write R^{k x 1}. Then the sigma matrix is of shape R^{k x k}.
      Let's just simplify (x - mu) as x and let sigma^{-1} be A.
      Then the equation, I wrote down is: x^T @ A = A @ x
      The left-hand side of the equation describes a vector-matrix product and the right-hand side a matrix vector product. The vector-matrix product works, because we transpose x first, turning it into a row vector, i.e., it is of shape R^{1 x k}.
      The given equation holds, since A is symmetric.
      Does this shine some additional light on it? :) Let me know what is unclear.

    • @knowledgedistiller
      @knowledgedistiller 2 роки тому +1

      ​@@MachineLearningSimulation I had a similar question
      Assume sigma is of shape (D x D). At 14:42, you are multiplying a vector of shape (1 x D) by a matrix of shape (D x D) on the left side, which produces shape (1 x D). On the right side, you are multiplying a matrix of shape (D x D) by a vector of shape (D x 1), which produces shape (D x 1). Aren't the left and right sides transposes of each other?

    • @MachineLearningSimulation
      @MachineLearningSimulation  2 роки тому

      ​@@knowledgedistiller @gammeligeskäsebrot you are both correct. :) I think I made a mistake there. Though, it's not changing the derivation much, but might be an interesting technicality.
      Indeed, the equality sign would not be fully correct, since, as you note, the left-hand side refers to a row vector, whereas the right-hand side refers to a column vector.
      The issue is actually in the derivative of the log likelihood with respect to the mu vector. If we chose the correct form as highlighted in formula (108) of the matrix cookbook (www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf ) everything should be fine (due to the symmetry of the covariance matrix). You could also show that in index notation.
      Thanks for the hint. I will leave a pinned comment that there was a small mistake.

    • @MachineLearningSimulation
      @MachineLearningSimulation  2 роки тому

      @@knowledgedistiller I updated the file on GitHub with a small hint on the correct vector derivative: github.com/Ceyron/machine-learning-and-simulation/blob/main/english/essential_pmf_pdf/multivariate_normal_mle.pdf

  • @user-or7ji5hv8y
    @user-or7ji5hv8y 2 роки тому +1

    Do you have any thoughts about the pro and the cons of PyMC3 or Stan? Can TensorFlow Probability do everything that PyMC3 can do?

    • @MachineLearningSimulation
      @MachineLearningSimulation  2 роки тому +1

      I haven't worked with Stan yet and only did some introductory tutorials in PyMC3, but I still share my (not too informed) thoughts.
      PyMC3 is similar to TFP in that it uses plain Python for most parts (or the parts you interact with). Stan, on the other hand, is a domain-specific language, meaning that you write "probabilistic programs" and then call Stan to evaluate them for you, for example by MCMC. I personally prefer the approach of PyMC3 and TFP as this helps in combining your projects with other stuff you might do in Python anyway.
      When it comes to performance, it's hard for me to judge, but I would say that Stan could be faster on pure CPU, whereas TFP could benefit from the TensorFlow XLA backend and perform better on a GPU. However, for the tutorials on this channel, I think any of the three are more than sufficient.
      Speaking about functionality: For the basics (MCMC & Variational Inference), I think all the three should be mostly identical. I had the impression that TFP had a lot of distributions at choice which, together with its bijectors, can model almost everything commonly used nowadays. Probably, it's similar for the others. One big advantage of TFP over the other twos (as far as I am aware of) is the tight integration with Keras to build "probabilistic layers" (with integrated parameterization trick) which makes it soooo easy to build Variational Autoencoders (sth I also want to introduce in a future video).
      One thing that PyMC3 or Stan users would probably criticize about TFP is that it uses long function names and sometimes unintuitive ways of doing things, and that the documentation can sometimes be a little too complicated. However, I personally like the API of TFP a lot.
      Based on this limited view on the world of probabilistic programming, you could have probably guessed that I prefer TFP, which is also the reason I am presenting it here on the channel. However, I think that the other projects are also excellent work, and it might be worth taking a look at them
      Interesting projects to also take a look at would be Turing.jl (something like PyMC3 but for Julia) and Edward2 which is similar to Stan (in that it is a domain-specific language), but uses afaik TFP as a backend.

    • @MachineLearningSimulation
      @MachineLearningSimulation  2 роки тому +1

      I found this notebook tutorial for PyMC3 docs.pymc.io/pymc-examples/examples/variational_inference/convolutional_vae_keras_advi.html but haven't looked at it yet. Based on this, I think that my point with the keras integration of TFP as the advantage would not be true.
      Therefore, to answer your 2nd question: I think that both have similar capabilities, maybe PyMC3 can even do more (or you can do things in less lines of code than TFP).

    • @MachineLearningSimulation
      @MachineLearningSimulation  2 роки тому +1

      One last point in favor of PyMC3 over TFP: It is more mature, it existed longer than TensorFlow (Probability). And Google (the main maintainers of TFP) is known for killing projects killedbygoogle.com/