MLE for the Multivariate Normal distribution | with example in TensorFlow Probability
Вставка
- Опубліковано 14 чер 2024
- With the Maximum Likelihood Estimate (MLE) we can derive parameters of the Multivariate Normal based on observed data. This video is a full derivation. Here are the notes: raw.githubusercontent.com/Cey...
The Multivariate Normal is potentially the most important distribution in all over Machine Learning. Commonly observed data is distributed according to it. Therefore, it is necessary to infer the parameters of the underlying distribution (or the distribution we think that is the underlying one) based on the data. That is the point where the Maximum Likelihood Estimate comes in where we essentially solve an optimization problem which can here be done in closed-form.
For this we first derive the Likelihood and Log-Likelihood of observed data. The derivation of the MLE estimate for the mu/mean vector is straight-forward. For the covariance matrix, on the other hand, we need some special matrix derivatives that we take from the matrix cookbook: www.math.uwaterloo.ca/~hwolko...
This book is the "bible" for tensor calculus. We might see it more often when it comes to the Multivariate Normal ;)
----------------------------------------
Information on why the constraint does NOT arise naturally:
Actually, things don't always arise naturally (unfortunately :/) in reality. There are cases where the MLE for the covariance matrix will be not positive definite, although still symmetric. In scenarios with more features/dimensions than samples our covariance matrix can even become singular. There will be a video on this coming in the future. Until then, take a look at this page of the fantastic sci-kit learn documentation: scikit-learn.org/stable/modul...
----------------------
-------
📝 : Check out the GitHub Repository of the channel, where I upload all the handwritten notes and source-code files (contributions are very welcome): github.com/Ceyron/machine-lea...
📢 : Follow me on LinkedIn or Twitter for updates on the channel and other cool Machine Learning & Simulation stuff: / felix-koehler and / felix_m_koehler
💸 : If you want to support my work on the channel, you can become a Patreon here: / mlsim
-------
⚙️ My Gear:
(Below are affiliate links to Amazon. If you decide to purchase the product or something else on Amazon through this link, I earn a small commission.)
- 🎙️ Microphone: Blue Yeti: amzn.to/3NU7OAs
- ⌨️ Logitech TKL Mechanical Keyboard: amzn.to/3JhEtwp
- 🎨 Gaomon Drawing Tablet (similar to a WACOM Tablet, but cheaper, works flawlessly under Linux): amzn.to/37katmf
- 🔌 Laptop Charger: amzn.to/3ja0imP
- 💻 My Laptop (generally I like the Dell XPS series): amzn.to/38xrABL
- 📱 My Phone: Fairphone 4 (I love the sustainability and repairability aspect of it): amzn.to/3Jr4ZmV
If I had to purchase these items again, I would probably change the following:
- 🎙️ Rode NT: amzn.to/3NUIGtw
- 💻 Framework Laptop (I do not get a commission here, but I love the vision of Framework. It will definitely be my next Ultrabook): frame.work
As an Amazon Associate I earn from qualifying purchases.
-------
Timestamps:
00:00 Introduction
00:44 Recap: Multivariate Normal
04:20 Likelihood
08:12 Log-Likelihood
10:27 Defining the MLE
11:28 Maximizing for Mu
17:04 Maximizing for Sigma (Covariance Matrix)
28:53 Computational Considerations
35:59 TFP: Creating a dataset
38:36 TFP: MLE for Mu
39:08 TFP: MLE for Sigma (Covariance Matrix)
42:21 TFP: A simpler way
43:13 Outro
Errata:
At 14:40, the derivative of the log-likelihood is taken with respect to the mu vector. There is a small error in that derivative which led to an incorrect equality (a row vector is not equal to a column vector). The accurate derivative had to follow formula (108) of the matrix cookbook: www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf . The mistake is more technical and does not affect the subsequent derivations nor the final result. Still, it's interesting to see what errors arise. Thanks to @gammeligeskäsebrot and @Aditya Mehrota for pointing this out :)
The file on GitHub has been updated with some additional information and a correct version of the derivative: www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf
You are giving fantastic tutorials, please keep it up. Really appreciate.
Thanks a lot ❤️
There is more like this coming in the future.
It's that amazing feedback like yours that motivates me a lot. Therefore, thanks again 😊
Thanks!
Thank you for the kind donation 😊
Thank you for this great tutorials! It helped a lot for getting my assignment down.
You're very welcome 😁
Very clear and instructive. 😊
Thanks a lot :)
Found your videos just recently while studying for Multivariate Analysis at college. As many have said and I repeat, they're awesome, congratulations!
Now, I'd like to ask you two questions if I may:
1. I'm using Richard Wichern as my main reference book. Have you used that one too, or do you have other recommendations?
2. Quite often in statistics - at least for me - it's very easy to get lost among all that math and often times I can't build the intuition on how to extrapolate that information to a Data Science problem. Do you have any other resources, besides your channel to help us with that intuition and how to apply that knowledge. Even tough Wichern's book is called Applied Multivariate Statistics, still seems too heavy for me sometimes.
Thanks for the kind words. :)
1.) I have not heard of the Richard Wichern book, but seems to be a good resource :). The reason I came to these topics is Machine Learning and not pure statistics (although those two are of course tightly coupled :)!). The book I mainly used was Christopher Bishop's "Pattern Recognition and Machine Learning". Although, admittedly, it is quite a hard read.
2.) I can totally understand this problem. Transferring knowledge to real issues is always tricky. Often, the necessary experience comes over time. Thanks, that you also mention the channel, although, I would argue that the issues this channel deals with are still toy-ish, in a sense :D.
Probably, solving Kaggle challenges (and looking at the submission notebooks of other Kagglers) can be a helpful thing to do. I can also recommend this beautiful book by Aurelion Geron: www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/
it's interesting with MLE, that you always have to know or assume that the data follows a certain distribution. Like in this case Normal. I guess, assumption about the distribution is kind of like a prior.
I would agree. Our assumption on the model of the data is somewhat like a prior. Generalizing this idea, one could even speak of "variational priors", i.e. probability distributions over potential probability distributions (which then leads again to functionals).
But for the application in Machine Learning & Simulation: it might be interesting to look at it the other way around: You don't necessarily choose the distribution based on how the data looks in a scatter, but you choose it based on your understanding of the underlying phenomenon (or the simplifying assumptions you made on it).
Think for instance of Gaussian Mixture Models. Surely, in many clustering scenarios the clusters do not necessarily follow a Gaussian distribution. But we made this particular assumption for two (probably rather pragmatic) reasons: First, we can still get decent results with it and in the end that is what we usually care about. And more importantly, secondly, we found something that is tractable or we at least know how to work with it.
This is also the reason the (Multivariate) Normal appears in so many scenarios. (Of course we can also derive it from Maximum entropy etc.), but it is just a distribution that is simple, yet versatile. We know its properties and many things are available analytically or in a closed-form.
I am very appreciate your videos. Have you considered making a Bayesian Deep Learning tutorial?
Thanks a lot for the feedback :)
There is a lot of content on my To-Do list. I will add Bayesian Deep Learning there, but I think it will take me some time until I can release something on it. For the next weeks I planned mostly videos on deep generative models. Optimistically, I could have some videos on Bayesian Deep Learning at the end of the year.
Very nice video ive got a question on the equation presented at min 14. if x is a 1byn vector and mue 2 and the Sigma Matrix is nxn then the 1st term would be of shape nx1 the second would be 1xn . How can they be the same then?
Hey,
thanks a lot for the kind words :)
Regarding your question: I am not sure, if I understand it correctly. What I wanted to show at that point in the video was that:
x and mu are R^k vectors, i.e., vectors with k entries. And those are column vectors, one could also write R^{k x 1}. Then the sigma matrix is of shape R^{k x k}.
Let's just simplify (x - mu) as x and let sigma^{-1} be A.
Then the equation, I wrote down is: x^T @ A = A @ x
The left-hand side of the equation describes a vector-matrix product and the right-hand side a matrix vector product. The vector-matrix product works, because we transpose x first, turning it into a row vector, i.e., it is of shape R^{1 x k}.
The given equation holds, since A is symmetric.
Does this shine some additional light on it? :) Let me know what is unclear.
@@MachineLearningSimulation I had a similar question
Assume sigma is of shape (D x D). At 14:42, you are multiplying a vector of shape (1 x D) by a matrix of shape (D x D) on the left side, which produces shape (1 x D). On the right side, you are multiplying a matrix of shape (D x D) by a vector of shape (D x 1), which produces shape (D x 1). Aren't the left and right sides transposes of each other?
@@knowledgedistiller @gammeligeskäsebrot you are both correct. :) I think I made a mistake there. Though, it's not changing the derivation much, but might be an interesting technicality.
Indeed, the equality sign would not be fully correct, since, as you note, the left-hand side refers to a row vector, whereas the right-hand side refers to a column vector.
The issue is actually in the derivative of the log likelihood with respect to the mu vector. If we chose the correct form as highlighted in formula (108) of the matrix cookbook (www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf ) everything should be fine (due to the symmetry of the covariance matrix). You could also show that in index notation.
Thanks for the hint. I will leave a pinned comment that there was a small mistake.
@@knowledgedistiller I updated the file on GitHub with a small hint on the correct vector derivative: github.com/Ceyron/machine-learning-and-simulation/blob/main/english/essential_pmf_pdf/multivariate_normal_mle.pdf
Do you have any thoughts about the pro and the cons of PyMC3 or Stan? Can TensorFlow Probability do everything that PyMC3 can do?
I haven't worked with Stan yet and only did some introductory tutorials in PyMC3, but I still share my (not too informed) thoughts.
PyMC3 is similar to TFP in that it uses plain Python for most parts (or the parts you interact with). Stan, on the other hand, is a domain-specific language, meaning that you write "probabilistic programs" and then call Stan to evaluate them for you, for example by MCMC. I personally prefer the approach of PyMC3 and TFP as this helps in combining your projects with other stuff you might do in Python anyway.
When it comes to performance, it's hard for me to judge, but I would say that Stan could be faster on pure CPU, whereas TFP could benefit from the TensorFlow XLA backend and perform better on a GPU. However, for the tutorials on this channel, I think any of the three are more than sufficient.
Speaking about functionality: For the basics (MCMC & Variational Inference), I think all the three should be mostly identical. I had the impression that TFP had a lot of distributions at choice which, together with its bijectors, can model almost everything commonly used nowadays. Probably, it's similar for the others. One big advantage of TFP over the other twos (as far as I am aware of) is the tight integration with Keras to build "probabilistic layers" (with integrated parameterization trick) which makes it soooo easy to build Variational Autoencoders (sth I also want to introduce in a future video).
One thing that PyMC3 or Stan users would probably criticize about TFP is that it uses long function names and sometimes unintuitive ways of doing things, and that the documentation can sometimes be a little too complicated. However, I personally like the API of TFP a lot.
Based on this limited view on the world of probabilistic programming, you could have probably guessed that I prefer TFP, which is also the reason I am presenting it here on the channel. However, I think that the other projects are also excellent work, and it might be worth taking a look at them
Interesting projects to also take a look at would be Turing.jl (something like PyMC3 but for Julia) and Edward2 which is similar to Stan (in that it is a domain-specific language), but uses afaik TFP as a backend.
I found this notebook tutorial for PyMC3 docs.pymc.io/pymc-examples/examples/variational_inference/convolutional_vae_keras_advi.html but haven't looked at it yet. Based on this, I think that my point with the keras integration of TFP as the advantage would not be true.
Therefore, to answer your 2nd question: I think that both have similar capabilities, maybe PyMC3 can even do more (or you can do things in less lines of code than TFP).
One last point in favor of PyMC3 over TFP: It is more mature, it existed longer than TensorFlow (Probability). And Google (the main maintainers of TFP) is known for killing projects killedbygoogle.com/