Expectation Maximization Algorithm | Intuition & General Derivation

Machine Learning & Simulation

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 15 лис 2024

КОМЕНТАРІ • 33

@MachineLearningSimulation 3 роки тому ⁺⁸
Error at 13:20 : It is a lower bound, not an upper bound. Maximizing an upper bound is not meaningful. See also the comment of @Flemming for more details.
@flemming9262 3 роки тому ⁺¹³
very well produced video! But log is concave so you flipped the sign/direction of Jensen's inequality. In other words, you are finding a lower bound on the log-likelihood. BTW that is in fact arguably desirable as maximizing a lower bound is informative, maximizing an upper bound is not. Maybe that should be clarified for ppl learning this stuff.
@P4el1co 3 роки тому ⁺²
You are correct, this should be noted in the video. Maximazing an upper bound makes no sense, because it doesn't guarantee you any optimal result. You need a tight lower bound that toches your function.
@MachineLearningSimulation 3 роки тому ⁺²
Thanks a lot for the feedback :)
You are absolutely right, it has to be a lower bound. As you correctly note, maximizing an upper bound is not meaningful. Thanks a lot for pointing it out :)
P.S.: Sorry for the late reply, I was a little busy the last days.
@algorithmo134 2 роки тому ⁺¹
The sign should be log >=
@algorithmo134 2 роки тому ⁺¹
@@MachineLearningSimulation Sign should be >=
@algorithmo134 2 роки тому ⁺¹
The sign of Concave function for Jensen inequality is reversed
@user-or7ji5hv8y 3 роки тому
Wow, I am still amazed how EM works. It’s really brilliant probability.
@MachineLearningSimulation 3 роки тому ⁺³
Yes, thanks for sharing my enthusiasm for it :D I do think Probability Theory, EM Algorithm and also Math in general is extremely beautiful, if done correctly (which is my goal for the channel).
There is a lot more you can do with the EM like Hidden Markov Models and there are also more things to consider like avoiding local minima by sieving, which we will be doing in a future video. Stay tuned ;)
@mohammadjoshaghani5388 2 роки тому ⁺¹
@@MachineLearningSimulation Really Beautiful and Cool !
Tnx for sharing.
@mohammadjoshaghani5388 2 роки тому
@@MachineLearningSimulation Really Beautiful and Cool !
Tnx for sharing.
@todianmishtaku6249 2 роки тому ⁺¹
Amazing lovely video. Great job. I feel a bit unlucky that I have not come across your channel earlier.
@MachineLearningSimulation 2 роки тому ⁺¹
Nice, thanks so much 😊
I'm happy to help and it's amazing to hear that my approach to teaching is helpful.
@ryanyu512 2 роки тому ⁺¹
The video is in high quality.
It is highly appreciated if the summation symbol is written as just Σ. It is a bit confusing when I look at your written summation symbol. I though that it is summing 1 to 1 (haha). But, this confusion does not degrade your video quality.
Thanks
@MachineLearningSimulation 2 роки тому
Thanks a lot for the kind words! :)
I can understand that the notation can be a bit cluttering from time to time. Back when I created this video, I thought it can be helpful to not leave out certain parts or shorten the notation. It's always a compromise :D. I will take your point into consideration for future videos, thanks :)
@lucavisconti1872 2 роки тому ⁺¹
The video explains formally and in a very clear way the algorithm. My question is, what if we have a mix of missing data, i.e. some missing Words and some missing Thoughts?
@MachineLearningSimulation 2 роки тому
Hi,
thanks a lot for the comment and the interesting question :). I do not have a good idea how one would handle such cases. The EM algorithm, presented here, is particularly powerful for mixture models. What you describe would be a more general probabilistic model. This is unfortunately beyond my knowledge, so I cannot give you a good answer.
@pravingaikwad1337 Рік тому ⁺¹
3:50, Theta bar has two components right? (you said 3 components)
@MachineLearningSimulation Рік тому
Great question! :)
I think it depends on how you count parameters. We have one parameter for the "thoughts" distribution and then there are two parameters for the "words", based on the value of the thoughts distribution. Given a good thought, the probability of a good word is different from a bad thought. If you only look at the node-level of the directed graphical model, then there are two parameters: one scalar parameter, and then a two-dimensional vector parameters. If you concatenated the scalar and the two-dimensional vector into a three-dimensional vector (which is what I wanted to express at the time stamp you mentioned; I think it was a bit hidden), then there are three scalar parameters in total.
Hope that helps :). Let me know if it is still unclear.
@imvijay1166 Рік тому ⁺¹
Hi felix, this is a nice video on em thanks for that, I question, i dont clearly understand why we have to take only posterior as q(T). why not something else. why posterior only suits q(t)
@MachineLearningSimulation Рік тому ⁺¹
Hi, thanks for the comment and the kind words :).
I hope I can answer your question sufficiently, it has been some time since I uploaded the video. In a more general setting, the Expectation Maximization algorithm is a tool to train latent-variable models. More concretely, in can only train those latent variable models for which the posterior is tractable. As such, it is a special case of Variational Inference (see also my video on VI: ua-cam.com/video/HxQ94L8n0vU/v-deo.html ).
Latent-variable models can be of various kinds. Here in the video, I present a super simplistic Bernoulli-Bernoulli model. In this model, the latent variable is just a binary scalar. However, it can have arbitrary shapes. Commonly, what you see is that the observed variable is a (high-dimensional) image (e.g. 28x28 pixels with 3 color channels is already more than 1000-dimensional) and the latent space might be 30 dimensional. Though, for those kinds of models, the EM algorithm might only work if you are interested in clustering with GMMs.
Hence, maybe as a first answer to your question, the T (or more generally Z) as presented in the video can also be high-dimensional instead of just a binary scalar.
If your question was more related to why it is just one T and one W or why would do not have a more complicated graphical model: This is just a modelling assumption. Typically speaking, you can find learning methods for many more complicated graphs, but those are out of the scope for this video.
Hope that helped. :) Let me know if sth is still unclear.
@imvijay1166 Рік тому ⁺¹
@@MachineLearningSimulation Thank you!
@MachineLearningSimulation Рік тому
@@imvijay1166 You're welcome! :)
@kartikkamboj295 2 роки тому
Hi there ! Can you pls explain why do we have a parameter vector for 'words' but just a single parameter for 'thoughts' ?
Thanks in advance:
@MachineLearningSimulation 2 роки тому
Hi, this is since the Bernoulli distribution for the words is conditioned on the thoughts. Depending on the thought, we have a different distribution for the word. :)
Take a look at my video on Ancestral Sampling, I think that should clear things up. ua-cam.com/video/EhoYs77xufQ/v-deo.html
@EngRiadAlmadani Рік тому
the only one made me understand this evil (E(P/q)= Σq* P/q))
thank you
@MachineLearningSimulation Рік тому ⁺¹
You're welcome :) I am honored, thanks for the kind words.
@ananthakrishnank3208 6 місяців тому
10:27 I don't think it is right. Summation is for the whole (q * p/q), and we cannot conveniently apply summation to just q alone.
@user-or7ji5hv8y 3 роки тому ⁺¹
I think I see why theta_k is associated with responsibilities, instead of theta_k+1.
@MachineLearningSimulation 3 роки тому ⁺¹
Nice! Good job.
I also think that this is one of my big learnings for the EM-Algorithm, the fact that we somehow have to solve a chicken-egg problem which calls for an iterative algorithm.
Let me know if something was not clear enough :)

Наступне

Автоматичне відтворення

Expectation Maximization for the Gaussian Mixture Model | Full Derivation