Error at 13:20 : It is a lower bound, not an upper bound. Maximizing an upper bound is not meaningful. See also the comment of @Flemming for more details.
very well produced video! But log is concave so you flipped the sign/direction of Jensen's inequality. In other words, you are finding a lower bound on the log-likelihood. BTW that is in fact arguably desirable as maximizing a lower bound is informative, maximizing an upper bound is not. Maybe that should be clarified for ppl learning this stuff.
You are correct, this should be noted in the video. Maximazing an upper bound makes no sense, because it doesn't guarantee you any optimal result. You need a tight lower bound that toches your function.
Thanks a lot for the feedback :) You are absolutely right, it has to be a lower bound. As you correctly note, maximizing an upper bound is not meaningful. Thanks a lot for pointing it out :) P.S.: Sorry for the late reply, I was a little busy the last days.
Yes, thanks for sharing my enthusiasm for it :D I do think Probability Theory, EM Algorithm and also Math in general is extremely beautiful, if done correctly (which is my goal for the channel). There is a lot more you can do with the EM like Hidden Markov Models and there are also more things to consider like avoiding local minima by sieving, which we will be doing in a future video. Stay tuned ;)
The video is in high quality. It is highly appreciated if the summation symbol is written as just Σ. It is a bit confusing when I look at your written summation symbol. I though that it is summing 1 to 1 (haha). But, this confusion does not degrade your video quality. Thanks
Thanks a lot for the kind words! :) I can understand that the notation can be a bit cluttering from time to time. Back when I created this video, I thought it can be helpful to not leave out certain parts or shorten the notation. It's always a compromise :D. I will take your point into consideration for future videos, thanks :)
The video explains formally and in a very clear way the algorithm. My question is, what if we have a mix of missing data, i.e. some missing Words and some missing Thoughts?
Hi, thanks a lot for the comment and the interesting question :). I do not have a good idea how one would handle such cases. The EM algorithm, presented here, is particularly powerful for mixture models. What you describe would be a more general probabilistic model. This is unfortunately beyond my knowledge, so I cannot give you a good answer.
Great question! :) I think it depends on how you count parameters. We have one parameter for the "thoughts" distribution and then there are two parameters for the "words", based on the value of the thoughts distribution. Given a good thought, the probability of a good word is different from a bad thought. If you only look at the node-level of the directed graphical model, then there are two parameters: one scalar parameter, and then a two-dimensional vector parameters. If you concatenated the scalar and the two-dimensional vector into a three-dimensional vector (which is what I wanted to express at the time stamp you mentioned; I think it was a bit hidden), then there are three scalar parameters in total. Hope that helps :). Let me know if it is still unclear.
Hi felix, this is a nice video on em thanks for that, I question, i dont clearly understand why we have to take only posterior as q(T). why not something else. why posterior only suits q(t)
Hi, thanks for the comment and the kind words :). I hope I can answer your question sufficiently, it has been some time since I uploaded the video. In a more general setting, the Expectation Maximization algorithm is a tool to train latent-variable models. More concretely, in can only train those latent variable models for which the posterior is tractable. As such, it is a special case of Variational Inference (see also my video on VI: ua-cam.com/video/HxQ94L8n0vU/v-deo.html ). Latent-variable models can be of various kinds. Here in the video, I present a super simplistic Bernoulli-Bernoulli model. In this model, the latent variable is just a binary scalar. However, it can have arbitrary shapes. Commonly, what you see is that the observed variable is a (high-dimensional) image (e.g. 28x28 pixels with 3 color channels is already more than 1000-dimensional) and the latent space might be 30 dimensional. Though, for those kinds of models, the EM algorithm might only work if you are interested in clustering with GMMs. Hence, maybe as a first answer to your question, the T (or more generally Z) as presented in the video can also be high-dimensional instead of just a binary scalar. If your question was more related to why it is just one T and one W or why would do not have a more complicated graphical model: This is just a modelling assumption. Typically speaking, you can find learning methods for many more complicated graphs, but those are out of the scope for this video. Hope that helped. :) Let me know if sth is still unclear.
Hi, this is since the Bernoulli distribution for the words is conditioned on the thoughts. Depending on the thought, we have a different distribution for the word. :) Take a look at my video on Ancestral Sampling, I think that should clear things up. ua-cam.com/video/EhoYs77xufQ/v-deo.html
Nice! Good job. I also think that this is one of my big learnings for the EM-Algorithm, the fact that we somehow have to solve a chicken-egg problem which calls for an iterative algorithm. Let me know if something was not clear enough :)
Error at 13:20 : It is a lower bound, not an upper bound. Maximizing an upper bound is not meaningful. See also the comment of @Flemming for more details.
very well produced video! But log is concave so you flipped the sign/direction of Jensen's inequality. In other words, you are finding a lower bound on the log-likelihood. BTW that is in fact arguably desirable as maximizing a lower bound is informative, maximizing an upper bound is not. Maybe that should be clarified for ppl learning this stuff.
You are correct, this should be noted in the video. Maximazing an upper bound makes no sense, because it doesn't guarantee you any optimal result. You need a tight lower bound that toches your function.
Thanks a lot for the feedback :)
You are absolutely right, it has to be a lower bound. As you correctly note, maximizing an upper bound is not meaningful. Thanks a lot for pointing it out :)
P.S.: Sorry for the late reply, I was a little busy the last days.
The sign should be log >=
@@MachineLearningSimulation Sign should be >=
The sign of Concave function for Jensen inequality is reversed
Wow, I am still amazed how EM works. It’s really brilliant probability.
Yes, thanks for sharing my enthusiasm for it :D I do think Probability Theory, EM Algorithm and also Math in general is extremely beautiful, if done correctly (which is my goal for the channel).
There is a lot more you can do with the EM like Hidden Markov Models and there are also more things to consider like avoiding local minima by sieving, which we will be doing in a future video. Stay tuned ;)
@@MachineLearningSimulation Really Beautiful and Cool !
Tnx for sharing.
@@MachineLearningSimulation Really Beautiful and Cool !
Tnx for sharing.
Amazing lovely video. Great job. I feel a bit unlucky that I have not come across your channel earlier.
Nice, thanks so much 😊
I'm happy to help and it's amazing to hear that my approach to teaching is helpful.
The video is in high quality.
It is highly appreciated if the summation symbol is written as just Σ. It is a bit confusing when I look at your written summation symbol. I though that it is summing 1 to 1 (haha). But, this confusion does not degrade your video quality.
Thanks
Thanks a lot for the kind words! :)
I can understand that the notation can be a bit cluttering from time to time. Back when I created this video, I thought it can be helpful to not leave out certain parts or shorten the notation. It's always a compromise :D. I will take your point into consideration for future videos, thanks :)
The video explains formally and in a very clear way the algorithm. My question is, what if we have a mix of missing data, i.e. some missing Words and some missing Thoughts?
Hi,
thanks a lot for the comment and the interesting question :). I do not have a good idea how one would handle such cases. The EM algorithm, presented here, is particularly powerful for mixture models. What you describe would be a more general probabilistic model. This is unfortunately beyond my knowledge, so I cannot give you a good answer.
3:50, Theta bar has two components right? (you said 3 components)
Great question! :)
I think it depends on how you count parameters. We have one parameter for the "thoughts" distribution and then there are two parameters for the "words", based on the value of the thoughts distribution. Given a good thought, the probability of a good word is different from a bad thought. If you only look at the node-level of the directed graphical model, then there are two parameters: one scalar parameter, and then a two-dimensional vector parameters. If you concatenated the scalar and the two-dimensional vector into a three-dimensional vector (which is what I wanted to express at the time stamp you mentioned; I think it was a bit hidden), then there are three scalar parameters in total.
Hope that helps :). Let me know if it is still unclear.
Hi felix, this is a nice video on em thanks for that, I question, i dont clearly understand why we have to take only posterior as q(T). why not something else. why posterior only suits q(t)
Hi, thanks for the comment and the kind words :).
I hope I can answer your question sufficiently, it has been some time since I uploaded the video. In a more general setting, the Expectation Maximization algorithm is a tool to train latent-variable models. More concretely, in can only train those latent variable models for which the posterior is tractable. As such, it is a special case of Variational Inference (see also my video on VI: ua-cam.com/video/HxQ94L8n0vU/v-deo.html ).
Latent-variable models can be of various kinds. Here in the video, I present a super simplistic Bernoulli-Bernoulli model. In this model, the latent variable is just a binary scalar. However, it can have arbitrary shapes. Commonly, what you see is that the observed variable is a (high-dimensional) image (e.g. 28x28 pixels with 3 color channels is already more than 1000-dimensional) and the latent space might be 30 dimensional. Though, for those kinds of models, the EM algorithm might only work if you are interested in clustering with GMMs.
Hence, maybe as a first answer to your question, the T (or more generally Z) as presented in the video can also be high-dimensional instead of just a binary scalar.
If your question was more related to why it is just one T and one W or why would do not have a more complicated graphical model: This is just a modelling assumption. Typically speaking, you can find learning methods for many more complicated graphs, but those are out of the scope for this video.
Hope that helped. :) Let me know if sth is still unclear.
@@MachineLearningSimulation Thank you!
@@imvijay1166 You're welcome! :)
Hi there ! Can you pls explain why do we have a parameter vector for 'words' but just a single parameter for 'thoughts' ?
Thanks in advance:
Hi, this is since the Bernoulli distribution for the words is conditioned on the thoughts. Depending on the thought, we have a different distribution for the word. :)
Take a look at my video on Ancestral Sampling, I think that should clear things up. ua-cam.com/video/EhoYs77xufQ/v-deo.html
the only one made me understand this evil (E(P/q)= Σq* P/q))
thank you
You're welcome :) I am honored, thanks for the kind words.
10:27 I don't think it is right. Summation is for the whole (q * p/q), and we cannot conveniently apply summation to just q alone.
I think I see why theta_k is associated with responsibilities, instead of theta_k+1.
Nice! Good job.
I also think that this is one of my big learnings for the EM-Algorithm, the fact that we somehow have to solve a chicken-egg problem which calls for an iterative algorithm.
Let me know if something was not clear enough :)