Fantastic tutorial! What I find great is that you anticipate the questions arising in the student's mind and then address them with very satisfying explanations!
Amazing value, Kapil. I like several of the things you do when you teach: refreshing necessary concepts (expectation), the precision of your language and notation, equivalent expressions, and so on. The pace is also great. Thank you very much!
was searching for some tutorials on approximate inference and the pre-reqs for it and stumbled upon this, literally my mind got blown with the way you explained the concepts here.
He is good! Very good to say, for example, we want the average difference, but when talking about rv we talk about expected value ... . And many other very careful explanations.
Thank you for the video. I am preparing a paper for my math stats class and after many videos yours gave me the best total explanation with terminology I am familiar with so far.
Hi, I had one doubt, in 5:18, why do we multiply p_theta(xi) to log(p_theta(xi)/q_phi(xi)) and not multiply q_phi(xi) to log(p_theta(xi)/q_phi(xi))? In 7:30 you show the variation with q_phi(xi). It seems like the probability distribution that is multiplied with the function of the random variable log(P1(x)/P2(x)) is the probability distribution that appears in the numerator. Is there a reason to not multiple the probability distribution in the denominator?
A bit rephrasing of your question (Correct me if this is not what you meant) Is there a reason to not multiply "with" the probability distribution in the denominator? It's a good question. So far in the literature only two types of KL divergence (forward & reverse) are defined. If you use the prob dist from numerator as the weighting factor then you would have 2 more variations. At the moment, I am not aware of any "mathematical" reason/justification that would suggest that it is an invalid operation.
Thank you very much! Your explanations are really clear and neat. Thanks to your video, now I understand KL divergence much much better than I did before.
Thank youu for great explanation. @9:27 I can't get how come reverse KL divergence has mode seeking behaviour and forward has mean seeking. I understood that P(x) is multimodal gaussian distribution , but what is Q(X) as we needed both distribution for finding K-L divergence.
I am assuming you have understood following - a) KL Divergence will give you a "number" that quantifies the difference between two probability distributions say p and q. b) There two different ways you could write the KL Divergence. In one approach you would have p as the weighing distribution and in other q Now, the next question to ask yourself is in what situation you would have the need to even compare two probability distributions. This situation occurs when we are trying to "estimate" p using q. One way you can estimate (or figure out the function for q) is by using "optimization" process. Below will not make sense if you do not know what optimization is. Optimization => you need a cost (loss or objective) function. An objective function's role is to compare true and predicted values. This is what I am doing when I first create/use a multi modal distribution (p) and I am trying to estimate q as a normal distribution that has 2 parameters (mu and sigma). I am using optimization process and KL divergence as a loss function to assess how well I am adjusting/predicting the values of mu and sigma (of q). When I used Reverse KL then the predicted mu and sigma create a function (q) that seems to converge to the bigger mode if p whereas for the forward KL, the function (q) lies somewhere in the middle of p.
Thank you very much for this lecture. Could you share the code/math you used to generate the "q" distribution in the toy example? I am struggling to understand how to compute q from p.
Great tutorial Kapil. I have a question regarding the approximation in the discrete case at 7:00. Does dividing by 1/N not assume uniform distribution? Maybe the approximation would have to be weighted by the likelihood?
That is the magic of law of large numbers. If you have sampled a large number of data then the expected value of that distribution converges to our regular unweighted average. en.wikipedia.org/wiki/Law_of_large_numbers
That the probability distribution used to compute the expectation is "p". Watch the tutorial on Importance Sampling [ ua-cam.com/video/ivBtpzHcvpg/v-deo.html] ... at the start of it I describe it in more detail.
Great video, onlything im having trouble understanding is at the end at 10:00 the plot you showed. I understood that i have 2 distributions p and q and i calculate the D_KL (p || q) and it would result in a number, so why did u show 2 curves for the forward and reverse KL? Where are those coming from?
I can see why it can be confusing; I should have done a better job to give a proper context for the plots/curves. Thanks for asking for the clarification. I take that you have understood that KL divergence will give you a number that quantifies the difference between two probability distributions p and q. The next question that you should occur to you is - "What is the application of obtaining this difference (i.e. KL divergence)?" One application is when you try to approximate a complex distribution with a simpler distribution. Let's say the complex distribution is denoted by `p` and it has 3 parameters (mu, sigma and gamma) and you would want to approximate it using a simpler distribution denoted by `q` which has 2 parameters (mu and sigma). Now when you are creating an approximate (function) you could rely on optimization procedure (gradient descent etc). But then an optimization procedure will require a cost (loss/objective) function. A cost function tells how far you are from the ground truth. This is where the KL divergence will be used as the cost function when probability distributions are involved. Finally coming to your question about the additional curves, these are the resulting approximate distributions I get when I use Forward and Reverse KL as the cost (loss/objective) functions. Both of them are normal distributions (mu and sigma) but since the loss functions used are different (foward vs reverse) they are different. Some else also asked this (similar) question sometime back; I do not know how to provide the direct link to the reply of the comment. See the question by Arash Poursadrollah on this video in the comment section and my reply. Hope this helps!
9:00 - here it seems like q is the resulting approximation but up until then i thought q was a distribution being compared to p? so for example, p is a red line, shouldn't q be shown as a line as well and then the resulting forward KL and reverse KL divergences be overlaid?
KL divergence at its core is giving you a way to compare two distributions. The question you should ask is why one should be interested in doing a comparison in the first place? In many machine learning / statistics task we want to "approximate" a complex distribution (say p) by a simpler distribution (say q). Here we talk only about parametric distributions i.e. p and q are represented using functions which parameters e.g. Normal distribution with mu and sigma as parameters. The exercise (or rather the utility of KL divergence) I showed in the tutorial was about finding the parameters of q (i.e. ). In order to find it I used optimization with KL divergence as the loss function. Like in any optimization problem we will start with random values for mu and sigma and then adjust them in such a way that the loss function decreases. I tried both types of KL divergences as loss functions and showed how the results (i.e. mu and sigma) obtained are different for each other. I did not do a good job in the tutorial by clearly explaining when and how you really end up using KL divergence and hence it is confusing for you. Hope this note make sense.
4:22 - Not sure whether it is so straights forward... Discrete to Continuous RV by replacing an integral. I think Claude Shannon also made the same mistake.
Thanks Gowri Shankar for pointing this out. Indeed it is not as simple as I made it sound however in terms of the "final" structure of the expectation "formula" the change from discrete to continuous is reflected as the change from summation to integration. Would be great if you can share insights that you have to highlight the differences in the interpretation (if any).
@@KapilSachdeva I have no light as of now. Planning to read these two literature but had a Claude Shannon moment at 4:22 - Hence posted my comment. www.tsc.uc3m.es/~fernando/bare_conf3.pdf web.stanford.edu/class/stats311/Lectures/lec-02.pdf Thanks for the reply, you made my day. Thanks for your services in this platform, your content is cool.
🙏 Thanks Gowri Shankar for these links and the kind words. I gave a cursory read to them and they are very well written papers; have added them to my library to re-read them. I still have to do a proper study of measure theory as lot of the explanations in continuous realm are based on it. Last year I wrote this article on Entropy that you may finding interesting as I do go over some of the topics mentioned in the lecture notes. Here is the link - towardsdatascience.com/but-what-is-entropy-ae9b2e7c2137?sk=23e69749005be756a1d19b6e1c3531f6 - may be it is of some assistance to you.
Integrals in general are computationally heavy but more importantly start thinking in terms of multivariate distributions i.e. when you have a vector (high dimensional data). In simple words, in high dimension space you have not a single integral but rather multiple integral and it would computations intractable. Needless to say, real world problems are high dimensional.
@@KapilSachdeva Why "Integrals in general are computationally heavy"? I thought we used the law of large numbers for approximating that integral since it "goes from -inty to +infty" and a calculator cannot afford continuous sums due to it discrete nature
Ah it seems I misunderstood your original question, usage of law of large numbers is to avoid the integrals. I first provided the general formulation of expectation (continuous) and then suggested that in practice one would use law of large numbers to work around the integral problem.
@@KapilSachdeva I understood that you used that law for approximating that hard integral. The question is: why is that integral hard? Sorry for my doubts!
I think I kind of answered it by suggesting to consider high dimensional spaces. But here it is slightly different words - Most of the time the integrals that we encounter in statistics do not have "closed" form solutions. So we must approximate them using "numerical" methods. The computational complexity of these numerical methods (such as quad) increase as the number of integrals increase (think double, triple integrals etc). Therefore we try to find simpler approximation to avoid them. One such simpler approximation (that in limit) leads to similar result is Monte Carlo integration. See if en.wikipedia.org/wiki/Numerical_integration gives you more insight.
Excellent video. I have subscribed since it was worth it. But one very kind and gentle request. Although I understand how much time , effort one needs to put in and how much pain on needs to undertake for coming up with videos of this nature but still would like to request to have couple of things in the future if feasible: 1. A detailed set of videos on Intermediate and advanced statistics. 2. A strongly mathematical set of videos on ML and Deep Learning. (Most resources are bit too verbose or may be completely skipped many vital points). I repeat, asking for anything for that matter is always easy and I was also feeling a bit suffocated to put these requests. But since I loved you explanation and clarity I got in quick time , so couldn't check. Please don't mind! Regards
Thanks for your comment and kind words. I will try to make more tutorials on these kind of things. Please do not hesitate to suggest any specific topics u have in mind. If I can, I would try to explain them.
Thank you for the video! Can I ask, the one thing I don't understand is x_i a possible state that X can take, or is it a sample? In the beginning you say x_1, etc are possible states, but when we get the expected value of the log likelihood ratio, is that not over samples in a dataset instead?
"sample" and "state" are the same thing here. e.g. A Die can be in one of the six possible states {1,2,3,4,5,6} When you roll the die you will get either 1 or 2 or 3 or 4 or 5 or 6 ... you call them sample.
Professor any good tutorials or books can u suggest to start from estimation theory to a bit advanced stats we can learn and moreover how much statistics we need in ML feilds can u tell ?
Really amazing explanation Kapil🔥!! Your 'its the recap' sidebars and accurate vocabulary are especially helpful. Would it be a good idea to also draw parallel with implementation? Like how the equation is executed in case of a ML model being trained (for example how the summation over N which is assumed to be large, terms holds true to mini-batch training)?
Thanks Aditya. You have raised a very good point - “How does it relate to mini batch training?”. As you know that during the mini batch training in general we have “noisy gradients” (per step) and this has implications in this context as well I.e. when KL divergence (rather ELBO) is used as a loss function. There are remedies such as re-parameterization trick, score function, importance sampling etc to reduce the gradient variance. No silver bullet but some workarounds.
Just to confirm, when using the law of large numbers, that approximation equation should be the sum of the p(x)*log(p(x)/q(x)) averaged over N correct?
although I won't use this in my life but you made it sound so easy :)) But one thing I didn't understand. The outcome of this KL divergence is supposed to be a number that shows the divergence of an approximate distribution to the reference. How do you get a whole distribution at the end of the video?
From now onwards u will see KL divergence in many areas - machine learning, statistics and prob theory :) KL divergence will give you a number that will show you how two distributions differ from each other. But important thing to understand is what would you do with this number. Why do u need it in the first place? Let’s answer that and then you will get the answer. A parametric prob distribution is summarized using parameters. For eg for normal distribution u need only mean and variance. Once u have them (mean and variance) u can generate samples from it and then plot them. Now let’s say for your “approx distribution” you select normal distribution but u donot know what is the mean and variance of it? To get them You would use the optimization procedure which requires you to have a number that will tell how different u are from the reference. This is where u will use the KL divergence. Once you have the optimization procedure done (ie minimization using KL divergence as the loss function) then u would get a good value for your mean and variance of your approx distribution. You would then use this mean and variance to generate the samples and plot them … this is how u get the “whole distribution” … but the right way of saying is that since u now have the parameters of your distribution function and this enables you to create the plot, compute likelihood etc
@@KapilSachdeva even without video, u can explain clearly the same! Thanks. Also, as a suggestion, I would appreciate it if you put subset simulation in your list of to-do videos :)
VAEs are one of my favorite neural architectures. Have been thinking to explain various variants of VAEs for dynamical data (time series); Also I have a tutorial on ELBO which is used a loss function in VAE. Check that out. But yes, suggestion accepted :) Will do a set of tutorials on various variants of VAEs. Though I know how GANs work, I have never used them in an industrial setting and I prefer to explain concepts that I have applied on a real world problem.
Hello dj S, first apologies for the such a delay in making the tutorial on VAE as I had promised earlier. I have published a tutorial on it few minutes ago - ua-cam.com/video/h9kWaQQloPk/v-deo.html Most likely by now you already have a good understanding of what VAE is but I have tried to explain it in a different manner than it is usually done. While my explanation is different it highlights the primary motivations VAE's inventor had. So check it out if you are still interested.
Not really; unfortunately it is the notation and nomenclature that makes it difficult to understand the subject. Provided when you write x_1 you are referring to one sample/state of Random Variable X (note - capital letter) You should read p_theta(x_1) as You want to calculate the probability of x_1 (which is a sample/state of Random Variable X) whose (Random Variable X) distribution function is represented using symbol p and "theta" (I can not write latex in comments!) are the parameters of the distribution function. For P(x_1|theta), I am not sure if it is valid notation. A notation could be P(X|theta) in that you should see as conditional distribution of X given theta. In this case both X and theta could be random variables. When theta is not a random variable then it is an abuse of the notation. This is bit sensitive and it is possible that in some books/papers you may find the notation (the second one) you have provided. The meaning will depend on the context Hope this helps.
@@KapilSachdeva I see what you mean, thank you for taking time to answer both of my questions! You helped me a lot, and your videos are great! I think my problem came from getting confused between frequentist and bayesian notation for likelihood, and also notation between unsupervised and supervised likelihood. Your videos and answers put me on the right path though, it's all clear to me now! Thank you!
Great video. Between your video and ua-cam.com/video/ErfnhcEV1O8/v-deo.html I now have several interesting ways to understand this KL Divergence stuff. Your descriptions of why we use logs, and why we use the ratio inside the log were quite brilliant. Also, thanks for the tip on creating bi-modal distributions using two gaussians. Keep up the good work.
Thanks for the feedback. I have not been good at this aspect; will try to create the sections. For smaller videos (for e.g. this one is only 11 minutes) it is bit difficult also. 🙏
These videos should be highly recommended by UA-cam algorithm
Fantastic tutorial! What I find great is that you anticipate the questions arising in the student's mind and then address them with very satisfying explanations!
🙏
Your explanations and visualizations are very good! Also you teaching style has the perfect tempo. Thank you very much for this great explanation
The best explanation I've heard about KL-Divergence. Keep up the great work.
🙏
I cannot express the gratitude I have for your explanation. What a beautiful soul you are .wow
Thanks Paedru for the kind words.
It was fantastic. The most informative video of KL divergence
Amazing value, Kapil. I like several of the things you do when you teach: refreshing necessary concepts (expectation), the precision of your language and notation, equivalent expressions, and so on. The pace is also great. Thank you very much!
🙏 Many thanks for the kind words and appreciation.
These videos are pure gold. Thank you so much. You can explain incredible well.
🙏
was searching for some tutorials on approximate inference and the pre-reqs for it and stumbled upon this, literally my mind got blown with the way you explained the concepts here.
🙏
this is the most simple and clear explanation of KL divergence, thank you
🙏
He is good! Very good to say, for example, we want the average difference, but when talking about rv we talk about expected value ... . And many other very careful explanations.
Thank you for the video. I am preparing a paper for my math stats class and after many videos yours gave me the best total explanation with terminology I am familiar with so far.
🙏 Good luck with your paper!
@@KapilSachdeva thanks!!
Hi, I had one doubt, in 5:18, why do we multiply p_theta(xi) to log(p_theta(xi)/q_phi(xi)) and not multiply q_phi(xi) to log(p_theta(xi)/q_phi(xi))?
In 7:30 you show the variation with q_phi(xi). It seems like the probability distribution that is multiplied with the function of the random variable log(P1(x)/P2(x)) is the probability distribution that appears in the numerator. Is there a reason to not multiple the probability distribution in the denominator?
A bit rephrasing of your question (Correct me if this is not what you meant)
Is there a reason to not multiply "with" the probability distribution in the denominator?
It's a good question. So far in the literature only two types of KL divergence (forward & reverse) are defined. If you use the prob dist from numerator as the weighting factor then you would have 2 more variations. At the moment, I am not aware of any "mathematical" reason/justification that would suggest that it is an invalid operation.
i loved this small session on KL Divergence. Thank you sir for this beautiful lecture.
🙏
Thank you very much! Your explanations are really clear and neat. Thanks to your video, now I understand KL divergence much much better than I did before.
🙏
This is amaaaazing! What a nicely paced and deep explanation!
🙏
You have a talent for teaching. Good explanation.
🙏
Thanks
🙏
Great content! Definitely need more views. Please keep uploading videos.
🙏
I fell in love with the explanation. Thanks a lot Kapil.
🙏
Thank youu for great explanation. @9:27 I can't get how come reverse KL divergence has mode seeking behaviour and forward has mean seeking. I understood that P(x) is multimodal gaussian distribution , but what is Q(X) as we needed both distribution for finding K-L divergence.
I am assuming you have understood following -
a) KL Divergence will give you a "number" that quantifies the difference between two probability distributions say p and q.
b) There two different ways you could write the KL Divergence. In one approach you would have p as the weighing distribution and in other q
Now, the next question to ask yourself is in what situation you would have the need to even compare two probability distributions.
This situation occurs when we are trying to "estimate" p using q.
One way you can estimate (or figure out the function for q) is by using "optimization" process.
Below will not make sense if you do not know what optimization is.
Optimization => you need a cost (loss or objective) function. An objective function's role is to compare true and predicted values. This is what I am doing when I first create/use a multi modal distribution (p) and I am trying to estimate q as a normal distribution that has 2 parameters (mu and sigma). I am using optimization process and KL divergence as a loss function to assess how well I am adjusting/predicting the values of mu and sigma (of q).
When I used Reverse KL then the predicted mu and sigma create a function (q) that seems to converge to the bigger mode if p whereas for the forward KL, the function (q) lies somewhere in the middle of p.
Best Explanation Ever! 🙏 Thanks for this and you do save my life!!!!
🙏
Thank you very much for this lecture. Could you share the code/math you used to generate the "q" distribution in the toy example? I am struggling to understand how to compute q from p.
Comparing to other videos, this one's fantastic.
Very Nice and lucid way of explaination
🙏
you explain like a messiah! :) life saver
🙏
Great tutorial Kapil. I have a question regarding the approximation in the discrete case at 7:00. Does dividing by 1/N not assume uniform distribution? Maybe the approximation would have to be weighted by the likelihood?
That is the magic of law of large numbers. If you have sampled a large number of data then the expected value of that distribution converges to our regular unweighted average. en.wikipedia.org/wiki/Law_of_large_numbers
Very intuitive explanation. Thank you.
Simple, short and precise.
🙏
Thanks a lot Kapil for your fantastic tutorial series. Keep up the good work :)
🙏
Excellent Explanation! Thanks
🙏
5:21 - what does the subscript notation p means in the expectation?
That the probability distribution used to compute the expectation is "p". Watch the tutorial on Importance Sampling [ ua-cam.com/video/ivBtpzHcvpg/v-deo.html] ... at the start of it I describe it in more detail.
Great video, onlything im having trouble understanding is at the end at 10:00 the plot you showed. I understood that i have 2 distributions p and q and i calculate the D_KL (p || q) and it would result in a number, so why did u show 2 curves for the forward and reverse KL? Where are those coming from?
I can see why it can be confusing; I should have done a better job to give a proper context for the plots/curves. Thanks for asking for the clarification.
I take that you have understood that KL divergence will give you a number that quantifies the difference between two probability distributions p and q.
The next question that you should occur to you is - "What is the application of obtaining this difference (i.e. KL divergence)?"
One application is when you try to approximate a complex distribution with a simpler distribution. Let's say the complex distribution is denoted by `p` and it has 3 parameters (mu, sigma and gamma) and you would want to approximate it using a simpler distribution denoted by `q` which has 2 parameters (mu and sigma).
Now when you are creating an approximate (function) you could rely on optimization procedure (gradient descent etc). But then an optimization procedure will require a cost (loss/objective) function. A cost function tells how far you are from the ground truth. This is where the KL divergence will be used as the cost function when probability distributions are involved.
Finally coming to your question about the additional curves, these are the resulting approximate distributions I get when I use Forward and Reverse KL as the cost (loss/objective) functions. Both of them are normal distributions (mu and sigma) but since the loss functions used are different (foward vs reverse) they are different.
Some else also asked this (similar) question sometime back; I do not know how to provide the direct link to the reply of the comment. See the question by Arash Poursadrollah on this video in the comment section and my reply.
Hope this helps!
@@KapilSachdeva Excellent answer! It took me a while to figure it out.
Thank you for the beautifully illustrated explanation!
🙏
Thanks Kapil for making this video. It was super helpful!
🙏
9:00 - here it seems like q is the resulting approximation but up until then i thought q was a distribution being compared to p? so for example, p is a red line, shouldn't q be shown as a line as well and then the resulting forward KL and reverse KL divergences be overlaid?
KL divergence at its core is giving you a way to compare two distributions.
The question you should ask is why one should be interested in doing a comparison in the first place?
In many machine learning / statistics task we want to "approximate" a complex distribution (say p) by a simpler distribution (say q). Here we talk only about parametric distributions i.e. p and q are represented using functions which parameters e.g. Normal distribution with mu and sigma as parameters.
The exercise (or rather the utility of KL divergence) I showed in the tutorial was about finding the parameters of q (i.e. ). In order to find it I used optimization with KL divergence as the loss function. Like in any optimization problem we will start with random values for mu and sigma and then adjust them in such a way that the loss function decreases.
I tried both types of KL divergences as loss functions and showed how the results (i.e. mu and sigma) obtained are different for each other.
I did not do a good job in the tutorial by clearly explaining when and how you really end up using KL divergence and hence it is confusing for you.
Hope this note make sense.
@@KapilSachdeva ahhhh i see now! makes sense! thanks so much for your time!! :)
Wonderful explanation 😀 Thanks!
🙏
4:22 - Not sure whether it is so straights forward... Discrete to Continuous RV by replacing an integral. I think Claude Shannon also made the same mistake.
Thanks Gowri Shankar for pointing this out. Indeed it is not as simple as I made it sound however in terms of the "final" structure of the expectation "formula" the change from discrete to continuous is reflected as the change from summation to integration. Would be great if you can share insights that you have to highlight the differences in the interpretation (if any).
@@KapilSachdeva I have no light as of now. Planning to read these two literature but had a Claude Shannon moment at 4:22 - Hence posted my comment.
www.tsc.uc3m.es/~fernando/bare_conf3.pdf
web.stanford.edu/class/stats311/Lectures/lec-02.pdf
Thanks for the reply, you made my day. Thanks for your services in this platform, your content is cool.
🙏 Thanks Gowri Shankar for these links and the kind words. I gave a cursory read to them and they are very well written papers; have added them to my library to re-read them. I still have to do a proper study of measure theory as lot of the explanations in continuous realm are based on it. Last year I wrote this article on Entropy that you may finding interesting as I do go over some of the topics mentioned in the lecture notes. Here is the link - towardsdatascience.com/but-what-is-entropy-ae9b2e7c2137?sk=23e69749005be756a1d19b6e1c3531f6 - may be it is of some assistance to you.
Amazing. Thanks for posting.
great video that gives an intuitive understanding 👍
🙏
Great explanation!
🙏
That's a great explanation. Thanks a lot!
Really fantastic 🙏🙏
🙏
Min 6:00 - why that integral over R is considered "hard to compute"?
Integrals in general are computationally heavy but more importantly start thinking in terms of multivariate distributions i.e. when you have a vector (high dimensional data). In simple words, in high dimension space you have not a single integral but rather multiple integral and it would computations intractable. Needless to say, real world problems are high dimensional.
@@KapilSachdeva Why "Integrals in general are computationally heavy"? I thought we used the law of large numbers for approximating that integral since it "goes from -inty to +infty" and a calculator cannot afford continuous sums due to it discrete nature
Ah it seems I misunderstood your original question, usage of law of large numbers is to avoid the integrals. I first provided the general formulation of expectation (continuous) and then suggested that in practice one would use law of large numbers to work around the integral problem.
@@KapilSachdeva I understood that you used that law for approximating that hard integral. The question is: why is that integral hard? Sorry for my doubts!
I think I kind of answered it by suggesting to consider high dimensional spaces. But here it is slightly different words -
Most of the time the integrals that we encounter in statistics do not have "closed" form solutions. So we must approximate them using "numerical" methods. The computational complexity of these numerical methods (such as quad) increase as the number of integrals increase (think double, triple integrals etc). Therefore we try to find simpler approximation to avoid them. One such simpler approximation (that in limit) leads to similar result is Monte Carlo integration.
See if en.wikipedia.org/wiki/Numerical_integration gives you more insight.
Excellent video. I have subscribed since it was worth it. But one very kind and gentle request. Although I understand how much time , effort one needs to put in and how much pain on needs to undertake for coming up with videos of this nature but still would like to request to have couple of things in the future if feasible:
1. A detailed set of videos on Intermediate and advanced statistics.
2. A strongly mathematical set of videos on ML and Deep Learning. (Most resources are bit too verbose or may be completely skipped many vital points).
I repeat, asking for anything for that matter is always easy and I was also feeling a bit suffocated to put these requests. But since I loved you explanation and clarity I got in quick time , so couldn't check. Please don't mind!
Regards
Thanks for your comment and kind words. I will try to make more tutorials on these kind of things. Please do not hesitate to suggest any specific topics u have in mind. If I can, I would try to explain them.
@@KapilSachdeva Sure I shall post. Thank you so much for your support!
Nicely explained. Thank you. Which software you have used to prepare your slides for presentation?
🙏 Powerpoint
Fantastic, amazing! Thank you!
🙏
Thank you for the video! Can I ask, the one thing I don't understand is x_i a possible state that X can take, or is it a sample? In the beginning you say x_1, etc are possible states, but when we get the expected value of the log likelihood ratio, is that not over samples in a dataset instead?
"sample" and "state" are the same thing here.
e.g. A Die can be in one of the six possible states {1,2,3,4,5,6}
When you roll the die you will get either 1 or 2 or 3 or 4 or 5 or 6 ... you call them sample.
Simply beautiful ❤️
Well explained.👍
Professor any good tutorials or books can u suggest to start from estimation theory to a bit advanced stats we can learn and moreover how much statistics we need in ML feilds can u tell ?
Great work!
🙏
amazing video, is there any link for the slides ?
🙏
epic explanation!
🙏
Really amazing explanation Kapil🔥!! Your 'its the recap' sidebars and accurate vocabulary are especially helpful. Would it be a good idea to also draw parallel with implementation? Like how the equation is executed in case of a ML model being trained (for example how the summation over N which is assumed to be large, terms holds true to mini-batch training)?
Thanks Aditya. You have raised a very good point - “How does it relate to mini batch training?”.
As you know that during the mini batch training in general we have “noisy gradients” (per step) and this has implications in this context as well I.e. when KL divergence (rather ELBO) is used as a loss function. There are remedies such as re-parameterization trick, score function, importance sampling etc to reduce the gradient variance. No silver bullet but some workarounds.
Great video, thank you!
🙏
This was very helpful
🙏
Hello, at 7:05, the final equation missed p_{\theta}(x_i)
Hi luping, it’s not missing it as we are approximating the expected value using law of large numbers. Note that it is now a simple average.
@@KapilSachdeva hello Kapil, yes, you are right, thank you for confirming.
Well done!
🙏
Just to confirm, when using the law of large numbers, that approximation equation should be the sum of the p(x)*log(p(x)/q(x)) averaged over N correct?
Yes
great job
🙏
Very good - thank y.ou
🙏
although I won't use this in my life but you made it sound so easy :)) But one thing I didn't understand. The outcome of this KL divergence is supposed to be a number that shows the divergence of an approximate distribution to the reference. How do you get a whole distribution at the end of the video?
From now onwards u will see KL divergence in many areas - machine learning, statistics and prob theory :)
KL divergence will give you a number that will show you how two distributions differ from each other.
But important thing to understand is what would you do with this number. Why do u need it in the first place?
Let’s answer that and then you will get the answer.
A parametric prob distribution is summarized using parameters. For eg for normal distribution u need only mean and variance. Once u have them (mean and variance) u can generate samples from it and then plot them.
Now let’s say for your “approx distribution” you select normal distribution but u donot know what is the mean and variance of it?
To get them You would use the optimization procedure which requires you to have a number that will tell how different u are from the reference. This is where u will use the KL divergence.
Once you have the optimization procedure done (ie minimization using KL divergence as the loss function) then u would get a good value for your mean and variance of your approx distribution.
You would then use this mean and variance to generate the samples and plot them … this is how u get the “whole distribution” … but the right way of saying is that since u now have the parameters of your distribution function and this enables you to create the plot, compute likelihood etc
@@KapilSachdeva even without video, u can explain clearly the same! Thanks. Also, as a suggestion, I would appreciate it if you put subset simulation in your list of to-do videos :)
@@ArashSadr will try!
Dear Kapil ji,
If time permits can you kindly upload some videos on Variational Autoencoders and GAN ?
Warm Regards
Dj
VAEs are one of my favorite neural architectures. Have been thinking to explain various variants of VAEs for dynamical data (time series); Also I have a tutorial on ELBO which is used a loss function in VAE. Check that out. But yes, suggestion accepted :) Will do a set of tutorials on various variants of VAEs. Though I know how GANs work, I have never used them in an industrial setting and I prefer to explain concepts that I have applied on a real world problem.
@@KapilSachdeva Thank you so much again.. Your videos are really impact making.
🙏
Hello dj S, first apologies for the such a delay in making the tutorial on VAE as I had promised earlier.
I have published a tutorial on it few minutes ago - ua-cam.com/video/h9kWaQQloPk/v-deo.html
Most likely by now you already have a good understanding of what VAE is but I have tried to explain it in a different manner than it is usually done. While my explanation is different it highlights the primary motivations VAE's inventor had. So check it out if you are still interested.
Any book you refer for the same?
Thanks.
Not really! Perhaps many different articles, papers, PhD thesis etc
@@KapilSachdeva ok. Thanks!
🙏
I hope you had a great day today :)
Sorry, another question, Is it correct to say that p_theta(x_1) is the same as P(x_1|theta)?
Not really; unfortunately it is the notation and nomenclature that makes it difficult to understand the subject.
Provided when you write x_1 you are referring to one sample/state of Random Variable X (note - capital letter)
You should read p_theta(x_1) as
You want to calculate the probability of x_1 (which is a sample/state of Random Variable X) whose (Random Variable X) distribution function is represented using symbol p and "theta" (I can not write latex in comments!) are the parameters of the distribution function.
For P(x_1|theta), I am not sure if it is valid notation. A notation could be P(X|theta) in that you should see as conditional distribution of X given theta. In this case both X and theta could be random variables. When theta is not a random variable then it is an abuse of the notation.
This is bit sensitive and it is possible that in some books/papers you may find the notation (the second one) you have provided. The meaning will depend on the context
Hope this helps.
@@KapilSachdeva I see what you mean, thank you for taking time to answer both of my questions! You helped me a lot, and your videos are great! I think my problem came from getting confused between frequentist and bayesian notation for likelihood, and also notation between unsupervised and supervised likelihood. Your videos and answers put me on the right path though, it's all clear to me now! Thank you!
🙏
Great video. Between your video and ua-cam.com/video/ErfnhcEV1O8/v-deo.html I now have several interesting ways to understand this KL Divergence stuff. Your descriptions of why we use logs, and why we use the ratio inside the log were quite brilliant. Also, thanks for the tip on creating bi-modal distributions using two gaussians. Keep up the good work.
🙏 Thank you. Happy that it was useful!
thanks! wish to have time code on the video.
Hello Yong, what is time code? … as in UA-cam chapters timestamp?
@@KapilSachdeva oops yep timestamp, using wrong word here. was doing too much thing with openCv.
Thanks for the feedback. I have not been good at this aspect; will try to create the sections. For smaller videos (for e.g. this one is only 11 minutes) it is bit difficult also. 🙏
Best
🙏