KL Divergence - CLEARLY EXPLAINED!

Поділитися
Вставка
  • Опубліковано 1 січ 2025

КОМЕНТАРІ • 151

  • @user-or7ji5hv8y
    @user-or7ji5hv8y 3 роки тому +27

    These videos should be highly recommended by UA-cam algorithm

  • @Vikram-wx4hg
    @Vikram-wx4hg 2 роки тому +23

    Fantastic tutorial! What I find great is that you anticipate the questions arising in the student's mind and then address them with very satisfying explanations!

  • @hannes7218
    @hannes7218 5 днів тому

    Your explanations and visualizations are very good! Also you teaching style has the perfect tempo. Thank you very much for this great explanation

  • @kaiponel5506
    @kaiponel5506 Рік тому +3

    The best explanation I've heard about KL-Divergence. Keep up the great work.

  • @paedrufernando2351
    @paedrufernando2351 3 роки тому +2

    I cannot express the gratitude I have for your explanation. What a beautiful soul you are .wow

  • @homakashefiamiri3749
    @homakashefiamiri3749 5 місяців тому

    It was fantastic. The most informative video of KL divergence

  • @aclapes
    @aclapes 3 роки тому +6

    Amazing value, Kapil. I like several of the things you do when you teach: refreshing necessary concepts (expectation), the precision of your language and notation, equivalent expressions, and so on. The pace is also great. Thank you very much!

    • @KapilSachdeva
      @KapilSachdeva  3 роки тому

      🙏 Many thanks for the kind words and appreciation.

  • @TheProblembaer2
    @TheProblembaer2 10 місяців тому +1

    These videos are pure gold. Thank you so much. You can explain incredible well.

  • @shantanudixit5453
    @shantanudixit5453 Рік тому +1

    was searching for some tutorials on approximate inference and the pre-reqs for it and stumbled upon this, literally my mind got blown with the way you explained the concepts here.

  • @matiascaceres2600
    @matiascaceres2600 2 роки тому +1

    this is the most simple and clear explanation of KL divergence, thank you

  • @pauledam2174
    @pauledam2174 5 місяців тому

    He is good! Very good to say, for example, we want the average difference, but when talking about rv we talk about expected value ... . And many other very careful explanations.

  • @puck6016
    @puck6016 3 роки тому +1

    Thank you for the video. I am preparing a paper for my math stats class and after many videos yours gave me the best total explanation with terminology I am familiar with so far.

    • @KapilSachdeva
      @KapilSachdeva  3 роки тому +1

      🙏 Good luck with your paper!

    • @puck6016
      @puck6016 3 роки тому

      @@KapilSachdeva thanks!!

  • @nikhilsrajan
    @nikhilsrajan Рік тому +1

    Hi, I had one doubt, in 5:18, why do we multiply p_theta(xi) to log(p_theta(xi)/q_phi(xi)) and not multiply q_phi(xi) to log(p_theta(xi)/q_phi(xi))?
    In 7:30 you show the variation with q_phi(xi). It seems like the probability distribution that is multiplied with the function of the random variable log(P1(x)/P2(x)) is the probability distribution that appears in the numerator. Is there a reason to not multiple the probability distribution in the denominator?

    • @KapilSachdeva
      @KapilSachdeva  Рік тому

      A bit rephrasing of your question (Correct me if this is not what you meant)
      Is there a reason to not multiply "with" the probability distribution in the denominator?
      It's a good question. So far in the literature only two types of KL divergence (forward & reverse) are defined. If you use the prob dist from numerator as the weighting factor then you would have 2 more variations. At the moment, I am not aware of any "mathematical" reason/justification that would suggest that it is an invalid operation.

  • @mdekramnazar1732
    @mdekramnazar1732 11 місяців тому +1

    i loved this small session on KL Divergence. Thank you sir for this beautiful lecture.

  • @bluepeace93
    @bluepeace93 2 роки тому +1

    Thank you very much! Your explanations are really clear and neat. Thanks to your video, now I understand KL divergence much much better than I did before.

  • @sergiobromberg9233
    @sergiobromberg9233 Рік тому +2

    This is amaaaazing! What a nicely paced and deep explanation!

  • @gauravkumarshah2771
    @gauravkumarshah2771 3 роки тому +2

    You have a talent for teaching. Good explanation.

  • @shadabalam2122
    @shadabalam2122 9 місяців тому +1

    Thanks

  • @raghav2198
    @raghav2198 3 роки тому +3

    Great content! Definitely need more views. Please keep uploading videos.

  • @spandanbasu5653
    @spandanbasu5653 2 роки тому +1

    I fell in love with the explanation. Thanks a lot Kapil.

  • @TJ-zs2sv
    @TJ-zs2sv Рік тому +1

    Thank youu for great explanation. @9:27 I can't get how come reverse KL divergence has mode seeking behaviour and forward has mean seeking. I understood that P(x) is multimodal gaussian distribution , but what is Q(X) as we needed both distribution for finding K-L divergence.

    • @KapilSachdeva
      @KapilSachdeva  Рік тому

      I am assuming you have understood following -
      a) KL Divergence will give you a "number" that quantifies the difference between two probability distributions say p and q.
      b) There two different ways you could write the KL Divergence. In one approach you would have p as the weighing distribution and in other q
      Now, the next question to ask yourself is in what situation you would have the need to even compare two probability distributions.
      This situation occurs when we are trying to "estimate" p using q.
      One way you can estimate (or figure out the function for q) is by using "optimization" process.
      Below will not make sense if you do not know what optimization is.
      Optimization => you need a cost (loss or objective) function. An objective function's role is to compare true and predicted values. This is what I am doing when I first create/use a multi modal distribution (p) and I am trying to estimate q as a normal distribution that has 2 parameters (mu and sigma). I am using optimization process and KL divergence as a loss function to assess how well I am adjusting/predicting the values of mu and sigma (of q).
      When I used Reverse KL then the predicted mu and sigma create a function (q) that seems to converge to the bigger mode if p whereas for the forward KL, the function (q) lies somewhere in the middle of p.

  • @yuqiwang7829
    @yuqiwang7829 2 роки тому +1

    Best Explanation Ever! 🙏 Thanks for this and you do save my life!!!!

  • @davidshen84
    @davidshen84 11 місяців тому +1

    Thank you very much for this lecture. Could you share the code/math you used to generate the "q" distribution in the toy example? I am struggling to understand how to compute q from p.

  • @mostafahamidifard6427
    @mostafahamidifard6427 4 місяці тому

    Comparing to other videos, this one's fantastic.

  • @pvtgcn8752
    @pvtgcn8752 Рік тому +1

    Very Nice and lucid way of explaination

  • @AI_ML_DL_LLM
    @AI_ML_DL_LLM Рік тому +1

    you explain like a messiah! :) life saver

  • @sandeshbrl1
    @sandeshbrl1 Рік тому +2

    Great tutorial Kapil. I have a question regarding the approximation in the discrete case at 7:00. Does dividing by 1/N not assume uniform distribution? Maybe the approximation would have to be weighted by the likelihood?

    • @KapilSachdeva
      @KapilSachdeva  Рік тому +1

      That is the magic of law of large numbers. If you have sampled a large number of data then the expected value of that distribution converges to our regular unweighted average. en.wikipedia.org/wiki/Law_of_large_numbers

  • @lewisclifton1892
    @lewisclifton1892 5 місяців тому

    Very intuitive explanation. Thank you.

  • @durgeshmishra4005
    @durgeshmishra4005 Рік тому +1

    Simple, short and precise.

  • @nikhilpriyatam
    @nikhilpriyatam Рік тому +1

    Thanks a lot Kapil for your fantastic tutorial series. Keep up the good work :)

  • @tanmaybhise
    @tanmaybhise Рік тому +1

    Excellent Explanation! Thanks

  • @mysteriousXsecret
    @mysteriousXsecret Рік тому

    5:21 - what does the subscript notation p means in the expectation?

    • @KapilSachdeva
      @KapilSachdeva  Рік тому

      That the probability distribution used to compute the expectation is "p". Watch the tutorial on Importance Sampling [ ua-cam.com/video/ivBtpzHcvpg/v-deo.html] ... at the start of it I describe it in more detail.

  • @Lyr00
    @Lyr00 2 роки тому +1

    Great video, onlything im having trouble understanding is at the end at 10:00 the plot you showed. I understood that i have 2 distributions p and q and i calculate the D_KL (p || q) and it would result in a number, so why did u show 2 curves for the forward and reverse KL? Where are those coming from?

    • @KapilSachdeva
      @KapilSachdeva  2 роки тому +1

      I can see why it can be confusing; I should have done a better job to give a proper context for the plots/curves. Thanks for asking for the clarification.
      I take that you have understood that KL divergence will give you a number that quantifies the difference between two probability distributions p and q.
      The next question that you should occur to you is - "What is the application of obtaining this difference (i.e. KL divergence)?"
      One application is when you try to approximate a complex distribution with a simpler distribution. Let's say the complex distribution is denoted by `p` and it has 3 parameters (mu, sigma and gamma) and you would want to approximate it using a simpler distribution denoted by `q` which has 2 parameters (mu and sigma).
      Now when you are creating an approximate (function) you could rely on optimization procedure (gradient descent etc). But then an optimization procedure will require a cost (loss/objective) function. A cost function tells how far you are from the ground truth. This is where the KL divergence will be used as the cost function when probability distributions are involved.
      Finally coming to your question about the additional curves, these are the resulting approximate distributions I get when I use Forward and Reverse KL as the cost (loss/objective) functions. Both of them are normal distributions (mu and sigma) but since the loss functions used are different (foward vs reverse) they are different.
      Some else also asked this (similar) question sometime back; I do not know how to provide the direct link to the reply of the comment. See the question by Arash Poursadrollah on this video in the comment section and my reply.
      Hope this helps!

    • @beelogger5901
      @beelogger5901 2 роки тому +1

      @@KapilSachdeva Excellent answer! It took me a while to figure it out.

  • @Joqu1nn
    @Joqu1nn 2 роки тому +1

    Thank you for the beautifully illustrated explanation!

  • @dipendrayadav6068
    @dipendrayadav6068 2 роки тому +1

    Thanks Kapil for making this video. It was super helpful!

  • @jaso403
    @jaso403 Рік тому

    9:00 - here it seems like q is the resulting approximation but up until then i thought q was a distribution being compared to p? so for example, p is a red line, shouldn't q be shown as a line as well and then the resulting forward KL and reverse KL divergences be overlaid?

    • @KapilSachdeva
      @KapilSachdeva  Рік тому +2

      KL divergence at its core is giving you a way to compare two distributions.
      The question you should ask is why one should be interested in doing a comparison in the first place?
      In many machine learning / statistics task we want to "approximate" a complex distribution (say p) by a simpler distribution (say q). Here we talk only about parametric distributions i.e. p and q are represented using functions which parameters e.g. Normal distribution with mu and sigma as parameters.
      The exercise (or rather the utility of KL divergence) I showed in the tutorial was about finding the parameters of q (i.e. ). In order to find it I used optimization with KL divergence as the loss function. Like in any optimization problem we will start with random values for mu and sigma and then adjust them in such a way that the loss function decreases.
      I tried both types of KL divergences as loss functions and showed how the results (i.e. mu and sigma) obtained are different for each other.
      I did not do a good job in the tutorial by clearly explaining when and how you really end up using KL divergence and hence it is confusing for you.
      Hope this note make sense.

    • @jaso403
      @jaso403 Рік тому

      @@KapilSachdeva ahhhh i see now! makes sense! thanks so much for your time!! :)

  • @prateekcaire4193
    @prateekcaire4193 Рік тому +1

    Wonderful explanation 😀 Thanks!

  • @shankar2chari
    @shankar2chari 3 роки тому +1

    4:22 - Not sure whether it is so straights forward... Discrete to Continuous RV by replacing an integral. I think Claude Shannon also made the same mistake.

    • @KapilSachdeva
      @KapilSachdeva  3 роки тому

      Thanks Gowri Shankar for pointing this out. Indeed it is not as simple as I made it sound however in terms of the "final" structure of the expectation "formula" the change from discrete to continuous is reflected as the change from summation to integration. Would be great if you can share insights that you have to highlight the differences in the interpretation (if any).

    • @shankar2chari
      @shankar2chari 3 роки тому

      @@KapilSachdeva I have no light as of now. Planning to read these two literature but had a Claude Shannon moment at 4:22 - Hence posted my comment.
      www.tsc.uc3m.es/~fernando/bare_conf3.pdf
      web.stanford.edu/class/stats311/Lectures/lec-02.pdf
      Thanks for the reply, you made my day. Thanks for your services in this platform, your content is cool.

    • @KapilSachdeva
      @KapilSachdeva  3 роки тому

      🙏 Thanks Gowri Shankar for these links and the kind words. I gave a cursory read to them and they are very well written papers; have added them to my library to re-read them. I still have to do a proper study of measure theory as lot of the explanations in continuous realm are based on it. Last year I wrote this article on Entropy that you may finding interesting as I do go over some of the topics mentioned in the lecture notes. Here is the link - towardsdatascience.com/but-what-is-entropy-ae9b2e7c2137?sk=23e69749005be756a1d19b6e1c3531f6 - may be it is of some assistance to you.

  • @desmondteo855
    @desmondteo855 8 місяців тому

    Amazing. Thanks for posting.

  • @zrkzheng7604
    @zrkzheng7604 2 роки тому +1

    great video that gives an intuitive understanding 👍

  • @abhinavkulkarni1174
    @abhinavkulkarni1174 2 роки тому +1

    Great explanation!

  • @AlstonMisquitta
    @AlstonMisquitta 5 місяців тому

    That's a great explanation. Thanks a lot!

  • @carbonrun2416
    @carbonrun2416 2 роки тому +1

    Really fantastic 🙏🙏

  • @mysteriousXsecret
    @mysteriousXsecret Рік тому

    Min 6:00 - why that integral over R is considered "hard to compute"?

    • @KapilSachdeva
      @KapilSachdeva  Рік тому

      Integrals in general are computationally heavy but more importantly start thinking in terms of multivariate distributions i.e. when you have a vector (high dimensional data). In simple words, in high dimension space you have not a single integral but rather multiple integral and it would computations intractable. Needless to say, real world problems are high dimensional.

    • @mysteriousXsecret
      @mysteriousXsecret Рік тому

      @@KapilSachdeva Why "Integrals in general are computationally heavy"? I thought we used the law of large numbers for approximating that integral since it "goes from -inty to +infty" and a calculator cannot afford continuous sums due to it discrete nature

    • @KapilSachdeva
      @KapilSachdeva  Рік тому

      Ah it seems I misunderstood your original question, usage of law of large numbers is to avoid the integrals. I first provided the general formulation of expectation (continuous) and then suggested that in practice one would use law of large numbers to work around the integral problem.

    • @mysteriousXsecret
      @mysteriousXsecret Рік тому

      @@KapilSachdeva I understood that you used that law for approximating that hard integral. The question is: why is that integral hard? Sorry for my doubts!

    • @KapilSachdeva
      @KapilSachdeva  Рік тому

      I think I kind of answered it by suggesting to consider high dimensional spaces. But here it is slightly different words -
      Most of the time the integrals that we encounter in statistics do not have "closed" form solutions. So we must approximate them using "numerical" methods. The computational complexity of these numerical methods (such as quad) increase as the number of integrals increase (think double, triple integrals etc). Therefore we try to find simpler approximation to avoid them. One such simpler approximation (that in limit) leads to similar result is Monte Carlo integration.
      See if en.wikipedia.org/wiki/Numerical_integration gives you more insight.

  • @djs749
    @djs749 3 роки тому +1

    Excellent video. I have subscribed since it was worth it. But one very kind and gentle request. Although I understand how much time , effort one needs to put in and how much pain on needs to undertake for coming up with videos of this nature but still would like to request to have couple of things in the future if feasible:
    1. A detailed set of videos on Intermediate and advanced statistics.
    2. A strongly mathematical set of videos on ML and Deep Learning. (Most resources are bit too verbose or may be completely skipped many vital points).
    I repeat, asking for anything for that matter is always easy and I was also feeling a bit suffocated to put these requests. But since I loved you explanation and clarity I got in quick time , so couldn't check. Please don't mind!
    Regards

    • @KapilSachdeva
      @KapilSachdeva  3 роки тому +1

      Thanks for your comment and kind words. I will try to make more tutorials on these kind of things. Please do not hesitate to suggest any specific topics u have in mind. If I can, I would try to explain them.

    • @djs749
      @djs749 3 роки тому

      @@KapilSachdeva Sure I shall post. Thank you so much for your support!

  • @ksenapati
    @ksenapati Рік тому +1

    Nicely explained. Thank you. Which software you have used to prepare your slides for presentation?

  • @FlavioBarrosProfessor
    @FlavioBarrosProfessor Рік тому +1

    Fantastic, amazing! Thank you!

  • @xTheSkace
    @xTheSkace 2 роки тому +1

    Thank you for the video! Can I ask, the one thing I don't understand is x_i a possible state that X can take, or is it a sample? In the beginning you say x_1, etc are possible states, but when we get the expected value of the log likelihood ratio, is that not over samples in a dataset instead?

    • @KapilSachdeva
      @KapilSachdeva  2 роки тому +2

      "sample" and "state" are the same thing here.
      e.g. A Die can be in one of the six possible states {1,2,3,4,5,6}
      When you roll the die you will get either 1 or 2 or 3 or 4 or 5 or 6 ... you call them sample.

  • @ChetanAnnam
    @ChetanAnnam 6 місяців тому

    Simply beautiful ❤️

  • @DevelopersHutt
    @DevelopersHutt 3 роки тому +3

    Well explained.👍

  • @kiit8337
    @kiit8337 Місяць тому

    Professor any good tutorials or books can u suggest to start from estimation theory to a bit advanced stats we can learn and moreover how much statistics we need in ML feilds can u tell ?

  • @martinschulze5399
    @martinschulze5399 2 роки тому +1

    Great work!

  • @PrinceYadav-xz2mb
    @PrinceYadav-xz2mb Рік тому +1

    amazing video, is there any link for the slides ?

  • @Vivekagrawal5800
    @Vivekagrawal5800 2 роки тому +1

    epic explanation!

  • @AdityaKumar-sm6hk
    @AdityaKumar-sm6hk 3 роки тому +1

    Really amazing explanation Kapil🔥!! Your 'its the recap' sidebars and accurate vocabulary are especially helpful. Would it be a good idea to also draw parallel with implementation? Like how the equation is executed in case of a ML model being trained (for example how the summation over N which is assumed to be large, terms holds true to mini-batch training)?

    • @KapilSachdeva
      @KapilSachdeva  3 роки тому +1

      Thanks Aditya. You have raised a very good point - “How does it relate to mini batch training?”.
      As you know that during the mini batch training in general we have “noisy gradients” (per step) and this has implications in this context as well I.e. when KL divergence (rather ELBO) is used as a loss function. There are remedies such as re-parameterization trick, score function, importance sampling etc to reduce the gradient variance. No silver bullet but some workarounds.

  • @aleksandermolak3518
    @aleksandermolak3518 3 роки тому +1

    Great video, thank you!

  • @cse-a-049mrinmoymondal9
    @cse-a-049mrinmoymondal9 3 роки тому +1

    This was very helpful

  • @lupingxiang1208
    @lupingxiang1208 3 роки тому

    Hello, at 7:05, the final equation missed p_{\theta}(x_i)

    • @KapilSachdeva
      @KapilSachdeva  3 роки тому

      Hi luping, it’s not missing it as we are approximating the expected value using law of large numbers. Note that it is now a simple average.

    • @lupingxiang1208
      @lupingxiang1208 3 роки тому +1

      @@KapilSachdeva hello Kapil, yes, you are right, thank you for confirming.

  • @chyldstudios
    @chyldstudios 2 роки тому +1

    Well done!

  • @DrAhdol
    @DrAhdol Рік тому

    Just to confirm, when using the law of large numbers, that approximation equation should be the sum of the p(x)*log(p(x)/q(x)) averaged over N correct?

  • @micahdelaurentis6551
    @micahdelaurentis6551 2 роки тому +1

    great job

  • @zwitter689
    @zwitter689 Рік тому +1

    Very good - thank y.ou

  • @ArashSadr
    @ArashSadr 2 роки тому +1

    although I won't use this in my life but you made it sound so easy :)) But one thing I didn't understand. The outcome of this KL divergence is supposed to be a number that shows the divergence of an approximate distribution to the reference. How do you get a whole distribution at the end of the video?

    • @KapilSachdeva
      @KapilSachdeva  2 роки тому +1

      From now onwards u will see KL divergence in many areas - machine learning, statistics and prob theory :)
      KL divergence will give you a number that will show you how two distributions differ from each other.
      But important thing to understand is what would you do with this number. Why do u need it in the first place?
      Let’s answer that and then you will get the answer.
      A parametric prob distribution is summarized using parameters. For eg for normal distribution u need only mean and variance. Once u have them (mean and variance) u can generate samples from it and then plot them.
      Now let’s say for your “approx distribution” you select normal distribution but u donot know what is the mean and variance of it?
      To get them You would use the optimization procedure which requires you to have a number that will tell how different u are from the reference. This is where u will use the KL divergence.
      Once you have the optimization procedure done (ie minimization using KL divergence as the loss function) then u would get a good value for your mean and variance of your approx distribution.
      You would then use this mean and variance to generate the samples and plot them … this is how u get the “whole distribution” … but the right way of saying is that since u now have the parameters of your distribution function and this enables you to create the plot, compute likelihood etc

    • @ArashSadr
      @ArashSadr 2 роки тому +1

      @@KapilSachdeva even without video, u can explain clearly the same! Thanks. Also, as a suggestion, I would appreciate it if you put subset simulation in your list of to-do videos :)

    • @KapilSachdeva
      @KapilSachdeva  2 роки тому

      @@ArashSadr will try!

  • @djs749
    @djs749 3 роки тому +1

    Dear Kapil ji,
    If time permits can you kindly upload some videos on Variational Autoencoders and GAN ?
    Warm Regards
    Dj

    • @KapilSachdeva
      @KapilSachdeva  3 роки тому +1

      VAEs are one of my favorite neural architectures. Have been thinking to explain various variants of VAEs for dynamical data (time series); Also I have a tutorial on ELBO which is used a loss function in VAE. Check that out. But yes, suggestion accepted :) Will do a set of tutorials on various variants of VAEs. Though I know how GANs work, I have never used them in an industrial setting and I prefer to explain concepts that I have applied on a real world problem.

    • @djs749
      @djs749 3 роки тому +1

      @@KapilSachdeva Thank you so much again.. Your videos are really impact making.

    • @KapilSachdeva
      @KapilSachdeva  3 роки тому

      🙏

    • @KapilSachdeva
      @KapilSachdeva  3 роки тому

      Hello dj S, first apologies for the such a delay in making the tutorial on VAE as I had promised earlier.
      I have published a tutorial on it few minutes ago - ua-cam.com/video/h9kWaQQloPk/v-deo.html
      Most likely by now you already have a good understanding of what VAE is but I have tried to explain it in a different manner than it is usually done. While my explanation is different it highlights the primary motivations VAE's inventor had. So check it out if you are still interested.

  • @AbhishekSinghSambyal
    @AbhishekSinghSambyal 2 роки тому

    Any book you refer for the same?
    Thanks.

  • @BigTiredDog
    @BigTiredDog 11 місяців тому

    I hope you had a great day today :)

  • @xTheSkace
    @xTheSkace 2 роки тому

    Sorry, another question, Is it correct to say that p_theta(x_1) is the same as P(x_1|theta)?

    • @KapilSachdeva
      @KapilSachdeva  2 роки тому +1

      Not really; unfortunately it is the notation and nomenclature that makes it difficult to understand the subject.
      Provided when you write x_1 you are referring to one sample/state of Random Variable X (note - capital letter)
      You should read p_theta(x_1) as
      You want to calculate the probability of x_1 (which is a sample/state of Random Variable X) whose (Random Variable X) distribution function is represented using symbol p and "theta" (I can not write latex in comments!) are the parameters of the distribution function.
      For P(x_1|theta), I am not sure if it is valid notation. A notation could be P(X|theta) in that you should see as conditional distribution of X given theta. In this case both X and theta could be random variables. When theta is not a random variable then it is an abuse of the notation.
      This is bit sensitive and it is possible that in some books/papers you may find the notation (the second one) you have provided. The meaning will depend on the context
      Hope this helps.

    • @xTheSkace
      @xTheSkace 2 роки тому +1

      @@KapilSachdeva I see what you mean, thank you for taking time to answer both of my questions! You helped me a lot, and your videos are great! I think my problem came from getting confused between frequentist and bayesian notation for likelihood, and also notation between unsupervised and supervised likelihood. Your videos and answers put me on the right path though, it's all clear to me now! Thank you!

    • @KapilSachdeva
      @KapilSachdeva  2 роки тому

      🙏

  • @GregThatcher
    @GregThatcher 3 роки тому +12

    Great video. Between your video and ua-cam.com/video/ErfnhcEV1O8/v-deo.html I now have several interesting ways to understand this KL Divergence stuff. Your descriptions of why we use logs, and why we use the ratio inside the log were quite brilliant. Also, thanks for the tip on creating bi-modal distributions using two gaussians. Keep up the good work.

    • @KapilSachdeva
      @KapilSachdeva  3 роки тому

      🙏 Thank you. Happy that it was useful!

  • @yongen5398
    @yongen5398 3 роки тому

    thanks! wish to have time code on the video.

    • @KapilSachdeva
      @KapilSachdeva  3 роки тому +1

      Hello Yong, what is time code? … as in UA-cam chapters timestamp?

    • @yongen5398
      @yongen5398 3 роки тому +1

      @@KapilSachdeva oops yep timestamp, using wrong word here. was doing too much thing with openCv.

    • @KapilSachdeva
      @KapilSachdeva  3 роки тому

      Thanks for the feedback. I have not been good at this aspect; will try to create the sections. For smaller videos (for e.g. this one is only 11 minutes) it is bit difficult also. 🙏

  • @sumaiyaafridi
    @sumaiyaafridi 2 роки тому +1

    Best