Intuitively Understanding the KL Divergence

Поділитися
Вставка
  • Опубліковано 22 сер 2024
  • This video discusses the Kullback Leibler divergence and explains how it's a natural measure of distance between distributions. The video goes through a simple proof, which shows how with some basic maths, we can get under the KL divergence and intuitively understand what it's about.

КОМЕНТАРІ • 107

  • @Vroomerify
    @Vroomerify 2 роки тому +91

    I just want to say. This is--by far--the best explanation of KL divergence I've found on the internet. Thanks so much!

  • @niofer7247
    @niofer7247 2 роки тому +23

    This was actually one of the most helpful videos. Thank you

  • @liliz1902
    @liliz1902 Рік тому +2

    KL divergence confused me for so long, and I understood it just by watching your video for one time, thank you very much!

  • @AashraiRavooru
    @AashraiRavooru Рік тому +5

    A question here why will the number of heads and number of tails be the same for both the distributions at 3:04. If the probabilities for both the coins are different then the number of occurrences of heads and tails can also be different

  • @haresage6110
    @haresage6110 Рік тому +12

    Great explanation! One technical remark I have is that (from my understanding) KL divergence is not technically a measure of distance, since it's not symmetric ( Dlk(P||Q) != Dlk(Q||P) ).

    • @charchitsharma8902
      @charchitsharma8902 6 місяців тому +2

      Yes, that's why it's called divergence instead of distance.

  • @nericarcasci9919
    @nericarcasci9919 Рік тому +4

    You are unbelievably good at teaching man. You explained it better than they did in my course.

  • @jimmygan801
    @jimmygan801 2 роки тому +1

    holy smoke, you are legit GOAT. so concise yet clear and intuitive explanation.

  • @baskaisimkalmamisti
    @baskaisimkalmamisti 2 роки тому +2

    I didn't expect that good explanation from a randomly suggested youtube video

  • @karstenhannes9628
    @karstenhannes9628 Рік тому

    This type of explanation is perfect! First boiling the problem down to the most intuitive understanding and from there deduce the general formula. Thanks so much!

  • @matakos22
    @matakos22 2 роки тому +15

    Thanks so much for this, needed to understand what KL Divergence is for a paper I'm reading and you just saved me so much time!

  • @wynandwinterbach455
    @wynandwinterbach455 7 місяців тому

    I'm just rewatching this video to freshen up my deep learning fundamentals. Super clear video, thank you so much!

  • @marcegger7411
    @marcegger7411 2 роки тому +5

    Great video! Loved the intuition behind the KL distribution. For some thinking about applications, this is used in the loss function of Variational Auto Encoders, a class of deep networks, and is used to find low dimensionality features of high dimensionality input data as an encoder. (e.g. use this to deconstruct images into "features")

  • @ian-haggerty
    @ian-haggerty 4 місяці тому +1

    Best explanation on the interwebs!

  • @sharingpurpose237
    @sharingpurpose237 Рік тому

    Bro, this intuition was not normal, u r just genius!!

  • @adityakulkarni5577
    @adityakulkarni5577 4 місяці тому +1

    Perfectly explained in 5 minutes. Wow.

  • @balasubramanyamevani7752
    @balasubramanyamevani7752 2 роки тому +5

    @3:26 I don't understand how are we normalizing by raising it to the power of 1/N. Could you please explain that?

    • @Chris-zg1me
      @Chris-zg1me 2 роки тому +1

      Same question here. This is a fantastic explanation but it defeats me when you mention “we normalize by raising to power of 1/N”. Why do we do this? What does that do or mean to the data? Thanks for making this video! Awesome!

    • @vyasraina3930
      @vyasraina3930 Рік тому +3

      I think the 1/N gives us the 'average' probability of a single toss; e.g. if we had a fair coin and had 3 tosses, the probability of our sequence would be 1/2 * 1/2 * 1/2 = 1/8. If we had ten tosses, the probability of the sequence would be 1/(2^10). These numbers are currently incomparable. If we now look at the probability of the sequence to the power of 1/N, where N is the number of tosses, then suddenly they are the same ... which is what we would want .... it basically normalizes the probability sequence!

    • @aniruddhajoshi7496
      @aniruddhajoshi7496 9 місяців тому

      @@vyasraina3930 thanks for the explaination! in general why is power 1/N more important than let's say multiplying by 1/N?

    • @franklyvulgar1
      @franklyvulgar1 3 місяці тому

      @@vyasraina3930 so basically the 1/N gets rid of the number of tosses/sample size and in your case of a fair coin makes it so the probability would be 1/2 regardless of N by getting rid of the N (exponent in your probability sequence)

  • @germangarcia5599
    @germangarcia5599 Рік тому

    One of the most useful explanations ever. Thanks!!

  • @alkanair7325
    @alkanair7325 Рік тому

    Thank you so much for this content. By far the explanation of KL Divergence seen so far

  • @drdca8263
    @drdca8263 2 роки тому +1

    Thanks, that made the idea make a lot more sense to me. Showing how it arises so nicely from a large sample size, made it feel much more natural.

  • @drondasgupta9378
    @drondasgupta9378 Рік тому

    Thanks for the brilliant, intuitive and crystal-clear explanation!

  • @moopoo123
    @moopoo123 2 роки тому

    Thanks Adian! The connection back to cross entropy loss is cool. Slowly coming together for me.

  • @kukuster
    @kukuster 9 місяців тому +1

    Thanks for the explanation!! One thing is, formulas were confusing with how you denoted *q1* & *q2* for probabilities for coin 2, instead of *p2* & *q2=1-p2*

  • @farshadsaberi2740
    @farshadsaberi2740 2 роки тому +1

    Thanks for the simple, yet helpful, explanation!

  • @shahriarrahman8425
    @shahriarrahman8425 4 місяці тому

    Great explanation. Thank you so much!

  • @zukofire6424
    @zukofire6424 Рік тому

    this was great and super useful in my internship (which really just started), Thanks! :)

  • @alecpanayotov
    @alecpanayotov 2 роки тому

    This is awesome, thanks for breaking it down Adian

  • @karthikeyans3
    @karthikeyans3 9 місяців тому

    Great video. Thanks for sharing. Really intuitive.

  • @reformed8246
    @reformed8246 Рік тому

    thanks a lot ! 5min for explaining what I could'nt understand in hours

  • @brianlee4966
    @brianlee4966 6 місяців тому

    Thank you so much for this video and clear explanation!

  • @adamtaylor2142
    @adamtaylor2142 10 місяців тому

    Great content! Thank you.

  • @SunilKumarSamji
    @SunilKumarSamji 5 місяців тому +1

    Excellent video. Can someone help me understand why is it called Divergence in the first place? Why are we taking 1/N power to normalise it to sample space, I did not understand the logic behind this.

  • @akidnag
    @akidnag 9 місяців тому

    Only that is not a distance ('cause is not symmetric), but a pseudo distance. Great video!

  • @ferkstkojtt
    @ferkstkojtt 2 роки тому

    Dude just plops in some God-tier eye openers in the credits and leaves. Never realized this relationship between KL and cross-entropy loss.

  • @yashrathi6862
    @yashrathi6862 2 роки тому +2

    Hi, I don't get why you assume that the nH and nT for the coin two would be the same as the coin 1?

    • @Marcus-ok2jy
      @Marcus-ok2jy 2 роки тому

      Yeah i don't get it either, any explainations anyone?

    • @Drewbie_T
      @Drewbie_T 2 роки тому +1

      @@Marcus-ok2jy nH and nT are just the number of heads and tails generated in the sequence by the 'true coin', not by coin 2.. i.e., if i have a true coin and I flip it a few times I may get H,H,T,H (nH=3, nT=1) and you will notice that nH/N=0.75 and nT/N =0.25 which is not equal to p1 and p2 respectively. However, if were to flip the coin many more times, infinitely more times, we would notice the number of heads is the same as the number of tails. Thus, he is saying in the limit of a sufficient amount of coin flips, we will notice nH/N = 0.5 and nT/N = 0.5.

    • @Marcus-ok2jy
      @Marcus-ok2jy 2 роки тому

      @@Drewbie_T Hi Andrew, But in 3:21 , the formula P(observations|coin 2) looks at the nH and nT of Coin 2 does it not? This is so that the KL divergence could take into the account the disparity in probability distribution between the 2 coins.

    • @Drewbie_T
      @Drewbie_T 2 роки тому +1

      @@Marcus-ok2jy No it does not, it is only looking at nH and nT of the true coin. Coin 2 is not being flipped at all. The only part where coin 2 comes in is after flipping the true coin (which has probability p1 heads and p2 tails), we obtain some chain of outcomes (i.e., H,H,T,H,T,T). Now that we have flipped the true coin and obtained an outcome, we look at the coin 2 probabilities and say, how likely is it that this sequence (H,H,T,H,T,T) could have come from coin 2? If coin 2 has .95 probability of landing on heads every time, it is unlikely that we would see an equal number of heads and tails in the distribution.

    • @adytya
      @adytya Рік тому

      It's because we first flip a coin N times and record the number of heads (nH) and the number of tails (nT). It is assumed here that the coin used here repesents the real coin (which has p1 probability for head and p2 probability for tail). We are now interested in finding how close coin 2 can mimic the real coin's flips. And since the real coin produced nH heads and nT tails during our experiment, we use the same values.
      Hope this helped.

  • @sanjayadhith
    @sanjayadhith Місяць тому

    Useful video.

  • @Luca-yy4zh
    @Luca-yy4zh 2 роки тому

    Finally a simple explanation

  • @soroushmehraban
    @soroushmehraban Рік тому

    Very well-explained. Thank you!

  • @gaoyang6608
    @gaoyang6608 3 роки тому +1

    thx for sharing very helpful and intuitive.

  • @tudor6210
    @tudor6210 Рік тому

    Beautiful explanation!

  • @cuongnguyenuc1776
    @cuongnguyenuc1776 7 місяців тому

    Great video! Can you make a video about soft actor critic?

  • @clairewang8370
    @clairewang8370 Рік тому

    This is so intuitive!!!!!!!!!❤

  • @_jiwi2674
    @_jiwi2674 2 роки тому +3

    great explanation, would be perfect if you speaked slower

  • @filipedstrom4462
    @filipedstrom4462 2 роки тому

    Concise and clear, thank you!

  • @jessechen6541
    @jessechen6541 3 роки тому +1

    excellent explanation

  • @user-sx4wm5ls5q
    @user-sx4wm5ls5q 2 роки тому +1

    Wow this is an amazing explanation. So is KL divergence equivalent to Bayes factor with equal priors?

  • @akshaydongare2136
    @akshaydongare2136 Рік тому

    Thank you!

  • @DC-gq6ww
    @DC-gq6ww 2 роки тому +2

    Thank you!
    May I ask how you made the video?
    I want the numbers to move like they do in your show.
    It looks great and maintains comprehensibility by bringing it to life!
    We have to make a video about AIC for our neuroinformatics class, so your video would be a nice introduction to the topic anyway...
    You do it a little better than our prof^^

    • @adianliusie590
      @adianliusie590  2 роки тому +8

      This might break the magic a bit but I just use plain old fashioned Microsoft power point! To move the equations I use the inbuilt animations functionality, though it can get a bit tedious to make everything move exactly how you’d like to. But best of luck on making your video.

    • @DC-gq6ww
      @DC-gq6ww 2 роки тому +1

      @@adianliusie590 thx for your answer! Good to know. It doesn't break the magic. I just use another program and I am a noob at some points

  • @alifarrokh9863
    @alifarrokh9863 2 роки тому

    Very great explanation!

  • @ananya_sutradhar
    @ananya_sutradhar 11 місяців тому

    Just perfect!

  • @petercourt
    @petercourt 2 роки тому

    Amazing explanation, thanks!

  • @mormonteg4073
    @mormonteg4073 9 місяців тому

    Thank you a lot

  • @researchmedicine6950
    @researchmedicine6950 3 роки тому

    Keep the vids coming this is so so useful

  • @ramendrachaudhary9784
    @ramendrachaudhary9784 2 роки тому

    Very well explained! Thank you!

  • @Darkev77
    @Darkev77 3 роки тому +4

    Awesome video, but at 3:27, on what basis did we take the log?

    • @adianliusie590
      @adianliusie590  3 роки тому +6

      That's a good question which I'm not sure I could answer too well. One could claim that the log function makes numbers more readable, and often when we deal with large/small numbers we log expressions first since the log operation is reversible and squeezes the range into a smaller one (e.g. e^10, about 22000, becomes 10), like is done with things like log probabilities. It could also just be mathematical convenience to drop the powers so that the overall expression looks much simpler.
      However I think you'd find a more satisfying answer by looking in the direction of entropy, as entropy is defined as the expected log probabilities of a distribution. Since the KL is interlinked tightly with entropy, something may drop out there which will show that logging the ratio makes the expression more natural and intuitive. I'd have to think bout it more, and maybe I'll make a video on entropy in the near future, but if I figure anything out I'll get back to you then.

    • @Darkev77
      @Darkev77 3 роки тому

      @@adianliusie590 wow that’s such a great answer. I truly appreciate that! And yeah, what you said makes sense, and with regards to entropy you’re very right; since entropy is the expected/avg information of a distribution of random events and KL div measures the *relative* difference in expected information between two distributions.

    • @skeletonrowdie1768
      @skeletonrowdie1768 2 роки тому +2

      Hi Darkev and Adian, there is another video on youtube (study squad academy) which explains the KL divergence from the perspective of Jensen's inequality. The main argument for taking the log is that it is a concave function, which does somewhat touch Adian's comment.

  • @annaly2318
    @annaly2318 2 роки тому

    Very good video. Thanks so much!

  • @yatinarora9650
    @yatinarora9650 2 роки тому

    thank you so much, very nicely explained

  • @blakeedwards3582
    @blakeedwards3582 2 роки тому

    This was awesome. Thank you.

  • @yingliu350
    @yingliu350 2 роки тому

    The vedio is good, but what confuses me is the correctness of the division. Sometimes,we have different probability(like NH = NT = 1,and p1=q2,p2=q1),but the division result is 1,which mean they are similar ,or same. It is wrong actually. So, may this explanation is just coinstance, or I have made some mistakes. Hopefully you can help me.(If my pool english make it confusing, I am sorry for that)

  • @hackercop
    @hackercop 2 роки тому

    This was very good have liked and subscribed

  • @yihongli350
    @yihongli350 Рік тому

    beautiful!

  • @thapargerrard123
    @thapargerrard123 2 роки тому

    Great video . Thanks.

  • @BillHaug
    @BillHaug 10 місяців тому

    ...tremendous!

  • @longh
    @longh Рік тому

    super helpful! Thank you

  • @unbridled_exciton
    @unbridled_exciton 2 роки тому

    This is gold!

  • @JingyueWu
    @JingyueWu 11 місяців тому

    Nice video! Can you say something about alternatives? E.g. why wouldn't mean squared error (of two probability distributions) work as well?

  • @ian-haggerty
    @ian-haggerty 4 місяці тому

    So a Kale Divergence of zero means identical distributions? What do the || lines mean?

  • @xxluapxx
    @xxluapxx Рік тому

    Thanks for the explanation. With the RLHF stuff happening in ChatGPT, does anyone know why they choose to use KL divergence instead of Cross-entropy loss when calculating the RL policy penalty?

  • @juliocardenas4485
    @juliocardenas4485 2 роки тому

    Excellent!!!

  • @hamzeasadi671
    @hamzeasadi671 2 роки тому

    Greaaaat job

  • @Gathanokos
    @Gathanokos 3 роки тому

    This video is amazing

  • @salehmontazeran1130
    @salehmontazeran1130 Рік тому

    Awesome

  • @amrahmed2009
    @amrahmed2009 2 роки тому

    Thanks very much.

  • @onamixt
    @onamixt 11 місяців тому

    Why raise to 1/n power, why use log? Why don't we use just sum(P/Q)?

  • @1.4142
    @1.4142 2 роки тому +1

    It has my initials

  • @treksis
    @treksis 2 роки тому

    😁😁😁gotcha. super ez explanation

  • @joshholder359
    @joshholder359 Рік тому

    So fire

  • @zyzhang1130
    @zyzhang1130 Рік тому

    KL loss is not exactly equivalent to cross entropy loss right

  • @vi5hnupradeep
    @vi5hnupradeep 3 роки тому

    Thank you so much

  • @Justin-zw1hx
    @Justin-zw1hx Рік тому

    when you say "likelyhood of the observation of each coin", you really mean "probability" instead of "likelyhood", right?

  • @ViralPanchal97
    @ViralPanchal97 11 місяців тому

    I love you Biradr

  • @cliveemary4806
    @cliveemary4806 Рік тому

    nice

  • @gzitterspiller
    @gzitterspiller Рік тому +1

    I still dont know why the log appears there.

    • @MessiahAtaey
      @MessiahAtaey Місяць тому

      It allows to factorize by addition rather than multiplication, since the log is a strictly monotonically increasing function. Practically speaking, this is more efficient to compute than a product of terms.

  • @dogukan463
    @dogukan463 3 роки тому

    Nice video :)

  • @gottlobfreige1075
    @gottlobfreige1075 2 роки тому

    So, Why is KL Divergence is not symmetric?

  • @ian-haggerty
    @ian-haggerty 4 місяці тому

  • @user-ut4zh3pw7l
    @user-ut4zh3pw7l 4 місяці тому

    wowowowo

  • @yegounkim1840
    @yegounkim1840 Рік тому +1

    It is not a measure of distance between distributions!

  • @zjy2936
    @zjy2936 2 роки тому +1

    It’s technically not “distance”

  • @Yassinius
    @Yassinius Рік тому

    Thanks so much