Fantastic video, incredibly clear. Definitely going to subscribe! I do have one suggestion. I think some people might struggle a little bit around 2m22s where you introduce the idea that if P(sun)=0.75 and P(rain)=0.25, then a forecast of rain reduces your uncertainty by a factor of 4. I think it's a little hard to see why at first. Sure, initially P(rain)=0.25 while after the forecast P(rain)=1, so it sounds reasonable that that would be a factor of 4. But your viewers might wonder why you can’t equally compute this as, initially P(sun)=0.75 while after the forecast P(sun)=0. That would give a factor of 0! You could talk people through this a little more, e.g. say imagine the day is divided into 4 equally likely outcomes, 3 sunny and 1 rainy. Before, you were uncertain about which of the 4 options would happen but after a forecast of rain you know for sure it is the 1 rainy option - that’s a reduction by a factor of 4. However after a forecast of sun, you only know it is one of the 3 sunny options, so your uncertainty has gone down from 4 options to 3 - that’s a reduction by 4/3.
Shouldn't one use information gain to check the extent of reduction ? IG = (-1log2(1) - 0log2(0) ) - (-(3/4)log2(4/3)-(1/4)log2(1/4)) = 0.01881437472 bit
As a Machine Learning practitioner & UA-cam vlogger, I find these videos incredibly valuable! If you want to freshen up on those so-often-needed theoretical concepts, your videos are much more efficient and clear than reading through several blogposts/papers. Thank you very much!!
@@Darkev77 The idea here is that most other resources (videos, blogs) take a very long time (and more importantly say a lot of things) to convey the ideas that this video did in a short time (and with just the essential ideas). This video, thus, has low entropy (vs most other resources that have much higher entropy).
I've been googling KL Divergence for some time now without understanding anything... your video conveys that concept effortlessly. beautiful explanation
Haven't seen a better, clearer explanation of entropy and KL-Divergence, ever, and I've studied information theory before, in 2 courses and 3 books. Phenomenal, this should be made the standard intro for these concepts, in all university courses.
Incredibly video, easily one of the top three I've ever stumbled across in terms of concise educational value. Also love the book, great for anyone wanting this level of clarity on a wide range of ml topics. Not sure if this will help anyone else, but I was having trouble understanding why we choose 1/p as the "uncertainty reduction factor," and not, say 1-p or some other metric. What helped me gain an intuition for this was realizing 1/p is the number of bits we would need to encode a uniform distribution if every event had the probability p. So the information, -log(p), is how many bits that event would be "worth" were it part of a uniform distribution. This uniform distribution is also the maximum entropy distribution that event could possibly come from given its probability...though you can't reference entropy without first explaining information.
Phenomenal explanation of a seemingly esoteric concept into one that's simple & easy-to-understand. Great choice of examples too. Very information-dense yet super accessible for most people (I'd imagine).
Sir, you have a talent to explain stuff in a crystal clear manner. You just make something that is usually explained by a huge sum of math equations to be something so simple like this. Great job, please continue on making more UA-cam videos!
Thank you, very well explained! I decided to get into machine learning in this hard quarantine period but I didn't have many expectations placed on me. Thanks to your clear and friendly explanations in your book I am learning, improving and, not least, enjoying a lot. So thank you so much!
This 11-ish minute presentation so clearly and concisely explained what I had a hard time understanding from a one hour lecture in school. Excellent video!
Wow! This was the perfect mix of motivated examples and math utility. I watched this video twice. The second time I wrote it all out. 3 full pages! It’s amazing that you could present all these examples and the core information in ten minutes without it feeling rushed. You’re a great teacher. I’d love to see you do a series on Taleb’s books - Fat Tails and Anti-Fragility.
Fantastic! This short video really explains the concept of entropy, cross-entropy, and KL-Divergence clearly, even if you know nothing about them before. Thank you for the clear explaination!
Not only this video is fantastic in explaining the concepts, but also the book "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow_ Concepts, Tools, and Techniques to Build Intelligent Systems-O’Reilly Media (2019)" is the best book I've studied on machine learning subject by the same author (Aurélien Géron).
This channel will sky rocket. no doubt. Thank you so much! Clear, visualized and well explained at a perfect pace! Everything is high quality! Keep it up sir!
i'm loving the slides and explaination. I noticed the name in the corner and thought, oh nice i know that name. then suddenly... It's the author of that huge book i love!
This is the 3rd time I watch this video. In April , September, and the December 2018. The first time I watched, I thought I understood this topic, but I know that I knew nothing back then.
I really enjoyed the way you are explaining it. It's so inspiring watching and learning difficult concepts from the author of such an incredible book in the ML realm. I wish you could teach via video other concepts as well. Cheers, Roxi
Thankyou for such a wonderful and to the point video. Now I know: Entropy, Cross Entropy, KL Divergence and also why cross entropy is such a good choice as loss function.
Very few of people can explain like you to be honest! I read so many decision tree tutorial and they are actually talking the same thing(information gain), but after I reading their articles I got 0 understanding still, big thanks to this video!
Hey Aurélien, thanks so much for this great video ! I have a few questions : 1/ I struggle with the concept of uncertainty. In the example where p(sun)=0.75 and p(rain)=0.25, what would be my uncertainty ? 2/ At 6:42, I don't understand why to use 2 bits for the sunny weather means that we are implicitly predicting that it'll be sunny every four days on average. 3/ Would it be a bad idea to try to use a cross entropy loss for something different from classification (i.e. where the targets wouldn't be one-hot vectors) ? I think there is a possibility that we can find a predicted distribution q different from the true distribution p, which would also minimise the value of the cross entropy, but I'm not sure.
Fantastic video! Now all the dots are connected! I have used loss function for NN machine learning, but not knowing the math behind it! This is so enlightening!
I've learned about this before, but this is the best explanation I've come across. And was a helpful review, since it's been a while since I used this. Well done.
Thank you so much! Not only it helped me understand KL-Divergence, but also it is helpful to remember the formula. From now I will place signs in right places. Keep it up!
Great work in the explanation. I have been pretty confused with this concept and the implication of Information theory with ML. This video does the trick in clarifying the concepts while providing a sync between information theory and ML usage. Thanks much for the video.
Hats off! One of the best teachers ever! This definitely helped me better understand it both mathematically and intuitively just in a single watch. Thanks for reducing my 'learning entropy'. My KL divergence on this topic is near zero now. ;)
you are 3blues1brown kind of guy. nowadays i see lot of youtubers making machine learning videos by repeating the words found in research papers and wikipedia . u r different
Ok, i maybe should pay more attention when reading my books, but when i heard here that CrossEntropy is entropy + KL it made sense, then when i read my notes i wrote something similar, but without even realizing how big it was.
I am reading your book! and oh man oh what a book!!! first I thought how the book and video has exactly same example for explanation until I saw the book of yours on the later part of the video, and realized it's you it's so great to listen to you after reading you!!
Great video! However, I have a doubt related to around 7:11 onwards. I don't understand the point where you say that "the code doesn't use messages starting with 1111, and hence the sum of predicted probabilities is not 1". Could you explain this?
Thanks, explain is clear. I found it's clean and easy to understand compare with my lecture notes. I don't even think they mentioned the history and derivation/origin
Your channel has become one of my favorite channels. Your explanation of CapsNet and now this is just amazing. I am going to get your book too. Thanks a lot. :)
Very nice. Really short yet clearly grasping the point of these concepts. Subscribed. I was really excited when I found this chanel. I mean the book Hands On Machine Learning is maybe the best book you can find these days
Thanks Aly! Basically, you can think of EM as a generalization of K-Means. K-Means is a clustering algorithm that works like this: first you randomly select k points called "centroids" (there are various ways to do that, but the simplest option is to pick k instances randomly from the dataset and place the centroids there). Then you alternate two steps until convergence: (1) assign each instance to the closest centroid, (2) update each centroid by moving it to the mean of the instances that are assigned to it. I recommend you search for an animation of this process, it's really quite simple, fast and often very efficient. This is guaranteed to converge, since both steps always reduce the mean squared distance between the instances and their closest centroid (this number is called the "inertia"). Unfortunately, the algorithm may converge to a local optimum, so you would typically repeat the whole process multiple times and pick the best solution (i.e., the one with the lowest inertia). Okay, now EM is basically the same idea, but instead of just searching for the cluster centers, the algorithm also tries to find each cluster's density, size, shape and orientation. Typically, we assume that the clusters are generated from a number of Gaussian distributions (this is called a Gaussian Mixture Model), so basically the clusters look like ellipsoids. Like K-Means, the EM algorithm alternates between two steps: the Expectation step (assigning instances to clusters), and the Maximization step (updating the cluster parameters). However, there are a few differences: during the Expectation step, EM uses soft clustering rather than hard clustering: this means that each instance is given a weight for each cluster, rather than being assigned to the closest cluster. Specifically, the algorithm estimates (using the current cluster parameters) the probability that each instance was generated by each cluster (this is called the cluster's "responsibility" for that instance). Next, the Maximization step updates the cluster parameters, i.e., the centroid, the covariance matrix (which determines the ellipsoid's size, shape and orientation), and the cluster's weight (basically how many instances it contains relative to the other clusters; you can think of it as the cluster's density). For example, to update a cluster's centroid, the algorithm computes a weighted mean of all the instances, using the cluster's responsibilities for the weights (so if the algorithm estimated that a particular instance had a very small probability of belonging to this cluster, then it will not affect the update much). To summarize: EM is very much like K-Means, but using soft-clustering, and based on a probabilistic model that allows it to capture not only each cluster's center, but also its size, shape and orientation. Check out Scikit-Learn's user guide on GaussianMixture for more details. Hope this helps! :)
Dear Aurélien Géron, I have the following questions. It would be great if you can answer these also. 1. How about continuous systems where the number of states possible is not discrete. Is it possible to use entropy in such cases? 2. What if we have no idea about the probability distribution of the weather states? In such case how can we assign more bits to rare events and less number of bits to frequent events?. 3. In cross-entropy calculation, the same number of bits for each state is assumed rather than varying number of bits (more bits to rare events and less number of bits to frequent events) why?
Haha, I hope you enjoy it! :) I haven't posted a video in months, because I've been busy moving to Singapore and writing the 2nd edition of my book, but as soon as I finish the book I'll get back to posting videos!
Your explanations are so much better than other "famous" ML vloggers (... looking at you Siraj Raval!). You truly know what you are talking about, even my grandma could understand this!! Subscribed, liked and belled. More, please!
I really enjoyed your book and these videos! Keep them coming! Even though some part of my PhD had to do with Information Theory I enjoyed the way you explain IT and Cross Entropy in a very practical way. Helped understand why it is used in machine learning the way it is. Looking forward for more great videos (and maybe a second book?)!
Thanks Omri, I'm glad you enjoyed the book & videos. :) I recently watched a great series of videos by Grant Sanderson (3Blue1Brown) about the Fourier Transform, and I loved the way he presents the topic: I thought I already knew the topic reasonably well, but it's great to see it from a different angle. Cheers!
Yes, the Fourier transform is a fascinating and multifaceted topic ;) In physics we use it very often for very surprising reasons. I'm looking for a book similar to yours which focuses specifically on NLP with Python and is very well written and modern. Do you have any recommendations? Thanks! Omri
Fantastic video, incredibly clear. Definitely going to subscribe!
I do have one suggestion. I think some people might struggle a little bit around 2m22s where you introduce the idea that if P(sun)=0.75 and P(rain)=0.25, then a forecast of rain reduces your uncertainty by a factor of 4. I think it's a little hard to see why at first. Sure, initially P(rain)=0.25 while after the forecast P(rain)=1, so it sounds reasonable that that would be a factor of 4. But your viewers might wonder why you can’t equally compute this as, initially P(sun)=0.75 while after the forecast P(sun)=0. That would give a factor of 0!
You could talk people through this a little more, e.g. say imagine the day is divided into 4 equally likely outcomes, 3 sunny and 1 rainy. Before, you were uncertain about which of the 4 options would happen but after a forecast of rain you know for sure it is the 1 rainy option - that’s a reduction by a factor of 4. However after a forecast of sun, you only know it is one of the 3 sunny options, so your uncertainty has gone down from 4 options to 3 - that’s a reduction by 4/3.
Thanks Jenny! You're right, I went a bit too fast on this point, and I really like the way you explain it. :)
Shouldn't one use information gain to check the extent of reduction ? IG = (-1log2(1) - 0log2(0) ) - (-(3/4)log2(4/3)-(1/4)log2(1/4)) = 0.01881437472 bit
thank youuuuuuuuuuuuuuuuu
Actually I understand the concept from your comment than the video itself :) thanks a lot
awesome, great insight i did struggle to get it at first place. Checked out the comments and bam! Thanks :)
As a Machine Learning practitioner & UA-cam vlogger, I find these videos incredibly valuable! If you want to freshen up on those so-often-needed theoretical concepts, your videos are much more efficient and clear than reading through several blogposts/papers. Thank you very much!!
Thanks! I just checkout out your channel and subscribed. :)
I like your video too! Especially the VAE one
Arxiv, it was actually your video on VAE's that encouraged me to check out this video for KL-Divergence. Keep up the good work, both of you.
thank you, at first i messed up trying to understand but now reading your comment i understamd it. Thank you! 😊
This feels like a 1.5-hour course conveyed in just 11 minutes, i wonder how much entropy it has :)
hahaha
Underrated Comment
ahhh....too clever. the comment has distracted my entropy from the video. Negative marks for you!
@@klam77 Could you elaborate on his joke please?
@@Darkev77 The idea here is that most other resources (videos, blogs) take a very long time (and more importantly say a lot of things) to convey the ideas that this video did in a short time (and with just the essential ideas). This video, thus, has low entropy (vs most other resources that have much higher entropy).
I've been googling KL Divergence for some time now without understanding anything... your video conveys that concept effortlessly. beautiful explanation
Haven't seen a better, clearer explanation of entropy and KL-Divergence, ever, and I've studied information theory before, in 2 courses and 3 books. Phenomenal, this should be made the standard intro for these concepts, in all university courses.
Beautiful short video, explaining the concept that is usually a 2 hour explanation in about 10 minutes.
Thank you , I have always confused about these three concepts, you make these concepts really clear for me.
Incredibly video, easily one of the top three I've ever stumbled across in terms of concise educational value. Also love the book, great for anyone wanting this level of clarity on a wide range of ml topics.
Not sure if this will help anyone else, but I was having trouble understanding why we choose 1/p as the "uncertainty reduction factor," and not, say 1-p or some other metric. What helped me gain an intuition for this was realizing 1/p is the number of bits we would need to encode a uniform distribution if every event had the probability p. So the information, -log(p), is how many bits that event would be "worth" were it part of a uniform distribution. This uniform distribution is also the maximum entropy distribution that event could possibly come from given its probability...though you can't reference entropy without first explaining information.
Phenomenal explanation of a seemingly esoteric concept into one that's simple & easy-to-understand. Great choice of examples too. Very information-dense yet super accessible for most people (I'd imagine).
I always seem to come back to watch this video every 3-6 months, when I forget what KL Divergence is conceptually. It's a great video.
Really, I definitely cannot come up with an alternative way to explain this concept more concisely.
Wow! It's just incredible to convey so much information while still keeping everything simple & well-explained, and within 10 min.
Wow best explaination ever, I found this while I was in college. I just come here once a year just to refresh my intution.
Sir, you have a talent to explain stuff in a crystal clear manner. You just make something that is usually explained by a huge sum of math equations to be something so simple like this. Great job, please continue on making more UA-cam videos!
Thank you, very well explained! I decided to get into machine learning in this hard quarantine period but I didn't have many expectations placed on me. Thanks to your clear and friendly explanations in your book I am learning, improving and, not least, enjoying a lot. So thank you so much!
Finally, someone who understands, and doesn't just regurgitate the wikipedia page :) Thanks alot!
this is by far the best and most concise explanation on the fundamental concepts of information theory we need for machine learning..
you are a genius in creating clarity
I come to find Entorpy, but I received Entorpy, Cross-Enropy and KL-divergence. You are so generous!
This 11-ish minute presentation so clearly and concisely explained what I had a hard time understanding from a one hour lecture in school. Excellent video!
Wow! This was the perfect mix of motivated examples and math utility. I watched this video twice. The second time I wrote it all out. 3 full pages! It’s amazing that you could present all these examples and the core information in ten minutes without it feeling rushed. You’re a great teacher. I’d love to see you do a series on Taleb’s books - Fat Tails and Anti-Fragility.
I want to like this video 1000 times. To the point, no BS, clear, understandable.
Fantastic! This short video really explains the concept of entropy, cross-entropy, and KL-Divergence clearly, even if you know nothing about them before.
Thank you for the clear explaination!
Not only this video is fantastic in explaining the concepts, but also the book "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow_ Concepts, Tools, and Techniques to Build Intelligent Systems-O’Reilly Media (2019)" is the best book I've studied on machine learning subject by the same author (Aurélien Géron).
You are the most talented tutor I've ever seen
I am new to information theory and computer science in general, and this is the best explanation I could find about these topics by far!
I have been using cross-entropy for classification for years and I just understood it. Thanks Aurélien!
Very elegant indicating how cognizant the presenter is.
This channel will sky rocket. no doubt. Thank you so much! Clear, visualized and well explained at a perfect pace! Everything is high quality! Keep it up sir!
Kinda feels like 3Blue1Brown's version of Machine learning Fundamentals. Simply Amazing
Thanks a lot, I'm a huge fan of 3Blue1Brown! 😊
phew !! as newbie to Machine Learning without a background in maths this video saved me, else i never expected to grasp the Entropy concept
This explanation is absolutely fantastic. Clear, concise and comprehensive. Thank you for the video.
Thanks, for people who are looking for ML explanation: Cross-Entropy is computed with -log(0.25)
Finally I understood Shannon's theory of information. Thank you
Aurélien
i'm loving the slides and explaination. I noticed the name in the corner and thought, oh nice i know that name. then suddenly... It's the author of that huge book i love!
This is the 3rd time I watch this video. In April , September, and the December 2018. The first time I watched, I thought I understood this topic, but I know that I knew nothing back then.
One of the most beautiful videos I've watched and understood a concept :')
Very recommendable! Finally, I found someone who could explain these concepts of entropy, cross entropy in very intuitive ways
You make the toughest concepts seem super easy! I love your videos!!!
thank you for useful video , and also really thanks for your book . You express very difficult concepts of machine learning like a piece of cake .
Really the best explanation of KL divergence I have seen so far !! Thank you.
this is by far the best description of those 3 terms , can't be thankful enough
Thank you so much Monsieur Geron pour cette explication simple et limpide
Aurelien has a knack for making things simpler. Check out his Deep leaning using TensorFlow course in Udacity. It's amazing.
I really enjoyed the way you are explaining it. It's so inspiring watching and learning difficult concepts from the author of such an incredible book in the ML realm. I wish you could teach via video other concepts as well.
Cheers,
Roxi
Thankyou for such a wonderful and to the point video. Now I know: Entropy, Cross Entropy, KL Divergence and also why cross entropy is such a good choice as loss function.
Guys this is the best explanation on Entropy , Cross-Entropy and KL-Divergence.
Very few of people can explain like you to be honest! I read so many decision tree tutorial and they are actually talking the same thing(information gain), but after I reading their articles I got 0 understanding still, big thanks to this video!
Hey Aurélien, thanks so much for this great video ! I have a few questions :
1/ I struggle with the concept of uncertainty. In the example where p(sun)=0.75 and p(rain)=0.25, what would be my uncertainty ?
2/ At 6:42, I don't understand why to use 2 bits for the sunny weather means that we are implicitly predicting that it'll be sunny every four days on average.
3/ Would it be a bad idea to try to use a cross entropy loss for something different from classification (i.e. where the targets wouldn't be one-hot vectors) ? I think there is a possibility that we can find a predicted distribution q different from the true distribution p, which would also minimise the value of the cross entropy, but I'm not sure.
Fantastic video! Now all the dots are connected! I have used loss function for NN machine learning, but not knowing the math behind it! This is so enlightening!
I've learned about this before, but this is the best explanation I've come across. And was a helpful review, since it's been a while since I used this. Well done.
You have no idea how much this video has helped me.Thanks for making such quality content and keep creating more.
Thank you so much! Not only it helped me understand KL-Divergence, but also it is helpful to remember the formula. From now I will place signs in right places. Keep it up!
this explanation really helps the learner in understanding such vague scientific concepts, thanx for the clear explanation !!
ur tutorial is always invincible. quite explicit with great examples. Thanks for ur work
Great work in the explanation. I have been pretty confused with this concept and the implication of Information theory with ML. This video does the trick in clarifying the concepts while providing a sync between information theory and ML usage. Thanks much for the video.
This is the best explanation of the topics that I have ever seen. Thanks!
This is the best explanation of entropy and KL I have found. Thanks
Hats off! One of the best teachers ever! This definitely helped me better understand it both mathematically and intuitively just in a single watch. Thanks for reducing my 'learning entropy'. My KL divergence on this topic is near zero now. ;)
5:12 A tiny typo: the entropy should have a negative sign
you are 3blues1brown kind of guy. nowadays i see lot of youtubers making machine learning videos by repeating the words found in research papers and wikipedia . u r different
Grant Sanderson is like the Morgan Freeman of visual Mathematics.....I wish his videos existed during my earlier days in college
Ok, i maybe should pay more attention when reading my books, but when i heard here that CrossEntropy is entropy + KL it made sense, then when i read my notes i wrote something similar, but without even realizing how big it was.
This was the best intuitive explanation of entropy and cross entropy I've seen. Thanks!
I am reading your book! and oh man oh what a book!!! first I thought how the book and video has exactly same example for explanation until I saw the book of yours on the later part of the video, and realized it's you it's so great to listen to you after reading you!!
I came here to learn how to correctly pronounce his name :).
The content is simply great. Thanks a lot.
the best video on cross entropy on youtube so far
Your book rocks.
EVERYONE BUY THE HANDS ON GUIDE
Edit: in fact if you can’t afford it contact me and make a case I might buy you it.
Great video! However, I have a doubt related to around 7:11 onwards. I don't understand the point where you say that "the code doesn't use messages starting with 1111, and hence the sum of predicted probabilities is not 1". Could you explain this?
Thanks, explain is clear. I found it's clean and easy to understand compare with my lecture notes. I don't even think they mentioned the history and derivation/origin
Nicely conveyed what is to be learned about the topic. I think I absorbed all the way. Best tutorial, keep dropping video like this.
Fantastic video! It made me understand and get together many "loose" concepts. Thank you very much for this contribution!
The no. of bits I received is way higher than I expected !!
Nice video
Your channel has become one of my favorite channels. Your explanation of CapsNet and now this is just amazing. I am going to get your book too. Thanks a lot. :)
Please do a video on 'PAC learning'. It seems very complex. Your way of explanation can make it easy!!
the best explanation I ever had about the topic. It was really insightful.
Very nice. Really short yet clearly grasping the point of these concepts. Subscribed.
I was really excited when I found this chanel. I mean the book Hands On Machine Learning is maybe the best book you can find these days
Still can't believe these can be taught in such a short video.
Best Entropy and Cross-Entropy explanation I have ever seen
I've seen all your videos now. You've taught me a lot of things and this was some good moments. Can't wait for more. Thanks so much
I'm amazed by this video, you are a gifted teacher.
This video is so clear and so well explained, just like his book!
very clear and well-structured explanation. Your book is great, too!Thank you very much!
Excellent explanation and discussion. Thank you very much!!
Normally when I like a video, I just click the like button. Since this is sooooo helpful, I will also leave a comment to thank you for making this.
Great video to learn interpretations of the concept of cross-entropy.
To-the-point and intuitive explanation and examples! Thank you very much! Salute to you!
super clear .. never I heard this explanation of Entropy and Cross Entropy !
I have that book, didn't realized you wrote it until now.
Really good explanation, the visuals were also great for understanding! Thanks Aurelien.
Thank you very much for this excellent video. Looking for a similar one on the topic of Expectation Maximization.
Thanks Aly! Basically, you can think of EM as a generalization of K-Means.
K-Means is a clustering algorithm that works like this: first you randomly select k points called "centroids" (there are various ways to do that, but the simplest option is to pick k instances randomly from the dataset and place the centroids there). Then you alternate two steps until convergence: (1) assign each instance to the closest centroid, (2) update each centroid by moving it to the mean of the instances that are assigned to it. I recommend you search for an animation of this process, it's really quite simple, fast and often very efficient. This is guaranteed to converge, since both steps always reduce the mean squared distance between the instances and their closest centroid (this number is called the "inertia"). Unfortunately, the algorithm may converge to a local optimum, so you would typically repeat the whole process multiple times and pick the best solution (i.e., the one with the lowest inertia).
Okay, now EM is basically the same idea, but instead of just searching for the cluster centers, the algorithm also tries to find each cluster's density, size, shape and orientation. Typically, we assume that the clusters are generated from a number of Gaussian distributions (this is called a Gaussian Mixture Model), so basically the clusters look like ellipsoids. Like K-Means, the EM algorithm alternates between two steps: the Expectation step (assigning instances to clusters), and the Maximization step (updating the cluster parameters). However, there are a few differences: during the Expectation step, EM uses soft clustering rather than hard clustering: this means that each instance is given a weight for each cluster, rather than being assigned to the closest cluster. Specifically, the algorithm estimates (using the current cluster parameters) the probability that each instance was generated by each cluster (this is called the cluster's "responsibility" for that instance). Next, the Maximization step updates the cluster parameters, i.e., the centroid, the covariance matrix (which determines the ellipsoid's size, shape and orientation), and the cluster's weight (basically how many instances it contains relative to the other clusters; you can think of it as the cluster's density). For example, to update a cluster's centroid, the algorithm computes a weighted mean of all the instances, using the cluster's responsibilities for the weights (so if the algorithm estimated that a particular instance had a very small probability of belonging to this cluster, then it will not affect the update much).
To summarize: EM is very much like K-Means, but using soft-clustering, and based on a probabilistic model that allows it to capture not only each cluster's center, but also its size, shape and orientation. Check out Scikit-Learn's user guide on GaussianMixture for more details. Hope this helps! :)
Merci Aurélien Géron, c'était une très belle présentation !
Dear Aurélien Géron,
I have the following questions. It would be great if you can answer these also.
1. How about continuous systems where the number of states possible is not discrete. Is it possible to use entropy in such cases?
2. What if we have no idea about the probability distribution of the weather states? In such case how can we assign more bits to rare events and less number of bits to frequent events?.
3. In cross-entropy calculation, the same number of bits for each state is assumed rather than varying number of bits (more bits to rare events and less number of bits to frequent events) why?
I rarely comment on videos, but this video is so good. I just couldn't resist. Thank you so much for the video. :)
I had to find a word for how well you explain. Perspicious. Thank you.
I just learned a new word, thanks James! :)
Awesome video, you made the concept of entropy so much clearer.
great understanding
and very good mentor
Thanks for the explanation, very clear and complements your excellent book
lol, i was reading only your book when i searched for 'cross entropy' and boom, i never knew you had a youtube channel too !
Haha, I hope you enjoy it! :) I haven't posted a video in months, because I've been busy moving to Singapore and writing the 2nd edition of my book, but as soon as I finish the book I'll get back to posting videos!
@@AurelienGeron Good luck with that ! You are a great teacher.
Your explanations are so much better than other "famous" ML vloggers (... looking at you Siraj Raval!). You truly know what you are talking about, even my grandma could understand this!! Subscribed, liked and belled. More, please!
Thanks Martin, I'm glad you enjoyed this presentation! My agenda is packed, but I'll do my best to upload more videos asap. :)
I really enjoyed your book and these videos! Keep them coming! Even though some part of my PhD had to do with Information Theory I enjoyed the way you explain IT and Cross Entropy in a very practical way. Helped understand why it is used in machine learning the way it is. Looking forward for more great videos (and maybe a second book?)!
Thanks Omri, I'm glad you enjoyed the book & videos. :) I recently watched a great series of videos by Grant Sanderson (3Blue1Brown) about the Fourier Transform, and I loved the way he presents the topic: I thought I already knew the topic reasonably well, but it's great to see it from a different angle. Cheers!
Yes, the Fourier transform is a fascinating and multifaceted topic ;) In physics we use it very often for very surprising reasons. I'm looking for a book similar to yours which focuses specifically on NLP with Python and is very well written and modern. Do you have any recommendations? Thanks! Omri