Відео

Intuitively Understanding the Cross Entropy Loss

5:24

Intuitively Understanding the Cross Entropy Loss

Переглядів 96 тис.3 роки тому

This video discusses the Cross Entropy Loss and provides an intuitive interpretation of the loss function through a simple classification set up. The video will draw the connections between the KL divergence and the cross entropy loss, and touch on some practical considerations. Twitter: AdianLiusie

Beginners Overview of Machine Learning and Artificial Intelligence

11:06

Beginners Overview of Machine Learning and Artificial Intelligence

Переглядів 9223 роки тому

This is a recorded talk which I created for my old school, Dubai College. This video is an introduction for complete beginners into artificial intelligence and machine learning, gives the general idea of the field and an overview of the machine learning approach.

5:12

Understanding Neural Networks Training

Переглядів 8633 роки тому

This video gives an overview of general optimisation of neural networks and explains how training is done through loss minimisation. This video will provide a foundation to then take a deeper look at more complex practical algorithms like stochastic gradient descent and backwards propagation which will be covered in future videos.

Intuitively Understanding the KL Divergence

5:13

Intuitively Understanding the KL Divergence

Переглядів 96 тис.3 роки тому

This video discusses the Kullback Leibler divergence and explains how it's a natural measure of distance between distributions. The video goes through a simple proof, which shows how with some basic maths, we can get under the KL divergence and intuitively understand what it's about.

6:06

Understanding deep neural networks

Переглядів 8913 роки тому

This video gives a basic introduction to neural networks and discusses what they are, how they work and ways to see the system in a matrix framework. This is the first video in a series which will ultimately build up to programming neural networks and running the back propagation algorithm from scratch in python.

КОМЕНТАРІ

@AxlProg-re3dz 6 днів тому
THE GOAT
@blackglitter 18 днів тому
I would give my life for you, fam. Thanks.
@sumitsp01 28 днів тому
Thank you sir! After watching so many videos, reading articles and talking with chatgpt, this explanation made sense to me and will remain with me forever. Grateful 🙏
@yawenliu-p7m Місяць тому
小猫到此一游
@jakobmiesner3995 Місяць тому
Thanks
@omarrayyann Місяць тому
Thanks!
@maxlehtinen4189 2 місяці тому
Here is my thought process on why the Shannon Entropy formula makes sense. Hope it helps some of you. Also, if someone wants to use this explanation anywhere, like a blog post etc., please go ahead. No credit necessary. 1) Let’s say that X = “person x has access to site” is a random variable (RV), where P(X = yes) = 0.75 and P(X = no) = 0.25. Then why does it make sense that Entropy(X) = - 0.75 * log2(0.75) - 0.25 * log2(0.25) = 0.75 * log(1 / 0.75) + 0.25 * log(1 / 0.25)? 2) Well, entropy(X) = average_surprise(X), right? Think about it: entropy is something uncontrollable, something NOT known in advance. Put (maybe even too) simply: entropy IS surprise, and to quantify entropy, we must quantify how surprised we are by the news of whether someone, let's say Amy, has been granted access to a site. 3) Average_surprise(X) = average_information_contained_in_distribution(X). This should be intuitive. The information that we need to transmit the result of an event with probability 1 is 0, as we already know it will happen. Yes, there is information in the fact that we know, but there is no information to be extracted FROM THE RESULT of a deterministic probabilistic process. The same logic works continuously: the more average information there is to be gained from the results of a probabilistic process, the more surprise it has embedded in it. 4) By combining these results (2) and (3), we get that Entropy(“person x has access to site”) = average_information_contained_in_distribution(X). 5) But wait, what is this information you speak of? Surely it isn't a mathematical object. It isn't a number. So, how can we even calculate it? Well, P(X = yes) = 0.75 means that with probability = 0.75, we can know the result of both X = yes (happened or didn’t) and X = no (happened or didn’t). We can quantify this GAIN IN INFORMATION from a piece of news in the following way. 6) When we get the news that Amy indeed got access (p = 0.75), we get information worth 1.0 units. What are the units? Well, they are information. We don’t have a unit like Hz or Amps for it. So it’s just units of information (gained). 7) But why is the information worth exactly 1.0 units? Well, we need to have some measure for information. You can’t measure 1 cm without first deciding that “this distance is 1 cm”. So, because we need a measure, it makes sense that the information of the event that actually occurred would be the measuring stick of 1.0 units, because we are going to be using it in trying to solve whether this event was probable or not. The other events have relative magnitudes w.r.t. the event that happened. By using there relative magnitudes, we can start to reason about whether we should weigh this outcome as having high entropy = high surprise, when it happens. 8) Okay, now we know the information gained on the event X = yes. It is 1.0 units, as agreed. But we also now know that the event X = no has NOT happened. This is also information. It must be "worth" something. But how do we quantify it? 9) Quantifying the information gained from events that have NOT happened is simple now that we have a measurement stick for what 1 unit of information is. We can use this measuring stick to calculate how much an event that has NOT happened is “worth” based on ITS own probability. 10) We now know that p = 0.75 is worth 1 unit of information. Then, p = 0.25, corresponding to X = no, is only worth 0.25 / 0.75 = 0.33 units. In total, the news has given us 1.33 units of information. 11) Notice a pattern that explains WHY this approach of comparing event probabilities with the event that happened makes sense: if the event that happens has high probability, the unit of information has “higher standards”. It doesn’t accept any lower probability events as having high information. 12) More generally, the probability of the event that happens “controls” how large the information gained from all its counter events is. Similarly, if a low-probability event happens, it will make the information gained from this event larger by lowering the bar for an unit of information gained. This corresponds to the intuitive idea we humans have while reasoning about this: "a low probability event has happened, so it must mean that the information gained from this event is larger”, and vice versa. The ratios of the probabilities make the magic work, so it is no coincidence that the Shannon formula has them. Namely, 1 / p. (Please only consider the formula without the minus sign, which has log(1 / p) and not just log(p). The minus sign is just there to make the formula more concise. In reality, everything we are talking about makes much more sense without the minus sign.) 13) If you’ve made it this far, congrats. We are almost there. But just to make sure we are on the same page, let’s reiterate on what the number 1.33 is telling us. It is telling us the “number” of different "event units" we have gained information on after hearing the news, where the unit of one full event/outcome corresponds to the probability of the outcome that actually happened. (Side quest: For simplicity, you can also use numbers that only output integers for event units. For example, let’s say our probability_distribution = [0.25, 0.25, 0.5]. If p = 0.25 happens, we gain info on 1 + 1 + 2 = 4 event units, where having p = 0.25 is the measure of 1 event unit. As discussed in (6), it can also be understood as the information unit, i.e. surprise unit, i.e. the reciprocal of the event’s probability, i.e. 1 / p.) 14) As we discussed earlier, Entropy(X) = average_information_contained_in(X) = expected_information_gained_from_knowing(X). This we have already calculated for X = yes (1.33). Do the same process for X = no and we get 4. (Left as an exercise to the reader. Note: if you can't do it, you haven't understood the most important point.) 15) Now we are ready to tie it all together. Notice the log(1 / p) in the entropy formula? That’s just our 1.33 and 4 with log wrappers (1 / 0.75 = 1.33 and 1 / 0.25 = 4). (Side quest: The Log. The reason we use log, originally log2, is because The Great Shannon wanted everything to be measured in bits. That makes sense because he was a computer scientist working on quantifying the information of (binary-encoded) messages. And to be fair, it is quite neat to have a binary interpretation for the total information in a system (the expected number of bits required to represent the result of its corresponding RV). The log also makes entropy additive (this is super useful in ML, even if we use nats or some other base, which obviously doesn’t have a bit interpretation. The bit interpretation doesn't really mean anything in the larger context of information anyway. It’s just one way to ENCODE information, but it itself is not information and we can only use the number of bits as a measuring stick for information. Any other measuring stick works just as well, at least in theory. For humans, bits are a friend.). 16) So, 1 / 0.75 = 1.33 and 1 / 0.25 = 4. That is exactly what we calculated with our intuitive method. That's because we used the same exact method as is the formula. The nominator is the measuring stick. It is the 1.0 units. To drive the point home, note that 1 / p = 1 + (1 - p) / p. The (1 - p) / p part is what calculates the information units gained from leftover events (0.33 when the event that happened is X = yes), and 1 is the information gained from the event that actually happened. 17) Now we are at the finish line. We just need to quantify the EXPECTED information required to encode something in bits. This is easy, if you understand expected values, which I expect you do, because you’ve made it this far. 18) Entropy(X) = 0.75 * log2(1 / 0.75) + 0.25 * log2(1 / 0.25) = - 0.75 * log2(0.75) - 0.25 * log2(0.25)
@newbie8051 3 місяці тому
oh wow this was simple and amazing thanks !
@semionababo 3 місяці тому
IMO this video gives pretty bad explanation of the essence just hiding by algebraic transformations. Found a much better video that actually explains why we use log() without "believe me or not...": ua-cam.com/video/q0AkK8aYbLY/v-deo.html
@unsaturated8482 3 місяці тому
The most intelligent people are the one's who are able to explain the hardest concepts in the most intutive way possible. Thanks.
@RedwanKarimSony_napstar_1455 4 місяці тому
Best explanation of the KL divergence in UA-cam for sure.... Thanks...
@fVNzO 4 місяці тому
I skipped through the video but i don't think you managed to explain how the formula itself deals with the infinites that are created when inputting log(0). That's what i don't understand.
@kushalneo 4 місяці тому
Great Video
@openroomxyz 4 місяці тому
Thanks for creating this video is awsome
@default_123-f4k 5 місяців тому
Thanks bro, that is such an elegant and concise explanation of the concept! Really Helpful
@MissPiggyM976 5 місяців тому
Very good!
@yingjiawan2514 6 місяців тому
This is so well explained. thank you so much!!! Now I know how to understand KL divergence, cross entropy, logits, normalization, and softmax.
@chunheichau7947 6 місяців тому
I wish more professors can hit all the insights that you mentioned in the video.
@Sars78 6 місяців тому
Well done, Adian. I just found out-though I'm not surprised at all, in the Shannon sense 🤓 -that you're doing a PhD at Cambridge. Congratulations! Best wishes for everything 🙂
@PoojaKumawat-z7i 6 місяців тому
How does the use of soft label distributions, instead of one-hot encoding hard labels, impact the choice of loss function in training models? Specifically, can cross-entropy loss still be effectively utilized, or should Kullback-Leibler (KL) divergence be preferred?
@deerwithantlers 7 місяців тому
Useful video.
@HaykTarkhanyan 7 місяців тому
great video, thank you!
@AdeshBenipal 7 місяців тому
Nice video
@Micha-ku2hu 8 місяців тому
What a great and simple explanation of the topic! Great work 👏
@xyzct 8 місяців тому
Excellent. Short and sweet.
@genkidama7385 8 місяців тому
distirbution
@avatar00001 8 місяців тому
thank you codexchan
@mathy642 8 місяців тому
Thank you for the best explanation
@franklyvulgar1 9 місяців тому
this is a great explanation thank you !
@debasishraychawdhuri 9 місяців тому
It does not explain the most important part - how the formula for non-uniform distribution came about
@hanriver8838 4 місяці тому
I agree. The transition from uniform distribution to ununiform distribution should be the most important and confusing part.
@shahriarrahman8425 9 місяців тому
Great explanation. Thank you so much!
@ian-haggerty 9 місяців тому
Adian. I know you're probably super busy doing PhD things, but come back & make some more videos! You're a gifted orator.
@ian-haggerty 9 місяців тому
Best explanation on the interwebs!
@madarahuchiha1133 9 місяців тому
what is true class distribution?
@elenagolovach384 8 місяців тому
the frequency of occurrence of a particular class depends on the characteristics of the objects
@ian-haggerty 10 місяців тому
So a Kale Divergence of zero means identical distributions? What do the || lines mean?
@ian-haggerty 10 місяців тому
<3 this. 👌
@KemalCetinkaya-i3q 10 місяців тому
wowowowo
@adityakulkarni5577 10 місяців тому
Perfectly explained in 5 minutes. Wow.
@LiHongxuan-ee7qs 10 місяців тому
So clear explanation! Thanks!
@charleswilliams8368 10 місяців тому
Three bits to tell the guy on the other side of the wall what happened, and it suddenly made sense. Thanks.
@EricPham-gr8pg 11 місяців тому
This is an incomplete hypothesis because on p(x) and ln(p(x)) come without proof is grossly insulting and it ignored the basic principle of conservation of energy meaning if pressure goes up then volume also goes up to release presured to go back to normal and that is the frustrating truth of life. The beast never dies it only change its phases or faces
@SunilKumarSamji 11 місяців тому
Excellent video. Can someone help me understand why is it called Divergence in the first place? Why are we taking 1/N power to normalise it to sample space, I did not understand the logic behind this.
@ajitzote6103 11 місяців тому
not really a great explaination, so many terms were thrown in. that's not a good way to explain something.
@백인선-u4z 11 місяців тому
Thankkkk youuuuu.
@brianlee4966 11 місяців тому
Thank you so much for this video and clear explanation!
@jiwoni523 Рік тому
make more videos please , you are awesome
@ananthakrishnank3208 Рік тому
Excellent expositions on KL divergence and Cross Entropy loss within 15 mins! Really intuitive. Thanks for sharing.
@cuongnguyenuc1776 Рік тому
Great video! Can you make a video about soft actor critic?
@sirelegant2002 Рік тому
Thank you!
@Z-eng0 Рік тому
Bro really explained it in less than 10 mins when my professors don't bother even if it could be done in 5 secs, true master piece thus video keep it up man 🔥🔥🔥

Adian Liusie

КОМЕНТАРІ