for everyone trying to understand this concept even more thoroughly, towardsdatascience's article "The intuition behind Shannon’s Entropy" is amazing. it gives added insight on why information is the reciprocal of probability
@@benjahnz perhaps see if there's a pdf version out there, if it's out there on a random forum out there (search for it by "site:[insert site here] [random search term]" or just look up google dorking), see if a reddit post out there has it, or see if it's on an internet archive (because perhaps it might've previously not been paywalled).
Bro really explained it in less than 10 mins when my professors don't bother even if it could be done in 5 secs, true master piece thus video keep it up man 🔥🔥🔥
it is confusing from some part of the concept, but the rethinking progress is helpful, as I can surely say: the expected possible outcomes (the surprises) * the probability = entropy (lower the entropy, the less surprising it will be)
Dude, took information theory from a rigorously academic and formal professor. I'm a little slow and under the pressure of getting assignments done, couldn't always see the forest for the trees. Just the sentence "how much information, on average, would we need to encode an outcome from a distribution" just summed up the whole motivation and intuition. Thanks!
I was just showing arbitrary examples but I could have chosen many different examples too. The triples (when there were 8 outcomes) was to show this could be easily extended to any power of 2, and the 10 outcomes was to show that this generalises too to non-powers of 2.
What I understand is that Entropy is directly related to the number of outcomes, right? So, I don't get why we need such a parameter/term when we could simply do by stating the number of outcomes of a probability distribution? What new thing does entropy bring to the table?
Consider the case that a biased coin is flipped. There are two outcomes, just like an unbiased coin, but let's say this biased coin has a (0.1)^10000 chance of being heads. Do you have exactly the same information about the outcome before hand as you do with an unbiased coin?
@@derickd6150 yes, it makes sense that a non-uniform distribution should have an effect on the uncertainty of a distribution, but can you explain how the bias affects the outcome via the entropy formula?
@@maxlehtinen4189 I'm not sure what you mean by bias here? Edit: Oh right you're referring to my answer, not something in the video. Yes well the entropy formula says something along the lines of: "How many bits do we need to represent the outcome of the coin"? That is a very natural measure of how much information you have about the outcome. If the coin is unbiased, you need two bits. If it is so severly biased like I describe above, and you plug the numbers into the entropy formula, it will essentially tell you "Well... we really only need one bit to describe the outcome right? We essentially certain it will be tails" Something intuitively along these lines. Edit 2: to see this, plot y(p) = -p log(p) - (1-p) log(1-p) for p in [0,1]. That is the expression for the entropy of the coin, whatever its bias. You will see that when p is very close to 1 or to 0 (which it is in my example), y(p) is almost 0. This is to say, you need almost no information to represent the outcome. It is just known. You need not transfer any information to someone, on the moon say, for that person to guess that the biased coin I described gives tails. However, when p is 0.5, the entropy is maximised, and so you would need to transfer the most information to someone on the moon to tell them the outcome of the coin, because they cannot use their prior knowledge at all to make any kind of educated guess
each slice of probablity requires log2p_i number of bits to represent, and the total number of outcomes (they call it entropy) requires the sum of all slices of probability. Each slice of probability is basically one of the expected outcome, say, geting the combination ABCDEF in a six letter scramble. (correct me if I am wrong)
Here is my thought process on why the Shannon Entropy formula makes sense. Hope it helps some of you. Also, if someone wants to use this explanation anywhere, like a blog post etc., please go ahead. No credit necessary. 1) Let’s say that X = “person x has access to site” is a random variable (RV), where P(X = yes) = 0.75 and P(X = no) = 0.25. Then why does it make sense that Entropy(X) = - 0.75 * log2(0.75) - 0.25 * log2(0.25) = 0.75 * log(1 / 0.75) + 0.25 * log(1 / 0.25)? 2) Well, entropy(X) = average_surprise(X), right? Think about it: entropy is something uncontrollable, something NOT known in advance. Put (maybe even too) simply: entropy IS surprise, and to quantify entropy, we must quantify how surprised we are by the news of whether someone, let's say Amy, has been granted access to a site. 3) Average_surprise(X) = average_information_contained_in_distribution(X). This should be intuitive. The information that we need to transmit the result of an event with probability 1 is 0, as we already know it will happen. Yes, there is information in the fact that we know, but there is no information to be extracted FROM THE RESULT of a deterministic probabilistic process. The same logic works continuously: the more average information there is to be gained from the results of a probabilistic process, the more surprise it has embedded in it. 4) By combining these results (2) and (3), we get that Entropy(“person x has access to site”) = average_information_contained_in_distribution(X). 5) But wait, what is this information you speak of? Surely it isn't a mathematical object. It isn't a number. So, how can we even calculate it? Well, P(X = yes) = 0.75 means that with probability = 0.75, we can know the result of both X = yes (happened or didn’t) and X = no (happened or didn’t). We can quantify this GAIN IN INFORMATION from a piece of news in the following way. 6) When we get the news that Amy indeed got access (p = 0.75), we get information worth 1.0 units. What are the units? Well, they are information. We don’t have a unit like Hz or Amps for it. So it’s just units of information (gained). 7) But why is the information worth exactly 1.0 units? Well, we need to have some measure for information. You can’t measure 1 cm without first deciding that “this distance is 1 cm”. So, because we need a measure, it makes sense that the information of the event that actually occurred would be the measuring stick of 1.0 units, because we are going to be using it in trying to solve whether this event was probable or not. The other events have relative magnitudes w.r.t. the event that happened. By using there relative magnitudes, we can start to reason about whether we should weigh this outcome as having high entropy = high surprise, when it happens. 8) Okay, now we know the information gained on the event X = yes. It is 1.0 units, as agreed. But we also now know that the event X = no has NOT happened. This is also information. It must be "worth" something. But how do we quantify it? 9) Quantifying the information gained from events that have NOT happened is simple now that we have a measurement stick for what 1 unit of information is. We can use this measuring stick to calculate how much an event that has NOT happened is “worth” based on ITS own probability. 10) We now know that p = 0.75 is worth 1 unit of information. Then, p = 0.25, corresponding to X = no, is only worth 0.25 / 0.75 = 0.33 units. In total, the news has given us 1.33 units of information. 11) Notice a pattern that explains WHY this approach of comparing event probabilities with the event that happened makes sense: if the event that happens has high probability, the unit of information has “higher standards”. It doesn’t accept any lower probability events as having high information. 12) More generally, the probability of the event that happens “controls” how large the information gained from all its counter events is. Similarly, if a low-probability event happens, it will make the information gained from this event larger by lowering the bar for an unit of information gained. This corresponds to the intuitive idea we humans have while reasoning about this: "a low probability event has happened, so it must mean that the information gained from this event is larger”, and vice versa. The ratios of the probabilities make the magic work, so it is no coincidence that the Shannon formula has them. Namely, 1 / p. (Please only consider the formula without the minus sign, which has log(1 / p) and not just log(p). The minus sign is just there to make the formula more concise. In reality, everything we are talking about makes much more sense without the minus sign.) 13) If you’ve made it this far, congrats. We are almost there. But just to make sure we are on the same page, let’s reiterate on what the number 1.33 is telling us. It is telling us the “number” of different "event units" we have gained information on after hearing the news, where the unit of one full event/outcome corresponds to the probability of the outcome that actually happened. (Side quest: For simplicity, you can also use numbers that only output integers for event units. For example, let’s say our probability_distribution = [0.25, 0.25, 0.5]. If p = 0.25 happens, we gain info on 1 + 1 + 2 = 4 event units, where having p = 0.25 is the measure of 1 event unit. As discussed in (6), it can also be understood as the information unit, i.e. surprise unit, i.e. the reciprocal of the event’s probability, i.e. 1 / p.) 14) As we discussed earlier, Entropy(X) = average_information_contained_in(X) = expected_information_gained_from_knowing(X). This we have already calculated for X = yes (1.33). Do the same process for X = no and we get 4. (Left as an exercise to the reader. Note: if you can't do it, you haven't understood the most important point.) 15) Now we are ready to tie it all together. Notice the log(1 / p) in the entropy formula? That’s just our 1.33 and 4 with log wrappers (1 / 0.75 = 1.33 and 1 / 0.25 = 4). (Side quest: The Log. The reason we use log, originally log2, is because The Great Shannon wanted everything to be measured in bits. That makes sense because he was a computer scientist working on quantifying the information of (binary-encoded) messages. And to be fair, it is quite neat to have a binary interpretation for the total information in a system (the expected number of bits required to represent the result of its corresponding RV). The log also makes entropy additive (this is super useful in ML, even if we use nats or some other base, which obviously doesn’t have a bit interpretation. The bit interpretation doesn't really mean anything in the larger context of information anyway. It’s just one way to ENCODE information, but it itself is not information and we can only use the number of bits as a measuring stick for information. Any other measuring stick works just as well, at least in theory. For humans, bits are a friend.). 16) So, 1 / 0.75 = 1.33 and 1 / 0.25 = 4. That is exactly what we calculated with our intuitive method. That's because we used the same exact method as is the formula. The nominator is the measuring stick. It is the 1.0 units. To drive the point home, note that 1 / p = 1 + (1 - p) / p. The (1 - p) / p part is what calculates the information units gained from leftover events (0.33 when the event that happened is X = yes), and 1 is the information gained from the event that actually happened. 17) Now we are at the finish line. We just need to quantify the EXPECTED information required to encode something in bits. This is easy, if you understand expected values, which I expect you do, because you’ve made it this far. 18) Entropy(X) = 0.75 * log2(1 / 0.75) + 0.25 * log2(1 / 0.25) = - 0.75 * log2(0.75) - 0.25 * log2(0.25)
Well done, Adian. I just found out-though I'm not surprised at all, in the Shannon sense 🤓 -that you're doing a PhD at Cambridge. Congratulations! Best wishes for everything 🙂
I appreciate your effort, but the video is quite confusing. For example, in the example about 8 football teams, you explain why 3 bits are required by flat out stating as a starting premise that 3 bits are required! It's a circular argument.
im not too sure but I think its just a bitwise expression of M possible outcomes. considering there are M probabilities with equal probabilities (p), so p = 1/M -> 1/p = (1/(1/M)) = M
Most of your understanding is good, but 4:50 is an unnecessary leap of logic. At level this introductory is probably best to assume outcomes to be at 2^n.
for everyone trying to understand this concept even more thoroughly, towardsdatascience's article "The intuition behind Shannon’s Entropy" is amazing. it gives added insight on why information is the reciprocal of probability
Thank you!!
And it's paywalled...
@@benjahnz perhaps see if there's a pdf version out there, if it's out there on a random forum out there (search for it by "site:[insert site here] [random search term]" or just look up google dorking), see if a reddit post out there has it, or see if it's on an internet archive (because perhaps it might've previously not been paywalled).
@@benjahnz Worse, the sign up process doesn't work.
Ain't paying all that
Please make more videos this is literally the only time I've ever seen entropy be explained in a way that makes sense
Three bits to tell the guy on the other side of the wall what happened, and it suddenly made sense. Thanks.
Bro really explained it in less than 10 mins when my professors don't bother even if it could be done in 5 secs, true master piece thus video keep it up man 🔥🔥🔥
it is confusing from some part of the concept, but the rethinking progress is helpful, as I can surely say:
the expected possible outcomes (the surprises) * the probability = entropy (lower the entropy, the less surprising it will be)
Dude, took information theory from a rigorously academic and formal professor. I'm a little slow and under the pressure of getting assignments done, couldn't always see the forest for the trees. Just the sentence "how much information, on average, would we need to encode an outcome from a distribution" just summed up the whole motivation and intuition. Thanks!
At 4:54, may I know the reason to consider 10 bits and triples? why not any other combination? Thanks.
I was just showing arbitrary examples but I could have chosen many different examples too. The triples (when there were 8 outcomes) was to show this could be easily extended to any power of 2, and the 10 outcomes was to show that this generalises too to non-powers of 2.
What I understand is that Entropy is directly related to the number of outcomes, right? So, I don't get why we need such a parameter/term when we could simply do by stating the number of outcomes of a probability distribution? What new thing does entropy bring to the table?
Consider the case that a biased coin is flipped. There are two outcomes, just like an unbiased coin, but let's say this biased coin has a (0.1)^10000 chance of being heads. Do you have exactly the same information about the outcome before hand as you do with an unbiased coin?
@@derickd6150 yes, it makes sense that a non-uniform distribution should have an effect on the uncertainty of a distribution, but can you explain how the bias affects the outcome via the entropy formula?
@@maxlehtinen4189 I'm not sure what you mean by bias here? Edit: Oh right you're referring to my answer, not something in the video. Yes well the entropy formula says something along the lines of: "How many bits do we need to represent the outcome of the coin"? That is a very natural measure of how much information you have about the outcome. If the coin is unbiased, you need two bits. If it is so severly biased like I describe above, and you plug the numbers into the entropy formula, it will essentially tell you "Well... we really only need one bit to describe the outcome right? We essentially certain it will be tails" Something intuitively along these lines. Edit 2: to see this, plot y(p) = -p log(p) - (1-p) log(1-p) for p in [0,1]. That is the expression for the entropy of the coin, whatever its bias. You will see that when p is very close to 1 or to 0 (which it is in my example), y(p) is almost 0. This is to say, you need almost no information to represent the outcome. It is just known. You need not transfer any information to someone, on the moon say, for that person to guess that the biased coin I described gives tails. However, when p is 0.5, the entropy is maximised, and so you would need to transfer the most information to someone on the moon to tell them the outcome of the coin, because they cannot use their prior knowledge at all to make any kind of educated guess
each slice of probablity requires log2p_i number of bits to represent, and the total number of outcomes (they call it entropy) requires the sum of all slices of probability. Each slice of probability is basically one of the expected outcome, say, geting the combination ABCDEF in a six letter scramble. (correct me if I am wrong)
Here is my thought process on why the Shannon Entropy formula makes sense. Hope it helps some of you. Also, if someone wants to use this explanation anywhere, like a blog post etc., please go ahead. No credit necessary.
1) Let’s say that X = “person x has access to site” is a random variable (RV), where P(X = yes) = 0.75 and P(X = no) = 0.25. Then why does it make sense that Entropy(X) = - 0.75 * log2(0.75) - 0.25 * log2(0.25) = 0.75 * log(1 / 0.75) + 0.25 * log(1 / 0.25)?
2) Well, entropy(X) = average_surprise(X), right? Think about it: entropy is something uncontrollable, something NOT known in advance. Put (maybe even too) simply: entropy IS surprise, and to quantify entropy, we must quantify how surprised we are by the news of whether someone, let's say Amy, has been granted access to a site.
3) Average_surprise(X) = average_information_contained_in_distribution(X). This should be intuitive. The information that we need to transmit the result of an event with probability 1 is 0, as we already know it will happen. Yes, there is information in the fact that we know, but there is no information to be extracted FROM THE RESULT of a deterministic probabilistic process. The same logic works continuously: the more average information there is to be gained from the results of a probabilistic process, the more surprise it has embedded in it.
4) By combining these results (2) and (3), we get that Entropy(“person x has access to site”) = average_information_contained_in_distribution(X).
5) But wait, what is this information you speak of? Surely it isn't a mathematical object. It isn't a number. So, how can we even calculate it? Well, P(X = yes) = 0.75 means that with probability = 0.75, we can know the result of both X = yes (happened or didn’t) and X = no (happened or didn’t). We can quantify this GAIN IN INFORMATION from a piece of news in the following way.
6) When we get the news that Amy indeed got access (p = 0.75), we get information worth 1.0 units. What are the units? Well, they are information. We don’t have a unit like Hz or Amps for it. So it’s just units of information (gained).
7) But why is the information worth exactly 1.0 units? Well, we need to have some measure for information. You can’t measure 1 cm without first deciding that “this distance is 1 cm”. So, because we need a measure, it makes sense that the information of the event that actually occurred would be the measuring stick of 1.0 units, because we are going to be using it in trying to solve whether this event was probable or not. The other events have relative magnitudes w.r.t. the event that happened. By using there relative magnitudes, we can start to reason about whether we should weigh this outcome as having high entropy = high surprise, when it happens.
8) Okay, now we know the information gained on the event X = yes. It is 1.0 units, as agreed. But we also now know that the event X = no has NOT happened. This is also information. It must be "worth" something. But how do we quantify it?
9) Quantifying the information gained from events that have NOT happened is simple now that we have a measurement stick for what 1 unit of information is. We can use this measuring stick to calculate how much an event that has NOT happened is “worth” based on ITS own probability.
10) We now know that p = 0.75 is worth 1 unit of information. Then, p = 0.25, corresponding to X = no, is only worth 0.25 / 0.75 = 0.33 units. In total, the news has given us 1.33 units of information.
11) Notice a pattern that explains WHY this approach of comparing event probabilities with the event that happened makes sense: if the event that happens has high probability, the unit of information has “higher standards”. It doesn’t accept any lower probability events as having high information.
12) More generally, the probability of the event that happens “controls” how large the information gained from all its counter events is. Similarly, if a low-probability event happens, it will make the information gained from this event larger by lowering the bar for an unit of information gained. This corresponds to the intuitive idea we humans have while reasoning about this: "a low probability event has happened, so it must mean that the information gained from this event is larger”, and vice versa. The ratios of the probabilities make the magic work, so it is no coincidence that the Shannon formula has them. Namely, 1 / p. (Please only consider the formula without the minus sign, which has log(1 / p) and not just log(p). The minus sign is just there to make the formula more concise. In reality, everything we are talking about makes much more sense without the minus sign.)
13) If you’ve made it this far, congrats. We are almost there. But just to make sure we are on the same page, let’s reiterate on what the number 1.33 is telling us. It is telling us the “number” of different "event units" we have gained information on after hearing the news, where the unit of one full event/outcome corresponds to the probability of the outcome that actually happened.
(Side quest: For simplicity, you can also use numbers that only output integers for event units. For example, let’s say our probability_distribution = [0.25, 0.25, 0.5]. If p = 0.25 happens, we gain info on 1 + 1 + 2 = 4 event units, where having p = 0.25 is the measure of 1 event unit. As discussed in (6), it can also be understood as the information unit, i.e. surprise unit, i.e. the reciprocal of the event’s probability, i.e. 1 / p.)
14) As we discussed earlier, Entropy(X) = average_information_contained_in(X) = expected_information_gained_from_knowing(X). This we have already calculated for X = yes (1.33). Do the same process for X = no and we get 4. (Left as an exercise to the reader. Note: if you can't do it, you haven't understood the most important point.)
15) Now we are ready to tie it all together. Notice the log(1 / p) in the entropy formula? That’s just our 1.33 and 4 with log wrappers (1 / 0.75 = 1.33 and 1 / 0.25 = 4).
(Side quest: The Log. The reason we use log, originally log2, is because The Great Shannon wanted everything to be measured in bits. That makes sense because he was a computer scientist working on quantifying the information of (binary-encoded) messages. And to be fair, it is quite neat to have a binary interpretation for the total information in a system (the expected number of bits required to represent the result of its corresponding RV). The log also makes entropy additive (this is super useful in ML, even if we use nats or some other base, which obviously doesn’t have a bit interpretation. The bit interpretation doesn't really mean anything in the larger context of information anyway. It’s just one way to ENCODE information, but it itself is not information and we can only use the number of bits as a measuring stick for information. Any other measuring stick works just as well, at least in theory. For humans, bits are a friend.).
16) So, 1 / 0.75 = 1.33 and 1 / 0.25 = 4. That is exactly what we calculated with our intuitive method. That's because we used the same exact method as is the formula. The nominator is the measuring stick. It is the 1.0 units. To drive the point home, note that 1 / p = 1 + (1 - p) / p. The (1 - p) / p part is what calculates the information units gained from leftover events (0.33 when the event that happened is X = yes), and 1 is the information gained from the event that actually happened.
17) Now we are at the finish line. We just need to quantify the EXPECTED information required to encode something in bits. This is easy, if you understand expected values, which I expect you do, because you’ve made it this far.
18) Entropy(X) = 0.75 * log2(1 / 0.75) + 0.25 * log2(1 / 0.25) = - 0.75 * log2(0.75) - 0.25 * log2(0.25)
Thank you for your video.Keep it up! 感谢你的视频. 再接再厉!
Well done, Adian. I just found out-though I'm not surprised at all, in the Shannon sense 🤓 -that you're doing a PhD at Cambridge. Congratulations! Best wishes for everything 🙂
Nice explanation. Keep up the good work, man!
god this is an incredible video thank you so much
Excellent. Short and sweet.
Fantastic job in explaining this,
this is great! i hope you will film more!
Thank you for the best explanation
It does not explain the most important part - how the formula for non-uniform distribution came about
I agree. The transition from uniform distribution to ununiform distribution should be the most important and confusing part.
You CERTAINLY DESERVE MORE VIEWS 👏 👍👍👍👍
Blew my mind!
Great video, well explained!
Great job. Thank you
I dont quite understand the very last step. what does summing over all the probability outcomes give us?
That is the way we calculate expectation values. For a random variable X which takes values {xi}, E(X) = sum P(xi) * xi
Intuitively, u sum over it to get some understanding of the average uncertainty
Great video!
Simple and precise!
THE GOAT
I didn't quite understand the 4:36 rational.
thank you codexchan
Uncertainty is a confusing way to describe this. For the lottery example, wouldn't you be very certain of the outcome?
It’s about the numbers not whether you win the lottery or not.
Very good!
excelent video, thank you!
Nice video, what do you think about set shaping theory (information theory)?
can someone explain the triplets part
Thank you!
wonderful explaination !!
Nice video
beautiful explanation :)
I appreciate your effort, but the video is quite confusing. For example, in the example about 8 football teams, you explain why 3 bits are required by flat out stating as a starting premise that 3 bits are required! It's a circular argument.
great video thank you
Why 1/p?????????
im not too sure but I think its just a bitwise expression of M possible outcomes. considering there are M probabilities with equal probabilities (p), so p = 1/M -> 1/p = (1/(1/M)) = M
awesome!
Your should have written H[U(x)] = logM / M
to better relate the entropy explanation.
Thankkkk youuuuu.
awesome
Most of your understanding is good, but 4:50 is an unnecessary leap of logic. At level this introductory is probably best to assume outcomes to be at 2^n.
This seems so intuitive why did it take so long to get "discovered"
"so long" relative to what?
I expected something else, but it's also ok.
Example is even difficult than the concept itself 🤦🏼♂️😃
Nice try by the way
this sucks, really unintuitive