CORRECTIONS: • At 5:22, instead of "more random" I probably should have said "less predictable," and instead of "less random, more deterministic," I probably should have said "more predictable," since it's only deterministic if one class has a value of 1 and all others 0, and other than that it represents randomness just with different distributions. Also, this is related to the context of when one is sampling from this distribution, for example when choosing the next character or word in a sentence when doing text generation. • At 16:51 I said, "It's basically the idea that as our networks are training, they're trying to push these logit values towards those Gaussian distributions," but this should only be taken loosely since the gradients don't point towards any single point. It would be more accurate to describe the direction of the gradient as being a weighted combination of both towards the subspace occupied by that Gaussian and towards the intersection of all the other subspaces, which can be seen by the slope of the gradient shown at 10:04.
I was searching for intuitive explanation for Sigmoid and Softmax because they both have something in common for that something there are rare materials.And to my surprise Elliot you explained most of nuances in the concept if not all. I was hoping to find intuition on some other aspects in activation functions but sadly there you didn't made any videos for the last many years. I understand due to lack of time and probably nothing comes out of this hobby of yours you left making videos on technical content. But the way you have perceived these ideas have produced gold.
Thanks. I'm glad you liked the explanation. I may make more videos in the future. The main reason I stopped is because I want to build an AI company, and I felt like making these videos was slowing down my progress towards that goal. But I think once I get my business going a little more, it will actually be beneficial to get back into making videos. But we'll see. Thanks again for the kind comment.
Amazing video, by far one of the best and most vidual explanations I have yet to see. Thanks for making such a great video. Can't wait to watch more of your videos.
I particularly love the way you explain it in graphical calculator, which most youtubers won't dare to use. The whole ML operation is a math, love you videos!! please update us by uploading more videos like this!
I'm not sure how much benefit zeroing the last layer actually adds, but the idea behind initializing the last linear layer's weights and biases to zero is that it will make all the inputs to the softmax be the same value (zero in this case). And if all the inputs to the softmax are the same value, then the initial guess for each output will be the same probability (for 2 classes it will output a probability of 0.5 for each class, and for 10 classes it will output a probability of 0.1 for each class, etc.). This essentially means that we initially assume we don't know anything about the data, which makes sense, and is a stable starting state for the neural network to start learning from. This also will make all the outputs initially be at the center of our output space where there will be a good amount of gradient for all values that will start separating them in the correct directions. If we didn't zero the last layer, the learning algorithm would still work, but it would start in a bit of a tangled state, where some outputs would be correct and some would be wrong, and the gradient would essentially have to untangle them. I think I read somewhere (but I can't remember) that someone tested zeroing the last layer and it tended to improve the convergence rate since it makes it so the network doesn't initially start in a tangled state, but I think it is only a minor optimization and either way the network will be able to learn a correct set of weights. Another thing to note is that the last linear layer is the only layer that we can safely zero out. If we zeroed out any of the earlier layers, then the input to the next linear layer would be all zeros, which would make it so that all the gradients for that layer's weights would also be zero, and it wouldn't be able to learn. The only reason we can zero out the last layer is that those values are fed to a softmax function instead of another linear layer, and the softmax function still provides gradients even if all the inputs are zeros. Thanks for the question. I probably should have explained this more in the video.
@@elliotwaite thank you very much for your reply :) ! I think that starting in a somewhat tangled state may depend on the initialization of the various layers, the type of layers and activations, and above all on the distribution of the input data. I agree that initializing the last layer (classifier) with zero allows to better optimize the softmax function, placing all the values in the center, therefore at the equiprobability point of the probability simplex of the softmax function. if we remove the dependence of distribution of the data and the network (like Unconstrained Feature Model), most likely the logits would be optimized following the best direction that leads from the center of the probability simplex to the vertices. Do you agree?
@@opaelfpv I think I agree, if I understand you correctly. And that's a good point about the tangledness depending on those other factors. Also, if the training data is known ahead of time, in which case we could figure out the expected probability for each output class, maybe it would be even better to still zero the weights but to set the biases to match the expected distribution for the output classes, since that may be an even more neural and stable starting state. I'm not sure I fully understand what you mean by removing the dependence of the distribution of the data and the network, and I'm not familiar with the unconstrained feature model yet, so I'd have to do some more research before being able to offer any feedback on your last question. By the way, if your paper is publicly available, on Arxiv or something, I be interested in finding out more about it.
Even without watching this video, I liked it! because I know this is really going to change my thought toward softmax function and I learn a lot through this video.
I didnt get why exactly the smaller curve itself is changing shape when you decrease / increase green value at 2:25. I guess going through this desmos will bring more clarity. Can you open source it? Or its already shared?
The link to the publicly accessible Desmos graph is in the description. The bottom graph is a copy of the top graph, but scaled down vertically so that if you add up the heights of all the dots, that total height has a height of 1. So when you increase the height of the green dot, that increases the total height of all the dots, so the bottom graph has to be scaled down even more to make that total height stay at a height of 1, which is why the bottom graph squashes down more as the height of the green value increases.
Addition (shift) to all classes is changing the scores. For example we have 3 classes with values: 5, 3 and 2. So we can say: 5/10 = 3/10 + 2/10. After addition 1 to value of all calsses: 6/13 != 4/13 + 3/13. So addition must change scores. It is also easy to see if we add 1000 for example. So it will be 1005, 1003 and 1002. And score of each roughly become to 0.3333....
The reason an addition to all inputs doesn't change their outputs is because instead of using the raw input values in the fractions, you exponentiate the inputs first before using them in the fraction: e^2 / (e^2 + e^3 + e^5) = e^1002 / (e^1002 + e^1003 + e^1005) Thanks for the question.
great video, many thanks! One question: around 5:00, you mentioned that getting probabilities together make them more random, and separating them more deterministic, I can see that graphically but I can't reconcile with the definition of variance which asserts the more spread of the points the more variance, maybe I'm mixing things up here??
Thanks, glad you liked the video. About your question, it is a bit counterintuitive at first, but I think the key to understanding how having more similar probabilities can lead to a more random outcome is remembering that the probabilities we are talking about are the probabilities for different outcomes. And when I say "random," I'm using it a bit vaguely and not in the same way as variance, which I explain below. So for example, if we were rolling a six sided die, we could think of plotting the bar heights of the probabilities of getting each of the six sides. If it were a weighted die that almost always rolled a 5, then the bar for 5 would be large and the others would be small, and we might say the die was not very random. In this case, the distribution of the value we'd get from rolling the die would also have a low variance with most of the distribution mass being for 5. On the other hand, for a fair die, the probabilities for the sides would almost all be equal with their bar heights all being about the same, and we might say the die was more random. In this case the distribution of the value we'd get from rolling the die would also have a higher variance with the probability mass distributed more evenly across the outcomes. To see the difference in my usage of the word "random" and the precise term "variance," let's consider a weighted die that rolls a 1 half the time and a 6 the other half. We might say that this die was less random than a fair die, but the distribution of the outcome values would actually have a higher variance than a fair die. And in fact it would be the way to weight the die to have the highest variance possible. So when I say "random" I just mean that we are less certain about which class will be the outcome, but if those classes correspond to specific values in a field (a 1d, 2d, 3d, or higher dimensional space), then the variance of the distribution of those values could be higher or lower depending on which values the classes correspond to. I hope that helps.
Thanks, Snehotosh! For the 2D graph, I used Desmos, and for the 3D graphs, I used GeoGebra. They both have web-based versions and GeoGebra also has a native version. I posted links in the description to all the interactive graphs shown in the video if you want to get some ideas for how to use them.
Great video. The point where you take the log of all these functions to make the loss seems like it could use a bit more motivation: The log makes sense for exponential-like sigmoids and if you're requiring a probabilistic interpretation, but perhaps there is some other function more appropriate for algebraic sigmoids? (Although I can't think of any that would produce the "nice" properties that you get from cross-entropy loss & logistic sigmoid, apart from maybe a piecewise scheme like "Generalized smooth hinge loss".)
Good point. I didn't go into the reasoning of why I chose the negative log-likelihood loss other than mentioning that it is the standard used for softmax outputs that are interpreted as probabilities. I wasn't sure of a concise way to explain it, so I thought I would just take it as a given without going into the reasoning. But maybe I could have at least pointed viewers to another resource that they could read for if they wanted to better understand why it is typically used any how other loss functions could also potentially be used. Thanks for the feedback.
Thanks Elliot, great video! 👍 Could you please clarify why the output space is evenly distributed among classes? Shouldn't there be more space for the dominant class? Or you consider that all the classes are equivalently dominant? Thanks 🙌
So the visuals are actually showing all possible cases in one graph, both what the output would be when one class is dominant and what the output would be when the classes are equivalent. The spaces where one color is above the others is where that class is dominant and the place in the middle where the curves cross and are all equal height is what the output would be if the classes were equivalent. And the spaces are all the same size because the softmax function doesn’t give any preference to any of the inputs before knowing their values. Let me know if it is still unclear.
@@elliotwaite oh, okay, thank you 👍 So what we see is just "the middle of the decision" and therefore it only looks like the classes are equal. Which means, this does not imply that the classes are like 50/50 everywhere in case of 2 classes, but at the intersection it indeed looks like they are 50/50. Hope I got it right 😊
@@my_master55 I’m not sure what you mean by "the middle of the decision." But yes, at the intersection they are 50/50 in the 2 class example and 1/3, 1/3, 1/3 in the 3 class example.
I'm not sure I understand your comment. Can you clarify what you mean by "theta values"? In the video I was trying to say that increasing the temperature flattens the distribution (making it more uniform), and that lowering the termpature pushes the arg max up the others down (making it more like a spike for only the arg max probability). Are you saying that you think it is the other way around, or was there something else that you thought was inaccurate?
😁 Thanks, I try. The higher dimensional stuff is tough, but I've slowly been getting better at finding ways to reason about it. But I'd imagine there are people, perhaps mathematicians who work in higher dimensions regularly, that have much more effective mental tools than I do.
CORRECTIONS:
• At 5:22, instead of "more random" I probably should have said "less predictable," and instead of "less random, more deterministic," I probably should have said "more predictable," since it's only deterministic if one class has a value of 1 and all others 0, and other than that it represents randomness just with different distributions. Also, this is related to the context of when one is sampling from this distribution, for example when choosing the next character or word in a sentence when doing text generation.
• At 16:51 I said, "It's basically the idea that as our networks are training, they're trying to push these logit values towards those Gaussian distributions," but this should only be taken loosely since the gradients don't point towards any single point. It would be more accurate to describe the direction of the gradient as being a weighted combination of both towards the subspace occupied by that Gaussian and towards the intersection of all the other subspaces, which can be seen by the slope of the gradient shown at 10:04.
Thank you!
After 30s you know that this lecture is worth its weight in gold.
Thanks!
This content is amazing, and the visualizations are so intuitive and easy-to-understand. Awesome video!
Thanks, Harry! I appreciate it.
this is probably the best math video I have ever watched
Thanks!
What an explanation! All I had to do is watch the first 2 minutes and everything made sense to me.
Thanks!
I was searching for intuitive explanation for Sigmoid and Softmax because they both have something in common for that something there are rare materials.And to my surprise Elliot you explained most of nuances in the concept if not all. I was hoping to find intuition on some other aspects in activation functions but sadly there you didn't made any videos for the last many years. I understand due to lack of time and probably nothing comes out of this hobby of yours you left making videos on technical content. But the way you have perceived these ideas have produced gold.
Thanks. I'm glad you liked the explanation. I may make more videos in the future. The main reason I stopped is because I want to build an AI company, and I felt like making these videos was slowing down my progress towards that goal. But I think once I get my business going a little more, it will actually be beneficial to get back into making videos. But we'll see. Thanks again for the kind comment.
Ultimate explanation. Thank you a million.
Thanks!
Amazing video, by far one of the best and most vidual explanations I have yet to see. Thanks for making such a great video. Can't wait to watch more of your videos.
Thanks!
Best, most thorough explanation possible!
Thank you!
I started a neural nets from scratch tutorial and your videos are amazing for support/understanding.
Thanks!
I particularly love the way you explain it in graphical calculator, which most youtubers won't dare to use. The whole ML operation is a math, love you videos!! please update us by uploading more videos like this!
Thanks, I appreciate the comment. I hope to make more videos eventually.
Awesome illustrations! Thanks
wow you make desmos sing! I love desmos as well. such as an amazing thing, incredible visualization tool. Great video, please keep making more!
Agreed, Desmos is great! Fun to use and helpful in gaining insights. Thanks, bArda26! I'll keep the videos coming.
I don't undersend why you initialize the last linear layer to zeros. Great video! You provide really good explanation for a paper i wrote recently
I'm not sure how much benefit zeroing the last layer actually adds, but the idea behind initializing the last linear layer's weights and biases to zero is that it will make all the inputs to the softmax be the same value (zero in this case). And if all the inputs to the softmax are the same value, then the initial guess for each output will be the same probability (for 2 classes it will output a probability of 0.5 for each class, and for 10 classes it will output a probability of 0.1 for each class, etc.). This essentially means that we initially assume we don't know anything about the data, which makes sense, and is a stable starting state for the neural network to start learning from. This also will make all the outputs initially be at the center of our output space where there will be a good amount of gradient for all values that will start separating them in the correct directions.
If we didn't zero the last layer, the learning algorithm would still work, but it would start in a bit of a tangled state, where some outputs would be correct and some would be wrong, and the gradient would essentially have to untangle them. I think I read somewhere (but I can't remember) that someone tested zeroing the last layer and it tended to improve the convergence rate since it makes it so the network doesn't initially start in a tangled state, but I think it is only a minor optimization and either way the network will be able to learn a correct set of weights.
Another thing to note is that the last linear layer is the only layer that we can safely zero out. If we zeroed out any of the earlier layers, then the input to the next linear layer would be all zeros, which would make it so that all the gradients for that layer's weights would also be zero, and it wouldn't be able to learn. The only reason we can zero out the last layer is that those values are fed to a softmax function instead of another linear layer, and the softmax function still provides gradients even if all the inputs are zeros.
Thanks for the question. I probably should have explained this more in the video.
@@elliotwaite thank you very much for your reply :) ! I think that starting in a somewhat tangled state may depend on the initialization of the various layers, the type of layers and activations, and above all on the distribution of the input data. I agree that initializing the last layer (classifier) with zero allows to better optimize the softmax function, placing all the values in the center, therefore at the equiprobability point of the probability simplex of the softmax function. if we remove the dependence of distribution of the data and the network (like Unconstrained Feature Model), most likely the logits would be optimized following the best direction that leads from the center of the probability simplex to the vertices. Do you agree?
@@opaelfpv I think I agree, if I understand you correctly. And that's a good point about the tangledness depending on those other factors. Also, if the training data is known ahead of time, in which case we could figure out the expected probability for each output class, maybe it would be even better to still zero the weights but to set the biases to match the expected distribution for the output classes, since that may be an even more neural and stable starting state.
I'm not sure I fully understand what you mean by removing the dependence of the distribution of the data and the network, and I'm not familiar with the unconstrained feature model yet, so I'd have to do some more research before being able to offer any feedback on your last question.
By the way, if your paper is publicly available, on Arxiv or something, I be interested in finding out more about it.
Well thought out and extremely informative. Thank you for making this.
Thanks!
Thanks so much for creating this thorough amazing explainer exposing the hidden details of softmax visually. Amazing!
:) I'm glad you liked it.
Man. This is the best video I've seen on this topic.
Thanks!
I just did a presentation with Desmos and I thought mine was good, but WOW props to you. Ur Desmos skills are insanely good man
Thanks! Yeah, I'm a big Desmos fan. It's a great learning tool.
The best video I have ever seen about softmax
Thanks!
This is great! You have put so much effort into this to make it easy to understand
Thanks!
Even without watching this video, I liked it! because I know this is really going to change my thought toward softmax function and I learn a lot through this video.
Very helpful! The visualisation was very good!! Thank you...
Thanks! Glad you liked it.
I love you man, this video is what I needed. Much love and best of luck mate.
Thanks!
Very well explained 👍
Thanks!
Awesome visuals! Thank you for a great work!
Thanks, Yuriy! Glad you liked it.
Amazing video Elliot!!! Great insghts, keep it up.
Thanks, Wenceslao! Comments like this make my day.
thank you for taking the effort to produce this, it is super helpful!
Thanks, catthrowvandisc! Glad you found it helpful.
I didnt get why exactly the smaller curve itself is changing shape when you decrease / increase green value at 2:25. I guess going through this desmos will bring more clarity. Can you open source it? Or its already shared?
The link to the publicly accessible Desmos graph is in the description. The bottom graph is a copy of the top graph, but scaled down vertically so that if you add up the heights of all the dots, that total height has a height of 1. So when you increase the height of the green dot, that increases the total height of all the dots, so the bottom graph has to be scaled down even more to make that total height stay at a height of 1, which is why the bottom graph squashes down more as the height of the green value increases.
Addition (shift) to all classes is changing the scores. For example we have 3 classes with values: 5, 3 and 2. So we can say: 5/10 = 3/10 + 2/10. After addition 1 to value of all calsses: 6/13 != 4/13 + 3/13. So addition must change scores.
It is also easy to see if we add 1000 for example. So it will be 1005, 1003 and 1002. And score of each roughly become to 0.3333....
It was comment to this part of video: ua-cam.com/video/ytbYRIN0N4g/v-deo.html.
The reason an addition to all inputs doesn't change their outputs is because instead of using the raw input values in the fractions, you exponentiate the inputs first before using them in the fraction:
e^2 / (e^2 + e^3 + e^5) = e^1002 / (e^1002 + e^1003 + e^1005)
Thanks for the question.
@@elliotwaite Thank you for explanation.
What you've done is just amazing!!
Thanks!
great video, many thanks! One question: around 5:00, you mentioned that getting probabilities together make them more random, and separating them more deterministic, I can see that graphically but I can't reconcile with the definition of variance which asserts the more spread of the points the more variance, maybe I'm mixing things up here??
Thanks, glad you liked the video. About your question, it is a bit counterintuitive at first, but I think the key to understanding how having more similar probabilities can lead to a more random outcome is remembering that the probabilities we are talking about are the probabilities for different outcomes. And when I say "random," I'm using it a bit vaguely and not in the same way as variance, which I explain below.
So for example, if we were rolling a six sided die, we could think of plotting the bar heights of the probabilities of getting each of the six sides. If it were a weighted die that almost always rolled a 5, then the bar for 5 would be large and the others would be small, and we might say the die was not very random. In this case, the distribution of the value we'd get from rolling the die would also have a low variance with most of the distribution mass being for 5.
On the other hand, for a fair die, the probabilities for the sides would almost all be equal with their bar heights all being about the same, and we might say the die was more random. In this case the distribution of the value we'd get from rolling the die would also have a higher variance with the probability mass distributed more evenly across the outcomes.
To see the difference in my usage of the word "random" and the precise term "variance," let's consider a weighted die that rolls a 1 half the time and a 6 the other half. We might say that this die was less random than a fair die, but the distribution of the outcome values would actually have a higher variance than a fair die. And in fact it would be the way to weight the die to have the highest variance possible.
So when I say "random" I just mean that we are less certain about which class will be the outcome, but if those classes correspond to specific values in a field (a 1d, 2d, 3d, or higher dimensional space), then the variance of the distribution of those values could be higher or lower depending on which values the classes correspond to.
I hope that helps.
@@elliotwaite Ah, yes you're right, those are different outcomes, appreciate the detailed explanation, thanks for your time 👍
very good explanation
very helpful, content was amazing😍
Great content. Recommended.
Thank you!
I have no words for you !!! Awesome
So useful! Thanks a lot
Hi, can I see your neural network training (at 12.12) in some environment such as Jupiter notebook?
I don't have a notebook of it, but you can get the code here: github.com/elliotwaite/softmax-logit-paths
Superb, Excellent Elliot!!! BTW, which software you used?
Thanks, Snehotosh! For the 2D graph, I used Desmos, and for the 3D graphs, I used GeoGebra. They both have web-based versions and GeoGebra also has a native version. I posted links in the description to all the interactive graphs shown in the video if you want to get some ideas for how to use them.
awesome video, thank you
:) thanks for the comment. I'm glad you liked it.
Please keep doing it!
Thanks, Cankun! I plan to make more videos soon.
Great video. The point where you take the log of all these functions to make the loss seems like it could use a bit more motivation: The log makes sense for exponential-like sigmoids and if you're requiring a probabilistic interpretation, but perhaps there is some other function more appropriate for algebraic sigmoids? (Although I can't think of any that would produce the "nice" properties that you get from cross-entropy loss & logistic sigmoid, apart from maybe a piecewise scheme like "Generalized smooth hinge loss".)
Good point. I didn't go into the reasoning of why I chose the negative log-likelihood loss other than mentioning that it is the standard used for softmax outputs that are interpreted as probabilities. I wasn't sure of a concise way to explain it, so I thought I would just take it as a given without going into the reasoning. But maybe I could have at least pointed viewers to another resource that they could read for if they wanted to better understand why it is typically used any how other loss functions could also potentially be used. Thanks for the feedback.
Thank you so much.
Excellent !
great stuff
very helpful!
Awesome !
@@nagakushalageeru135 Thanks!
THANK YOU
Thanks Elliot, great video! 👍
Could you please clarify why the output space is evenly distributed among classes? Shouldn't there be more space for the dominant class?
Or you consider that all the classes are equivalently dominant?
Thanks 🙌
So the visuals are actually showing all possible cases in one graph, both what the output would be when one class is dominant and what the output would be when the classes are equivalent. The spaces where one color is above the others is where that class is dominant and the place in the middle where the curves cross and are all equal height is what the output would be if the classes were equivalent. And the spaces are all the same size because the softmax function doesn’t give any preference to any of the inputs before knowing their values. Let me know if it is still unclear.
@@elliotwaite oh, okay, thank you 👍
So what we see is just "the middle of the decision" and therefore it only looks like the classes are equal.
Which means, this does not imply that the classes are like 50/50 everywhere in case of 2 classes, but at the intersection it indeed looks like they are 50/50.
Hope I got it right 😊
@@my_master55 I’m not sure what you mean by "the middle of the decision." But yes, at the intersection they are 50/50 in the 2 class example and 1/3, 1/3, 1/3 in the 3 class example.
Thanks!
Impressive!
@@moorboor4977 thanks!
The comment (5:06) about temperature parameter is not accurate, when temp parameter
I'm not sure I understand your comment. Can you clarify what you mean by "theta values"? In the video I was trying to say that increasing the temperature flattens the distribution (making it more uniform), and that lowering the termpature pushes the arg max up the others down (making it more like a spike for only the arg max probability). Are you saying that you think it is the other way around, or was there something else that you thought was inaccurate?
"To imagine 23-dimensional space, I first consider n-dimensional space, and then I set n = 23"
Haha. For real though, that probably is the best way to do it.
why you have so few subscribers :
I know 😥. I think I need to make more videos.
wow this is real shit
@@nexyboye5111 thanks
Lol "I am not confortable with geometry"... If you aren't, imagine marginal people... Haha
😁 Thanks, I try. The higher dimensional stuff is tough, but I've slowly been getting better at finding ways to reason about it. But I'd imagine there are people, perhaps mathematicians who work in higher dimensions regularly, that have much more effective mental tools than I do.
@@elliotwaite Maybe but higher dimensions are hypothetical as we are trapped in 3 dimensions ... the rest is just imagination hehe
Great video. Thank you