Softmax Function Explained In Depth with 3D Visuals

Поділитися
Вставка
  • Опубліковано 21 гру 2024

КОМЕНТАРІ • 108

  • @elliotwaite
    @elliotwaite  4 роки тому +17

    CORRECTIONS:
    • At 5:22, instead of "more random" I probably should have said "less predictable," and instead of "less random, more deterministic," I probably should have said "more predictable," since it's only deterministic if one class has a value of 1 and all others 0, and other than that it represents randomness just with different distributions. Also, this is related to the context of when one is sampling from this distribution, for example when choosing the next character or word in a sentence when doing text generation.
    • At 16:51 I said, "It's basically the idea that as our networks are training, they're trying to push these logit values towards those Gaussian distributions," but this should only be taken loosely since the gradients don't point towards any single point. It would be more accurate to describe the direction of the gradient as being a weighted combination of both towards the subspace occupied by that Gaussian and towards the intersection of all the other subspaces, which can be seen by the slope of the gradient shown at 10:04.

  • @rembautimes8808
    @rembautimes8808 3 роки тому +26

    After 30s you know that this lecture is worth its weight in gold.

  • @harry8
    @harry8 Рік тому +12

    This content is amazing, and the visualizations are so intuitive and easy-to-understand. Awesome video!

  • @tomgreene4329
    @tomgreene4329 Рік тому +1

    this is probably the best math video I have ever watched

  • @aureliobeiza3422
    @aureliobeiza3422 2 місяці тому

    What an explanation! All I had to do is watch the first 2 minutes and everything made sense to me.

  • @alidanish6303
    @alidanish6303 Рік тому +1

    I was searching for intuitive explanation for Sigmoid and Softmax because they both have something in common for that something there are rare materials.And to my surprise Elliot you explained most of nuances in the concept if not all. I was hoping to find intuition on some other aspects in activation functions but sadly there you didn't made any videos for the last many years. I understand due to lack of time and probably nothing comes out of this hobby of yours you left making videos on technical content. But the way you have perceived these ideas have produced gold.

    • @elliotwaite
      @elliotwaite  Рік тому

      Thanks. I'm glad you liked the explanation. I may make more videos in the future. The main reason I stopped is because I want to build an AI company, and I felt like making these videos was slowing down my progress towards that goal. But I think once I get my business going a little more, it will actually be beneficial to get back into making videos. But we'll see. Thanks again for the kind comment.

  • @shashankdhananjaya9923
    @shashankdhananjaya9923 3 роки тому +3

    Ultimate explanation. Thank you a million.

  • @ryzurrin
    @ryzurrin 2 роки тому +5

    Amazing video, by far one of the best and most vidual explanations I have yet to see. Thanks for making such a great video. Can't wait to watch more of your videos.

  • @RomainPuech
    @RomainPuech Рік тому

    Best, most thorough explanation possible!

  • @Nico-rl4bo
    @Nico-rl4bo 3 роки тому +1

    I started a neural nets from scratch tutorial and your videos are amazing for support/understanding.

  • @LifeKiT-i
    @LifeKiT-i Рік тому

    I particularly love the way you explain it in graphical calculator, which most youtubers won't dare to use. The whole ML operation is a math, love you videos!! please update us by uploading more videos like this!

    • @elliotwaite
      @elliotwaite  Рік тому

      Thanks, I appreciate the comment. I hope to make more videos eventually.

  • @chrisogonas
    @chrisogonas 2 роки тому +1

    Awesome illustrations! Thanks

  • @bArda26
    @bArda26 4 роки тому +6

    wow you make desmos sing! I love desmos as well. such as an amazing thing, incredible visualization tool. Great video, please keep making more!

    • @elliotwaite
      @elliotwaite  4 роки тому

      Agreed, Desmos is great! Fun to use and helpful in gaining insights. Thanks, bArda26! I'll keep the videos coming.

  • @opaelfpv
    @opaelfpv 2 місяці тому +1

    I don't undersend why you initialize the last linear layer to zeros. Great video! You provide really good explanation for a paper i wrote recently

    • @elliotwaite
      @elliotwaite  2 місяці тому +1

      I'm not sure how much benefit zeroing the last layer actually adds, but the idea behind initializing the last linear layer's weights and biases to zero is that it will make all the inputs to the softmax be the same value (zero in this case). And if all the inputs to the softmax are the same value, then the initial guess for each output will be the same probability (for 2 classes it will output a probability of 0.5 for each class, and for 10 classes it will output a probability of 0.1 for each class, etc.). This essentially means that we initially assume we don't know anything about the data, which makes sense, and is a stable starting state for the neural network to start learning from. This also will make all the outputs initially be at the center of our output space where there will be a good amount of gradient for all values that will start separating them in the correct directions.
      If we didn't zero the last layer, the learning algorithm would still work, but it would start in a bit of a tangled state, where some outputs would be correct and some would be wrong, and the gradient would essentially have to untangle them. I think I read somewhere (but I can't remember) that someone tested zeroing the last layer and it tended to improve the convergence rate since it makes it so the network doesn't initially start in a tangled state, but I think it is only a minor optimization and either way the network will be able to learn a correct set of weights.
      Another thing to note is that the last linear layer is the only layer that we can safely zero out. If we zeroed out any of the earlier layers, then the input to the next linear layer would be all zeros, which would make it so that all the gradients for that layer's weights would also be zero, and it wouldn't be able to learn. The only reason we can zero out the last layer is that those values are fed to a softmax function instead of another linear layer, and the softmax function still provides gradients even if all the inputs are zeros.
      Thanks for the question. I probably should have explained this more in the video.

    • @opaelfpv
      @opaelfpv 2 місяці тому +1

      @@elliotwaite thank you very much for your reply :) ! I think that starting in a somewhat tangled state may depend on the initialization of the various layers, the type of layers and activations, and above all on the distribution of the input data. I agree that initializing the last layer (classifier) ​​with zero allows to better optimize the softmax function, placing all the values ​​in the center, therefore at the equiprobability point of the probability simplex of the softmax function. if we remove the dependence of distribution of the data and the network (like Unconstrained Feature Model), most likely the logits would be optimized following the best direction that leads from the center of the probability simplex to the vertices. Do you agree?

    • @elliotwaite
      @elliotwaite  2 місяці тому

      ​@@opaelfpv I think I agree, if I understand you correctly. And that's a good point about the tangledness depending on those other factors. Also, if the training data is known ahead of time, in which case we could figure out the expected probability for each output class, maybe it would be even better to still zero the weights but to set the biases to match the expected distribution for the output classes, since that may be an even more neural and stable starting state.
      I'm not sure I fully understand what you mean by removing the dependence of the distribution of the data and the network, and I'm not familiar with the unconstrained feature model yet, so I'd have to do some more research before being able to offer any feedback on your last question.
      By the way, if your paper is publicly available, on Arxiv or something, I be interested in finding out more about it.

  • @buffmypickles
    @buffmypickles 4 роки тому +9

    Well thought out and extremely informative. Thank you for making this.

  • @boutiquemaths
    @boutiquemaths Рік тому

    Thanks so much for creating this thorough amazing explainer exposing the hidden details of softmax visually. Amazing!

  • @eugene63218
    @eugene63218 3 роки тому

    Man. This is the best video I've seen on this topic.

  • @stanshunpike369
    @stanshunpike369 2 роки тому

    I just did a presentation with Desmos and I thought mine was good, but WOW props to you. Ur Desmos skills are insanely good man

    • @elliotwaite
      @elliotwaite  2 роки тому

      Thanks! Yeah, I'm a big Desmos fan. It's a great learning tool.

  • @andreykorneychuk8053
    @andreykorneychuk8053 3 роки тому

    The best video I have ever seen about softmax

  • @mic9657
    @mic9657 9 місяців тому

    This is great! You have put so much effort into this to make it easy to understand

  • @rutvikjaiswal4986
    @rutvikjaiswal4986 3 роки тому +2

    Even without watching this video, I liked it! because I know this is really going to change my thought toward softmax function and I learn a lot through this video.

  • @avinrajan4815
    @avinrajan4815 Рік тому

    Very helpful! The visualisation was very good!! Thank you...

  • @EpicGamer-ux1tu
    @EpicGamer-ux1tu Рік тому

    I love you man, this video is what I needed. Much love and best of luck mate.

  • @vatsal_gamit
    @vatsal_gamit 4 роки тому +1

    Very well explained 👍

  • @EBCEu4
    @EBCEu4 4 роки тому +5

    Awesome visuals! Thank you for a great work!

    • @elliotwaite
      @elliotwaite  4 роки тому

      Thanks, Yuriy! Glad you liked it.

  • @wencesvm
    @wencesvm 4 роки тому +3

    Amazing video Elliot!!! Great insghts, keep it up.

    • @elliotwaite
      @elliotwaite  4 роки тому +1

      Thanks, Wenceslao! Comments like this make my day.

  • @catthrowvandisc5633
    @catthrowvandisc5633 4 роки тому +2

    thank you for taking the effort to produce this, it is super helpful!

    • @elliotwaite
      @elliotwaite  4 роки тому

      Thanks, catthrowvandisc! Glad you found it helpful.

  • @Mahesha999
    @Mahesha999 3 роки тому +1

    I didnt get why exactly the smaller curve itself is changing shape when you decrease / increase green value at 2:25. I guess going through this desmos will bring more clarity. Can you open source it? Or its already shared?

    • @elliotwaite
      @elliotwaite  3 роки тому +1

      The link to the publicly accessible Desmos graph is in the description. The bottom graph is a copy of the top graph, but scaled down vertically so that if you add up the heights of all the dots, that total height has a height of 1. So when you increase the height of the green dot, that increases the total height of all the dots, so the bottom graph has to be scaled down even more to make that total height stay at a height of 1, which is why the bottom graph squashes down more as the height of the green value increases.

  • @anton_98
    @anton_98 3 роки тому +1

    Addition (shift) to all classes is changing the scores. For example we have 3 classes with values: 5, 3 and 2. So we can say: 5/10 = 3/10 + 2/10. After addition 1 to value of all calsses: 6/13 != 4/13 + 3/13. So addition must change scores.
    It is also easy to see if we add 1000 for example. So it will be 1005, 1003 and 1002. And score of each roughly become to 0.3333....

    • @anton_98
      @anton_98 3 роки тому

      It was comment to this part of video: ua-cam.com/video/ytbYRIN0N4g/v-deo.html.

    • @elliotwaite
      @elliotwaite  3 роки тому

      The reason an addition to all inputs doesn't change their outputs is because instead of using the raw input values in the fractions, you exponentiate the inputs first before using them in the fraction:
      e^2 / (e^2 + e^3 + e^5) = e^1002 / (e^1002 + e^1003 + e^1005)
      Thanks for the question.

    • @anton_98
      @anton_98 3 роки тому

      @@elliotwaite Thank you for explanation.

  • @patite3103
    @patite3103 4 роки тому +1

    What you've done is just amazing!!

  • @alroy01g
    @alroy01g 3 роки тому +1

    great video, many thanks! One question: around 5:00, you mentioned that getting probabilities together make them more random, and separating them more deterministic, I can see that graphically but I can't reconcile with the definition of variance which asserts the more spread of the points the more variance, maybe I'm mixing things up here??

    • @elliotwaite
      @elliotwaite  3 роки тому +3

      Thanks, glad you liked the video. About your question, it is a bit counterintuitive at first, but I think the key to understanding how having more similar probabilities can lead to a more random outcome is remembering that the probabilities we are talking about are the probabilities for different outcomes. And when I say "random," I'm using it a bit vaguely and not in the same way as variance, which I explain below.
      So for example, if we were rolling a six sided die, we could think of plotting the bar heights of the probabilities of getting each of the six sides. If it were a weighted die that almost always rolled a 5, then the bar for 5 would be large and the others would be small, and we might say the die was not very random. In this case, the distribution of the value we'd get from rolling the die would also have a low variance with most of the distribution mass being for 5.
      On the other hand, for a fair die, the probabilities for the sides would almost all be equal with their bar heights all being about the same, and we might say the die was more random. In this case the distribution of the value we'd get from rolling the die would also have a higher variance with the probability mass distributed more evenly across the outcomes.
      To see the difference in my usage of the word "random" and the precise term "variance," let's consider a weighted die that rolls a 1 half the time and a 6 the other half. We might say that this die was less random than a fair die, but the distribution of the outcome values would actually have a higher variance than a fair die. And in fact it would be the way to weight the die to have the highest variance possible.
      So when I say "random" I just mean that we are less certain about which class will be the outcome, but if those classes correspond to specific values in a field (a 1d, 2d, 3d, or higher dimensional space), then the variance of the distribution of those values could be higher or lower depending on which values the classes correspond to.
      I hope that helps.

    • @alroy01g
      @alroy01g 3 роки тому +1

      @@elliotwaite Ah, yes you're right, those are different outcomes, appreciate the detailed explanation, thanks for your time 👍

  • @tato_good
    @tato_good 3 роки тому

    very good explanation

  • @mohammadkashif750
    @mohammadkashif750 3 роки тому +1

    very helpful, content was amazing😍

  • @TheMotivationFrontier
    @TheMotivationFrontier Місяць тому

    Great content. Recommended.

  • @nikhilshingadiya7798
    @nikhilshingadiya7798 3 роки тому

    I have no words for you !!! Awesome

  • @toniiicarbonelll287
    @toniiicarbonelll287 3 роки тому +1

    So useful! Thanks a lot

  • @arseniymaryin738
    @arseniymaryin738 3 роки тому +1

    Hi, can I see your neural network training (at 12.12) in some environment such as Jupiter notebook?

    • @elliotwaite
      @elliotwaite  3 роки тому

      I don't have a notebook of it, but you can get the code here: github.com/elliotwaite/softmax-logit-paths

  • @snehotoshbanerjee1938
    @snehotoshbanerjee1938 4 роки тому +1

    Superb, Excellent Elliot!!! BTW, which software you used?

    • @elliotwaite
      @elliotwaite  4 роки тому

      Thanks, Snehotosh! For the 2D graph, I used Desmos, and for the 3D graphs, I used GeoGebra. They both have web-based versions and GeoGebra also has a native version. I posted links in the description to all the interactive graphs shown in the video if you want to get some ideas for how to use them.

  • @HeitorPastorelli
    @HeitorPastorelli Рік тому

    awesome video, thank you

    • @elliotwaite
      @elliotwaite  Рік тому

      :) thanks for the comment. I'm glad you liked it.

  • @horizon9863
    @horizon9863 4 роки тому +1

    Please keep doing it!

    • @elliotwaite
      @elliotwaite  4 роки тому

      Thanks, Cankun! I plan to make more videos soon.

  • @scotth.hawley1560
    @scotth.hawley1560 4 роки тому +2

    Great video. The point where you take the log of all these functions to make the loss seems like it could use a bit more motivation: The log makes sense for exponential-like sigmoids and if you're requiring a probabilistic interpretation, but perhaps there is some other function more appropriate for algebraic sigmoids? (Although I can't think of any that would produce the "nice" properties that you get from cross-entropy loss & logistic sigmoid, apart from maybe a piecewise scheme like "Generalized smooth hinge loss".)

    • @elliotwaite
      @elliotwaite  4 роки тому

      Good point. I didn't go into the reasoning of why I chose the negative log-likelihood loss other than mentioning that it is the standard used for softmax outputs that are interpreted as probabilities. I wasn't sure of a concise way to explain it, so I thought I would just take it as a given without going into the reasoning. But maybe I could have at least pointed viewers to another resource that they could read for if they wanted to better understand why it is typically used any how other loss functions could also potentially be used. Thanks for the feedback.

  • @adityaghosh8601
    @adityaghosh8601 3 роки тому

    Thank you so much.

  • @yoggi1222
    @yoggi1222 3 роки тому

    Excellent !

  • @ناصرالبدراني-ش9س
    @ناصرالبدراني-ش9س 2 роки тому

    great stuff

  • @zyzhang1130
    @zyzhang1130 3 роки тому

    very helpful!

  • @nagakushalageeru135
    @nagakushalageeru135 Місяць тому

    Awesome !

    • @elliotwaite
      @elliotwaite  Місяць тому

      @@nagakushalageeru135 Thanks!

  • @melkenhoning158
    @melkenhoning158 3 роки тому

    THANK YOU

  • @my_master55
    @my_master55 2 роки тому

    Thanks Elliot, great video! 👍
    Could you please clarify why the output space is evenly distributed among classes? Shouldn't there be more space for the dominant class?
    Or you consider that all the classes are equivalently dominant?
    Thanks 🙌

    • @elliotwaite
      @elliotwaite  2 роки тому

      So the visuals are actually showing all possible cases in one graph, both what the output would be when one class is dominant and what the output would be when the classes are equivalent. The spaces where one color is above the others is where that class is dominant and the place in the middle where the curves cross and are all equal height is what the output would be if the classes were equivalent. And the spaces are all the same size because the softmax function doesn’t give any preference to any of the inputs before knowing their values. Let me know if it is still unclear.

    • @my_master55
      @my_master55 2 роки тому

      @@elliotwaite oh, okay, thank you 👍
      So what we see is just "the middle of the decision" and therefore it only looks like the classes are equal.
      Which means, this does not imply that the classes are like 50/50 everywhere in case of 2 classes, but at the intersection it indeed looks like they are 50/50.
      Hope I got it right 😊

    • @elliotwaite
      @elliotwaite  2 роки тому

      @@my_master55 I’m not sure what you mean by "the middle of the decision." But yes, at the intersection they are 50/50 in the 2 class example and 1/3, 1/3, 1/3 in the 3 class example.

  • @TylerMatthewHarris
    @TylerMatthewHarris 4 роки тому +1

    Thanks!

  • @moorboor4977
    @moorboor4977 Місяць тому

    Impressive!

  • @zhiyingwang1234
    @zhiyingwang1234 Рік тому

    The comment (5:06) about temperature parameter is not accurate, when temp parameter

    • @elliotwaite
      @elliotwaite  Рік тому

      I'm not sure I understand your comment. Can you clarify what you mean by "theta values"? In the video I was trying to say that increasing the temperature flattens the distribution (making it more uniform), and that lowering the termpature pushes the arg max up the others down (making it more like a spike for only the arg max probability). Are you saying that you think it is the other way around, or was there something else that you thought was inaccurate?

  • @fibonacci112358s
    @fibonacci112358s Рік тому

    "To imagine 23-dimensional space, I first consider n-dimensional space, and then I set n = 23"

    • @elliotwaite
      @elliotwaite  Рік тому

      Haha. For real though, that probably is the best way to do it.

  • @lenam8048
    @lenam8048 Рік тому

    why you have so few subscribers :

    • @elliotwaite
      @elliotwaite  Рік тому

      I know 😥. I think I need to make more videos.

  • @nexyboye5111
    @nexyboye5111 3 місяці тому

    wow this is real shit

  • @DiogoSanti
    @DiogoSanti 4 роки тому +2

    Lol "I am not confortable with geometry"... If you aren't, imagine marginal people... Haha

    • @elliotwaite
      @elliotwaite  4 роки тому +1

      😁 Thanks, I try. The higher dimensional stuff is tough, but I've slowly been getting better at finding ways to reason about it. But I'd imagine there are people, perhaps mathematicians who work in higher dimensions regularly, that have much more effective mental tools than I do.

    • @DiogoSanti
      @DiogoSanti 4 роки тому +1

      @@elliotwaite Maybe but higher dimensions are hypothetical as we are trapped in 3 dimensions ... the rest is just imagination hehe

  • @joeybasile545
    @joeybasile545 6 місяців тому

    Great video. Thank you