Why Do We Use the Sigmoid Function for Binary Classification?

Поділитися
Вставка
  • Опубліковано 8 січ 2025

КОМЕНТАРІ • 85

  • @danielwie8472
    @danielwie8472 3 роки тому +8

    You are great! Thanks for making this so much easier to understand. I had a hard time understanding this while I was studying 6 years ago, but with all these great visualizations it all makes so much sense!

    • @elliotwaite
      @elliotwaite  3 роки тому +1

      Thanks, Daniel! I'm glad the video and visualizations helped.

  • @TerragonDE
    @TerragonDE 4 роки тому +30

    I like the visualization of the two normal distributions with the sigmoid function, very cool, never seen before :-)

    • @kingsleyibeh1214
      @kingsleyibeh1214 2 місяці тому +1

      THANK You so much Prof. I am grateful for the explanation you gave.

  • @neuromancer13
    @neuromancer13 5 місяців тому +1

    Another reason is that the 1st derivative of the sigmoid function is a function of itself, making calculation of weight corrections in back-prop computationally efficient. Great video BTW!

    • @elliotwaite
      @elliotwaite  5 місяців тому +1

      Good point. At around 3:40, I mentioned the importance of considering the computational cost of the different functions. However, I neglected to mention the additional cost of also needing compute the derivative of these functions during back-prop. Thanks for pointing that out.

  • @cornelisderuiter4279
    @cornelisderuiter4279 3 роки тому +2

    This has to be the best explanation on the web for the SF.

  • @carnright
    @carnright 4 роки тому +6

    Thanks for going over that with so much visual detail! I just heard about the swish activation function, would love to see your take on it!

  • @aryang5511
    @aryang5511 2 роки тому +1

    Excellent explanation, it really helped me understand the concept! I honestly dont think it could've been explained better!

  • @shvprkatta
    @shvprkatta 3 роки тому +2

    Thank you Elliote!..this is a brilliant content...it helped me understand more in an intuitive way

    • @elliotwaite
      @elliotwaite  3 роки тому +1

      Thanks, glad you found it helpful.

  • @ilkero1067
    @ilkero1067 7 місяців тому

    Dear Elliot please do more ML videos, you are giving the most intuitive explanations, love your content

    • @elliotwaite
      @elliotwaite  7 місяців тому

      Thanks! I'm glad you like my explanations. I may make more videos eventually, but recently I've been busy with a project I'm working on.

  • @satyakamshailesh184
    @satyakamshailesh184 Місяць тому

    This blew my mind. Today i truly understood why sigmoid was chosen

    • @elliotwaite
      @elliotwaite  Місяць тому

      @@satyakamshailesh184 🙌🏻

  • @jennyjumpjump
    @jennyjumpjump 2 роки тому

    Thank you! I'm learning neural networks self study. This is the answer to the question I had

    • @elliotwaite
      @elliotwaite  2 роки тому

      Nice, I learned through self study as well. I'm glad I was able to answer your question.

  • @CalculationConsulting
    @CalculationConsulting 4 роки тому +36

    Fun fact: the Sigmoid function was first introduced by Jack Cowan to model experiments on real neurons.

    • @elliotwaite
      @elliotwaite  4 роки тому +4

      Chuck! The flip master! Good to hear from you, man. Thanks for the fun fact. I wasn't sure where to verify that Jack was the first person to use the sigmoid function in the context of neural networks, but this looks like one of the early papers where he mentions it (from 1972): www.cell.com/biophysj/pdf/S0006-3495(72)86068-5.pdf

  • @Kikikuku2
    @Kikikuku2 7 місяців тому

    I in love with your keyboard, and thanks for the video.

  • @taiwanSmart
    @taiwanSmart 4 роки тому +2

    many many thanks, I've been thinking about the reason for quite some time...

  • @HeduAI
    @HeduAI 3 роки тому +3

    The branches example was so cool! Felt mentally transported to a foggy forest so as to observe the dripping dew drops.

    • @elliotwaite
      @elliotwaite  3 роки тому +2

      Haha, nice. I had a little fun with that part. I'm glad you liked it.

  • @Blure
    @Blure 4 роки тому +1

    Quality content. Thanks!

  • @kbostr
    @kbostr Рік тому

    great videos on the sigmoid function

  • @kennymaccaferri2602
    @kennymaccaferri2602 4 роки тому +1

    Thanks for this comparison of the different functions, brilliant content and the reason why you/we use this sigmoid function. Glad I found it. One small issue: Well not so small really, potentially I would have missed the great content because to Native English UK ears - I had no idea what you were talking about when you mentioned the ""Lawssssed" function" And it seemed really important. However after I persevered and FINALLY looked at the legend on full screen I realised what you referred to called a lawsssed function is what we call the "Lost Function" (it has a t at the end here in UK). Lost function. Thanks again. But you might consider putting on subtitles for UK viewers of other English Accents which pronounce ST as if there is a T in it.

    • @elliotwaite
      @elliotwaite  4 роки тому +1

      Ah, I was actually saying "loss function," referring to the function used to compute the loss that is used to perform gradient descent in the context of machine learning. I checked out the auto generated english captions and it looks like they are currently correct, transcribing it as "loss function," unless they are wrong in another part of the video that I didn't check.

    • @kennymaccaferri2602
      @kennymaccaferri2602 4 роки тому

      @@elliotwaite Dear Elliott, you really should not bother replying to (partially) deaf Scotsmen. The problem was on my end. Not yours. I've watched your video three times today, it has given me an insight into this Sigmoid function like no other. The actual trouble is that I'm coming at this from a completely different angle - an angle where the loss function is not of great importance, I'm interested in the derivative of a function whose curve is similar in shape to the sigmoid function (an S curve) i only got confused by the emphasis you put one this loss function. Sorry if it grated, because your video is phenomenal. Phenomenal. Thanks. Kenny Glasgow.

    • @elliotwaite
      @elliotwaite  4 роки тому +1

      @@kennymaccaferri2602 ah, got it. Yeah, I usually make videos related to machine learning, but I'm glad to hear you still found it useful even though you are using the sigmoid function for something else. And I appreciated your comment. Others may have been similarly confused but didn't take the time to mention it, so your comment may have also helped others in the future.

  • @anneni4438
    @anneni4438 3 роки тому

    Nice intuation of sigmoid function!!

  • @loganyang
    @loganyang 4 роки тому +2

    If you want other perspectives, search for the video: why sigmoid: a probabilistic perspective. The GDA perspective is mentioned as well but it's not a full formulation, more like a motivating example.

    • @elliotwaite
      @elliotwaite  4 роки тому +4

      Nice video. I'll leave a link to it here so others can more easily find it: ua-cam.com/video/oxGC9LLY6ZQ/v-deo.html

    • @loganyang
      @loganyang 4 роки тому

      @@elliotwaite Thanks Elliot! Love your videos, keep up the great work!

  • @LAKXx
    @LAKXx 2 роки тому

    very informative and clear! thanks m8

  • @ankitdixit2754
    @ankitdixit2754 3 роки тому

    Awesome! Thank you so much! This video was so intuitive!🙌

  • @oheldad
    @oheldad 2 роки тому

    Excellent video !

  • @bruceb85
    @bruceb85 4 роки тому +1

    awesome explanation, I use normal distribution for trading - working on an ML system at the moment

  • @AayushJariwala-j4n
    @AayushJariwala-j4n 3 місяці тому

    At 6:33, suppose the droplet somehow falls at x=10. The model is so sure that it has come from the right tree, but when we think about it the probability of the right tree dropping its droplet at x=10 is almost 0 still model has so much surety. One way we can think is by considering the left tree. Just because the left tree is 2.4 units far from the right tree, the model became so sure that this event must have happened by the right tree even though both tree has a probability of almost 0 at x=10. Let's assume the droplet dropped at x=100000. Now what? The model is almost sure that it has come from the right tree even though difference between tree is still 2.4 (I am not saying that this model is wrong but just a thought I had to put)

    • @elliotwaite
      @elliotwaite  3 місяці тому

      @@AayushJariwala-j4n that is a good point. When something so rare happens, maybe it would be better to consider that something else is happening other than just random nudges left or right, in which case it might be a good idea to use a different model for those events.

  • @sughosh100
    @sughosh100 3 роки тому

    Wow! Thank you so much for the video!

  • @sukanya4498
    @sukanya4498 3 роки тому

    Awesome 👍🏼! Thank you ! ...

  • @InquilineKea
    @InquilineKea Рік тому

    What is the CPU/RAM extension you're running at the top of the macbook toolbar?

  • @gauravms2799
    @gauravms2799 3 роки тому +1

    amazing video how did you get that intuition of 2 normal distributions any links to books or articles would help thanks for the video its another level

    • @elliotwaite
      @elliotwaite  3 роки тому +3

      Thanks! I think I was just wondering at some point what the curve was if I applied Bayes rule to two normal distributions with different means and variance, and it was difficult to solve for so I instead tried first solving the simpler case where the two normal distributions had the same variance and I was surprised that it came out to be a scaled sigmoid curve. I then did the version where they had different variances and it turned out to be a bi-modal curve in that the higher variance probability always dominates in both tails. I later did a Google search to verify this relationship and found it mentioned in several resources including statistics books, but I haven't read any of those books so I don't have any to recommend.
      But one probability resource that I did like that I can recommend (although I don't think it mentions anything about the sigmoid curve) is this UA-cam playlist that covers some of the ideas in the book, "Probability Theory: The Logic of Science," by E.T. Jaynes. I also read some of the book, but I found this UA-cam playlist easier to follow: ua-cam.com/video/rfKS69cIwHc/v-deo.html

    • @gauravms2799
      @gauravms2799 3 роки тому +1

      @@elliotwaite omg thank you so much replying😍😍 made my day you are a genius because now because of this understanding its helping me to connect other things thank you so much love you a lots one day this video is gonna be big for sure❤️❤️❤️❤️❤️❤️❤️❤️❤️❤️

  • @titotitoburg6298
    @titotitoburg6298 4 роки тому +3

    Can somebody walk me through how exactly 1/1+e^-x = e^x/e^x+1
    If you multiply both sides by e^x you get
    1(e^x) / 1(e^x) + (e^-x) = e^x / e^2x
    or whatever you get 1(e^x)/(e^x) e^-x + 1 = e^x / e^2x + 1
    either of those are correct.
    nobody on the internet has an answer Ive been searching for hours for somebody to simplify this equation for me, because this doesn't make any sense at all.
    Edit: Okay after many hours of research I found out that when someone writes:
    1/1+e^-x = e^x/e^x+1 they actually mean 1/(1+e^-x) = e^x/(e^x+1)
    nobody told me this in all my life and while it seems obvious I did not pick up on that and I guarentee it's why i've failed some math problems here and there.
    Also found out when you multiply negative and positive exponential values you get 1. example: e^x * e^-x = 1
    therefore;
    1/ (1+e^-x) = 1(e^x) / (e^x)(1 + e^-x) = e^x / (e^x * 1 ) + (e^x * e^-x) = e^x / e^x+1

    • @durzehra6642
      @durzehra6642 4 роки тому

      well if x = 1 then e^x * e^-x
      e^0
      and when power is zero nomatter what value it is it will be one

    • @AlexandrBorschchev
      @AlexandrBorschchev 4 роки тому

      e^-x is just another way of saying 1/e^x. When he said multiplying by e^x, that was oversimplified. heres how you solve it if you substitute 1/e^x
      1/(1+e^-x)
      1/(1+1/e^x)
      1/(e^x+1/e^x), we multiplied 1+1/e^x by the lcd e^x to get e^x+1/e^x
      e^x/e^x+1

  • @josephsantarcangelo9310
    @josephsantarcangelo9310 4 роки тому +4

    cool video! so the sigmoid function comes Bayes' Theorem of two Gaussians

    • @elliotwaite
      @elliotwaite  4 роки тому +1

      Yep! Good ol' Reverend Bayes.

    • @loganyang
      @loganyang 4 роки тому +1

      Doesn't need to be two Gaussians though, two Gaussians is one special case.

    • @elliotwaite
      @elliotwaite  4 роки тому +1

      @@loganyang, true. And there may be a better distribution that models the output from the last layer of a neural network. I could see it being asymmetric and it could depend on the loss function used to update the values, but I'm not sure what a better distribution would be. The Gaussian is kind of the distribution that makes the least assumptions (just random noise), so it seems like the best choice without knowledge of a better option.

  • @a1x45h
    @a1x45h 4 роки тому +17

    my dumass brain thought this dude's a dj or something

  • @ryancodrai487
    @ryancodrai487 3 роки тому

    I read that the sigmoid function arises naturally in the form of the posterior probability distribution in a Bayesian treatment of two-class classification? I believe this is what you showed in your video? Could you explain briefly how this is the posterior probability?

    • @elliotwaite
      @elliotwaite  3 роки тому +2

      So when doing classification, what it means to apply a Bayesian treatment is use the Baysian equation, which roughly says that to get the probability that a sample came from class A (to get the probability that a specific drop came from the left branch, using the dripping branches analogy from the video), what we do is we figure out what the probability would be for a random sample from class A to have produced that sample (we figure out what the probability is that a random drop from the left branch would have landed in the same exact spot as our drop in question), and then we divide that probability by the sum of that same kind of probability but for each of the different possible classes (we divide that first probability by the sum of both that first probability and the probability that a random drop from the right branch would have landed in that same exact same spot).
      Also, all these probabilities are very small, essentially infinitesimal if we are using a continuous probability distribution like the normal distribution. So what we are actually doing is using probability densities. The probability that a random drop from the left branch will fall in the exact same spot as our sample is essentially zero, but if we look at what the probability would be of falling in a tiny slice around that sample, and divide that probability by the width of that slice, then we get a probability density.
      So to reiterate, applying Bayesian classification means that to get the probability that a specific sample came from class A, what we do is we invert the question, instead of asking what the probability is that the sample came from A, we ask what is the probability that class A would have produced this sample. And then we do that same thing for class B. And then to get the probability that the sample came from class A, we divide that "inverted" probability for A by the sum of all the "inverted" probabilities over all the possible classes.
      And to restate that in more common language, it's like saying, "well the probability that a drop from left branch would have landed exactly here is highly unlikely, but it's also highly unlikely that a drop from the right branch would have landed exactly here, but if we are going to assume it had to have come from one of those two branches, than we can limit our reasoning to just these two rare cases, and figure out the probability for each by dividing the rare probability for each case by the sum of all the rare probabilities for each of the cases." The dividing by the sum is the part that "limits our reasoning to just these specific cases."
      And if we do this Bayesian procedure using prior distributions that are identical normal distributions for each class, and we do it for a bunch of different places along the ground, plotting the output probability at each spot as a height, then all those different points will exactly follow a sigmoid curve. And by "prior distribution," I just mean the probability distribution of where we think a random sample from a class will end up before it is observed (the probability distribution of where we think a random drop from a branch will land before we actually observe where it landed).
      That may have been over explained, but I hope it helps. Let me know if there is anything else I can clarify.

  • @shassy7253
    @shassy7253 2 роки тому

    Hi, nice visualisation and explanation. I thought you are a DJ too. It is a bit unclear for me about your explanation of why compared to other equations, especially on the upper left quadrant. Also, for your foggy forest analogy, what does it means with the sigmoid function, does it show where the raindrop will end on average? How is this related to a binary option if the option is between 0 and 1?

    • @elliotwaite
      @elliotwaite  2 роки тому +1

      Thanks. Haha, yep, I had DJ lighting setup going.
      About the upper left quadrant part of the video, it might have been confusing because I was using some terms and ideas related to machine learning that I was assuming that the audience was familiar with, so it was a bad explanation if you weren't familiar with those terms and ideas I was mentioning. But if I try to re-explain it here, I feel like I might just be repeating what I said in the video, so I was thinking it might be better if instead you let me know which parts of the explanation in the video seemed confusing, and then I can try to clarify those parts here.
      About the water droplet part, the function that shows where the drop will fall on average is actually the normal distribution. The probability distribution for where a drop from the left branch will fall is a normal distribution centered underneath that branch. And the same for the right branch, it is a normal distribution centered underneath that branch. What the sigmoid function shows is, if you see a drop land at a certain location and you don't know which branch that drop came from, then you look at the height of the sigmoid function at that location, and that height will be the probability that the drop came from the right branch. For example, if the height at the location of a drop is 0.75, then there is a 75% chance that that drop came from the right branch. And this example is related to binary classification because we can replace "came from the left branch" and "came from the right branch" with any other two competing propositions of which we are trying to decide which of them is true. Again, I might just be repeating what I said in the video, so feel free to let me know which parts of this explanation or the video explanation still seem confusing.

  • @minhnguyen-pt7lu
    @minhnguyen-pt7lu 4 роки тому +1

    Very helpful.

  • @徐聖旂
    @徐聖旂 Рік тому

    Very interesting!

  • @nirbhay_raghav
    @nirbhay_raghav 2 роки тому

    But doesn't the derivative of sigmoid function saturates to zero as it gets closer to one leads to gradient vanishing in neural networks? It can lead the neurons to freeze right? I am not sure, could you please shed some more light on this?

    • @elliotwaite
      @elliotwaite  2 роки тому

      Yes, the derivative saturates to near zero when the output in near one or zero, but if the output is near one and the label says that it should have been a zero (or vice versa), the loss will be so high that it compensates for how little the derivative is, so much so that the resulting gradient (loss * derivative) does not vanish when the prediction is incorrect, in fact the resulting gradient of an incorrect prediction is very consistent even when the output is very saturated, as can be seen by the slope of the loss to the left of the origin at around 3:00. The gradient only vanishes when the prediction is correct.

    • @nirbhay_raghav
      @nirbhay_raghav 2 роки тому

      @@elliotwaite Thanks for the reply. The video was just amazing. But I have a follow up question. When you say "gradient (loss * derivative)", what exactly do you mean? I could not follow. And eventhough loss compensates for the gradients but it is the gradients which are used to update the weights right and not the loss so it does not matter how big the loss is if we cannot change weights because gradient is the limiting factor. I am sorry but I am now confused. I will read the vanishing gradient problem surely in more detail. I am probably missing something crucial here.

    • @elliotwaite
      @elliotwaite  2 роки тому +1

      @@nirbhay_raghav to calculate the gradient for a weight you perform back propagation, which means you start with the loss and go backwards through all the operations that lead to the output, multiplying the loss by the derivatives of those operations. That's what I meant by saying the gradient was the loss * the derivative. I have a video about how PyTorch's autograd works that walks through some examples of the backprop process that might help explain it better. It's a bit difficult to explain it well just using text.

    • @nirbhay_raghav
      @nirbhay_raghav 2 роки тому +1

      @@elliotwaite thanks. I will watch it. I had watched it earlier. Will need a refresher.

  • @Ettoyeaz
    @Ettoyeaz 3 роки тому

    Why is the natural logarithm used as a loss function?

    • @elliotwaite
      @elliotwaite  3 роки тому +1

      The loss funciton of "-ln(output)" is called the negative log likelihood loss. It is currently the most popular loss function used for optimizing neural networks for classification problems.
      The idea behind this loss function is the assumption that we are trying to maximize the probability that our model would produce the training data. Or in other words, if we fed each of the inputs in the training data into our model, and our model sampled labels for those inputs from its output probability distribution, what is the probability that all of the labels from our model will match all of the labels in the training data.
      So to get that probability we measure the probability that our model will predict the correct label for each of the inputs, and then multiply all of those probabilities together. This is called the likelihood of the model, it's how likely it is that our model would produce the data, and when doing maximum likelihood optimization, it's what we want to maximize.
      However, instead of maximizing this product of many different values, we often instead maximize the log of this product (the log likelihood), since this often makes the optimization more convenient and more stable. And we can get away with doing this because the maximum of function will also be the maximum of the log of that function, so using the log likelihood instead of the likelihood preserves the optimization gradient direction we would get, however it does change the slope of that gradient, making the slope less the higher our probabilities get, which is often desireable anyways since it's equivalent to reducing the optimization step size as the model gets better, which is something that is usually done anyways when training a model.
      And the reason that using the log of the product is convenient and more stable is because the log of the product is the same as the sum of the log of each of the individual values in the product, so instead of optimizing a product over our entire training set, we are now optimizing a sum over our entire training set, which can be conveniently estimated by just optimizing the sum of just a minibatch of our dataset, and training using minibatches often speeds up training and can even help avoid local optima.
      And regarding stability, if we were to try to optimize the product directly (the likelihood), our loss would be the product of many probabilities (values between 0 and 1), which can sometimes lead to a very small final value that might be affected by floating point error issues. Which is avoided when optimizing the log likelihood, since the sum of many logs of probabilities will result in a often well behaved value that can be accurately represented using a floating point number (as long as none of the probabilities are zero, which would lead to a log value of negative infinity, but since we usually use the sigmoid or softmax function to generate the probabilities, none of our output probabilities should be zero).
      And then finally, instead of maximizing the log likelihood, we usually instead refer to it as minimizing the negative log likelihood, which is the same thing, but often in machine learning we talk about minimizing a loss, so when referring to it as a loss, we call it the negative log likelihood loss.
      I hope this helps. Feel free to ask any questions about my explanation if there were any parts that seemed unclear.
      P.S. - The negative log likelihood is just one of the possible loss functions that can be used. There are other optimization functions that can also be used, and some that are theoretically better at generalizing, such as the techniques used in baysian deep learning, which don't have overfitting issues. However, using the negative log likelihood loss is simpler and faster than baysian deep learning, and it seems to work very well when you have a large amount of training data. However, I still find baysian deep learning very interesting, and I could see it, or some version of it, potentially replacing the popularity of the likelihood maximization approach at some point in the future if the efficiency of the technique can be improved.

    • @Ettoyeaz
      @Ettoyeaz 3 роки тому +1

      @@elliotwaite wow, thank you a lot for answering and explaining!

  • @veggiet2009
    @veggiet2009 3 роки тому

    So basically because you don't want to classify something as "150% a cat" or you don't want your ultimate obviously-its-a-cat image to suddenly register as "75% a cat" as it would if you were just using the normal distribution

    • @elliotwaite
      @elliotwaite  3 роки тому

      I'm not sure I understand what you mean, but I think yes.

  • @thelstan8562
    @thelstan8562 Рік тому

    AMAZING!!! WOW!!!

  • @no-qm1kn
    @no-qm1kn 4 роки тому

    thanks so much

  • @rushikeshbulbule8120
    @rushikeshbulbule8120 4 роки тому +2

    Awsome...

  • @wenjiezhu70
    @wenjiezhu70 4 роки тому

    the voice really sounds like trump...but this guy is a genius tho!!!Love the video so clear !

  • @prasenjitgiri919
    @prasenjitgiri919 25 днів тому

    Honestly appreciate what you are doing, but i still didnt get it...

    • @elliotwaite
      @elliotwaite  25 днів тому

      @@prasenjitgiri919 understandable