The video production quality along with the explanation is really nice. Keep making such great content, your channel is bound to gain traction very rapidly. :)
@@NormalizedNerd I just wanted to come back and let you know - I got a distinction in my MSc (I did my thesis on GANs for tabular data) and your vids were a huge factor in helping me achieve this! So thank you!
@@NormalizedNerd I am currently working on a medical computer vision project - so it’s all going well! Thanks again, I look forward to watching more of your vids
@@NormalizedNerd And well, as a teacher (or better still a facilitator) of the subject, I am a student first. There is just so much to keep learning! And I enjoy it :)
In regression also u can use CE. IF U CAN QUANTIZE the target into lets say n classes. Else if u are intersted in mape . U can use mape loss or (1+log) compression.
Another way of looking at it is that L2 is already way too punishing where outliers are concerned; hence we use L1, so cross-entropy is likely to exacerbate the issues already found in L2.
Ok great but what if the number in the log is zero. For example when the ground truth is 1 but my model predicts zero? Im having trouble understanding this. I try to make an XOR-MultilayerPerceptron but 1, 0 are not good inputs. If an input is zero the weight update for the corresponding weight is impossible. I tried -1 and 1 as inputs and labels but then the loss function is not working. Im using the simoid activation function and have to hidden and one output neuron but it does not work. Maaaaan im going crazy with this ML stuff
I dont get why the growth rate of Cross Entropy is the "sweet spot". If in classification tasks having a very steep gradient is important even when the prediction is wrong for a small amount, why don't we just use a linear loss function with a very steep gradient (the gradient would be constant and high on all the domain, not only when the prediction is far from the ground truth)? Otherwise, if what we want is a gradient that starts low and then increments fast (even faster than the parabolic MSE) why don't we use an exponential loss? or something that grows even faster than the n*log(n) of Cross Entropy?
Is the curvature of the gradient the only reason we prefer CEL over MSE? does this mean that MSE would work but just converge slower needing more data to train on?
Yes, the slope is an important point. There's another thing...CEL arises naturally if you solve classification problem using maximum likelihood method. More about that here: ua-cam.com/video/2PfGO753UHk/v-deo.html
Nice video! It is possible to train a neural network for a classification task using MSE. For a binary classification, we can use two neurons for each class and train the network using the MSE loss. When you want to compute the classification accuracy, you can do similarly to the classification case: the predicted class is the index of largest logit in the output layer. Any idea why this works?
What a great explanation, thank you! One question: Don't we want the derivative to be zero if the model performs as best as it could get and that is when p always equals p^ ? Using binary-cross-entropy, we have derivatives of +1 and -1 of the loss function at the intersections with the x-axis...
I thought this was a great video! Can you explain how this generalizes to multi-class classification problems or link me to a video where I can learn more? Thank you :)
This is the exact one video you need for the "Cross Entropy Loss" keyword. Straight to the point
Best explanation I could find! This channel is gonna be big.
Thanks man!
it was really great pointing out that its the gradient that matters more than the actual loss value.
Great video, keep it up
The video production quality along with the explanation is really nice. Keep making such great content, your channel is bound to gain traction very rapidly. :)
Thanks man...Another interesting video is on the way!
Best video to understand cross entropy out there, I was struggling a bit until I found this one.
I love the math, calculus and all the visualizations that come in this video. Great job
Glad you liked it!
Amazing explanation! just what I was looking for :)
great job man!
I am a big fan of how you remove all the unneccesary steps in the formulas in order to explain it as simply as possible. Very nice!
Thanks! Yeah, this way you get to know the essence.
no those steps are too important
Your explanations are great! Thanks for the vids!
Thanks for watching!
@@NormalizedNerd I just wanted to come back and let you know - I got a distinction in my MSc (I did my thesis on GANs for tabular data) and your vids were a huge factor in helping me achieve this! So thank you!
@@TheQuantumPotato Your comment just made my day! Best wishes for your future endeavors 😊
@@NormalizedNerd I am currently working on a medical computer vision project - so it’s all going well! Thanks again, I look forward to watching more of your vids
Undoubtedly, the explanation of cross-entropy loss I found on youtube.
Thanks for this helpful video! This delivers a clear visual explanation that my professor didn't do
You're very welcome!
Very nicely explained! Your video helped me a lot in my classroom discussion today. Thank you very much.
Really glad to hear that!
@@NormalizedNerd And my students enjoyed that explanation. I'll surely share your channel link with them.
@@angelinagokhale9309 Omg! I thought you are attending the class as a student! Really happy to see other educators appreciating the videos :)
@@NormalizedNerd And well, as a teacher (or better still a facilitator) of the subject, I am a student first. There is just so much to keep learning! And I enjoy it :)
You took this explanation to the next level man! Great analysis
Amazing video, totally understood the concept
Extremely helpful. Thank you
Loved it thanks for making this video
Best explanation I could find!
Thanks a lot!
Even as a mathematically handicapped person I can understand it now fully. Bravo!
Thanks a lot ❤
This is such an amazing video! Thanks!
Glad you liked it!
Nicely explained....I was struggling to decode it
Glad to help :)
Very neat and clearful beast i've ever found.
thanks a lot❤.
You're welcome!
Great explanation 🫡
In regression also u can use CE. IF U CAN QUANTIZE the target into lets say n classes. Else if u are intersted in mape . U can use mape loss or (1+log) compression.
Thank you, your explanation was perfect....
awesome video, thanks for explaining it so well! keep it up.
very good explanation
Amazing explanation
Very good explanation, you got a new subscriber
Awesome, thank you!
Your explanation helps me a lot!
Glad to hear that
Such an awesome explanation! Thanks!
You're very welcome!
Khub sundor video, amake onek bhalo lageche. Keep it up :)
Onek dhonnobad 😊😊
great explanation!
Another way of looking at it is that L2 is already way too punishing where outliers are concerned; hence we use L1, so cross-entropy is likely to exacerbate the issues already found in L2.
Really good explanation, good job !
Thank you!
Great video. Thanks!
Awesome explanation - keep up the good work 👍
Thank you! 👍
great explanation
wow, great explanation
There is a point that I felt missing. I've read on websites that the cross-entropy function helps reach the global minima quicker.
Nice video and visualisation
Wow very much helpful
Thank you for your graphical presentation
You're very welcome!
Thanks for your insights sharing ~
Glad it was helpful!
Awesome explanation! Thank you!
You're very welcome!
Congrats from Brazil!
Brilliant insight thank you so so much!!!
Ok great but what if the number in the log is zero. For example when the ground truth is 1 but my model predicts zero? Im having trouble understanding this. I try to make an XOR-MultilayerPerceptron but 1, 0 are not good inputs. If an input is zero the weight update for the corresponding weight is impossible. I tried -1 and 1 as inputs and labels but then the loss function is not working. Im using the simoid activation function and have to hidden and one output neuron but it does not work. Maaaaan im going crazy with this ML stuff
Is there any difference between BCE & weighted cross entropy loss function?
Yes. In the 2nd one there's an extra weight term. The value of weight is different for each class.
How do we come up with this formula of binary cross entropy loss? Linked to any proof? It would be a great help
I did a video about it -> ua-cam.com/video/2PfGO753UHk/v-deo.html
That's an amazing introduction.
Thanks mate!!
I dont get why the growth rate of Cross Entropy is the "sweet spot". If in classification tasks having a very steep gradient is important even when the prediction is wrong for a small amount, why don't we just use a linear loss function with a very steep gradient (the gradient would be constant and high on all the domain, not only when the prediction is far from the ground truth)? Otherwise, if what we want is a gradient that starts low and then increments fast (even faster than the parabolic MSE) why don't we use an exponential loss? or something that grows even faster than the n*log(n) of Cross Entropy?
Is the curvature of the gradient the only reason we prefer CEL over MSE? does this mean that MSE would work but just converge slower needing more data to train on?
Yes, the slope is an important point. There's another thing...CEL arises naturally if you solve classification problem using maximum likelihood method. More about that here: ua-cam.com/video/2PfGO753UHk/v-deo.html
this made sense! thank you
Awesome video, thank you very much! :)
You're very welcome!
Great explanation, Thank you :)
You are welcome!
Well explained, nice!
Thank you
Excellent 👍
How do you make such animations? What softwares do you use?
Its called manim
Just in a point! thanks
You're welcome!
Nice video! It is possible to train a neural network for a classification task using MSE. For a binary classification, we can use two neurons for each class and train the network using the MSE loss. When you want to compute the classification accuracy, you can do similarly to the classification case: the predicted class is the index of largest logit in the output layer. Any idea why this works?
What a great explanation, thank you!
One question: Don't we want the derivative to be zero if the model performs as best as it could get and that is when p always equals p^ ? Using binary-cross-entropy, we have derivatives of +1 and -1 of the loss function at the intersections with the x-axis...
Ideally yes, but for functions with log terms, it's not possible to achieve a derivative of 0 right?
I thought this was a great video!
Can you explain how this generalizes to multi-class classification problems or link me to a video where I can learn more?
Thank you :)
very well done!
Thanks :)
Excellent
Thanks, this was really helpful! though i had to put it on 0.75 :D
Thanks for the feedback on speed :)
great video man
Glad you enjoyed it
good Video!
Thanks!
Speechless ! Paid course fail to deliver these concept. Experience data scientist can only.
Thanks a lot mate :D
3blue 1 brown ... but its literally a brown guy in this case.. Loved the videos man ....
Great video!
Glad you enjoyed it
brilliant ..also delightful bong accent :)
Haha :P
subscribed just b/c of the UA-cam channel name!
haha
Superb!
very nice
Brilliant
Great!
Thank you!!!!
good
Noice Explanation
Thanks!!
Thanks!!!!!!!
Happy to help
AMAZING!
nice
Thanks man!
3blue1brown copy :/
Amazing explanation
Amazing explanation!! Thanks!
Awesome video!
Glad you enjoyed it
Excellent