The full Neural Networks playlist, from the basics to deep learning, is here: ua-cam.com/video/CqOfi41LfDw/v-deo.html Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
Just to help with promotion the study guides he posts on his site only cost $3.00 and they are immensely helpful to refer back to. It's like having his entire video condensed into a handy step by step guide as a PDF. Yes, you could just watch the video over and over but this way you help Josh continue making great content for us at the cost of a cup of coffee.
This is such a nice and clear visualization of how activation functions work inside a neural network, and a perfect way to remember the inner workings. This is a masterpiece! Before that I knew, apply an input, activation functions, etc, etc, and you will receive an output with a magic value. But now I have a much more deeper understanding of WHY we are applying these activation functions / different activation functions.
@@statquest That's a big BAM!!! StatQuest is by far one of the best resources for statistics and ML - thanks a lot, you helped me understand so many concepts, which I never got such as PCA, and even this how activation functions like relu actually bring non-linearity by slicing, fliping, etc operations!
I like how you explain affine transformation rotates, scales and flips activation function with vivid illustration. Now I can relate to Lecun's deep learning class, which talks about this in abstract matrix form. Thanks.
I come here every time I learn some new concept to understand it clearly. Thanks a ton!! Would really love to jam with you someday for the intros, and maybe we can call it a BAMMING session.
Hey Josh, here is a topic you may be interested in making a video about. It is very relevant and I feel like not many videos in the web explain it: - Which are the hyperparameters affecting a NN and what is the intuition behind each of them. Most packages (e.g. caret) run a grid of models with all combinations of parameters you have specified, but it gets very computationally expensive pretty easily. It would be great to learn about some of the intuition behind in order to feed that grid something better than random guesses. Let me know what you think about this topic, and thanks again for your great job.
I'll keep that in mind. In the next few months I want to do a webinar on how to do neural networks in python (with sklearn) and maybe that will help you.
You said that ReLU sounds like a robot. My personification of this function is actually a robot who is simple-minded when solving problems! Coincidence!
[Notes excerpt from this video] 7:10 The reason why ReLu works. -- Like other activation function, the weights and bias on the connection slice, flip and stretch the function image into new shape. 7:40 The method to solve the problem that the derivative of ReLu function is not defined at bent point (0,0). -- Manually define the derivative to be 0 or 1.
Good explanation. It is interesting to see how the ReLU is used to gradually refine the function that defines the probabilistic output. Looking at how the ReLU is used reminds me of the use of diodes (with an approximate characteristic curve) in electronic circuits.
I was struggling (well, I still am) with understanding activation functions and these videos are helping me a lot. And man, you answer every single comment, even in old videos. Thanks a lot and many bams from Argentina :D . There's one thing that I did't quite get, and maybe this is reviewed in the next episodes. I understand activation functions are used to introduce "non-linearity" to predictions, but to me they still seem very arbitrary. I mean, why would I, for example, with ReLU in mind, keep positive values and change negative values to 0? Am I not losing a lot of information there? I know when it comes to deep learning sometimes the answer is along the lines of "because some dude tried it 10 years ago and it worked. Here's a paper discussing it" but I'd still like to ask.
It might help to see the ReLU in action. Here's an example that shows how it can help us fit any shape to the data: ua-cam.com/video/83LYR-1IcjA/v-deo.html
No estás perdiendo información porque lo que viaja después de que empiezas a manipular el algoritmo ya no es la información original, sino una "sombra", y lo que intentas hacer es ajustar los parámetros para que todo quede ajustado a esa sombra. El objetivo de las funciones de activación es agregar no linearidad, pero recuerda que la magia también está en las neuronas y las capas ocultas de la red. Cada capa oculta agrega un punto de intersección, y eso va a ajustar más la información que se observa
Acccording to the formula for the deravative of loss function with respect to weights/bias in the previous video, if we replace the softplus function with ReLu function, then (e**x / (1+e**x)) is replaced with 0 or 1. If 0 then the deravative of loss function with respect to weights/bias is 0, then step size = 0, then there is no tweak in weights, bias Can u make a video for that? Or may i be wrong
Yes, that's correct. So, for very simple neural networks, it can be hard to train with ReLU. However, for bigger, more complicated networks, the value is rarely 0 since there are so many possible inputs.
Hi, I was really excited when I saw you start to neural networks after your great machine learning videos. Can you create a playlist for neural networks as you did for machine learning? It is easier to follow with playlists. Thanks
As always everything is very clear but I still don't understand why is RELU function is currently the most effective function in machine learning. I mean, the shape seems so less natural than for the softplus function for instance.
It's simple - super easy to calculate - so it doesn't take any time, which means we can use a lot more of them to fit more complicated shapes to the data.
I love the toilet paper image for softplus... didn't catch it on the first watch but it became more and more suspicious as I went through this a few times... LOL.
Hey Josh! Amazing content - as always. I have always found your videos to be very useful in understanding the fundamental ideas, rather than just accepting the 'theoretical definitions'. I just wanted throw out a suggestion that it would be great if you could collaborate with other open source/free for all learning mediums like Khan Academy. This would not only increase the viewer base for all open source platforms but it would also fill in the gaps where the content on your channel or their channel has not been created yet.
Learning math is becoming sooooo easy and funny with your effort! Thank you so MUCH! BAM~~~~~~ If possible, can you make a video about the "gradient vanishing problem" in the future~?
*I am already familiar with Neural Networks Part One* 😂😂😂 So I this my quest will start here so it’s time to _Start Quest_ This time my quest is leading me to the *ReLU in Action* then I will unwind and back propagar 🎉the *Recurrent Neural Networks (RRNs)…* I will then learn What is « *Seq2Seq* »but I must go watch *Long Short Term-Memory* I think I will have to check out the quest also *Word Embedding and Word2Vec…* and then I will be happy to come back to learn with Josh 😅 I am impatient to learn *Attention for Neural Networks* _Clearly Explained_
Could you shed some light on the advantage and disadvantage of Relu vs Soft Plus please? Thank you. I didn't know there was soft plus until this video. lol
There seems to be a raging debate as to whether or not ReLU is better or worse than Soft Plus and it could be domain specific. So I don't really know the answer - maybe just try them both and see what works better.
How would we deal with the derivative of this function? I've read that the derivative is 0 for x < 0 and 1 for x > 0, but I'm having issues in that when weights are initialized below zero, (something like -0.5), the derivative of the activation function is 0. The chain rule would then make the entire gradient for that weight 0, and the weight would just never change.
Yes, that's true, so you may have to try a few different sets of random numbers. However, in practice, we usually have much larger networks with lots of inputs and lots of connections to every node and a lot of data for training. In this case, because we sum all of the connections, having a few negative weights may, for some subset of the training data, result in 0 for the derivative, but not for all data.
I really the great content you make that helps us to understand such difficult topics. Also, if you kindly include the formula generally used for the concept and break them down in the video, it would be immensely useful. Thanks anyways.
Thank you and yes. Right now I'm working on Convolutional Neural Networks and image recognition, but after that (and perhaps part of that) we'll cover softmax.
Thank you for the very clear explanations - this video series is wonderful in its capacity to communicate complex topics in a very clear and understandable manner. I have a question: does the use of the ReLu function for this network result in a less accurate model overall due to its piecewise linearity? It seems the more elegant curve has been replaced by a more simplistic linear-looking triangle. Would this mean a larger network would be needed in order for the model to be more accurate - so that the non-linear relationship between dosage and effectiveness can be modelled more accurately through a more complete/complex interaction of nodes?
That's a good question and I'm not 100% what the answer is other than, "it probably depends on the data". That said, ReLU, because of it's simplicity, allows for much deeper neural networks (more hidden layers and more nodes per layer) than the sigmoid shape, and, as a result, allows for more complicated shapes to be fit to the data. ReLU's introduction to neural networks a little over 20 years ago made a huge impact on AI because "deep learning" wasn't possible with the sigmoid shape. In contrast, with it's super simple derivative, ReLU allowed neural networks to model much more complicated datasets than ever before.
@@statquest Thanks for taking the time to answer my question, much appreciated :). I'm very much looking forward to watching the rest of your videos related to NNs and related concepts.
Hi, are you planning on making a video on Radial Basis Function Networks and self-organizing maps? Especially with self-organizing maps it's very hard to find good ressources, at least so far I found nothing that could really help me wrap my head around this topic. I figured, since this seem like your kind of topic and you have a hand for explaining these things in an easy to understand way there'd be no harm in asking :)
I have a video on the radial basis function with respect to Support Vector Machines, if you are interested in that. SVMs: ua-cam.com/video/efR1C6CvhmE/v-deo.html Polynomial Kernel: ua-cam.com/video/Toet3EiSFcM/v-deo.html and Radial Basis Function Kernel: ua-cam.com/video/Qc5IyLW_hns/v-deo.html
at 8:17, could you please explain in details why it does not matter that RELU is bent? How does it related to gradient vanishing/exploding? The reason I am asking is that if like you said we can get around the non-differentiability at bent point by setting gradient to 0, then it leads to gradient vanishing during back propagation, right? Thanks.
The gradient for ReLU is either 1 or 0. Thus, the gradient can not vanish unless every single value that goes through it is less than 0. If this happens, the node goes dark and becomes unusable, but this is rare if your data is relatively large.
@@statquest Thanks. That makes sense. I guess what I didn't understand was that you said in the video that "we can simply define the gradient at bent point to be 0 or 1". So the choice really does not make a difference? Thanks.
@@DED_Search It doesn't make a difference because the probability of having an input value equal = 0 *exactly* (and not 0.000001 or -0.00001) is pretty much 0. So it doesn't matter what value we give the derivative at *exactly* 0.
The weights and biases come from backpropagation, which I talk about in Part 1 in this series, and the show a simple example in Part 2 ua-cam.com/video/IN2XmBhILt4/v-deo.html and then go into more detail in these videos: ua-cam.com/video/iyn2zdALii8/v-deo.html ua-cam.com/video/GKZoOHXGcLo/v-deo.html
At 7:45 could you please tell why gradient descent wouldn't work for a bent line? Gradient descent videos didn't help clear this doubt. Amazing video btw! Thanks
Gradient Descent needs to have a derivative defined at all points. Technically the bent line does not a derivative when x = 0 (at the bend). However, at 8:03 I say that we just define the derivative at x = 0 to either 0 or 1, and when we do that, gradient descent works just fine with the ReLU.
How do we know values won't become really high above 0? I though the activation function contained the values in both negative and positive directions so that they wouldn't explode. Is that not a problem?
A much bigger problem is something called a "vanishing gradient", which is where the gradient gets very small and mover very slowly - too slowly to learn. ReLU helps eliminate that problem by having a gradient that is either 0 or 1.
See this video for details and just replace the derivative of the softmax with 0 or 1 depending on the value for x: ua-cam.com/video/GKZoOHXGcLo/v-deo.html
i have been watching your quest for NN for past few days and the way you explain is good but i didn't get one thing that you said about adding 2 lines on a graph ? So how do we add 2 lines on a graph and finde a third curve?
@@statquest So is it inefficient to end a neural network with a ReLU function? Because then we never allow the network to generate a negative slope. Correct me if I'm wrong: Input -> ReLU(X1) -> only positive outputs -> negative weights -> ReLU(X2) -> only positive final outputs. I guess my real question is this: can negative weights followed by a ReLU function produce a negative slope? Thanks!
@@nickmishkin4162 This video pretty much illustrates everything you want to know about the ReLU. Look at the shape of the function that comes out of the final ReLU at 5:35
I explain how to estimate parameters (w1, w2, b1, b2 etc.) in Part 2 in this series (this is Part 3): ua-cam.com/video/IN2XmBhILt4/v-deo.html and in these videos: ua-cam.com/video/iyn2zdALii8/v-deo.html and ua-cam.com/video/GKZoOHXGcLo/v-deo.html
@@muzammelmokhtar6498 Bent lines don't have derivatives where they are bent. This, in theory, is a problem for backpropagation, which is what we use to optimize the weights and biases. To get around this, we simply define a value for the derivative at the bend. For details on back propagation, see: ua-cam.com/video/IN2XmBhILt4/v-deo.html ua-cam.com/video/iyn2zdALii8/v-deo.html and ua-cam.com/video/GKZoOHXGcLo/v-deo.html
I do the full backpropagation for all the parameters, including w1, in this video (just sub in the ReLU derivative for the SoftMax derivative): ua-cam.com/video/GKZoOHXGcLo/v-deo.html
@@statquest I was trying to use this video and sub in the ReLU as you suggested. I think that the ReLU at the end might be what is tripping me up. I haven't figured it out yet, but I will keep trying.
I'm not sure I understand your question. However, I will say that input data simply follows the connections from the input node to the nodes in the hidden layers. For details, see: ua-cam.com/video/CqOfi41LfDw/v-deo.html
Why are you using 2 nodes in the hidden layer, means what is the purpose of solving it using 2 nodes, we can construct it using 1 node also. Please clarify.
The more nodes you have in a hidden layer and the more hidden layers you have the more complicated a dataset you can fit your neural network to. In this case, we have a simple dataset and I found, with trial and error, that we could fit it with just 2 nodes.
The weights and biases are linear transformations. Remember, the equation for a line is y = slope * x + intercept, and we can replace the "slope" with the "weight" and the "intercept" with the "bias". So y = weight * x + bias = a linear transformation. All of the non-linearity comes from the non-linear activation functions.
@@statquest so are we using weights in neural network as a linear relation between the input and nonlinear output and we introduce nonlinearity to linear combination because it is the linear combination of inputs and weights(showing the linear relation between the input and nonlinear output) as the linear combination is showing the linear relation between input and output and we're using nonlinear activation function so to convert that linear relation between input and nonlinear into nonlinear result.
f(x)=x is connect. f(x)=0 is disconnect. You can view ReLU as a switch. What are being connected and disconnected? Dot products (weighted sums.) The funny thing is if all the switch states are known you can simplify many connected dot products into a single dot product.
@@nguyenngocly1484 hey ,thanks ,Can you tell me that why we use weighted sum as an activation function's input ,can't we use neuron's input as activation function's input in a neural network.
@@Anujkumar-my1wi The "weighted sum" is simply the weights times the input value. If you have multiple connections to a neuron, then each one has it's own weight - just like the connections to the final ReLU function in this video.
hey josh m how u added blue and orange line @5.24. i mean how that blue line in -y axis came in +y axis ( we need to multiply it by something which is not shown in video ) ?.. hopefully u reply soon :)
if you add activation funct to the final layer, doesn't it restrict the possible output values? (i.e. if you add relu to the last layer before the output, doesn't it restrict outputs to pos? similarly if you use sigmoid, doesnt it restrict it to between [0, 1]
The full Neural Networks playlist, from the basics to deep learning, is here: ua-cam.com/video/CqOfi41LfDw/v-deo.html
Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
Just to help with promotion the study guides he posts on his site only cost $3.00 and they are immensely helpful to refer back to. It's like having his entire video condensed into a handy step by step guide as a PDF. Yes, you could just watch the video over and over but this way you help Josh continue making great content for us at the cost of a cup of coffee.
TRIPLE BAM!!! Thanks for the promotion! I'm glad you like the study guides. It takes a lot of work to condense everything down to just a few pages.
every time a go the bathroom, and I use "soft plus " I think about neural nets again and this accelerates my learning process :)
BAM! :)
This is such a nice and clear visualization of how activation functions work inside a neural network, and a perfect way to remember the inner workings. This is a masterpiece!
Before that I knew, apply an input, activation functions, etc, etc, and you will receive an output with a magic value. But now I have a much more deeper understanding of WHY we are applying these activation functions / different activation functions.
Thank you very much! :)
This is just so crystal clear and must have taken you some time to really deconstruct in order to explain it, really fantastic, thank you Josh!
Thanks! It took a few years to figure out how to create this whole series.
@@statquest Thank you for the videos! This helps a lot :)
@@statquest That's a big BAM!!! StatQuest is by far one of the best resources for statistics and ML - thanks a lot, you helped me understand so many concepts, which I never got such as PCA, and even this how activation functions like relu actually bring non-linearity by slicing, fliping, etc operations!
The theoretical simplicity of deep learning is a beautiful thing
:)
Google should give you award for spreading knowledge to us all...
Bam! Thank you! :)
OMG - CLEAREST EXPLANATION OF RELU ON THE PLANET!!! PLEASE TEACH ME EVERYTHING YOU KNOW
bam! Here's a list of all of my videos: statquest.org/video-index/
@@statquest Thank you Mister :)
Bro you are a genius. Much love from South Africa. Soon i will be able to buy your stuff. You deserve it
Thanks!
How could you make difficult Machine Learning contents so easy? Incredible!
Wow, thanks!
The way you explain is on another level sir..Thanks🙏
Thank you very much! :)
Your explanation deserves a huge bam! that's great man
Thanks!
Thanks Josh!!! It's such a fun to learn machine learning from your videos.
Thank you! :)
Awesome guy, Awesome channel, Awesome video, TRIPLE BAM!!!
Hooray!!!
You saved my life thank you from korea
Bam! :)
your explanation makes things so simple ...
Thanks!
I like how you explain affine transformation rotates, scales and flips activation function with vivid illustration. Now I can relate to Lecun's deep learning class, which talks about this in abstract matrix form. Thanks.
Glad it was helpful!
I always expect your intro music... I like it. Your content is also satisfying. Thank you!
Glad you enjoy it!
I come here every time I learn some new concept to understand it clearly. Thanks a ton!!
Would really love to jam with you someday for the intros, and maybe we can call it a BAMMING session.
That would be awesome!
My mind is blowing. Triple Bam. Josh Starmer - a great thanks to you for making such amazing videos, educating others for free of cost.
Wow, thank you!
Hey Josh, here is a topic you may be interested in making a video about. It is very relevant and I feel like not many videos in the web explain it:
- Which are the hyperparameters affecting a NN and what is the intuition behind each of them. Most packages (e.g. caret) run a grid of models with all combinations of parameters you have specified, but it gets very computationally expensive pretty easily. It would be great to learn about some of the intuition behind in order to feed that grid something better than random guesses.
Let me know what you think about this topic, and thanks again for your great job.
I'll keep that in mind. In the next few months I want to do a webinar on how to do neural networks in python (with sklearn) and maybe that will help you.
@@statquest thank you
Very good explanation ,especially when you talked about the relu function which is not differentiable
keep up the great work brother
Thanks, will do!
Another banger
Thank you! :)
You said that ReLU sounds like a robot. My personification of this function is actually a robot who is simple-minded when solving problems! Coincidence!
Bam!
@@statquest Bigger bam! A bam caused by fire magic.
[Notes excerpt from this video]
7:10 The reason why ReLu works. -- Like other activation function, the weights and bias on the connection slice, flip and stretch the function image into new shape.
7:40 The method to solve the problem that the derivative of ReLu function is not defined at bent point (0,0). -- Manually define the derivative to be 0 or 1.
bam!
I really like the intro of the video statquest....
bam! :)
Good explanation. It is interesting to see how the ReLU is used to gradually refine the function that defines the probabilistic output.
Looking at how the ReLU is used reminds me of the use of diodes (with an approximate characteristic curve) in electronic circuits.
Same here - the ReLU reminds me of a circuit.
Not skipping ads for my guy Josh
BAM! :) thanks for your support!
I first like the video then starts learning 😊😊🔥🔥🔥🔥
bam! :)
This is so helpful! Thank you!!
bam! :)
amazing job... understood clearly... now i don't have to search more for ReLU :D
bam! :)
note regarding relu: derivative at 0 is not defined so we assume the derivative at 0 o be either 0 or 1.
yep
a tiny bam is just a declaration of humility
:)
I was struggling (well, I still am) with understanding activation functions and these videos are helping me a lot. And man, you answer every single comment, even in old videos. Thanks a lot and many bams from Argentina :D .
There's one thing that I did't quite get, and maybe this is reviewed in the next episodes.
I understand activation functions are used to introduce "non-linearity" to predictions, but to me they still seem very arbitrary. I mean, why would I, for example, with ReLU in mind, keep positive values and change negative values to 0? Am I not losing a lot of information there?
I know when it comes to deep learning sometimes the answer is along the lines of "because some dude tried it 10 years ago and it worked. Here's a paper discussing it" but I'd still like to ask.
It might help to see the ReLU in action. Here's an example that shows how it can help us fit any shape to the data: ua-cam.com/video/83LYR-1IcjA/v-deo.html
No estás perdiendo información porque lo que viaja después de que empiezas a manipular el algoritmo ya no es la información original, sino una "sombra", y lo que intentas hacer es ajustar los parámetros para que todo quede ajustado a esa sombra.
El objetivo de las funciones de activación es agregar no linearidad, pero recuerda que la magia también está en las neuronas y las capas ocultas de la red. Cada capa oculta agrega un punto de intersección, y eso va a ajustar más la información que se observa
could you do another video on back propagation with 2 hidden layers combined with relu functions ?
I really love ur visualization
Acccording to the formula for the deravative of loss function with respect to weights/bias in the previous video, if we replace the softplus function with ReLu function, then (e**x / (1+e**x)) is replaced with 0 or 1.
If 0 then the deravative of loss function with respect to weights/bias is 0, then step size = 0, then there is no tweak in weights, bias
Can u make a video for that? Or may i be wrong
Yes, that's correct. So, for very simple neural networks, it can be hard to train with ReLU. However, for bigger, more complicated networks, the value is rarely 0 since there are so many possible inputs.
@@statquest tks, I'm really appreciate ur dedication to making a video like this. It has helped so much around the concepts
Thank you sir so much for such an clear explanation
Happy to help! :)
Hi, I was really excited when I saw you start to neural networks after your great machine learning videos. Can you create a playlist for neural networks as you did for machine learning? It is easier to follow with playlists. Thanks
Yes, soon! As soon as I finish all the videos in this series. There are at least 2, maybe 3 or 4 more to go.
Thank you, this helped me with an assignment
Glad it helped!
Thank you for the explanation.
You are welcome!
As always everything is very clear but I still don't understand why is RELU function is currently the most effective function in machine learning. I mean, the shape seems so less natural than for the softplus function for instance.
It's simple - super easy to calculate - so it doesn't take any time, which means we can use a lot more of them to fit more complicated shapes to the data.
'at least it's ok with me' - all we need for peace of mind
:)
Thank you very much Starmer, very clear and great video! Thank you!
Glad it was helpful!
I love the toilet paper image for softplus... didn't catch it on the first watch but it became more and more suspicious as I went through this a few times... LOL.
:)
thank you Professor
Any time!
*this video/ most of his videos have less than 100K views!!??*
*ppl are missing out on a gold mine!*
Thank you! :)
Again a appreciation for u
Thank you! :)
Hey Josh! Amazing content - as always. I have always found your videos to be very useful in understanding the fundamental ideas, rather than just accepting the 'theoretical definitions'. I just wanted throw out a suggestion that it would be great if you could collaborate with other open source/free for all learning mediums like Khan Academy. This would not only increase the viewer base for all open source platforms but it would also fill in the gaps where the content on your channel or their channel has not been created yet.
I'll keep that in mind. However, I have no idea how to collaborate with Khan Academy. If you have suggestions, let me know.
With ReLU life is easier, you don't have to computing complicated THE CHAIN RULE :D
Great series!!! I finally get it because of you Josh!
Thanks!
Learning math is becoming sooooo easy and funny with your effort!
Thank you so MUCH! BAM~~~~~~
If possible, can you make a video about the "gradient vanishing problem" in the future~?
I'll keep that in mind.
*I am already familiar with Neural Networks Part One* 😂😂😂 So I this my quest will start here so it’s time to _Start Quest_ This time my quest is leading me to the *ReLU in Action* then I will unwind and back propagar 🎉the *Recurrent Neural Networks (RRNs)…* I will then learn What is « *Seq2Seq* »but I must go watch *Long Short Term-Memory* I think I will have to check out the quest also *Word Embedding and Word2Vec…* and then I will be happy to come back to learn with Josh 😅 I am impatient to learn *Attention for Neural Networks* _Clearly Explained_
Please just just watch the videos in order: ua-cam.com/play/PLblh5JKOoLUIxGDQs4LFFD--41Vzf-ME1.html
Can u also talk about CNN and RNN? U r my favorite teacher.
CNNs are coming up soon.
Hi, there are a lot of activation functions like relu and tanh etc..., can you make a video about the usage of different activation functions?
I'll keep that in mind.
Thank you Sir
Most welcome!
Could you shed some light on the advantage and disadvantage of Relu vs Soft Plus please? Thank you. I didn't know there was soft plus until this video. lol
There seems to be a raging debate as to whether or not ReLU is better or worse than Soft Plus and it could be domain specific. So I don't really know the answer - maybe just try them both and see what works better.
@@statquest thank you!
That SoftPlus toilet paper and the Chain Rule sound effect in every video lol
:)
So good!
bam!
That was very satisfying
Thank you!
AMAZING!
Thanks!
How would we deal with the derivative of this function? I've read that the derivative is 0 for x < 0 and 1 for x > 0, but I'm having issues in that when weights are initialized below zero, (something like -0.5), the derivative of the activation function is 0. The chain rule would then make the entire gradient for that weight 0, and the weight would just never change.
Yes, that's true, so you may have to try a few different sets of random numbers. However, in practice, we usually have much larger networks with lots of inputs and lots of connections to every node and a lot of data for training. In this case, because we sum all of the connections, having a few negative weights may, for some subset of the training data, result in 0 for the derivative, but not for all data.
@@statquest Ah, I see. My network had 1 neuron per layer, makes sense now
I really the great content you make that helps us to understand such difficult topics.
Also, if you kindly include the formula generally used for the concept and break them down in the video, it would be immensely useful.
Thanks anyways.
I'm not sure I understand your question. This video discusses the formula for ReLU and breaks it down.
I desperately need a recurrent neural networks video :'(
Good Job btw, liked and subscribed as always
Thanks! I'm working on convolutional neural networks right now.
Just awesome...
Thank you! :)
You are a godsend :')
Thank you! :)
Brilliant as usual. Is there an SQ on softmax activation function in the pipeline?
Thank you and yes. Right now I'm working on Convolutional Neural Networks and image recognition, but after that (and perhaps part of that) we'll cover softmax.
@@statquest If you can visualize what happens to the data to show us how those methods are working. Best regards,
Why does it make sense to use a ReLu at the end? Is it to reduce the complexity of the green squiggle from a curvy to a pointy squiggle?
It restricts the final output to be between 0 and 1.
Du bist toll!
bam!
Thank you for the very clear explanations - this video series is wonderful in its capacity to communicate complex topics in a very clear and understandable manner. I have a question: does the use of the ReLu function for this network result in a less accurate model overall due to its piecewise linearity? It seems the more elegant curve has been replaced by a more simplistic linear-looking triangle. Would this mean a larger network would be needed in order for the model to be more accurate - so that the non-linear relationship between dosage and effectiveness can be modelled more accurately through a more complete/complex interaction of nodes?
That's a good question and I'm not 100% what the answer is other than, "it probably depends on the data". That said, ReLU, because of it's simplicity, allows for much deeper neural networks (more hidden layers and more nodes per layer) than the sigmoid shape, and, as a result, allows for more complicated shapes to be fit to the data. ReLU's introduction to neural networks a little over 20 years ago made a huge impact on AI because "deep learning" wasn't possible with the sigmoid shape. In contrast, with it's super simple derivative, ReLU allowed neural networks to model much more complicated datasets than ever before.
@@statquest Thanks for taking the time to answer my question, much appreciated :). I'm very much looking forward to watching the rest of your videos related to NNs and related concepts.
Hi, are you planning on making a video on Radial Basis Function Networks and self-organizing maps?
Especially with self-organizing maps it's very hard to find good ressources, at least so far I found nothing that could really help me wrap my head around this topic. I figured, since this seem like your kind of topic and you have a hand for explaining these things in an easy to understand way there'd be no harm in asking :)
I have a video on the radial basis function with respect to Support Vector Machines, if you are interested in that. SVMs: ua-cam.com/video/efR1C6CvhmE/v-deo.html Polynomial Kernel: ua-cam.com/video/Toet3EiSFcM/v-deo.html and Radial Basis Function Kernel: ua-cam.com/video/Qc5IyLW_hns/v-deo.html
at 8:17, could you please explain in details why it does not matter that RELU is bent? How does it related to gradient vanishing/exploding? The reason I am asking is that if like you said we can get around the non-differentiability at bent point by setting gradient to 0, then it leads to gradient vanishing during back propagation, right? Thanks.
The gradient for ReLU is either 1 or 0. Thus, the gradient can not vanish unless every single value that goes through it is less than 0. If this happens, the node goes dark and becomes unusable, but this is rare if your data is relatively large.
@@statquest Thanks. That makes sense. I guess what I didn't understand was that you said in the video that "we can simply define the gradient at bent point to be 0 or 1". So the choice really does not make a difference? Thanks.
@@DED_Search It doesn't make a difference because the probability of having an input value equal = 0 *exactly* (and not 0.000001 or -0.00001) is pretty much 0. So it doesn't matter what value we give the derivative at *exactly* 0.
@@statquest thank you!
Day 4 of leaving the comment and studying with statquest :)
:)
It's only me that I didn't understand where the values of weights and bias come from? Why for example the first weight w1=1.70?
The weights and biases come from backpropagation, which I talk about in Part 1 in this series, and the show a simple example in Part 2 ua-cam.com/video/IN2XmBhILt4/v-deo.html and then go into more detail in these videos: ua-cam.com/video/iyn2zdALii8/v-deo.html ua-cam.com/video/GKZoOHXGcLo/v-deo.html
@@statquest thank you!! Back propagation video made it clear.
@@ThePanagiotisvm bam!
At 7:45 could you please tell why gradient descent wouldn't work for a bent line? Gradient descent videos didn't help clear this doubt.
Amazing video btw! Thanks
Gradient Descent needs to have a derivative defined at all points. Technically the bent line does not a derivative when x = 0 (at the bend). However, at 8:03 I say that we just define the derivative at x = 0 to either 0 or 1, and when we do that, gradient descent works just fine with the ReLU.
Hey Josh, Can you please please make videos on Recurrent Neural Network and Transformers
I'll keep that in mind.
Hi really love your work. Any plans on doing the Recurrent Neural networks including the modern RNN units (LSTM, GRU)?
I'll keep that in mind. I'm working on convolutional neural networks right now.
@@statquest Thanks. Appreciate your efforts 🙂
God bless you, I mean it.
Thank you! :)
Josh, could you please do a video on NLP and its implementation in python?. Would really love that.
And about the video, it is awesome as always!
I'll keep that in mind.
How do we know values won't become really high above 0? I though the activation function contained the values in both negative and positive directions so that they wouldn't explode. Is that not a problem?
A much bigger problem is something called a "vanishing gradient", which is where the gradient gets very small and mover very slowly - too slowly to learn. ReLU helps eliminate that problem by having a gradient that is either 0 or 1.
Triple BAM!!! 💥 💥 💥
:)
Fourple BAM!!!!
.
.
.
.
.
.
.
(ik its not a word :p lol)
I gave "like" before watched it...
BAM! :)
Take me to your leader Josh 🤲
:)
There is an error at 3:15 because you change the dosage from 0.2 to 0.1 in the equation.
Oops!
@@statquest amazing video nonetheless
you did not show how the rest of the weights are updated? I need to understand how the derivative of the activation function affects the weight update
See this video for details and just replace the derivative of the softmax with 0 or 1 depending on the value for x: ua-cam.com/video/GKZoOHXGcLo/v-deo.html
i have been watching your quest for NN for past few days and the way you explain is good but i didn't get one thing that you said about adding 2 lines on a graph ? So how do we add 2 lines on a graph and finde a third curve?
What time point, minutes and seconds, are you asking about?
Nice video. If ReLU always outputs a positive number, how can the neural network produce a negative sloping curve?
A weight that comes after the ReLU can be negative, and flip it over.
@@statquest So is it inefficient to end a neural network with a ReLU function? Because then we never allow the network to generate a negative slope.
Correct me if I'm wrong:
Input -> ReLU(X1) -> only positive outputs -> negative weights -> ReLU(X2) -> only positive final outputs.
I guess my real question is this: can negative weights followed by a ReLU function produce a negative slope?
Thanks!
@@nickmishkin4162 This video pretty much illustrates everything you want to know about the ReLU. Look at the shape of the function that comes out of the final ReLU at 5:35
@@statquest Yes! Didn't realize your nn ended with a ReLU. Thank you
Awesome.. Awaiting CNN from SQ ..
It should come out in the next few weeks.
Great video!! But might you explain more about how to estimate w1,w2,b1,b2,...? Thank you!
I explain how to estimate parameters (w1, w2, b1, b2 etc.) in Part 2 in this series (this is Part 3): ua-cam.com/video/IN2XmBhILt4/v-deo.html and in these videos: ua-cam.com/video/iyn2zdALii8/v-deo.html and ua-cam.com/video/GKZoOHXGcLo/v-deo.html
@@statquest Thank you so much!
Great video, but i dont really understand about the curvy and bent ReLU on the last part of the video..
What time point, minute and seconds, are you asking about?
@@statquest 7.42-8.15
@@muzammelmokhtar6498 Bent lines don't have derivatives where they are bent. This, in theory, is a problem for backpropagation, which is what we use to optimize the weights and biases. To get around this, we simply define a value for the derivative at the bend. For details on back propagation, see: ua-cam.com/video/IN2XmBhILt4/v-deo.html ua-cam.com/video/iyn2zdALii8/v-deo.html and ua-cam.com/video/GKZoOHXGcLo/v-deo.html
@@statquest ouh okay, thank you👍
Hi, you are really awesome, would appreciate it, if you do a video on the (Maximum a posterior estimation (MAP))?
I'll keep that in mind.
Is the derivative of SSR with respect to w1
-2(obs - pred) * ifelse(y1 * w3 + y2 * w4 + b3
I do the full backpropagation for all the parameters, including w1, in this video (just sub in the ReLU derivative for the SoftMax derivative): ua-cam.com/video/GKZoOHXGcLo/v-deo.html
@@statquest I was trying to use this video and sub in the ReLU as you suggested. I think that the ReLU at the end might be what is tripping me up. I haven't figured it out yet, but I will keep trying.
thnak you
:)
Can you make a video about Recursive Feature Elimination ? I like your video style.
I'll keep that in mind.
Hi Josh,thank you for the excellent videos ..how the input data splitted across 2 hidden neurons
I'm not sure I understand your question. However, I will say that input data simply follows the connections from the input node to the nodes in the hidden layers. For details, see: ua-cam.com/video/CqOfi41LfDw/v-deo.html
@@statquest ,Thanks Josh
Shameless Self Promotion is the funniest thing Ive seen in youtube!
BAM! :)
Tiny bam 😂 omfg I couldn't stop laughing. It's like Ryan Reynolds is explaining Neural Networks 😂
BAM! :)
Why are you using 2 nodes in the hidden layer, means what is the purpose of solving it using 2 nodes, we can construct it using 1 node also. Please clarify.
The more nodes you have in a hidden layer and the more hidden layers you have the more complicated a dataset you can fit your neural network to. In this case, we have a simple dataset and I found, with trial and error, that we could fit it with just 2 nodes.
i want to know that whether weights in neural network are linear relation between input and nonlinear output or is it something else?
The weights and biases are linear transformations. Remember, the equation for a line is y = slope * x + intercept, and we can replace the "slope" with the "weight" and the "intercept" with the "bias". So y = weight * x + bias = a linear transformation. All of the non-linearity comes from the non-linear activation functions.
@@statquest so are we using weights in neural network as a linear relation between the input and nonlinear output and we introduce nonlinearity to linear combination because it is the linear combination of inputs and weights(showing the linear relation between the input and nonlinear output) as the linear combination is showing the linear relation between input and output and we're using nonlinear activation function so to convert that linear relation between input and nonlinear into nonlinear result.
f(x)=x is connect. f(x)=0 is disconnect. You can view ReLU as a switch. What are being connected and disconnected? Dot products (weighted sums.) The funny thing is if all the switch states are known you can simplify many connected dot products into a single dot product.
@@nguyenngocly1484 hey ,thanks ,Can you tell me that why we use weighted sum as an activation function's input ,can't we use neuron's input as activation function's input in a neural network.
@@Anujkumar-my1wi The "weighted sum" is simply the weights times the input value. If you have multiple connections to a neuron, then each one has it's own weight - just like the connections to the final ReLU function in this video.
Really like your video and r shirt is nice
Thank you so much 😀!
hey josh m how u added blue and orange line @5.24. i mean how that blue line in -y axis came in +y axis ( we need to multiply it by something which is not shown in video ) ?.. hopefully u reply soon :)
We added the y-axis coordinates of the two lines. However, those lines are not exactly to scale, so that might be confusing you.
if you add activation funct to the final layer, doesn't it restrict the possible output values? (i.e. if you add relu to the last layer before the output, doesn't it restrict outputs to pos? similarly if you use sigmoid, doesnt it restrict it to between [0, 1]
I'm a little confused by your question. Are you asking about what happens at 5:35 ?
I have a question, will they y output in a ReLU function always be the same as the x input if the, x input is greater than 0?
Yes