From experience I'd recommend in order, ELU (exponential linear units) >> leaky ReLU > ReLU > tanh, sigmoid. I agree that you basically never have an excuse to use tanh or sigmoid.
just watched your speech @TNW Conference 2017, I am really happy that you are growing every day, You are my motivation and my idol. proud of you love you
hey Siraj- just wanted to say thanks again. Apparently you got carried away and got busted being sneaky w crediting. I still respect your hustle and hunger. I think your means justify your ends- if you didn't make the moves that you did to prop up the image etc, I probably wouldn't have found you and your resources. At the end of the day, you are in fact legit bc you really bridge the gap of 1) knowing what ur talking about (i hope) 2) empathizing w someone learning this stuff (needed to break it down) 3) raising awareness about low hanging fruit that ppl outside the realm might not be aware of. Thank you again!!!!
Valuable introduction to generative methods for establishement of sense in artificial intelligence. A great way of bringing things together and express in one single indescret language. Thanks Siraj Raval, great!
1. The (activation) value of a neuron should be between 0 and 1, or? ReLu has a leaking minimum around 0, shouldn't ReLu have also a (leaking) maximum around 1? 2. Is there one best activation function, delivering the best neural network with the least amount of effort, like the amount of tests needed, and computer power? 3. Should weights and biases be between 0 and 1 or between -1 and 1? Or any different values? 4. Against vanishing and exploding gradients: can this be prevented with a (leaking) correction minimum and maximum for the weights and biases? There would be some symmetry then with the activation function suggested in the first paragraph.
I gained a lot of understanding and got that "click" moment after you explained linear vs non linearity. Thanks man. Keep up w/ the dank memes. My dream is that some day, I'd see a collab video between you, Dan Shiffman, and 3Blue1Brown. Love lots from Philippines!
If we use GA we do not need differentiable activation functions , inclusive we can build our own function.The issue is the back propagation method , this limits the activation functions
Wow, man, this is a seriously amazing video. Very entertaining and informative at the same time. Keep up great work! I'm now watching all your other videos :)
Still can't decide if I like the number of memes in these videos. It's humorous of course and I did grow up on the internet, but I'm trying to learn a viciously hard subject and they are somewhat distracting. I suppose it helps the less-intrinsically-motivated keep watching, and I can always read more about it elsewhere, as these videos are more like cursory summaries. Great channel.
Siraj I agree with Jotto. I enjoy them, but at some critical points in the video I found myself replaying several times as the first time through I was a little distracted.
Hey Siraj, here is a great trick: show us a neural net that can perform inductive reasoning! Great videos as always, keep them coming! Learning so much!
@Siraj NN can potentially grow in so many directions, you will always have something to explain to us. As you used to say 'this is only the beginning'. And ohh maaan ! you're so clear when you explain NN ;) Please keep doing what you're doing again and again and again...and again ! You are for NN, what Neil de Grass is for astrophysics. thx for sharing the github source that detail each activation source
Sir, likes for your memetics and fun explanation! All the spice you add to this video might bring some tech kids like me to the realm of Machine Learning! (And today, a mysterious graph sheet with the plot of max(0,x), a.k.a. ReLU function, appeared in my High School Maths notebook, between the pages about piecewise functions, after I get up and arrived at school.)
Crashed2DesktoP this is a little less generically answerable than which activation. For standard tasks there are a few loss functions available, binary cross entropy and categorical cross entropy for classification, mean squared error for regression. but more generally the cost function encodes the nature of your problem. Once you go deeper and your problem fleshes out a bit the exact loss you use might change to reflect your task. Custom losses might reflect auxiliary learning tasks, domain specific weights, and many other things. because of this "which loss should I use" is quite close to asking "how should I encode my problem" and so can be a little trickier to answer beyond the well studied settings.
very helpful video, thanks a lot, actually to introduce non-linearities we are introducing activation function. But how does ReLU which is linear is doing the justification over other non-linear functions ? Can you please give the correct intuition behind this ? Thanks in advance :)
but isn't RELU a linear function? you mentioned at the beginning that linear functions should be avoided as both calculating backpropagation on non-linear functions as classifying data points that do not fit a single hyperplane is easier or did I get the whole thing wrong?
It's not linear because any -X sits at zero on the Y axis. "Linear" basically means "straight line". The ReLU line is bent, hard, at 0. So it's linear if you're only looking at > 0 or < 0, but if you look at the whole line it's kinked in the middle, which makes it non-linear.
More layers makes learning very slow compared to more neurons. Before training, all the biases will overcome the inputs and make the output side of the network static. It takes a long time to get past that.
depends on situation, like a simple text recognition kind of thing is fine with 2 layers but something like a convolutional neural network may have to have 10. but for the majority of tings in this day and age, 2 is plenty.
Excellent, as usual. I think that the reason RELU hasn't been popular prior to now is that it is mathematically inelegant, in that it can't be used in commutable functions,, and a sigmoid function can. It does beg the question though - if RELU is being used, do we need to use the back propagation algorithm at all ? Perhaps some simpler recursive algorithm can be used.
While the activation function must be non-linear, neural nets store weights as binary numbers. If the range is small enough, you can store each activation function value by looking up the weight in a table. In other words, for every possible x, given a function f(x), simply store the result f(x) in a table of x+1 entries where for every x value, value_table[x] = f(x). The time it takes to calculate the activation function becomes 0 for all intents and purposes, no matter how complex it might be. In the days when I can purchase gigabytes of memory for a couple of hundred bucks, it's hard to see why anyone would include a hyperbolic function calculation embedded in their innermost loops. Even the modified RelX function requires more work than a simple table lookup. Furthermore, by using a simple table lookup method, it can be much more easily coded into a matrix library calculation.
Hey Siraj, you missed out on the Swish function that returned very tiny negative values for negative inputs. Probably you made this video before the Swish function came into existence.
Siraj, ur videos inspired me to study machine learning. I've been learning python for the past month, and am looking to start playing around with more advanced stuff. Do you have any good book recommendations for machine or deep learning, or online resources that beginners should start with?
So the slide at 4:00 says "Activation functions should be differentiable", but the conclusion of the video is that you should use the ReLU activation function, which is not differentiable. (Great video btw.)
Daniel O'Connor 2 years later and you’ve probably figured it out, but I believe reLu is not differentiable only exactly at x=0, which is usually really rare
You said that linear activation doesn't solve more complex problem. However relu itself is a linear function (with negative part discarded) . Then how relu activation function help to learn the non-linear behavior of the system? Also how the relu addresses the problem of gradient exploding problem.
I don't get how a linear activation function adds non linearity to a network. As i see it, it just switches some neurons off. We could basically use a Dropout, and follow it up by multiplicating each neuron with a random value. To have the same effect as a Relu activated dense layer. No ? While it's true, that Relu is used in a lot of deep neural networks, i think it's far from true you should always use relu. In classification relu is used, because f.E. we want to "swtich off" the not important part of the pictures, there relu is nearly perfect.
Hello Sir Siraj, would you please answer the following questions? I am PhD student, I would use the artificial neural networks approach to measure the productive efficiency of 31 farms, and their production and cost functions as well. My questions are: Are 31 farms enough for applying this approach? what is the minimum number of observations that must be obtained in the research? is it a sufficient number? or I need more farms? knowing that I have many affecting independent variables. What type of activation functions I should use for the efficiency measurement and the production and cost functions estimation. Many thanks in advance.
Noticed a tiny issue in your vid. When talking about the tanh function, you included python code that suggests the derivative is equal to 1-x^2. If I’m correct, the derivative is equal to 1-tanh(x)^2
There are plenty of reasons to use a tanh activation at times, you can't just say never to use something - that's a gross simplification. Yes, for very large networks with many layers, ReLU is best because of vanishing gradients, but that's not necessarily true for smaller networks. And for RNNs? Just use LSTM or GRU instead of regular RNNs, then you don't have that problem. I have found that tanh works the best for GRUs
Thank you Sir for this wonderful video , i have a question . How are the basis function determined in practice ? why does you choose Gaussian function as the basis function??
Today I did some testing. Created a neural network, that has some dense layers, then some LSTM layers, then some dense layers again, with linear units on top, since that was a regression problem. Then I've tried algorithmic hyperparameter optimisation, that involved activation function search. As it turns out, the best performing network had relu as activation for initial dense layers and tanh for the rest of the network. The second best (the loss was almost identical) had first tanh, then relu for lstm layers, then tanh again for last dense layers. Sigmoid ruined pretty much everything, and relu didn't really work for the last dense layers. My guess is that relu doesn't really work that well with negative numbers (I think). I may be wrong.
Well explained , if i have dataset given X= (3.3, 1.6) , (7.5, 48.2),(100, 20)......... target t={ 0, 2,.......4.5} , if use deepforward network to learn the function t=F(x) , then which activation i choose use for this learning work and why ? really appreciate if someone help me .
But ReLU is still linear and also non continuous so you cant compute a gradient and your previous point of linear functions being bad are combined into one
Initially we are applying activation functions to squash the output of each neuron to the range (0,1) or (1,-1). But in ReLU, the range is (0,x) and the x can take any large number of values. Can you please give the correct intuition behind this ? Thanks in advance :)
I don't see how ReLU avoids the vanishing gradient problem. The entire left side gives a value of zero and a gradient of zero! Maybe it depends on the data?
problem with relu is it returns 0 if the number is less than 0... if I'm working with unsigned hexadecimal numbers there isn't a negative number so effectively relu is just wasted cpu time jumping in and out of an unnecessary function because it will always just return the input value.
Hi Sirj you mentioned that activation functions should be differentiable but from my understanding relu is not. I was wondering how this affect back propagation in our neural net.
From the math point of view it's not. But the only part not differentiable is at 0, for which you declare that the gradient is 0 or the gradient of identity, it doesn't matter much because you're using float32 for an optimization problem, so you're very unlikely to fall on this 0 case. Just approximate it. The purpose of the ReLu is to have sparse output and sparse gradient, it allow the network to 'activate paths'.
It doesn't matter in practice. You can return 0 or 1 when the input is at the non-differentiable point and it would do fine. Remember that neural networks are just approximators. Its algorithm is plain simple and dumb but it does the job.
Don't we use ln(1+exp(x)) instead of real Relu in practice? as far as i know, it's differentiable(and super easy to calculate differentiation), has similar shape of relu and so on.
@Yunchan Hwang We actually appreciate this 0 output on the ReLu, it's appreciable because it give sparse output and gradient, if you use your function you can't 'deactivate' some path (just put it very close to 0, which is quite different). Also you have to consider the computation time. max(0, x) is far easier to compute than ln(1+exp(x))
So if ReLU is best for hidden layer and softmax/linear is best for output, what is best for input layer? sorry I'm new but your video makes a lot of sense
just divide the input data into small little separate data streams and feed each one into a single node each and the linear function would get around 80 percent perfection no matter what data feed into it. the fever data point you feed into each node the smaller the error to solve. the problem of a better activaiton function is only a problem of solving to much data per node. make a paralell processing network that only break down small problem per node and the error could be low no matter what. i dont think the real neurons in the brain get the entire feed of the photoreceptor neurons all at once but rather each real neuron in the brain solves a tiny piece of data each by being feed only a small part of a big problem and not the whole problem of input all at once.
Thanks, my biological neural network now has learned how to choose activation functions!
awesome
Hahahah
Remember whole is not in its parts. Whole behaviour is different from its elements
Great video, super helpful!
thx Dan love u
You are both awesome
You are both awesome
I absolutely love the energy you both have in your videos :)
Be soo cool if both did a collab video!
From experience I'd recommend in order, ELU (exponential linear units) >> leaky ReLU > ReLU > tanh, sigmoid. I agree that you basically never have an excuse to use tanh or sigmoid.
I'm using tanh but i always read saturated neurons as 0.95 or -0.95 while backpropagating so gradient doesnt disapear.
@@gorkemvids4839 doesn't*
just watched your speech @TNW Conference 2017, I am really happy that you are growing every day, You are my motivation and my idol. proud of you love you
thx stevey love u
I love you man, 4 f***** months passed and my stupid prof. could not explain it as you did, not even partially. keep up the good work.
Thanks a lot
Dude! DUUUDE! You are AMAZING! I've read multiple papers already, but now the stuff are really making sense to me!
I really like your videos as they strike the very sweet spot between being concise and precise!
Really enjoyed the video as you add subtle humor in between.
Excellent and entertaining at a high level of entropy reduction. A fan.
Amazing video! THank you! I've never heard of neural networks until I started my internship. This is really fascinating.
hey Siraj- just wanted to say thanks again. Apparently you got carried away and got busted being sneaky w crediting. I still respect your hustle and hunger. I think your means justify your ends- if you didn't make the moves that you did to prop up the image etc, I probably wouldn't have found you and your resources. At the end of the day, you are in fact legit bc you really bridge the gap of 1) knowing what ur talking about (i hope) 2) empathizing w someone learning this stuff (needed to break it down) 3) raising awareness about low hanging fruit that ppl outside the realm might not be aware of. Thank you again!!!!
this guy needs more subs. Finally a good explanation. Thanks man!
Valuable introduction to generative methods for establishement of sense in artificial intelligence. A great way of bringing things together and express in one single indescret language.
Thanks Siraj Raval, great!
Learning more from your videos than all my college classes together!
Dank memes and dank learning, both in the same video. Who would have thought. Thanks Raj!
1. The (activation) value of a neuron should be between 0 and 1, or? ReLu has a leaking minimum around 0, shouldn't ReLu have also a (leaking) maximum around 1?
2. Is there one best activation function, delivering the best neural network with the least amount of effort, like the amount of tests needed, and computer power?
3. Should weights and biases be between 0 and 1 or between -1 and 1? Or any different values?
4. Against vanishing and exploding gradients: can this be prevented with a (leaking) correction minimum and maximum for the weights and biases? There would be some symmetry then with the activation function suggested in the first paragraph.
I gained a lot of understanding and got that "click" moment after you explained linear vs non linearity. Thanks man. Keep up w/ the dank memes. My dream is that some day, I'd see a collab video between you, Dan Shiffman, and 3Blue1Brown. Love lots from Philippines!
I can't control the gradient, the Best part of the video.
If we use GA we do not need differentiable activation functions , inclusive we can build our own function.The issue is the back propagation method , this limits the activation functions
Wow, man, this is a seriously amazing video. Very entertaining and informative at the same time. Keep up great work! I'm now watching all your other videos :)
8:44 i liked this motto on the wall.
Cool. Your lecture cleared the cloud in my brain. I now have better understanding about the whole picture of the activation function.
Still can't decide if I like the number of memes in these videos. It's humorous of course and I did grow up on the internet, but I'm trying to learn a viciously hard subject and they are somewhat distracting. I suppose it helps the less-intrinsically-motivated keep watching, and I can always read more about it elsewhere, as these videos are more like cursory summaries. Great channel.
this is a well thought out comment. so is the reply to it i see. making them more relevant and spare should help. ill do that
spare = sparse*
Siraj I agree with Jotto. I enjoy them, but at some critical points in the video I found myself replaying several times as the first time through I was a little distracted.
i read papers and articles... but a 10 min video helped me more tha all of that :D
@@SirajRaval It keeps it fresh and help me remember. I find I remember things you say by remembering the joke! Relu, relu, relu....
Noi i understood wtf we are using this activation function, til now i was just using them now I know why am using them, thanks siraj
Update: There is another activation function called "elu" which is faster than "relu" when it comes to speed of training. Try it out guys! :D
Super clear & concise. Amazing simplicity. You Rock !!!
I love watching these videos, even if I don't understand 90% of what he is saying.
Hey Siraj, here is a great trick: show us a neural net that can perform inductive reasoning! Great videos as always, keep them coming! Learning so much!
thx will do
@Siraj
NN can potentially grow in so many directions, you will always have something to explain to us.
As you used to say 'this is only the beginning'.
And ohh maaan ! you're so clear when you explain NN ;)
Please keep doing what you're doing again and again and again...and again !
You are for NN, what Neil de Grass is for astrophysics.
thx for sharing the github source that detail each activation source
By far the best videos of Machine Learning Ive watched. Amazing work! Love the energy and Vibe!
Super Siraj Raval!!!!! Great compilation Bro.
Sir, likes for your memetics and fun explanation! All the spice you add to this video might bring some tech kids like me to the realm of Machine Learning!
(And today, a mysterious graph sheet with the plot of max(0,x), a.k.a. ReLU function, appeared in my High School Maths notebook, between the pages about piecewise functions, after I get up and arrived at school.)
Crystal clear explanation, just loved it
this guys makes learning so much fun!
I've been wondering what loss function to use D: Can you make a video for loss functions pls :)
Crashed2DesktoP this is a little less generically answerable than which activation.
For standard tasks there are a few loss functions available, binary cross entropy and categorical cross entropy for classification, mean squared error for regression. but more generally the cost function encodes the nature of your problem. Once you go deeper and your problem fleshes out a bit the exact loss you use might change to reflect your task. Custom losses might reflect auxiliary learning tasks, domain specific weights, and many other things. because of this "which loss should I use" is quite close to asking "how should I encode my problem" and so can be a little trickier to answer beyond the well studied settings.
Sina Samangooei thanks for your answer. It is useful for me too. 😃
hmm. sina has a good answer but more vids similar to this coming
log- likelihood cost function with softmax output layer for classification
omg, this is the first time i am seeing his video and its quite entertaining
Your channel is GOLD!
Great video! Also make a video on How to choose the number of hidden layers and number of nodes in each layer?
will do thx
If I understand the subject right, you'll always only need one hidden layer, because of Cover's Theorem
Please do a detailed video regarding the difference between multilayer neural network and deep neural network and the evolution. Pleeeease!
very helpful video, thanks a lot, actually to introduce non-linearities we are introducing activation function. But how does ReLU which is linear is doing the justification over other non-linear functions ? Can you please give the correct intuition behind this ? Thanks in advance :)
but isn't RELU a linear function? you mentioned at the beginning that linear functions should be avoided as both calculating backpropagation on non-linear functions as classifying data points that do not fit a single hyperplane is easier
or did I get the whole thing wrong?
It's not linear because any -X sits at zero on the Y axis. "Linear" basically means "straight line". The ReLU line is bent, hard, at 0. So it's linear if you're only looking at > 0 or < 0, but if you look at the whole line it's kinked in the middle, which makes it non-linear.
it is a piece-wise linear function which is essentially a nonlinear function. For more info, google "piece-wise linear functions".
The sparisty of the activations add to the non linearity of the neural net.
@@10parth10 that explanation helped. Thanks
your teaching way is so cool and crazy :)
X8 Better than my data mining Professor, thank you 🙏
Just gotta say Siraj. You are amazing because i only understand half of what you say.
thx keep watching
Love this video so much. Helped me so much with my LSTM RNN network
Another question is what the difference is if I use more hidden
layers or more hidden neurons
I think that at this moment there's not a cut and clear approach to how to choose the NN architecture
More layers makes learning very slow compared to more neurons. Before training, all the biases will overcome the inputs and make the output side of the network static. It takes a long time to get past that.
Maybe you should limit the starting biases so you can pass that phase quicker. I always apply biases betven 0 and 0.5
2 Hidden layers are enough.
depends on situation, like a simple text recognition kind of thing is fine with 2 layers but something like a convolutional neural network may have to have 10. but for the majority of tings in this day and age, 2 is plenty.
despised the stale memes. loved the explanation
Excellent, as usual.
I think that the reason RELU hasn't been popular prior to now is that it is mathematically inelegant, in that it can't be used in commutable functions,, and a sigmoid function can.
It does beg the question though - if RELU is being used, do we need to use the back propagation algorithm at all ? Perhaps some simpler recursive algorithm can be used.
While the activation function must be non-linear, neural nets store weights as binary numbers. If the range is small enough, you can store each activation function value by looking up the weight in a table. In other words, for every possible x, given a function f(x), simply store the result f(x) in a table of x+1 entries where for every x value, value_table[x] = f(x). The time it takes to calculate the activation function becomes 0 for all intents and purposes, no matter how complex it might be. In the days when I can purchase gigabytes of memory for a couple of hundred bucks, it's hard to see why anyone would include a hyperbolic function calculation embedded in their innermost loops. Even the modified RelX function requires more work than a simple table lookup. Furthermore, by using a simple table lookup method, it can be much more easily coded into a matrix library calculation.
Dude.... exactly what i needed.. Thanks again!
Thanks @Siraj. What amazing and easy to digest explanation.
siraj you are a good ai teacher
Hey Siraj, you missed out on the Swish function that returned very tiny negative values for negative inputs. Probably you made this video before the Swish function came into existence.
This video is very easy to understand!
"Leep Dearning": thank you for this one LOL
Yes!!! A new episode. SWEET!!! Thanks Siraj.
hard humor with gifs and memes makes me lose track of what Siraj is saying and had to rewind a bit ... LoL :)
u covered half of what my ai principles course covered on learning in 3 and half hrs in 8 mins. nice
digging your vids and enthusiasm from Portland Oregon!
Siraj, ur videos inspired me to study machine learning. I've been learning python for the past month, and am looking to start playing around with more advanced stuff. Do you have any good book recommendations for machine or deep learning, or online resources that beginners should start with?
aweomse. watch my playlist learn python for data science
Siraj Raval Do you have videos on matlab using nn?
Great explanation of activation functions. Now I need to tweak my model.
So the slide at 4:00 says "Activation functions should be differentiable", but the conclusion of the video is that you should use the ReLU activation function, which is not differentiable. (Great video btw.)
Daniel O'Connor 2 years later and you’ve probably figured it out, but I believe reLu is not differentiable only exactly at x=0, which is usually really rare
Hard stuff made easy. Congrats to a great video! Keep it up, mate!
Woah ! thanks man, you made things so clear !!!
Entire video is a GEM 💎
Totally makes sense to use ML
You said that linear activation doesn't solve more complex problem. However relu itself is a linear function (with negative part discarded) . Then how relu activation function help to learn the non-linear behavior of the system? Also how the relu addresses the problem of gradient exploding problem.
I don't get how a linear activation function adds non linearity to a network.
As i see it, it just switches some neurons off. We could basically use a Dropout, and follow it up by multiplicating each neuron with a random value.
To have the same effect as a Relu activated dense layer. No ?
While it's true, that Relu is used in a lot of deep neural networks, i think it's far from true you should always use relu.
In classification relu is used, because f.E. we want to "swtich off" the not important part of the pictures, there relu is nearly perfect.
Thanks Siraj. Awesome explanation.
I'm am new to deep learning. It would be great if you can make videos about regularization, and cost functions.
Hello Sir Siraj, would you please answer the following questions?
I am PhD student, I would use the artificial neural networks approach to measure the productive efficiency of 31 farms, and their production and cost functions as well. My questions are:
Are 31 farms enough for applying this approach? what is the minimum number of observations that must be obtained in the research?
is it a sufficient number? or I need more farms?
knowing that I have many affecting independent variables.
What type of activation functions I should use for the efficiency measurement and the production and cost functions estimation.
Many thanks in advance.
Awesome video. Can you explain a bit more on why we aren't using an activation funciton in the outer layer?
Hi Siraj:
Your videos are great!
CONGRATULATIONS!
Great demonstration 🙏...I have a question. How to choose the number of neurons?.. please explain 🙏
Noticed a tiny issue in your vid. When talking about the tanh function, you included python code that suggests the derivative is equal to 1-x^2. If I’m correct, the derivative is equal to 1-tanh(x)^2
Quick note; if you're playing with autoencoders, use tanh or elu. Softplus and relu fail to converge, and sigmoid is too slow.
This is definitely one of your funnier videos.
hi Siraj,
you nailed it in a very short period of time. Loved it. Would like you to keep up always. Cheers....
Great video! Except LSTM doesn't use ReLU
rarely
Leaky relu transfer function looks like V-I curve of a approximated diode
There are plenty of reasons to use a tanh activation at times, you can't just say never to use something - that's a gross simplification.
Yes, for very large networks with many layers, ReLU is best because of vanishing gradients, but that's not necessarily true for smaller networks. And for RNNs? Just use LSTM or GRU instead of regular RNNs, then you don't have that problem.
I have found that tanh works the best for GRUs
Thank you Sir for this wonderful video , i have a question . How are the basis function determined in practice ? why does you choose Gaussian function as the basis function??
Today I did some testing. Created a neural network, that has some dense layers, then some LSTM layers, then some dense layers again, with linear units on top, since that was a regression problem. Then I've tried algorithmic hyperparameter optimisation, that involved activation function search. As it turns out, the best performing network had relu as activation for initial dense layers and tanh for the rest of the network. The second best (the loss was almost identical) had first tanh, then relu for lstm layers, then tanh again for last dense layers. Sigmoid ruined pretty much everything, and relu didn't really work for the last dense layers. My guess is that relu doesn't really work that well with negative numbers (I think). I may be wrong.
presentation is good,learned how to choose the activation function and thanks for the video,it helped a lot
What's a good way to test if your neurons are dying? Any heuristics to check?
Excellent explanation!!! You're really funny and I loved the way you explain things. Thank you!!!
Well explained , if i have dataset given X= (3.3, 1.6) , (7.5, 48.2),(100, 20)......... target t={ 0, 2,.......4.5} , if use deepforward network to learn the function t=F(x) , then which activation i choose use for this learning work and why ? really appreciate if someone help me .
But ReLU is still linear and also non continuous so you cant compute a gradient and your previous point of linear functions being bad are combined into one
Loving the KEK :) Awesome Siraj :) can you do a piece on CFR+ and it's geopolitical implications?
Initially we are applying activation functions to squash the output of each neuron to the range (0,1) or (1,-1). But in ReLU, the range is (0,x) and the x can take any large number of values. Can you please give the correct intuition behind this ? Thanks in advance :)
Awesome explanation. +1 for creating such a big shadow over the Earth.
I don't see how ReLU avoids the vanishing gradient problem. The entire left side gives a value of zero and a gradient of zero! Maybe it depends on the data?
problem with relu is it returns 0 if the number is less than 0...
if I'm working with unsigned hexadecimal numbers there isn't a negative number so effectively relu is just wasted cpu time jumping in and out of an unnecessary function because it will always just return the input value.
This channel is gold! Thanks
Hi Sirj you mentioned that activation functions should be differentiable but from my understanding relu is not. I was wondering how this affect back propagation in our neural net.
From the math point of view it's not. But the only part not differentiable is at 0, for which you declare that the gradient is 0 or the gradient of identity, it doesn't matter much because you're using float32 for an optimization problem, so you're very unlikely to fall on this 0 case. Just approximate it.
The purpose of the ReLu is to have sparse output and sparse gradient, it allow the network to 'activate paths'.
stackoverflow.com/questions/30236856/how-does-the-back-propagation-algorithm-deal-with-non-differentiable-activation
It doesn't matter in practice. You can return 0 or 1 when the input is at the non-differentiable point and it would do fine. Remember that neural networks are just approximators. Its algorithm is plain simple and dumb but it does the job.
Don't we use ln(1+exp(x)) instead of real Relu in practice? as far as i know, it's differentiable(and super easy to calculate differentiation), has similar shape of relu and so on.
@Yunchan Hwang We actually appreciate this 0 output on the ReLu, it's appreciable because it give sparse output and gradient, if you use your function you can't 'deactivate' some path (just put it very close to 0, which is quite different). Also you have to consider the computation time. max(0, x) is far easier to compute than ln(1+exp(x))
Great video Siraj. Keep up the good work
thx love u
Awesome thanks. Could follow it all the through in full speed 👍
thx for feedback
I died at the Einstein meme
So if ReLU is best for hidden layer and softmax/linear is best for output, what is best for input layer? sorry I'm new but your video makes a lot of sense
Thanks, for the video!
I have a question: Why should't I use tanh?
suffers from the vanishing gradient problem, i.e. the weights do not produce any change in the model
so
we use ReLU cuz it never leads to that
Are you serious, he literally just told you.
Siraj great video. Your views about Parametric Rectified Linear Unit (PReLU)?
just divide the input data into small little separate data streams and feed each one into a single node each and the linear function would get around 80 percent perfection no matter what data feed into it. the fever data point you feed into each node the smaller the error to solve. the problem of a better activaiton function is only a problem of solving to much data per node. make a paralell processing network that only break down small problem per node and the error could be low no matter what. i dont think the real neurons in the brain get the entire feed of the photoreceptor neurons all at once but rather each real neuron in the brain solves a tiny piece of data each by being feed only a small part of a big problem and not the whole problem of input all at once.