Which Activation Function Should I Use?

Siraj Raval

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 3 жов 2024

КОМЕНТАРІ • 461

@Skythedragon 7 років тому ⁺²⁶⁰
Thanks, my biological neural network now has learned how to choose activation functions!
@SirajRaval 7 років тому ⁺²³
awesome
@GilangD21 7 років тому ⁺¹
Hahahah
@rs-tarxvfz 4 роки тому
Remember whole is not in its parts. Whole behaviour is different from its elements
@TheCodingTrain 7 років тому ⁺²⁰³
Great video, super helpful!
@SirajRaval 7 років тому ⁺²⁹
thx Dan love u
@eointolster 7 років тому ⁺¹³
You are both awesome
@eointolster 7 років тому ⁺⁴
You are both awesome
@terigopula 6 років тому ⁺²
I absolutely love the energy you both have in your videos :)
@silverreyes7912 6 років тому ⁺²
Be soo cool if both did a collab video!
@StephenRoseDuo 7 років тому ⁺³⁵
From experience I'd recommend in order, ELU (exponential linear units) >> leaky ReLU > ReLU > tanh, sigmoid. I agree that you basically never have an excuse to use tanh or sigmoid.
@gorkemvids4839 6 років тому ⁺¹
I'm using tanh but i always read saturated neurons as 0.95 or -0.95 while backpropagating so gradient doesnt disapear.
@JorgetePanete 5 років тому
@@gorkemvids4839 doesn't*
@hussain5755 7 років тому ⁺¹
just watched your speech @TNW Conference 2017, I am really happy that you are growing every day, You are my motivation and my idol. proud of you love you
@SirajRaval 7 років тому ⁺¹
thx stevey love u
@BOSS-bk2jx 6 років тому ⁺³
I love you man, 4 f***** months passed and my stupid prof. could not explain it as you did, not even partially. keep up the good work.
Thanks a lot
@pouyan74 4 роки тому ⁺¹
Dude! DUUUDE! You are AMAZING! I've read multiple papers already, but now the stuff are really making sense to me!
@quant-trader-010 2 роки тому ⁺¹
I really like your videos as they strike the very sweet spot between being concise and precise!
@rafiakhan8721 2 роки тому ⁺¹
Really enjoyed the video as you add subtle humor in between.
@supremehype3227 5 років тому ⁺¹
Excellent and entertaining at a high level of entropy reduction. A fan.
@drhf1214 6 років тому ⁺¹
Amazing video! THank you! I've never heard of neural networks until I started my internship. This is really fascinating.
@captainwalter 4 роки тому
hey Siraj- just wanted to say thanks again. Apparently you got carried away and got busted being sneaky w crediting. I still respect your hustle and hunger. I think your means justify your ends- if you didn't make the moves that you did to prop up the image etc, I probably wouldn't have found you and your resources. At the end of the day, you are in fact legit bc you really bridge the gap of 1) knowing what ur talking about (i hope) 2) empathizing w someone learning this stuff (needed to break it down) 3) raising awareness about low hanging fruit that ppl outside the realm might not be aware of. Thank you again!!!!
@gydo1942 6 років тому ⁺¹
this guy needs more subs. Finally a good explanation. Thanks man!
@CristianMargiotta 7 років тому ⁺¹
Valuable introduction to generative methods for establishement of sense in artificial intelligence. A great way of bringing things together and express in one single indescret language.
Thanks Siraj Raval, great!
@MrJnsc 6 років тому
Learning more from your videos than all my college classes together!
@waleedtahir2072 7 років тому ⁺¹
Dank memes and dank learning, both in the same video. Who would have thought. Thanks Raj!
@paulbloemen7256 5 років тому ⁺¹
1. The (activation) value of a neuron should be between 0 and 1, or? ReLu has a leaking minimum around 0, shouldn't ReLu have also a (leaking) maximum around 1?
2. Is there one best activation function, delivering the best neural network with the least amount of effort, like the amount of tests needed, and computer power?
3. Should weights and biases be between 0 and 1 or between -1 and 1? Or any different values?
4. Against vanishing and exploding gradients: can this be prevented with a (leaking) correction minimum and maximum for the weights and biases? There would be some symmetry then with the activation function suggested in the first paragraph.
@grainfrizz 7 років тому
I gained a lot of understanding and got that "click" moment after you explained linear vs non linearity. Thanks man. Keep up w/ the dank memes. My dream is that some day, I'd see a collab video between you, Dan Shiffman, and 3Blue1Brown. Love lots from Philippines!
@satyamskillz 4 роки тому ⁺¹
I can't control the gradient, the Best part of the video.
@fersilvil 7 років тому ⁺³
If we use GA we do not need differentiable activation functions , inclusive we can build our own function.The issue is the back propagation method , this limits the activation functions
@cali4nicated 4 роки тому ⁺¹
Wow, man, this is a seriously amazing video. Very entertaining and informative at the same time. Keep up great work! I'm now watching all your other videos :)
@sahand5277 6 років тому ⁺⁴
8:44 i liked this motto on the wall.
@slowcoding 5 років тому
Cool. Your lecture cleared the cloud in my brain. I now have better understanding about the whole picture of the activation function.
@Jotto999 7 років тому ⁺³⁸
Still can't decide if I like the number of memes in these videos. It's humorous of course and I did grow up on the internet, but I'm trying to learn a viciously hard subject and they are somewhat distracting. I suppose it helps the less-intrinsically-motivated keep watching, and I can always read more about it elsewhere, as these videos are more like cursory summaries. Great channel.
@SirajRaval 7 років тому ⁺⁵
this is a well thought out comment. so is the reply to it i see. making them more relevant and spare should help. ill do that
@SirajRaval 7 років тому ⁺²
spare = sparse*
@austinmoran456 7 років тому ⁺⁵
Siraj I agree with Jotto. I enjoy them, but at some critical points in the video I found myself replaying several times as the first time through I was a little distracted.
@Kaiz0kuSama 7 років тому ⁺²
i read papers and articles... but a 10 min video helped me more tha all of that :D
@TuyoIsaza 5 років тому
@@SirajRaval It keeps it fresh and help me remember. I find I remember things you say by remembering the joke! Relu, relu, relu....
@yatinarora9650 5 років тому
Noi i understood wtf we are using this activation function, til now i was just using them now I know why am using them, thanks siraj
@ManajitPal95 5 років тому ⁺⁵
Update: There is another activation function called "elu" which is faster than "relu" when it comes to speed of training. Try it out guys! :D
@prateekraghuwanshi5645 7 років тому
Super clear & concise. Amazing simplicity. You Rock !!!
@nikksengaming933 7 років тому
I love watching these videos, even if I don't understand 90% of what he is saying.
@akompsupport 7 років тому
Hey Siraj, here is a great trick: show us a neural net that can perform inductive reasoning! Great videos as always, keep them coming! Learning so much!
@SirajRaval 7 років тому ⁺¹
thx will do
@plouismarie 7 років тому
@Siraj
NN can potentially grow in so many directions, you will always have something to explain to us.
As you used to say 'this is only the beginning'.
And ohh maaan ! you're so clear when you explain NN ;)
Please keep doing what you're doing again and again and again...and again !
You are for NN, what Neil de Grass is for astrophysics.
thx for sharing the github source that detail each activation source
@kalreensdancevelventures5512 3 роки тому
By far the best videos of Machine Learning Ive watched. Amazing work! Love the energy and Vibe!
@gigeg7708 5 років тому ⁺¹
Super Siraj Raval!!!!! Great compilation Bro.
@jennycotan7080 10 місяців тому
Sir, likes for your memetics and fun explanation! All the spice you add to this video might bring some tech kids like me to the realm of Machine Learning!
(And today, a mysterious graph sheet with the plot of max(0,x), a.k.a. ReLU function, appeared in my High School Maths notebook, between the pages about piecewise functions, after I get up and arrived at school.)
@akhilguptavibrantjava 6 років тому
Crystal clear explanation, just loved it
@joshiyogendra 6 років тому
this guys makes learning so much fun!
@thedeliverguy879 7 років тому ⁺⁴⁹
I've been wondering what loss function to use D: Can you make a video for loss functions pls :)
@sinjaxsan 7 років тому ⁺²⁷
Crashed2DesktoP this is a little less generically answerable than which activation.
For standard tasks there are a few loss functions available, binary cross entropy and categorical cross entropy for classification, mean squared error for regression. but more generally the cost function encodes the nature of your problem. Once you go deeper and your problem fleshes out a bit the exact loss you use might change to reflect your task. Custom losses might reflect auxiliary learning tasks, domain specific weights, and many other things. because of this "which loss should I use" is quite close to asking "how should I encode my problem" and so can be a little trickier to answer beyond the well studied settings.
@RoxanaNoe 7 років тому
Sina Samangooei thanks for your answer. It is useful for me too. 😃
@SirajRaval 7 років тому ⁺⁸
hmm. sina has a good answer but more vids similar to this coming
@rishabh4082 6 років тому
log- likelihood cost function with softmax output layer for classification
@drip888 Рік тому
omg, this is the first time i am seeing his video and its quite entertaining
@dipeshbhandari4746 5 років тому
Your channel is GOLD!
@vijayabhaskar-j 7 років тому ⁺⁷
Great video! Also make a video on How to choose the number of hidden layers and number of nodes in each layer?
@SirajRaval 7 років тому ⁺⁴
will do thx
@TheQuickUplifts 5 років тому
If I understand the subject right, you'll always only need one hidden layer, because of Cover's Theorem
@suprotikdey1910 7 років тому
Please do a detailed video regarding the difference between multilayer neural network and deep neural network and the evolution. Pleeeease!
@mohamednoordeen6331 7 років тому ⁺²
very helpful video, thanks a lot, actually to introduce non-linearities we are introducing activation function. But how does ReLU which is linear is doing the justification over other non-linear functions ? Can you please give the correct intuition behind this ? Thanks in advance :)
@sedthh 7 років тому ⁺¹⁹
but isn't RELU a linear function? you mentioned at the beginning that linear functions should be avoided as both calculating backpropagation on non-linear functions as classifying data points that do not fit a single hyperplane is easier
or did I get the whole thing wrong?
@jeffwells641 7 років тому ⁺¹⁴
It's not linear because any -X sits at zero on the Y axis. "Linear" basically means "straight line". The ReLU line is bent, hard, at 0. So it's linear if you're only looking at > 0 or < 0, but if you look at the whole line it's kinked in the middle, which makes it non-linear.
@anshu957 5 років тому ⁺⁴
it is a piece-wise linear function which is essentially a nonlinear function. For more info, google "piece-wise linear functions".
@10parth10 4 роки тому ⁺¹
The sparisty of the activations add to the non linearity of the neural net.
@UnrecycleRubdish 3 роки тому
@@10parth10 that explanation helped. Thanks
@madhumithak3338 3 роки тому
your teaching way is so cool and crazy :)
@angelo6082 4 місяці тому
X8 Better than my data mining Professor, thank you 🙏
@ColacX 7 років тому
Just gotta say Siraj. You are amazing because i only understand half of what you say.
@SirajRaval 7 років тому
thx keep watching
@WillTesler 7 років тому
Love this video so much. Helped me so much with my LSTM RNN network
@firespark804 7 років тому ⁺¹⁷
Another question is what the difference is if I use more hidden
layers or more hidden neurons
@davidfortini3205 7 років тому ⁺¹
I think that at this moment there's not a cut and clear approach to how to choose the NN architecture
@trainraider8 6 років тому
More layers makes learning very slow compared to more neurons. Before training, all the biases will overcome the inputs and make the output side of the network static. It takes a long time to get past that.
@gorkemvids4839 6 років тому
Maybe you should limit the starting biases so you can pass that phase quicker. I always apply biases betven 0 and 0.5
@paras8361 6 років тому
2 Hidden layers are enough.
@kayrunjaavice1421 6 років тому
depends on situation, like a simple text recognition kind of thing is fine with 2 layers but something like a convolutional neural network may have to have 10. but for the majority of tings in this day and age, 2 is plenty.
@toadfrommariokart64 3 роки тому
despised the stale memes. loved the explanation
@antonylawler3423 7 років тому
Excellent, as usual.
I think that the reason RELU hasn't been popular prior to now is that it is mathematically inelegant, in that it can't be used in commutable functions,, and a sigmoid function can.
It does beg the question though - if RELU is being used, do we need to use the back propagation algorithm at all ? Perhaps some simpler recursive algorithm can be used.
@edreusser4741 2 роки тому
While the activation function must be non-linear, neural nets store weights as binary numbers. If the range is small enough, you can store each activation function value by looking up the weight in a table. In other words, for every possible x, given a function f(x), simply store the result f(x) in a table of x+1 entries where for every x value, value_table[x] = f(x). The time it takes to calculate the activation function becomes 0 for all intents and purposes, no matter how complex it might be. In the days when I can purchase gigabytes of memory for a couple of hundred bucks, it's hard to see why anyone would include a hyperbolic function calculation embedded in their innermost loops. Even the modified RelX function requires more work than a simple table lookup. Furthermore, by using a simple table lookup method, it can be much more easily coded into a matrix library calculation.
@TuyoIsaza 5 років тому ⁺¹
Dude.... exactly what i needed.. Thanks again!
@MohammedAli-pg2fw 5 років тому
Thanks @Siraj. What amazing and easy to digest explanation.
@clark87 5 років тому
siraj you are a good ai teacher
@rakeshmallick27 6 років тому
Hey Siraj, you missed out on the Swish function that returned very tiny negative values for negative inputs. Probably you made this video before the Swish function came into existence.
@dyjiang1350 6 років тому
This video is very easy to understand!
@vdoundakov 6 років тому ⁺¹¹
"Leep Dearning": thank you for this one LOL
@OzzieCoto 7 років тому
Yes!!! A new episode. SWEET!!! Thanks Siraj.
@unboxwithaakash 6 років тому ⁺¹
hard humor with gifs and memes makes me lose track of what Siraj is saying and had to rewind a bit ... LoL :)
@Omar-kw5ui 5 років тому
u covered half of what my ai principles course covered on learning in 3 and half hrs in 8 mins. nice
@nicodaunt 5 років тому
digging your vids and enthusiasm from Portland Oregon!
@ilyassalhi 7 років тому ⁺³
Siraj, ur videos inspired me to study machine learning. I've been learning python for the past month, and am looking to start playing around with more advanced stuff. Do you have any good book recommendations for machine or deep learning, or online resources that beginners should start with?
@SirajRaval 7 років тому
aweomse. watch my playlist learn python for data science
@lubnaaashaikh8901 7 років тому
Siraj Raval Do you have videos on matlab using nn?
@gowriparameswaribellala4423 5 років тому
Great explanation of activation functions. Now I need to tweak my model.
@Singularitarian 7 років тому
So the slide at 4:00 says "Activation functions should be differentiable", but the conclusion of the video is that you should use the ReLU activation function, which is not differentiable. (Great video btw.)
@musilicks 5 років тому
Daniel O'Connor 2 years later and you’ve probably figured it out, but I believe reLu is not differentiable only exactly at x=0, which is usually really rare
@jb.1412 7 років тому
Hard stuff made easy. Congrats to a great video! Keep it up, mate!
@anjali7778 4 роки тому
Woah ! thanks man, you made things so clear !!!
@rahulsbhatt 5 років тому
Entire video is a GEM 💎
Totally makes sense to use ML
@keshavpanthi5806 6 років тому
You said that linear activation doesn't solve more complex problem. However relu itself is a linear function (with negative part discarded) . Then how relu activation function help to learn the non-linear behavior of the system? Also how the relu addresses the problem of gradient exploding problem.
@aresagathos2075 3 роки тому
I don't get how a linear activation function adds non linearity to a network.
As i see it, it just switches some neurons off. We could basically use a Dropout, and follow it up by multiplicating each neuron with a random value.
To have the same effect as a Relu activated dense layer. No ?
While it's true, that Relu is used in a lot of deep neural networks, i think it's far from true you should always use relu.
In classification relu is used, because f.E. we want to "swtich off" the not important part of the pictures, there relu is nearly perfect.
@nrewik 7 років тому
Thanks Siraj. Awesome explanation.
I'm am new to deep learning. It would be great if you can make videos about regularization, and cost functions.
@alialkerdi1732 5 років тому
Hello Sir Siraj, would you please answer the following questions?
I am PhD student, I would use the artificial neural networks approach to measure the productive efficiency of 31 farms, and their production and cost functions as well. My questions are:
Are 31 farms enough for applying this approach? what is the minimum number of observations that must be obtained in the research?
is it a sufficient number? or I need more farms?
knowing that I have many affecting independent variables.
What type of activation functions I should use for the efficiency measurement and the production and cost functions estimation.
Many thanks in advance.
@akshaysreekumar1997 6 років тому
Awesome video. Can you explain a bit more on why we aren't using an activation funciton in the outer layer?
@hectoralvarorojas1918 7 років тому
Hi Siraj:
Your videos are great!
CONGRATULATIONS!
@kaushikghosal02 3 роки тому
Great demonstration 🙏...I have a question. How to choose the number of neurons?.. please explain 🙏
@johnphilmore7269 3 роки тому
Noticed a tiny issue in your vid. When talking about the tanh function, you included python code that suggests the derivative is equal to 1-x^2. If I’m correct, the derivative is equal to 1-tanh(x)^2
@dingleberriesify 7 років тому
Quick note; if you're playing with autoencoders, use tanh or elu. Softplus and relu fail to converge, and sigmoid is too slow.
@madisonforsyth9184 5 років тому
This is definitely one of your funnier videos.
@amanpreetsingh8100 7 років тому
hi Siraj,
you nailed it in a very short period of time. Loved it. Would like you to keep up always. Cheers....
@sunyangfu9882 7 років тому ⁺³
Great video! Except LSTM doesn't use ReLU
@SirajRaval 7 років тому ⁺¹
rarely
@dexterdev 7 років тому
Leaky relu transfer function looks like V-I curve of a approximated diode
@TheSe7enman 6 років тому
There are plenty of reasons to use a tanh activation at times, you can't just say never to use something - that's a gross simplification.
Yes, for very large networks with many layers, ReLU is best because of vanishing gradients, but that's not necessarily true for smaller networks. And for RNNs? Just use LSTM or GRU instead of regular RNNs, then you don't have that problem.
I have found that tanh works the best for GRUs
@arunbali7480 2 роки тому
Thank you Sir for this wonderful video , i have a question . How are the basis function determined in practice ? why does you choose Gaussian function as the basis function??
@kestutisstugys1189 5 років тому
Today I did some testing. Created a neural network, that has some dense layers, then some LSTM layers, then some dense layers again, with linear units on top, since that was a regression problem. Then I've tried algorithmic hyperparameter optimisation, that involved activation function search. As it turns out, the best performing network had relu as activation for initial dense layers and tanh for the rest of the network. The second best (the loss was almost identical) had first tanh, then relu for lstm layers, then tanh again for last dense layers. Sigmoid ruined pretty much everything, and relu didn't really work for the last dense layers. My guess is that relu doesn't really work that well with negative numbers (I think). I may be wrong.
@venkateshkolpakwar5757 5 років тому
presentation is good,learned how to choose the activation function and thanks for the video,it helped a lot
@ThePeterDislikeShow 2 роки тому
What's a good way to test if your neurons are dying? Any heuristics to check?
@guilhermeabreu3131 3 роки тому
Excellent explanation!!! You're really funny and I loved the way you explain things. Thank you!!!
@AScheccher 6 років тому
Well explained , if i have dataset given X= (3.3, 1.6) , (7.5, 48.2),(100, 20)......... target t={ 0, 2,.......4.5} , if use deepforward network to learn the function t=F(x) , then which activation i choose use for this learning work and why ? really appreciate if someone help me .
@Cat-vs7rc 6 років тому
But ReLU is still linear and also non continuous so you cant compute a gradient and your previous point of linear functions being bad are combined into one
@mswai5020 5 років тому
Loving the KEK :) Awesome Siraj :) can you do a piece on CFR+ and it's geopolitical implications?
@mohamednoordeen6331 7 років тому ⁺¹
Initially we are applying activation functions to squash the output of each neuron to the range (0,1) or (1,-1). But in ReLU, the range is (0,x) and the x can take any large number of values. Can you please give the correct intuition behind this ? Thanks in advance :)
@midhunrajr372 5 років тому
Awesome explanation. +1 for creating such a big shadow over the Earth.
@Alex55555 7 років тому
I don't see how ReLU avoids the vanishing gradient problem. The entire left side gives a value of zero and a gradient of zero! Maybe it depends on the data?
@logangraham2956 Рік тому
problem with relu is it returns 0 if the number is less than 0...
if I'm working with unsigned hexadecimal numbers there isn't a negative number so effectively relu is just wasted cpu time jumping in and out of an unnecessary function because it will always just return the input value.
@harveynorman8787 5 років тому
This channel is gold! Thanks
@sidhantchadda9396 7 років тому ⁺²⁵
Hi Sirj you mentioned that activation functions should be differentiable but from my understanding relu is not. I was wondering how this affect back propagation in our neural net.
@caenorstfuji 7 років тому ⁺¹⁴
From the math point of view it's not. But the only part not differentiable is at 0, for which you declare that the gradient is 0 or the gradient of identity, it doesn't matter much because you're using float32 for an optimization problem, so you're very unlikely to fall on this 0 case. Just approximate it.
The purpose of the ReLu is to have sparse output and sparse gradient, it allow the network to 'activate paths'.
@amreshgiri4933 7 років тому ⁺²
stackoverflow.com/questions/30236856/how-does-the-back-propagation-algorithm-deal-with-non-differentiable-activation
@offchan 7 років тому ⁺¹¹
It doesn't matter in practice. You can return 0 or 1 when the input is at the non-differentiable point and it would do fine. Remember that neural networks are just approximators. Its algorithm is plain simple and dumb but it does the job.
@yunchanhwang3068 7 років тому
Don't we use ln(1+exp(x)) instead of real Relu in practice? as far as i know, it's differentiable(and super easy to calculate differentiation), has similar shape of relu and so on.
@caenorstfuji 7 років тому ⁺⁴
@Yunchan Hwang We actually appreciate this 0 output on the ReLu, it's appreciable because it give sparse output and gradient, if you use your function you can't 'deactivate' some path (just put it very close to 0, which is quite different). Also you have to consider the computation time. max(0, x) is far easier to compute than ln(1+exp(x))
@AashishKumar1 7 років тому ⁺³
Great video Siraj. Keep up the good work
@SirajRaval 7 років тому
thx love u
@wh33lers 7 років тому ⁺²
Awesome thanks. Could follow it all the through in full speed 👍
@SirajRaval 7 років тому
thx for feedback
@samario_torres 5 років тому
I died at the Einstein meme
@DelandaBaudLacanian 2 роки тому
So if ReLU is best for hidden layer and softmax/linear is best for output, what is best for input layer? sorry I'm new but your video makes a lot of sense
@Michalos86 5 років тому ⁺²
Thanks, for the video!
I have a question: Why should't I use tanh?
@chetana9802 5 років тому
suffers from the vanishing gradient problem, i.e. the weights do not produce any change in the model
so
we use ReLU cuz it never leads to that
@user-rc9jf8ng2k 4 роки тому
Are you serious, he literally just told you.
@UsmanAhmed-sq9bl 7 років тому ⁺²
Siraj great video. Your views about Parametric Rectified Linear Unit (PReLU)?
@primodernious 4 роки тому
just divide the input data into small little separate data streams and feed each one into a single node each and the linear function would get around 80 percent perfection no matter what data feed into it. the fever data point you feed into each node the smaller the error to solve. the problem of a better activaiton function is only a problem of solving to much data per node. make a paralell processing network that only break down small problem per node and the error could be low no matter what. i dont think the real neurons in the brain get the entire feed of the photoreceptor neurons all at once but rather each real neuron in the brain solves a tiny piece of data each by being feed only a small part of a big problem and not the whole problem of input all at once.

Наступне

Автоматичне відтворення

Activation Functions | Deep Learning Tutorial 8 (Tensorflow Tutorial, Keras & Python)