During high bias, weights will be very small. During high variance, weights will be high. Similarly, during regularisation....if lambda is near infinity or high, our weights will tend to go down, because the function (gradient decent) will always try to minimize the overall value. If there's less lambda, weights will increase and model will try to fit each data point.......that also creates overfitting problems. So by tuning lambda in such a way that; both bias and variance should be in a acceptable range.
But for other activations (i.e. ReLU) this explanation is totally counterintuitive. For ReLU the nonlinearity is exactly at 0, so smaller absolute value of w wouldn't really reduce the unit's use of the nonlinear range.
@@cbt0949 @Seunggu Kang Nope. Normal ReLU also is nonlinear. An activation function needs to be nonlinear (check Hinton's article on "Learning representations by back-propagating errors" for a more detailed explanation), and that's why we use ReLU and not a simple linear wx+b. A neural network with linear activations wouldn't learn anything no matter how deep it would be. The fact, that it has different derivatives over the domain makes it nonlinear.
I have a doubt if z is small then why does it have any effect on curve overfitting as we obtain decision boundary by putting z=0 so there is no involvement of tanhz functiion in plotting the decision boundry !!
Andrew says the tanh in the linear region will deliver linear models only ....what about RELU? that's all linear can it only deliver linear models??? RELU is popular I find that hard to believe ...where am i confused?
as I've already done Prof.Andrew Ngs machine learning lectures, the terms seems familiar, and the explanation he had given in the logistic regression part lambda/2m. But it would have been more helpful, if you could give us some ideas in which scenarios we prefer using L2 over L1, or we can by default try with L2 having almost the same effect. I'm big fan of you professor
the tanh intuition makes sense to me [linearsizing and so simplifying the model] ...the first intuition, described as making the weights so small their effects disappear, does not make sense to me ...if ALL the weights are reduced by the same factor then its the same model. Isn't it? There would need to be some selectivity in the reductions it seems to me??
The weights in the model are randomly initialized and some start close to 0 and others don't. The L2 regularization puts pressure on these weights during backpropagation from reaching high values which results in models that have fewer hidden units that significantly contribute and thus have reduced complexity in terms of number of units in the network that produce it's output.
@@RickLamers Absolutely that is how regularization works. Over-fitting occurs when you are unnecessarily giving importance to most of the weights to fit the training set well,which end up giving a model that tries to capture the whole pattern of the training data in a precise manner, but our goal isn't to build a model that precisely captures the training data pattern, but to have a more generalized model,that will perform good on test/dev which is completely unseen to the model. So in order to reduce the complexity one may try to penalize the higher weights, the weights that the model things these weights are majorly responsible to match the hypothesis,but fails when it applies it on the test/dev
I am confused... if "lambda" is big, your "dw" will be big. So your "w" could become very negative after the update. (which in tanh is nonlinear)... won't that overfit even more?
6 років тому+3
Well, I think it's because of the optimization problem. In order to minimize the cost function J, if the lambda is large, we will need to choose smaller value of W.
You're absolutely right, There is no restriction on lambda imposed here, for an instance we can choose any negative lambda and it will obviously minimize it, for an instance choosing -infinity is best. He doesn't explains it clearly.
@ bro if we just take lambda=-infinity, we're done, that's the minimum of J you can get. Think about it
5 років тому+1
@@drummatick Hi, as Adrew Ng explained above, if lambda is very big then the optimizer will pick up the very small value of W (W is nearly to 0). It turns out the z value will be small and the model is nearly linear model (in case of tanh activation function) ==> This model is now UNDERFITTING In contrast, if you choose lambda to be very small (in your case: -infinity) then the optimizer can just pick up any NOT small value of W. This time, the model will be complex since the value of z will be large. Of course the value of the cost function J is nearly 0 (because value of lambda is -infinity as you said), but now the risk of overfitting is super high. Your model will learn too much about the particularities of the training data, and won't be able to generalize to new data.
If w does become negative the update function w = w - alpha* (update from back-propagation + lambda (w/m)), where lambda(w/m) would be negative. Since you'll be subtracting a negative number, it becomes equivalent to adding lambda(|w|/m).
The updating formula for weights in a layer is W = (1-lambda/m)W - learning_rate*(dCost/dW), so from here you can see having larger lambda will make W small. This formula is when we are regularizing.
During high bias, weights will be very small. During high variance, weights will be high. Similarly, during regularisation....if lambda is near infinity or high, our weights will tend to go down, because the function (gradient decent) will always try to minimize the overall value. If there's less lambda, weights will increase and model will try to fit each data point.......that also creates overfitting problems. So by tuning lambda in such a way that; both bias and variance should be in a acceptable range.
i like your explanation
Thanks for your comment, it clicked for me now.
Thanks, I finally got it.
interesting perspective, never thought of using the shape of tanh to help understand the intuition of regularization
i feel like the second explanation with the tanh function is much better
But for other activations (i.e. ReLU) this explanation is totally counterintuitive. For ReLU the nonlinearity is exactly at 0, so smaller absolute value of w wouldn't really reduce the unit's use of the nonlinear range.
@@bzqp2 It's linear near 0 in ReLU.
@@cbt0949 well NEAR 0 it is linear, but exactly at 0 the 1st derivative changes, which makes it nonlinear there.
Leaky ReLU can be an example of non-linear, but ReLU is linear in my opinion, since a domain is [0,inf) or (0,inf).
@@cbt0949 @Seunggu Kang Nope. Normal ReLU also is nonlinear. An activation function needs to be nonlinear (check Hinton's article on "Learning representations by back-propagating errors" for a more detailed explanation), and that's why we use ReLU and not a simple linear wx+b. A neural network with linear activations wouldn't learn anything no matter how deep it would be. The fact, that it has different derivatives over the domain makes it nonlinear.
I have a doubt if z is small then why does it have any effect on curve overfitting as we obtain decision boundary by putting z=0 so there is no involvement of tanhz functiion in plotting the decision boundry !!
What an amazing intution 🙇♂️
very nice explanation.need to watch again
Relu is also linear activation function , how it doesn't reduce network into linear network?
Andrew says the tanh in the linear region will deliver linear models only ....what about RELU? that's all linear can it only deliver linear models??? RELU is popular I find that hard to believe ...where am i confused?
i see what you mean, it doesnt make sense to regularize with lambda for relu, but it does if you are regularizing with dropout. great point though
datascience.stackexchange.com/questions/26475/why-is-relu-used-as-an-activation-function
ReLU has non-linearity. It does not have a constant gradient. It's piece-wise linear, and not simply just linear.
Relu has the nonlinearity at z=0 (which actually makes the explanation totally counterintuitive)
@@bzqp2 yes I accept this reasoning now. cheers
Never thought of tanh (or sigmoid) in that perspective thanks
as I've already done Prof.Andrew Ngs machine learning lectures, the terms seems familiar, and the explanation he had given in the logistic regression part lambda/2m. But it would have been more helpful, if you could give us some ideas in which scenarios we prefer using L2 over L1, or we can by default try with L2 having almost the same effect. I'm big fan of you professor
the tanh intuition makes sense to me [linearsizing and so simplifying the model] ...the first intuition, described as making the weights so small their effects disappear, does not make sense to me ...if ALL the weights are reduced by the same factor then its the same model. Isn't it? There would need to be some selectivity in the reductions it seems to me??
The weights in the model are randomly initialized and some start close to 0 and others don't. The L2 regularization puts pressure on these weights during backpropagation from reaching high values which results in models that have fewer hidden units that significantly contribute and thus have reduced complexity in terms of number of units in the network that produce it's output.
@@RickLamers Absolutely that is how regularization works. Over-fitting occurs when you are unnecessarily giving importance to most of the weights to fit the training set well,which end up giving a model that tries to capture the whole pattern of the training data in a precise manner, but our goal isn't to build a model that precisely captures the training data pattern, but to have a more generalized model,that will perform good on test/dev which is completely unseen to the model. So in order to reduce the complexity one may try to penalize the higher weights, the weights that the model things these weights are majorly responsible to match the hypothesis,but fails when it applies it on the test/dev
I am confused... if "lambda" is big, your "dw" will be big. So your "w" could become very negative after the update. (which in tanh is nonlinear)... won't that overfit even more?
Well, I think it's because of the optimization problem. In order to minimize the cost function J, if the lambda is large, we will need to choose smaller value of W.
You're absolutely right, There is no restriction on lambda imposed here, for an instance we can choose any negative lambda and it will obviously minimize it, for an instance choosing -infinity is best. He doesn't explains it clearly.
@ bro if we just take lambda=-infinity, we're done, that's the minimum of J you can get. Think about it
@@drummatick Hi, as Adrew Ng explained above, if lambda is very big then the optimizer will pick up the very small value of W (W is nearly to 0). It turns out the z value will be small and the model is nearly linear model (in case of tanh activation function) ==> This model is now UNDERFITTING
In contrast, if you choose lambda to be very small (in your case: -infinity) then the optimizer can just pick up any NOT small value of W. This time, the model will be complex since the value of z will be large. Of course the value of the cost function J is nearly 0 (because value of lambda is -infinity as you said), but now the risk of overfitting is super high. Your model will learn too much about the particularities of the training data, and won't be able to generalize to new data.
If w does become negative the update function w = w - alpha* (update from back-propagation + lambda (w/m)), where lambda(w/m) would be negative. Since you'll be subtracting a negative number, it becomes equivalent to adding lambda(|w|/m).
How come large lembda makes w matrix zero plz guide
The updating formula for weights in a layer is W = (1-lambda/m)W - learning_rate*(dCost/dW), so from here you can see having larger lambda will make W small. This formula is when we are regularizing.
he is genuinely excited for dropout, so I'm going to click on it
Why high lambda isn't the same as a high learning rate?
Check this- stats.stackexchange.com/questions/168666/boosting-why-is-the-learning-rate-called-a-regularization-parameter#:~:text=I%20don't%20get%20why,of%20Statistical%20Learning%2C%20section%2010.12.&text=Regularization%20means%20%22way%20to%20avoid,too%20high%20leads%20to%20overfitting).
Too ez for such a genius like me, but thx for an explanation anyway, watched the video on 2x speed trying not to fall into a sleep
The meaning of the word "intuition" is being single handedly destroyed by the misuse in this video.
lol