Why Regularization Reduces Overfitting (C2W1L05)

DeepLearningAI

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 1 лис 2024

КОМЕНТАРІ • 42

@saanvisharma2081 5 років тому ⁺³⁹
During high bias, weights will be very small. During high variance, weights will be high. Similarly, during regularisation....if lambda is near infinity or high, our weights will tend to go down, because the function (gradient decent) will always try to minimize the overall value. If there's less lambda, weights will increase and model will try to fit each data point.......that also creates overfitting problems. So by tuning lambda in such a way that; both bias and variance should be in a acceptable range.
@aditisrivastava7079 5 років тому
i like your explanation
@lekjov6170 4 роки тому ⁺¹
Thanks for your comment, it clicked for me now.
@dragonixZXgames 4 роки тому
Thanks, I finally got it.
@ozziejin 3 роки тому ⁺⁸
interesting perspective, never thought of using the shape of tanh to help understand the intuition of regularization
@thedrei24 5 років тому ⁺⁸
i feel like the second explanation with the tanh function is much better
@bzqp2 4 роки тому ⁺³
But for other activations (i.e. ReLU) this explanation is totally counterintuitive. For ReLU the nonlinearity is exactly at 0, so smaller absolute value of w wouldn't really reduce the unit's use of the nonlinear range.
@cbt0949 4 роки тому
@@bzqp2 It's linear near 0 in ReLU.
@bzqp2 4 роки тому
@@cbt0949 well NEAR 0 it is linear, but exactly at 0 the 1st derivative changes, which makes it nonlinear there.
@cbt0949 4 роки тому
Leaky ReLU can be an example of non-linear, but ReLU is linear in my opinion, since a domain is [0,inf) or (0,inf).
@bzqp2 4 роки тому
@@cbt0949 @Seunggu Kang Nope. Normal ReLU also is nonlinear. An activation function needs to be nonlinear (check Hinton's article on "Learning representations by back-propagating errors" for a more detailed explanation), and that's why we use ReLU and not a simple linear wx+b. A neural network with linear activations wouldn't learn anything no matter how deep it would be. The fact, that it has different derivatives over the domain makes it nonlinear.
@beluga7428 Рік тому ⁺²
I have a doubt if z is small then why does it have any effect on curve overfitting as we obtain decision boundary by putting z=0 so there is no involvement of tanhz functiion in plotting the decision boundry !!
@AvinashSingh-bk8kg 3 роки тому
What an amazing intution 🙇‍♂️
@sandipansarkar9211 3 роки тому
very nice explanation.need to watch again
@jesuspreachings2023 2 роки тому ⁺¹
Relu is also linear activation function , how it doesn't reduce network into linear network?
@redberries8039 6 років тому ⁺⁶
Andrew says the tanh in the linear region will deliver linear models only ....what about RELU? that's all linear can it only deliver linear models??? RELU is popular I find that hard to believe ...where am i confused?
@JadtheProdigy 5 років тому ⁺¹
i see what you mean, it doesnt make sense to regularize with lambda for relu, but it does if you are regularizing with dropout. great point though
@chakibbachounda1721 5 років тому
datascience.stackexchange.com/questions/26475/why-is-relu-used-as-an-activation-function
@fupopanda 5 років тому ⁺¹
ReLU has non-linearity. It does not have a constant gradient. It's piece-wise linear, and not simply just linear.
@bzqp2 4 роки тому ⁺¹
Relu has the nonlinearity at z=0 (which actually makes the explanation totally counterintuitive)
@redberries8039 4 роки тому
@@bzqp2 yes I accept this reasoning now. cheers
@hackercop 2 роки тому
Never thought of tanh (or sigmoid) in that perspective thanks
@rahuldey6369 3 роки тому
as I've already done Prof.Andrew Ngs machine learning lectures, the terms seems familiar, and the explanation he had given in the logistic regression part lambda/2m. But it would have been more helpful, if you could give us some ideas in which scenarios we prefer using L2 over L1, or we can by default try with L2 having almost the same effect. I'm big fan of you professor
@redberries8039 6 років тому ⁺³
the tanh intuition makes sense to me [linearsizing and so simplifying the model] ...the first intuition, described as making the weights so small their effects disappear, does not make sense to me ...if ALL the weights are reduced by the same factor then its the same model. Isn't it? There would need to be some selectivity in the reductions it seems to me??
@RickLamers 5 років тому ⁺³
The weights in the model are randomly initialized and some start close to 0 and others don't. The L2 regularization puts pressure on these weights during backpropagation from reaching high values which results in models that have fewer hidden units that significantly contribute and thus have reduced complexity in terms of number of units in the network that produce it's output.
@rahuldey6369 3 роки тому
@@RickLamers Absolutely that is how regularization works. Over-fitting occurs when you are unnecessarily giving importance to most of the weights to fit the training set well,which end up giving a model that tries to capture the whole pattern of the training data in a precise manner, but our goal isn't to build a model that precisely captures the training data pattern, but to have a more generalized model,that will perform good on test/dev which is completely unseen to the model. So in order to reduce the complexity one may try to penalize the higher weights, the weights that the model things these weights are majorly responsible to match the hypothesis,but fails when it applies it on the test/dev
@X_platform 7 років тому ⁺¹
I am confused... if "lambda" is big, your "dw" will be big. So your "w" could become very negative after the update. (which in tanh is nonlinear)... won't that overfit even more?
6 років тому ⁺³
Well, I think it's because of the optimization problem. In order to minimize the cost function J, if the lambda is large, we will need to choose smaller value of W.
@drummatick 5 років тому
You're absolutely right, There is no restriction on lambda imposed here, for an instance we can choose any negative lambda and it will obviously minimize it, for an instance choosing -infinity is best. He doesn't explains it clearly.
@drummatick 5 років тому ⁺¹
@ bro if we just take lambda=-infinity, we're done, that's the minimum of J you can get. Think about it
5 років тому ⁺¹
@@drummatick Hi, as Adrew Ng explained above, if lambda is very big then the optimizer will pick up the very small value of W (W is nearly to 0). It turns out the z value will be small and the model is nearly linear model (in case of tanh activation function) ==> This model is now UNDERFITTING
In contrast, if you choose lambda to be very small (in your case: -infinity) then the optimizer can just pick up any NOT small value of W. This time, the model will be complex since the value of z will be large. Of course the value of the cost function J is nearly 0 (because value of lambda is -infinity as you said), but now the risk of overfitting is super high. Your model will learn too much about the particularities of the training data, and won't be able to generalize to new data.
@satyajitgiri5060 5 років тому ⁺¹
If w does become negative the update function w = w - alpha* (update from back-propagation + lambda (w/m)), where lambda(w/m) would be negative. Since you'll be subtracting a negative number, it becomes equivalent to adding lambda(|w|/m).
@krishnachauhan2850 4 роки тому
How come large lembda makes w matrix zero plz guide
@tamoorkhan3262 4 роки тому ⁺²
The updating formula for weights in a layer is W = (1-lambda/m)W - learning_rate*(dCost/dW), so from here you can see having larger lambda will make W small. This formula is when we are regularizing.
@timelyrain 3 роки тому
he is genuinely excited for dropout, so I'm going to click on it
@lmadriles 4 роки тому
Why high lambda isn't the same as a high learning rate?
@rahuldey6369 3 роки тому
Check this- stats.stackexchange.com/questions/168666/boosting-why-is-the-learning-rate-called-a-regularization-parameter#:~:text=I%20don't%20get%20why,of%20Statistical%20Learning%2C%20section%2010.12.&text=Regularization%20means%20%22way%20to%20avoid,too%20high%20leads%20to%20overfitting).
@rp88imxoimxo27 4 роки тому ⁺¹
Too ez for such a genius like me, but thx for an explanation anyway, watched the video on 2x speed trying not to fall into a sleep
@calluma8472 5 років тому ⁺⁶
The meaning of the word "intuition" is being single handedly destroyed by the misuse in this video.
@muhammadnaufil5237 5 років тому
lol

Наступне

Автоматичне відтворення