Ridge Regression Part 2 | Mathematical Formulation & Code from scratch | Regularized Linear Models

Поділитися
Вставка
  • Опубліковано 8 лип 2024
  • In the second part of our series, we break down the mathematical formulation of Ridge Regression and guide you through coding it from scratch. Explore the essence of Ridge Regression, a form of regularized linear models, and gain hands-on experience in implementing this advanced regression technique.
    Code : github.com/campusx-official/1...
    Matrix Differentiation : www.gatsby.ucl.ac.uk/teaching/...
    Videos to watch:
    • Simple Linear Regressi...
    • Multiple Linear Regres...
    Sklearn Ridge Class: scikit-learn.org/stable/modul...
    ============================
    Do you want to learn from me?
    Check my affordable mentorship program at : learnwith.campusx.in/s/store
    ============================
    📱 Grow with us:
    CampusX' LinkedIn: / campusx-official
    CampusX on Instagram for daily tips: / campusx.official
    My LinkedIn: / nitish-singh-03412789
    Discord: / discord
    E-mail us at support@campusx.in
    ⌚Time Stamps⌚
    00:00 - Intro
    00:32 - Revision on Ridge Regression
    12:27 - Code Demo
    20:50 - Ridge Regression for N-Dimensional Data
    33:46 - Coding ridge regression from scratch

КОМЕНТАРІ • 87

  • @satyabratanayak2264
    @satyabratanayak2264 15 днів тому +2

    As in the Simple Regression case we have added lambda*(m^2) and have not included the b wala term but at the end it automatically modified because b = y.mean() - m* x.mean() i.e b is a function of m. Similiarly in the Nd case the intercept term is a function of the rest N weights and the change in rest weights alters the value of this intercept hence we does not need to perform the operation on intercept explicitly.

  • @sabir_hussain_01
    @sabir_hussain_01 8 місяців тому +18

    43:01
    The reason of using [0][0]=0 is that because in our metrix W the first term is basically intercept not slope and we have to multiply Lambda with only slopes.
    Thats why first term became zero and lambda is multiplied with only slopes.

    • @rijanpokhrel9281
      @rijanpokhrel9281 6 місяців тому +2

      yes exactly this is the reason for it.....I was about to comment the same...but found the same answer in your comment

    • @princeagrawal9565
      @princeagrawal9565 Місяць тому

      Thank you bro........Good Job.....

    • @anilkumarreddykorivi
      @anilkumarreddykorivi 27 днів тому +1

      No bro reason is not correct because in w matrix all are coefficients only I mean to say they all are slopes it's not intercept so reason is not correct bro

    • @korivianilkumarreddy3273
      @korivianilkumarreddy3273 27 днів тому

      Yes

    • @atharvkazarid2-354
      @atharvkazarid2-354 24 дні тому

      but we are adding that one column of 1's know ? before training [ X_train = np.insert(X_train,0,1,axis=1) ]

  • @arnabroy9782
    @arnabroy9782 2 роки тому +44

    Luckily stumbled across your videos few days back. Found them to be better than most on this platform.
    Like the fact that your explanation involves both the intuition and the maths (wherever necessary) behind these algorithms. It gives greater justification for its usage. Many tutorials fail to do that and only rely on intuition.
    Really appreciate the efforts in trying to find out the difference between the custom code and library code and sharing that with us also.
    You've definitely earned a sub!

    • @RamandeepSingh_04
      @RamandeepSingh_04 5 місяців тому

      can you please help , i am unable to understand [XW-Y]` [XW-Y], even though i have revised and understood that E(error for multiple linear reg) = e transpose *e

  • @sudarshansutar1463
    @sudarshansutar1463 Рік тому +14

    If you see Nitesh sir's playlists for machine learning and deep learning you will easily get hired for data scientist role with good package

    • @mihirsrivastava2668
      @mihirsrivastava2668 Рік тому

      did you got selected?

    • @hritikroshanmishra3630
      @hritikroshanmishra3630 11 місяців тому

      @@mihirsrivastava2668 tumharaa??

    • @RamandeepSingh_04
      @RamandeepSingh_04 5 місяців тому

      can you please help , i am unable to understand [XW-Y]` [XW-Y], even though i have revised and understood that E(error for multiple linear reg) = e transpose *e

  • @abhinavkale4632
    @abhinavkale4632 3 роки тому +40

    Don't know why did I paid so much to study the same which is already available on your channel.. great work sir...cheers

    • @mr.deep.
      @mr.deep. 2 роки тому +1

      true

    • @Noob31219
      @Noob31219 Рік тому +1

      more content than paid course

    • @mohitkushwaha8974
      @mohitkushwaha8974 Рік тому +4

      @@Noob31219 very true, in paid courses also, they dont go in that much detail

    • @hritikroshanmishra3630
      @hritikroshanmishra3630 11 місяців тому

      @@mohitkushwaha8974 tum liye ho kya couse??

    • @mihirthakkare504
      @mihirthakkare504 10 місяців тому

      Us bro us i paid 1,50,000 just for roadmap in some institute 😂😂

  • @krutikashimpi626
    @krutikashimpi626 7 місяців тому +2

    The statement I[0][0] = 0 is likely setting the regularization strength for the bias term to zero. Regularization is a technique used to prevent overfitting in a model by penalizing large coefficients. However, the comment suggests that the bias term, which represents the baseline value of the target variable when all input features are zero, should not be heavily regularized.
    In simpler terms, the bias term is essential for capturing the inherent value of the target variable when there's no influence from the input features. By setting its regularization strength to zero, the model is allowed to keep this baseline value without being penalized too much, as it's crucial for accurate predictions.

  • @animeshsingh4645
    @animeshsingh4645 11 місяців тому +12

    43:00
    Ig reason of using I[0][0] = 0 is :
    bias term should not be heavily regularized because it represents the baseline value of the target variable when all input features are zero.

  • @ujjalroy1442
    @ujjalroy1442 Місяць тому

    Thanks for such a detailed explanation

  • @balrajprajesh6473
    @balrajprajesh6473 2 роки тому

    best teacher ever!

  • @piyushpathak7311
    @piyushpathak7311 2 роки тому +7

    Sir please upload video on DBSCAN and xgboost algorithm plz 🙏 sir your teaching style is awesome sir

  • @siyays1868
    @siyays1868 Рік тому

    Only on this channel , everything explained so thoroughly. U'll clear conceptual. This channel has that magic & the magic is Nitish sir. Thanku so much sir for working so hard everytime.

    • @RamandeepSingh_04
      @RamandeepSingh_04 5 місяців тому

      can you please help , i am unable to understand [XW-Y]` [XW-Y], even though i have revised and understood that E(error for multiple linear reg) = e transpose *e

  • @ParthivShah
    @ParthivShah 3 місяці тому +1

    Thank You Sir.

  • @sudduswetu8912
    @sudduswetu8912 Рік тому

    in overfitting bias is high and variance is low ....underfitting means high variance and low bias

  • @pramodshaw2997
    @pramodshaw2997 2 роки тому

    God bless you sir!!

  • @MohammadZeeshan-zg6ek
    @MohammadZeeshan-zg6ek 2 роки тому

    because lamda will repersent to changing slope value i.e coef_ 1th column is intercept_ ( adding as per required initially )so they didn't want idendity matrix in intercept_ columns also we observe

  • @Ishant875
    @Ishant875 5 місяців тому

    42:58 b0 is optional to use in regulisation because it doesn't change the shape of the fitted curve or hyperparameter.b0 just shift it. We use regulisation to make the complex model simpler. So, regulising b0 will not able to help in that.

  • @StartGenAI
    @StartGenAI 2 роки тому

    Thank you very much!!!

  • @piyushkumar0i0
    @piyushkumar0i0 Рік тому

    as W0 is intercept which means that alpha can only work with W1 to Wm not with intercept .. hence the kept the value I[0][0] = 0

  • @akash.deblanq
    @akash.deblanq 2 роки тому

    Am I right in assuming that, instead of changing the individual values of m, you just add one m at the end and tuning that will effect the entire equation?

  • @Noob31219
    @Noob31219 Рік тому

    you are great

  • @sujithsaikalakonda4863
    @sujithsaikalakonda4863 Рік тому +4

    Hi sir great explanation.
    I have a doubt: At 30:13 the formula for differentiation of 'xTBx' given is as '2Bx' but the formula told by you while Multiple Regression is '2BxT'.
    I think if we have take '2wTyTx' instead of '2wTxTy'.

  • @princekhunt1
    @princekhunt1 3 місяці тому

    OMG Explanation

  • @FarhanAhmed-xq3zx
    @FarhanAhmed-xq3zx 2 роки тому +7

    the reason they are replacing with 0 i,e ( first value in I matrix) because in LAMBDA(W) square we have first value w0 (i,e intercept) so they don't want to consider that first value in weights vector because the lambda is concerned only with coefficients as per the regularization term. This could be the reason i'm thinking but not sure .Correct me if im wrong .Thanks

  • @rohitdahiya6697
    @rohitdahiya6697 Рік тому

    why there is no learning rate hyperparameter in scikit-learn Ridge/lasso/Elasticnet . As it has a hyperparameter called max_iteration that means it uses gradient descent but still there is no learning rate present in hyperparameters . if anyone knows please help me out with it.

  • @krishnakanthmacherla4431
    @krishnakanthmacherla4431 2 роки тому +2

    They are regularizing only for the coefficients and not for intercept as per the definition of regularisation , but we are also regularising the intercept by keeping 1 in the 1st row of the I'd entity matrix which handles the intercept part
    I guess I am true

  • @varunahlawat9013
    @varunahlawat9013 Рік тому +2

    I do not agree with the idea that if m is too high then it can lead to overfitting, and if m is less than it can lead to underfitting. The word overfitting means anything only when the performance on the training data is really good but that in testing data is poor, but 'm' will perform poor in both datasets if either it'll be too high or it'll be too low.
    Please confirm if that's right or wrong understanding.

    • @email4ady
      @email4ady 3 місяці тому

      great point! i was thinking same.....i think Nitish meant this only for the example datapoints he showed on the whiteboard to clarify the example....in general, a higher m might be an optimal model or a bad model & a low m might underfit or even overfit the data, should depend completely on the data

  • @univer_se1306
    @univer_se1306 5 місяців тому +1

    class MyRidge:
    def __init__(self,alpha=0.1):
    self.intercept=None
    self.coef=None
    self.alpha=alpha
    def fit(self,X,y):
    num=np.sum(np.dot(y-y.mean(),X-X.mean()))
    denom=np.sum((X-X.mean())*(X-X.mean()))+self.alpha
    self.intercept=num/denom
    self.coef=y.mean()-self.intercept*X.mean()
    print(self.intercept,self.coef)
    def predict(self,X):
    y_pred=self.intercept*X+self.coef
    return y_pred
    is the above class correct???

  • @saptarshisanyal6738
    @saptarshisanyal6738 Рік тому +2

    at 8:30 You are multiplying the all the terms with minis(-) but there is error in the sign of resultant expression

    • @ritwikdubey5331
      @ritwikdubey5331 2 місяці тому

      no the equation is okay! when you take the minus sign inside the summation then it becomes +m {x-x(mean)}^2...

  • @RamandeepSingh_04
    @RamandeepSingh_04 5 місяців тому +1

    Sir i am unable to understand [XW-Y]` [XW-Y], even though i have revised and understood that E(error for multiple linear reg) = e transpose *e

    • @adenmukhtar9804
      @adenmukhtar9804 3 дні тому

      In the previous video the coefficients matrix was written as beta which in this video's derivation is W, so first thing to keep in mind is that W(for this video)=beta(in the previous video).
      Moreover y_hat = X*beta for the previous video
      So for this video Y_hat becomes Y_hat = XW
      E(error for multiple linear reg) = e transpose * e
      and e = (Y - Y_hat) so the final expression here becomes [XW-Y] transpose [XW-Y]
      I hope this helps

  • @arslanahamd7742
    @arslanahamd7742 2 роки тому +1

    I don't know why your views is low .Sir, your teaching style is too good.

  • @RohitKumar-iw3tt
    @RohitKumar-iw3tt 2 роки тому +3

    Reason why intercept is not regularized:
    Intercept acts as a receiver of reduction in coefficients thus regularisation of both will not improve the model or in other words you are regularising the curve, not shifting it.

    • @barryallen3051
      @barryallen3051 Рік тому

      I found a similar reason too. When regularizing, we are trying to reduce variance of our model, not bias. The first term m0 (or theta0) is a bias term, regularizing it will not reduce the variance but shift the whole curve.

    • @tanmaythaker2905
      @tanmaythaker2905 Рік тому

      Thanks for this!

  • @602rohitkumar8
    @602rohitkumar8 5 місяців тому

    i think they did not want to change intercept coz intercept kisi ke weightage ko show nhi kr rha so intercept change krne pr overfitting pr koi effect nhi aayega

  • @krishnakanthmacherla4431
    @krishnakanthmacherla4431 2 роки тому

    Sir , now that we are adding the regularisation in the loss function , wont it change the parabolic nature of the earlier function ?? And it won't effect our solution ??

    • @aadarshbhalerao8507
      @aadarshbhalerao8507 2 роки тому

      This is called as bias-variance trade off. i.e. We should only increase the bias (regularization) in our model if the varaince (Total Loss on Test Data - Total Loss on train Data) is redused. Lambda is a tuning factor as you know tune it to get the best result. Better than normal Linear Regression

    • @RamandeepSingh_04
      @RamandeepSingh_04 5 місяців тому

      @@aadarshbhalerao8507 can you please help , i am unable to understand [XW-Y]` [XW-Y], even though i have revised and understood that E(error for multiple linear reg) = e transpose *e

  • @symonhalder391
    @symonhalder391 Рік тому

    Dear Sir, I am from Bangladesh. we have learned (Y-Y-predict)whole square in the case of Error formula. So, I need to put XW instead Y-predict. Why you applied XW instead of Y. Kindly advise on this please. (XW - Y) or (Y-XW)

    • @YogaNarasimhaEpuri
      @YogaNarasimhaEpuri Рік тому

      This equation calculation, is taught in the Linear Regression video (N-Dimensional)

  • @maheshwaroli653
    @maheshwaroli653 2 роки тому +6

    Just curious if we really need to learn how to derive the model from scratch. I mean we already have Scikitlearn for that no. These formulations are little complex! Any comments would be appreciated.

    • @campusx-official
      @campusx-official  2 роки тому +4

      No, it's not mandatory.

    • @Vinayworks666
      @Vinayworks666 10 місяців тому +2

      Well, it's not mandatory but companies can ask you to code it or explain the mathematics, it happened with me in the Microsoft DS test. so don't miss out

    • @flakky626
      @flakky626 10 місяців тому

      @@Vinayworks666 Hello bhaiya can we please talk?needed guidance
      IF yes, Pls provide a way by which I can contact you with pls

    • @vinayrathore560
      @vinayrathore560 10 місяців тому

      @@flakky626 Checkmy channel

    • @Vinayworks666
      @Vinayworks666 10 місяців тому

      @@flakky626 Well i could help you out but youtube is not allowing me to do it

  • @uditjec8587
    @uditjec8587 8 місяців тому

    @28: 37 dono term same nahi hai. ek term dusre ka transpose hai.

  • @anshulsharma7080
    @anshulsharma7080 Рік тому +2

    22:31 ,
    (Y^trans -(X. Beta) ^trans) .
    (Y - (X. Beta))
    By the way at the end it doesn't matter even bhaiya has taken reverse it's also fit.

    • @tafiquehossainkhan3740
      @tafiquehossainkhan3740 11 місяців тому

      I am having the same doubt can u please tell me how's it correct

  • @ganeshreddy1808
    @ganeshreddy1808 Рік тому

    Sir the main reason behind optimisation techniques like Gradient descent is to find the appropriate parameters that does not overfit or underfit right? Then why do we use regularisation again?

    • @campusx-official
      @campusx-official  Рік тому +1

      Those parameters that you find using gradient descent are the best parameters on the given dataset(which can be considered as sample data) how would know these parameters will be suitable for all of population data(read testing data)

    • @animeshsingh4645
      @animeshsingh4645 11 місяців тому +1

      Even remember that heavy advantages of using gradient descent over that is faster convergence as inverse increases time complexity.

    • @animeshsingh4645
      @animeshsingh4645 11 місяців тому

      And also linear regression limited to to convex function and can't wait work on very big data efficiently

    • @RamandeepSingh_04
      @RamandeepSingh_04 5 місяців тому

      @@campusx-official can you please help , i am unable to understand [XW-Y]` [XW-Y], even though i have revised and understood that E(error for multiple linear reg) = e transpose *e

  • @TheVicky888
    @TheVicky888 2 роки тому +3

    Shouldnt the formula for Loss be (Y-XW)^T (Y-XW) ?
    i think ulta ho gaya

  • @flakky626
    @flakky626 10 місяців тому

    sir little bit confused here, last time multiple reg me aapane (y - y^) (actual - predic) liya tha so our equation was (y - XB)
    but is video/lecture me aapne(y^-y) (predic - actual ) liya and equation here is (XB - y)
    So why this difference?

    • @ankurlohiya
      @ankurlohiya 9 місяців тому

      take both minus signs common and you will get the same answer as above, it doesn't matter. He by mistake took the other way round.

    • @flakky626
      @flakky626 9 місяців тому

      @@ankurlohiya Can we even take minus common out of a transposed braces?
      also thankyousomuch brutha!!

    • @shreyasmhatre9393
      @shreyasmhatre9393 6 місяців тому

      22:45
      L= ( yi - ŷi ) ²
      In matrix from
      L= ( y - Xw )ᵀ ( y - Xw )
      L= ( y - Xw )ᵀ ( y - Xw ) + || w || ²
      L= ( y - Xw )ᵀ ( y - Xw) + λ wᵀw
      L= ( yᵀ - wᵀ Xᵀ )( y - Xw) + λ wᵀw
      L= yᵀ y - wᵀ Xᵀ y - yᵀXw + wᵀXᵀXw + λ wᵀw
      As he told wᵀXᵀy and yᵀXw both are same
      L= yᵀy - 2(wᵀXᵀy) + wᵀXᵀXw + λ wᵀw
      this is same eqn he got

      Eg =
      (A-B)(C-D) = AC - BC -AD + BD ----- 1
      (B-A)(D-C) = BD - AD - BC + AC re-arrange (AC - BC -AD + BD) ---- 2
      Both eqn found out to be same don’t get confused 😵‍💫

  • @shreyasmhatre9393
    @shreyasmhatre9393 6 місяців тому

    Ref 22:45
    L= ( yi - ŷi ) ²
    In matrix from
    L= ( y - Xw )ᵀ ( y - Xw )
    L= ( y - Xw )ᵀ ( y - Xw ) + || w || ²
    L= ( y - Xw )ᵀ ( y - Xw) + λ wᵀw
    L= ( yᵀ - wᵀ Xᵀ )( y - Xw) + λ wᵀw
    L= yᵀ y - wᵀ Xᵀ y - yᵀXw + wᵀXᵀXw + λ wᵀw
    As he told wᵀXᵀy and yᵀXw both are same
    L= yᵀy - 2(wᵀXᵀy) + wᵀXᵀXw + λ wᵀw
    this is same eqn he got

    Eg =
    (A-B)(C-D) = AC - BC -AD + BD ----- 1
    (B-A)(D-C) = BD - AD - BC + AC re-arrange (AC - BC -AD + BD) ---- 2
    Both eqn found out to be same

    • @ali75988
      @ali75988 6 місяців тому

      Thank you so much man. Even i thought, something was odd as he replaced y with wx, instead of y hat.

    • @RamandeepSingh_04
      @RamandeepSingh_04 5 місяців тому

      can we connect on Linkedin ?
      thank you so much for the explaination