The Unspoken Effectiveness of L3 Regularization

Поділитися
Вставка
  • Опубліковано 7 лют 2025
  • We know about L1 Regularization (Lasso) and L2 Regularization (Ridge), but what would L3 Regularization look like and when would we use it?
    Lasso (L1) : • Lasso Regression
    Ridge (L2): • Ridge Regression
    Elastic Net : • Elastic Net Regulariza...
    Visuals Created with Excalidraw : excalidraw.com/

КОМЕНТАРІ • 33

  • @pragyan-099
    @pragyan-099 5 днів тому +15

    What I learnt from this video is that as we move towards greater norm, we tend to go from distinction to similarity. The sensitivity of outliers also increases (in regression). Maybe one of the reason could also be that l2 regularization already set the bar high of sensitivity, and people hardly use L3 regularization.
    One of the used cases I can think of that it can be used in identifying outliers .....since, the more sensitive it makes the outliers, the clarity of this we could get.

  • @ellysian
    @ellysian 4 дні тому +6

    Great video! Remember seeing a question on reddit about why we are not using L3 and above norms. Here are the conclusions that I made from reading the comments and thinking about it:
    1) L0 is the norm the we ideally want to use to promote sparsity, however, it is non differentiable and therefore not suitable for gradient based optimization.
    2) L2 regularization works best when we can assume that the data is corrupted by a zero mean Gaussian noise. If you do the math, you will find out that imposing Gaussian prior to regularize beta is equivalent to L2 norm.
    3) Similar reasoning can be applied to L1 norm with Laplacian prior.
    4) In practice, L1 norm is something in between L0 and L2 norms.
    5) Ln norms, where n is greater than 2 and is odd, cannot be valid regularizers since distance function by definition should be greater than 0 (although absolute function could be applied)
    6) other Ln norms are not based on any theoretic assumption afaik and would probably decrease the magnitude of gradient when approaching near zero greater than L2 norm, so usually are impractical.

  • @farzinnasiri1084
    @farzinnasiri1084 2 дні тому

    your contnet is very differnt from other channels and I really appreciate that you go into the actual mathematics of stuff while giving a good intuition. thanks

  • @philwebb59
    @philwebb59 4 дні тому +9

    L1 gives you sparsity. L2 gives you smoothness L3 just gives you more smoothness, right? Coefficients are less likely to approach zero, making it basically the opposite of L1. There really is no practical reason to do that, except like in your example where you're basically taking an average.

    • @ritvikmath
      @ritvikmath  4 дні тому +2

      Yea that's about where I landed too, you'd mostly want L3 and beyond when you actively want uniformity in your parameters but that's a bit counter-productive to think about for most applications

  • @대윤-y7o
    @대윤-y7o 4 дні тому

    I usually watching your video and this is really helpful to me.

    • @ritvikmath
      @ritvikmath  4 дні тому

      Thanks! I'm glad it was helpful!

  • @yensteel
    @yensteel 4 дні тому

    Its a game changer for machine learning! Have been using them for over 8 years, even with the prevalence of other techniques. Its efficient, cuts the bloat, and increase robustness.

  • @MarkusKofler
    @MarkusKofler 14 годин тому

    Another use case would be deriving the channel effectiveness using a marketing mix model. You would want coefficients of similar magnitude as, per assumption, we want to avoid one channel to receive all the credit for the generated kpi (target variable).
    However, it also does not make intuitive sense to let coefficients become arbitrarily large (gibes us a measure of ROI) or negative at all (the worst a marketing channel can do is 0)
    So in my opinion it would make sense to penalize according to L2 and Lk with k>2 to encourage both smaller and similar magnitude of coefficients

  • @Alexander-pk1tu
    @Alexander-pk1tu 4 дні тому +3

    in multiple dimensions, it could be even more interesting to visualize. Since it will be a cube.

    • @ritvikmath
      @ritvikmath  4 дні тому +2

      yes exactly! in 3d the L3 norm constraint is a sort-of "puffy cube" which is really fun to think about.

  • @tod9141
    @tod9141 3 дні тому +4

    hello! where can learn the prerequisites for understanding this video?

    • @ritvikmath
      @ritvikmath  3 дні тому +2

      Hey please check the videos in the description for those prerequisites!

  • @michaelzumpano7318
    @michaelzumpano7318 2 дні тому

    I don’t know if I’m understanding regularization correctly, but can you map a distribution of betas to the rounded corners (since they are not precisely fixed to equality by the sharp corners of L1)? If you randomly sample that distribution you should get a normal distribution. You can get a mean and std deviation from the Lx. Would these be the equilibrium (level curve) intersectors?

  • @maulikshah9078
    @maulikshah9078 5 днів тому +1

    Awesome

  • @nicholaswilson8357
    @nicholaswilson8357 12 годин тому

    I guess it might be useful if you are trying to regularise in Fourier space?

  • @zakaria1252
    @zakaria1252 5 днів тому +2

    I notice that when we use regularization for a model trained with centralized, standardized data, its effect is not noticeable and does not affect the parameters that much. Why?
    I'm new to data science btw.

  • @yurcchello
    @yurcchello 4 дні тому +2

    using power of 0.5 gives star shape. what properties it have?

    • @ritvikmath
      @ritvikmath  4 дні тому +3

      This is an excellent question and one I did think about addressing in the video. The reason I ultimately decided not to talk about norms less than 1 are that they are not "real norms" in that they do not satisfy the triangle inequality and because the shape their constraint creates (a star shape as you'd mentioned) are non-convex and therefore tricker to use in optimization problems

  • @TheZork1995
    @TheZork1995 10 годин тому

    For mixture of expert routing maybe. You want them all to take roughly equal amount of work.

  • @mannyourfriend
    @mannyourfriend 4 дні тому

    Great videos but I don’t know what L1 regularization is, and I wonder why we expect the objective functions to hit a corner instead of overlapping with the shape or existing far from the shape, or why there are multiple objective functions in the first place, and why we’re limited to a 2D plane. Still, I found the video interesting, so you did a great job.

    • @cvanaret
      @cvanaret 4 дні тому +1

      You want to minimize the objective (whose contours/level curves are shown as ellipses here) while staying inside the convex shape (a diamond for the L1 norm). At 2:33, you see that the constrained minimizers are located at the corners of the diamond.

  • @anantshukla3415
    @anantshukla3415 3 дні тому

    A video about variational linear regression please.

  • @Jdrake4e
    @Jdrake4e 4 дні тому

    I was thinking, what difference would larger versions of the L_n norm lead to from placing all features in the same column.
    My other thought was, would it encode similar information to L1 as as a -1 is anti correlated, 0 is no correlation, 1 is correlated, but just fail to have the ability to state no correlation.
    I'll definitely have to think a bit more on this.

  • @irobot-ng6li
    @irobot-ng6li 5 днів тому +3

    You could've also quickly talked about nuclear norms ($L_0$) also. They are more pointy and spiky.

    • @ritvikmath
      @ritvikmath  5 днів тому +5

      thanks for the note and good suggestion for a later video!

    • @yensteel
      @yensteel 4 дні тому

      Spiked? What are the use cases?

  • @tantzer6113
    @tantzer6113 3 дні тому

    What about L_p, where p < 1?

  • @DavidHarris-ce6zh
    @DavidHarris-ce6zh День тому

    We should talk. I am doing something not precisely similar but asking a related question.

  • @kevon217
    @kevon217 3 дні тому

    L3 shower thoughts, lol.