Advanced Regression - Categorical X variables and Interaction terms

Поділитися
Вставка
  • Опубліковано 14 жов 2024

КОМЕНТАРІ • 136

  • @leosizaret4104
    @leosizaret4104 6 років тому +14

    Your videos on regression are amazing! Interesting, clear, very informative, you the stats & intuition behind regressions into something fun to lose oneself in :D

  • @petercrooks3166
    @petercrooks3166 3 роки тому +1

    One of the best explanations of the Dummy Variable Trap and how to circumvent it!

  • @hitm43
    @hitm43 5 років тому +14

    This video was exactly what I needed. Clear and thorough. Keep it up!

  • @arushibhattacharya2143
    @arushibhattacharya2143 2 роки тому +2

    These videos are a godsend. These are going to save my life for my massive regression analysis research paper. Thanks for the great content!!

  • @benflis1618
    @benflis1618 3 роки тому +2

    Thanks. My professor threw this into the review of SLR and MLR (which we didn't originally learn), but he didn't explain it very well. This video was a big help
    Edit: and by "this," I mean interaction terms

    • @PunmasterSTP
      @PunmasterSTP 2 роки тому

      How'd the rest of your class go?

  • @danielalonso3664
    @danielalonso3664 3 роки тому +16

    20:08 you should also take into account that being in cat4 gets you -0.390 so apart from adding 0.123 of the pink slip, you should subtract 0.390 fro being in cat4

    • @abdulmateen6101
      @abdulmateen6101 3 роки тому +3

      You are absolutely right in pointing this omission

    • @azzakamoun2294
      @azzakamoun2294 3 роки тому +4

      think of it in terms of marginal effect: consider that you have a simpler form of regression with only pink slip (X1) and agecat4( X2) + interaction term (X1*X2) with a coef of b3, if you re going to asses the marginal effect of the var1 ( pink slip ) , you ll proceed with the derivative of Y over deriv of X1 and the result il the following ( Dy/Dx1 = b1+ b3* X2 ) , thus considering the " boolean" nature of X2 if X2=1 , then Y will increase by b1+b3 else it will only increase by b1, same thing if you want to see the marginal effect of X2 ! => Conclusion : always think of it as the derivative over the variable and you ll get the answer

    • @anthonyabolarin4961
      @anthonyabolarin4961 Рік тому

      There is no omission: Ln(P) = constant (b0) - 0.181Ac2 - 0.800Ac3 - 0.390Ac4 -0.209 LnD + 0.123PS +1.371(Ps x Ac4)
      The term [- 0.390Ac4] is what you counted as omitted, but it is present.
      Model7 Coef Var (ind) LN Anti-Ln
      c 9.125 1 9.1250
      ac2 -0.181 0 0.0000
      ac3 -0.800 0 0.0000
      ac4 -0.390 1 -0.3900
      lnd -0.209 5.669880923 -1.1850
      ps 0.123 1 0.1230
      psac 1.371 1 1.3710
      9.0440 8,467.54
      (Apologies, the table does not show very well here)

    • @existentialrap521
      @existentialrap521 Рік тому

      I was wondering about this as well. Glad you caught it as well!
      Edit: He did do it while calculating price in model 7 at the very end.

    • @freeSpiritNonna
      @freeSpiritNonna 8 місяців тому

      That time slot is about the effect of attaining a 'pink slip' and the -0.390 item is only for the age of the car, hence not included, I think.

  • @jovial129
    @jovial129 4 роки тому +4

    Love the excitement of the pink slip variable becoming significant lol 9:55

  • @bernardosangir2698
    @bernardosangir2698 5 років тому +3

    I love your videos, especially this video has given me the insight on explaining a regression model with interactions which I have struggled with a lot. Thank you so much

  • @ivangarcialaverde2065
    @ivangarcialaverde2065 4 роки тому +1

    You explain 100 times better than my statistics teacher, you've just save my exam, thanks a lot !!

    • @PunmasterSTP
      @PunmasterSTP 2 роки тому

      Hey I know it's been awhile, but I just came across your comment and was curious. How'd the rest of the class go?

  • @jasminepandit9861
    @jasminepandit9861 2 роки тому +1

    Thank you SO much for this series! Best I've seen on UA-cam so far!

  • @PunmasterSTP
    @PunmasterSTP 2 роки тому +1

    Categorical X? More like "Certainly the best." These videos rock!

  • @shashankkhare1023
    @shashankkhare1023 4 роки тому +4

    Hi Justin, hope you are doing great! I love your videos and have been following them on your website as well. I have one doubt in this video. At 3:48, where you added pink slip variable, you say that having a pink slip increases the price by 15.6% as coeff for pink slip is 0.156. I am confused here as y is logged and when y is logged and x is not, general interpretation is that 1 unit increase in x means exp(coeff)-1 percent change in y. Please help me understand where I am going wrong. Thanks :)

    • @JoaoVitorBRgomes
      @JoaoVitorBRgomes 4 роки тому +1

      I think it is because he is not doing a logistic regression but a linear regression.

  • @kanchangupta19
    @kanchangupta19 5 років тому +2

    iam totally sold by ths video. i mean your teaching pattern is totally upto par. please make some videos on logistic regression, random forest, neural network and clustering as well.

  • @azizsakr5565
    @azizsakr5565 5 років тому +9

    Thank you for the interesting videos.
    I think there is a little bit confusion in the interpretation of the resulting coefficients. The change in one of the independent variables holding all the others constant does not mean increasing the dependent variable by the same percentage. Please check an example of this at 11:28

    • @imglenngarcia
      @imglenngarcia 2 роки тому

      On the time stamp you provided, I think it should be 60.64% higher rather than 47.4%.

    • @helloworld1537
      @helloworld1537 2 роки тому

      Yes! I found the same issue in the last video as well, it should be: the median of the car price will be e^0.474=1.6064 times. Thus 60.64% increase

    • @helloworld1537
      @helloworld1537 2 роки тому

      @@imglenngarcia I think the pink slip interaction term case also has the same problem..

  • @houlipouli3559
    @houlipouli3559 2 роки тому

    brooooo, how did I not see your videos in 5 years *sight* i would have gone through uni so much easier!!

  • @cravenhealth1563
    @cravenhealth1563 4 роки тому +8

    "15-35, well, they're the shit boxes aren't they?" lol

    • @petercrooks3166
      @petercrooks3166 3 роки тому +1

      Think about the kids learning advanced regression! /s lol

  • @the5to9life
    @the5to9life 4 роки тому +2

    Sir, you are amazing. Thank you for making these videos.

  • @rrrprogram8667
    @rrrprogram8667 4 роки тому +2

    Change ur channel name to "zstatistics for machine learning" you goona soon have million subs

  • @crock1255
    @crock1255 6 років тому +11

    In doing an actual analysis, would you still add the pink-slip coefficient to the pink-slip x cat4 interaction even when the pink slip variable alone is not statistically significant?

    • @malikakbar
      @malikakbar 4 роки тому +3

      I have the same question, would appreciate if anyone willing to share the answer for this one

    • @lekjov6170
      @lekjov6170 4 роки тому +4

      @@malikakbar It does, even if the variable on itself is not statistically significant. Think about this scenario:
      There's a new AgeCat5 variable that is added for all cars that are over 70 years old; and also we are gonna add another variable called redCar, that takes 1 as a value if the car is red, and 0 otherwise.
      Now, if I were to ask you: "Does the color of the car being red have an impact in the price of the car?" Probably not, cars are customizable and the color on itself doesn't seem to be too relevant to determine the price of the car.
      For the sake of the example, let's stirr things up and claim that in the 40's(80years ago) almost all the cars were either grey or black, and only BMW was producing fancy red cars that were way more expensive than the grey/black cars; but since it's been so long ago it's really hard nowadays to find those special edition BMW cars.
      So, I believe it's pretty easy to tell that if you find a car that is over 80years old whose original paint is red, it's almost certain that it is one of those special edition BMW red cars that are way more expensive than the others.
      So, for all the other cars that were fabricated less than 70 years ago, the color is not a factor that affects the price of the car, so the variable "redCar" is probably not gonna be statistically significant on itself, but in the case that the car is 70years older, the color of the car is gonna play a huge role to predict the value of the car, because if it is red,. the price is gonna be way more expensive, therefore, it's gonna be significant in that scenario.
      In this case, intuiton tells me we should add another predictive variable to the equation that describes that relationship, which you would do like this "+ Beta6 * redCar*ageCat5

  • @garyabrams1020
    @garyabrams1020 5 років тому +3

    This is probably a very basic question - you added variables to get the final price of your car at the end of IVb that were not statistically significant - why - thanks gary

  • @leehyeah9133
    @leehyeah9133 9 місяців тому

    omg you have saved my thesis now. Thank you 3000

  • @woodcrestshop5621
    @woodcrestshop5621 3 роки тому +1

    Very well explained, Thank you so much !!. Stay safe Professor !.

  • @anandparanjape1
    @anandparanjape1 5 років тому +7

    Awesome videos man! You are a very good teacher :-)

  • @niv2419
    @niv2419 6 років тому +5

    Quality content and well explained. Thanks a ton!

  • @huntermarshall161
    @huntermarshall161 4 роки тому +2

    Hey Z,
    You mentioned interaction terms should only be included when the added IV(2) affects the relationship between IV(1) and Y. How do you determine the affect of adding IV(2), is it a change in the regression coefficient of IV(1)?
    I’m conducting orthopaedic related research involving canine models, and I’m using multiple regression models to control for the size of the specimen. This would really help me out as I’m working to nail down models for our various parameters.

    • @JoaoVitorBRgomes
      @JoaoVitorBRgomes 4 роки тому +1

      Actually I think it is when the p value of the interaction term is statistically significant. Then you see there's an effect modifying e.g. age modifying selling a car with a pink slip.

  • @tedofbeverlyhills
    @tedofbeverlyhills 2 роки тому

    Awesome videos, any book you particularly recommend to understand how to do linear regressions?

  • @pomme_paille
    @pomme_paille 4 роки тому +1

    You forgot minus 1 at the end 😉
    Thanks for the awesome content

  • @UdoLattek02
    @UdoLattek02 2 роки тому

    You saved my thesis

  • @rutu.dances.to.express
    @rutu.dances.to.express 5 років тому

    Firstly Thannk You so much Sir for your videos filled with detailed explamation! Any query pops up in my mind..mostly gets solved up within few minutes in the videos...Got a clear idea about Interaction terms and correlation..
    So just a small doubt....is correlation only difference bw Multicolineaity and Interaction terms...as we donot prefer Multicolinearity...?

  • @TÔMTIÊNYÊN
    @TÔMTIÊNYÊN 10 місяців тому

    thank you Professor, why do we have to spare the variable age in model 5 ?

  • @amirnashed9701
    @amirnashed9701 5 років тому

    amazing work with the car example

  • @jahongirmuratov1576
    @jahongirmuratov1576 4 роки тому +1

    Why all videos on interaction terms discuss only the cases where both variables are expected to have positive impact on Y and the results of the interaction are also positive? What about having 2 independent variables (x1 and x2) with one of them having positive impact on Y and another - negative. How to interpret the interaction term if it is positive? If it is negative?

  • @ael3377
    @ael3377 3 роки тому +1

    Don't you have to exponentiate the coefficient of the dummy variable and then interpret it as a multiplier? The sales are still logged, so I guess you would have to exponentiate both sides of the equation to see the change in sales from non pink slips to pink slips.

    • @jakeandersonbell5993
      @jakeandersonbell5993 2 роки тому

      Same here, I thought it was (exp(coef) - 1) * 100 for non-transformed independent variables. Someone please correct me.

  • @Nereknu93
    @Nereknu93 4 роки тому +1

    Hey, what if the categorical variable would have had a lot of levels (nationality, religion...) so that it would effectively mean so many variables that the adjusted R squared would be very low?
    And 2) - when some variables in your model are insignificant, should't you remove them from the model? but then some of the then-significant variables become insignificant... :(

  • @eddiele644
    @eddiele644 4 роки тому +1

    So when do we actually interact our variables? Is there a way to see if it is necessary or do we just do it and then see if the coefficient on the interaction term is statistically significant?

  • @folumb
    @folumb 5 років тому +21

    Why is this video categorized as comedy?

    • @arnavtube
      @arnavtube 4 роки тому +8

      statistics is funny

    • @anthonyabolarin4961
      @anthonyabolarin4961 Рік тому

      ...to them that are statistically lost, it is indeed a comedy! 1CO1!18

  • @neelabhdubey8453
    @neelabhdubey8453 2 роки тому +1

    Why do we say there's a 47.4% increase in price in Cat4 as compared to Cat1 and why don't we read it as a 47.4% jump from Cat3, i understand that the base category is 1 but i fail to see the reason behind the interpretation

  • @timothyagandahabagre1091
    @timothyagandahabagre1091 4 роки тому +1

    Hi, thanks for your videos. They are being very helpful to me.
    Could we simply code the new AgeCat variable such that:
    AgeCat:
    = 0 if age

    • @gamerchil
      @gamerchil 2 роки тому

      Yes I was thinking this as well. I have always learned that amount of dummies = category levels (age in this case 4) - 1 = amount of dummies that meet be created

    • @anthonyabolarin4961
      @anthonyabolarin4961 Рік тому

      Restoring the Ac1 (age 1 to 5 years) into the Model is okay. However, the value of Ac1 = 0, annulling the entire term and contributing nothing to the Model, just like the Ac2 and Ac3. Other interactions are feasible. (e.g., odometer vs. age).

  • @banashreeshiva4506
    @banashreeshiva4506 5 років тому +1

    Changing the format of age into categorical variable increased the significance of pink slip.
    Is this by chance or it happens for every variables?
    Is it okay if we had just dropped the pink slip variable for the next model without changing the format of the age variable?

  • @petercrooks3166
    @petercrooks3166 3 роки тому +2

    Why did you not include [-0.390(AgeCat4)] when the car is older than 35 years and has a pink slip? Isn't your answer incorrect? The answer should be, "... for models older than 35 years, attaining a pink slip increases the price by an average of 110.4%, holding all else constant."

  • @kanewilliams1653
    @kanewilliams1653 7 місяців тому

    You're a legend!

  • @jongkargrinang8012
    @jongkargrinang8012 2 місяці тому

    If you add age and ln(age) in the model, it will create multicollinearlity. Which coefficient is useful, age or ln(age)?

  • @zoyaaqib9269
    @zoyaaqib9269 2 роки тому

    I don't understand why we turned age, a continuous variable, into a categorical variable. Was that just for explaining how a multi-level categorical variable works or it was actually important to our model?

  • @Skey1337
    @Skey1337 2 роки тому

    The interpretation of the pink slip coefficient in Model 7, 19:44 - is that still 149,4% relative to cat 1?

  • @aritradatta448
    @aritradatta448 3 роки тому +1

    In the model 7, the pvalue for pink slip is 0.5 and therefore, quite highly insignificant. Shouldn't it be removed? If yes, then after removing the base variable, can the interaction variable pink slip*agecat4 be retained in the model?

    • @drnurintankamaruddin3164
      @drnurintankamaruddin3164 2 роки тому +1

      I have the same exact question. Have you found the answer?

    • @anthonyabolarin4961
      @anthonyabolarin4961 Рік тому

      If an interaction variable is SIGNIFICANT, t]then its components must be admitted into the model.

  • @drewtamales5999
    @drewtamales5999 3 роки тому

    These videos are fantastic thank you!

  • @kanikabagree1084
    @kanikabagree1084 4 роки тому

    Thankyou so much for such an amazing explanation you're my saviour thankyou :)

  • @johnpark1797
    @johnpark1797 2 роки тому

    15:35 really clarifies it

  • @rogerdoux
    @rogerdoux 3 роки тому

    "but mate, this thing goes"
    bloody oath

  • @Zerudite
    @Zerudite 4 роки тому +1

    why do we add the pinkslip and pinkslip*agecat4 coefficients for the interpretation but not the agecat4?
    is it because it's not significant?
    or because we were only interpreting the coefficient of pinkslip with the condition that agecat4 is true?

    • @KissingPL
      @KissingPL 4 роки тому +1

      You can also interpret the effect of agecat4, which works the same way. Models that are in agecat4, but have NO pink-slip, have on average a 39% lower price than the baseline, holding all else constant. If they do have a pink-slip, the price is on average 98,1% (1,371-0,39) higher than the baseline, hold all else constant.

  • @compilations6358
    @compilations6358 3 роки тому

    Here you intuitively decided that the coefficient of a certain feature is not as you would expect. So usually we have thousands of dimensions, how can we know if the coefficients make sense? any sort of analysis other than manual checking of coefficients we can do here?

  • @akanshabari6394
    @akanshabari6394 8 місяців тому

    Amazing content!!!

  • @sum1sw
    @sum1sw 4 роки тому +1

    Any idea how does one calculate SE for multi parameter non-linear regression?

  • @olb47
    @olb47 4 роки тому

    Hi, can we use the model for Datsun if pinkslip's pvalue is not significant and we are highly restrictive? I'm asking because pvalue is higher than 0.05 and with common methodology it seems insignificant.

  • @lauramollema1817
    @lauramollema1817 3 роки тому +1

    I noticed you're interpreting insignificant variables multiple times. Could you please leave these out when performing calculations?

  • @yasminfatima5948
    @yasminfatima5948 4 роки тому +1

    How removing one agecategory variable is making sense and including all four wont?

  • @siddhft3001
    @siddhft3001 3 роки тому

    Great video! Thank you!

  • @yourstrulysj2183
    @yourstrulysj2183 3 місяці тому

    How will I understand the price conversion e9044 to real dollar terms $8468. Can you please help me understand!

  • @HANXIZHANG-d6u
    @HANXIZHANG-d6u Рік тому

    extremely clear!

  • @thenineteennn
    @thenineteennn 5 років тому

    1. In the regression models with an intercept, the coefficients can not be interpreted as % change as the coefficient doesn't effect the constant intercept.
    2. At 20mins you forget to include the ageCat4 solo term. I.e. there are 3 terms to sum not just 2.
    Otherwise well done for a perspicuous explanation!

    • @zedstatistics
      @zedstatistics  5 років тому +1

      Hi! Thanks for watching! Regarding your observations:
      1. They certainly can be interpreted as a % change on y. Remember that if you were to exponentiate the log, youd find that y=exp(B0+B1X). Or in other words y = exp(B0)*exp(B1X), so the constant term just becomes a multiplying constant (ie. won't affect the percentage change in y for a given absolute change in X).
      2. The ageCat4 term is def there! Just on the second line of the equation.
      Hope that helps.

    • @thenineteennn
      @thenineteennn 5 років тому

      @@zedstatistics
      1. Erm exactly my point. When B1 increases by 10%, B1X increases by 10%. But Exp(B1X) does not.
      2. It was in the forumula but not summed

    • @thenineteennn
      @thenineteennn 5 років тому

      Sorry ignore 2.

    • @zedstatistics
      @zedstatistics  5 років тому

      @@thenineteennn Be careful here, B1 does not increase at all. B1 is the coefficient. It is X that increases. Also, if you're talking about a log-linear relationship (ie. ln(y)=B0 + B1X...) then a 1 unit increase in X has a constant % effect on Y. As per my first response.
      If you have a log-log relationship (ie. ln(y)=B0+B1(ln(X))), only THEN can you interpret it as "a 1% increase in X relates to a 1% increase in Y". But this is different to your example above. Hope that helps :)

    • @jaliu
      @jaliu 4 роки тому

      @@zedstatistics how come at 20:00 you added the main effect of pink slip but didn't subtract the main effect of agecat4?

  • @qinghuafeng1705
    @qinghuafeng1705 Рік тому

    Thank you!

  • @Mona-xl6mv
    @Mona-xl6mv 3 роки тому

    I love u, made my life so much easier

  • @KFIR93
    @KFIR93 3 роки тому

    You are the best!

  • @heidilinnsandster920
    @heidilinnsandster920 3 роки тому

    This was so helpful!

  • @JoaoVitorBRgomes
    @JoaoVitorBRgomes 4 роки тому

    Keep posting!

  • @serikshamgunov7940
    @serikshamgunov7940 6 років тому

    thank you very much for this video

  • @_Anonymous_9
    @_Anonymous_9 3 роки тому +1

    brooo, you didnt show how you coded the interaction term with dummy vars

  • @sociologie4507
    @sociologie4507 3 роки тому

    excellent!

  • @in100seconds5
    @in100seconds5 4 роки тому

    well-done bro !

  • @haroldbradford690
    @haroldbradford690 3 роки тому

    brilliant!!!

  • @samuelthomaz
    @samuelthomaz 2 роки тому

    Valeu!

  • @korman9872
    @korman9872 3 роки тому

    Tx sir

  • @piku9290dgp
    @piku9290dgp 4 роки тому

    How do you identify which two variable can interact. is this based on business or domain knowledge

    • @JoaoVitorBRgomes
      @JoaoVitorBRgomes 4 роки тому +1

      I think according to what he said is a bit subjective (domain knowledge)

    • @anthonyabolarin4961
      @anthonyabolarin4961 Рік тому

      Any combination can interact! But to what effect? If the interaction is INSIGNIFICANT, we save time by ignoring the crossing, except we are constrained, say in an exam setting.

  • @couragee1
    @couragee1 2 роки тому

    thanks

  • @abdelkaderkaouane1944
    @abdelkaderkaouane1944 Рік тому

    👌

  • @bk6prod490
    @bk6prod490 3 роки тому

    can i remove one control variable in the model ? and dont remove the others ( interaction and one control variable )

    • @anthonyabolarin4961
      @anthonyabolarin4961 Рік тому

      Whatever you remove demands that you recalculate your Model Regression and obtain new coefficients and significances.

  • @Hitesh_0421
    @Hitesh_0421 3 роки тому

    Why you take (Age)^2

  • @95FH95
    @95FH95 6 років тому +1

    haha i can get everything but cannot solve -0.209Ln(290). can someone please quickly show me the calculations for this?

    • @EpicLaith
      @EpicLaith 6 років тому +2

      Why not? Use a calculator and type in -0.209log(290)

    • @anthonyabolarin4961
      @anthonyabolarin4961 Рік тому

      @@EpicLaith By now, you have settled this matter. To use Excel, change 'log' to 'LN' and add = ahead of the minus sign. = - 0.209 * LN(290) and enter (voila!)

  • @annabrenner5995
    @annabrenner5995 Рік тому

    Pre-ZedStats I respected my profs and trusted they knew best even though it was impossible to understand the jargon they mumbled. Nowadays I'm furious that those small-minded fools are getting thousands of dollars for smugly explaining nothing while I'm spending hours each week learning from ZedStats' free videos. American universities are the worst.

  • @lav1093
    @lav1093 2 роки тому

    Big mistake in min 19:15 , you should have considered the coefficient of Cat4 honey

    • @zedstatistics
      @zedstatistics  2 роки тому

      We're talking about the effect of pinkslip. So not a mistake, sweetie.

    • @lav1093
      @lav1093 2 роки тому

      @@zedstatistics but without considering Cat4, that term is 0. You should combine the effect of having the pinkSlip on cat4, not over Cat1 (BASE category). Thus, the price increases 110.4%.

    • @OskarBienko
      @OskarBienko 2 роки тому

      Could you elaborate? Please

    • @lav1093
      @lav1093 2 роки тому

      @@OskarBienko he shoud have sum the coefficients when cat4=1, pinkslip=1 and pinkslipXcat4=1

    • @lav1093
      @lav1093 2 роки тому

      The extra effect of pinkslip is zero without consdering cat4

  • @pradeep2005s
    @pradeep2005s 5 років тому

    For log-level regression interpretation
    sites.google.com/site/curtiskephart/ta/econ113/interpreting-beta

  • @dashnaso
    @dashnaso 3 роки тому

    18:00 nah bro but thanks.

  • @TheCsePower
    @TheCsePower 2 роки тому

    that s ar really cheap and cool car

  • @masoudparpanchi505
    @masoudparpanchi505 5 років тому

    speak LOUDER man!!!!!