in this video tutorial we learn how to fit the polynomial regression model and assess the regression model in R using the partial F-test with examples. For more in-depth explanation of linear regression check our series on linear regression concept and R (bit.ly/2z8fXg1); Like to support us? You can Donate (statslectures.com/support-us), Share our Videos, Leave us a Comment, Give us a Like or Write us a Review! Either way, We Thank You!
Dear Marin and Ladan, hats off! Clearly explained with such a deep knowledge and human understanding! Thank you very-very much! You=lm (teacher~knowledge+I(statwizard^3)), a talent="TRUE" in your field. Enjoy your life with your family and if you find the time and opportunity, the new series on guiding Us=lm(astronauts~strayed+I(how^2)) in the space of R, is highly welcome! All the best, Gergo Dioszegi
Hello Mike, With this video I've finished your course of videos of the introduction of R and I don't have the words to express my gratitude. Thanks to your amazing work I've entered the world of data science, and I will continue diving into this wonderful technique full of possibilities. Since I'm an student of Economics this will be incredibly useful. You have helped me inmensely without asking for anything, as I'm sure you have thousands of other people who feel equally as thankful The world needs more people like you, and I will try to continue the chain of helping others. Sincerly from Universidad Carlos III, Madrid, Luis
Hello Mike, I am glad you managed to teach all of us with such explanatory and step by step approach. I watched all your series 5 videos and i wish more people can take advantage of your knowledge and skill. Thank you so much. Looking forward to more. Regards Richa
Hello Respected Professor Mike Marin, I really appreciate your great tutorials about R. I have watched all of your lectures and paying you more and more gratitude for this great helpful lectures series. And hope it will be continue in future. Wish you have Happy and healthy life. Thank you very much, stay blessed!
That was interesting video comparing first- and second-order of polynomial for linear models, I really liked it. Although I am dealing with a mixed model right now and need to do the same comparison for the fist and second order of polynomial for it, and this does not work for me. Do you have some tutorial video for the mixed model as well? Thanks a lot.
Fantastic series. Very clear and crisp explanations.Thank you very much again for this. Would it be possible to make some videos on longitudinal data and logistic regression?
I noticed that the summary output for the cubic model had large p-values for all the coefficients but the multiple R-square still seemed large, the residual error seemed low, and the overall F-statistic was large too, thus we would reject the null (all coefficients=0). QUESTION: What should we say about each coefficient since their individual p-values are so high?
Thank you for the nicely explained tutorial. I have a question regarding the Polynomial function. Why do we use the property raw=T in this case? I am currently trying to understand that the multicolinearity is a general problem in this situation since x and x^2 are correlated. The solution to this usually presented by defining raw=F. Therefore by considering only orhtogonal polynomials. But why would orthogonal polynomials only solve the problem of multicolinearity ? Im lost in this field. I hope you can help me out.
what about the F-statistics p-value (2.2e-16)... what is its significance or importance compared to other p-values of height and height^2? which one should we consider?
to be honest, none of them are particularly enlightening. the F-stat p-value is testing overall significance of the model....that is somewhat helpful, but it is testing if ALL coefficients are 0...so essentially testing if your model is significantly better than just guessing the mean y-value for everyone (is it better than nothing). the p-values for height and height^2 can be misleading as those variables are correlated (as well as can be correlated with other variables in a model), and so their p-values can get inflated by this collinearity. the best way to test significance of variables is to compare models with/without a variable included. we have a separate video talking about this here: ua-cam.com/video/G_obrpV70QQ/v-deo.html
Does anybody know what about if I get a nonsignificant p value for my predictor for the first order polynomial beta and the second order beta is significant?
Hi, I'm getting the error message "Error in xy.coords(x, y, setLab = FALSE) : 'x' and 'y' lengths differ" when trying to add the regression lines to the original plot. But my data is of the same length. Any advice?
Hi, if you expand a bit on the exact commands you've entered, i may be able to figure out the issue. the issue is that one of the variables (X and Y) that you are trying to plot has more elements than the other..but without knowing the code you've entered I'm not able to figure out where you've made an error
Well, it becomes harder to interpret the effect of that variable, as the effect of the X variable is not being modelled using a polynomial, and so the effect is not linear....the effect of a 1-unit increased in X on Y is not the same everywhere. one way to provide an interpretation is to take a value for x, calculate Y than calculate the value of Y for x+1....to this for a few different values of x, and this will tell you the effect of a 1-unit increase in x, for specific values of X. that's one way to go, if you wanted to talk about the effect of a 1-unit increase in X on Y.
Hi Mike and youtubers, I need to plot two sigmoid curves to the same dataset one for control and one for treatment points. If I subset the x axis it gives me an error. If I do not subset it gives me one line only to fit all the points. Do you have any suggestion to solve this? Thank you!
hello, hi Mike Martin. i have a question for you and hopefully you can help me to answer it. first, this method of polynomial regression in r applicable if i have 3 variables (2 independent variables, and 1 dependent variables). and how to develop it? the second one is all the data that possible to use this method or there is any way to verify the data if the data can use this method or not? hopefully you can help me Mike, Thanks you
thanks a lot for the videos... very helpful. I have a question and for the community as well: is there any way to automate the process for selecting the best regression model? instead of running comparing by hand the models. I have a scenario with a lot of variables.
Hi Diego, there are, but i wouldn't fully recommend them. here's a brief summary of some of that. you can read more about things like *step-wise selection* or *all-subsets*. the key word for these is *automated model selection procedures*. the "stepwise" approach alternates steps (forward and backward) adding and removing variables until you hit a steady state where you can not add/remove any variables. this also requires specifying a "maximum model" (e.g.. will you consider one-way interactions, or two-way interactions, etc). the "all possible subsets" considers every possible model with every possible subset of variables, and chooses the one with the lowest AIC (or lowers BIC if you prefer that). the drawback of these is that they are purely automated and dont allow input from the user, and are sort of a "black-box" approach. as an example of what i mean by this, suppose you have a set of data for a bunch of school aged children, and one variable is "age" and another is "grade that they are in". these variables contain almost the exact same info, but are not exactly the same. the one that is selected into your model will be based mostly on chance...i myself would prefer to have some control over which would be in the model (i would personally chose age as i think this is more meaningful than "grade they are in"). automated procedures also allow for variable to be included/excluded based on chance correlations. by chance, some meaningless variables will always end up correlated with your "Y" variable, and automated procedures will end up including these. i prefer an approach where if i KNOW conceptual that a variable is correlated with Y then i will want to get that into my model, similarly if i KNOW something is not correlated with Y i want to exclude that. these automated procedures let "chance" take a lot of control over your model building and variable selection. MY PERSONAL STANCE is that automated selection procedures can be useful as an exploratory tool..to help discover which variable may/may-not be important, but i would always revise a model and variables in it from there, and i wouldn't let an algorithm choose my model...i will combine "what i already know" with "what the data is telling me". hope that's helpful...
Hi. Thanks a lot for your answer. It is really helpful in fact. For sure many people will also get a benefit from this answer. I will look for more information related to the topic as suggested. I use it as exploratory tool as well, with a lot of variables sounds like a good idea. Thanks a lot again.
Hi, thanks your video is really useful for me. I have a question, one of my coefficient regresion's p-value is not rejected when I applied linear and non linear regression model, do you have any suggestions for my case? Thanks in advance!
Hi Mike, I just watched your playlist about regression models in R and it was very helpfull! By now, you worked with the lm() function in R, but there are so many others like glm() or lmer() and glmer() from the lme4 package. What are the difference between those models? Certainly it is somehow depending on your data, but how can I find out, which model I should use for my analysis? It would be very great if you may have a tip on what I should focus... Thank you in advance!
Hi +Fa Fa , the others are for entirely different regression models. *lm* is to fit a linear regression (y/outcome is numeric, and assumed normal). *glm* is for generalized linear models, which are a whole class of models on their own. for example, logistic regression is a generalized linear model (for y/outcome that is binary, and assumed binomial), Poisson regression is a generalized linear model (for y/outcome that is a count or rate, and assumed to follow a poisson distribution). and there are many other GLMs. *lme* models are linear mixed effect models, and are often used for longitudinal data. each of these are very large topics on their own. in a traditional stats department, there are usually multiple courses offered on GLMs, a full course on longitudinal data analysis, and so forth. so, i can not do these justice in a few short paragraphs. the short answer of which to use for your analysis would depend mostly on the type of data (and more importantly, the type of outcome (y) variable that you are working with). the example i use in the videos is for an outcome/y that is lung capacity (which is numeric/continuous) and assumed to be normally distributed, so I'm using linear regression. i hope that helps clarify some things.
Hi Marin, so when polynomial terms are in the model, how do you interpret the coefficients in a valuable way? Intuitively, coefficient makes sense for just "height" but what about "height^2"? Or "height^3" and so forth... thanks!
in this case, the coefficient doesn't have a simple interpretation...that is because the relationship between X and Y isnt assumed to be something simple like a line (which has a slope with a nice simple interpretation). for X and X^2, the change in y for a 1-unit change in X is NOT the same everywhere....and so you can't have a simple interpretation. if you want to interpret the model coefficients, then there are other options for addressing a non-linearity. one that works and maintains a simple interpretation is to "categorize" the numeric variable (to convert it from numeric into a set of categories). we have a separate video talking about the different ways to address a non-linearity, focussing on the concepts of it. im linking to that here in case you wanted to explore that: ua-cam.com/video/tOzwEv0PoZk/v-deo.html
Hi Mike, I have seen somewhere that , we have to divide data into two groups. One for development and second for valiadation/testing. So is it necessary to validate model before presenting it to business peers. Please advice. Regards Nirmal
Hi Nirmal, it depends on the reason for you fitting a model. if you are using the model to make predictions (a predictive model) then you probably want to do some sort of validation of the mode (to ensure that it does make good and reliable predictions). there are lots of packages in R to do different sorts of validation. key words to research are "cross validation", "leave one out validation", and when you search those topics you will come across different sorts of validation methods. cross validation is probably what you want to research the most. good luck!
Hi mike, Thanks for your time. i request you to make a video for the validation of the model. I know you would do in 5 minutes. Other would have take hours to explain same things. i hope you will look up to my request. Again Thanks for make us understand R easy and faster. Cheers from India. Regards, Nirmal
Sir In lungcap vs height First you should check the correlation coeff 'r' If r=0 then it means no linear relationship It means that you can now go for Polynomial regression But why you fitted a Polynomial regression Here r is not 0 Then why you moved to a Polynomial regression concept
Hi, polynomial regression is not for when the correlation is 0, it is an option when there is a relationship, but not a linear one (maybe a bit of a curved/non-linear relationship). We have a video that talks a bit about this: ua-cam.com/video/tOzwEv0PoZk/v-deo.html
Hi Mike Marin, I'm so sad you stopped making videos! I have a question for you and I hope you can help me (you may start a new series talking about it :)). How do I treat historical datas? I have daily datas for 200 years. I have to plot them all first and then plot only the maximum for each year. And how can I do if I have 365 days in some years and 366 in others? Hope you understand what I mean. Thank you in advance!
Hi +TKSGL89 , thanks! we haven't actually stopped making videos...life has just gotten busy, and we've had to slow down a bit...but we plan on continually making videos for the foreseeable future! we've actually got a few different ones in the works, and a list of topics we want to cover that is WAY too long...there are so many cool topics that could be covered,...just no time!! so, that's time series data you've got there, so you'll want to be using time series methods (looks like you've started there, with the ts() function). i wont have time to make anything helpful for you anytime soon, but id suggest to search around for resources for time series in R. as for picking out the max for each year, there are different ways to do that, and some of it depends on exactly how your data is. but you should have the *variable* of interest, as well as a *year* variable. to find the max for each year, you would use something like *max(variable[year==2015])* , and this could be done for every year. and you can do this in more efficient ways (like using apply statements, or other ways) once you've coded it in a simple way. hope that helps get you started!
hi, it isn't completely necessary, but what the does is it reduces the collinearity between the predictors in the model...because X and X^2 will be highly correlated, and thus their SEs will get inflated...orthogonal polynomials addresses this
to do a formal test of polynomial terms, you can try comparing a model that uses just "X" to one that also includes "X^2", to test if the model with "X^2" is significantly "better". if it is, then you can compare the model with X, X^2 to a model with X, X^2, X^3 and test if that model is a significant improvement,..and continue until the model does not improve. to do this test you can use the "Partial F test" or "L:likelihood Ratio Test", we have a video showing that here: ua-cam.com/video/G_obrpV70QQ/v-deo.html you can also decide conceptually which you think makes sense and begin from there. i work in health research, and most of the time we dont want to go beyond X^2 or maybe up too X^3, as beyond that usually isn't realistic. (e.g.) some things have a sort of exponential growth, and including X^2 may be appropriate. at times including X^3 to allow for another inflection may be relevant...but past that, there aren't many things where you could justify conceptually a relationship up to the power of X^4. the most important part in model building is that your model is conceptually sound....and not rely purely on statistical testing...but make sure that your model also makes sense in context.
I think there is something I am missing here. When running anova for the 2 models, I can get the null hypothesis (not significant difference) but what about the alternative? If there is not significant difference, then couldn't it be that the full model is worse? My question more clearly: With anova, these are always the conditions for the models? That the alternative is that the full model is better? Or could the alt hypothesis be that the full model is worse? Thank you.
yes, the alternative is always that the full model is "better" (it has significantly less unexplained error). adding an unnecessary variable can never increase the SSE (the unexplained error). this test is testing if the larger model has significantly lower unexplained error (lower SSE).
im a bit unclear what you are asking, but if iyou are asking how to include a polynomial term in a model that also has many X variables it would look like this: lm(y ~ X1 + I(X1^2) + X2 + X3 +...) the exact same as shown in this video, except including other variables in there as well. hope that answers your question
thanks! yes, including X^2, X^3,... would introduce collinearity between the X, X^2, etc. this may or may not be an issue. first, let me mention that solutions to this are often to either "center the X variable" (i.e. include X and (entered-X)^2 ....this can help reduce the collinearity between the two. you can also use "orthogonal polynomials" to reduce this. collinearity between X and X^2 would only really serve to inflate the SE for the coefficients for X and for X^2 (while it wouldn't really affect their coefficients, and the shape of the mode fit), so it is not such a big issue in this sense.
yes, but it's not in 'base-R' (at least to my knowledge it's not). there are packages that you can install that can get you the VIF and other related things.
in this video tutorial we learn how to fit the polynomial regression model and assess the regression model in R using the partial F-test with examples. For more in-depth explanation of linear regression check our series on linear regression concept and R (bit.ly/2z8fXg1); Like to support us? You can Donate (statslectures.com/support-us), Share our Videos, Leave us a Comment, Give us a Like or Write us a Review! Either way, We Thank You!
Actually it is pretty much linear. You can always use log to make it more linear and then make the tests.
You videos helped me to complete my MSc degree successfully - thank you very much for your very informative videos!
Dear Marin and Ladan,
hats off! Clearly explained with such a deep knowledge and human understanding! Thank you very-very much! You=lm (teacher~knowledge+I(statwizard^3)), a talent="TRUE" in your field. Enjoy your life with your family and if you find the time and opportunity, the new series on guiding Us=lm(astronauts~strayed+I(how^2)) in the space of R, is highly welcome!
All the best,
Gergo Dioszegi
Hello Mike,
With this video I've finished your course of videos of the introduction of R and I don't have the words to express my gratitude. Thanks to your amazing work I've entered the world of data science, and I will continue diving into this wonderful technique full of possibilities. Since I'm an student of Economics this will be incredibly useful.
You have helped me inmensely without asking for anything, as I'm sure you have thousands of other people who feel equally as thankful
The world needs more people like you, and I will try to continue the chain of helping others.
Sincerly from Universidad Carlos III, Madrid,
Luis
Thank you for the best tutorial, you have provided the datasheet which made it more beneficial
Hello Mike,
I am glad you managed to teach all of us with such explanatory and step by step approach. I watched all your series 5 videos and i wish more people can take advantage of your knowledge and skill. Thank you so much. Looking forward to more.
Regards
Richa
Really excellent tutorail series. Thank you very much.
you're welcome :)
All the videos are very informative and interactive. Thanks for very much Professor. :)
This dude is a fxking live saver
Thanks for your linear regression series. So helpful!
THANKS MAN, THIS VIDEO WAS SUPER USEFUL
Excellent. Thank you so much for this helpful video. I'm waiting for a new tutorial video.
Thank you so much. you are way better than my teacher
Thank you +Anqi Dai
Great video! Thank you!
you're welcome :)
Waiting for your new tutorials on R programming. :)
Very clear my friend. Thumb up
thanks :)
well explained thanks for upload
Hello Respected Professor Mike Marin, I really appreciate your great tutorials about R. I have watched all of your lectures and paying you more and more gratitude for this great helpful lectures series. And hope it will be continue in future. Wish you have Happy and healthy life. Thank you very much, stay blessed!
You're the best!
Hi, Mike, What test should I perform on data prior the selection of a Polynomial Model? great video man
Superb. Helped me a lot. Thank you!
This video is gold!
Thanks for well-featured videos
you're welcome :)
That was interesting video comparing first- and second-order of polynomial for linear models, I really liked it. Although I am dealing with a mixed model right now and need to do the same comparison for the fist and second order of polynomial for it, and this does not work for me. Do you have some tutorial video for the mixed model as well? Thanks a lot.
Hi. Thank you for your great explanation. The page for Dataset & R Script doesn't exist and the provided link doesn't work.
Fantastic series. Very clear and crisp explanations.Thank you very much again for this. Would it be possible to make some videos on longitudinal data and logistic regression?
Is polynomial regression same with polynomial orthogonal? Thanks!
Great help. Thanks a lot.
Thanks for the tutorial. But may I ask how to use poly() function in multivariable regression? :D
When do you use an orthogonal polynomiall rather than a raw poly?
I noticed that the summary output for the cubic model had large p-values for all the coefficients but the multiple R-square still seemed large, the residual error seemed low, and the overall F-statistic was large too, thus we would reject the null (all coefficients=0).
QUESTION: What should we say about each coefficient since their individual p-values are so high?
Thank you for the nicely explained tutorial. I have a question regarding the Polynomial function. Why do we use the property raw=T in this case? I am currently trying to understand that the multicolinearity is a general problem in this situation since x and x^2 are correlated. The solution to this usually presented by defining raw=F. Therefore by considering only orhtogonal polynomials. But why would orthogonal polynomials only solve the problem of multicolinearity ? Im lost in this field. I hope you can help me out.
Great video! Very helpful :-)
Very nice! Thank you :)
you're welcome +Benjamin Gutzmann
Thankyou!
Is there any way to control for a variable inside the model? For example, controlling for age
what about the F-statistics p-value (2.2e-16)... what is its significance or importance compared to other p-values of height and height^2? which one should we consider?
to be honest, none of them are particularly enlightening. the F-stat p-value is testing overall significance of the model....that is somewhat helpful, but it is testing if ALL coefficients are 0...so essentially testing if your model is significantly better than just guessing the mean y-value for everyone (is it better than nothing).
the p-values for height and height^2 can be misleading as those variables are correlated (as well as can be correlated with other variables in a model), and so their p-values can get inflated by this collinearity.
the best way to test significance of variables is to compare models with/without a variable included. we have a separate video talking about this here: ua-cam.com/video/G_obrpV70QQ/v-deo.html
thank you!!
Does anybody know what about if I get a nonsignificant p value for my predictor for the first order polynomial beta and the second order beta is significant?
Hi, I'm getting the error message "Error in xy.coords(x, y, setLab = FALSE) : 'x' and 'y' lengths differ" when trying to add the regression lines to the original plot. But my data is of the same length. Any advice?
Hi, if you expand a bit on the exact commands you've entered, i may be able to figure out the issue. the issue is that one of the variables (X and Y) that you are trying to plot has more elements than the other..but without knowing the code you've entered I'm not able to figure out where you've made an error
Hi Mike. Thanks for your great videos. By adding a polynomial predictor how does the interpretation of it change?
Well, it becomes harder to interpret the effect of that variable, as the effect of the X variable is not being modelled using a polynomial, and so the effect is not linear....the effect of a 1-unit increased in X on Y is not the same everywhere. one way to provide an interpretation is to take a value for x, calculate Y than calculate the value of Y for x+1....to this for a few different values of x, and this will tell you the effect of a 1-unit increase in x, for specific values of X. that's one way to go, if you wanted to talk about the effect of a 1-unit increase in X on Y.
Hi Mike and youtubers, I need to plot two sigmoid curves to the same dataset one for control and one for treatment points. If I subset the x axis it gives me an error. If I do not subset it gives me one line only to fit all the points. Do you have any suggestion to solve this? Thank you!
hello, hi Mike Martin. i have a question for you and hopefully you can help me to answer it. first, this method of polynomial regression in r applicable if i have 3 variables (2 independent variables, and 1 dependent variables). and how to develop it?
the second one is all the data that possible to use this method or there is any way to verify the data if the data can use this method or not?
hopefully you can help me Mike, Thanks you
thanks alot
thanks a lot for the videos... very helpful.
I have a question and for the community as well: is there any way to automate the process for selecting the best regression model? instead of running comparing by hand the models. I have a scenario with a lot of variables.
Hi Diego, there are, but i wouldn't fully recommend them. here's a brief summary of some of that. you can read more about things like *step-wise selection* or *all-subsets*. the key word for these is *automated model selection procedures*. the "stepwise" approach alternates steps (forward and backward) adding and removing variables until you hit a steady state where you can not add/remove any variables. this also requires specifying a "maximum model" (e.g.. will you consider one-way interactions, or two-way interactions, etc). the "all possible subsets" considers every possible model with every possible subset of variables, and chooses the one with the lowest AIC (or lowers BIC if you prefer that).
the drawback of these is that they are purely automated and dont allow input from the user, and are sort of a "black-box" approach. as an example of what i mean by this, suppose you have a set of data for a bunch of school aged children, and one variable is "age" and another is "grade that they are in". these variables contain almost the exact same info, but are not exactly the same. the one that is selected into your model will be based mostly on chance...i myself would prefer to have some control over which would be in the model (i would personally chose age as i think this is more meaningful than "grade they are in").
automated procedures also allow for variable to be included/excluded based on chance correlations. by chance, some meaningless variables will always end up correlated with your "Y" variable, and automated procedures will end up including these. i prefer an approach where if i KNOW conceptual that a variable is correlated with Y then i will want to get that into my model, similarly if i KNOW something is not correlated with Y i want to exclude that. these automated procedures let "chance" take a lot of control over your model building and variable selection.
MY PERSONAL STANCE is that automated selection procedures can be useful as an exploratory tool..to help discover which variable may/may-not be important, but i would always revise a model and variables in it from there, and i wouldn't let an algorithm choose my model...i will combine "what i already know" with "what the data is telling me".
hope that's helpful...
Hi. Thanks a lot for your answer. It is really helpful in fact. For sure many people will also get a benefit from this answer. I will look for more information related to the topic as suggested. I use it as exploratory tool as well, with a lot of variables sounds like a good idea.
Thanks a lot again.
Hi, thanks your video is really useful for me. I have a question, one of my coefficient regresion's p-value is not rejected when I applied linear and non linear regression model, do you have any suggestions for my case? Thanks in advance!
is that even matter if I keep on using the models?
Hi Mike, I just watched your playlist about regression models in R and it was very helpfull!
By now, you worked with the lm() function in R, but there are so many others like glm() or lmer() and glmer() from the lme4 package. What are the difference between those models? Certainly it is somehow depending on your data, but how can I find out, which model I should use for my analysis? It would be very great if you may have a tip on what I should focus... Thank you in advance!
Hi +Fa Fa , the others are for entirely different regression models. *lm* is to fit a linear regression (y/outcome is numeric, and assumed normal). *glm* is for generalized linear models, which are a whole class of models on their own. for example, logistic regression is a generalized linear model (for y/outcome that is binary, and assumed binomial), Poisson regression is a generalized linear model (for y/outcome that is a count or rate, and assumed to follow a poisson distribution). and there are many other GLMs. *lme* models are linear mixed effect models, and are often used for longitudinal data. each of these are very large topics on their own.
in a traditional stats department, there are usually multiple courses offered on GLMs, a full course on longitudinal data analysis, and so forth. so, i can not do these justice in a few short paragraphs.
the short answer of which to use for your analysis would depend mostly on the type of data (and more importantly, the type of outcome (y) variable that you are working with). the example i use in the videos is for an outcome/y that is lung capacity (which is numeric/continuous) and assumed to be normally distributed, so I'm using linear regression.
i hope that helps clarify some things.
Hi Marin, so when polynomial terms are in the model, how do you interpret the coefficients in a valuable way? Intuitively, coefficient makes sense for just "height" but what about "height^2"? Or "height^3" and so forth... thanks!
in this case, the coefficient doesn't have a simple interpretation...that is because the relationship between X and Y isnt assumed to be something simple like a line (which has a slope with a nice simple interpretation). for X and X^2, the change in y for a 1-unit change in X is NOT the same everywhere....and so you can't have a simple interpretation. if you want to interpret the model coefficients, then there are other options for addressing a non-linearity. one that works and maintains a simple interpretation is to "categorize" the numeric variable (to convert it from numeric into a set of categories).
we have a separate video talking about the different ways to address a non-linearity, focussing on the concepts of it. im linking to that here in case you wanted to explore that: ua-cam.com/video/tOzwEv0PoZk/v-deo.html
What if when we around 8 independent variables? How to determine x2 / x3 values?
I'm not sure what you mean by this. if you try clarifying, i may be able to help
In polynomial regression do we take log of the Y value Eg:lm(log(Y)~poly(X,2,raw = T)) .
good question I am looking for something similar. I need to make a polynomial regression with log10 in x values with poly and I don't get it.
Hi Mike,
I have seen somewhere that , we have to divide data into two groups. One for development and second for valiadation/testing. So is it necessary to validate model before presenting it to business peers.
Please advice.
Regards
Nirmal
Hi Nirmal, it depends on the reason for you fitting a model. if you are using the model to make predictions (a predictive model) then you probably want to do some sort of validation of the mode (to ensure that it does make good and reliable predictions). there are lots of packages in R to do different sorts of validation. key words to research are "cross validation", "leave one out validation", and when you search those topics you will come across different sorts of validation methods. cross validation is probably what you want to research the most. good luck!
Hi mike, Thanks for your time.
i request you to make a video for the validation of the model. I know you would do in 5 minutes. Other would have take hours to explain same things.
i hope you will look up to my request.
Again Thanks for make us understand R easy and faster. Cheers from India.
Regards,
Nirmal
Sir
In lungcap vs height
First you should check the correlation coeff 'r'
If r=0 then it means no linear relationship
It means that you can now go for Polynomial regression
But why you fitted a Polynomial regression
Here
r is not 0
Then why you moved to a Polynomial regression concept
Hi, polynomial regression is not for when the correlation is 0, it is an option when there is a relationship, but not a linear one (maybe a bit of a curved/non-linear relationship). We have a video that talks a bit about this: ua-cam.com/video/tOzwEv0PoZk/v-deo.html
Hi Mike Marin, I'm so sad you stopped making videos! I have a question for you and I hope you can help me (you may start a new series talking about it :)). How do I treat historical datas? I have daily datas for 200 years. I have to plot them all first and then plot only the maximum for each year. And how can I do if I have 365 days in some years and 366 in others? Hope you understand what I mean. Thank you in advance!
I checked the function ts() but I got trouble dealing with it (especially with frequency)
Hi +TKSGL89 , thanks! we haven't actually stopped making videos...life has just gotten busy, and we've had to slow down a bit...but we plan on continually making videos for the foreseeable future! we've actually got a few different ones in the works, and a list of topics we want to cover that is WAY too long...there are so many cool topics that could be covered,...just no time!!
so, that's time series data you've got there, so you'll want to be using time series methods (looks like you've started there, with the ts() function). i wont have time to make anything helpful for you anytime soon, but id suggest to search around for resources for time series in R.
as for picking out the max for each year, there are different ways to do that, and some of it depends on exactly how your data is. but you should have the *variable* of interest, as well as a *year* variable. to find the max for each year, you would use something like *max(variable[year==2015])* , and this could be done for every year. and you can do this in more efficient ways (like using apply statements, or other ways) once you've coded it in a simple way.
hope that helps get you started!
Great to hear you haven't stopped! Thank you for these explanations and for being so quick! I'll be waiting for next videos :) see you soon
Why is it good to have orthogonal polynoms? Is it needed in the modell?
And yeah your link is disposed. Not avilable.
hi, it isn't completely necessary, but what the does is it reduces the collinearity between the predictors in the model...because X and X^2 will be highly correlated, and thus their SEs will get inflated...orthogonal polynomials addresses this
try this one: statslectures.com/r-scripts-datasets
How do you decide what degree of polynomial you should go to?
to do a formal test of polynomial terms, you can try comparing a model that uses just "X" to one that also includes "X^2", to test if the model with "X^2" is significantly "better". if it is, then you can compare the model with X, X^2 to a model with X, X^2, X^3 and test if that model is a significant improvement,..and continue until the model does not improve. to do this test you can use the "Partial F test" or "L:likelihood Ratio Test", we have a video showing that here: ua-cam.com/video/G_obrpV70QQ/v-deo.html
you can also decide conceptually which you think makes sense and begin from there. i work in health research, and most of the time we dont want to go beyond X^2 or maybe up too X^3, as beyond that usually isn't realistic. (e.g.) some things have a sort of exponential growth, and including X^2 may be appropriate. at times including X^3 to allow for another inflection may be relevant...but past that, there aren't many things where you could justify conceptually a relationship up to the power of X^4.
the most important part in model building is that your model is conceptually sound....and not rely purely on statistical testing...but make sure that your model also makes sense in context.
I think there is something I am missing here. When running anova for the 2 models, I can get the null hypothesis (not significant difference) but what about the alternative? If there is not significant difference, then couldn't it be that the full model is worse? My question more clearly: With anova, these are always the conditions for the models? That the alternative is that the full model is better? Or could the alt hypothesis be that the full model is worse?
Thank you.
yes, the alternative is always that the full model is "better" (it has significantly less unexplained error). adding an unnecessary variable can never increase the SSE (the unexplained error). this test is testing if the larger model has significantly lower unexplained error (lower SSE).
please upload new tutorials regarding quadratic regression.
I think R has changed the way it would treat it if you just put x^2 into the equation since this video was made
What about the multivariate case?
im a bit unclear what you are asking, but if iyou are asking how to include a polynomial term in a model that also has many X variables it would look like this:
lm(y ~ X1 + I(X1^2) + X2 + X3 +...)
the exact same as shown in this video, except including other variables in there as well.
hope that answers your question
Doesn't the inclusion of Height^2 and Height^3 in the model cause multicollinearity?
BTW you make excellent content, Thank You.
thanks! yes, including X^2, X^3,... would introduce collinearity between the X, X^2, etc. this may or may not be an issue. first, let me mention that solutions to this are often to either "center the X variable" (i.e. include X and (entered-X)^2 ....this can help reduce the collinearity between the two. you can also use "orthogonal polynomials" to reduce this.
collinearity between X and X^2 would only really serve to inflate the SE for the coefficients for X and for X^2 (while it wouldn't really affect their coefficients, and the shape of the mode fit), so it is not such a big issue in this sense.
Thanks. But the data we could download is different from the video you use.
IS it possible to find VIF in R?
yes, but it's not in 'base-R' (at least to my knowledge it's not). there are packages that you can install that can get you the VIF and other related things.
wow wow
your script link doesnt work.
Hi @Semihardbagels . it is fixed now. let us know if you had any trouble with accessing the files.
Kilback Center
Walsh Turnpike
Jacobson Run
Never studied statistics? This stuff is absolutely linear:). Not even outliers.