Here's the link to the code: github.com/StatQuest/logistic_regression_demo/blob/master/logistic_regression_demo.R Support StatQuest by buying my books The StatQuest Illustrated Guide to Machine Learning, The StatQuest Illustrated Guide to Neural Networks and AI, or a Study Guide or Merch!!! statquest.org/statquest-store/
Hi Josh, Love your content. Has helped me to learn a lot & grow. You are doing an awesome work. Please continue to do so. Wanted to support you but unfortunately your Paypal link seems to be dysfunctional. Please update it.
Your videos never disappoint, Sir. I have gone through many of them and think you've earned the right to brand the phrase: "clearly explained" because your explanations are indeed very clear. I am building a better explanation of statistics thanks to you. I appreciate you and hope you continue to pass on the knowledge.
Thank you so much Josh for all these videos! I got Aplus for most of my stat courses quite a few years ago when I was doing my MSc of BIostat, but it took me quite some time to come up with a better understanding of a few concepts. You just summarized and presented these ideas and more in a few minutes! You are a genius and on top of that, you are so Kind to share all these work to everyone for free! With my limited vocabulary, all I can say is THANK YOU! It makes me feel the world is a beautiful place with beautiful mind and soul. I love your song “hello”, it reminds me of the day I met my daughter and brought happy tears to my eyes :)
Josh, it’s Saturday morning here and I’m enjoying a cup of Bam! learning R from the best teacher on the planet. I’m so grateful and appreciative of your efforts to share your considerable talents with us!
What is great with your video is that even if I forgot my headphone I am able to follow the video in the computer room full of other students! Thank you so so so much !!!! From University of Bordeaux
You are an absolute life saver. My data science paper is due in two days and now I have my pretty log graph and I understand this better. DOUBLE BAM!!!!!
Your simple English explanation of the meaning of "Intercept" in the output from 8:30 to 8:38 of this video was something I could not find after searching for 2 hours. Thank you!
I was just here for the logistic regression but bam!! I would be watching all of your videos. As a ds learner using r, double bam!!!, your videos will surely help big time! Bambambam! 👌😅 Thank you. 🙂
Both my husband and I learned so much from ur video. ( inspired by the top comment), whenever you come to Toronto let us know for a few free accommodation in our Asian restaurant/bubble tea surrounded neighborhoods (north York center)! Thx again! Xin
Thanks you very much for all stuff. You have saved me to fail my exams. Amazing quality channel Unbelievable the low number of likes. Very appreciated channel, at least for me. Thanks again.
good looking white background... graphs are beautiful... whatever you say, you write it on screen.... your sound and sound system, very good.. the way you explain things, CLEARLY EXPLAINS everything.. and loved that music part and BAM!!! and here, i have something to say about your work.. and that is VERY BIG BAM !!!... good luck.. keep growing..
Hi Josh, thanks for your videos they are very easy to understand. Really appreciate your efforts. I believe I speak for many, Because of you many people are able to understand with utmost clearity and you cover all the small details with super ease. Keep up the Nobel work. Cheers 👍 Would it be possible for you to put up a video on model evaluation i.e. determining cutoff and model performance. Thanks
Hoooray! We made it to the end of an exciting journey through logistic regression! Hope you have a nice day, and thank you for understanding the output for logistic regression in R, which really cant be understood thoroughly without watching all the logistic + odds videos!
Josh, joining all the folks here in thanking you! I have a question: around minute 9:05 you talk about odds of having being unhealthy for a female. How do we know that these are the odds of being unhealthy vs being healthy? I feel I am floating when it comes to intercept, reference categories, and baseline categories. Thanks a lot!
R orders factors ("healthy" vs "unhealthy") in alphabetical order. So that means "healthy" is first, and the default, and "unhealthy" is the difference from that. Likewise, "sexF" and "sexM" are ordered alphabetically, so "sexF" is the default value and "sexM" is the difference from that.
Great tutorials, I started with your PCA video and since then hooked onto other videos . Could I request you to do a video on various types of probability distributions when to use them.
Those are all in the works. I wish I could work 2 or 4 times faster than I can. I've wanted to cover the major probability distributions for over a year, but got sucked down a machine learning path and now feel spread pretty thin. However, these will happen eventually! :)
Thank you so much for this effort really appreciate We need a stat quest on three topics: 1-Chi-square test, 2- The Hosmer-Lemeshow goodness of fit test for logistic regression. And 3- Iteratively reweighted least squares (IRLS) by using Newton's method. If you don't mind :) of course. Can you tell us about the title of next video?!
The Chi-Square test is on the list. I've looked into the Hosmer-Lemeshow fit... Can you tell me what you think about the limitations? Specifically those mentioned in the wikipiedia article about it? en.wikipedia.org/wiki/Hosmer%E2%80%93Lemeshow_test#Limitations_and_alternatives And iteratively reweighted least squares is also on the list. However, up next are some basic statistics videos and then videos on lasso, ridge, and elastic-net regression.
the Hosmer-Lemeshow statistic was used to avoid problem in Pearson chi-squared statistic which was when observations being grouped by the values of the x variables, the Pearson chi-squared goodness of fit test cannot be readily applied if there are only one or a few observations for each possible value of an x variable, or for each possible combination of values of x variables. (A sample with a sufficiently large size is assumed. If a chi-squared test is conducted on a sample with a smaller size, then the chi-squared test will yield an inaccurate inference). So in the Hosmer-Lemeshow statistic, the observations are grouped by expected probability. But there is very little guidance on selecting the number of subgroups. The number of subgroups,g, is usually calculated using the formula g> P + 1. For example, if you had 12 covariates in your model, then g > 12. How much bigger than 12 g should be is essentially left up to you. Small values for g give the test less opportunity to find mis-specifications. Larger values mean that the number of items in each subgroup may be too small to find differences between observed and expected values. Sometimes changing g by very small amounts (e.g. by 1 or 2) can result in wild changes in p-values. As such, the selection for g is often confusing and arbitrary. Also, it doesn’t take overfitting into account and tends to have low power. For these reasons, the Hosmer-Lemeshow test is no longer recommended. Am I on right? Is it enough cues to no longer used of HL test? I have another question, ( Overfitting is happening when your sample size is too small. If you put enough predictor variables in your regression model, you will nearly always get a model that looks significant. While an overfitted model may fit the idiosyncrasies of your data extremely well, it won’t fit additional test samples or the overall population. The model’s p-values, R-Squared and regression coefficients can all be misleading. Basically, you’re asking too much from a small set of data.) If I have a small sample, is there any problem to use Maximum likelihood to fit model and McFadden's pseudo-R squared? Is there any rule to chose the number of sample for any regression? Sorry for the many of questions, it is my first year in biostatistics. :)
These are all great questions. You are correct about the HL test and you are correct about overfitting. There are, however, lots of tricks you can use to compensate for overfitting (lasso regression, ridge regression, elastic net regression etc.) One way to test to see if you have a model that is "overfit" is to use cross validation. As for a minimum number of samples for logistic regression - people often say "10 samples per level of each discrete variable". It's a general rule of thumb and it doesn't always apply. However, again you can use cross validation to verify if you have enough samples or not. Cross validation is a very practical tool!
Thanks for the video. Your video made it look like so simple. I request you to upload a video of how to get risk ratios in multiple logistic regression model.
Since the values or 0 and 1, it probably doesn't matter. However, to be safe, it's probably a good idea to make all categorical values, regardless of their values, factors.
Hi Josh Firstly many thanks for your videos on this topic. I have noticed very odd and conflicting results between R and SPSS with regards to entering factors (with more than 1 level) into a logistic regression model. SPSS produces a simplified output containing an odds ratio with 95% CI and p-value, for each individual variable entered into a logistic regression model (rather than the factor levels, as displayed in R). In R - I have not found a good way to do this. I have used the logistic.display command as well as exp() to get odds ratios, but they do not provide an overall value like in SPSS (instead, listing these for each individual level within the factor). Do you have any idea why SPSS and R handle logistic regression differently like this? All I would like is a similar output to SPSS - where I get a single odds ratio, 95% CI and p-value for each individual factor variable entered.
Unfortunately I've never used SPSS so I'm not really familiar with the problem you are having. That said, perhaps this will help: stats.stackexchange.com/questions/543540/different-output-for-logistic-regression-between-r-and-spss-how-to-get-correct
These videos are so amazing! Do you have a suggestion for a book that explains Logistic Regression to newbies? The videos are super awesome, but extra references may help too. Hopefully you will write your own book soon! Thanks!
I know this is probably 10 months too late, but the book “Introduction to Categorical Data Analysis” by Alan Agresti is a great book. Does a really good job explaining logistic regression and is pretty light on the math.
Awesome! Thank you so much! Please could you do a video about conditional logistic regression like clogit in R with result interpretation and how it works when using adjusted parameters.
Hi, I really like your videos, every topic is as clear as water after watching it. I've watched this one and also the three videos about logistic regression's details. If you want to go further in this topic, you could do a video explaining emmeans package for R. Many people, including me, would understand post hoc tests for glm using emmeans, if someone like you explained it. Thank you!
At 11:30, the video states, "Since we are not estimating the variance from the data (and instead deriving it from the mean) it is possible that the variance is UNDERESTIMATED." Q. How can we say that we are UNDER-estimating the value of the variance? BTW, awesome vids, music man! ;)
@@statquest It's 1am but let me see if I got this straight... Due to the nature of discrete functions (like logistic functions) they do not always vary smoothly. With discrete functions it is possible to see variances (and their corresponding probabilities) differ from (in our case) the proposed logistic model. In other words, it is conceivable to have a leptokutic or platykurtic distribution. It is possible to see probabilities which differ from the expected probabilities due to the fact that the "real" model may be different and/or the samples may not be i.i.d. As it happens, the Bernoulli distribution tends toward the platykurtic. ...It's just the wrong dang model sometimes...
It's actually a little simpler than that. With binomial data (like logistic regression) we estimate the mean value = number of positive responses / total number of responses. Once we have the mean value estimated, we use that, and that alone, to calculate the variance. In other words, once we have calculated the mean, we do not need the data anymore to calculate the variance. This is in contrast to linear regression (or a lot of other things) where we estimate the mean with the data and then use the data again to calculate how it varies around the estimated mean. Thus, there is a possibility that with Logistic Regression (and other "generalized linear models") we did not correctly estimate the variance since the data were not involved in that calculation. If we over estimate the variance, that just makes the calculations more conservative and, generally speaking, that's not a problem. However, if we underestimate the variance, then that means we're more likely to say things are significantly different even if they are not, and that's no good. So the dispersion parameter takes care of that.
Hi Josh, thanks for this amazing tutorial. Would you be able to add something interactions between predictors and random effects? I am trying to run a mixed-model logistic regression and have three-way interactions but not entirely sure on how to deal with them. Thanks so much :)
I like the way you presented the information in such a manner that is easily understood I have 2 questions 1. While doing the xtab, what do we need to do if we found that say it is either or both healthy and unhealthy under cp3 is 0 or very minimal (video clip at 6:04) 2. At 15:55 of the video clip , you mentioned about using cross validation to get a better idea of how well it might perform with new data. Do you have a separated video which is specifically for that topic ? Many thanks Williams
1) Unfortunately I don't understand what you're asking in this question. However, I think you are asking what do we do when one level from a categorical variable does not have strong preference for healthy or unhealthy or doesn't have much data to begin with. It really depends. You can just try it and see what happens, but you might also try removing the variable and see if that improves predictions. 2) I have a video on cross validation here: ua-cam.com/video/fSytzGwwBVw/v-deo.html
@@statquest Thanks for your prompt reply let me clarify my 1st question at the video clip at 6:24, you mentioned that there are 4 patients represent level 1 under restecg category. My first question is, why only 4 can cause problem? is it because it is too mininal compares with others?(Level 0, and Level 2). How do I know exactly that it is causing the problem when I do the analysis? and if it does cause the problem. how to go about fixing it? just remove the Level 1 can solve the problem? Thanks for your help
@@williamstan1780 When you don't have much data supporting a specific category, then chance are it will have a lot of variance - in other words, further samples may be very different from the ones in the original dataset. You can test this with cross validation (use some of the data to fit the model, use the rest to see how well it performs). If things are no good, you can remove the variable, or try to lump categories together.
All of your videos are great and fun to learn from! Could you please upload a tutorial on mediation analysis using STATA and R (using the mediation package)?
Great video, thanks so much Josh! After the 4th minute you mention how to address the NA samples. Can you teach us the RANDOM FOREST method, if we don't want to get rid of our NA samples (e.g. in multivariate cases, where the rows include other useful info)? Thanks!
Great salute! If you can, please post a video on all machine learning models with a large dataset example implementation in r with clear intuition and mathematics statistics behind it. Thanks.
at the end where you make the graph , you could have used the broom package and augment function to create the data frame to compute the fitted and actual values.
Excellent video, very clear and easy to follow! Do you have any videos that show how to do best subsets and cross validation with logistic regression on R? I know you have a video that explains the concept of cross validation but I am looking for a video like this that goes through it step-by-step for logistic regression on R. Same thing for how to run all possible models (best subsets) using logistic regression on R. I have found one by another youtuber for linear regression but not for logistic.
Great videos, Josh! You make things so easy! I just had a question though - Is it mandatory to convert all variables (which can be converted into factors) into factors? For example, what would have happened if we have kept the sex variable as numeric? Does it make my logistic regression model incorrect?
@@statquest Well yes, doing this for the sex variable makes sense. However, for my data, I have a religiousness column with discrete values 1-5 and a rating column again with a discrete rating of 1-5. So should I make these two variables factors as well? Or is it fair to keep them as numeric? Also, thanks for such a prompt reply. Really appreciate it!
Hi, I love the way you explain all this things! I have a couple of questions. I observe that it's necessary to establish a code type for the predictors, if these are dichotomous, for example, they are assigned 1 and 0 (in the example male / female), so: - How should we proceed with polytomous predictors? - What results of the model should be reported in a scientific article? Thank you in advice and keep doing great content!
1) For all categorical data (with 2 or more classes), just make sure you are storing it in a factor. 2) That depends on the journal. I would look at other articles in that journal to figure it out.
Hi Josh I find your videos very informative and they help me a lot with my bachelors thesis. Because you put some variables into "factors" and others stay "numeric" I think I can ask my question, that I nowhere find an answer on the internet, or I don't know how! I do a logistic regression with NBA regular season games to find out if the fact that the teams are eliminated from the playoffs has an effect on their winning probability (to find out if they "tank" = intentionally loosing). For the variable of the current strength of the team I use the current winning percentage of the team (how many games won over how many games playd) and this variable is refreshed after every game. I was wondering if I can put this variable as a "numeric"? Or as what kind of type would you define this winning percentage? The opponents winning percentage, whether the game is on the home court or not, if the team is statistically eliminated or in the playoffs and if the opponent is statistically eliminated or in the playoffs is also in the regression. It is the same regression some reserachers did back in 2002 to test the same thing but no one did recently. I hope you understand my question and hope very much, that you can and are willing to help me. Thank you very much and have a great day!
For logistic regression, it will be easier to understand what the estimated coefficients mean if you multiply the percentage of games won by 100. When you do this, you can use these values as "numeric" and the coefficient will tell you how much the probability of the outcome changes for every 1 percentage change in that variable. For more details on interpreting the coefficients, check out ua-cam.com/video/vN5cNN2-HWE/v-deo.html
@@statquest I'm trying to uze a logistic regression model on a set of binary events. Each with a different probability of happening.. and I have no idea what I'm doing haha.. so I'm loading up on coffee and I'm going to start your videos soon
Here's the link to the code: github.com/StatQuest/logistic_regression_demo/blob/master/logistic_regression_demo.R
Support StatQuest by buying my books The StatQuest Illustrated Guide to Machine Learning, The StatQuest Illustrated Guide to Neural Networks and AI, or a Study Guide or Merch!!! statquest.org/statquest-store/
Hi Josh,
Love your content. Has helped me to learn a lot & grow. You are doing an awesome work. Please continue to do so.
Wanted to support you but unfortunately your Paypal link seems to be dysfunctional. Please update it.
Your videos never disappoint, Sir. I have gone through many of them and think you've earned the right to brand the phrase: "clearly explained" because your explanations are indeed very clear. I am building a better explanation of statistics thanks to you. I appreciate you and hope you continue to pass on the knowledge.
Wow, thanks!
I dont understand why you used both categorical for logistic regression??
7:00
This 89-year-old guy says BAM!! So clearly explained, indeed. DOUBLE-BAM!!!!
BAM!!! And thank you for your support!!!!
Thank you so much Josh for all these videos! I got Aplus for most of my stat courses quite a few years ago when I was doing my MSc of BIostat, but it took me quite some time to come up with a better understanding of a few concepts. You just summarized and presented these ideas and more in a few minutes! You are a genius and on top of that, you are so Kind to share all these work to everyone for free! With my limited vocabulary, all I can say is THANK YOU! It makes me feel the world is a beautiful place with beautiful mind and soul. I love your song “hello”, it reminds me of the day I met my daughter and brought happy tears to my eyes :)
Thank you so much!!! I'm really glad you like my videos and my music. :)
where have you been my whole thesis! thank you!!
Hooray! I'm glad to help! :)
I feel the same!! hah
Josh, it’s Saturday morning here and I’m enjoying a cup of Bam! learning R from the best teacher on the planet. I’m so grateful and appreciative of your efforts to share your considerable talents with us!
Thank you very much! :)
I just wish one day all this information actually stays and sticks to my mind... thank you thought! Your videos are amazing!
Thanks for watching!
What is great with your video is that even if I forgot my headphone I am able to follow the video in the computer room full of other students! Thank you so so so much !!!! From University of Bordeaux
Solal Sténou Merci!! :)
You will surely be in my Thesis acknowledgments. Thank you for making our lives relatively easier but truly more ineligible. BAAAAAM!!
Thanks so much! :)
You are an absolute life saver. My data science paper is due in two days and now I have my pretty log graph and I understand this better. DOUBLE BAM!!!!!
Hooray!
so, how did it go today?
Your simple English explanation of the meaning of "Intercept" in the output from 8:30 to 8:38 of this video was something I could not find after searching for 2 hours. Thank you!
Awesome!!! Now that you have that concept down, a lot of other stuff in statistics should make more sense. (At least I hope!) :)
Your videos are great! It's also so nice of you that you take the time reply to so many of the comments here !
Thank you!
Nice channel to land on! Happiest discovery of my 2020! Great job!
Thank you! :)
Your videos cover everything in my course and I wish I found you sooner! So much detail and clear explaining in such little time
I was just here for the logistic regression but bam!! I would be watching all of your videos. As a ds learner using r, double bam!!!, your videos will surely help big time! Bambambam! 👌😅
Thank you. 🙂
Awesome! Thank you!
I really enjoyed the clearly way to explain us this topic. So many thanks for the teaching!!!
Thank you very much!!!
"one last shameless self promotion" got me 😂😂😂.....that's why I love your videos, u make learning stats fun
Hooray! Thank you! :)
Both my husband and I learned so much from ur video. ( inspired by the top comment), whenever you come to Toronto let us know for a few free accommodation in our Asian restaurant/bubble tea surrounded neighborhoods (north York center)!
Thx again!
Xin
Hooray!!! That would be awesome. I will dream of the day I can visit you in Toronto. :)
Thanks you very much for all stuff. You have saved me to fail my exams. Amazing quality channel Unbelievable the low number of likes. Very appreciated channel, at least for me. Thanks again.
Wow, thanks!
Thank you so much! I've a stat project to do in R with logistic Regression and this simplified the coding portion so much!
Hooray!
You are an amazing teacher. God bless you!
Thank you! 😃
You sir deserve a promotion 👏 thanks for this incredibly helpful video
Thank you! :)
I recommend all the videos by stat quest with Josh Starmer. Thank you for your good explanations.
Thank you very much! :)
Thank you so much for this video! I've been suffering with the coding for my project but this really helped. You're a star!
Thanks!
good looking white background...
graphs are beautiful...
whatever you say, you write it on screen....
your sound and sound system, very good..
the way you explain things, CLEARLY EXPLAINS everything..
and loved that music part and BAM!!!
and here, i have something to say about your work..
and that is VERY BIG BAM !!!... good luck.. keep growing..
Thank you very much! :)
Its's 1:11 AM and what I am doing is DOUBLE BAM. Thank you for this awesome video. U are hero.
Thanks! :)
Thanks for an excellent video. As usual.
Thanks again!
Thank you for saving my study. Not gonna lie, this video made me cry. I was about to drop out because of statistics, but this saved my project.
Hooray!
It must be so much fun working with you! Thank you for this tutorial. =)
Thank you! :)
So helpful, thanks!
Whenever you come to Cyprus let me know for few free accomodations in our mountainous region, Marathasa!
Thx again!
Γ
Wow! That sounds awesome!!!
@@statquest
oh yes!
I owe you a lot - you saved me so many hours!
Γ
Thanks for also showing how to wrangle data and explore missing data in a simple helpful way ❤
My pleasure 😊
I dont understand why you used both categorical for logistic regression??
7:00
The outcome is dichotomous.
Doing a masters program on analytics and this video made more sense than all the lectures combined on logistic regression. thank you
Thanks!
Hi Josh, thanks for your videos they are very easy to understand. Really appreciate your efforts. I believe I speak for many,
Because of you many people are able to understand with utmost clearity and you cover all the small details with super ease. Keep up the Nobel work. Cheers 👍
Would it be possible for you to put up a video on model evaluation i.e. determining cutoff and model performance.
Thanks
Thank you! :)
Great job bro.
Gratitude for your help. You also have where to stay if you come to Uganda (Africa).
Thank you very much!!! :)
I am impressed, you are talented, thanks for your sharing your knowledge.
Thank you! :)
I won't forget you in the acknowledgments sir haha!!! Great job!
Thank you very much! :)
Hi, Josh. I cannot thank you enough for these videos... Would also be good to have a similar video in Python..
Great suggestion!
@@statquest where's the video sir in python sir?
Thanks Josh - you are our saviour!
BAM! :)
@@statquest Triple Booyah BAM from my side!
THANK YOU! somehow I couldn't find any websites explaining this
Glad you found it.
Hoooray! We made it to the end of an exciting journey through logistic regression! Hope you have a nice day, and thank you for understanding the output for logistic regression in R, which really cant be understood thoroughly without watching all the logistic + odds videos!
Yep, that is correct. That's why I made all those other videos first - the output is jam packed with stuff.
Josh, joining all the folks here in thanking you! I have a question: around minute 9:05 you talk about odds of having being unhealthy for a female. How do we know that these are the odds of being unhealthy vs being healthy? I feel I am floating when it comes to intercept, reference categories, and baseline categories. Thanks a lot!
R orders factors ("healthy" vs "unhealthy") in alphabetical order. So that means "healthy" is first, and the default, and "unhealthy" is the difference from that. Likewise, "sexF" and "sexM" are ordered alphabetically, so "sexF" is the default value and "sexM" is the difference from that.
BAM_ spot on thanks for such video.. my journey with logis tic regression and r has started.
Awesome!!! :)
Mine too
amazing. thank you man!
Thanks!
You may like this video too:
Another great video about logistic regression in JMP
ua-cam.com/video/9yN_yjGAJZE/v-deo.htmlsi=jUwEZUDobBudE8AE
incredibly brilliant tutorial!
Thanks! :)
me binge watching Josh's videos before midterm... anyone else? lmao
Good luck! :)
U are geneus...and ur teaching style too...hurray!!!! and Bamm!!!!
Wow, thank you!
Great tutorials, I started with your PCA video and since then hooked onto other videos . Could I request you to do a video on various types of probability distributions when to use them.
Those are all in the works. I wish I could work 2 or 4 times faster than I can. I've wanted to cover the major probability distributions for over a year, but got sucked down a machine learning path and now feel spread pretty thin. However, these will happen eventually! :)
StatQuest with Josh Starmer could you make a video on how to work 2 to 4 times faster? :-)
As soon as I figure that out, I'll make a video on it! ;)
BAM!!!
A small request, you have done a lot already, a big thank you for that. Is it possible to make a video on Logistic regression in Python ?
I'll keep that in mind.
@@statquest thank you so much
This has been extremely helpful. Thank you!
Thank you! :)
Clear as water. Super BAM!!! Gracias por compartir
Your videos are awesome. Thank you very much.
Thank you! :)
YOU RAWK !! Awesome explains on ML concepts..
Thank you! :)
Thank you so much for this effort really appreciate
We need a stat quest on three topics:
1-Chi-square test,
2- The Hosmer-Lemeshow goodness of fit test for logistic regression.
And 3- Iteratively reweighted least squares (IRLS) by using Newton's method.
If you don't mind :) of course.
Can you tell us about the title of next video?!
The Chi-Square test is on the list. I've looked into the Hosmer-Lemeshow fit... Can you tell me what you think about the limitations? Specifically those mentioned in the wikipiedia article about it? en.wikipedia.org/wiki/Hosmer%E2%80%93Lemeshow_test#Limitations_and_alternatives
And iteratively reweighted least squares is also on the list. However, up next are some basic statistics videos and then videos on lasso, ridge, and elastic-net regression.
the Hosmer-Lemeshow statistic was used to avoid problem in Pearson chi-squared statistic which was when observations being grouped by the values of the x variables, the Pearson chi-squared goodness of fit test cannot be readily applied if there are only one or a few observations for each possible value of an x variable, or for each possible combination of values of x variables.
(A sample with a sufficiently large size is assumed. If a chi-squared test is conducted on a sample with a smaller size, then the chi-squared test will yield an inaccurate inference).
So in the Hosmer-Lemeshow statistic, the observations are grouped by expected probability. But there is very little guidance on selecting the number of subgroups. The number of subgroups,g, is usually calculated using the formula g> P + 1. For example, if you had 12 covariates in your model, then g > 12. How much bigger than 12 g should be is essentially left up to you. Small values for g give the test less opportunity to find mis-specifications. Larger values mean that the number of items in each subgroup may be too small to find differences between observed and expected values. Sometimes changing g by very small amounts (e.g. by 1 or 2) can result in wild changes in p-values. As such, the selection for g is often confusing and arbitrary. Also, it doesn’t take overfitting into account and tends to have low power. For these reasons, the Hosmer-Lemeshow test is no longer recommended.
Am I on right? Is it enough cues to no longer used of HL test?
I have another question, ( Overfitting is happening when your sample size is too small. If you put enough predictor variables in your regression model, you will nearly always get a model that looks significant.
While an overfitted model may fit the idiosyncrasies of your data extremely well, it won’t fit additional test samples or the overall population. The model’s p-values, R-Squared and regression coefficients can all be misleading. Basically, you’re asking too much from a small set of data.)
If I have a small sample, is there any problem to use Maximum likelihood to fit model and McFadden's pseudo-R squared? Is there any rule to chose the number of sample for any regression?
Sorry for the many of questions, it is my first year in biostatistics. :)
These are all great questions. You are correct about the HL test and you are correct about overfitting. There are, however, lots of tricks you can use to compensate for overfitting (lasso regression, ridge regression, elastic net regression etc.)
One way to test to see if you have a model that is "overfit" is to use cross validation.
As for a minimum number of samples for logistic regression - people often say "10 samples per level of each discrete variable". It's a general rule of thumb and it doesn't always apply. However, again you can use cross validation to verify if you have enough samples or not. Cross validation is a very practical tool!
Thank you, Mr Josh, for answering me, I need to study more about Cross-validation.
Sorry l have more than one account 🙈🙊
you are just the best! Thanks for doing this!
Thank you! :)
Excelent!!!! Thank you very much.
bam!
Sir, you are a savior
Thanks! :)
Josh you are amazing, thank you!
You're the man! thanks for everything!
Thank you very much! :)
so incredibly helpful and well done. Thank you so much!!
Thank you! :)
You really are wonderful for explaining this in a way morons like me can understand, this is so incredibly helpful. Thank you so much!
This channel has helped me a lot understanding statistics! Could you please make a video explaining the linear mixed model too?
Yes! However, it might be a while before I get to it.
Great content and incredible value. Thank you so much
Thanks! :)
The more I watch your videos the more the wish I had a teacher like you in my school days..
Do we have a video on chi square test?
Not yet. :( But one day we will.
Great video!!! Thank you so much!
Thanks!
This man is a legend
:)
Thanks for the video. Your video made it look like so simple. I request you to upload a video of how to get risk ratios in multiple logistic regression model.
I'll keep that in mind.
This is great stuff as I am just learning R; so pardon a very basic question: Why does "sex" need to be a factor vs number here?
Since the values or 0 and 1, it probably doesn't matter. However, to be safe, it's probably a good idea to make all categorical values, regardless of their values, factors.
Hi Josh
Firstly many thanks for your videos on this topic. I have noticed very odd and conflicting results between R and SPSS with regards to entering factors (with more than 1 level) into a logistic regression model. SPSS produces a simplified output containing an odds ratio with 95% CI and p-value, for each individual variable entered into a logistic regression model (rather than the factor levels, as displayed in R).
In R - I have not found a good way to do this. I have used the logistic.display command as well as exp() to get odds ratios, but they do not provide an overall value like in SPSS (instead, listing these for each individual level within the factor).
Do you have any idea why SPSS and R handle logistic regression differently like this? All I would like is a similar output to SPSS - where I get a single odds ratio, 95% CI and p-value for each individual factor variable entered.
Unfortunately I've never used SPSS so I'm not really familiar with the problem you are having. That said, perhaps this will help: stats.stackexchange.com/questions/543540/different-output-for-logistic-regression-between-r-and-spss-how-to-get-correct
Hi!! You may like this video too:
Another great video about logistic regression in JMP
ua-cam.com/video/9yN_yjGAJZE/v-deo.htmlsi=jUwEZUDobBudE8AE
super dangg! Good explanation, bro!
Thank you! :)
These videos are so amazing!
Do you have a suggestion for a book that explains Logistic Regression to newbies? The videos are super awesome, but extra references may help too. Hopefully you will write your own book soon!
Thanks!
I know this is probably 10 months too late, but the book “Introduction to Categorical Data Analysis” by Alan Agresti is a great book. Does a really good job explaining logistic regression and is pretty light on the math.
Awesome! Thank you so much! Please could you do a video about conditional logistic regression like clogit in R with result interpretation and how it works when using adjusted parameters.
I'll keep that in mind.
Hi, I really like your videos, every topic is as clear as water after watching it. I've watched this one and also the three videos about logistic regression's details. If you want to go further in this topic, you could do a video explaining emmeans package for R. Many people, including me, would understand post hoc tests for glm using emmeans, if someone like you explained it. Thank you!
Thanks! :)
Is it needed to turn all the variables into a factor before the regression analysis?
All of the categorical variables need to be converted to factors.
@@statquest Thank you very much. What do you classify as categorical?
@@afiapriscilla8276 Variables that represent discrete categories. Like "favorite color=Blue" or "Red"
And...BAM, thanks for sharing, your video is really useful :D
Thanks! :)
At 11:30, the video states, "Since we are not estimating the variance from the data (and instead deriving it from the mean) it is possible that the variance is UNDERESTIMATED." Q. How can we say that we are UNDER-estimating the value of the variance? BTW, awesome vids, music man! ;)
That's a great question. Here's a (hopefully) useful discussion on the topic: newonlinecourses.science.psu.edu/stat504/node/162/
@@statquest
It's 1am but let me see if I got this straight...
Due to the nature of discrete functions (like logistic functions) they do not always vary smoothly. With discrete functions it is possible to see variances (and their corresponding probabilities) differ from (in our case) the proposed logistic model. In other words, it is conceivable to have a leptokutic or platykurtic distribution.
It is possible to see probabilities which differ from the expected probabilities due to the fact that the "real" model may be different and/or the samples may not be i.i.d. As it happens, the Bernoulli distribution tends toward the platykurtic.
...It's just the wrong dang model sometimes...
It's actually a little simpler than that. With binomial data (like logistic regression) we estimate the mean value = number of positive responses / total number of responses. Once we have the mean value estimated, we use that, and that alone, to calculate the variance. In other words, once we have calculated the mean, we do not need the data anymore to calculate the variance. This is in contrast to linear regression (or a lot of other things) where we estimate the mean with the data and then use the data again to calculate how it varies around the estimated mean. Thus, there is a possibility that with Logistic Regression (and other "generalized linear models") we did not correctly estimate the variance since the data were not involved in that calculation. If we over estimate the variance, that just makes the calculations more conservative and, generally speaking, that's not a problem. However, if we underestimate the variance, then that means we're more likely to say things are significantly different even if they are not, and that's no good. So the dispersion parameter takes care of that.
@@statquest Cheers,
Hi Josh, thanks for this amazing tutorial. Would you be able to add something interactions between predictors and random effects? I am trying to run a mixed-model logistic regression and have three-way interactions but not entirely sure on how to deal with them. Thanks so much :)
same!
This is so helpful thank you!!
Hooray! :)
You are so good!! Thank you!
Thanks! :)
I like the way you presented the information in such a manner that is easily understood
I have 2 questions
1. While doing the xtab, what do we need to do if we found that say it is either or both healthy and unhealthy under cp3 is 0 or very minimal (video clip at 6:04)
2. At 15:55 of the video clip , you mentioned about using cross validation to get a better idea of how well it might perform with new data. Do you have a separated video which is specifically for that topic ?
Many thanks
Williams
1) Unfortunately I don't understand what you're asking in this question. However, I think you are asking what do we do when one level from a categorical variable does not have strong preference for healthy or unhealthy or doesn't have much data to begin with. It really depends. You can just try it and see what happens, but you might also try removing the variable and see if that improves predictions.
2) I have a video on cross validation here: ua-cam.com/video/fSytzGwwBVw/v-deo.html
@@statquest
Thanks for your prompt reply
let me clarify my 1st question at the video clip at 6:24, you mentioned that there are 4 patients represent level 1 under restecg category.
My first question is, why only 4 can cause problem? is it because it is too mininal compares with others?(Level 0, and Level 2). How do I know exactly that it is causing the problem when I do the analysis? and if it does cause the problem. how to go about fixing it? just remove the Level 1 can solve the problem?
Thanks for your help
@@williamstan1780 When you don't have much data supporting a specific category, then chance are it will have a lot of variance - in other words, further samples may be very different from the ones in the original dataset. You can test this with cross validation (use some of the data to fit the model, use the rest to see how well it performs). If things are no good, you can remove the variable, or try to lump categories together.
@@statquest thanks Josh ..: appreciated
gazillion bam THANKS to you!
Thanks!
great
work done here
Thank you!
All of your videos are great and fun to learn from! Could you please upload a tutorial on mediation analysis using STATA and R (using the mediation package)?
I'll keep that in mind.
Thank You. SOOOOOOOOOooOOOoo Helpful
bam!
SO clear!! Thanks!!
Awesome!
Great video, thanks so much Josh! After the 4th minute you mention how to address the NA samples. Can you teach us the RANDOM FOREST method, if we don't want to get rid of our NA samples (e.g. in multivariate cases, where the rows include other useful info)? Thanks!
I cover the random forest method in this video: ua-cam.com/video/6EXPYzbfLCE/v-deo.html (the theory is here: ua-cam.com/video/sQ870aTKqiM/v-deo.html )
This Video is very helpful. Do you also have a video about Multinomial Logistic Regression in R. Could be very helpful if you can post it.
I'm glad you like the video. I don't have one on multinomial logistic regression, so I'll put it on the to-do list.
@@statquest hello! did you ever make a video for this one? would love to check it out if you did, thanks so much for what you do!
@@joshuabudi4787 Not yet. :(
Hi Josh, Thank you for the very informative tuturial. Do you have any videos for the multilevel modelling?
No yet.
Great salute! If you can, please post a video on all machine learning models with a large dataset example implementation in r with clear intuition and mathematics statistics behind it. Thanks.
Love your videos. Could you do one on mixed logistic regression?
I'll keep that in mind.
This video is amazing! Thanks!!!
Thank you! :)
Thaaaanks! very useful and clear!
Hooray! I'm glad you like it! :)
lets goooo StatQuest
bam!
at the end where you make the graph , you could have used the broom package and augment function to create the data frame to compute the fitted and actual values.
Cool.
Excellent video, very clear and easy to follow! Do you have any videos that show how to do best subsets and cross validation with logistic regression on R? I know you have a video that explains the concept of cross validation but I am looking for a video like this that goes through it step-by-step for logistic regression on R. Same thing for how to run all possible models (best subsets) using logistic regression on R. I have found one by another youtuber for linear regression but not for logistic.
Not yet. :(
@@statquest Wow thank you for the quick reply! That's alright, if you do make any videos like that, I'll be among the first to watch them! :)
Great videos, Josh! You make things so easy!
I just had a question though - Is it mandatory to convert all variables (which can be converted into factors) into factors? For example, what would have happened if we have kept the sex variable as numeric? Does it make my logistic regression model incorrect?
Unless you were expecting a continuous range of values between the two sexes, your model would be incorrect.
@@statquest Well yes, doing this for the sex variable makes sense. However, for my data, I have a religiousness column with discrete values 1-5 and a rating column again with a discrete rating of 1-5. So should I make these two variables factors as well? Or is it fair to keep them as numeric?
Also, thanks for such a prompt reply. Really appreciate it!
Hi, I love the way you explain all this things! I have a couple of questions. I observe that it's necessary to establish a code type for the predictors, if these are dichotomous, for example, they are assigned 1 and 0 (in the example male / female), so:
- How should we proceed with polytomous predictors?
- What results of the model should be reported in a scientific article?
Thank you in advice and keep doing great content!
1) For all categorical data (with 2 or more classes), just make sure you are storing it in a factor.
2) That depends on the journal. I would look at other articles in that journal to figure it out.
The last graph deserves a quadruple BAM!!!!🤣🤣🤣🤣
Yes!
Hi Josh
I find your videos very informative and they help me a lot with my bachelors thesis. Because you put some variables into "factors" and others stay "numeric" I think I can ask my question, that I nowhere find an answer on the internet, or I don't know how! I do a logistic regression with NBA regular season games to find out if the fact that the teams are eliminated from the playoffs has an effect on their winning probability (to find out if they "tank" = intentionally loosing). For the variable of the current strength of the team I use the current winning percentage of the team (how many games won over how many games playd) and this variable is refreshed after every game. I was wondering if I can put this variable as a "numeric"? Or as what kind of type would you define this winning percentage? The opponents winning percentage, whether the game is on the home court or not, if the team is statistically eliminated or in the playoffs and if the opponent is statistically eliminated or in the playoffs is also in the regression. It is the same regression some reserachers did back in 2002 to test the same thing but no one did recently. I hope you understand my question and hope very much, that you can and are willing to help me. Thank you very much and have a great day!
For logistic regression, it will be easier to understand what the estimated coefficients mean if you multiply the percentage of games won by 100. When you do this, you can use these values as "numeric" and the coefficient will tell you how much the probability of the outcome changes for every 1 percentage change in that variable. For more details on interpreting the coefficients, check out ua-cam.com/video/vN5cNN2-HWE/v-deo.html
Thank you very much for your help!!! I appreciate it a lot! I'm glad it's not a complicated solution... :D
OK.. I haven't even watched the video yet but it looks like exactly what I need
I hope so! :)
@@statquest I'm trying to uze a logistic regression model on a set of binary events. Each with a different probability of happening.. and I have no idea what I'm doing haha.. so I'm loading up on coffee and I'm going to start your videos soon
@@Gypsy_Danger_TMC Good luck! :)
Thank you!!! Do you have awesome videos on Tobit or Logit model too?
The Logit model is the same as logistic regression - the only difference is how the output is presented.
Thank you a lot!! Also, Good wishes to North Carolina.
Thank you! :)