The law of total variance is what made it make sense for me! None of my classes covered why something called "analysis of variance" would be a hypothesis test for significantly different means.
I wish my statistics classes had gone this deep into ANOVA. Unfortunately, we were limited by time constraints and sort of took for granted why they work. Thank you for providing more background context in a fun and engaging way!
An additional point on using ANOVA in practice: the F-test can only tell you that a difference between the means is present, not necessarily which groups are different or not. You have to use a more specific test (Tukey's HSD) to compare specific groups against each other.
Hey dude. I'm in Highschool and I got back my (self studied) AP statistics score earlier today. Scored a 5/5. I don't think I could've done it without you lol. tysm.
This is the best explanation of the ANOVA I've seen so far. It directly answer why such a test that is testing the "equality" of differents means is called "ANOVA "(Analysis of Variance). I also liked how you showed its direct connection with the F-statistic using the actual equations. Keep up the good work!
Christian, would you consider making a video specifically about multiple regression? I still don't have an intuitive understanding of why the Gauss-Markov hypothesis need to be confirmed in order to make inferences, and I think your videos would be of great help for you're an incredible teacher. Thank you for your work! Keep it up!
It's because the assumptions of the Gauss-Markov theorem are used to determine what the standard errors of the coefficient estimators are. So, if those assumptions aren't met, but you still calculate the standard errors in the same way as you would if they were met, then you're going to get incorrect values for the standard errors. Then you use those standard errors to calculate t-statistics and such, so you'll get incorrect values for the t-statistics, and hence incorrect confidence intervals and potentially incorrect results for hypothesis tests.
Oh, I love it that you not only know the term Homoskedasticity but also mention it as an assumption we are taking! Sometimes I ask Psychologists about what they think of Nassim Taleb's criticism of IQ - it being too heteroskedastic - and then usually their looks give away that they have never learned about Heteroskedasticity in their Psychometric lessons... I think this is sad, so all the better you mention it ;-)
Thanks for continuously producing these videos! Your channel is by far the best explainer on statistics compared to other UA-cam channels IMO. I’m curious: what software do you use to create the videos? PowerPoint?
I think an important note on this is that the more populations you check the higher the likelihood is that one differs significantly by sheer luck. If instead of 5 cancers you're checking 100, the odds that statistical fluke will make one mean look further away from the others is fairly high.
This is not true with ANOVA. It has a type I error rate of 5% for finding *any* difference, not for each particular difference. If you had 1 million populations that were all the same, you would still only have a alpha% chance of finding a fluke. This is the advantage of running an ANOVA and not just running a bunch of two-sample pairwise tests.
Hmmm what if 5 out of 6 drug-organ pairs see success in cancer treatment? (1 mean singled out from the group, but not what we expect) Or if the group means are clustered, split in half (pairs 1,2,3 have the same mean, so do pairs 4,5,6)?
You’d have a similar conclusion. The ANOVA is only detecting that at least one of them is different, so if that’s the case, there should be some compelling evidence to reject the null hypothesis. But to actually figure out *which* one is different, you’d need to follow up with secondary testing for each of the means
It comes from the distribution assumption on the residuals. The residuals were assumed to be normally distributed with some variance, sigma^2. You if you divide the sum of squares by sigma^2, then you get a random variable that’s a standard normal, squaring that gives you a chi-squared distribution. This applies to both the numerator and denominator in the F-statistic.
In the one sample t test, we take alpha error to be cconstant and play around with error beta. Could we do it the other way around what would the implications be?
you could, but most of the time we’re interested in detecting a significant effect, so power is the thing we want to maximize. There’s a trade off between reducing type-I error and power, so we choose to keep alpha constant to signify we tolerate a defined probability of making a wrong decision about rejecting the null
Question: What to do when the assumption of h̶o̶m̶o̶s̶k̶e̶d̶a̶s̶t̶i̶c̶i̶t̶y̶ homogeneity of variance is not met, i.e. there are different variances in the different populations? I would think this is a rather major assumption, especially if the sample size is small, as that would make ̶h̶e̶t̶e̶r̶o̶s̶k̶e̶d̶a̶s̶t̶i̶c̶i̶t̶y̶ heterogeneity of variance harder to test... Shouldn't one not always in some form test for ̶h̶e̶t̶e̶r̶o̶s̶k̶e̶d̶a̶s̶t̶i̶c̶i̶t̶y̶ heterogeneity of variance? Is this done in practice? Edit: Sorry, I wrote homoskedasticity and heteroskedasticity, but I meant homogeneity of variance and heterogeneity of variance. (The former assumes constant variance in the regressor variables, while the latter assumes the same variance for different sub-populations.
Same question here. In regression I remember them teaching us that you can scale down the data with the different variances in presence of heteroscedasticity. I wonder if that would work here or we have to do some sort of non parametric test
Yeah, common variance is a pretty strong assumption to make. One solution I know of is a variant of the ANOVA called Welch’s ANOVA that can be used when you don’t want to make this assumption. It’s from the same guy behind Welch’s t-test, the version that students learn for two-sample problems when they also can’t assume common variance.
@@very-normal Thank you that's great to know. It seems like Welch's ANOVA is really the way to go, both for small sample size and for no knowledge about the data. (Apparently, it is almost as powerful as the standard ANOVA, even if heterogeneity of variance is fulfilled, so...)
If the goal is to find out whether the drug is potentially useful, whether all the mu's are the same or not doesn't really tell you anything. The drug could be equally useful for all 5 illnesses or unequally useful.
It depends on what mu represents. If you define mu to be some baseline value or rate, then if something is statistically different from this baseline, then it could merit further investigation in a larger experiment.
You’re right, I just wanted to emphasize that the main assumption is on the residuals. It implies that the outcome is normally distributed, but it’s more of a consequence of the fact that the residuals are normally distributed, rather than an assumption of the model
Could you explain a bit further about the "residuals are normally distributed not that the variable is normally distributed itself" thing? This is one of the things that trips me up most often..
Yeah for sure, I’ll try my best. This is partially my opinion, so just a heads up. My feeling is that assuming something about the data itself is much stronger than assuming something about the residuals. Very rarely will real-world data follow nice distributions like the Normal, so it’s harder to convince people (read: the statistical referee) that this will hold up. On the other hand, assuming that the residuals is not so bad. It’s like saying, we know there’s an average outcome and people will differ from this average, but they won’t differ too badly from it. In other words, outlier residuals are very rare. It’s confusing because this residual assumption implies that the outcome is also normally distributed in this, but it’s important to note that it’s the residual assumption we make. It’s also important because with stuff like linear regression, we’re looking at how different values of the predictor (i.e. cancer group) shift the distribution of the outcome. If you assume the data itself to have an outcome, it gets more complicated to try to work in how other variables influence it. Assuming the distribution is on the residuals doesn’t come with this baggage. Some people are taught that they should try to transform the outcome so that it “works better” with linear regression or ANOVA. Even though you’re manipulating the outcome, the hope is that this transformations makes the -residuals- look more normal. I hope this helps clarify somewhat. If anyone else sees this and thinks I left something out, please chip in. This is a common question, but even I don’t feel like I get all the nuances.
@@very-normal "it's confusing because this residual assumption implies the outcome is also normally distributed in this" yeah that's the bit that always tripped me up, like I get that you can make one or other the core assumption and build it up from there (it's like picking your axioms in pure maths or something), but in my head the fact that the kinda nebulous residuals assumption implies the much more intuitive distribution assumption meant that I was often fighting between intuition and logic in terms of thinking it through. It also doesn't help that thinking of an example where the residuals are Normal but the distribution _isn't_ is much harder... So it's more about being an assumption of convenience in that it makes the maths much nicer to deal with and is also a weaker and more generalisable assumption, rather than it being anything else like purity or tradition or something. Thanks, I think I get it now! Though no doubt this will be one of those weird bits that'll always feel a little bit of, I feel like I have a much better grasp of the rationale! Much appreciated!
My notation was a little sloppy here... I think you are right. The denominator is supposed to be the variance of the residuals, but my sum doesn't look like it there. Thanks for catching that
i will be counting days until you do videos about 2-factor ANOVA and then ANCOVA. and then :-) special video explaining what is actually the difference. because i`m dense m****f**** and i dont get it.... thank you)))
not gonna lie, you’ll be counting a lot of days friend. But I can explain a bit The rationale behind two way anova is almost exactly like the one-way anova. As its name suggests, one-way anova looks at a single categorical variable, two-way looks at two. The “groups” in two-way anova include not just the main categories (i.e being in treatment A vs not, or being in treatment B vs not), but also considers interactions as their own groups as well (i.e. someone being on both treatment A and B). As for ANCOVA, I’ve never dealt with it before myself lol, so I can’t comment on it herr
I think example is incorrect, if the new drug is effective on different types of cancer , anova may still show statistically non significant inspite the drug being effective leading to wrong conclusion drawn and loss to the company 😂
Wait... You spend most of the time about ANOVA test and make an irrelevant simulation. Could you make a better simulation that looks more like the cancers and drugs problem we were looking at?
The law of total variance is what made it make sense for me! None of my classes covered why something called "analysis of variance" would be a hypothesis test for significantly different means.
I wish my statistics classes had gone this deep into ANOVA. Unfortunately, we were limited by time constraints and sort of took for granted why they work. Thank you for providing more background context in a fun and engaging way!
At my school, linear models is a two year course, regression and anova get their own semester then we do generalized models and other things
An additional point on using ANOVA in practice: the F-test can only tell you that a difference between the means is present, not necessarily which groups are different or not. You have to use a more specific test (Tukey's HSD) to compare specific groups against each other.
Immensely underrated channel, 46k subscribers is criminal
Almost done with my ANOVA class, and this built my intuition more than the course itself! thanks :)
Hey dude. I'm in Highschool and I got back my (self studied) AP statistics score earlier today. Scored a 5/5. I don't think I could've done it without you lol. tysm.
Great job! I’m sure I only played a small role in that, you’re the one who hustled to learn the material, congratulations!
This is the best explanation of the ANOVA I've seen so far. It directly answer why such a test that is testing the "equality" of differents means is called "ANOVA "(Analysis of Variance). I also liked how you showed its direct connection with the F-statistic using the actual equations. Keep up the good work!
this is amazing and totally underrated. keep going
That was fucking amazing. Why does nobody explain it using the law of total variance. It all clicked now. Thank you!
Christian, would you consider making a video specifically about multiple regression? I still don't have an intuitive understanding of why the Gauss-Markov hypothesis need to be confirmed in order to make inferences, and I think your videos would be of great help for you're an incredible teacher. Thank you for your work! Keep it up!
It's because the assumptions of the Gauss-Markov theorem are used to determine what the standard errors of the coefficient estimators are. So, if those assumptions aren't met, but you still calculate the standard errors in the same way as you would if they were met, then you're going to get incorrect values for the standard errors. Then you use those standard errors to calculate t-statistics and such, so you'll get incorrect values for the t-statistics, and hence incorrect confidence intervals and potentially incorrect results for hypothesis tests.
Oh, I love it that you not only know the term Homoskedasticity but also mention it as an assumption we are taking!
Sometimes I ask Psychologists about what they think of Nassim Taleb's criticism of IQ - it being too heteroskedastic - and then usually their looks give away that they have never learned about Heteroskedasticity in their Psychometric lessons... I think this is sad, so all the better you mention it ;-)
Thanks for continuously producing these videos! Your channel is by far the best explainer on statistics compared to other UA-cam channels IMO. I’m curious: what software do you use to create the videos? PowerPoint?
Thanks! I use Final Cut Pro for editing, Figma and Midjourney for graphics and the manim python library for animations
can you do a video on GLMs please!! Your videos are great
I want to mention I am currently taking aparametric stats course! so I understand the vids about it better!
what's crazy is that my stat inference midterm is literally tomorrow, it's about one way anova 🤣
👀 good luck!
I think an important note on this is that the more populations you check the higher the likelihood is that one differs significantly by sheer luck. If instead of 5 cancers you're checking 100, the odds that statistical fluke will make one mean look further away from the others is fairly high.
Yeah I thought about covering multiplicity here, but it deserves its own video
This is not true with ANOVA. It has a type I error rate of 5% for finding *any* difference, not for each particular difference. If you had 1 million populations that were all the same, you would still only have a alpha% chance of finding a fluke. This is the advantage of running an ANOVA and not just running a bunch of two-sample pairwise tests.
Hmmm what if 5 out of 6 drug-organ pairs see success in cancer treatment? (1 mean singled out from the group, but not what we expect)
Or if the group means are clustered, split in half (pairs 1,2,3 have the same mean, so do pairs 4,5,6)?
You’d have a similar conclusion. The ANOVA is only detecting that at least one of them is different, so if that’s the case, there should be some compelling evidence to reject the null hypothesis. But to actually figure out *which* one is different, you’d need to follow up with secondary testing for each of the means
At 9:22, why are they Chi square distributed?
It comes from the distribution assumption on the residuals.
The residuals were assumed to be normally distributed with some variance, sigma^2. You if you divide the sum of squares by sigma^2, then you get a random variable that’s a standard normal, squaring that gives you a chi-squared distribution. This applies to both the numerator and denominator in the F-statistic.
@Very Normal What textbook would you suggest for this content?
Rosner’s Fundamentals of Biostatistics (7th ed) is a good source with a solutions manual that can also easily be found online
@@very-normal Thanks! And I must say that you are an excellent teacher.
Wow I was just working on this exact scenario
In the one sample t test, we take alpha error to be cconstant and play around with error beta. Could we do it the other way around what would the implications be?
you could, but most of the time we’re interested in detecting a significant effect, so power is the thing we want to maximize. There’s a trade off between reducing type-I error and power, so we choose to keep alpha constant to signify we tolerate a defined probability of making a wrong decision about rejecting the null
what if the drug has effect on all the test group and the means for all the groups are shifted the same amount?
You’d prolly get a null result. If you shift all the distributions by the same amount, there wouldn’t be a change in the variance in group means
Question: What to do when the assumption of h̶o̶m̶o̶s̶k̶e̶d̶a̶s̶t̶i̶c̶i̶t̶y̶ homogeneity of variance is not met, i.e. there are different variances in the different populations?
I would think this is a rather major assumption, especially if the sample size is small, as that would make ̶h̶e̶t̶e̶r̶o̶s̶k̶e̶d̶a̶s̶t̶i̶c̶i̶t̶y̶ heterogeneity of variance harder to test...
Shouldn't one not always in some form test for ̶h̶e̶t̶e̶r̶o̶s̶k̶e̶d̶a̶s̶t̶i̶c̶i̶t̶y̶ heterogeneity of variance? Is this done in practice?
Edit: Sorry, I wrote homoskedasticity and heteroskedasticity, but I meant homogeneity of variance and heterogeneity of variance. (The former assumes constant variance in the regressor variables, while the latter assumes the same variance for different sub-populations.
Same question here. In regression I remember them teaching us that you can scale down the data with the different variances in presence of heteroscedasticity. I wonder if that would work here or we have to do some sort of non parametric test
Yeah, common variance is a pretty strong assumption to make. One solution I know of is a variant of the ANOVA called Welch’s ANOVA that can be used when you don’t want to make this assumption.
It’s from the same guy behind Welch’s t-test, the version that students learn for two-sample problems when they also can’t assume common variance.
@@very-normal Thank you that's great to know. It seems like Welch's ANOVA is really the way to go, both for small sample size and for no knowledge about the data. (Apparently, it is almost as powerful as the standard ANOVA, even if heterogeneity of variance is fulfilled, so...)
Can you explain the statistics behind weather prediction
I’m not very well versed it in, but it sounds like it’d be a fancy, high dimensional regression model
If the goal is to find out whether the drug is potentially useful, whether all the mu's are the same or not doesn't really tell you anything. The drug could be equally useful for all 5 illnesses or unequally useful.
It depends on what mu represents. If you define mu to be some baseline value or rate, then if something is statistically different from this baseline, then it could merit further investigation in a larger experiment.
If the residues are normally distributed are then the original data not normal distributed as well? Aren't they just shifted by the mean?
You’re right, I just wanted to emphasize that the main assumption is on the residuals. It implies that the outcome is normally distributed, but it’s more of a consequence of the fact that the residuals are normally distributed, rather than an assumption of the model
Could you explain a bit further about the "residuals are normally distributed not that the variable is normally distributed itself" thing? This is one of the things that trips me up most often..
Yeah for sure, I’ll try my best. This is partially my opinion, so just a heads up.
My feeling is that assuming something about the data itself is much stronger than assuming something about the residuals. Very rarely will real-world data follow nice distributions like the Normal, so it’s harder to convince people (read: the statistical referee) that this will hold up.
On the other hand, assuming that the residuals is not so bad. It’s like saying, we know there’s an average outcome and people will differ from this average, but they won’t differ too badly from it. In other words, outlier residuals are very rare. It’s confusing because this residual assumption implies that the outcome is also normally distributed in this, but it’s important to note that it’s the residual assumption we make.
It’s also important because with stuff like linear regression, we’re looking at how different values of the predictor (i.e. cancer group) shift the distribution of the outcome. If you assume the data itself to have an outcome, it gets more complicated to try to work in how other variables influence it. Assuming the distribution is on the residuals doesn’t come with this baggage.
Some people are taught that they should try to transform the outcome so that it “works better” with linear regression or ANOVA. Even though you’re manipulating the outcome, the hope is that this transformations makes the -residuals- look more normal.
I hope this helps clarify somewhat. If anyone else sees this and thinks I left something out, please chip in. This is a common question, but even I don’t feel like I get all the nuances.
@@very-normal "it's confusing because this residual assumption implies the outcome is also normally distributed in this" yeah that's the bit that always tripped me up, like I get that you can make one or other the core assumption and build it up from there (it's like picking your axioms in pure maths or something), but in my head the fact that the kinda nebulous residuals assumption implies the much more intuitive distribution assumption meant that I was often fighting between intuition and logic in terms of thinking it through. It also doesn't help that thinking of an example where the residuals are Normal but the distribution _isn't_ is much harder...
So it's more about being an assumption of convenience in that it makes the maths much nicer to deal with and is also a weaker and more generalisable assumption, rather than it being anything else like purity or tradition or something.
Thanks, I think I get it now! Though no doubt this will be one of those weird bits that'll always feel a little bit of, I feel like I have a much better grasp of the rationale! Much appreciated!
At 9:17, shouldn't Y_i be \mu_j? Or \mu_i, depending on what you are summing over
My notation was a little sloppy here... I think you are right. The denominator is supposed to be the variance of the residuals, but my sum doesn't look like it there. Thanks for catching that
1:19 is there heart cancer? i thought no, since the cells are from birth. cool video either way, thx!
I saw it was really rare, but deep down, I was just looking for an emoji to represent the group lol 😅
anova did my head in stats, i
i will be counting days until you do videos about 2-factor ANOVA and then ANCOVA. and then :-) special video explaining what is actually the difference.
because i`m dense m****f**** and i dont get it.... thank you)))
not gonna lie, you’ll be counting a lot of days friend. But I can explain a bit
The rationale behind two way anova is almost exactly like the one-way anova. As its name suggests, one-way anova looks at a single categorical variable, two-way looks at two. The “groups” in two-way anova include not just the main categories (i.e being in treatment A vs not, or being in treatment B vs not), but also considers interactions as their own groups as well (i.e. someone being on both treatment A and B).
As for ANCOVA, I’ve never dealt with it before myself lol, so I can’t comment on it herr
🎉
I think example is incorrect, if the new drug is effective on different types of cancer , anova may still show statistically non significant inspite the drug being effective leading to wrong conclusion drawn and loss to the company 😂
that’s all hypothesis tests tho lol
@@very-normal I meant you need atleast one more group of standard or control to come to any conclusion regarding efficacy
I heard recently that Fisher was great at stats but not the best in moral and ethical character.
yeahhh he had some L opinions with smoking and eugenics
Wait... You spend most of the time about ANOVA test and make an irrelevant simulation. Could you make a better simulation that looks more like the cancers and drugs problem we were looking at?
i don’t have access to data like that, so a simulation from an particular situation was the next best thing lol
ur the goat