As in many statistical inference scenarios, here the assumption of normally distributed populations becomes less important as the sample size increases. So yes, for large sample sizes normality is not an issue. It's always still an assumption of the procedure, it just isn't very important for large sample sizes. And even in small sample size situations, ANOVA can still work well under some violations of the normality assumption, depending on the type of violation.
In the other video on "Inference for 2 variance ratio", you mentioned that the violation of the normality assumption can lead to very poor results when you use the F-statistic, even for large sample sizes, which means that the normality assumption in the F-statistic is crucial regardless of sample size unlike the t-test and the z-test. Since ANOVA also uses the F-statistic, shouldn't it also be the case that the ANOVA test will perform very poorly if the normality assumption is not met, even for large sample sizes? Your comment here seems to contradict that, which I am confused about. Why is it that unlike the two variance ratio test, the assumption of normality will become less relevant in ANOVA as the sample size increases? They both use the same F-statistic right?
@@Robin1997311 That's a very good question. One reason is that the numerator of the F statistic is related to the variance *of the sample means*, and the sample means will be approximately normal for large sample sizes. When sampling from non-normal populations, the sample variance will get more normal as the sample size increases, it's just very slow (relative to the mean).
@@jbstatistics Thank you for the answer. I have two follow up questions: 1. I can see that as the sample size gets larger and larger, both the numerator and the denominator in an F-statistic may become approximately normal. However, what does that have to do with the ANOVA becoming more robust as the sample size increases? How does the numerator and the denominator becoming closer to normal affect the F-statistic? Does it make the F-statistic approximate closer to the actual/theoretical F-distribution you would get if the assumptions weren't violated? 2. The F-statistic for "Inference for 2 variance ratio" and the F-statistic for "ANOVA" are both ratios of variance. Which means that the effect you described in your reply will apply to both cases. Then my original question remains unanswered because according to your videos the large sample sizes do NOT improve the performance of "Inference for 2 variance ratio" tests while it DOES improve the performance of "ANOVA" (when the normality assumption is violated). I want to know why this distinction exists when both tests are using the same F-statistic, and your answer doesn't explain that distinction since it is a common effect for both tests.
In the assumptions you say that he population should be normally distributed. Can"t we not avoid that and use the central limit theorem and find the same result?
Professor--you reference the box and whisker plots and cite the mean. But I thought box and whisker plots display/mark the median. Therefore the "apparent" variances or spread are really about the median. Unless of course your box and whisker plots display/mark the mean and not the median. Your reply is requested. Thank you. Steve G Sept 27, 2020
Yes, the line within the box represents the median. But I'm not sure why you feel that's problematic. I give the sample means under the boxplots, and any way you slice it, it is visually apparent that the within group variability of the groups on the right are larger than that of those on the left.
Thank you... is the presentation build in Tex/Latex? If so and if you agree, please send me your idears of the presentation and i write the Latex-code for you. Then you save some time... ;)
The error row is sometimes called "residuals", and it is the default in R to do so. SSE is the sum of squared residuals. If you're saying there is a difference between the theoretical error terms and the residuals, then sure, but that's not relevant here.
No one explains it as clearly as you do. You are literally a life saver. Thank you so much
The only words that will keep on ringing in my mind are 'beetweeeeen' and 'withhhhiiiiinnnn' . Great explanation !!
I simply cannot find the words to express my gratitude.
I am in a data science MSc and you are about to save my life...
+ilias siablis You are very welcome Ilias! I hope your MSc studies go very well!
Thanks! I'm glad you find them interesting. I may get to logistic regression eventually, but it might take a while.
As in many statistical inference scenarios, here the assumption of normally distributed populations becomes less important as the sample size increases.
So yes, for large sample sizes normality is not an issue. It's always still an assumption of the procedure, it just isn't very important for large sample sizes. And even in small sample size situations, ANOVA can still work well under some violations of the normality assumption, depending on the type of violation.
In the other video on "Inference for 2 variance ratio", you mentioned that the violation of the normality assumption can lead to very poor results when you use the F-statistic, even for large sample sizes, which means that the normality assumption in the F-statistic is crucial regardless of sample size unlike the t-test and the z-test.
Since ANOVA also uses the F-statistic, shouldn't it also be the case that the ANOVA test will perform very poorly if the normality assumption is not met, even for large sample sizes? Your comment here seems to contradict that, which I am confused about. Why is it that unlike the two variance ratio test, the assumption of normality will become less relevant in ANOVA as the sample size increases? They both use the same F-statistic right?
@@Robin1997311 That's a very good question. One reason is that the numerator of the F statistic is related to the variance *of the sample means*, and the sample means will be approximately normal for large sample sizes.
When sampling from non-normal populations, the sample variance will get more normal as the sample size increases, it's just very slow (relative to the mean).
@@jbstatistics Thank you for the answer. I have two follow up questions:
1. I can see that as the sample size gets larger and larger, both the numerator and the denominator in an F-statistic may become approximately normal. However, what does that have to do with the ANOVA becoming more robust as the sample size increases? How does the numerator and the denominator becoming closer to normal affect the F-statistic? Does it make the F-statistic approximate closer to the actual/theoretical F-distribution you would get if the assumptions weren't violated?
2. The F-statistic for "Inference for 2 variance ratio" and the F-statistic for "ANOVA" are both ratios of variance. Which means that the effect you described in your reply will apply to both cases. Then my original question remains unanswered because according to your videos the large sample sizes do NOT improve the performance of "Inference for 2 variance ratio" tests while it DOES improve the performance of "ANOVA" (when the normality assumption is violated). I want to know why this distinction exists when both tests are using the same F-statistic, and your answer doesn't explain that distinction since it is a common effect for both tests.
Yes, it's done in Latex (a Beamer presentation). Thanks for the offer, but I think I'm going to keep this a one-person show. Cheers.
Wait, where did you get that F statistic?
I'm brushing up on my stats and I wish ANOVA had been taught to me this way in the first place. Thanks for a great video in plain English.
I'm glad to be of help! Thanks for the kind words!
Perfect. Just what I was needing :-) I need explanations of the theory not the math. Thank you so much for this video!
+metapsych27 You are very welcome. I'm glad I could be of help!
...keep on going... your tutorials are so interesting~~~ cannot stop watching ^^ what about logistic regression, please make a tutorial video!!! ;)
please do a two ANOVA series as well man. i really appreciate your work.
GREAT! Now it makes perfect sense. Thank you!
Thank you so much, these videos are great!
Why do prof's make this shit so confusing, thank u bro!
to be honest they too are figuring it out lol
Awesome explanation
We looking forward to your R tutorial :-)
Excellent video. Keep up the good work.
Thanks for the compliment!
In the assumptions you say that he population should be normally distributed. Can"t we not avoid that and use the central limit theorem and find the same result?
thanks a lot for your informative demonstration ..
Professor--you reference the box and whisker plots and cite the mean. But I thought box and whisker plots display/mark the median. Therefore the "apparent" variances or spread are really about the median. Unless of course your box and whisker plots display/mark the mean and not the median. Your reply is requested. Thank you. Steve G Sept 27, 2020
Yes, the line within the box represents the median. But I'm not sure why you feel that's problematic. I give the sample means under the boxplots, and any way you slice it, it is visually apparent that the within group variability of the groups on the right are larger than that of those on the left.
Thank you... is the presentation build in Tex/Latex? If so and if you agree, please send me your idears of the presentation and i write the Latex-code for you. Then you save some time... ;)
There is difference between error and residual ! (Look that up)
The error row is sometimes called "residuals", and it is the default in R to do so. SSE is the sum of squared residuals. If you're saying there is a difference between the theoretical error terms and the residuals, then sure, but that's not relevant here.