Nice explanation thanks. Before i was watching scipy chi2. It was a little bit diffucult. But turns out sklearn chi2 is pretty straightforward and well explained in the website. Thanks for introducing it.
Please what will you do next after finding the chi-values and p-values and plotting the graph? How will you use this to analyse the data and come to a conclusion??
Hi Ashwin, I have a question. In the list of categorical variables that you have extracted, why have you added "Dependents" & "Credit_History". Are they not numerical variables? I just want to understand the basis behind adding them to the categorical variables list! An earliest response is highly appreciated.
if you check the data, dependents is category as it has a value 4+ which is a string and also credit history is a category similar to gender... only continuous values are considered for numerical
Thank you! This video was very clear and very insightful to check. I do only have a quick question which isn't still clear to me: what is the null hypothesis H0? Is it maybe the hypothesis of some correlation between the categorical variables against the y target variable? If this is the case, then only variables Credit_history and Education result into having a p-value lower than 0.05, and hence they mean something (H0 valid) while the other dependent categorical variables are to be dropped (as their p-values are higher than 0.05, hence rejecting H0). Did I got it correctly? Anyway, really nice job, keep it up ;)
The end result is correct, however the reasons aren't, I think you have misunderstood the Chi2 Independance test, let me reclarify it for you: - H0: the target and the dependant variable are independant - H1: the target and the dependant variable are depandant The p-value is linked to the test statistic Chi2 (measure of distance between observed and expected results), the greater Chi2, the greater the distance and therefore the less likely that the variables are independant (if they were independant, observed results and expected results would be close and Chi2 small). Also, the greater the Chi2, the smaller the p-value. Therefore, to sum it up, if the p-value is small (0.05 is a common threshold), it means the independance is unlikely and that we reject H0, hence only keeping variables which p-values are lower than 0.05, since they are dependant to the target (and therefore useful).
Hi everyone, I have mistakenly mentioned that pvalue should be greater than 0.5. It should be 0.05.
Literally popped in my recommended 10 minutes ago. This is great, thank you!
Hope it's helpful!!!
Just what I was looking for.
😄
Nice explanation thanks. Before i was watching scipy chi2. It was a little bit diffucult. But turns out sklearn chi2 is pretty straightforward and well explained in the website. Thanks for introducing it.
Glad it was helpful!!!😄
Hi Ashwin, your explanation is very good. I liked it & In fact, I have subscribed to your channel as well.
Glad you liked the video!!! I will try my best to share more videos like this!!!
hello sir Excellent work....kindly share playlist link for previous video
Thanks. Which video you're referring?
Please what will you do next after finding the chi-values and p-values and plotting the graph? How will you use this to analyse the data and come to a conclusion??
You can find the importance of the features and try to eliminate the rest if you have many features. Eg. 1000 features
Can we do label encoding if one of the features have more than 10 categories?
Yes, you can
¡Tremendous explanation! Thank you very much.
Glad you liked it!!!
We need to label encode tge variables before applying this or it will work as it is ??
need to encode before applying
Hi Ashwin,
I have a question. In the list of categorical variables that you have extracted, why have you added "Dependents" & "Credit_History". Are they not numerical variables? I just want to understand the basis behind adding them to the categorical variables list! An earliest response is highly appreciated.
if you check the data, dependents is category as it has a value 4+ which is a string and also credit history is a category similar to gender... only continuous values are considered for numerical
@@HackersRealm Where can we find this dataset? Could you please share the link here?
@@pradeeppaladi8513 It's in the github repo and the link is in the description!!!
Great explanation....can I use this technique with any dataset for regression?
This is mostly for categorial data...
Hi, it's a beneficial video. But how can we use this chi-square for malware detection in Android application? could you please reply me?
could you please explain this with more detail like what are the attributes you're considering?
my chi scores is giving nan values in array and the series attribute in pandas is also not working.
could you please help me with my problem
Are you using different dataset or same?
different dataset @@HackersRealm
Thank you! This video was very clear and very insightful to check.
I do only have a quick question which isn't still clear to me: what is the null hypothesis H0? Is it maybe the hypothesis of some correlation between the categorical variables against the y target variable? If this is the case, then only variables Credit_history and Education result into having a p-value lower than 0.05, and hence they mean something (H0 valid) while the other dependent categorical variables are to be dropped (as their p-values are higher than 0.05, hence rejecting H0). Did I got it correctly?
Anyway, really nice job, keep it up ;)
The end result is correct, however the reasons aren't, I think you have misunderstood the Chi2 Independance test, let me reclarify it for you:
- H0: the target and the dependant variable are independant
- H1: the target and the dependant variable are depandant
The p-value is linked to the test statistic Chi2 (measure of distance between observed and expected results), the greater Chi2, the greater the distance and therefore the less likely that the variables are independant (if they were independant, observed results and expected results would be close and Chi2 small). Also, the greater the Chi2, the smaller the p-value.
Therefore, to sum it up, if the p-value is small (0.05 is a common threshold), it means the independance is unlikely and that we reject H0, hence only keeping variables which p-values are lower than 0.05, since they are dependant to the target (and therefore useful).
@@AnasAbid-zm1lk Thank you for getting back at me!
p-value should be > .05 (No .5) to fail to reject Ho..
thanks for finding the mistake, I will update it!!!