Thank you very much for the videos. I had two questions. When we have categorical variables, can we use Pearson correlation to get the order of significance, such as paperless billing is more significant than seniorcitizen? or do we need to only use chi- squared test? Another question- if I have few categorical variables with multiple categories, should we first create dummy variables and then run chi squared test on each of the dummy variables against the target variable?
Strength of relationship between 2 categorical value can be measured with Cramers V test. You can check my cramers V video in case if you have not already one hot encoding might not be required. you just create a contingency table based on number of categories
Best way to have a productive lunch, thank you! I have a question, did you chose chi_square because the degree of freedom is 1 (for churn x gender for example). If it would have been DOF>30, what would you have chosen?
Chi Square can be used with higher cardinality categories as well. But if there are lot of low tail categories it is better to group them and feed it else low tails can distort the output stats
When you say have a feature with 30 categories you might see last few might have only few observation to make a strong conclusion. These are low tail ones
Hi sir, thanks for posting video. I had a question that to check the significance between two categorical we use chi-square test, for significance between two continuous we use t-test. How can we check significance between independent categorical variable and dependent continuous variable or vice versa?
You can use Regression after converting your categorial variable to numeric values. If you're looking for statistical test then ANOVA would suffice. This will help: www.researchgate.net/post/What_if_an_independentvariable_is_categorical_and_dependent_variables_iscontinuous_variable_can_anyone_suggest_a_suitable_test
@@devpratap Thanks for your answer. But ANOVA will work in case of independent categorical and continuous dependent variable. what test in case of continuous independent and categorical dependent. Is there any test for such case or we need to convert the categorical dependent to numerical?
Sir reviewer has asked me this question I don't know how to address it, can you please guide me "Use some statistical significant test such as T-test or ANOVA to prove you validate the proposed diagnostic model on patients and quality improvements of your method". I have two datasets. Dataset 1 was used to train the model and dataset 2 was used to validate the trained model. I have trained the ML model deployed it and Validated it on new data and presented the results. Actually, I have understood the question. Shall I apply the statistical test between the performance metrics of trained model results and validation results? Please help me, sir.
Akhilesh.. it is very subjective. I would say it is good to investigate each variable to see how it impacts the model. How exhaustive it depends. Most of these tests can be automated
Fantastic video, you just helped me with a major assignment and saved me a lot of stress. Buy you a beer if I could! How can I access your github repo?
You can check my video on project. This is sample approaches where you can try out something similar ua-cam.com/play/PL3N9eeOlCrP7RBbok898Yk0SsUw1O9urP.html Learn python to a extent you can do data science work. You need to have good understanding of pandas, numpy, scikit and matplot packages
Vijay.. Can you elaborate as the dataset I have shown in video has multiple data. Typically while testing we test for individual column with target first during data analysis phase Instead of doing column by column manually we can create functions and iterate through multiple columns
Hello sir.. Thank you for posting this video. But sir I have some doubts regarding this chi square test.. Is it possible to use for numerical dataset as I have numerical dataset not categorical data..? I'm working on lung cancer dataset in which we have all numerical data ... Can you please post one video for selecting best features using chu square test for numerical data? It would be a great help if u do and explain.
Chi Square test if for categorical but if it is numeric will pearson or spearman correlation will not work?. Or you can use any other feature elimination method like forward selection or others
@@AIEngineeringLife so chi square test is not possible for numerical data ? But in this beginning of your video you said that in next video will show how to use chi square test for numerical dataset...
@@poojashah5095 .. What you can do it you can bucket numerical data and run chi square. This is for continuous data where bucket makes sense like age, salary bucket and others. If I had said chi square for pure continuous then I made a mistake but i do have video for continuous data using regular correlation
Null Hypo : There is no relation between the variables 13:30 we fail to reject the Null hypo..s , the gender col is not significant with Churn columm ! How is it possible ???
That is the why chi square test is defined. Each test when was hypothesized was framed on some hypothesis. The null hypothesis of the Chi-Square test is that no relationship exists on the categorical variables in the population but other tests might have different null hypothesis
@@AIEngineeringLife Thanks for your quick reply. Suppose for an observation, the p-value is very small and less than the significant value, and Cramer's V score is also very less(due to the high sample size). What can we conclude from this?
@@swapnanilsharma .. This can be your input to feature selection process of ML model as well to see if this variable is important in modelling the target variable. Again one thing is this is statistical test that gives you a probability of correlation but you can always override it if you feel this variable is important based on domain understanding
This is what I was looking for sir...Thanks a lot for this video
You explained it very well! Thanks for producing and sharing this tutorial.
Thank you very much, you are a great teacher
Awesome vedios with great content. loved it.. :).. waiting more vedios on feature engineering.
Hello, this was very helpful video . If you have done a bayesian analysis please provide the video link
Best way to start your morning !!:)
Thank you very much for the videos. I had two questions.
When we have categorical variables, can we use Pearson correlation to get the order of significance, such as paperless billing is more significant than seniorcitizen? or do we need to only use chi- squared test?
Another question- if I have few categorical variables with multiple categories, should we first create dummy variables and then run chi squared test on each of the dummy variables against the target variable?
Strength of relationship between 2 categorical value can be measured with Cramers V test. You can check my cramers V video in case if you have not already
one hot encoding might not be required. you just create a contingency table based on number of categories
@@AIEngineeringLife Thank you very much for the quick reply.
what would be our feature selection if we are using mixture of continuous and categorical variables to predict categorical variable
Best way to have a productive lunch, thank you! I have a question, did you chose chi_square because the degree of freedom is 1 (for churn x gender for example). If it would have been DOF>30, what would you have chosen?
Chi Square can be used with higher cardinality categories as well. But if there are lot of low tail categories it is better to group them and feed it else low tails can distort the output stats
@@AIEngineeringLife thank you for your answer, what do you call low tail?
When you say have a feature with 30 categories you might see last few might have only few observation to make a strong conclusion. These are low tail ones
Hi sir, thanks for posting video. I had a question that to check the significance between two categorical we use chi-square test, for significance between two continuous we use t-test. How can we check significance between independent categorical variable and dependent continuous variable or vice versa?
You can use Regression after converting your categorial variable to numeric values. If you're looking for statistical test then ANOVA would suffice.
This will help: www.researchgate.net/post/What_if_an_independentvariable_is_categorical_and_dependent_variables_iscontinuous_variable_can_anyone_suggest_a_suitable_test
@@devpratap Thanks for your answer. But ANOVA will work in case of independent categorical and continuous dependent variable. what test in case of continuous independent and categorical dependent. Is there any test for such case or we need to convert the categorical dependent to numerical?
Awesome...
Sir reviewer has asked me this question I don't know how to address it, can you please guide me "Use some statistical significant test such as T-test or ANOVA to prove you validate the proposed diagnostic model on patients and quality improvements of your method". I have two datasets. Dataset 1 was used to train the model and dataset 2 was used to validate the trained model. I have trained the ML model deployed it and Validated it on new data and presented the results. Actually, I have understood the question. Shall I apply the statistical test between the performance metrics of trained model results and validation results? Please help me, sir.
Is this good practice to perform statistical test on all column available for modelling how any trigger point to consider this.
Akhilesh.. it is very subjective. I would say it is good to investigate each variable to see how it impacts the model. How exhaustive it depends. Most of these tests can be automated
I have one doubt instead using the stats package we can use the chisquare directly from sklearn library rit?
Yes you can.. since I did not use sklearn pipeline I used stats one
Fantastic video, you just helped me with a major assignment and saved me a lot of stress. Buy you a beer if I could!
How can I access your github repo?
Thank you sir
can we do CHI-sqaure between two categorical data when there is no target variable (gender and paperlessbilling )i.e un-supervised data ?
Yes Junaid you can. It can be any 2 categorical variables
@@AIEngineeringLife ..in that way , can we find out multicolinearilty between 2 categorical features??
Good one!!
Hello sir,
What types of project should we do as a fresher to get a job.
And also to what extent one should know python?
You can check my video on project. This is sample approaches where you can try out something similar
ua-cam.com/play/PL3N9eeOlCrP7RBbok898Yk0SsUw1O9urP.html
Learn python to a extent you can do data science work. You need to have good understanding of pandas, numpy, scikit and matplot packages
Will surely watch and work on your recommended approach.
Thank You
Epic tut
What if I have a dataset with multiple data? Should I change it to 1NF? How can i do it in python any resources plz
Vijay.. Can you elaborate as the dataset I have shown in video has multiple data. Typically while testing we test for individual column with target first during data analysis phase
Instead of doing column by column manually we can create functions and iterate through multiple columns
Nice ✌
Hello sir..
Thank you for posting this video.
But sir I have some doubts regarding this chi square test..
Is it possible to use for numerical dataset as I have numerical dataset not categorical data..?
I'm working on lung cancer dataset in which we have all numerical data ...
Can you please post one video for selecting best features using chu square test for numerical data?
It would be a great help if u do and explain.
Chi Square test if for categorical but if it is numeric will pearson or spearman correlation will not work?. Or you can use any other feature elimination method like forward selection or others
@@AIEngineeringLife so chi square test is not possible for numerical data ? But in this beginning of your video you said that in next video will show how to use chi square test for numerical dataset...
@@AIEngineeringLife even is it not possible to use for continuous data ?
@@poojashah5095 .. What you can do it you can bucket numerical data and run chi square. This is for continuous data where bucket makes sense like age, salary bucket and others. If I had said chi square for pure continuous then I made a mistake but i do have video for continuous data using regular correlation
@@AIEngineeringLife can you please provide that video link ?
Thank you😊😊
How do we get to know as to which variables out of the given data are to be compared using chi square test?
You can compare all categorical variables if we do not have much background of business or every dependent variable with independent
What if there are more number of categories in a feature, like say 15-20. What to use in such cases?
You can still use it. But if you have very low ocurance of some categories then it might not give correct outcome
@@AIEngineeringLife Thanks for your reply
Gran explicación
Will be helpful if colab link is shared for all the videos .Thanks
You can find repo details of my courses here - github.com/srivatsan88/
The one you are seeing is part of my applied stats course
Null Hypo : There is no relation between the variables
13:30 we fail to reject the Null hypo..s , the gender col is not significant with Churn columm !
How is it possible ???
Why you choose no relation in NULL hypothesis. Why not NULL hypothesis is like: there is some relationship between 2 cat vaiables
That is the why chi square test is defined. Each test when was hypothesized was framed on some hypothesis. The null hypothesis of the Chi-Square test is that no relationship exists on the categorical variables in the population but other tests might have different null hypothesis
@@AIEngineeringLife Thanks for your quick reply. Suppose for an observation, the p-value is very small and less than the significant value, and Cramer's V score is also very less(due to the high sample size). What can we conclude from this?
@@swapnanilsharma .. This can be your input to feature selection process of ML model as well to see if this variable is important in modelling the target variable. Again one thing is this is statistical test that gives you a probability of correlation but you can always override it if you feel this variable is important based on domain understanding
Can you give me the code please ?
it is in my git repo here - github.com/srivatsan88/UA-camLI/blob/master/statistics/Statistical_Thinking_Feature_Selection_Categorical_Variables.ipynb