Hello everyone, here's something that has been done wrong in this video, I am pretty much sure Krish might have done that by mistake. In the last part where you do the sorting thing, in between 18:38 - 21:00, that step is not sorting the columns based on their associated p-values instead it is sorting them based on their names, i.e 'Sex' is coming before 'Embarked' even though the p-value for 'Embarked' is greater than 'Sex', and that is why you are receiving randomly sorted results when you see the decimal converted p-values. To fix this or sort that Series object on the basis of their respective p-values use p_values.sort_values(ascending=False) instead of p_values.sort_index(ascending=False), now you are generating results on the basis of sorted p-values, which should be correct imo. Hope that helps!
Hi krish, we are doing train test split then feature selection. Suppose out of 10 features 7 are important then we have remove 3 from x train and x test, why can’t we do feature selection first then do train test split?
How lesser p value in turn is more important feature , isn't that value lesser than .05 % will be lying somewhere at the extremes of bell curve so we neglect that hypothesis.
lesser value means that there is approx 0 probability for its existence in the curve , so for lesser values we become more sure , and ys we take .05 , but sme exceptional cases are there , as in the video example all features have very less values as compare to 0.05
You sort P_values in descending order at 20.12 n you say lesser p value's feature is important.So according to this alone must be the important feature?
Wonderful explanation. What happens if your dependent variable (Here Survived is having only two values: 0 or 1) was also categorical with more than 2 values? How do you identify the features to drop? How do you perform the analysis of the odds of the independent variables associated with that dependent variable? Logistic Regression with multi-class? Do you have a use case or example of a scenario where all your dependent and independent variables are categorical? what type of test can be done to determine the odds of the output variables on the given input features? Specifically target variable is having more than 2 values?
great video brother, but how about this : instead of encoding columns that already has binary variables (like sex,alone,survived), supposing we have some categorical features (after encoding them) that carries more than one column for each, it happens when a certain column has more than 2 variables. in this case and by considering nominal variables, instead of having one Pvalue for each feature we might have 3 Pvalues or more for a single feature. question : what should i do in this situation ? i wish you best luck brother
if we have many features(columns) that we r considering, I guess 1 hot encoding will inc the computation. Depending on the algorithm, the type of encoding varies
OHE will generate more columns and the originality for the original variable will be gone, i.e now you are not comparing the 2 original variabless of your data set instead you are now comparing an original variable and a substitute variable which only represent a part of the original variable. Thus IMO you should not use it iff you are comparing 2 original variables, but if you are trying to get the chi-sq value between OHE variable and the class variable then you can use that. The only reason to do the later thing is to check whether or not the step of OHE was actually useful/relevant or useless/irrelevant.
In tutorial-32 you taught for 2 categorical var. with one more than 2 categories we should apply ANOVA test but here we are using chi2, still I am confused will get it cleared after watching your tutorial for ANOVA test
annova test is for one categorical feature with 1 numerical feature , where as if we want to compare features that are both categorical then we use chi2 test
@NITIN GOSWAMI here we are taking categorical features.. For exam we get p value for sex and survived columns.. And after that pclass and survived columns.. Which means we comparing one independent features to target feature and getting p value and do this for rest of columns ...
Every thing has merits as well as demerits, idk if you have already seen the demerit of "letting LabelEncoder() to do the encoding on whatever basis it wants". 😉😉
When I Run the chi2 in my dataset it shows 0.000000e+00 for most of the category columns, I converted the category column to (label encoding on category column) UOM 1.311792e-23 TYPE 0.000000e+00 SUPPLIER NUMBER 0.000000e+00 SUPPLIER GROUP 0.000000e+00 SUPPLIER COUNTRY REGION 0.000000e+00 SUPPLIER COUNTRY 0.000000e+00 SUPPLIER 0.000000e+00 SUBCATEGORY 0.000000e+00 SOURCE ROW ID 0.000000e+00 SIFOT EXCLUSION NaN RELEASE NUMBER NaN RECEIVED QUANTITY 0.000000e+00 PRODUCT TYPOLOGY 0.000000e+00 PRICE 0.000000e+00 PONUMBER 0.000000e+00 PO SPEND ORIGINAL CURRENCY 0.000000e+00 PO SPEND (DKK) 0.000000e+00 PO QUANTITY 0.000000e+00 PO PROMISED DATE 0.000000e+00 PO LINE NUMBER 0.000000e+00 PO LINE DESCRIPTION EN 0.000000e+00 PO LINE DESCRIPTION 0.000000e+00 PO FULFILLMENT DATE 0.000000e+00 PO FLAG STATUS 0.000000e+00 PO DELIVERY STATUS DETAILS NaN PO DATE 0.000000e+00 ORIGINAL CURRENCY 0.000000e+00 ORDER TYPOLOGY 2.924802e-73 ITEM NO 0.000000e+00 ITEM EN 0.000000e+00 ITEM 0.000000e+00 INCO TERMS 0.000000e+00 EXPEDITOR NAME NaN DELIVERY STATUS 0.000000e+00 DELIVERY IN DAYS 0.000000e+00 COUNTRY OF ORIGIN NaN CATEGORY 8.510489e-106 BUYER 0.000000e+00 dtype: float64 Process finished with exit code 0
Revise Feature Selection From the below playlist
ua-cam.com/play/PLZoTAELRMXVPgjwJ8VyRoqmfNs2CJwhVH.html
Hello everyone, here's something that has been done wrong in this video, I am pretty much sure Krish might have done that by mistake. In the last part where you do the sorting thing, in between 18:38 - 21:00, that step is not sorting the columns based on their associated p-values instead it is sorting them based on their names, i.e 'Sex' is coming before 'Embarked' even though the p-value for 'Embarked' is greater than 'Sex', and that is why you are receiving randomly sorted results when you see the decimal converted p-values. To fix this or sort that Series object on the basis of their respective p-values use p_values.sort_values(ascending=False) instead of p_values.sort_index(ascending=False), now you are generating results on the basis of sorted p-values, which should be correct imo. Hope that helps!
Yes, It's a mistake
Thanks for pointing out
Thank you for validating my sanity, I though something looked out of place
Who knew after the traumatic event of the titanic, it would become a famous practice problem to solve among ML industry
sir in the end if we check p value then here you did p_values.sort_index(ascending=False).i thing sort_values should be performed right?
Thank you for providing us such great content! I am glad that I found your UA-cam channel!
You got another subscriber!
8:16,Embarked is not a ordinal variable so how can we give ordinal encoding
how we can do features selection fo clustering tasks???
Hi krish, we are doing train test split then feature selection. Suppose out of 10 features 7 are important then we have remove 3 from x train and x test, why can’t we do feature selection first then do train test split?
Do we have any function in panda’s which pic all categorical features from data set and show us?
try this.... df.select_dtypes(include='category')
loved yours passionate discussion on why certain people survived lol.
how to drop the columns based on those f and p values if we have more columns....?
Thanks sir making video on this topic
Sir how is careerx data science course
Sir as we are label encoding, aren't we iindroducing order in those features? is it ok? we won't use it while building models right?
Perfect 🙏
Thanks a lot❤
How lesser p value in turn is more important feature , isn't that value lesser than .05 % will be lying somewhere at the extremes of bell curve so we neglect that hypothesis.
lesser value means that there is approx 0 probability for its existence in the curve , so for lesser values we become more sure , and ys we take .05 , but sme exceptional cases are there , as in the video example all features have very less values as compare to 0.05
Why do we use train test split while performing Chi square, does imbalanced data on a Boolean output has any impact ?
You sort P_values in descending order at 20.12 n you say lesser p value's feature is important.So according to this alone must be the important feature?
feature selection is the part of the feature engineering????
Feature Selection can be considered as a separate module in Life cycle of Data Science Project
@@krishnaik06 thank you
Hi sir,
what if we have around more than 2k columns and for all columns how we can perform encoding?
Label encoding and dummy variable are both same ?
Dummy variable and one hot encoding is same
@@raghavramola7012 Thank you Raghav
Label encoding won't make a seperate column of those unique categorical features while dummies will do that
Label encoding is Nominal, and Dummy variable is Ordinal.
label encoding just replace the values where no extra features adds , but in one hot encoding new features adds which are called dummy variables
Wonderful explanation. What happens if your dependent variable (Here Survived is having only two values: 0 or 1) was also categorical with more than 2 values? How do you identify the features to drop? How do you perform the analysis of the odds of the independent variables associated with that dependent variable? Logistic Regression with multi-class?
Do you have a use case or example of a scenario where all your dependent and independent variables are categorical? what type of test can be done to determine the odds of the output variables on the given input features? Specifically target variable is having more than 2 values?
how to drop the columns based on those f and p values if we have more columns....?
@16:00 you had ran the np.where cell twice that was the reason for all zero values
Thanks Krish
I have two question.
1) Do we need to separate the categorical values from our dataset?
2) How to apply this on X_test?
Please answer. Thank you.
You don't have to apply on X_test as we already got best features in train data.. we will skip this step on test data.
Sir how careerx course
Sir will you be uploading more feature selection techniques?
why did he drop the values ?
there is minor mistake in code. correct code is : p_values.sort_values(ascending=True)
great video brother, but how about this : instead of encoding columns that already has binary variables (like sex,alone,survived), supposing we have some categorical features (after encoding them) that carries more than one column for each, it happens when a certain column has more than 2 variables. in this case and by considering nominal variables, instead of having one Pvalue for each feature we might have 3 Pvalues or more for a single feature.
question : what should i do in this situation ?
i wish you best luck brother
Hello sir, please there are soo many feature selectiong technique in your playlist. Which one of them do you think is best to use.
it depend on your dataset
the p-value of alone column is 0.9 that is greater than significance level of 0.05
To apply chi sq. Is it compulsory to only use Label Encoding for categories col or can I use One hot too , and then proceed with the test ?
if we have many features(columns) that we r considering, I guess 1 hot encoding will inc the computation. Depending on the algorithm, the type of encoding varies
OHE will generate more columns and the originality for the original variable will be gone, i.e now you are not comparing the 2 original variabless of your data set instead you are now comparing an original variable and a substitute variable which only represent a part of the original variable. Thus IMO you should not use it iff you are comparing 2 original variables, but if you are trying to get the chi-sq value between OHE variable and the class variable then you can use that. The only reason to do the later thing is to check whether or not the step of OHE was actually useful/relevant or useless/irrelevant.
In tutorial-32 you taught for 2 categorical var. with one more than 2 categories we should apply ANOVA test but here we are using chi2, still I am confused will get it cleared after watching your tutorial for ANOVA test
annova test is for one categorical feature with 1 numerical feature , where as if we want to compare features that are both categorical then we use chi2 test
@@gurdeepsinghbhatia2875 got it bro here we are only applying it to categorical features
@@nitingoswami4993 yss
@NITIN GOSWAMI here we are taking categorical features.. For exam we get p value for sex and survived columns.. And after that pclass and survived columns.. Which means we comparing one independent features to target feature and getting p value and do this for rest of columns ...
p-value doesn't say anything about IMPORTANCE, it is about significance of the statistic - how likely we are to get that value for the statistic.
Does p-value tells us about what are the chances that the result we have got are just by chance or what is the reliability of the result?
What keyboard are you using sir? It does make a lot of clicky noise :D Thank you for making this video.
Any mechanical keyboard would make this possible. ☺️☺️
from sklearn.preprocessing import LabelEncoder
vals = ['sex','embarked','alone']
le = LabelEncoder()
df[vals] = df[vals].apply(le.fit_transform)
Every thing has merits as well as demerits, idk if you have already seen the demerit of "letting LabelEncoder() to do the encoding on whatever basis it wants". 😉😉
@@yashasvibhatt1951 ok
thx a lot ur amazing
hello everyone i need someone to helpe me in my project 'future select using chi2 plzz
Amazing
Sir PLZZ add in playlist
Lower the P value the better, why is ascending = False,?
Ordering of index not value ...I also got same doubt and checked in Google.
I think instead of F value he took p value. Remaining all are correct :)
Actually univariate features selection is Wrong techniques statistical significance doesn't mean practical significant
When I Run the chi2 in my dataset it shows 0.000000e+00 for most of the category columns, I converted the category column to (label encoding on category column)
UOM 1.311792e-23
TYPE 0.000000e+00
SUPPLIER NUMBER 0.000000e+00
SUPPLIER GROUP 0.000000e+00
SUPPLIER COUNTRY REGION 0.000000e+00
SUPPLIER COUNTRY 0.000000e+00
SUPPLIER 0.000000e+00
SUBCATEGORY 0.000000e+00
SOURCE ROW ID 0.000000e+00
SIFOT EXCLUSION NaN
RELEASE NUMBER NaN
RECEIVED QUANTITY 0.000000e+00
PRODUCT TYPOLOGY 0.000000e+00
PRICE 0.000000e+00
PONUMBER 0.000000e+00
PO SPEND ORIGINAL CURRENCY 0.000000e+00
PO SPEND (DKK) 0.000000e+00
PO QUANTITY 0.000000e+00
PO PROMISED DATE 0.000000e+00
PO LINE NUMBER 0.000000e+00
PO LINE DESCRIPTION EN 0.000000e+00
PO LINE DESCRIPTION 0.000000e+00
PO FULFILLMENT DATE 0.000000e+00
PO FLAG STATUS 0.000000e+00
PO DELIVERY STATUS DETAILS NaN
PO DATE 0.000000e+00
ORIGINAL CURRENCY 0.000000e+00
ORDER TYPOLOGY 2.924802e-73
ITEM NO 0.000000e+00
ITEM EN 0.000000e+00
ITEM 0.000000e+00
INCO TERMS 0.000000e+00
EXPEDITOR NAME NaN
DELIVERY STATUS 0.000000e+00
DELIVERY IN DAYS 0.000000e+00
COUNTRY OF ORIGIN NaN
CATEGORY 8.510489e-106
BUYER 0.000000e+00
dtype: float64
Process finished with exit code 0