Tutorial 5- Feature Selection-Perform Feature Selection Using Chi2 Statistical Analysis

Поділитися
Вставка
  • Опубліковано 22 гру 2024

КОМЕНТАРІ • 71

  • @krishnaik06
    @krishnaik06  3 роки тому +2

    Revise Feature Selection From the below playlist
    ua-cam.com/play/PLZoTAELRMXVPgjwJ8VyRoqmfNs2CJwhVH.html

  • @yashasvibhatt1951
    @yashasvibhatt1951 3 роки тому +31

    Hello everyone, here's something that has been done wrong in this video, I am pretty much sure Krish might have done that by mistake. In the last part where you do the sorting thing, in between 18:38 - 21:00, that step is not sorting the columns based on their associated p-values instead it is sorting them based on their names, i.e 'Sex' is coming before 'Embarked' even though the p-value for 'Embarked' is greater than 'Sex', and that is why you are receiving randomly sorted results when you see the decimal converted p-values. To fix this or sort that Series object on the basis of their respective p-values use p_values.sort_values(ascending=False) instead of p_values.sort_index(ascending=False), now you are generating results on the basis of sorted p-values, which should be correct imo. Hope that helps!

    • @drdominance9392
      @drdominance9392 3 роки тому +3

      Yes, It's a mistake
      Thanks for pointing out

    • @HerpaDerpaZX
      @HerpaDerpaZX 2 роки тому +3

      Thank you for validating my sanity, I though something looked out of place

  • @anshumaanphukan4388
    @anshumaanphukan4388 3 роки тому +5

    Who knew after the traumatic event of the titanic, it would become a famous practice problem to solve among ML industry

  • @sivachaitanya6330
    @sivachaitanya6330 3 роки тому +4

    sir in the end if we check p value then here you did p_values.sort_index(ascending=False).i thing sort_values should be performed right?

  • @pedrocolangelo5844
    @pedrocolangelo5844 3 роки тому +1

    Thank you for providing us such great content! I am glad that I found your UA-cam channel!
    You got another subscriber!

  • @rafibasha4145
    @rafibasha4145 2 роки тому

    8:16,Embarked is not a ordinal variable so how can we give ordinal encoding

  • @mohamedmouhiha
    @mohamedmouhiha 7 місяців тому

    how we can do features selection fo clustering tasks???

  • @ajaykushwaha4233
    @ajaykushwaha4233 2 роки тому +1

    Hi krish, we are doing train test split then feature selection. Suppose out of 10 features 7 are important then we have remove 3 from x train and x test, why can’t we do feature selection first then do train test split?

  • @ajaykushwaha4233
    @ajaykushwaha4233 3 роки тому +3

    Do we have any function in panda’s which pic all categorical features from data set and show us?

    • @sanjitonly
      @sanjitonly 3 роки тому +3

      try this.... df.select_dtypes(include='category')

  • @waichingleung412
    @waichingleung412 3 роки тому

    loved yours passionate discussion on why certain people survived lol.

  • @maskman9630
    @maskman9630 2 роки тому

    how to drop the columns based on those f and p values if we have more columns....?

  • @SahilKhan-yu3oh
    @SahilKhan-yu3oh 3 роки тому +1

    Thanks sir making video on this topic

  • @neeleshpandya5406
    @neeleshpandya5406 3 роки тому +1

    Sir how is careerx data science course

  • @sane7263
    @sane7263 2 роки тому

    Sir as we are label encoding, aren't we iindroducing order in those features? is it ok? we won't use it while building models right?

  • @pouriaforouzesh5349
    @pouriaforouzesh5349 Рік тому

    Perfect 🙏

  • @dinushachathuranga7657
    @dinushachathuranga7657 10 місяців тому

    Thanks a lot❤

  • @nik7867
    @nik7867 3 роки тому +5

    How lesser p value in turn is more important feature , isn't that value lesser than .05 % will be lying somewhere at the extremes of bell curve so we neglect that hypothesis.

    • @gurdeepsinghbhatia2875
      @gurdeepsinghbhatia2875 3 роки тому

      lesser value means that there is approx 0 probability for its existence in the curve , so for lesser values we become more sure , and ys we take .05 , but sme exceptional cases are there , as in the video example all features have very less values as compare to 0.05

  • @kumarabhishek8957
    @kumarabhishek8957 3 роки тому

    Why do we use train test split while performing Chi square, does imbalanced data on a Boolean output has any impact ?

  • @utkarshsharma1867
    @utkarshsharma1867 2 роки тому

    You sort P_values in descending order at 20.12 n you say lesser p value's feature is important.So according to this alone must be the important feature?

  • @TheBlessingMoon
    @TheBlessingMoon 3 роки тому +1

    feature selection is the part of the feature engineering????

    • @krishnaik06
      @krishnaik06  3 роки тому +2

      Feature Selection can be considered as a separate module in Life cycle of Data Science Project

    • @TheBlessingMoon
      @TheBlessingMoon 3 роки тому +1

      @@krishnaik06 thank you

  • @amardeepkumar3944
    @amardeepkumar3944 2 роки тому

    Hi sir,
    what if we have around more than 2k columns and for all columns how we can perform encoding?

  • @ajaykushwaha4233
    @ajaykushwaha4233 3 роки тому +1

    Label encoding and dummy variable are both same ?

    • @raghavramola7012
      @raghavramola7012 3 роки тому +1

      Dummy variable and one hot encoding is same

    • @ajaykushwaha4233
      @ajaykushwaha4233 3 роки тому +1

      @@raghavramola7012 Thank you Raghav

    • @nik7867
      @nik7867 3 роки тому +2

      Label encoding won't make a seperate column of those unique categorical features while dummies will do that

    • @vishaldas6346
      @vishaldas6346 3 роки тому

      Label encoding is Nominal, and Dummy variable is Ordinal.

    • @gurdeepsinghbhatia2875
      @gurdeepsinghbhatia2875 3 роки тому +1

      label encoding just replace the values where no extra features adds , but in one hot encoding new features adds which are called dummy variables

  • @sarvatra539
    @sarvatra539 2 роки тому

    Wonderful explanation. What happens if your dependent variable (Here Survived is having only two values: 0 or 1) was also categorical with more than 2 values? How do you identify the features to drop? How do you perform the analysis of the odds of the independent variables associated with that dependent variable? Logistic Regression with multi-class?
    Do you have a use case or example of a scenario where all your dependent and independent variables are categorical? what type of test can be done to determine the odds of the output variables on the given input features? Specifically target variable is having more than 2 values?

    • @maskman9630
      @maskman9630 2 роки тому

      how to drop the columns based on those f and p values if we have more columns....?

  • @ashwinshetgaonkar6329
    @ashwinshetgaonkar6329 3 роки тому

    @16:00 you had ran the np.where cell twice that was the reason for all zero values

  • @louerleseigneur4532
    @louerleseigneur4532 3 роки тому

    Thanks Krish

  • @md.younusahamed4969
    @md.younusahamed4969 2 роки тому +1

    I have two question.
    1) Do we need to separate the categorical values from our dataset?
    2) How to apply this on X_test?
    Please answer. Thank you.

    • @pritampatra6077
      @pritampatra6077 2 роки тому +1

      You don't have to apply on X_test as we already got best features in train data.. we will skip this step on test data.

  • @neeleshpandya5406
    @neeleshpandya5406 3 роки тому +1

    Sir how careerx course

  • @pushpitkumar99
    @pushpitkumar99 3 роки тому

    Sir will you be uploading more feature selection techniques?

  • @vent_srikar7360
    @vent_srikar7360 Рік тому

    why did he drop the values ?

  • @ajaykushwaha-je6mw
    @ajaykushwaha-je6mw 2 роки тому +1

    there is minor mistake in code. correct code is : p_values.sort_values(ascending=True)

  • @abdelouadoudkhouri8234
    @abdelouadoudkhouri8234 2 роки тому

    great video brother, but how about this : instead of encoding columns that already has binary variables (like sex,alone,survived), supposing we have some categorical features (after encoding them) that carries more than one column for each, it happens when a certain column has more than 2 variables. in this case and by considering nominal variables, instead of having one Pvalue for each feature we might have 3 Pvalues or more for a single feature.
    question : what should i do in this situation ?
    i wish you best luck brother

  • @owusubright1046
    @owusubright1046 3 роки тому

    Hello sir, please there are soo many feature selectiong technique in your playlist. Which one of them do you think is best to use.

  • @wahidnabi9476
    @wahidnabi9476 3 роки тому

    the p-value of alone column is 0.9 that is greater than significance level of 0.05

  • @manishsharma2211
    @manishsharma2211 3 роки тому

    To apply chi sq. Is it compulsory to only use Label Encoding for categories col or can I use One hot too , and then proceed with the test ?

    • @yaminimadan4087
      @yaminimadan4087 3 роки тому

      if we have many features(columns) that we r considering, I guess 1 hot encoding will inc the computation. Depending on the algorithm, the type of encoding varies

    • @yashasvibhatt1951
      @yashasvibhatt1951 3 роки тому

      OHE will generate more columns and the originality for the original variable will be gone, i.e now you are not comparing the 2 original variabless of your data set instead you are now comparing an original variable and a substitute variable which only represent a part of the original variable. Thus IMO you should not use it iff you are comparing 2 original variables, but if you are trying to get the chi-sq value between OHE variable and the class variable then you can use that. The only reason to do the later thing is to check whether or not the step of OHE was actually useful/relevant or useless/irrelevant.

  • @nitingoswami4993
    @nitingoswami4993 3 роки тому

    In tutorial-32 you taught for 2 categorical var. with one more than 2 categories we should apply ANOVA test but here we are using chi2, still I am confused will get it cleared after watching your tutorial for ANOVA test

    • @gurdeepsinghbhatia2875
      @gurdeepsinghbhatia2875 3 роки тому

      annova test is for one categorical feature with 1 numerical feature , where as if we want to compare features that are both categorical then we use chi2 test

    • @nitingoswami4993
      @nitingoswami4993 3 роки тому +1

      @@gurdeepsinghbhatia2875 got it bro here we are only applying it to categorical features

    • @gurdeepsinghbhatia2875
      @gurdeepsinghbhatia2875 3 роки тому

      @@nitingoswami4993 yss

    • @manojrangera
      @manojrangera 3 роки тому +1

      @NITIN GOSWAMI here we are taking categorical features.. For exam we get p value for sex and survived columns.. And after that pclass and survived columns.. Which means we comparing one independent features to target feature and getting p value and do this for rest of columns ...

  • @chandangarg5713
    @chandangarg5713 3 роки тому +1

    p-value doesn't say anything about IMPORTANCE, it is about significance of the statistic - how likely we are to get that value for the statistic.

    • @akshaychauhan5919
      @akshaychauhan5919 3 роки тому

      Does p-value tells us about what are the chances that the result we have got are just by chance or what is the reliability of the result?

  • @bishwarup1429
    @bishwarup1429 3 роки тому +1

    What keyboard are you using sir? It does make a lot of clicky noise :D Thank you for making this video.

    • @yashasvibhatt1951
      @yashasvibhatt1951 3 роки тому

      Any mechanical keyboard would make this possible. ☺️☺️

  • @hardikvegad3508
    @hardikvegad3508 3 роки тому

    from sklearn.preprocessing import LabelEncoder
    vals = ['sex','embarked','alone']
    le = LabelEncoder()
    df[vals] = df[vals].apply(le.fit_transform)

    • @yashasvibhatt1951
      @yashasvibhatt1951 3 роки тому

      Every thing has merits as well as demerits, idk if you have already seen the demerit of "letting LabelEncoder() to do the encoding on whatever basis it wants". 😉😉

    • @hardikvegad3508
      @hardikvegad3508 3 роки тому

      @@yashasvibhatt1951 ok

  • @salsabilemed6700
    @salsabilemed6700 3 роки тому

    thx a lot ur amazing

  • @bahoussimeriemrabab5416
    @bahoussimeriemrabab5416 2 роки тому

    hello everyone i need someone to helpe me in my project 'future select using chi2 plzz

  • @felixoluoch1371
    @felixoluoch1371 3 роки тому

    Amazing

  • @surajprajapati3447
    @surajprajapati3447 3 роки тому

    Sir PLZZ add in playlist

  • @swethakulkarni3563
    @swethakulkarni3563 3 роки тому

    Lower the P value the better, why is ascending = False,?

  • @thepresistence5935
    @thepresistence5935 2 роки тому

    I think instead of F value he took p value. Remaining all are correct :)

  • @shrikantdeshmukh7951
    @shrikantdeshmukh7951 3 роки тому

    Actually univariate features selection is Wrong techniques statistical significance doesn't mean practical significant

  • @vijaysavarimuthu
    @vijaysavarimuthu 2 роки тому

    When I Run the chi2 in my dataset it shows 0.000000e+00 for most of the category columns, I converted the category column to (label encoding on category column)
    UOM 1.311792e-23
    TYPE 0.000000e+00
    SUPPLIER NUMBER 0.000000e+00
    SUPPLIER GROUP 0.000000e+00
    SUPPLIER COUNTRY REGION 0.000000e+00
    SUPPLIER COUNTRY 0.000000e+00
    SUPPLIER 0.000000e+00
    SUBCATEGORY 0.000000e+00
    SOURCE ROW ID 0.000000e+00
    SIFOT EXCLUSION NaN
    RELEASE NUMBER NaN
    RECEIVED QUANTITY 0.000000e+00
    PRODUCT TYPOLOGY 0.000000e+00
    PRICE 0.000000e+00
    PONUMBER 0.000000e+00
    PO SPEND ORIGINAL CURRENCY 0.000000e+00
    PO SPEND (DKK) 0.000000e+00
    PO QUANTITY 0.000000e+00
    PO PROMISED DATE 0.000000e+00
    PO LINE NUMBER 0.000000e+00
    PO LINE DESCRIPTION EN 0.000000e+00
    PO LINE DESCRIPTION 0.000000e+00
    PO FULFILLMENT DATE 0.000000e+00
    PO FLAG STATUS 0.000000e+00
    PO DELIVERY STATUS DETAILS NaN
    PO DATE 0.000000e+00
    ORIGINAL CURRENCY 0.000000e+00
    ORDER TYPOLOGY 2.924802e-73
    ITEM NO 0.000000e+00
    ITEM EN 0.000000e+00
    ITEM 0.000000e+00
    INCO TERMS 0.000000e+00
    EXPEDITOR NAME NaN
    DELIVERY STATUS 0.000000e+00
    DELIVERY IN DAYS 0.000000e+00
    COUNTRY OF ORIGIN NaN
    CATEGORY 8.510489e-106
    BUYER 0.000000e+00
    dtype: float64
    Process finished with exit code 0