Statistical Thinking - Imputing Missing Values

Поділитися
Вставка
  • Опубліковано 24 жов 2024

КОМЕНТАРІ • 73

  • @aravindcr4998
    @aravindcr4998 4 роки тому +13

    One of the very few members in the Data Science community who provides quality content.

  • @divyamarora4903
    @divyamarora4903 3 роки тому +5

    I can't believe that all this is free. This level of practical content is awesome and rare to find

  • @arpitakar3384
    @arpitakar3384 26 днів тому +1

    Multivariate Imputation by Chained Equations (MICE) or Iterative imputation from scratch just got it's back scratch here...
    Great video sir...
    Love from your own country..
    WRITE DOWN THE DIFFERENCE: SCKIT-LEARN ITERATIVE IMPUTATER

    • @arpitakar3384
      @arpitakar3384 26 днів тому

      Iterative Imputer
      This estimator is still experimental for now: the predictions and the API might change without any deprecation cycle. To use it, you need to explicitly import enable_iterative_imputer:
      # explicitly require this experimental feature
      from sklearn.experimental import enable_iterative_imputer # noqa
      # now you can import normally from sklearn.impute
      from sklearn.impute import IterativeImputer

  • @ravi_krishna_reddy
    @ravi_krishna_reddy 3 роки тому +1

    Very good content and awesome explanation. Thank you so much.

  • @jayaraghavendra9025
    @jayaraghavendra9025 4 роки тому +1

    Awesome and Expecting more info like this

  • @abhisheksolet8494
    @abhisheksolet8494 4 роки тому +1

    Amazing Tutorial Sir. Thank you so much for providing such great learning material. Looking forward to have many more.

  • @madhukerbillapati3944
    @madhukerbillapati3944 4 роки тому +1

    Good one. Worth reading, wish to see more video's

  • @raviirla459
    @raviirla459 4 роки тому +1

    Wow movement vedios.. you have nailed it... your vedios are fun to watch with great content.. looking forward more vedios on visualization, feature engineering and data interpretation..

  • @MrSmarthunky
    @MrSmarthunky 4 роки тому +1

    Very good video Srivatasan sir. Happy if you can make more videos on such foundational knowledge.

  • @chidiedim3166
    @chidiedim3166 4 роки тому +1

    great one sir

  • @hardikraja
    @hardikraja 4 роки тому +1

    Awesome...

  • @muralikrishna9499
    @muralikrishna9499 4 роки тому +1

    Your videos are making me more and more inspired!

  • @manassharma869
    @manassharma869 4 роки тому +1

    awesome explanation i hope more parts are coming, thanks

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому +1

      Yes framing the problem is difficult.:).. need to see good problems to provide solution. Suggestions welcome as well

  • @mukeshkund4465
    @mukeshkund4465 4 роки тому +1

    Worth watching.

  • @TravelWithIndoCanadian
    @TravelWithIndoCanadian 4 роки тому +1

    Very well explained.

  • @blue_sapphire8650
    @blue_sapphire8650 3 роки тому

    Simple and neat. I goddamn love the way you covered concepts in this video. In fact, I have been searching for a content like this for a while and now I got it here 😊. Thanks a lot sir.

  • @udaysai2647
    @udaysai2647 4 роки тому +1

    Srivatsan- I am very excited for this series of videos. It is a great elucidation of we can impute missing values. To generalize your point we need to check how the column containing NaN's varies with the target variable and how other independent variables influence column with NaN's and then figure out best way to impute. This is awesome but just curious about the case where other independent variables might also contain NaN in applying such technique. For example if 'MonthlyCharges' column contains 'NaN's or 'tenure' contains 'NaN's how will we implement this 'lmmodel' technique in this example.

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому +3

      Uday.. Nice questions.. First thing in DS process is understand source of data and why nulls are populated. Is Null an unavailability scenario or exception scenario. In Case say monthly charges is null due to error capturing, I will try first imputing it if that field cannot be dropped. Say can I impute it by contract type and services. Below is original dataset
      github.com/srivatsan88/UA-camLI/blob/master/dataset/WA_Fn-UseC_-Telco-Customer-Churn.csv
      I have individual services the customer has and can i run KNN based on similarity. Then if I am able to approximate can I use it to get to TotalCharges
      Again as I said there are probabilistic and there is no one solution to it :)

    • @udaysai2647
      @udaysai2647 4 роки тому

      Srivatsan- Thank You for the explanation. I think the very first line answers my question. I will try to impart this perception when dealing with a dataset and try to figure out reason that is causing the NaN's . As always you add value to your videos with these suggestions :), Thank you once Again

  • @mohdhammadkhan5570
    @mohdhammadkhan5570 3 роки тому +1

    This content is so rare.

  • @rushikeshbulbule8120
    @rushikeshbulbule8120 4 роки тому +1

    Comprehensive ✌
    How to do normal distribution by transformation... expecting ahead.. .

  • @rakeshkedar4096
    @rakeshkedar4096 4 роки тому +2

    Thanks for this video . I have a question which was even asked in one of the interviews.
    How can we evaluate our imputation strategy without applying any machine learning model? for example if i would have replaced Total Charges with mean/median and i do not have actual values to compare as you had in this case. so in that case what are the various statistical approaches to check how good is our imputation strategy ?

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому +1

      You can still evaluate the regression model output by creating split within data point you have value to evaluate
      Mean and median can be good strategy when your data points are close and not spread out. Another option I would say rather imputing use models that can handle missing values

    • @rakeshkedar4096
      @rakeshkedar4096 4 роки тому

      @@AIEngineeringLife Yes i agree about the intuitive part of using mean/median strategy & using the models that can handle missing values, but i am curious to know whether there are any statistical test to evaluate if the mean/median imputation works for our case?

  • @arianaquek6036
    @arianaquek6036 4 роки тому +1

    ​@AIEngineering hello sir, thank you for the insightful video! Just to clarify a few points: -
    What does 'makes null values as larger numbers or high negative numbers (in this case since 0 can be valid values, default is 0)' you wrote in a comment mean?
    Does the '0' you mention as 'default is 0' represent a missing value or a value that you impute as a 'value' in the dataset?
    As what i understood from the paper, xgboost is capable of taking dataset with missing values, impute them by splitting them into different directions and then choose the best route to impute.
    There is a little part where it says 'The same algorithm can also be applied when the non-presence corresponds to a user specified value by limiting the enumeration only to consistent solutions.' I assume that this is what you meant by making 'makes null values as larger numbers or high negative numbers' - but what does larger/high negative numbers mean?

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому +1

      Ariana.. There are 2 things in xgboost. You can set your own missing value in params like the example below
      xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html#dealing-with-missing-values
      Now what I meant is in case if a continuous value being zero is normal for your business and not really meant missing then keeping the default value is not right. Say a retailer has offer to give a product free if someone buys another product. Then value can be zero for the free product and cannot be treated as missing value. So in these cases we typically impute it with high negative number
      Now in many cases if the number is substantially an outlier then XGBoost will create separate split sometimes and can be covered in alternate tree path during split even without setting missing value. This can be viewed by visualizing xgboost trees
      I hope it makes sense.. Will try to cover it in one of my future video where I will be visualizing and interpreting trees

    • @arianaquek6036
      @arianaquek6036 4 роки тому

      @@AIEngineeringLife Thank you for your reply Sir! Am looking forward to more of your videos!

  • @sachingalugade8092
    @sachingalugade8092 4 роки тому +1

    Thanks for video sir..can u please make video which will show various ways to impute values for categorical variables?

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому

      Categorical is simple typically. We can go with mode of data or use logistics regression to impute it depending on data and business need

  • @vaibhavbhatia4641
    @vaibhavbhatia4641 4 роки тому +1

    Great video sir. Can you please also share a link to the notebook through description, thank you.

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому +1

      Notebook is in my git repo here - github.com/srivatsan88/UA-camLI/tree/master/statistics

    • @vaibhavbhatia4641
      @vaibhavbhatia4641 4 роки тому

      @@AIEngineeringLife thanks a lot.

  • @ragulshan6490
    @ragulshan6490 4 роки тому +1

    Sir, please do make more videos on different kinds of t-test using python? please elaborate more about different types of normality test.

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому +1

      Ragul.. Will do as and when I get time.. Have too many in backlog and finding less free time so bear with me please

    • @ragulshan6490
      @ragulshan6490 4 роки тому

      @@AIEngineeringLife Take your time, sir. I'll be waiting for that!

  • @anishnama2091
    @anishnama2091 4 роки тому +1

    Thanks for this informative video.. How to impute categorical missing values using statistics thinking?

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому +1

      Anish.. It depends on distribution of categories. In most cases you can tag it as others and train model or impute with value of max categories. Also similar approach can be followed to see if we can find the category from other variable but this is applicable in very few cases

    • @anishnama2091
      @anishnama2091 4 роки тому

      @@AIEngineeringLife Thanks

  • @antoniushka
    @antoniushka 4 роки тому +1

    Hi Guys! Great job! For some reason I got the same values for both columns "TotChargeNew" and "TotChargesAct", where could be the mistake?

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому +1

      Antonio, Thats interesting.. Seeing the data it is very rare to not have any standard error, while you may get value different than mine due to some randomness. Is you data before and after pandas concat same for TotChargeNew column?. You can check my notebook below to compare with yours
      colab.research.google.com/drive/1fzf5bm_HvbtAQS_2jxR8UoQsCliDr5fa

    • @antoniushka
      @antoniushka 4 роки тому

      @@AIEngineeringLife Thank you! I'll check it out!

  • @devpratap
    @devpratap 4 роки тому +1

    first of all, thanks for sharing this all.
    Sir, I executed the notebook codes after typing them myself to get better understanding.
    The TotalCharges has only 11 NA values but yours had 28.
    Also when I load the merged the values in actual Total Charges were empty.
    Did I do something wrong or have you made changes to the dataset?

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому +1

      Devpratap.. My bad.. I think I overwrote the file by mistake.. Check now. Created a new one.. It has 27 though but must work as expected. Let me know if you still have problem

    • @devpratap
      @devpratap 4 роки тому

      @@AIEngineeringLife I checked using my notebook. It worked fine now. Thanks.

  • @rajeshk1739
    @rajeshk1739 4 роки тому +2

    Thanks a lot for your efforts. Request you to please share the ipynb file.

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому +1

      It is in my gitrepo - github.com/srivatsan88/UA-camLI/tree/master/statistics

  • @ashirbaddas2573
    @ashirbaddas2573 4 роки тому +1

    Hello Sir.
    Could not we do by simply checking all the correlation values and then we could have gone for best fit couple . And we could have easily find line function for predicting the missing..Please correct me if I am wrong.Thank you for all this.

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому +1

      You can but this is simple dataset for demo. think of sparse data more correlated value can introduce bias as well.. it is like how we do feature selection for models even in case of imputing analysis cycle helps

  • @bharathjc4700
    @bharathjc4700 4 роки тому +1

    we can use mIce to impute missing continuous data is this technique better than mice what are the gaps
    please drop your insights sir

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому

      Bharath.. The video highlights how to analyze data and use statistical techniques for it, MICE internally uses the same technique but in case if you already have knowledge of data better to use that knowledge instead of have MICE doing the wrong stuff

  • @kachrooabhishek
    @kachrooabhishek 2 роки тому

    How we switched from "Monthly Charges" to "Tenure". at 19.34
    Sir was that some random guess to check what will be the R-Square and std error with that. ?

  • @midhileshelavazhagan2541
    @midhileshelavazhagan2541 4 роки тому +1

    Why does imputing very high values works with gradient booting method? As mentioned in 8:15

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому +1

      Sorry for confusion. I think I did not articulate it better.. In models like xgboost I can just makes null values as larger numbers or high negative numbers (in this case since 0 can be valid values, default is 0). Since GBM work on splits it might create separate split for these values. You can check for sparsity aware splitting in below doc
      arxiv.org/pdf/1603.02754v3.pdf

  • @datatales1063
    @datatales1063 3 роки тому

    @19:36 - If we look at the graph, it shows that the slope is touching the y-axis above 0. But, in the equation the intercept value is negative, -924.8180. Why is it like that??

  • @arulsebastian6338
    @arulsebastian6338 4 роки тому +1

    Thanks for the post.
    What is the github url for this code?

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому

      Here you go - github.com/srivatsan88/UA-camLI/tree/master/statistics

  • @raghumarusu4019
    @raghumarusu4019 4 роки тому +1

    sir you are doing great job, could you also please tell if i have any doubts in data science, can i reach you on email?

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому

      Thank you. You can message me on LinkedIn or post as video comments as well

  • @sumanthreddy1542
    @sumanthreddy1542 4 роки тому +1

    Why are we imputing values of TotalCharge with Monthlycharge where tenure = 'Zero',Why Can't we put it zero?

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому

      Sumanth.. we can.. I am just assuming they might anyway have to pay first month. If contract they get penalized for breaking contract. But you can put zero as well. I was just showing thinking to differentiate user personas

    • @sumanthreddy1542
      @sumanthreddy1542 4 роки тому

      @@AIEngineeringLife Thank you for your response. Very much appreciate your kind effort to share your knowledge.

  • @valerysalov8208
    @valerysalov8208 4 роки тому +1

    please update your github on these video series

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому

      I thought I did already. Did u check statistics folder in my git. Will check later and update if not

  • @rajeshvenaganti6797
    @rajeshvenaganti6797 3 роки тому

    where can i find this code

  • @username42
    @username42 4 роки тому

    any github links for jupyter notebooks?

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому

      Here it is - github.com/srivatsan88/UA-camLI/blob/master/statistics/Statistical_Thinking_Imputing_Missing_Value.ipynb

    • @username42
      @username42 4 роки тому

      @@AIEngineeringLife thanks :)

  • @shivankumar9060
    @shivankumar9060 4 роки тому

    Sir please provide Dataset

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому

      Shivan.. It is in my gitrepo in below link
      github.com/srivatsan88/UA-camLI/blob/master/dataset/churn_data_st.csv