Adapt this pattern to solve many Machine Learning problems

Поділитися
Вставка
  • Опубліковано 29 вер 2024
  • Here's a simple pattern that can be adapted to solve many ML problems. It has plenty of shortcomings, but can work surprisingly well as-is!
    Shortcomings include:
    - Assumes all columns have proper data types
    - May include irrelevant or improper features
    - Does not handle text or date columns well
    - Does not include feature engineering
    - Ordinal encoding may be better
    - Other imputation strategies may be better
    - Numeric features may not need scaling
    - A different model may be better
    - And so on...
    Want to watch all 50 scikit-learn tips? Enroll in my FREE online course:
    👉 courses.datasc... 👈
    Tips mentioned in this video:
    Tip 1: • Use ColumnTransformer ...
    Tip 2: • Seven ways to select c...
    Tip 6: • Encode categorical fea...
    Tip 7: • Handle unknown categor...
    Tip 9: • Add a missing indicato...
    Tip 11: • Impute missing values ...
    Tip 16: • Use cross_val_score an...
    Tip 27: • Two ways to impute mis...
    Tip 43: • Use OrdinalEncoder ins...
    === WANT TO GET BETTER AT MACHINE LEARNING? ===
    1) LEARN THE FUNDAMENTALS in my intro course (free!): courses.datasc...
    2) BUILD YOUR ML CONFIDENCE in my intermediate course: courses.datasc...
    3) LET'S CONNECT!
    - Newsletter: www.dataschool...
    - Twitter: / justmarkham
    - Facebook: / datascienceschool
    - LinkedIn: / justmarkham

КОМЕНТАРІ • 16

  • @dataschool
    @dataschool  2 роки тому +4

    Want to watch all 50 scikit-learn tips? Enroll in my FREE online course: courses.dataschool.io/scikit-learn-tips
    This is the last scikit-learn tip I'll be posting... thank you SO MUCH for watching! 🙌

    • @grzegorzzawadzki8718
      @grzegorzzawadzki8718 2 роки тому

      I recently learned that you can use handle_unknown for OrdinalEncoder, but this requires scikit-learn 0.24 or later.
      What do you think about using onehotencoder for only the 5 or 10 most common values?

    • @dataschool
      @dataschool  2 роки тому +1

      Regarding handle_unknown with OrdinalEncoder, that's correct! I was excited to see that option released.
      Regarding OneHotEncoder with a frequency cut-off, that can be a useful strategy. It's not currently easy to do in scikit-learn, but it will be possible in a future version.
      Thanks for your comment!

  • @RRSS-ce5hf
    @RRSS-ce5hf 2 роки тому +2

    Hey Kevin, very helpful videos! In this video,
    num_cols = make_column_selector(dtype_include='number')
    -> Does 'num_cols' here also include the dependent/target column? (Assuming it is a numerical column)
    If yes, say we are scaling other independent features using RobustScaler() because of presence of lot of outliers.. But the target column does not have many outliers.. Will it affect the regression output?
    What is the way out (I want to scale all numerical columns except the target column)?

    • @dataschool
      @dataschool  2 роки тому

      Excellent question! No, num_cols does not include the target column, because the preprocessor is only applied to the columns in X. Hope that helps!

  • @KartikeyRiyal
    @KartikeyRiyal 2 роки тому +2

    Great video. I have been learning from your videos since 2018 end.
    Thank you so much and God bless you Kevin. from India

    • @dataschool
      @dataschool  2 роки тому

      That's great to hear! 🙏

  • @blink4037
    @blink4037 2 роки тому +1

    Thank you for the all tips learnt so much, I just wondered are we able to or is it proper to use like FeatureUnion instead of make pipeline while combining transformer objects and pass as featureunion1 and featureunion2 with these numerical/non-numerical constraints.

  • @abir95571
    @abir95571 2 роки тому +1

    Great videos ... one question, let's say if the number of categories in a column is large then what should be the ideal encoding? One hot encoding isn't really the ideal one as it will create too many dummy columns

    • @dataschool
      @dataschool  2 роки тому +1

      Glad you like the videos! As for your question, there are a lot of factors that influence the optimal encoding, but you can certainly try OrdinalEncoder instead. However, you will find that it's often not a problem to create thousands of dummy columns, and that feature will still be improving the performance of your model. Hope that helps!

    • @abir95571
      @abir95571 2 роки тому

      @@dataschool I thought of ordinal encoding. But you see ordinal encoding inherently introduces rank ... like 1 > 2 > 3 .. so on . In my case the categories have no order , all have equal weightage. I've chosen binary encoding coz at least it reduces the columns to log N , where N is the count of distinct categories . My only doubt is , does it introduce order or is it unordered

  • @pruthvips9565
    @pruthvips9565 2 роки тому

    Can you explain who can we Perform EDA in NLP

  • @johnanih56
    @johnanih56 2 роки тому +1

    THE BEST TIP SO FAR!

    • @dataschool
      @dataschool  2 роки тому

      You are so kind, thank you! 🙏

  • @sargonsarkis1292
    @sargonsarkis1292 2 роки тому

    Awesome!