Want to watch all 50 scikit-learn tips? Enroll in my FREE online course: courses.dataschool.io/scikit-learn-tips This is the last scikit-learn tip I'll be posting... thank you SO MUCH for watching! 🙌
I recently learned that you can use handle_unknown for OrdinalEncoder, but this requires scikit-learn 0.24 or later. What do you think about using onehotencoder for only the 5 or 10 most common values?
Regarding handle_unknown with OrdinalEncoder, that's correct! I was excited to see that option released. Regarding OneHotEncoder with a frequency cut-off, that can be a useful strategy. It's not currently easy to do in scikit-learn, but it will be possible in a future version. Thanks for your comment!
Hey Kevin, very helpful videos! In this video, num_cols = make_column_selector(dtype_include='number') -> Does 'num_cols' here also include the dependent/target column? (Assuming it is a numerical column) If yes, say we are scaling other independent features using RobustScaler() because of presence of lot of outliers.. But the target column does not have many outliers.. Will it affect the regression output? What is the way out (I want to scale all numerical columns except the target column)?
Thank you for the all tips learnt so much, I just wondered are we able to or is it proper to use like FeatureUnion instead of make pipeline while combining transformer objects and pass as featureunion1 and featureunion2 with these numerical/non-numerical constraints.
Great videos ... one question, let's say if the number of categories in a column is large then what should be the ideal encoding? One hot encoding isn't really the ideal one as it will create too many dummy columns
Glad you like the videos! As for your question, there are a lot of factors that influence the optimal encoding, but you can certainly try OrdinalEncoder instead. However, you will find that it's often not a problem to create thousands of dummy columns, and that feature will still be improving the performance of your model. Hope that helps!
@@dataschool I thought of ordinal encoding. But you see ordinal encoding inherently introduces rank ... like 1 > 2 > 3 .. so on . In my case the categories have no order , all have equal weightage. I've chosen binary encoding coz at least it reduces the columns to log N , where N is the count of distinct categories . My only doubt is , does it introduce order or is it unordered
Want to watch all 50 scikit-learn tips? Enroll in my FREE online course: courses.dataschool.io/scikit-learn-tips
This is the last scikit-learn tip I'll be posting... thank you SO MUCH for watching! 🙌
I recently learned that you can use handle_unknown for OrdinalEncoder, but this requires scikit-learn 0.24 or later.
What do you think about using onehotencoder for only the 5 or 10 most common values?
Regarding handle_unknown with OrdinalEncoder, that's correct! I was excited to see that option released.
Regarding OneHotEncoder with a frequency cut-off, that can be a useful strategy. It's not currently easy to do in scikit-learn, but it will be possible in a future version.
Thanks for your comment!
THE BEST TIP SO FAR!
You are so kind, thank you! 🙏
Hey Kevin, very helpful videos! In this video,
num_cols = make_column_selector(dtype_include='number')
-> Does 'num_cols' here also include the dependent/target column? (Assuming it is a numerical column)
If yes, say we are scaling other independent features using RobustScaler() because of presence of lot of outliers.. But the target column does not have many outliers.. Will it affect the regression output?
What is the way out (I want to scale all numerical columns except the target column)?
Excellent question! No, num_cols does not include the target column, because the preprocessor is only applied to the columns in X. Hope that helps!
Thank you for the all tips learnt so much, I just wondered are we able to or is it proper to use like FeatureUnion instead of make pipeline while combining transformer objects and pass as featureunion1 and featureunion2 with these numerical/non-numerical constraints.
Great videos ... one question, let's say if the number of categories in a column is large then what should be the ideal encoding? One hot encoding isn't really the ideal one as it will create too many dummy columns
Glad you like the videos! As for your question, there are a lot of factors that influence the optimal encoding, but you can certainly try OrdinalEncoder instead. However, you will find that it's often not a problem to create thousands of dummy columns, and that feature will still be improving the performance of your model. Hope that helps!
@@dataschool I thought of ordinal encoding. But you see ordinal encoding inherently introduces rank ... like 1 > 2 > 3 .. so on . In my case the categories have no order , all have equal weightage. I've chosen binary encoding coz at least it reduces the columns to log N , where N is the count of distinct categories . My only doubt is , does it introduce order or is it unordered
Great video. I have been learning from your videos since 2018 end.
Thank you so much and God bless you Kevin. from India
That's great to hear! 🙏
Can you explain who can we Perform EDA in NLP
Awesome!
Thanks!