Thanks for watching! 🙌 If you're new to Machine Learning, I'd love for you to take my FREE 4-hour introductory course: courses.dataschool.io/introduction-to-machine-learning-with-scikit-learn
Stratified spliting is indeed a valuable technique used in machine learning and data analysis to maintain the distribution of categorical variables, such as keywords in your example, in both traning and test sets. This ensures that the data split accurately represents the overall distribution of the caregorical variable, helping to mitigate potential biases and maintain data intergrity
As always..brilliant..Your tutorials motivated me to become data scientist. Your tricks just make me more confident everytime in handling these types of queries.
If we handle the class imbalance by first oversampling the minority class (using smote library for example), then is there any reason left to do a stratified split ? I thought training most models do need balanced datasets and hence handling the class imbalance is imperative. And thus rendering stratified sampling as a moot point since the classes are now 50%-50% balanced. Is this understanding correct / not-correct ?
Thank you Kevin for your fantastic tips! I have the question: If we have regression problem, where one of the feature has 3 unordered categories (every category has 100 samples), can we define in train_test_split that stratify="THAT FEATURE", i.e. can we do the stratify per features when make split? If cannot, how to provide class proportions when make split train and test data? Thank you in advance!
I have the same question. Did you figure it out? I have a linear regression model but I want to stratify a feature (say 3 locations). Will it work by writing 'stratify=location feature' or what's the correct way?
Thanks for the suggestion! But it's a far larger topic than a single video. In fact, I have already created two full chapters on this topic for my upcoming ML course... stay tuned!
What if we have a feature which is categorical (say 0 & 1) and strongly correlated to the response class, it also has class imbalance (1's are more & 0's are less). How would we split that equally through our training and testing data so that proportion of 1's and 0's are similar in training and testing data??
Great question! Stratified sampling is concerned with the target values, not the features, thus the type of input data is not important. Hope that helps!
Great question! There would be 1 "not fraud" in train and 2 in test, or 2 "not fraud" in train and 1 in test. For any dataset with a sufficient number of samples, it doesn't matter if the proprortions are identical, just that they are close. Hope that helps!
Thanks for watching! 🙌 If you're new to Machine Learning, I'd love for you to take my FREE 4-hour introductory course: courses.dataschool.io/introduction-to-machine-learning-with-scikit-learn
Stratified spliting is indeed a valuable technique used in machine learning and data analysis to maintain the distribution of categorical variables, such as keywords in your example, in both traning and test sets. This ensures that the data split accurately represents the overall distribution of the caregorical variable, helping to mitigate potential biases and maintain data intergrity
As always..brilliant..Your tutorials motivated me to become data scientist. Your tricks just make me more confident everytime in handling these types of queries.
Thank you so much Abhishek! 🙏
If we handle the class imbalance by first oversampling the minority class (using smote library for example), then is there any reason left to do a stratified split ? I thought training most models do need balanced datasets and hence handling the class imbalance is imperative. And thus rendering stratified sampling as a moot point since the classes are now 50%-50% balanced. Is this understanding correct / not-correct ?
Thank you Kevin for your fantastic tips! I have the question: If we have regression problem, where one of the feature has 3 unordered categories (every category has 100 samples), can we define in train_test_split that stratify="THAT FEATURE", i.e. can we do the stratify per features when make split? If cannot, how to provide class proportions when make split train and test data? Thank you in advance!
I have the same question. Did you figure it out? I have a linear regression model but I want to stratify a feature (say 3 locations). Will it work by writing 'stratify=location feature' or what's the correct way?
Please do a separate video on how to tackle class imbalance.
Thanks for the suggestion! But it's a far larger topic than a single video. In fact, I have already created two full chapters on this topic for my upcoming ML course... stay tuned!
What if we have a feature which is categorical (say 0 & 1) and strongly correlated to the response class, it also has class imbalance (1's are more & 0's are less).
How would we split that equally through our training and testing data so that proportion of 1's and 0's are similar in training and testing data??
Thank you! What about data with multi-labels ?
what about the dataset is made of images?
Great question! Stratified sampling is concerned with the target values, not the features, thus the type of input data is not important. Hope that helps!
what if we had 3 not fraud and 5 fraud? How would stratify=y split then?
Great question! There would be 1 "not fraud" in train and 2 in test, or 2 "not fraud" in train and 1 in test. For any dataset with a sufficient number of samples, it doesn't matter if the proprortions are identical, just that they are close. Hope that helps!
You're genius!
Thanks!
Helpful.. Thanks Kevin..!!
You're welcome! Great to hear!
Thank you so much ! 😎
You're welcome!
Thanks Brother
You're welcome!