I was at the same problem for the imbalance in dataset and by then I researched for different methods to take on. Here I am presenting my shortlist that I have created which might help you somewhere. Possible Solutions: 1. Make some changes in the algorithm • Adjust the class weight so it becomes sensitive to the minority class • Adjust the decision threshold (we can check by PR curve) • Penalize the algorithms by putting class_weight='balanced' 2. Discard the minority examples and treat all classes as one • Here we can treat the problem as the "anomaly detection" problem instead of classification For anomaly detection "Isolation forest" tend to give promising results 3. Balance the dataset by sampling • Undersample • Oversample & SMOTE 4. Ensemble learning by downsampling • It bootstraps different samples and each time it will balance the classes by undersampling the majority classes and then aggregates the results for voting 5. Usage other techniques • Algorithms such as Tomek links (which removes k nearest majority pair to increase division) • Focal loss I have also tried to look for the kaggle notebooks there people have also found out that XGBoost slightly outperforms other algorithms even it would require to give different class weights. - This was my cheat sheet of the 5 ways. Share your thoughts!!
Very good explanation and thanks for putting the learning here. I will pin this comment on top for others benefit. My view - Data Science is all about trying/experimenting/failing and learning. Then something very good comes up.
@@UnfoldDataScience Won't lie, but when I started watching your videos, your explanations made things much simpler. You know, I was used to freak out (sorry for the words) by listening DBSCAN, Hierarchical Clustering and what not, but when I see those topics explained by you I feel so comfortable that now I would understand this. How simply but accurately you explain without missing the important things. PS: I was introduced to assumptions of linear regression by your channel. Before that I knew the model, came to know that there is something called "assumptions" and how important are they!! Totally missed by the instructions on online courses! Your channel is a huge contribution to the data science community on YT.
Thank you sir, iam searching methods for imbalaced data , finally i got the methods with your video.Thank u so much once again. All in methods which one is best method .
Hello Aman, this is the third time I have watched this video, simply to learn your way of explaining things. Is it possible for you to create a video or give us some notes so we can find all the important questions for ensembling techniques?
Hi Aman, In most of videos we could see imbalanced Dataset for classification problems but how to check and Handle imbalanced Dataset for regression problem. Thanks,
Thanks very much for sharing and explaining. What's your thought on logistic regression? Would imbalanced data still a problem if you build the model in GLM using logistic regression?
If oversampling gives data leakage issues in total data? Or if I use smote in train data after the train test split it is giving poor precision to the minority however recall is ok...so what do I do to improve the precision of the minority class?
Hello sir, How to solve A Classification problem in which training data has only one class? 'This solver needs samples of at least 2 classes in the data, but the data contains only one class: 1', please help me out
i am doing MBA using apriori algorithm by using google colab. the problem is when i use more than 20 rows in csv transaction data it displays error. if the no of rows is less than 20 expected result come.
I was at the same problem for the imbalance in dataset and by then I researched for different methods to take on. Here I am presenting my shortlist that I have created which might help you somewhere.
Possible Solutions:
1. Make some changes in the algorithm
• Adjust the class weight so it becomes sensitive to the minority class
• Adjust the decision threshold (we can check by PR curve)
• Penalize the algorithms by putting class_weight='balanced'
2. Discard the minority examples and treat all classes as one
• Here we can treat the problem as the "anomaly detection" problem instead of classification
For anomaly detection "Isolation forest" tend to give promising results
3. Balance the dataset by sampling
• Undersample
• Oversample & SMOTE
4. Ensemble learning by downsampling
• It bootstraps different samples and each time it will balance the classes by undersampling
the majority classes and then aggregates the results for voting
5. Usage other techniques
• Algorithms such as Tomek links (which removes k nearest majority pair to increase division)
• Focal loss
I have also tried to look for the kaggle notebooks there people have also found out that XGBoost slightly outperforms other algorithms even it would require to give different class weights.
-
This was my cheat sheet of the 5 ways. Share your thoughts!!
Very good explanation and thanks for putting the learning here. I will pin this comment on top for others benefit.
My view - Data Science is all about trying/experimenting/failing and learning. Then something very good comes up.
@@UnfoldDataScience Won't lie, but when I started watching your videos, your explanations made things much simpler. You know, I was used to freak out (sorry for the words) by listening DBSCAN, Hierarchical Clustering and what not, but when I see those topics explained by you I feel so comfortable that now I would understand this. How simply but accurately you explain without missing the important things.
PS: I was introduced to assumptions of linear regression by your channel. Before that I knew the model, came to know that there is something called "assumptions" and how important are they!! Totally missed by the instructions on online courses! Your channel is a huge contribution to the data science community on YT.
You are the best Data Science And Machine Learning Teacher I have ever seen. Thanks a lot!!
You are welcome!
Thank you sir, iam searching methods for imbalaced data , finally i got the methods with your video.Thank u so much once again. All in methods which one is best method .
Learned new things about the balancing of data sets for Imbalanced data sets. Thanks.
Welcome.
Undersampling, oversampling minority class, combo, ensemble random forest, batch selection
An important concept when dealing with classification
Thanks for sharing Aman 👍👍
Thanks Nived.
Such a clear and concise explanation. Thank you, Aman!
Thanks A lot.
Good explanation
Hello Aman, this is the third time I have watched this video, simply to learn your way of explaining things. Is it possible for you to create a video or give us some notes so we can find all the important questions for ensembling techniques?
Thanks Mamata, I do keep sharing on Instagram, please follow "unfolddatascience" On Instagram.
Sure, Aman. Thank you
Awesome explanation. Can you please tell us when we use which technique? I mean with an example of dataset and selection of sampling technique.
Hi Aman,
In most of videos we could see imbalanced Dataset for classification problems but how to check and Handle imbalanced Dataset for regression problem.
Thanks,
Very Nice explanation kudos
Thanks for liking Bijay
Very nice Aman
Thank you
Can I use oversampling or undersampling before Splitting the dataset into training and testing ?
Thank you so much! This video help me a lot.
I have a question, how can we choose and decide which way is the best fit for our imbalance dataset?
Its always trial and error.
Thanks very much for sharing and explaining. What's your thought on logistic regression? Would imbalanced data still a problem if you build the model in GLM using logistic regression?
Can we use smote technique
If oversampling gives data leakage issues in total data? Or if I use smote in train data after the train test split it is giving poor precision to the minority however recall is ok...so what do I do to improve the precision of the minority class?
Hi Aman, Thanks for explaining various method. One question, when to use which method ?
Thanks Naga, cant have like one to one go for rule. some pointers are there which I can cover in different video, thanks for asking
The video was good but i request to add some extra information and explanation for each methods.
in my case i have more than 2 classes and those classes are in text ,for eg- well being , business analytics etc
how will balance such classes
Which method is preferable?
This is discussed towards end.
Thanks for sharing knowledge 🙂, plz share how to explain PCA in front of interviewer..
ua-cam.com/video/osgqQy9Hr8s/v-deo.html
hey sir, how about imbalanced image data for deep learning?
Data augmentation is one option.
Hello sir, How to solve A Classification problem in which training data has only one class? 'This solver needs samples of at least 2 classes in the data, but the data contains only one class: 1', please help me out
i am doing MBA using apriori algorithm by using google colab. the problem is when i use more than 20 rows in csv transaction data it displays error. if the no of rows is less than 20 expected result come.
Thats not number of rows problem, some hidden issue may be there with row number 21 probably. I am just guessing.
Please share the source of the notebook. Thanks.
drive.google.com/drive/u/0/folders/13pZrCIqk1XN6W4I95A07bK8YRHBB3btt
Hello
I’m irritated to know from you how can deal with unbalanced penal data ? How can i transform the data when there is missing year ??
github link