5 ways to work with imbalanced data | Imbalanced dataset machine learning | Imbalanced data

Поділитися
Вставка
  • Опубліковано 26 лис 2024

КОМЕНТАРІ • 47

  • @enchanted_swiftie
    @enchanted_swiftie 2 роки тому +9

    I was at the same problem for the imbalance in dataset and by then I researched for different methods to take on. Here I am presenting my shortlist that I have created which might help you somewhere.
    Possible Solutions:
    1. Make some changes in the algorithm
    • Adjust the class weight so it becomes sensitive to the minority class
    • Adjust the decision threshold (we can check by PR curve)
    • Penalize the algorithms by putting class_weight='balanced'
    2. Discard the minority examples and treat all classes as one
    • Here we can treat the problem as the "anomaly detection" problem instead of classification
    For anomaly detection "Isolation forest" tend to give promising results
    3. Balance the dataset by sampling
    • Undersample
    • Oversample & SMOTE
    4. Ensemble learning by downsampling
    • It bootstraps different samples and each time it will balance the classes by undersampling
    the majority classes and then aggregates the results for voting
    5. Usage other techniques
    • Algorithms such as Tomek links (which removes k nearest majority pair to increase division)
    • Focal loss
    I have also tried to look for the kaggle notebooks there people have also found out that XGBoost slightly outperforms other algorithms even it would require to give different class weights.
    -
    This was my cheat sheet of the 5 ways. Share your thoughts!!

    • @UnfoldDataScience
      @UnfoldDataScience  2 роки тому +4

      Very good explanation and thanks for putting the learning here. I will pin this comment on top for others benefit.
      My view - Data Science is all about trying/experimenting/failing and learning. Then something very good comes up.

    • @enchanted_swiftie
      @enchanted_swiftie 2 роки тому +3

      @@UnfoldDataScience Won't lie, but when I started watching your videos, your explanations made things much simpler. You know, I was used to freak out (sorry for the words) by listening DBSCAN, Hierarchical Clustering and what not, but when I see those topics explained by you I feel so comfortable that now I would understand this. How simply but accurately you explain without missing the important things.
      PS: I was introduced to assumptions of linear regression by your channel. Before that I knew the model, came to know that there is something called "assumptions" and how important are they!! Totally missed by the instructions on online courses! Your channel is a huge contribution to the data science community on YT.

  • @KastijitBabar
    @KastijitBabar 6 місяців тому +1

    You are the best Data Science And Machine Learning Teacher I have ever seen. Thanks a lot!!

  • @sreebvmcreation9388
    @sreebvmcreation9388 4 місяці тому

    Thank you sir, iam searching methods for imbalaced data , finally i got the methods with your video.Thank u so much once again. All in methods which one is best method .

  • @karthebans248
    @karthebans248 2 роки тому

    Learned new things about the balancing of data sets for Imbalanced data sets. Thanks.

  • @Samtoosoon
    @Samtoosoon 25 днів тому

    Undersampling, oversampling minority class, combo, ensemble random forest, batch selection

  • @nivednambiar6845
    @nivednambiar6845 2 роки тому

    An important concept when dealing with classification
    Thanks for sharing Aman 👍👍

  • @zahedinima732
    @zahedinima732 2 роки тому

    Such a clear and concise explanation. Thank you, Aman!

  • @ayushparihar5989
    @ayushparihar5989 Рік тому

    Good explanation

  • @mamataparab9803
    @mamataparab9803 2 роки тому

    Hello Aman, this is the third time I have watched this video, simply to learn your way of explaining things. Is it possible for you to create a video or give us some notes so we can find all the important questions for ensembling techniques?

    • @UnfoldDataScience
      @UnfoldDataScience  2 роки тому

      Thanks Mamata, I do keep sharing on Instagram, please follow "unfolddatascience" On Instagram.

    • @mamataparab9803
      @mamataparab9803 2 роки тому

      Sure, Aman. Thank you

  • @atod2572
    @atod2572 2 роки тому

    Awesome explanation. Can you please tell us when we use which technique? I mean with an example of dataset and selection of sampling technique.

  • @NeeRaja_Sweet_Home
    @NeeRaja_Sweet_Home 2 роки тому

    Hi Aman,
    In most of videos we could see imbalanced Dataset for classification problems but how to check and Handle imbalanced Dataset for regression problem.
    Thanks,

  • @bijaynayak6473
    @bijaynayak6473 2 роки тому

    Very Nice explanation kudos

  • @sadhnarai8757
    @sadhnarai8757 2 роки тому

    Very nice Aman

  • @younesgasmi8518
    @younesgasmi8518 10 місяців тому

    Can I use oversampling or undersampling before Splitting the dataset into training and testing ?

  • @riva.4484
    @riva.4484 Рік тому

    Thank you so much! This video help me a lot.
    I have a question, how can we choose and decide which way is the best fit for our imbalance dataset?

  • @dd3371
    @dd3371 2 роки тому

    Thanks very much for sharing and explaining. What's your thought on logistic regression? Would imbalanced data still a problem if you build the model in GLM using logistic regression?

  • @swapnilgiram1355
    @swapnilgiram1355 2 місяці тому

    Can we use smote technique

  • @avikdinda7827
    @avikdinda7827 4 місяці тому

    If oversampling gives data leakage issues in total data? Or if I use smote in train data after the train test split it is giving poor precision to the minority however recall is ok...so what do I do to improve the precision of the minority class?

  • @nagarajsundar7931
    @nagarajsundar7931 2 роки тому

    Hi Aman, Thanks for explaining various method. One question, when to use which method ?

    • @UnfoldDataScience
      @UnfoldDataScience  2 роки тому

      Thanks Naga, cant have like one to one go for rule. some pointers are there which I can cover in different video, thanks for asking

  • @dhanushraj3697
    @dhanushraj3697 2 роки тому

    The video was good but i request to add some extra information and explanation for each methods.

  • @dilshadmuhammed8224
    @dilshadmuhammed8224 11 місяців тому

    in my case i have more than 2 classes and those classes are in text ,for eg- well being , business analytics etc
    how will balance such classes

  • @chalmerilexus2072
    @chalmerilexus2072 2 роки тому +1

    Which method is preferable?

  • @snehalvaidya5843
    @snehalvaidya5843 2 роки тому

    Thanks for sharing knowledge 🙂, plz share how to explain PCA in front of interviewer..

  • @mihretdesta9153
    @mihretdesta9153 Рік тому

    hey sir, how about imbalanced image data for deep learning?

  • @maasahebbiustad8514
    @maasahebbiustad8514 2 роки тому

    Hello sir, How to solve A Classification problem in which training data has only one class? 'This solver needs samples of at least 2 classes in the data, but the data contains only one class: 1', please help me out

  • @tharindumadusanka3038
    @tharindumadusanka3038 2 роки тому

    i am doing MBA using apriori algorithm by using google colab. the problem is when i use more than 20 rows in csv transaction data it displays error. if the no of rows is less than 20 expected result come.

    • @UnfoldDataScience
      @UnfoldDataScience  2 роки тому

      Thats not number of rows problem, some hidden issue may be there with row number 21 probably. I am just guessing.

  • @ratnajyotibhowmick9801
    @ratnajyotibhowmick9801 2 роки тому

    Please share the source of the notebook. Thanks.

    • @UnfoldDataScience
      @UnfoldDataScience  2 роки тому +1

      drive.google.com/drive/u/0/folders/13pZrCIqk1XN6W4I95A07bK8YRHBB3btt

  • @hasantalib6254
    @hasantalib6254 Рік тому

    Hello
    I’m irritated to know from you how can deal with unbalanced penal data ? How can i transform the data when there is missing year ??

  • @PalaSheshu111
    @PalaSheshu111 2 роки тому

    github link