Undersampling for Handling Imbalanced Datasets | Python | Machine Learning

Поділитися
Вставка
  • Опубліковано 3 гру 2024

КОМЕНТАРІ • 50

  • @ruhinehri5607
    @ruhinehri5607 2 роки тому

    Awesome explanation... I was really struggling to balance a dataset... This video made my day...

  • @dhananjaykansal8097
    @dhananjaykansal8097 5 років тому +2

    This is awesome. Pls market more. The likes and comments doesn’t justify the kinda of work you’re doing. Obviously it might happen that you stop making frequent videos for obvious reasons, but I like to tell you I personally really liked your videos and your teaching style is straight forward and lucid. Thanks

  • @angelsandemons
    @angelsandemons 2 роки тому

    Amazing video, great teaching style i struggled for hrs and finally found this gem of a video , thank you so much!!

  • @manaralassf2896
    @manaralassf2896 5 років тому +4

    Please , could you tell us , why did you apply Undersampling to all the whole dataset? I think we should implement this technique on the training set, like what we should do with SMOTE?

  • @joseluismanzanares3662
    @joseluismanzanares3662 4 роки тому

    Great. Very useful. I´m just facing this issue with a target varible in a classification model for lung cancer. THANK YOU

  • @abhijeetpatil6634
    @abhijeetpatil6634 5 років тому

    Thanks bhavesh, never stop making such videos

  • @RobertWei-p1l
    @RobertWei-p1l Рік тому

    hi there, I'm a little confusing at 4:44. You have the imbalanced data and split it without `stratify` method. But the model still can fit well. When I apply this to my imbalanced data, which the 0 is 582689 and 1 is 1296. It raise out error that says my X_train only got 1 class instead of 2. How can I do to solve this problem, I used `stratify` method but it is still not working. Really appreciate that.

  • @shreyachandra5175
    @shreyachandra5175 4 роки тому +1

    Thank you! This was an excellent video and extremely helpful :)

  • @hemantsah8567
    @hemantsah8567 4 роки тому +1

    How will you perform sampling when you have target feature with more than 2 categories...?

  • @powellmenezes584
    @powellmenezes584 5 років тому +2

    simple and easy - i appreciate you bro :) Subscribed and liked :P

  • @sidgupta1957
    @sidgupta1957 3 роки тому

    Thanks for the explanation. When undersampling , the output scores that we get would be inflated/deflated depending upon the majority class( what I mean is that if the dependent variable takes values 1 and 0 and if the majority class is 0 , then we will get get inflated scores after the model is built). So how to factor in that?

  • @luismagana6347
    @luismagana6347 4 роки тому

    Thanks, it has been clear for me, good vídeo.

  • @mr.techwhiz4407
    @mr.techwhiz4407 4 роки тому

    Great video. Is this the same case if you use a Random Forest model?

  • @sobinbabu984
    @sobinbabu984 4 роки тому

    How can we apply smote in dataset containing categorical variables? or should we apply onehotencoding before smote?

  • @debatradas1597
    @debatradas1597 2 роки тому

    Thank you so much Sir

  • @joseluismanzanares3662
    @joseluismanzanares3662 4 роки тому

    Hi Bhavesh Bhatt , Just a question. I wonder if undersampling may be appropiate for my data set. minority class is 8.4% of data. With 6976 obs for minority and 83687 for majority. Any comments on this issue? Thanks

  • @deutschvalley3574
    @deutschvalley3574 3 роки тому

    Great explanation sir kindly make videos on performance all matrix how we can get best information our model and data

  • @santanusarangi
    @santanusarangi 4 роки тому

    Hello,
    Once we get the optimum threshold value, how to reset the threshold value?

  • @agnibhohomchowdhury
    @agnibhohomchowdhury 5 років тому

    Your videos are very simple and easy to understand ... Love your work. Can u provide the code?

  • @niranjanbehera4591
    @niranjanbehera4591 5 років тому

    good one

  • @pratapdutta4
    @pratapdutta4 3 роки тому

    So here we are splitting the data into test and train after under sampling?

  • @akashm103
    @akashm103 4 роки тому

    dude i have a doubt what about the training accuracy does it goes down? I'm training a model which after oversampling has made testing accuracy to up but training accuracy went down.

  • @jagritisehgal3867
    @jagritisehgal3867 4 роки тому

    Thanks, nice work :)

  • @halilibrahimozkan9799
    @halilibrahimozkan9799 3 роки тому

    I have a question. I created a model. My data has 1 and 0. 1 is more than 0. I realize undersampling and oversampling. Undersampling is more less than oversampling as accuracy. Why is it?

  • @karthicradha4834
    @karthicradha4834 4 роки тому

    Very interesting,easy to understand and follow all the steps. Btw I am facing issues with codings. While executing “generate_auc_roc_curve”.its showing name auc is not defined.
    “Plt.plot(for,tot,label = “AUC ROC CURVE WITH area under the curve =“ +str(auc)).
    Could you please explain me this line of code. Thanks

    • @bhattbhavesh91
      @bhattbhavesh91  4 роки тому +1

      If you have followed the process as shown in the video, it shouldn't give you an error! If its giving you an error then you are a google search away to get to the final solution!

  • @deepikadusane9051
    @deepikadusane9051 4 роки тому

    Hii , i have seen ur all videos of imbalance dataset bt which one we should prefer the most over sampling , under sampling or o weights

    • @bhattbhavesh91
      @bhattbhavesh91  4 роки тому

      depends on your problem statement! Is your business ok to trust synthetic data? are you ok to lose out on data in case of under sampling? so, I can't give you a single answer!

  • @muza6322
    @muza6322 3 роки тому

    Thank you

  • @poojarani9860
    @poojarani9860 5 років тому

    HI BHavesh, I liked your video. I have a large amount if text data set of some violation data. I need to apply ML techniques to find the major key areas which are causing violation. Can yiu guide me how can i proceed. The data I am having is in excel and we cna apply supervise machine learning. I have also created manually the category for which I also tried to apply supervise machine lerning algo to predict the target variable. But my motto is not to find the target variable, My motto is to find the major key areas because of which violation exist. When I created category, I found around 90%data belongs to one category which is causing class imbalance.

  • @yasserothman4023
    @yasserothman4023 3 роки тому

    Why apply the undersampling on the whole dataset not the training set only ?

  • @TejaDuggirala
    @TejaDuggirala 5 років тому

    Great work bro.. helped me a lot ! Thank you so much! Liked and subscribed :)

  • @Neerajpl7
    @Neerajpl7 5 років тому

    Good One 👌

  • @kinglovesudelhi
    @kinglovesudelhi 3 роки тому

    Why we cannot use firth logistic by penalizing maximum likelyhood.

  • @Lion9781
    @Lion9781 4 роки тому

    Great video. Undersampling on the entire data set, so both train and test data, is a mistake though. Generally it can only be applied to the training set, otherwise the great performance will be misleading. Nonetheless, the code itself is nice.

  • @NaviVlogs76
    @NaviVlogs76 4 роки тому

    sirr what about 3 clsses ? how to handle them ? it was really helpful

  • @taskynrakhym1542
    @taskynrakhym1542 5 років тому

    Thanks Bro!!!!

  • @AbdullahQamer
    @AbdullahQamer 4 роки тому

    Can anyone answer this question please?
    A dataset with the following numbers of instances for three classes A, B, and C shall be balanced:
    A: 3100
    B: 3200
    C: 3600
    a) How many instances does the dataset have in total after balancing with undersampling?
    b) How many instances does the dataset have in total after balancing with oversampling?

  • @vaibhavmishra2283
    @vaibhavmishra2283 2 роки тому

    I think there is a mistake in this.. Metrics values came to be that good because the test data was also balanced(as you performed undersampling on the entire dataset) . This would lead us to misleading result as we have never tested the imbalanced scenario , which unfortunately is the real case. We perform under or over sampling only on the training set and validate it with the imbalanced dataset only to make sure we get the correct results..

  • @radcyrus
    @radcyrus 5 років тому

    Thank you :-)