This is awesome. Pls market more. The likes and comments doesn’t justify the kinda of work you’re doing. Obviously it might happen that you stop making frequent videos for obvious reasons, but I like to tell you I personally really liked your videos and your teaching style is straight forward and lucid. Thanks
Please , could you tell us , why did you apply Undersampling to all the whole dataset? I think we should implement this technique on the training set, like what we should do with SMOTE?
hi there, I'm a little confusing at 4:44. You have the imbalanced data and split it without `stratify` method. But the model still can fit well. When I apply this to my imbalanced data, which the 0 is 582689 and 1 is 1296. It raise out error that says my X_train only got 1 class instead of 2. How can I do to solve this problem, I used `stratify` method but it is still not working. Really appreciate that.
Thanks for the explanation. When undersampling , the output scores that we get would be inflated/deflated depending upon the majority class( what I mean is that if the dependent variable takes values 1 and 0 and if the majority class is 0 , then we will get get inflated scores after the model is built). So how to factor in that?
Hi Bhavesh Bhatt , Just a question. I wonder if undersampling may be appropiate for my data set. minority class is 8.4% of data. With 6976 obs for minority and 83687 for majority. Any comments on this issue? Thanks
dude i have a doubt what about the training accuracy does it goes down? I'm training a model which after oversampling has made testing accuracy to up but training accuracy went down.
I have a question. I created a model. My data has 1 and 0. 1 is more than 0. I realize undersampling and oversampling. Undersampling is more less than oversampling as accuracy. Why is it?
Very interesting,easy to understand and follow all the steps. Btw I am facing issues with codings. While executing “generate_auc_roc_curve”.its showing name auc is not defined. “Plt.plot(for,tot,label = “AUC ROC CURVE WITH area under the curve =“ +str(auc)). Could you please explain me this line of code. Thanks
If you have followed the process as shown in the video, it shouldn't give you an error! If its giving you an error then you are a google search away to get to the final solution!
depends on your problem statement! Is your business ok to trust synthetic data? are you ok to lose out on data in case of under sampling? so, I can't give you a single answer!
HI BHavesh, I liked your video. I have a large amount if text data set of some violation data. I need to apply ML techniques to find the major key areas which are causing violation. Can yiu guide me how can i proceed. The data I am having is in excel and we cna apply supervise machine learning. I have also created manually the category for which I also tried to apply supervise machine lerning algo to predict the target variable. But my motto is not to find the target variable, My motto is to find the major key areas because of which violation exist. When I created category, I found around 90%data belongs to one category which is causing class imbalance.
Great video. Undersampling on the entire data set, so both train and test data, is a mistake though. Generally it can only be applied to the training set, otherwise the great performance will be misleading. Nonetheless, the code itself is nice.
Can anyone answer this question please? A dataset with the following numbers of instances for three classes A, B, and C shall be balanced: A: 3100 B: 3200 C: 3600 a) How many instances does the dataset have in total after balancing with undersampling? b) How many instances does the dataset have in total after balancing with oversampling?
I think there is a mistake in this.. Metrics values came to be that good because the test data was also balanced(as you performed undersampling on the entire dataset) . This would lead us to misleading result as we have never tested the imbalanced scenario , which unfortunately is the real case. We perform under or over sampling only on the training set and validate it with the imbalanced dataset only to make sure we get the correct results..
Awesome explanation... I was really struggling to balance a dataset... This video made my day...
Glad it helped!
This is awesome. Pls market more. The likes and comments doesn’t justify the kinda of work you’re doing. Obviously it might happen that you stop making frequent videos for obvious reasons, but I like to tell you I personally really liked your videos and your teaching style is straight forward and lucid. Thanks
Amazing video, great teaching style i struggled for hrs and finally found this gem of a video , thank you so much!!
Glad it was helpful!
Please , could you tell us , why did you apply Undersampling to all the whole dataset? I think we should implement this technique on the training set, like what we should do with SMOTE?
Great. Very useful. I´m just facing this issue with a target varible in a classification model for lung cancer. THANK YOU
Thanks bhavesh, never stop making such videos
hi there, I'm a little confusing at 4:44. You have the imbalanced data and split it without `stratify` method. But the model still can fit well. When I apply this to my imbalanced data, which the 0 is 582689 and 1 is 1296. It raise out error that says my X_train only got 1 class instead of 2. How can I do to solve this problem, I used `stratify` method but it is still not working. Really appreciate that.
Thank you! This was an excellent video and extremely helpful :)
Glad it was helpful!
How will you perform sampling when you have target feature with more than 2 categories...?
simple and easy - i appreciate you bro :) Subscribed and liked :P
Thanks for the explanation. When undersampling , the output scores that we get would be inflated/deflated depending upon the majority class( what I mean is that if the dependent variable takes values 1 and 0 and if the majority class is 0 , then we will get get inflated scores after the model is built). So how to factor in that?
Thanks, it has been clear for me, good vídeo.
Great to hear!
Great video. Is this the same case if you use a Random Forest model?
How can we apply smote in dataset containing categorical variables? or should we apply onehotencoding before smote?
Thank you so much Sir
Most welcome
Hi Bhavesh Bhatt , Just a question. I wonder if undersampling may be appropiate for my data set. minority class is 8.4% of data. With 6976 obs for minority and 83687 for majority. Any comments on this issue? Thanks
Great explanation sir kindly make videos on performance all matrix how we can get best information our model and data
Already uploaded
Hello,
Once we get the optimum threshold value, how to reset the threshold value?
Your videos are very simple and easy to understand ... Love your work. Can u provide the code?
good one
So here we are splitting the data into test and train after under sampling?
dude i have a doubt what about the training accuracy does it goes down? I'm training a model which after oversampling has made testing accuracy to up but training accuracy went down.
Thanks, nice work :)
Glad you liked it!
I have a question. I created a model. My data has 1 and 0. 1 is more than 0. I realize undersampling and oversampling. Undersampling is more less than oversampling as accuracy. Why is it?
Very interesting,easy to understand and follow all the steps. Btw I am facing issues with codings. While executing “generate_auc_roc_curve”.its showing name auc is not defined.
“Plt.plot(for,tot,label = “AUC ROC CURVE WITH area under the curve =“ +str(auc)).
Could you please explain me this line of code. Thanks
If you have followed the process as shown in the video, it shouldn't give you an error! If its giving you an error then you are a google search away to get to the final solution!
Hii , i have seen ur all videos of imbalance dataset bt which one we should prefer the most over sampling , under sampling or o weights
depends on your problem statement! Is your business ok to trust synthetic data? are you ok to lose out on data in case of under sampling? so, I can't give you a single answer!
Thank you
You're welcome
HI BHavesh, I liked your video. I have a large amount if text data set of some violation data. I need to apply ML techniques to find the major key areas which are causing violation. Can yiu guide me how can i proceed. The data I am having is in excel and we cna apply supervise machine learning. I have also created manually the category for which I also tried to apply supervise machine lerning algo to predict the target variable. But my motto is not to find the target variable, My motto is to find the major key areas because of which violation exist. When I created category, I found around 90%data belongs to one category which is causing class imbalance.
Why apply the undersampling on the whole dataset not the training set only ?
Great work bro.. helped me a lot ! Thank you so much! Liked and subscribed :)
Good One 👌
Why we cannot use firth logistic by penalizing maximum likelyhood.
Great video. Undersampling on the entire data set, so both train and test data, is a mistake though. Generally it can only be applied to the training set, otherwise the great performance will be misleading. Nonetheless, the code itself is nice.
sirr what about 3 clsses ? how to handle them ? it was really helpful
did you find any way to do that?
Thanks Bro!!!!
Can anyone answer this question please?
A dataset with the following numbers of instances for three classes A, B, and C shall be balanced:
A: 3100
B: 3200
C: 3600
a) How many instances does the dataset have in total after balancing with undersampling?
b) How many instances does the dataset have in total after balancing with oversampling?
under: 3100*3
over: 3600*3
I think there is a mistake in this.. Metrics values came to be that good because the test data was also balanced(as you performed undersampling on the entire dataset) . This would lead us to misleading result as we have never tested the imbalanced scenario , which unfortunately is the real case. We perform under or over sampling only on the training set and validate it with the imbalanced dataset only to make sure we get the correct results..
Thank you :-)