To downsample or not? Handling class imbalance in bird feeder observations

Julia Silge

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 31 січ 2025

КОМЕНТАРІ • 23

@wouldntyaliktono 2 роки тому ⁺¹¹
One way I like to think about this question of downsampling is whether it alters the bias term of my model. Rebalancing the data will force the model to assume that the global average probability of SQUIRREL is 50%, but that isn't the case in the empirical data. And that can affect how successful my models are when they're deployed to production.
@JuliaSilge 2 роки тому ⁺²
Love this!
@natarajanlalgudi Рік тому
Down sampling will have an impact in production as it will affect the model's ability to generalize to unseen data. Weighted loss function approach could actually yield far lesser variance, and far better model performance on unseen data outside of the training and validation process.
@JuliaSilge Рік тому
@@natarajanlalgudi In tidymodels, a similar/related approach is tuning using a custom cost function for classification:
yardstick.tidymodels.org/reference/classification_cost.html
@alexandroskatsiferis 2 роки тому ⁺¹
Nice demonstration showing the complexity of imbalanced classes. An issue with choosing specificity, sensitivity and similar metrics, is that they are all dependent on the decision threshold (in this case 0.5) which further complicates decision making.
@yangyang6008 2 роки тому ⁺¹
Hi Julia, how can we define a class imbalance? In the example, "squirrels" is 4 times more than "no squirrels". If "squirrels" is only 1.5 times more than "no squirrels", is it still called imbalance?
@JuliaSilge 2 роки тому
I think anything other than perfect balance (i.e. the categories are equal) is imbalance, but in typical modeling projects you don't start having problems until you have proportions like 5-to-1 or 10-to-1.
@yangyang6008 2 роки тому
@@JuliaSilge Thank you for your help Julia!
@natarajanlalgudi Рік тому
@@JuliaSilge 4:1 is on the borderline of "serious imbalance" I'm guessing. There could be some learners tuned better using resampling or penalties and some not so.
@517127 2 роки тому
Excelent work. I learn a lot with your videos
@ismaelmontero4811 Рік тому
Hi Julia, thank you very much for your videos. I have a question. I have a dataset that only has nominal variables transformed as factors (it's a classification problem), however, when I try to use your code, I get an error:
error: Some columns are non-numeric. The data cannot be converted to numeric matrix: 'ICode_Weather', 'ICode_Gender', 'ICategory_Age', 'iCode_Accident_Category', 'ICategory_Vehicle', 'ICategory_Time', 'BDrugs', 'BAlcohol', 'Week_Day', 'IZone'.
There were issues with some computations A: x1
Can you give some advice? Thank you very much.
@JuliaSilge Рік тому
You'll want to convert those to dummy or indicator variables using `step_dummy()`. Read more about this here:
recipes.tidymodels.org/articles/Dummies.html
@ismaelmontero4811 Рік тому
@@JuliaSilge Thank you for the information you shared, it was helpful. Do you know of any ways I could obtain the marginal effects?
@JuliaSilge Рік тому
@@ismaelmontero4811 Many of the typical methods for getting marginal effects will work just fine. Here is an example of generating partial dependence profiles: www.tmwr.org/explain.html#building-global-explanations-from-local-explanations
@CaribouDataScience 2 роки тому
Thanks for sharing!!
@shauryamehta5339 2 роки тому
Hi I have this question that if i will use more than two different models in my work flow set for two different specification then how many models in total will be computed? For example lets say i want to compute two models one be using regularized regression and other be a tree based model with two different specification one be without down sample and other be with downsample so will in toal 4 models will be computed? Two for regularised regression and two for lets say random forest
Thanks
@JuliaSilge 2 роки тому ⁺¹
If I'm understanding you correctly, it sounds like you will have 4 models (logistic regression + downsampling, logistic regression without, tree-based + downsampling, tree-based without). When you decide to compare them, they will be fit to your resamples. If you have 10 folds, then you will fit 40 models to understand which will be the right one for you.
@yangyang6008 2 роки тому ⁺¹
Hi Julia, thank you for the amazing tutorial! I wonder if it is possible to include Extreme Learning Machines in Tidymodels? Extreme learning machine (ELM) is a training algorithm for single hidden layer feedforward neural network (SLFN), which converges much faster than traditional methods and yields promising performance. The algorithm is currently included in the R package "elmNNRcpp" and "ELMR". Thank you.
@JuliaSilge 2 роки тому
Not currently, no! You might be interested in learning how to create a parsnip model for it, like this:
www.tidymodels.org/learn/develop/models/
Feel free to ask on GitHub or RStudio Community if you run into problems!
@yangyang6008 2 роки тому ⁺¹
@@JuliaSilge Thank you Julia and I will try to create a parsnip model for ELM. Hopefully, Tidymodels will update to include the algorithm in the future as ELM is very popular nowadays in machine learning.
@joshuapooley8993 2 роки тому
I am not sure if @ijessup is into data science, but if she were then this would be the video for her. #Gary
@xxXXCarbon6XXxx 2 роки тому ⁺²
I love squirrels, they are so cute so I could never be a hater. We were in Washington at the Vietnam memorial wall & my brother-in-law offered a squirrel a piece of banana. It bit his finger and I laughed so hard (yes they may have rabies!). Adorable.
@cuysaurus 2 роки тому
You look awesome, Julia.

Наступне

Автоматичне відтворення

Resampling to understand gender in art history textbooks