To downsample or not? Handling class imbalance in bird feeder observations

Поділитися
Вставка
  • Опубліковано 15 вер 2024
  • Will squirrels will come eat from your bird feeder? Let's fit a model with #TidyTuesday data on bird feeders both with and without downsampling to find out. Check out the code on my blog: juliasilge.com...

КОМЕНТАРІ • 23

  • @wouldntyaliktono
    @wouldntyaliktono Рік тому +11

    One way I like to think about this question of downsampling is whether it alters the bias term of my model. Rebalancing the data will force the model to assume that the global average probability of SQUIRREL is 50%, but that isn't the case in the empirical data. And that can affect how successful my models are when they're deployed to production.

    • @JuliaSilge
      @JuliaSilge  Рік тому +2

      Love this!

    • @natarajanlalgudi
      @natarajanlalgudi Рік тому

      Down sampling will have an impact in production as it will affect the model's ability to generalize to unseen data. Weighted loss function approach could actually yield far lesser variance, and far better model performance on unseen data outside of the training and validation process.

    • @JuliaSilge
      @JuliaSilge  Рік тому

      @@natarajanlalgudi In tidymodels, a similar/related approach is tuning using a custom cost function for classification:
      yardstick.tidymodels.org/reference/classification_cost.html

  • @xxXXCarbon6XXxx
    @xxXXCarbon6XXxx Рік тому +2

    I love squirrels, they are so cute so I could never be a hater. We were in Washington at the Vietnam memorial wall & my brother-in-law offered a squirrel a piece of banana. It bit his finger and I laughed so hard (yes they may have rabies!). Adorable.

  • @alexandroskatsiferis
    @alexandroskatsiferis Рік тому +1

    Nice demonstration showing the complexity of imbalanced classes. An issue with choosing specificity, sensitivity and similar metrics, is that they are all dependent on the decision threshold (in this case 0.5) which further complicates decision making.

  • @517127
    @517127 Рік тому

    Excelent work. I learn a lot with your videos

  • @CaribouDataScience
    @CaribouDataScience Рік тому

    Thanks for sharing!!

  • @yangyang6008
    @yangyang6008 Рік тому +1

    Hi Julia, how can we define a class imbalance? In the example, "squirrels" is 4 times more than "no squirrels". If "squirrels" is only 1.5 times more than "no squirrels", is it still called imbalance?

    • @JuliaSilge
      @JuliaSilge  Рік тому

      I think anything other than perfect balance (i.e. the categories are equal) is imbalance, but in typical modeling projects you don't start having problems until you have proportions like 5-to-1 or 10-to-1.

    • @yangyang6008
      @yangyang6008 Рік тому

      @@JuliaSilge Thank you for your help Julia!

    • @natarajanlalgudi
      @natarajanlalgudi Рік тому

      @@JuliaSilge 4:1 is on the borderline of "serious imbalance" I'm guessing. There could be some learners tuned better using resampling or penalties and some not so.

  • @ismaelmontero4811
    @ismaelmontero4811 Рік тому

    Hi Julia, thank you very much for your videos. I have a question. I have a dataset that only has nominal variables transformed as factors (it's a classification problem), however, when I try to use your code, I get an error:
    error: Some columns are non-numeric. The data cannot be converted to numeric matrix: 'ICode_Weather', 'ICode_Gender', 'ICategory_Age', 'iCode_Accident_Category', 'ICategory_Vehicle', 'ICategory_Time', 'BDrugs', 'BAlcohol', 'Week_Day', 'IZone'.
    There were issues with some computations A: x1
    Can you give some advice? Thank you very much.

    • @JuliaSilge
      @JuliaSilge  Рік тому

      You'll want to convert those to dummy or indicator variables using `step_dummy()`. Read more about this here:
      recipes.tidymodels.org/articles/Dummies.html

    • @ismaelmontero4811
      @ismaelmontero4811 Рік тому

      @@JuliaSilge Thank you for the information you shared, it was helpful. Do you know of any ways I could obtain the marginal effects?

    • @JuliaSilge
      @JuliaSilge  Рік тому

      @@ismaelmontero4811 Many of the typical methods for getting marginal effects will work just fine. Here is an example of generating partial dependence profiles: www.tmwr.org/explain.html#building-global-explanations-from-local-explanations

  • @shauryamehta5339
    @shauryamehta5339 Рік тому

    Hi I have this question that if i will use more than two different models in my work flow set for two different specification then how many models in total will be computed? For example lets say i want to compute two models one be using regularized regression and other be a tree based model with two different specification one be without down sample and other be with downsample so will in toal 4 models will be computed? Two for regularised regression and two for lets say random forest
    Thanks

    • @JuliaSilge
      @JuliaSilge  Рік тому +1

      If I'm understanding you correctly, it sounds like you will have 4 models (logistic regression + downsampling, logistic regression without, tree-based + downsampling, tree-based without). When you decide to compare them, they will be fit to your resamples. If you have 10 folds, then you will fit 40 models to understand which will be the right one for you.

  • @yangyang6008
    @yangyang6008 Рік тому +1

    Hi Julia, thank you for the amazing tutorial! I wonder if it is possible to include Extreme Learning Machines in Tidymodels? Extreme learning machine (ELM) is a training algorithm for single hidden layer feedforward neural network (SLFN), which converges much faster than traditional methods and yields promising performance. The algorithm is currently included in the R package "elmNNRcpp" and "ELMR". Thank you.

    • @JuliaSilge
      @JuliaSilge  Рік тому

      Not currently, no! You might be interested in learning how to create a parsnip model for it, like this:
      www.tidymodels.org/learn/develop/models/
      Feel free to ask on GitHub or RStudio Community if you run into problems!

    • @yangyang6008
      @yangyang6008 Рік тому +1

      @@JuliaSilge Thank you Julia and I will try to create a parsnip model for ELM. Hopefully, Tidymodels will update to include the algorithm in the future as ELM is very popular nowadays in machine learning.

  • @cuysaurus
    @cuysaurus Рік тому

    You look awesome, Julia.

  • @joshuapooley8993
    @joshuapooley8993 Рік тому

    I am not sure if @ijessup is into data science, but if she were then this would be the video for her. #Gary