Tune xgboost more efficiently with racing methods

Julia Silge

248

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 7 лют 2025
In this screencast, I use data on baseball home runs from the recent episode of #SLICED to build an xgboost model, and tune through possible hyperparameters more efficiently using #rstats tidymodels racing methods. Check out the code on my blog: juliasilge.com...

КОМЕНТАРІ • 27

@mattm9069 3 роки тому ⁺¹
Your blogs have helped me so much. Tidymodels for life!
@MattRosinski 3 роки тому ⁺³
Thanks Julia! Love the inclusion of a linear model for imputing speed and angle with a linear model!
@mkklindhardt 3 роки тому ⁺³
Thank you Julia! I have been waiting for this unknowingly for too long. Great pleasure to follow your videos and always very insightful! Congratulations with your new space :)
@alexandroskatsiferis 3 роки тому
Another splendid screencast Julia!
@pabloormachea3404 3 роки тому ⁺¹
Impressive! Thanks so much for the educational video - - it makes tidymodels very appealing!
@deannanuboshi1387 2 роки тому
great video! Do you know how to get prediction or confidence interval in r? Thanks~~
@JuliaSilge 2 роки тому
An algorithm like xgboost doesn't involve math that can produce one natively (unless I am mistaken) but you can use resampling to create those kinds of intervals: markjrieke.github.io/workboots/
@hansmeiser6078 3 роки тому
Thank you Julia! I was asking me myself what would be the benefit. Can you tell us something about the advantages of tune_sim_anneal() too? And when it is better to fill param grid with a grid and not with an integer?
@JuliaSilge 3 роки тому ⁺¹
When you use an integer, the tune package uses a space-filling design rather than a regular grid design for the possible parameters to try. You can read about these two kinds of grids here:
www.tmwr.org/grid-search.html#grids
We write a bit about iterative search with simulated annealing here:
www.tmwr.org/iterative-search.html#simulated-annealing
@hansmeiser6078 3 роки тому
@@JuliaSilge But when I fill grid param with grid_latin or max_entropy, this would be space filling too- or do I missunderstan this?
@hansmeiser6078 3 роки тому
simulated annealing is hard tobac... hope you make a video of it.
@JuliaSilge 3 роки тому
@@hansmeiser6078 Yes, that's right. If you put an integer, then it uses `grid_latin_hypercube()` to make a semi-random space-filling grid as a default:
tune.tidymodels.org/reference/tune_grid.html#parameter-grids
@hansmeiser6078 3 роки тому
@@JuliaSilge In a regression-case, what is better for tun_bayes(),tune_sim_anneal(),tune_race_anova()? To provide an external tuned grid (maybe grid_latin or grid_regula), or an integer, where is there the benefit. Could we avoid an averhead- or some redundance?
@gkuleck Рік тому
Hi Julia,
Nice video on a topic that I find intrinsically interesting as a baseball AND tidy models fan. I did run into an error when executing the tune_race_anova.
Error in `test_parameters_gls()`:
! There were no valid metrics for the ANOVA model.
I am not sure how to fix this and I have been careful to follow the scripts. Any idea what might be causing the error?
@JuliaSilge Рік тому
When you see an error like that, it usually means your models are not able to fit/train. If you ever run into trouble with a workflow set or racing method like this, I recommend trying to just plain _fit_ the workflow on your training data one time, or use plain old `tune_grid()`. You will likely get a better understanding of where the problems are cropping up.
@juliantagell1891 3 роки тому
Cheers Julia, Great video! Have been wondering about xgboost a bit lately -in regards to using tidymodels vs using the underlying xgboost package directly, with xgb.train(). I've heard mention that xgb.train() has an "automatic stop", that limits the number of trees when no more improvement is detected. This seems pretty helpful (and a great processing-time saver) rather then having to pre-specify the number of trees used. But I'm certainly not a pro at xgboost, so was just wondering your opinion. I like that tidymodels can be applied to all models but was just wondering if, in doing so, this comes at a cost (for xgboost tuning, specifically)
@JuliaSilge 3 роки тому ⁺³
Yes, you can specify this (and even tune it to find the best value) in tidymodels. We call this early stopping parameter `stop_iter`:
parsnip.tidymodels.org/reference/details_boost_tree_xgboost.html
I used it in the last episode of SLICED I was on (with the Spotify dataset) if you want to watch that to see it in action, but I'll try to put together a tutorial/blog post demoing that soon.
@AndreaDalseno 3 роки тому
Thank you very much once more for your videos, Julia. Another question for you: is there a way to have a progress bar or something like that to monitor the tuning process (that may take a long time to run)?
@JuliaSilge 3 роки тому ⁺¹
We don't have support for a progress bar due to how we use parallel workers (considering using the future package for this, though, which may open up other options) but you can set various `verbose` options in `control_race()` that may give you some of what you want:
finetune.tidymodels.org/reference/control_race.html
@AndreaDalseno 3 роки тому
@@JuliaSilge thank you very much for your kind reply. I tried to use control_grid(verbose = TRUE) in the RandomForest example, just before fitting the grid, but I couldn't get any output (with parallel processing). Can you kindly let me have an example? I will check the future package
@JuliaSilge 3 роки тому
@@AndreaDalseno Ah I'm sorry I wasn't more clear; we are considering adding support for the future package which will likely allow for better progress messaging in the... future. I'm not sure if the `verbose` option will work right now. Here is an example to try:
github.com/tidymodels/tune/issues/377
@AndreaDalseno 3 роки тому
@@JuliaSilge thank you very much for your hint. I did:
regular_res
@JuliaSilge 3 роки тому
@@AndreaDalseno Yes, you can read more about the current status of how parallel processing works here:
tune.tidymodels.org/articles/extras/optimizations.html#parallel-processing-1
@recordyao 3 роки тому
Hi Julia. Great tutorial! I think it's a great time-saving solution for tuning random grid points. It would be awesome if tune_race_anova could work with tune_bayes, in that once random grids are selected from tune_race_anova, it could pass as "initial" into tune_bayes to fine-tune the best. But currently it does not work, as tune_race_anova only finishes one point that fits all folders and tune_bayes needs as least the same number as tuning parameters. Is there an way around? Again, great work! : )
@JuliaSilge 3 роки тому ⁺²
Ah no, this doesn't currently work as the infrastructure for tune_bayes() does currently expect all the tuning parameters to have been evaluated completely on resamples. You could post an issue on the repo asking if tune_bayes() could be changed to accept the subset and we could discuss it there: github.com/tidymodels/tune/issues
@recordyao 3 роки тому
@@JuliaSilge Thanks for pointing to the right place. It'll be awesome if the two can be combined. But of course, it'll be a lot of works for developers. We users are taking things as granted haha.
@jacquesboubou Рік тому
Thank you so much! Great presentation. I have learned a lot.
new_subsciption

Наступне

Автоматичне відтворення

Tuning random forest hyperparameters with tidymodels