Tuning XGBoost using tidymodels

Поділитися
Вставка
  • Опубліковано 22 лип 2024
  • Learn how to tune hyperparameters for an XGBoost model, using #TidyTuesday data on beach volleyball matches.
    You can check out the code here on my blog: juliasilge.com/blog/xgboost-t...
  • Наука та технологія

КОМЕНТАРІ • 84

  • @kemikao
    @kemikao 4 роки тому +12

    Your videos are very informative! I love that you take the time to show the data first and explain what the variables are.
    And the fact that you explain the tidy functions and even repeat a bit of what you said in earlier videos is great! You use just the right amount of detail for me at least. Thank you.

  • @mygeorgyboy
    @mygeorgyboy 2 роки тому +1

    Very nice example. You show all the process, very illustrative. Thank you Julia

  • @mathteacher1729
    @mathteacher1729 4 роки тому +9

    Thank you so much for this video (and for all your videos). I've been using R for about two or three years and this was just the right amount of detail and exposition for me. Your workflow is clean and easy to follow, I like how you used the help function and your overall layout is nice to (console in the top right). I look forward to trying XGBoost on some data sets now! :)

  • @talitabac
    @talitabac 3 роки тому

    Amazing video, super clear! Thank you, Julia!

  • @mehdi1270
    @mehdi1270 3 роки тому +1

    Thank you so much Julia for all your tutorial videos. They are easy to follow and very informative.......just great! Please keep posting them. I hope you can find some time to post a video on neural network optimization with Keras in R. I can even start a petition for that. LOL

  • @BonifaceMakone
    @BonifaceMakone 4 роки тому +1

    These videos are super informative. Keep them coming. Thanks

  • @flachboard84
    @flachboard84 2 роки тому

    Very helpful video! I look forward to following this example in a future project!

  • @davidjackson7675
    @davidjackson7675 4 роки тому +2

    I always learn something from your videos.

  • @luisfernandocuestasanchez4343
    @luisfernandocuestasanchez4343 3 роки тому

    You are the most amazing person I've ever come across
    Thanks a lot
    Blessings =)

  • @haraldurkarlsson1147
    @haraldurkarlsson1147 3 роки тому

    Very nice presentation of xgboost by the way.

  • @badrGamer11
    @badrGamer11 2 роки тому

    Always an amazing content thank you

  • @edneideramalho2363
    @edneideramalho2363 10 місяців тому

    You are the best!

  • @erickcohen1876
    @erickcohen1876 4 роки тому +3

    Hi Julia, this was video was amazing and very informative! Would you be able to help us find resources for (or post a video about :) ) the math behind these models? I.e. gradient-descent for XGBoost models. Thank you very much for posting these videos! I am learning a ton!

  • @JoseAyerdis
    @JoseAyerdis 4 роки тому

    If you get a RStudio crash related to Initializing libomp.dylib, but found libomp.dylib already initialized. When using the final workflow and fit it. You can use a workaround on OSX
    Sys.setenv(KMP_DUPLICATE_LIB_OK = TRUE)

  • @geilin2394
    @geilin2394 4 роки тому

    These vids are great. Can we see a classification model with calibration curves, and then recalibrate it, within the tidymodels framework? How long did the hyperparameter tuning take here?

  • @amahoela730
    @amahoela730 3 роки тому

    Does anyone know how you can save the workflow for later use? I have problems with it since it is not of format 'xgb.booster', whereas using the function saveRDS might result in compatibility issues in case of future package versions.

  • @haraldurkarlsson1147
    @haraldurkarlsson1147 3 роки тому

    I should mention that the mini ran this quietly and I heard no noise from an overworked. The unit is also cool to the touch.

  • @haraldurkarlsson1147
    @haraldurkarlsson1147 3 роки тому

    Julia,
    I ran this model on a new mac mini and it produced results in about 7 minutes. Much faster than my old mac which I desktop did not dare run it on.

  • @alanjiang2930
    @alanjiang2930 3 роки тому +1

    Watched more than half of your videos within one week. Don't even want to blink! Saw you plotted XGB importance - wonder if there is tidymodel way to plot SHAP values from XGB. Thanks, Julia!

    • @JuliaSilge
      @JuliaSilge  3 роки тому +1

      If you are only doing xgboost, you might try the SHAPforxgboost package: cran.r-project.org/package=SHAPforxgboost (it takes a bit of munging the model to get it to work with that package)
      For modeling in general, I like DALEX for explainability, which also supports tidymodels:
      modeloriented.github.io/DALEXtra/reference/explain_tidymodels.html
      We have a chapter in process on explainability in our upcoming book, so keep your eyes out for that:
      www.tmwr.org/

    • @alanjiang2930
      @alanjiang2930 3 роки тому

      @@JuliaSilge Got it. Thanks for the direction! Again, amazing video series! Really really tidy.

  • @lucaskramer438
    @lucaskramer438 4 роки тому +1

    Great explanation, but i have one question: When you call last_fit() you make use of your split object. In my particular case i only was provided with the train and test test initially, so that i dont have a split object. Is there any way to call last_fit() nevertheless? Thanks!

    • @JuliaSilge
      @JuliaSilge  4 роки тому

      You can't call last_fit() directly if you don't have the split, but you *can* manually do what it is a wrapper for, which is train one last time on the training data and then evaluate one last time on the testing data.

  • @faiazrummankhan5589
    @faiazrummankhan5589 3 роки тому

    All your videos are such a great learning resource for real world EDA and modelling. I was just wondering what theme you are using in rstudio ?

    • @JuliaSilge
      @JuliaSilge  3 роки тому

      It's one of the themes available via rsthemes: www.garrickadenbuie.com/project/rsthemes/

  • @raminziaei6411
    @raminziaei6411 4 роки тому +1

    Thanks a lot Julia. I really love your videos. Do you have any plans for making a video on neural network and tuning it in tidymodels? That would be awesome if possible. Please continue these videos. They are really great.
    Cheers

  • @JerryWho49
    @JerryWho49 9 місяців тому

    Great video, thanks. But I’ve got a question. Say, my local computer is too small to fit a model fast enough. How would I train a model in the cloud? Do you have any best practices?

    • @JuliaSilge
      @JuliaSilge  9 місяців тому

      One of the easiest ways to go is to use RStudio on SageMaker:
      posit.co/blog/getting-started-rstudio-sagemaker/

  • @briancostello939
    @briancostello939 4 роки тому

    Great video! Is there any difference between “pivot_longer” and “gather”? They look identical to me, just with the arguments having different names, but want to make sure I’m not missing something.

    • @JuliaSilge
      @JuliaSilge  4 роки тому +1

      You can read this blog post that introduced the pivot verbs: www.tidyverse.org/blog/2019/09/tidyr-1-0-0/

    • @briancostello939
      @briancostello939 4 роки тому

      Julia Silge oh awesome thanks!

  • @angvl8793
    @angvl8793 Рік тому

    Hi Julia ! Great video as always :) ! Can i ask you something please? At around 34.08 if we don't want to use the xgb_grid you are using and we use in the tune_grid() function, something else for the grid parameter, let's say grid = 50 is this ok ? I mean generally is it ok to use grid equal a number ? Thank you very much !

    • @JuliaSilge
      @JuliaSilge  Рік тому +1

      Yes, that argument can take a couple of different kinds of values, either a dataframe or an integer value:
      tune.tidymodels.org/reference/tune_grid.html
      You can read a bit more about this here:
      www.tmwr.org/grid-search.html#evaluating-grid

    • @angvl8793
      @angvl8793 Рік тому

      @@JuliaSilge Thank you again ! :) .

  • @gkuleck
    @gkuleck 11 місяців тому

    Hi Julia! Great video. Have you done a video on multiclass classification? I am struggling to find guidance for this type with text classification. Thanks!!

    • @JuliaSilge
      @JuliaSilge  10 місяців тому

      Check out these two:
      - juliasilge.com/blog/nber-papers/
      - juliasilge.com/blog/multinomial-volcano-eruptions/

    • @gkuleck
      @gkuleck 10 місяців тому

      Thank you!

  • @artathearta
    @artathearta 3 роки тому

    48:44 my autoplot was flipped along the X = Y axis, I wonder why.

    • @JuliaSilge
      @JuliaSilge  3 роки тому +1

      It's because of a global change in how yardstick finds the "first" or base level event:
      juliasilge.com/blog/xgboost-tune-volleyball/#comment-5015180544

  • @Matthew-px9nu
    @Matthew-px9nu 4 роки тому

    Julia thank you for these great videos keep it up ! Quick question once using last_fit if wanting to predict on NEW data what are the workflow steps ? Last_fit doesn’t really work on new data that wasn’t in the original split. Thank you !

    • @JuliaSilge
      @JuliaSilge  4 роки тому +2

      Once you get to last_fit(), check out the objects that are inside of it. One of the columns contains a *fitted model* that can be used on new data. In fact, that fitted model is used on the testing data to compute the metrics!

    • @Matthew-px9nu
      @Matthew-px9nu 4 роки тому

      @@JuliaSilge Thank you Julia! Last quick Q, noticed you always process the commands in console from the notebook Rmd, what button do you click to run in console instead of in the notebook?

    • @JuliaSilge
      @JuliaSilge  4 роки тому +1

      @@Matthew-px9nu That's probably my most used keyboard shortcut! Ctrl+Shift+Enter for a chunk, Cmd+Enter for a line
      In RStudio, you can find them under Tools -> Keyboard Shortcuts Help, but there's just a handful that I use regularly.

    • @vincentpepe1064
      @vincentpepe1064 4 роки тому

      @@JuliaSilge Hi Julia! Where do I exactly find this? The columns I have are splits, id, .metrics, .notes,. predictions, .workflow. I can't find the fitted model in .workflow either so I'm not sure where it is. Thanks!

    • @JuliaSilge
      @JuliaSilge  4 роки тому +2

      @@vincentpepe1064 The .workflow is a *fitted* workflow at this point. For example, try tidying it or predicting on it. I show how to tidy it here: juliasilge.com/blog/palmer-penguins/

  • @haraldurkarlsson1147
    @haraldurkarlsson1147 3 роки тому

    Julia,
    I was able to follow along and everything looked fine until the final roc_auc curve. I get a mirror image of your curve. I have combed through the code and found nothing wrong. The confusion matrix outcome is similar to yours etc. It seems like a systematic error. I noticed when looked at the data that will generate the curve that indeed my numbers for specificity are somehow switched. While your table starts with specificity of 1 mine starts at zero so the value seem more like 1-specificity to begin with in my case. I am puzzled.

    • @JuliaSilge
      @JuliaSilge  3 роки тому +1

      You can look at the first comment at the relevant blog post here:
      juliasilge.com/blog/xgboost-tune-volleyball/
      Since I published this blog post, there was a change in yardstick in version 0.0.7:
      github.com/tidymodels/yardstick/blob/master/NEWS.md#yardstick-007
      that changed how to choose which level (win or lose) is the "event". You can change this by using the `event_level` argument for functions like `roc_curve()`:
      yardstick.tidymodels.org/reference/roc_curve.html

  • @shamsulhoquekhan933
    @shamsulhoquekhan933 Рік тому

    Can someone tell me why we used sample_prop inside the search grid?

    • @JuliaSilge
      @JuliaSilge  Рік тому

      It's what proportion of the total available sample is used for modeling within one boosting iteration:
      dials.tidymodels.org/reference/trees.html#details

  • @dudeadulto
    @dudeadulto 4 роки тому

    Hi im getting a warning-error: ! Fold01: model 1/20: The `x` argument of `as_tibble.matrix()` must have colum...
    Whentune_grid function runs... Found in a github issue, that it's related to "name reparing"...
    Do you have any idea if it really affects the results of the tunning process, or if thers a update/solution for it?

    • @JuliaSilge
      @JuliaSilge  4 роки тому +1

      Hmmmm, do you want to make sure all your packages are updated? That sounds like a message from an older version of the packages. If you are still getting that warning, I recommend creating a reprex and posting on RStudio Community: community.rstudio.com/c/ml/15

    • @dudeadulto
      @dudeadulto 4 роки тому

      @@JuliaSilge After reading your responde, I did update all my packages, and the error still occurs, but the process seems to keep running. I will let it finish, and see if it affects the results of the tune_grid

  • @deltax7159
    @deltax7159 3 місяці тому

    What appearance theme are you using here?

    • @JuliaSilge
      @JuliaSilge  3 місяці тому

      I use one of the themes from rsthemes:
      www.garrickadenbuie.com/project/rsthemes/
      I think Oceanic Plus? There are lots of nice ones available in that package.

  • @tamaraabzhandadze2712
    @tamaraabzhandadze2712 3 роки тому

    Thank you for the great tutorial. I have been haivng a problem with a confusion matrix. namely, when i run the code " final_res_r %>%
    collect_predictions()%>% roc_curve(dependent_var, .pred_dependent_var)%>% autoplot()", i get the error Can't subset columns that don't exist.
    x Column `.pred_dependent_var` doesn't exist.. I can not understand how to solve the problem. What am i doing wrong?

    • @JuliaSilge
      @JuliaSilge  3 роки тому +2

      Hmmmm, do you see the column with the predicted class probability in it, after you run `collect_predictions()`? You can check out the documentation for `roc_curve()` here:
      yardstick.tidymodels.org/reference/roc_curve.html
      And if you continue to have trouble, I recommend creating a reprex and posting it on RStudio Community:
      rstd.io/tidymodels-community
      It's often easier to get help with coding problems in a format like that rather than comments.

    • @tamaraabzhandadze2712
      @tamaraabzhandadze2712 3 роки тому

      @@JuliaSilge Dear Julia! Just amazing to read your response :). I have solved that problem :). however, another problem that I could not solve was related to the variable importance. I managed to create a figure but I can not get the actual values per variable. I tried to use varImp(model_name), xgb.importance(model = model_name). but getting just lovely red text around, without the results :)

    • @JuliaSilge
      @JuliaSilge  3 роки тому +1

      @@tamaraabzhandadze2712 I typically use the vip package for variable importance, as I show in this blog post:
      juliasilge.com/blog/xgboost-tune-volleyball/

    • @tamaraabzhandadze2712
      @tamaraabzhandadze2712 3 роки тому

      @@JuliaSilge thank you! I have actually posted the question there as well :) . I read your answer and got the results :). I just really have to decide now the cutoff coefficient for choosing some variables out of ten features.
      p.s. i did factor analyses as well, and could identify 3 variables with good loading, but there it was a bit easier as there are cutoffs for loading :). For XGboost i have no idea what to do :)

  • @wecsleyprates3205
    @wecsleyprates3205 3 роки тому

    Hey Julia, congrats again: show up this error:
    xgb_res

    • @JuliaSilge
      @JuliaSilge  3 роки тому +2

      You need to *install* xgboost, actually; you don't have the package installed: install.packages("xgboost")

    • @wecsleyprates3205
      @wecsleyprates3205 3 роки тому

      @@JuliaSilge yeah...but I don't know what is happening, when I try install the package xgboost gives a error telling me that the xgboost is not available for my R version. My R Studio is the currently version.

    • @JuliaSilge
      @JuliaSilge  3 роки тому +2

      @@wecsleyprates3205 Ah, a classic problem that folks run into when things get borked! Check out this SO question + answers:
      stackoverflow.com/questions/25721884/how-should-i-deal-with-package-xxx-is-not-available-for-r-version-x-y-z-wa

    • @wecsleyprates3205
      @wecsleyprates3205 3 роки тому

      Thanks @@JuliaSilge...Do you know what means the error below?
      Error in (function (classes, fdef, mtable) :
      unable to find an inherited method for function ‘predict’ for signature ‘"xgb.Booster"’

    • @JuliaSilge
      @JuliaSilge  3 роки тому +1

      @@wecsleyprates3205 That sounds like xgboost still isn't getting loaded correctly to me. Could you try creating a reprex showing your problem and posting on RStudio Community? rstd.io/tidymodels-community

  • @Simonsayztaga
    @Simonsayztaga 4 роки тому +1

    Do you have a course on tidymodels?? Video Course or Tutorials?

    • @JuliaSilge
      @JuliaSilge  4 роки тому +6

      You can check out this interactive course on tidymodels: supervised-ml-course.netlify.app/

    • @artathearta
      @artathearta 3 роки тому

      @@JuliaSilge Amazing resource, thank you

  • @haraldurkarlsson1147
    @haraldurkarlsson1147 3 роки тому

    Julia,
    I do like Markdown but for testing out code I prefer R script simply because I make a lot of mistakes. So I am curious to know why you work in Markdown. Is it so because you have already written and debugged your code and would like to save the lesson in a nicer format?

    • @JuliaSilge
      @JuliaSilge  3 роки тому +1

      No, I work in R Markdown regularly. In R I basically am either building package code or I am working in R Markdown. I'm a huge believer in the idea of "literate programming" as a real way to work. I make a lot of mistakes too, but I don't think that reduces the value of combining narrative and code in one document.

    • @haraldurkarlsson1147
      @haraldurkarlsson1147 3 роки тому

      I am working on setting up a class for students in my department and am quite torn on whether to go the Markdown or R script route. Since most of the class work will be around coding and simply learning how to R I am inclined to start with the regular setup (script) and then move on to Markdown later. Thanks.

    • @JuliaSilge
      @JuliaSilge  3 роки тому +1

      @@haraldurkarlsson1147 The person I know who has thought the most about this is Mine Çetinkaya-Rundel; you can see one of her resources for teaching here: datasciencebox.org/
      She recommends teaching R Markdown to emphasize reproducible analyses.

    • @haraldurkarlsson1147
      @haraldurkarlsson1147 3 роки тому

      I see. Thanks a lot for the tip.

    • @haraldurkarlsson1147
      @haraldurkarlsson1147 3 роки тому

      Julia,
      I will have a deeper dive into the datasciencebox. However, I will be teaching grad students that should have some inkling of what the basic statistics concepts are. Most have already worked with data, done some data processing, and generated tables and graphs. I would like to teach them R to simplify their lives and give them hopefully a new valuable skill for the current or future work. As grad students the science part is covered.

  • @hansmeiser6078
    @hansmeiser6078 2 роки тому

    Is .pred_win = .pred_class ?

    • @JuliaSilge
      @JuliaSilge  2 роки тому +1

      No, .pred_win should be a class probability (like a number) and .pred_class should be the predicted class (like the factor level).

    • @hansmeiser6078
      @hansmeiser6078 2 роки тому

      @@JuliaSilge Ah ok, thank you!