Lasso regression with tidymodels and The Office

Поділитися
Вставка
  • Опубліковано 1 гру 2024

КОМЕНТАРІ • 44

  • @rrmaximiliano
    @rrmaximiliano 4 роки тому +8

    Thanks, Julia for the video. Really interesting how you approached the cleaning and models in comparison to David. Pretty nice you keep making these videos. They are super helpful.

  • @RAPmastaGBLASCO63
    @RAPmastaGBLASCO63 4 роки тому +5

    Every time I watch one of your videos I learn something new and become more confident in my modeling. Thank you so much for them!

  • @erickknackstedt3131
    @erickknackstedt3131 4 роки тому +1

    Love it! Finding this channel has made my day.

  • @iugaMovil
    @iugaMovil 4 роки тому

    Great video Julia.
    It was a refresher for add_count and geom_col because I stop using them for some reason.

  • @minhnguyenbui6827
    @minhnguyenbui6827 4 роки тому

    Oh wow, It's so amazing. I know you via Text mining with R book, Found David and your channel is a memorable milestone in my learning R process :D

  • @ethanthealien
    @ethanthealien 4 роки тому

    This was fantastic! It got me really excited about tidymodels =)

  • @luisfernandobaldanfechio8958
    @luisfernandobaldanfechio8958 3 роки тому

    Thanks a lot, excellent material. I'm having a different response from the fitted workflow (@ 27:00). I'm receiving a tibble: 31 x 3 with only one intecept while yours is a tibble 1,563 x 5 with many intercepts. I copy/paste the code as in my blog post.

    • @JuliaSilge
      @JuliaSilge  3 роки тому

      Ah, I believe there has been a change in parsnip since this video was published that you only get the lambda you actually specified, not the whole path of lambdas: github.com/tidymodels/parsnip/blob/master/NEWS.md#parsnip-013

  • @iqu3261
    @iqu3261 3 роки тому

    Thanks so much Julia for the valuable videos, im trying to evaluate LDA topic modelling on tweets using NPMI , do you have an idea how to implement it in R? thanks Sam

  • @juliantagell1891
    @juliantagell1891 4 роки тому +1

    Thanks Julie, this is great. Just got one question at 4:20.
    The other day I realised I can put pipes inside a mutate to get something like below... do you reckon using this is a good idea (I don't see it much but it feels really efficient)?
    transmute(episode_name = title %>% str_to_lower() %>% str_remove_all(remove_regex) %>% str_trim(),
    imdb_rating)

  • @AdrianaCastilloC
    @AdrianaCastilloC Рік тому

    Julia, this is great!! It's so well explained (: ... Do you know by any chance how to do exactly this for spatial (polygon) data?

    • @JuliaSilge
      @JuliaSilge  Рік тому

      You might check out the spatialsample package:
      spatialsample.tidymodels.org/
      And here is a blog post where I walk through how to use it:
      juliasilge.com/blog/drought-in-tx/

    • @AdrianaCastilloC
      @AdrianaCastilloC Рік тому

      @@JuliaSilge oh, my god! This is GREATTTT!!! many many thanks!!

  • @mindlessgreen
    @mindlessgreen 3 роки тому

    Thanks for the nice tutorial. At 22:30, office_prep was created. What was that about? It was never used downstream. In general, I don't get the use of prep and bake.

    • @JuliaSilge
      @JuliaSilge  3 роки тому +1

      I think it *is* useful to know how to use `prep()` and `bake()` if you are going to be a tidymodels user, in order to debug and problem solve when things don't go right with your recipes. It's a way to check out how your recipe will preprocess your data for modeling. You can read about what the two functions do here: www.tmwr.org/recipes.html#using-recipes

  • @hesamseraj
    @hesamseraj 2 роки тому

    I am reviewing all the videos and adding the tree episode names as some sort of homework for myself.

  • @brendang8610
    @brendang8610 4 роки тому +2

    Awesome and informative video as always! I have a question and hope you can help clarify - I noticed when you did the bootstrap resampling you used office_train as the dataset, which is the unmodified training data. In another video (the hotel bookings one) you used the juiced recipe as the dataset when creating the monte carlo cross validation resamples. Is there a best practice on which dataset to use when resampling with tidymodels - the un-processed training data vs the pre-processed & juiced recipe data? Thanks!

    • @brendang8610
      @brendang8610 4 роки тому

      oh! wait is it because here you're using a workflow() and in the hotel bookings video you weren't? and if so, is the workflow applying the recipe, prepping and juicing in the resampling step for you?

    • @JuliaSilge
      @JuliaSilge  4 роки тому +2

      @@brendang8610 Yes, that's basically it! A workflow that includes a recipe will apply that recipe. Generally it is probably better practice to do resampling on the unmodified training set, because otherwise you can get LEAKAGE from your preprocessing steps and then overly optimistic results from resampling.

  • @vladimirmijatovic883
    @vladimirmijatovic883 Рік тому

    Hi @julia - great video!
    funny - I tried tuning hyperparameters with two different values of trees. when I tune the model with trees = 100 and with trees = 1000 the order of variable importance changes. With trees = 100 the most important variable is mhi_2018, followed by one_race_a, while with trees = 1000 the most important variable is one_race_a (followed by mhi_2018).
    How is this possible? From where this could be coming from?

    • @JuliaSilge
      @JuliaSilge  Рік тому

      I think you may be asking about a different video in this comment?
      But yes, maybe I should have been more clear that the variable importance I show is for *that model specifically*. The hyperparameters you choose for your algorithm often have an impact on variable importance. (And if you use variable importance to do feature selection, then that will change the hyperparameters you choose!) There is some related discussion here:
      stats.stackexchange.com/questions/264533/how-should-feature-selection-and-hyperparameter-optimization-be-ordered-in-the-m

    • @vladimirmijatovic883
      @vladimirmijatovic883 Рік тому

      @@JuliaSilge OMG, how embarrassing :), indeed it is related to another video of yours.
      The question was about this video: ua-cam.com/video/OMn1WCNufo8/v-deo.html (Predict Childcare Costs), but UA-cam kept rolling to next video while I was waiting for my model to be trained :).
      However, I was surprised that hyperparameter such as number of trees could impact order of variable importance. I guess my intuition was wrong.

  • @vincentpepe1064
    @vincentpepe1064 4 роки тому

    Hi Julia,
    Love the video! I was wondering how you would compare the accuracy of the model to the testing data? I need to submit a report with both the predicted and actual values and cannot seem to find it.

  • @alexnoble17
    @alexnoble17 4 роки тому +2

    This is super interesting. I would love to do this analysis with Doctor Who (specifically New Who!)

  • @drinks3544
    @drinks3544 2 роки тому

    What does the value used to indicate "importance" on the x-axis mean? is that R^2?

    • @JuliaSilge
      @JuliaSilge  2 роки тому

      In the vip package, what "importance" is varies from model to model. You can look more at the documentation but for a linear model like a lasso regularized model, it is just literally the coefficients from the model itself (similar to coefficients from `lm()`). You can check out documentation for vip here:
      koalaverse.github.io/vip/

  • @muttbane1072
    @muttbane1072 4 роки тому

    Great video! Love it!

  • @TheFrankyguitar
    @TheFrankyguitar 4 роки тому

    Thanks for the great video Julia! I learned a lot. If we use a GLM, we might want to use a univariate filter to keep only relevant variables in the model since GLM's don't have built-in variable selection. Is there a way to do this with tidymodels? Maybe with recipes?

    • @JuliaSilge
      @JuliaSilge  4 роки тому +2

      Not currently, but we're interested in recipes supporting feature selection like that in the future!

    • @TheFrankyguitar
      @TheFrankyguitar 4 роки тому

      That's great! Thank you.

  • @travisknoche5639
    @travisknoche5639 4 роки тому

    Hi Julia, thanks for the video! I am getting the error: "All models failed in tune_grid(). See the `.notes` column." when running tune_grid(). My code is identical to yours and I'm also using a mac. Any ideas?

    • @travisknoche5639
      @travisknoche5639 4 роки тому

      all of the .notes say "model 1/1 (predictions): Error in cbind2(1, newx) %*% nbeta: invalid class 'NA' to dup_mMatrix_as_dgeMatrix"

    • @JuliaSilge
      @JuliaSilge  4 роки тому

      @@travisknoche5639 Is this using the same code/data as in my blog post? juliasilge.com/blog/lasso-the-office/ Or different data?

    • @travisknoche5639
      @travisknoche5639 4 роки тому

      @@JuliaSilge Yep!

    • @JuliaSilge
      @JuliaSilge  4 роки тому

      @@travisknoche5639 Does the first fit work, when you are not tuning?

  • @k.d0721
    @k.d0721 4 роки тому

    you are the best, I should put your name in my PhD thesis

  • @ryankirk574
    @ryankirk574 4 роки тому

    What RStudio theme are you using? I could not find that in the default appearances.

    • @JuliaSilge
      @JuliaSilge  4 роки тому

      It's one of the themes from the rsthemes package: github.com/gadenbuie/rsthemes

    • @ryankirk574
      @ryankirk574 4 роки тому

      @@JuliaSilge Thank you for the quick reply! Watched and now reading through the blog explanation for further understanding.

  • @hesamseraj
    @hesamseraj 4 роки тому

    Amazing thank you very much.

  • @hoschie211
    @hoschie211 4 роки тому +1

    Very nice video! Well explained and above all: 30:18 :-)