Get started with tidymodels and classification of penguin data

Поділитися
Вставка
  • Опубліковано 1 гру 2024

КОМЕНТАРІ • 82

  • @HamJeong
    @HamJeong 4 роки тому +40

    I can't emphasize enough how useful this content is, from the screencast style, to the focus on using specific packages to the insight into the modelling process. I really love it, hope it keeps coming!!

  • @SC-pm7zd
    @SC-pm7zd 2 роки тому

    A perfect UA-cam content for people wanting to know how to analyze data using R in an elegant way.

  • @Simonsayztaga
    @Simonsayztaga 4 роки тому +7

    30 minutes is the sweet spot!! Ur awesome @julia

  • @mocabeentrill
    @mocabeentrill 11 місяців тому

    Clear explained and direct to the point! Thank you Julia.

  • @dasrotrad
    @dasrotrad 4 роки тому +5

    What a great video Julia! Thank you for such wonderful introduction to ML and for sharing your knowledge. You are indeed, awesome.

  • @edGoldi
    @edGoldi 4 роки тому +1

    Many thanks Julia!!! can't wait for the next video!!!

  • @WhySoBroke
    @WhySoBroke 2 роки тому

    Superbly done!! Will rewatch a couple times, lots to learn! Many thanks Julia!! ❤️🇲🇽❤️

  • @kentico1234
    @kentico1234 4 роки тому +2

    Great job, Julia .... you put a lot of effort into this very worthwhile endeavor!

  • @danielalvarezmd
    @danielalvarezmd 4 роки тому +1

    Great video Julia. You are the best. Thank u very much!!!

  • @dariyatukhmetova1172
    @dariyatukhmetova1172 3 роки тому +1

    amazing tutorial, thank you. Love how you give interesting explanation for each output value of the model.

  • @yussifmohammed9324
    @yussifmohammed9324 2 роки тому

    Thanks Julie- will like to see more

  • @julietterose5753
    @julietterose5753 2 роки тому

    Thank you so much for this video. Appreciate it. It is so helpful to see how it works actually

  • @gabrielrosa9738
    @gabrielrosa9738 4 роки тому +2

    Excelent! The content is very useful and your way to go trough it makes it easy to grasp. Thank you!

  • @averyrobbins68
    @averyrobbins68 4 роки тому +2

    Very helpful! Thank you very much for doing these videos. `tidy(exponentiate = TRUE)` was a new one for me. Very useful.

  • @socratesoliveira1176
    @socratesoliveira1176 4 роки тому

    Very clear and easy to follow, so useful! Thank you very much!

  • @sophiej4605
    @sophiej4605 3 роки тому

    Great to get started a tidymodel!!

  • @TURALOWEN
    @TURALOWEN 2 роки тому

    Amazing lecture! Thank you!

  • @malkhaz.jokhadze
    @malkhaz.jokhadze 4 роки тому +3

    Dear Julia, I want to ask you how do you execute a markdown code in the console, I mean what key do you use for that purpose. Thank you in advance.

    • @JuliaSilge
      @JuliaSilge  4 роки тому +8

      That's probably my most used keyboard shortcut! Ctrl+Shift+Enter for a chunk, Cmd+Enter for a line
      In RStudio, you can find them under Tools -> Keyboard Shortcuts Help, but there's just a handful that I use regularly.

    • @PatrickBateman12420
      @PatrickBateman12420 4 роки тому

      @@JuliaSilge thanks a lot Julia!

  • @cb5231
    @cb5231 8 місяців тому

    thanks for this video Julia

  • @buraktiras93
    @buraktiras93 2 роки тому

    Thanks for the content! I have a question. How can we change the cutoff value in glm when we use tidymodels?

    • @JuliaSilge
      @JuliaSilge  2 роки тому

      Do you mean using the probability threshold to decide what label to predict? You can get out the probabilities via `type = "prob"` and can go from there as you wish, or you may be interested in using probably:
      probably.tidymodels.org/

  • @raydePay
    @raydePay 2 роки тому

    Would it be useful to compare the predictions-weights ("probs", I think in caret) where rf and glm divert? So, if glm-pos > rf-neg the outcome is glm, else rf?

  • @carolinemimeault3668
    @carolinemimeault3668 4 роки тому

    Thank you so much for making those videos!

  • @crgIN07
    @crgIN07 4 роки тому +1

    Really great, thank you! Do you have a plans to do time series analysis or a SVM model?

  • @rank4816
    @rank4816 3 роки тому

    Really instructive video, thank you!

  • @jaredminetola
    @jaredminetola 4 роки тому +1

    Hi Julia, I I'm newISH to R and VERY new to predictive modelling in R. I really enjoy watching your videos! I'm wondering if you would start over and exclude Flipper_Length_mm from this model (if you were actually going to use this going forward) since it had a higher P Value in your summary statistics. Thanks!

    • @JuliaSilge
      @JuliaSilge  4 роки тому +6

      That would be like one step of "stepwise regression", basically, and stepwise regression has a lot of problems when applied in general. However, in real life problems (where the goal was prediction, i.e. good model fit), I probably *would* try the model without the insignificant variable term to see if it still fit about as well and then I would pick the simpler model if it did.

    • @jaredminetola
      @jaredminetola 4 роки тому

      @@JuliaSilge Thanks for the quick reply!

  • @cgmiguel
    @cgmiguel 3 роки тому +1

    Excellent video and content, as usual! One quick question though: what do you mean by being easier to deploy a logistic regression model than a random forest?

    • @JuliaSilge
      @JuliaSilge  3 роки тому +2

      I was thinking about how a logistic regression model is linear so you don't need to get an R object deployed somewhere to make predictions; you can just use a flat file of model coefficients that could be incorporated into any kind of production system (no R necessary) pretty easily.

  • @marianklose1197
    @marianklose1197 Рік тому

    great tutorial!

  • @maxcopa83
    @maxcopa83 4 роки тому

    I wander what the results would be if the independent fields were dummy coded. Great code as always.

  • @maksim0933
    @maksim0933 4 роки тому

    I have a very silly question: for practical reason of filling missing values in particular dataset (taking apart all great regressions) it wouldn't better fill NA with the help of some packages, for example mice ?

    • @JuliaSilge
      @JuliaSilge  4 роки тому +1

      Here, sex is the thing we are predicting so we would need to be careful using the predictors to impute the outcome and then also to predict the outcome. If on the other hand you want to use imputation for predictors, tidymodels has a number of functions for that in the recipes package: recipes.tidymodels.org/reference/index.html#section-step-functions-imputation

  • @sotirismargaritis4965
    @sotirismargaritis4965 4 роки тому

    May i ask what the lines 9 to 15 does?
    theme_set(theme_plex()) is from rstheme package which defines the r studio theme?
    Thank you very much

    • @JuliaSilge
      @JuliaSilge  4 роки тому +4

      theme_set() is for ggplot2, to set the what the plots look like: ggplot2.tidyverse.org/reference/theme_get.html
      The part above that sets options for knitr chunks, such as whether to cache results, whether to print messages and warnings, what size to prints figures, etc. You can read more about knitr chunk options here: yihui.org/knitr/options/

    • @sotirismargaritis4965
      @sotirismargaritis4965 4 роки тому

      @@JuliaSilge Thank you very much for the quick response. I hope you will make in the future some interactive courses like supervised ml case studies

  • @brendanmcewen7190
    @brendanmcewen7190 Рік тому

    Around minute 22:00 you're mentioning that the (generalized) linear model did just as well at classifying sex as the random forest model, despite not being able to identify interactions (e.g. a flipped dimorphism for one of the species). Isn't this rather expected, though, as the dataset itself contained no interactions between sex and the other identifying characteristics? Would the RF model have performed better if, say, one of the species had an inverse relationship between sex and flipper/beak dimensions?

    • @JuliaSilge
      @JuliaSilge  Рік тому +1

      I think it's a little strong to say there are *no* interactions in the penguins dataset, as for example the slope for bill depth vs length isn't the same for all species and/or sexes. However, yep, the fact that the linear model performs just as well does indicate that any interactions aren't that important and we would expect a random forest model to do better when there are more important interactions.

    • @brendanmcewen7190
      @brendanmcewen7190 Рік тому

      @@JuliaSilge Gotcha, that makes sense. Thanks for the reply on a two year old video! Ben Bolker recommended I look into TidyModels, so I've been watching lots of your videos. Very clear and informative!

  • @jamespaz4333
    @jamespaz4333 3 роки тому

    Great presentation! How can I include grid search into my recipes?

    • @JuliaSilge
      @JuliaSilge  3 роки тому +1

      You can tune many recipe parameters, in much the same way you tune model parameters. You can check out some examples here:
      www.tidymodels.org/learn/work/tune-text/
      And here:
      www.tidymodels.org/learn/work/bayes-opt/

    • @jamespaz4333
      @jamespaz4333 3 роки тому

      @@JuliaSilge amazing! Thank you!!!!!

  • @upendra8050
    @upendra8050 4 роки тому +1

    Dear Julia, great video, and I learned a lot about tidy models today. I have a couple of questions.
    1. For tree-based models, I can use feature importance and packages such as SHAP for interpreting them. Is this something that we can do with linear models such as logistic regression? Or in other words, can we assume coefficients of features in linear models to be the same as feature importance in tree-based models?
    2. From your analyses, you found that the bill depth is the most important feature that differentiates the sexes. Can we come up with rules/cut-offs using which we can say whether a particular bill depth corresponds to a male penguin or female penguin?
    Thanks in advance.

    • @JuliaSilge
      @JuliaSilge  4 роки тому +2

      Absolutely, the coefficients of a linear model give you analogous information to feature importance of a tree model. In fact, they are *better* in terms of feature importance because they literally are just which features are most important for your model, directly.
      If you want a set of rules, I would use a specific model for that: www.tidyverse.org/blog/2020/05/rules-0-0-1/

    • @upendra8050
      @upendra8050 4 роки тому

      @@JuliaSilge Thanks Julia.

  • @yujuansun8522
    @yujuansun8522 2 роки тому

    Your video is so useful! I use the same method as yours but I got this Error message when I use fit_resamples "Error: For a classification model, the outcome should be a factor." Do you know how to fix this problem? Thanks in advance!!!

    • @JuliaSilge
      @JuliaSilge  2 роки тому

      It sounds like you may be fitting a classification model to data with a numeric outcome. Try choosing a model that is a good fit for your particular data, like a regression model if you have a numeric outcome.

  • @selecta_ssbm
    @selecta_ssbm 4 роки тому

    Love this! How do I got an error at the last step however with the following:
    Error: No tidy method for objects of class ranger

    • @JuliaSilge
      @JuliaSilge  4 роки тому

      Seems like you tried to tidy the random forest instead of the logistic regression model. A random forest model doesn't have simple coefficients so can't be tidied in the same way that a logistic regression model can.

  • @andrewnguyen3312
    @andrewnguyen3312 Рік тому

    Great video ty so much

  • @sabbamussadiq9818
    @sabbamussadiq9818 3 роки тому

    Mam ,
    Can you kindly teach constructing 2 or 3 variables on the same graph of ROC curve in SPSS for easy visual comparison.. like you made in this video .. but this does not look like SPSS

    • @JuliaSilge
      @JuliaSilge  3 роки тому

      Well, it definitely is not SPSS! 😁 If you can outline in detail more of what you are trying to do with a reproducible example, I suggest you post on RStudio Community where folks will be able to help you:
      rstd.io/tidymodels-community

    • @sabbamussadiq9818
      @sabbamussadiq9818 3 роки тому

      @@JuliaSilge well , thankyou for the reply Mam.
      I am comparing 2 biomarkers in a disease diagnosis… so needed ROC curve ..but I was not able to plot both on same graph… like you did ..(ploting many ROC curves on one graph)….
      Will look at the site you have mentioned… thankyou

  • @felipetorres4464
    @felipetorres4464 4 роки тому +1

    Hi Julia. Why is this video call "unknown"?

    • @Ledgerdomain
      @Ledgerdomain 4 роки тому

      Frozen 2. Penguins. Ice.
      Just kidding

    • @felipetorres4464
      @felipetorres4464 4 роки тому

      @@Ledgerdomain Hahaha maybe ... it's a good name.

  • @byronpop2
    @byronpop2 4 роки тому

    Hi @julia, I love your videos! Thank you so much for making them. I am following along and using my own data for some modeling and unfortunately when I try to train the random forest model with:
    rf_rs %
    add_model(rf_spec) %>%...
    I get the following error: "model: Error: spark objects can only be used with the formula interface to `fit()` with a spark data object."
    Any idea what might be going on? For context, my data is described below:
    tibble [4,428 × 12] (S3: tbl_df/tbl/data.frame)

    $ deployment : Factor w/ 13 levels
    $ realty_status : Factor w/ 2 levels "opted IN","opted OUT":
    $ property_county : Factor w/ 356 levels "
    $ property_state : Factor w/ 44 levels
    $ loan_amount : num [1:4428]
    $ total_income : num [1:4428]
    $ age : num [1:4428]
    $ n_schooling_years : num [1:4428]
    $ n_owned_properties: num [1:4428]
    $ n_dependents : num [1:4428]
    $ device_type_start : Factor w/ 4 levels
    $ completion_time : 'difftime' num

    • @JuliaSilge
      @JuliaSilge  4 роки тому

      I don't think that I can get enough info in the comments here to help. Can you post on RStudio Community with a little more detail (preferably a whole reprex, if possible) so we can check it out and see what's going on? rstd.io/tidymodels-community

  • @kenkoonwong2166
    @kenkoonwong2166 4 роки тому

    thank you. very helpful!

  • @jakebersabe6511
    @jakebersabe6511 3 роки тому

    Thank you!

  • @haraldurkarlsson1147
    @haraldurkarlsson1147 3 роки тому

    Interesting point about not building a classification model for species. However, perhaps a model classification would work better than one made by a biologist. I would think that a model would definitely do a better job than a beginner or amateur. The classification of any sort of thing - being it a rock or a bird is often fraught with mistakes.

  • @nadiamekhloufi8744
    @nadiamekhloufi8744 Рік тому

    Please you can sheer with us the script code

    • @JuliaSilge
      @JuliaSilge  Рік тому +1

      Check out the description here on UA-cam, where I always include that info:
      juliasilge.com/blog/palmer-penguins/

  • @Mohamed-sq8od
    @Mohamed-sq8od 3 роки тому

    you are awesome

  • @farnooshsheikhi
    @farnooshsheikhi 4 роки тому

    Thank you Julia. This was really helpful. Quick question, do you always create a balanced data where you have the same number of cases and controls before modeling and then resample from that data set? I was wondering if this is a general approach to build predictive models. Thank you again. I love your videos :)

    • @JuliaSilge
      @JuliaSilge  4 роки тому +1

      I don't think it's best practice to *always* create a balanced training set, but often this is a helpful preprocessing step to build a model that can learn to recognize both, say, the majority and minority classes. One important note is that it is best to resample the original, imbalanced dataset, and then do the over/undersampling on the resamples, to avoid data leakage. In tidymodels, we have tools for dealing with imbalanced data in the themis package:
      themis.tidymodels.org/

    • @farnooshsheikhi
      @farnooshsheikhi 4 роки тому

      @@JuliaSilge thank you so much for getting back to me. I'll check the themis package out :)

    • @TheFrankyguitar
      @TheFrankyguitar 4 роки тому +1

      I use the SMOTE algorithm contained in themis package. You just have to add one line in your recipe: step_smote(your_response_variable, smote_parameters).

  • @oddsratio4070
    @oddsratio4070 2 роки тому

    Its confusing that all your other videos you use `recipes`, but not here?

    • @JuliaSilge
      @JuliaSilge  2 роки тому

      If you want to learn about using a formula vs. a recipe, I recommend checking out these sections of our book:
      www.tmwr.org/base-r.html#formula
      www.tmwr.org/workflows.html#workflow-encoding
      www.tmwr.org/recipes.html

    • @oddsratio4070
      @oddsratio4070 2 роки тому

      @@JuliaSilge Thanks! I am also ordering the book in hardcopy on Amazon today :)

  • @deiro04
    @deiro04 2 роки тому

    Amazin

  • @elOtorongo96
    @elOtorongo96 4 роки тому

    Awesome

  • @Blackhole-yy6yq
    @Blackhole-yy6yq 4 роки тому

    i love you julia.. how r u today