I can't emphasize enough how useful this content is, from the screencast style, to the focus on using specific packages to the insight into the modelling process. I really love it, hope it keeps coming!!
That's probably my most used keyboard shortcut! Ctrl+Shift+Enter for a chunk, Cmd+Enter for a line In RStudio, you can find them under Tools -> Keyboard Shortcuts Help, but there's just a handful that I use regularly.
Do you mean using the probability threshold to decide what label to predict? You can get out the probabilities via `type = "prob"` and can go from there as you wish, or you may be interested in using probably: probably.tidymodels.org/
Would it be useful to compare the predictions-weights ("probs", I think in caret) where rf and glm divert? So, if glm-pos > rf-neg the outcome is glm, else rf?
Hi Julia, I I'm newISH to R and VERY new to predictive modelling in R. I really enjoy watching your videos! I'm wondering if you would start over and exclude Flipper_Length_mm from this model (if you were actually going to use this going forward) since it had a higher P Value in your summary statistics. Thanks!
That would be like one step of "stepwise regression", basically, and stepwise regression has a lot of problems when applied in general. However, in real life problems (where the goal was prediction, i.e. good model fit), I probably *would* try the model without the insignificant variable term to see if it still fit about as well and then I would pick the simpler model if it did.
Excellent video and content, as usual! One quick question though: what do you mean by being easier to deploy a logistic regression model than a random forest?
I was thinking about how a logistic regression model is linear so you don't need to get an R object deployed somewhere to make predictions; you can just use a flat file of model coefficients that could be incorporated into any kind of production system (no R necessary) pretty easily.
I have a very silly question: for practical reason of filling missing values in particular dataset (taking apart all great regressions) it wouldn't better fill NA with the help of some packages, for example mice ?
Here, sex is the thing we are predicting so we would need to be careful using the predictors to impute the outcome and then also to predict the outcome. If on the other hand you want to use imputation for predictors, tidymodels has a number of functions for that in the recipes package: recipes.tidymodels.org/reference/index.html#section-step-functions-imputation
theme_set() is for ggplot2, to set the what the plots look like: ggplot2.tidyverse.org/reference/theme_get.html The part above that sets options for knitr chunks, such as whether to cache results, whether to print messages and warnings, what size to prints figures, etc. You can read more about knitr chunk options here: yihui.org/knitr/options/
Around minute 22:00 you're mentioning that the (generalized) linear model did just as well at classifying sex as the random forest model, despite not being able to identify interactions (e.g. a flipped dimorphism for one of the species). Isn't this rather expected, though, as the dataset itself contained no interactions between sex and the other identifying characteristics? Would the RF model have performed better if, say, one of the species had an inverse relationship between sex and flipper/beak dimensions?
I think it's a little strong to say there are *no* interactions in the penguins dataset, as for example the slope for bill depth vs length isn't the same for all species and/or sexes. However, yep, the fact that the linear model performs just as well does indicate that any interactions aren't that important and we would expect a random forest model to do better when there are more important interactions.
@@JuliaSilge Gotcha, that makes sense. Thanks for the reply on a two year old video! Ben Bolker recommended I look into TidyModels, so I've been watching lots of your videos. Very clear and informative!
You can tune many recipe parameters, in much the same way you tune model parameters. You can check out some examples here: www.tidymodels.org/learn/work/tune-text/ And here: www.tidymodels.org/learn/work/bayes-opt/
Dear Julia, great video, and I learned a lot about tidy models today. I have a couple of questions. 1. For tree-based models, I can use feature importance and packages such as SHAP for interpreting them. Is this something that we can do with linear models such as logistic regression? Or in other words, can we assume coefficients of features in linear models to be the same as feature importance in tree-based models? 2. From your analyses, you found that the bill depth is the most important feature that differentiates the sexes. Can we come up with rules/cut-offs using which we can say whether a particular bill depth corresponds to a male penguin or female penguin? Thanks in advance.
Absolutely, the coefficients of a linear model give you analogous information to feature importance of a tree model. In fact, they are *better* in terms of feature importance because they literally are just which features are most important for your model, directly. If you want a set of rules, I would use a specific model for that: www.tidyverse.org/blog/2020/05/rules-0-0-1/
Your video is so useful! I use the same method as yours but I got this Error message when I use fit_resamples "Error: For a classification model, the outcome should be a factor." Do you know how to fix this problem? Thanks in advance!!!
It sounds like you may be fitting a classification model to data with a numeric outcome. Try choosing a model that is a good fit for your particular data, like a regression model if you have a numeric outcome.
Seems like you tried to tidy the random forest instead of the logistic regression model. A random forest model doesn't have simple coefficients so can't be tidied in the same way that a logistic regression model can.
Mam , Can you kindly teach constructing 2 or 3 variables on the same graph of ROC curve in SPSS for easy visual comparison.. like you made in this video .. but this does not look like SPSS
Well, it definitely is not SPSS! 😁 If you can outline in detail more of what you are trying to do with a reproducible example, I suggest you post on RStudio Community where folks will be able to help you: rstd.io/tidymodels-community
@@JuliaSilge well , thankyou for the reply Mam. I am comparing 2 biomarkers in a disease diagnosis… so needed ROC curve ..but I was not able to plot both on same graph… like you did ..(ploting many ROC curves on one graph)…. Will look at the site you have mentioned… thankyou
Hi @julia, I love your videos! Thank you so much for making them. I am following along and using my own data for some modeling and unfortunately when I try to train the random forest model with: rf_rs % add_model(rf_spec) %>%... I get the following error: "model: Error: spark objects can only be used with the formula interface to `fit()` with a spark data object." Any idea what might be going on? For context, my data is described below: tibble [4,428 × 12] (S3: tbl_df/tbl/data.frame)
$ deployment : Factor w/ 13 levels $ realty_status : Factor w/ 2 levels "opted IN","opted OUT": $ property_county : Factor w/ 356 levels " $ property_state : Factor w/ 44 levels $ loan_amount : num [1:4428] $ total_income : num [1:4428] $ age : num [1:4428] $ n_schooling_years : num [1:4428] $ n_owned_properties: num [1:4428] $ n_dependents : num [1:4428] $ device_type_start : Factor w/ 4 levels $ completion_time : 'difftime' num
I don't think that I can get enough info in the comments here to help. Can you post on RStudio Community with a little more detail (preferably a whole reprex, if possible) so we can check it out and see what's going on? rstd.io/tidymodels-community
Interesting point about not building a classification model for species. However, perhaps a model classification would work better than one made by a biologist. I would think that a model would definitely do a better job than a beginner or amateur. The classification of any sort of thing - being it a rock or a bird is often fraught with mistakes.
Thank you Julia. This was really helpful. Quick question, do you always create a balanced data where you have the same number of cases and controls before modeling and then resample from that data set? I was wondering if this is a general approach to build predictive models. Thank you again. I love your videos :)
I don't think it's best practice to *always* create a balanced training set, but often this is a helpful preprocessing step to build a model that can learn to recognize both, say, the majority and minority classes. One important note is that it is best to resample the original, imbalanced dataset, and then do the over/undersampling on the resamples, to avoid data leakage. In tidymodels, we have tools for dealing with imbalanced data in the themis package: themis.tidymodels.org/
I use the SMOTE algorithm contained in themis package. You just have to add one line in your recipe: step_smote(your_response_variable, smote_parameters).
If you want to learn about using a formula vs. a recipe, I recommend checking out these sections of our book: www.tmwr.org/base-r.html#formula www.tmwr.org/workflows.html#workflow-encoding www.tmwr.org/recipes.html
I can't emphasize enough how useful this content is, from the screencast style, to the focus on using specific packages to the insight into the modelling process. I really love it, hope it keeps coming!!
A perfect UA-cam content for people wanting to know how to analyze data using R in an elegant way.
30 minutes is the sweet spot!! Ur awesome @julia
Clear explained and direct to the point! Thank you Julia.
What a great video Julia! Thank you for such wonderful introduction to ML and for sharing your knowledge. You are indeed, awesome.
Many thanks Julia!!! can't wait for the next video!!!
Superbly done!! Will rewatch a couple times, lots to learn! Many thanks Julia!! ❤️🇲🇽❤️
Great job, Julia .... you put a lot of effort into this very worthwhile endeavor!
Great video Julia. You are the best. Thank u very much!!!
amazing tutorial, thank you. Love how you give interesting explanation for each output value of the model.
Thanks Julie- will like to see more
Thank you so much for this video. Appreciate it. It is so helpful to see how it works actually
Excelent! The content is very useful and your way to go trough it makes it easy to grasp. Thank you!
Very helpful! Thank you very much for doing these videos. `tidy(exponentiate = TRUE)` was a new one for me. Very useful.
Very clear and easy to follow, so useful! Thank you very much!
Great to get started a tidymodel!!
Amazing lecture! Thank you!
Dear Julia, I want to ask you how do you execute a markdown code in the console, I mean what key do you use for that purpose. Thank you in advance.
That's probably my most used keyboard shortcut! Ctrl+Shift+Enter for a chunk, Cmd+Enter for a line
In RStudio, you can find them under Tools -> Keyboard Shortcuts Help, but there's just a handful that I use regularly.
@@JuliaSilge thanks a lot Julia!
thanks for this video Julia
Thanks for the content! I have a question. How can we change the cutoff value in glm when we use tidymodels?
Do you mean using the probability threshold to decide what label to predict? You can get out the probabilities via `type = "prob"` and can go from there as you wish, or you may be interested in using probably:
probably.tidymodels.org/
Would it be useful to compare the predictions-weights ("probs", I think in caret) where rf and glm divert? So, if glm-pos > rf-neg the outcome is glm, else rf?
Thank you so much for making those videos!
Really great, thank you! Do you have a plans to do time series analysis or a SVM model?
Really instructive video, thank you!
Hi Julia, I I'm newISH to R and VERY new to predictive modelling in R. I really enjoy watching your videos! I'm wondering if you would start over and exclude Flipper_Length_mm from this model (if you were actually going to use this going forward) since it had a higher P Value in your summary statistics. Thanks!
That would be like one step of "stepwise regression", basically, and stepwise regression has a lot of problems when applied in general. However, in real life problems (where the goal was prediction, i.e. good model fit), I probably *would* try the model without the insignificant variable term to see if it still fit about as well and then I would pick the simpler model if it did.
@@JuliaSilge Thanks for the quick reply!
Excellent video and content, as usual! One quick question though: what do you mean by being easier to deploy a logistic regression model than a random forest?
I was thinking about how a logistic regression model is linear so you don't need to get an R object deployed somewhere to make predictions; you can just use a flat file of model coefficients that could be incorporated into any kind of production system (no R necessary) pretty easily.
great tutorial!
I wander what the results would be if the independent fields were dummy coded. Great code as always.
I have a very silly question: for practical reason of filling missing values in particular dataset (taking apart all great regressions) it wouldn't better fill NA with the help of some packages, for example mice ?
Here, sex is the thing we are predicting so we would need to be careful using the predictors to impute the outcome and then also to predict the outcome. If on the other hand you want to use imputation for predictors, tidymodels has a number of functions for that in the recipes package: recipes.tidymodels.org/reference/index.html#section-step-functions-imputation
May i ask what the lines 9 to 15 does?
theme_set(theme_plex()) is from rstheme package which defines the r studio theme?
Thank you very much
theme_set() is for ggplot2, to set the what the plots look like: ggplot2.tidyverse.org/reference/theme_get.html
The part above that sets options for knitr chunks, such as whether to cache results, whether to print messages and warnings, what size to prints figures, etc. You can read more about knitr chunk options here: yihui.org/knitr/options/
@@JuliaSilge Thank you very much for the quick response. I hope you will make in the future some interactive courses like supervised ml case studies
Around minute 22:00 you're mentioning that the (generalized) linear model did just as well at classifying sex as the random forest model, despite not being able to identify interactions (e.g. a flipped dimorphism for one of the species). Isn't this rather expected, though, as the dataset itself contained no interactions between sex and the other identifying characteristics? Would the RF model have performed better if, say, one of the species had an inverse relationship between sex and flipper/beak dimensions?
I think it's a little strong to say there are *no* interactions in the penguins dataset, as for example the slope for bill depth vs length isn't the same for all species and/or sexes. However, yep, the fact that the linear model performs just as well does indicate that any interactions aren't that important and we would expect a random forest model to do better when there are more important interactions.
@@JuliaSilge Gotcha, that makes sense. Thanks for the reply on a two year old video! Ben Bolker recommended I look into TidyModels, so I've been watching lots of your videos. Very clear and informative!
Great presentation! How can I include grid search into my recipes?
You can tune many recipe parameters, in much the same way you tune model parameters. You can check out some examples here:
www.tidymodels.org/learn/work/tune-text/
And here:
www.tidymodels.org/learn/work/bayes-opt/
@@JuliaSilge amazing! Thank you!!!!!
Dear Julia, great video, and I learned a lot about tidy models today. I have a couple of questions.
1. For tree-based models, I can use feature importance and packages such as SHAP for interpreting them. Is this something that we can do with linear models such as logistic regression? Or in other words, can we assume coefficients of features in linear models to be the same as feature importance in tree-based models?
2. From your analyses, you found that the bill depth is the most important feature that differentiates the sexes. Can we come up with rules/cut-offs using which we can say whether a particular bill depth corresponds to a male penguin or female penguin?
Thanks in advance.
Absolutely, the coefficients of a linear model give you analogous information to feature importance of a tree model. In fact, they are *better* in terms of feature importance because they literally are just which features are most important for your model, directly.
If you want a set of rules, I would use a specific model for that: www.tidyverse.org/blog/2020/05/rules-0-0-1/
@@JuliaSilge Thanks Julia.
Your video is so useful! I use the same method as yours but I got this Error message when I use fit_resamples "Error: For a classification model, the outcome should be a factor." Do you know how to fix this problem? Thanks in advance!!!
It sounds like you may be fitting a classification model to data with a numeric outcome. Try choosing a model that is a good fit for your particular data, like a regression model if you have a numeric outcome.
Love this! How do I got an error at the last step however with the following:
Error: No tidy method for objects of class ranger
Seems like you tried to tidy the random forest instead of the logistic regression model. A random forest model doesn't have simple coefficients so can't be tidied in the same way that a logistic regression model can.
Great video ty so much
Mam ,
Can you kindly teach constructing 2 or 3 variables on the same graph of ROC curve in SPSS for easy visual comparison.. like you made in this video .. but this does not look like SPSS
Well, it definitely is not SPSS! 😁 If you can outline in detail more of what you are trying to do with a reproducible example, I suggest you post on RStudio Community where folks will be able to help you:
rstd.io/tidymodels-community
@@JuliaSilge well , thankyou for the reply Mam.
I am comparing 2 biomarkers in a disease diagnosis… so needed ROC curve ..but I was not able to plot both on same graph… like you did ..(ploting many ROC curves on one graph)….
Will look at the site you have mentioned… thankyou
Hi Julia. Why is this video call "unknown"?
Frozen 2. Penguins. Ice.
Just kidding
@@Ledgerdomain Hahaha maybe ... it's a good name.
Hi @julia, I love your videos! Thank you so much for making them. I am following along and using my own data for some modeling and unfortunately when I try to train the random forest model with:
rf_rs %
add_model(rf_spec) %>%...
I get the following error: "model: Error: spark objects can only be used with the formula interface to `fit()` with a spark data object."
Any idea what might be going on? For context, my data is described below:
tibble [4,428 × 12] (S3: tbl_df/tbl/data.frame)
$ deployment : Factor w/ 13 levels
$ realty_status : Factor w/ 2 levels "opted IN","opted OUT":
$ property_county : Factor w/ 356 levels "
$ property_state : Factor w/ 44 levels
$ loan_amount : num [1:4428]
$ total_income : num [1:4428]
$ age : num [1:4428]
$ n_schooling_years : num [1:4428]
$ n_owned_properties: num [1:4428]
$ n_dependents : num [1:4428]
$ device_type_start : Factor w/ 4 levels
$ completion_time : 'difftime' num
I don't think that I can get enough info in the comments here to help. Can you post on RStudio Community with a little more detail (preferably a whole reprex, if possible) so we can check it out and see what's going on? rstd.io/tidymodels-community
thank you. very helpful!
Thank you!
Interesting point about not building a classification model for species. However, perhaps a model classification would work better than one made by a biologist. I would think that a model would definitely do a better job than a beginner or amateur. The classification of any sort of thing - being it a rock or a bird is often fraught with mistakes.
Please you can sheer with us the script code
Check out the description here on UA-cam, where I always include that info:
juliasilge.com/blog/palmer-penguins/
you are awesome
Thank you Julia. This was really helpful. Quick question, do you always create a balanced data where you have the same number of cases and controls before modeling and then resample from that data set? I was wondering if this is a general approach to build predictive models. Thank you again. I love your videos :)
I don't think it's best practice to *always* create a balanced training set, but often this is a helpful preprocessing step to build a model that can learn to recognize both, say, the majority and minority classes. One important note is that it is best to resample the original, imbalanced dataset, and then do the over/undersampling on the resamples, to avoid data leakage. In tidymodels, we have tools for dealing with imbalanced data in the themis package:
themis.tidymodels.org/
@@JuliaSilge thank you so much for getting back to me. I'll check the themis package out :)
I use the SMOTE algorithm contained in themis package. You just have to add one line in your recipe: step_smote(your_response_variable, smote_parameters).
Its confusing that all your other videos you use `recipes`, but not here?
If you want to learn about using a formula vs. a recipe, I recommend checking out these sections of our book:
www.tmwr.org/base-r.html#formula
www.tmwr.org/workflows.html#workflow-encoding
www.tmwr.org/recipes.html
@@JuliaSilge Thanks! I am also ordering the book in hardcopy on Amazon today :)
Amazin
Awesome
i love you julia.. how r u today