I have another video on decision trees, I plan on doing more predictive models in the future. I haven't worked with Matlab so I'm not sure how close the code would be.
I am busy with a project where I have 65 000 entries and 70 variables. What method would you propose for me to use to identify the most important predictors? Will I be able to use randomForest?
I would do what is done in the video with varImpPlot(rf). Then I would keep the variables that are positive on the left graph. If it is greater than zero it means that it is increasing the accuracy of the model. Depending on how well the top few variables predict, I would also consider a model with maybe the top ten variables.
thatRnerd. You invite the viewer to follow along but don't make the code available. STraight away RStudio says "Error in tbl_df(iris) : could not find function "tbl_df"". So which library is that? I've never seen tbl_df() before.
Thank you for letting me know that! The library you want to use for that function is dplyr, I use the tidyverse package which will load dplyr as well as a bunch of other packages that are very helpful for data science. That should definitely have been at the top with the other libraries, though the tbl_df() is not saving anything to use in random forest, but just gives a clean way to get an overview of the data.
thanks a lot for your explaination
Thank you thatRnerd! You are going to help me so much with my Graduate Thesis
That's awesome! Good luck!
Do you do tutorials based on predictive models and is it similar to Matlab?
I have another video on decision trees, I plan on doing more predictive models in the future. I haven't worked with Matlab so I'm not sure how close the code would be.
well explained
I am busy with a project where I have 65 000 entries and 70 variables. What method would you propose for me to use to identify the most important predictors? Will I be able to use randomForest?
I would do what is done in the video with varImpPlot(rf). Then I would keep the variables that are positive on the left graph. If it is greater than zero it means that it is increasing the accuracy of the model. Depending on how well the top few variables predict, I would also consider a model with maybe the top ten variables.
@@thatrnerd4265 Thank you for your response. I appreciate it :)
thatRnerd. You invite the viewer to follow along but don't make the code available. STraight away RStudio says "Error in tbl_df(iris) : could not find function "tbl_df"". So which library is that? I've never seen tbl_df() before.
Thank you for letting me know that! The library you want to use for that function is dplyr, I use the tidyverse package which will load dplyr as well as a bunch of other packages that are very helpful for data science. That should definitely have been at the top with the other libraries, though the tbl_df() is not saving anything to use in random forest, but just gives a clean way to get an overview of the data.