R Stats: Multiple Regression - Variable Preparation

Поділитися
Вставка
  • Опубліковано 17 кві 2016
  • This video gives a quick overview of constructing a multiple regression model using R to estimate vehicles price based on their characteristics. The video focuses on how to prepare variables while employing a stepwise regression with backward elimination of variables. The lesson explains how to transform highly skewed variables (using Log10 transform) and later report their characteristics, how to check variable normality and their multiple collinearity (using Variance Inflation Factors) and their extreme values (using Cook's distance). The process will be guided by the measures of model quality, such as R-Squared and Adjusted R-Squared statistics, and variables' p-values, which represent the level of coefficient confidence. As always, the final model will be evaluated by calculating the correlation between the predicted and actual vehicle price for both the training and validation data sets, with correction for the previously transformed variables. The explanation will be quite informal and will avoid the more complex statistical concepts. Note that visual presentation and interpretation of multiple regression results will be explained in the next lesson.
    The data for this lesson can be obtained from the well-known UCI Machine Learning archives:
    * archive.ics.uci.edu/ml/datase...
    The R source code for this video can be found here (some small discrepancies are possible):
    * jacobcybulski.com/youtube-rsrc...
    Videos in data analytics and data visualization by Jacob Cybulski, jacobcybulski.com.
  • Наука та технологія

КОМЕНТАРІ • 21

  • @allisonhaaning44
    @allisonhaaning44 6 років тому

    Very helpful video. Thank you for posting!

  • @klaldju
    @klaldju 7 років тому

    Hi professor, thanks for the great tutorial.
    Just for curiosity, why do you use the number 2017 in set.seed()?
    Many thanks

    • @ironfrown
      @ironfrown  7 років тому +3

      Claudio Lira, set.seed() initiates random number generation which is used in sampling and data splitting. If you use the same number that I used, you'll get the same result as in this video. 2017 felt like a good number for the video produced this year 😁

  • @ironfrown
    @ironfrown  7 років тому

    Since this video was created the UCI Machine Learning repository moved to the new location. What it means is that the web location shown in the script is not working. However, I have updated the link to the lesson data in the video description.

    • @aajaykapoor
      @aajaykapoor 7 років тому

      The R source code for this video can be found here (some small discrepancies are possible):
      * visanalytics.org/youtube-rsrc/...
      this doesn't seem to be working, would you please upload the updated link.
      Thanks!

    • @ironfrown
      @ironfrown  7 років тому +1

      Ajay Kapoor - do not copy the text of the link, UA-cam abbreviates it, click on it and the file will get uploaded.

  • @benediktusnugrohoadiwiyoto7046
    @benediktusnugrohoadiwiyoto7046 4 роки тому

    hello prof. Thank you for all of your lessons. These are really helpful. My question is how we do the back transformation for log10 for report requirements? or how the model equation looks like? thank you in advance.

    • @ironfrown
      @ironfrown  4 роки тому +1

      Hi Benediktus, the secret is is at the end of the video when I discuss performance results, the inverse of log10(x) is 10^x.

  • @mohammadumam897
    @mohammadumam897 6 років тому

    what if after eliminating some extreme values, the R-squared instead becomes smaller ?

    • @ironfrown
      @ironfrown  6 років тому

      Mohammad Umam extreme values are the cases which generate high residuals and also those of high leverage on the model, I. e. their removal causes great changes to the model formula. If you remove too many data points in one go or your data set is small it is possible that R2 will go down. If the change was small then do not worry. Also watch the F-ratio which is also used to see if the model improves. if F-ratio grows then this is an indication your change is beneficial. Ultimately model validation could assist determining which model is better.

  • @muhammadsaleemkhan5761
    @muhammadsaleemkhan5761 2 роки тому

    many thanks nice videio. can u please check the link for r source code. it is not working. thanks

  • @harithsyafiqhalim4996
    @harithsyafiqhalim4996 3 роки тому

    Hello sir, how do we check for non linearity if the variables are factors instead of numerical?
    Or do we just do the full model and then check for linearity from that full model?

    • @ironfrown
      @ironfrown  3 роки тому

      When your nominal variables are dummy encoded then of course you do not have continuity of values, same for any ordinal variables. In both cases this may be overlooked the independent variables, as the model will fit these data points. The issue of non-linearity is still applicable to ordinal variables.

    • @ironfrown
      @ironfrown  3 роки тому

      You must encode your factors, do not rely on their sequence as their sequence is purely random.

  • @sambad8429
    @sambad8429 6 років тому

    shouldn'it be sqrt(vif(fit))) rather?

    • @ironfrown
      @ironfrown  6 років тому

      Samba D, If you look at the way vif is calculated, vif(varj) = 1/(1-R2(varj)), this means that if you create a regression model predicting variable varj from the remaining predictors then vif represents inverse of the coefficient of determination R2. If R2 > 0.8 then vif > 5, if R2 > 0.9 then vif > 10. While the threshold of 5 or 10 is arbitrary it explains the choice. If you want to use sqrt(vif) then use the correspondingly smaller values.

    • @sambad8429
      @sambad8429 6 років тому

      Thank you sir for this clear answer :)

  • @killa14108
    @killa14108 2 роки тому

    Hi Sir, I see that the final model for your multiple regression after backward elimination only used two variables : Peak.rpm and Curb.weight. When testing the final model with the valid/test set, can't we just do this
    valid.sample$Pred.Price

    • @ironfrown
      @ironfrown  2 роки тому

      In the process of creating the model we've been manipulating the training data, slowly dropping the columns. However, we have never touched the validation data set, so to predict the outcome on the validation I've used the subset of validation data set columns as well.

  • @xymabuka3538
    @xymabuka3538 7 років тому

    Great video but the link to the data is not working. I have cleaned and prepared the data for saving your guys time.
    The data is format to be suitable to the R code in a description above.
    download link : app.box.com/s/1rieq2r1fensn4bjxu4m32wzdn1sbvnt
    Data's name is Auto.csv with 205 rows and 26 cols
    After importing the data to R and in a process of imputing the NA value, please notice:
    auto$Num.of.doors

    • @ironfrown
      @ironfrown  7 років тому

      And, I have also found the new home the data moved into and updated the link again!