Since this video was created the UCI Machine Learning repository moved to the new location. What it means is that the web location shown in the script is not working. However, I have updated the link to the lesson data in the video description.
Thank you so much for this video! I am a beginner in statistical modeling/and R this helped me very to understand how to of analyze model, which i needed for my homework.
Regression linearly scales its variables so scaling into 0:1 interval is not needed - the regression will do so automatically. However, normalisation of some skewed variables, eg using log / sqrt / sqr could improve regression but this needs to be tested. Such unnormalised variables can be unnecessarily rejected in the process demoed in this video (which aims at a simple case for learning purposes).
Could you please explain what the function impute does here. Does it replace the missing values with mean/ median value. Also why did you replace NA's with question mark. How did you define the variable 'Num of doors' to be a categorical variable?
Dibyajyoti Chowdhury, as you said impute replaces missing values with either the mean or median. I do not replace NAs with question marks, quite the opposite, missing values in data files have been coded with question marks, and so question marks are replaced with NAs while reading the file. As the number of doors is coded in the file as words, such as "four" or "two" then the variable will automatically become a factor on reading the data frame. It would be different if we were to assign values, in which case these vars would become chars.
@yiyuan 90% of missing values in an independent variable is a lot, virtually all attempts of dealing with this, apart from dropping the variable will harm your predictive model. Eliminating rows with missing values will leave you with less than 10% of data for training, replacing NAs with mean will falsify distribution, and if you think imputation with decision trees or k-NN would work (or rather prediction of missing values from the remaining independent vars) then really you do not need this variable for predicting a dependent variable (due to redundancy). However, if missing values are the "feature" of data entry and is not random (say, missing value was entered instead of zero) then you'd better investigate this and fix the way data was recorded, in which case you could deal with 90% of such intentional missingness.
@@ironfrown Thank you! If i understand you correctly, it is better for me 1) to delete this variable or 2) if the variable is important to the analysis, investigate and fix why such huge missingness was caused. am i correct?
@@ironfrown also, I am now having a multiple regression which contains many independent variables with missing values varying from 1% to 99% of the total observation. I will proceeds my analysis in the following way. 1) delete variables with over 90% of missing values (if the missing are not the feature of data entry) 2) if varaibles with missing values over 60% but less than 90%, i wlll run a regression of dependent variable on each of these independent variables respectively, if it is not significant, i will also delete these varaible. For independent variable with missing values less than 60%, I will use KNN approach to impute the missing value. Do you think my approach works? thank you very much
@@yiyuanzhang6335 Any variable which has over 20% missing values is a good candidate for removal. If your variable has over 50% missing values i would not touch it. If it is between 20-50% you can try imputing the missing values. In all cases, validate the resulting model on a test data (which has no missing values) to see which option gives you the best outcome. Never use imputed variables for testing as the results will be misleading.
I am not sure what you mean by data set is not available. Please follow the link in the video description and you'll find the data there. As data used in my videos was copyrighted I was resisting simply copying it and redistribution. Any links in the script itself may not be accurate as the script is quite old by now At that time, running R in Jupyter was a black art, which not many beginners could master. This seemed the simplest distribution.
@@ironfrown Well, I mean that the link provided leads to a data (and names) file. Apparently this can be easily used with pandas. I expected a csv (or excel) file (currently im working with r). I mention the jupyter notebook because you talked about it
Thanks for your response. Just one more thing. I applied log transformation on Sale Price now when I want to convert it back to the normal price I am getting Infinite values. I did 10^ sale Price. Please advise what is the correct way of conversion.
At the beginning of the script (lines 17-19), you will find a number of commented out statements, which load the required libraries. Uncomment them and run them once only. The package 'psych' is the one which provides the pairs.panels. Good luck!
@danish mumtaz, Adj R2 and squared correlation are mathematically very close for simple regression. However, once we start adding more variables their values will diverge. There is nothing wrong with this. However, cross-validation measures, such as MAE, RMSE or correlation, are better indicators of the model quality and performance than Adj R2, which is only correct when all of the regression assumptions / requirements have been met.
I see what you mean, unfortunately visanalytics.org reached the end of its life and those links expired. Give me a few days and I'll provide a new home for these sources and I'll update the video descriptions. I am sorry about the inconvenience!
@@ironfrown many thanks for your reply and updating the links. i want to ask a question regarding linear regression variable selection, my outcome is continuous lung function and main exposure variable is no2 air pollution (quintiles), in linear regression when i add deprivation score to my model, the sign of coefficient of no2 change (from negative to positive), i am not sure what could be reason for this. i check collinearity, there is no collinearity. i would really appreciate for your help in this regards? is it possible to have your email to contact number please. thanks
Thanks Agus, sorry for the delay, however, I have been traveling a bit. Indeed UCI have moved their repository, so I have modified the link to the data set used in this video. It will affect the video description and not what is showing in the comments of the script in the video though. Good luck with this.
Anthony WongKL I am not sure I understand the question. Let me assume you ask how to create a new column, such as c, in a data frame df based on a formula, such as +, using values from other columns, such as a and b? If so, you'd execute an R statement: df$c
Satish Rahi, make sure that the packages for all the required libraries have been previously installed. Note that the sample R code has commented out statements "install.packages", which need to run once only - so uncomment them for one run only. When the package "Hmisc" is properly installed and the "library" statement is successfully executed, the "impute" function will work.
Unistall the package then install it again, when prompted "Do you want to install from sources the package which needs compilation?." put no instead of yes. It should work after this.
@Jabab Namgay, indeed R Studio format is not ideal for UA-cam presentations. However, you can get the R script from the links included in the video description. My new R videos utilise Jupyter Notebooks, which allow much clearer display of R code.
Thank you, Professor, you are the best! I wish you 100 years of teaching!
Since this video was created the UCI Machine Learning repository moved to the new location. What it means is that the web location shown in the script is not working. However, I have updated the link to the lesson data in the video description.
Very useful to help me understand better a stats graduate course I am taking. Great complement and thank you!
Thank you so much for this video! I am a beginner in statistical modeling/and R this helped me very to understand how to of analyze model, which i needed for my homework.
5:45 I dont like your regression : where is the normalization of features (0:1 for example)
Regression linearly scales its variables so scaling into 0:1 interval is not needed - the regression will do so automatically. However, normalisation of some skewed variables, eg using log / sqrt / sqr could improve regression but this needs to be tested. Such unnormalised variables can be unnecessarily rejected in the process demoed in this video (which aims at a simple case for learning purposes).
I agree with you that variables ideally should have similar distribution of variance and here normalisation could improve the outcome
May God BLESS YOU!!!!! Thank you so much!
Could you please explain what the function impute does here. Does it replace the missing values with mean/ median value. Also why did you replace NA's with question mark. How did you define the variable 'Num of doors' to be a categorical variable?
Dibyajyoti Chowdhury, as you said impute replaces missing values with either the mean or median. I do not replace NAs with question marks, quite the opposite, missing values in data files have been coded with question marks, and so question marks are replaced with NAs while reading the file. As the number of doors is coded in the file as words, such as "four" or "two" then the variable will automatically become a factor on reading the data frame. It would be different if we were to assign values, in which case these vars would become chars.
thank you very much! what happend if the missing data consists of over 90% of an independent variable? should we delete this variable?
@yiyuan 90% of missing values in an independent variable is a lot, virtually all attempts of dealing with this, apart from dropping the variable will harm your predictive model. Eliminating rows with missing values will leave you with less than 10% of data for training, replacing NAs with mean will falsify distribution, and if you think imputation with decision trees or k-NN would work (or rather prediction of missing values from the remaining independent vars) then really you do not need this variable for predicting a dependent variable (due to redundancy). However, if missing values are the "feature" of data entry and is not random (say, missing value was entered instead of zero) then you'd better investigate this and fix the way data was recorded, in which case you could deal with 90% of such intentional missingness.
@@ironfrown Thank you! If i understand you correctly, it is better for me 1) to delete this variable or 2) if the variable is important to the analysis, investigate and fix why such huge missingness was caused. am i correct?
@@ironfrown also, I am now having a multiple regression which contains many independent variables with missing values varying from 1% to 99% of the total observation. I will proceeds my analysis in the following way. 1) delete variables with over 90% of missing values (if the missing are not the feature of data entry) 2) if varaibles with missing values over 60% but less than 90%, i wlll run a regression of dependent variable on each of these independent variables respectively, if it is not significant, i will also delete these varaible. For independent variable with missing values less than 60%, I will use KNN approach to impute the missing value. Do you think my approach works? thank you very much
@@yiyuanzhang6335 Any variable which has over 20% missing values is a good candidate for removal. If your variable has over 50% missing values i would not touch it. If it is between 20-50% you can try imputing the missing values. In all cases, validate the resulting model on a test data (which has no missing values) to see which option gives you the best outcome. Never use imputed variables for testing as the results will be misleading.
Very good and funny videos bring a great sense of entertainment!
When I imputed() numer of doors, it became all NAs. Why?
how to increase the text size (Horsepower, City.mpg, Peak.rpm, Curb.weight, Num.of.doors, Price)?
Very nicely explained!
Thank you Professor.
And again, data set not available ¿wouldn´t it be better to have it all (data set, jupyter notebook) in, say, github?
I am not sure what you mean by data set is not available. Please follow the link in the video description and you'll find the data there. As data used in my videos was copyrighted I was resisting simply copying it and redistribution. Any links in the script itself may not be accurate as the script is quite old by now At that time, running R in Jupyter was a black art, which not many beginners could master. This seemed the simplest distribution.
@@ironfrown Well, I mean that the link provided leads to a data (and names) file. Apparently this can be easily used with pandas. I expected a csv (or excel) file (currently im working with r). I mention the jupyter notebook because you talked about it
@@danielj5851 I'll check this, they must have given up on R
Thanks for your response. Just one more thing. I applied log transformation on Sale Price now when I want to convert it back to the normal price I am getting Infinite values. I did 10^ sale Price. Please advise what is the correct way of conversion.
Log does not like zeros, just add a small value before the log (to be deducted later after the power). It should fix the problem?
Very Helpful, thank you.
Hey, very nice and clear explanation. Thanks
unable to use pairs.panel() function, how do I fix it ?
At the beginning of the script (lines 17-19), you will find a number of commented out statements, which load the required libraries. Uncomment them and run them once only. The package 'psych' is the one which provides the pairs.panels. Good luck!
What if the train.corr and the adj R sqaured is not nearly equal. What is wrong in that case and how to fix that?
@danish mumtaz, Adj R2 and squared correlation are mathematically very close for simple regression. However, once we start adding more variables their values will diverge. There is nothing wrong with this. However, cross-validation measures, such as MAE, RMSE or correlation, are better indicators of the model quality and performance than Adj R2, which is only correct when all of the regression assumptions / requirements have been met.
many thanks nice videio. can u please check the link for r source code. it is not working. thanks
I see what you mean, unfortunately visanalytics.org reached the end of its life and those links expired. Give me a few days and I'll provide a new home for these sources and I'll update the video descriptions. I am sorry about the inconvenience!
Ok, I have fixed links in this video, I will progressively correct the links in my other videos soon!
@@ironfrown many thanks for your reply and updating the links. i want to ask a question regarding linear regression variable selection, my outcome is continuous lung function and main exposure variable is no2 air pollution (quintiles), in linear regression when i add deprivation score to my model, the sign of coefficient of no2 change (from negative to positive), i am not sure what could be reason for this. i check collinearity, there is no collinearity. i would really appreciate for your help in this regards? is it possible to have your email to contact number please. thanks
Can you reupload the data for this lesson? I tried to open that link but Google said the link has been changed.
Thanks Agus, sorry for the delay, however, I have been traveling a bit. Indeed UCI have moved their repository, so I have modified the link to the data set used in this video. It will affect the video description and not what is showing in the comments of the script in the video though. Good luck with this.
And they moved it again, so I have re-updated the link again :)
Hey, that was really great!
Thanks a lot for the video!
The video sound is pretty good, beyond my imagination
Hi I am getting Error at auto$price
Ajay Kumar R is case sensitive, you are using variable Price and price so one has no values in it!
How to calculate columns of data?
Anthony WongKL I am not sure I understand the question. Let me assume you ask how to create a new column, such as c, in a data frame df based on a formula, such as +, using values from other columns, such as a and b? If so, you'd execute an R statement: df$c
I get "impute function not found" having loaded the same libraries. Puts a break
Satish Rahi, make sure that the packages for all the required libraries have been previously installed. Note that the sample R code has commented out statements "install.packages", which need to run once only - so uncomment them for one run only. When the package "Hmisc" is properly installed and the "library" statement is successfully executed, the "impute" function will work.
Unistall the package then install it again, when prompted "Do you want to install from sources the package which needs compilation?." put no instead of yes. It should work after this.
Can you share your .rmd notebook so i can reverse engineer?
Nicco, if you check the description of the video, it has the links to both the data and the .R script. Enjoy!
Seems good but too small view and cannot view it
@Jabab Namgay, indeed R Studio format is not ideal for UA-cam presentations. However, you can get the R script from the links included in the video description. My new R videos utilise Jupyter Notebooks, which allow much clearer display of R code.
@@ironfrown thank you sir
great!