Having problems with the code? I just finished updating the notebooks to use *scikit-learn 0.23* and *Python 3.9* 🎉! You can download the updated notebooks here: github.com/justmarkham/scikit-learn-videos
*Note:* This video was recorded using Python 2.7 and scikit-learn 0.16. Recently, I updated the code to use Python 3.6 and scikit-learn 0.19.1. You can download the updated code here: github.com/justmarkham/scikit-learn-videos
Sir firstly I would like to you thanks a lot, because you spent so much time to make this video ....this is really helpful to initial phase learner's for ML ; keep doing sir , I stopped this video in mid to say thanks , you saved my lots of hour to understand cross validation.....
In some places I’ve seen people run K fold cross validation on the entire dataset and in other places I’ve seen people run it only on the training set. They then calculate on the test set separately. Is there any recommendation regarding which practice makes more sense? Great video!!
I can't remember exactly, but I probably chose it because it's the simplest model among those options with the best performance. For more details on what I mean by "simple", see here: scott.fortmann-roe.com/docs/BiasVariance.html
I know this is from a long time ago, but why is that the complexity decreases as the value of K increases? Shouldn't it be the opposite, since for greater values of K we'd need to iterate over more data from y? Thanks for the videos, the entire playlist is amazing
Great question, and thanks for your kind words! It's a tricky topic, but the best way to understand it is by reading this article: scott.fortmann-roe.com/docs/BiasVariance.html Hope that helps!
Waiting eagerly for new series because I have completed watching all your videos. Thank you so much for teaching me Python. You have made me educated you are a teacher for me and I respect you. Thank you so much.
First of all let me thank you at first for the extraordinary work you are doing......... You are explaining extraordinary things in a very ordinary way.............. I had a query regarding FIT with the cross validation........... When we do linear regression with a SINGLE test train split data set.........we get a SINGLE FIT to predict over the test data........ Whereas when we do cross validation (for eg cv=10) in linear regression.....we have 10 training datasets............... BUT DO WE ALSO HAVE 10 FIT MODELS AS WELL FOR EACH TRAINING MODEL ?????? OR DO WE HAVE AN AVERAGE OF ALL THE 10 FIT MODELS????? ########################################################################### I am able to get coefficient & intercept of the fit model via single training dataset X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25,random_state=1,) lm=LinearRegression() fitting=lm.fit(X_train,y_train) fitting.coef_ fitting.intercept_ How to get the intercept & coefficient for the fit model via cross validation??????? ############################################################################ what is the significance of cross_val_predict???? Does it have any relation my query???
Hi Kevin, Are you going to provide any tutorial on Neural Networks? The current publicly available tutorials are great but they mainly use very massive datasets such as mnist and also do not pay enough attention on how to decide about the architecture of the model, for instance, how to decide about the number if inputs, outputs, number of hidden layers and number of nodes in hidden layers! therefore a tutorial using smaller and simpler datasets such as iris, or wine will be very highly appreciated.
Thank you for your super clear and thorough videos. Do you talk about the validation/dev set in your workflow course? I have a hard time with that concept, but my teachers require that we use it. It gets confusing for me when I try to select a model. Because I've heard you should tune your models to the validation set, and then get an "unbiased" score on the test set. But they say you're supposed to select your model according to the validation score. I don't understand why, because I thought that was biased because we used it to tune the models. Is the validation score the better of the two to use to select the model because if you use the test set to select the model, then we're biased to the test set and then have no unbiased score left? I almost feel like there should be one more split in the data, to create a second test set. So the first test set will be used to choose the model, and the second test set is used to show an unbiased score. There is honestly a lot of conflicting information about this entire topic on the internet. I've seen people say the model should be chosen based on the test set, and the validation set is only present to help tune the model. Cross validation is not always possible, because it can be very slow with big datasets.
Yes, I do talk about this issue in my newest ML course. It's a complicated topic, but the bottom line is that whether or not you "need" to do an extra split (train/test/validation instead of just train/test) depends on whether your goal is only to choose the optimal model or also to estimate that model's performance on out-of-sample data. Also, you need to be mindful that an extra split can have a negative side effect if you have an insufficient amount of training data. If you decide you want to enroll, you can do so here: gumroad.com/l/ML-course?variant=Live%20Course%20%2B%20Advanced%20Course Hope that helps!
You are awesome...you teach everything in a simple way....ask for the feedback....and make them much better.....And the best thing is you make everything (I REPEAT EVERYTHING) easy for us....So sweet of you :)
Could you please recommend me a couple of book, or preferrably videos/notes regarding multi-agent systems? Hopefully something as thorougly explained as your lectures. Many thanks
I watched a dozen videos on this topic. I was pretty certain I understood it, but I still had a few questions. You're video cleared those questions up amazingly! Thank you.
Thank you for your video. How is the cross validated accuracy calculated? Is it simply the number of good classification over the total number of data used in the testing set?
I am facing a strange error. If I run my Python scripts from the terminal, all the modules except seaborn are imported fine. However, if I use Spyder IDE, seaborn is imported okay but sklearn is not. Can someone help me out?
Kevin, you are very good in explaining. I wish I found you earlier. I just subscribed to receive all future videos and thank you for all the explanations.
I am so happy to follow such kind of lecture because your teaching way is attractive and your language clarity is very excellent so I get knowledge an input for my ML thesis because am doing on classification (prediction) problem
I'm getting following error at 9:20,please Help --------------------------------------------------------------------------- # simulate splitting a dataset of 25 observations into 5 folds 2 from sklearn.model_selection import KFold ----> 3 kf = KFold(25,n_splits=5, shuffle=False) 4 5 # print the contents of each training and testing set TypeError: __init__() got multiple values for argument 'n_splits'
What to do if I wanna apply minmax or standardscale to fit train and transform only test set in cross Val score? The rule of thumb is to apply these technique on train and test separately so how I can perform this? Cross Val score doesn’t has a specific argument
Good video. Maybe is for my level of english but i dont understand why at 24:00 we compare the accuracy of KNN with Linear Regression when KNN is used for classification and Linear Regression is used for regression. I know Cross Validation works for both but the response variable in both cases should be different, since for KNN should be categorical and for LR should be continuos. Great series, enjoying them so far. Thanks for the good content :)
Just like Andrew Ng you are a genius....he gave a clear explanation in theory and you produced a mesmerising implementation techniques in such a simple way that is inexplicable..............Awesome machine learning video i have watched so far...you helped me a lot you could never imagine..thank you sir
Excellent series! This is the first time I've studied machine learning. You are doing an outstanding job of transforming it from a science fiction term into a tangible subject. I really appreciate these videos!
Good job on this video series! Question on the code below near the end of video/notebook 8. In my run the first item in the grid_scores_ list is: mean: 0.97333, std: 0.03266, params: {'weights': 'distance', 'n_neighbors': 18}, ... When I check the type of the first item in the first list item ( print( type(rand.grid_scores_[1][0])) as below ) it returns: How come the items in the list, e.g. mean: 0.97333 are not surrounded by braces {} if they are dict's? thanks, Tom # n_iter controls the number of searches rand = RandomizedSearchCV(knn, param_dist, cv=10, scoring='accuracy', n_iter=10, random_state=5) rand.fit(X, y) print( rand.grid_scores_) print( type(rand.grid_scores_[1][0]))
Great question! I'm not positive, but I think that rand.grid_scores_[1][0] actually refers to the 'params' dictionary. I can't say for sure without checking, but let me know what you find if you decide to check!
Hello sir, i wanna ask a couple of questions 1. What do "{:^9}", "{:^61}", "{:^25}" actually mean? 2. Why is the higher k values in knn correspond to lower complexity? Is it just for knn or other models for instance random forest?
1. That's code for string formatting. Some examples are here: pyformat.info/ 2. See this article: scott.fortmann-roe.com/docs/BiasVariance.html Hope that helps!
Great video sir, but I have a question. you are using accuracy for selecting the model, but I have read that accuracy is not a sufficient measure to decide the performance of a model in case of classification? Also is a model with higher accuracy is always better than a model with lower accuracy(in case of classification)?
@Data School: Hi sir, I have been working with Customer Churn dataset. It contains 21 features including the target variable churn which is an integer type. While I was doing missing value treatment, I tried with mean imputation in dependents and city columns(both are numeric in nature). mean value was in fraction(i.e. float64). This results in error in train_test_split(). "TypeError: Singleton array cannot be considered a valid collection". I searched in stack-overflow, they said the error might be caused due to float value. If I go with mean imputation what would be a possible solution. I would be grateful for your response.
Is there any benefit on using "-scores" instead of "abs(scores)"? They look like they do the same in this case. abs looks safer to me because it will always return a positive number, no matter if you use loss-function or a reward-function.
The difference between the average RMSE with the Newspaper feature and 10-fold cross-validation without the Nespaper feature seems - to me - quite negligible. Could you walk through the logic behind choosing when the difference is actually small enough to keep a feature and when one should decide to drop a feature? Thanks! And, BTW: your video series is amazing! Keep it up!
Thanks for your kind words! The simple answer is that you should always prefer a simpler model (less features), unless having more features provides a "meaningful" increase in performance. There's no strict definition for "meaningful", it depends on context. Hope that helps!
Nice to see your video after long time, I have some confusion, I hope i will be clear after your response. 1. Please correct me that i understood the demerit of Train/Test split is high variance or differences between Training and Testing data will affect the Testing accuracy. 2. I really want to know what does random_state parameter does when you change it form 4 , 3,2,1 and 0. 3. For Classification , you mentioned Stratified Sampling to make K- Fold. How does it affect for the accuracy of the model ( For eg. out of total 5000 rows or observation if have 80% ham and 20 % spam mail or out of total 5000 rows or observation if have 50% ham and 50 % spam mail ) in my collected dataset. 4. Since you have numerical feature only , so you used accuracy as a metrics to select the best feature, what do you suggest if your dataset contains object datatypes like dates format, and string objects like text data. I am new student of datascience , so i am sorry for long comments
unique raj Great questions! My responses: 1. The disadvantage of train/test split is that the resulting performance estimate (called "testing accuracy") is high variance, meaning that it may change a lot depending upon the random split of the data into training and testing sets. 2. Try removing the random_state parameter, and running train_test_split multiple times. Every time you run it, you will get different splits of the data. Now use random_state=1, and run it multiple times. Every time you run it, you will get the same exact split of the data. Now change it to random_state=2, and run it multiple times. Every time you run it, you will get the same exact split of the data, though it will be different than the split resulting from random_state=1. Thus, the point of using random_state is to introduce reproducibility into your process. It doesn't actually matter whether you use random_state=1 or random_state=9999. What matters is that if you set a random_state, you can reproduce your results. 3. In this context, stratified sampling relates to how the observations are assigned to the cross-validation folds. The reason to use stratified sampling in this context is that it will produce a more reliable estimate of out-of-sample accuracy. It doesn't actually have anything to do with making the model itself more accurate. 4. My choice of classification accuracy as the evaluation metric is not actually related to the data types of the features. Your features in a scikit-learn model will always be numeric. If you have non-numeric values that you want to use as features, you have to transform them into numeric features (which I will cover in a future video). Hope that helps!
while explaining the graph of model complexity vs model accuracy, you mentioned best model is middle value ,where k = 13, where it balances bias and variance. But you also mentioned that best value of k should be producing the simplest model. And higher values of k gives simple model. So k = 20 is taken as the lowest complexity model. I am confused about this.
At minute 22:15. Why is a kNN model with a higher k less complex than one with a lower k? From the perspective of computational complexity it should be the opposite, because less neighbours have to found, shouldn't it? And of course: Thanks for the great video.
Why does a higher K produce a simpler KNN model? I thought the higher K the more complex? 22:09 EDIT: Alright, makes sense now. I rewatched the episode on KNN and looked at the partition diagrams. In contrast to the other introduced models a higher value makes KNN less complex.
Hello Kevin, could you please elaborate a little bit why the bigger K is the less complicated the model is and thereafter K=20 should be selected instead of K=14? Thanks, Terry
Hello Dear sir your video tutorial is really really appreciated you are doing well job.sir i need to say can you make one tutorial on image data-set? mean we have images and we want to make data-set like MNIST or FMNIST how to do this job? thanks sir
After we use K-fold cross-validation and it can be used for selecting optimal tuning parameters, choosing between models, and selecting features, how can we make the model? It looks like it is helping us to select a model or tuning the parameters. If we want to make a model and make the prediction, we still need to split the data into training and testing set and then choose the algorithm such as knn. For instance, knn.fit(x_train, y_train) knn.predict(x_test) Are those the correct concept?
No, cross-validation is an alternative to train/test split. Once you choose your tuning parameters, you fit the model on all of the data before making predictions. Hope that helps!
Another question - how do you find out the final numeric values for the parameter coefficients that minimizes the loss function? In your linear regression example, how do you find beta_0, ..., beta_3 in y = beta_0 + beta_1*TV + beta_2*Radio + beta_3*Newspaper? Are beta_0, ..., beta_3 averages from the k models, i.e. beta_i = (beta_ik)/K for i = 0,1,2,3, k = 1, ...,K? Or, after CV, you know y = beta_0 + beta_1*TV + beta_2*Radio + beta_3*Newspaper is the best model, then you just run this on ALL data points to compute coefficients?
+donkeyenvious Once you have selected the best model via cross-validation, you train that same model (meaning the same features and model tuning parameters) on all of your training data, and that tells you the coefficients to use when making actual future predictions.
while applying CV as 10 in Cross val score function we are getting "All the n_labels for individual classes are less than 10 folds." and when we reduce the same to cv =5 we are getting the result. can you please help to explain why we are getting this error?
How do you write the code on line 7 at 9:20 using the model_selection module? KFold seems to be set up slightly differently in that module and it's not really working for me. Thanks, love your videos.
Extremely useful and thank you for sharing. My question is regarding very very huge datasets. Is it okay to random sample a portion of data and do this cross-validation? if the random-sampled data is representative of the original data, will these cross-validation results for sample population be a representative of cross-validation results of original data?
That should work, if the random sample is truly representative and there is not severe class imbalance. (If the positive class for the response value you are studying is extremely rare, then sampling may remove too many instances of that class, which would skew the results.)
Newbie question here: After all the process of Cross-Validation, do fitting the data and making prediction need a different process? Or you just do as normal fitting the training data and then make predictions on your test data?
Sorry, it depends a lot on what you mean by "a different process" and "you just do as normal"... it's hard for me to say without a lot more details! I recommend reviewing the notebooks or videos here: github.com/justmarkham/scikit-learn-videos
Around 29:40, when you're redoing the cross-validation with just the TV and Radio columns, wouldn't you need to refit the linear model? Because the previous one was fitted using TV, Radio, and Newspaper, but this new one is only involving TV and Radio.
Great video! The only line of code that I needed to update is reshaping the data to pass into the binarize function and then flatten the return ndarray. '''y_pred_class_2 = binarize(y_pred_prob.reshape((192,1)), threshold=0.3).flatten() '''
I found a lot and lot of videos about ML and cross validation, I watched them all, I tried to follow but it was very hard understand. But you, you make it easier, I was very confused with this cross validation and now it's more than clear. Thank you very much for this video and for your channel
@Data School I have a question which is about the minute 29:44, you conclude that we choose the second model because the score is less than the other one. But we already negate and flip the sign of the mean_squared_error, so why we choose the smaller number? I thought it should be the higher one. Thanks!
this is very clear, a bit slow but very clear, Question: do you know what the impact of the number of seeds value is to i.e. 10 cross validation, i'm assuming from your video, the impact = that it starts the test set at a different location within the training set, because on weka it gives me different accuracy every time from inserting 1 to 10 seeds
For KNN, lower values of K result in a model with lower bias and higher variance. This article explains that point further: scott.fortmann-roe.com/docs/BiasVariance.html
Having problems with the code? I just finished updating the notebooks to use *scikit-learn 0.23* and *Python 3.9* 🎉! You can download the updated notebooks here: github.com/justmarkham/scikit-learn-videos
This videos are so well done, so clear and easy to follow that it makes appear ML a trick for kids. Congratulations, great teaching.
Thanks for your kind words!
can't agree more! best video resource on cross validation on the internet.
*Note:* This video was recorded using Python 2.7 and scikit-learn 0.16. Recently, I updated the code to use Python 3.6 and scikit-learn 0.19.1. You can download the updated code here: github.com/justmarkham/scikit-learn-videos
Updated link shows me this message :
Sorry, something went wrong. Reload?
is this still relevant 2020?
i almost gave up python,untill i met your channel. you are my savior
Wow, thank you! Good luck with your Python education! :)
Amazing video! Very instructive. And the presenter has a very clear voice and pace.
Thank you so much! I'm glad you liked it!
Kevin, I appreciate the slow but thorough walk through, you and StatQuests are awesome people. Thank you.
Thank you so much!
Sir firstly I would like to you thanks a lot, because you spent so much time to make this video ....this is really helpful to initial phase learner's for ML ; keep doing sir , I stopped this video in mid to say thanks , you saved my lots of hour to understand cross validation.....
That's awesome to hear! Thanks so much for letting me know! 🙌
In some places I’ve seen people run K fold cross validation on the entire dataset and in other places I’ve seen people run it only on the training set. They then calculate on the test set separately. Is there any recommendation regarding which practice makes more sense? Great video!!
This is beyond the scope of what I can get into in a UA-cam comment... sorry!
Brilliant explanation. But can you/somebody please make me understand, why u choose to k=20, it can be 13 or 17..???
I can't remember exactly, but I probably chose it because it's the simplest model among those options with the best performance. For more details on what I mean by "simple", see here: scott.fortmann-roe.com/docs/BiasVariance.html
Awesome explanation....!
Thanks!
I know this is from a long time ago, but why is that the complexity decreases as the value of K increases? Shouldn't it be the opposite, since for greater values of K we'd need to iterate over more data from y?
Thanks for the videos, the entire playlist is amazing
Great question, and thanks for your kind words! It's a tricky topic, but the best way to understand it is by reading this article: scott.fortmann-roe.com/docs/BiasVariance.html
Hope that helps!
Excellent. Thanks for sharing your expertise.
Thank you!
Great ! Thank you very much for sharing such a clear explanation 🙌
You're very welcome!
i usually don't subscribe any channel but you earned this subs from me...keep going lots of love
Thank you! 🙌
thank you for your great video. I know how to get the accuracy of my clasifier. But how can I get my actual predictions when using cross validation?
I think you would use cross_val_predict: scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html
best explanation, easy to understand. thank you so much
Thank you!
The Best explanation I have seen ever.
Wow, thank you so much!
Great videos. Please keep the good work doing. We really need your lectures. Thank you so much.
Thanks for your kind words! I will definitely release more videos! :)
Waiting eagerly for new series because I have completed watching all your videos. Thank you so much for teaching me Python. You have made me educated you are a teacher for me and I respect you. Thank you so much.
Awesome! Thank you for watching and learning! :)
Outstanding work once again Kevin. A treasure to newcomers in the area.
Thank you!
First of all let me thank you at first for the extraordinary work you are doing.........
You are explaining extraordinary things in a very ordinary way..............
I had a query regarding FIT with the cross validation...........
When we do linear regression with a SINGLE test train split data set.........we get a SINGLE FIT to predict over the test data........
Whereas when we do cross validation (for eg cv=10) in linear regression.....we have 10 training datasets...............
BUT DO WE ALSO HAVE 10 FIT MODELS AS WELL FOR EACH TRAINING MODEL ??????
OR
DO WE HAVE AN AVERAGE OF ALL THE 10 FIT MODELS?????
###########################################################################
I am able to get coefficient & intercept of the fit model via single training dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25,random_state=1,)
lm=LinearRegression()
fitting=lm.fit(X_train,y_train)
fitting.coef_
fitting.intercept_
How to get the intercept & coefficient for the fit model via cross validation???????
############################################################################
what is the significance of cross_val_predict???? Does it have any relation my query???
Thank you so much. This is very helpful^^
Glad it was helpful!
Hi Kevin,
Are you going to provide any tutorial on Neural Networks? The current publicly available tutorials are great but they mainly use very massive datasets such as mnist and also do not pay enough attention on how to decide about the architecture of the model, for instance, how to decide about the number if inputs, outputs, number of hidden layers and number of nodes in hidden layers! therefore a tutorial using smaller and simpler datasets such as iris, or wine will be very highly appreciated.
Thanks so much for the suggestion! I will strongly consider creating this in the future.
Thanks a ton, perfectly explained the concept and the code
Great to hear!
This is the best sklearn tutorial I have come across.
Thank you!
Thank you for your super clear and thorough videos. Do you talk about the validation/dev set in your workflow course? I have a hard time with that concept, but my teachers require that we use it. It gets confusing for me when I try to select a model. Because I've heard you should tune your models to the validation set, and then get an "unbiased" score on the test set. But they say you're supposed to select your model according to the validation score. I don't understand why, because I thought that was biased because we used it to tune the models. Is the validation score the better of the two to use to select the model because if you use the test set to select the model, then we're biased to the test set and then have no unbiased score left? I almost feel like there should be one more split in the data, to create a second test set. So the first test set will be used to choose the model, and the second test set is used to show an unbiased score. There is honestly a lot of conflicting information about this entire topic on the internet. I've seen people say the model should be chosen based on the test set, and the validation set is only present to help tune the model. Cross validation is not always possible, because it can be very slow with big datasets.
Yes, I do talk about this issue in my newest ML course. It's a complicated topic, but the bottom line is that whether or not you "need" to do an extra split (train/test/validation instead of just train/test) depends on whether your goal is only to choose the optimal model or also to estimate that model's performance on out-of-sample data. Also, you need to be mindful that an extra split can have a negative side effect if you have an insufficient amount of training data.
If you decide you want to enroll, you can do so here: gumroad.com/l/ML-course?variant=Live%20Course%20%2B%20Advanced%20Course
Hope that helps!
@@dataschool Thanks for the reply!
Great video, thanks a lot!
+Matina Fragkogianni You're very welcome, I'm happy to help!
You are awesome...you teach everything in a simple way....ask for the feedback....and make them much better.....And the best thing is you make everything (I REPEAT EVERYTHING) easy for us....So sweet of you :)
That is so kind of you to say! Thank you so much 😄
How do we test independent dataset other than X and y i.e. 10 fold cross validated dataset any response will be highly appreciated Thank You.
Could you please recommend me a couple of book, or preferrably videos/notes regarding multi-agent systems?
Hopefully something as thorougly explained as your lectures.
Many thanks
I watched a dozen videos on this topic. I was pretty certain I understood it, but I still had a few questions. You're video cleared those questions up amazingly! Thank you.
That's awesome to hear! 🙌
Thank you for your video. How is the cross validated accuracy calculated? Is it simply the number of good classification over the total number of data used in the testing set?
It's the model's accuracy on the testing set for a given fold.
@@dataschool Thank you, but my question was more about the term "accuracy". What is the formula used to calculate this accuracy?
Got it! Accuracy is just the percentage of correct predictions.
You are charismatic, thank you!
Thanks!
Always eager to learn. You demystified the subject and you even made it easy for a 75 year old brain to comprehend ML methods. :) :) :)
Great to hear!
I am facing a strange error. If I run my Python scripts from the terminal, all the modules except seaborn are imported fine. However, if I use Spyder IDE, seaborn is imported okay but sklearn is not. Can someone help me out?
Whenever I needed a references, I always end up with your videos after a long search. That prove you are THE best teacher.
What a nice thing to say! Thank you! :)
When we use cross_val_score using X,Y. Is the model trained there or we need to train it again??
Kevin, you are very good in explaining. I wish I found you earlier. I just subscribed to receive all future videos and thank you for all the explanations.
Thanks for your kind words! 🙏
I am so happy to follow such kind of lecture because your teaching way is attractive and your language clarity is very excellent so I get knowledge an input for my ML thesis because am doing on classification (prediction) problem
Glad to hear that my videos are helpful to you! Good luck with your thesis!
amazingly explained. Sufficiently Slow for a fresher in Machine learning.. Easily understandable. Keep it up.
Awesome, thank you! :)
I really really appreciate your efforts..this series is so helpful in learning.
I'm glad to hear the series is helpful to you! :)
I couldn't understand the feature engineering concept 31:22. Any example?
I'm getting following error at 9:20,please Help
---------------------------------------------------------------------------
# simulate splitting a dataset of 25 observations into 5 folds
2 from sklearn.model_selection import KFold
----> 3 kf = KFold(25,n_splits=5, shuffle=False)
4
5 # print the contents of each training and testing set
TypeError: __init__() got multiple values for argument 'n_splits'
I recommend just skipping that code instead of trying to fix the error.
REALLY well done!! I found this video extremely helpful. Keep up the good work
Thanks for your kind comment! :)
Thanks for the Video. This is the best Video on KNN Cross validation that I have watched. Appreciate your effort...
Thanks! :)
Just chiming in to thank you for the series, really helps demistify and fill in the gaps. Looking forward to working through
You're very welcome!
What to do if I wanna apply minmax or standardscale to fit train and transform only test set in cross Val score? The rule of thumb is to apply these technique on train and test separately so how I can perform this? Cross Val score doesn’t has a specific argument
Hey i have a doubt. If i use scoting as R2 in cross validation and get negative results should I flip the signs same way as mse scoring?
I really love your videos. They are so simple and to the point! Thanks for making such videos. :)
Thanks very much for your kind words!
HOW DOES THIS VIDEO NOT HAVE LIKE A MILLION VIEWS?!? So good. Thank you, man!
HA! Thank you :)
Good video. Maybe is for my level of english but i dont understand why at 24:00 we compare the accuracy of KNN with Linear Regression when KNN is used for classification and Linear Regression is used for regression. I know Cross Validation works for both but the response variable in both cases should be different, since for KNN should be categorical and for LR should be continuos.
Great series, enjoying them so far. Thanks for the good content :)
I'm comparing KNN with Logistic Regression, which is used for classification. Hope that helps!
I used scoring ="r2". some of the scores are more than one. How is "r2" calculated ? Can share some reference resource please.
Just like Andrew Ng you are a genius....he gave a clear explanation in theory and you produced a mesmerising implementation techniques in such a simple way that is inexplicable..............Awesome machine learning video i have watched so far...you helped me a lot you could never imagine..thank you sir
Thanks very much for your kind words!
you are most welcome sir
Excellent series! This is the first time I've studied machine learning. You are doing an outstanding job of transforming it from a science fiction term into a tangible subject. I really appreciate these videos!
BrothersFreedive You're very welcome! I greatly appreciate your kind comments!
i just want to know how we use cross validation using non negative matrix factorization model?
Good job on this video series!
Question on the code below near the end of video/notebook 8. In my run the first item in the grid_scores_ list is:
mean: 0.97333, std: 0.03266, params: {'weights': 'distance', 'n_neighbors': 18}, ...
When I check the type of the first item in the first list item ( print( type(rand.grid_scores_[1][0])) as below ) it returns:
How come the items in the list, e.g. mean: 0.97333 are not surrounded by braces {} if they are dict's?
thanks, Tom
# n_iter controls the number of searches
rand = RandomizedSearchCV(knn, param_dist, cv=10, scoring='accuracy', n_iter=10, random_state=5)
rand.fit(X, y)
print( rand.grid_scores_)
print( type(rand.grid_scores_[1][0]))
Great question! I'm not positive, but I think that rand.grid_scores_[1][0] actually refers to the 'params' dictionary. I can't say for sure without checking, but let me know what you find if you decide to check!
okay i'm having a boy crush momeennt 🥵🥵🥵
Hello sir, i wanna ask a couple of questions
1. What do "{:^9}", "{:^61}", "{:^25}" actually mean?
2. Why is the higher k values in knn correspond to lower complexity? Is it just for knn or other models for instance random forest?
1. That's code for string formatting. Some examples are here: pyformat.info/
2. See this article: scott.fortmann-roe.com/docs/BiasVariance.html
Hope that helps!
best tutorial in youtube
Thank you!
valueError:Found input variables with inconsistent numbers of samples: [299, 150]
I'm sorry, I can't evaluate the cause of the error without knowing what code you wrote that caused the error. Good luck!
Great video sir, but I have a question. you are using accuracy for selecting the model, but I have read that accuracy is not a sufficient measure to decide the performance of a model in case of classification? Also is a model with higher accuracy is always better than a model with lower accuracy(in case of classification)?
Accuracy can be a useful evaluation metric. See here for more: ua-cam.com/video/85dtiMz9tSo/v-deo.html
@Data School: Hi sir, I have been working with Customer Churn dataset. It contains 21 features including the target variable churn which is an integer type. While I was doing missing value treatment, I tried with mean imputation in dependents and city columns(both are numeric in nature). mean value was in fraction(i.e. float64). This results in error in train_test_split(). "TypeError: Singleton array cannot be considered a valid collection". I searched in stack-overflow, they said the error might be caused due to float value. If I go with mean imputation what would be a possible solution. I would be grateful for your response.
Hard to say without seeing your dataset and your code... good luck!
Is there any benefit on using "-scores" instead of "abs(scores)"? They look like they do the same in this case. abs looks safer to me because it will always return a positive number, no matter if you use loss-function or a reward-function.
Great idea! No, I don't see any benefit of using "-scores" instead of "abs(scores)". Thanks for the suggestion!
All your videos W.I.N - you’re the best
can someoe explain this formatting pls :) print('{:^9} {} {:^25}'.format(iteration, data[0], str(data[1])))
This should help: pyformat.info/
The difference between the average RMSE with the Newspaper feature and 10-fold cross-validation without the Nespaper feature seems - to me - quite negligible. Could you walk through the logic behind choosing when the difference is actually small enough to keep a feature and when one should decide to drop a feature? Thanks!
And, BTW: your video series is amazing! Keep it up!
Thanks for your kind words!
The simple answer is that you should always prefer a simpler model (less features), unless having more features provides a "meaningful" increase in performance. There's no strict definition for "meaningful", it depends on context. Hope that helps!
Nice to see your video after long time, I have some confusion, I hope i will be clear after your response.
1. Please correct me that i understood the demerit of Train/Test split is high variance or differences between Training and Testing data will affect the Testing accuracy.
2. I really want to know what does random_state parameter does when you change it form 4 , 3,2,1 and 0.
3. For Classification , you mentioned Stratified Sampling to make K- Fold. How does it affect for the accuracy of the model ( For eg. out of total 5000 rows or observation if have 80% ham and 20 % spam mail or out of total 5000 rows or observation if have 50% ham and 50 % spam mail ) in my collected dataset.
4. Since you have numerical feature only , so you used accuracy as a metrics to select the best feature, what do you suggest if your dataset contains object datatypes like dates format, and string objects like text data.
I am new student of datascience , so i am sorry for long comments
unique raj Great questions! My responses:
1. The disadvantage of train/test split is that the resulting performance estimate (called "testing accuracy") is high variance, meaning that it may change a lot depending upon the random split of the data into training and testing sets.
2. Try removing the random_state parameter, and running train_test_split multiple times. Every time you run it, you will get different splits of the data. Now use random_state=1, and run it multiple times. Every time you run it, you will get the same exact split of the data. Now change it to random_state=2, and run it multiple times. Every time you run it, you will get the same exact split of the data, though it will be different than the split resulting from random_state=1. Thus, the point of using random_state is to introduce reproducibility into your process. It doesn't actually matter whether you use random_state=1 or random_state=9999. What matters is that if you set a random_state, you can reproduce your results.
3. In this context, stratified sampling relates to how the observations are assigned to the cross-validation folds. The reason to use stratified sampling in this context is that it will produce a more reliable estimate of out-of-sample accuracy. It doesn't actually have anything to do with making the model itself more accurate.
4. My choice of classification accuracy as the evaluation metric is not actually related to the data types of the features. Your features in a scikit-learn model will always be numeric. If you have non-numeric values that you want to use as features, you have to transform them into numeric features (which I will cover in a future video).
Hope that helps!
Your explanations are very straightforward. Thanks a lot.
Thanks!
while explaining the graph of model complexity vs model accuracy, you mentioned best model is middle value ,where k = 13, where it balances bias and variance. But you also mentioned that best value of k should be producing the simplest model. And higher values of k gives simple model. So k = 20 is taken as the lowest complexity model. I am confused about this.
I know it's confusing! This article might help to clarify: scott.fortmann-roe.com/docs/BiasVariance.html
scoring='mean_squared_error' is replaced with scoring = 'neg_mean_squared_error'
Right! I have updated the code in the GitHub repository: github.com/justmarkham/scikit-learn-videos
cross_val_score(lm, X, y, cv=10, scoring='mean_squred_error') is not working. given error in mean_squred_error
Please see this notebook: github.com/justmarkham/scikit-learn-videos/blob/master/07_cross_validation.ipynb
At minute 22:15. Why is a kNN model with a higher k less complex than one with a lower k? From the perspective of computational complexity it should be the opposite, because less neighbours have to found, shouldn't it?
And of course: Thanks for the great video.
Great question! I don't have a short answer, but this article should help explain: scott.fortmann-roe.com/docs/BiasVariance.html
Why does a higher K produce a simpler KNN model? I thought the higher K the more complex? 22:09
EDIT: Alright, makes sense now. I rewatched the episode on KNN and looked at the partition diagrams. In contrast to the other introduced models a higher value makes KNN less complex.
Glad you were able to figure it out!
i have watched several videos on this subject. this was the only one that has met my expectations
Thanks for your kind words!
Hello Kevin, could you please elaborate a little bit why the bigger K is the less complicated the model is and thereafter K=20 should be selected instead of K=14?
Thanks,
Terry
This should help explain it: scott.fortmann-roe.com/docs/BiasVariance.html
Hello Dear sir your video tutorial is really really appreciated you are doing well job.sir i need to say can you make one tutorial on image data-set? mean we have images and we want to make data-set like MNIST or FMNIST how to do this job? thanks sir
Thanks for your suggestion!
After we use K-fold cross-validation and it can be used for selecting optimal tuning parameters, choosing between models, and selecting features, how can we make the model? It looks like it is helping us to select a model or tuning the parameters. If we want to make a model and make the prediction, we still need to split the data into training and testing set and then choose the algorithm such as knn.
For instance, knn.fit(x_train, y_train) knn.predict(x_test)
Are those the correct concept?
No, cross-validation is an alternative to train/test split. Once you choose your tuning parameters, you fit the model on all of the data before making predictions. Hope that helps!
I Really wanted some tutorials on Pipeline and ROC and AUC curve. If you can. Thanks in advance
This covers ROC: ua-cam.com/video/85dtiMz9tSo/v-deo.html
This covers Pipeline: www.dataschool.io/learn/
Is Python better than R ? which one do you recommend for beginners ?
You might find this article helpful: www.dataschool.io/python-or-r-for-data-science/
Another question - how do you find out the final numeric values for the parameter coefficients that minimizes the loss function? In your linear regression example, how do you find beta_0, ..., beta_3 in y = beta_0 + beta_1*TV + beta_2*Radio + beta_3*Newspaper? Are beta_0, ..., beta_3 averages from the k models, i.e. beta_i = (beta_ik)/K for i = 0,1,2,3, k = 1, ...,K? Or, after CV, you know y = beta_0 + beta_1*TV + beta_2*Radio + beta_3*Newspaper is the best model, then you just run this on ALL data points to compute coefficients?
+donkeyenvious Once you have selected the best model via cross-validation, you train that same model (meaning the same features and model tuning parameters) on all of your training data, and that tells you the coefficients to use when making actual future predictions.
+Data School Thanks for your explanation!
while applying CV as 10 in Cross val score function we are getting "All the n_labels for individual classes are less than 10 folds." and when we reduce the same to cv =5 we are getting the result. can you please help to explain why we are getting this error?
It's hard for me to say without having access to your code and dataset. Good luck!
Great information but I found the delivery was much too slow, but you can't please everyone so please disregard :)
If it's helpful, you can change the speed of any UA-cam video to 1.25x or more.
How do you write the code on line 7 at 9:20 using the model_selection module? KFold seems to be set up slightly differently in that module and it's not really working for me. Thanks, love your videos.
I'll be updating the Jupyter notebooks for this video series soon, and will let you know when that is complete. Stay tuned! :)
Extremely useful and thank you for sharing. My question is regarding very very huge datasets. Is it okay to random sample a portion of data and do this cross-validation? if the random-sampled data is representative of the original data, will these cross-validation results for sample population be a representative of cross-validation results of original data?
That should work, if the random sample is truly representative and there is not severe class imbalance. (If the positive class for the response value you are studying is extremely rare, then sampling may remove too many instances of that class, which would skew the results.)
Newbie question here:
After all the process of Cross-Validation, do fitting the data and making prediction need a different process?
Or you just do as normal fitting the training data and then make predictions on your test data?
Sorry, it depends a lot on what you mean by "a different process" and "you just do as normal"... it's hard for me to say without a lot more details! I recommend reviewing the notebooks or videos here: github.com/justmarkham/scikit-learn-videos
Around 29:40, when you're redoing the cross-validation with just the TV and Radio columns, wouldn't you need to refit the linear model? Because the previous one was fitted using TV, Radio, and Newspaper, but this new one is only involving TV and Radio.
Great video! The only line of code that I needed to update is reshaping the data to pass into the binarize function and then flatten the return ndarray.
'''y_pred_class_2 = binarize(y_pred_prob.reshape((192,1)), threshold=0.3).flatten() '''
I found a lot and lot of videos about ML and cross validation, I watched them all, I tried to follow but it was very hard understand.
But you, you make it easier, I was very confused with this cross validation and now it's more than clear.
Thank you very much for this video and for your channel
You're very welcome! Glad it was helpful to you!
@Data School I have a question which is about the minute 29:44, you conclude that we choose the second model because the score is less than the other one. But we already negate and flip the sign of the mean_squared_error, so why we choose the smaller number? I thought it should be the higher one. Thanks!
Any metric that is named "error" is something you want to minimize. Thus we chose the second model because it minimized RMSE. Hope that helps!
So the program knows that scoring='accuracy' means to calculate the accuracy from sklearn.metrics?
Exactly.
thank you for introducing me to ML, and also for helping me understand Python through your great pandas videos!
You are very welcome!
this is very clear, a bit slow but very clear,
Question: do you know what the impact of the number of seeds value is to i.e. 10 cross validation, i'm assuming from your video, the impact = that it starts the test set at a different location within the training set, because on weka it gives me different accuracy every time from inserting 1 to 10 seeds
+olie tim Are you asking about the choice of "K" for K-fold cross-validation, meaning the number of folds?
this is great, thanks, but how do we use this to predict on test?
This is very good for supervised models. Can you please give your inputs on how to measure the accuracy of unsupervised model? Thank you
Sorry, I won't be able to advise. Good luck!
Low value of K produce model with low biais, high variance? I though it was the opposite low K means high bais and low variance
For KNN, lower values of K result in a model with lower bias and higher variance. This article explains that point further: scott.fortmann-roe.com/docs/BiasVariance.html
very good video and channel...
In cross_val_score how should it be decided which scoring parameter should be used when?
This might help: github.com/justmarkham/DAT8/blob/master/other/model_evaluation_comparison.md
All about cross validation in one video , THAAAAAAAAAAAAAAAAAAAAAANK YOU
You're very welcome! :)
great vid kevin! But I got a question is why k-fold so familar with leave one out cross validation??
Leave one out cross-validation is just a special case of K-fold cross-validation. Glad you like the video!
Very clear and detailed explanations. Also the links after the videos are very helpful. Thanks.
You're welcome!
Very well explained!!!! Explanation is very impressive...
Thanks!