Hi, A question for you! We run CV which is telling us the performance of the model. This itself doesn't make any model for us. We eventually still need to use .fit() function to train our model by the training set. Whats the point for doing CV in this case? Moreover, does that matter to use different kind of CV since it is not actually making the model by CV?
Great question! CV is used to get more stable estimate of your model's performance. So not really needed to fit a model (unless you want to ensemble models trained with CV which is another topic). The type of CV is important for different datasets so that the estimate of model performance is unbiased. All this being said, if you have a ton of homogenous data CV is probably overkill (but you can always use a bootstrap power calculation to figure that out too -- shameless plug -- ua-cam.com/video/G77qfPVO6x8/v-deo.html)
Here's my question. when we train our model using the StratifiedKFold we actually get "k" numbers of models in return and we can calculate the accuracy on that model. But how do we get one final model instead of these "k" number of models? I've read that we take the average of these models, but how do we take the average of a model. To put it more simply, how can we use StratifiedKFold to make a final model?
Great question! Ultimately you will find the hyperparameters that you like best from the StratifiedKFold and then retrain the model with those hyperparameters on the full data. Hope that helps
If you do model selection or hyperparameter tuning the CV isn't unbiased on the selected model. Should we hold out a separate test set to test the best model on to get an really unbiased performance estimate?
Great question! You will always need to have a test set - that's what's gonna tell you how well your model will do in production. Cross validation is a way to have a validation set with a lower amount of data. Where your validation set is what you use to optimize hyper parameters.
I used the following code: X, y = np.arange(20).reshape((10, 2)), np.arange(10) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=111) kf = KFold(n_splits=4, random_state=1, shuffle=True) for train_index, test_index in kf.split(X_train): print("TRAIN:", train_index, "TEST:", test_index) X_train, X_test = X_train[train_index], X_train[test_index] y_train, y_test = y_train[train_index], y_train[test_index] But I got the following message: index 7 is out of bounds for axis 0 with size 6 What could be the reason? Thanks in advance
in k-fold when using the function kf.split(X) how do we separate the data from the target? (x's, y's) I mean the function splits the X array but where do we define our y's?
Great question! kf.split will return an index in the training and the test set. And you get to use those to index into those sets that you have split previously. Generally you will use pandas to split them beforehand. That being said there are particular types of data that need different splits (like class imbalanced or time series).
'cross_val_score' in the model selection module estimates the accuracy of a classifier using stratified k-fold-- by default. The 'cv' parameter adjust the number of folds, default is 5. If you need to compare different classifiers, most likely this is the way. The actual stratified Kfold seems most useful for making charts, etc. Same with standard Kfold. This is all mentioned near the end of the video very briefly, blink and you miss it.
Great question. Predict will return the prediction itself (if you are predicting house value it will return the predicted values). Score will take the predictions one step forward and compute how well you have done on a set of fabled data (generally your accuracy or r squared)
Thank you sooo much! BTW, I just tried an example which gave me a predict() score 94% and score() score 81%. What information I can get from these two scores? Which score should I use to test the model? To be specific, I used grid search to tune parameter first, and then get the predict() scores of all classifications, also use score() to get scores. Lots of questions, thank you in advance! :-)
It will be hard for me to debug without seeing the full code. But predicting and then scoring should be the same as just scoring (as long as you have the same data and score measure). So do double check. In the future just go ahead and use score. It's the method that is used behind the scenes in the GridSearchCV method and is built for getting the model score. Hope that helps. If you want to chat more feel free to message me through YT :)
Hey man, first of all great vid! One doubt tho, if I need to normalize or scale my data, should I do it before on my whole training dataset or on each fold should I normalize or scale for the subset of the training data that is being extracted?
You should normalize on each fold if you are doing cross validation! You should do your full training on each fold of the cross validation - and normalization, feature selection, etc included.
Thanks for the amazing video, i used for a project and found a peculiar "problem" and came back to notice that it happens in your video too. When you use the mean + 2* std the value is bigger than 1, is that normal?
Great quesiton! The interval we calculate has a max greater than 1 which is a bit silly because it can't be greater than 1. This is because we assume a symetric distribution (a normal distribution centered around the mean of the scores). You don't need to do this however. My favorite confidence intervals are bootstrap confidence intervals which don't have this type of behavior. (check out my series here for the full course: ua-cam.com/video/uWLMtCtsHmc/v-deo.html&ab_channel=DataTalks)
@@umasharma6119 without specifying it uses 5 fold. However you can specify specify cv techniques in the parameter cv scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
Thanks for your video, it helps me a lot. By the way, can you zoom in your code page, it is not easy using a 11 inches laptop to read the code. Thanks.
if you port the cross_val_predict result as y_pred to classification report (y_pred, y), it will output 3 classes, 0,1,2. Why does it output 3 classes instead of 2 since the iris dataset is binary classification?
@@DataTalks thx for the help! Yeah i was totally confused the confusion matrix is 3x3..... But for the cross_val_predict, is the output the average of n folds prediction? How come it only outputs 1 set of predictions whereas cross_val_score output the multiple scores (i.e accuracy)?
@@DataTalks Should it not output the metrics for each iteration of k fold? Hence if i have cv=3 fold it should output 3 classification summary and confusion matrix?
your voice is extremely soothing
Super good, thank you for that clear run through!
Is it required to train the model in entire data after cross-validation?
In cv. Scores () what does it gives training accuracy?, if yes can we get training accuracy at each spit?
Hi,
A question for you! We run CV which is telling us the performance of the model. This itself doesn't make any model for us. We eventually still need to use .fit() function to train our model by the training set. Whats the point for doing CV in this case?
Moreover, does that matter to use different kind of CV since it is not actually making the model by CV?
Great question! CV is used to get more stable estimate of your model's performance. So not really needed to fit a model (unless you want to ensemble models trained with CV which is another topic).
The type of CV is important for different datasets so that the estimate of model performance is unbiased.
All this being said, if you have a ton of homogenous data CV is probably overkill (but you can always use a bootstrap power calculation to figure that out too -- shameless plug -- ua-cam.com/video/G77qfPVO6x8/v-deo.html)
Here's my question.
when we train our model using the StratifiedKFold we actually get "k" numbers of models in return and we can calculate the accuracy on that model. But how do we get one final model instead of these "k" number of models?
I've read that we take the average of these models, but how do we take the average of a model.
To put it more simply, how can we use StratifiedKFold to make a final model?
Great question! Ultimately you will find the hyperparameters that you like best from the StratifiedKFold and then retrain the model with those hyperparameters on the full data. Hope that helps
If you do model selection or hyperparameter tuning the CV isn't unbiased on the selected model. Should we hold out a separate test set to test the best model on to get an really unbiased performance estimate?
Do you know anything about forecast.baseline?
So if we use train_test split do we also need to use cross-validation?
Great question! You will always need to have a test set - that's what's gonna tell you how well your model will do in production. Cross validation is a way to have a validation set with a lower amount of data. Where your validation set is what you use to optimize hyper parameters.
@@DataTalks thank you for clearing that up for me!
I used the following code:
X, y = np.arange(20).reshape((10, 2)), np.arange(10)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2,
random_state=111)
kf = KFold(n_splits=4, random_state=1, shuffle=True)
for train_index, test_index in kf.split(X_train):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X_train[train_index], X_train[test_index]
y_train, y_test = y_train[train_index], y_train[test_index]
But I got the following message: index 7 is out of bounds for axis 0 with size 6
What could be the reason?
Thanks in advance
in k-fold when using the function kf.split(X) how do we separate the data from the target? (x's, y's)
I mean the function splits the X array but where do we define our y's?
Great question! kf.split will return an index in the training and the test set. And you get to use those to index into those sets that you have split previously. Generally you will use pandas to split them beforehand.
That being said there are particular types of data that need different splits (like class imbalanced or time series).
thanks
I went through the complete video, and I still don't know how to perform a cross validation using stratified trainning sets...?
'cross_val_score' in the model selection module estimates the accuracy of a classifier using stratified k-fold-- by default. The 'cv' parameter adjust the number of folds, default is 5. If you need to compare different classifiers, most likely this is the way. The actual stratified Kfold seems most useful for making charts, etc. Same with standard Kfold. This is all mentioned near the end of the video very briefly, blink and you miss it.
I just had a question: what's the difference between using clf.predict and clf.scores?
Great question. Predict will return the prediction itself (if you are predicting house value it will return the predicted values). Score will take the predictions one step forward and compute how well you have done on a set of fabled data (generally your accuracy or r squared)
Thank you sooo much! BTW, I just tried an example which gave me a predict() score 94% and score() score 81%. What information I can get from these two scores? Which score should I use to test the model? To be specific, I used grid search to tune parameter first, and then get the predict() scores of all classifications, also use score() to get scores. Lots of questions, thank you in advance! :-)
It will be hard for me to debug without seeing the full code. But predicting and then scoring should be the same as just scoring (as long as you have the same data and score measure). So do double check. In the future just go ahead and use score. It's the method that is used behind the scenes in the GridSearchCV method and is built for getting the model score.
Hope that helps. If you want to chat more feel free to message me through YT :)
Hey man, first of all great vid! One doubt tho, if I need to normalize or scale my data, should I do it before on my whole training dataset or on each fold should I normalize or scale for the subset of the training data that is being extracted?
You should normalize on each fold if you are doing cross validation! You should do your full training on each fold of the cross validation - and normalization, feature selection, etc included.
@@DataTalks got it! Thanks
Thanks for the amazing video, i used for a project and found a peculiar "problem" and came back to notice that it happens in your video too. When you use the mean + 2* std the value is bigger than 1, is that normal?
Great quesiton! The interval we calculate has a max greater than 1 which is a bit silly because it can't be greater than 1. This is because we assume a symetric distribution (a normal distribution centered around the mean of the scores). You don't need to do this however. My favorite confidence intervals are bootstrap confidence intervals which don't have this type of behavior. (check out my series here for the full course: ua-cam.com/video/uWLMtCtsHmc/v-deo.html&ab_channel=DataTalks)
Can you please s tell me that in cross Val score which cross validation technique is used?
@@umasharma6119 without specifying it uses 5 fold. However you can specify specify cv techniques in the parameter cv
scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
Thanks for your video, it helps me a lot. By the way, can you zoom in your code page, it is not easy using a 11 inches laptop to read the code. Thanks.
Absolutely, you are definitely not alone. I'll try to make the text bigger in subsequent vids!
Really nice content. Thanks a lot!
Lovely presented, would love to see more ;)
Thanks, it was very useful! masterclass by young hugh grant
nice video can you plz share me the code
Thank you!
I am not gay but I have to say that you are one attractive personality
Good video....but why make the video in the kitchen :D:D
lol
Hard to focus on what he is saying because the teacher it too cute LOL
Focus... focus .. focus!!!
if you port the cross_val_predict result as y_pred to classification report (y_pred, y), it will output 3 classes, 0,1,2. Why does it output 3 classes instead of 2 since the iris dataset is binary classification?
Iris has three classes: each the species of plant :)
@@DataTalks yeah I got mind fk just realized the fact it has 3 classes....
@@MasterofPlay7 No problem! You'd be totally right if there were two!
@@DataTalks thx for the help! Yeah i was totally confused the confusion matrix is 3x3..... But for the cross_val_predict, is the output the average of n folds prediction? How come it only outputs 1 set of predictions whereas cross_val_score output the multiple scores (i.e accuracy)?
@@DataTalks Should it not output the metrics for each iteration of k fold? Hence if i have cv=3 fold it should output 3 classification summary and confusion matrix?
Thank you!