Cross Validation in Scikit Learn

Data Talks

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 29 гру 2024

КОМЕНТАРІ • 51

@TD-ph3wb 3 роки тому
your voice is extremely soothing
@ninjaduck3534 3 роки тому ⁺²
Super good, thank you for that clear run through!
@samuelpradhan1899 5 місяців тому
Is it required to train the model in entire data after cross-validation?
@manojachari3682 2 роки тому
In cv. Scores () what does it gives training accuracy?, if yes can we get training accuracy at each spit?
@liuchengyu5420 5 років тому ⁺¹
Hi,
A question for you! We run CV which is telling us the performance of the model. This itself doesn't make any model for us. We eventually still need to use .fit() function to train our model by the training set. Whats the point for doing CV in this case?
Moreover, does that matter to use different kind of CV since it is not actually making the model by CV?
@DataTalks 5 років тому
Great question! CV is used to get more stable estimate of your model's performance. So not really needed to fit a model (unless you want to ensemble models trained with CV which is another topic).
The type of CV is important for different datasets so that the estimate of model performance is unbiased.
All this being said, if you have a ton of homogenous data CV is probably overkill (but you can always use a bootstrap power calculation to figure that out too -- shameless plug -- ua-cam.com/video/G77qfPVO6x8/v-deo.html)
@backgroundnoiselistener3599 6 років тому ⁺¹
Here's my question.
when we train our model using the StratifiedKFold we actually get "k" numbers of models in return and we can calculate the accuracy on that model. But how do we get one final model instead of these "k" number of models?
I've read that we take the average of these models, but how do we take the average of a model.
To put it more simply, how can we use StratifiedKFold to make a final model?
@DataTalks 5 років тому
Great question! Ultimately you will find the hyperparameters that you like best from the StratifiedKFold and then retrain the model with those hyperparameters on the full data. Hope that helps
@DrJohnnyStalker 6 років тому
If you do model selection or hyperparameter tuning the CV isn't unbiased on the selected model. Should we hold out a separate test set to test the best model on to get an really unbiased performance estimate?
@dikshyasurvi6869 3 роки тому
Do you know anything about forecast.baseline?
@alice9737 11 місяців тому
So if we use train_test split do we also need to use cross-validation?
@DataTalks 11 місяців тому ⁺¹
Great question! You will always need to have a test set - that's what's gonna tell you how well your model will do in production. Cross validation is a way to have a validation set with a lower amount of data. Where your validation set is what you use to optimize hyper parameters.
@alice9737 11 місяців тому
@@DataTalks thank you for clearing that up for me!
@armelsokoudjou8696 3 роки тому
I used the following code:
X, y = np.arange(20).reshape((10, 2)), np.arange(10)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2,
random_state=111)
kf = KFold(n_splits=4, random_state=1, shuffle=True)
for train_index, test_index in kf.split(X_train):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X_train[train_index], X_train[test_index]
y_train, y_test = y_train[train_index], y_train[test_index]
But I got the following message: index 7 is out of bounds for axis 0 with size 6
What could be the reason?
Thanks in advance
@kokkiniarkouda1 7 років тому
in k-fold when using the function kf.split(X) how do we separate the data from the target? (x's, y's)
I mean the function splits the X array but where do we define our y's?
@DataTalks 7 років тому
Great question! kf.split will return an index in the training and the test set. And you get to use those to index into those sets that you have split previously. Generally you will use pandas to split them beforehand.
That being said there are particular types of data that need different splits (like class imbalanced or time series).
@kokkiniarkouda1 7 років тому
thanks
@guillermoabrahamgonzaleztr4852 3 роки тому
I went through the complete video, and I still don't know how to perform a cross validation using stratified trainning sets...?
@TaiHeardPostAudio 3 роки тому
'cross_val_score' in the model selection module estimates the accuracy of a classifier using stratified k-fold-- by default. The 'cv' parameter adjust the number of folds, default is 5. If you need to compare different classifiers, most likely this is the way. The actual stratified Kfold seems most useful for making charts, etc. Same with standard Kfold. This is all mentioned near the end of the video very briefly, blink and you miss it.
@fanwu281 6 років тому
I just had a question: what's the difference between using clf.predict and clf.scores?
@DataTalks 6 років тому ⁺²
Great question. Predict will return the prediction itself (if you are predicting house value it will return the predicted values). Score will take the predictions one step forward and compute how well you have done on a set of fabled data (generally your accuracy or r squared)
@fanwu281 6 років тому
Thank you sooo much! BTW, I just tried an example which gave me a predict() score 94% and score() score 81%. What information I can get from these two scores? Which score should I use to test the model? To be specific, I used grid search to tune parameter first, and then get the predict() scores of all classifications, also use score() to get scores. Lots of questions, thank you in advance! :-)
@DataTalks 6 років тому ⁺¹
It will be hard for me to debug without seeing the full code. But predicting and then scoring should be the same as just scoring (as long as you have the same data and score measure). So do double check. In the future just go ahead and use score. It's the method that is used behind the scenes in the GridSearchCV method and is built for getting the model score.
Hope that helps. If you want to chat more feel free to message me through YT :)
@ericsuda4143 3 роки тому
Hey man, first of all great vid! One doubt tho, if I need to normalize or scale my data, should I do it before on my whole training dataset or on each fold should I normalize or scale for the subset of the training data that is being extracted?
@DataTalks 3 роки тому ⁺¹
You should normalize on each fold if you are doing cross validation! You should do your full training on each fold of the cross validation - and normalization, feature selection, etc included.
@ericsuda4143 3 роки тому
@@DataTalks got it! Thanks
@rafaelaraujo5988 2 роки тому
Thanks for the amazing video, i used for a project and found a peculiar "problem" and came back to notice that it happens in your video too. When you use the mean + 2* std the value is bigger than 1, is that normal?
@DataTalks 2 роки тому
Great quesiton! The interval we calculate has a max greater than 1 which is a bit silly because it can't be greater than 1. This is because we assume a symetric distribution (a normal distribution centered around the mean of the scores). You don't need to do this however. My favorite confidence intervals are bootstrap confidence intervals which don't have this type of behavior. (check out my series here for the full course: ua-cam.com/video/uWLMtCtsHmc/v-deo.html&ab_channel=DataTalks)
@umasharma6119 2 роки тому
Can you please s tell me that in cross Val score which cross validation technique is used?
@DataTalks 2 роки тому
@@umasharma6119 without specifying it uses 5 fold. However you can specify specify cv techniques in the parameter cv
scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
@hangxu7967 6 років тому ⁺¹
Thanks for your video, it helps me a lot. By the way, can you zoom in your code page, it is not easy using a 11 inches laptop to read the code. Thanks.
@DataTalks 6 років тому
Absolutely, you are definitely not alone. I'll try to make the text bigger in subsequent vids!
@marinovik7954 4 роки тому
Really nice content. Thanks a lot!
@VladyVeselinov 7 років тому
Lovely presented, would love to see more ;)
@alejandrodecoud7319 2 роки тому
Thanks, it was very useful! masterclass by young hugh grant
@shankrukulkarni3234 4 роки тому
nice video can you plz share me the code
@cyberdudecyber 5 років тому
Thank you!
@syyamnoor9792 6 років тому ⁺²
I am not gay but I have to say that you are one attractive personality
@rasu84 6 років тому ⁺¹
Good video....but why make the video in the kitchen :D:D
@fiddinyusfida5356 6 років тому ⁺¹
lol
@desertrose00 2 роки тому
Hard to focus on what he is saying because the teacher it too cute LOL
Focus... focus .. focus!!!
@MasterofPlay7 4 роки тому
if you port the cross_val_predict result as y_pred to classification report (y_pred, y), it will output 3 classes, 0,1,2. Why does it output 3 classes instead of 2 since the iris dataset is binary classification?
@DataTalks 4 роки тому
Iris has three classes: each the species of plant :)
@MasterofPlay7 4 роки тому
@@DataTalks yeah I got mind fk just realized the fact it has 3 classes....
@DataTalks 4 роки тому
@@MasterofPlay7 No problem! You'd be totally right if there were two!
@MasterofPlay7 4 роки тому
@@DataTalks thx for the help! Yeah i was totally confused the confusion matrix is 3x3..... But for the cross_val_predict, is the output the average of n folds prediction? How come it only outputs 1 set of predictions whereas cross_val_score output the multiple scores (i.e accuracy)?
@MasterofPlay7 4 роки тому
@@DataTalks Should it not output the metrics for each iteration of k fold? Hence if i have cv=3 fold it should output 3 classification summary and confusion matrix?
@GWebcob 3 роки тому
Thank you!

Наступне

Автоматичне відтворення

Selecting the best model in scikit-learn using cross-validation