I have done a course and also did bit of predictions using ML, DL including ANN, CNN and LSTM. However, now i understand the libraries to use for difference cases. Thanks For coming up with such Videos. Please do more, i am your subscriber.
got an exam in a couple of hours, and this video cleared up a LOT of things! thank you for going into the concept, and using that to explain what's going on in your code. kudos man, kudos
Hi Krish, After selecting the model, How to select the best chunk of data for training, as different splits of data will give different accuracy. Very helpful if you can post some video for the same.
Hi Anime, I think after selecting best model with good average accuracy, we dont need to split further again i.e. now train on whole dataset and make/save a model. What say?
Excelllent explanation. So if I am using cross_val_score instead of train_test_split, I don't need to find out and analyse metrices like Precision, Recall , F1 , ROC ? Just getting the accuracy.mean() is good enough? p.s I am new to DS so I hv probably mixed up a few things
I have a question. In cross validation we perform multiple experiment based on cv value. In K fold also we do the same thing. What is the difference between these two ?
The X, and y variables. The whole point of using cross validation techniques is to try various combinations of train and test sets from your original dataset, and find out how effective your algorithm is for any of these combinations.
It is difficult calculate which model suitable to use data because we can use all models to check the good accuracy . This is to lead to big problem in coding part .
You said...if CV = 10, then ten experiments are conducted but did not tell in what ratio train and test sample is split. Also there is different interpretation of random_state on internet. If random_state = 'none' then sample changes and if random_state = any integer then sample does not change irrespective of any integer you choose. But in your case sample did change if any integer was used. Please clarify
In the case of CV= 10 the ratio is train=0.9, testing=0.1, thats because you split the "cake" in 10 pieces. Let´s see for instance CV= 5. The cake is split in 5 pieces so you have 4 pieces to training and 1 piece to test. So you will have train= 4/5= 0.8 and testing 1/5= 0.2. For a CV= 4. Training = 3/4= 0.75 and testing =1/4= 0.25. I hope this clarify
Hi sir... When we are using cv=10...then it simply applies Kfold sampling...can we import stratified k fold and put cv=stratified k fold while wroking with a dataset which has class imbalance as the stratified sampling gives same ratio of clases in train and validation data
What to do if I wanna apply minmax or standardscale to fit train and transform only test set in cross Val score? The rule of thumb is to apply these technique on train and test separately so how I can perform this? Cross Val score doesn’t has a specific argument
When to use the trian_test_split() and cross_val_score() on the dataset ? As I have seen most of the programs use train_test_split with 70%,30% or 60%,40% train,test data split and fit the model. So which is the best approach ?
k fold cross validation use to decide which model is the best for regression and classification only or I can use it to decide which model is the best for clustering
each random state is a different randomization of the train-tes split of the data. So the reason the accuracy is changing is that in each case the split was done differently and lead to different results, which is why its quite unreliable and CrossValidation helps us solve it
If you take your k fold value as 5, then the CV will perform 5 exps Suppose there are 50 records and you took k fold value as 5 Then size of the test data would be = 50/5 i.e 10 Exp1==> test data = df[0:10,:] Exp2 ==> test data = df[10:20,:] Exp3 ==> test data = df[20:30,:] Exp4 ==> test data = df[30:40,:] Exp5 ==> test data = df[40:50,:]
sir i m confused in this, we are selecting a part of dataset for testing nd rest for training ,in next iteration we are selecting a part for testing(that was already trained in previous iteration? if so then it wont give correct accuracy as model has already seen the data? or Am i missing some point..
Bro in second iteration data that goes in training ans testing is randomly lifted..but its not the same in second iteration. And second has unseen data for testing
@@krishnaik06 Could u upload some more algorithms with their params meaning ...just the video on hyperparameters for logistic algm,regression ...If u have time sir
@@generationwolves If you use cross validation to tune your hyper parameters and improve your model, then you shouldn't apply cross validation on the entire dataset but only on the training data. Test data must be always independent. Otherwise it will result in data leakage. If you just want to have an overall look of the scores of the splits then you can apply on the whole dataset.
No I would't think so. I think cross validation is a quick way of determining which ML algorithm is most suitable. When you use which ever one that returns a high CV score, you can then do the model summary and confusion matrix using the confusion matrix library.
I have done a course and also did bit of predictions using ML, DL including ANN, CNN and LSTM. However, now i understand the libraries to use for difference cases. Thanks For coming up with such Videos. Please do more, i am your subscriber.
This video helps me to clear all my doubts regarding cross-validation and data leakage.
Krish, A very simple explanation of how CV can be used in algorithm selection. Very well done.
got an exam in a couple of hours, and this video cleared up a LOT of things! thank you for going into the concept, and using that to explain what's going on in your code. kudos man, kudos
That is Clean AF! Thanks for this video, Really appreciated
Explained with simplicity .
Thanks Krish..
Awesome and clean, simple explanation.
Hi its a very good video. Could you plz let me know if cross validation is done on train data or total data?
best video 🙌
keep posting sir you are awesome
Nice video...sir..how to find the cross validation of non standard parameter...example specificity..
Hi Krish, After selecting the model, How to select the best chunk of data for training, as different splits of data will give different accuracy. Very helpful if you can post some video for the same.
Hi Anime,
I think after selecting best model with good average accuracy, we dont need to split further again i.e. now train on whole dataset and make/save a model. What say?
Excellent
Very apt and straight to the point. Thanks for sharing
Great teaching
Excelllent explanation. So if I am using cross_val_score instead of train_test_split, I don't need to find out and analyse metrices like Precision, Recall , F1 , ROC ? Just getting the accuracy.mean() is good enough? p.s I am new to DS so I hv probably mixed up a few things
Can you increase font size of the editor. Its very small and eye straining to read in mobiles.
Thank you sir , for a lucid explaination...
Excellent.
Thanks for the video.
Iam following ur videos
The way u explains is simply awesome N many more happy thanks for sharing the information N knowledge about D.S
Krish For logistical regression problem we should use Mode right ..Use used mean here..why
gr8 explanation.
Very Good Explanation
I have a question. In cross validation we perform multiple experiment based on cv value. In K fold also we do the same thing.
What is the difference between these two ?
Is it ok to use your train_x anda train_y data in your cross validation? Or is better to use your whole x and y variables?
The X, and y variables. The whole point of using cross validation techniques is to try various combinations of train and test sets from your original dataset, and find out how effective your algorithm is for any of these combinations.
Don of datascience Community
It is difficult calculate which model suitable to use data because we can use all models to check the good accuracy . This is to lead to big problem in coding part .
You said...if CV = 10, then ten experiments are conducted but did not tell in what ratio train and test sample is split. Also there is different interpretation of random_state on internet. If random_state = 'none' then sample changes and if random_state = any integer then sample does not change irrespective of any integer you choose. But in your case sample did change if any integer was used. Please clarify
In the case of CV= 10 the ratio is train=0.9, testing=0.1, thats because you split the "cake" in 10 pieces. Let´s see for instance CV= 5. The cake is split in 5 pieces so you have 4 pieces to training and 1 piece to test. So you will have train= 4/5= 0.8 and testing 1/5= 0.2.
For a CV= 4. Training = 3/4= 0.75 and testing =1/4= 0.25.
I hope this clarify
very nice video. Thank you.
Sir, how can I achive this datasets which validated? Why we are applying cross validation if we cant select the high scored scores data? Thank you.
Very good
Hi sir... When we are using cv=10...then it simply applies Kfold sampling...can we import stratified k fold and put cv=stratified k fold while wroking with a dataset which has class imbalance as the stratified sampling gives same ratio of clases in train and validation data
how to use other CV techniques in coding like stratified cv, time series cv and leave one out cv?
Thanks Krish
In cross validaion we are running different models and taking mean of all the acuracies. So which model will be our final model!
Sir what if the total observations are like 107 or 191 or any prime number...How to split using k fold cv?
great tutorial
What to do if I wanna apply minmax or standardscale to fit train and transform only test set in cross Val score? The rule of thumb is to apply these technique on train and test separately so how I can perform this? Cross Val score doesn’t has a specific argument
if train _scores and cross_validate scores difference is negative value so does that mean that model perform very well
This is very useful.
sir you didn't define the test_size in train_test_split().
if u don't define automatically train_test_split function takes 75:25 ratio
Very Good! Thanks
Thank you sir
If i just want to apply K-fold Cross validation, so i don't need to do train test split, right?
have u done it usind decision tree, random forest, naive bayes??
Sir will it work for multi-class problem
When to use the trian_test_split() and cross_val_score() on the dataset ? As I have seen most of the programs use train_test_split with 70%,30% or 60%,40% train,test data split and fit the model. So which is the best approach ?
There is not really a neat rule. A rule of thumb is to take the same ratio as your test/train set ratio
can you output the model summary and the confusion matrix using cross_val_score?
k fold cross validation use to decide which model is the best for regression and classification only
or I can use it to decide which model is the best for clustering
Just for classification. Not used for clustering I think.
Great
Please can be explain feature selection in model
Does cross_val_score functio uses hyper params and startified folds?
Iam unable to use that cross validation function in my system
what is that accuracy is that train accuracy or test ?
Can you explain the effect of random_state to the accuracy?
each random state is a different randomization of the train-tes split of the data. So the reason the accuracy is changing is that in each case the split was done differently and lead to different results, which is why its quite unreliable and CrossValidation helps us solve it
xCELENT VIDEO
I am a little confused with cv folds, and no.of values in X and Y dataset.
If you take your k fold value as 5, then the CV will perform 5 exps
Suppose there are 50 records and you took k fold value as 5
Then size of the test data would be = 50/5 i.e 10
Exp1==> test data = df[0:10,:]
Exp2 ==> test data = df[10:20,:]
Exp3 ==> test data = df[20:30,:]
Exp4 ==> test data = df[30:40,:]
Exp5 ==> test data = df[40:50,:]
sir i m confused in this, we are selecting a part of dataset for testing nd rest for training ,in next iteration we are selecting a part for testing(that was already trained in previous iteration? if so then it wont give correct accuracy as model has already seen the data? or Am i missing some point..
Bro in second iteration data that goes in training ans testing is randomly lifted..but its not the same in second iteration. And second has unseen data for testing
@@zulfiquarshaikh3461 okay thnku bro..!!!
Nice viddo
Sir it's cross val comes first n then tunning the parameter always in general??
First hypertunning then cross validation
@@krishnaik06 Ok sir..
@@krishnaik06 Could u upload some more algorithms with their params meaning ...just the video on hyperparameters for logistic algm,regression ...If u have time sir
❤❤
Thanks
I am having a doubt that u have to use cross_val_score on train datset or on the whole dataset
@@generationwolves If you use cross validation to tune your hyper parameters and improve your model, then you shouldn't apply cross validation on the entire dataset but only on the training data. Test data must be always independent. Otherwise it will result in data leakage. If you just want to have an overall look of the scores of the splits then you can apply on the whole dataset.
can you output the model summary and the confusion matrix using cross_val_score?
No I would't think so. I think cross validation is a quick way of determining which ML algorithm is most suitable. When you use which ever one that returns a high CV score, you can then do the model summary and confusion matrix using the confusion matrix library.
@@chinedumezeakacha1604 actually I tried it and you can do it