Selecting the best model in scikit-learn using cross-validation

Data School

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 30 вер 2024

КОМЕНТАРІ • 598

@dataschool 3 роки тому ⁺¹¹
Having problems with the code? I just finished updating the notebooks to use *scikit-learn 0.23* and *Python 3.9* 🎉! You can download the updated notebooks here: github.com/justmarkham/scikit-learn-videos
@giovannibruner8455 8 років тому ⁺¹¹²
This videos are so well done, so clear and easy to follow that it makes appear ML a trick for kids. Congratulations, great teaching.
@dataschool 8 років тому ⁺¹³
Thanks for your kind words!
@deneb6139 7 років тому ⁺¹¹
can't agree more! best video resource on cross validation on the internet.
@dataschool 6 років тому ⁺³⁴
*Note:* This video was recorded using Python 2.7 and scikit-learn 0.16. Recently, I updated the code to use Python 3.6 and scikit-learn 0.19.1. You can download the updated code here: github.com/justmarkham/scikit-learn-videos
@mubeenkhan8210 5 років тому
Updated link shows me this message :
Sorry, something went wrong. Reload?
@MasterofPlay7 4 роки тому
is this still relevant 2020?
@thevivekmathema 7 років тому ⁺³⁴
i almost gave up python,untill i met your channel. you are my savior
@dataschool 7 років тому ⁺⁴
Wow, thank you! Good luck with your Python education! :)
@apachaves 7 років тому ⁺¹³
Amazing video! Very instructive. And the presenter has a very clear voice and pace.
@dataschool 7 років тому ⁺¹
Thank you so much! I'm glad you liked it!
@ngochua6679 3 роки тому ⁺⁴
Kevin, I appreciate the slow but thorough walk through, you and StatQuests are awesome people. Thank you.
@dataschool 3 роки тому ⁺¹
Thank you so much!
@akshaysingh1914 5 років тому ⁺⁴
Sir firstly I would like to you thanks a lot, because you spent so much time to make this video ....this is really helpful to initial phase learner's for ML ; keep doing sir , I stopped this video in mid to say thanks , you saved my lots of hour to understand cross validation.....
@dataschool 5 років тому
That's awesome to hear! Thanks so much for letting me know! 🙌
@mdudius37 5 років тому ⁺¹
In some places I’ve seen people run K fold cross validation on the entire dataset and in other places I’ve seen people run it only on the training set. They then calculate on the test set separately. Is there any recommendation regarding which practice makes more sense? Great video!!
@dataschool 5 років тому ⁺¹
This is beyond the scope of what I can get into in a UA-cam comment... sorry!
@arunavasengupta160 5 років тому ⁺¹
Brilliant explanation. But can you/somebody please make me understand, why u choose to k=20, it can be 13 or 17..???
@dataschool 5 років тому
I can't remember exactly, but I probably chose it because it's the simplest model among those options with the best performance. For more details on what I mean by "simple", see here: scott.fortmann-roe.com/docs/BiasVariance.html
@abusaleham 8 років тому ⁺¹⁰
Awesome explanation....!
@dataschool 8 років тому ⁺²
Thanks!
@guilhermereis1218 Рік тому ⁺¹
I know this is from a long time ago, but why is that the complexity decreases as the value of K increases? Shouldn't it be the opposite, since for greater values of K we'd need to iterate over more data from y?
Thanks for the videos, the entire playlist is amazing
@dataschool Рік тому ⁺¹
Great question, and thanks for your kind words! It's a tricky topic, but the best way to understand it is by reading this article: scott.fortmann-roe.com/docs/BiasVariance.html
Hope that helps!
@johnnovotny4286 2 роки тому ⁺¹
Excellent. Thanks for sharing your expertise.
@dataschool 2 роки тому
Thank you!
@Lala-qh3wl Рік тому ⁺¹
Great ! Thank you very much for sharing such a clear explanation 🙌
@dataschool Рік тому
You're very welcome!
@unsharma9229 4 роки тому ⁺¹
i usually don't subscribe any channel but you earned this subs from me...keep going lots of love
@dataschool 4 роки тому
Thank you! 🙌
@antonmarshall5194 5 років тому ⁺²
thank you for your great video. I know how to get the accuracy of my clasifier. But how can I get my actual predictions when using cross validation?
@dataschool 5 років тому ⁺¹
I think you would use cross_val_predict: scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html
@KierZarate-u4n 8 місяців тому ⁺¹
best explanation, easy to understand. thank you so much
@dataschool 8 місяців тому
Thank you!
@XRobotexEditz 7 років тому ⁺⁵
The Best explanation I have seen ever.
@dataschool 7 років тому ⁺²
Wow, thank you so much!
@flamboyantperson5936 6 років тому ⁺⁴
Great videos. Please keep the good work doing. We really need your lectures. Thank you so much.
@dataschool 6 років тому ⁺¹
Thanks for your kind words! I will definitely release more videos! :)
@flamboyantperson5936 6 років тому ⁺²
Waiting eagerly for new series because I have completed watching all your videos. Thank you so much for teaching me Python. You have made me educated you are a teacher for me and I respect you. Thank you so much.
@dataschool 6 років тому ⁺¹
Awesome! Thank you for watching and learning! :)
@colmorourke4657 4 роки тому ⁺²
Outstanding work once again Kevin. A treasure to newcomers in the area.
@dataschool 4 роки тому
Thank you!
@HARDYBOY290988 4 роки тому
First of all let me thank you at first for the extraordinary work you are doing.........
You are explaining extraordinary things in a very ordinary way..............
I had a query regarding FIT with the cross validation...........
When we do linear regression with a SINGLE test train split data set.........we get a SINGLE FIT to predict over the test data........
Whereas when we do cross validation (for eg cv=10) in linear regression.....we have 10 training datasets...............
BUT DO WE ALSO HAVE 10 FIT MODELS AS WELL FOR EACH TRAINING MODEL ??????
OR
DO WE HAVE AN AVERAGE OF ALL THE 10 FIT MODELS?????
###########################################################################
I am able to get coefficient & intercept of the fit model via single training dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25,random_state=1,)
lm=LinearRegression()
fitting=lm.fit(X_train,y_train)
fitting.coef_
fitting.intercept_
How to get the intercept & coefficient for the fit model via cross validation???????
############################################################################
what is the significance of cross_val_predict???? Does it have any relation my query???
@jongcheulkim7284 2 роки тому ⁺¹
Thank you so much. This is very helpful^^
@dataschool 2 роки тому
Glad it was helpful!
@hashemkoohy4729 8 років тому ⁺²
Hi Kevin,
Are you going to provide any tutorial on Neural Networks? The current publicly available tutorials are great but they mainly use very massive datasets such as mnist and also do not pay enough attention on how to decide about the architecture of the model, for instance, how to decide about the number if inputs, outputs, number of hidden layers and number of nodes in hidden layers! therefore a tutorial using smaller and simpler datasets such as iris, or wine will be very highly appreciated.
@dataschool 8 років тому ⁺¹⁰
Thanks so much for the suggestion! I will strongly consider creating this in the future.
@salmanpatel5666 3 роки тому ⁺¹
Thanks a ton, perfectly explained the concept and the code
@dataschool 3 роки тому
Great to hear!
@jiwachhetri4165 3 роки тому ⁺¹
This is the best sklearn tutorial I have come across.
@dataschool 3 роки тому
Thank you!
@cindyo6298 4 роки тому ⁺¹
Thank you for your super clear and thorough videos. Do you talk about the validation/dev set in your workflow course? I have a hard time with that concept, but my teachers require that we use it. It gets confusing for me when I try to select a model. Because I've heard you should tune your models to the validation set, and then get an "unbiased" score on the test set. But they say you're supposed to select your model according to the validation score. I don't understand why, because I thought that was biased because we used it to tune the models. Is the validation score the better of the two to use to select the model because if you use the test set to select the model, then we're biased to the test set and then have no unbiased score left? I almost feel like there should be one more split in the data, to create a second test set. So the first test set will be used to choose the model, and the second test set is used to show an unbiased score. There is honestly a lot of conflicting information about this entire topic on the internet. I've seen people say the model should be chosen based on the test set, and the validation set is only present to help tune the model. Cross validation is not always possible, because it can be very slow with big datasets.
@dataschool 3 роки тому ⁺¹
Yes, I do talk about this issue in my newest ML course. It's a complicated topic, but the bottom line is that whether or not you "need" to do an extra split (train/test/validation instead of just train/test) depends on whether your goal is only to choose the optimal model or also to estimate that model's performance on out-of-sample data. Also, you need to be mindful that an extra split can have a negative side effect if you have an insufficient amount of training data.
If you decide you want to enroll, you can do so here: gumroad.com/l/ML-course?variant=Live%20Course%20%2B%20Advanced%20Course
Hope that helps!
@cindyo6298 3 роки тому
@@dataschool Thanks for the reply!
@matinafragkogianni1376 9 років тому ⁺⁴
Great video, thanks a lot!
@dataschool 9 років тому ⁺²
+Matina Fragkogianni You're very welcome, I'm happy to help!
@saisreenivas8875 5 років тому ⁺¹
You are awesome...you teach everything in a simple way....ask for the feedback....and make them much better.....And the best thing is you make everything (I REPEAT EVERYTHING) easy for us....So sweet of you :)
@dataschool 5 років тому ⁺¹
That is so kind of you to say! Thank you so much 😄
@subashchandrapakhrin3537 3 роки тому
How do we test independent dataset other than X and y i.e. 10 fold cross validated dataset any response will be highly appreciated Thank You.
@panagiotisgoulas8539 2 роки тому
Could you please recommend me a couple of book, or preferrably videos/notes regarding multi-agent systems?
Hopefully something as thorougly explained as your lectures.
Many thanks
@jamesdalley2394 Рік тому ⁺¹
I watched a dozen videos on this topic. I was pretty certain I understood it, but I still had a few questions. You're video cleared those questions up amazingly! Thank you.
@dataschool Рік тому
That's awesome to hear! 🙌
@EW-mb1ih 3 роки тому ⁺¹
Thank you for your video. How is the cross validated accuracy calculated? Is it simply the number of good classification over the total number of data used in the testing set?
@dataschool 3 роки тому ⁺¹
It's the model's accuracy on the testing set for a given fold.
@EW-mb1ih 3 роки тому
@@dataschool Thank you, but my question was more about the term "accuracy". What is the formula used to calculate this accuracy?
@dataschool Рік тому
Got it! Accuracy is just the percentage of correct predictions.
@Dexter01 4 роки тому ⁺¹
You are charismatic, thank you!
@dataschool 4 роки тому
Thanks!
@edbull4891 2 роки тому ⁺¹
Always eager to learn. You demystified the subject and you even made it easy for a 75 year old brain to comprehend ML methods. :) :) :)
@dataschool 2 роки тому
Great to hear!
@deshraj3001 9 років тому
I am facing a strange error. If I run my Python scripts from the terminal, all the modules except seaborn are imported fine. However, if I use Spyder IDE, seaborn is imported okay but sklearn is not. Can someone help me out?
@datascienceds7965 6 років тому ⁺²
Whenever I needed a references, I always end up with your videos after a long search. That prove you are THE best teacher.
@dataschool 6 років тому
What a nice thing to say! Thank you! :)
@nirjalpaudel 4 роки тому
When we use cross_val_score using X,Y. Is the model trained there or we need to train it again??
@mightyhlungwane2639 3 роки тому ⁺¹
Kevin, you are very good in explaining. I wish I found you earlier. I just subscribed to receive all future videos and thank you for all the explanations.
@dataschool 3 роки тому
Thanks for your kind words! 🙏
@dawittekie3796 7 років тому ⁺¹
I am so happy to follow such kind of lecture because your teaching way is attractive and your language clarity is very excellent so I get knowledge an input for my ML thesis because am doing on classification (prediction) problem
@dataschool 7 років тому
Glad to hear that my videos are helpful to you! Good luck with your thesis!
@aditidubey2826 5 років тому ⁺¹
amazingly explained. Sufficiently Slow for a fresher in Machine learning.. Easily understandable. Keep it up.
@dataschool 5 років тому
Awesome, thank you! :)
@raidtape123 7 років тому ⁺¹⁰
I really really appreciate your efforts..this series is so helpful in learning.
@dataschool 7 років тому ⁺²
I'm glad to hear the series is helpful to you! :)
@panagiotisgoulas8539 2 роки тому
I couldn't understand the feature engineering concept 31:22. Any example?
@vijaypandit6590 6 років тому
I'm getting following error at 9:20,please Help
---------------------------------------------------------------------------
# simulate splitting a dataset of 25 observations into 5 folds
2 from sklearn.model_selection import KFold
----> 3 kf = KFold(25,n_splits=5, shuffle=False)
4
5 # print the contents of each training and testing set
TypeError: __init__() got multiple values for argument 'n_splits'
@dataschool 6 років тому
I recommend just skipping that code instead of trying to fix the error.
@johnathangonzalez5286 6 років тому ⁺²
REALLY well done!! I found this video extremely helpful. Keep up the good work
@dataschool 6 років тому
Thanks for your kind comment! :)
@anilreddyk5 5 років тому ⁺¹
Thanks for the Video. This is the best Video on KNN Cross validation that I have watched. Appreciate your effort...
@dataschool 5 років тому
Thanks! :)
@cozylifemodular1863 3 роки тому ⁺¹
Just chiming in to thank you for the series, really helps demistify and fill in the gaps. Looking forward to working through
@dataschool 3 роки тому
You're very welcome!
@denischo2133 3 роки тому
What to do if I wanna apply minmax or standardscale to fit train and transform only test set in cross Val score? The rule of thumb is to apply these technique on train and test separately so how I can perform this? Cross Val score doesn’t has a specific argument
@rishavhore4498 4 роки тому
Hey i have a doubt. If i use scoting as R2 in cross validation and get negative results should I flip the signs same way as mse scoring?
@prathameshmahankal4180 5 років тому ⁺¹
I really love your videos. They are so simple and to the point! Thanks for making such videos. :)
@dataschool 5 років тому
Thanks very much for your kind words!
@mueez.mp4 6 років тому ⁺¹
HOW DOES THIS VIDEO NOT HAVE LIKE A MILLION VIEWS?!? So good. Thank you, man!
@dataschool 6 років тому
HA! Thank you :)
@wavyjones96 2 роки тому
Good video. Maybe is for my level of english but i dont understand why at 24:00 we compare the accuracy of KNN with Linear Regression when KNN is used for classification and Linear Regression is used for regression. I know Cross Validation works for both but the response variable in both cases should be different, since for KNN should be categorical and for LR should be continuos.
Great series, enjoying them so far. Thanks for the good content :)
@dataschool 2 роки тому ⁺¹
I'm comparing KNN with Logistic Regression, which is used for classification. Hope that helps!
@subrahmanyamkesani7304 2 роки тому
I used scoring ="r2". some of the scores are more than one. How is "r2" calculated ? Can share some reference resource please.
@RHONSON100 6 років тому ⁺¹
Just like Andrew Ng you are a genius....he gave a clear explanation in theory and you produced a mesmerising implementation techniques in such a simple way that is inexplicable..............Awesome machine learning video i have watched so far...you helped me a lot you could never imagine..thank you sir
@dataschool 6 років тому ⁺¹
Thanks very much for your kind words!
@RHONSON100 6 років тому
you are most welcome sir
@BrothersFreedive 9 років тому ⁺¹
Excellent series! This is the first time I've studied machine learning. You are doing an outstanding job of transforming it from a science fiction term into a tangible subject. I really appreciate these videos!
@dataschool 9 років тому
BrothersFreedive You're very welcome! I greatly appreciate your kind comments!
@researchrahilda2364 4 роки тому
i just want to know how we use cross validation using non negative matrix factorization model?
@tommiron6762 7 років тому
Good job on this video series!
Question on the code below near the end of video/notebook 8. In my run the first item in the grid_scores_ list is:
mean: 0.97333, std: 0.03266, params: {'weights': 'distance', 'n_neighbors': 18}, ...
When I check the type of the first item in the first list item ( print( type(rand.grid_scores_[1][0])) as below ) it returns:
How come the items in the list, e.g. mean: 0.97333 are not surrounded by braces {} if they are dict's?
thanks, Tom
# n_iter controls the number of searches
rand = RandomizedSearchCV(knn, param_dist, cv=10, scoring='accuracy', n_iter=10, random_state=5)
rand.fit(X, y)
print( rand.grid_scores_)
print( type(rand.grid_scores_[1][0]))
@dataschool 7 років тому
Great question! I'm not positive, but I think that rand.grid_scores_[1][0] actually refers to the 'params' dictionary. I can't say for sure without checking, but let me know what you find if you decide to check!
@dolomite6818 3 роки тому
okay i'm having a boy crush momeennt 🥵🥵🥵
@mfaarisulilmi9130 3 роки тому
Hello sir, i wanna ask a couple of questions
1. What do "{:^9}", "{:^61}", "{:^25}" actually mean?
2. Why is the higher k values in knn correspond to lower complexity? Is it just for knn or other models for instance random forest?
@dataschool 3 роки тому
1. That's code for string formatting. Some examples are here: pyformat.info/
2. See this article: scott.fortmann-roe.com/docs/BiasVariance.html
Hope that helps!
@ousmanelom6274 4 роки тому ⁺¹
best tutorial in youtube
@dataschool 4 роки тому
Thank you!
@pankajnayak8388 7 років тому
valueError:Found input variables with inconsistent numbers of samples: [299, 150]
@dataschool 7 років тому
I'm sorry, I can't evaluate the cause of the error without knowing what code you wrote that caused the error. Good luck!
@gulshankumar17 5 років тому
Great video sir, but I have a question. you are using accuracy for selecting the model, but I have read that accuracy is not a sufficient measure to decide the performance of a model in case of classification? Also is a model with higher accuracy is always better than a model with lower accuracy(in case of classification)?
@dataschool 5 років тому
Accuracy can be a useful evaluation metric. See here for more: ua-cam.com/video/85dtiMz9tSo/v-deo.html
@somnathbanerjee2057 4 роки тому
@Data School: Hi sir, I have been working with Customer Churn dataset. It contains 21 features including the target variable churn which is an integer type. While I was doing missing value treatment, I tried with mean imputation in dependents and city columns(both are numeric in nature). mean value was in fraction(i.e. float64). This results in error in train_test_split(). "TypeError: Singleton array cannot be considered a valid collection". I searched in stack-overflow, they said the error might be caused due to float value. If I go with mean imputation what would be a possible solution. I would be grateful for your response.
@dataschool 4 роки тому
Hard to say without seeing your dataset and your code... good luck!
@GenzoVandervelden 8 років тому ⁺¹
Is there any benefit on using "-scores" instead of "abs(scores)"? They look like they do the same in this case. abs looks safer to me because it will always return a positive number, no matter if you use loss-function or a reward-function.
@dataschool 8 років тому
Great idea! No, I don't see any benefit of using "-scores" instead of "abs(scores)". Thanks for the suggestion!
@michaelDMZ 4 роки тому
All your videos W.I.N - you’re the best
@yilmazbingol6953 6 років тому
can someoe explain this formatting pls :) print('{:^9} {} {:^25}'.format(iteration, data[0], str(data[1])))
@dataschool 6 років тому
This should help: pyformat.info/
@jlaroche0 6 років тому
The difference between the average RMSE with the Newspaper feature and 10-fold cross-validation without the Nespaper feature seems - to me - quite negligible. Could you walk through the logic behind choosing when the difference is actually small enough to keep a feature and when one should decide to drop a feature? Thanks!
And, BTW: your video series is amazing! Keep it up!
@dataschool 6 років тому ⁺¹
Thanks for your kind words!
The simple answer is that you should always prefer a simpler model (less features), unless having more features provides a "meaningful" increase in performance. There's no strict definition for "meaningful", it depends on context. Hope that helps!
@uniqueraj518 9 років тому
Nice to see your video after long time, I have some confusion, I hope i will be clear after your response.
1. Please correct me that i understood the demerit of Train/Test split is high variance or differences between Training and Testing data will affect the Testing accuracy.
2. I really want to know what does random_state parameter does when you change it form 4 , 3,2,1 and 0.
3. For Classification , you mentioned Stratified Sampling to make K- Fold. How does it affect for the accuracy of the model ( For eg. out of total 5000 rows or observation if have 80% ham and 20 % spam mail or out of total 5000 rows or observation if have 50% ham and 50 % spam mail ) in my collected dataset.
4. Since you have numerical feature only , so you used accuracy as a metrics to select the best feature, what do you suggest if your dataset contains object datatypes like dates format, and string objects like text data.
I am new student of datascience , so i am sorry for long comments
@dataschool 9 років тому ⁺¹
unique raj Great questions! My responses:
1. The disadvantage of train/test split is that the resulting performance estimate (called "testing accuracy") is high variance, meaning that it may change a lot depending upon the random split of the data into training and testing sets.
2. Try removing the random_state parameter, and running train_test_split multiple times. Every time you run it, you will get different splits of the data. Now use random_state=1, and run it multiple times. Every time you run it, you will get the same exact split of the data. Now change it to random_state=2, and run it multiple times. Every time you run it, you will get the same exact split of the data, though it will be different than the split resulting from random_state=1. Thus, the point of using random_state is to introduce reproducibility into your process. It doesn't actually matter whether you use random_state=1 or random_state=9999. What matters is that if you set a random_state, you can reproduce your results.
3. In this context, stratified sampling relates to how the observations are assigned to the cross-validation folds. The reason to use stratified sampling in this context is that it will produce a more reliable estimate of out-of-sample accuracy. It doesn't actually have anything to do with making the model itself more accurate.
4. My choice of classification accuracy as the evaluation metric is not actually related to the data types of the features. Your features in a scikit-learn model will always be numeric. If you have non-numeric values that you want to use as features, you have to transform them into numeric features (which I will cover in a future video).
Hope that helps!
@marwanalbadawii 10 місяців тому
Your explanations are very straightforward. Thanks a lot.
@dataschool 9 місяців тому
Thanks!
@gtalpc59 6 років тому
while explaining the graph of model complexity vs model accuracy, you mentioned best model is middle value ,where k = 13, where it balances bias and variance. But you also mentioned that best value of k should be producing the simplest model. And higher values of k gives simple model. So k = 20 is taken as the lowest complexity model. I am confused about this.
@dataschool 6 років тому
I know it's confusing! This article might help to clarify: scott.fortmann-roe.com/docs/BiasVariance.html
@subratkumarsahoo4849 6 років тому
scoring='mean_squared_error' is replaced with scoring = 'neg_mean_squared_error'
@dataschool 6 років тому
Right! I have updated the code in the GitHub repository: github.com/justmarkham/scikit-learn-videos
@AshishTyagi2911 4 роки тому
cross_val_score(lm, X, y, cv=10, scoring='mean_squred_error') is not working. given error in mean_squred_error
@dataschool 4 роки тому
Please see this notebook: github.com/justmarkham/scikit-learn-videos/blob/master/07_cross_validation.ipynb
@sorensch1892 6 років тому
At minute 22:15. Why is a kNN model with a higher k less complex than one with a lower k? From the perspective of computational complexity it should be the opposite, because less neighbours have to found, shouldn't it?
And of course: Thanks for the great video.
@dataschool 6 років тому
Great question! I don't have a short answer, but this article should help explain: scott.fortmann-roe.com/docs/BiasVariance.html
@rainerwahnsinn3262 7 років тому
Why does a higher K produce a simpler KNN model? I thought the higher K the more complex? 22:09
EDIT: Alright, makes sense now. I rewatched the episode on KNN and looked at the partition diagrams. In contrast to the other introduced models a higher value makes KNN less complex.
@dataschool 7 років тому
Glad you were able to figure it out!
@miguelamaro4900 4 роки тому ⁺¹
i have watched several videos on this subject. this was the only one that has met my expectations
@dataschool 4 роки тому
Thanks for your kind words!
@terryliu3635 5 років тому
Hello Kevin, could you please elaborate a little bit why the bigger K is the less complicated the model is and thereafter K=20 should be selected instead of K=14?
Thanks,
Terry
@dataschool 5 років тому
This should help explain it: scott.fortmann-roe.com/docs/BiasVariance.html
@tahirarshad5801 5 років тому
Hello Dear sir your video tutorial is really really appreciated you are doing well job.sir i need to say can you make one tutorial on image data-set? mean we have images and we want to make data-set like MNIST or FMNIST how to do this job? thanks sir
@dataschool 5 років тому
Thanks for your suggestion!
@liuchengyu5420 5 років тому
After we use K-fold cross-validation and it can be used for selecting optimal tuning parameters, choosing between models, and selecting features, how can we make the model? It looks like it is helping us to select a model or tuning the parameters. If we want to make a model and make the prediction, we still need to split the data into training and testing set and then choose the algorithm such as knn.
For instance, knn.fit(x_train, y_train) knn.predict(x_test)
Are those the correct concept?
@dataschool 5 років тому
No, cross-validation is an alternative to train/test split. Once you choose your tuning parameters, you fit the model on all of the data before making predictions. Hope that helps!
@jnandikonda 5 років тому
I Really wanted some tutorials on Pipeline and ROC and AUC curve. If you can. Thanks in advance
@dataschool 5 років тому
This covers ROC: ua-cam.com/video/85dtiMz9tSo/v-deo.html
This covers Pipeline: www.dataschool.io/learn/
@ARJUN-op2dh 6 років тому
Is Python better than R ? which one do you recommend for beginners ?
@dataschool 6 років тому
You might find this article helpful: www.dataschool.io/python-or-r-for-data-science/
@c0t556 8 років тому
Another question - how do you find out the final numeric values for the parameter coefficients that minimizes the loss function? In your linear regression example, how do you find beta_0, ..., beta_3 in y = beta_0 + beta_1*TV + beta_2*Radio + beta_3*Newspaper? Are beta_0, ..., beta_3 averages from the k models, i.e. beta_i = (beta_ik)/K for i = 0,1,2,3, k = 1, ...,K? Or, after CV, you know y = beta_0 + beta_1*TV + beta_2*Radio + beta_3*Newspaper is the best model, then you just run this on ALL data points to compute coefficients?
@dataschool 8 років тому ⁺¹
+donkeyenvious Once you have selected the best model via cross-validation, you train that same model (meaning the same features and model tuning parameters) on all of your training data, and that tells you the coefficients to use when making actual future predictions.
@c0t556 8 років тому
+Data School Thanks for your explanation!
@ankitasaha1897 6 років тому
while applying CV as 10 in Cross val score function we are getting "All the n_labels for individual classes are less than 10 folds." and when we reduce the same to cv =5 we are getting the result. can you please help to explain why we are getting this error?
@dataschool 6 років тому
It's hard for me to say without having access to your code and dataset. Good luck!
@pauleaster5380 6 років тому
Great information but I found the delivery was much too slow, but you can't please everyone so please disregard :)
@dataschool 6 років тому
If it's helpful, you can change the speed of any UA-cam video to 1.25x or more.
@garaj1 6 років тому
How do you write the code on line 7 at 9:20 using the model_selection module? KFold seems to be set up slightly differently in that module and it's not really working for me. Thanks, love your videos.
@dataschool 6 років тому
I'll be updating the Jupyter notebooks for this video series soon, and will let you know when that is complete. Stay tuned! :)
@shudharsanmuthuraj1076 8 років тому
Extremely useful and thank you for sharing. My question is regarding very very huge datasets. Is it okay to random sample a portion of data and do this cross-validation? if the random-sampled data is representative of the original data, will these cross-validation results for sample population be a representative of cross-validation results of original data?
@dataschool 8 років тому
That should work, if the random sample is truly representative and there is not severe class imbalance. (If the positive class for the response value you are studying is extremely rare, then sampling may remove too many instances of that class, which would skew the results.)
@miguelcastillo1742 4 роки тому
Newbie question here:
After all the process of Cross-Validation, do fitting the data and making prediction need a different process?
Or you just do as normal fitting the training data and then make predictions on your test data?
@dataschool 4 роки тому
Sorry, it depends a lot on what you mean by "a different process" and "you just do as normal"... it's hard for me to say without a lot more details! I recommend reviewing the notebooks or videos here: github.com/justmarkham/scikit-learn-videos
@sbk1398 4 роки тому
Around 29:40, when you're redoing the cross-validation with just the TV and Radio columns, wouldn't you need to refit the linear model? Because the previous one was fitted using TV, Radio, and Newspaper, but this new one is only involving TV and Radio.
@javonnii436 3 роки тому
Great video! The only line of code that I needed to update is reshaping the data to pass into the binarize function and then flatten the return ndarray.
'''y_pred_class_2 = binarize(y_pred_prob.reshape((192,1)), threshold=0.3).flatten() '''
@hamkam33521 2 роки тому
I found a lot and lot of videos about ML and cross validation, I watched them all, I tried to follow but it was very hard understand.
But you, you make it easier, I was very confused with this cross validation and now it's more than clear.
Thank you very much for this video and for your channel
@dataschool 2 роки тому
You're very welcome! Glad it was helpful to you!
@goople273 5 років тому
@Data School I have a question which is about the minute 29:44, you conclude that we choose the second model because the score is less than the other one. But we already negate and flip the sign of the mean_squared_error, so why we choose the smaller number? I thought it should be the higher one. Thanks!
@dataschool 5 років тому
Any metric that is named "error" is something you want to minimize. Thus we chose the second model because it minimized RMSE. Hope that helps!
@wizzard5574 6 років тому
So the program knows that scoring='accuracy' means to calculate the accuracy from sklearn.metrics?
@dataschool 6 років тому
Exactly.
@stjepan_8902 6 років тому ⁺¹
thank you for introducing me to ML, and also for helping me understand Python through your great pandas videos!
@dataschool 6 років тому
You are very welcome!
@oliveryoung6501 9 років тому
this is very clear, a bit slow but very clear,
Question: do you know what the impact of the number of seeds value is to i.e. 10 cross validation, i'm assuming from your video, the impact = that it starts the test set at a different location within the training set, because on weka it gives me different accuracy every time from inserting 1 to 10 seeds
@dataschool 9 років тому
+olie tim Are you asking about the choice of "K" for K-fold cross-validation, meaning the number of folds?
@himanshu8006 2 роки тому
this is great, thanks, but how do we use this to predict on test?
@torpatty 4 роки тому
This is very good for supervised models. Can you please give your inputs on how to measure the accuracy of unsupervised model? Thank you
@dataschool 4 роки тому
Sorry, I won't be able to advise. Good luck!
@n7router 7 років тому
Low value of K produce model with low biais, high variance? I though it was the opposite low K means high bais and low variance
@dataschool 7 років тому
For KNN, lower values of K result in a model with lower bias and higher variance. This article explains that point further: scott.fortmann-roe.com/docs/BiasVariance.html
@spearchew 3 роки тому
very good video and channel...
@deepikabhojani7907 5 років тому
In cross_val_score how should it be decided which scoring parameter should be used when?
@dataschool 5 років тому
This might help: github.com/justmarkham/DAT8/blob/master/other/model_evaluation_comparison.md
@DTPwr 5 років тому ⁺¹
All about cross validation in one video , THAAAAAAAAAAAAAAAAAAAAAANK YOU
@dataschool 5 років тому
You're very welcome! :)
@rickywong8149 5 років тому
great vid kevin! But I got a question is why k-fold so familar with leave one out cross validation??
@dataschool 5 років тому
Leave one out cross-validation is just a special case of K-fold cross-validation. Glad you like the video!
@juancarlosesquivel7855 6 років тому
Very clear and detailed explanations. Also the links after the videos are very helpful. Thanks.
@dataschool 6 років тому
You're welcome!
@pranayrungta 6 років тому
Very well explained!!!! Explanation is very impressive...
@dataschool 6 років тому
Thanks!

Наступне

Автоматичне відтворення

How to find the best model parameters in scikit-learn