Having problems with the code? I just finished updating the notebooks to use *scikit-learn 0.23* and *Python 3.9* 🎉! You can download the updated notebooks here: github.com/justmarkham/scikit-learn-videos
Sorry to be off topic but does anyone know of a method to get back into an Instagram account..? I was stupid forgot the password. I would love any tricks you can give me!
@Lyle Elias Thanks for your reply. I got to the site thru google and im trying it out atm. Seems to take quite some time so I will get back to you later with my results.
This is by far the best Sci-kit Learn tutorial on UA-cam. I can say this because I have seen almost every tutorial and this covers everything starting from scratch.I knew how all the algorithms work but what I needed was how do I implement those algorithms from loading the data set to all terminologies to checking the accuracy and what not and this series has everything I was looking for ,thank you so much for this.Really appreciate it.
*Note:* This video was recorded using Python 2.7 and scikit-learn 0.16. Recently, I updated the code to use Python 3.6 and scikit-learn 0.19.1. You can download the updated code here: github.com/justmarkham/scikit-learn-videos
Most notable take aways from the video: - "Plotting testing accuracy vs model complexity is a very useful way to tune any parameters that relate to model complexity." - "Once you have chosen a model and it's optimal parameters and are ready to make predictions on out of sample data, it's important to re train your model on all of the available training data." - Repeating the train/test split process multiple times in a systematic way using k fold cross_validation
The last two videos are the best ones I’ve seen someone explain scikit learn’s predictions. Every other video jumps straight to the full analysis but in reality, you can predict in as little as 4 lines of code. Great job!
This video series sets such a high standards for Content, Context and Delivery of Machine Learning training ! Its a winner for all those who are starting to learn Machine Learning !! Thank you so much for your efforts Kevin !!!
Uno de los mejores manuales sobre "Machine learning" que he visto. Gracias por ofrecernos la oportunidad de aprender. Además, tu pronunciación es perfecta para hispanohablantes
When it comes to using the model for future predictions on real-life data, you can directly use the trained model without retraining it with the whole training data, including the test data. The idea is that the model has learned patterns and relationships from the training data that generalize well to unseen data, including real-life data. Retraining the model with the entire dataset, including the test data, is generally not recommended as it may lead to overfitting. Overfitting occurs when the model becomes too specific to the training data, capturing noise and irrelevant patterns, which can reduce its performance on new data.
Wow I must say your teaching style is amazing. Very organized, thorough and easy to follow. Thanks for your time, and keep making great videos! I wish more professors were like you at my school.
I couldn't agree more with berry jordaan. The way you deliver the content of a quite complex topic naturally guides me to want to learn more about machine learning. Thank you very much
Excellent teaching !!! I am required to set up competency around advance analytic involving ML/DS (since I am coming from DWH and BI practice) in my organization, so I wanted to learn and practice. Now , I feel like taking this as a full time profession and become Data Scientist. It's so much fun and exciting work, such video has made it lot easier. Thank you !!!
Best video series I've come across on sklearn! I tried a few other channels before this and was left feeling like I still had no idea what was going on, but after only 5 of your videos I already feel way more confident that I can actually get into it, cheers!
I have been reading from a lot of source but till date this series is the best! I wish there much more videos and reference which will take us to the advanced level!
Thank you Kevin for sharing well organized, normal speed video lectures on scikit learn. These videos are very helpful to teach ML in python to graduate students. The links in the resources are also very valuable. You deserve appreciations. I would suggest to upload lectures ML with R.
You're very welcome! I'm glad to hear the videos have been helpful to you! I'm focused on Python these days, so I don't anticipate making any videos on R - sorry!
Would you please clarify why we need to use `solver='liblinear'` as one of the parameters in LogisticRegression model. Why we assume rest of the parameters as default ? Also, why we import `metrics` from `sklearn` to have the score function work to compute accuracy where as we can simply make use of the `score` function straight from the model `LogisticRegression` that we imported?
Great questions! 1. liblinear was the default solver when I recorded the video, but it is no longer the default solver. Thus, in the current version of scikit-learn, you have to set it explicitly because it happens that the current default solver does not converge with this particular dataset. 2. scikit-learn uses sensible defaults, and so generally you start with the defaults and tweak them as needed (or as part of the hyperparameter tuning process). 3. You can use the score function, but the metrics module offers far more flexibility, and thus that is what I tend to use.
Hi, Kevin. Let me do some remark. On 3:00 you`ve mentioned that train dataset must take ALL the samples (so including the "test" as well)? The thing if we do so test samples willn`t demostrate any error at all. So that must be said that BEFORE trainig model the whole data should be split into two groups (for training model and for test). So what`s your opinion about that? Yep. And hopefully on 11:00 min that you talked about. Iwas worried that I can be lost. Still that`t totally great as you covered that information. Thanks.
Great series, honestly it's the most easily understandable lecture about one of the complicated topic in computer science. Love the flow of the video, the tempo of the complexity, really easy to follow. I have several comments to improve in my opinion: 1. When you point out on specific parts of the screen, it would be great to not just use the cursor but also a more visually impactful feedback (there are tools for this) 2. Would love to get a repeated definition of the specific terms (such as model complexity, what does that mean? The higher the value of n_neighbors the more complex it is? what does it mean to be complex?) 3. I understand that this is an introduction class, but it would be really helpful to show the industry's best practices (advanced series?) Great work, I subscribed, and liking all of your videos.
+SomeIndoGuy Thanks for your very kind comments, as well as your feedback! Regarding model complexity, this is an excellent essay on the bias-variance tradeoff (a critical machine learning topic) that touches on model complexity: scott.fortmann-roe.com/docs/BiasVariance.html
I have a question, Why have you used logistic regression here (03:24), while before you said that this is not regression problem, its a classification problem ?
I started after you put a video on how to make submission on kaggle on my request,I did well in last contest and finished 144 in leader board :) All credit goes to you
Amazing!!! That's great to hear! :) For others who might be interested, this is my video about creating Kaggle submissions: ua-cam.com/video/ylRlGCtAtiE/v-deo.html
Thanks so much for all these videos! Im doing an internship at a really nice group but they're letting me figure out most of the stuff by myself so this is super useful!
Thank you so much for putting up this series. I was looking for something basic yet comprehensive and something easy to follow. This is being very helpful to me . Thanks.
Hi Kevin, Regression is supervised learning in which the response is ordered and continuous, but why we can use it to analysis the iris dataset? Thank you!
Logistic regression is a classification model, not a regression model. (It's confusing, I know!) That's why we can use logistic regression in this case.
awesome video. i would also love to see a video regarding SVM kernels, major differences among them, when to choose them, and how the different parameters may affect the classification and the metrics.
I have a question. How can I test the accuracy score of prediction of a random sample? Let's consider your previous video where you calculated knn.predict([3,5,4,2]) which gives the output value: 2 (virginica). How can I calculate the accuracy score for this prediction? Thanks anyway
The only way to check the accuracy of a single prediction is if you know the "ground truth", meaning the actual value. If you don't know the ground truth, then you can't measure whether the prediction was accurate. Hope that helps!
My understanding is that for any kind of prediction (single or multiple) to know the accuracy you have to know the "ground truth", unless the accuracy is derived from the test/train models. However, you can calculate the Variance (some also call it Precision) for multiple predictions. I read the short paper you linked "Understanding the Bias-Variance Tradeoff" and wonder when it comes to a relative system (in some cases an absolute ground truth is not available but just have to rely on a reference system to build a "relative ground truth", in this case, what to do to guide the precision (or variance)? I also calculated the knn values versus the scores of accuracy in this video. I have different values each time I ran the function. And the differences are quite significant, is it supposed to be so?
When you say "I have different values each time I ran the function", I can't think of a reason that that would occur, unless you are changing which observations are in the training and testing sets.
Hello, I don't understand why eventually we fit the whole dataset into the model after we get the best k value after we use training and testing dataset? Jason
Hi, Kevin. Wonderful videos for freshmen.Thank you very much. In 19:54 , 'the relationship between the value of `X` and the testing accuracy', I think it should be `K` not `X` , am i right?
I didn't understand what is the use of random_state ? Is it the case that same random_state value ,same test_size give the same accuracy on a particular data_set ?
HI Kevin, For logistic regression, the model is fit with (X,y) input data. When you call the predict method for this model with input X, we may expect the model to output y outcome since it has been fit with this sampling of data (X,y). I am getting it wrong here? Thank for your answer.
Great question! Models (with a few exceptions) don't exactly memorize the training data. Thus when given the same exact data, they don't make the same exact predictions.
Great videos kevin. I like your deliberately slow style. It is hard to improve, but if I may suggest something. As your videos are long, it would be useful if you have an index in the description with links to the times of the subtopics. That would help a lot on review and certainly would increase the number of re-visits.
Thanks for the suggestion! I know the videos are super long, but ever since making this series, I have tried to make shorter videos. And, thanks for the time-coding suggestion! I'll consider it.
I have a question on 05_model_evaluation about prediction. they fit the model with their own X, and y logreg.fit(X,y) and why logreg.predict(X) is little bit different with actual target(Y). this X is not even from other resouce. this part i confused
Hello jake Monk. I think that we have value of accuracy lower than 1.0 because our model have some approximations. You can see this at this plot scikit-learn.org/stable/auto_examples/linear_model/plot_iris_logistic.html#example-linear-model-plot-iris-logistic-py We divide all points to the three areas, but some points belonging to one area, in our model hit to another area.
Thanks for video .What do you think about division of data for 3 categories : training (60%) , cross-validation(20%) and test(20%) or more advanced techniques : divide training set to 10 pieces and train the model several times with 9/10 of training set using each time another 1/10 for test .
Eli Lavi Great questions! 1. A three-way split of your data (sometimes called train/test/holdout) is useful if you need a less biased estimate of out-of-sample error. In that case, you use train/test to select model parameters (as shown in the video), and then use the holdout set at the very end to estimate out-of-sample error. 2. What you described with the 10-way split is called 10-fold cross-validation. It's very useful, and I'll cover it in an upcoming video! Briefly, it provides slightly better estimates of out-of-sample error than train/test split, but is also 10 times more computationally expensive and less flexible than train/test split (in terms of use cases).
Hello Kevin. I was following your tutorials, but encountered this problem. I trained classifier on (12000,5) data frame and I tried to predict from new data set (15,5),but I got an error: shapes (15,5) and (9,) not aligned: 5 (dim 1) != 9 (dim 0). Would you please, explain what am I doing wrong. Thank you.
Dear Kevin. I need to find the significance of each feature (like p-value in R). odds ratios and other stats regarding the model, while using Scikit learn for Logistic regression. Most sources saying for that we should use Statsmodel insted of scikit learn. can't that possible with scikit-learn. please help
okay. thanks a lot. can we use both lib's in the same model? since cross-validation, prob's prediction and class prediction are pretty simple in scikit learn right !
You did the job quite quick for me to jump start ML. All courses I saw had durations ranging from 3 months to a year.. thank you!! do you have any tutorial for neural nets and deep learning??
Glad the videos were helpful to you! I don't currently have any tutorials on neural networks or deep learning, but please subscribe to my newsletter for updates on future tutorials: www.dataschool.io/subscribe/
I get this error when trying to make a predictionValueError: Expected 2D array, got 1D array instead: array=[3 5 4 2]. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample. can anyone help me?
I am getting the error below when I try to use the LogisticRegression model. "ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT." Anyone who knows how I can resolve it?
Awesome.....Highly effective communication......So for the best of Machine Learning videos......very grateful to the author. The flow and methodology makes Machine Learning look so simple which in fact is quite complex for beginners like me.
Thanks so much for your kind comment! I'm glad to hear the machine learning videos have been helpful to you. I know it's complex but you will get it eventually... good luck with your education!
I have a doubt in accuracy part , we are testing (y,y_pred) in which both are the targets(labelled), then how we can predict the accuracy with targets only. I think that we need both data from x_test and target (y_pedt) from the result of X_train for accuracy, please explain anyone to clarify it
you're doing a great job, I would just emphasize on giving more examples that are relatable and speaking like you're talking to another person in the room. I only give feedbacks because thats what I would've wanted from people tuning in.
Wow. This video in particular is one of the most useful videos that I have found in the entire UA-cam. Thanks you very much, your a great person and a great teacher!
I have a repeated problem please and I could not find a solution yesterday i Followed the instructions and every thing was ok but this error ModuleNotFoundError: No module named 'sklearn' pop up sometime it is enough to restart a computer and every thing work fine sometimes it does not working and I need to reinstall anaconda and reinstall the packages this happens with all packages what I should do to install everything for ones and work always i am tired from these errors
Hi Kevin....I have several tokenized text files... I want to compare each of these text file with another text file and check the similarities or differences how i am i able to do that using scikit or nltk
i want to do a ranking system for cv's applied for a job post. first through an api i am getting all the cv's in a json format. than i separate each skill, education and experience of each resume into separate text files after stemming and lemmatizing it. each cv has many skills, education and experience. The same way through an api i am getting the requirements where i extract the requirement skills, education and experience. After doing that i want to compare each cv skill, education and experience with the requirement skiill, education and experience. for skills i did keyword matching and checked whether each skill is there or not in the requirement skills set. but for experience and education i am planning to build models and through machine learning i want to do the comparision. i wish you can give me any idea how to approach this sir. :)
If you want to use supervised machine learning with this task, there has to be a "ground truth" that you are trying to predict. It doesn't sound like there is a ground truth in your case, such as "did a particular resume result in a job offer for a particular job".
I can provide you my code sir. I am not good at python and my code is not in the standards. pastebin.com/4CQnjMd8 I know most of the code there are redundant. But I followed that way since i am testing some of the outputs while i am writing the code
1.Suppose i want to do k learning or CNN on medical images let for example take skin diseases images now how will I preprocess it and will create that k learning or cnn network 2.if in skin diseases images we have our own image and values then how to feed that image and csv values to get the categorical result that what kind of disease it is
If we train the data on the entire dataset with KNN (K=5), and then pass the same dataset to it to evaluate it's performance, how come we don't receive a 100% accuracy?
When you train and test on the same dataset, you will get an overly optimistic estimate of out-of-sample performance. However, the estimated performance will rarely be 100%, because most models will only learn an approximation of the training data during the model fitting step, rather than learning it exactly. I'm sorry if that is not clear - it's a very difficult question to answer in a few sentences!
Great videos Sir. But i have one question i.e. how come the accuracy is not 100% since we are predicting on the same features X that we fit it with : logreg.fit(X,y) y_pred = logreg.predict(X) from sklearn import metrics print(metrics.accuracy_score(y, y_pred)) 0.96
Great question! It's because this type of model does not memorize the training data in the way you are assuming. That's the best I can explain it briefly.
Hi, thanks for the very detailed tutorial, very useful, I am just having problem understanding this line: X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.4, random_state=4) Maybe this is a Python question more than a machine learning question , but how can you have "X_train, X_test, y_train, y_test = something" , aren't all these variable going to end-up with the same value? Thanks
That is a great question! It's a Python question, not a machine learning question. It's called tuple unpacking. Try this: a, b = 1, 2 print(a) print(b) Search for "tuple unpacking" on this page for more examples: www.dataschool.io/python-quick-reference/ Hope that helps!
It was just moved to the model_selection module after I recorded this video: scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection
Love the videos! I'm curious, why does the accuracy drop so harshly at k = 18, but then rise again at k = 19. Is there a mathematical explanation? Does it just have to do with the way testing/ training accuracy works. Just trying to fully wrap my head around everything, thank you!
Glad the videos are helpful to you! Regarding your question, that drop is just due to the natural variation in the data. As well, train/test split is a high variance procedure, meaning its results will vary depending upon which observations happen to be in the training set versus the testing set. Finally, I would just mention that the drop is not actually large -- this is a tiny dataset, and we are zoomed way in to the plot. Hope that helps!
Great video lecture series. Love the slow, clear delivery. I noticed a few deprecation warnings when running the code myself. Is there a forum for reporting technical issues/questions?
Glad you like the series! Regarding technical issues, you are welcome to log them as issues in my GitHub repo: github.com/justmarkham/scikit-learn-videos/issues. I will eventually update the notebooks to reflect Python 3, and I know the API is changing slightly in the upcoming scikit-learn 0.18 release, so I will address that as well. But I'd love to know the specifics of any errors or warnings that you receive! Regarding questions, you can ask them on GitHub, or post UA-cam comments, and I'll see them either way. Thanks!
I tried to play with the iris data set and a question came to me if can help me please. here the response vector is evenly splited in 3x50 (for each type of iris) so when you "split and test the data set" the split command have a high chance to equally split and provide a representative training data set of the overall starting data set to train your classification algorithm.Question: what if the response vector that is given to you is "not" evenly splited (say: 30% of irises type 0, 50% type 1, 20% type 2) ? does the split command of scikit-learn take care of it automatically so that i have a "representative" training data set to train my model ? or do i have to do it manually ? or is there another command i should use ? Ps: Sorry if it is a silly question, and thanks again for your time.
+arab ilies Great question! You're asking about stratified sampling, which means that the response class proportions should be (approximately) preserved between the training and testing sets. As to whether 'train_test_split' does this by default, I believe the answer is "no" prior to version 0.17. Starting in version 0.17, there is a 'stratify' parameter you can use with 'train_test_split' to accomplish this. More information is here: scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
Very well explained and great teaching style!! I am doing my first pass through your videos. I will go back and enter the python code and run these on my next pass. I was hoping I could find a set of graded exercises at the end of each video. Any thoughts on this ?
As I remember, in the previous parts of this series this convention explained as: using capital letter for matrices (2d array), and using lowercase letter for vectors (1d array).
Having problems with the code? I just finished updating the notebooks to use *scikit-learn 0.23* and *Python 3.9* 🎉! You can download the updated notebooks here: github.com/justmarkham/scikit-learn-videos
Sorry to be off topic but does anyone know of a method to get back into an Instagram account..?
I was stupid forgot the password. I would love any tricks you can give me!
@Emmitt Kyrie Instablaster :)
@Lyle Elias Thanks for your reply. I got to the site thru google and im trying it out atm.
Seems to take quite some time so I will get back to you later with my results.
Really simple to understand. Doesn't make it seem like "its a library thing, library does it for ya". Thank you for doing this
You're very welcome! Thanks for your kind words!
This is by far the best Sci-kit Learn tutorial on UA-cam. I can say this because I have seen almost every tutorial and this covers everything starting from scratch.I knew how all the algorithms work but what I needed was how do I implement those algorithms from loading the data set to all terminologies to checking the accuracy and what not and this series has everything I was looking for ,thank you so much for this.Really appreciate it.
Wow! Thank you so much for your kind words! :)
That's some killer delivery, you didn't waste a word! Great tutorial!
Thanks so much!
"models that overfit have learned the noise in the data rather than the signal" - yes, well said!
Glad it was helpful to you!
I like the pace of these videos. You speak really slow and clear which helps your viewer to digest the information on the fly. Loving your work!
Thanks for the feedback! I'm really glad to hear that my presentation of the material works well for you. Good luck with your education!
Yeah, the slow pace is generally great, though personally I view these at 1.25 speed. Still clear at that rate too. :)
*Note:* This video was recorded using Python 2.7 and scikit-learn 0.16. Recently, I updated the code to use Python 3.6 and scikit-learn 0.19.1. You can download the updated code here: github.com/justmarkham/scikit-learn-videos
You're very welcome!
@@dataschool can we have a lecture about Tensorflow?
Most notable take aways from the video:
- "Plotting testing accuracy vs model complexity is a very useful way to tune any parameters that relate to model complexity."
- "Once you have chosen a model and it's optimal parameters and are ready to make predictions on out of sample data, it's important to re train your model on all of the available training data."
- Repeating the train/test split process multiple times in a systematic way using k fold cross_validation
Great summary! I approve :)
Your way of delivery is exceptional. I have never seen somebody teaching so well like you. I made me interested in ML Thanks bro...God bless U
Awesome, thank you!
The last two videos are the best ones I’ve seen someone explain scikit learn’s predictions. Every other video jumps straight to the full analysis but in reality, you can predict in as little as 4 lines of code. Great job!
Thanks! :)
This video series sets such a high standards for Content, Context and Delivery of Machine Learning training ! Its a winner for all those who are starting to learn Machine Learning !! Thank you so much for your efforts Kevin !!!
Wow, thank you so much for your very kind comment! I really appreciate your support!
Your teaching style is outstanding. As someone who has used R in the past, I really appreciate the clarity of your explanations and demonstrations.
Thank you, Frank!
Dear Kevin. To me your videos are a reference, as those of Mr Andrew Ng. Very good job! Thank you very much from Spain :)
You're very welcome!
Uno de los mejores manuales sobre "Machine learning" que he visto. Gracias por ofrecernos la oportunidad de aprender. Además, tu pronunciación es perfecta para hispanohablantes
When it comes to using the model for future predictions on real-life data, you can directly use the trained model without retraining it with the whole training data, including the test data. The idea is that the model has learned patterns and relationships from the training data that generalize well to unseen data, including real-life data.
Retraining the model with the entire dataset, including the test data, is generally not recommended as it may lead to overfitting. Overfitting occurs when the model becomes too specific to the training data, capturing noise and irrelevant patterns, which can reduce its performance on new data.
Thanks for sharing, but if I'm understanding you correctly, I respectfully disagree.
loving this series man just started out with ML and DS understanding everything
That's excellent to hear!
I thank God I landed on your videos. I see things clearer than ever. You are a gifted tutor. God bless you sir.
Wow, thanks so much for your incredibly kind comments!
A student from CN jumping across the Great Wall learned this excellent class. Thx.
Awesome! You're very welcome!
Wow I must say your teaching style is amazing. Very organized, thorough and easy to follow. Thanks for your time, and keep making great videos! I wish more professors were like you at my school.
+Juan P Castillo What a nice comment! Thank you so much for your generous words! I'm glad the series has been helpful to you :)
I couldn't agree more with berry jordaan.
The way you deliver the content of a quite complex topic naturally guides me to want to learn more about machine learning.
Thank you very much
Thank you so much for your comment - you're very welcome!
Excellent teaching !!! I am required to set up competency around advance analytic involving ML/DS (since I am coming from DWH and BI practice) in my organization, so I wanted to learn and practice. Now , I feel like taking this as a full time profession and become Data Scientist. It's so much fun and exciting work, such video has made it lot easier. Thank you !!!
Awesome! So glad to hear! Thanks for your kind words, and good luck on your educational journey :)
My confident level is super high to learn Machine Learning after seeing this video. Your every word is very clear and correct. Thank you very much.
That's awesome to hear!
for i in range(1, 10001) :
print(“THANK YOU VERY MUCH")
HA! Love it! You're very welcome :)
Data School Thank you.
Your reply shows your passion for programming.
Keep up the good work of teaching.
Thanks! :)
No ,this is the correct one:
while 1 == 1:
print("THANK YOU VERY MUCH")
while True: print("THANK YOU VERY MUCH")
Kevin this series is excellent, you are able to really simplify the topic to make it easy to learn Thanks
Thank you!
Best video series I've come across on sklearn! I tried a few other channels before this and was left feeling like I still had no idea what was going on, but after only 5 of your videos I already feel way more confident that I can actually get into it, cheers!
Awesome! Thanks for your kind comments, and good for you! :)
I have been reading from a lot of source but till date this series is the best! I wish there much more videos and reference which will take us to the advanced level!
Thanks so much for your kind comment!
Man, he just makes it so easy to learn.
Wish we had half as good teachers as him in school.
Thank you so much Gautam!
I was looking for ML tutorials and can say that your videos are simply the best.Thanks a lot
Wow, thank you so much! What a nice comment!
This was one of the best videos on the topic that I've found. Thank you for being so succinct and breaking this down so clearly!
You're so very welcome!
“Overfitting learns the noise of the data, rather than the signals”
I finally understand what overfitting means.
Awesome! So glad to hear!
Yes , you are a great teacher.
I really disliked machine learning after we got taught it at uni. you really have sparked my interest again thank you so much for this series.
That's great to hear! You are very welcome.
OMG finally found a ML tutor who is awesome.... i cant skip any seconds in your videos, even each words are informative
Thanks so much for your kind words! I truly appreciate it!
Thank you Kevin for sharing well organized, normal speed video lectures on scikit learn. These videos are very helpful to teach ML in python to graduate students. The links in the resources are also very valuable. You deserve appreciations. I would suggest to upload lectures ML with R.
You're very welcome! I'm glad to hear the videos have been helpful to you! I'm focused on Python these days, so I don't anticipate making any videos on R - sorry!
It used to be hard for me to learn Machine Learning, but now thanks to you it isn't anymore
Thanks so much! That is awesome to hear 😄
This is such a gem for beginners .Thank you very much Kevin
You're very welcome!
Would you please clarify why we need to use `solver='liblinear'` as one of the parameters in LogisticRegression model. Why we assume rest of the parameters as default ? Also, why we import `metrics` from `sklearn` to have the score function work to compute accuracy where as we can simply make use of the `score` function straight from the model `LogisticRegression` that we imported?
Great questions!
1. liblinear was the default solver when I recorded the video, but it is no longer the default solver. Thus, in the current version of scikit-learn, you have to set it explicitly because it happens that the current default solver does not converge with this particular dataset.
2. scikit-learn uses sensible defaults, and so generally you start with the defaults and tweak them as needed (or as part of the hyperparameter tuning process).
3. You can use the score function, but the metrics module offers far more flexibility, and thus that is what I tend to use.
@@dataschool thank you!
Hi, Kevin. Let me do some remark. On 3:00 you`ve mentioned that train dataset must take ALL the samples (so including the "test" as well)? The thing if we do so test samples willn`t demostrate any error at all. So that must be said that BEFORE trainig model the whole data should be split into two groups (for training model and for test). So what`s your opinion about that? Yep. And hopefully on 11:00 min that you talked about. Iwas worried that I can be lost. Still that`t totally great as you covered that information. Thanks.
In the video, I outline why you should not use evaluation procedure #1. I included it for explanatory purposes.
Hope that helps!
Great series, honestly it's the most easily understandable lecture about one of the complicated topic in computer science.
Love the flow of the video, the tempo of the complexity, really easy to follow. I have several comments to improve in my opinion:
1. When you point out on specific parts of the screen, it would be great to not just use the cursor but also a more visually impactful feedback (there are tools for this)
2. Would love to get a repeated definition of the specific terms (such as model complexity, what does that mean? The higher the value of n_neighbors the more complex it is? what does it mean to be complex?)
3. I understand that this is an introduction class, but it would be really helpful to show the industry's best practices (advanced series?)
Great work, I subscribed, and liking all of your videos.
+SomeIndoGuy Thanks for your very kind comments, as well as your feedback!
Regarding model complexity, this is an excellent essay on the bias-variance tradeoff (a critical machine learning topic) that touches on model complexity: scott.fortmann-roe.com/docs/BiasVariance.html
I believe the K value you have set for teaching is perfect for my learning. thanks
+vishwas s HA! Love a good machine learning joke :)
I have a question, Why have you used logistic regression here (03:24), while before you said that this is not regression problem, its a classification problem ?
+umar0021 Logistic regression is actually a model used for classification problems, despite its name. (Confusing, I know!)
This is a brilliant tutorial -- I love everything about it. Thanks.
+Khalil Muhammad Wow, thank you! I really appreciate your kind words!
I'm watching this tutorial from last few days..Very very precise and accurate content.. it made me to rewind watch many times..! great..!
Awesome! Glad it's helpful to you!
I started after you put a video on how to make submission on kaggle on my request,I did well in last contest and finished 144 in leader board :) All credit goes to you
Amazing!!! That's great to hear! :)
For others who might be interested, this is my video about creating Kaggle submissions: ua-cam.com/video/ylRlGCtAtiE/v-deo.html
Thanks so much for all these videos! Im doing an internship at a really nice group but they're letting me figure out most of the stuff by myself so this is super useful!
You are so welcome!
Quarantine with Data School is lit!!
Thank you so much for putting up this series. I was looking for something basic yet comprehensive and something easy to follow. This is being very helpful to me . Thanks.
That's great to hear! You are very welcome.
Hi Kevin, Regression is supervised learning in which the response is ordered and continuous, but why we can use it to analysis the iris dataset? Thank you!
Logistic regression is a classification model, not a regression model. (It's confusing, I know!) That's why we can use logistic regression in this case.
Thanks Sir , for giving effort to make these videos . Being a beginner I find these resources extremely helpful .
You're welcome!
awesome video. i would also love to see a video regarding SVM kernels, major differences among them, when to choose them, and how the different parameters may affect the classification and the metrics.
Glad you liked it, and thanks for the suggestion!
Thank you for these videos! they are well made and clear. I don't think i understood ML until sitting through your videos.
Thanks for your kind comment! That's so nice to hear.
I have a question. How can I test the accuracy score of prediction of a random sample? Let's consider your previous video where you calculated knn.predict([3,5,4,2]) which gives the output value: 2 (virginica). How can I calculate the accuracy score for this prediction? Thanks anyway
The only way to check the accuracy of a single prediction is if you know the "ground truth", meaning the actual value. If you don't know the ground truth, then you can't measure whether the prediction was accurate. Hope that helps!
thanks
My understanding is that for any kind of prediction (single or multiple) to know the accuracy you have to know the "ground truth", unless the accuracy is derived from the test/train models. However, you can calculate the Variance (some also call it Precision) for multiple predictions. I read the short paper you linked "Understanding the Bias-Variance Tradeoff" and wonder when it comes to a relative system (in some cases an absolute ground truth is not available but just have to rely on a reference system to build a "relative ground truth", in this case, what to do to guide the precision (or variance)?
I also calculated the knn values versus the scores of accuracy in this video. I have different values each time I ran the function. And the differences are quite significant, is it supposed to be so?
When you say "I have different values each time I ran the function", I can't think of a reason that that would occur, unless you are changing which observations are in the training and testing sets.
Hi Kevin,
Do you have any blog which you continuously update to stay in touch with your latest info on machine learning/Data science articles ?
I have a blog, but it doesn't focus on what you are describing: www.dataschool.io
Hello,
I don't understand why eventually we fit the whole dataset into the model after we get the best k value after we use training and testing dataset?
Jason
You use all of the data when fitting so that the model can learn from all of the data you have available.
Hi, Kevin. Wonderful videos for freshmen.Thank you very much.
In 19:54 , 'the relationship between the value of `X` and the testing accuracy', I think it should be `K` not `X` , am i right?
You're right! I meant to say "K".
Glad the videos have been helpful to you!
I have to say this is another great lesson by Kevin. Thank you very much indeed.
Thanks! I'm glad my videos are helpful to you!
Hi, in 21:56 why KNN low k value = more complex??
This article might be helpful to you: scott.fortmann-roe.com/docs/BiasVariance.html
I didn't understand what is the use of random_state ?
Is it the case that same random_state value ,same test_size give the same accuracy on a particular data_set ?
It's complicated to explain briefly, but random_state is used for reproducibility.
HI Kevin,
For logistic regression, the model is fit with (X,y) input data. When you call the predict method for this model with input X, we may expect the model to output y outcome since it has been fit with this sampling of data (X,y). I am getting it wrong here? Thank for your answer.
Great question! Models (with a few exceptions) don't exactly memorize the training data. Thus when given the same exact data, they don't make the same exact predictions.
@@dataschool Hi Kevin,
Thanks for the clarification. Your channel is very helpful and informative. Keep it up :-)
Thanks!
Great videos kevin. I like your deliberately slow style. It is hard to improve, but if I may suggest something. As your videos are long, it would be useful if you have an index in the description with links to the times of the subtopics. That would help a lot on review and certainly would increase the number of re-visits.
Thanks for the suggestion! I know the videos are super long, but ever since making this series, I have tried to make shorter videos.
And, thanks for the time-coding suggestion! I'll consider it.
I have a question on 05_model_evaluation about prediction. they fit the model with their own X, and y logreg.fit(X,y)
and why logreg.predict(X) is little bit different with actual target(Y). this X is not even from other resouce. this part i confused
I'm sorry, I don't understand your question!
Hello jake Monk. I think that we have value of accuracy lower than 1.0 because our model have some approximations. You can see this at this plot scikit-learn.org/stable/auto_examples/linear_model/plot_iris_logistic.html#example-linear-model-plot-iris-logistic-py
We divide all points to the three areas, but some points belonging to one area, in our model hit to another area.
what will happen ,if value of feature are different in size ?
I love you man, i have watched every single video of yours.
Thank you so much!
I love the series so far. I have learned so much. Thank you for creating these. They are quite easy to follow.
Awesome, that's great to hear!
Thanks for video .What do you think about division of data for 3 categories : training (60%) , cross-validation(20%) and test(20%) or more advanced techniques : divide training set to 10 pieces and train the model several times with 9/10 of training set using each time another 1/10 for test .
Eli Lavi Great questions!
1. A three-way split of your data (sometimes called train/test/holdout) is useful if you need a less biased estimate of out-of-sample error. In that case, you use train/test to select model parameters (as shown in the video), and then use the holdout set at the very end to estimate out-of-sample error.
2. What you described with the 10-way split is called 10-fold cross-validation. It's very useful, and I'll cover it in an upcoming video! Briefly, it provides slightly better estimates of out-of-sample error than train/test split, but is also 10 times more computationally expensive and less flexible than train/test split (in terms of use cases).
Hello Kevin.
I was following your tutorials, but encountered this problem.
I trained classifier on (12000,5) data frame and I tried to predict from new data set (15,5),but I got an error: shapes (15,5) and (9,) not aligned: 5 (dim 1) != 9 (dim 0).
Would you please, explain what am I doing wrong.
Thank you.
I'm sorry, I wouldn't be able to help without having access to more of your code and data. Good luck!
Dear Kevin. I need to find the significance of each feature (like p-value in R). odds ratios and other stats regarding the model, while using Scikit learn for Logistic regression. Most sources saying for that we should use Statsmodel insted of scikit learn. can't that possible with scikit-learn. please help
Yes, it's correct that Statsmodels is more appropriate for those tasks. Hope that helps!
okay. thanks a lot. can we use both lib's in the same model? since cross-validation, prob's prediction and class prediction are pretty simple in scikit learn right !
No, you can't use both libraries with the same model. And yes, scikit-learn does make many common machine learning tasks simple!
You did the job quite quick for me to jump start ML. All courses I saw had durations ranging from 3 months to a year.. thank you!!
do you have any tutorial for neural nets and deep learning??
Glad the videos were helpful to you! I don't currently have any tutorials on neural networks or deep learning, but please subscribe to my newsletter for updates on future tutorials: www.dataschool.io/subscribe/
if it means anything to you, i really like the way you put things and simplify them, thanx man
Thanks for your kind comment!
I get this error when trying to make a predictionValueError: Expected 2D array, got 1D array instead:
array=[3 5 4 2].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
can anyone help me?
Try this: knn.predict([[3, 5, 4, 2]])
I am getting the error below when I try to use the LogisticRegression model. "ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT." Anyone who knows how I can resolve it?
Try changing the solver to liblinear when creating the LogisticRegression object.
Great explanation of noise and signal!
Awesome, thank you!
Awesome.....Highly effective communication......So for the best of Machine Learning videos......very grateful to the author. The flow and methodology makes Machine Learning look so simple which in fact is quite complex for beginners like me.
Thanks so much for your kind comment! I'm glad to hear the machine learning videos have been helpful to you. I know it's complex but you will get it eventually... good luck with your education!
train_test_split uses simple random sampling or stratified random sampling? If it uses simple random sampling then how can i use stratified sampling?
there are input options to the function you can specify stratified. see the scikit-learn train test split page for details
This section of the scikit-learn documentation should be helpful to you: scikit-learn.org/stable/modules/cross_validation.html#a-note-on-shuffling
I have a doubt in accuracy part , we are testing (y,y_pred) in which both are the targets(labelled), then how we can predict the accuracy with targets only. I think that we need both data from x_test and target (y_pedt) from the result of X_train for accuracy, please explain anyone to clarify it
I don't quite follow your question, I'm sorry!
you're doing a great job, I would just emphasize on giving more examples that are relatable and speaking like you're talking to another person in the room. I only give feedbacks because thats what I would've wanted from people tuning in.
Thanks for your suggestions!
Dear lovely Kevin,
I have read the numpy pdf you have suggested.
I am wondering if you suggest any pdf for matplotlib .
Dear lovely Kostas, I can't think of a PDF for matplotlib right now, sorry! :)
Awesome .... Can you post some videos on using random forecast and SVM techniques with examples
Thanks for your suggestion!
Wow. This video in particular is one of the most useful videos that I have found in the entire UA-cam. Thanks you very much, your a great person and a great teacher!
Wow, thank you so much! :)
Thank you for such clear and well done tutorials!
Thanks for your kind words!
Nicely paced set of tutorials. Thanks
You're welcome!
Can we say that KNN would overfit as K values get smaller?
matee, awesome videos. You saved my ass for an ML deadline. Awesome, really
That's awesome to hear!
I have a repeated problem please and I could not find a solution
yesterday i Followed the instructions and every thing was ok but
this error
ModuleNotFoundError: No module named 'sklearn'
pop up
sometime it is enough to restart a computer and every thing work fine
sometimes it does not working and I need to reinstall anaconda and reinstall the packages
this happens with all packages
what I should do
to install everything for ones and work always
i am tired from these errors
Not sure, I'm sorry!
Hi Kevin....I have several tokenized text files... I want to compare each of these text file with another text file and check the similarities or differences
how i am i able to do that using scikit or nltk
I'm sorry, it's hard for me to say how you should approach this task without knowing a lot more information from you. Good luck!
i want to do a ranking system for cv's applied for a job post. first through an api i am getting all the cv's in a json format. than i separate each skill, education and experience of each resume into separate text files after stemming and lemmatizing it. each cv has many skills, education and experience. The same way through an api i am getting the requirements where i extract the requirement skills, education and experience. After doing that i want to compare each cv skill, education and experience with the requirement skiill, education and experience. for skills i did keyword matching and checked whether each skill is there or not in the requirement skills set. but for experience and education i am planning to build models and through machine learning i want to do the comparision. i wish you can give me any idea how to approach this sir. :)
If you want to use supervised machine learning with this task, there has to be a "ground truth" that you are trying to predict. It doesn't sound like there is a ground truth in your case, such as "did a particular resume result in a job offer for a particular job".
I can provide you my code sir. I am not good at python and my code is not in the standards. pastebin.com/4CQnjMd8
I know most of the code there are redundant. But I followed that way since i am testing some of the outputs while i am writing the code
how to deal with 3d medical images dats sets and suppose I have a 3d image then how i can test that image to get the categorical
result
I'm not sure I understand your question, I'm sorry!
1.Suppose i want to do k learning or CNN on medical images let for example take skin diseases images now how will I preprocess it and will create that k learning or cnn network
2.if in skin diseases images we have our own image and values then how to feed that image and csv values to get the categorical result that what kind of disease it is
Sorry, I won't be able to help... good luck!
If we train the data on the entire dataset with KNN (K=5), and then pass the same dataset to it to evaluate it's performance, how come we don't receive a 100% accuracy?
When you train and test on the same dataset, you will get an overly optimistic estimate of out-of-sample performance. However, the estimated performance will rarely be 100%, because most models will only learn an approximation of the training data during the model fitting step, rather than learning it exactly.
I'm sorry if that is not clear - it's a very difficult question to answer in a few sentences!
Good Video, Thanks for information, Is there any other video with an example other than Iris data set ??(regarding Logistic Regression,KNN)
Sure, how about this video: ua-cam.com/video/85dtiMz9tSo/v-deo.html
Thank u
Thank u, Is there any video on Time series forecasting. thank u..
Great videos Sir. But i have one question i.e. how come the accuracy is not 100% since we are predicting on the same features X that we fit it with :
logreg.fit(X,y)
y_pred = logreg.predict(X)
from sklearn import metrics
print(metrics.accuracy_score(y, y_pred))
0.96
Great question! It's because this type of model does not memorize the training data in the way you are assuming. That's the best I can explain it briefly.
Hi, thanks for the very detailed tutorial, very useful, I am just having problem understanding this line:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.4, random_state=4)
Maybe this is a Python question more than a machine learning question , but how can you have "X_train, X_test, y_train, y_test = something" , aren't all these variable going to end-up with the same value? Thanks
That is a great question! It's a Python question, not a machine learning question.
It's called tuple unpacking. Try this:
a, b = 1, 2
print(a)
print(b)
Search for "tuple unpacking" on this page for more examples: www.dataschool.io/python-quick-reference/
Hope that helps!
awesome, thanks
the train test module will be remove in .20, any suggestions?
It was just moved to the model_selection module after I recorded this video: scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection
Love the videos! I'm curious, why does the accuracy drop so harshly at k = 18, but then rise again at k = 19. Is there a mathematical explanation? Does it just have to do with the way testing/ training accuracy works. Just trying to fully wrap my head around everything, thank you!
Glad the videos are helpful to you! Regarding your question, that drop is just due to the natural variation in the data. As well, train/test split is a high variance procedure, meaning its results will vary depending upon which observations happen to be in the training set versus the testing set. Finally, I would just mention that the drop is not actually large -- this is a tiny dataset, and we are zoomed way in to the plot. Hope that helps!
Great video lecture series. Love the slow, clear delivery. I noticed a few deprecation warnings when running the code myself. Is there a forum for reporting technical issues/questions?
Glad you like the series!
Regarding technical issues, you are welcome to log them as issues in my GitHub repo: github.com/justmarkham/scikit-learn-videos/issues. I will eventually update the notebooks to reflect Python 3, and I know the API is changing slightly in the upcoming scikit-learn 0.18 release, so I will address that as well. But I'd love to know the specifics of any errors or warnings that you receive!
Regarding questions, you can ask them on GitHub, or post UA-cam comments, and I'll see them either way. Thanks!
I've tried using the fit attribute but I've got the following error:
fit() missing 1 required positional argument y
Help?
I'm sorry, I can't diagnose your code without a lot more information. Good luck!
This lecture is fantastic and extremely helpful to learn machine learning from scratch, very appreciate to share this wonderful vedio
Thanks for your kind comment!
great video ! great resources to understand the Bias-Variance trade-off. You are a reference Kevin. Thanks a ton.
You're very welcome!
I tried to play with the iris data set and a question came to me if can help me please. here the response vector is evenly splited in 3x50 (for each type of iris) so when you "split and test the data set" the split command have a high chance to equally split and provide a representative training data set of the overall starting data set to train your classification algorithm.Question: what if the response vector that is given to you is "not" evenly splited (say: 30% of irises type 0, 50% type 1, 20% type 2) ? does the split command of scikit-learn take care of it automatically so that i have a "representative" training data set to train my model ? or do i have to do it manually ? or is there another command i should use ?
Ps: Sorry if it is a silly question, and thanks again for your time.
+arab ilies Great question! You're asking about stratified sampling, which means that the response class proportions should be (approximately) preserved between the training and testing sets. As to whether 'train_test_split' does this by default, I believe the answer is "no" prior to version 0.17. Starting in version 0.17, there is a 'stratify' parameter you can use with 'train_test_split' to accomplish this. More information is here: scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
thanks a lot sir !!
do you have a full machine learning python course i can do ? post the link please
www.dataschool.io/learn/
Very well explained and great teaching style!! I am doing my first pass through your videos. I will go back and enter the python code and run these on my next pass. I was hoping I could find a set of graded exercises at the end of each video. Any thoughts on this ?
Glad you like the videos! The only course for which I offer exercises is my paid online course, Machine Learning with Text: www.dataschool.io/learn/
my goodness. what intriguing and useful videos. you have a true gift
Thank you!
What is the convention behind naming X capital and y lowercase?
As I remember, in the previous parts of this series this convention explained as: using capital letter for matrices (2d array), and using lowercase letter for vectors (1d array).
Exactly correct, thanks! More information is available in this video: ua-cam.com/video/hd1W4CyPX58/v-deo.html