K-means Clustering From Scratch In Python [Machine Learning Tutorial]
Вставка
- Опубліковано 12 чер 2024
- In this project, we'll build a k-means clustering algorithm from scratch. Clustering is an unsupervised machine learning technique that can find patterns in your data. K-means is one of the most popular forms of clustering.
We'll create our algorithm using python and pandas. We'll then compare it to the reference implementation from scikit-learn.
You can find the full project code here - github.com/dataquestio/projec... .
You can download the data here - www.kaggle.com/datasets/stefa... .
Project Steps
- Write out pseudocode for the algorithm
- Code the k-means algorithm
- Plot the clusters from the algorithm
- Compare performance to the scikit-learn algorithm
Chapters
00:00 Intro
00:37 k-means overview
02:51 Loading in and cleaning FIFA data
06:11 Scaling the data
10:31 Initialize random centroids
14:20 Finding cluster labels for each data point
19:29 Update centroid values
23:30 Plotting k-means iterations
28:24 Pulling the algorithm together
35:25 Comparing our implementation to scikit-learn
37:56 Conclusion and next steps
------------------------------
Join 1M+ Dataquest learners today!
Master data skills and change your life.
Sign up for free: bit.ly/3O8MDef
Here's all the code for this video - github.com/dataquestio/project-walkthroughs/tree/master/kmeans . Hope you enjoy it!
Thank you, sir. This is how tutorials should be conducted: with in-depth explanations, step-by-step implementation, and the release of all code and datasheets to enable everyone to practice and advance their own personal projects. Congrats!
This was amazing. Brilliantly explained, demonstrated and presented clearly. Helped me so much with my current bootcamp task. Thank you.
From the bottom of my heart; thank you. This was so clear and easily understandable, fantastic video!
Amazingly clear! Thank you so much, Dataquest!
One of the best tutorials on the internet, thank you.
Such good and clearly delivered material. Thanks a lot!
Awesome stuff, Vik. Thanks for sharing.
This THE best tutorial online. I am so grateful for this! Thank you
Great video. Really helpful looking at implementing it manually. Thank you so much
Outstanding! Thank you, man! This really helped me do my masters thesis. I really appreciate that you explained every small step, and used as much visuals as possible, and focused on us being able to learn!
- In case others run into the same problem: With Scikit K-means, when using the fit(data) function, I got an "split" error message. (attributeerror: 'nonetype' object has no attribute 'split'). I checked my BLAS, and updated through conda all libraries, then shut everything down and opened again, and this resolved the problem, but it took a long time.
(I asked chatgpt for help)
very helpful and clear explanations - thank you!
Excellent video !! Many thanks 🙏🏼
I have never thought that we can visualize K means by using Dimension Reduction (PCA)!! Awesome Tutorial Sir
Your explanation is absolutely clear. You have best knowledge. Keep posting new topics and encourage us ❤
This is a nice and powerful way to learn. Thanks for teaching.
Thank you, thank you, thank you!!! Being able to perform and explain what runs under the hood is really important- I agree. Please keep these videos coming 🙌🏼❤️ The “From Scratch” series :)
That's a great idea :) I'm working on linear regression from scratch.
This is amazing, keep up a good job
Thanks a LOT for this tutorial!😀
Terrific implementation! I also really liked the way you used PCA for iteritive visualization... Nicely done
Thanks a lot, Tim! -Vik
it's very great job , the only one in youtube that explain every place of code 👍👍
Absolutely fantastic
Would love a similar video on PAM clustering for mixed integer and categorical variables
Thanks for the suggestion :)
Thank you very much for this clearly understood video.
you might be a hero... thansk a lot for the contents...
I can't thank you enough. Thank you for this content.
Very insightful and step by step code explanation.
Thank you for this excellent tutorial
:)
Glad it was helpful! -Vik
@@Dataquestio Vik,
how do I assign new data points to a cluster i.e. once I have run my K-means cluster and want to use it to assign a cluster to new data sets just like out of time datasets or testing/validation datasets.
There doesn't seem to be anything online about this. Is it the case that I'd have to re-run the K-Means with the new data included?
Thanks in advance
Elvy
great video, you are a great teacher
Thanks alot that was a great help !
Great video!
Hi, Thanks so much for the video!! Can you please advise on how one adds a legend to the cluster scatter plots? I've been trying but can't figure it out.
I loved ur video it is so well-explained!! I only used scikit-learn but now I understand better how it's works.
But I have a question: why is it not good no use height and wight to use as feature?
Thanks for the video. It is just brilliant. One of the best ones on Clustering that I have seen for sure!
I just had a question. I tried using this on data with 13 variables. It worked perfectly but when I scale the data using n. distrb or skscalar rather than using min-max, I get an error following the PCA transformation code saying there are Nans in the data variable when there clearly were not before. I cant put my finger on what is causing this. Would appreciate any insights on your part. Thanks
Very insightful explanation of codes. By the way how can I plot the Elbow plot using the SSE Vs K values at every k value iteratively. this will help me be able to optimise the K value using this codes... Looking foreword to hearing from you
good tutorial thank you
More videos like these please on other algos
TYSM :)
It's a very well explained video. Just a quick question, how can we add random_state in the final model code?
Hi, thank you so much for this clear tutorial.
I need one another help from you. How do we get this cluster result exported to a CSV file?
Great Video , BTW why did u use Geometric means instead Arithmetic mean for finding the clusters. Please make a whole series on building models From Scratch.
Cool 👍
can we follow up based on the identified clusters, by using them to regress for another variable, e.g. with a logistic regression?
thanks teacher, may you introduce how to calculate SSE for k means clustering solution when you choose not to use k means directly from sklearn package
Can you make a video implementing Local Outlier Factor (LOF) with Pandas and NumPy in Python for identifying outliers?
can we also use players pogition as one of the feature if yes then how (cauz that isn't numeric)
what is the maximum amount of variables recommendable for a clustering analysis?
if we have IP addresses in data should we still scale the data ? i had a dataset where ip add and fraud transactions are given, i converted ip add to numerical data
How do you know which 5 features to pick at the beginning?
do we have to get rid of outliers beforehand?
Your explanation is grate. I found out that the "k" parameter of method "new_centroids" has no effect for the application. Correct me if I'm wrong.
How would you include Ordinal features ?
Thanks for this, I really don't get how I can possibly use it for fraud detection
Which platform you are using for coding??
I didn’t understand why we took geometric mean instead of arithmetic mean??? Can you explain tht pls ????
Could you explain the meaning of the x- and y-axis?
I am getting an error when calculating centroids - 'float' object has no attribute 'sqrt'..... Please help
Keep sending the emails, thanks for the vids
Wht does groupby() return. ?? How can I see wht groupby() has returned??? Can you pls share the code too what data.groupby(labels) do ???
can someone help with the issue at 29:48
when we use old_centroids=centroids
in my code
this error comes
'DataFrame' object has no attribute 'equal'
it should be .equals with an s
make a video on ''customer segmentation and clustering in retail using machine learning'' using real retail dataset
pls unpack what is going on in centroid = data.apply(lambda x: float (x.sample())) without the float cast the line returns a DataFrame with NaN values in none sampled/selected columns. There appears to be some VooDoo magic going on here, driven by the float cast!
Sir how to find out the individual elements present in each cluster? For example, I'm working on a dataset of genes. How will i get the names of the individual genes that are present in each cluster?
I am finding the same right now ? Are you able to get anything . If yes then please help me too😊
Why you did not apply fit_transform to centroids_2d variable as well?
Fit transform will both compute the fit and transform the data. In this case, we already computed the fit on the data, and we want to just apply the same fit to the centroids, so that they're all on the same scale and can be visualized. -Vik
Please can someone tell me how to apply arithmetic mean instead of geometric mean in lambda function of getting new centroids. I am dealing with negative datasets and applying geometric mean is of no use to me. will it be like this : data.groupby(labels).apply(lambda x: np.mean(x,axis=0))
Thank you, I required arithmetic mean too and your code worked for me.
I think k = 4, because the young players incluce two high overall and low overall. Like young star in high leage level and young normal player
Amazing!! But, how to implement the scatter without PCA?
Did you figure out? I'd like to know too.
@@animal40 Just leave out the PCA- still transform the centroid T though and remember to include iloc here's my code: def plot_clusters(data, labels, centroids, iteration):
centroid_T = centroids.T
plt.title(f'Iteration {iteration}')
plt.scatter(x = data.iloc[:,0], y= data.iloc[:,1], c =labels)
plt.scatter(x = centroid_T.iloc[:,0],y = centroid_T.iloc[:,1])
plt.show()
@@akosuakoranteng3327 thanks very much for this. Tried a few things today but couldn't quite get it working. Will try again tomorrow with this. Appreciate it, cheers.
At 10:08, how did you know row 0 belongs to lionel messi?
does anyone have the code ?
Code is here - github.com/dataquestio/project-walkthroughs/tree/master/kmeans . It's linked in the description
@@Dataquestio sir , I'm getting an error doing with scratch, any platform at which I can send my query?
SUUUUIII
Nice but a little too much for a newbie 😅