K-means Clustering From Scratch In Python [Machine Learning Tutorial]

Поділитися
Вставка
  • Опубліковано 12 чер 2024
  • In this project, we'll build a k-means clustering algorithm from scratch. Clustering is an unsupervised machine learning technique that can find patterns in your data. K-means is one of the most popular forms of clustering.
    We'll create our algorithm using python and pandas. We'll then compare it to the reference implementation from scikit-learn.
    You can find the full project code here - github.com/dataquestio/projec... .
    You can download the data here - www.kaggle.com/datasets/stefa... .
    Project Steps
    - Write out pseudocode for the algorithm
    - Code the k-means algorithm
    - Plot the clusters from the algorithm
    - Compare performance to the scikit-learn algorithm
    Chapters
    00:00 Intro
    00:37 k-means overview
    02:51 Loading in and cleaning FIFA data
    06:11 Scaling the data
    10:31 Initialize random centroids
    14:20 Finding cluster labels for each data point
    19:29 Update centroid values
    23:30 Plotting k-means iterations
    28:24 Pulling the algorithm together
    35:25 Comparing our implementation to scikit-learn
    37:56 Conclusion and next steps
    ------------------------------
    Join 1M+ Dataquest learners today!
    Master data skills and change your life.
    Sign up for free: bit.ly/3O8MDef

КОМЕНТАРІ • 83

  • @vikasparuchuri
    @vikasparuchuri Рік тому +11

    Here's all the code for this video - github.com/dataquestio/project-walkthroughs/tree/master/kmeans . Hope you enjoy it!

  • @maleck25
    @maleck25 Місяць тому

    Thank you, sir. This is how tutorials should be conducted: with in-depth explanations, step-by-step implementation, and the release of all code and datasheets to enable everyone to practice and advance their own personal projects. Congrats!

  • @animal40
    @animal40 Рік тому +4

    This was amazing. Brilliantly explained, demonstrated and presented clearly. Helped me so much with my current bootcamp task. Thank you.

  • @stevenlomon
    @stevenlomon Рік тому +2

    From the bottom of my heart; thank you. This was so clear and easily understandable, fantastic video!

  • @tejasvinnarayan2887
    @tejasvinnarayan2887 Рік тому +1

    Amazingly clear! Thank you so much, Dataquest!

  • @mo_l9993
    @mo_l9993 Рік тому +1

    One of the best tutorials on the internet, thank you.

  • @hounddog1
    @hounddog1 Рік тому

    Such good and clearly delivered material. Thanks a lot!

  • @jessemunson7091
    @jessemunson7091 Рік тому

    Awesome stuff, Vik. Thanks for sharing.

  • @amandamorrow73
    @amandamorrow73 Рік тому

    This THE best tutorial online. I am so grateful for this! Thank you

  • @krlwshu
    @krlwshu Рік тому

    Great video. Really helpful looking at implementing it manually. Thank you so much

  • @MarianneHMiettinen
    @MarianneHMiettinen 5 місяців тому +1

    Outstanding! Thank you, man! This really helped me do my masters thesis. I really appreciate that you explained every small step, and used as much visuals as possible, and focused on us being able to learn!
    - In case others run into the same problem: With Scikit K-means, when using the fit(data) function, I got an "split" error message. (attributeerror: 'nonetype' object has no attribute 'split'). I checked my BLAS, and updated through conda all libraries, then shut everything down and opened again, and this resolved the problem, but it took a long time.
    (I asked chatgpt for help)

  • @sashagalanova818
    @sashagalanova818 4 місяці тому

    very helpful and clear explanations - thank you!

  • @VaradKashmire
    @VaradKashmire Рік тому

    Excellent video !! Many thanks 🙏🏼

  • @photoish3863
    @photoish3863 Рік тому

    I have never thought that we can visualize K means by using Dimension Reduction (PCA)!! Awesome Tutorial Sir

  • @vishwas5344
    @vishwas5344 4 місяці тому

    Your explanation is absolutely clear. You have best knowledge. Keep posting new topics and encourage us ❤

  • @elu1
    @elu1 4 місяці тому

    This is a nice and powerful way to learn. Thanks for teaching.

  • @oskeeg619
    @oskeeg619 Рік тому +1

    Thank you, thank you, thank you!!! Being able to perform and explain what runs under the hood is really important- I agree. Please keep these videos coming 🙌🏼❤️ The “From Scratch” series :)

    • @Dataquestio
      @Dataquestio  Рік тому +3

      That's a great idea :) I'm working on linear regression from scratch.

  • @obeynjanjeni4466
    @obeynjanjeni4466 Місяць тому

    This is amazing, keep up a good job

  • @shreshthasingh
    @shreshthasingh Рік тому

    Thanks a LOT for this tutorial!😀

  • @TimHerrin
    @TimHerrin Рік тому +2

    Terrific implementation! I also really liked the way you used PCA for iteritive visualization... Nicely done

  • @allaguimaouia6510
    @allaguimaouia6510 7 місяців тому

    it's very great job , the only one in youtube that explain every place of code 👍👍

  • @SuperJohnnyuk
    @SuperJohnnyuk Рік тому +1

    Absolutely fantastic
    Would love a similar video on PAM clustering for mixed integer and categorical variables

  • @ahmetatasever8315
    @ahmetatasever8315 Рік тому

    Thank you very much for this clearly understood video.

  • @ytustatistics
    @ytustatistics 3 місяці тому

    you might be a hero... thansk a lot for the contents...

  • @user-sz3zb1rq5z
    @user-sz3zb1rq5z 7 місяців тому

    I can't thank you enough. Thank you for this content.

  • @elvykamunyokomanunebo1441
    @elvykamunyokomanunebo1441 Рік тому

    Very insightful and step by step code explanation.
    Thank you for this excellent tutorial
    :)

    • @Dataquestio
      @Dataquestio  Рік тому +1

      Glad it was helpful! -Vik

    • @elvykamunyokomanunebo1441
      @elvykamunyokomanunebo1441 Рік тому

      @@Dataquestio Vik,
      how do I assign new data points to a cluster i.e. once I have run my K-means cluster and want to use it to assign a cluster to new data sets just like out of time datasets or testing/validation datasets.
      There doesn't seem to be anything online about this. Is it the case that I'd have to re-run the K-Means with the new data included?
      Thanks in advance
      Elvy

  • @rajeshmanjrekar3614
    @rajeshmanjrekar3614 Рік тому

    great video, you are a great teacher

  • @user-do6zb9mt5q
    @user-do6zb9mt5q Рік тому

    Thanks alot that was a great help !

  • @adriancondie831
    @adriancondie831 Рік тому

    Great video!

  • @akosuakoranteng3327
    @akosuakoranteng3327 Рік тому

    Hi, Thanks so much for the video!! Can you please advise on how one adds a legend to the cluster scatter plots? I've been trying but can't figure it out.

  • @HelloIamLauraa
    @HelloIamLauraa 5 днів тому

    I loved ur video it is so well-explained!! I only used scikit-learn but now I understand better how it's works.
    But I have a question: why is it not good no use height and wight to use as feature?

  • @a3i3m1an
    @a3i3m1an Рік тому +1

    Thanks for the video. It is just brilliant. One of the best ones on Clustering that I have seen for sure!
    I just had a question. I tried using this on data with 13 variables. It worked perfectly but when I scale the data using n. distrb or skscalar rather than using min-max, I get an error following the PCA transformation code saying there are Nans in the data variable when there clearly were not before. I cant put my finger on what is causing this. Would appreciate any insights on your part. Thanks

  • @UkrainVsRussoReaction
    @UkrainVsRussoReaction Рік тому

    Very insightful explanation of codes. By the way how can I plot the Elbow plot using the SSE Vs K values at every k value iteratively. this will help me be able to optimise the K value using this codes... Looking foreword to hearing from you

  • @jagajaga6908
    @jagajaga6908 10 місяців тому

    good tutorial thank you

  • @saemamiftah1669
    @saemamiftah1669 11 місяців тому

    More videos like these please on other algos

  • @itsamankumar403
    @itsamankumar403 6 місяців тому

    TYSM :)

  • @payalpatel2560
    @payalpatel2560 10 місяців тому

    It's a very well explained video. Just a quick question, how can we add random_state in the final model code?

  • @ayushadhikari2357
    @ayushadhikari2357 Рік тому

    Hi, thank you so much for this clear tutorial.
    I need one another help from you. How do we get this cluster result exported to a CSV file?

  • @virendrakhanduri4897
    @virendrakhanduri4897 Рік тому +1

    Great Video , BTW why did u use Geometric means instead Arithmetic mean for finding the clusters. Please make a whole series on building models From Scratch.

  • @dedisupardi2815
    @dedisupardi2815 Рік тому

    Cool 👍

  • @user-un6em6bd6h
    @user-un6em6bd6h 3 місяці тому

    can we follow up based on the identified clusters, by using them to regress for another variable, e.g. with a logistic regression?

  • @soothingszelam2607
    @soothingszelam2607 3 місяці тому

    thanks teacher, may you introduce how to calculate SSE for k means clustering solution when you choose not to use k means directly from sklearn package

  • @dataprofessor_
    @dataprofessor_ Рік тому

    Can you make a video implementing Local Outlier Factor (LOF) with Pandas and NumPy in Python for identifying outliers?

  • @anirudhpurohit2251
    @anirudhpurohit2251 5 місяців тому

    can we also use players pogition as one of the feature if yes then how (cauz that isn't numeric)

  • @user-un6em6bd6h
    @user-un6em6bd6h 3 місяці тому

    what is the maximum amount of variables recommendable for a clustering analysis?

  • @prgyagupta8079
    @prgyagupta8079 10 місяців тому

    if we have IP addresses in data should we still scale the data ? i had a dataset where ip add and fraud transactions are given, i converted ip add to numerical data

  • @Anae2003
    @Anae2003 2 місяці тому

    How do you know which 5 features to pick at the beginning?

  • @user-un6em6bd6h
    @user-un6em6bd6h 3 місяці тому

    do we have to get rid of outliers beforehand?

  • @sadeepmihiranga6958
    @sadeepmihiranga6958 Рік тому

    Your explanation is grate. I found out that the "k" parameter of method "new_centroids" has no effect for the application. Correct me if I'm wrong.

  • @causticmonster
    @causticmonster 7 місяців тому

    How would you include Ordinal features ?

  • @goodnessawe4262
    @goodnessawe4262 Рік тому

    Thanks for this, I really don't get how I can possibly use it for fraud detection

  • @NadeemAkhtar-gu4up
    @NadeemAkhtar-gu4up Рік тому

    Which platform you are using for coding??

  • @sukshithshetty8349
    @sukshithshetty8349 Рік тому

    I didn’t understand why we took geometric mean instead of arithmetic mean??? Can you explain tht pls ????

  • @jakubharas9477
    @jakubharas9477 6 місяців тому

    Could you explain the meaning of the x- and y-axis?

  • @2919091986
    @2919091986 2 місяці тому

    I am getting an error when calculating centroids - 'float' object has no attribute 'sqrt'..... Please help

  • @rodneymawero9063
    @rodneymawero9063 Рік тому

    Keep sending the emails, thanks for the vids

  • @sukshithshetty8349
    @sukshithshetty8349 Рік тому

    Wht does groupby() return. ?? How can I see wht groupby() has returned??? Can you pls share the code too what data.groupby(labels) do ???

  • @swayamjoshi7667
    @swayamjoshi7667 Рік тому +1

    can someone help with the issue at 29:48
    when we use old_centroids=centroids
    in my code
    this error comes
    'DataFrame' object has no attribute 'equal'

    • @engineervol
      @engineervol 2 місяці тому

      it should be .equals with an s

  • @AbrarMuhtasim
    @AbrarMuhtasim Рік тому

    make a video on ''customer segmentation and clustering in retail using machine learning'' using real retail dataset

  • @ZigBehaviour
    @ZigBehaviour Рік тому

    pls unpack what is going on in centroid = data.apply(lambda x: float (x.sample())) without the float cast the line returns a DataFrame with NaN values in none sampled/selected columns. There appears to be some VooDoo magic going on here, driven by the float cast!

  • @63_mayukhdebnath22
    @63_mayukhdebnath22 Рік тому

    Sir how to find out the individual elements present in each cluster? For example, I'm working on a dataset of genes. How will i get the names of the individual genes that are present in each cluster?

    • @subhasishtripathy6933
      @subhasishtripathy6933 11 місяців тому

      I am finding the same right now ? Are you able to get anything . If yes then please help me too😊

  • @dataprofessor_
    @dataprofessor_ Рік тому

    Why you did not apply fit_transform to centroids_2d variable as well?

    • @Dataquestio
      @Dataquestio  Рік тому +1

      Fit transform will both compute the fit and transform the data. In this case, we already computed the fit on the data, and we want to just apply the same fit to the centroids, so that they're all on the same scale and can be visualized. -Vik

  • @shreyanshkhandelwal6499
    @shreyanshkhandelwal6499 Рік тому

    Please can someone tell me how to apply arithmetic mean instead of geometric mean in lambda function of getting new centroids. I am dealing with negative datasets and applying geometric mean is of no use to me. will it be like this : data.groupby(labels).apply(lambda x: np.mean(x,axis=0))

    • @animal40
      @animal40 Рік тому

      Thank you, I required arithmetic mean too and your code worked for me.

  • @viencong
    @viencong 6 місяців тому

    I think k = 4, because the young players incluce two high overall and low overall. Like young star in high leage level and young normal player

  • @bgizzanm
    @bgizzanm Рік тому

    Amazing!! But, how to implement the scatter without PCA?

    • @animal40
      @animal40 Рік тому

      Did you figure out? I'd like to know too.

    • @akosuakoranteng3327
      @akosuakoranteng3327 Рік тому +1

      @@animal40 Just leave out the PCA- still transform the centroid T though and remember to include iloc here's my code: def plot_clusters(data, labels, centroids, iteration):
      centroid_T = centroids.T
      plt.title(f'Iteration {iteration}')
      plt.scatter(x = data.iloc[:,0], y= data.iloc[:,1], c =labels)
      plt.scatter(x = centroid_T.iloc[:,0],y = centroid_T.iloc[:,1])
      plt.show()

    • @animal40
      @animal40 Рік тому

      @@akosuakoranteng3327 thanks very much for this. Tried a few things today but couldn't quite get it working. Will try again tomorrow with this. Appreciate it, cheers.

  • @itsmitasha
    @itsmitasha 4 місяці тому

    At 10:08, how did you know row 0 belongs to lionel messi?

  • @DeepakKumarBCH
    @DeepakKumarBCH Рік тому +1

    does anyone have the code ?

    • @Dataquestio
      @Dataquestio  Рік тому

      Code is here - github.com/dataquestio/project-walkthroughs/tree/master/kmeans . It's linked in the description

    • @DeepakKumarBCH
      @DeepakKumarBCH Рік тому

      @@Dataquestio sir , I'm getting an error doing with scratch, any platform at which I can send my query?

  • @I_balit
    @I_balit Рік тому

    SUUUUIII

  • @jinluwang5671
    @jinluwang5671 5 днів тому

    Nice but a little too much for a newbie 😅