Kilian Weinberger
Kilian Weinberger
  • 41
  • 1 400 915
The DBSCAN Clustering Algorithm Explained
DBSCAN has become one of my favorite clustering algorithms.
The original paper is here: www.dbs.ifi.lmu.de/Publikationen/Papers/KDD-96.final.frame.pdf
(This video is part of the CS4780 Machine Learning Class.)
Переглядів: 2 153

Відео

CS4780 Transformers (additional lecture 2023)
Переглядів 6 тис.Рік тому
A brief explanation of the Transformer Architecture used in GPT-3, ChatGPT for language modelling. (Uploaded here, for those who missed class due to the unusually nice weather :-) )
On the Importance of Deconstruction in Machine Learning Research
Переглядів 6 тис.3 роки тому
This is a talk I gave in December 2020 at the NeurIPS Retrospective Workshop. I explain why it is so important to carefully analyze your own research contributions through the story of 3 recent publications from my research group at Cornell University. In all three cases did we first invent something far more complicated, only to realize that the gains could be attributed to something far simpl...
Machine Learning Lecture 18 "Review Lecture II" -Cornell CS4780 SP17
Переглядів 11 тис.4 роки тому
Machine Learning Lecture 18 "Review Lecture II" -Cornell CS4780 SP17
In-class Kaggle Competition in less than 5 Minutes
Переглядів 11 тис.5 років тому
The Fall 2018 version of CS4780 featured an in-class Kaggle competition. The stuents had 3 weeks to beat my submission, for which I only had 5 minutes. Some students challenged me to show a screencast of me actually training and uploading the model in time, so here you go. Happy Xmas.
Machine Learning Lecture 22 "More on Kernels" -Cornell CS4780 SP17
Переглядів 22 тис.5 років тому
Lecture Notes: www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote13.html www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote14.html
Machine Learning Lecture 37 "Neural Networks / Deep Learning" -Cornell CS4780 SP17
Переглядів 14 тис.5 років тому
Lecture Notes: www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote20.pdf
Machine Learning Lecture 36 "Neural Networks / Deep Learning Continued" -Cornell CS4780 SP17
Переглядів 13 тис.5 років тому
Lecture Notes: www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote20.pdf
Machine Learning Lecture 35 "Neural Networks / Deep Learning" -Cornell CS4780 SP17
Переглядів 20 тис.5 років тому
Machine Learning Lecture 35 "Neural Networks / Deep Learning" -Cornell CS4780 SP17
Machine Learning Lecture 34 "Boosting / Adaboost" -Cornell CS4780 SP17
Переглядів 16 тис.5 років тому
Lecture Notes: www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote19.html
Machine Learning Lecture 33 "Boosting Continued" -Cornell CS4780 SP17
Переглядів 16 тис.5 років тому
Lecture Notes: www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote19.html
Machine Learning Lecture 31 "Random Forests / Bagging" -Cornell CS4780 SP17
Переглядів 44 тис.5 років тому
Lecture Notes: www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote18.html If you want to take the course for credit and obtain an official certificate, there is now a revamped version (with much higher quality videos) offered through eCornell ( tinyurl.com/eCornellML ). Note, however, that eCornell does charge tuition for this version.
Machine Learning Lecture 21 "Model Selection / Kernels" -Cornell CS4780 SP17
Переглядів 27 тис.5 років тому
Lecture Notes: www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote11.html www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote12.html
Machine Learning Lecture 32 "Boosting" -Cornell CS4780 SP17
Переглядів 33 тис.5 років тому
Lecture Notes: www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote19.html
Machine Learning Lecture 30 "Bagging" -Cornell CS4780 SP17
Переглядів 24 тис.5 років тому
Lecture Notes: www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote18.html
Machine Learning Lecture 29 "Decision Trees / Regression Trees" -Cornell CS4780 SP17
Переглядів 42 тис.5 років тому
Machine Learning Lecture 29 "Decision Trees / Regression Trees" -Cornell CS4780 SP17
Machine Learning Lecture 28 "Ball Trees / Decision Trees" -Cornell CS4780 SP17
Переглядів 29 тис.5 років тому
Machine Learning Lecture 28 "Ball Trees / Decision Trees" -Cornell CS4780 SP17
Machine Learning Lecture 27 "Gaussian Processes II / KD-Trees / Ball-Trees" -Cornell CS4780 SP17
Переглядів 28 тис.5 років тому
Machine Learning Lecture 27 "Gaussian Processes II / KD-Trees / Ball-Trees" -Cornell CS4780 SP17
Machine Learning Lecture 26 "Gaussian Processes" -Cornell CS4780 SP17
Переглядів 67 тис.5 років тому
Machine Learning Lecture 26 "Gaussian Processes" -Cornell CS4780 SP17
Machine Learning Lecture 25 "Kernelized algorithms" -Cornell CS4780 SP17
Переглядів 14 тис.5 років тому
Machine Learning Lecture 25 "Kernelized algorithms" -Cornell CS4780 SP17
Machine Learning Lecture 24 "Kernel Support Vector Machine" -Cornell CS4780 SP17
Переглядів 18 тис.5 років тому
Machine Learning Lecture 24 "Kernel Support Vector Machine" -Cornell CS4780 SP17
Machine Learning Lecture 23 "Kernels Continued Continued" -Cornell CS4780 SP17
Переглядів 14 тис.5 років тому
Machine Learning Lecture 23 "Kernels Continued Continued" -Cornell CS4780 SP17
Machine Learning Lecture 20 "Model Selection / Regularization / Overfitting" -Cornell CS4780 SP17
Переглядів 20 тис.5 років тому
Machine Learning Lecture 20 "Model Selection / Regularization / Overfitting" -Cornell CS4780 SP17
Machine Learning Lecture 19 "Bias Variance Decomposition" -Cornell CS4780 SP17
Переглядів 46 тис.5 років тому
Machine Learning Lecture 19 "Bias Variance Decomposition" -Cornell CS4780 SP17
Machine Learning Lecture 17 "Regularization / Review" -Cornell CS4780 SP17
Переглядів 16 тис.5 років тому
Machine Learning Lecture 17 "Regularization / Review" -Cornell CS4780 SP17
Machine Learning Lecture 16 "Empirical Risk Minimization" -Cornell CS4780 SP17
Переглядів 26 тис.5 років тому
Machine Learning Lecture 16 "Empirical Risk Minimization" -Cornell CS4780 SP17
Machine Learning Lecture 15 "(Linear) Support Vector Machines continued" -Cornell CS4780 SP17
Переглядів 25 тис.5 років тому
Machine Learning Lecture 15 "(Linear) Support Vector Machines continued" -Cornell CS4780 SP17
Machine Learning Lecture 14 "(Linear) Support Vector Machines" -Cornell CS4780 SP17
Переглядів 41 тис.5 років тому
Machine Learning Lecture 14 "(Linear) Support Vector Machines" -Cornell CS4780 SP17
Machine Learning Lecture 13 "Linear / Ridge Regression" -Cornell CS4780 SP17
Переглядів 33 тис.5 років тому
Machine Learning Lecture 13 "Linear / Ridge Regression" -Cornell CS4780 SP17
Machine Learning Lecture 12 "Gradient Descent / Newton's Method" -Cornell CS4780 SP17
Переглядів 45 тис.5 років тому
Machine Learning Lecture 12 "Gradient Descent / Newton's Method" -Cornell CS4780 SP17

КОМЕНТАРІ

  • @30saransh
    @30saransh 4 дні тому

    Not sure why this playlist doesn't come on top when one searches for ML on youtube. Andrew Ng might be a good researcher but he's not a very good teacher. Killian teaches in such a good manner, that I never once felt bored of felt as if I'm studying. Thanks Killian, You're a Gem.

  • @30saransh
    @30saransh 7 днів тому

    Is there any way we can get access to the projects for this course?

  • @30saransh
    @30saransh 8 днів тому

    Amazing!!!!!!!!!!!!!!!!!!!!!!!!

  • @WellItsNotTough
    @WellItsNotTough 15 днів тому

    We have a quiz question in lecture notes : "How does k affect the classifier? What happens if k = n? What happens if k = 1?" I do not think it is discussed in lectures. In my opinion, k is the only hyperparameter in this algorithm. For k = n, we are taking mode of the entire dataset labels as the output for test point, where as for k =1 , it will be assigned label that of the closest nearest neighbor. I have a doubt here, as we are using distance metric, what if we have 2 points(for simplicity) that are at equal distance to test point and have different labels. What happens in that case for k = 1? Similarly, for k = n, if we have equal proportion of binary class labels, how does mode works in that case?

  • @eliasboulham
    @eliasboulham 16 днів тому

    thank you professor .

  • @pnachtwey
    @pnachtwey 20 днів тому

    Everyone seems to have a different version. AdaGrad doesn't always work. The sum of the dot product of the gradient gets too big UNLESS one scales it down. Also, AdaGrad works best with a line search. All variations work best with a line search.

  • @gwonchanyoon7748
    @gwonchanyoon7748 22 дні тому

    i am wondering teenager hahaha!

  • @neelmishra2320
    @neelmishra2320 23 дні тому

    RIP harambe

  • @studentgaming3107
    @studentgaming3107 27 днів тому

    wauuw in my second video already i scrolled through the videos it seems it will be much better to learn this parallel with the pattern recognition book of bishop

  • @habeebijaz5907
    @habeebijaz5907 Місяць тому

    He is Hermann Minkowski and was Einstein's teacher. Minkowski metric is the metric of flat space time and forms the backbone of special relativity. The ideas developed by Minkowski were later extended by Einstein to develop the theory of general relativity.

  • @iaroslavakornach
    @iaroslavakornach Місяць тому

    OMG! I'm in love with this guy! the energy he has makes me want to learn ML!

  • @maxfine3299
    @maxfine3299 Місяць тому

    the Donald Trump bits were very funny!

  • @nassimhaddam7136
    @nassimhaddam7136 Місяць тому

    Thank you for this conference, I learned a lot! I do have one question though. In the course, it is shown how to diagnose the ML model to balance the bias/variance trade-off, but what about noise? How is it possible to know if the error of the model comes from significant noise in the dataset?

  • @almichaelraza5851
    @almichaelraza5851 Місяць тому

    Alrighht... hello everybody...

  • @lonnybulldozer8426
    @lonnybulldozer8426 Місяць тому

    You are a J word.

  •  2 місяці тому

    I've never encountered anything better than these playlist. Thank you, Professor, for the detailed explanation and, most importantly, for presenting it in such an engaging way that it sparks a deep passion in everyone. Here I am in 2024, following since lecture one and feeling how much I have developed after your lectures.

  • @vishaljain4915
    @vishaljain4915 2 місяці тому

    What was the question at 14:30 anyone know? Brilliant lecture - easily a new all time favourite.

  • @giulianobianco6752
    @giulianobianco6752 2 місяці тому

    Great lecture, thanks Professor! Integrating the online Certificate with deeper math and concepts

  • @KW-fb4kv
    @KW-fb4kv 2 місяці тому

    He's such a good lecturer that 2/3rds of students show up in person.

  • @KW-fb4kv
    @KW-fb4kv 2 місяці тому

    A minor suggestion with the charts showing the distance vs dimensions at around the 29:00 minute mark.... I think at least one student was a bit misled because they didn't notice the x-axis changes so drastically because the peak of distribution visually appears to be in the same spot on each of the 6 charts.

  • @seetsamolapo5600
    @seetsamolapo5600 2 місяці тому

    Supervised learning - making predictions from data - dataset D with n data points and their outputs; the feature vector and label - the data points are a sample from some distribution - the aim is to get a function from the dataset that maps x to y Examples of y (labelling) - binary classification: there or not label - k classes: multiple labels -

  • @damian_smith
    @damian_smith 2 місяці тому

    Loved that "the answer will always be Gaussian, the whole lecture!" moment.

  • @jumpingcat212
    @jumpingcat212 2 місяці тому

    Hi Professor Weinberger, it looks like this lecture is about a smart algorithm created by smart people which can classify datas into two classes. But in the first introduction lecture you mentioned that machine learning is about computer learning to design a program by itself to achieve our goal. So I' confused what's the relationship between this perceptron hyperplane algorithm with machine learning? It looks like we human just design this algorithm and code it into a program and feed it into a computer to solve the classification problem...

    • @kilianweinberger698
      @kilianweinberger698 Місяць тому

      So the Perceptron algorithm is the learning algorithm which is designed by humans. However, given a data set, this learning algorithm generates a classifier and you can view this classifier as a program that is learned from data. The program code is stored inside the weights of the hyperplane. You could put all that in automatically generated C code if you want to and compile it. Hope this helps.

  • @KW-fb4kv
    @KW-fb4kv 2 місяці тому

    Your English is so perfect.

  • @Theophila-FlyMoutain
    @Theophila-FlyMoutain 3 місяці тому

    Hi Professor, thank you so much for the lecture. I wonder if it's possible that AdaBoost stops when training error is zero? Because I see from your demo, after training error is close to zero and exponential loss goes smaller and smaller, the test error doesn't change too much. I guess we don't need to waste time on letting exponential loss smaller and smaller.

    • @kilianweinberger698
      @kilianweinberger698 Місяць тому

      No it typically doesn't stop when zero training error is reached. The reason is that even if the training error is zero, the training LOSS will still be >0 and can be further reduced (e.g. by increasing the margin of the decision boundary).

  • @nikolamarkovic9906
    @nikolamarkovic9906 3 місяці тому

    16:40

  • @newbie8051
    @newbie8051 3 місяці тому

    17:54 volunteers 🤣 Thanks prof for the fun and interesting lecture, got to revise these fundamentals quickly 🙏

  • @jahnvi8373
    @jahnvi8373 3 місяці тому

    thank you for these lectures!

  • @cge007
    @cge007 3 місяці тому

    Hello, Thank you for the lecture. Why is the variance equal for all points? 17:23 Is this an assumption that we are taking?

  • @vishnuvardhan6625
    @vishnuvardhan6625 3 місяці тому

    Best vedio on Bias-Variance Decomposition ❤

  • @Karim-nq1be
    @Karim-nq1be 3 місяці тому

    I was looking for an answer that was quite technical in another video but I got hooked. Thank you so much for providing such great knowledge.

  • @user-kf9tp2qv9j
    @user-kf9tp2qv9j 3 місяці тому

    the party example and the demo makes the algorithm so easy that a mid school student can understand to

  • @fermisurface2616
    @fermisurface2616 3 місяці тому

    *says some stuff* "aaaAAAHHhh" *says some more stuff* "aaAAAHHhhhh"

  • @georgestu7216
    @georgestu7216 3 місяці тому

    Hi All, can someone give some more information about the notation at 31:40? What does the indicator “I” mean? Thanks

  • @MemeConnoisseur
    @MemeConnoisseur 3 місяці тому

    Very beautiful algorithm

  • @itachi4alltime
    @itachi4alltime 3 місяці тому

    Wish we can get a new lecture series

  • @VV-xt7fj
    @VV-xt7fj 3 місяці тому

    I just want to say the way you teach Killian, I haven't seen anyone this talented who can explain a concept in such a clear manner. Even people with minimal background get the big picture of what you're trying to say. Please keep uploading these excellent lectures

  • @Theophila-FlyMoutain
    @Theophila-FlyMoutain 3 місяці тому

    Hi Professor. Thank you for sharing the video. I am now using Gaussian Process Regression in physics field. One thing I noticed is that even though there exists specific loss function for GPR, many people use root-mean-squared-error as loss function. Is there any rule to choose loss function and regularization?

  • @ForcesOfOdin
    @ForcesOfOdin 4 місяці тому

    The intuition here was so so satisfying. The way it all comes together at the end, when he points out that the sigmoidal functions people used to use (because of emulating neuronal activation functions) have these flat parts which slow down the gradient. Not only is the slowed learning bad, but that slowed learning dampens the ability of the noisy SGD to escape the thin deep wells which represent ideal parameters only for a SPECIFIC data set. I.e. the thin deep wells = overfitting, the noise of SGD escapes them along with big alpha, and a slowed gradient from the sigmoidal flat parts causes an effective reduction in learning rate, which leads to getting trapped in the wells even with SGD, which causes overfitting. Just awesome.

  • @suvkaka
    @suvkaka 4 місяці тому

    @kilianweinberger698 Sir, How do we ensure that adding pos encoding does not distort the original embedding too much? or how is that the sums of embedding and positional encoding of different tokens do not collide?

    • @kilianweinberger698
      @kilianweinberger698 3 місяці тому

      It can change the encoding a little, and lately people have started developing alternatives. However, in general it isn’t really a big problem, because the positional embedding is always exactly the same for every training sequence, so the network can easily learn to remove it.

    • @suvkaka
      @suvkaka 3 місяці тому

      @@kilianweinberger698 Thank you professor

  • @Theophila-FlyMoutain
    @Theophila-FlyMoutain 4 місяці тому

    For non-parametric regression, when we test a data, we always need the whole training dataset right? Does it mean that we need to use a lot of memory to save the model, i.e. the training dataset?

    • @kilianweinberger698
      @kilianweinberger698 4 місяці тому

      Exactly, that is one downside of non-parametric models. (Some keep a subset of the data, or a digest of the data, but in general the model size grows with the training set size.)

    • @Theophila-FlyMoutain
      @Theophila-FlyMoutain 4 місяці тому

      Thank you!@@kilianweinberger698

    • @Theophila-FlyMoutain
      @Theophila-FlyMoutain 4 місяці тому

      Thank you! @@kilianweinberger698

  • @zaidamvs4905
    @zaidamvs4905 4 місяці тому

    i have a question how we know the best sequence of features that we should use in each depth layer because if we want to try each one and optimize with 30 to 40 features will take forever , or how we can do this for m features because i can really visual how this work.

  • @Aesthetic_Euclides
    @Aesthetic_Euclides 4 місяці тому

    I was thinking about modeling the prediciton of y given x with a Gaussian. Are these observations/reasoning steps correct? I understand the Gaussianess comes because we have a true linear function that perfectly models the relationship between X and Y, but it is uknown to us. But we have data (D) that we assume comes from sampling the true distribution (P). Now, we only have this limited sample of data, so it's reasonable to model the noise as Gaussian. This means that for a given x, our prediction y actually belongs to a Gaussian distribution, but since we only have this "single" sample D of the true data distribution, our best bet is to assign this y as the expectation of the true Gaussian. Which results in us predicting y as the final prediction (also because a good estimator of the expectation is the average, I guess). Now, I have explained how in the end we are going to fit the model to the data and predict that, so why do we have to model the noise in the model? Why not make it purely an optimization problem? I guess more like the DL approach.

  • @Gg-kw9ql
    @Gg-kw9ql 4 місяці тому

    hi nice

  • @vivi412a8nl
    @vivi412a8nl 4 місяці тому

    I have a question regarding masked multi-head attention around 53:30, if the outputs are generated one by one, then how can the word 'bananas' knows about the word 'cherries' (because at the time bananas is generated, cherries is not yet generated) and be modified by it? ie., why do we have to worry about cherries modifying bananas (aka having information about the future) if cherries hasn't even existed at that point?

    • @kilianweinberger698
      @kilianweinberger698 4 місяці тому

      In some sense it is really all a speed-up. The moment cherry comes along, it could modify bananas. However, you don't want this, because you want to avoid re-computing all the representations of all the words you have already generated. If you do the masked attention, then you are safe, and you can re-use the representation you computed for bananas when cherry didn't even exist. Does this make sense?

    • @vivi412a8nl
      @vivi412a8nl 4 місяці тому

      @@kilianweinberger698 Thank you Professor that makes a lot of sense, I never thought about the idea of avoiding recalculation. Thank you again for making these great materials available for free.

  • @thachnnguyen
    @thachnnguyen 4 місяці тому

    I raise my hand. Why you assume any type of distribution when discussing? What if I don't know that formula? But what I see is nH and nT. Why not work with those?

  • @user-kc1xf6hq1b
    @user-kc1xf6hq1b 4 місяці тому

    legendary!

  • @pritamgouda7294
    @pritamgouda7294 4 місяці тому

    can someone tell where's the lecture in which he proved K nearest algorithm which he mentioned @5:09

    • @kilianweinberger698
      @kilianweinberger698 4 місяці тому

      ua-cam.com/video/oymtGlGdT-k/v-deo.html

    • @pritamgouda7294
      @pritamgouda7294 4 місяці тому

      @@kilianweinberger698 sir I saw that lec and it's notes as well but in notes it's mentioned about Bayes optimal classifier but I don't think it's there in the video lec. Please correct me if I'm wrong. Thank you for your reply 😊

  • @Tyokok
    @Tyokok 4 місяці тому

    Hi if anyone can advice, 40:00, so since we have close form solution, if we need implement ridge regression with kernel, we don't use Gradient Descent, but the close form directly, is that correct? because at lecture 21 at 16:40 prof. Kilian gave the recursive gradient descent solution, and I tried implement it which diverges very quickly, sensitive to the step size. Thank you!

  • @hamzak93
    @hamzak93 4 місяці тому

    Can't thank you enough for these lectures Professor Weinberger!