I've learned this kernel thing in college class, in Andrew Ng's ML courses and many other times, but this is literally the best explanation so far. It really blew my mind once I’ve grasped how gaussian kernel can help me work in infinite dimension!
I am a professor of engineering, and I have to say that your chain of thoughts, and the back drive are just amazing. Also the simplicity of explanation, and energy in this video. Keep on .
Hey Ritwik, been trying to intuitively understand kernel-infinite dimension link for ages now and i had not remotely come closr to doing it,,,but your one video has melted the fuzziness away in a trice..thank you so much
Love your videos! Have just gone through the SVM an Kernel videos. However, I feel a little like on a cliffhanger. That is, now understand SVM, and I see where going with Kernel, but then it seems there needs to be a follow on video to finally link the kernel back explicitly to SVM and how the kernel is then explicitly used to do the classification. Specifically, what is missing in this video (or more accurately needed in a follow on video) is the linkage back to the Alphas of the Lagrangian or the w and b, because in the end, that is what defines the discrimination line. That last piece is tantalizing missing (I.e., hint for next video ;-). Thanks!
This was fun! There are so many different ways to explain things. In this case youve based explanation on ‘property of kernel’ which seems so stodgy. theres maybe other ‘street math’ explanations of why this kernel is so great? For example e^x is its own derivative. Why does it use the 2 norm? Would a 4 norm be ok too? Why the -1/2? It turns out theres lots of variations of rbf that work just as well, this canonical edition is often the most efficient. i think it would be fun to see the rbf in action applied to a thorny classification problem, why do the operators of rbf work so well, what makes the wrong variations work poorly.
great point! Maybe he fotgot to mention this in the video. I think without this condition the definition of the high dimension feature vector is not consistant
Your comment is two years old but here’s how I tried to make some intuitive sense out of it: In regression Gaussian processes are used as a prior over functions. It is often said that the kernel of a Gaussian process specifies the “form” of the functions. For example in the sense that a larger lengthscale places more mass on smoother functions. If you sample from a GP with 1d inputs with an RBF kernel it looks exactly like this but does not really explain why that’s the case. What I did next was looking into a kernel smoother. Roughly speaking: You have a bunch of observations of a function f(x) at locations x and we predict the unknown function value at some location z by computing the linear combination of RBF kernel times known function values and normalise that sum. Let’s say we know f(x1) and f(x2) and want to predict f(x3). Then f(x3) ≈ (k(x3,x2)*f(x2) + k(x3,x1)*f(x1)) / (k(x3,x2) + k(x3,x1)) If you try to construct an equation with nice vector-matrix-notation you might get something like this: f = C * K_{fy} * y where f is the prediction of the unknown function values, y are the known function values and C is a matrix that does the normalisation. When you look at the equation of the posterior distribution’s mean of a GP in GP regression it looks something like this: mean = K(X_known, X_unknown)^T @ K(X_known, X_known)^(-1) @ y It’s also a linear combination of kernel and observed function values. Here “centred” by the inverse of the kernel matrix evaluated on the locations of the observations. This similarity between the a-posteriori mean and a kernel smoother helps me with the intuition. Of course it’s not a solid mathematical explanation but maybe a nice point of view from where to start when looking into it.
I've learned this kernel thing in college class, in Andrew Ng's ML courses and many other times, but this is literally the best explanation so far. It really blew my mind once I’ve grasped how gaussian kernel can help me work in infinite dimension!
I am a professor of engineering, and I have to say that your chain of thoughts, and the back drive are just amazing. Also the simplicity of explanation, and energy in this video. Keep on .
Our professor has just told us that it is a rbf kernel and i was not convinced but your video helped me believe it. This is amazing, thanks a lot sir
Man you are just amazing! Everytime i come across something i don't quite get at Machine Learning theory, there you are! Thanks a million!
Hey Ritwik, been trying to intuitively understand kernel-infinite dimension link for ages now and i had not remotely come closr to doing it,,,but your one video has melted the fuzziness away in a trice..thank you so much
thank you thank you, I just finished my test,
your Markov videos were amazing and helped me a lot, thank you again
Thank you. You warped my fragile little mind. Fresh air. I love the RBF. Well presented. Nice zest too
Thank you for such an excellent explanation! This helps me to understand better ML models!
Thanks!
Best explanation. Thank you so much!
Glad it was helpful!
I just got emotional. what a video
Amazing concept with amazing explanation !! Hats off to you !!
Glad you liked it!
Amazing explanation
Sir you deserve a million subscribers. Hope you get soon what you deserve 😊
Thank you, I have learn a lot about kernel function
This is art, really nice explaination
Love your videos! Have just gone through the SVM an Kernel videos. However, I feel a little like on a cliffhanger. That is, now understand SVM, and I see where going with Kernel, but then it seems there needs to be a follow on video to finally link the kernel back explicitly to SVM and how the kernel is then explicitly used to do the classification. Specifically, what is missing in this video (or more accurately needed in a follow on video) is the linkage back to the Alphas of the Lagrangian or the w and b, because in the end, that is what defines the discrimination line. That last piece is tantalizing missing (I.e., hint for next video ;-). Thanks!
Beautiful
Thank you! Cheers!
thank you!!! so good.
Amazing explanation. keep up good work
brilliant!
What has my life become. I genuinely anticipate the release of new math videos smh. Thanks for the great videos though :)
This was fun!
There are so many different ways to explain things. In this case youve based explanation on ‘property of kernel’ which seems so stodgy. theres maybe other ‘street math’ explanations of why this kernel is so great? For example e^x is its own derivative. Why does it use the 2 norm? Would a 4 norm be ok too? Why the -1/2? It turns out theres lots of variations of rbf that work just as well, this canonical edition is often the most efficient.
i think it would be fun to see the rbf in action applied to a thorny classification problem, why do the operators of rbf work so well, what makes the wrong variations work poorly.
so cool to understand this infinitive. How to avoid overfitting for such a powerful model?
good question, I have an SVM kernels coding video coming soon that will answer that
@@ritvikmath Hello Sir. Can you create a video on what role the hyperparameters play in SVM.
This is probably because we can differentiate/integrate e^x infinite times and it results in always same function e^x
Just great! Would you do a few examples (preferrably in python) and make the code available?
4:50 I think the reason why you are able to make that constant (even terms involving xi), is that xi is normalized. So xi.T@xi = 1
great point! Maybe he fotgot to mention this in the video. I think without this condition the definition of the high dimension feature vector is not consistant
Sorry I was wrong. The term exp(xi^T * xi) is just a scaler and that's part of the function defines the high dimension feature vector.
I don't understand how does my teacher turn 8 min content into 1 hour confussing and boring class.
Absolutely brilliant! Could you maybe elaborate the Gaussian Radial Basis Function? How does the Variance & Mean fit into the context?
Your comment is two years old but here’s how I tried to make some intuitive sense out of it:
In regression Gaussian processes are used as a prior over functions.
It is often said that the kernel of a Gaussian process specifies the “form” of the functions.
For example in the sense that a larger lengthscale places more mass on smoother functions.
If you sample from a GP with 1d inputs with an RBF kernel it looks exactly like this but does not really explain why that’s the case.
What I did next was looking into a kernel smoother.
Roughly speaking:
You have a bunch of observations of a function f(x) at locations x and we predict the unknown function value at some location z by computing the linear combination of RBF kernel times known function values and normalise that sum.
Let’s say we know f(x1) and f(x2) and want to predict f(x3).
Then
f(x3) ≈ (k(x3,x2)*f(x2) + k(x3,x1)*f(x1)) / (k(x3,x2) + k(x3,x1))
If you try to construct an equation with nice vector-matrix-notation you might get something like this:
f = C * K_{fy} * y
where f is the prediction of the unknown function values, y are the known function values and C is a matrix that does the normalisation.
When you look at the equation of the posterior distribution’s mean of a GP in GP regression it looks something like this:
mean = K(X_known, X_unknown)^T @ K(X_known, X_known)^(-1) @ y
It’s also a linear combination of kernel and observed function values.
Here “centred” by the inverse of the kernel matrix evaluated on the locations of the observations.
This similarity between the a-posteriori mean and a kernel smoother helps me with the intuition.
Of course it’s not a solid mathematical explanation but maybe a nice point of view from where to start when looking into it.
Did anyone have their Oppenheimer moment while understanding the RBF kernel ? I did.
Doesn't exp(x_i^Tx_j) give the same power?
I came here looking for intuiton on what RBFs are
3:43 How can you add the cross terms to get -2*xiT*xj? Can anyone help me.
xiT * xj produces the same result as xjT * xi, I think :)