To learn more about Lightning: lightning.ai/ Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
I usually hate when people say that a video explains well, because usually this is not the case. But, haha, amazing job! Well done, really nice explained, it's a gamification, they way I understand!
Dude these are so good. I have to watch them several times, and then I try write some code to reinforce the concept. Your vides are absolutely amazing.
Great video. It would've been worth noting that magnitude of the feature space does matter in certain cases and doesn't in others. Your example of [Hello Hello Hello], caught my eye. In that example, the magnitude of that feature didn't matter because its direction didn't change. However the difference between [Hello World!] and [Hello Hello Hello World!] does have an impact on the angle.
This is another great video, Josh! question: @3:51 you talk about having 3 Hellos and that still results in a 45 degree angle with Hello World. However, comparing Hello to Hello World seems to be a diff angle from comparing Hello to Hello World World. Is there an intuition as to why this is the case? That is adding as many Hellos to Hello keeps the angle the same, but adding more Worlds to Hello World seems to change the Cosine Similarity.
Two answers: 1) Just plots the points on a 2-dimensional graph for the two pairs of phrases and you'll see that the angles are different. 2) The key difference is that "hello hello hello" only contains the word "hello". If we had included "world", then the angles would be different. Again, you can plot the points to see the differences.
This guy is seriously funny. I thought I was the only person who ever watched gymkata (like 50 times, especially the part in the town where everyone was crazy). This video def explains cosine sim clearly. Thk u!
Thank you for making all of these informative, simple and precise videos. I wondered what happens if two phrases deliver the same meaning but have different orders of words, for instance: A) I like Gymkata. B) I really like Gymkata. In this case doesn't the extra adverb "really" in the second sentence disturb the phrase matrix? And one more question, if the three phrases have the same length and two of them have the same meaning but have used different words, like: A) I like Gymkata. B) I love Gymkata. C) I like volleyball. In this case, would the cosine similarity between A and B be more than A and C?
In this video, we're simply counting the number of words that are the same in different phrases, however, you can use other metrics to calculate the cosine similarity, and that is often the case. For example, we could calculate "word embeddings" for each word in each phrase and calculate the cosine similarity using the word embedding values and that would allow phrases with similar meanings to have larger similarities. To learn more about word embeddings, see: ua-cam.com/video/viZrOnJclY0/v-deo.html
Great video! I've seen though in many articles out there that people consider cosine similarity the same as Pearson's correlation since they produce the same outcome when E(X) = E(Y) = 0 and the means of X and Y = 0. This is not true since both measure different things. Cosine similarly measures the cosine of the angle between two vectors in a multi-dimensional space and returns a similarity score as explained in the video, while Pearson's correlation measure the linear relationship between 2 variables.
Could you cover discrete cosine/fourier transforms pretty please?* I've love to know how to break signals up into their component frequencies. If you haven't already!
This video also does a good job highlighting how cosine and dot products are the same. Unless I'm mistaken, that equation can be written dot(a, b) / (magnitude(a) * magnitude(b)), where magnitude(x) = sqrt(dot(x, x))
Excellent and clear video! I wonder why NLP applications use more often cosine distance rather than other metrics, such as euclidean distance. Is there a clear reason for that? Thanks in advance
I'm not certain, but one factor might be how easy it is to compute (people often omit the denominator making the calculation even easier) and it might be nice that the cosine similarity is always between 0 and 1 and doesn't need to be normalized.
Hey, great video as always!! Is the cosine similarity good for regression problems in which the targets are pretty close to zero? Im trying to implement some accuracy metrics for a transformer model
Thank you so much I had no idea what cosine similarity is and you illustrated it easily, appreciate it Btw how cosine similarity can result in -ve number
I'm not sure I understand your question. My understanding of string comparison in programming languages is that it just compares the bits to make sure they are equal and the result is a boolean True/False type thing.
Always thank you for the great and easy-understanding video! And I have a question about the totally different word. If there are 2 sentences like very good/super nice, since very, good, super, nice are totally different, the cosine similarity will be 1. However, they are actually the same meaning! I want to ask what else preprocessing should we do toward such situation? Thank you so much!
I think you might need more context (longer phrases) to get a better cosine similarity. I just used 2 words because I could draw them, but in practice, you use more.
Hi Josh, I’m trying to understand why cosine similarity may be the best metric to find semantically similar texts (using pertained embeddings). It sounds like the two vectors have to only directionally similar for cosine similarity to be high. What about using something like Euclidean or Manhattan distance. Would a distance metric be better to see if two texts are semantically similar?
That's a good question and, to be honest, I don't know the answer. I do know, however, that most neural networks - when they use "attention" (like in transformers, which are used for ChatGPT) - just use the numerator of the cosine similarity as the "similarity metric". In other words, they just compute the dot-product. Maybe they do this because it's super fast, and the speed outweighs the benefits of using another, more sophisticated method. Also, it's worth noting that this is a similarity metric and not a distance. In other words, as the value goes up, things are "more similar" (the angle is smaller). In contrast, the Euclidean and Manhattan distances are...distances. That is, as the value goes up, the things are further away and considered "less similar" Lastly, cool music on your channel! You've got a dynamite voice.
I talk about that at the start of the video, but it's also used by CatBoost to compare the predicted values for a bunch of samples to their actual values.
How can someone be so good at something! Thank you. I have bought a copy of your book "statquest_illustrated_guide_to_machine_learning" because I wanted to convey my gratitude. I am yet to go through the book (Just bought it!) but I am sure it would be awesome.
I'm a native spanish speaker, and it surprised me when it started speaking spanish, it will reach more people, but they will miss your motivating silly songs xD
Hey...so cosine is only depends on angle not on lengths... When the case of three Hello were shown, how it can be distinguished between them as similarity is same for both sentence
Can u please help me with this? This is my data: A: cosine: 0.58, z-score: 372 B: cosine: 0.63 , z-score: 370 How can I find the p-value/significance of the 0.5 change in the cosine similarities?
I talk about that at the start of the video, but you can also use it whenever you want to compare two rows of data. For example, CatBoost uses it compare predicted values for a bunch of data to their actual values.
I guess it depends on how you measure the distance. However, in general, the cosine similarity will always be between -1 and 1 (and is usually just between 0 and 1).
@@statquest in what cases can cosine similarity be -1? Isnt it a similarity measure meaning 0 would imply nothing in common and 1 perfect similarity? What would -1 imply?
The cosine similarity can be between -1 and 1. If all the input data are positive (like they are in a bunch of the examples in this video, since we are just using count data, and count data is positive) then you'll be restricted to values between 0 and 1, but the data don't always have to be positive.
I am conducting sentiment analysis research and found that some data has a Cosine Similarity of 0. Are there any methods to make the Cosine Similarity not equal to 0?
To learn more about Lightning: lightning.ai/
Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
The explanation is so clean. I was clapping for him from my room. How can someone be so good at their job!
Thank you! :)
I clapped too, twice! :)
You are EXCEPTIONALLY good at CLEARLY describing complex topics!!! Thank you!
Thank you very much! :)
Such a simple, yet, a beautiful and powerful concept of similarity.
Thanks, StatQuest!
bam!
Never understood something in such a slow but efficient pace,thanks💯
Thanks!
I think and hope that this video is a preamble for more comlex NLP topics such as Word Embeddings etc.. many thanks for all of your efforts!
Yes it is! :)
Cosine Similarity is used as an evaluation tool on word2vec
You literally make it so easy!!
I can't help but smile 😊😊😊❤️❤️❤️
By far one of my favorite UA-cam channels!
Thank you so much! :)
I usually hate when people say that a video explains well, because usually this is not the case. But, haha, amazing job! Well done, really nice explained, it's a gamification, they way I understand!
Thanks!
Dude these are so good. I have to watch them several times, and then I try write some code to reinforce the concept. Your vides are absolutely amazing.
Thank you!
My Love for learning Data Science and Statistics has increased multi-folds because of you. Thank you Josh!!🙂
bam! :)
QUADRUPLE BAM!!! Thanks for such fun yet pragmatic explainers.
Thank you!
Great video. It would've been worth noting that magnitude of the feature space does matter in certain cases and doesn't in others. Your example of [Hello Hello Hello], caught my eye. In that example, the magnitude of that feature didn't matter because its direction didn't change. However the difference between [Hello World!] and [Hello Hello Hello World!] does have an impact on the angle.
Good point!
This is another great video, Josh!
question: @3:51 you talk about having 3 Hellos and that still results in a 45 degree angle with Hello World.
However, comparing Hello to Hello World seems to be a diff angle from comparing Hello to Hello World World.
Is there an intuition as to why this is the case? That is adding as many Hellos to Hello keeps the angle the same, but adding more Worlds to Hello World seems to change the Cosine Similarity.
Two answers:
1) Just plots the points on a 2-dimensional graph for the two pairs of phrases and you'll see that the angles are different.
2) The key difference is that "hello hello hello" only contains the word "hello". If we had included "world", then the angles would be different. Again, you can plot the points to see the differences.
Your videos are such a lifesaver! Could you do one on the difference between PCA and ICA?
I'll keep that in mind.
Terrific video, thanks Josh! I learned the basics of Linear Algebra before, but the explanations were never this clear (or fun).
Thank you!
This guy is seriously funny. I thought I was the only person who ever watched gymkata (like 50 times, especially the part in the town where everyone was crazy). This video def explains cosine sim clearly. Thk u!
BAM! :)
Você democratiza a matemática! Deveriam fazer assim nas escolas.
Muito obrigado!
I came here as I need to learn something in NLP. Thank you, I understood it clearly.
BAM! :)
wow thankyou!!! i don't know how to calculate it , but after watching this, i become mathmatician!!
bam!
Wonderfully explained, Josh! You've earned a subscriber!
Thank you!
It all seems so easy when you speak about such complicated things! Huge talent! And so funny ⚡⚡⚡
Thank you!
The quality of your explanation is more than triple bam!!😂
Thanks!
Thank you for making all of these informative, simple and precise videos. I wondered what happens if two phrases deliver the same meaning but have different orders of words, for instance: A) I like Gymkata. B) I really like Gymkata. In this case doesn't the extra adverb "really" in the second sentence disturb the phrase matrix? And one more question, if the three phrases have the same length and two of them have the same meaning but have used different words, like: A) I like Gymkata. B) I love Gymkata. C) I like volleyball. In this case, would the cosine similarity between A and B be more than A and C?
In this video, we're simply counting the number of words that are the same in different phrases, however, you can use other metrics to calculate the cosine similarity, and that is often the case. For example, we could calculate "word embeddings" for each word in each phrase and calculate the cosine similarity using the word embedding values and that would allow phrases with similar meanings to have larger similarities. To learn more about word embeddings, see: ua-cam.com/video/viZrOnJclY0/v-deo.html
Excellent explaination! I hope it is the first of a NLP series of videos!
I hope to do word embeddings soon.
What a great way of explaination !! Love it ❤
Thanks!
Awesome video! I had no idea what Cosine Similarity was, but you explained super clearly
Thanks!
pretty good .
Great video! My notes: 3:52 4:23
bam!
Great video! I've seen though in many articles out there that people consider cosine similarity the same as Pearson's correlation since they produce the same outcome when E(X) = E(Y) = 0 and the means of X and Y = 0.
This is not true since both measure different things. Cosine similarly measures the cosine of the angle between two vectors in a multi-dimensional space and returns a similarity score as explained in the video, while Pearson's correlation measure the linear relationship between 2 variables.
Correct!
You deliver the moment I need it. Thanks
BAM! :)
Could you cover discrete cosine/fourier transforms pretty please?* I've love to know how to break signals up into their component frequencies.
If you haven't already!
I'll keep that in mind.
Have you seen 3blue1brown video on this topic? Not sure if it about descreet FT.
This is an AMAZING explanation !!
Thank you!
Super simplistic explanation! Thanks for your effort.
Thanks!
this video needs more views it is awesome
Thank you! :)
This video saved me, I cannot thank you enough.
Bam! :)
Hilarious, easy to understand, and entertaining. Bravo!
Glad you enjoyed it!
I love you!!!! Salute from Brazil.
Muito obrigado! :)
This video was the goat!
Thanks!
Excellent explanation!
Thanks!
you are the King Josh 👏👏👏👏 wonderful job!!!
Thank you! 😃
I must watch Gymkata ! Thanks for the recommendation ! And excellent explanation of the topic !
bam! :)
This video also does a good job highlighting how cosine and dot products are the same. Unless I'm mistaken, that equation can be written dot(a, b) / (magnitude(a) * magnitude(b)), where magnitude(x) = sqrt(dot(x, x))
yep
what's an amazing explanation. Thanks for the video.
Thanks!
Great video. Very interesting. I hope to see you apply this to more examples.
We'll see it used in CatBoost for sure.
You really are the best !
Thank you!
Your Explanation is great
Thanks!
Great video! Have you made one for the Word Embeddings?
Coming soon!
super clear, thank you dude
Thanks!
Holy shit did I land on a gold mine. Love the explanation (minus the intro, sorry Josh). Thanks a bunch!
Thanks!
How am I able to understand this topic? Wasn't this supposed to be difficult? 😭
Seriously Great Explanation Josh.
Thank you!
U nailed it enjoyed it hello ,BAM and best teacher ever 😂😂😂
Thank you!
Very useful 👍
Thank you! :)
Thank you so much
You're most welcome!
thanks a lot. this kind of videos are super helpful for me !!!
Thanks! :)
this video was absolutely a BAM!!
Thanks!
superb ! Thank you for the explanation
Thanks!
Excellent and clear video! I wonder why NLP applications use more often cosine distance rather than other metrics, such as euclidean distance. Is there a clear reason for that? Thanks in advance
I'm not certain, but one factor might be how easy it is to compute (people often omit the denominator making the calculation even easier) and it might be nice that the cosine similarity is always between 0 and 1 and doesn't need to be normalized.
Thank you!
Thanks!
Great video and great explanation! Thanks.
Glad it was helpful!
Hey josh thanks for the video nice explanation.
You bet!
nice easy explanation
Thanks!
thanks a lot. easy to understand
Thanks!
Perfection! BAM
Thank you!
Amazing!!
Thanks!
Hey, great video as always!! Is the cosine similarity good for regression problems in which the targets are pretty close to zero? Im trying to implement some accuracy metrics for a transformer model
Hmm... I bet it would work (if you had a row of predictions and a row of known values).
Great video 🎉
Thank you 😁!
Hello! Hello! Hello! Thank you for introducing me to this topic! Subscribed.
Awesome! Thank you!
Great video as we are used to.
Thank you! :)
Thank you so much
I had no idea what cosine similarity is and you illustrated it easily, appreciate it
Btw how cosine similarity can result in -ve number
The cosine similarity can be calculated for any 2 sets of numbers, and that can result in a negative value.
you are insane at explaining clearly, btw you sing really well😂
Thanks! 😃
the generalized equation of cosine similarity comes from the dot product of 2 vectors in multidimension.....by the way big fan of yours❤
scaled to be between -1 and 1. :)
Parabéns pelo conteúdo. Excelente explicação, como não encontrei em nenhum outro vídeo
Muito obrigado! :)
useful, thanks
Thanks!
Cool! (in StatQuest voice)
bam! :)
Hello!! Nice video!
Thank you!
Cosine similarity is a good method for comparing the embedding vectors, especially for face recognition.
Nice!
Great video ! One question - how is this diffrent from the regular string comparison we use various programming languages?
I'm not sure I understand your question. My understanding of string comparison in programming languages is that it just compares the bits to make sure they are equal and the result is a boolean True/False type thing.
Like most things, it is relatively straightforward when you remove the jargon
bam! :)
Always thank you for the great and easy-understanding video!
And I have a question about the totally different word.
If there are 2 sentences like very good/super nice, since very, good, super, nice are totally different, the cosine similarity will be 1.
However, they are actually the same meaning!
I want to ask what else preprocessing should we do toward such situation?
Thank you so much!
I think you might need more context (longer phrases) to get a better cosine similarity. I just used 2 words because I could draw them, but in practice, you use more.
Hi Josh, I’m trying to understand why cosine similarity may be the best metric to find semantically similar texts (using pertained embeddings). It sounds like the two vectors have to only directionally similar for cosine similarity to be high. What about using something like Euclidean or Manhattan distance. Would a distance metric be better to see if two texts are semantically similar?
That's a good question and, to be honest, I don't know the answer. I do know, however, that most neural networks - when they use "attention" (like in transformers, which are used for ChatGPT) - just use the numerator of the cosine similarity as the "similarity metric". In other words, they just compute the dot-product. Maybe they do this because it's super fast, and the speed outweighs the benefits of using another, more sophisticated method.
Also, it's worth noting that this is a similarity metric and not a distance. In other words, as the value goes up, things are "more similar" (the angle is smaller). In contrast, the Euclidean and Manhattan distances are...distances. That is, as the value goes up, the things are further away and considered "less similar"
Lastly, cool music on your channel! You've got a dynamite voice.
@@statquest thank you! let me know if you need another voice in any of your intro jingles 😁
@@SalahMusicOfficial bam!
you are amazing
Thanks!
Amazing
Thanks!
Super interesting ! Do you have examples of how those are implemented in practice ?
I talk about that at the start of the video, but it's also used by CatBoost to compare the predicted values for a bunch of samples to their actual values.
How can someone be so good at something! Thank you. I have bought a copy of your book "statquest_illustrated_guide_to_machine_learning" because I wanted to convey my gratitude. I am yet to go through the book (Just bought it!) but I am sure it would be awesome.
Thank you very much!!! I really appreciate your support! :)
@@statquest can you please do an episode on NMF
@@notjustanyuser I'll keep that in mind, but it will probably be a long time before I can get to it.
Wow, Math is awesome!
:)
I'm a native spanish speaker, and it surprised me when it started speaking spanish, it will reach more people, but they will miss your motivating silly songs xD
Thanks! Yeah - I'm not sure what to do about the silly songs. :)
Hey...so cosine is only depends on angle not on lengths... When the case of three Hello were shown, how it can be distinguished between them as similarity is same for both sentence
What time point, minutes and seconds, are you asking about?
Can u please help me with this?
This is my data:
A: cosine: 0.58, z-score: 372
B: cosine: 0.63 , z-score: 370
How can I find the p-value/significance of the 0.5 change in the cosine similarities?
We didn't cover p-values in the video.
Does cosine similarity equation ends up being a vector normalization of the projection of one vector over the other one?
I believe that is correct.
Can you please tell about some applications of cosine similarity like where is it used in which type of problems?
I talk about that at the start of the video, but you can also use it whenever you want to compare two rows of data. For example, CatBoost uses it compare predicted values for a bunch of data to their actual values.
Love you
Love you too
:)
Why is it specifically Cos, and not Tan? Since you’re collecting the opposite and adjacent length??
The cosine is easy to calculate and, unlike the tangent function, is defined for all possible angles.
"in contrast, this last sentence is from someone who does not like troll 2" - I was expecting a BOOOO after that lol
Ha! That would have been great.
Can the cosine similarity be greater than the distance between words?
I guess it depends on how you measure the distance. However, in general, the cosine similarity will always be between -1 and 1 (and is usually just between 0 and 1).
@@statquest in what cases can cosine similarity be -1? Isnt it a similarity measure meaning 0 would imply nothing in common and 1 perfect similarity? What would -1 imply?
@@aizazkhan5439 -1 similarity is sort of like a inverse correlation - when one goes up, the other goes down, etc.
can i use cosine similarity for building a similarity matrix between two different brain regions?
Probably.
Wow, I used this to make a bot from whatsapp, to put client on flow/menu based on the first message from client
bam!
Somewhere it says Cosine Similarity is a number between -1 and +1 but in other places it is said to be between 0 & 1. What is the truth?
The cosine similarity can be between -1 and 1. If all the input data are positive (like they are in a bunch of the examples in this video, since we are just using count data, and count data is positive) then you'll be restricted to values between 0 and 1, but the data don't always have to be positive.
I am conducting sentiment analysis research and found that some data has a Cosine Similarity of 0. Are there any methods to make the Cosine Similarity not equal to 0?
you could pad each phrase with something, so all phrases have at least one thing in common.
@@statquest Thank you so much😁
Can you please do Spherical K Means with Cosine Similarity as the distance metric?
I'll keep that in mind.
tysm
Thanks!
how can we relate this with correlation between two continuous random variables?
See: stats.stackexchange.com/questions/235673/is-there-any-relationship-among-cosine-similarity-pearson-correlation-and-z-sc#:~:text=TL%3BDR%20Cosine%20similarity%20is,a%20norm%20of%20%E2%88%9An.&text=To%20convert%20a%20z%2Dscore,function%20for%20a%20Gaussian%20distribution.