Timestamps: 00:00:00 Basic bag-of-words (BoW) 00:00:22 The need for vectors 00:00:53 Selecting and extracting features from our data 00:04:04 Idea: similar documents share similar vocabulary 00:04:46 Turning a corpus into a BoW matrix 00:07:10 What vectorization helps us accomplish 00:08:20 Measuring document similarity 00:11:09 Shortcomings of basic BoW 00:12:37 Capturing a bit of context with n-grams 00:14:10 DEMO: creating basic BoW with scikit-learn and spaCy 00:17:47 DEMO: measuring document similarity 00:18:40 DEMO: creating n-grams with scikit-learn 00:19:35 Basic BoW recap
I am truly amazed by the excellence of this course. It is undoubtedly the finest NLP course I have come across, and the teaching and explanations provided are unparalleled. I have the utmost respect and admiration for it. Kudos to you, and thank you for such a remarkable learning experience! BOWING DOWN IN RESPECT!
Hi sir, maybe the calculation of dot product at 9:05 in the video is wrong. It should be (6x4)+(6x2)=36. By the way, your videos are very helpful for a beginner. Thank you very much for your effort. Looking forward to seeing more good videos in your channel.
If anybody is getting "ValueError: Input vector should be 1-D", in the Cosine Similarity section, the fix is simple. Change where the indices are on toarray(). For example: bow[0].toarray() is replaced by bow.toarray()[0]
I am bit confused about the cosine similarity metric. I thought the cosine similarity range is from -1 to 1, instead of 0 to 1. I've seen 0 to 1 threshold being used elsewhere as well but I do notice more popular embedding models generate -ve vector elements and naturally the normalized versions produce ranges from -1 to 1. Can you please clarify this? Cuz I've been struggling to wrap my head around this.
It's mentioned in the video, a line with asterisk at 10:00 timestamp. Cosine similarity measures as [-1:1] but word frequencies do not produce negative values. Hence vectors can't point in opposite directions, to put it in plain words. So, in the context of the task, the effective range is [0:1]. Frequency-based vectors are not rare, you can easily see both [0:1] and [-1:1] ranges in the wild. Think of the first as the subset of the second.
Timestamps:
00:00:00 Basic bag-of-words (BoW)
00:00:22 The need for vectors
00:00:53 Selecting and extracting features from our data
00:04:04 Idea: similar documents share similar vocabulary
00:04:46 Turning a corpus into a BoW matrix
00:07:10 What vectorization helps us accomplish
00:08:20 Measuring document similarity
00:11:09 Shortcomings of basic BoW
00:12:37 Capturing a bit of context with n-grams
00:14:10 DEMO: creating basic BoW with scikit-learn and spaCy
00:17:47 DEMO: measuring document similarity
00:18:40 DEMO: creating n-grams with scikit-learn
00:19:35 Basic BoW recap
I am truly amazed by the excellence of this course. It is undoubtedly the finest NLP course I have come across, and the teaching and explanations provided are unparalleled. I have the utmost respect and admiration for it. Kudos to you, and thank you for such a remarkable learning experience! BOWING DOWN IN RESPECT!
Thank you so much!
Thanks for this awesome course. :)
great lectures, I learned a lot of NLP concepts.
You are the best!! This course is soo soo helpful man!!
great lectures.
Hi sir, maybe the calculation of dot product at 9:05 in the video is wrong. It should be (6x4)+(6x2)=36. By the way, your videos are very helpful for a beginner. Thank you very much for your effort. Looking forward to seeing more good videos in your channel.
Thank you for the correction!
If anybody is getting "ValueError: Input vector should be 1-D", in the Cosine Similarity section, the fix is simple. Change where the indices are on toarray(). For example:
bow[0].toarray() is replaced by
bow.toarray()[0]
Thank you! Code updated.
I am bit confused about the cosine similarity metric. I thought the cosine similarity range is from -1 to 1, instead of 0 to 1. I've seen 0 to 1 threshold being used elsewhere as well but I do notice more popular embedding models generate -ve vector elements and naturally the normalized versions produce ranges from -1 to 1. Can you please clarify this? Cuz I've been struggling to wrap my head around this.
It's mentioned in the video, a line with asterisk at 10:00 timestamp. Cosine similarity measures as [-1:1] but word frequencies do not produce negative values. Hence vectors can't point in opposite directions, to put it in plain words. So, in the context of the task, the effective range is [0:1]. Frequency-based vectors are not rare, you can easily see both [0:1] and [-1:1] ranges in the wild. Think of the first as the subset of the second.