NLP Demystified 5: Basic Bag-of-Words and Measuring Document Similarity

Поділитися
Вставка
  • Опубліковано 7 січ 2025

КОМЕНТАРІ • 13

  • @futuremojo
    @futuremojo  2 роки тому +4

    Timestamps:
    00:00:00 Basic bag-of-words (BoW)
    00:00:22 The need for vectors
    00:00:53 Selecting and extracting features from our data
    00:04:04 Idea: similar documents share similar vocabulary
    00:04:46 Turning a corpus into a BoW matrix
    00:07:10 What vectorization helps us accomplish
    00:08:20 Measuring document similarity
    00:11:09 Shortcomings of basic BoW
    00:12:37 Capturing a bit of context with n-grams
    00:14:10 DEMO: creating basic BoW with scikit-learn and spaCy
    00:17:47 DEMO: measuring document similarity
    00:18:40 DEMO: creating n-grams with scikit-learn
    00:19:35 Basic BoW recap

  • @vipulmaheshwari2321
    @vipulmaheshwari2321 Рік тому +4

    I am truly amazed by the excellence of this course. It is undoubtedly the finest NLP course I have come across, and the teaching and explanations provided are unparalleled. I have the utmost respect and admiration for it. Kudos to you, and thank you for such a remarkable learning experience! BOWING DOWN IN RESPECT!

  • @NAEXTRO
    @NAEXTRO Рік тому +3

    Thanks for this awesome course. :)

  • @FrankCai-e7r
    @FrankCai-e7r Рік тому

    great lectures, I learned a lot of NLP concepts.

  • @aneshsrivastav8092
    @aneshsrivastav8092 Рік тому

    You are the best!! This course is soo soo helpful man!!

  • @frankrobert9199
    @frankrobert9199 Рік тому

    great lectures.

  • @zhuchenwang4747
    @zhuchenwang4747 Рік тому

    Hi sir, maybe the calculation of dot product at 9:05 in the video is wrong. It should be (6x4)+(6x2)=36. By the way, your videos are very helpful for a beginner. Thank you very much for your effort. Looking forward to seeing more good videos in your channel.

  • @metavore7790
    @metavore7790 Рік тому

    If anybody is getting "ValueError: Input vector should be 1-D", in the Cosine Similarity section, the fix is simple. Change where the indices are on toarray(). For example:
    bow[0].toarray() is replaced by
    bow.toarray()[0]

  • @techaztech2335
    @techaztech2335 Рік тому

    I am bit confused about the cosine similarity metric. I thought the cosine similarity range is from -1 to 1, instead of 0 to 1. I've seen 0 to 1 threshold being used elsewhere as well but I do notice more popular embedding models generate -ve vector elements and naturally the normalized versions produce ranges from -1 to 1. Can you please clarify this? Cuz I've been struggling to wrap my head around this.

    • @IvanKleshnin
      @IvanKleshnin 3 місяці тому

      It's mentioned in the video, a line with asterisk at 10:00 timestamp. Cosine similarity measures as [-1:1] but word frequencies do not produce negative values. Hence vectors can't point in opposite directions, to put it in plain words. So, in the context of the task, the effective range is [0:1]. Frequency-based vectors are not rare, you can easily see both [0:1] and [-1:1] ranges in the wild. Think of the first as the subset of the second.