How to Create a BM25 Index in Python with Rank BM25 (Search Engine)

Поділитися
Вставка
  • Опубліковано 29 січ 2025

КОМЕНТАРІ • 21

  • @jesusmtz29
    @jesusmtz29 2 роки тому +4

    I love how you take the time to show how it can produce incorrect result. It's very helpful

    • @jesusmtz29
      @jesusmtz29 2 роки тому +1

      Is there a nice way to.combine this library with spacy?

    • @python-programming
      @python-programming  2 роки тому

      Thanks for that comment! It is good to know that others find that approach helpful. Good question about spaCy. There would be. I am thinking of how to do it now and I think you would use the doc container tokens as the sequence text but how you put it in the spaCy pipeline would depend on what you want it to do. Also, you would need to put it in a custom component. If you wanted to have it sit outside of spaCy, you could save your doc containers as an index and then use bm25 to search results and then populate that the results by checking the index of Doc containers.

  • @karndeepsingh
    @karndeepsingh 2 роки тому

    how we can extract the trained weights from trained bm25 model?

  • @SOUFTVOFFICIEL
    @SOUFTVOFFICIEL Рік тому

    how can we use inverted index with BM25 ... or we don't need Inverted Index in case we use BM25 model

  • @airesearch2024
    @airesearch2024 2 роки тому +1

    how can we use BM25L with this package?

    • @python-programming
      @python-programming  2 роки тому +1

      Great question! You simply call the BM25L class instead, see line 137: github.com/dorianbrown/rank_bm25/blob/master/rank_bm25.py

    • @airesearch2024
      @airesearch2024 2 роки тому +1

      @@python-programmingthank you!!!! Also I’m wondering if you know how to combine sentence transformers with pm25 for a better searching results?

    • @python-programming
      @python-programming  2 роки тому +1

      @@airesearch2024 No problem! In this scenario, I would recommend using a sentence transformer to vectorize your documents and then use Annoy for the searching algorithm. I don't have a video on doing this with texts, but I do with using a CLIP model (images and text).

    • @SOUFTVOFFICIEL
      @SOUFTVOFFICIEL Рік тому

      how can we use inverted index with BM25 ... or we don't need Inverted Index in case we use BM25 model

  • @lukasmarteleur9318
    @lukasmarteleur9318 2 роки тому +1

    Does this library work with text in different languages than English?

    • @python-programming
      @python-programming  2 роки тому +1

      I have used it with Latin and it worked fine for me. So it should work with most Western languages.

  • @venkatesanr9455
    @venkatesanr9455 2 роки тому +1

    Thanks for your valuable videos. I have one doubt, I have many documents after semantic search in which some documents are having same contents with slightly different filenames as it is saved and backuped in different time period. Can you provide a way to have only one documents from this same content having documents because other document which resembles same content, not required. Whether cosine similarity helps here to choose one document from set of same contents having documents.

    • @python-programming
      @python-programming  2 роки тому

      Thanks for the comment and question. Would you mind rephrasing this a bit? I just want to make sure I understand the core part of your question.

    • @venkatesanr9455
      @venkatesanr9455 2 роки тому

      @@python-programming I have handled this by having pdf content of different filenames and droping duplicates/keep the last using pandas dataframe. I think semantic search(symmetric/asymetric) can be done by using bi_encoder/cross_encoder. Can you discuss this please

    • @SOUFTVOFFICIEL
      @SOUFTVOFFICIEL Рік тому

      how can we use inverted index with BM25 ... or we don't need Inverted Index in case we use BM25 model

  • @superfreiheit1
    @superfreiheit1 11 місяців тому

    Awesome Video quality.

  • @kenchang3456
    @kenchang3456 10 місяців тому

    Hi. Did you ever get around to making a video to store metadata in a dictionary that accompanied a tokenized index? Thanks for sharing.

  • @wakam229
    @wakam229 2 роки тому +1

    I want my query to be all my corpus sentences, is it possible? Like instead of "windy london" be "hello there good man!", " it is quite windy at london"...

    • @python-programming
      @python-programming  2 роки тому

      Yes absolutely. You would just adjust the index accordingly

  • @rChandan_Singh
    @rChandan_Singh Місяць тому

    There is no single method explained for non english corpus