Semantic Chunking for RAG

Поділитися
Вставка
  • Опубліковано 12 січ 2025

КОМЕНТАРІ • 67

  • @energyexecs
    @energyexecs 6 місяців тому +3

    James Brggs one of my favorites and I believe I am a "Patreon""member - spend hundreds of hours listening to about 10 podcasts, studying Large Language Models, Machine Learning and so called "AI". James Briggs breaks things down in easier to understand concepts. Thank you James Briggs

    • @jamesbriggs
      @jamesbriggs  6 місяців тому +1

      hey that's awesome, I really appreciate the support!

  • @aaronsmyth7943
    @aaronsmyth7943 8 місяців тому +12

    At this point, you are practically Captain Chunk.

  • @AaronJOlson
    @AaronJOlson 8 місяців тому +2

    Thank you! I’ve been doing this for a while, but did not have a good name for it.

  • @xuantungnguyen9719
    @xuantungnguyen9719 8 місяців тому +3

    Need a video on cross-chunk attention. Wasn’t attention all about key query and val anyway

  • @lalamax3d
    @lalamax3d 8 місяців тому

    best i have seen so far about understanding core concept of chunking , thanks

  • @AdrienSales
    @AdrienSales 8 місяців тому

    Excellent content and explanation , espeicialy chunking core concepts and challenges. Keep going your work it's so precisous to learn 👍

  • @baskarjayaraman5821
    @baskarjayaraman5821 8 місяців тому

    Great video. Thanks for posting. I have been thinking of document chunking but using the LLM itself via prompting + k-shot. The approach you show will be cheaper of course but curious to see how these two approaches will compare in terms of any relevant non-cost metrics.

  • @GeertBaeke
    @GeertBaeke 8 місяців тому

    We use a simple combination of Microsoft's Document Intelligence with markdown output and a simple markdown splitter. The improvement is noticeable although the Document Intelligence models do come at an additional cost.

    • @jamesbriggs
      @jamesbriggs  8 місяців тому +2

      yeah it depends on what you need ofcourse, I'm mostly interested in further abstraction and more analytics methods for chunking not for where it is now, but for where this type of experimentation might lead to in the future - I could see a few more iterations and improvements to more intelligent doc parsing and chunking to become increasingly more performant - but we'll see

    • @alivecoding4995
      @alivecoding4995 8 місяців тому

      Do you have a link for this markdown processing? :)
      We are using Document Intelligence as well, but not for layout analysis, yet.

    • @GayathriG-h5h
      @GayathriG-h5h 8 місяців тому

      @@alivecoding4995you can also use layoutpdf reader from llmsherpra

  • @gullyburns1280
    @gullyburns1280 8 місяців тому

    Another killer video. Great work!

  • @naromsky
    @naromsky 8 місяців тому +9

    King of Chunk

    • @jamesbriggs
      @jamesbriggs  8 місяців тому +5

      a title I have always wanted

  • @rodgerb2645
    @rodgerb2645 8 місяців тому

    Love all your content sir!

  • @AGI-Bingo
    @AGI-Bingo 8 місяців тому +4

    Hi James , would you please tell me how you would tackle this one..
    How would you design a realtime updating rag system? For example, let's say our clients updated some details in some watched doc, I want the old chunks to be removed, and rechunked automatically. Have you seen such pipeline existing already? No one seems to cover this and I think it sets apart fun projects and actual production system. Thanks and all the best! Love your channel ❤

    • @shameekm2146
      @shameekm2146 8 місяців тому +1

      I have achieved this for one of the sources in my RAG bot. It has an api provided to access the data. So i run the embedding script on the delta changes.

    • @AGI-Bingo
      @AGI-Bingo 8 місяців тому +2

      @@shameekm2146 amazing, would you please opensource it so we can all improve the pipeline as a community? 🌈

    • @rohansingh1057
      @rohansingh1057 2 місяці тому

      RAG does not mean you "have to use vector embeddings and Vector DB". If you can run APIs to fetch relevant info, it should be good enough. Use function/tool calling to call the API.
      Otherwise, if you are planning to watch some doc live you need to have the following pipeline this will only work if you are making small changes in the doc frequently ->
      Doc Changed -> Webhook/Trigger to your system -> If diff is available, use it, if diff is not available, compute diff with old vs new doc.
      -> Take nearby text as sample and compute embeddings -> Fetch top N nearby docs from the Vector DB (Hybrid search will work really well, tune the sparse vector weightage higher than normal RAG here) -> Ask LLM Agent to mark relevant chunks/Use reranking models (old chunk as query) -> Delete these old chunks from VectorDB -> Compute embeddings for the new changes -> Upsert the new vectors into the DB.
      There are tons of edge cases that you will run into when running it this pipeline and they always every for each use case, so you will have to consider those accordingly.

  • @dinoscheidt
    @dinoscheidt 8 місяців тому +3

    People since GPT2: Simply ask an LLM recursively to please insert “{split}“ where a topic change etc happens according to a summary of prior text. Get embeddings. Use to separate and group.
    2024: We would like to introduce a novel concept called Semantic Chunking with a sliding Context……..
    Beginners must be truly lost 😮‍💨

  • @FrankenLab
    @FrankenLab 2 місяці тому

    @James Briggs Newbie here, was wondering if it was necessary to store the chunk with the vector, it seems like a lot of data duplication and a good way to fill your disk. I like the idea of storing the title, I was thinking about storing the document path and filename also. I haven't been able to find good info about what data besides vectors is also kept in the vector db. I understand that the vectors need to correlate to data, I just don't understand what data is actually represented in the vectors. If you just have an ID and the vectors, can't that ID point back to the document with the content?

  • @shameekm2146
    @shameekm2146 8 місяців тому

    Thank you so much for this. Will test it out on the RAG flow in the company.

    • @jamesbriggs
      @jamesbriggs  8 місяців тому

      welcome, would love to hear how it goes

  • @FatherNovelty
    @FatherNovelty 8 місяців тому +1

    At ~4:40, you mention that you should use the same encoder for the chunking and the encoding. Why? A chunk size captures a "single meaning", so why would it matter that the same encoder is used? If you look at the chunking as a clutering algorithim that creates meaningful chunks, then what does it matter that the encoders match? What am I missing?

    • @jamesbriggs
      @jamesbriggs  8 місяців тому +1

      good point - yes they are capturing the "single meaning" and that single meaning will (hopefully) overlap a lot, but embedding models are not perfect and so they will not align between themselves. Similar to if someone asked myself and you to chunk an article, we'd likely overlap for the majority of the article, but I'm sure there would be differences

  • @jonm691
    @jonm691 6 місяців тому

    Loved this explanation

  • @NhatNguyen-bq6jj
    @NhatNguyen-bq6jj 7 місяців тому

    Can you introduce some articles related to this topic? Thanks!

  • @brianferrell9454
    @brianferrell9454 7 місяців тому

    Do you think this causes the results to be biased towards smaller chunks? Because the user will only query probably no more than 10 words . So the highest semantic similar results may also only be 10 words and the chunks that are 400 tokens wouldn't have as high as a score unless you provide more context to the query?

  • @bastabey2652
    @bastabey2652 7 місяців тому

    using a high end LLM like GPT-4 or Opus or Gemini Ultra or Pro might be effective in performing semantic chunking.. Google large context window seems suitable for chunking large files.. we need to introduce LLM in automating the RAG stack

    • @jamesbriggs
      @jamesbriggs  7 місяців тому +1

      Yeah I’d like to introduce an LLM chunker and see how they compare

    • @bastabey2652
      @bastabey2652 7 місяців тому

      @@jamesbriggs better than any non LLM chunker.. if we aim to empower user's with AI, why not empower the developer? chunking is not easy

  • @scottmiller2591
    @scottmiller2591 8 місяців тому +2

    "Grab complete thoughts" is an obvious good and expensive thing. Except for tables, for instance.

    • @jamesbriggs
      @jamesbriggs  8 місяців тому +2

      yeah tables need to handled differently - doable if you are identifying text vs. table elements in your processing pipeline

  • @amantandon-ln9xx
    @amantandon-ln9xx 8 місяців тому

    I see the #abstract is also with #title ideally both should be in different chunks so that LLM can understand better semantics.

  • @luciolrv
    @luciolrv 8 місяців тому

    How does Parent Document Rag fits in your in your new techniques?

  • @MrMoonsilver
    @MrMoonsilver 8 місяців тому

    Can this be used to create chunks for creating a training dataset as well? It would be great to chunk a document into 'statements' and use those statements for a dataset. In essence have a LLM create questions for each of those statements and use those pairs for training. Could you make a video to show how that works?

  • @nikhilmaddirala
    @nikhilmaddirala 7 місяців тому

    What's a good way to use the metadata for retrieval and ranking of the chunks?

  • @MrMoonsilver
    @MrMoonsilver 8 місяців тому

    Amazing video, thank you so much!!

  • @MrDespik
    @MrDespik 8 місяців тому

    Hi James. Excuse me, maybe I missed it. But how you handle the situation that when we use semantic chunking we miss pages numbers for chunks? Is it possible to receive it with using this package?

  • @botondvasvari5758
    @botondvasvari5758 8 місяців тому

    and how can I use big models from huggingface ? I can't load them into memory because many of them are bigger than 15gb, some of them are 130gb+ . Any thoughts?

  • @swethak7198
    @swethak7198 5 місяців тому

    i have a doubt that i have a document which has the many page references to one to another page, should i want to group all the data into the same chunks (like to get data from first page and in this reference page number is in page 3 means should i get data from both pages and store it a single chunk ) does is this only way or is there any special models . Else give some idea

    • @drosi1994
      @drosi1994 4 місяці тому

      Hmm that's an issue that you could solve in the retrieving stage not chunking... When you retrieve a chunk you can check with an LLM fast model if it has references to another one to get them as well

  • @klik24
    @klik24 8 місяців тому

    Just what i eas trying to lewrn ...awesome mate, thanks

  • @fayluu248
    @fayluu248 6 місяців тому

    Hi James, do you think that the chunking and embedding process in RAG will be unnecessary in the short future, as the input token length is no longer a limitation.

    • @jamesbriggs
      @jamesbriggs  6 місяців тому

      I don’t think the input token length will become unlimited any time soon - but for smaller use cases (fitting within Anthropic limits) where latency and token cost are not important then you can use a pure LLM solution rather than RAG

  • @trn450
    @trn450 8 місяців тому

    Great material. 🙏

  • @talesfromthetrailz
    @talesfromthetrailz 8 місяців тому

    Dude already embedded whole documents of texts into PC haha would've helped a month ago. But awesome thanks for this! 🤘🏾

    • @jamesbriggs
      @jamesbriggs  8 місяців тому +1

      Maybe for the next project 😅

    • @talesfromthetrailz
      @talesfromthetrailz 8 місяців тому

      @@jamesbriggs quick question man. Is the objective of semantic chunking to achieve broader search results? Or to decrease query times? I'm thinking of it in terms of medium sized text docs, for example movies summaries and such. Thanks!

  • @FDasdana
    @FDasdana 4 місяці тому

    Does this library support ollama, gemini or hf encoders also or Is it only for chatgpt?

    • @jamesbriggs
      @jamesbriggs  4 місяці тому

      it supports these encoders github.com/aurelio-labs/semantic-router/tree/main/semantic_router/encoders

  • @manslaughterinc.9135
    @manslaughterinc.9135 4 місяці тому

    Unfortunately, the semantic router has removed this feature, or refactored it in some way.

    • @jamesbriggs
      @jamesbriggs  4 місяці тому

      hey yes they were deprecated in favour of this ua-cam.com/video/7JS0pqXvha8/v-deo.html

  • @itzuditsharma
    @itzuditsharma 8 місяців тому

    I am facing the problem in my jupyter notebook as this, please help
    2024-05-10 10:59:50 WARNING semantic_router.utils.logger Retrying in 2 seconds...

  • @mrchongnoi
    @mrchongnoi 8 місяців тому

    Why not chunk based on paragraphs, lists, and tables.

  • @jimmc448
    @jimmc448 8 місяців тому +1

    My son just asked if you were the Rock

  • @saqqara6361
    @saqqara6361 8 місяців тому +1

    "What is the title of the document?" -> 99% of RAG pipelines fail, because there is not answer in the document as it is embedded,

    • @jamesbriggs
      @jamesbriggs  8 місяців тому

      in that case we can try including the title in our chunk, and possibly consider different routing logic for this type of query - something that triggers when a user asks for metadata about a received document we trigger a function that identifies the document ID in previously retrieved contexts, and uses that to pull in the document metadata for the answer to be generated by the LLM

  • @maharun
    @maharun 3 місяці тому +1

    Using the semantic chunker is giving this error even thought I'm not using cohere:
    cannot import name 'EmbedResponse_EmbeddingsByType' from 'cohere.types.embed_response'
    how to solve it? i have already wasted on day on it.. this is so annoying.. plz help.. :)

    • @jamesbriggs
      @jamesbriggs  3 місяці тому

      cohere did a surprise SDK update and they are a default package in the library (we may change this) - try doing a `pip install -qU semantic-chunkers semantic-router==0.68`
      more info here if needed github.com/aurelio-labs/semantic-router/issues/422