Semantic Chunking - 3 Methods for Better RAG

Поділитися
Вставка
  • Опубліковано 11 жов 2024

КОМЕНТАРІ • 26

  • @wassfila
    @wassfila 4 місяці тому +4

    this is really promising, thank you. It's really hard to get an overview on cost/benefit for end results from a RAG end user perspective. Like a comparison table.

  • @BB-ou5ui
    @BB-ou5ui 4 місяці тому +1

    Hi! That's exactly what I was looking for and explaining with some personal implementation, and trying to implement something with different strategies from dense vectors... Have you considered using multivec models like ColBERT? To some extent, you could work with matrix similarities on bigger contexts... I'm also testing some weighted strategies using splade, but that's too early to make claims 😊

  • @samcavalera9489
    @samcavalera9489 4 місяці тому

    Hi James,
    First off, I want to express my immense gratitude for your insightful videos on RAG and other AI topics. Your content is so enriching that I find myself watching each video at least twice!
    I do have a couple of questions that I hope you can shed some light on:
    1) When using OpenAI’s small embedding model with the recursivecharctertextsplitter, is there a general guideline for determining the optimal chunk size and overlapping size? I’m looking for a rule of thumb that could help me set the right values for these parameters.
    2) My work primarily involves using RAG on scientific papers, which often include figures that sometimes convey more information than the text itself. Is there a technique to incorporate these figures into the vector database along with the paper’s text? Essentially, for multi-modal vector embedding that includes both text and images, what’s the best approach to achieve this?
    I greatly appreciate your insight 🙏🙏🙏

    • @jamesbriggs
      @jamesbriggs  4 місяці тому +1

      Hey thanks for the message! For (1) my rule of thumb is 200-300 tokens with a 20-40 token overlap, for (2) you can use the multimodal models (like gpt-4o) to describe what is in the image, then embed that - alternatively you could use an text-image embedding model but they don’t capture as much detail as what you could get from a multimodal LLM. Hope that helps :)

    • @samcavalera9489
      @samcavalera9489 4 місяці тому

      @@jamesbriggs many thanks James 🙏🙏🙏

  • @KenRossPhotography
    @KenRossPhotography 4 місяці тому

    Super interesting - thanks for that! I'll definitely be experimenting with those chunking variations.

    • @jamesbriggs
      @jamesbriggs  4 місяці тому

      Awesome, would love to hear how it goes

  • @AGI-Bingo
    @AGI-Bingo 4 місяці тому +3

    Hi James, could you please cover how to do "citing" with rag? With option to open the original source. That would be cool ❤
    Also if love to see an example for LiveRag, that watches certain files or folders for changes, and rechunks, embeddes, removes outdated and saves diffs.
    What do you think about these?
    Thanks a lot!

    • @tarapogancev
      @tarapogancev 3 місяці тому

      If you are using Pinecone or similar vector database, along with the vector entry you can usually also add specific metadata. I mostly keep the original text stored within that vector as a 'content' metadata field, and then add other fields for the file's name, topic etc. :) This way, you can cross-reference your data for the users to navigate easily.

    • @AGI-Bingo
      @AGI-Bingo 3 місяці тому

      got it so you could also add "filepath" and trigger opening the file, wonder if there's a way to jump and highlight a specific part of text after opening (i.e pdf)
      Also,@@tarapogancev do you know of a way to run diffs on files and delete/reupload all relevant chunks. Watching files and folders for changes, then triggering re-RAG embeddings, to keep everything automatically up-to-date. Thanks 🙏 👍

    • @tarapogancev
      @tarapogancev 3 місяці тому +1

      @@AGI-Bingo The idea of highlighting relevant text sounds great! I am yet to face the UI portion of this problem, trying to achieve similar results. :)
      I haven't worked with automatic syncs, but they would be very useful! So far, from what I've seen AWS Knowledge Bases and Azure's AI Search (if I remember correctly) both offer options to sync data manually when needed. It's not as convenient but I'm thinking it's not a bad solution either, considering it is possibly less work on the server-side, and maybe less credits for OpenAI or other LLM services.
      Sorry I couln't offer help on this topic, but I hope you come uo with a great solution! :D

  • @hughesadam87
    @hughesadam87 2 місяці тому

    I've been using a tool unstructured to split my documents into known sections (ie. title, abstract, pararaphs) - it can do the splitting. Do you think having these sentences apriori is helpful to chunking or it's better to just feed plaintex to the chunking strat and let it do all the grouping/separations etc...

  • @Piero-xi1yi
    @Piero-xi1yi 4 місяці тому +1

    Could you please explain the logic and concept of your code? How does this compare with semantic_chunker from langchain / llama index (it use something like your comulative, using a sliding window of n sentences, and with an "adaptive" threshold based on percentile)

  • @talesfromthetrailz
    @talesfromthetrailz 4 місяці тому

    How would you compare the Statistical chunker with the rolling window splitter you used for semantic chunking? Do you prefer one over the other? I'm designing a recommendation system that uses user queries to match to certain outputs they may want. Thanks!

    • @jamesbriggs
      @jamesbriggs  4 місяці тому +1

      StatisticalChunker is actually just a more recent version of the rolling window splitter, it includes handling for larger documents and some other optimizations so I'd recommend the statistical

  • @maxlgemeinderat9202
    @maxlgemeinderat9202 4 місяці тому

    Nice video! So e.g if i am reading in docs with unstructured io, i can then use the semantic chunker instead of a RecursiveCharacterSplitter?

    • @jamesbriggs
      @jamesbriggs  4 місяці тому

      yes you can, there's an (old, I should update) example here github.com/aurelio-labs/semantic-router/blob/main/docs/examples/unstructured-element-splitter.ipynb
      ^ the "splitter" here is equivalent to the StatisticalChunker in semantic-chunkers

  • @ΛΑΦ
    @ΛΑΦ 4 місяці тому +1

    Can we use Ollama for the embedding?

  • @ariugarte
    @ariugarte 4 місяці тому

    Hello, it's a fantastic tool! but I encountered some problems with tables in PDFs and with strings that use characters such as '-' to separate phrases or sections.I end up with chunks that are much bigger than the maximum size.

  • @jamesbriggs
    @jamesbriggs  4 місяці тому

    📌 Code:
    github.com/aurelio-labs/semantic-chunkers/blob/main/docs/00-chunkers-intro.ipynb
    ⭐ Article:
    www.aurelio.ai/learn/semantic-chunkers-intro

  • @looppp
    @looppp 4 місяці тому

    great video

  • @prasunkumar2106
    @prasunkumar2106 Місяць тому

    How can I use llama3.1 to achieve this?

  • @CBCELIMUPORTALORG
    @CBCELIMUPORTALORG 4 місяці тому

    🎯 Key points for quick navigation:
    📘 The video introduces three semantic chunking methods for text data, improving retrieval-augmented generation (RAG) applications.
    💻 Demonstrates use of the "semantic chunkers library," showcasing practical examples via a Colab notebook, requiring OpenAI's API key.
    📊 Focuses on a dataset of AI archive papers, applying semantic chunking to manage the data's complexity and improve processing efficiency.
    🤖 Discusses the need for an embedding model to facilitate semantic chunking, highlighting OpenAI's Embedding Model as a primary tool.
    📈 Outlines the "statistical chunking method" as a recommended approach for its efficiency, cost-effectiveness, and automatic parameter adjustments.
    🔍 Explains "consecutive chunking" as being cost-effective and relatively fast, but requiring more manual input for tuning parameters.
    📝 Presents "cumulative chunking" as a method that builds embeddings progressively, offering noise resistance but at a higher computational cost.
    🌐 Notes the adaptability of chunking methods to different data modalities, with specific mention of their suitability for text and potential for video.
    Made with HARPA AI

  • @lavamonkeymc
    @lavamonkeymc 3 місяці тому

    Where’s the advanced lamb graph video?

  • @kevinozero
    @kevinozero Місяць тому

    Very strange this keeps breaking sentences up mid-way through even though the sentence is conveying one message, like a clause in a contract, not impressed