Building long context RAG with RAPTOR from scratch

Поділитися
Вставка
  • Опубліковано 6 чер 2024
  • The rise of long context LLMs and embeddings will change RAG pipeline design. Instead of splitting docs and indexing doc chunks, it will become feasible to index full documents. RAG approaches will need to flexibly answer lower-level questions from single documents or higher-level questions that require information across many documents.
    RAPTOR (Sarthi et al) is one approach to tackle this by building a tree of document summaries: docs are clustered and clusters are summarized to capture higher-level information across similar docs.
    This is repeated recursively, resulting in a tree of summaries from individual docs as leafs to intermediate summaries of related docs to high-level summaries of the full doc collection.
    In this video, we build RAPTOR from scratch and test it on 33 web pages (each ranging 2k - 12k tokens) of LangChain docs using the recently released Claude3 model from Anthropic to build the summarization tree. The pages and tree of summaries are indexed together for RAG with Claude3, enabling QA on lower-lever questions or higher-level concepts (captured in summaries that span related pages).
    This idea can scale to large collections of documents or to documents of arbitrary size (up to embd / LLM context window).
    Code:
    github.com/langchain-ai/langc...
    Paper:
    arxiv.org/abs/2401.18059

КОМЕНТАРІ • 38

  • @maxi-g
    @maxi-g 2 місяці тому +3

    Lance is killing it with these videos. Keep it up!

  • @danielschoenbohm
    @danielschoenbohm 3 місяці тому +6

    That was so useful. Thanks! I'd love to see more advanced technics like that.

  • @cnmoro55
    @cnmoro55 3 місяці тому +17

    I think this approach is very interesting, and was very well presented, thank you for the video.
    One thing is, this works when working with a "closed" context, so we know we will query ONLY these 31 pages, let's say.
    If we are in an environment where this is dynamic, the clustering approach might not work so well.
    When we add more documents, we would have to run the clustering again, not simply load the model and predict the cluster, because new documents might get added that have completely new information. This becomes a problem when scaling this up, basically - both in terms of time spent, as well as cost for running the summarization again.

    • @seanpitcher1102
      @seanpitcher1102 2 місяці тому +2

      That was my first thought. This approach seems to work well with static content but what happens if I want to add new documents? It seems like you would need to rerun the entire process which will get incrementally expensive over time.

    • @alchemication
      @alchemication Місяць тому

      Agreed, we need another paper on scalable RAPTOR ;)

  • @SimonMariusGalyan
    @SimonMariusGalyan 9 днів тому

    Thank you for your awesome presentation :)

  • @MrPlatinum148
    @MrPlatinum148 2 місяці тому

    Fantastic video. Thanks heaps for the content. It really feels like you could present a series of these talks. I want to learn more about implementation of some of these ideas.

  • @Novacasa88
    @Novacasa88 3 місяці тому

    Hilarious I just came up with this idea a few months ago for a project really makes me think I should just get into doing the research in this field that seems my ideas end up becoming common concepts over and over again over the last few years. 😊 Such a cool field

  • @johnnydubrovnic
    @johnnydubrovnic 2 місяці тому +4

    Excellent approach and very well explained.
    One challenge that comes to mind with this summarisation hierarchy is maintaining it, as the source content changes or is revised. I am thinking in scenarios where there are hundreds of millions of documents to index.

  • @isa-bv481
    @isa-bv481 3 місяці тому

    First, I want to mention I like your explanations/videos. Thanks for your great work.
    In this occasion I was blocked (but I will solve that) because of the following:
    1. Claude is not available in some regions (like mine, being Belgium) - I'm on the waiting list.
    2. I tried with GPT4 as an alternative, but I forgot that you must put money on the account (I still have most of the $5 free test account, but that's limited to GPT 3.5).

  • @paraconscious790
    @paraconscious790 3 місяці тому

    This approach and implementation is amazing to alleviate the 3 issues you mentioned, thanks! One query though: have you checked the accuracy of the output as against the entire content into single prompt in the large context LLMs?

  • @JonWillis9
    @JonWillis9 3 місяці тому +5

    F yes, it's lance from langchain again, it is going to be a good day.

  • @f2f4ff6f8f0
    @f2f4ff6f8f0 3 місяці тому

    Great stuff

  • @8eck
    @8eck Місяць тому

    Anyway, thank you for the high level explanation.

  • @gowtham-user2834
    @gowtham-user2834 3 місяці тому

    you are great one champ

  • @jaysonp9426
    @jaysonp9426 3 місяці тому

    This is great, long context is a tool for a specific use case. Until costs and latency with long context are the same as RAG, RAG will be what most apps use.

  • @henkhbit5748
    @henkhbit5748 3 місяці тому +1

    Indeed an interesting approach that is not limited by the context length of the LLM. I have some remarks: a) Is the threshold also not the same as choosing the K-parameter of KNN (can Kohonen map not be used, its also unsupervised clustering...?) b) you don't have a performance impact retrieving from a long embedded text and also from the summarization clusters? c) already ponted out in some of the comments: how 2 update when adding new docs efficiently? (of course u can do, for example, using a copy vectorstore and do the update and switch over when done). d) have u tested the results using the "standard" method without summarization and this "Raptor" method and timing the inference time of both methods?
    btw: using long context is NOT very cost effective if u are using the commercial big AI companies.

  • @mr_adisa
    @mr_adisa 3 місяці тому

    Awesome walkthrough, going to give it a try.
    One thing this approach seems to lack is the ability to include metadata (e.g. source) on the summarizations. Has anyone found a solution to this?

  • @bertobertoberto3
    @bertobertoberto3 2 місяці тому

    Interesting idea, however, if you retrieve from an intermediate summary, would it still be possible to do citations on the original documents? Citations are key for most production level deployments

  • @jeffsteyn7174
    @jeffsteyn7174 3 місяці тому +1

    So in the example you adding a batch of 30 pages and they clustered and summarized. What happens when you add another batch or even just one extra doc. Is it added to an existing cluster and summary or does this become a new cluster summary

  • @anhvunguyen7935
    @anhvunguyen7935 2 місяці тому

    Will you make videos about RAG with PDF (contains not only text but also tables and images). That would be a very helpful video for me. Thank you for the great work!

  • @insitegd7483
    @insitegd7483 3 місяці тому

    I think that the solution in the last part to not exceed the limit tokens could be this:
    If we know that the first document is very large we could only embed this whole document and add an ID in metadata, then do similarity search in other vector database retrieving the documents by the ID.
    I am not sure, but I think that could solve the problem.

  • @perrygoldman612
    @perrygoldman612 3 місяці тому +1

    One key question of your approach is how to define the summary so that it offers adequate information to be used in RAG. If the summary does not include some minor information points, it would be impossible for RAG to identify the document as relevant solely based on the summary. And moreover, what if the document itself contains too many scatter info, and is hard to summarize, the approach would have many issues...... I do believe using this approach for many docs, but this approach does have some pre-requisites...

    • @easvidi6325
      @easvidi6325 2 місяці тому +2

      I think we should shift from summaries to abstract summaries. Making them more conceptual on higher level. Then, before sending a request for search, LLM should (re)formulate question in the way that it will be compatible with abstract summaries, then search, then find real texts based on found abstract summaries

  • @HealthLucid
    @HealthLucid Місяць тому

    Some of the readers have commented that we need run the entire clustering algorithm again if we get a new set of documents, or need it to be dynamic.
    I DO NOT think we need to do this. Here is why
    Lance (the speaker) shows how the documents are clustered recursively till it reaches n or a single cluster.
    So let us say there are 10,000 clusters and the new documents impact only 4 clusters [see at 06:33], where he talks about Gaussian Mixture model (AFAIK, this means a point can belong to multiple clusters), then we have two cases here
    1. No new clusters are created: So only "those 4 clusters" have to be rebuilt and its changes need to be propagated up through the chain to the root node right? We continue to have 10,000 clusters
    2. Let us say it ends up in expanding the # of clusters from 4 to 6 say, then only the impacted clusters will have to be rebuilt from that point to the root cluster. We will now have 10,0002 clusters
    If this is true, we do not need to rebuild everything but only that clusters that get impacted. Its like rebalancing the tree

  • @YueNan79
    @YueNan79 Місяць тому

    Hey I got a issue, What if sum of cluster documents exceed maximum token of summary chain ?

  • @HashimWarren
    @HashimWarren Місяць тому

    7:56 What does it mean to "embed" the document?

  • @MattJonesYT
    @MattJonesYT 3 місяці тому +3

    The content is great but the audio has a lot of echos. If you use a headset with the mic positioned so that it's below the chin to avoid plosive pop sounds it will greatly improve the audio quality.

  • @byrondelgado
    @byrondelgado 2 місяці тому

    This is a more comprehensive RAG scalable approach

  • @dejoma.
    @dejoma. 2 місяці тому

    How is running all your context through an LLM in “chunks” cheaper than throwing it all in 1 chunk… I think this approach is not viable for most people since it requires passing ALL context through an LLM. Either adding as context, or by being passed through the summary prompt.. Opinions?

  • @maskedvillainai
    @maskedvillainai 2 місяці тому +1

    Y’know what works better than all of this? Something we’ve done for centuries. Versioning the model itself in a server cache as an instance the model can prompt - using the exact same method for every instance until it finds the model that holds the summary .

    • @easvidi6325
      @easvidi6325 2 місяці тому +1

      Please elaborate

    • @peterwlodarczyk3987
      @peterwlodarczyk3987 2 місяці тому

      ⁠​⁠I believe he’s trying to make a joke along the lines of “just fine tune the model bro lol”. Which is, of course, useless advice. Impossible for most valid use cases (using e.g. GPT4 / Claude 3). Impractical for the less popular ones (prohibitively expensive for anything above a 14B). His writing style is pretty schizo though so I’m giving him the benefit of the doubt by assuming he was actually trying to provide some kind of constructive feedback or suggestion rather than going on a free association word rant. He’s not describing fine-tuning but is vaguely in the neighborhood with that nonsense.

  • @8eck
    @8eck Місяць тому

    Still, "k" problem haven't gone anywhere. 😅

  • @nogool111
    @nogool111 3 місяці тому

    Can this approach solve a multi-hop question? I should try it myself. Thank you for a great video.