I think this approach is very interesting, and was very well presented, thank you for the video. One thing is, this works when working with a "closed" context, so we know we will query ONLY these 31 pages, let's say. If we are in an environment where this is dynamic, the clustering approach might not work so well. When we add more documents, we would have to run the clustering again, not simply load the model and predict the cluster, because new documents might get added that have completely new information. This becomes a problem when scaling this up, basically - both in terms of time spent, as well as cost for running the summarization again.
That was my first thought. This approach seems to work well with static content but what happens if I want to add new documents? It seems like you would need to rerun the entire process which will get incrementally expensive over time.
I think that after having a summary tree, adding new documents or modifying old documents. Can be done binary by comparing information with existing clusters from root to leaf, and deciding whether that leaf is created new or updated information. Splitting the original document will be the thing for me to find the difference when expanding or duplicating the document.
Excellent approach and very well explained. One challenge that comes to mind with this summarisation hierarchy is maintaining it, as the source content changes or is revised. I am thinking in scenarios where there are hundreds of millions of documents to index.
Hilarious I just came up with this idea a few months ago for a project really makes me think I should just get into doing the research in this field that seems my ideas end up becoming common concepts over and over again over the last few years. 😊 Such a cool field
This is great, long context is a tool for a specific use case. Until costs and latency with long context are the same as RAG, RAG will be what most apps use.
Fantastic video. Thanks heaps for the content. It really feels like you could present a series of these talks. I want to learn more about implementation of some of these ideas.
First, I want to mention I like your explanations/videos. Thanks for your great work. In this occasion I was blocked (but I will solve that) because of the following: 1. Claude is not available in some regions (like mine, being Belgium) - I'm on the waiting list. 2. I tried with GPT4 as an alternative, but I forgot that you must put money on the account (I still have most of the $5 free test account, but that's limited to GPT 3.5).
Some of the readers have commented that we need run the entire clustering algorithm again if we get a new set of documents, or need it to be dynamic. I DO NOT think we need to do this. Here is why Lance (the speaker) shows how the documents are clustered recursively till it reaches n or a single cluster. So let us say there are 10,000 clusters and the new documents impact only 4 clusters [see at 06:33], where he talks about Gaussian Mixture model (AFAIK, this means a point can belong to multiple clusters), then we have two cases here 1. No new clusters are created: So only "those 4 clusters" have to be rebuilt and its changes need to be propagated up through the chain to the root node right? We continue to have 10,000 clusters 2. Let us say it ends up in expanding the # of clusters from 4 to 6 say, then only the impacted clusters will have to be rebuilt from that point to the root cluster. We will now have 10,0002 clusters If this is true, we do not need to rebuild everything but only that clusters that get impacted. Its like rebalancing the tree
good point! I think it would be useful to attach meta-data to the summarized clusters so when new documents are upload only those relevant clusters are retrieved and re-indexed. Having said that I do have my doubts about the efficiency of this system. It only works if the data going in is already pre-processed correctly although one could say this applies to all RAG solutions.
This approach and implementation is amazing to alleviate the 3 issues you mentioned, thanks! One query though: have you checked the accuracy of the output as against the entire content into single prompt in the large context LLMs?
One key question of your approach is how to define the summary so that it offers adequate information to be used in RAG. If the summary does not include some minor information points, it would be impossible for RAG to identify the document as relevant solely based on the summary. And moreover, what if the document itself contains too many scatter info, and is hard to summarize, the approach would have many issues...... I do believe using this approach for many docs, but this approach does have some pre-requisites...
I think we should shift from summaries to abstract summaries. Making them more conceptual on higher level. Then, before sending a request for search, LLM should (re)formulate question in the way that it will be compatible with abstract summaries, then search, then find real texts based on found abstract summaries
Indeed an interesting approach that is not limited by the context length of the LLM. I have some remarks: a) Is the threshold also not the same as choosing the K-parameter of KNN (can Kohonen map not be used, its also unsupervised clustering...?) b) you don't have a performance impact retrieving from a long embedded text and also from the summarization clusters? c) already ponted out in some of the comments: how 2 update when adding new docs efficiently? (of course u can do, for example, using a copy vectorstore and do the update and switch over when done). d) have u tested the results using the "standard" method without summarization and this "Raptor" method and timing the inference time of both methods? btw: using long context is NOT very cost effective if u are using the commercial big AI companies.
If higher level summary of is being used as the context during generation, how would one go about providing reference? Specially, in use cases where answers have to be 100% factual and reference is necessary to have transparency. Thanks!
I think that the solution in the last part to not exceed the limit tokens could be this: If we know that the first document is very large we could only embed this whole document and add an ID in metadata, then do similarity search in other vector database retrieving the documents by the ID. I am not sure, but I think that could solve the problem.
Will you make videos about RAG with PDF (contains not only text but also tables and images). That would be a very helpful video for me. Thank you for the great work!
So in the example you adding a batch of 30 pages and they clustered and summarized. What happens when you add another batch or even just one extra doc. Is it added to an existing cluster and summary or does this become a new cluster summary
Interesting idea, however, if you retrieve from an intermediate summary, would it still be possible to do citations on the original documents? Citations are key for most production level deployments
How is running all your context through an LLM in “chunks” cheaper than throwing it all in 1 chunk… I think this approach is not viable for most people since it requires passing ALL context through an LLM. Either adding as context, or by being passed through the summary prompt.. Opinions?
Awesome walkthrough, going to give it a try. One thing this approach seems to lack is the ability to include metadata (e.g. source) on the summarizations. Has anyone found a solution to this?
The paper doesn't share the experimental script, I don't know if there is a big guy who can share what he wrote, and I have never been able to reproduce the results in the paper
The content is great but the audio has a lot of echos. If you use a headset with the mic positioned so that it's below the chin to avoid plosive pop sounds it will greatly improve the audio quality.
Y’know what works better than all of this? Something we’ve done for centuries. Versioning the model itself in a server cache as an instance the model can prompt - using the exact same method for every instance until it finds the model that holds the summary .
I believe he’s trying to make a joke along the lines of “just fine tune the model bro lol”. Which is, of course, useless advice. Impossible for most valid use cases (using e.g. GPT4 / Claude 3). Impractical for the less popular ones (prohibitively expensive for anything above a 14B). His writing style is pretty schizo though so I’m giving him the benefit of the doubt by assuming he was actually trying to provide some kind of constructive feedback or suggestion rather than going on a free association word rant. He’s not describing fine-tuning but is vaguely in the neighborhood with that nonsense.
Lance is killing it with these videos. Keep it up!
I think this approach is very interesting, and was very well presented, thank you for the video.
One thing is, this works when working with a "closed" context, so we know we will query ONLY these 31 pages, let's say.
If we are in an environment where this is dynamic, the clustering approach might not work so well.
When we add more documents, we would have to run the clustering again, not simply load the model and predict the cluster, because new documents might get added that have completely new information. This becomes a problem when scaling this up, basically - both in terms of time spent, as well as cost for running the summarization again.
That was my first thought. This approach seems to work well with static content but what happens if I want to add new documents? It seems like you would need to rerun the entire process which will get incrementally expensive over time.
Agreed, we need another paper on scalable RAPTOR ;)
I think that after having a summary tree, adding new documents or modifying old documents. Can be done binary by comparing information with existing clusters from root to leaf, and deciding whether that leaf is created new or updated information. Splitting the original document will be the thing for me to find the difference when expanding or duplicating the document.
That was so useful. Thanks! I'd love to see more advanced technics like that.
Excellent approach and very well explained.
One challenge that comes to mind with this summarisation hierarchy is maintaining it, as the source content changes or is revised. I am thinking in scenarios where there are hundreds of millions of documents to index.
You can use an llm to make the summary of the documents and clusters each time an update is necessary.
Hilarious I just came up with this idea a few months ago for a project really makes me think I should just get into doing the research in this field that seems my ideas end up becoming common concepts over and over again over the last few years. 😊 Such a cool field
This is great, long context is a tool for a specific use case. Until costs and latency with long context are the same as RAG, RAG will be what most apps use.
Fantastic video. Thanks heaps for the content. It really feels like you could present a series of these talks. I want to learn more about implementation of some of these ideas.
First, I want to mention I like your explanations/videos. Thanks for your great work.
In this occasion I was blocked (but I will solve that) because of the following:
1. Claude is not available in some regions (like mine, being Belgium) - I'm on the waiting list.
2. I tried with GPT4 as an alternative, but I forgot that you must put money on the account (I still have most of the $5 free test account, but that's limited to GPT 3.5).
F yes, it's lance from langchain again, it is going to be a good day.
Indeed
Thank you for your awesome presentation :)
thanks for sharing, love your videos
Some of the readers have commented that we need run the entire clustering algorithm again if we get a new set of documents, or need it to be dynamic.
I DO NOT think we need to do this. Here is why
Lance (the speaker) shows how the documents are clustered recursively till it reaches n or a single cluster.
So let us say there are 10,000 clusters and the new documents impact only 4 clusters [see at 06:33], where he talks about Gaussian Mixture model (AFAIK, this means a point can belong to multiple clusters), then we have two cases here
1. No new clusters are created: So only "those 4 clusters" have to be rebuilt and its changes need to be propagated up through the chain to the root node right? We continue to have 10,000 clusters
2. Let us say it ends up in expanding the # of clusters from 4 to 6 say, then only the impacted clusters will have to be rebuilt from that point to the root cluster. We will now have 10,0002 clusters
If this is true, we do not need to rebuild everything but only that clusters that get impacted. Its like rebalancing the tree
good point! I think it would be useful to attach meta-data to the summarized clusters so when new documents are upload only those relevant clusters are retrieved and re-indexed.
Having said that I do have my doubts about the efficiency of this system. It only works if the data going in is already pre-processed correctly although one could say this applies to all RAG solutions.
An enhancement here would be to have it expand to the summarised nodes into the original nodes.
This approach and implementation is amazing to alleviate the 3 issues you mentioned, thanks! One query though: have you checked the accuracy of the output as against the entire content into single prompt in the large context LLMs?
One key question of your approach is how to define the summary so that it offers adequate information to be used in RAG. If the summary does not include some minor information points, it would be impossible for RAG to identify the document as relevant solely based on the summary. And moreover, what if the document itself contains too many scatter info, and is hard to summarize, the approach would have many issues...... I do believe using this approach for many docs, but this approach does have some pre-requisites...
I think we should shift from summaries to abstract summaries. Making them more conceptual on higher level. Then, before sending a request for search, LLM should (re)formulate question in the way that it will be compatible with abstract summaries, then search, then find real texts based on found abstract summaries
Hey I got a issue, What if sum of cluster documents exceed maximum token of summary chain ?
This is a more comprehensive RAG scalable approach
Indeed an interesting approach that is not limited by the context length of the LLM. I have some remarks: a) Is the threshold also not the same as choosing the K-parameter of KNN (can Kohonen map not be used, its also unsupervised clustering...?) b) you don't have a performance impact retrieving from a long embedded text and also from the summarization clusters? c) already ponted out in some of the comments: how 2 update when adding new docs efficiently? (of course u can do, for example, using a copy vectorstore and do the update and switch over when done). d) have u tested the results using the "standard" method without summarization and this "Raptor" method and timing the inference time of both methods?
btw: using long context is NOT very cost effective if u are using the commercial big AI companies.
Anyway, thank you for the high level explanation.
If higher level summary of is being used as the context during generation, how would one go about providing reference? Specially, in use cases where answers have to be 100% factual and reference is necessary to have transparency. Thanks!
I think that the solution in the last part to not exceed the limit tokens could be this:
If we know that the first document is very large we could only embed this whole document and add an ID in metadata, then do similarity search in other vector database retrieving the documents by the ID.
I am not sure, but I think that could solve the problem.
Will you make videos about RAG with PDF (contains not only text but also tables and images). That would be a very helpful video for me. Thank you for the great work!
So in the example you adding a batch of 30 pages and they clustered and summarized. What happens when you add another batch or even just one extra doc. Is it added to an existing cluster and summary or does this become a new cluster summary
Interesting idea, however, if you retrieve from an intermediate summary, would it still be possible to do citations on the original documents? Citations are key for most production level deployments
How is running all your context through an LLM in “chunks” cheaper than throwing it all in 1 chunk… I think this approach is not viable for most people since it requires passing ALL context through an LLM. Either adding as context, or by being passed through the summary prompt.. Opinions?
Great stuff
Awesome walkthrough, going to give it a try.
One thing this approach seems to lack is the ability to include metadata (e.g. source) on the summarizations. Has anyone found a solution to this?
The paper doesn't share the experimental script, I don't know if there is a big guy who can share what he wrote, and I have never been able to reproduce the results in the paper
you are great one champ
7:56 What does it mean to "embed" the document?
The content is great but the audio has a lot of echos. If you use a headset with the mic positioned so that it's below the chin to avoid plosive pop sounds it will greatly improve the audio quality.
Still, "k" problem haven't gone anywhere. 😅
Y’know what works better than all of this? Something we’ve done for centuries. Versioning the model itself in a server cache as an instance the model can prompt - using the exact same method for every instance until it finds the model that holds the summary .
Please elaborate
I believe he’s trying to make a joke along the lines of “just fine tune the model bro lol”. Which is, of course, useless advice. Impossible for most valid use cases (using e.g. GPT4 / Claude 3). Impractical for the less popular ones (prohibitively expensive for anything above a 14B). His writing style is pretty schizo though so I’m giving him the benefit of the doubt by assuming he was actually trying to provide some kind of constructive feedback or suggestion rather than going on a free association word rant. He’s not describing fine-tuning but is vaguely in the neighborhood with that nonsense.
Can this approach solve a multi-hop question? I should try it myself. Thank you for a great video.
I. Was wondering same