James Brggs one of my favorites and I believe I am a "Patreon""member - spend hundreds of hours listening to about 10 podcasts, studying Large Language Models, Machine Learning and so called "AI". James Briggs breaks things down in easier to understand concepts. Thank you James Briggs
Great video. Thanks for posting. I have been thinking of document chunking but using the LLM itself via prompting + k-shot. The approach you show will be cheaper of course but curious to see how these two approaches will compare in terms of any relevant non-cost metrics.
We use a simple combination of Microsoft's Document Intelligence with markdown output and a simple markdown splitter. The improvement is noticeable although the Document Intelligence models do come at an additional cost.
yeah it depends on what you need ofcourse, I'm mostly interested in further abstraction and more analytics methods for chunking not for where it is now, but for where this type of experimentation might lead to in the future - I could see a few more iterations and improvements to more intelligent doc parsing and chunking to become increasingly more performant - but we'll see
Hi James , would you please tell me how you would tackle this one.. How would you design a realtime updating rag system? For example, let's say our clients updated some details in some watched doc, I want the old chunks to be removed, and rechunked automatically. Have you seen such pipeline existing already? No one seems to cover this and I think it sets apart fun projects and actual production system. Thanks and all the best! Love your channel ❤
I have achieved this for one of the sources in my RAG bot. It has an api provided to access the data. So i run the embedding script on the delta changes.
RAG does not mean you "have to use vector embeddings and Vector DB". If you can run APIs to fetch relevant info, it should be good enough. Use function/tool calling to call the API. Otherwise, if you are planning to watch some doc live you need to have the following pipeline this will only work if you are making small changes in the doc frequently -> Doc Changed -> Webhook/Trigger to your system -> If diff is available, use it, if diff is not available, compute diff with old vs new doc. -> Take nearby text as sample and compute embeddings -> Fetch top N nearby docs from the Vector DB (Hybrid search will work really well, tune the sparse vector weightage higher than normal RAG here) -> Ask LLM Agent to mark relevant chunks/Use reranking models (old chunk as query) -> Delete these old chunks from VectorDB -> Compute embeddings for the new changes -> Upsert the new vectors into the DB. There are tons of edge cases that you will run into when running it this pipeline and they always every for each use case, so you will have to consider those accordingly.
People since GPT2: Simply ask an LLM recursively to please insert “{split}“ where a topic change etc happens according to a summary of prior text. Get embeddings. Use to separate and group. 2024: We would like to introduce a novel concept called Semantic Chunking with a sliding Context…….. Beginners must be truly lost 😮💨
@James Briggs Newbie here, was wondering if it was necessary to store the chunk with the vector, it seems like a lot of data duplication and a good way to fill your disk. I like the idea of storing the title, I was thinking about storing the document path and filename also. I haven't been able to find good info about what data besides vectors is also kept in the vector db. I understand that the vectors need to correlate to data, I just don't understand what data is actually represented in the vectors. If you just have an ID and the vectors, can't that ID point back to the document with the content?
At ~4:40, you mention that you should use the same encoder for the chunking and the encoding. Why? A chunk size captures a "single meaning", so why would it matter that the same encoder is used? If you look at the chunking as a clutering algorithim that creates meaningful chunks, then what does it matter that the encoders match? What am I missing?
good point - yes they are capturing the "single meaning" and that single meaning will (hopefully) overlap a lot, but embedding models are not perfect and so they will not align between themselves. Similar to if someone asked myself and you to chunk an article, we'd likely overlap for the majority of the article, but I'm sure there would be differences
Do you think this causes the results to be biased towards smaller chunks? Because the user will only query probably no more than 10 words . So the highest semantic similar results may also only be 10 words and the chunks that are 400 tokens wouldn't have as high as a score unless you provide more context to the query?
using a high end LLM like GPT-4 or Opus or Gemini Ultra or Pro might be effective in performing semantic chunking.. Google large context window seems suitable for chunking large files.. we need to introduce LLM in automating the RAG stack
Can this be used to create chunks for creating a training dataset as well? It would be great to chunk a document into 'statements' and use those statements for a dataset. In essence have a LLM create questions for each of those statements and use those pairs for training. Could you make a video to show how that works?
Hi James. Excuse me, maybe I missed it. But how you handle the situation that when we use semantic chunking we miss pages numbers for chunks? Is it possible to receive it with using this package?
and how can I use big models from huggingface ? I can't load them into memory because many of them are bigger than 15gb, some of them are 130gb+ . Any thoughts?
i have a doubt that i have a document which has the many page references to one to another page, should i want to group all the data into the same chunks (like to get data from first page and in this reference page number is in page 3 means should i get data from both pages and store it a single chunk ) does is this only way or is there any special models . Else give some idea
Hmm that's an issue that you could solve in the retrieving stage not chunking... When you retrieve a chunk you can check with an LLM fast model if it has references to another one to get them as well
Hi James, do you think that the chunking and embedding process in RAG will be unnecessary in the short future, as the input token length is no longer a limitation.
I don’t think the input token length will become unlimited any time soon - but for smaller use cases (fitting within Anthropic limits) where latency and token cost are not important then you can use a pure LLM solution rather than RAG
@@jamesbriggs quick question man. Is the objective of semantic chunking to achieve broader search results? Or to decrease query times? I'm thinking of it in terms of medium sized text docs, for example movies summaries and such. Thanks!
in that case we can try including the title in our chunk, and possibly consider different routing logic for this type of query - something that triggers when a user asks for metadata about a received document we trigger a function that identifies the document ID in previously retrieved contexts, and uses that to pull in the document metadata for the answer to be generated by the LLM
Using the semantic chunker is giving this error even thought I'm not using cohere: cannot import name 'EmbedResponse_EmbeddingsByType' from 'cohere.types.embed_response' how to solve it? i have already wasted on day on it.. this is so annoying.. plz help.. :)
cohere did a surprise SDK update and they are a default package in the library (we may change this) - try doing a `pip install -qU semantic-chunkers semantic-router==0.68` more info here if needed github.com/aurelio-labs/semantic-router/issues/422
James Brggs one of my favorites and I believe I am a "Patreon""member - spend hundreds of hours listening to about 10 podcasts, studying Large Language Models, Machine Learning and so called "AI". James Briggs breaks things down in easier to understand concepts. Thank you James Briggs
hey that's awesome, I really appreciate the support!
At this point, you are practically Captain Chunk.
Thank you! I’ve been doing this for a while, but did not have a good name for it.
Need a video on cross-chunk attention. Wasn’t attention all about key query and val anyway
best i have seen so far about understanding core concept of chunking , thanks
glad it was helpful :)
Excellent content and explanation , espeicialy chunking core concepts and challenges. Keep going your work it's so precisous to learn 👍
Glad to hear it helps
Great video. Thanks for posting. I have been thinking of document chunking but using the LLM itself via prompting + k-shot. The approach you show will be cheaper of course but curious to see how these two approaches will compare in terms of any relevant non-cost metrics.
We use a simple combination of Microsoft's Document Intelligence with markdown output and a simple markdown splitter. The improvement is noticeable although the Document Intelligence models do come at an additional cost.
yeah it depends on what you need ofcourse, I'm mostly interested in further abstraction and more analytics methods for chunking not for where it is now, but for where this type of experimentation might lead to in the future - I could see a few more iterations and improvements to more intelligent doc parsing and chunking to become increasingly more performant - but we'll see
Do you have a link for this markdown processing? :)
We are using Document Intelligence as well, but not for layout analysis, yet.
@@alivecoding4995you can also use layoutpdf reader from llmsherpra
Another killer video. Great work!
King of Chunk
a title I have always wanted
Love all your content sir!
Hi James , would you please tell me how you would tackle this one..
How would you design a realtime updating rag system? For example, let's say our clients updated some details in some watched doc, I want the old chunks to be removed, and rechunked automatically. Have you seen such pipeline existing already? No one seems to cover this and I think it sets apart fun projects and actual production system. Thanks and all the best! Love your channel ❤
I have achieved this for one of the sources in my RAG bot. It has an api provided to access the data. So i run the embedding script on the delta changes.
@@shameekm2146 amazing, would you please opensource it so we can all improve the pipeline as a community? 🌈
RAG does not mean you "have to use vector embeddings and Vector DB". If you can run APIs to fetch relevant info, it should be good enough. Use function/tool calling to call the API.
Otherwise, if you are planning to watch some doc live you need to have the following pipeline this will only work if you are making small changes in the doc frequently ->
Doc Changed -> Webhook/Trigger to your system -> If diff is available, use it, if diff is not available, compute diff with old vs new doc.
-> Take nearby text as sample and compute embeddings -> Fetch top N nearby docs from the Vector DB (Hybrid search will work really well, tune the sparse vector weightage higher than normal RAG here) -> Ask LLM Agent to mark relevant chunks/Use reranking models (old chunk as query) -> Delete these old chunks from VectorDB -> Compute embeddings for the new changes -> Upsert the new vectors into the DB.
There are tons of edge cases that you will run into when running it this pipeline and they always every for each use case, so you will have to consider those accordingly.
People since GPT2: Simply ask an LLM recursively to please insert “{split}“ where a topic change etc happens according to a summary of prior text. Get embeddings. Use to separate and group.
2024: We would like to introduce a novel concept called Semantic Chunking with a sliding Context……..
Beginners must be truly lost 😮💨
@James Briggs Newbie here, was wondering if it was necessary to store the chunk with the vector, it seems like a lot of data duplication and a good way to fill your disk. I like the idea of storing the title, I was thinking about storing the document path and filename also. I haven't been able to find good info about what data besides vectors is also kept in the vector db. I understand that the vectors need to correlate to data, I just don't understand what data is actually represented in the vectors. If you just have an ID and the vectors, can't that ID point back to the document with the content?
Thank you so much for this. Will test it out on the RAG flow in the company.
welcome, would love to hear how it goes
At ~4:40, you mention that you should use the same encoder for the chunking and the encoding. Why? A chunk size captures a "single meaning", so why would it matter that the same encoder is used? If you look at the chunking as a clutering algorithim that creates meaningful chunks, then what does it matter that the encoders match? What am I missing?
good point - yes they are capturing the "single meaning" and that single meaning will (hopefully) overlap a lot, but embedding models are not perfect and so they will not align between themselves. Similar to if someone asked myself and you to chunk an article, we'd likely overlap for the majority of the article, but I'm sure there would be differences
Loved this explanation
Can you introduce some articles related to this topic? Thanks!
Do you think this causes the results to be biased towards smaller chunks? Because the user will only query probably no more than 10 words . So the highest semantic similar results may also only be 10 words and the chunks that are 400 tokens wouldn't have as high as a score unless you provide more context to the query?
using a high end LLM like GPT-4 or Opus or Gemini Ultra or Pro might be effective in performing semantic chunking.. Google large context window seems suitable for chunking large files.. we need to introduce LLM in automating the RAG stack
Yeah I’d like to introduce an LLM chunker and see how they compare
@@jamesbriggs better than any non LLM chunker.. if we aim to empower user's with AI, why not empower the developer? chunking is not easy
"Grab complete thoughts" is an obvious good and expensive thing. Except for tables, for instance.
yeah tables need to handled differently - doable if you are identifying text vs. table elements in your processing pipeline
I see the #abstract is also with #title ideally both should be in different chunks so that LLM can understand better semantics.
How does Parent Document Rag fits in your in your new techniques?
Can this be used to create chunks for creating a training dataset as well? It would be great to chunk a document into 'statements' and use those statements for a dataset. In essence have a LLM create questions for each of those statements and use those pairs for training. Could you make a video to show how that works?
What's a good way to use the metadata for retrieval and ranking of the chunks?
Amazing video, thank you so much!!
Hi James. Excuse me, maybe I missed it. But how you handle the situation that when we use semantic chunking we miss pages numbers for chunks? Is it possible to receive it with using this package?
and how can I use big models from huggingface ? I can't load them into memory because many of them are bigger than 15gb, some of them are 130gb+ . Any thoughts?
i have a doubt that i have a document which has the many page references to one to another page, should i want to group all the data into the same chunks (like to get data from first page and in this reference page number is in page 3 means should i get data from both pages and store it a single chunk ) does is this only way or is there any special models . Else give some idea
Hmm that's an issue that you could solve in the retrieving stage not chunking... When you retrieve a chunk you can check with an LLM fast model if it has references to another one to get them as well
Just what i eas trying to lewrn ...awesome mate, thanks
Nice np
Hi James, do you think that the chunking and embedding process in RAG will be unnecessary in the short future, as the input token length is no longer a limitation.
I don’t think the input token length will become unlimited any time soon - but for smaller use cases (fitting within Anthropic limits) where latency and token cost are not important then you can use a pure LLM solution rather than RAG
Great material. 🙏
Dude already embedded whole documents of texts into PC haha would've helped a month ago. But awesome thanks for this! 🤘🏾
Maybe for the next project 😅
@@jamesbriggs quick question man. Is the objective of semantic chunking to achieve broader search results? Or to decrease query times? I'm thinking of it in terms of medium sized text docs, for example movies summaries and such. Thanks!
Does this library support ollama, gemini or hf encoders also or Is it only for chatgpt?
it supports these encoders github.com/aurelio-labs/semantic-router/tree/main/semantic_router/encoders
Unfortunately, the semantic router has removed this feature, or refactored it in some way.
hey yes they were deprecated in favour of this ua-cam.com/video/7JS0pqXvha8/v-deo.html
I am facing the problem in my jupyter notebook as this, please help
2024-05-10 10:59:50 WARNING semantic_router.utils.logger Retrying in 2 seconds...
Why not chunk based on paragraphs, lists, and tables.
My son just asked if you were the Rock
I hope you said yes
"What is the title of the document?" -> 99% of RAG pipelines fail, because there is not answer in the document as it is embedded,
in that case we can try including the title in our chunk, and possibly consider different routing logic for this type of query - something that triggers when a user asks for metadata about a received document we trigger a function that identifies the document ID in previously retrieved contexts, and uses that to pull in the document metadata for the answer to be generated by the LLM
Using the semantic chunker is giving this error even thought I'm not using cohere:
cannot import name 'EmbedResponse_EmbeddingsByType' from 'cohere.types.embed_response'
how to solve it? i have already wasted on day on it.. this is so annoying.. plz help.. :)
cohere did a surprise SDK update and they are a default package in the library (we may change this) - try doing a `pip install -qU semantic-chunkers semantic-router==0.68`
more info here if needed github.com/aurelio-labs/semantic-router/issues/422