Whats the best Chunk Size for LLM Embeddings

Matt Williams

747

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 10 лют 2025
When working with Embeddings, one of the challenging decisions is how big the chunks in the embedding should be. In this video I look at the question and discover some surprising conclusions.
My Links 🔗
👉🏻 Subscribe (free): / technovangelist
👉🏻 Join and Support: / @technovangelist
👉🏻 Newsletter: technovangelis...
👉🏻 Twitter: / technovangelist
👉🏻 Discord: / discord
👉🏻 Patreon: / technovangelist
👉🏻 Instagram: / technovangelist
👉🏻 Threads: www.threads.ne...
👉🏻 LinkedIn: / technovangelist
👉🏻 All Source Code: github.com/tec...
Want to sponsor this channel? Let me know what your plans are here: technovangelis...

КОМЕНТАРІ • 101

@sergey_is_sergey 11 місяців тому ⁺¹¹
I read the thumbnail as "Talk With Your Dogs" and was impressed with just how multi-modal Ollama has become.
@HistoryIsAbsurd 11 місяців тому ⁺²
😂😀😂 love it
@AlekseyRubtsov 11 місяців тому ⁺²
Thanks!
@sebingtoon 11 місяців тому ⁺⁵
Hi Matt, I really like your experimental approach to chunk size, so useful. I'm looking forward to installing bun (a step into the unknown for me 😱) and trying out your code! Thanks for your videos, they are a pleasure to watch 🙂
@steffenmuller2888 6 місяців тому ⁺¹
"Way way way back, at the dawn of the beginning... 6 month ago" XD
A very eye-catching description how fast we are traveling in terms of LLMs and NLP. Your content is great, I was looking for exactly this. Well explained, and even understandable for rookies in this area.
@c0t1 11 місяців тому ⁺³
Thank you for addressing this - great video! Thanks for providing an example for analyzing the chunk size and overlap. I've wanted to try this exact thing but wasn't sure on a good way to programatically assess the quality of the LLM's response. I'm surprised that the chunk size makes so much of a difference in the end results, and I'd LOVE to see your analysis of the myriad vector dbs out there!
@technovangelist 11 місяців тому ⁺³
It’s so rare to see someone use the word myriad correctly. Yes the vector db looks are coming soon
@c0t1 11 місяців тому ⁺¹
I know there are not an innumerable number of vector dbs out there, but from the perspective of a relative newbie in this space, there might as well be. I feel like the ex-Soviet man when first confronted with 100 different brands of shampoo.
@lucioussmoothy 11 місяців тому ⁺¹
Great video - I dig your style. Just back from Spring break adventures and needed something to get pumped about diving in the AI madness on Monday. well done Sir
@MrRavaging 22 дні тому
You've inspired me to look into whether it's possible to use a variable chunk size, and make the size equal to the length of the sentence or phrase + 3 words preceding + 3 words following. That way a searching LLM will be more likely to match more complicated concepts when performing tasks, which would lead to more reliability. Or... at least, that's the hypothesis...
@espero7757 6 днів тому ⁺¹
How it works with sentience embeddings? Or with e big model semantic embedder?
@sahajamitrawat 11 місяців тому ⁺²
I love all your videos so please continue posting them.
On chunking, I recently built a RAG app to do QnA on our product docs where I have loaded all the product documentation in vector DB. In my case I converted these docs in to json array where each json object talk about one independent requirement for given capability. So there is no fixed chunk size for me, I stored each requirement in one chunk irrespective of its size an I have not used any overlapping. I was surprised by the accuracy of the results with this approach with nomin-embed-text embedding.
@NLPprompter 10 місяців тому
care to make a video presentation about it? seems like interesting
@shresthmeharia5632 2 місяці тому
yo sounds really cool...can i see it?
@melaronvalkorith1301 7 місяців тому
Loved your video and especially the experiment you ran! Everything is so new and changing so fast with LLMs that experiments like yours are very valuable. I would love to see you perform a more in-depth experiment on this and/or anything else you think would be worthwhile!
It would be interesting to see a program based on utilizing two representations of the same document with different chunk sizes like one high and one low (e.g. 100, 5) and using one or the other based on what you need from it.
@darenbaker4569 11 місяців тому ⁺¹
Brilliant results thank you.
@tal7atal7a66 11 місяців тому
excellent / pro tutos ❤ , clean english too and suite explainations
thank you
@HistoryIsAbsurd 11 місяців тому ⁺¹
Thank you good sir!
@phizc 11 місяців тому
9:35 One interesting twist would be to use a rather short chunk size to look up the embedding, then privide the text of the couple of previous and next chunks as well to the LLM. That would give the LLM a longer context to use for providing the answer while still being able to find the closest matching embedding.
I'm not too surprised about the results, except perhaps about overlap not making much of a difference. I would think having enough overlap to at least have complete sentences would be ideal.
@HistoryIsAbsurd 11 місяців тому ⁺¹
Hey not sure if this is your thing or not (no worries if not totally cool!) but could you like concatenate your videos about embedding together into a super tutorial or maybe set up a playlist from your embedding videos? Super super good info in your videos man
@LuisEduardoHernandezT 3 місяці тому
I don’t know why I feel the urge to comment about the awesomeness of this video
@mariacardoso5145 Місяць тому
Great video!
@jim02377 11 місяців тому
Nicely done. I struggle with chunking on my projects and sometimes I think the performance varies from day to day. I am curious how chunk size varies RAG performance on smaller models like Mistral 7B vs the big ones like GPT 3.5Turbo?
@meyermc80 11 місяців тому
Stating the obvious a little but there are more than 2 variables to control for here. The video mentions another big one: the specific application. Also another huge variable is the specific embedding model you use and what capabilities it was trained with. Also how are you embedding? top match top 5 matches, anything above a similarity threshold. How do you measure similarity cosine, euclidian, manhattan, ...? Are you embedding raw chunks or generating hypothetical embeddings? I love any material that help answer any part of this. 😍
@technovangelist 11 місяців тому ⁺¹
yes. can only do so much in a video. so a lot of that is coming in future videos
@cnmoro55 11 місяців тому
Several months ago I have made a very similar test.
I found that chunks of 150 words with 20 words overlap were the best for my case.
But now I am using token chunks instead of word chunks. Approximately, those 150 words are equivalent to 300 tokens.
@technovangelist 11 місяців тому
It tends to average out to 3 words to 4 tokens in general. But depends on the text.
@cnmoro55 11 місяців тому ⁺²
@@technovangelist yes, you're right. but my use case is for portuguese language, so the token count tends to be a little higher :(
@panckreous 11 місяців тому
Remember when the MCU was great, and how as good as a movie may have been, the post-credits stingers were always the best part? Thank you for filling the void. 10:31
(alternatively, "dude...")
@technovangelist 11 місяців тому
The only reference I know of for MCU is related to Marvel, so no, I don't remember when MCU was great, because every Marvel movie has been garbage. I have gone to some and just walked out they were so bad. I went to see Dune 2 this weekend and would have walked out on that if I hadn't fallen asleep.
@deetechprojects 8 днів тому
Based on experience, each chunk should have an independent context, making the search process more effective and efficient. I prefer not to use a specific chunk size and instead use specific markers, such as the "==" symbol, to separate each chunk to ensure that every chunk has an independent context.
@jjolla6391 2 місяці тому
it would be useful if u could doo a video on chunking non-prose. for example, what if i had json structured data .. what should be the embedding model there? is it even possible?
@technovangelist 2 місяці тому
I would probably have a model interpret the data 5 or more different ways changing perspective each time. Embed that. And in the metadata for the embed point to the source data.
@philipthomas3503 3 місяці тому
Thank you so much for going to the trouble of this. Question - does Ollama embedding format vary by the LLM that's active? eg if you're running it with Mistral or Llama, will the embedding format be different? (and do the tuning numbers for overlap / chunk size vary by modal?)
@technovangelist 3 місяці тому ⁺¹
Yes but you should never use llama or mistral for embedding. There are embedding models that are smaller and faster and with far better results
@philipthomas3503 3 місяці тому
@@technovangelist can Llama or Mistral then understand the output of those embedding models you mention? I thought each LLM had its own distinct embedding format it could work with
@technovangelist 3 місяці тому
The model will never see the embeddings. Even if you use those models to create the embedding you will only ever pass the plaintext to the model.
@philipthomas3503 3 місяці тому
@@technovangelist I was reading that in LangChain one could use a QARetriever object to semantically retrieve relevant embedding from a vector store then allows an LLM to process those embedding
@technovangelist 3 місяці тому
the model only understands the plaintext
@user-ro4ov2xv7s 7 місяців тому
What K value for retrieval did you see had the best results?
@rj7855 11 місяців тому
Very interesting, as usual
@ginisksam 11 місяців тому
Ciao Matt,
Thanks for the insight. Been playing with Langchain+Ollama(nomic-embed-txt)+Groq(free now & speed) - Chunking on a pdf article with 20 sub-topics.
Found out chunk size = 1024 & 512 w/o overlap sufficed. Finish off using gpt-3.5 (free) to merge my outputs for each topic do produce rather good summary for own use on a cpu laptop for now.
On large pdf single topic - my goto approach is ask llm to generate list of questions for me - then will peruse those questions - in this way allows for better enjoyable reading & understanding of the pdf. What's your view on this?
Keep up your videos. Cheers.
@DeepakKumar-in7er 6 місяців тому
I have 100+ documents, what should be best size, some time relevant information present in doc but, I am not getting the answers for that questions
@technovangelist 6 місяців тому
Depends on your docs. Play around with different sizes to figure out what works. But if they cover different topics make sure you have a way to filter out the irrelevant ones from your search.
@veiculoseaventuras 10 місяців тому
Que vídeo espetacular! Adorei! Forte abraço, diretamente do Brasil!
@StudyWithMe-mh6pi 11 місяців тому
Music is inspiring me to try chunking text -:)
@neodim1639 11 місяців тому ⁺¹
What about semantic chunking?
@ilianos 10 місяців тому
I was waiting for this as well.
@ilianos 10 місяців тому
Recently, I even heard about "agentic chunking". Pretty interesting concept!
@prajnaparamitahrdaya 4 місяці тому
@@ilianos One of the reasons we need chunking is the text is too long for llm to consume. So I guess agentic chunking is only useful in some no so generic use cases
@prajnaparamitahrdaya 4 місяці тому
I am experimentig semantic chunking with custom embedding. The speed is acceptable for 100 pages doc, but the result still need to be further fine tune/clean up
@yevhendyachenko1384 8 місяців тому
Could You test the code repo chunking?
@technovangelist 8 місяців тому
Um, I'm not sure what you mean here.
@HenryETaylor 10 місяців тому ⁺¹
I'm barely a novice on any of this but intuitively I'm wondering why the length of your longest question doesn't factor into choosing your minimum embedding length. Along the same intuitive line, wouldn't you want to pad or augment your answers so that questions always start at the beginning of a chunk? If your questions and combinations upon your questions are the most likely search criteria which you are likely to receive from users, wouldn't guaranteeing that each question is totally encapsulated in a chunk facilitate the speed and accuracy of the resulting searches?
@rafaelrodrigues6320 10 місяців тому
The embedding length (number of dimensions) is fixed, depending on the model you choose, so I think by "minimum embedding length" you meant "minimum chunk length".
It's a good idea to determine you chunking method according to the queries you're expecting, but maybe the longest question can be an outlier. Be careful with that.
I didn't quite get your final question, but you can't tailor your chunks to each specific query. Chunking your texts, embedding them, and indexing for efficient searching, takes quite a long time. Everything must be ready for when your wants to do a search.
@mvdiogo 11 місяців тому
very nice video. I use pgvector, for me, chroma db did not suport all my files
@HarmAarts 10 місяців тому
I do have a question: wouldn't chunking by sentence(s) make sense?
@technovangelist 10 місяців тому
Yup. Look at some of the later videos
@95jack44 11 місяців тому
And what about using an LLM to do the embedding ?? Seems like it would know what's the best place to cut the text !
@technovangelist 11 місяців тому
you definitely don't want to use a regular llm to do the embedding. that’s not what they are for. Your results are going to be so much better using an embedding model for that purpose
@spartaleonidas540 10 місяців тому
@@technovangelistI think he meant in terms of chunking the text. A semantic chunker instead rigid arithmetic
@technovangelist 10 місяців тому
Same answer. Embedding model is much better for this.
@idleidle3448 10 місяців тому
What's the difference between RAG and Embeddings?
@technovangelist 10 місяців тому
Embeddings are a part of what goes into a vector db that is used for rag.
@idleidle3448 10 місяців тому
@@technovangelist thanks Matt! Do you have a buymeacoffee link or patreon?
@technovangelist 9 місяців тому
Well I do have that patreon now. Just set it up: patreon.com/technovangelist
@darenbaker4569 11 місяців тому
Vector dbs can't wait my favourite is milvus for local dev
@daryladhityahenry 11 місяців тому
Hi! Really interesting to hear 100 words only is kind the best... Still, this is based on use case right?
Let say the question is: How to do xyz step by step?
It's impossible to be answered in 100 words right? And if the list of how to split into chunks ( even with overlap ), it will lose the context above right?
I mean...
Let say there's 10 list item. The first 3 list item know the context since it's at the start ( 1st chunk ).
In the second chunk it has 3 more list item, and maybe it doesn't know the start of the sentence which is the key.
This is the problem right? Or embedding somehow can get the connection between the list and the main intention?
Thanks
@technovangelist 11 місяців тому
that was the summary at the end....you have to experiment with what you are trying to do.
@daryladhityahenry 11 місяців тому
@@technovangelist I see.. So there's really no generalization about that? Hmm..
If that's right then, the real implementation is really segmented to simple things? ( Or basically depends on how good the LLM handle context )?
@technovangelist 11 місяців тому ⁺¹
i found for my use case even the complicated stuff was covered by 100 words or less.
@daryladhityahenry 11 місяців тому
@@technovangelistwoahh.. I see.. Nice2.. Thanks for sharing :D:D:D:D..
@AndreaBorruso 9 місяців тому
Hi the ua-cam.com/video/6QAlbThWomc/v-deo.html video isn't available any more. Is there a new URL? Thank you
@technovangelist 9 місяців тому
Ok. I don’t know what that is or why you are leaving a comment here
@AndreaBorruso 9 місяців тому
@@technovangelist it's the URL you suggest at 2:30 time ua-cam.com/video/9HbU9Of-Ptw/v-deo.html
@technovangelist 9 місяців тому ⁺¹
That’s was more of an example of the kind of output I want. The actual video doesn’t matter
@AndreaBorruso 9 місяців тому
@@technovangelist I feel like a jerk. Thank you
@technovangelist 9 місяців тому ⁺¹
I should have put a silly word in the url to be more obvious
@mungojelly 11 місяців тому
the sms length of 160 characters was based on research that suggested that was long enough for most things people say,,, i think we overestimate how much information we're adding most times that we say more than that,,, a few words is enough to enter a completely different world of meaning, we forget how just a sentence or two is enough to bring us to almost anywhere
@technovangelist 11 місяців тому
yes, but if we have a long document with a bunch of concepts discussed in it, i thought I would need a longer chunk size to capture the idea in the doc.
@technovangelist 11 місяців тому
and i remember seeing that story once, so why was the original limit 140. the story was specifically about 160 characters on a one guy's postcard. i worked for a fax vendor, and folks that peddle out of date communications mediums tend to be interested in each other I think.
@mungojelly 11 місяців тому
@@technovangelist idk why we talk about it in terms of static docs, i think of it mostly in terms of asking llms for specific forms & for categorizations ,, maybe depends on how much text you have already, it's like a dirty way to ingest stuff made by dirty random humans, i guess
@mshonle 11 місяців тому
@@technovangelistI will send you some Gregg shorthand via carrier pigeon. I must use duct tape, alas, because my wax for seals ran out when my second cousin thought he found “ceiling wax”.
@technovangelist 11 місяців тому ⁺¹
Nice, i look forward to that. I remember watching a video a year ago comparing authentication to wax seals in olden days.
@chrisBruner 11 місяців тому
I'm starting to think I'm going to have to learn ts (and bun).
@sdaiwepm 3 місяці тому
Thought-provoking,, thanks. Are there some other dimensions to consider, e.g. whether the chunks can be organized by topic, and which LLM is being fed these chunks (you touched on this)?
@themax2go 5 місяців тому
i just realized, chunk(ing) might be a thing of the past now (well, for a while = ~6mo to 1y, but i just realized it)
@technovangelist 5 місяців тому
Nope. Still super important
@valueray 10 місяців тому
Why Docs need to have all the same size. Cant it be dynamic?
@wholeness 11 місяців тому
Chunk from Goonies?
@Attlas-b1k 3 дні тому
I really thought you would talk about dimensions. Aren’t they actually relevant?
@technovangelist 3 дні тому
Tell me more. What would you like to hear? If more was better it might be worth adding but it’s not the case
@fabriziocasula 11 місяців тому
ciao Matt, where is the code? :-)
@technovangelist 11 місяців тому ⁺⁴
On GitHub. . Technovangelist/videoprojects on my repo
@fabriziocasula 11 місяців тому
thank you :-)@@technovangelist
@mshonle 11 місяців тому
What about a more content driven approach, such as using a traditional NLP library like spaCy to first break your text up into sentences? (You can remove all punctuation and let spaCy decide the breaks, or I suppose you could go an evil regular expression route.)
Also, what about hierarchical embeddings? E.g., use the sentence embeddings, but also have paragraph embeddings, section embeddings, and so on until you have document embeddings at the very top?
@technovangelist 11 місяців тому ⁺¹
Normally I do it based on multiple sentences. No need to use a library for simple stuff like that. Actually the first version of this video was going to do that, but I wanted to see if sub-sentence chunk sizes were useful.... and they are. Most vector db's that I have used also supported what you have referred to as hierarchical embeddings. But you can go too far pretty easily, potentially giving the whole document which means you have made it almost as inefficient as without rag.
@technobanjo 11 місяців тому
Like granted for using Bun
@florentflote 11 місяців тому

Наступне

Автоматичне відтворення