thanks for the vid and the experiment; one little thing that one might need to consider: since the sentence transformer your are using (all mini) is limited to 512 tokens (well, technically it's 256 from their training, but chroma might have overwritten that. would get truncated if larger), this should be adjusted to the chunk size that you are applying for chunking the document. Since you have chosen 800/400 overlap, there is not a very big chance that you end up with questions which are "out of bound" from what the embedder is even taking as input, this gap naturally increases if you have larger chunks whilst using a small sentence transformer model. the kind of large overlap ratio has a second effect: if the question that was created by the LLM for the synthetic data generation is based on information that is beyond what the embedder takes into account, it is most likely that the next chunk will be returned as best match. However, this wouldn't degrade the overall RAG performance but it's certainly not optimal to train the adapter to optimize against something that is not represented in the input data. Hope, this is understandable - would be interesting to look at the "non-successful" cases (chunk not in first place) and analyze how many times it actually chose the following chunk instead.
Thanks for the additional information, all great things to consider and ideas for further work! Wasn't aware of the base model token limitation off the bat- will have to look and see how chroma approaches this/if they do anything specifically for larger token count documentation by default. Definitely agree with your points on the larger chunking with overlap and how that can affect ranking/accuracy. My main goal was to test this research in an environment that best simulates what would likely be an existing RAG tool for a company, rather than the nicely organized benchmark data- which definitely leads to inefficiencies for training but was hoping it would be amore realistic/applicable test as most company data is not as nicely structured or chunked as the benchmarks research gets via MTEB Thanks for the writeup! Great context and considerations for the future
Awesome Video! Can you do a similar one for not fine-tuning the retriever but the generator. Like i found sth called "RAFT". When i understand it correctly, its about fine-tuning the actual LLM for RAG, by making it used to additional context to answer questions. For me it would be interesting: - which fine-tuning method is suited best for that (full-finetuning, adapters, ...) - how one does the E2E process
Question: What if you have multiple documents for the golden dataset? For example, if you’re training a retriever for multi-hop query, that has multiple documents we need to retrieve, is this possible? I know we’d use other metrics like NDCG or MAP, but how about preparing the data?
Great stuff good job.. new subscriber🙏 Just curious how long did the fine tuning take for the 30 epochs.. i honestly thought it'd increase the recall even more.. but given its performance on training it seems to overfit
Why would anyone finetune embedding models? Are you saying we will convert data into embeddings and store in vectorDB which we will query duing our call to LLM. The embeddings returned will be used as a context to our query to LLM. This is called Simple RAG. But what do you mean when you say "Finetuning embeddings" ?
thanks for the vid and the experiment; one little thing that one might need to consider: since the sentence transformer your are using (all mini) is limited to 512 tokens (well, technically it's 256 from their training, but chroma might have overwritten that. would get truncated if larger), this should be adjusted to the chunk size that you are applying for chunking the document.
Since you have chosen 800/400 overlap, there is not a very big chance that you end up with questions which are "out of bound" from what the embedder is even taking as input, this gap naturally increases if you have larger chunks whilst using a small sentence transformer model. the kind of large overlap ratio has a second effect: if the question that was created by the LLM for the synthetic data generation is based on information that is beyond what the embedder takes into account, it is most likely that the next chunk will be returned as best match. However, this wouldn't degrade the overall RAG performance but it's certainly not optimal to train the adapter to optimize against something that is not represented in the input data.
Hope, this is understandable - would be interesting to look at the "non-successful" cases (chunk not in first place) and analyze how many times it actually chose the following chunk instead.
Thanks for the additional information, all great things to consider and ideas for further work! Wasn't aware of the base model token limitation off the bat- will have to look and see how chroma approaches this/if they do anything specifically for larger token count documentation by default.
Definitely agree with your points on the larger chunking with overlap and how that can affect ranking/accuracy. My main goal was to test this research in an environment that best simulates what would likely be an existing RAG tool for a company, rather than the nicely organized benchmark data- which definitely leads to inefficiencies for training but was hoping it would be amore realistic/applicable test as most company data is not as nicely structured or chunked as the benchmarks research gets via MTEB
Thanks for the writeup! Great context and considerations for the future
Great post. Keep it up the practical videos
Awesome Video! Can you do a similar one for not fine-tuning the retriever but the generator. Like i found sth called "RAFT". When i understand it correctly, its about fine-tuning the actual LLM for RAG, by making it used to additional context to answer questions.
For me it would be interesting:
- which fine-tuning method is suited best for that (full-finetuning, adapters, ...)
- how one does the E2E process
amazing content, thank you so much, looking forward to more
Question: What if you have multiple documents for the golden dataset? For example, if you’re training a retriever for multi-hop query, that has multiple documents we need to retrieve, is this possible? I know we’d use other metrics like NDCG or MAP, but how about preparing the data?
Thank you so much, the video was really helpful
Great stuff good job.. new subscriber🙏
Just curious how long did the fine tuning take for the 30 epochs.. i honestly thought it'd increase the recall even more.. but given its performance on training it seems to overfit
The video is AWESOME
Why would anyone finetune embedding models?
Are you saying we will convert data into embeddings and store in vectorDB which we will query duing our call to LLM.
The embeddings returned will be used as a context to our query to LLM.
This is called Simple RAG.
But what do you mean when you say "Finetuning embeddings" ?
one single layer doesn't really do anything
two is the holy grail
Then why is this research published as just one layer from chroma db? Do u have other sources to indicate two is much better?
Martin Betty Taylor Elizabeth Allen Kevin