I’m of the understanding that chunking in to smaller sizes yields better results when querying the vector store. Larger chunks risk missing information (context) due to summarisation being focused on generating a semantic summary of the large chunk whereas smaller chunks provide finer, granular semantic targets to locate with search.
Kind of a hard 1 to explain why I want to do this, but I've been working on utilizing knowledge graphs to map the interconnectedness of technical documentation, this would be a fantastic way to achieve this. LLM are quite good at crating Triples implied in a collection of tokens. This is good for example if you want to build a workflow for a business, along with the rules that constrain user actions based on say a policy or regulatory document very quickly.
Harrison mentioned that there is a company who is building doc loaders called instructor? i tried to search but couldn't find any reference. do you guys know the name of the company?
quick question; theoretically, would it be possible to manually adapt the chunk and overlap size for each document separately, or would that cause some sort of issues down the line? thank you!! @StanGirard
My thinking too - from my experience this would not be an issue and actually a good UI option to add so the user can refine the chunk size during upsert.
I would assume French as well. When he started sharing his screen, the context popup was briefly in French. Also, Mistral and other similar LLMs seem to have a larger training set in French due to the organizations that are creating the models. Dolphin Mistral/Mixtral LLMs end up being great for me as native English speaker who uses French frequently. There are LLMs trained on Japanese and Korean, and I'm sure other languages as well. I got PrivateGPT to use different unsupported LLMs so Quivr should be able to as well with a little work.
Big fan of Quivr, There has been a lot of progress by Stan and the contributors since this video was done. Loved the discussions many months later.
can we have one integrated with privategpt and use of hugging face models?
I’m of the understanding that chunking in to smaller sizes yields better results when querying the vector store. Larger chunks risk missing information (context) due to summarisation being focused on generating a semantic summary of the large chunk whereas smaller chunks provide finer, granular semantic targets to locate with search.
Kind of a hard 1 to explain why I want to do this, but I've been working on utilizing knowledge graphs to map the interconnectedness of technical documentation, this would be a fantastic way to achieve this. LLM are quite good at crating Triples implied in a collection of tokens. This is good for example if you want to build a workflow for a business, along with the rules that constrain user actions based on say a policy or regulatory document very quickly.
Harrison mentioned that there is a company who is building doc loaders called instructor? i tried to search but couldn't find any reference. do you guys know the name of the company?
@dgmarshmallow thank you!
Haven’t gotten there yet but might be talking about “hkunlp/instructor-base” on huggingface.
Harrison is it not possible to create an entity extraction agent to store metadata during embedding?
The installation procedures is no more working in recent updates
Text splitting by context and file type is a great take away from this video!
quick question; theoretically, would it be possible to manually adapt the chunk and overlap size for each document separately, or would that cause some sort of issues down the line? thank you!! @StanGirard
My thinking too - from my experience this would not be an issue and actually a good UI option to add so the user can refine the chunk size during upsert.
Quiver supports other languages or only English for documents?
I would assume French as well. When he started sharing his screen, the context popup was briefly in French. Also, Mistral and other similar LLMs seem to have a larger training set in French due to the organizations that are creating the models. Dolphin Mistral/Mixtral LLMs end up being great for me as native English speaker who uses French frequently. There are LLMs trained on Japanese and Korean, and I'm sure other languages as well. I got PrivateGPT to use different unsupported LLMs so Quivr should be able to as well with a little work.
it won't accept my OpenAI API token, no matter how many I create the app says it's invalid.
sorry unrelated question, what software do you use to make video on the Quivr's website 😃
That is a very insightful comment ua-cam.com/video/1hZ7svDRA0o/v-deo.html old plain UI can go a long way in making the experience smoother :)
It’s kind of hard to understand at a high level what it is.
First 😅
SECOONDDDDD 💪