Amazing tutorial, exactly what I was looking for! Running it with a few text documents, the results are great. Do you have any recommendations for making the QA faster? A different model or libraries?
Great video again. Question though. I can see you are more in favour of langchain but what’s your thoughts on autogen and Teachable agents to do something similar? And in general I suppose your thoughts on autogen and its agentic model?
AutoGen is cool. I will make some videos about it at some point. I often use LangChain because it is easy and quick. I do like LlamaIndex as well and need to find some time to make vids about it as well.
Just wanted to give a big thumbs up to this, although I haven't yet watched the whole thing 😀. There are so many interesting things you can do with Local RAG and LangChain is very straight forward. I did something similar with Ollama's Llama3 model. Very interested in trying new Llama models that should be available soon.
LangChain and lamaIndex is really is boilerplate imho , they create more problems then they solve with their over abstraction , can you show vanilla example of how to do rag ?
I find chroma is not very suitable for local RAG. It sends back telemetry data to their devs. One needs to set anonymized_telemetry=False to keep it quiet. Also, running ollama with some of the tools mentioned behind a firewall/proxy can be a challenge.
i've thought it is possible to switch it off completely. There is a big choice of vector or hybrid local databases these days, including postgres and mongo.
@@stanTrX I was hoping to get a better one suggested. I would prefer to use MariaDB, as I have it as a relational DB anyway on my server. Though, I've never seen any RAG examples with it.
how does this compare to ms' recently opensourced graphrag? btw there are graphrag w/ ollama implementation tutorials (2 diff versions to do it, 1 is a "hack" / req graphrag python lib change to make it work w/ ollama, other one req lm studio)... with 2 types of querying: "global", which works fine, always; "local", which often / usually fails (w/ various error msgs / for various reasons)
Hey Sam! For now Gemma2 is still broken in Ollama, which doesn't include yet the latest llama.cpp fixes required. It's about the tokenizer: and are interpreted as text instead of special tokens, and of course things don't really work as expected as a result. I believe it'll be fixed in the next Ollama update tho - very soon. But it's too early for Gemma 2 evaluations using Ollama at the moment, like many are making on their own or publishing in videos.
@@samwitteveenai good to know! Personally for RAG applications I'm always unsure on using a chat template or not. Typically, I'm setting the task and data to look at as system prompt, and then have the standard user/assistant roles following the template. But then, Gemma models don't support the system role at all so... 😅
For Ollama I am not sure if it operates different when you call it as an API. I need to look into. I saw the Ollama people were in the early access group with me so I figured they would have figured it out, but everything was very last minute for all of us this time round, so they may have gotten quant versions much later.
@@samwitteveenai Ollama has chat and generate API endpoints. For the chat endpoint you're expected to pass a list of messages with roles, and that'll be formatted according to the model's chat template. For the generate endpoint, it's just text that'll be passed to the LLM. Which one will answer your questions best tho, is worth looking into 😃
And the generate has an option raw that when you set it to true it does not further apply any template. It takes the input as is. Useful when you want to apply the template by yourself.
Hello, i'm french, sorry for translate, really good job, i have a question, how do you add pfd to txt on the top of your code ? **/*.txt, *.pdf or anything ? thank you
i see that it is working quite fast on Mac Mini. But what are RAM requirements for model and chroma? Does it require GPU for acceptable performance? You've mentioned that choice of embedder is important. As I understand the same vector dimensionality is not required, since you use embeddings only during embedding process and vector search. But what about "semantic" compatibility between embedder and LLM? I can imagine that embedder could map semantic meaning in its vector space differently from Gemma or LLama. Is it even possible to compare to ensure that you use the best possible embedder for some model?
You don’t need any matching between the embedding model and the text generation model. They are both handling separate parts of the process. The embeddings generated by the embedder are never given to the TextGen LLM.they are used with the vector store and retrieval. I am using a MacMini with 32GB of Ram so not sure how well it work with low Ram etc.
Hey sam can you explain why does your prompt template always seem in a different structure? By that I mean in this case you wrote at the start user then towrads the end you wrote . Does each llm have its own way of writing its own prompt template? If so, what & where do you refer to when you want to do prompt engineering for an llm ur using?
Great question. Yes every LLM or more specifically every Fine tune of a LLM has a structure that the training data was given to it for the training. When we want to do inference we have to match that structure to to get the best results. In the past if we didn't match the structure we would get garbage out, nowadays the models are getting so good even when we don't match it they can return decent results. The structure normally has some special tokens that tell the model things like was this line said by the user or the assistant, when a section starts or ends. This structure can be very different from model to model especially if made by different companies etc. That why the Llama models have a different prompt template than Gemma etc. You can usually find the prompt templates listed next to models on the HuggingFace in the model card for each model. Hope that helps.
Thanks for sharing your experience. I want to run this model on my computer. So I wrote Modelfile like below: ---------------------------------------------------- FROM gemma-2-9b-it-Q6_K_L.gguf TEMPLATE """ user: {{prompt}} model: """ PARAMETER stop ---------------------------------------------------- And I create model to ollama, so I ran this command ---------------------------------------------------- ollama create ollama create gemma-2-9b-it-Q6_K_L -f ~/gemma-2-9b-it-Q6_K_L/Modelfile ---------------------------------------------------- And I want to run this model, so I ran this command ---------------------------------------------------- ollama run gemma-2-9b-it-Q6_K_L:latest ---------------------------------------------------- Finally, I got an error message.... Error: llama runner process has terminated: signal: aborted (core dumped) How could you run this model on ollama? Thank you.
Thank you for the video. Vote for the next video - Fully Local Multimodal RAG
useful videos. keep on uploading and make aussies proud of you
Awesome work Sam, Thank you
memory, LangChain agents, streaming UI next, please. Thanks for the very useful video!
Amazing tutorial, exactly what I was looking for! Running it with a few text documents, the results are great. Do you have any recommendations for making the QA faster? A different model or libraries?
Great video again. Question though. I can see you are more in favour of langchain but what’s your thoughts on autogen and Teachable agents to do something similar? And in general I suppose your thoughts on autogen and its agentic model?
AutoGen is cool. I will make some videos about it at some point. I often use LangChain because it is easy and quick. I do like LlamaIndex as well and need to find some time to make vids about it as well.
Just wanted to give a big thumbs up to this, although I haven't yet watched the whole thing 😀. There are so many interesting things you can do with Local RAG and LangChain is very straight forward. I did something similar with Ollama's Llama3 model. Very interested in trying new Llama models that should be available soon.
LangChain and lamaIndex is really is boilerplate imho , they create more problems then they solve with their over abstraction , can you show vanilla example of how to do rag ?
I find chroma is not very suitable for local RAG. It sends back telemetry data to their devs. One needs to set anonymized_telemetry=False to keep it quiet. Also, running ollama with some of the tools mentioned behind a firewall/proxy can be a challenge.
i've thought it is possible to switch it off completely.
There is a big choice of vector or hybrid local databases these days, including postgres and mongo.
I didn’t know about the telemetry on Chroma will look into. Thanks for letting me know
I wonder if you could strip out the telemetry data with a prompt on the LLM?
Which Vectordb do you suggest?
@@stanTrX I was hoping to get a better one suggested. I would prefer to use MariaDB, as I have it as a relational DB anyway on my server. Though, I've never seen any RAG examples with it.
once downloaded gemma 2 where can i find the directory in which the source code is ubbicated???
how does this compare to ms' recently opensourced graphrag? btw there are graphrag w/ ollama implementation tutorials (2 diff versions to do it, 1 is a "hack" / req graphrag python lib change to make it work w/ ollama, other one req lm studio)... with 2 types of querying: "global", which works fine, always; "local", which often / usually fails (w/ various error msgs / for various reasons)
Hey Sam! For now Gemma2 is still broken in Ollama, which doesn't include yet the latest llama.cpp fixes required.
It's about the tokenizer: and are interpreted as text instead of special tokens, and of course things don't really work as expected as a result.
I believe it'll be fixed in the next Ollama update tho - very soon. But it's too early for Gemma 2 evaluations using Ollama at the moment, like many are making on their own or publishing in videos.
Interesting I tried with and with out the prompt format and both seemed ok. I will try it more with some other ideas this weekend .
@@samwitteveenai good to know!
Personally for RAG applications I'm always unsure on using a chat template or not.
Typically, I'm setting the task and data to look at as system prompt, and then have the standard user/assistant roles following the template.
But then, Gemma models don't support the system role at all so... 😅
For Ollama I am not sure if it operates different when you call it as an API. I need to look into. I saw the Ollama people were in the early access group with me so I figured they would have figured it out, but everything was very last minute for all of us this time round, so they may have gotten quant versions much later.
@@samwitteveenai Ollama has chat and generate API endpoints.
For the chat endpoint you're expected to pass a list of messages with roles, and that'll be formatted according to the model's chat template.
For the generate endpoint, it's just text that'll be passed to the LLM.
Which one will answer your questions best tho, is worth looking into 😃
And the generate has an option raw that when you set it to true it does not further apply any template. It takes the input as is. Useful when you want to apply the template by yourself.
Hello, i'm french, sorry for translate, really good job, i have a question, how do you add pfd to txt on the top of your code ? **/*.txt, *.pdf or anything ? thank you
Excellent, tks!
jezuz with the white background
Dark mode has better contrast for many visually impaired viewers. Dark mode please.
Is it possible to run the embedding model on cpu and the localllm on gpu?
What are the system requirements?
Do we need a GPU with certain size of VRAM?
I would also like to know this.
I was using a M2 MacMini with 32gb of RAM. It works on my MBA though its slower there.
i see that it is working quite fast on Mac Mini.
But what are RAM requirements for model and chroma? Does it require GPU for acceptable performance?
You've mentioned that choice of embedder is important. As I understand the same vector dimensionality is not required, since you use embeddings only during embedding process and vector search. But what about "semantic" compatibility between embedder and LLM? I can imagine that embedder could map semantic meaning in its vector space differently from Gemma or LLama. Is it even possible to compare to ensure that you use the best possible embedder for some model?
You don’t need any matching between the embedding model and the text generation model. They are both handling separate parts of the process. The embeddings generated by the embedder are never given to the TextGen LLM.they are used with the vector store and retrieval.
I am using a MacMini with 32GB of Ram so not sure how well it work with low Ram etc.
@@samwitteveenai Yes, one of the neat things about RAG in general is that the interface is "English" (or any language the LLM supports) 😂
Thanks for showing gemma2 and ollama. Would be nice to see with mesop. Maybe in combination with langsmith for debugging?
Actually I made the Mesopotamia version but the streaming wasn’t working with their Chat UI. I need to look into it more.
Can you share the index code? I do not see it in Github
Hey thanks for pointing this out. I have just added it
Hey sam can you explain why does your prompt template always seem in a different structure? By that I mean in this case you wrote at the start user
then towrads the end you wrote . Does each llm have its own way of writing its own prompt template? If so, what & where do you refer to when you want to do prompt engineering for an llm ur using?
Great question. Yes every LLM or more specifically every Fine tune of a LLM has a structure that the training data was given to it for the training. When we want to do inference we have to match that structure to to get the best results. In the past if we didn't match the structure we would get garbage out, nowadays the models are getting so good even when we don't match it they can return decent results. The structure normally has some special tokens that tell the model things like was this line said by the user or the assistant, when a section starts or ends.
This structure can be very different from model to model especially if made by different companies etc. That why the Llama models have a different prompt template than Gemma etc.
You can usually find the prompt templates listed next to models on the HuggingFace in the model card for each model. Hope that helps.
@@samwitteveenai Thank you so much!
thanks!!
it doesn't work very well, but it is informative.
Thanks for sharing your experience.
I want to run this model on my computer. So I wrote Modelfile like below:
----------------------------------------------------
FROM gemma-2-9b-it-Q6_K_L.gguf
TEMPLATE """
user:
{{prompt}}
model:
"""
PARAMETER stop
----------------------------------------------------
And I create model to ollama, so I ran this command
----------------------------------------------------
ollama create ollama create gemma-2-9b-it-Q6_K_L -f ~/gemma-2-9b-it-Q6_K_L/Modelfile
----------------------------------------------------
And I want to run this model, so I ran this command
----------------------------------------------------
ollama run gemma-2-9b-it-Q6_K_L:latest
----------------------------------------------------
Finally, I got an error message....
Error: llama runner process has terminated: signal: aborted (core dumped)
How could you run this model on ollama?
Thank you.