The point is it's local, more secure/ data not being sent out/used as training data AND you're not paying for it. ... AnnnD, you can expand it into your own workflows etc.
Hi, I like the content - very informative; I have idea for next video and very good showcase : Agents automation (*selenium based) to log to any portal or interact with any page eg. ebay and looking for specific item (*book) but not via API, just tool based on selenium library or similar - that will set foot in the door of RPA ;)
Hi, I see you have been doing quite some LLMing and RAGging. I was just fiddling with it, and the problem is, sometimes, it generates like real garbage, like someone github pages which don't exists, or like a lines and lines of nothing ( ) or, maybe starts repeating. How do you prevent or catch and stop generating? If you have some examples or some videos, it would be helpful
If you're using open source LLMs, you need to be aware of the prompt formats and stop tokens. I have a video on deploying a basic llama 3 chatbot that goes into a bit more detail.
What temperature is set to and what quantized version are you running? The free version on groq managed to get the final question right though struggle with Aruba one. Am sure with more tweaking, we can make llama 70b do quite well on these tasks
What gpu are you used? I have followed your video before to deploy in runpod but i cannot connect to host 8000, or its just takes more time to start? Please let me know!
I haven't gone back and looked at your implementation of your agent workflow, but when you're talking about restrictions in context windows with your scraping. Are you using RAG with the large documents you're scraping?
From what I recall in the previous videos, the content is fed in full in the context, leaving the LLM to extract the information from the whole page. That's why some pages don't fit in the 8K tokens window. RAG would work better by chunking and retrieving only the relevant text from the whole page but it would also make the code of the project quite a bit more complex, unless relying on a framework.
Yes, this is pretty much it. Would probably add some latency too because you would have to create the embeddings for each webpage each time you did a new search.
@@Data-Centric good point about the latency. Ollama recently added the ability to keep several models loaded at the same time, which would help. Otherwise swapping between the models for embeddings and a 8b LLM would slow things down significantly.
Looking for to the RAG series 🎉
you're the real mvp.. your videos are ace, you break down these other tools into low-level code.
Surely you can have an agent running opus or GPT4o perform the testing and generate a matrix for you against your expected answers
The point is it's local, more secure/ data not being sent out/used as training data AND you're not paying for it. ... AnnnD, you can expand it into your own workflows etc.
@@EddieAdolf he's the real MVP for sure.
Hi, I like the content - very informative; I have idea for next video and very good showcase : Agents automation (*selenium based) to log to any portal or interact with any page eg. ebay and looking for specific item (*book) but not via API, just tool based on selenium library or similar - that will set foot in the door of RPA ;)
I like this idea
Did you get a bigger background to accommodate a bigger llama model? 😆 Ok, let me get serious and actually watch this thing ...🤣.
There are whole teams that fund raised on being able to this very task.
and This guy gave them the run for money
@@MrAhsan99 it points to how overvalued some of these startups are
@@matten_zero absolutely
impressive work!
How's it to determine which city? You don't specify in the UK. North of London Birmingham is the largest city in the world?
I would hope it could figure it out. I'm going to try the same thing with the GPT-3.5-Turbo and 4o.
For context Perplexity has a $50+ M* valuation.
Contex windows shouldn't be broken. It should slide and do do summarizes at time.
Hi, I see you have been doing quite some LLMing and RAGging. I was just fiddling with it, and the problem is, sometimes, it generates like real garbage, like someone github pages which don't exists, or like a lines and lines of nothing (
) or, maybe starts repeating. How do you prevent or catch and stop generating? If you have some examples or some videos, it would be helpful
If you're using open source LLMs, you need to be aware of the prompt formats and stop tokens. I have a video on deploying a basic llama 3 chatbot that goes into a bit more detail.
You should add Snowflake Arctic to the comparison! Apparently its 128 experts are less prone to hallucinating
I was wondering how a MoE architecture might do. I have Mixtral coming up, but will consider Snowflake Arctic too!
What temperature is set to and what quantized version are you running? The free version on groq managed to get the final question right though struggle with Aruba one. Am sure with more tweaking, we can make llama 70b do quite well on these tasks
Temperature setting is 0 and model is the 16bit version. I think the Llama models are published in 16 bit though so completely unquantized.
What gpu are you used? I have followed your video before to deploy in runpod but i cannot connect to host 8000, or its just takes more time to start? Please let me know!
my bad, i jusr need to wait a litle bit
I haven't gone back and looked at your implementation of your agent workflow, but when you're talking about restrictions in context windows with your scraping. Are you using RAG with the large documents you're scraping?
From what I recall in the previous videos, the content is fed in full in the context, leaving the LLM to extract the information from the whole page.
That's why some pages don't fit in the 8K tokens window.
RAG would work better by chunking and retrieving only the relevant text from the whole page but it would also make the code of the project quite a bit more complex, unless relying on a framework.
@@supercurioTube aye, a RAG stage is needed then for large contexts or some form of hierarchical context summariser.
RAG would be better though
Yes, this is pretty much it. Would probably add some latency too because you would have to create the embeddings for each webpage each time you did a new search.
@@Data-Centric good point about the latency.
Ollama recently added the ability to keep several models loaded at the same time, which would help.
Otherwise swapping between the models for embeddings and a 8b LLM would slow things down significantly.
@@Data-Centric aye, but you could check first to see if it's already indexed