00:00:49 Fix the model by creating a data pipeline to add context into the prompt. 00:01:33 Understand the paradigms of retrieval augmentation and fine-tuning for language models. 00:02:00 Learn about building a QA system using data ingestion and querying components. 00:02:07 Explore lower-level components to understand data ingestion and querying processes. 00:03:01 Address challenges with naive rag applications, such as poor response quality. 00:04:02 Improve retrieval performance by optimizing data storage and pipeline. 00:04:14 Enhance the embedding representation for better performance. 00:04:45 Implement advanced retrieval methods like reranking and recursive retrieval. 00:05:18 Incorporate metadata filtering to add structured context to text chunks. 00:06:27 Experiment with small to big retrieval for more precise retrieval results. 00:07:14 Consider embedding references to parent chunks for improved retrieval. 00:09:31 Explore the use of agents for reasoning and more advanced analysis. 00:12:12 Fine-tune the rag system to optimize specific components for better performance. 00:17:01 Generate a synthetic query dataset from raw text chunks using LLMS to fine-tune and embed a model. 00:17:12 Fine-tune the base model itself or fine-tune an adapter on top of the model to improve performance. 00:17:16 Consider fine-tuning an adapter on top of the model as it has advantages such as not requiring the base model's weights to fine-tune and avoiding the need to reindex the entire document corpus when fine-tuning the query. 00:18:00 Explore the idea of generating a synthetic dataset using a bigger model like GBD4 and distilling it into a weaker LM like 3.5 Turbo to enhance train of thought, response quality, and structured outputs.
🎯 Key Takeaways for quick navigation: 01:44 🧩 *The current RAG stack for building a QA system consists of two main components: data ingestion and data querying (retrieval and synthesis).* 03:08 🚧 *Challenges with naive RAG include issues with response quality, bad retrieval, low precision, hallucination, fluff in return responses, low recall, and outdated information.* 04:31 🔄 *Strategies to improve RAG performance involve optimizing various aspects, including data, retrieval algorithm, and synthesis. Techniques include storing additional information, optimizing data pipeline, adjusting chunk sizes, and optimizing embedding representation.* 06:50 📊 *Evaluation of RAG systems involves assessing both retrieval and synthesis. Retrieval evaluation includes ensuring returned content is relevant to the query, while synthesis evaluation examines the quality of the final response.* 08:30 🛠️ *To optimize RAG systems, start with "table stakes" techniques like tuning chunk sizes, better pruning, adjusting chunk sizes, and using metadata filters integrated with vector databases.* 12:29 🧐 *Advanced retrieval methods, such as small to big retrieval and embedding a reference to the parent trunk, can enhance precision by retrieving more granular information.* 14:42 🧠 *Exploring more advanced concepts, like multi-document agents, allows for reasoning beyond synthesis, enabling the modeling of documents as sets of tools for tasks such as summarization and QA.* 16:23 🎯 *Fine-tuning in RAG systems is crucial to optimize specific components, such as embeddings, for better performance. It involves generating synthetic query datasets and fine-tuning on either the base model or an adapter on top of the model.* 18:15 📚 *Documentation on production RAG and fine-tuning, including distilling knowledge from larger models to weaker ones, is available for further exploration.* Made with HARPA AI
This is a great overview of the transformative impact of Large Language Models and the exciting developments around Retrieval Augmented Generation (RAG). Jerry Liu's talk seems like a must-watch for anyone interested in building and optimizing LLM-powered applications on private data. It's inspiring to see experts like Jerry, with his impressive background in AI research and engineering, sharing insights on how to tackle the challenges of productionizing RAG systems. Looking forward to exploring more at the AI Engineer World's Fair 2024!
Thank you very much for this. In this age of LLms it is getting more and more important to be able to mesure theyr accuracy and efficacy. I've been working with problems like this since the beggining of 2024 and it's been such an interesting topic to learn about. Cheers and thx for the upload
I was thoroughly impressed by the depth of your insights and the clarity of your delivery. The ability of Jerry Liu to distill complex concepts into understandable terms was remarkable, and I particularly enjoyed how you illustrated the practical applications of RAG in various fields. Would it be possible for you to share the slides from the Jerry Liu's presentation?
There wasn't anything filler. Down to the point from beginning to the end. He gave a similar talk at Silicon Valley DevFest AI Edition, I was impressed.
I use the hyper-naive approach: Provide the LLM with all the knowledge keys in my MySQL DB and let it tell me which ones are most likely to be helpful for answering the current prompt. Then just load the entries based on the keys the LLM told me and inject them into the second propmpt, which the LLM is then supposed to answer. (Yes, Vector search would be way more fitting for this, but I'm a peasant and don't even have the slightest clue of how to to implement it)
I am using the ''Gpt2" model , its response is correct but the response time is about 10 seconds on the local pc and 35 seconds on the EC 2 server, can you tell how to reduce response time, you can share server configuration or any good model of GPT 2 or smaller than this
RAG is an interesting idea. If the predictions are right and these models are only going to get better, wouldn’t it make sense to give them direct access to the embedding DB and let the model decide how best to handle retrieval rather than having the humans do it?
No, but that’s the whole point of human feedback and RFHL. It would be great to give LLM all access to DB but then their coherent biases would eventually lead to overfitting.
I still haven't managed to find an argument for RAG over LORA. RAG's biggest achilles heel is cortext size. It almost seems to me to be a band aid, especially when at least a year from now context size may not even be an issue. We can spend months perfecting our RAG pipeline and end up throwing it all away a month later due to it being redundant.
Pretty sure rag avoids hallucination much better than Lora does, fine tuning is good for changing the language style but doesn’t necessarily work the best when your looking for specific info from the way I understand it, also rag allows you to plug in diff data without having to go back and re fine tune ur model with every update
@@namankapasi6463 I have noticed with LORA you don't get back the specifics of the trained data, but rather an interpreted version of it (which in my experiments has been jaw-dropping). If RAG functions more like a search engine then I can see how these could both be useful. So my guess, after reading your reply, is LORA would be suited to emulating specific writing styles and RAG would be good for technical data retrieval or for extracting paragraphs from text with references? Makes sense then, since you would probably only need to train in a specific writing style once. Even so, when context size increases dramatically will we still use RAG and not just add the content into the main prompt as is? Or does the vector process make the entire process more efficient, regardless?
I mean you can think of rag as restricting your output to the data that ur giving it, user makes a request to the model, model looks at vector database and responds from the database first, not saying I’m an expert but im 99% sure. Also in regards to efficiency, higher context windows are expensive and are repetitive so I’d avoid them, even tho open ai caching is p good this not the case for a lot of open source models
You need to do both for optimal performance. Everything you put in RAG should be data that may need to change in real time - ex: price lists, spec sheets, latest instructions manuals, product updates etc…. Most everything else you can fine tune - however if you plan on running sizable projects your fine tuning could take weeks. Or even days. Now if you have to constantly adjust your fine tuning this is not very practical. Therefore you may wish to move part of your data into RAG. Additionally you need to play with Chunks in order to better organize your training data. Of course much depends on your project
When somebody who looks like 19 says that Information Retrieval is already one or two decades old, I feel so old 😂 Come on, Lucene is already more than 20 years old 😅
Seems straightforward to be. Just encode your docs into vector embeddings. And then search whatever you need and you can use the information to write stuff by creating appropriate prompt templates depending on what you want it to write. Search using any LLM. You can use openai or the ones on hugging face
I wish they didn't use the term QA for question answering and used "Q&A" instead. leads to a lot of confusion with those of us developing production grade systems that require quality assurance :)
AI basically consumes data like your body consumes a large cube of paneer, breaking it into smaller pieces and digesting it using stomach juices to know it is paneer. AI paneer ko paneer hi bole, aloo na bole iske liye nuske bata rahe hai bhai I think.
So far the most completed and clear LLM RAG go-through video ever existed on UA-cam.
100%
00:00:49 Fix the model by creating a data pipeline to add context into the prompt.
00:01:33 Understand the paradigms of retrieval augmentation and fine-tuning for language models.
00:02:00 Learn about building a QA system using data ingestion and querying components.
00:02:07 Explore lower-level components to understand data ingestion and querying processes.
00:03:01 Address challenges with naive rag applications, such as poor response quality.
00:04:02 Improve retrieval performance by optimizing data storage and pipeline.
00:04:14 Enhance the embedding representation for better performance.
00:04:45 Implement advanced retrieval methods like reranking and recursive retrieval.
00:05:18 Incorporate metadata filtering to add structured context to text chunks.
00:06:27 Experiment with small to big retrieval for more precise retrieval results.
00:07:14 Consider embedding references to parent chunks for improved retrieval.
00:09:31 Explore the use of agents for reasoning and more advanced analysis.
00:12:12 Fine-tune the rag system to optimize specific components for better performance.
00:17:01 Generate a synthetic query dataset from raw text chunks using LLMS to fine-tune and embed a model.
00:17:12 Fine-tune the base model itself or fine-tune an adapter on top of the model to improve performance.
00:17:16 Consider fine-tuning an adapter on top of the model as it has advantages such as not requiring the base model's weights to fine-tune and avoiding the need to reindex the entire document corpus when fine-tuning the query.
00:18:00 Explore the idea of generating a synthetic dataset using a bigger model like GBD4 and distilling it into a weaker LM like 3.5 Turbo to enhance train of thought, response quality, and structured outputs.
🎯 Key Takeaways for quick navigation:
01:44 🧩 *The current RAG stack for building a QA system consists of two main components: data ingestion and data querying (retrieval and synthesis).*
03:08 🚧 *Challenges with naive RAG include issues with response quality, bad retrieval, low precision, hallucination, fluff in return responses, low recall, and outdated information.*
04:31 🔄 *Strategies to improve RAG performance involve optimizing various aspects, including data, retrieval algorithm, and synthesis. Techniques include storing additional information, optimizing data pipeline, adjusting chunk sizes, and optimizing embedding representation.*
06:50 📊 *Evaluation of RAG systems involves assessing both retrieval and synthesis. Retrieval evaluation includes ensuring returned content is relevant to the query, while synthesis evaluation examines the quality of the final response.*
08:30 🛠️ *To optimize RAG systems, start with "table stakes" techniques like tuning chunk sizes, better pruning, adjusting chunk sizes, and using metadata filters integrated with vector databases.*
12:29 🧐 *Advanced retrieval methods, such as small to big retrieval and embedding a reference to the parent trunk, can enhance precision by retrieving more granular information.*
14:42 🧠 *Exploring more advanced concepts, like multi-document agents, allows for reasoning beyond synthesis, enabling the modeling of documents as sets of tools for tasks such as summarization and QA.*
16:23 🎯 *Fine-tuning in RAG systems is crucial to optimize specific components, such as embeddings, for better performance. It involves generating synthetic query datasets and fine-tuning on either the base model or an adapter on top of the model.*
18:15 📚 *Documentation on production RAG and fine-tuning, including distilling knowledge from larger models to weaker ones, is available for further exploration.*
Made with HARPA AI
So far this is the best presentation on RAG I have ever come across in last couple of months.
This is a great overview of the transformative impact of Large Language Models and the exciting developments around Retrieval Augmented Generation (RAG). Jerry Liu's talk seems like a must-watch for anyone interested in building and optimizing LLM-powered applications on private data. It's inspiring to see experts like Jerry, with his impressive background in AI research and engineering, sharing insights on how to tackle the challenges of productionizing RAG systems. Looking forward to exploring more at the AI Engineer World's Fair 2024!
Thank you very much for this. In this age of LLms it is getting more and more important to be able to mesure theyr accuracy and efficacy. I've been working with problems like this since the beggining of 2024 and it's been such an interesting topic to learn about.
Cheers and thx for the upload
Thank you not just for putting this together, but by making sense of it all! In 18min!? Amazing!
I was thoroughly impressed by the depth of your insights and the clarity of your delivery. The ability of Jerry Liu to distill complex concepts into understandable terms was remarkable, and I particularly enjoyed how you illustrated the practical applications of RAG in various fields.
Would it be possible for you to share the slides from the Jerry Liu's presentation?
There wasn't anything filler. Down to the point from beginning to the end. He gave a similar talk at Silicon Valley DevFest AI Edition, I was impressed.
Your distilled video has almost no knowledge loss over hours of coursework. Great work !
i thoroughly enjoyed your presentation. jerry Liu-Thanks for the Deep methods to be applied to traditional RAG.-
This is exactly what i needed, when I needed it. Big props!
Very deep talking! Really appreciate and learned a lot
I love Jerry's approach to identifying intuition and solution
Really nice presentation skills, Jerry!
Amazing video. Helped a lot !
Excellent presentation on RAG
short and sweet presentation. Very clear
Very nice presentation and very practical tips for enterprise RAGs
Thanks for Your hard-work. Really learned a lot
Thank you for this excellent presentation, very much appreciated
🎯 Key Takeaways for quick navigation:
00:01 🎤 *视频简介*
- Jerry介绍了他的公司以及今天的主题:构建生产就绪的RAG应用程序。
00:23 📚 *LLM的应用场景*
- Jerry提到了近期的AI应用,包括知识搜索、QA、对话代理、工作流自动化和文档处理等。
01:03 🔍 *LLM数据理解的两种主要方法*
- 检索增强:通过数据源将上下文添加到语言模型的输入提示中。
- 微调:通过训练模型权重来将知识嵌入到模型中。
01:44 📊 *RAG的构建*
- RAG架构包括数据摄取和数据查询,包括检索和合成。
- Jerry建议学习如何进行数据摄取和查询以深入了解组件的工作原理。
03:08 🚧 *RAG的挑战*
- Jerry介绍了RAG的性能挑战,包括响应质量、检索问题、数据陈旧和LLM的问题。
- 指出了检索过程中可能出现的问题,如低准确性、幻觉、低召回等。
05:27 🧪 *评估RAG系统*
- 讨论了RAG系统的评估方法,包括检索评估和合成评估。
- 强调了需要定义基准来度量性能的重要性。
08:30 🧩 *优化RAG系统*
- Jerry提供了从基础到高级的RAG系统优化方法,包括调整块大小、元数据过滤、高级检索和代理。
16:23 🔄 *微调和未来展望*
- 讨论了微调LLM的潜在益处,以及使用较弱LLM生成合成数据集来提高性能的方法。
Made with HARPA AI
This is an awesome video 🎉
Are there any take-aways here that can help an average user generate better results using a standard UI?
12:56 interesting expanding on smaller chunks
wow thanks for the presentation
He speaks like the guys in "The Californians". I keep expecting him to say "turn right on Ocean, left on Pico all the way to ...."
Awesome rundown!
Can anyone share this presentation link mentioned in 5:35 ?
Any luck on this?
Nope
docs.google.com/presentation/d/1GWjchMiY0LQ8Bc8e7NAkutOzpaTsfn487XHwbGIqKvo/mobilepresent?slide=id.p
ua-cam.com/video/ua93WTjIN7s/v-deo.htmlsi=Kp0VrpPkDuVJ_HGC it’s concerning none of you could find this
was someone able to open that colab link that was mentioned in one of the slide, if yes, could you share the link. please
The V stands for cmd/ctrl V
Thank you
I use the hyper-naive approach: Provide the LLM with all the knowledge keys in my MySQL DB and let it tell me which ones are most likely to be helpful for answering the current prompt. Then just load the entries based on the keys the LLM told me and inject them into the second propmpt, which the LLM is then supposed to answer. (Yes, Vector search would be way more fitting for this, but I'm a peasant and don't even have the slightest clue of how to to implement it)
It's five lines of codes in the llama index docs. Works well out of the box for simple data.
I am using the ''Gpt2" model , its response is correct but the response time is about 10 seconds on the local pc and 35 seconds on the EC 2 server, can you tell how to reduce response time, you can share server configuration or any good model of GPT 2 or smaller than this
Great one!
impressed
RAG is an interesting idea. If the predictions are right and these models are only going to get better, wouldn’t it make sense to give them direct access to the embedding DB and let the model decide how best to handle retrieval rather than having the humans do it?
No, but that’s the whole point of human feedback and RFHL. It would be great to give LLM all access to DB but then their coherent biases would eventually lead to overfitting.
can I have these slides somewhere ?
Compact infor, thank you !
what is the process if i what to query chat from cloud mangoDB using llm and RAG
Can I get the presentation ?
More like this
I still haven't managed to find an argument for RAG over LORA. RAG's biggest achilles heel is cortext size. It almost seems to me to be a band aid, especially when at least a year from now context size may not even be an issue. We can spend months perfecting our RAG pipeline and end up throwing it all away a month later due to it being redundant.
Pretty sure rag avoids hallucination much better than Lora does, fine tuning is good for changing the language style but doesn’t necessarily work the best when your looking for specific info from the way I understand it, also rag allows you to plug in diff data without having to go back and re fine tune ur model with every update
@@namankapasi6463 I have noticed with LORA you don't get back the specifics of the trained data, but rather an interpreted version of it (which in my experiments has been jaw-dropping). If RAG functions more like a search engine then I can see how these could both be useful. So my guess, after reading your reply, is LORA would be suited to emulating specific writing styles and RAG would be good for technical data retrieval or for extracting paragraphs from text with references? Makes sense then, since you would probably only need to train in a specific writing style once.
Even so, when context size increases dramatically will we still use RAG and not just add the content into the main prompt as is? Or does the vector process make the entire process more efficient, regardless?
I mean you can think of rag as restricting your output to the data that ur giving it, user makes a request to the model, model looks at vector database and responds from the database first, not saying I’m an expert but im 99% sure. Also in regards to efficiency, higher context windows are expensive and are repetitive so I’d avoid them, even tho open ai caching is p good this not the case for a lot of open source models
@@namankapasi6463 Ok great. Thanks for shedding some extra light here.
You need to do both for optimal performance. Everything you put in RAG should be data that may need to change in real time - ex: price lists, spec sheets, latest instructions manuals, product updates etc….
Most everything else you can fine tune - however if you plan on running sizable projects your fine tuning could take weeks. Or even days. Now if you have to constantly adjust your fine tuning this is not very practical. Therefore you may wish to move part of your data into RAG.
Additionally you need to play with Chunks in order to better organize your training data. Of course much depends on your project
what music is that by the way?
this is pretty deep
All the documentation became obsolete in a couple of months, since I can't find useful examples with the current stuff I'm moving to langchain
All these videos today start with a cyberpunk theme music
When somebody who looks like 19 says that Information Retrieval is already one or two decades old, I feel so old 😂 Come on, Lucene is already more than 20 years old 😅
I just want an LLM to read my google docs and let me ask questions about stuff, then use it to write and add into my drive
Seems straightforward to be. Just encode your docs into vector embeddings. And then search whatever you need and you can use the information to write stuff by creating appropriate prompt templates depending on what you want it to write.
Search using any LLM. You can use openai or the ones on hugging face
@@deeghalbhaumik3779 found lm studio and embedding models. This is working now
Google’s NotebookLM does this exactly
Nice intro music 😂
I wish they didn't use the term QA for question answering and used "Q&A" instead. leads to a lot of confusion with those of us developing production grade systems that require quality assurance :)
Are ypu working on an AI based Quality assurance / Quality Audit system? Would love to connect and work together
Are the comments AI-generated? They seem like variants of the same glowing, effusive prompt.
Why does every tech bro speak as if every comment is cooler when in the tone of a question.
Llama Index has poor documentation despite claims to the contrary and causes dependency conflicts off the bat.
Million Things to do = initiative time before going to prod :-/
Feel like MSFT copilot is the RAG killer…
kehna kya chahte ho
AI basically consumes data like your body consumes a large cube of paneer, breaking it into smaller pieces and digesting it using stomach juices to know it is paneer. AI paneer ko paneer hi bole, aloo na bole iske liye nuske bata rahe hai bhai I think.
Don't wear a hat next time, you didn't come to fashion show. These are serious world changing talks. I didn't get anything because of the hat 🙄
You have a very low IQ if a hat can throw you out this much
Short and Precise
thank you