How to evaluate an LLM-powered RAG application automatically.

Underfitted

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 3 гру 2024
Source code of this example:
github.com/svp...
Giskard library: github.com/Gis...
I teach a live, interactive program that'll help you build production-ready machine learning systems from the ground up. Check it out here:
www.ml.school
To keep up with the content I create:
• Twitter/X: / svpino
• LinkedIn: / svpino

КОМЕНТАРІ • 61

@aleksandarboshevski 8 місяців тому ⁺¹⁴
Hey Santiago! Just wanted to drop a comment to say that you're absolutely killing it as an instructor. Your way of breaking down the code and the whole process into simple, understandable language is pure gold, making it accessible for new comers like me. Wishing you all the success and hoping you keep blessing the community with your valuable content!
Aside of the teaching side have you tried to create micro-saas based on this technologies ? For me seems that you are half there and could be great opportunity to expand your business.
@underfitted 8 місяців тому ⁺¹
Thanks for taking the time and letting me know! I have not created any micro saas applications, but you are right; that could be a great idea
@TooyAshy-100 8 місяців тому ⁺⁵
THANK YOU
I greatly appreciate the release of the new videos. The clarity of the explanations and the logical sequence of the content are exceptional.
@liuyan8066 8 місяців тому ⁺¹
Glad to see you involved pytest in the end, it is like a surprise dessert🍰 after great meal.
@mohammed333suliman 8 місяців тому ⁺³
This is my first time watching your videos. It is great. Thank you.
@TPH310 8 місяців тому ⁺¹
We appreciate your work a lot, my man.
@AmbrishYadav 4 місяці тому
Thanks ! Exactly what I was looking for. I’ve been cracking my head on how the hell to test a RAG system. How the hell is business going to give me 1000+ questions to test and how can a human verify the response. Top content.
@TheScott10012 8 місяців тому ⁺⁶
FYI, keep an eye on the mic volume levels! Sounds like it was clipping
@underfitted 8 місяців тому
Thanks. You are right. Will adjust.
@peterhjvaneijk1670 4 місяці тому
Love the video. Great breakdown. Would like to see more detail in evaluation results (e.g. it is now .73 good. WTH...!?), how tweaking the pipeline gives different eval results, and e.g. Ragas versus Giskard.
@tee_iam78 5 місяців тому
Thanks!
@dikshantgupta5539 6 місяців тому ⁺¹
Oh man, the way you explained these complex topics is mind blowing. I just wanted to say thank you for making such types of videos.
@tee_iam78 5 місяців тому
Superb video. Great content from start to finish. Thank you.
@maxnietzsche4843 5 місяців тому
Damm, you explained each step really well! Love it!
@bald_ai_dev 7 місяців тому ⁺³
Great stuff!
What are your preferred open source alternatives to all tools used in this tutorial?
@VikasChaudhary-x1y 6 місяців тому ⁺³
Hello Santiago, Your explanation was thorough and I understood it really well,Now I have a question as is there any other tool than giskard to evaluate (which is open source and does not require openai api key) for my llm or rag model.
Thank you in advance😊
@CliveFernandesNZ 8 місяців тому ⁺³
Great stuff Santiago! You've used giskard to create the test cases. These test cases themselves are created using an LLM. In a real application, would we have to manually vet the test cases to ensure they themselves are 100% accurate?
@alextiger548 6 місяців тому
Super important topic you covered here man!
@aliassim8774 7 місяців тому ⁺¹
Hey Santiago, thank you for this course in which you explained all the concepts of rag evaluation in a very clear way. However, I have a question about the the reference answers. How they have been generated ? based on what (is it an LLM) ? If it is the case, let's say we have a question that needs a specific information that exists only on the knowledge base, how can other llm generate such an answer ? & how the we know that reference questions are correct and it is what we are looking for ? Thank you in advance
@litan5006 3 місяці тому
great video on llm and rag
@MohammadEskandari-do6xy 8 місяців тому ⁺¹
Amazing! Can you also explain how to do the same type of evaluation on Vision Language Models that use images?
@theacesystem 8 місяців тому ⁺¹
Just awesome instruction Santiago. I am a beginner but you make learning digestible and clear! Sorry if ignorant question. but is it possible to use FAISS, Postgres, MongoDB, or Chroma DB, or another free open source model that can be substituted for pinecone to save money, and if so which would you recommend for ease of implementation with Langchain?
@underfitted 8 місяців тому
Yes, you can! Any of them will work fine. FAISS is very popular.
@kloklojul 4 місяці тому ⁺³
You are using an LLM to create a question and a LLM to get another answers and than let an LLM eval both answers, but how do you evaluate the output of the Initial Tests? At this point you are trusting the facts of an LLM by tusting the answers of the llm
@arifkarim768 6 місяців тому
explained amazingly
@PratheekBabu 4 місяці тому
Thanks for an amzing content can we use giskard without openai key
@sabujghosh8474 8 місяців тому
Its awesome need more os models workings
@AliMohammadjafari97 28 днів тому
Thank you for your video it is very helpful. How can we use giskards if we want to use local LLM in our RAG system, like llama3?
@francescofisica4691 6 місяців тому ⁺¹
How can i use huggingface llms to generate the testset?
@dhrroovv 5 місяців тому ⁺¹
do we need to have a paid subscription to openai apis to be able to use giskard?
@proterotype 7 місяців тому
This is so well done
@maxisqt 8 місяців тому ⁺⁴
So the one thing you learn training ML models is that you don’t evaluate your model on training data, and be careful of data leaking. Here, you’re providing giskard your embedded documentation, meaning giskard is likely using its own RAG system to generate tests cases, which you then use to evaluate your own RAG system. Can you please explain how this isn’t nonsense? Do you evaluate the accuracy of the giskard test cases beyond the superficial “looks good to me” method that you claim to be replacing? What metrics do you evaluate giskard’s test cases against since its answers are also subjective, you’re just now entrusting that subjective evaluation to another LLM?
@maxisqt 8 місяців тому
Perhaps the purpose of testing in software development is different to ML testing, in soft eng you’re ensuring that changes made to a system don’t break existing functionality, in ML you test on data your model hasn’t trained on to prove it generalises to unseen novel samples as that’s how it’ll have to perform in deployment. Maybe the tests you’re doing here fit into the software eng bucket and therefore LLMs may be perfectly capable of auto generating test cases, and since we aren’t trying to test how well the generated material “generalises” since that doesn’t make sense in this context, that’s okay… I’m a little confused.
@maxisqt 8 місяців тому ⁺¹
I’m new to gen ai, background in ML some years back, apologies if I come off hostile or jaded.
@mikaelhuss5080 8 місяців тому
@@maxisqt i think these are good questions actually. Maybe the way to think about RAG at least in the present scenario is that it is really a type of information retrieval and there is no need to generalise, as you say - we just want to be able to find relevant information in a predefined set of documents.
@u4tiwasdead 8 місяців тому
The way that frameworks like Giscard try to solve the problem of how we can evaluate llms/rag using llms that are not necessarily better than the ones being eveluated is through the way that test sets are generated.
Just to give one example the framework might ask an llm to generate a question and answer pair, then ask it to rephrase the question to make it harder to understand without changing its meaning/what the answer will be. It will then ask llm the harder version of the question and compare it to the original answer. This can work despite the fact their llm is not necessarily more powerful than yours, because rephrasing an easy question into a hard one is an easier problem, than interpreting the hard question.
(a good analogy migght be that a person can create puzzles that are hard to solve for much smarter people than themselves by starting from the solution and then creating the question)
Note that the test data does not need to be perfect, it just needs to be generally better than the outputs we will get from our models/pipelines. The point of these tools is not to evaluate whether the outputs we are getting are actually true, but simply whether they are improved when we make changes to the pipline.
@trejohnson7677 8 місяців тому ⁺¹
Ouroboros
@sridharm4254 6 місяців тому
Very useful video. thank you
@StoryWorld_Quiz 8 місяців тому ⁺¹
how does the gpt instance that generates the questions and the answers know the validity of those answers? if they are actually accurate, why would you build the rag in the first place if you can create a gpt instance that is accurate enough (using one simple prompt: 18:33, agent description)? i dont understand, can someone explain please? do you see the paradox here?
@mehmetbakideniz 7 місяців тому
because gpt 4 is quite expensive, you wouldnt want to use it in production if 3.5 or any other open source model does the job correctly. this library uses gpt4 as the best llm to have RAG answers. that is why they use it as the test case to see whether for your specific application a cheaper or a free open source model is more or the less okay.
@mahimahesh1945 Місяць тому
The primary purpose of Retrieval-Augmented Generation (RAG) is to enable the development of applications tailored to your enterprise. Foundational models like GPT-3.5 or GPT-4 are not specifically trained on your enterprise data, so to adapt an LLM effectively for your organization, RAG or fine-tuning may be necessary. This allows the model to interact with and utilize your enterprise data seamlessly.
@gauravpratapsingh8840 5 місяців тому
Hey can you make a video that uses open source llm and make a q/a chat bot for website page?
@aryamasingh3413 2 місяці тому
I loved it!
@fintech1378 8 місяців тому ⁺¹
How new is this giskard
@AjayKumar-hs2li 27 днів тому
Amazing!
@theacesystem 8 місяців тому
That's great. You rock!!!
@utkarshgaikwad2476 8 місяців тому ⁺¹
Is it ok to use generative ai to test generative ai ? What about the accuracy of giskard ? I’m not sure about this
@underfitted 8 місяців тому
The accuracy is as good as the model they use is (which is GPT-4). Yes, this is how you can test the result of a model.
@ergun_kocak 8 місяців тому
This is gold ❤
@pratheekbabu272 4 місяці тому
Can you do using gemini pro
@caesarHQ 8 місяців тому
hi excellent tutorial, wouldn't anticipate any less. Ran your notebook with an open-source LLM, however generate test set with giskard.rag is calling OpenAI api "timestamp 19:11", any work-around?
@underfitted 8 місяців тому ⁺¹
Giskard will always use gpt4 regardless of the model you use in your RAG app
@sbacon92 4 місяці тому
What happens when you take away OpenAI and a module?
Can you build this with a local model and your own code?
@datascienceandaiconcepts5435 3 місяці тому
nice explation
@JonathanLoscalzo 5 місяців тому
I think that all the "AI experts" in the wild just "explain" common concepts of AI/LLM systems. It would be nice to understand a bit more other aspects, like (good choice) evaluation. It would be interesting to have some relevant courses on that. I know it is the secret juice but, could be useful.
BTW, Are you teaching causal ml in your course?
@underfitted 5 місяців тому ⁺¹
I’m not teaching causal ml, no. The program focuses on ML Engineering
@JonathanLoscalzo 5 місяців тому
@@underfitted I want to do it, but I don't have time. I hope there will be more cohorts in near future
@not_amanullah 8 місяців тому
Thanks ❤
@JTMoustache 8 місяців тому
Langchain sucks

Наступне

Автоматичне відтворення

Step by step no-code RAG application using Langflow.