Talk to your CSV & Excel with LangChain
Вставка
- Опубліковано 4 жов 2024
- Colab: drp.li/nfMZY
In this video, we look at how to use LangChain Agents to query CSV and Excel files. This allows you to have all the searching power of a tool like Pandas but done through natural language using an LLM to help.
My Links:
Twitter - / sam_witteveen
Linkedin - / samwitteveen
Github:
github.com/sam...
github.com/sam...
#LangChain #BuildingAppswithLLMs - Наука та технологія
You are producing great content that's showing me how to exploit GPT. Thanks.
Might be a good idea for a new video is Lang Flow. A GUI based tool for Lang Chain
i was not aware of this -- cool!
Welcome to LangChain
LangChain is a framework for developing applications powered by language models. We believe that the most powerful and differentiated applications will not only call out to a language model via an API, but will also:
Great stuff Sam. Quick question - How do we improve the model if it answers a question incorrectly? Is there a "training" mechanism or reward function to let them know it was incorrect?
(just seeing this now) not really you can fine tune the LLM for this task but that isn't a guarantee.
Good article with a workable example. Great work.
hey Sam, great video and content in general, just a quick question, how would you go about adding short term memory to a chain with Dataframe/CSV? The dataframe or csv agents have no parameter for MemoryBuffer. There are ways to read the csv or dataframe using a separate loader, but how do you incorporate it into a chain with an llm, prompt and most importantly, a memory buffer? I am trying to make it remember the questions I asked before (memory in the same chat instance, not historically - e.g. when you correct a question the llm does not understand, "I meant X")
Thanks much
Hey, i am also looking for similar functionality. Did you find anything for it? Apparantly we can use the Conversational Memory Buffer but it seems it doesn't integrate well with this csv_agent.
Hello guys. I am also working on a similar use case. Any solution you guys found?
Hello guys. I am also working on a similar use case. Any solution you guys found?
Sam can you make a video showing how to get a reply as a Plotly chart? or a PyVis with networkx graph?
one of my previous vids should getting replies as triples which you can use in NetworkX. Might look at making something more advanced like that
Great stuff Sam. Looks like those legacy Excel spreadsheets with macros and multiple indexes still require plenty of cleaning and preprocessing before we can use any agent on them
Yes treating the doc as a spreadsheet/table and not a csv file is actually quite different. The spreadsheet way is being baked into Google Sheets and Excel so I wonder how much of a market there is for an open source system. Would love to hear your opinion.
can please tell me how can we use pinecone into this to store memory
If you figured it out please tell me i am intrigued
Hi Sam,
Great and very helpful video, thanks.
I have a question.
My CSV have many columns and then there is another csv that contains the definition of each column. How to handle such case and stillbe able to ask questions on the csv.
Vikkas
you try feeding that info in via the prompt. Just try to keep it concise.
Wat if the excel is having multiple sheets. Will it work?
If you figured it out pls lmk
Hi Sam, thank you for this great tutorial. If possible, can you also show us how to use HuggingFace models for the csv agent? Also, do you have any recommendation which LLMs from Huggingface is great for this kind of task? Look forward to hearing from you soon.
Hey hi @TienPham-rx6gk
did you find any solution?
I am looking for an open source pre-trained model too which can do this task?
did you find any on hugging face?
Possible to use Matplotlib or Seaborn to display Data Visualization as the additional output after we query the data? So you think this would work?
Yeah possibly better to try doing it as a custom tool with an OpenAI Function
Hey sam,
Great video! Can i achieve the same using Mistral or Llama 2?
with some of the finetunes of Mistral you should be able to get some ok results.
@@samwitteveenai thanks. Will try it out
Hey hi sam, I have one main question.
Is there any open-source model where I can do the same thing ?
or is there any open-source even close to doing what you have done here ? maybe I can fine tune and use that.
what is the strategy for handling large amount of csv file, for example: over 800K
If you figured it out pls lmk
The videos are great. Very helpful. I've a question. After loading the csv file using CSVLoader, which custom chain/agent I can use? Can you share some insights on that? Share any reference/notebook if possible.
🎯 Key Takeaways for quick navigation:
00:00 🗂️ *Introduction to LangChain for querying CSV and Excel files*
- Overview of using LangChain with OpenAI models to extract data from CSV and Excel files.
01:25 🔒 *Security considerations for CSV agent*
- The CSV agent runs a Python agent under the hood, caution advised for prompt injection attacks.
02:22 🛠️ *Setting up the CSV agent with OpenAI language model*
- How to create a CSV agent and configure it to minimize hallucination by setting the temperature to zero.
03:48 📊 *Understanding the CSV agent's prompt and scratch pad*
- Explanation of the CSV agent's prompt structure and the use of a scratch pad for iterative language model calls.
05:14 🤔 *Asking the CSV agent simple and complex questions*
- Demonstrating the CSV agent's ability to answer simple queries like row counts and more complex ones involving data filtering.
07:32 🔄 *Using LangChain with Excel files and custom agents*
- Converting Excel files to CSV for use with LangChain and the possibility of creating custom agents for specific tasks.
09:22 🎓 *Conclusion and practical applications of LangChain*
- Summarizing the capabilities of LangChain for non-technical users to query data and the invitation for feedback and subscription.
Made with HARPA AI
Fantastic video, Sam. I’m going to try this but use a pdf instead.
I have some chat your docs vids coming, but they keep getting delayed by LLMs getting released every day :D
@@samwitteveenai are you just using pure langchain for it?
Great video Sam … I had one question - Could you please tell me how to change the agent.agent.llm_chain.prompt.template ? I will be very grateful to you if you can help me out as I am just starting to learn LangChain
could you make a video on how to correctly use a csv_agent in langchain with alpaca? I have tried the approach you showed with Alpaca and it doesn't seem to produce good results at all, so I would be curious to see how you go about it
Would it be capable of doing (complex) joins between SQL tables to answer arbitrary predicate logic questions using a database?
Probably not, but give it a few years. Scary.
To some extent if you make it aware of the tables, I've had more luck with text2sql
you can use the SQL Agent for that so you get SQL queries and not pandas etc. I might make a vid of that soon.
Hey I ran into an issue which I found quite weird. create_csv_agent worked for me as in the video, but then suddenly I started getting an error while running the same code as before on the same file. The error was a token limit error. its only a 157 row csv file and again, it worked before on the same file, but suddenly even upon restarting kernel and reloading everything, it will not query because of this error. Anyone ran into this weird issue?
I have this issue as well but I have not been able to resolve it. Did you ever find a solution?
Good Video..
I have a doubt you have taken a dataset with all columns of integers. if the columns having strings or characters..?
Great video. BTW, I could not extract the Prompt from the agent using the code specified in this video. It was throwing error
how can i only print out the final answer?
set verbose = False
This can affect the jobs of many data workers and analysts. How can they best protect themselves?
I think like many areas the need to people with a surface amount of knowledge may decline, but there will still be a need for people with deep knowledge.
@@samwitteveenai How deep though? Didn't GPT 4 just pass a medical licensing exam with flying colors? I think you could potentially pivot into areas that have to do with AI, because undoubtedly many new jobs will be created from this. Many people will be left behind though.
This is exactly what I needed but can I use something more secured than langchain. For example Voiceflow on top of chatgpt? My customer is very sensitive about data protection. Thanks a lot for answering.
Could it be that the CSV agent always summarize the text? I have this "Comment" field on my CSV and when i asked for the value of that field in one of the rows, it returns me a summarize of that comment, not the comment itself 🤔.
The original comment: The products arrived in good condition, but the delivery was delayed more than expected and the customer service did not provide me with a clear solution regarding the matter.
The value returned by the agent: The products arrived in good condition, but the delivery was very slow.
Great Video. Your sessions are super
Thanks, I appreciate that!
Excellent video Sam, I too have a question, lets say i wanted to add to the csv_agent promt - ie tell it how it should handle date periods like "last week", ie specify it to use today as the end of period and ignore all future dates. Is there anyway to extend the csv_agaent? or do you have to write a custom agent?
You could probably do this just by overwriting the Prompt to add it in there. See how I get the prompt to show what it is and then just assign it to that variable.
@@samwitteveenai thanks Sam, that exactly what I did! Appreciate you commenting back mate.
Me too
I approached this slightly differently by converting CSV/Excel files into SQL tables(named by name of csv). Then using the SQL agent instead of CSV agent, as GPT is well-trained for SQL queries.
There is one downside that the SQL table do not have the correct schema for the columns. Do you see any other issues arising out of it?
What do you mean it does not have the correct schema? All SQL columns have names and data types.
I think the key thing with all of these is to experiment and see what works best for you own situation. I may make a video of the SQL Agent as well, it is also very cool.
I would love to do this as well since I'm well versed in SQL and all our data is in SQL server. It would be nice to use Wolfram alpha or JavaScript libraries to generate charts or nice looking tables if the user of our chat bot wants it
@@samwitteveenai I'll be waiting for that 👍👍👍
This is wonderful idea. How long would creating this take?
Hello, I am wonderng About something, when WE se a csv agent, WE don't need to use embeding, Vector data base or a memory ? I am currenly confuse
Great work. I had a question, What could be the problem that it only counts 5 records when I have 200?
It might be limited to only sending that many back to the LLM, not sure about this as I did it quite a while ago.
which LLM are you using?
gpt3.5 or gpt4
from memory that was davinci 3 or 3.5. the code should show it.
hello there, thank you for this interesting video. I am trying to replicate this notebook but I am getting errors when I try to view the agent prompt template using this line agent.agent.llm_chain.prompt.template.
It looks like the library has changed considerably in the time since this video was posted
Any help would be appreciated to be able to do this step
Hi Sam. brilliant tutorial for doing exactly what the video title says. I do have a question, what actual LLM does the agent call when we simply say OpenAI(temperature=0) without specifying any model parameter?
Is chatgpt api become paid, it is showing that limit has been reached. Do you face same problem
when I recorded that video (a few months ago) I think it was text-davinci-003, it is probably the same with ChatOpenAI being used for the other OpenAI models.
Sam, really great demonstration on langchain CSV agents but I am getting the error OutputParserException while running the code in notebook in vs code to chat with my csv file not containing huge data only 1 sheet of 22 rows using langchain create_csv_agent, AzureOpenAI.
How can I solve this error, Sam could you or anyone out there please give me the solution for this issue with detailed explanation?
Please revert to me for more details on this.
Thanks.
they have updated LangChain so the code on this is about 1 year old unfortunately. I will try to make a new version of the video soon.
Very nice tutorial! Thanks! I have a question tho, how do we ask questions to multiple csv files? or even multiple csv files + some txt/pdf documents?
you can have multiple indexes and query each of them.
Thanks for the great video. I think you already have done pandasAI video. Would you recommend using that in place of an agent from langchain?
Good question. I think the Pandas AI is more if you are using it for personal use but LangChain if making an app for others etc. Can check the prompts from both and see what works best for you and use those as well.
how to persist that csv in vector db and get similar kind of response ? please help.
What about csv file without using csv agent please help
In a database of cars would LangChain be able to compare cars with everything about them (brand, series, model, HP, option list, etc) to another to give me a good comparison car for example a Mercedes A-Class to an Audi A3 or something like that.
Series and model would an input from myself for which car could compare to what and some it should solve itself by comparing body types etc, but option list is not normalised for different car producers. Would vector embedding be needed for that?
Or is a different model a better solution? For example BERT?
Would be grateful about a response, thank you.
This is wonderful. How long would creating this app take? you made it look easy!
writing the backend is not that complicated if you look at the Colab code I provided.
@@samwitteveenai thank you so much.. Will check out out and get back to you. Again thanks for sharing your knowledge
Hi Sam.. Are you available for consultation?
How scalable is this to large data sets, or to databases with multiple tables?
Thanksss man, great vid
Please change the davince model to chatgpt model (gpt3 turbo) for this tutorial as it is better and 10x cheap
How to change davince model to gpt-3-turbo? when i input model_name='gpt-3.5-turbo' parameter to create_csv_agent function i got error. Could you teach me?
Can i give you my csv assignment?... I've to submit by tomorrow and I don't know how to do😢
Sam any idea how to have this on multiple csv files
Thank you for your informative video. I have a question for you.
I followed your method to conduct queries and responses for the product information in my online store's csv file. However, it consumed too many tokens for just a few questions, as shown below: text-davinci, 17 requests - 42,525 prompt + 2,142 completion = 44,667 tokens. I'm wondering if converting the csv file into embedded vector values could reduce the number of tokens used in queries. I'd like to know your opinion on what can be done when the tokens used for queries and responses are excessively high.
Interesting what types of queries were you doing? if it was things like list all the products etc and that was more that 4k tokens yes you will have an issue, if it was just getting Pandas queries it should have that kind of issue. You are right you could use a vector store and do it that way. I have a few videos showing things like that coming out soon
Hey 구본천, great questions.
What have you found to be best for an optimal token consumption? I started using embeddings for questions but then got to know agents and started using them. Using this agent method and asking 5 questions on a 15,000 rows table, the consumption was $0.14 USD; not that optimal. Appreciate your feedback!
And thanks Sam Witteveen for such great content!
Looking for this solution
@samwitteveenai any documentation to achieve this?
@@samwitteveenai Sam, could you point me in the direction of your videos using a vector store with the pandas agent? Or indicate when you might have some videos out on it? I'm currently comfortable with the Pandas agent and adjusting the prompt but it gets expensive!
Hello Sam, when will you make a video about reading csv, pdf or txt data using free LLMs? It would be interesting to learn using alternatives to chatgpt/openai.
They need to be fine tuned or find prompts that can get them to stay consistent. most will not work for tools etc.
Hi sam had one doubt how can we chat with .xlsx file or .xls file
Hi Sam, is there any open source LLM that we could use for the same??
What is the name of the OpenAI model you used inside this video ?
Awesome video!!
Can you/anyone guide me how to load CSV file for question answering using Dolly2.0 with langchain??
I wouldn't use Dolly as it is very out of date now and the LLaMA 2 models are much better.
Nice video!
I have a question though, is it possible replicate the code or the idea using a different LLM like Bloom, OPT or GPTNeoX?
Yes but it wont work with the standard version of those models because they don't do well with these tasks. I did one No OpenAI vid and I plan another later this week, looking at what models can do what etc.
Hi @Sam - One more question: Can i refine the prompt of the agent?
Yes all the prompts you can change and should tune depending on the model you are using.
Great video! There's some notebook that show how use Alpaca Llama to talk to CSV or any other date file like Json?
I made one and it didn't work well out of the box, so I need to finetune an Alpaca to do it. Will try to do that this weekend.
@@samwitteveenai thanks so much!
How do we aadd past conversations as memory to agent?
What would be cool would be if we could visualize the data using matplotlib
this is an interesting direction a few people have mentioned and since I suck at writing Matplotlib code I probably will look into it :D
Can we try to do something similar with Opensource LLMs alpacalora , gpt4all ?
been playing with this, no success so far but surely coming very soon.
good question I did try this on Alpaca and was hoping to show that as a follow up video but it wasn't good enough out of the box. That said it should be doable by finetuning the model first. I will have another go at it when I get some time.
@@samwitteveenai fine tune it on various pandas queries ?
Waiting for this. Will be fantastic.
Got tools and data QA working but the context size (2048) limits significantly the amount of text you can feed. And it's slow, even on 4bit. We need a non Llama based one for this to be useful.
Thanks for the great video, Sam. I was doing analytics on a pandas DF using the LangChain agent and came across the model’s tokens limit. Is there any way to overcome it?
you can use the 16k context model for 3.5-turbo which is 4x longer than the normal 3.5 model
@@samwitteveenai I'll try. Thank you again Sam!
How can we load multiple files ?
Can we use other language model like LLAMA or Alpace for reading csv like this?
most don't have enough reasoning for doing that.
Possible for the agent to query data from 2 csv files instead?
yes but will need to change some of the internal code
Awesome content! Simple and effective. Congrats :) ((small question: is it possible to use an alternative to OpenAI for this task? Some LLM providers such as SelfHostedPipeline or SelfHostedHuggingFaceLLM?! Thanks in advance.
Yes you can, but often models like Alpaca etc. weren't trained on instructions that allow this to work, so it would need finetuning.
@@samwitteveenai great to know, thanks. I am going to watch your finetuning video first :)
Hi @@samwitteveenai, do you have any tips / links on how to build instructions dataset from csv tables to finetune LLMS like Alpacas ?
Thank you :)
Nice explanation. Can you help me add this to a custom csv dataset.
custom csv should work just fine.
@@samwitteveenai yes I found that but how do access conversationbuffermemory with it
Does it sends / uploads your csv data somewhere? I explicitly wanted to know about data privacy.
not you full file but if you use OpenAI like this then some of the data will be included in the prompt.
Could you do this but not using chatGPT? I would need to use a local LLM is that at all possible?
yes but you would probably need to finetune the local model for this task.
any idea to process multiple csv/excel data on it?
you could run it multiple times and then merge the outputs to a summary chain. This would require making a custom agent etc.
HI @Sam Witteveen I am getting Rate Limit Error Can you guide me how to do that ?
That sounds like an OpenAI issue, leave it and try a bit later sometimes their API has issues
I have private documents (Excel &CSV)I can't share it with openai , is there anyway to do it as private GPT ?
Yes you can try some of the open source models. I am going to revisit this in some more vids soon.
@@samwitteveenai Many thanks Sam, that would change my life I have plenty of CVS & excel files and existing LLM like groovy and snoozy from gpt4all are unable to read CSV & Excel correctly. That would be great to have tutorial video ☺️
is there a limit for the size of the csv file?
possibly but they way it works as long as the CSV can be loaded into memory, then pandas queries can be run on it.
what about large csv ?
could you make a video on using langchain and llama to connect llama to the internet? maybe using alpaca13b or alpaca7b?
I am looking into this, the challenge is to do it well LLaMa needs to be trained on a unique dataset. Still working on it
@@samwitteveenai i see. what about something like vicuna
do i have to add open ai api key myself?
yes you will need to
🔥🔥🔥🔥
I tried this but the result is not satisfactory
I need to revisit tabular data with these again soon, there are lots of new ways to approach it. I think this vid is close to a year old now
@@samwitteveenai Thank you so much for your reply, I really appreciate it, The issue seems to stem from LangChain's processing, where it embeds document data and searches for the closest matching data before reconverting it into text. This can lead to errors, particularly with logical answers. For instance, calculating the average expenses for specific categories like food is problematic. This is because the process requires access to the entire CSV dataset, and LangChain struggles to retrieve specific data if the corresponding keyword is missing from the CSV.
@@MrShivrajansingh often for this kind of thing it is better to just get treat the csv as a SQL db and use the LLM ti just write SQL quieries
Can this do graphing too?
graphin in what sense? plots? you could make an LLM write the code for a plot. if you are talking about Knowledge graphs then yes but in a different way
@@samwitteveenai So like if i ask " Make me a line plot showing the trend of xyz from 2005 - 2010 using the Plotly library" (assuming I have that data ofc!), I would want it to make me a line graph using Plotly
Nice content, I have few queries:
1. if i use OpenAI API key, is it like my organization's data will get exposed?
2. Can you make video on how to develop a model to extract question answers from my Organization's data (available in CSV in Excel format only).
In my case i want to create the similar question answering bot or web app with my organization's data.
Anyone has any idea about that.
anything you pass in the prompt will be data OpenAI has access too. So be careful.
@@samwitteveenai thanks for quick response
Can you guide me to create a chat gpt like chat bot to answer queries based on my Excel data
@@satishkumar-ir9wy Hey, I am also looking to create a similar chatbot. Were you able to create one?
Is it possible with alpaca models?
not with the straight Alpaca model. I have tried it and didn't get good results. But I am working on a finetuned version of Alpaca to do it.
@@samwitteveenai i shall sub and eagerly wait for its arrival
How can I add custom template/prompts?
Lol now try doing it in typescript
i need json responce
agent.agent.llm_chain.prompt.template
AttributeError: 'RunnableAgent' object has no attribute 'llm_chain'
this is over a year old they have updated since then. I will make an update at some point.
did you ever find a way to fix this issue?