This is really interesting but I have some concerns about this method, I'd love to hear what you think about them: 1. We are always sending the entire schema as context. If we want to have a large dataset connected to this "application", we will waste a ton of tokens on that. The agent that LangChain built slowly decides which tables might be relevant, thus reducing the amount of tokens used as context. How would you approach something like this? 2. Sometimes, tables and column names might not be super intuitive to the LLM, and without sampling the data, it can assume properties, values or anything else. So this requires the user to review the query and make sure it makes sense, which is what we are kind of trying to prevent when we start using AI for queries. What do you think about adding a semi step that will somehow sample the relevant data?
That is one way to think of it. But in this case LangChain is handling the parsing of the LLM output (note the "model.bind(stop=[" SQLResult:"])" in the chain). When you generate SQL or any other code you'll find that the code is often returned in quotes or with some text explaining the code. The trick is to minimize this by parsing the output in a suitable way.
I am late to the game on this video. I have been working on a TextToSQL project. Like most of the examples I have viewed, the LLM can understand the context of columns. From the project I am working on, the names of the columns may have some hint of what the data would be or its use. The schema I have has a date, a reference date, and a delivery date. Delivery date is obvious. There are other fields where the names are not indicative of the values. What happens when you have multiple tables with a large schema? My approach is to use the LLM to build the SQL and not to synthesize, as the amount of data could be quite large.
Unable to see the names of the db using "print(db.get_usable_table_names())" but the Database connected successfully, it shows an empty array [], What I'll do?
Great video! Always something new to learn!
This is really interesting but I have some concerns about this method, I'd love to hear what you think about them:
1. We are always sending the entire schema as context. If we want to have a large dataset connected to this "application", we will waste a ton of tokens on that. The agent that LangChain built slowly decides which tables might be relevant, thus reducing the amount of tokens used as context. How would you approach something like this?
2. Sometimes, tables and column names might not be super intuitive to the LLM, and without sampling the data, it can assume properties, values or anything else. So this requires the user to review the query and make sure it makes sense, which is what we are kind of trying to prevent when we start using AI for queries. What do you think about adding a semi step that will somehow sample the relevant data?
all thses LLM to SQL are playing, not production ready. they are just not mature enough. the db itself need to be documented well with the business.
what has been your experience with text to pandas dataframe? Is it better than text to sql in terms of complexity?
Hi, great tutorial! How would you implement a chat fuctionality? where you can ask follow up questions??
Thanks! I would use ChatMessageHistory to manage the conversation and catch the traceback - this is needed for more advanced queries.
THIS is function-calling but instead of a "json" u get a "sql query". Am i missing something?
That is one way to think of it. But in this case LangChain is handling the parsing of the LLM output (note the "model.bind(stop=["
SQLResult:"])" in the chain). When you generate SQL or any other code you'll find that the code is often returned in quotes or with some text explaining the code. The trick is to minimize this by parsing the output in a suitable way.
I am late to the game on this video. I have been working on a TextToSQL project. Like most of the examples I have viewed, the LLM can understand the context of columns. From the project I am working on, the names of the columns may have some hint of what the data would be or its use. The schema I have has a date, a reference date, and a delivery date. Delivery date is obvious. There are other fields where the names are not indicative of the values. What happens when you have multiple tables with a large schema? My approach is to use the LLM to build the SQL and not to synthesize, as the amount of data could be quite large.
You should be able to define some of the fields in the prompt with examples. That way the model can try to differentiate what the field means
Where can we download the code file?
There's a link below the video to the Colab notebook with code and written tutorial including how to generate the ecom tables
Unable to see the names of the db using "print(db.get_usable_table_names())" but the Database connected successfully, it shows an empty array [], What I'll do?
What happens if he drops the table when hallucinating
Read only role
As mentioned, make sure to restrict access scope and permission.