Thanks Kames to elaborate about Langchain Memory , For the viewers here are some 🎯 Key Takeaways for quick navigation: 00:00 🧠 Conversational memory is essential for chatbots and AI agents to respond coherently to queries in a conversation. 01:23 📚 Different memory types, like conversational buffer memory and conversational summary memory, help manage and recall previous interactions in chatbots. 05:42 🔄 Conversational buffer memory stores all past interactions in a chat, while conversational summary memory summarizes these interactions, reducing token usage. 14:13 🪟 Conversational buffer window memory limits the number of recent interactions saved, offering a balance between token usage and remembering recent interactions. 23:05 📊 Conversational summary buffer memory combines summarization and saving recent interactions, providing flexibility in managing conversation history. We are also doing lots of workshops in this pace , looking forward to talk more
Things really seem to get interesting with the knowledge graph! Saving things that really matter like relation context, along with a combination of the other methods, starts to sound very powerful. Add in some embedding/vectorDB and wow. The other commenters idea about a system for bots evolving sentiment, or even personality, over time is worth thinking about as well.
Very powerful! Any idea or resources on how to add a embedding/vectorDB to this? I would like this memory chatbot to be able to reference my own data stored in the vectorDB but I can't seem to make it work together. Either the chatbot has memory OR it references the embedding bot I can't seem to combine it..
Cool! This video addressed the question that I had posed in your earlier (1st) video about the Token size limitations due to adding conversational history. The charts provide a good intuition of the workings of the memory types. Two takeaways. 1.When to use which mem. type 2. How to do performance tuning for a Chatbot app. due to the overheads posed by token tracking, memory appending so on..
If I understand correctly the graphs, what is represented is the token used per interaction, in the case of the Buffer Memory (the quasi linear one), the 25th interact is about 4k tokens. But the price (in tokens) of the whole conversation up to the 25th interaction is the sum of the price of all the interactions up to the 25th. So basically the price of the conversations, in each case, is the area under the curves you showed, not the highest point it reached. The Summarized conversations, with the flat tendency towards the end, it means the price just keep adding almost the same tokens per each new interaction, not that the price of the conversation has reached a top.
If my math isnt off that should be 25/2 * 4k = 12.5 * 4k = 50k tokens after 25 interactions at $0.002 per 1k tokens (on turbo) that is $0.1 dollars or 1 dime for that whole conversation
yeah you're logic is correct, the graphs ended up like this as I wanted to show the limit of buffer memory (ie hitting the token limit) - we had intended to include cumulative total graphs but I didn't get time, planning on putting together a little notebook to show this in the coming days token math checks out for me - it adds up quickly
Oh wow you just destroyed my project lol I gave chat GPT long term memory, autonomous memory store and recall,speech recognition, audio out put, self reflect. Thought I was the only working on stuff like this. Well I’m basically trying to build a sentient, I need vision tho. Hopefully GPT 4 is multimodal because I’m struggling to give me project vision recognition.
Check out David Shapiro’s latest approach with salient summarization when you get a chance. Essentially: The summarizer can more efficiently pick and choose which context to preserve if it is properly primed with specific objectives/goals for the information.
In the scenario of conversational robots, how to limit the token consumption of the entire conversation? For example, once the consumption reaches 1,000, it will prompt that the tokens for this conversation have been used up.
Great video! I love the graphs for token usage. I kept meaning to graph the trends myself, but I was too lazy! I was talking to Harrison Chase as he was implementing the latest changes to memory, and it's had me thinking about other unique ways to approach it. I've been using different customized summarizers, and I can bring up any subset of the message history as I like, but I'm thinking also to include some way to flag messages as important or unimportant, dynamically feeding the history. I also haven't really explored my options in terms of local storage and retrieval of old chat history. One note that I might make for the video too... I noticed you're using LangChain's usual OpenAI class and just adjusting your model to 3.5-turbo. My understanding is that we have been advised to use the new ChatOpenAI class for now when interacting with 3.5-turbo, since that's where they'll be focusing development and they can address changes there without breaking other stuff, necessary since the new model endpoint differs in how it takes a message list as parameter instead of a simple string.
dynamically feeding the memory sounds cool, would you do this explicitly or implicitly? langchain moves super fast, I haven't seen the new ChatOpenAI class, thanks for pointing this out!
@@jamesbriggs My notions are to create a chat client where the bot is controlling the conversation, instead of the user, for the purpose of guided educational experiences - like a math lesson performed with the Socratic method, where you want to elicit the solution from the user rather than just provide it to them. I'm imagining I'll need an internal model of the user's cognition and an outline of the lesson, then implicitly determining the importance of any interaction or lesson detail by how logically connected it is to both, feeding only the immediately relevant context to the external facing LLM. I'm really still brainstorming, and I just started a month-long vacation to play with the idea.
Hi James, great video. This is probably a stupid comment but here goes.…Could you not just ask the LLM to capture some key variables that summarise the completion for the prompt and then feed that (rather than the full conversation) as ‘memory’ for subsequent prompts? I’m imagining a ‘ghost’ question being added to each prompt like ‘Also capture key variables to summarise the response for future recall’ and then this being used as the assistant message (per GTPTurbo 3.5) rather than all of the previous conversation?
Great content. thanks for that. I'm working on a summary tweets use case, but I don't want to break the overall corpus into pieces, build summary to each one, and combine those summaries into a larger one. I want something more clever. Suppose I have 10 tweets. 6 are related (same topics) and the last 4 are different from each other. I think I can build a better summary from "lang chain summary" by only summarizing the 6 related tweets and adding the 4 raw tweets. This can help not to lose the context for the future.
I'm not sure how exactly to implement this, but possibly: 1. embed the tweets 2. when looking to summarize, embed the current query and perform semantic search to identify tweets over a particular similarity threshold to return 3. summarize those retrieved tweets
These lectures are really helpful, thanks a lot! Is there a way to use Conversational Memory along with VectorDBQA (generative question answering on a database)?
Can someone let me know where i can get an off the shelf LLM with long term memory? I need it to be able to remember things i tell it, remember where i put stuff etc, I don't mind paying for it.
thanks for sharing, gpt-3.5-turbo is great! We do demo it in this video during the first example even :) - the reason I share this tutorial anyway is because gpt-3.5-turbo is (using the direct openai api) restricted to the equivalent of `ConversationBufferMemory`, it doesn't do the summary, window, or summary + window memory types We didn't really cover it here but there's also the knowledge graph memory, we'll cover that in the future
@@jamesbriggs I see, so even if we want to use the turbo model because it is cheaper than davinci, we would still want to explore one of these Langchain memory types?
@@jamesbriggs graph memory looks really interesting, would love to see it utilized with turbo or chatgpi api, also wondering if/when openai will start cacheing tokens for users on their end meaning you would only pay for new data added to the conversation.
Hello James, this method would not work for chat models anymore, right? The code would have to be adjusted to work for the new chat models from langchain. Could you make a new video to cover that?
I know that OpenAI’s text embeddings measure the relatedness of text. I am new to this field, so probably for some of you this question would be trivial. Anyway, I was wondering if is it possible to use this technique with source code. I was trying to figure out a way to analyse a source code, but due to token limitation, one way to save prior knowledge could have been this. For example if I have a list of source codes, I can search similarities within the list. Any advice? Is it possible or I am just blathering on?
interesting question, I'm not sure as I haven't seen this done before but generally speaking, these language models are just as good (if not better) at generating good code to good natural language, so I'd imagine generating embeddings for code *might* work For dealing with token limits, you can try comparing chunks of code, rather than the full code - if your use-case allows for that
Hello, this was interesting. I am currently developing a chatbot with llama index model_name="text-ada-001 or davinci-003. So, based on thousands of documents (external data), the user will ask questions, and the chatbot must respond. When I tried it with just one document, the model performed well, but when I added another, the performance dropped. Could you please advise on a possible solution to this? thank you in advance
The big problem is that so far I haven't found a solution that doesn't need to insert the entire schema in the prompt itself so that chatgpt understands how to organize and structure the data. Explaining my need better, I extracted information from sales pages via webscrapping and I would like Chatgpt to organize the data collected based on my SCHEMA structure so that I can save them in the database with the fields I created. I wouldn't want to add instructions on how to sort the data in the ChatGPT prompt every time. DOUBT: Question of 1 million dollars 😊: How to "teach" the schema to chatgpt only 1 time and be able to validate infinite texts without having to spend a token inserting the schema in the prompt and without having to train the model via fine-tune?
So large language is simply a specialized transformer models. For words. Stable diffusion, and all the others are a specialized transformer model for images. Etc. Right now companies are developing out their own specialized transformer models.
for large language models yes, they're essentially specialized and very large transformer models Stable diffusion does contain a transformer or two in the pipeline, but the core "component" of it is the diffusion model, which is different. But the input to this diffusion model includes embeddings which are generated by something like CLIP (which contains a text transformer and vision transformer, ViT) Generally yes, transformers are everywhere, with a couple of other models (like diffusers) scattered around the landscape
we should have a dedicated ai that sumarizes from old chats based on what you are talking about now and then give back less recent convos. a bit of both
Thanks Kames to elaborate about Langchain Memory , For the viewers here are some 🎯 Key Takeaways for quick navigation:
00:00 🧠 Conversational memory is essential for chatbots and AI agents to respond coherently to queries in a conversation.
01:23 📚 Different memory types, like conversational buffer memory and conversational summary memory, help manage and recall previous interactions in chatbots.
05:42 🔄 Conversational buffer memory stores all past interactions in a chat, while conversational summary memory summarizes these interactions, reducing token usage.
14:13 🪟 Conversational buffer window memory limits the number of recent interactions saved, offering a balance between token usage and remembering recent interactions.
23:05 📊 Conversational summary buffer memory combines summarization and saving recent interactions, providing flexibility in managing conversation history.
We are also doing lots of workshops in this pace , looking forward to talk more
Super! For me, it is one of the best tutorials on this subject. Much appreciated, James.
thanks, credit to Francisco too for the great notebook
Thank you. I was way behind langchain and had no time to read documentations. This video saved me a lot of time. Subscribed.
Another masterpiece of a tutorial. You’re an absolute gem James!
Things really seem to get interesting with the knowledge graph! Saving things that really matter like relation context, along with a combination of the other methods, starts to sound very powerful. Add in some embedding/vectorDB and wow. The other commenters idea about a system for bots evolving sentiment, or even personality, over time is worth thinking about as well.
yeah this is fascinating to me, looking forward to working on these
Very powerful!
Any idea or resources on how to add a embedding/vectorDB to this?
I would like this memory chatbot to be able to reference my own data stored in the vectorDB but I can't seem to make it work together.
Either the chatbot has memory OR it references the embedding bot I can't seem to combine it..
@@Jordy-t8y It's done in video #9
Great explaining to the memory in langchain, when you show the chart is more clearly for my
Cool! This video addressed the question that I had posed in your earlier (1st) video about the Token size limitations due to adding conversational history. The charts provide a good intuition of the workings of the memory types. Two takeaways. 1.When to use which mem. type 2. How to do performance tuning for a Chatbot app. due to the overheads posed by token tracking, memory appending so on..
If I understand correctly the graphs, what is represented is the token used per interaction, in the case of the Buffer Memory (the quasi linear one), the 25th interact is about 4k tokens. But the price (in tokens) of the whole conversation up to the 25th interaction is the sum of the price of all the interactions up to the 25th. So basically the price of the conversations, in each case, is the area under the curves you showed, not the highest point it reached. The Summarized conversations, with the flat tendency towards the end, it means the price just keep adding almost the same tokens per each new interaction, not that the price of the conversation has reached a top.
If my math isnt off that should be 25/2 * 4k = 12.5 * 4k = 50k tokens after 25 interactions at $0.002 per 1k tokens (on turbo) that is $0.1 dollars or 1 dime for that whole conversation
yeah you're logic is correct, the graphs ended up like this as I wanted to show the limit of buffer memory (ie hitting the token limit) - we had intended to include cumulative total graphs but I didn't get time, planning on putting together a little notebook to show this in the coming days
token math checks out for me - it adds up quickly
Thanks for your content! looking forward to watching the knowledge graph video :)
Oh wow you just destroyed my project lol I gave chat GPT long term memory, autonomous memory store and recall,speech recognition, audio out put, self reflect. Thought I was the only working on stuff like this. Well I’m basically trying to build a sentient, I need vision tho. Hopefully GPT 4 is multimodal because I’m struggling to give me project vision recognition.
yeah I think you might be in luck for multimodal GPT-4 :) - that's awesome though, I haven't done all of that yet, very cool!
Great work bro! Keep it up! 👍
Thanks for this content James, awesome!
you're welcome
Amazing Content
James - are you still planning to work on the KG video? Seems like a powerful method that solves for scale and token limits.
Check out David Shapiro’s latest approach with salient summarization when you get a chance. Essentially: The summarizer can more efficiently pick and choose which context to preserve if it is properly primed with specific objectives/goals for the information.
fascinating, love Dave's videos they're great!
Skimming through the docs, LangChain seems like a complicated abstraction around what's essentially auto copy and paste.
the simpler stuff yes, but they have some other things like knowledge graph memory + agents that I think are valuable
Great demo James
thanks Tommy I appreciate it!
Hi Sam, how do we keep the Conversation context of multiple users on different devices separate ?
In the scenario of conversational robots, how to limit the token consumption of the entire conversation?
For example, once the consumption reaches 1,000, it will prompt that the tokens for this conversation have been used up.
Thank you! Awesome work!! Appreaciate it!
thanks!
James, thanks so much!
Just curious, what's the openAI cost to complete this course if you choose the pay as you go plan?
@jamesbriggs why transformers are stateless
Great video! I love the graphs for token usage. I kept meaning to graph the trends myself, but I was too lazy! I was talking to Harrison Chase as he was implementing the latest changes to memory, and it's had me thinking about other unique ways to approach it. I've been using different customized summarizers, and I can bring up any subset of the message history as I like, but I'm thinking also to include some way to flag messages as important or unimportant, dynamically feeding the history. I also haven't really explored my options in terms of local storage and retrieval of old chat history. One note that I might make for the video too... I noticed you're using LangChain's usual OpenAI class and just adjusting your model to 3.5-turbo. My understanding is that we have been advised to use the new ChatOpenAI class for now when interacting with 3.5-turbo, since that's where they'll be focusing development and they can address changes there without breaking other stuff, necessary since the new model endpoint differs in how it takes a message list as parameter instead of a simple string.
dynamically feeding the memory sounds cool, would you do this explicitly or implicitly?
langchain moves super fast, I haven't seen the new ChatOpenAI class, thanks for pointing this out!
@@jamesbriggs My notions are to create a chat client where the bot is controlling the conversation, instead of the user, for the purpose of guided educational experiences - like a math lesson performed with the Socratic method, where you want to elicit the solution from the user rather than just provide it to them. I'm imagining I'll need an internal model of the user's cognition and an outline of the lesson, then implicitly determining the importance of any interaction or lesson detail by how logically connected it is to both, feeding only the immediately relevant context to the external facing LLM. I'm really still brainstorming, and I just started a month-long vacation to play with the idea.
what if i want to use it for my own fine-tuned gpt3.5 model?
How can I keep the conversation context of multiple users separately?
Can you please please please make a video on how to connect mongoDB with langchain?
How to use this conversational memory for custom chatbot along with lagnchain?
do u have a substitute of langchain
you are awesome - thanks again!
Hi James, great video. This is probably a stupid comment but here goes.…Could you not just ask the LLM to capture some key variables that summarise the completion for the prompt and then feed that (rather than the full conversation) as ‘memory’ for subsequent prompts? I’m imagining a ‘ghost’ question being added to each prompt like ‘Also capture key variables to summarise the response for future recall’ and then this being used as the assistant message (per GTPTurbo 3.5) rather than all of the previous conversation?
How is the model able to judge whether it needs to come to the conclusion: "I don't know."
Love the video! Question about wanting to put this behind a UI, how hard would that process be?
How do I use memory with ChatVectorDBChain where we can specify vector stores. Could you please give code snippet for this. Thanks
Great content. thanks for that.
I'm working on a summary tweets use case, but I don't want to break the overall corpus into pieces, build summary to each one, and combine those summaries into a larger one. I want something more clever.
Suppose I have 10 tweets. 6 are related (same topics) and the last 4 are different from each other. I think I can build a better summary from "lang chain summary" by only summarizing the 6 related tweets and adding the 4 raw tweets. This can help not to lose the context for the future.
I'm not sure how exactly to implement this, but possibly:
1. embed the tweets
2. when looking to summarize, embed the current query and perform semantic search to identify tweets over a particular similarity threshold to return
3. summarize those retrieved tweets
These lectures are really helpful, thanks a lot!
Is there a way to use Conversational Memory along with VectorDBQA (generative question answering on a database)?
I swear you have the coolest shirts!
Make a drip video too! would watch !
Thanks Billy! A drip video??
How would I be able to use this with a pinecone vector DB for context ?
Helpful! Thanks
Does anyone know the difference between the run vs predict method? Cause they seem the same to me.
If there is a difference, which one is better?
thank you great topic
glad you liked it!
Can someone let me know where i can get an off the shelf LLM with long term memory? I need it to be able to remember things i tell it, remember where i put stuff etc, I don't mind paying for it.
Hey James, can you share the Collab notebook for this?
Yes it’s the chat notebook here github.com/pinecone-io/examples/tree/master/generation/langchain/handbook
Hi great content but the gpt-3.5 model already has its conversation memory so instead of davinci you can use that. It is also 10 times cheaper😊
thanks for sharing, gpt-3.5-turbo is great! We do demo it in this video during the first example even :)
- the reason I share this tutorial anyway is because gpt-3.5-turbo is (using the direct openai api) restricted to the equivalent of `ConversationBufferMemory`, it doesn't do the summary, window, or summary + window memory types
We didn't really cover it here but there's also the knowledge graph memory, we'll cover that in the future
@@jamesbriggs I see, so even if we want to use the turbo model because it is cheaper than davinci, we would still want to explore one of these Langchain memory types?
@@jamesbriggs graph memory looks really interesting, would love to see it utilized with turbo or chatgpi api, also wondering if/when openai will start cacheing tokens for users on their end meaning you would only pay for new data added to the conversation.
Hello James, this method would not work for chat models anymore, right? The code would have to be adjusted to work for the new chat models from langchain. Could you make a new video to cover that?
it works for normal LLMs, not for chatbot-only models - but yes I'll be doing another video on this
@@jamesbriggs awesome! Thank you so much for all the work you put in. You got me back to coding :)
make a video on using this kind of long term memory based chat for sementic search on local files like txt pls
planning to do it soon!
I know that OpenAI’s text embeddings measure the relatedness of text.
I am new to this field, so probably for some of you this question would be trivial. Anyway, I was wondering if is it possible to use this technique with source code.
I was trying to figure out a way to analyse a source code, but due to token limitation, one way to save prior knowledge could have been this.
For example if I have a list of source codes, I can search similarities within the list.
Any advice? Is it possible or I am just blathering on?
interesting question, I'm not sure as I haven't seen this done before but generally speaking, these language models are just as good (if not better) at generating good code to good natural language, so I'd imagine generating embeddings for code *might* work
For dealing with token limits, you can try comparing chunks of code, rather than the full code - if your use-case allows for that
no exemple???
Make James Famous ....
He already is
Hello, this was interesting. I am currently developing a chatbot with llama index model_name="text-ada-001 or davinci-003. So, based on thousands of documents (external data), the user will ask questions, and the chatbot must respond. When I tried it with just one document, the model performed well, but when I added another, the performance dropped. Could you please advise on a possible solution to this? thank you in advance
my documents are in a form of pdf
The big problem is that so far I haven't found a solution that doesn't need to insert the entire schema in the prompt itself so that chatgpt understands how to organize and structure the data.
Explaining my need better, I extracted information from sales pages via webscrapping and I would like Chatgpt to organize the data collected based on my SCHEMA structure so that I can save them in the database with the fields I created.
I wouldn't want to add instructions on how to sort the data in the ChatGPT prompt every time.
DOUBT:
Question of 1 million dollars 😊: How to "teach" the schema to chatgpt only 1 time and be able to validate infinite texts without having to spend a token inserting the schema in the prompt and without having to train the model via fine-tune?
For this kind of question you should try more advanced LLM channels
So large language is simply a specialized transformer models. For words.
Stable diffusion, and all the others are a specialized transformer model for images.
Etc. Right now companies are developing out their own specialized transformer models.
for large language models yes, they're essentially specialized and very large transformer models
Stable diffusion does contain a transformer or two in the pipeline, but the core "component" of it is the diffusion model, which is different. But the input to this diffusion model includes embeddings which are generated by something like CLIP (which contains a text transformer and vision transformer, ViT)
Generally yes, transformers are everywhere, with a couple of other models (like diffusers) scattered around the landscape
Yeah. I count the transformer and diffusion layers to be separate aspects of it but I see what you mean. It's getting so crazy.
IT's not DIALOGUE its a SERIE of Questions .... the AI must dialogue like you make with friend ,
And yet ChatGPT needs some of this badly as I have seen it massively forget things that it said literally just one or two comments previously.
ChatGPT 4 charge high fees and people should not support it.
we should have a dedicated ai that sumarizes from old chats based on what you are talking about now and then give back less recent convos. a bit of both
I think this is similar to the summary + buffer window memory?