Excellent as usual! For Phi3 3.8b latest it works fine with: prompt_phi = PromptTemplate.from_template( """{context} Human: {question} AI:""" ) Otherwise you will get validation errors. All the best Sam!
I got validation errors both with llama3 and phi3. Worse, the LLM was answering wrong, it returned Alex. Changing the prompt solved it. I tried Mistral v0.3, and it works too. Sam, I wonder where you found the recommanded prompt formats ? Also I would appreciate a video on how you handle validation errors as they may occur from time to time.
7 місяців тому
@@gregorychatelier2950 Hello it seems that the Llama3 prompt format had been altered a bit a few weeks after model's release (Reddit). To be double checked...
@@samwitteveenai Great experiment you're running there but please consider using lm studio's new cli as well in your subsequent videos instead of ollama all the time. Also can you try using Anima's air llm library so you can run the llama 3 70B locally using layered inference?
Thank u sharing video of ollama with phi3 to run locally, hope u would come up wid more such videos to use ollama locally for different tasks. Pls mk more videos on phi3, llama3 with ollama.
Thank you for the informative videos as always. One note: if you want to run things all locally and want a lot better throughput, running the models using vLLM and serving the API with vLLM's OpenAI-compatible server is definitely the way to go. If you have a 24 GB VRAM GPU like a 3090 or 4090, you can run a GPTQ or AWQ quantized model, or just the full FP16 model and serve a large number of concurrent clients. With batching, you can get thousands of tokens per second in aggregate for responses if you run a lot of parallel clients.
@@marilynlucas5128 I've never used this library before, what kind of tokens per second speed can you get? For reference, using LLaMA-3 70B with exllamav2 quantization at 2.4bpw on a single 4090, you can get around 36 tokens/second. With 2x4090s and 5.0bpw quantization, you get around 18 t/s.
For local models I’ve found it’s helpful to at extra context at the very end of the prompt, in the assistant reply section (not the instruction section), kicking things off with “Sure, here is your JSON:” and then adding markdown syntax for preformatted text and then letting one of the end symbols be the final three backticks to close the markdown. It’s also helpful to write a custom grammar (like with llama.cpp) to constrain output to a specific schema even. (Depending on your setup this could slowdown inference if the constrained generation part isn’t running on the GPU.)
Ah really needed this, I kept feeling, I want to learn function calling with llama3. Feels so good to use a local model with function calling, and langchain made it really easy to do. Love to experiment with it now, thank you so much for this video❤️❤️❤️ and thanks to langchain for making it easy to do function calling ❤️❤️❤️
The biggest issue with function calling is that the way everyone suggests to use it is not very viable / economical, if you want your model to choose one out of many functions to call. I'll elaborate: in order for LLM to pick a function to use, you need to announce all those tools in advance and make sure it hasn't forgotten them, if you're going into multy turn chat. This means more context will be used just to make model aware about all these extra tools you want it to use and less context will be available for responses. There's probably some semantic router needs to be introduced in-between to give model only those tools which might be relevant to current question
100% my experience as well. In fact, I’ve only had success doing function calling by putting it at the individual run level rather than at the model level and only calling a single function that will be needed
Thanks for the video! I'd like to see an example of using DSPy to optimize a local model so that it can use tools more reliably. I'm actually not sure if this would work but I'd like to find out. 😃
would love if you can explain this using the ollama python package. As someone else said this is very specific to langchain and i just cant find good information on how to use function calling with ollama.
You check in code for the additional kwargs function then if present, you run the function and then pass back the returned data to the model. then the model will respond to the original query with the data as added context. (knowing the weather temperature in this example.)
Which version of Phi 3 are you using? I'm having trouble replicating your results for the structured_output example as Phi 3 is not returning any "tool_calls".
Very cool! I've been using the instructor library with pydantic for structured output and had a lot of success on openai models, but it didn't work very well with local llms. I'll definitely try out your approach!
I have found Phi-3 truly impressive for its size, getting good results even for general inquiries. I almost wonder if you could just use Phi-3 if you don't need a super refined response. It's so light on resources comparatively for an LLM.
when i try to execute the code , it shows this error : langchain_community.chat_models.ollama.ChatOllama() got multiple values for keyword argument 'format' (type=type_error) , any solution please (i didn't change any thing from the code in github link) ?
I have been running gpt-pilot with Llama3-70b-instruct.Q5_K_M for a couple of weeks. The biggest problem I have, as far as I understand, is not function calling but rather the stability of the framework. It starts developing a bunch of files, but when I provide feedback, it may abandon the old files instead of correcting them, and starts creating a new set of files. Basically makes a mess.
How do I incorporate function calling with follow up questions and memory. Say a user asks “what is the weather”. The model should be able ask “what place are you requesting for” and say the user replies “California” It should then make the function call with the mentioned arguments. Please let me know which direction I should look in order to achieve this.
wow Sam!!!, this video is really helpful but i am facing challenge in running it on server as the response is not coming within 1 min and i am getting 504 Gateway Timeout error, i have used ollama docker image to install ollama but i am not able to find how to increase gateway timeout to 10 mins instead of default 1 min. Can you please help if you have faced such issue?
I see you use a mac mini, could you talk more about what model and OS setup? Thinking of fun things to do with my 2011 2ghz i7 16gb ddr3 ram, a local something on my network if I could.
Thanks for sharing this, how can I use this json output funcution call format to combine the langchian agent functuion call framework , which. Use the llm.blind_tool to replace the llm=ChatOpenAI()? Will this work? Thanks
Great video as always! In future videos, could you please show how to do this with Ollama and langchain running on separate computers? I'd like to develop on Laptop or Colab with just inference running on my Desktop PC. And since Ollama doesnt currently do API keys, how do we secure the inference server and access it from a Colab notebook? Thanks!!
Thanks for the code and the explanation. In order to be usable, you should be able to execute the function feed the info back into the history of the conversation with the result of the function and then the llm should be able to use the results from the function to write the last message. For instance, lets say that the weather tool responds with just the temperature and nothing else, then the LLM should be able to respond back 'in Singapore the current temperature is ..." and in the same language as it was asked from the user.
Absolutely agreed. It seems to be very hard to find information on how to do exactly that. The Phi-3 chat template doesn't seem to introduce a dedicated role for a function call result. So if it seems to be the "user" replying with a function call result, why would the model figure that it needs to phrase that into a coherent message? Also, I fail to get sensible output when there is more than one function declared and the model is supposed to be free to use a tool or reply directly. Often, I get long chunks of what appears to be training data appended to the initial reply.
Function calling is very difficukt. I am trying to do a POST api call with Llama3:8B, Ollama and CrewAI. My use case is i get a text string of OCR data and then i need to map certain foelds from that OCR to a JSON and send that JSON to the POST api to save rhe rransaction. It is way way difficult to build it. But if Langchain tool can solve woth Cloaed models like GPT-4 rhen it can u lock a good enterprise value
Great video. I was hoping this would give me a reason to try LangChain vs my own prompt/post-parsing for a web ui, but I'm actually getting better results than this demonstrates. I'm using llama3-8B via LM Studio. I think until these guys get their sh*t together and create a standard for output, this is going to be similar to the browser wars (standards). At the very least, they should all conform to current markdown standards or accept a config/spec for default output. Whoever comes out with an open source competitive model that does this is going to be the clear leader... for me anyway. ...And if such a model exists, please point me to it!! :)
Is possible to fine tune a small language model for function call? For example, if we look to BERT models that perform zero-shot classification we can pass a set of labels to it, so maybe is possible to use a similar approach to get a very performatic model just for function calling, since LLMs are very huge and almost every time requires a GPU. I know that phi3 is very small but in my machine it takes like 3Gb of GPU.
thanks for sharing. Could you point me to the next step where the function is actually called and provide a natural language response. ? I created a fake funcvtion to test it out. def get_current_weather(location, unit="celsius"): return f"The current weather in {location} is 30°C." Sorry in advance as I suspect I lack of langchain knowledge on this.
I did not get what you call the "function calling" is not calling the function but just identifying whether a function should be called and produce a defined structure for it. Then we have to handle the actual call afterwards. I hope I get it right. Here is an example I built to do this last step : from langchain_experimental.llms.ollama_functions import OllamaFunctions from langchain_core.messages import HumanMessage from langchain_core.tools import tool def get_current_weather(location: str, unit: str ="celsius") -> str: """Returns weather in a given location Args: location: The city and state, e.g. San Francisco, CA unit: The unit of temperature, either celsius or fahrenheit """ return f"The current weather in {location} is 30°C." model = OllamaFunctions( model="llama3", keep_alive=-1, format="json" ) model_with_tools = model.bind_tools( tools=[ { "name": "get_current_weather", "description": "Get the current weather in a given location", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The city and state, e.g. San Francisco, CA", }, "unit": { "type": "string", "enum": ["celsius", "fahrenheit"], }, }, "required": ["location"], }, } ], function_call={"name": "get_current_weather"}, ) response = model_with_tools.invoke([HumanMessage(content="what is the weather in Singapore?")]) print("Model response:", response) # Execute the function based on the model's output if response.tool_calls: for tool_call in response.tool_calls: if tool_call['name'] == 'get_current_weather': args = tool_call['args'] result = get_current_weather(**args) print("Function result:", result) Model response: content='' id='run-aa594ea7-e739-496b-a74b-592d2e08b00b-0' tool_calls=[{'name': 'get_current_weather', 'args': {'location': 'Singapore', 'unit': 'celsius'}, 'id': 'call_ec83fe9d46f649abbc5dc69651e3908d'}] Function result: The current weather in Singapore is 30°C.
Hi< Is that because we try to give it as much more accurate and better machine-readable input, so the model does not have to 'think' too much that it can follow the correct format like JSON and some basic function, and it can meet some complex requirements also. The way is more efficient and energy-saving.
Hey, thats very helpful to understand how to run these models locally. Can u/anyone tell, how to actually do actual function call and pass that response to llm? Is it possible without LangGraph??? I want llm to decide which tool to call, once he decide that, llm should do entity extraction and then invoke tool, then returns ans back to llm and gives it to user. This was easy with AgentExecutor in OpenAI examples. Similar thing possible in Ollama?
@@willjohnston8216 Mr. Witteven just mentioned this as a possible implication. I was just glad more people to turn their minds into some real world use cases for agentic flows - like giving a topic for your agent and let it research it, find products / software, which you would never find in ads, do some data gathering and processing for you, providing helpful summaries on a hot topics you never have time to investigate properly yourself, etc etc etc
all looked well and good until you try feeding a question into the 'agent' that doesn't relate directly to: "get the current weather in a given location". I thought the whole point of function calling/tooling was to present the LLM with the opportunity to use tooling if necessary.
has someone tried to load the models other than using ollama like the huggingface transformer pipeline or in other words I would love to know how torun these models in Linux based servers like databricks where I am unable run ollama application in the background like in my windows PC?
I am using the latest langchain-experimental 0.58 the bind is used in the main function calling with prop models for the OllamaFunction they still have it as bind_tools. If I am missing something send me a link.
the class OllamaFunctions is in langchain_experimental.llms.ollama_functions and i get warning message it will be removed, but they still didn't provide a replacement!
Excellent as usual! For Phi3 3.8b latest it works fine with:
prompt_phi = PromptTemplate.from_template(
"""{context}
Human: {question}
AI:"""
)
Otherwise you will get validation errors.
All the best Sam!
I got validation errors both with llama3 and phi3. Worse, the LLM was answering wrong, it returned Alex.
Changing the prompt solved it. I tried Mistral v0.3, and it works too.
Sam, I wonder where you found the recommanded prompt formats ?
Also I would appreciate a video on how you handle validation errors as they may occur from time to time.
@@gregorychatelier2950 Hello it seems that the Llama3 prompt format had been altered a bit a few weeks after model's release (Reddit). To be double checked...
I would recommend adding 'Langchain' to the title of the video, most of this is very langchain specific, for those specifically searching for that.
Very good point. Added! Thanks!
@@samwitteveenai Great experiment you're running there but please consider using lm studio's new cli as well in your subsequent videos instead of ollama all the time. Also can you try using Anima's air llm library so you can run the llama 3 70B locally using layered inference?
I haven't heard of Anima's air llm library but will check it out
Lm studio isn't as 'open' as ollama so would restrict the use cases to just personal use.
Thank you for doing this with Ollama, this was an really good explanation and helped me a lot!
Thank u sharing video of ollama with phi3 to run locally, hope u would come up wid more such videos to use ollama locally for different tasks. Pls mk more videos on phi3, llama3 with ollama.
Great video, very informative, and filled some gaps. Thank you
Thank you for the informative videos as always. One note: if you want to run things all locally and want a lot better throughput, running the models using vLLM and serving the API with vLLM's OpenAI-compatible server is definitely the way to go. If you have a 24 GB VRAM GPU like a 3090 or 4090, you can run a GPTQ or AWQ quantized model, or just the full FP16 model and serve a large number of concurrent clients. With batching, you can get thousands of tokens per second in aggregate for responses if you run a lot of parallel clients.
linux only, and I am not sure it has enough performance like that. Multiple API calling contiusely sounds great. just not sure....
You can run the llama 3 70b model with as little as 4gb gpu using Anima's air llm library which enables layered inference.
@@marilynlucas5128 I've never used this library before, what kind of tokens per second speed can you get? For reference, using LLaMA-3 70B with exllamav2 quantization at 2.4bpw on a single 4090, you can get around 36 tokens/second. With 2x4090s and 5.0bpw quantization, you get around 18 t/s.
For local models I’ve found it’s helpful to at extra context at the very end of the prompt, in the assistant reply section (not the instruction section), kicking things off with “Sure, here is your JSON:” and then adding markdown syntax for preformatted text and then letting one of the end symbols be the final three backticks to close the markdown. It’s also helpful to write a custom grammar (like with llama.cpp) to constrain output to a specific schema even. (Depending on your setup this could slowdown inference if the constrained generation part isn’t running on the GPU.)
Ah really needed this, I kept feeling, I want to learn function calling with llama3. Feels so good to use a local model with function calling, and langchain made it really easy to do. Love to experiment with it now, thank you so much for this video❤️❤️❤️ and thanks to langchain for making it easy to do function calling ❤️❤️❤️
The biggest issue with function calling is that the way everyone suggests to use it is not very viable / economical, if you want your model to choose one out of many functions to call. I'll elaborate: in order for LLM to pick a function to use, you need to announce all those tools in advance and make sure it hasn't forgotten them, if you're going into multy turn chat. This means more context will be used just to make model aware about all these extra tools you want it to use and less context will be available for responses. There's probably some semantic router needs to be introduced in-between to give model only those tools which might be relevant to current question
100% my experience as well. In fact, I’ve only had success doing function calling by putting it at the individual run level rather than at the model level and only calling a single function that will be needed
I have a video doing exactly this with a library called semantic router and crew-ai!
Amazing video 🙏🏻
Currently using crewai
Thanks for the video! I'd like to see an example of using DSPy to optimize a local model so that it can use tools more reliably. I'm actually not sure if this would work but I'd like to find out. 😃
would love if you can explain this using the ollama python package. As someone else said this is very specific to langchain and i just cant find good information on how to use function calling with ollama.
I wonder if the function tool structure can be passed in as a pydantic object like the other example?
Now how do i get the actual output from the function?
You check in code for the additional kwargs function then if present, you run the function and then pass back the returned data to the model. then the model will respond to the original query with the data as added context. (knowing the weather temperature in this example.)
Which version of Phi 3 are you using? I'm having trouble replicating your results for the structured_output example as Phi 3 is not returning any "tool_calls".
Very cool! I've been using the instructor library with pydantic for structured output and had a lot of success on openai models, but it didn't work very well with local llms. I'll definitely try out your approach!
I have found Phi-3 truly impressive for its size, getting good results even for general inquiries. I almost wonder if you could just use Phi-3 if you don't need a super refined response. It's so light on resources comparatively for an LLM.
Agree it is a nice model especially when you consider its size
when i try to execute the code , it shows this error : langchain_community.chat_models.ollama.ChatOllama() got multiple values for keyword argument 'format' (type=type_error) , any solution please (i didn't change any thing from the code in github link) ?
I have been running gpt-pilot with Llama3-70b-instruct.Q5_K_M for a couple of weeks. The biggest problem I have, as far as I understand, is not function calling but rather the stability of the framework. It starts developing a bunch of files, but when I provide feedback, it may abandon the old files instead of correcting them, and starts creating a new set of files. Basically makes a mess.
Thank you so much. Super helpful.
How to use chatOllama along with function calling. i want to pass messages along with functions same as open ai v1/chat/completions api provides.
How do I incorporate function calling with follow up questions and memory. Say a user asks “what is the weather”. The model should be able ask “what place are you requesting for” and say the user replies “California”
It should then make the function call with the mentioned arguments. Please let me know which direction I should look in order to achieve this.
wow Sam!!!, this video is really helpful but i am facing challenge in running it on server as the response is not coming within 1 min and i am getting 504 Gateway Timeout error, i have used ollama docker image to install ollama but i am not able to find how to increase gateway timeout to 10 mins instead of default 1 min.
Can you please help if you have faced such issue?
I see you use a mac mini, could you talk more about what model and OS setup?
Thinking of fun things to do with my 2011 2ghz i7 16gb ddr3 ram, a local something on my network if I could.
Thanks for sharing this, how can I use this json output funcution call format to combine the langchian agent functuion call framework , which. Use the llm.blind_tool to replace the llm=ChatOpenAI()? Will this work? Thanks
Great video as always! In future videos, could you please show how to do this with Ollama and langchain running on separate computers? I'd like to develop on Laptop or Colab with just inference running on my Desktop PC. And since Ollama doesnt currently do API keys, how do we secure the inference server and access it from a Colab notebook?
Thanks!!
Thanks for the code and the explanation. In order to be usable, you should be able to execute the function feed the info back into the history of the conversation with the result of the function and then the llm should be able to use the results from the function to write the last message.
For instance, lets say that the weather tool responds with just the temperature and nothing else, then the LLM should be able to respond back 'in Singapore the current temperature is ..." and in the same language as it was asked from the user.
Absolutely agreed. It seems to be very hard to find information on how to do exactly that. The Phi-3 chat template doesn't seem to introduce a dedicated role for a function call result. So if it seems to be the "user" replying with a function call result, why would the model figure that it needs to phrase that into a coherent message? Also, I fail to get sensible output when there is more than one function declared and the model is supposed to be free to use a tool or reply directly. Often, I get long chunks of what appears to be training data appended to the initial reply.
Very useful, thanks!
Function calling is very difficukt. I am trying to do a POST api call with Llama3:8B, Ollama and CrewAI. My use case is i get a text string of OCR data and then i need to map certain foelds from that OCR to a JSON and send that JSON to the POST api to save rhe rransaction. It is way way difficult to build it. But if Langchain tool can solve woth Cloaed models like GPT-4 rhen it can u lock a good enterprise value
What was the name of the paper that shifts the probabilities to get json as response more likely?
Can check it out here github.com/1rgs/jsonformer
Great video. I was hoping this would give me a reason to try LangChain vs my own prompt/post-parsing for a web ui, but I'm actually getting better results than this demonstrates. I'm using llama3-8B via LM Studio. I think until these guys get their sh*t together and create a standard for output, this is going to be similar to the browser wars (standards). At the very least, they should all conform to current markdown standards or accept a config/spec for default output. Whoever comes out with an open source competitive model that does this is going to be the clear leader... for me anyway.
...And if such a model exists, please point me to it!! :)
Is possible to fine tune a small language model for function call?
For example, if we look to BERT models that perform zero-shot classification we can pass a set of labels to it, so maybe is possible to use a similar approach to get a very performatic model just for function calling, since LLMs are very huge and almost every time requires a GPU. I know that phi3 is very small but in my machine it takes like 3Gb of GPU.
Yes very possible to do the key is getting the dataset and most people aren't making their datasets for this public.
thanks for sharing. Could you point me to the next step where the function is actually called and provide a natural language response. ? I created a fake funcvtion to test it out. def get_current_weather(location, unit="celsius"):
return f"The current weather in {location} is 30°C."
Sorry in advance as I suspect I lack of langchain knowledge on this.
I did not get what you call the "function calling" is not calling the function but just identifying whether a function should be called and produce a defined structure for it. Then we have to handle the actual call afterwards. I hope I get it right. Here is an example I built to do this last step :
from langchain_experimental.llms.ollama_functions import OllamaFunctions
from langchain_core.messages import HumanMessage
from langchain_core.tools import tool
def get_current_weather(location: str, unit: str ="celsius") -> str:
"""Returns weather in a given location
Args:
location: The city and state, e.g. San Francisco, CA
unit: The unit of temperature, either celsius or fahrenheit
"""
return f"The current weather in {location} is 30°C."
model = OllamaFunctions(
model="llama3",
keep_alive=-1,
format="json"
)
model_with_tools = model.bind_tools(
tools=[
{
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA",
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
},
},
"required": ["location"],
},
}
],
function_call={"name": "get_current_weather"},
)
response = model_with_tools.invoke([HumanMessage(content="what is the weather in Singapore?")])
print("Model response:", response)
# Execute the function based on the model's output
if response.tool_calls:
for tool_call in response.tool_calls:
if tool_call['name'] == 'get_current_weather':
args = tool_call['args']
result = get_current_weather(**args)
print("Function result:", result)
Model response: content='' id='run-aa594ea7-e739-496b-a74b-592d2e08b00b-0' tool_calls=[{'name': 'get_current_weather', 'args': {'location': 'Singapore', 'unit': 'celsius'}, 'id': 'call_ec83fe9d46f649abbc5dc69651e3908d'}]
Function result: The current weather in Singapore is 30°C.
yes you are right, it is getting the LLM to tell us what functions to call and what args to send in with the function call.
Hi<
Is that because we try to give it as much more accurate and better machine-readable input, so the model does not have to 'think' too much that it can follow the correct format like JSON and some basic function, and it can meet some complex requirements also. The way is more efficient and energy-saving.
I hope you to make react agent tutorial with ollamafunction..!
Hey, thats very helpful to understand how to run these models locally.
Can u/anyone tell, how to actually do actual function call and pass that response to llm? Is it possible without LangGraph???
I want llm to decide which tool to call, once he decide that, llm should do entity extraction and then invoke tool, then returns ans back to llm and gives it to user. This was easy with AgentExecutor in OpenAI examples.
Similar thing possible in Ollama?
did you figure it out?
Can I use function calling with llama.cpp?
in theory yes but might need to mess with how to get it accept them etc.
How one can pass multiple functions and let model decide to use particular one. Does it supports multiple functions
The bind function takes in an array of functions so you can simply add the additional functions to the array separated by commas. E.g. [f1, f2]
But how to use function calling along with chat message like user, system and assistance role
At last someone finds a good use for agents - to give them some tasks you want accomplished and give loose it free overnight to use internet :)
I don't understand how this demonstrated using agents overnight on the Internet? I'd really like to know how to do that. What did I miss?
@@willjohnston8216 Mr. Witteven just mentioned this as a possible implication. I was just glad more people to turn their minds into some real world use cases for agentic flows - like giving a topic for your agent and let it research it, find products / software, which you would never find in ads, do some data gathering and processing for you, providing helpful summaries on a hot topics you never have time to investigate properly yourself, etc etc etc
Well, actually, where are the functions? I only see a Json string.
all looked well and good until you try feeding a question into the 'agent' that doesn't relate directly to: "get the current weather in a given location".
I thought the whole point of function calling/tooling was to present the LLM with the opportunity to use tooling if necessary.
has someone tried to load the models other than using ollama like the huggingface transformer pipeline or in other words I would love to know how torun these models in Linux based servers like databricks where I am unable run ollama application in the background like in my windows PC?
Ollama already supports windows
@@MavVRX for Linux based servers like databricks server
I made a Llama3 review deep dive video and show loading that in HF Transformers there in a colab
You are not using latest version. It’s now called “bind” not bind_tools
I am using the latest langchain-experimental 0.58 the bind is used in the main function calling with prop models for the OllamaFunction they still have it as bind_tools. If I am missing something send me a link.
very cool, but this is kind of useless unless you can mix text responses and function calling with the same prompt
hmmm. my hobby is pizza as well. :)
This is like the reverse of crypto mining. Lol😅
the class OllamaFunctions is in langchain_experimental.llms.ollama_functions and i get warning message it will be removed, but they still didn't provide a replacement!