The BEST Open Source LLM? (Falcon 40B)
Вставка
- Опубліковано 27 вер 2024
- TII Call for Proposals with Falcon 40B: falconllm.tii....
Falcon Github samples: github.com/Sen...
TermGPT: • Letting GPT-4 Control ...
GPT-4 Overview: • Sparks of AGI? - Analy...
Neural Networks from Scratch book: nnfs.io
Channel membership: / @sentdex
Discord: / discord
Reddit: / sentdex
Support the content: pythonprogramm...
Twitter: / sentdex
Instagram: / sentdex
Facebook: / pythonprogramming.net
Twitch: / sentdex
Just realised for the first time that Im watching your videosfor work... I used to watch them for fun, and now I get paid to watch them!!! Feeling quite humble ☺
I would guess the reason why some of the more modern models at much lower parameter counts are performing better than GPT3/3.5 is because the latter were trained pre the chinchilla paper on datasets that were too small in relationship to their parameter counts. Prior to chinchilla it was common to use a 2:1 ratio compared to post-chinchilla were 20:1 or 30:1 is now the norm.
thanks for this insight, man. it's a bit hard to navigate the field right now as everybody and their cat are publishing.
Hi,
Can you share any suitable resources for your statement so that I can explore it a little
@@ZaratchSearch up ‘Chinchilla’s Wild Implications,’ I think it’s a good overview
wow nice info!
@@ZeroRelevance I'll look into it. Thank you. Apart from it, do you recommend anything which can help me make my way in AI for my career?
Stoked for the fine-tuning video, can’t wait
Holy shit, last time I watched you you were teaching me game development in your bedroom, now you're living in a data center
my comments would include - having model run on much more data and much more recent data and also training the model on all your docs plus having more aggregated data and aggregated plugins - i think the main bottleneck for most open source ai LLM is the amount of nvram (gpu ram) available - it would be nice to find ways around this via ram disks or lower cost gpu cluster nodes - eventually we will see more gpu with lots more ram but it could take a while - lots of growth and interest in ai will help push things forward quickly - the hardware is catching up it is just not quite there yet for the common man - in 10 years we will likely have quantum functions helping out and face a similar situation all over again but it is more than enough now to just enjoy what has been wrought and look at the constant daily improvement and be happy with that and not project too much
This is awesome! Not having to send all your data to openai is crucial for privacy reasons. Wonder how it performs in languages different than english...
Tell me if you doing it pls, i read that it cant be work for language such as indonesian,malay and other asian languange
@@fiqigading102 mb you could just translate to indo?
@@vasilylukichev-pp4sh hmmm, i think the auto translate output is not very good too. There is a good translate that use AI but i think its not free xD
🤔i think it would be neat to always have like 6 examples per prompt to get a good overview over a models capabilities
Thank you for your feedback
@@sentdexI agree with that guy
@youtubescroller886 lol 'that guy' 😅 1 post later and you already forgot the name 🤣
I too agree with that guy
@@jonathan-._.- I'm afraid you have to change your channel name to "that guy" now
Again as the comments suggest here.. can't wait for your Fine tuning video 😅
Can't wait for qlora fine tuning video!
The pace of development in LLM is at light speed. My brain hurts trying to keep abreast of the myriad of applications and use cases. The opportunities are endless and all of this (general public use) happened, as you have stated, within the last 12 months or so. I have a feeling it's going to continue for the next 2-5 years. Hopefully, I'll still be employed as a DevOPs engineer. LOL
I still remember how I started following you my dude … with tutorials on trading with python 😂 a long time ago
i was trying to use falcon 40b instruct on a 96vCPU , 360 GB RAM and 4 NVIDIA T4 GPUs, but it takes almost an hour to give a single output. can someone please tell me if there is something that I might be doing wrong, for the inference time to be this high or does it usually take this much time to run?
Great video! Do create a tut where we can finetune this model on our own custom dataset!
Part 10 of Neural Net from Scratch, about analytical derivatives??? Please bring the series back!
@21:30 is a spicy take , i'm in
Ups is delivering my 100usd Nvidia P40 card tomorrow.. hoping I can make it run these models. Won't fit the 40b model tho.. maybe if I find one more card in my price range..
T40? I m not familiar. Do you mean a T4? If you got a T4 for $100 that'd be awesome. With 16GB of memory, you could check out the Falcon 7B model.
@@sentdex no no, the older 24gb Nvidia Tesla p40, I'm just realizing that autocorrect changed it to T40.
@fuba44 oh wow. Are you running into any trouble with the pascal architecture?
@@sentdex so fare no, no issues. Infact I'm considering ordering a second card.
@@w花b maybe I should be more clear here, I had no issues with the architecture, but the card it self is not "consumer grade" so it is not just "plug and play". It comes fan less so I needed to buy a special squirrel cage fan and design/3d-print a custom adapter in order to cool the card. And it's not your standard PCIe power adapter, it's a special plug so you need a special PCIe converter to power it. I found those hard to get where I live.
You ROCK!!! Love your Work!
I wonder how slow this would be on a Raspberry Pi 4.
You'd be wondering for quite some time I imagine :D
Hi thanks for sharing this content. Could you create a video in fine tuning this model or create chain of thoughts/ fine tuning for complex tasks on Falcon-40b??
Do you have any videos about fine-tune falcon model?
How do you fine tune? Also, if you wanted an API endpoint, how would you host it without breaking the bank? It seems like it would be more expensive than Open AI
Are these model capable enough to parse tabular data? Just like gpt turbo is after creating a csv agent?
Haven't tried, but I would expect it to. Do you have a simple example that I could test?
I don’t know i shared you the link, but it got deleted somehow.
@sentdex can you try building the csv agent using the open-source LLM models? That would be really a game changer since reading and analysing data via LLM would be something on another level.
Any chance you could cover possible techniques to running a model over multiple GPUs so that we could for example run 80 billion parameter models
The examples shown in the github include that. It's fairly trivial these days. Just let transformers lib automatically map to your devices pretty much.
What you think is the best free open source language model today? thx a bunch!
Do you think you could do a video explaining hardware requirements and cpu vs gpu?
Also how does the bit size affect ram and performance?
For example, I'm considering buying more ram for my pc 2 x 32bg for a total of 96gb (already have 32gb). But I have no idea if that would be enough for a 13-15b model (I would be running on cpu)
Unfortunately reddit is closed so I can't really ask these questions. But maybe this is an easy video for you if you were looking for content ideas.
FYI you want all of your RAM to be composed of the same size sticks and have an even number of them. There are weird technical reasons why, but having an uneven number of RAM sticks or having RAM sticks of different sizes cause performance losses. Anyways, you're going to need VRAM more than RAM, which means you need a specialty GPU with a lot of VRAM, something like an A6000. Most standard gaming GPUs don't have anywhere near enough VRAM until you get to the upper-upper end of modern consumer cards like the RTX 4090, but even then it still has less VRAM than something like an A6000 (4090 has 24GB VRAM and an A6000 has 48GB, there are also other workstation cards cheaper than an A6000 which also have 48GB VRAM as well). Usually these types of GPUs are classified as "workstation" GPUs and can be very expensive. Even then you might still actually need 2 of them to get the 40B model running. You could probably get the 7B model running on a GPU with around 16GB VRAM though.
@@pbjandahighfive thanks man
Audio is fine for me.
Thank you for confirming!
Really cool thing ❤
This LLM is the best. He wears this necklace that resembles an egg with disoriented eyes and nose and mouth, looks a bit creepy, but he is a very good person.
Does conversation history works with Falcon40B Instruct, anyone tried?
I have a question. I currently have GPT4All on my PC. I'm using a simple install package that uses the CPU and system RAM. My PC is a workstation (custom build) running with an Intel Xeon E5-2680 v4 CPU, (14 cores / 28 threads), and 32GB of system RAM. I also have an RTX 2060 graphics card, with 12GB of GDDR5 memory but, that would not seem needed for this framework I have installed. The largest chatbots available are 13B and have a minimum system requirement of 16GB RAM. I have dedicated 12 threads, which seems to be, functionally, just the same as when I had 8 dedicated to the Chatbot. It typically responds faster than the time it takes me, (usually) to ask it questions but, I notice it hallucinates.
My question is, would I be able to run this 40B local LLM (Falcon) with my current system and, if so, where would I download such a Instruct LLM like this; (that would run on my CPU /system RAM)?
Google GML
Thanks sentdex! Is there an open source implementation for function-calling like openai's that works with falcon or any locally runnable model?
Also can you cover easy workflows, for finetuning the instruct model to work with public/private self-curated big data such as pdfs or large text files
I think the closest to it would be either: Doing it yourself with prompting or possibly LangChain Agents. I may eventually cover Langchain agents for this reason, but I am still undecided if I want to use that or just do it myself.
@@sentdex I also have some nice ideas to try, but regarding function-calling , I've seen things like MLQL and glimpses of others that manage to solve llms' formatting issues and prepare data for api calls but openai's way is by far the most simple and elegant, I just hope to hear soon that someone cracks it and generalize to accept and use any model. If you hear something please let us know!
Thanks for the reply 💛
what is this lambda platform you talk about?
okay so yeah... better get my act together putting in a proposal there. My guess is they will need to see a MVP.
Neat outro
Good stuff
I don't know about GPT4 not making mistakes as often as Falcon. I do not remember a single time when anything code came out with no syntax errors, or that ran on the first go
Anyone know what RBRMs are at 9:40? Haven’t been able to Google it successfully.
Edit: Rule-Based Reward Models!
Possible to use with functions? I want to extract json data from blocks of text. I have 8mm records, so open ai will be too expensive
if you could do a tutorial on fine tunning the new version of it
thank you for your videos
Just need a larger context window, 8k now really (6000) is about most of us with 24Gb cards can push now. Nvidia and AMD need to give us more vram.
can we fine tune this model to create custom embeddings ?
Yes
I just tried asking ChatGPT that practicing law question, and it got it right. I'm on the free plan, so that would still be 3.5, right?
But can't implement using code, getting error. Can you share any working codes or method
Can these models access the internet like bing gpt? If not, how will that be possible?
What is your opinion about mpt-30B
Этот чел занимался нейронками ещё до того, как это бомбануло)
@sentdex - could you pls try a similar vide on MPT-30 by MosaicML #llm #ai #mpt30b #mosaicml
aaaa you are still alive :D
Can you do a video about save the models and training data locally and running them from your own GPUs? Lots of people have GPUs, few people want to pay hourly for cloud services..
The code shown here is mostly me running it locally. Used cloud too to speed it up but the code is identical. You may just not have enough gpu memory locally to run these models. Otherwise, might I ask what you're having trouble with locally?
@@sentdex I wonder if AMD would sponsor if you showed a video chaining 4 radeon XTX7900's together and loading a 40-50bil parameter model =).
Will this run on a nivida p40?
can i run it 49b falcon on my ryzen 9 5900x and rtx 3090
No. It would be too slow.
Can you do a comparison to mpt30b
Now I just need to put a couple A100s on layaway.
hahahahahaha yes man
i feel you so much
Have you tried anything censorship related with the model? There were some posts about it acting weird about Saudi politics and LGBTQ topics.
I've not seen those rumors, but I encourage you to try the model for yourself. I don't find that to be even remotely true. This model feels like one of the most un-manipulated general purpose models that I've used so far.
I've tested that quite a bit and it's not censoring. It has no preference for the muslim religion or any certain politics.
The Instruct models are "safety" flavored, that's because they contain OpenAI data. The foundation models show no artificial biases.
What would Jordi do?
This was wonderful news. Gpt4 is declining
How much did they pay for this promo!?
Making a mixture model with falcon might get it closer to gpt4
Is it true that the Falcon models won't usually say anything negative about the UAE or is that just a rumor? The only official word on censorship is that they removed adult content and machine generated text from its training data. (Also, how did they identify machine generated text? That's known to be an extremely hard problem.)
When I use and work with GPT-4, I think it's relatively clear there's what I would call "heavy" moderation from OpenAI being applied. When I work with Falcon 40B, I do not notice any such thing. Even on topics you might suspect there could be moderation applied, I do not find any examples, so I would say that's just false. Beyond that, it's a totally open model, so any real moderation/censorship effort would be kind of pointless since users have the power to modify weights w/ fine-tuning.
As for identifying machine generated text, I think we probably have to wait for their paper to be released to learn more about that. It's certainly going to be a problem going forward. I am very curious to hear more about how they curated data too.
falcon b was created by young Algerian geniuses, and since the Emirates regime is a criminal, conspiring and oppressive regime that is hostile to Algeria, the Emirati regime will be exposed soon, God willing.
As a researcher, I’ve found Star coder and star chat (beta) to be very effective instruction tuned models, even for NL.
In general, how have you found the comparison of LlaMa vs star coder/char vs falcon?
Also on inference times given the flash attention in the huggingface models.
I've been staying away from the legal vaguery that is llama. I think Facebook should have gone about the release of those models much differently. I've only lightly played with a few llama models, but haven't put much effort into them to actually work deeply with them due to licensing questions. There are "work arounds" that it appears Facebook is choosing to allow, but tides can change at any time and I don't like that.
As for StarCoder, I used it a bit and didn't really find it to be good enough. Do you think StarCoder is better than results here for coding? Have you some examples from StarCoder that you think show it's exceptional? At least at the moment, I am mostly thinking in terms of my TermGPT project, so I need mostly code, but also need a decent language understanding and also system administration in general. So far, I find Falcon 40B to be the best here, but everything is purely anecdotal. I dont need everything to be contained in one model either, but I am seriously considering putting my stakes down in Falcon 40B, probably fine-tuning it slightly and going full steam ahead.
I’m a research student from Melbourne and would just like to know if Falcon is a censored LLM like ChatGPT? I’m trying to study the bias of LLMs and whether they are impartial with their answers
Too bad the sequence length is just 2048
11:00
HA the scifi trope of making computers autistic is dead wrong.
Hey. Please increase the volume of future videos if possible :)
AGI is impossible to realize and the sooner we realize that the better
5th
ehehheeheheh
Absolutely massive!
Pop filter bro...
Funded by the UAE? LOL
all worthless without book size context limits.
I tried it and wasn't impressed at all
Hi, model 40b-instruct works on my 3090 with torch.bfloat16. It takes about 23GB VRAM and 62GB RAM. Am I doing anything wrong? 😁
Yes, you're supposed to fit the entire model on your GPU (the VRAM). Otherwise it would take way too long. How long does it take to process a single token? You need something like 45-50 GB memory for the falcon 40b-instruct even at 8-bit precision. Try the 7b model instead, and test 8bit and bfloat16.
@@Woollzable Hi, meanwhile my ssd died so I no longer have this model and it takes too long to download. I would rather test llama-2, which should be better than falcon in every test. (see The Impact of chatGPT talks (2023) - Keynote address by Prof. Yann LeCun)
Which Llama-2 are you referring to? You don't have enough memory for some Llama-2 models, only the Llama-2 7b (at both 8-bits & 16-bits) model and Llama-2 13b (at 8 bits)
Let's break it down. The size of the model in RAM memory is very roughly:
Let's say the model has 7 billion parameters:
32-bit: 7 billion parameters * 4 = 28 GB GPU RAM minimum. (not used for inference)
16-bit: 7 billion parameters * 2 = 14 GB GPU RAM minimum.
8-bit: 7 billion parameters * 1 = 7 GB GPU RAM minimum. That's because 8 bits = 1 byte.
All of this is only for inference, not for training/fine-tuning. For training you need to at minimum double the RAM size as for inference (just using the model without training/fine-tuning)
32-bit is not used for inference. Usually the highest is 16-bit. but for training/fine-tuning, you might need higher precision such as 24-bit (automatic mixed precision) or 32-bit.
Also, your CPU RAM is not that important, you can fit part of the model in the GPU RAM and part of the model in the CPU RAM and then do a memory swap, but that is extremely slow and not recommended, but it might work in some cases.
Now think about Llama-2 13b which has 13 billion parameters. Do the same calculations as above but replace the 7 billion with 13 billion. You can probably run the Llama-2 13b at 8-bit.
Finally, think about Llama-2 70b which has 70 billion parameters. You do the same calculations :D
I also forgot to tell you that the lower precision you go to, the worse the performance/accuracy. 16-bit is better than 8-bit which is better than 4-bit etc. Quantization (going to lower bits) is done to save memory but at the cost of accuracy.
@@hakoren4444
Thanks for explanation.
I do time series regression where there is no problem to fit model in memory :D due to small amount of data.
@@Woollzable
Very cool. I got the Falcon-7b-instruct model working on my home PC while I watched your video. Only took about 5-10 mins to get it all going. Inference works well on my RTX 4080 (16GB) GPU too. As soon as I load the model using torch.bfloat16, the transformers library allocated ~14GB of GPU memory, but it works really well!
I'm going to have to replace my LLM app development with this local endpoint to save cost on OpenAI API calls ;) I wonder if that's a thing, a local dev loop pointing at a smaller, locally hosted LLM, and then when pushed to production, a large model or hosted endpoint, a la GPT-4. Depending on how you use the LLM in your application, I can imagine this could possibly lead to a whole new class of heisen-like bugs. Interesting to think about.
Great vid btw. I like how you keep things simple, and high-level. This is the perfect level of depth/complexity for video.
Could you share some metrics on your inference speed?
@@Moonz97 Using the text-generation pipeline I'm getting about 10-12 tokens/second on my GTX 4080.
@@snarkyboojum that is actually a really good number isnt it? When I did my project with GPT3.5 they allowed 3 prompts a minute even though I was paying for prompts.
how is it going now?
Audio is fine for me.
If you had a bigger microphone, it could cover your whole face.
The dream!
not gonna continue neural networks from scratch series?
We will.
would be awesome to see you finetune the model. do you know if something like LORA could work to reduce the cost of fine-tuning?
We'll almost certainly be using lora to fine tune falcon 40b, so stay tuned for that. I am very curious to see how well it works in practice. I think the hardest part right now though is gathering quality fine-tuning data. It's easy to think of 10 decent samples. It's quite hard to think of a thousand haha.
@@sentdex I was able to successfully fine tune the model using qlora, I think it took >160gb of VRAM. But even just for 10 minutes of training (I only had that amount of credits) I was able to get much better results than fine tuning falcon 7b for 10+ hours.
@@macklinrw what service did you use?
@@sentdex Crowd source it from your 1.2M followers on UA-cam ;)
@@sentdextake a look at the Lima dataset (less is more for alignment) I suggest you individually send each example to gpt3.5 and rephrase it to explain step by step/verify/change style to create a perfect dataset, had great success with llama 7b this way fine tuning on curated LIMA
gpt often makes mistakes. Sometimes it becomes stupidity and not artificial intelligence. I haven't tried Falcon yet. Is it better?
Would you be so kind as to refer me to a video explaining, at a non - computer person level, how to set this up?
Do you think the 7B model can be fine tuned to auto-completing code and be used as a local and good substitute for co-pilot? (for those who have the required compute power, which I don't :D)
Possibly fine-tuned, but there might be better models at that size for just auto-completing, like replit/replit-code-v1-3b. If you had an awesome fine-tuning dataset though, I imagine Falcon 7B could be quite good at this task.
Try asking time differences. If it's 1:00am in Tokyo, what time is it in London. Amazing how few LLM's get this right.
Its all fun and games, but can it predict the future? Otherwise what is the point.
Not inventing the future by controlling data, but purely predicting it by the inputs of the world.
Amazing vid.
One question, tho, is base ChatGPT actually a 175B? Was it confirmed by anyone? I mean, the "original default" version probably was somewhere around those amount of params. However, since they introduced the "turbo" version, I feel like they just scaled it down. It feels to me that it actually got dumber in some instances, and additionally, how would they actually speed it up if the underlying architecture is still GPT-3.5.
I definitely do agree tho that the Falcon 40B and LLaMA-65B "feel" more knowledgeable than 3.5 from my experience, with LLaMA slightly outperforming Falcon. This is all subjective ofcourse and it depends on what your use case is. This ties neatly with final observation.
The coding part of these models is still FAR from what I could get even with 3.5. This might change, however, if we finetune the base models to act as a sort of agents for specific tasks since the models are ours to modify.
I tried playing with LoRA / QLoRA, but I couldn't achieve any good results for some reason (LLaMA models). I tried replicating early Alpaca training, and it all flopped. There are probably some errors in the code I can not seem to recognize...
As for Falcon, it just takes a huge amount of time, and unfortunately, I can not afford not to use my PC for more than a day or two, so I didn't have a chance to play with it.
I think I heard GPT 3.5 is actually smaller than GPT3, but I might be mistaken, and that'd be the "turbo" version you mentioned that does indeed feel a little scaled down, even since init release. It's hard to really know when so much is kept "secret." It becomes especially problematic as we learn that these models are used on their own outputs too in some, or maybe all, cases to further "improve" outputs. I sure wish OpenAI was a tiny bit more open :D
As for fine-tuning, stay tuned. We're going to almost certainly make use of LoRA/QLoRA.
@@sentdex Would love to see your take on LoRAs. Can't wait!
Isn't the model fully deterministic if you use the exact same seed and weights are exactly the same for each prompt?
Aren’t all current ai deterministic given identical inputs?
It depends how you frame things. If you have identical input w/ identical seed, yes, a frozen model is very much deterministic.
In practice, with LLMs, and natural language input, however, your input is going to nearly infinitely vary. As such, your outputs will too, so "natural language" input applications are by nature going to be treated as non-deterministic.
@@iansharoo2 no
It is not 100% open source
60% of the model only.
RunPod is another solid alternative to Lambda to run these models :)
I've seen a few people using runpod. I think the ~concept~ is fantastic. I tried once to dive in but all the abstractions were kind of annoying. I plan to revisit it because I want it to work for my needs, but we'll see haha. Used right, it could be potentially even cheaper than Lambda and even simpler to use possibly.
@@sentdex curious what you mean by abstractions in the context of the platform?
@@merrell_io Essentially it seemed quite challenging to integrate runpod into a larger application that might be powered by an LLM. This is in particular to the "serverless GPU," as that's what I was looking at and think I would want. I cannot remotely speak on runpod intelligently though and forget the precise details where I formed that opinion, which is why I plan to dive in again. I happened across it while also looking into langchain, and I need to probably look into runpod on its own entirely first lol
you want to give GPT4 a terminal huh lol
Low volume
Please fix it from the next one!
Thanks!
@@deonex4993 Still not near as loud as the other videos.
Hmm. To be honest I am not noticing any audio level issues here. This has been a problem with some previous videos, but I now actually have decibel checks in my workflow for producing a video because of those issues in the past and I find the level here to be as expected.
I really hope an llm as powerful as gpt4 becomes available open source soonish.. having an llm running in an engineering business's server would allow for safer use.. without sharing sensitive information to a server
GPT4 looks like it is 8 models running in parallel at the moment with only 220B parameters each. GPT4 level performance will be awhile away but it seems like performance scales up as better data is made, and models can be scaled down while maintaining performance in this way.
The future is bright
This is serious disaster to this world
1st
What a amazing topic about the open-source model Falcon 40B. A very important sentence what you say it is "YOURS" ( model).
how about wizardcoder? It seems like wizardcoder might turn out to be a better coding LLM than falcon 40B?
noob question: why is hugging face benchmarking important?
Great videos lately!
Thanks!
its actually because i am a magic fairy which makes it work
If you constraints the logits selection to the user input context, after subsequents regressive updates, the model performance shouts up
I was just playing around with Falcon chat a day ago, it's pretty good and awesome for being open source
how is Falcon performing in other languages?
how do i concretely do finetuning?
Please finish your neural network from scratch series, there’s only one more episode needed to finish it and it would help so many people its the only good series I found on UA-cam that explains it clearly.
I followed it all the way through and it took me absolutely ages to figure out back propagation, there were so many tiny questions I had that could have saved hours if it was just explained through an example. And once I got it working, I thought it was wrong because the network cost was decreasing but the accuracy stayed the same, and it took me forever to realize that it was a combination of the size, learning rate, and number of epochs that caused that, not my code.
Any please finish it, it would help so many people who trying to are learning machine learning fundamentals. Anyone who has made it to the final episode is looking to learn, and will find it extremely useful
Any recommendations on SQL and ability to answer questions from multiple tables and plot graphs let's say from a CRM dataset?
Langchain