The Best Tiny LLMs
Вставка
- Опубліковано 15 тра 2024
- ➡️ Trelis Function-calling Models (incl. Trelis Tiny): trelis.com/function-calling/
➡️ ADVANCED-inference Repo: trelis.com/enterprise-server-...
➡️ ADVANCED-fine-tuning Repo: trelis.com/advanced-fine-tuni...
➡️ One-click Fine-tuning & Inference Templates: github.com/TrelisResearch/one...
➡️ Trelis Newsletter: Trelis.Substack.com
➡️ Trelis Resources and Support: Trelis.com/About
Affiliate Links (support the channel):
- Vast AI - cloud.vast.ai/?ref_id=98762
- RunPod - tinyurl.com/4b6ecbbn
Resources:
- Slides: tinyurl.com/4kvnu4ad
- Chat fine-tuning datasets: huggingface.co/collections/Tr...
- One-click LLM templates: github.com/TrelisResearch/one...
Models:
- DeepSeek Coder 1.3B: huggingface.co/deepseek-ai/de...
- Phi 2: huggingface.co/microsoft/phi-2
- TinyLlama: huggingface.co/TinyLlama/Tiny...
- Trelis Tiny: huggingface.co/Trelis/Tiny
Repo Access (purchase includes lifetime access to improvements):
- ADVANCED Fine-tuning: trelis.com/advanced-fine-tuni...
- ADVANCED Inference: trelis.com/enterprise-server-...
Chapters:
0:00 Best Small Language Models
0:19 Video Overview
1:23 Benefits of Tiny LLMs
2:09 Fine-tuning and Inference Repo Overviews
4:28 Performance Comparison - TinyLlama, DeepSeek Coder and Phi 2
16:21 Fine-tuning Tiny Language Models
33:55 Function-calling quantized models with llama.cpp
44:44 Challenges and Tricks - Function-calling with Tiny Models
1:00:00 What are the best Tiny Language models?
Reminder: Be careful when using private keys (e.g. OpenAI or HuggingFace). If they are exposed, make sure to rotate them, as I do after each video. - Наука та технологія
** GGUF Not working **
Towards the end of the video, I state that the issue with function calling with GGUF is due to the prompt format. However, the issue is that the GGUF model (unlike the base model) is responding with incorrectly formed json objects.
There appears to be an issue with the GGUF quantization that I need to resolve. I'll updated here once resolved.
Much appreciated.
Most non-professional AI enthusiasts only have 16 GB of non-upgradable memory.
As usual pure 🔥. Thanks for putting the time and energy into your outstanding didactic videos🙌🏾
Tiny LLMs = Tiny Large LMs = (Tiny Large) LMs = LMs
😂 Keep an eye out for an upcoming video on Large TLLMs
Medium Language Models
... whoa
You post most valuable content on AI/ML.
Thank you so much! Youre research is highly appreciated, and this video solves the feasibility question mark in my mind! Looking forward to digging into your company and vids. 👍👍👍🎆
Thanks for the amazing content! Based on my downloads of various quantised models from The Bloke, 5-bit quantisation would seem to be the sweet spot if you want reduced memory usage, but you still care about quality.
Yeah - it’s not as though there is one point that perplexity suddenly drops off. Very roughly, I say 8bit but yeah some 6 or 5 bit quants are good too.
Another great video.
Awesomme video. Phi 2 is now available for commercial use under MIT License.
Man, how do you come up with ideas for the new videos?! This is pure gold! Would you consider doing something with non-english languages (considering Europe has a nice mixture of those)? I'm wondering if this is even something I should be thinking about when fine tuning open LLMs...
Cheers! What language? And what topic?
Cheers. I am thinking about French, German, Italian, Spanish and Polish (and English obviously). But even a clue on how to deal with one extra language would be nice. I don't speak theese languages, which even makes it more "fun". I have a custom dataset of 1000 FAQ-style question/answer pairs, currently in English, so that would be an example use case to play with..
Damn, did my reply not show up here? I must be loosing my mind... In general I deal with English, but also German, French, Italian, Spanish and Polish. Even seeing how to fine tune for 1 non-english language could be very interesting, what are some best practices, limitations, etc. I do have a custom dataset of ~1000 FAQ-style question/answer pairs (upsampled by GPT-4 from original ~150 questions/answers).
Great insights. Would low rank training would be useful for narrow tasks like text classification for example?
Yes! Very effective for training for classification - the basic premise is the same as training for function calling (take a look at the recent vid and also the older vid on structured responses).
🎯 Key Takeaways for quick navigation:
11:10 🔄 *Sequence Reversal Challenge:*
- Tiny Lama struggles even with a sequence of two, providing an incorrect response.
- The F model manages to reverse a two-token sequence correctly.
- Deep SE Coder excels, correctly reversing sequences even up to three tokens.
13:25 💻 *Code Generation Performance:*
- Tiny Lama produces incorrect Python code for prime number generation.
- The F model generates a shorter piece of code, but it fails to provide the correct output.
- Deep SE Coder impressively generates accurate Python code for prime number generation.
15:13 🔑 *Pass Key Retrieval Challenge:*
- Tiny Lama fails to retrieve the pass key correctly.
- The F model successfully retrieves the pass key from the context.
- Deep SE Coder, due to its instruct nature, refuses to retrieve the pass key, limiting its applicability.
19:52 🔄 *Fine-tuning Tiny Models with Low Rank Adaptation (Lura):*
- For tiny models, using a small rank in low-rank adaptation may result in insufficient parameters for effective training.
- Adjusting the rank in Lura for tiny models is crucial to prevent underfitting and enhance adaptation to training data.
22:17 🚀 *Function Calling Fine-tuning with Deep SE Coder:*
- Utilizing advanced fine-tuning scripts for function calling with the Deep SE Coder model.
- Exploring the importance of adjusting Lura parameters for tiny models during function calling fine-tuning.
- The video emphasizes the practical implementation of these concepts using Runp pod and uploading relevant scripts.
22:57 🛠️ *Connected weights and biases for model training tracking, loaded a base chat model fine-tuned from deep seek.*
25:17 📊 *Explained Laura's low-rank adaptation for training attention and linear layers in language models.*
27:26 🔄 *Discussed setting the rank and alpha parameters in Laura for adjusting learning rates during training.*
29:21 🧠 *Demonstrated fine-tuning with Laura on a function calling dataset and emphasized the importance of training enough parameters.*
30:45 🤖 *Compared model responses before and after fine-tuning with Laura, showcasing improved structure and functionality.*
32:37 💡 *Advised checking the total number of trainable parameters during training to ensure an adequate level for fine-tuning, especially with tiny models.*
34:57 🧐 *Explored two approaches for tiny language models: quantizing a larger model like Open chat and fine-tuning a tiny model like Trellis Tiny.*
36:34 📏 *Examined quantization options for Open chat model, highlighting performance trade-offs based on bit precision.*
38:38 📊 *Demonstrated performance of Open chat model in 2-bit quantization using Lama CPP, acknowledging the need for sufficient VRAM.*
43:57 📉 *Tested 2-bit and 4-bit quantized Open chat models in Lama CPP, revealing limitations in handling function responses on a laptop.*
45:38 🚀 *Introduced fine-tuning challenges with tiny models for function calling, emphasizing the weakness of non-coding models like Tiny Lama or Fi in this context.*
46:07 🚀 *Fine-tuning Tiny LLMs for function calling can be challenging, as responses might be verbose and hard to control.*
46:21 🎛️ *Deep Seek is recommended for fine-tuning models for function calling, but chain function calling is difficult due to statistical distribution challenges.*
47:04 🔄 *Limiting the model to a single function call and preventing recursive calls can be achieved by careful implementation during inference.*
48:26 🛠️ *Techniques for better responses involve providing start-of-response prompts and utilizing information from the Json object returned by the function.*
50:26 🚀 *Trellis Tiny, based on Deep Seek, is showcased for function calling with high token generation speed; it's suitable for utility purposes.*
56:50 🧠 *Demonstrations using TGI (Tiny Go) and Lama CPP for function calling, emphasizing the importance of fine-tuning and manual tweaks for optimal results.*
57:46 🔧 *The Advanced Inference repo includes tweaks, like a parameter preventing recursive calls in tiny models, enhancing model responsiveness.*
01:00:21 📚 *Helper text and logic adjustments play a crucial role in handling function calling challenges, allowing tiny models to answer both normal and functional questions.*
01:01:40 🌐 *Tris Tiny is recommended for utility purposes, especially function calling, offering single function call capability and short normal responses.*
Made with HARPA AI
do you know if the advanced inference supports native logit biasing and constrained generation via the API
Nice video as always, for function calling I am using NexusRaven V1 on a 1070ti and I think its better than gpt4.
PS Im using Ollama for inference.
thanks for the tips, I'll dig in on those
@@TrelisResearch Its super fast
UPDATE: Phi-2 is now available - incl. for commercial use - under an MIT license!
I have a 2018 16inch with a x86 but 32 gig of RAM, and I can run the hell out of solar, deep seek and mixtral simultaneously, with ollama or 1 at the time with jan (slower).
it would be nice a video about groq (not grok) but I don't know how many infos there are around at the moment
yeah, I'm kind of tracking it, but they don't give a way to inference a custom model yet afaik, once they do, I think that would def be interesting
Can you do a video on finetuning a multimodal LLM (Video-LlaMA, LLaVA, or CLIP) with a custom multimodal dataset containing images and texts for relation extraction or a specific task? Can you do it using open-source multimodal LLM and multimodal datasets like video-llama or else so anyone can further their experiments with the help of your tutorial. Can you also talk about how we can boost the performance of the fine-tuned modal using prompt tuning in the same video?
yeah i wanna do a vid on multi-modal. I tried out llava and was unimpressed by performance versus openai, so I thought I would delay a little bit. I'll revisit soon
lets agree they are called SLMs as in SLiM unless you want to start using the metric system of pico, nano, micro, milli, etc 😄 in 5 years "big" as in "big data" will be considered small compared to the biggest
Hey , I have a question. I trained a tokenizer changing the length of the tokenizer , then did peft + qlora ( embedding, lm head and qkv) fine tuning. But the model does not perform well? Is it because of the lack of datasets? Or because i have changed the dimensions?
I'd need more info to say...
- What kind of dataset were you using and training for what application?
- Did you merge the LoRA onto the base model you trained? (you have to be careful not to lose the updated embed and lm head layers).
- when changing the embeddings settings you have to update both the tokenizer and model.
The best video for all of this is the one I did on Chat Fine-tuning.
@@TrelisResearch okay i will watch the video and i come back to you thanks 🙏
Hi Ronan could you please do tutorial on GuardRails?
interesting idea, let me add that to the list of potential vids
great video guys, can someone help me understand when you use just the lora adapter weights for inference and when you merge the lora weights to the original model
generally it's best to merge because inference is slower unmerged (there's an extra addition step to apply the adapter).
The reason not to merge is that you can store the adapter (which is small) separately [if that's useful].
@@TrelisResearch thanks for your reply. got it. please continue with your content, it has helped me a lot
What's the best way to get clients for these types of solutions?
Howdy, Are you asking how to come up with applications for tiny LLMs? i.e. use cases/markets where having tiny llms is useful?
Can any be loaded on a iPhone ?
In principle yes, although I haven’t dug into that yet. I’ll add to my potential videos list
I using MLC chat / chatterUI on android but i think they have iphone versions too.
you remind me Andrey Karpathy
Even a second hand GTX 1070 laptop would be able to handle the 4 bit quantised variant.
in the meanwhile phi 2 model changed licensing to a permissive one
Do you find Mozilla's Llamafile project interesting or useful? As someone who dabbles, I'm still not sure how to think about it.
Thanks for sharing . I just had a look and it looks like a strong option to get a chat going. Would be nice if they add Phi as an option . As you saw in this vid a 4bit quant is still too big for my machine .
Btw when llamcpp is installed and you run ./server - there’s also a simple chat interface on the localhost port