The Best Tiny LLMs

Поділитися
Вставка
  • Опубліковано 15 тра 2024
  • ➡️ Trelis Function-calling Models (incl. Trelis Tiny): trelis.com/function-calling/
    ➡️ ADVANCED-inference Repo: trelis.com/enterprise-server-...
    ➡️ ADVANCED-fine-tuning Repo: trelis.com/advanced-fine-tuni...
    ➡️ One-click Fine-tuning & Inference Templates: github.com/TrelisResearch/one...
    ➡️ Trelis Newsletter: Trelis.Substack.com
    ➡️ Trelis Resources and Support: Trelis.com/About
    Affiliate Links (support the channel):
    - Vast AI - cloud.vast.ai/?ref_id=98762
    - RunPod - tinyurl.com/4b6ecbbn
    Resources:
    - Slides: tinyurl.com/4kvnu4ad
    - Chat fine-tuning datasets: huggingface.co/collections/Tr...
    - One-click LLM templates: github.com/TrelisResearch/one...
    Models:
    - DeepSeek Coder 1.3B: huggingface.co/deepseek-ai/de...
    - Phi 2: huggingface.co/microsoft/phi-2
    - TinyLlama: huggingface.co/TinyLlama/Tiny...
    - Trelis Tiny: huggingface.co/Trelis/Tiny
    Repo Access (purchase includes lifetime access to improvements):
    - ADVANCED Fine-tuning: trelis.com/advanced-fine-tuni...
    - ADVANCED Inference: trelis.com/enterprise-server-...
    Chapters:
    0:00 Best Small Language Models
    0:19 Video Overview
    1:23 Benefits of Tiny LLMs
    2:09 Fine-tuning and Inference Repo Overviews
    4:28 Performance Comparison - TinyLlama, DeepSeek Coder and Phi 2
    16:21 Fine-tuning Tiny Language Models
    33:55 Function-calling quantized models with llama.cpp
    44:44 Challenges and Tricks - Function-calling with Tiny Models
    1:00:00 What are the best Tiny Language models?
    Reminder: Be careful when using private keys (e.g. OpenAI or HuggingFace). If they are exposed, make sure to rotate them, as I do after each video.
  • Наука та технологія

КОМЕНТАРІ • 50

  • @TrelisResearch
    @TrelisResearch  3 місяці тому

    ** GGUF Not working **
    Towards the end of the video, I state that the issue with function calling with GGUF is due to the prompt format. However, the issue is that the GGUF model (unlike the base model) is responding with incorrectly formed json objects.
    There appears to be an issue with the GGUF quantization that I need to resolve. I'll updated here once resolved.

  • @sillystuff6247
    @sillystuff6247 4 місяці тому +10

    Much appreciated.
    Most non-professional AI enthusiasts only have 16 GB of non-upgradable memory.

  • @AP-hv5dh
    @AP-hv5dh 4 місяці тому +2

    As usual pure 🔥. Thanks for putting the time and energy into your outstanding didactic videos🙌🏾

  • @agusavior_channel
    @agusavior_channel 4 місяці тому +21

    Tiny LLMs = Tiny Large LMs = (Tiny Large) LMs = LMs

  • @gautammandewalker8935
    @gautammandewalker8935 Місяць тому

    You post most valuable content on AI/ML.

  • @eugenetapang
    @eugenetapang 4 місяці тому +1

    Thank you so much! Youre research is highly appreciated, and this video solves the feasibility question mark in my mind! Looking forward to digging into your company and vids. 👍👍👍🎆

  • @footube3
    @footube3 4 місяці тому +2

    Thanks for the amazing content! Based on my downloads of various quantised models from The Bloke, 5-bit quantisation would seem to be the sweet spot if you want reduced memory usage, but you still care about quality.

    • @RonanMcGovern
      @RonanMcGovern 4 місяці тому +1

      Yeah - it’s not as though there is one point that perplexity suddenly drops off. Very roughly, I say 8bit but yeah some 6 or 5 bit quants are good too.

  • @nirsarkar
    @nirsarkar 4 місяці тому +1

    Another great video.

  • @maslaxali8826
    @maslaxali8826 4 місяці тому

    Awesomme video. Phi 2 is now available for commercial use under MIT License.

  • @alchemication
    @alchemication 4 місяці тому +3

    Man, how do you come up with ideas for the new videos?! This is pure gold! Would you consider doing something with non-english languages (considering Europe has a nice mixture of those)? I'm wondering if this is even something I should be thinking about when fine tuning open LLMs...

    • @TrelisResearch
      @TrelisResearch  4 місяці тому +1

      Cheers! What language? And what topic?

    • @alchemication
      @alchemication 4 місяці тому

      Cheers. I am thinking about French, German, Italian, Spanish and Polish (and English obviously). But even a clue on how to deal with one extra language would be nice. I don't speak theese languages, which even makes it more "fun". I have a custom dataset of 1000 FAQ-style question/answer pairs, currently in English, so that would be an example use case to play with..

    • @alchemication
      @alchemication 4 місяці тому

      Damn, did my reply not show up here? I must be loosing my mind... In general I deal with English, but also German, French, Italian, Spanish and Polish. Even seeing how to fine tune for 1 non-english language could be very interesting, what are some best practices, limitations, etc. I do have a custom dataset of ~1000 FAQ-style question/answer pairs (upsampled by GPT-4 from original ~150 questions/answers).

  • @johnade-ojo2917
    @johnade-ojo2917 4 місяці тому

    Great insights. Would low rank training would be useful for narrow tasks like text classification for example?

    • @TrelisResearch
      @TrelisResearch  4 місяці тому +1

      Yes! Very effective for training for classification - the basic premise is the same as training for function calling (take a look at the recent vid and also the older vid on structured responses).

  • @renancidale9210
    @renancidale9210 4 місяці тому +1

    🎯 Key Takeaways for quick navigation:
    11:10 🔄 *Sequence Reversal Challenge:*
    - Tiny Lama struggles even with a sequence of two, providing an incorrect response.
    - The F model manages to reverse a two-token sequence correctly.
    - Deep SE Coder excels, correctly reversing sequences even up to three tokens.
    13:25 💻 *Code Generation Performance:*
    - Tiny Lama produces incorrect Python code for prime number generation.
    - The F model generates a shorter piece of code, but it fails to provide the correct output.
    - Deep SE Coder impressively generates accurate Python code for prime number generation.
    15:13 🔑 *Pass Key Retrieval Challenge:*
    - Tiny Lama fails to retrieve the pass key correctly.
    - The F model successfully retrieves the pass key from the context.
    - Deep SE Coder, due to its instruct nature, refuses to retrieve the pass key, limiting its applicability.
    19:52 🔄 *Fine-tuning Tiny Models with Low Rank Adaptation (Lura):*
    - For tiny models, using a small rank in low-rank adaptation may result in insufficient parameters for effective training.
    - Adjusting the rank in Lura for tiny models is crucial to prevent underfitting and enhance adaptation to training data.
    22:17 🚀 *Function Calling Fine-tuning with Deep SE Coder:*
    - Utilizing advanced fine-tuning scripts for function calling with the Deep SE Coder model.
    - Exploring the importance of adjusting Lura parameters for tiny models during function calling fine-tuning.
    - The video emphasizes the practical implementation of these concepts using Runp pod and uploading relevant scripts.
    22:57 🛠️ *Connected weights and biases for model training tracking, loaded a base chat model fine-tuned from deep seek.*
    25:17 📊 *Explained Laura's low-rank adaptation for training attention and linear layers in language models.*
    27:26 🔄 *Discussed setting the rank and alpha parameters in Laura for adjusting learning rates during training.*
    29:21 🧠 *Demonstrated fine-tuning with Laura on a function calling dataset and emphasized the importance of training enough parameters.*
    30:45 🤖 *Compared model responses before and after fine-tuning with Laura, showcasing improved structure and functionality.*
    32:37 💡 *Advised checking the total number of trainable parameters during training to ensure an adequate level for fine-tuning, especially with tiny models.*
    34:57 🧐 *Explored two approaches for tiny language models: quantizing a larger model like Open chat and fine-tuning a tiny model like Trellis Tiny.*
    36:34 📏 *Examined quantization options for Open chat model, highlighting performance trade-offs based on bit precision.*
    38:38 📊 *Demonstrated performance of Open chat model in 2-bit quantization using Lama CPP, acknowledging the need for sufficient VRAM.*
    43:57 📉 *Tested 2-bit and 4-bit quantized Open chat models in Lama CPP, revealing limitations in handling function responses on a laptop.*
    45:38 🚀 *Introduced fine-tuning challenges with tiny models for function calling, emphasizing the weakness of non-coding models like Tiny Lama or Fi in this context.*
    46:07 🚀 *Fine-tuning Tiny LLMs for function calling can be challenging, as responses might be verbose and hard to control.*
    46:21 🎛️ *Deep Seek is recommended for fine-tuning models for function calling, but chain function calling is difficult due to statistical distribution challenges.*
    47:04 🔄 *Limiting the model to a single function call and preventing recursive calls can be achieved by careful implementation during inference.*
    48:26 🛠️ *Techniques for better responses involve providing start-of-response prompts and utilizing information from the Json object returned by the function.*
    50:26 🚀 *Trellis Tiny, based on Deep Seek, is showcased for function calling with high token generation speed; it's suitable for utility purposes.*
    56:50 🧠 *Demonstrations using TGI (Tiny Go) and Lama CPP for function calling, emphasizing the importance of fine-tuning and manual tweaks for optimal results.*
    57:46 🔧 *The Advanced Inference repo includes tweaks, like a parameter preventing recursive calls in tiny models, enhancing model responsiveness.*
    01:00:21 📚 *Helper text and logic adjustments play a crucial role in handling function calling challenges, allowing tiny models to answer both normal and functional questions.*
    01:01:40 🌐 *Tris Tiny is recommended for utility purposes, especially function calling, offering single function call capability and short normal responses.*
    Made with HARPA AI

  • @CrypticConsole
    @CrypticConsole 4 місяці тому

    do you know if the advanced inference supports native logit biasing and constrained generation via the API

  • @todordonev
    @todordonev 4 місяці тому

    Nice video as always, for function calling I am using NexusRaven V1 on a 1070ti and I think its better than gpt4.
    PS Im using Ollama for inference.

    • @TrelisResearch
      @TrelisResearch  4 місяці тому

      thanks for the tips, I'll dig in on those

    • @todordonev
      @todordonev 4 місяці тому

      @@TrelisResearch Its super fast

  • @TrelisResearch
    @TrelisResearch  4 місяці тому

    UPDATE: Phi-2 is now available - incl. for commercial use - under an MIT license!

    • @truehighs7845
      @truehighs7845 3 місяці тому +1

      I have a 2018 16inch with a x86 but 32 gig of RAM, and I can run the hell out of solar, deep seek and mixtral simultaneously, with ollama or 1 at the time with jan (slower).

  • @sherpya
    @sherpya 2 місяці тому

    it would be nice a video about groq (not grok) but I don't know how many infos there are around at the moment

    • @TrelisResearch
      @TrelisResearch  2 місяці тому

      yeah, I'm kind of tracking it, but they don't give a way to inference a custom model yet afaik, once they do, I think that would def be interesting

  • @thisurawz
    @thisurawz 4 місяці тому

    Can you do a video on finetuning a multimodal LLM (Video-LlaMA, LLaVA, or CLIP) with a custom multimodal dataset containing images and texts for relation extraction or a specific task? Can you do it using open-source multimodal LLM and multimodal datasets like video-llama or else so anyone can further their experiments with the help of your tutorial. Can you also talk about how we can boost the performance of the fine-tuned modal using prompt tuning in the same video?

    • @TrelisResearch
      @TrelisResearch  4 місяці тому +1

      yeah i wanna do a vid on multi-modal. I tried out llava and was unimpressed by performance versus openai, so I thought I would delay a little bit. I'll revisit soon

  • @easyaistudio
    @easyaistudio 4 місяці тому +1

    lets agree they are called SLMs as in SLiM unless you want to start using the metric system of pico, nano, micro, milli, etc 😄 in 5 years "big" as in "big data" will be considered small compared to the biggest

  • @webizfabulous2535
    @webizfabulous2535 4 місяці тому +1

    Hey , I have a question. I trained a tokenizer changing the length of the tokenizer , then did peft + qlora ( embedding, lm head and qkv) fine tuning. But the model does not perform well? Is it because of the lack of datasets? Or because i have changed the dimensions?

    • @TrelisResearch
      @TrelisResearch  4 місяці тому

      I'd need more info to say...
      - What kind of dataset were you using and training for what application?
      - Did you merge the LoRA onto the base model you trained? (you have to be careful not to lose the updated embed and lm head layers).
      - when changing the embeddings settings you have to update both the tokenizer and model.
      The best video for all of this is the one I did on Chat Fine-tuning.

    • @webizfabulous2535
      @webizfabulous2535 4 місяці тому

      @@TrelisResearch okay i will watch the video and i come back to you thanks 🙏

  • @VijayDChauhaan
    @VijayDChauhaan 4 місяці тому

    Hi Ronan could you please do tutorial on GuardRails?

    • @TrelisResearch
      @TrelisResearch  4 місяці тому +2

      interesting idea, let me add that to the list of potential vids

  • @anirudhsarma937
    @anirudhsarma937 4 місяці тому

    great video guys, can someone help me understand when you use just the lora adapter weights for inference and when you merge the lora weights to the original model

    • @TrelisResearch
      @TrelisResearch  4 місяці тому

      generally it's best to merge because inference is slower unmerged (there's an extra addition step to apply the adapter).
      The reason not to merge is that you can store the adapter (which is small) separately [if that's useful].

    • @anirudhsarma937
      @anirudhsarma937 4 місяці тому

      @@TrelisResearch thanks for your reply. got it. please continue with your content, it has helped me a lot

  • @JordanKaufman
    @JordanKaufman 4 місяці тому

    What's the best way to get clients for these types of solutions?

    • @TrelisResearch
      @TrelisResearch  4 місяці тому

      Howdy, Are you asking how to come up with applications for tiny LLMs? i.e. use cases/markets where having tiny llms is useful?

  • @googleSux
    @googleSux 3 місяці тому

    Can any be loaded on a iPhone ?

    • @TrelisResearch
      @TrelisResearch  3 місяці тому

      In principle yes, although I haven’t dug into that yet. I’ll add to my potential videos list

    • @Tofu3435
      @Tofu3435 2 місяці тому

      I using MLC chat / chatterUI on android but i think they have iphone versions too.

  • @fontende
    @fontende 4 місяці тому

    you remind me Andrey Karpathy

  • @konstantinlozev2272
    @konstantinlozev2272 4 місяці тому

    Even a second hand GTX 1070 laptop would be able to handle the 4 bit quantised variant.

  • @sherpya
    @sherpya 2 місяці тому

    in the meanwhile phi 2 model changed licensing to a permissive one

  • @AlexBerg1
    @AlexBerg1 4 місяці тому +1

    Do you find Mozilla's Llamafile project interesting or useful? As someone who dabbles, I'm still not sure how to think about it.

    • @TrelisResearch
      @TrelisResearch  4 місяці тому

      Thanks for sharing . I just had a look and it looks like a strong option to get a chat going. Would be nice if they add Phi as an option . As you saw in this vid a 4bit quant is still too big for my machine .
      Btw when llamcpp is installed and you run ./server - there’s also a simple chat interface on the localhost port