Fine-tuning Language Models for Structured Responses with QLoRa

Поділитися
Вставка
  • Опубліковано 2 жов 2024

КОМЕНТАРІ • 55

  • @TrelisResearch
    @TrelisResearch  Рік тому +4

    UPDATE (ADVANCED):
    - The ADVANCED portion of the video, and notebook, shows left-padding. I have since switched to right padding of sequences. This seems more robust because the beginning of sequence token (an s in angled brackets) is always tokenized as one token when at the start of an input (i.e. right padding). By contrast, after pad tokens, the tokenizer often tokenizes the beginning of sequence token as three tokens, which can lead to misalignent of the loss mask and attention mask as well as unknown tokens.

    • @hariduraibaskar9056
      @hariduraibaskar9056 10 місяців тому

      so do we change to right padding now?

    • @TrelisResearch
      @TrelisResearch  10 місяців тому +1

      @hariduraibaskar9056 . Yup, the script now uses right padding. I think it's a bit more robust when fine-tuning for structured responses. For doing unsupervised training/pre-training, left padding is probably better because you don't want unfinished sentences ending with

  • @CurtKeisler
    @CurtKeisler Рік тому +5

    2^4 = 16 (not 32)
    2^32 = 4,294,967,296

  • @mohammadkhair7
    @mohammadkhair7 Рік тому +4

    What an amazing and insightful tutorial with detailed understanding and review of the LLaMA-2 fine-tuning and inference all extended with function calling. Much appreciated!
    I will be supporting this channel and advertising it. Well done.

  • @ghrasko
    @ghrasko 9 місяців тому +1

    Hi,
    somewhen around 38:00 in the video you start working on a colab sheet "QLoRa Training for Small Datasets". I have purchased the Advanced Fine Tuning package, but I can't find it there.

    • @TrelisResearch
      @TrelisResearch  9 місяців тому

      Howdy! Yeah the latest version of that script is the function calling notebook in the function calling branch

  • @Wanderlust1342
    @Wanderlust1342 Рік тому +2

    Excellent stuff, can you tell about the max_steps arg, if lets say i set it to 1000, would it mean we will be looking for first 1000 batches?

    • @TrelisResearch
      @TrelisResearch  Рік тому

      Howdy! Yes if you have a batch size of 1 (and gradient accumulation of 1, I think). If your batch size is 2, then 1000 steps would include 2000 rows of data.

  • @emrahe468
    @emrahe468 Рік тому +3

    ty for the nice tutorial, this helped me greatly running the fine tuned model

  • @colkbassad
    @colkbassad 9 місяців тому +1

    Please keep up the content, you're very gifted at teaching and presenting clearly. I'm very interested (read: obsessed) with function calling on local LLMs. My goal is a solution that doesn't require a network connection. I've found mistral-7b is the best trade-off on performance between hardware requirements and inferencing reliability. It tends to fall apart with more functions, though.
    I try to keep things simple by grouping my function descriptions into areas of responsibility (e.g. map-navigation and map-styling) and having a main agent that decides what the user is trying to do. Then based on the choice from the main agent I invoke the sub-agent that has relevant function descriptions.
    It seems to help keep the model focused and is more efficient with the context window. I even get promising results with the default instruct version but I'm very interested in fine-tuning for my use case. I tried nexusraven 13b and it works well, but it runs too slow on my a5000 laptop. Do you think this is worth pursuing? Can you recommend some of your gated content given what I'm up to?

    • @TrelisResearch
      @TrelisResearch  9 місяців тому

      Howdy @colkbassad! Sure, a few ideas:
      1. Have a look at this function calling video: ua-cam.com/video/hHn_cV5WUDI/v-deo.htmlsi=6woUAR2XFGnQzWdB . In case you haven't seen it already.
      2. Yes, Mistral v0.1 and v0.2 are (perhaps oddly?) only ok at function calling. By far the best value model I've tested (in terms of capabilities per model size) is OpenChat 3.5 (huggingface.co/Trelis/openchat_3.5-function-calling-v3). It's also demo'd in the video above.
      3. Actually models that are good at code are great at function calling. There are v2 function calling models on HuggingFace under Trelis for all DeepSeek model sizes and it is a strong model. The drawback is that coding models are not as strong on non code and non-function chats.

  • @MrSCAAT
    @MrSCAAT 8 місяців тому

    Great Work

  • @AbhijeetTamrakar-k4l
    @AbhijeetTamrakar-k4l 9 місяців тому +1

    How do you decide the r=16 and lora_alpha=32 in the LoraConfig?

    • @TrelisResearch
      @TrelisResearch  9 місяців тому

      It's empirical and r of 8 or 16 with alpha of about 32 tends to work well.
      r is the rank of the adapter matrices. So the adapters are of size r X embedding_size . If you make r as big as the embedding size, then the adapters are pointless because they are just as big as the weight matrices. The whole idea is to train smaller adapters.
      So you want r to be a lot smaller than the embedding size but you also want it big enough so that the adapter matrices can retain some info.
      Meanwhile alpha * learning rate / r is the learning rate used for the adapters, so you always want alpha to be some multiple or fraction of r that is not too far from 1 (i.e. have alpha be four times r is fine). Keep an eye out for a new vid on tiny models where I talk about sizing r.

    • @AbhijeetTamrakar-k4l
      @AbhijeetTamrakar-k4l 9 місяців тому +1

      ​@@TrelisResearch Sure, Thanks for the insights!
      Also, Have you tried fine-tuning vision models? What are your views regarding them?
      Also, I have watched most of your videos from the beginning. If I want to learn things in depth, could you suggest ways to do that!!

    • @TrelisResearch
      @TrelisResearch  9 місяців тому

      @@AbhijeetTamrakar-k4l you're then at the point to start reading papers! Attention is All you Need. LoRA, AWQ, GPTQ, Lima, and many more!

  • @WinsonDabbles
    @WinsonDabbles 9 місяців тому +1

    I started watching all your videos! I couldnt find one that explained how to create fine tuning datasets on your own personal or company’s data to fine tune on though. Only existing datasets created by people from HF. Have you any tips or ideas? Happy to pay for these tips/info. Good job! Enjoying every single one i have watched

    • @TrelisResearch
      @TrelisResearch  9 місяців тому +1

      ah yeah, you need to go to the three videos entitled "Fine-tuning versus embeddings" - that's where I do a custom dataset (touch rugby rules)

    • @WinsonDabbles
      @WinsonDabbles 9 місяців тому

      @@TrelisResearch amazing! I’ll go watch it! Thank you! Keep killing it man!

  • @ayoubelmhamdi7920
    @ayoubelmhamdi7920 10 місяців тому +1

    you start writing text word by word, then we skip writing to copy paste, why i cannot continue watching copy pasting. 😢😂

    • @TrelisResearch
      @TrelisResearch  10 місяців тому

      Howdy, are you saying that I'm going too fast with the explanation?
      if so, thanks for the feedback, I'll keep that in mind.
      Otherwise, let me know what you mean. Cheers

    • @ayoubelmhamdi7920
      @ayoubelmhamdi7920 10 місяців тому +1

      @@TrelisResearch
      To be fair, I meant no one of programming persons I follow except @tsoding, why? Because he starts coding apps from scratch. Every idea should begin with writing the "Hello world," then upgrade the project to the end, without any copy-paste.
      When you attempt to code faster, it makes the code very complex for me.

    • @TrelisResearch
      @TrelisResearch  10 місяців тому

      @@ayoubelmhamdi7920 thanks for the comment. My target audience here is for devs in the intermediate to advanced stage of coding. However, I think you still make a good point and there's opportunity for me to be more clear to show the steps. Thanks for the feedback!

    • @TemporaryForstudy
      @TemporaryForstudy 7 місяців тому

      I also felt the same. My advice is that if you have two monitors, write the code first and then start recording the video on the second monitor. try to explain everything, and you already have complete code in the second monitor, so you can see from there also.@@TrelisResearch

  • @damonpalovaara4211
    @damonpalovaara4211 6 місяців тому

    I've researching tarnerization of weights (-1, 0, 1) which reduces model size down to 2 bits per weight and compressed down to 1.58 bits per weight for transfers

    • @TrelisResearch
      @TrelisResearch  6 місяців тому

      me too!
      I want to do a vid, but libraries aren't mature just yet and no one has released their weights (and just quantizing down doesn't work well).

    • @damonpalovaara4211
      @damonpalovaara4211 6 місяців тому

      I'm working on a technique that ternarizes using gradient descent by using a smooth-ternarize function@@TrelisResearch

    • @damonpalovaara4211
      @damonpalovaara4211 6 місяців тому

      @TrelisResearch I also found a technique for efficiently storing 5 weights into a single byte making the weights take up 1.6 bits each

  • @vaaaliant
    @vaaaliant Рік тому +1

    Great start of the video, you explaining everything is great. But once you start running the functions and they don't run, you sort of lose me.

    • @TrelisResearch
      @TrelisResearch  Рік тому +1

      💯that's good and fair feedback. I've been working since to trim content and better organise. New vid on this topic upcoming

    • @vaaaliant
      @vaaaliant Рік тому

      @@TrelisResearch Great, I'll subscribe and look forward to your future content. Keep it up!

  • @tomhavy
    @tomhavy Рік тому +1

    Hell yes , thanks a lot for this !

  • @YuCai-v8k
    @YuCai-v8k Рік тому +1

    great

  • @AbhijeetTamrakar-k4l
    @AbhijeetTamrakar-k4l 9 місяців тому

    Regarding the size of meta-llama/Llama-2-7b-chat-hf
    It has two files .safetensors
    The cumulative size is around 14GB
    Q1) As you mentioned, the model is nothing but the weights so these .safetensors are weights, right?
    Q2) As you explained, the size of models should be 28GB but they are 14GB. So, are they model weights in 16bit data type?

    • @TrelisResearch
      @TrelisResearch  9 місяців тому

      1) Yes, safetensors is a file format that is quicker to load than .pytorch
      2) Yes, in 32-bit, the model would be about 28 GB

    • @AbhijeetTamrakar-k4l
      @AbhijeetTamrakar-k4l 9 місяців тому

      @@TrelisResearch Thanks, Also
      How to identify what all are the target modules for training. Say for llamas we have different and for others we might have different!

    • @TrelisResearch
      @TrelisResearch  9 місяців тому

      @@AbhijeetTamrakar-k4l just run "print(model)" to see the list of modules. Generally you want to train attention and optionally you can train up and down and gate proj. See the chat fine-tuning video for more discussion.

  • @额哈哈-b1q
    @额哈哈-b1q 7 місяців тому

    2^4 is 16 (at 04:15) ?

  • @vent_srikar7360
    @vent_srikar7360 7 місяців тому

    How do i fine tune with my own data though?

    • @TrelisResearch
      @TrelisResearch  7 місяців тому +1

      Have a look at the embeddings vs fine tuning videos

  • @ArunKumar-bp5lo
    @ArunKumar-bp5lo 11 місяців тому

    thanks so much

  • @gustavofelicidade_
    @gustavofelicidade_ 10 місяців тому

    Subscribed!

  • @ambakumari4058
    @ambakumari4058 10 місяців тому

    is GPU necessary for this ?

    • @TrelisResearch
      @TrelisResearch  10 місяців тому

      You could run on CPU but it will be really really slow unless you're training a very small model like TinyLlama or DeepSeek 1.3B

  • @prestonmccauley43
    @prestonmccauley43 Рік тому +2

    Overall this was a really good video. A lot of good detail and explanation. I think you over complicated it a bit with using the function. A lot of us are just looking how to use a non-function based data set. For example 3 columns of data in a data set with an instruction model. How much data is necessary? Your technique of padding and custom pad token to this looks like a good process, but can we make a simple example as well?

    • @TrelisResearch
      @TrelisResearch  Рік тому +1

      Regarding data quantity - it's hard to generalise. Probably the larger the model, the less the data required (as the model has more of a statistical base already). My experience is that 50-100 datapoints can be enough.
      One specific example. In the case of function calling, it's important to train with some examples where the model is given functions in the prompt, but the prompt does not require a function call. This avoids the model being led to believe that the presence of functions means they must be used.
      So one has to consider the edge cases.
      BTW, what are some examples of fine-tuning datasets that would be useful to show? What would you want to fine tune the model for?

    • @prestonmccauley43
      @prestonmccauley43 Рік тому +1

      I would gladly share a video focusing on qlora with small data set non-functions. I'm a teacher as well, and been watching about 50 youtube or so in the past several weeks. That is clearly the most significant gap. I get peft, qlora, lora, models, hyperparamaters, - everyone is missing he IA data science part:) @@RonanMcGovern - I would even pay for that great colab, tutorial or using tools like Axolotyl, text-gen,

    • @TrelisResearch
      @TrelisResearch  Рік тому

      @@prestonmccauley43 great stuff, any feedback on the colab templates in the video description is welcome. BTW, the free one is for fine-tuning on a simple dataset (not function calling).
      Just a note to your earlier question on amount of data required - probably more important is a) quality of data and b) being very exact with the attention and loss mask.
      btw, if you have a very small dataset, you can consider just putting that into the system message, that is quicker and can be as good as fine-tuning.