Prepare Fine-tuning Datasets with Open Source LLMs

Поділитися
Вставка
  • Опубліковано 27 тра 2024
  • Runpod Affiliate Link tinyurl.com/yjxbdc9w
    Advanced Fine-tuning and Data-preparation Scripts (Lifetime Membership)
    Fine-tune LLMs for style, content or structured responses...
    Learn More: trelis.com/advanced-fine-tuni...
    LLM Server Setup Repo Access (Lifetime Membership)
    - Video: Run Llama 2 on AWS: • Deploy Llama 2 for you...
    - Video: Deploy a Llama API in 5 clicks: • Deploy an API for Llam...
    - Learn more: trelis.com/enterprise-server-...
    Chapters:
    0:00 Preparing data for fine-tuning
    0:37 Video overview
    1:04 Accessing the GitHub Repo w/ data preparation scripts
    2:42 Q&A Dataset preparation using Llama 2 70B and chat-ui
    7:29 How to set up a Llama 2 API for 70B
    8:45 Using a Llama 2 API to prepare a Q&A dataset for fine-tuning
    12:22 Pro tips for preparing fine-tuning datasets
  • Наука та технологія

КОМЕНТАРІ • 24

  • @nkhuang1390
    @nkhuang1390 7 місяців тому +8

    I purchased full access to your repo because I love and want to support the work you are doing. Some of the clearest and most articulate explanations about embedding, fine-tuning. Supervised vs unsupervised methods, data prep. Keep it up!

  • @carthagely122
    @carthagely122 4 місяці тому

    Thank you very much

  • @unshadowlabs
    @unshadowlabs 7 місяців тому +2

    Great video! How are you chunking the videos, by paragraph, sentence, word char, etc? Are you using any overlap in the chunks? Have you tested you system with a smaller llama 2 model? What type of results would one get from maybe a llama 2 13B, or even a 7B that could possibly be ran from home?

    • @TrelisResearch
      @TrelisResearch  7 місяців тому

      Howdy!
      Here, I chunk into 500 or 750 token chunks. If you chunk too little, then the cropped sentence at the end has too much effect and you get hallucination. If you use too big chunks then you'll get too many questions (and llms aren't able to respond consistently with very long lists of questions, often).
      Check out my supervised fine-tuning video, that's done on 13B. With enough data, you can get to reasonable quality. 7B - unless you have a lot of data (or are fine-tuning for structured responses) is tough.

    • @unshadowlabs
      @unshadowlabs 7 місяців тому

      @@TrelisResearch Thanks for the reply. I watched the whole series after I had posted this. Very good series! :) What are your thoughts about using a 7B model just for the Q&A creation, and then fine tuning that on the larger 70B model? Is there any benefit for using such a large model on the Q&A creation step?

    • @TrelisResearch
      @TrelisResearch  7 місяців тому +1

      @@unshadowlabs yeah I think you need to use a big model for Q&A because you don't want hallucination in the Q&A set - data quality is crucial and 7B hallucinates too much

  • @devtest202
    @devtest202 2 місяці тому

    Hi thanks!! A question for a model in which I have more than 2,000 pdfs. Do you recommend improving the handling of vector databases? When do you recommend fine tunning and when do you recommend vector database

    • @TrelisResearch
      @TrelisResearch  2 місяці тому

      Start with a vector database unless, a) you need high latency and short prompts, or b) you want to do structured generation. Fine-tuning may give a small boost but embeddings will be best.

  • @user-iz1mh5cd2q
    @user-iz1mh5cd2q 3 місяці тому

    you used plain text for the dataset, is it better than the json format? when choosing one or the other? thanks for the video!

    • @TrelisResearch
      @TrelisResearch  3 місяці тому +1

      Well if you have json available to start that’s going to be even easier to process and modify to meet your needs. Plaintext is hardest as there is no structure to go on.

  • @MarxOrx
    @MarxOrx 7 місяців тому +1

    Hi, I just paid for the access to the repo of this video, but I wasn't aware of the option to buy access to all projects in the repo, Is there any way to pay the difference and upgrade? how can I get in touch with you for that? love the work btw!

    • @TrelisResearch
      @TrelisResearch  7 місяців тому

      Howdy, everyone gets emailed a receipt, so you can just respond to that email!

  • @el.kochevnik
    @el.kochevnik 7 місяців тому +1

    Great 🤠

  • @HemangJoshi
    @HemangJoshi 3 місяці тому

    I want to fine-tune on my code. I have multiple folders and files in each project on which i want to fine-tune. Can this private repo work in that? Basically i want to fine-tune on my coding projects.

    • @TrelisResearch
      @TrelisResearch  3 місяці тому

      Yes, this can work. If dealing with a file structure, you may want to decide what files to include and then flatten them into one single .txt file. It can also help to include a directory structure within that txt file as well so the llm knows what it's looking at.

  • @babyfox205
    @babyfox205 2 місяці тому

    is "Context" a keyword which this specific model knows? how would it notice it after the blob of text

    • @TrelisResearch
      @TrelisResearch  2 місяці тому

      It should know Context like any other english word and also have seen training data of what that refers to.

  • @GrahamAndersonis
    @GrahamAndersonis 3 місяці тому

    On Runpod, How do I get/amend Llama 70B API by TrelisResearch Template to work with an exposed TCP?
    The terminal says connection is refused in the terminal and in VScode (preferred).
    Other templates work fine.
    Doesn't work: The SSH over exposed TCP: (Supports SCP & SFTP)
    Works: the Basic SSH Terminal: (No support for SCP & SFTP) works fine.
    The basic SSH terminal is not going to work with VScode to my knowledge.
    Perhaps there is a way to edit the templates for these containers so they can work with VS code?
    I'm really looking forward to digging into your tutorials :)

    • @sagardesai1253
      @sagardesai1253 3 місяці тому

      Hello @GrahamAndersonis,
      out of the box debian linux does not comes SSH installed.
      1. In run pod image, you have to pass the public_key, as well as TCP port 22.
      2. please use following commands in the basic command prompt -
      ####
      # Update package lists for upgrades and new package installations
      apt update;
      # Install OpenSSH server in a non-interactive mode to avoid prompts and questions during installation
      DEBIAN_FRONTEND=noninteractive apt-get install openssh-server -y;
      # Start the SSH service to enable remote connections
      service ssh start;
      ####
      3. post this the run pod will have SSH available to connect.
      4. use VScodes remote extension to connect to runpod as remote server.
      5. this will have SCP and SFTP enabled

    • @TrelisResearch
      @TrelisResearch  3 місяці тому +1

      Hi Graham, yeah had this issue too and will post shortly with a workaround. Ultimately the image would need to be updated for a permanent fix (but I don't control that image).

    • @GrahamAndersonis
      @GrahamAndersonis 3 місяці тому

      @@TrelisResearch fantastic work

  • @enriquecolladofernandez8758
    @enriquecolladofernandez8758 8 місяців тому +1

    cheeeeez u give it to me man !