Prepare Fine-tuning Datasets with Open Source LLMs
Вставка
- Опубліковано 27 тра 2024
- Runpod Affiliate Link tinyurl.com/yjxbdc9w
Advanced Fine-tuning and Data-preparation Scripts (Lifetime Membership)
Fine-tune LLMs for style, content or structured responses...
Learn More: trelis.com/advanced-fine-tuni...
LLM Server Setup Repo Access (Lifetime Membership)
- Video: Run Llama 2 on AWS: • Deploy Llama 2 for you...
- Video: Deploy a Llama API in 5 clicks: • Deploy an API for Llam...
- Learn more: trelis.com/enterprise-server-...
Chapters:
0:00 Preparing data for fine-tuning
0:37 Video overview
1:04 Accessing the GitHub Repo w/ data preparation scripts
2:42 Q&A Dataset preparation using Llama 2 70B and chat-ui
7:29 How to set up a Llama 2 API for 70B
8:45 Using a Llama 2 API to prepare a Q&A dataset for fine-tuning
12:22 Pro tips for preparing fine-tuning datasets - Наука та технологія
I purchased full access to your repo because I love and want to support the work you are doing. Some of the clearest and most articulate explanations about embedding, fine-tuning. Supervised vs unsupervised methods, data prep. Keep it up!
Appreciate that! Many thanks
Thank you very much
Great video! How are you chunking the videos, by paragraph, sentence, word char, etc? Are you using any overlap in the chunks? Have you tested you system with a smaller llama 2 model? What type of results would one get from maybe a llama 2 13B, or even a 7B that could possibly be ran from home?
Howdy!
Here, I chunk into 500 or 750 token chunks. If you chunk too little, then the cropped sentence at the end has too much effect and you get hallucination. If you use too big chunks then you'll get too many questions (and llms aren't able to respond consistently with very long lists of questions, often).
Check out my supervised fine-tuning video, that's done on 13B. With enough data, you can get to reasonable quality. 7B - unless you have a lot of data (or are fine-tuning for structured responses) is tough.
@@TrelisResearch Thanks for the reply. I watched the whole series after I had posted this. Very good series! :) What are your thoughts about using a 7B model just for the Q&A creation, and then fine tuning that on the larger 70B model? Is there any benefit for using such a large model on the Q&A creation step?
@@unshadowlabs yeah I think you need to use a big model for Q&A because you don't want hallucination in the Q&A set - data quality is crucial and 7B hallucinates too much
Hi thanks!! A question for a model in which I have more than 2,000 pdfs. Do you recommend improving the handling of vector databases? When do you recommend fine tunning and when do you recommend vector database
Start with a vector database unless, a) you need high latency and short prompts, or b) you want to do structured generation. Fine-tuning may give a small boost but embeddings will be best.
you used plain text for the dataset, is it better than the json format? when choosing one or the other? thanks for the video!
Well if you have json available to start that’s going to be even easier to process and modify to meet your needs. Plaintext is hardest as there is no structure to go on.
Hi, I just paid for the access to the repo of this video, but I wasn't aware of the option to buy access to all projects in the repo, Is there any way to pay the difference and upgrade? how can I get in touch with you for that? love the work btw!
Howdy, everyone gets emailed a receipt, so you can just respond to that email!
Great 🤠
I want to fine-tune on my code. I have multiple folders and files in each project on which i want to fine-tune. Can this private repo work in that? Basically i want to fine-tune on my coding projects.
Yes, this can work. If dealing with a file structure, you may want to decide what files to include and then flatten them into one single .txt file. It can also help to include a directory structure within that txt file as well so the llm knows what it's looking at.
is "Context" a keyword which this specific model knows? how would it notice it after the blob of text
It should know Context like any other english word and also have seen training data of what that refers to.
On Runpod, How do I get/amend Llama 70B API by TrelisResearch Template to work with an exposed TCP?
The terminal says connection is refused in the terminal and in VScode (preferred).
Other templates work fine.
Doesn't work: The SSH over exposed TCP: (Supports SCP & SFTP)
Works: the Basic SSH Terminal: (No support for SCP & SFTP) works fine.
The basic SSH terminal is not going to work with VScode to my knowledge.
Perhaps there is a way to edit the templates for these containers so they can work with VS code?
I'm really looking forward to digging into your tutorials :)
Hello @GrahamAndersonis,
out of the box debian linux does not comes SSH installed.
1. In run pod image, you have to pass the public_key, as well as TCP port 22.
2. please use following commands in the basic command prompt -
####
# Update package lists for upgrades and new package installations
apt update;
# Install OpenSSH server in a non-interactive mode to avoid prompts and questions during installation
DEBIAN_FRONTEND=noninteractive apt-get install openssh-server -y;
# Start the SSH service to enable remote connections
service ssh start;
####
3. post this the run pod will have SSH available to connect.
4. use VScodes remote extension to connect to runpod as remote server.
5. this will have SCP and SFTP enabled
Hi Graham, yeah had this issue too and will post shortly with a workaround. Ultimately the image would need to be updated for a permanent fix (but I don't control that image).
@@TrelisResearch fantastic work
cheeeeez u give it to me man !
😂❤️