Prepare Fine-tuning Datasets with Open Source LLMs

Trelis Research

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 8 січ 2025

КОМЕНТАРІ • 33

@nkhuang1390 Рік тому ⁺¹⁰
I purchased full access to your repo because I love and want to support the work you are doing. Some of the clearest and most articulate explanations about embedding, fine-tuning. Supervised vs unsupervised methods, data prep. Keep it up!
@TrelisResearch Рік тому
Appreciate that! Many thanks
@unshadowlabs Рік тому ⁺²
Great video! How are you chunking the videos, by paragraph, sentence, word char, etc? Are you using any overlap in the chunks? Have you tested you system with a smaller llama 2 model? What type of results would one get from maybe a llama 2 13B, or even a 7B that could possibly be ran from home?
@TrelisResearch Рік тому
Howdy!
Here, I chunk into 500 or 750 token chunks. If you chunk too little, then the cropped sentence at the end has too much effect and you get hallucination. If you use too big chunks then you'll get too many questions (and llms aren't able to respond consistently with very long lists of questions, often).
Check out my supervised fine-tuning video, that's done on 13B. With enough data, you can get to reasonable quality. 7B - unless you have a lot of data (or are fine-tuning for structured responses) is tough.
@unshadowlabs Рік тому
@@TrelisResearch Thanks for the reply. I watched the whole series after I had posted this. Very good series! :) What are your thoughts about using a 7B model just for the Q&A creation, and then fine tuning that on the larger 70B model? Is there any benefit for using such a large model on the Q&A creation step?
@TrelisResearch Рік тому ⁺¹
@@unshadowlabs yeah I think you need to use a big model for Q&A because you don't want hallucination in the Q&A set - data quality is crucial and 7B hallucinates too much
@AmbarPathak-w6c 9 днів тому
is there a way to do the training on a local nv link paired rtx 4090 gpus from raw data(multimodal pdfs) for a llava 13b ?
@TrelisResearch 9 днів тому
Yes! But probably better to use Qwen VL 7B. It’s more powerful
@AmbarPathak-w6c 8 днів тому
@@TrelisResearch even for dealing with multimodal pdfs in high volume ?
@AmbarPathak-w6c 6 днів тому
@@TrelisResearch why do you think it will be more powerful ?
@devtest202 10 місяців тому
Hi thanks!! A question for a model in which I have more than 2,000 pdfs. Do you recommend improving the handling of vector databases? When do you recommend fine tunning and when do you recommend vector database
@TrelisResearch 10 місяців тому
Start with a vector database unless, a) you need high latency and short prompts, or b) you want to do structured generation. Fine-tuning may give a small boost but embeddings will be best.
@AmbarPathak-w6c 9 днів тому
Is it actually possible to do it on a rtx 4090 machine locally without using any cloud api or cloud gpu provider and using multimodal pdf as your input data ?
@TrelisResearch 9 днів тому
Yes, Definitely!
@MarinaRodriguesCrespo 10 місяців тому
you used plain text for the dataset, is it better than the json format? when choosing one or the other? thanks for the video!
@TrelisResearch 10 місяців тому ⁺¹
Well if you have json available to start that’s going to be even easier to process and modify to meet your needs. Plaintext is hardest as there is no structure to go on.
@GrahamAndersonis 11 місяців тому
On Runpod, How do I get/amend Llama 70B API by TrelisResearch Template to work with an exposed TCP?
The terminal says connection is refused in the terminal and in VScode (preferred).
Other templates work fine.
Doesn't work: The SSH over exposed TCP: (Supports SCP & SFTP)
Works: the Basic SSH Terminal: (No support for SCP & SFTP) works fine.
The basic SSH terminal is not going to work with VScode to my knowledge.
Perhaps there is a way to edit the templates for these containers so they can work with VS code?
I'm really looking forward to digging into your tutorials :)
@sagardesai1253 11 місяців тому
Hello @GrahamAndersonis,
out of the box debian linux does not comes SSH installed.
1. In run pod image, you have to pass the public_key, as well as TCP port 22.
2. please use following commands in the basic command prompt -
####
# Update package lists for upgrades and new package installations
apt update;
# Install OpenSSH server in a non-interactive mode to avoid prompts and questions during installation
DEBIAN_FRONTEND=noninteractive apt-get install openssh-server -y;
# Start the SSH service to enable remote connections
service ssh start;
####
3. post this the run pod will have SSH available to connect.
4. use VScodes remote extension to connect to runpod as remote server.
5. this will have SCP and SFTP enabled
@TrelisResearch 11 місяців тому ⁺¹
Hi Graham, yeah had this issue too and will post shortly with a workaround. Ultimately the image would need to be updated for a permanent fix (but I don't control that image).
@GrahamAndersonis 11 місяців тому
@@TrelisResearch fantastic work
@HemangJoshi 11 місяців тому
I want to fine-tune on my code. I have multiple folders and files in each project on which i want to fine-tune. Can this private repo work in that? Basically i want to fine-tune on my coding projects.
@TrelisResearch 10 місяців тому
Yes, this can work. If dealing with a file structure, you may want to decide what files to include and then flatten them into one single .txt file. It can also help to include a directory structure within that txt file as well so the llm knows what it's looking at.
@MarxOrx Рік тому ⁺¹
Hi, I just paid for the access to the repo of this video, but I wasn't aware of the option to buy access to all projects in the repo, Is there any way to pay the difference and upgrade? how can I get in touch with you for that? love the work btw!
@TrelisResearch Рік тому
Howdy, everyone gets emailed a receipt, so you can just respond to that email!
@babyfox205 10 місяців тому
is "Context" a keyword which this specific model knows? how would it notice it after the blob of text
@TrelisResearch 10 місяців тому
It should know Context like any other english word and also have seen training data of what that refers to.
@TheLokiGT 7 місяців тому
Hi Ronan. Where is the code relevant to this video as of june 2024? In the Adv. FT repo, there is no trace of it AFAIK. Thanks.
@TrelisResearch 7 місяців тому ⁺¹
Howdy, code is in the supervised-fine-tuning branch
@TheLokiGT 7 місяців тому
@@TrelisResearch Thanks!
@carthagely122 11 місяців тому
Thank you very much
@el.kochevnik Рік тому ⁺¹
Great 🤠
@enriquecolladofernandez8758 Рік тому ⁺¹
cheeeeez u give it to me man !
@TrelisResearch Рік тому
😂❤️

Наступне

Автоматичне відтворення