How to make a custom dataset like Alpaca7B

Sam Witteveen

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 15 лис 2024
Colab: drp.li/hrPbE
In this video, I go through making your own custom dataset in the style that the Alpaca dataset was made using human-generated data and using it to generate synthetic data with GPT-3
For more tutorials on using LLMs and building Agents, check out my Patreon:
Patreon: / samwitteveen
Twitter: / sam_witteveen
My Links:
Linkedin: / samwitteveen
Github:
github.com/sam...
github.com/sam...
#largelanguagemodels

КОМЕНТАРІ • 113

@____2080_____ Рік тому ⁺¹
It’s an awesome time to be alive if you’re into this sort of thing.
@samwitteveenai Рік тому ⁺¹
totally!!
@jayhu6075 Рік тому ⁺²
I am very glad to find your channel as a beginner in ML. You make it so understandable for everybody for a topic like this.
Hopefully in the future a tutorial how to make a dataset. Many thanks.
@samwitteveenai Рік тому
Thanks and welcome
@SpricesExist Рік тому ⁺¹⁷
It might be a good idea to fine tune the llama 30b model on very specific tasks. For example, using a dataset of JUST python code prompts with responses from GPT-4, and then filter the responses based on simple 'do they actually execute when you run them or do they give an error'. Maybe add max 1 correction attempt for every prompt and use the corrected version if it runs. Only train on code that actually runs.
@samwitteveenai Рік тому ⁺⁴
This is certainly possible. I have trained the 30B privately, but the biggest issue with the 30B is it is much harder for people to run it without a rather large GPU system behind it. certainly training these models on specialized dataset is something we are working on a lot.
@lovemonkey9982 Рік тому ⁺²
@@samwitteveenaiis it possible to run 30B with dual rtx 4090 gpu ?
@leonelsamayoa3260 Рік тому
@@samwitteveenai I am also interested on the hardware requirements for each of the models ? I was planning on getting some hardware and this would be very insightful
@zachadolphe3633 11 місяців тому
@@lovemonkey9982 How much VRAM do you have?
@shuvojyotirakshit5808 Рік тому ⁺⁷
Will love to see a chat based system example where the model will retain the context.
@janskacel9480 Рік тому
That would require something like sleep 😀
@Vincent-mx4rk Рік тому ⁺²⁰
Someone needs to train Alpaca with GPT-4 self-instruct
@OuranianCyclops Рік тому ⁺⁷
I am, also using the llama 65b instead of the 7b
@lovemonkey9982 Рік тому
@@OuranianCyclops thats awesome what gpu you are using ?
@thenbaplayer9485 Рік тому
@@OuranianCyclops could you explain how thanks!
@OuranianCyclops Рік тому ⁺¹
@@lovemonkey9982 8 H100 gpus but that’s because I already use this machine for my business, haven’t tested it in standard 4090 for example since it’s already trained.
@lovemonkey9982 Рік тому
@@OuranianCyclops lucky you.
@JOHNSMITH-ve3rq Рік тому ⁺⁴
Would love to see some training on specific tasks like turning unstructured to structured data etc
@samwitteveenai Рік тому ⁺²
This is an interesting, most dataset for this kind of thing are often not public and related to a very specific task I will look around and see what I can find.
@rimpuru Рік тому
Thank you so much! This video made me understand Datasets MUCH better!
@galgrunfeld9954 Рік тому ⁺²
I'd love to see a video about integrating it to various softwares and OSs to AI-boost tools and systems we already use, like what Microsoft and Google did recently with their online tools.
@samwitteveenai Рік тому ⁺¹
I will show how to use it in LangChain in an upcoming video
@BECHEEKHA Рік тому
what do i need to get one of this into my computer?
@samwitteveenai Рік тому ⁺¹
mostly you will need a pretty powerful GPU card.
@jayhu6075 Рік тому
@@samwitteveenai RTX30.. or RTX40.. to make a dataset for question & answer?
@auntiedrummer Рік тому ⁺¹
Hi Sam, thanks for making this video. Great channel. I would like to ask a noob question, after generating my own dataset, what needs to be done next to fine tune the alpaca model?
@samwitteveenai Рік тому ⁺¹
you will need to load the model, set it up and do a training. Check out the video for finetuning Alpaca.
@academai11 Рік тому
Bro can we meet up in discord, I have questions about generating dataset
@thenbaplayer9485 Рік тому ⁺³
Could you make a video to make Alpaca with gpt-4 and the 65b of meta?
@andy111007 Рік тому
Hi Sam, Thanks for the amazing tutorial. Assume i have a csv file where instructions need to be different but i only have have input and output with no Instructions. How to generate instructions for those datasets?. Looking forward to hearing from you. Thanks,
Andy
@othmankabbaj9960 Рік тому
Thanks for the video, wanted to understand the difference between prompt engineering and agent creation (from Langchain for instance) vs. creating a full on dataset and training a model. What are the main differences and what are the ups and downs from doing each one ?
@batuhanbayraktar337 Рік тому ⁺¹
I am wondering how we create dataset of pdf files. I mean my deparmant related in avition. We have lots of pdf files. I only need convert them to require format for alpaca or create a dataset wıth pdf files. Which one is more fit for my situation idk honestly. I feel stuc. what do you think about this sir ?
@samwitteveenai Рік тому
for the instruction fine tuning you would want them in some kind of question/answer pairs.
@frankvitetta Рік тому ⁺¹
really great video, what are the instructions to actually train the model ? At the end of the video I believe you generate the data to train the model but how would you actually do it .. ? and would you need a super powerpul GPU to do it ?
@samwitteveenai Рік тому ⁺¹
Checkout I have another video about finetuning LLaMa to create an Alpaca model
@frankvitetta Рік тому
@@samwitteveenai thanks is this the video ? ua-cam.com/video/JzBR8oieyy8/v-deo.html
@samwitteveenai Рік тому ⁺¹
@@frankvitetta no this one ua-cam.com/video/LSoqyynKU9E/v-deo.html
@frankvitetta Рік тому
@@samwitteveenai thank you sooo much !
@bharatk6790 Рік тому ⁺¹
So in lay man terms what alpaca creators did was to create a dataset using openai api. Then they fine tuned the LLaMA model on that dataset.
@8888-u6n Рік тому ⁺²
Hi, am loving the videos you are making I am learning lots, Could you make a video on how to do instruction fine tuning on a new dataset for example on chat GTP4 🙂👍
@Purulence-bw7nt Рік тому ⁺⁵
Hi Sam, I was wondering whether it's possible to make a dataset out of my personal website? In that case I would not want to input self-written content and then augment the data to generate the dataset but rather use the website's data directly. How should I go about in doing so? I hope if my question is clear. Many Thanks. :)
@underscore. Рік тому
you should ask gpt to write a web crawler
@silvacarl Рік тому ⁺¹
0:05 you are making amazing videos thank you
@samwitteveenai Рік тому ⁺¹
Thanks much appreciated
@silvacarl Рік тому
Excellent
@ikjb8561 Рік тому
Can you share the link from Alpaca as well? This is a great starting tutorial. Do you cover the part where you actually create the model from your own custom data set?
@samwitteveenai Рік тому ⁺¹
see the training Alpaca video, but the code is possibly out of date by now.
@lordofudead Рік тому ⁺¹¹
I’d be really fascinated to see if you could not just train a model for a particular business or service, but to respond like a particular person. Eg, could you train it to talk like you? Request your message logs from Facebook over the last 10+ years and use that as training data? Could we actually do a Black Mirror “Be Right Back”?
@RexelBartolome Рік тому
people are already doing exactly that, look up Project December
@anonymuzz5102 Рік тому ⁺²
Ah, you are also thinking of a cloning yourself, nice to know I'm not alone 😂
@samwitteveenai Рік тому ⁺¹
This is certainly possible training these models on specialized dataset is something we are working on a lot.
@anonymuzz5102 Рік тому ⁺³
@@samwitteveenai Keep us posted it is a game changer, thanks for confirming it is possible, i suspected it was (im an outsider, a simple wagie proompter, who dreams of automating my life in order to play Zelda, FFXVI, and Diablo IV instead of slaving this summer... I have a dream...)
@lainchanzzzzzz Рік тому ⁺¹
@@samwitteveenai would love a video on that! Specifically to provide an already performant model some extra files containing additional information for a specific use. Was wondering also, how hard would it be to give it more permissions over a machine? I know there is work to do on the software side, but how could you train a model to make it know that it has control (example: opening apps etc.)
@kennethleung4487 Рік тому
Keep these wonderful videos coming!
@houbenbub Рік тому
Very cool stuff! Just found your channel and you're killing it :)
@babakardiop7071 Рік тому
Great content m8! can this model take in unstructured text as well (e.g in this example this would be the About page of a business) and use that to answer customer questions?
@samwitteveenai Рік тому
Yes but to get good results you would need to finetune the model for that.
@omegablast2002 Рік тому
I'm not sure i fully understand training this...are data sets basically questions/answers pairs? Or can i hand it a book on data structures and have it learn the info?
@samwitteveenai Рік тому
they question / answer or task /response style data.
@rverm1000 Рік тому
i know you just explained how how trained the model. But is there a tutorial anywhere that goes into depth. how to add data set for training . how to use it once its trained.
@demayaaron6107 Рік тому ⁺²
Great Video ! I was wondering if it was possible to generate synthetic data without the use of OpenAI API ?
@samwitteveenai Рік тому ⁺³
Yes there are other ways. You could also look at other LLM providers. or use a open source model with filtering. A lot of it comes down to your creativity. Perhaps I will make a video for this in the next few weeks.
@demayaaron6107 Рік тому
@@samwitteveenai Nice ! Thank you
@msachet Рік тому
@@samwitteveenai that’d be greatly appreciated! I guess the challenge now with LLM open source providers is to provide the same quality level as Open AI, likely even more filtering would be required with these models?!
Thanks for all those great videos BTW!
@nikk6489 Рік тому
@@samwitteveenai I am looking for the same to create the dataset but without using OpenAI API Key. Is your
video is available, can you please provide the link. Many thanks in advance.
@BennySalto Рік тому
Sometimes I'm a little confused as to what constitutes training & what constitutes fine-tuning. It seems in the comments, people also mix this up? Would you mind elaboration?
Also a point: it says num_CPU. Did you train / fine-tune this on a CPU?
@samwitteveenai Рік тому
This was just creating the dataset not doing the fine tuning. The fine-tuning uses a gpu, I have another video walking through that. Fine-tuning is tuning a model for a specific down stream task it is technically a form of training. Training and pre-training for LLMs generally refers to training in a self supervised way over a very large amount of tokens to get the model to "understand" language in general but not for a specific task.
@yangfuye5935 Рік тому ⁺¹
Do you have any idea how to transform a pdf document to such dataset without too many manual works? Which means we need to generate Q&A on specific document setence....
@samwitteveenai Рік тому
it would really depend on how the pdf is setup and what data is in the pdf. Do you have a specific example?
@leemark7739 Рік тому
@@samwitteveenai how to collect data for Lora training
@РыгорБородулин-ц1е Рік тому
How much did it cost though?
@twinstars8812 Рік тому
Is it possible to finetune a model specific for writing fantasy adventure novels?
@samwitteveenai Рік тому
yes totally as long as you get it into this format.
@henrymetzger9951 Рік тому
So Could i make this imitate certain writing styles for fantasy novels? Seems better than paying to do so lol
@samwitteveenai Рік тому
yeah you would just need to train it on a dataset like that.
@Zumito Рік тому ⁺¹
I'm trying to finetune Guanaco 13B, its a Alpaca 13B based on Alpaca-LoRa but in spanish, and I want to set like instruction codes for every instruction, so if I want to open some app, It gives me a code that I receive and execute a python func, this is because in spanish we have a lot of words for the same things so its complex to have it every possibility in a single if
@Zumito Рік тому ⁺¹
And I also want it to respond to the name "Emilia"
@shanesteven4578 Рік тому
With such a relatively small dataset I’m a little confused as to why the model wouldn’t use Lemmatisation over stemming, would this not have provided a higher accuracy rate because of its ‘canonical dictionary-based approach. Listening to Openai’s Chief Scientist last week, it’s obvious that Open AI models of the near future will be based on much small datasets. Or am I missing the point?
@samwitteveenai Рік тому ⁺¹
the pretraining of the base model makes it not need to use traditional NLP and allows us to finetune it with a relative small dataset. That said I am pretty sure OpenAI is using datasets bigger than this themselves.
@shanesteven4578 Рік тому
@@samwitteveenai Thank you Sam.
@AndreYaniv1 Рік тому
I want to use raiders of the lost kek dataset to see how chatgpt4all would be uncensored, how do I go about this?
@samwitteveenai Рік тому
Make your dataset, then use the fine-tuning colab and video.
@microgamawave Рік тому
Can we use a bigger model to fine tuning?
@samwitteveenai Рік тому ⁺¹
yes you can, but the challenge becomes being able to fit the model in the VRAM of your GPU card, this is where multiple cards come in. If you want to try something bigger you can try the T5 and Flan models which go up to 11b and 20b in size.
@tray84 Рік тому
can i train it on premade datasets? like wikipedia for example
@samwitteveenai Рік тому
yes but you will need probably need to do some preprocessing to get the best results. That said most of these big models will have had wikipedia in their training data already.
@eduardmart1237 Рік тому
Is possible just to add a lot of data, without "instruction" - "instanses" expamples?
@samwitteveenai Рік тому
yes but you will want to think about how it would be used/conditioned in the model. Do this will just predict the next word/token so how would you want it to generate what you are after?
@Shabasky1 Рік тому
I want to make a custom dataset for k8s specific questions. How do I make sure that the AI does code blocks?
@samwitteveenai Рік тому ⁺¹
the code blocks in other models are often wrapped in a special token or 3 backticks etc. You could do it like that, but also probably better to use a model more focused on code pretraining.
@Shabasky1 Рік тому
@@samwitteveenai nice ok. What models do you recommend for code ?
@Xenon0000000 Рік тому
What happens if you use a dataset in a different language?
@samwitteveenai Рік тому ⁺¹
It seems it will work for some languages but probably not as well as English. Someone translated the dataset into Portuguese and apparently it worked, so worth a try.
@biiigdaaaddy Рік тому
Do you know how the Stanford folks avoid the legal issue? OpenAI clearly says NO training using GPT output on their terms and conditions.
@samwitteveenai Рік тому ⁺¹
This is a good point. Perhaps because they are Stanford. :D
@bandui4021 Рік тому
@@samwitteveenai FTX CEO Sam Bankman Fried parents Joseph Bankman and Barbara Fried are both professors at Stanford University. So give a guess whether he gets a penalty :))))).
@EcommerceGrowthHacker Рік тому
@@samwitteveenai They clearly say in Alpaca license terms that commercial use is not allowed for 3 reasons. First reason is Llama's own non-commercial-use license and the second one is actually OpenAI's clause that prohibits using their models to training other models that are in competition with their services.
@nikk6489 Рік тому
@@EcommerceGrowthHacker Then what are we gonna do? Any suggestion :)
@kimie126 Рік тому ⁺²
hi newbie here. David shapiro says alpaca7b is an important breakthrough in generative AI. thats when I found ur video.
can u explain what is the difference between training the model vs normal finetuning? I see the process are quite the same where u feed it with a lot of data.
thank you. 🙏
@GyroO7 Рік тому
If you mean the difference between fine-tuning the model and lora
loras are trained weights that get inserted to the original model in real time to get the behavior that u want
whilst fine-tuning is training and changing the structure of the whole original model
@ikjb8561 Рік тому
Is there a way to do this without openai?
@samwitteveenai Рік тому ⁺¹
You can use an open source mode, but the results probably won't be as good.
@ikjb8561 Рік тому
@@samwitteveenai for sure! Having said that community should rally around open source and make it better than openai.
@nikk6489 Рік тому
@@samwitteveenai any video, tutorial or link for the same. Thanks.

Наступне

Автоматичне відтворення

Understanding Constitutional AI - the paper and key concepts