How to make a custom dataset like Alpaca7B
Вставка
- Опубліковано 15 лис 2024
- Colab: drp.li/hrPbE
In this video, I go through making your own custom dataset in the style that the Alpaca dataset was made using human-generated data and using it to generate synthetic data with GPT-3
For more tutorials on using LLMs and building Agents, check out my Patreon:
Patreon: / samwitteveen
Twitter: / sam_witteveen
My Links:
Linkedin: / samwitteveen
Github:
github.com/sam...
github.com/sam...
#largelanguagemodels
It’s an awesome time to be alive if you’re into this sort of thing.
totally!!
I am very glad to find your channel as a beginner in ML. You make it so understandable for everybody for a topic like this.
Hopefully in the future a tutorial how to make a dataset. Many thanks.
Thanks and welcome
It might be a good idea to fine tune the llama 30b model on very specific tasks. For example, using a dataset of JUST python code prompts with responses from GPT-4, and then filter the responses based on simple 'do they actually execute when you run them or do they give an error'. Maybe add max 1 correction attempt for every prompt and use the corrected version if it runs. Only train on code that actually runs.
This is certainly possible. I have trained the 30B privately, but the biggest issue with the 30B is it is much harder for people to run it without a rather large GPU system behind it. certainly training these models on specialized dataset is something we are working on a lot.
@@samwitteveenaiis it possible to run 30B with dual rtx 4090 gpu ?
@@samwitteveenai I am also interested on the hardware requirements for each of the models ? I was planning on getting some hardware and this would be very insightful
@@lovemonkey9982 How much VRAM do you have?
Will love to see a chat based system example where the model will retain the context.
That would require something like sleep 😀
Someone needs to train Alpaca with GPT-4 self-instruct
I am, also using the llama 65b instead of the 7b
@@OuranianCyclops thats awesome what gpu you are using ?
@@OuranianCyclops could you explain how thanks!
@@lovemonkey9982 8 H100 gpus but that’s because I already use this machine for my business, haven’t tested it in standard 4090 for example since it’s already trained.
@@OuranianCyclops lucky you.
Would love to see some training on specific tasks like turning unstructured to structured data etc
This is an interesting, most dataset for this kind of thing are often not public and related to a very specific task I will look around and see what I can find.
Thank you so much! This video made me understand Datasets MUCH better!
I'd love to see a video about integrating it to various softwares and OSs to AI-boost tools and systems we already use, like what Microsoft and Google did recently with their online tools.
I will show how to use it in LangChain in an upcoming video
what do i need to get one of this into my computer?
mostly you will need a pretty powerful GPU card.
@@samwitteveenai RTX30.. or RTX40.. to make a dataset for question & answer?
Hi Sam, thanks for making this video. Great channel. I would like to ask a noob question, after generating my own dataset, what needs to be done next to fine tune the alpaca model?
you will need to load the model, set it up and do a training. Check out the video for finetuning Alpaca.
Bro can we meet up in discord, I have questions about generating dataset
Could you make a video to make Alpaca with gpt-4 and the 65b of meta?
Hi Sam, Thanks for the amazing tutorial. Assume i have a csv file where instructions need to be different but i only have have input and output with no Instructions. How to generate instructions for those datasets?. Looking forward to hearing from you. Thanks,
Andy
Thanks for the video, wanted to understand the difference between prompt engineering and agent creation (from Langchain for instance) vs. creating a full on dataset and training a model. What are the main differences and what are the ups and downs from doing each one ?
I am wondering how we create dataset of pdf files. I mean my deparmant related in avition. We have lots of pdf files. I only need convert them to require format for alpaca or create a dataset wıth pdf files. Which one is more fit for my situation idk honestly. I feel stuc. what do you think about this sir ?
for the instruction fine tuning you would want them in some kind of question/answer pairs.
really great video, what are the instructions to actually train the model ? At the end of the video I believe you generate the data to train the model but how would you actually do it .. ? and would you need a super powerpul GPU to do it ?
Checkout I have another video about finetuning LLaMa to create an Alpaca model
@@samwitteveenai thanks is this the video ? ua-cam.com/video/JzBR8oieyy8/v-deo.html
@@frankvitetta no this one ua-cam.com/video/LSoqyynKU9E/v-deo.html
@@samwitteveenai thank you sooo much !
So in lay man terms what alpaca creators did was to create a dataset using openai api. Then they fine tuned the LLaMA model on that dataset.
Hi, am loving the videos you are making I am learning lots, Could you make a video on how to do instruction fine tuning on a new dataset for example on chat GTP4 🙂👍
Hi Sam, I was wondering whether it's possible to make a dataset out of my personal website? In that case I would not want to input self-written content and then augment the data to generate the dataset but rather use the website's data directly. How should I go about in doing so? I hope if my question is clear. Many Thanks. :)
you should ask gpt to write a web crawler
0:05 you are making amazing videos thank you
Thanks much appreciated
Excellent
Can you share the link from Alpaca as well? This is a great starting tutorial. Do you cover the part where you actually create the model from your own custom data set?
see the training Alpaca video, but the code is possibly out of date by now.
I’d be really fascinated to see if you could not just train a model for a particular business or service, but to respond like a particular person. Eg, could you train it to talk like you? Request your message logs from Facebook over the last 10+ years and use that as training data? Could we actually do a Black Mirror “Be Right Back”?
people are already doing exactly that, look up Project December
Ah, you are also thinking of a cloning yourself, nice to know I'm not alone 😂
This is certainly possible training these models on specialized dataset is something we are working on a lot.
@@samwitteveenai Keep us posted it is a game changer, thanks for confirming it is possible, i suspected it was (im an outsider, a simple wagie proompter, who dreams of automating my life in order to play Zelda, FFXVI, and Diablo IV instead of slaving this summer... I have a dream...)
@@samwitteveenai would love a video on that! Specifically to provide an already performant model some extra files containing additional information for a specific use. Was wondering also, how hard would it be to give it more permissions over a machine? I know there is work to do on the software side, but how could you train a model to make it know that it has control (example: opening apps etc.)
Keep these wonderful videos coming!
Very cool stuff! Just found your channel and you're killing it :)
Great content m8! can this model take in unstructured text as well (e.g in this example this would be the About page of a business) and use that to answer customer questions?
Yes but to get good results you would need to finetune the model for that.
I'm not sure i fully understand training this...are data sets basically questions/answers pairs? Or can i hand it a book on data structures and have it learn the info?
they question / answer or task /response style data.
i know you just explained how how trained the model. But is there a tutorial anywhere that goes into depth. how to add data set for training . how to use it once its trained.
Great Video ! I was wondering if it was possible to generate synthetic data without the use of OpenAI API ?
Yes there are other ways. You could also look at other LLM providers. or use a open source model with filtering. A lot of it comes down to your creativity. Perhaps I will make a video for this in the next few weeks.
@@samwitteveenai Nice ! Thank you
@@samwitteveenai that’d be greatly appreciated! I guess the challenge now with LLM open source providers is to provide the same quality level as Open AI, likely even more filtering would be required with these models?!
Thanks for all those great videos BTW!
@@samwitteveenai I am looking for the same to create the dataset but without using OpenAI API Key. Is your
video is available, can you please provide the link. Many thanks in advance.
Sometimes I'm a little confused as to what constitutes training & what constitutes fine-tuning. It seems in the comments, people also mix this up? Would you mind elaboration?
Also a point: it says num_CPU. Did you train / fine-tune this on a CPU?
This was just creating the dataset not doing the fine tuning. The fine-tuning uses a gpu, I have another video walking through that. Fine-tuning is tuning a model for a specific down stream task it is technically a form of training. Training and pre-training for LLMs generally refers to training in a self supervised way over a very large amount of tokens to get the model to "understand" language in general but not for a specific task.
Do you have any idea how to transform a pdf document to such dataset without too many manual works? Which means we need to generate Q&A on specific document setence....
it would really depend on how the pdf is setup and what data is in the pdf. Do you have a specific example?
@@samwitteveenai how to collect data for Lora training
How much did it cost though?
Is it possible to finetune a model specific for writing fantasy adventure novels?
yes totally as long as you get it into this format.
So Could i make this imitate certain writing styles for fantasy novels? Seems better than paying to do so lol
yeah you would just need to train it on a dataset like that.
I'm trying to finetune Guanaco 13B, its a Alpaca 13B based on Alpaca-LoRa but in spanish, and I want to set like instruction codes for every instruction, so if I want to open some app, It gives me a code that I receive and execute a python func, this is because in spanish we have a lot of words for the same things so its complex to have it every possibility in a single if
And I also want it to respond to the name "Emilia"
With such a relatively small dataset I’m a little confused as to why the model wouldn’t use Lemmatisation over stemming, would this not have provided a higher accuracy rate because of its ‘canonical dictionary-based approach. Listening to Openai’s Chief Scientist last week, it’s obvious that Open AI models of the near future will be based on much small datasets. Or am I missing the point?
the pretraining of the base model makes it not need to use traditional NLP and allows us to finetune it with a relative small dataset. That said I am pretty sure OpenAI is using datasets bigger than this themselves.
@@samwitteveenai Thank you Sam.
I want to use raiders of the lost kek dataset to see how chatgpt4all would be uncensored, how do I go about this?
Make your dataset, then use the fine-tuning colab and video.
Can we use a bigger model to fine tuning?
yes you can, but the challenge becomes being able to fit the model in the VRAM of your GPU card, this is where multiple cards come in. If you want to try something bigger you can try the T5 and Flan models which go up to 11b and 20b in size.
can i train it on premade datasets? like wikipedia for example
yes but you will need probably need to do some preprocessing to get the best results. That said most of these big models will have had wikipedia in their training data already.
Is possible just to add a lot of data, without "instruction" - "instanses" expamples?
yes but you will want to think about how it would be used/conditioned in the model. Do this will just predict the next word/token so how would you want it to generate what you are after?
I want to make a custom dataset for k8s specific questions. How do I make sure that the AI does code blocks?
the code blocks in other models are often wrapped in a special token or 3 backticks etc. You could do it like that, but also probably better to use a model more focused on code pretraining.
@@samwitteveenai nice ok. What models do you recommend for code ?
What happens if you use a dataset in a different language?
It seems it will work for some languages but probably not as well as English. Someone translated the dataset into Portuguese and apparently it worked, so worth a try.
Do you know how the Stanford folks avoid the legal issue? OpenAI clearly says NO training using GPT output on their terms and conditions.
This is a good point. Perhaps because they are Stanford. :D
@@samwitteveenai FTX CEO Sam Bankman Fried parents Joseph Bankman and Barbara Fried are both professors at Stanford University. So give a guess whether he gets a penalty :))))).
@@samwitteveenai They clearly say in Alpaca license terms that commercial use is not allowed for 3 reasons. First reason is Llama's own non-commercial-use license and the second one is actually OpenAI's clause that prohibits using their models to training other models that are in competition with their services.
@@EcommerceGrowthHacker Then what are we gonna do? Any suggestion :)
hi newbie here. David shapiro says alpaca7b is an important breakthrough in generative AI. thats when I found ur video.
can u explain what is the difference between training the model vs normal finetuning? I see the process are quite the same where u feed it with a lot of data.
thank you. 🙏
If you mean the difference between fine-tuning the model and lora
loras are trained weights that get inserted to the original model in real time to get the behavior that u want
whilst fine-tuning is training and changing the structure of the whole original model
Is there a way to do this without openai?
You can use an open source mode, but the results probably won't be as good.
@@samwitteveenai for sure! Having said that community should rally around open source and make it better than openai.
@@samwitteveenai any video, tutorial or link for the same. Thanks.