How to make a custom dataset like Alpaca7B

Поділитися
Вставка
  • Опубліковано 15 лис 2024
  • Colab: drp.li/hrPbE
    In this video, I go through making your own custom dataset in the style that the Alpaca dataset was made using human-generated data and using it to generate synthetic data with GPT-3
    For more tutorials on using LLMs and building Agents, check out my Patreon:
    Patreon: / samwitteveen
    Twitter: / sam_witteveen
    My Links:
    Linkedin: / samwitteveen
    Github:
    github.com/sam...
    github.com/sam...
    #largelanguagemodels

КОМЕНТАРІ • 113

  • @____2080_____
    @____2080_____ Рік тому +1

    It’s an awesome time to be alive if you’re into this sort of thing.

  • @jayhu6075
    @jayhu6075 Рік тому +2

    I am very glad to find your channel as a beginner in ML. You make it so understandable for everybody for a topic like this.
    Hopefully in the future a tutorial how to make a dataset. Many thanks.

  • @SpricesExist
    @SpricesExist Рік тому +17

    It might be a good idea to fine tune the llama 30b model on very specific tasks. For example, using a dataset of JUST python code prompts with responses from GPT-4, and then filter the responses based on simple 'do they actually execute when you run them or do they give an error'. Maybe add max 1 correction attempt for every prompt and use the corrected version if it runs. Only train on code that actually runs.

    • @samwitteveenai
      @samwitteveenai  Рік тому +4

      This is certainly possible. I have trained the 30B privately, but the biggest issue with the 30B is it is much harder for people to run it without a rather large GPU system behind it. certainly training these models on specialized dataset is something we are working on a lot.

    • @lovemonkey9982
      @lovemonkey9982 Рік тому +2

      @@samwitteveenaiis it possible to run 30B with dual rtx 4090 gpu ?

    • @leonelsamayoa3260
      @leonelsamayoa3260 Рік тому

      @@samwitteveenai I am also interested on the hardware requirements for each of the models ? I was planning on getting some hardware and this would be very insightful

    • @zachadolphe3633
      @zachadolphe3633 11 місяців тому

      @@lovemonkey9982 How much VRAM do you have?

  • @shuvojyotirakshit5808
    @shuvojyotirakshit5808 Рік тому +7

    Will love to see a chat based system example where the model will retain the context.

    • @janskacel9480
      @janskacel9480 Рік тому

      That would require something like sleep 😀

  • @Vincent-mx4rk
    @Vincent-mx4rk Рік тому +20

    Someone needs to train Alpaca with GPT-4 self-instruct

    • @OuranianCyclops
      @OuranianCyclops Рік тому +7

      I am, also using the llama 65b instead of the 7b

    • @lovemonkey9982
      @lovemonkey9982 Рік тому

      @@OuranianCyclops thats awesome what gpu you are using ?

    • @thenbaplayer9485
      @thenbaplayer9485 Рік тому

      @@OuranianCyclops could you explain how thanks!

    • @OuranianCyclops
      @OuranianCyclops Рік тому +1

      @@lovemonkey9982 8 H100 gpus but that’s because I already use this machine for my business, haven’t tested it in standard 4090 for example since it’s already trained.

    • @lovemonkey9982
      @lovemonkey9982 Рік тому

      @@OuranianCyclops lucky you.

  • @JOHNSMITH-ve3rq
    @JOHNSMITH-ve3rq Рік тому +4

    Would love to see some training on specific tasks like turning unstructured to structured data etc

    • @samwitteveenai
      @samwitteveenai  Рік тому +2

      This is an interesting, most dataset for this kind of thing are often not public and related to a very specific task I will look around and see what I can find.

  • @rimpuru
    @rimpuru Рік тому

    Thank you so much! This video made me understand Datasets MUCH better!

  • @galgrunfeld9954
    @galgrunfeld9954 Рік тому +2

    I'd love to see a video about integrating it to various softwares and OSs to AI-boost tools and systems we already use, like what Microsoft and Google did recently with their online tools.

    • @samwitteveenai
      @samwitteveenai  Рік тому +1

      I will show how to use it in LangChain in an upcoming video

  • @BECHEEKHA
    @BECHEEKHA Рік тому

    what do i need to get one of this into my computer?

    • @samwitteveenai
      @samwitteveenai  Рік тому +1

      mostly you will need a pretty powerful GPU card.

    • @jayhu6075
      @jayhu6075 Рік тому

      @@samwitteveenai RTX30.. or RTX40.. to make a dataset for question & answer?

  • @auntiedrummer
    @auntiedrummer Рік тому +1

    Hi Sam, thanks for making this video. Great channel. I would like to ask a noob question, after generating my own dataset, what needs to be done next to fine tune the alpaca model?

    • @samwitteveenai
      @samwitteveenai  Рік тому +1

      you will need to load the model, set it up and do a training. Check out the video for finetuning Alpaca.

    • @academai11
      @academai11 Рік тому

      Bro can we meet up in discord, I have questions about generating dataset

  • @thenbaplayer9485
    @thenbaplayer9485 Рік тому +3

    Could you make a video to make Alpaca with gpt-4 and the 65b of meta?

  • @andy111007
    @andy111007 Рік тому

    Hi Sam, Thanks for the amazing tutorial. Assume i have a csv file where instructions need to be different but i only have have input and output with no Instructions. How to generate instructions for those datasets?. Looking forward to hearing from you. Thanks,
    Andy

  • @othmankabbaj9960
    @othmankabbaj9960 Рік тому

    Thanks for the video, wanted to understand the difference between prompt engineering and agent creation (from Langchain for instance) vs. creating a full on dataset and training a model. What are the main differences and what are the ups and downs from doing each one ?

  • @batuhanbayraktar337
    @batuhanbayraktar337 Рік тому +1

    I am wondering how we create dataset of pdf files. I mean my deparmant related in avition. We have lots of pdf files. I only need convert them to require format for alpaca or create a dataset wıth pdf files. Which one is more fit for my situation idk honestly. I feel stuc. what do you think about this sir ?

    • @samwitteveenai
      @samwitteveenai  Рік тому

      for the instruction fine tuning you would want them in some kind of question/answer pairs.

  • @frankvitetta
    @frankvitetta Рік тому +1

    really great video, what are the instructions to actually train the model ? At the end of the video I believe you generate the data to train the model but how would you actually do it .. ? and would you need a super powerpul GPU to do it ?

    • @samwitteveenai
      @samwitteveenai  Рік тому +1

      Checkout I have another video about finetuning LLaMa to create an Alpaca model

    • @frankvitetta
      @frankvitetta Рік тому

      @@samwitteveenai thanks is this the video ? ua-cam.com/video/JzBR8oieyy8/v-deo.html

    • @samwitteveenai
      @samwitteveenai  Рік тому +1

      @@frankvitetta no this one ua-cam.com/video/LSoqyynKU9E/v-deo.html

    • @frankvitetta
      @frankvitetta Рік тому

      @@samwitteveenai thank you sooo much !

  • @bharatk6790
    @bharatk6790 Рік тому +1

    So in lay man terms what alpaca creators did was to create a dataset using openai api. Then they fine tuned the LLaMA model on that dataset.

  • @8888-u6n
    @8888-u6n Рік тому +2

    Hi, am loving the videos you are making I am learning lots, Could you make a video on how to do instruction fine tuning on a new dataset for example on chat GTP4 🙂👍

  • @Purulence-bw7nt
    @Purulence-bw7nt Рік тому +5

    Hi Sam, I was wondering whether it's possible to make a dataset out of my personal website? In that case I would not want to input self-written content and then augment the data to generate the dataset but rather use the website's data directly. How should I go about in doing so? I hope if my question is clear. Many Thanks. :)

    • @underscore.
      @underscore. Рік тому

      you should ask gpt to write a web crawler

  • @silvacarl
    @silvacarl Рік тому +1

    0:05 you are making amazing videos thank you

  • @ikjb8561
    @ikjb8561 Рік тому

    Can you share the link from Alpaca as well? This is a great starting tutorial. Do you cover the part where you actually create the model from your own custom data set?

    • @samwitteveenai
      @samwitteveenai  Рік тому +1

      see the training Alpaca video, but the code is possibly out of date by now.

  • @lordofudead
    @lordofudead Рік тому +11

    I’d be really fascinated to see if you could not just train a model for a particular business or service, but to respond like a particular person. Eg, could you train it to talk like you? Request your message logs from Facebook over the last 10+ years and use that as training data? Could we actually do a Black Mirror “Be Right Back”?

    • @RexelBartolome
      @RexelBartolome Рік тому

      people are already doing exactly that, look up Project December

    • @anonymuzz5102
      @anonymuzz5102 Рік тому +2

      Ah, you are also thinking of a cloning yourself, nice to know I'm not alone 😂

    • @samwitteveenai
      @samwitteveenai  Рік тому +1

      This is certainly possible training these models on specialized dataset is something we are working on a lot.

    • @anonymuzz5102
      @anonymuzz5102 Рік тому +3

      @@samwitteveenai Keep us posted it is a game changer, thanks for confirming it is possible, i suspected it was (im an outsider, a simple wagie proompter, who dreams of automating my life in order to play Zelda, FFXVI, and Diablo IV instead of slaving this summer... I have a dream...)

    • @lainchanzzzzzz
      @lainchanzzzzzz Рік тому +1

      @@samwitteveenai would love a video on that! Specifically to provide an already performant model some extra files containing additional information for a specific use. Was wondering also, how hard would it be to give it more permissions over a machine? I know there is work to do on the software side, but how could you train a model to make it know that it has control (example: opening apps etc.)

  • @kennethleung4487
    @kennethleung4487 Рік тому

    Keep these wonderful videos coming!

  • @houbenbub
    @houbenbub Рік тому

    Very cool stuff! Just found your channel and you're killing it :)

  • @babakardiop7071
    @babakardiop7071 Рік тому

    Great content m8! can this model take in unstructured text as well (e.g in this example this would be the About page of a business) and use that to answer customer questions?

    • @samwitteveenai
      @samwitteveenai  Рік тому

      Yes but to get good results you would need to finetune the model for that.

  • @omegablast2002
    @omegablast2002 Рік тому

    I'm not sure i fully understand training this...are data sets basically questions/answers pairs? Or can i hand it a book on data structures and have it learn the info?

    • @samwitteveenai
      @samwitteveenai  Рік тому

      they question / answer or task /response style data.

  • @rverm1000
    @rverm1000 Рік тому

    i know you just explained how how trained the model. But is there a tutorial anywhere that goes into depth. how to add data set for training . how to use it once its trained.

  • @demayaaron6107
    @demayaaron6107 Рік тому +2

    Great Video ! I was wondering if it was possible to generate synthetic data without the use of OpenAI API ?

    • @samwitteveenai
      @samwitteveenai  Рік тому +3

      Yes there are other ways. You could also look at other LLM providers. or use a open source model with filtering. A lot of it comes down to your creativity. Perhaps I will make a video for this in the next few weeks.

    • @demayaaron6107
      @demayaaron6107 Рік тому

      @@samwitteveenai Nice ! Thank you

    • @msachet
      @msachet Рік тому

      @@samwitteveenai that’d be greatly appreciated! I guess the challenge now with LLM open source providers is to provide the same quality level as Open AI, likely even more filtering would be required with these models?!
      Thanks for all those great videos BTW!

    • @nikk6489
      @nikk6489 Рік тому

      @@samwitteveenai I am looking for the same to create the dataset but without using OpenAI API Key. Is your
      video is available, can you please provide the link. Many thanks in advance.

  • @BennySalto
    @BennySalto Рік тому

    Sometimes I'm a little confused as to what constitutes training & what constitutes fine-tuning. It seems in the comments, people also mix this up? Would you mind elaboration?
    Also a point: it says num_CPU. Did you train / fine-tune this on a CPU?

    • @samwitteveenai
      @samwitteveenai  Рік тому

      This was just creating the dataset not doing the fine tuning. The fine-tuning uses a gpu, I have another video walking through that. Fine-tuning is tuning a model for a specific down stream task it is technically a form of training. Training and pre-training for LLMs generally refers to training in a self supervised way over a very large amount of tokens to get the model to "understand" language in general but not for a specific task.

  • @yangfuye5935
    @yangfuye5935 Рік тому +1

    Do you have any idea how to transform a pdf document to such dataset without too many manual works? Which means we need to generate Q&A on specific document setence....

    • @samwitteveenai
      @samwitteveenai  Рік тому

      it would really depend on how the pdf is setup and what data is in the pdf. Do you have a specific example?

    • @leemark7739
      @leemark7739 Рік тому

      @@samwitteveenai how to collect data for Lora training

  • @РыгорБородулин-ц1е

    How much did it cost though?

  • @twinstars8812
    @twinstars8812 Рік тому

    Is it possible to finetune a model specific for writing fantasy adventure novels?

    • @samwitteveenai
      @samwitteveenai  Рік тому

      yes totally as long as you get it into this format.

  • @henrymetzger9951
    @henrymetzger9951 Рік тому

    So Could i make this imitate certain writing styles for fantasy novels? Seems better than paying to do so lol

    • @samwitteveenai
      @samwitteveenai  Рік тому

      yeah you would just need to train it on a dataset like that.

  • @Zumito
    @Zumito Рік тому +1

    I'm trying to finetune Guanaco 13B, its a Alpaca 13B based on Alpaca-LoRa but in spanish, and I want to set like instruction codes for every instruction, so if I want to open some app, It gives me a code that I receive and execute a python func, this is because in spanish we have a lot of words for the same things so its complex to have it every possibility in a single if

    • @Zumito
      @Zumito Рік тому +1

      And I also want it to respond to the name "Emilia"

  • @shanesteven4578
    @shanesteven4578 Рік тому

    With such a relatively small dataset I’m a little confused as to why the model wouldn’t use Lemmatisation over stemming, would this not have provided a higher accuracy rate because of its ‘canonical dictionary-based approach. Listening to Openai’s Chief Scientist last week, it’s obvious that Open AI models of the near future will be based on much small datasets. Or am I missing the point?

    • @samwitteveenai
      @samwitteveenai  Рік тому +1

      the pretraining of the base model makes it not need to use traditional NLP and allows us to finetune it with a relative small dataset. That said I am pretty sure OpenAI is using datasets bigger than this themselves.

    • @shanesteven4578
      @shanesteven4578 Рік тому

      @@samwitteveenai Thank you Sam.

  • @AndreYaniv1
    @AndreYaniv1 Рік тому

    I want to use raiders of the lost kek dataset to see how chatgpt4all would be uncensored, how do I go about this?

    • @samwitteveenai
      @samwitteveenai  Рік тому

      Make your dataset, then use the fine-tuning colab and video.

  • @microgamawave
    @microgamawave Рік тому

    Can we use a bigger model to fine tuning?

    • @samwitteveenai
      @samwitteveenai  Рік тому +1

      yes you can, but the challenge becomes being able to fit the model in the VRAM of your GPU card, this is where multiple cards come in. If you want to try something bigger you can try the T5 and Flan models which go up to 11b and 20b in size.

  • @tray84
    @tray84 Рік тому

    can i train it on premade datasets? like wikipedia for example

    • @samwitteveenai
      @samwitteveenai  Рік тому

      yes but you will need probably need to do some preprocessing to get the best results. That said most of these big models will have had wikipedia in their training data already.

  • @eduardmart1237
    @eduardmart1237 Рік тому

    Is possible just to add a lot of data, without "instruction" - "instanses" expamples?

    • @samwitteveenai
      @samwitteveenai  Рік тому

      yes but you will want to think about how it would be used/conditioned in the model. Do this will just predict the next word/token so how would you want it to generate what you are after?

  • @Shabasky1
    @Shabasky1 Рік тому

    I want to make a custom dataset for k8s specific questions. How do I make sure that the AI does code blocks?

    • @samwitteveenai
      @samwitteveenai  Рік тому +1

      the code blocks in other models are often wrapped in a special token or 3 backticks etc. You could do it like that, but also probably better to use a model more focused on code pretraining.

    • @Shabasky1
      @Shabasky1 Рік тому

      @@samwitteveenai nice ok. What models do you recommend for code ?

  • @Xenon0000000
    @Xenon0000000 Рік тому

    What happens if you use a dataset in a different language?

    • @samwitteveenai
      @samwitteveenai  Рік тому +1

      It seems it will work for some languages but probably not as well as English. Someone translated the dataset into Portuguese and apparently it worked, so worth a try.

  • @biiigdaaaddy
    @biiigdaaaddy Рік тому

    Do you know how the Stanford folks avoid the legal issue? OpenAI clearly says NO training using GPT output on their terms and conditions.

    • @samwitteveenai
      @samwitteveenai  Рік тому +1

      This is a good point. Perhaps because they are Stanford. :D

    • @bandui4021
      @bandui4021 Рік тому

      @@samwitteveenai FTX CEO Sam Bankman Fried parents Joseph Bankman and Barbara Fried are both professors at Stanford University. So give a guess whether he gets a penalty :))))).

    • @EcommerceGrowthHacker
      @EcommerceGrowthHacker Рік тому

      @@samwitteveenai They clearly say in Alpaca license terms that commercial use is not allowed for 3 reasons. First reason is Llama's own non-commercial-use license and the second one is actually OpenAI's clause that prohibits using their models to training other models that are in competition with their services.

    • @nikk6489
      @nikk6489 Рік тому

      @@EcommerceGrowthHacker Then what are we gonna do? Any suggestion :)

  • @kimie126
    @kimie126 Рік тому +2

    hi newbie here. David shapiro says alpaca7b is an important breakthrough in generative AI. thats when I found ur video.
    can u explain what is the difference between training the model vs normal finetuning? I see the process are quite the same where u feed it with a lot of data.
    thank you. 🙏

    • @GyroO7
      @GyroO7 Рік тому

      If you mean the difference between fine-tuning the model and lora
      loras are trained weights that get inserted to the original model in real time to get the behavior that u want
      whilst fine-tuning is training and changing the structure of the whole original model

  • @ikjb8561
    @ikjb8561 Рік тому

    Is there a way to do this without openai?

    • @samwitteveenai
      @samwitteveenai  Рік тому +1

      You can use an open source mode, but the results probably won't be as good.

    • @ikjb8561
      @ikjb8561 Рік тому

      @@samwitteveenai for sure! Having said that community should rally around open source and make it better than openai.

    • @nikk6489
      @nikk6489 Рік тому

      @@samwitteveenai any video, tutorial or link for the same. Thanks.