Thanks for this! I watched Sam's video and was starting to figure out how to use it for flan-ul2 but got very confused with the modules mapping. This video really helped! Only have 23 minutes left of Google Colab Pro A100 access but at least I got it running before I ran out of time! Next I can play with smaller models till I get my gpu time next month. Currently Flan-UL2 is using 35.6 GB of gpu space so just fits!
Very good callout about not just blindly trusting any model or adapters you find on Huggingface. Much better to do the training yourself, so you have control over the data it was trained on. Edit: well, for the stuff you can realistically do yourself. I know *I'm* not planning on training GPT-4 from scratch 😂
Can you walk through the differences between the big LLM’s for example the dataset they were trained on? If I want to fine tune a model I’d like to understand more about the base model to ensure matching and I’m not sure the differences based on model card.
Currently I counted more than 140 LLM variations and sizes. The best way I currently use is simply to read the accompanying scientific publication to understand anything about the specific model. There you have between 20 to 147 pages per model.
@@code4AI Older models like gpt j or bloom were undertrained. One of findings of the paper is that model performance keeps improving past 1 T token count. In short, llama was trained for more epochs. And yes that does correlate with better performance.
@@code4AI not sure I understand. :) but basically token count for training and number of epochs a model is trained on are both measures essentially of how much pre training has been done.
you are absolutely right. It is an indication of how much pre-training has been done, but tells you nothing, if you compare different LLM architectures, about the quality of pre-training. Just more (like epochs or input tokens) does not mean better - if you compare different smart transformer, GPT and RL architectures.
Is it possible to give advice to someone who wants to train a model, but his level is beginner in programming? I spent a week trying to understand what is going on, but every time I delve deeper into scientific papers
Sir, what is hyperparameter "alpha" for LoraConfig? How do we comprehend "The hyperparameter used for scaling the LoRA reparametrization."? Thank you sir.
Hello mate, I lovvveedd your tutorial series ❤ __ I have a question, actually I am trying to fine-tune "GPT-J" on my private data . So, I have multiple documents, they all in the raw text. So, as the example goes, we will convert them into the huggingface dataset and then train the mode. My doubt is: I mean, during the training, how should I structure my prompt? Should I just give the raw text as-is? or I should do some prompt engineering like: Context:{} Question:{} Answer:{} to the model? Will you please shed some light on this? Thank you very much!
When you fine-tune (!) a LLM like GPT-J on your data, you need to work with a DataCollator, like in my code. Further details see here: huggingface.co/docs/transformers/main_classes/data_collator#data-collator I would recommend you use the following DC_for_LM: : huggingface.co/docs/transformers/main_classes/data_collator#transformers.DataCollatorForLanguageModeling What you mention in the second half of your comment as a prompt is ICL: in-context-learning. 1- With ICL you do not change the weights of your transformer model (LLM), with 2- fine-tuning you change all weights of your transformer model and 3- with adapter-tuning you insert additional trainable tensors in your transformer.
@@code4AI Thank you for your response, I have followed your video for the difference between "pre-training, fine-tuning and ICL". So, when used ICL for the question answering, often the prompt becomes so big that exceeds the total token limit for the model. And also I get CUDA out-of-memory errors for the bigger prompts. For that reason I was thinking to fine tune the model for the QA task for my private documents. Now, in your case you are showcasing the Quotes dataset. So that is more of a "completion" task. While I am looking for the Question Answering task (generative question answering) so I think the way we prepare our dataset will be different (because the task is QA) than the quote completion one (from the video). So as I was asking, In what format I should prepare my dataset for question answering? Is it okay just to pass the text documents as they are? or some kind of different formatting is required? Will you please shade some light on this? I really appreciate your work and the way you explain the process ♥
Very informative!!!! does fine tunning with qlora/lora does support this kind of dataset? If not, what changes should i make in my output dataset? Review(col1) Nice cell phone, big screen, plenty of storage. Stylus pen works well. Analysis(col2) [{“segment”: “Nice cell phone”,“Aspect”: “Cell phone”,“Aspect Category”: “Overall satisfaction”,“sentiment”: “positive”},{“segment”: “big screen”,“Aspect”: “Screen”,“Aspect Category”: “Design”,“sentiment”: “positive”},{“segment”: “plenty of storage”,“Aspect”: “Storage”,“Aspect Category”: “Features”,“sentiment”: “positive”},{“segment”: “Stylus pen works well”,“Aspect”: “Stylus pen”,“Aspect Category”: “Features”,“sentiment”: “positive”}]
Hi, Thanks for the video. This video is very helpful. I was trying to execute the notebook given in the description. I am facing session crash issue in the free tier of google colab while loading the shards. Can you please help me?
Hello, this training took around 36 minutes. If we had more steps or more data, it would take more. I have seen some "accelerator" term. Does it increase the training speed? Should we use that? Any guidance on that? Please share, thanks.
as always, it depends. If you optimize it for your distributed hardware infrastructure, and you use the optimal data handling, and you choose the right compiler, and your code is really optimized for the task at hand, then you could be in for a surprise.
@@code4AI If possible would you please cover the "Generative Question Answering" on the private documents? So that covers fine tuning + accelerate! Thanks 💖
Beautiful..Beautiful..with lots of learning..u learnt the art
Thanks for this! I watched Sam's video and was starting to figure out how to use it for flan-ul2 but got very confused with the modules mapping. This video really helped! Only have 23 minutes left of Google Colab Pro A100 access but at least I got it running before I ran out of time! Next I can play with smaller models till I get my gpu time next month. Currently Flan-UL2 is using 35.6 GB of gpu space so just fits!
Great video, please share the notebook in the video
Fantastic tutorial!
Thank you very much for this wonderful explanation .
Big help. Thank U for sharing
Very good callout about not just blindly trusting any model or adapters you find on Huggingface. Much better to do the training yourself, so you have control over the data it was trained on.
Edit: well, for the stuff you can realistically do yourself. I know *I'm* not planning on training GPT-4 from scratch 😂
Right on!
The colab in the video description is different than the one in the video. Could you please share the colab in the video?
Very good video. Thanks for sharing it with us.
Glad you enjoyed it
can i please get the link for the original colab that was usedd in the video
Nice tutorial, pls share notebook.
Can you walk through the differences between the big LLM’s for example the dataset they were trained on? If I want to fine tune a model I’d like to understand more about the base model to ensure matching and I’m not sure the differences based on model card.
Currently I counted more than 140 LLM variations and sizes. The best way I currently use is simply to read the accompanying scientific publication to understand anything about the specific model. There you have between 20 to 147 pages per model.
Llama was trained on more tokens than the open source alternatives like gpt-j, that's why it's an appealing model to fine tune.
The amount of tokens trained is not indicative to the performance of a model. Just to the amount of input.
@@code4AI Older models like gpt j or bloom were undertrained. One of findings of the paper is that model performance keeps improving past 1 T token count. In short, llama was trained for more epochs. And yes that does correlate with better performance.
So it is the number of epochs for you?
@@code4AI not sure I understand. :) but basically token count for training and number of epochs a model is trained on are both measures essentially of how much pre training has been done.
you are absolutely right. It is an indication of how much pre-training has been done, but tells you nothing, if you compare different LLM architectures, about the quality of pre-training. Just more (like epochs or input tokens) does not mean better - if you compare different smart transformer, GPT and RL architectures.
Is it possible to give advice to someone who wants to train a model, but his level is beginner in programming? I spent a week trying to understand what is going on, but every time I delve deeper into scientific papers
I wrote about it in the community tab. great question.
Sir, what is hyperparameter "alpha" for LoraConfig? How do we comprehend "The hyperparameter used for scaling the LoRA reparametrization."? Thank you sir.
Hello mate, I lovvveedd your tutorial series ❤
__
I have a question, actually I am trying to fine-tune "GPT-J" on my private data . So, I have multiple documents, they all in the raw text. So, as the example goes, we will convert them into the huggingface dataset and then train the mode.
My doubt is:
I mean, during the training, how should I structure my prompt?
Should I just give the raw text as-is?
or
I should do some prompt engineering like: Context:{} Question:{} Answer:{} to the model?
Will you please shed some light on this?
Thank you very much!
When you fine-tune (!) a LLM like GPT-J on your data, you need to work with a DataCollator, like in my code. Further details see here: huggingface.co/docs/transformers/main_classes/data_collator#data-collator
I would recommend you use the following DC_for_LM: : huggingface.co/docs/transformers/main_classes/data_collator#transformers.DataCollatorForLanguageModeling
What you mention in the second half of your comment as a prompt is ICL: in-context-learning.
1- With ICL you do not change the weights of your transformer model (LLM), with
2- fine-tuning you change all weights of your transformer model and
3- with adapter-tuning you insert additional trainable tensors in your transformer.
@@code4AI Thank you for your response, I have followed your video for the difference between "pre-training, fine-tuning and ICL". So, when used ICL for the question answering, often the prompt becomes so big that exceeds the total token limit for the model. And also I get CUDA out-of-memory errors for the bigger prompts.
For that reason I was thinking to fine tune the model for the QA task for my private documents.
Now, in your case you are showcasing the Quotes dataset. So that is more of a "completion" task. While I am looking for the Question Answering task (generative question answering) so I think the way we prepare our dataset will be different (because the task is QA) than the quote completion one (from the video).
So as I was asking, In what format I should prepare my dataset for question answering? Is it okay just to pass the text documents as they are? or some kind of different formatting is required?
Will you please shade some light on this? I really appreciate your work and the way you explain the process ♥
Very informative!!!! does fine tunning with qlora/lora does support this kind of dataset? If not, what changes should i make in my output dataset?
Review(col1)
Nice cell phone, big screen, plenty of storage. Stylus pen works well.
Analysis(col2)
[{“segment”: “Nice cell phone”,“Aspect”: “Cell phone”,“Aspect Category”: “Overall satisfaction”,“sentiment”: “positive”},{“segment”: “big screen”,“Aspect”: “Screen”,“Aspect Category”: “Design”,“sentiment”: “positive”},{“segment”: “plenty of storage”,“Aspect”: “Storage”,“Aspect Category”: “Features”,“sentiment”: “positive”},{“segment”: “Stylus pen works well”,“Aspect”: “Stylus pen”,“Aspect Category”: “Features”,“sentiment”: “positive”}]
Hi,
Thanks for the video. This video is very helpful. I was trying to execute the notebook given in the description. I am facing session crash issue in the free tier of google colab while loading the shards. Can you please help me?
Hello, this training took around 36 minutes. If we had more steps or more data, it would take more. I have seen some "accelerator" term. Does it increase the training speed? Should we use that? Any guidance on that? Please share, thanks.
In one of my next videos I'll use Huggingface Accelerate in my code ... Good point!
@@code4AI So, it does speed up the process, doesn't it?
as always, it depends. If you optimize it for your distributed hardware infrastructure, and you use the optimal data handling, and you choose the right compiler, and your code is really optimized for the task at hand, then you could be in for a surprise.
@@code4AI If possible would you please cover the "Generative Question Answering" on the private documents? So that covers fine tuning + accelerate! Thanks 💖
like like like like like like like like like like like like like like like like like