Excellent and well presented. Different than other fine tuning tutorials. I appreciate that it's an unfamiliar topic "Touch Ruby" that it has no knowledge about, that's interesting seeing how it progresses. Great job.
From other sources I hear that one uses finetuning not for knowledge teaching, but for improving process following or output format (if prompting is not giving satisfactory results). They suggest using RAG for knowledge improvement. What are your thoughts about this?
Yes Retrieval Augmented Generation (RAG) approaches are one way to go (as in the first part of the video). The other is to do fine-tuning (parts 2 and 3). As you can see in this series, neither RAG nor fine-tuning are perfect... Depending on the use case, one or the other may be better OR sometimes both in combination. I would recommend always trying RAG and only then considering fine-tuning.
Thanks for this video as I keep coming back to it. Here is a scenario. If we have a working RAG with an open source model, (let's say, Gemma-7B) and we have a fixed corpus (let's say a pdf book), would it be a good choice if I replace the open source model with a fine-tuned version tuned on my corpus? I guess it should work better than the simple RAG. But now I have second question. For fine tuning such a model, should I have a data set with three columns: Chunk-of-Text, Question, Answer? Or should I have two columns (Prompt, Response), with the prompt including the chunk-of-text as the context as well?
Well done! Purchased the notebook, hopefully it will help support the channel. Quick question: in "prepare_dataset", response_lengths and input_lengths are lists of ints, which then gives an error in TextDataset __getitem__, This is because the "idx" parameter in __getitem__ is a list (size of batch_size), and not an int, so doing self.input_lengths[idx] gives """TypeError: list indices must be integers or slices, not list""" Even if batch_size=1, the "idx" parameter is a list with 1 element, and still getting the same error, am i doing something wrong? Or should the input_lengths turned into tensors to support list idx? thank you in advance
Yes, another user has this issue too. Oddly, when I run the notebook, idx is an index, not a list. That said, can you try this fix: ``` def __getitem__(self, idx): # print(f"__getitem__ called with index {idx}") if isinstance(idx, int): item = {key: torch.tensor(val[idx]).clone().detach() for key, val in self.encodings.items()} response_start_position = self.input_lengths[idx] response_end_position = self.input_lengths[idx] + self.response_lengths[idx] elif isinstance(idx, list): item = {key: torch.stack([val[i].clone().detach() for i in idx]) for key, val in self.encodings.items()} response_start_position = [self.input_lengths[i] for i in idx] response_end_position = [self.input_lengths[i] + self.response_lengths[i] for i in idx] ## Prior code for idx being an integer, not list. # item = {key: val[idx].clone().detach() for key, val in self.encodings.items()} # # Calculate the start and end positions of the response # response_start_position = self.input_lengths[idx] # response_end_position = self.input_lengths[idx] + self.response_lengths[idx] # Set labels to be the same as input_ids item["labels"] = item["input_ids"].clone() ```
Yes, using Orca tricks like including system prompts to think step by step - when generating the prompt - response pair, makes a lot of sense for the data preparation. arxiv.org/pdf/2306.02707.pdf
great content really helpful, quick question if the model doesn't have access to the whole rule book documents, how could he reason to answer questions other than the ones given in the train data ?
yeah, if the model hasn't been trained on the full rule book then (unless the base model has the knowledge) there's no way to answer questions correctly other than those in the training data.
Great content, appreciate 👏🏼 There are many QLoRa tutorials out there, including some “official” ones, but I haven’t seen mask handling in them as you describe, they use SFTTrainer, yet, what you present with the custom trainer makes perfect sense, does this means the rest of the tutorials miss the mask handling? Or is it baked into the SFTTrainer already?
The SFTTrainers are good and they have various options for masking. The reason I make custom trainers is because I find I always need to manually inspect the tokens. It's hard to inspect both the tokens and also the losses in the SFTTrainers, which makes it hard to troubleshoot when things go wrong (which, for me, they always do for a bit before I get things working).
@@TheRealRoySadaka btw the other thing that I found hard to do with SFTT is training the model to have a stop token . One nuance there is that often these special tokens accidentally get tokenized differently depending on what comes before or after. This is particularly an issue with [INST] which is not in the tokenizer vocab and tends to get tokenized differently depending on what is nearby.
Going into this, I thought embeddings would be better and more robust. However, even with embeddings, there can be inaccuracies, and even with gpt-4. Another learning is that everything depends on model strength, really you need a big model to do well with either approach. Broadly, I would think:
Appreciate the illustration of the difference between base and chat models.
thanks yeah I was thinking maybe it's overkill, but then felt it was important
Excellent and well presented. Different than other fine tuning tutorials. I appreciate that it's an unfamiliar topic "Touch Ruby" that it has no knowledge about, that's interesting seeing how it progresses. Great job.
Cheers thanks
Going to a interview right now. This has been really helpful for remembering and learning the logic. Thank you so much❤
Keep up the good tutorials. you are doing a really good job
Enjoyed your series of tutorials. Thanks!
I just wanted to let you know that these videos are really fantastic compared to many of the other ones I've seen. I really appreciate it!
awesome job explaining everything in extra detail
From other sources I hear that one uses finetuning not for knowledge teaching, but for improving process following or output format (if prompting is not giving satisfactory results). They suggest using RAG for knowledge improvement. What are your thoughts about this?
Yes Retrieval Augmented Generation (RAG) approaches are one way to go (as in the first part of the video). The other is to do fine-tuning (parts 2 and 3). As you can see in this series, neither RAG nor fine-tuning are perfect... Depending on the use case, one or the other may be better OR sometimes both in combination. I would recommend always trying RAG and only then considering fine-tuning.
Thanks for this video as I keep coming back to it. Here is a scenario.
If we have a working RAG with an open source model, (let's say, Gemma-7B) and we have a fixed corpus (let's say a pdf book), would it be a good choice if I replace the open source model with a fine-tuned version tuned on my corpus?
I guess it should work better than the simple RAG. But now I have second question. For fine tuning such a model, should I have a data set with three columns: Chunk-of-Text, Question, Answer? Or should I have two columns (Prompt, Response), with the prompt including the chunk-of-text as the context as well?
Well done!
Purchased the notebook, hopefully it will help support the channel.
Quick question: in "prepare_dataset", response_lengths and input_lengths are lists of ints, which then gives an error in TextDataset __getitem__,
This is because the "idx" parameter in __getitem__ is a list (size of batch_size), and not an int, so doing
self.input_lengths[idx]
gives """TypeError: list indices must be integers or slices, not list"""
Even if batch_size=1, the "idx" parameter is a list with 1 element, and still getting the same error, am i doing something wrong?
Or should the input_lengths turned into tensors to support list idx? thank you in advance
Yes, another user has this issue too. Oddly, when I run the notebook, idx is an index, not a list.
That said, can you try this fix:
```
def __getitem__(self, idx):
# print(f"__getitem__ called with index {idx}")
if isinstance(idx, int):
item = {key: torch.tensor(val[idx]).clone().detach() for key, val in self.encodings.items()}
response_start_position = self.input_lengths[idx]
response_end_position = self.input_lengths[idx] + self.response_lengths[idx]
elif isinstance(idx, list):
item = {key: torch.stack([val[i].clone().detach() for i in idx]) for key, val in self.encodings.items()}
response_start_position = [self.input_lengths[i] for i in idx]
response_end_position = [self.input_lengths[i] + self.response_lengths[i] for i in idx]
## Prior code for idx being an integer, not list.
# item = {key: val[idx].clone().detach() for key, val in self.encodings.items()}
# # Calculate the start and end positions of the response
# response_start_position = self.input_lengths[idx]
# response_end_position = self.input_lengths[idx] + self.response_lengths[idx]
# Set labels to be the same as input_ids
item["labels"] = item["input_ids"].clone()
```
Would it be better to do orca style prompts for q/a dataset?
Yes, using Orca tricks like including system prompts to think step by step - when generating the prompt - response pair, makes a lot of sense for the data preparation.
arxiv.org/pdf/2306.02707.pdf
great content really helpful, quick question if the model doesn't have access to the whole rule book documents, how could he reason to answer questions other than the ones given in the train data ?
yeah, if the model hasn't been trained on the full rule book then (unless the base model has the knowledge) there's no way to answer questions correctly other than those in the training data.
Really informative, your channel should get wider traction.
Thanks, really appreciate that
Great content, appreciate 👏🏼
There are many QLoRa tutorials out there, including some “official” ones, but I haven’t seen mask handling in them as you describe, they use SFTTrainer, yet, what you present with the custom trainer makes perfect sense, does this means the rest of the tutorials miss the mask handling? Or is it baked into the SFTTrainer already?
The SFTTrainers are good and they have various options for masking. The reason I make custom trainers is because I find I always need to manually inspect the tokens. It's hard to inspect both the tokens and also the losses in the SFTTrainers, which makes it hard to troubleshoot when things go wrong (which, for me, they always do for a bit before I get things working).
@@TrelisResearch
Much appreciate the quick response
@@TheRealRoySadaka btw the other thing that I found hard to do with SFTT is training the model to have a stop token . One nuance there is that often these special tokens accidentally get tokenized differently depending on what comes before or after. This is particularly an issue with [INST] which is not in the tokenizer vocab and tends to get tokenized differently depending on what is nearby.
What do you think of using embeddings+supervised fine tuning? Thanks
Going into this, I thought embeddings would be better and more robust. However, even with embeddings, there can be inaccuracies, and even with gpt-4. Another learning is that everything depends on model strength, really you need a big model to do well with either approach. Broadly, I would think: