Thanks for the video, really interesting. I am fine tuning code LLMs and this was helpful. Aside; I find it so strange that token prediction is able to do this, since the LLM needs to be able to "plan" that a function will be needed and declare it and then use it in exactly the right way.
@code_your_own_AI - Thank you so much for this great video ! I am in the middle of preparing the dataset for fine-tuning and wanted to refer to the dataset you had to prepare and preprocess for your Pytorch code assistant. It would really help me to understand if there is any character limit that I should be aware about while creating the dataset. I also wanted to know the format in which 'Instruction' , 'Input', 'Output' should be presented - will it be in JSONL format / or a .txt format / .csv format ? Also for using StarCoder , is there a character / token limit that I need to follow for each example ?
Can you add a video of finetuning starcoder for auto completion and not instructions like they show in they repo?
@code_your_own_AI - Hi - are you going to finish the colab and show the results to prove it works? And provide a link to it? Thank you!
Amazing video, thank you for sharing all of this knowledge with us! These fine-tuned generative code systems can be the next level of programming,.
Your explanation is awesome!! Thanks for sharing these videos
You are welcome.
Thanks for the video, really interesting. I am fine tuning code LLMs and this was helpful. Aside; I find it so strange that token prediction is able to do this, since the LLM needs to be able to "plan" that a function will be needed and declare it and then use it in exactly the right way.
@code_your_own_AI - Thank you so much for this great video ! I am in the middle of preparing the dataset for fine-tuning and wanted to refer to the dataset you had to prepare and preprocess for your Pytorch code assistant. It would really help me to understand if there is any character limit that I should be aware about while creating the dataset.
I also wanted to know the format in which 'Instruction' , 'Input', 'Output' should be presented - will it be in JSONL format / or a .txt format / .csv format ? Also for using StarCoder , is there a character / token limit that I need to follow for each example ?
did you get a reply? looking to do the same
Thank you for the video! Can this be used to fine-tune for other languages such as Lisp or Haskell? Or should I pre-train a new model from scratch?
Wow :D What a cliff hanger at the end^^
Sir I just wanted to know, can I finetune with predefined codes and there promt for better code genration ?? if yes how to proceed with that
I don't see any links in the description....
Yes Please
can I do code refactoring task with star coder? how can I prepare my dataset for that task?
Nice video! Is the outlier in part 3?
of course!
Will this work on M2 CPU?
I see more and more solutions for apple silicon on reddit and hacker news, but I have no empirical data on stability or performance.