Thank you for sharing the video! I believe an important point that may have been overlooked in the context of model merging is the necessity for the models to remain within the same optimization basin. For instance, if the models are fine-tuned for too long and diverge significantly, weight averaging could result in a collapse in performance instead of an improvement.
Thanks Julien another good video explaining model merging strategies. It just blew my mind when I heard Maxime Labonne talk about it at a conference. I am guessing the hyperscalars and NVDA are not hyping up this technique as there is no need for accelerated compute. :) Is this still research? Have you seen practical implementation of this? Why is SLM more hyped than merging LLMs? Thank you for responding.
Merging is still an active research field, but great production models are built with it, like Google Gemma2 and of course the Arcee models. Merging and SLMs are a great fit because we have so many models to choose from. LLMs are much much more expensive to build....
Great viedo thank you! What I didn't grasp quite well is that, let's say I'm merging 2 models. One is trained on maths, other is trained on coding. Do we expect the merged model to perform high level in both tasks?
I can't help wondering if there is an experiment which really fully discovers those technique like applying to all kinds of models or combining different methods together?
It wasn't clear to me if these methodologies first do some kind of sorting by weight and connectivity similarity accross layers. I can imagine that when merging models that were fine tuned from the same base checkpoint, that we can proceed without sorting. But if we trained two models from different random initializations, we would need to sort them by similarity previously. In any case, has there been any research into this?
Not sure what you mean by sorting. Most methods require that merged models share the same base architecture. Frankenmerging is different and you need to pick which layers come from which model.
@@juliensimonfr Sorry, I meant that not only should they share the same architecture, but they should also share the same initial pretraining and weights. Two models with the same architecture but trained from scratch with different initial randomized weights would not merge very well. Unless that is, and assuming that the patterns found are the same or very similar, some analysis is done on the layers of the two models to find similar weight and connectivity distributions scattered in different parts in each, then somehow ordering for similarity before merging.
@@juliensimonfr The question came up because I'm currently training two models from scratch, with different initial randomized weights, with the same X data but different Y data each and I'm curious about any research done in merging these two into a single model with both Ys as multi-headed output.
Thank you. It's up to you, depending on how much you want to "influence" the base model. mergekit has a parameter called 'density': fraction of weights in differences from the base model to retain. Example at github.com/arcee-ai/mergekit/blob/edd3817e4a470c7a959ef4c505f52a650a46ff07/examples/ties.yml
Thank you for sharing the video!
I believe an important point that may have been overlooked in the context of model merging is the necessity for the models to remain within the same optimization basin.
For instance, if the models are fine-tuned for too long and diverge significantly, weight averaging could result in a collapse in performance instead of an improvement.
Thanks Julien another good video explaining model merging strategies. It just blew my mind when I heard Maxime Labonne talk about it at a conference. I am guessing the hyperscalars and NVDA are not hyping up this technique as there is no need for accelerated compute. :) Is this still research? Have you seen practical implementation of this? Why is SLM more hyped than merging LLMs?
Thank you for responding.
Merging is still an active research field, but great production models are built with it, like Google Gemma2 and of course the Arcee models. Merging and SLMs are a great fit because we have so many models to choose from. LLMs are much much more expensive to build....
That was great and it helped me so much! Is there this possibility to have the presentation slides?
Hi, you can find the slides on Slideshare at fr.slideshare.net/slideshow/julien-simon-deep-dive-model-merging/270921708
Great viedo thank you! What I didn't grasp quite well is that, let's say I'm merging 2 models. One is trained on maths, other is trained on coding. Do we expect the merged model to perform high level in both tasks?
Yes, that's the expectation :)
I can't help wondering if there is an experiment which really fully discovers those technique like applying to all kinds of models or combining different methods together?
Check out arcee.ai, their platform is definitely going that way.
@@juliensimonfr Thanks for your answer!! I've found some interesting blogs about it!
It wasn't clear to me if these methodologies first do some kind of sorting by weight and connectivity similarity accross layers. I can imagine that when merging models that were fine tuned from the same base checkpoint, that we can proceed without sorting. But if we trained two models from different random initializations, we would need to sort them by similarity previously.
In any case, has there been any research into this?
Not sure what you mean by sorting. Most methods require that merged models share the same base architecture. Frankenmerging is different and you need to pick which layers come from which model.
@@juliensimonfr Sorry, I meant that not only should they share the same architecture, but they should also share the same initial pretraining and weights.
Two models with the same architecture but trained from scratch with different initial randomized weights would not merge very well.
Unless that is, and assuming that the patterns found are the same or very similar, some analysis is done on the layers of the two models to find similar weight and connectivity distributions scattered in different parts in each, then somehow ordering for similarity before merging.
@@juliensimonfr The question came up because I'm currently training two models from scratch, with different initial randomized weights, with the same X data but different Y data each and I'm curious about any research done in merging these two into a single model with both Ys as multi-headed output.
Thank you for this video. I gotta give this a try 🙂
You're welcome, and yes, you should :)
Thanks for the fantastic video. Loved how you simplified almost all the methods to merge the models!
Glad it was helpful!
Super intersting Julien, thanks a lot for sharing
Glad you enjoyed it
Hey @Julien, great vide. I have a question regarding the scale factor in TIES method. How we determine the scale factor?
Thank you. It's up to you, depending on how much you want to "influence" the base model. mergekit has a parameter called 'density': fraction of weights in differences from the base model to retain. Example at github.com/arcee-ai/mergekit/blob/edd3817e4a470c7a959ef4c505f52a650a46ff07/examples/ties.yml
This is a random comment to boost your channel. Thank you.
LOL, thank you.
Nice.
Thanks!