Superficial comment, but when you record your audio, put a couch in the room or throw up some sound blankets. I can hear the room much more than I should. I only comment this because I love what you put out and there’s an easy win on audio quality. Your mic is decent, the room needs some work.
What’s most interesting about frankenmerging, isn’t necessarily scaling beyond scaling laws (though stuff like this is great fun, too!), but rather, I notice that in creative writing a lot of people report great results with smaller models for whatever reason. It might be that the extra parameters are like a longer “barrel” on a shotgun which limits the spread of possible output, or it might be that smaller models are able to receive more training relative to their size (there could be important changes like grokking that only occur in extreme training scenarios) but regardless, I’ve seen multiple subjective reports like that. With that in mind, I think that frankenmerges might have more success with, dedicated fine tunes, continued pre-training and the like in smaller models (in the 7-13B range).
I was able to run 405B GGUF on CPU, it requires just 465Gb of Ram in best q8 quality (which I have in 12 slots server board). The most problem and the longest was trying to load all that 465Gb from ordinary HDD with 700Mb/s transfer speed, that is taking 30 minutes, after loading into Ram it works, like 1 token/s, also next load are mich faster. Unfortunately the only 512Gb SSD is busy at the moment, so hard drive is the only way for now, but slowest also.
@@aifluxchannel this is why I consider pcie NVME that would bump up NVME reading speeds to 30,000+MBps, but still CPU will be the bottleneck... 1t/s? really? better use grok 🥲
@@gileneuszterabyte SSDs are inaccessible with prices on them, esp nvme, new hard drives are really great to have a dozen models but Sata is a bottleneck.
@@gileneuszI see the only solution in distributed calculations, there's some amateur projects on Git which supports gguf processing on scale, where I could use 2 home servers with Cpus + GPUs combined
I think we won't see in any near future many VRAM in GPUs because servers Ram which i've bought overheating to 90C in idle and stamped warning "Danger. Surface hot" is no joke, i've managed to reduce it to 60C.
@@aifluxchannel ChatGPT-4 is estimated to have roughly 1.8 trillion parameters. Source: G Hotz, and confirmed by others. High temps may have led to exploring the mini version.
Everyone keeps talking about how better every new llm release is but when you use them they are all the same, especially for coding, I only noticed a difference between gpt 3.5 and gpt 4.
Everyone has different use-cases, for me I gauge how much these models can accomplish in a single shot and how much (valid) code can be produced in just a few prompts. What do you generally use these models for?
@@aifluxchannel I use perplexity and play around with the latest models for coding tasks, for summarizing UA-cam videos or text I use local models, but as I said I stopped noticing differences after gpt4
@@aifluxchannel It's a new model designed for use on edge devices, such as smartphones, with a focus on uncensored creative writing and general assistant tasks. This model is highly compliant with user requests, as the base Gemma this model was built upon, was quite disappointing due to its tendency to moralize and refuse even trivial queries. Please feel free to test it at your convenience, and do not hesitate to reach out if you have any further questions or require additional information.
2:20 "only 681B parameters", what a time to live....
Superficial comment, but when you record your audio, put a couch in the room or throw up some sound blankets. I can hear the room much more than I should. I only comment this because I love what you put out and there’s an easy win on audio quality. Your mic is decent, the room needs some work.
In a temporary location traveling, will soon be back in my studio with proper audio dampening but thanks for the tip!
What’s most interesting about frankenmerging, isn’t necessarily scaling beyond scaling laws (though stuff like this is great fun, too!), but rather, I notice that in creative writing a lot of people report great results with smaller models for whatever reason. It might be that the extra parameters are like a longer “barrel” on a shotgun which limits the spread of possible output, or it might be that smaller models are able to receive more training relative to their size (there could be important changes like grokking that only occur in extreme training scenarios) but regardless, I’ve seen multiple subjective reports like that.
With that in mind, I think that frankenmerges might have more success with, dedicated fine tunes, continued pre-training and the like in smaller models (in the 7-13B range).
No way I could even run this with 30 3090s 😂
Not quite!
U can run 4but of this on 20 3090s thi
I would not be surprised if Zuck publishes a 1T+ open-source parameter model, since he will have 100k H100s available for training very soon... 😁
The H100s are delayed by Nvidia by the way. Not only for Meta but it seems they're having production issues...
7:50 ollama is just upgrading their 405B quants rn...
I was able to run 405B GGUF on CPU, it requires just 465Gb of Ram in best q8 quality (which I have in 12 slots server board). The most problem and the longest was trying to load all that 465Gb from ordinary HDD with 700Mb/s transfer speed, that is taking 30 minutes, after loading into Ram it works, like 1 token/s, also next load are mich faster. Unfortunately the only 512Gb SSD is busy at the moment, so hard drive is the only way for now, but slowest also.
Time to upgrade to NVME!
@@aifluxchannel this is why I consider pcie NVME that would bump up NVME reading speeds to 30,000+MBps, but still CPU will be the bottleneck... 1t/s? really? better use grok 🥲
@@gileneuszterabyte SSDs are inaccessible with prices on them, esp nvme, new hard drives are really great to have a dozen models but Sata is a bottleneck.
@@fontenbleau if you RAID them, you can get better speed even with sata
@@gileneuszI see the only solution in distributed calculations, there's some amateur projects on Git which supports gguf processing on scale, where I could use 2 home servers with Cpus + GPUs combined
1T model is no joke
I think we won't see in any near future many VRAM in GPUs because servers Ram which i've bought overheating to 90C in idle and stamped warning "Danger. Surface hot" is no joke, i've managed to reduce it to 60C.
What a good time to make and sell memory
Very good time for SK Hynix!
not the first 1T+ parameter model
It's a merge, so likely done before. What model are you referring to?
@@aifluxchannel ChatGPT-4 is estimated to have roughly 1.8 trillion parameters. Source: G Hotz, and confirmed by others.
High temps may have led to exploring the mini version.
Can I run like a trillion models while hopping like a Llama? You bet your Strawberry I can, like it's Automatic!!!!(11111)
hahaha
Everyone keeps talking about how better every new llm release is but when you use them they are all the same, especially for coding, I only noticed a difference between gpt 3.5 and gpt 4.
Everyone has different use-cases, for me I gauge how much these models can accomplish in a single shot and how much (valid) code can be produced in just a few prompts. What do you generally use these models for?
@@aifluxchannel I use perplexity and play around with the latest models for coding tasks, for summarizing UA-cam videos or text I use local models, but as I said I stopped noticing differences after gpt4
Not even with Claude 3.5?
@@robd7724 true unnerfed claude 3.5 definitely better
@@Zale370yeah, even Gemma 2 2b is good for summarizing text. On llama cpp, I get 100 t/ on RTX 3060Ti.
256Gb DDR5 ECC*4+1
Maybe enough ;)
Heugdhdhd
GG
Can you review SicariusSicariiStuff/2B_or_not_2B ? Many ppl hyped the fk out of this model
🤗
What is this model intended for?
@@aifluxchannel It's a new model designed for use on edge devices, such as smartphones, with a focus on uncensored creative writing and general assistant tasks. This model is highly compliant with user requests, as the base Gemma this model was built upon, was quite disappointing due to its tendency to moralize and refuse even trivial queries.
Please feel free to test it at your convenience, and do not hesitate to reach out if you have any further questions or require additional information.