If anything more efficient AIs running on consumer hardware would increase demand lol. Look up Jevon's paradox. The steam engine increased the demand of coal by means of its efficiency, not decreased. Just gives more reason to get more hardware to take advantage of the scaling for large companies, and for consumers to buy more GPUs since they can finally take advantage of them. Coal companies benefitted from the efficiency of the steam engine causing increased demand, so too will nvidia benefit from the efficiency of quantization
That's very true. We are at a huge deficit of GPU's right now. This is the case even for consumers just interested in gaming. Also what happens when LLM's start penetrating games, with realistic NPCs? It will be the next big thing, demands on VRAM will skyrocket, it'll become an expected feature in every big game eventually. Having hard scripted NPCs will be viewed the way we do ATARI games. Main thing stopping this from happening right now is hardware costs, the model performance is already there for that (open source too).
I completely agree with this line of logic. This just indirectly makes Nvidia GPUs 8-10x faster working _in tandem_ with 1-bit quantized models at scale.
That's true over the long run, but if you can run far more capable A.I.s at a local level, there's less of an incentive or need to upgrade your hardware, at least until the next big thing comes along. We are in the early days of A.I. but once we start getting into a good enough state for many tasks, having better becomes far less of an incentive to upgrade when good enough is fine for most. But I don't think we are at that stage yet, there's so much potential with A.I. and much more to come, but if I can run a much better model at a local level then I could say a year ago, there's fewer incentives for me to upgrade.
Without understanding the paper, I do understand the quantization, and it's hard for me to believe they could go down to 1 bit without drastically losing quality, considering even 4 bit models are pretty bad compared to full bit models. Even 4 bit starts to feel like an AI has been given a lobotomy.
As mentioned in the video, the 4-bit quantization models you have used are post training quantization. Every corresponding weight in the quantized model has the same numerical value as the original but rounded off to the closest number that can be representing with the lower precision encoding. This paper requires retraining the model natively as a quantized model, the values of the new weights likely have no correlation to the original. Essentially this model is a completely different architecture that is trained with a similar regime to the original model.
@@GeorgeXianYou are on the right track but Its all a bit more nuanced than this (pun intended). For starters they need 2-bits to simulate 1-trit (1.58bit needs to be rounded up), so its more fair to see this as a 2-bit model. Secondly they train "quantized aware" which is not the same as training with quantized trits (2-bit) directly. The lowest they can reasonably go during training is 8-bit floats, this is because backprop using gradient decent false apart doing integers and low bit numbers. So they basically train two networks side by side and transfer the knowledge from the higher bit model to the lower bit model.
@@GeorgeXianYes its a tradeoff. However is still very impressive because the resulting network has all the benefits that you described: sum only matrix multiplication, lower energy consumption, smaller model size etc. There are actually two more advantages: No rounding errors, and free sparsity from the 0 weights. never the less i wouldn't write of nvidia just yet, they are also pushing AI further with there own research.
Counter intuitively, reducing the number of bits can increase the quality of a neural network, as it is mathematically similar to introducing noise: adding random noise being a common technique to put a network through its paces and make training more difficult for the network so it learns more. While you are 'giving it a lobotomy' in the sense you are giving it a hard kick it didn't expect, that kick builds resilience.
I can only try. My background is in software engineering rather than mathematics so some of the concepts easily go over my head. This paper definitely hit the mark for being understandable to most people with a tech background yet had an unexpected conclusion with shocking ramifications to the AI community if it scales to larger models as they claim. Many papers are confirmations studies with nothing interesting to report.
This reminds me of something Carl Sagan said, "Extraordinary claims require extraordinary evidence". It would be extraordinary if true, but 1 bit? I am sure it will be tested. I feel like 8 bits is the sweet spot, but IDK.
I haven't produced did yet. This is the reason why I mentioned I wished this paper dropped a couple weeks later in the video but I wanted to share my opinion on this paper quickly.
I've looked into this a lot due to attempting to wrap my head around further optimizations of ternary-based neural networks. The problem lies in the fact that you're working with quantized values - i.e., values where it is either difficult or close to impossible to represent a vector in a continuous space. When the models are trained, they're essentially using multivariate calculus in order to *predict* retroactively which change in the weights - the vectors - will get the model closer to a local minima given the input on the training run (and technically, this process is batched which complicates things a bit more by multiplying this 'adjustment' across multiple observations, but that's beside the main point). In other words, if the corrections you make to the model at each step are essentially required to 'snap' to a very, very small range of discrete values, the local minima will be a lot more difficult to find. Contrast this with a situation in which you're able to use 16bit or 32bit floats to represent extremely accurate vectors 'pointing' at highly-specific directions in vector space. Any nudges can be very small and represent averages amongst several training steps in each batch, and the 'nudges' make it far more likely that the model will successfully find a comfortable niche that fits the data well in a generalized way. After all of the specifics have been worked out in training - i.e., once the training data have been very accurately approximated by the model - THEN you can quantize everything because you've already worked out the details. Without the specificity, there's no guarantee that you will ever find the correct local minima. It just may not be accurate enough to do that, and it's very difficult to perform the same kinds of continuous calculations on discrete values - at least, not without attempting to pull the values into continuous space, perform the multivariate calculus, and only then convert back into a quantized system. It just isn't able to have the level of resolution required for the training 'search' process to find the right fit. At least... that's the theory. ;)
Hey just wanted to complement your clarity of presentation. You’ve got the core of a good channel. Clarify the niche you want to go after so I can understand how it helps me out as a viewer and you’ll skyrocket.
Any tips you have for me in terms of niche? My channel is definitely pivoting toward the AI tech space. I can't say I have fully decided on whether I want to dig deeply in the math or code of AI or explain things at a higher level like an industry analyst. At the moment, my best performing videos have been where I have been an industry analyst.
I heard yesterday about some people selling their stock in Nvidia. Makes you wonder if the word of this has gotten around. Between this and Groq inference accelerator cards, who needs Nvidia?
@@ireallyreallyreallylikethisimg Hardware has always proven a less defensible niche than software and services. You would think it might be the other way around.
NVIDIA is overhyped for sure. They have drawn too much attention to themselves. However, this research for now is really about optimising AI for processing, which benefits even those using NVIDIA chip too. What is likely to happen right now is OpenAI retraining GPT-4 in this method right now or has already trained their turbo models this way (closed source model, can’t know for sure). I am keeping an eye on Groq. NVIDIA has little bit of buddy protection with the major tech companies for now but when they grow too arrogant, some new dedicated chip be announced to stab NVIDIA in the back when they least expect it.
Friend, this 1.58 bits papers and others, benefit Intel, because intel is the only GPU company that can process in two bits, not AMD not NVIDEA@@GeorgeXian
Hey Loved this ! Been thinking a lot about how to run these models on Local hardware, Could you also cover LLMs in a Flash - A research paper by apple addressing this issue
Totally agree with you. What also could get interesting, is that non-batched LLM inference is less compute and almost totally memory-bandwidth bound. And if you compare an Apple M2 Ultra with its 1024-Bit memory bus (and huge RAM) with Nvidia, it does not compare to badly on inference. However in prompt-processing,… the 4090,… is 10x faster. If compute can be reduced, a broader memory-bus (much cheaper than Nvidia‘s VRAM) will get very interesting. The reduced size is an additional benefit, because its less transfers from memory. Llama.cpp is already doing great work on SOTA quantization down to 2 Bits. I will be looking forward, if they manage to support the 1.58 bit algorithms (and reduce the math)
@@GeorgeXian replied with links and youtube has hidden my 2 replies. More explanations there and also information how you already can run mixtral8x7B and llama2-70b on your 4090 now.
yes, but is not 2 bit is ternary, because ternary is self prunning. ternary with FPGA ternary with FPGA arxiv.org/pdf/1609.00222.pdf TRAINED TERNARY QUANTIZATION arxiv.org/pdf/1612.01064.pdf binarized arxiv.org/pdf/1602.02830.pdf
i am trying to wrap my head around this. And yes, the matrix math simplification method makes a lot of sense. But what i can not understand is why NVDA also would not start to use these 1-BIT LLMs also? It seems to be more of a "Software" approach, rather than baked into some firmware....so you can take a blackwell chip, USE 1-bit LLMs on it, and have amazing computational power! right?
Nvidia can take advantage of this system as mentioned in the video. It’s just they can’t be faster than a ASIC purposely built to perform this operation.
Stumbled upon this vid on my feed. I am VERY CLEARLY not informed at all on what you're talking about. Any tips to start out on the technical side of what you're talking about? I would say i have decent knowledge in tech all around, way above average
Thanks for the compliment! My industry experience is in software engineering, though my undergrad was in Mechatronics - it's the latter that's given me the background on the linear algebra and computer hardware knowledge presented in this video. I do a lot of my learning into machine learning theory by asking ChatGPT questions. I've been doing that recently to aid me in the process of building apps that integrate AI/ML technologies.
@@GeorgeXian Any idea which channels/sources i should start learning about the technical side of AI/ML from? I have AI/ML in my next semester but I've often found that college doesn't care about your foundation but it cares more about being able to claim that they "taught" you a certain software and hand you the degree. You have a pretty wide range of skills and experience so I thought you would be the right person to ask for pointers and stuff
@@cyclicwarrior2570There's a video series on neural networks that really help cement my understanding of how neural networks work: www.3blue1brown.com/topics/neural-networks The first video covers how neural networks are just matrix multiplications. They used a very basic OCR neural network as a case study. It's easy to explain how the input is transformed into a vector for those (relatively) simple neural networks.
nVidia A100 tensor cores have a binary mode that does exactly the acceleration you are talking about. However they removed it from later generations such as H100. Seems like a mistake. Also the 1bit nVidia approach is not suited for the 1.58bit ternary approach that a later paper has suggested.
I'm not too familiar with how tensor cores are optimized for each encoding. Surely it has a 8bit fixed point mode that operates faster than a floating point mode.
Can you elaborate? What you have mentioned sounds to me like post training quantization - which is how models are quantized at the moment. This paper mentions training models from scratch as a quantized model - the back propagation itself decides whether a particular weight is -1, 0 or 1.
There's no savings for training. A 16bit set of weights have to be kept in memory to accumulate gradients. They are quantized at every forward pass to -1, 0, 1 and used for the forward and backwards calculations which target the 16bit weights. This is called QAT and does produce a model that can be run at 1.58bits. However, you have to save the 16bit weights if you want to continue training. Still amazing, but we still need the big boys to produce the foundation models. Let's just hope they use this QAT method going forward so they come quantized by default. @GeorgeXian
@@JohnDoe-lg6dj Ok looks like I'll have to deep dive on the training regime they're using. I figured if they did managed to save on memory during training, they'd definitely mention it on the paper.
idk, it's crazy how much technology isn't being used. like no fuel injectated auto detonating gasoline engines. Heat recirculating ice, True gear cvts. Metabolism slowing life extension. thorium nuclear reactors. Self powering heat engines inside air conditioners.
Very interesting running the bitnet examples on Pi5. The matrix outputs are 1, 0, -1, -0. Not sure what -0 means but 1, 0, -1 reminds me of those old Russian Ternary computers. I like to think of these as yes, no, maybe. I wonder if it could even speed up Stable Diffusion even more? Running AI on SBC is a game changer. I use this Pi5 as my home Desktop PC now.
@@GeorgeXian Could fake it with 2 bits 1, 0, -1, -0. I got interested in Ternary a few years back and some FPGAs can do it. It reminds me of fuzzy logic and Quantum comoputers. That reminds me to check Intel ARC GPU bit ops.
@@babbagebrassworks4278 It could be that 2 bit emulation of 1 trit could be faster than native ternary processor given how we've had decades of experience building binary computers. Without a deep understanding of how chip manufacturing works, it'll be hard to say.
@@GeorgeXianThose old Russian computers used negative voltage. While it could be done, most semiconductor technology is 0 or x voltage. x could be up to 15volts for CMOS when I started in Electronics, now it is down to about 0.9volts. Going lower and quantum effects start to mess things up. Memistor arrays could be used for analog computing, it seems some noise in the system helps AI. I have been checking NPU chips to see what their lowest level math is, 4bit so far. Yolo is fast because it uses binary neural networks, BNN.
Quantized models are the future. Also, a solution to ridiculously prized GPUs is needed badly, like neural-based processors (NPUs). We need dedicated AI processors that do away with Cudacores.
Wow! This sounds very significant. Looks like aspiring hardware manufacturers shud start designing hardware and related library with full throttle. But if they want any chance against Nvidia, then, they gotto build scalable hardwares for both end users and enterprises. Making Enterprise only things like Groq will never make any dent even in enterprise market.
This optimization could lead to another GPU shortage (for consumer GPUs). Back then, it was due to crypto, now a bunch of AI startups and average companies can use RTX 4080 for their businesses.
That would be unfortunate. However, in reality it's cheaper for a startup to rent a GPU cluster to run their AIs. With 1-bit, the rental costs will be cheaper for a given model size.
@@GeorgeXian Fair point. But there's also another hypothetical concern from that perspective. GPU cluster renting companies could use consumer-grade GPUs for a cheaper alternative for those who want it. However, progress is progress, and this optimization could lead to many great things.
The big caveat of this paper is that the largest model where they trained it be sufficient to be compared against the original was the 3 billion parameter variant. Matching the performance of such small models is a low bar. They are projecting that the output quality scales with parameter count just like the original. However, if my understanding is correct, the VRAM requirements for training the 1-bit model should be dramatically less than the original so it baffles me that they didn't even try full training the 7 billion parameter variant.
Its not a joke. 2.5 bit state-of-the-art quantization already works great in llama.cpp since early February. Yes, it degrades model quality. But a 2.5 bit quantized 2x larger model (e.g. llama 13B vs 7B) still has higher quality than an unquantized smaller model, and it runs much faster and with less memory … - looking forward to when the 1.58 paper gets implemented and reduces the needed compute horsepower. There is crazy innovation going on.
If you look for "digital signal processing" 1 bit digital filter, you find many papers about it, the concept there is the same, to have the digital filter without the need of hardware multipliers, only additions. I kind of assume that even the hardware multipliers use this concept😂.
Optimizations wont kill nvidia. OpenAI will need bigger GPUs either way because they just wanna train bigger and bigger models. Also a lot of the vendor lock-in happens with their software stack, not their hardware.
🤔 I wonder if a person couldn't get the kernels (if indeed you would even call them that with such an architecture) embedded into an FPGA for a proof of concept of how efficient dedicated hardware would be for this method. I want to say Groq has a hardcoded fixed function hardware that's insanely efficient for the process node (14nm I think, compared to Hopper...5nm maybe?), and while FPGAs aren't quite as efficient as ASICs in terms of price to performance, they're still quite a bit more powerful than GPUs for the same silicon in areas like this, from what I've seen. My intuition is that you'd probably need to network several of them together to get to any reasonable size of model, but once you did, the bandwidth would be honestly insane, and the hardware would be quite scalable.
There is no foundational model of this size for BitNet. The authors trained only 3B parameters model, how it will function in 70B no one knows. Those small models are so weak at those sizes. Another problem is that 72B parameters are still very weak models, looks like for anything useful you need grok size at least - 314B parameters, or maybe 250B at least, of gpt3.5 All this means you will still need a powerful gpu to run a useful model that can perform for real life situations tasks, just maybe one A100 80GB GPU will be enough and not two or eight.
Care to elaborate? I obviously don't have any insider knowledge of Nvidia, but surely Nvidia is keeping tabs on dedicated AI chip's efforts as that will upset their dominance in the AI sector or at least take a hit in their share price.
This addition on fixed-point numbers thing is awesome, I have been thinking about what will happen when this is proven, will old Pentium 4s plugged into janky pirate-motherboards like bitcoin miners and graphics cards come-about? The thing about ALL the 'old' processors that are just laying around is that they take a lot of power, but A ghz is a ghz and a core is a core (when we are talking about just matrix addition). But yea all that electricity. Its cool when you start thinking about the number of radios that exist (billions and billions) in old smartphones that perhaps lower-spec internet-of-things-type ai-driven multi-computation could use. There are so many antennas/radios in laptops and phones all over the earth.
If anything more efficient AIs running on consumer hardware would increase demand lol. Look up Jevon's paradox. The steam engine increased the demand of coal by means of its efficiency, not decreased.
Just gives more reason to get more hardware to take advantage of the scaling for large companies, and for consumers to buy more GPUs since they can finally take advantage of them.
Coal companies benefitted from the efficiency of the steam engine causing increased demand, so too will nvidia benefit from the efficiency of quantization
That's very true. We are at a huge deficit of GPU's right now. This is the case even for consumers just interested in gaming. Also what happens when LLM's start penetrating games, with realistic NPCs? It will be the next big thing, demands on VRAM will skyrocket, it'll become an expected feature in every big game eventually. Having hard scripted NPCs will be viewed the way we do ATARI games. Main thing stopping this from happening right now is hardware costs, the model performance is already there for that (open source too).
I completely agree with this line of logic. This just indirectly makes Nvidia GPUs 8-10x faster working _in tandem_ with 1-bit quantized models at scale.
but if the barrier gets low enough it would be just an ip block in arm chip sold for pennies.
coal is natural resource is chip is manmade....
That's true over the long run, but if you can run far more capable A.I.s at a local level, there's less of an incentive or need to upgrade your hardware, at least until the next big thing comes along.
We are in the early days of A.I. but once we start getting into a good enough state for many tasks, having better becomes far less of an incentive to upgrade when good enough is fine for most.
But I don't think we are at that stage yet, there's so much potential with A.I. and much more to come, but if I can run a much better model at a local level then I could say a year ago, there's fewer incentives for me to upgrade.
great point, they would probably now shift to even bigger models while we get a handle on the smaller ones
Without understanding the paper, I do understand the quantization, and it's hard for me to believe they could go down to 1 bit without drastically losing quality, considering even 4 bit models are pretty bad compared to full bit models. Even 4 bit starts to feel like an AI has been given a lobotomy.
As mentioned in the video, the 4-bit quantization models you have used are post training quantization. Every corresponding weight in the quantized model has the same numerical value as the original but rounded off to the closest number that can be representing with the lower precision encoding. This paper requires retraining the model natively as a quantized model, the values of the new weights likely have no correlation to the original. Essentially this model is a completely different architecture that is trained with a similar regime to the original model.
@@GeorgeXianYou are on the right track but Its all a bit more nuanced than this (pun intended). For starters they need 2-bits to simulate 1-trit (1.58bit needs to be rounded up), so its more fair to see this as a 2-bit model. Secondly they train "quantized aware" which is not the same as training with quantized trits (2-bit) directly. The lowest they can reasonably go during training is 8-bit floats, this is because backprop using gradient decent false apart doing integers and low bit numbers. So they basically train two networks side by side and transfer the knowledge from the higher bit model to the lower bit model.
@@holthuizenoemoet591 Interesting, in some ways it could actually be more memory intensive for training.
@@GeorgeXianYes its a tradeoff. However is still very impressive because the resulting network has all the benefits that you described: sum only matrix multiplication, lower energy consumption, smaller model size etc. There are actually two more advantages: No rounding errors, and free sparsity from the 0 weights.
never the less i wouldn't write of nvidia just yet, they are also pushing AI further with there own research.
Counter intuitively, reducing the number of bits can increase the quality of a neural network, as it is mathematically similar to introducing noise: adding random noise being a common technique to put a network through its paces and make training more difficult for the network so it learns more. While you are 'giving it a lobotomy' in the sense you are giving it a hard kick it didn't expect, that kick builds resilience.
Will you be doing more videos in this sort of area on LLM/Machine Learning related research/papers etc? Thanks
I can only try. My background is in software engineering rather than mathematics so some of the concepts easily go over my head. This paper definitely hit the mark for being understandable to most people with a tech background yet had an unexpected conclusion with shocking ramifications to the AI community if it scales to larger models as they claim. Many papers are confirmations studies with nothing interesting to report.
such a nice explaination on 1 bit Large language models.
Sony A7C.
This reminds me of something Carl Sagan said, "Extraordinary claims require extraordinary evidence". It would be extraordinary if true, but 1 bit? I am sure it will be tested. I feel like 8 bits is the sweet spot, but IDK.
Subscribed. Waiting for that update!
Do you have a link to your video on embedding spaces? It wasn't obvious from a scan of your channel.
I haven't produced did yet. This is the reason why I mentioned I wished this paper dropped a couple weeks later in the video but I wanted to share my opinion on this paper quickly.
I've looked into this a lot due to attempting to wrap my head around further optimizations of ternary-based neural networks. The problem lies in the fact that you're working with quantized values - i.e., values where it is either difficult or close to impossible to represent a vector in a continuous space. When the models are trained, they're essentially using multivariate calculus in order to *predict* retroactively which change in the weights - the vectors - will get the model closer to a local minima given the input on the training run (and technically, this process is batched which complicates things a bit more by multiplying this 'adjustment' across multiple observations, but that's beside the main point). In other words, if the corrections you make to the model at each step are essentially required to 'snap' to a very, very small range of discrete values, the local minima will be a lot more difficult to find.
Contrast this with a situation in which you're able to use 16bit or 32bit floats to represent extremely accurate vectors 'pointing' at highly-specific directions in vector space. Any nudges can be very small and represent averages amongst several training steps in each batch, and the 'nudges' make it far more likely that the model will successfully find a comfortable niche that fits the data well in a generalized way.
After all of the specifics have been worked out in training - i.e., once the training data have been very accurately approximated by the model - THEN you can quantize everything because you've already worked out the details. Without the specificity, there's no guarantee that you will ever find the correct local minima. It just may not be accurate enough to do that, and it's very difficult to perform the same kinds of continuous calculations on discrete values - at least, not without attempting to pull the values into continuous space, perform the multivariate calculus, and only then convert back into a quantized system. It just isn't able to have the level of resolution required for the training 'search' process to find the right fit. At least... that's the theory. ;)
Hey just wanted to complement your clarity of presentation. You’ve got the core of a good channel. Clarify the niche you want to go after so I can understand how it helps me out as a viewer and you’ll skyrocket.
Any tips you have for me in terms of niche? My channel is definitely pivoting toward the AI tech space. I can't say I have fully decided on whether I want to dig deeply in the math or code of AI or explain things at a higher level like an industry analyst. At the moment, my best performing videos have been where I have been an industry analyst.
I heard yesterday about some people selling their stock in Nvidia. Makes you wonder if the word of this has gotten around. Between this and Groq inference accelerator cards, who needs Nvidia?
nvidia is in the same state that the nortel corporation was in the 1990s its only a matter of time until it all comes crashing down
@@ireallyreallyreallylikethisimg Hardware has always proven a less defensible niche than software and services. You would think it might be the other way around.
NVIDIA is overhyped for sure. They have drawn too much attention to themselves. However, this research for now is really about optimising AI for processing, which benefits even those using NVIDIA chip too.
What is likely to happen right now is OpenAI retraining GPT-4 in this method right now or has already trained their turbo models this way (closed source model, can’t know for sure).
I am keeping an eye on Groq. NVIDIA has little bit of buddy protection with the major tech companies for now but when they grow too arrogant, some new dedicated chip be announced to stab NVIDIA in the back when they least expect it.
Friend, this 1.58 bits papers and others, benefit Intel, because intel is the only GPU company that can process in two bits, not AMD not NVIDEA@@GeorgeXian
@@BienestarMutuo can you elaborate? How would be faster than a larger fixed point data types in the current architecture?
Hey Loved this ! Been thinking a lot about how to run these models on Local hardware, Could you also cover LLMs in a Flash - A research paper by apple addressing this issue
Link to research paper?
This is really cool! Do you know where the model can be installed and how to run it on ollama?
It's not available yet, it's also only been done in the 3B parameter version. Not a useful model as it stands at the moment. We can hope!
While I'm excited no one has released a model that uses this paper yet. Hope it happens soon
Me too!
Great vid. btw it HAS been done before, there is a BinaryBERT model that uses Trits on the backend
Isn't BERT just an embedding model?
@@GeorgeXianthat doesn't do it justice but its not a full generative LLM
: That's a great video. Thank you very much. Let's see what this new method will bring :)
And let's also hope that SLI will make a return :D :D
Totally agree with you. What also could get interesting, is that non-batched LLM inference is less compute and almost totally memory-bandwidth bound. And if you compare an Apple M2 Ultra with its 1024-Bit memory bus (and huge RAM) with Nvidia, it does not compare to badly on inference. However in prompt-processing,… the 4090,… is 10x faster. If compute can be reduced, a broader memory-bus (much cheaper than Nvidia‘s VRAM) will get very interesting. The reduced size is an additional benefit, because its less transfers from memory. Llama.cpp is already doing great work on SOTA quantization down to 2 Bits. I will be looking forward, if they manage to support the 1.58 bit algorithms (and reduce the math)
I didn't think state-of-the-art needed its own acronym! You sound knowledgeable on this subject! What's your experience with AI models?
@@GeorgeXian replied with links and youtube has hidden my 2 replies. More explanations there and also information how you already can run mixtral8x7B and llama2-70b on your 4090 now.
yes, but is not 2 bit is ternary, because ternary is self prunning. ternary with FPGA
ternary with FPGA
arxiv.org/pdf/1609.00222.pdf
TRAINED TERNARY QUANTIZATION
arxiv.org/pdf/1612.01064.pdf
binarized
arxiv.org/pdf/1602.02830.pdf
i am trying to wrap my head around this. And yes, the matrix math simplification method makes a lot of sense. But what i can not understand is why NVDA also would not start to use these 1-BIT LLMs also? It seems to be more of a "Software" approach, rather than baked into some firmware....so you can take a blackwell chip, USE 1-bit LLMs on it, and have amazing computational power! right?
Nvidia can take advantage of this system as mentioned in the video. It’s just they can’t be faster than a ASIC purposely built to perform this operation.
Gotta love that “price is right” jingle😂
very interesting, thanks for sharing
Stumbled upon this vid on my feed. I am VERY CLEARLY not informed at all on what you're talking about. Any tips to start out on the technical side of what you're talking about? I would say i have decent knowledge in tech all around, way above average
Thanks for the compliment! My industry experience is in software engineering, though my undergrad was in Mechatronics - it's the latter that's given me the background on the linear algebra and computer hardware knowledge presented in this video. I do a lot of my learning into machine learning theory by asking ChatGPT questions. I've been doing that recently to aid me in the process of building apps that integrate AI/ML technologies.
@@GeorgeXian Any idea which channels/sources i should start learning about the technical side of AI/ML from? I have AI/ML in my next semester but I've often found that college doesn't care about your foundation but it cares more about being able to claim that they "taught" you a certain software and hand you the degree. You have a pretty wide range of skills and experience so I thought you would be the right person to ask for pointers and stuff
@@cyclicwarrior2570There's a video series on neural networks that really help cement my understanding of how neural networks work: www.3blue1brown.com/topics/neural-networks
The first video covers how neural networks are just matrix multiplications. They used a very basic OCR neural network as a case study. It's easy to explain how the input is transformed into a vector for those (relatively) simple neural networks.
@@GeorgeXian Thanks for helping me bro Waiting for your next vid
I am now subscribed and I liked your video!!!
nVidia A100 tensor cores have a binary mode that does exactly the acceleration you are talking about. However they removed it from later generations such as H100. Seems like a mistake.
Also the 1bit nVidia approach is not suited for the 1.58bit ternary approach that a later paper has suggested.
I'm not too familiar with how tensor cores are optimized for each encoding. Surely it has a 8bit fixed point mode that operates faster than a floating point mode.
If anything we will see the push for 1T models or the like and much bigger models because we don't know what happens as models get bigger
I think they used full floating-point numbers for training and then quantized the matrices.
Can you elaborate? What you have mentioned sounds to me like post training quantization - which is how models are quantized at the moment. This paper mentions training models from scratch as a quantized model - the back propagation itself decides whether a particular weight is -1, 0 or 1.
There's no savings for training. A 16bit set of weights have to be kept in memory to accumulate gradients. They are quantized at every forward pass to -1, 0, 1 and used for the forward and backwards calculations which target the 16bit weights. This is called QAT and does produce a model that can be run at 1.58bits. However, you have to save the 16bit weights if you want to continue training. Still amazing, but we still need the big boys to produce the foundation models. Let's just hope they use this QAT method going forward so they come quantized by default. @GeorgeXian
@@JohnDoe-lg6dj Ok looks like I'll have to deep dive on the training regime they're using. I figured if they did managed to save on memory during training, they'd definitely mention it on the paper.
idk, it's crazy how much technology isn't being used. like no fuel injectated auto detonating gasoline engines. Heat recirculating ice, True gear cvts. Metabolism slowing life extension. thorium nuclear reactors. Self powering heat engines inside air conditioners.
Very interesting running the bitnet examples on Pi5. The matrix outputs are 1, 0, -1, -0. Not sure what -0 means but 1, 0, -1 reminds me of those old Russian Ternary computers. I like to think of these as yes, no, maybe. I wonder if it could even speed up Stable Diffusion even more? Running AI on SBC is a game changer. I use this Pi5 as my home Desktop PC now.
Yes, it’s actually 1 trit LLM not 1 bit. The specialised hardware will be optimised ternary adders.
@@GeorgeXian Could fake it with 2 bits 1, 0, -1, -0. I got interested in Ternary a few years back and some FPGAs can do it. It reminds me of fuzzy logic and Quantum comoputers. That reminds me to check Intel ARC GPU bit ops.
@@babbagebrassworks4278 It could be that 2 bit emulation of 1 trit could be faster than native ternary processor given how we've had decades of experience building binary computers. Without a deep understanding of how chip manufacturing works, it'll be hard to say.
@@GeorgeXianThose old Russian computers used negative voltage. While it could be done, most semiconductor technology is 0 or x voltage. x could be up to 15volts for CMOS when I started in Electronics, now it is down to about 0.9volts. Going lower and quantum effects start to mess things up. Memistor arrays could be used for analog computing, it seems some noise in the system helps AI. I have been checking NPU chips to see what their lowest level math is, 4bit so far. Yolo is fast because it uses binary neural networks, BNN.
Quantized models are the future. Also, a solution to ridiculously prized GPUs is needed badly, like neural-based processors (NPUs). We need dedicated AI processors that do away with Cudacores.
What camera do you use?
Sony A7C.
has no one tried analog opamp based multiplicators?
Wow! This sounds very significant. Looks like aspiring hardware manufacturers shud start designing hardware and related library with full throttle.
But if they want any chance against Nvidia, then, they gotto build scalable hardwares for both end users and enterprises. Making Enterprise only things like Groq will never make any dent even in enterprise market.
This optimization could lead to another GPU shortage (for consumer GPUs). Back then, it was due to crypto, now a bunch of AI startups and average companies can use RTX 4080 for their businesses.
That would be unfortunate. However, in reality it's cheaper for a startup to rent a GPU cluster to run their AIs. With 1-bit, the rental costs will be cheaper for a given model size.
@@GeorgeXian Fair point. But there's also another hypothetical concern from that perspective. GPU cluster renting companies could use consumer-grade GPUs for a cheaper alternative for those who want it. However, progress is progress, and this optimization could lead to many great things.
when I saw another video about the paper I couldn't avoid to think about the possibility of it being a joke. But who knows.
The big caveat of this paper is that the largest model where they trained it be sufficient to be compared against the original was the 3 billion parameter variant. Matching the performance of such small models is a low bar. They are projecting that the output quality scales with parameter count just like the original. However, if my understanding is correct, the VRAM requirements for training the 1-bit model should be dramatically less than the original so it baffles me that they didn't even try full training the 7 billion parameter variant.
Its not a joke. 2.5 bit state-of-the-art quantization already works great in llama.cpp since early February. Yes, it degrades model quality. But a 2.5 bit quantized 2x larger model (e.g. llama 13B vs 7B) still has higher quality than an unquantized smaller model, and it runs much faster and with less memory … - looking forward to when the 1.58 paper gets implemented and reduces the needed compute horsepower. There is crazy innovation going on.
If you look for "digital signal processing" 1 bit digital filter, you find many papers about it, the concept there is the same, to have the digital filter without the need of hardware multipliers, only additions. I kind of assume that even the hardware multipliers use this concept😂.
I wonder whether these improvements would mean that AMD GPU's could enter the race.
Optimizations wont kill nvidia. OpenAI will need bigger GPUs either way because they just wanna train bigger and bigger models. Also a lot of the vendor lock-in happens with their software stack, not their hardware.
However, now every chip company has equal opportunity to build a new software stack with as severe of a handicap as before.
This is creepy. I’m skeptical.
🤔
I wonder if a person couldn't get the kernels (if indeed you would even call them that with such an architecture) embedded into an FPGA for a proof of concept of how efficient dedicated hardware would be for this method.
I want to say Groq has a hardcoded fixed function hardware that's insanely efficient for the process node (14nm I think, compared to Hopper...5nm maybe?), and while FPGAs aren't quite as efficient as ASICs in terms of price to performance, they're still quite a bit more powerful than GPUs for the same silicon in areas like this, from what I've seen.
My intuition is that you'd probably need to network several of them together to get to any reasonable size of model, but once you did, the bandwidth would be honestly insane, and the hardware would be quite scalable.
Yea really keen to see that. You'd probably need to stage many many FPGA chips to run any useful model. This is why we need to democratize AIs.
Groq speeds with consumer hardware. One can dream right.
the future for AI trianing is with analog ICs that is the only way to democratize them
Great stuff :)
Subscribed ❤
Oxen-AI has a good vid on running it and even a github repo...
good content
There is no foundational model of this size for BitNet. The authors trained only 3B parameters model, how it will function in 70B no one knows. Those small models are so weak at those sizes.
Another problem is that 72B parameters are still very weak models, looks like for anything useful you need grok size at least - 314B parameters, or maybe 250B at least, of gpt3.5
All this means you will still need a powerful gpu to run a useful model that can perform for real life situations tasks, just maybe one A100 80GB GPU will be enough and not two or eight.
Yea, I did write a comment that they only trained up to 3B parameters. Very low bar, barely usable model.
I feel like the channel author is lacking a lot of context around NVIDIA's position right now to be able to make these statements.
Care to elaborate? I obviously don't have any insider knowledge of Nvidia, but surely Nvidia is keeping tabs on dedicated AI chip's efforts as that will upset their dominance in the AI sector or at least take a hit in their share price.
@@GeorgeXian Nvidia is actively helping their customers design chips to replace theirs. They're in a non-zero-sum space right now.
This addition on fixed-point numbers thing is awesome, I have been thinking about what will happen when this is proven, will old Pentium 4s plugged into janky pirate-motherboards like bitcoin miners and graphics cards come-about? The thing about ALL the 'old' processors that are just laying around is that they take a lot of power, but A ghz is a ghz and a core is a core (when we are talking about just matrix addition). But yea all that electricity. Its cool when you start thinking about the number of radios that exist (billions and billions) in old smartphones that perhaps lower-spec internet-of-things-type ai-driven multi-computation could use. There are so many antennas/radios in laptops and phones all over the earth.
No one needs to be scared at all of better AI performance... Also Llama 30b have horrible performance on my RTX4090 compared to Llama 13b.
good video
ai won't be real for smb mkt for 5 years - it is going to take that long - really
China has a chip that (Acell) that is 3000x faster than an A100 and uses 500x less power
Tell me more, is there a link for more info?
wow, do they sell it