Those curious where the 1.58 comes from - It's log(3)/log(2) = 1.5849625. Basically if you have a long sequence of random three state values you can represent it with no less than 1.58 bits per three state value.
Actually you need to explain it a bit more why that is. Modern computers, use 1 bit to represent 0 and 1. And 2 bits to represent -1. Given you generate a random sequence of 3 possible integers of which 2 are 1 bit and another is a 2 bit one the average number of bits used would eventually converge to log(3)/log(2). Edit: If you are curious why it's log 3/ log 2. Base 10 log that is. Because we use the decimal system and computers use binary, number of bits required to store a number is log base 2 of that number. Since it's 3 states of numbers we are using or 3 unique digits, it's log_2(3), or log 3/log 2
@@swarnavasamanta2628So if I understand you correctly, you have 1 bit to represent 1 number, and it uses 2 bits to represent the other 2 numbers? So for example, if you have a 1, you stop. but if you have a 0, then you check for 00, or 01? Giving you 3 options stored in a compact way?
@@jarrod752 You have 2 bits to represent -1 and 1 bit to represent 0 or 1. So you need 2 bits to store whenever the weight is -1 and 1 bit to store the 0 and 1 weights
Summary: The video introduces a groundbreaking advancement in the field of Large Language Models (LLMs) by presenting the concept of 1.58-bit LLMs, a significant departure from traditional 32-bit or 16-bit floating-point representations used in these models. This new approach utilizes ternary values (-1, 0, +1) for model parameters, drastically simplifying the computational operations needed to run these models. Traditionally, LLMs rely on complex matrix operations, which are computationally intensive and require high-performance hardware like GPUs, optimized for such tasks through technologies like NVIDIA's CUDA. However, the 1.58-bit LLMs leverage simple arithmetic operations, reducing the need for specialized hardware and potentially lowering energy consumption and operational costs. This method significantly cuts down computational complexity, allowing advanced models to run on less powerful hardware, even devices without GPUs. It suggests a shift towards more sustainable AI technology use, with reduced environmental impact due to lower energy requirements. Moreover, it opens up avenues for hardware innovation, with the potential development of new devices optimized for ternary computation rather than complex floating-point matrix calculations. This advancement is not just a technical feat; it represents a shift towards making high-performance AI more accessible and cost-effective, paving the way for future innovations in AI hardware and software. The 1.58-bit LLMs maintain performance levels comparable to traditional models while offering improvements in speed, memory usage, throughput, and energy efficiency. This development could redefine how LLMs are scaled and trained, offering a new paradigm for AI model deployment that is both high-performing and environmentally conscious.
Why not address the elephant in the room. The 1.5 bit model cannot store the same amount of information as 16bit. So why is it performing at an equal level. Very fishy. metrics gamed? most probably.
I am also super skeptical about this, some papers shows pruning the network may give some performance boost, but the performance degrades when too many weights are set to zero. This is like pruning and quantization at the same time, which would likely work in a very controlled manner, but sounds too good to be true in this paper.
It may be that our training algorithms cannot efficiently use the data resolution of floating point the size of floating point numbers is way too large for the proper atomic level of this kind of information. Think of it this way, after you get to TV screens with 16 bit colors do you actually get much more by switching to 32-bit color? You mostly get much bigger files and much slower processing.
My thoughts exactly. I’d imagine these 1.58 bit models would be more prone to catastrophic forgetting as well. They can’t hold nearly as much information.
So basically the main idea is to get rid of multiplication as multiplication by 1 is the same, by -1 the same but with - sign. And multiplication by zero is zero. So 1 is obvious. 0 is interesting as it allows for additional complexity representation to be encoded in the neural network. I expect they could rediscover some bitwise hack techniques from the era of the beginning of 3D gaming. To use bitwise operations instead of multiplication. In that way you get the same efficiency not doing multiplication but the complexity goes up significantly. I like though the quantization as it introduces some noise in the neural network and by doing so you get probably some interesting results as the NN can outperform the original full precision model on data on which the original was not trained for.
Its so hard to believe that 3 possible parameter values can capture the patterns and signal from the data as good as with 16-bit float. Mind blowing stuff for sure.
If you understand Digital-to-Analog conversion techniques, it is trivial. In fact, you wouldn't need a negative one value either, but they have yet to figure that part out.
Maybe the trick is that with ternaries there's now a convenient way to represent sparsity (the zero value), which I heard is pretty important in neural networks.
This isn't even a click bait title. This is an improvement of the original bitnet paper. This is actually a huge deal. Very. Original method, would reduce performance by almost half. We been working with it for our own LM architecture. Trying to figure out clever ways to mitigate these issurs to a degree, which required whole nee activation function for sparse representations, increasing nonlinearity...blah blah lol. This improvement seems it retains majority of this performance. Wow. Just going off vid screenshot. I have to dig into this paper now lol. Listen we won't have gpt4+ models locally without super efficient low depth quantization. This is morr holistic since you have to pretrain. Wow. Also generationing weights directly is more feasible at this depth, we believe that conditional weight generation is going to forever change deep learning. I know i sound crazy, but even compute won't be an advantage soon. We are going to change that. I know i sound crazy..but remember i said this. Within a year you wont have to pay hundreds of thousands/millions to pretrain a model, simply bring your dataset, model code, and hyperparameters. That's your prompt by the way. This year is going to be insane for all.
I'm suprised it took so long to study this. Years ago, I realized you could model combinational logic as a "neural net". Weights would have to be +1, 0 (not connected), and -1 (negated). The gates would be like neurons with various activation functions. I never wrote about it because i figured surely someone in the feild had already written a paper on the concept. It seems so obvious.
Since I missed the boat on this discovery, I might as well share some other insights I think are rather obvious, but haven't seen in literature (though I haven't looked very hard): 1) I bet the majority of training can be done at low precision too. Since weights start out highly randomized, they (almost by definition) don't carry much information. One should be able to train a 1.58-bit (trit) model until things settle, then add a bit of precision and continue training until the model parameters settle, then add another bit and so-on until you reach the desired performance. It makes little sense to train at high precision if you're going to throw away most of that precision when it comes time for inference anyway. I don't know how much precision is required for meta-parameters like momentum, but it shouldn't be that much more than the actual parameters themselves.
2) I think the field is missing a key fundamental element besides neurons and weights: delays. AFAIK, delays have yet to be explicitly modeled in ANNs. I think it would help at a theoretical level for understanding RNNs. I think we might gain a lot from modeling neuron inputs as adaptive FIR filters instead of single-weighted, single-valued signals. Digital Signal Processing engineers have a whole toolbox of techniques based on explicitly modeling delays.
My Xiaomi Mi 10 Ultra has 16GB of RAM, of which 12GB is literally always free, because that's an insane amount for a phone... but there have been phones like that for years now!
I did actually try to run llama.cpp on my phone a while back - there was a project that compiled it natively for Android, but I couldn't get it to not crash. I could have tried compiling it myself, but I got bored and figured it would probably be way too slow with a phone CPU anyway
When accuracy is so sensitive even to quantization, it's hard to grasp how this makes things better and faster! Every upgrade has a price tag, so what's the catch here? I'm curious to give their model a go, hopefully they add it to Hugging Face.
NVIDIA is in an excellent position to optimize their GPU's for this. It will actually save them money. NVIDIA will be just as excited as this as we are. The good thing is that so too will RISC-V, Arm, Intel, AMD etc
@@MichaelBarry-gz9xl Yeah you could certainly be right. I wondered if that will be how it plays out too. My gut says it'll be a matter of how adaptive nVidia is and how quickly they'll be able to pivot from their existing momentum (which is substantial) in the whole CUDA stack with the FP16/etc matrix math to doing custom circuits (ASICs like being done by Groq). I am also a bit confused about this because it seems like there is still a need to train using the existing FP16 math and that this binary-ish technique is more for the inference stage, after quantizing an FP16 model down to a model with 1/0/-1, or at least that's my read on it. If that's so, then Groq and Cerebrus and others who are well down the custom ASIC paths may be better positioned to pivot to ASICs for this binary-like math purpose, specializing in inference (which is the larger market I believe) and leave the training phase to nVidia hardware (or similar).
There are multiple reasons to sell if you've been holding for a while. Regardless of what architecture pulls ahead, nvidia will have more competition soon. We may also be in a speculative bubble, since the investor frenzy is banking on a wildly optimistic economic transformation due to AI, and that may not come to pass.
This is a good thing for both top end and consumer level AI. The top end will always use the extra headroom to improve the quality of the model. Consumers will finally run good models that they can run on a regular PC, perhaps opening the window for games running AI locally.
Wooo, this sounds pretty cool, and may make CPU inference on low power devices much more realistic. Groq folks probably aren't too happy about this (tho lets be real, this will make groq's already incredible performance even more incredible)
@EobardUchihaThawne I don't know sh-t about the human mind *but* in whatever way it functions it can probably be abstracted to something simpler than floating points. Probably much more along the lines of two or three values like -1, 0, 1.
"It (BitNet b1.58) matches the full-precision Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance" I don't buy this at all. I will only believe it when I have a model like that running locally in my computer.
log_2(3) or log(3)/log(2) actually. But good catch anyway! This is related to entropy. This BitNet shifts from a binary to a ternary representation. So if you have a sequence of values with 3 possible states each, you need about 1.58 bits per value to encode them efficiently without loss of information.
Ternary is apparently a very old, alternative number system, which permits decision making with the use of -1. 0, and 1, unlike just binary's 0 and 1. Apparently a computer called Setun which was based on it, was built by the Soviets in the 1950s. It makes me wonder what other innovations might be possible, if we were to look at such obscure and potentially unexplored ideas.
That it took so long to find out that a 1-bit approach is viable, suggests that it was very counter-intuitive, so it is good that we eventually found out, because in terms of hardware, this is revolutionary. It also possibly sheds some light on how neurons work, where the complexity may just be a consequence of the biology, but unnecessary in that it does not add to the performance, but (and this is key) nor does it get in the way. So maybe we can simplify a neuron model along the 1-bit lines.
Some earlier tensor cores of nVidia GPUs (Turing, Ampere A100) have a 1 bit matrix multiplication mode. It was labeled as experimental in the documentation. Not sure if they kept this in later hardware revisions.
I still couldn't understand the main thing. In this work, the original model was trained in 16 bits, and then it was quantized in -1,0,1, or did they learn to train the model immediately in -1,0,1, managed to do full backpropagation in this representation, etc.?
From what I understand you get the full benefits only if you train on 1,58 bits. If you do post training quantization you get what you get, lower precision that scales down 8 bit, 4, bit, 2, bit, 1 bit.
When you backpropagate, you use the fp16 values as reference but tell the the ternary model to only use 1, 0, or -1 (paraphrasing from someone else comment from other video)
If this is indeed real, thats game changer. But how can the model resolution be maintained with so much compression? LLMs are naturally lossy, but this takes it to the extreme.
From what I understand, this isn't a compression method. It's an alternative way of encoding the parameters and of processing them during training and inference that makes a more efficient use of the available memory space. Models must be trained from scratch using this new encoding scheme.
I file this under "I believe it when I see it". I find it suspicious that they only show scores for tiny models (up to 3B), but they have tokens/s and model-size for bigger 1.58-bit models (upto 70B)
Great explanation! It will be great if you can run a more in depth explanation of the paper and share some resources contrasting conventional matrix multiplication with the 1Bit approach. Thanks a lot!
There is a lot of potential here. Originally, I thought this was a completely Ternary process. BUT, I doubt it. So this model is probably using similar method as "BitNet: Scaling 1-bit Transformers for Large Language Models." which explains "BitNet employs low-precision binary weights and quantized activations, while maintaining high precision for the optimizer states and gradients during training."... In other words, this method is not a completely Ternary process. It is still using high precision for gradients etc (If it is using all Ternary, then please explain how it is avoiding the same issues as over quantization)
This is huge and actually shocking... But, how can the 3 values (-1, 0, 1) do the same work as the 16-bit ones without losing any precision? Also, wouldn't this mean that something on the level of GPT-4 will work with a single powerful GPU (something like a 3090) if it uses this 1.58-bit form, as it will scale down massively both in size and required computational power? edit: A deeper dive on this if possible will be helpful.
Yes. 120B parameters on a 3090. Hook a few up and you can absolutely give GPT4 a run for it's money. Don't forgot GPT 5 will use this tech, and probably improve it and keep the improvements secret
For speed: Increase the quantity of calculations and Lower the complexity of calculation? For model size: The model is highly compressed because the values are highly compressable (only three options)? So maybe the model is bigger in terms of overall weights / information held in the model but much lower in disc size? Disclaimer: I have no idea wtf I'm talking about.
None that I can see. It's seems better in every single way. 5x smaller. 2x context length. 4x lower latency. 11x greater batch sizes. Several orders of magnitude less energy required. Better scaling law. It needs retraining from scratch so it'll be the good part of a year before we get some decent models to play with. It will also spur on the development of new hardware. So we'll all be wanting to buy new hardware, which will make it even better
i think using -1,0,1,2 with 2 total bits would be a little smarter but i see why using only one bit is easier because multiplication on a single bit is just AND
Wow! This takes LLM Design out of the realm of high-level abstraction and pulls it down to the hardware level. Looks like old C++ Guys like me are about to become relevant again. If the performance of these models can match that of FP16 on the output side, then this is truly game-changing. The cost of building out AI Infrastructure dropped by several orders of magnitude. All we need to do is develop efficient libraries for performing matrix math using 2-Bit Unsigned Integers.
Not only that but it now seems logical to ditch the tokenizers and start using binary. That way we can train the models on existing binaries. Text in executable out and vice versa. Think about the ramifications. In the future everything will be open source, whether we like it or not.
@@MichaelBarry-gz9xlBy binary in and text out, I assume you mean binary in and source-code out - that is an intriguing proposition. People have already predicted that AI spells the end for programmers. That is not a question of if but when. Anyone want to start a Job Death Pool?
@@walterbaltzley4546 I mean compiled code out. And in. Source code sure, but we can already do that. Imagine "feeding" it a copy of Microsoft Windows and then saying I want this but change X, Y, Z. Out comes the binaries, the source code, the documentation, everything. Next just say, change it around slightly so it doesn't infringe copyright. Boom.
@@walterbaltzley4546 it's not the end of programmers, that's like saying I already have an iPhone so I don't need apple anymore. Sure but you want updates and improvements and someone to blame when it goes wrong.
@@walterbaltzley4546 I think the best way to look at the future of AI is: ANY to ANY modality. I.E, voice in Blockbuster movie out. Video in poem out. Picture in software out. Text in video game out. Video game in movie out. And so on. ANY to ANY, is the way to see it. Also think in terms of millions and then billions and then trillions, and so on, of tokens per second. Now you should see where this is going.
If it ends up being true, I have my doubts but regardless, can we even make models smaller after this? I don't think it's possible really. I think maybe the next step is trying to trim parameters, I think I saw this being talked about once, just trying to find parameters that don't contribute much but it would be computationally expensive
From 16bits float to 1bit ternary, bring back memory, from CD (16bit int) to SACD (1bit DSD/SDM)... Does a text-to-sound with a 1.58 quantization could output DSD directly ?
Isn't there some kind of restriction on sale of high-performance GPU/LPU hardware to China? That might be fueling this. It might also be cause for doubt. I hope it's true though!
If the 1.58bit model is out performing the traditional model on current hardware what could it do with optimised hardware? Would be interested to know also if it is quicker (and cheaper) to train. Possible energy/environmental benefits too? Could it make putting LLMs in phones and small, portable devices more doable?
It should be possible to do the calculations with mostly analogue circuits, just a couple of transistors per weight really, one layer adding and one layer subtracting (or doing nothing). Like 10000x10000 weigh layer update in one ns with very little energy.
How does this differ from all the other quantization methods that we have used so far? I have worked with 2-bit, 4-bit, 5-bit, and 6-bit quantized versions. Is this a completely new approach?
If this is true, ternary computing just found its application, finally. But I doubt this representation is useful on current hardware, or faster than 8 or 16 bit integer. Well, unless the matrix library does some mad bit banging and masking.
Are we going to hit a wall with LLM? Garbage in garbage out, we need a system that verifies the AI is not corrupting itself, plus we need more energy to feed this monster. Is the Singularity near or far?
Doesn't this do the same thing to an LLM computation that first using UMAP on high density embedding at a 3 dimensional level for visualization does in Topic modeling using classic ML methods? I always use the HDBscan output for hierarchical and algorithmic insights before 3d vis because of the density of the parameters gives out more precise, but harder to handle high density data. The 3d data is still VERY usable and holds most understanding from HD. Is this comparable? Nice to know after struggling with the frustration of Nvidia CUDA update dragging..
🎯 Key Takeaways for quick navigation: 00:00 *🔄 The video introduces changes in the LLM (large language model) world, highlighting the transition from traditional deep learning models to ones that no longer require GPUs for high-performance matrix multiplication.* 00:43 *🔍 The discussion introduces the concept of 1bit LLMs, suggesting a move towards more efficient computational models that retain performance parity with current LLMs but at a reduced computational cost.* 01:11 *🧮 Explains the shift from 32-bit or 16-bit floating-point representations to a ternary system (using -1, 0, 1) for model parameters, significantly simplifying the computational process by eliminating the need for multiplication.* 03:31 *🆕 Introduces the "B 1.58 model," which uses ternary values instead of binary, enhancing learning capabilities and performance by incorporating a zero value alongside -1 and 1.* 05:08 *💡 Discusses the potential for new hardware development optimized for the ternary computation model, suggesting a significant shift away from GPU reliance and towards more specialized computing solutions.* 06:02 *🚀 Highlights the paper's assertion that the 1.58 bit LLM architecture offers comparable accuracy to traditional models while improving latency, memory efficiency, throughput, and energy consumption.* 07:12 *📈 Provides evidence of the new model's effectiveness through comparisons with the Llama LLM architecture, showing equal or better performance on various metrics, including perplexity and downstream task performance.* 09:58 *🎛️ Elaborates on the technical implementation of the 1.58 bit LLM, retaining the Transformer architecture but altering the numerical representation and computational approach within the model.* 11:48 *🌍 Suggests a significant impact on the scalability and application of LLMs across different hardware platforms, including mobile and edge computing, due to the reduced computational requirements.* 13:11 *📉 Concludes with the potential for dramatic improvements in hardware efficiency and cost reduction for deploying large-scale LLMs, due to the shift to a 1.58 bit computational model.* Made with HARPA AI
@@1littlecoder Thanks for the reply (and for this very nice high-level summary video). I looked into the paper, the closest they come to discussing training is to mention that other groups have used post-training quantization (which is the most natural guess), but then they criticize such methods and don't say what they do instead (a little bit suspicious, but maybe being guarded is normal with so much money at stake). Clearly to make such a big change of a 1 to 0 or -1 at once, the decision to change must be based on many training samples somehow. The best way I can think of to do this, without storing a hidden floating-point for each parameter, is by a Monte Carlo method (each training sample suggest some small direction for each trit to change in, and RNG is used to accept the change probabilistically, so that on average the trit parameters are being sampled from a relevant distribution that takes into account all the training samples). Just a guess!
@@1littlecoder Ah that's a serendipitous recommendation, I knew the CEO Verdon from his days as a graduate student in quantum computing. I might actually get in touch to help their efforts...
Great videos ❤ there it goes Sama 7Ts unless this can't be used in training like quantization. I have been using 2bit xss quants models. they are accurate enough and given you can build them from the ground up to be compatible as 1bit or 1.5bit I think is revolutionary not sure we won't need GPUs at all even with linear mults. I did miss this one with all the noise too.
Hey, where do you get all these papers? How do you stay informed with all the new important papers? I'd like to start reading some specific to my work (data science)
@@MichaelBarry-gz9xlhey I checked your comments on UA-cam and they're good and you seem to be one of the few knowledgeable out here. I'm working on something interesting but want to verify with you. If possible can we connect on discord. Thanks!
I checked the paper for a sec and found these models actually eat more memory than 4-bit quantized ones and don't offer much of a speedup compared to them either. Don't know where the memory inefficiency comes from and whether it could be fixed. If this is the best it can do, it's quite Zzz
Huh? Did we read the same paper? I did my own calculations in my head before reading the paper, and the paper confirmed what I suspected. This architecture can fit around 40B parameters on a little 8GB card
@@MichaelBarry-gz9xl "BitNet b1.58 1.3B 1.14 (2.93x) 0.97 (1.67x) 11.29" taking 1.14 GB to run a 1.3B model gives us 8 / 1.14 = 7.017 * 1.3 = 9.1B for an 8GB card, assuming of course 0 bytes being used for stuff like display drivers. However, we'll also need some more memory for our context window.
@@Alice_Fumo the smaller models are larger than expected yes I noticed that too and perhaps its head which is 8 bits? But it levels of add or gets bigger and reaches the expected 1.58 bits. Check the chart that compares the size using the log scale on the left. 70B was around 10Gb, and the 7B was around 10Gb or less. I'm going off memory, but everything seemed as I expected it to be (with the exception of the smaller models)
@@MichaelBarry-gz9xl ah, you're right. I kinda missed that graph. This would put the memory consumption of 70b models to ~20gb (they only gave the ratio for that one which is why I overlooked it), which is actually within the realm of consumer GPUs (although barely). Most interestingly, it would let one run mixtral (or similar) on a single GPU.
@@MichaelBarry-gz9xl I still do not quite get how it would be matching performance with the 32 bit variants? I'm just an undergrad student, so I'm probably missing a lot of the necessary details. If you could explain it, or guide me to some resources that could, I would be grateful.
I like this title more, it's a lot less click-baity. ;-) Though I still wonder if this even changes anything at all. If the paper is is already from October last year, it's old by todays standards. So there is probably a good reasons why it has not been adopted yet.
Pretraining is expensive and takes time, not many people can afford it. Hence out of the thousands of models on the hub, it's basically just LLaMA and Mistral. The original paper was good, but the accuracy wasn't there. Now the accuracy is there and I think companies such as meta, maybe even Microsoft as they played a part in this research, will be scrambling to make bigger models with this. But it takes a long time and a lot of money
@@MichaelBarry-gz9xl What do you mean by "the accuracy wasn't there. Now the accuracy is there". I don't think that there is new hardware or software with increased "accuracy" since october 2023. Also the most time consuming part of training for these companies is most likely the preparation of the datasets. Apart from the biggest models training takes only days or weeks with the hardware that these big tech companies have. For the training datasets they could just use the existing ones. If on the other hand this 1.58BitNet approach allows for training of smaller models like 7b from scratch with more affordable hardware or if there is a way to "compress" existing models by converting or re-training them into the new format, one would expect that there were are already some open source examples floating around.
Reading this paper they got the idea to include also the zero as a third operator instead of 1 and -1 to get to 1.58 bits. The whole idea is to get rid of multiplication. As multiplication by 1 is the same thing, we just keep track of the sign. And 0 has also the property to remain zero. Still there is another possibility. to use "signed 0" as operator too. So we gave -1, 1, -0, +0. We get 2 bit instead of 1 bit but with the same computational effort. The idea of signed zero is about a "zero with memory" with the sign we keep the memory of where it came from from the positive side or from the negative side. en.wikipedia.org/wiki/Signed_zero
Great Video bro. Actually Im doing a project in AI which converts an UI design into front end code. Can you upload videos regarding this. It will be very useful for my project. Thanks in advance
Question: If word2vec is able to represent semantically similar words based on them existing near each other in an n dimensional vector space, how can it achieve the same result if there are only three possible positions in each dimension, or are the vectors produced by word2vec solely the data that is fed to the model, but not the model weights itself? Does this also have anything to do with why activations are 8 bit, or is that also unrelated?
consider that embeddings have 512 to 1024 or even more dimensions, that means with just 1, -1, and 0, u can represent 3^1024 unique elements in an embedding. that's more than the number of atoms in the universe afaik. I am unsure if this produces embeddings in 1.58 bits but that is still a very large space to generate information over.
The main issue here is these are tiny models they are playing with. There is no proof that this technique scales to anything above (in terms of perplexity/quality). It reminds me of RetNet, which was also supposed to be a big breakthrough and hasn't been released with any open weights since.
can this method be used for only inference or training as well. if I understand correctly the hardware requirements would remain the same for training.
I thought with quantisation the model performance would drop. Maybe the memory requirement has gone up as a trade off? The matrix dimension has to increase to have all of the information be stored in some way. i m not expert in anyway just feels its not that simple..
9:20 I thought so too, but…. No? The memory footprint is clearly smaller. I’m genuinely shocked and it seems to decrease the relative memory requirements more the bigger the model size is. So we get smaller llms (not parameter wise, but storage and ram wise) with pretty much no loss in quality (and even some gain) with WAY faster inference times. This truly will change everything. Imagine this on mixtral
@@maxieroo629perhaps the math performed during quantization simply produces less relevant values than are decided by this 1.58 bit’s (rounding?) rules? When you’re quantizing you may inadvertently keep less statistically relevant information, or values that amount to less statistically relevant information during inference whereas this technique performs a similar function at a different time that just happens to produce similar results to the original…or perhaps the paper was run through Google’s marketing team and thus the entire thing is bogus 😅 In any case - well, other than if Google really was involved 😂 - I can’t wait to check this out on some local 1.58 bit mistral models!
to all people who are kept saying today's AI is over-hyped by some people. they should see this, how much research are advancing. today's AI can be this smart, i lost of my imagination to imagine how advanced, agile and light the future AI, as if they can be embedded in small powered devices already insaaaaaaane.... that way we all could have decentralized ownership like david sapphiro said, no need carry mobile phone when want to take picture just ask AI it's all over places they all embedded everywhere so ask AI it will bring up all the candid photoshots, or maybe when you need to talk with someone just going to nearby embedded AI and access it's network to talk to someone you know, maybe he's in a yacht picking up your call at his yacht windshield, while talking he want's to grab a beer he just go down and can keep talking to you because AI embedded all over the places.
This essentially creates different "engines" ... like cars there are v4 and v12, you drive what you can afford! Currently they're all too expensive lol
To address the core of the confusion: 3 values can absolutely be represented with 1 bit. 1 ternary bit holds 3 values. It's only binary that holds 2 values. But because binary is so ubiquitous people often assume it is the only game in town. The moral of the story, and the thing to remember is that binary is not the only game in town. You can have any number of values in your bits (so long as you are a chip manufacturer). The values themselves are not the bits, they are properties of the bits. And a bit can have as many properties/states/values as the manufacturer desires.
All I have to say is that this in no way is a 1bit tecnology like you make it to be, or even 1.58, cause a bit is either 0 or 1, you can not make an half bit, or you can but you need 2 bits to say that it is an half, or that it is positive or negative if you say it is -1 0 1, to represent that you need 2 bits, no other way to make it. I see another comment below where people are talking about compression and whatever, but if you give an ai an 1h 1GB video to learn from it and the same video in MP4 with 300mb the ai will learn the same, it will not increase 1GB neither 300mb, it learns that the video has the things you alredy teached it before or not, example if cats show in the video or people or whatever. It does not compresss the video and put it in another file somewhere like most people seem to think that it is how it works. Most ais are trained with TBs of data, if it worked like that, we would never be able to run it on or basic pcs. If you guys see even the 70B parameter models, dont get to terrabytes of size. And it has nothing to do with compression of image or whatever. It has to do with what your brain do also, if you see a new movie, you will remember some scenes that clicked with you or shocked you, but you will probably not remember the color of the dress the actriz was wearing. You will not compress the video in your head, what you will retain is the things that your brain has connections already from other movies or irl situations. Or trully new things that surprise you or have an impact on you, so your brain remembers it, but not pixel by pixel, is based on parameters that you know already. Like LLMs At least this is how I see it, but maybe I'm ignorant. I talk from the litle experience I have with llms and coding. And that is the reason that I dont understand why you are calling this a 1bit model, it is possible to do it, but you will need a way bigger ammount of parameters to make it learn anything if the only options you have is to give it 0 or 1 to tune its learning skills in each neuron I think. Is like saying I will make an ai that learns all 256 color range for the red in rgb with 1 bit, you can do it, but it needs at least 256 1 bit parameters to detect all possible 256 values, or at least half of that. since it is in bits. 0 and 1. If there are any experts that can show me where I'm thinking wrong about it, I will be glad to read your opinion and learn with it. Like I said I talk from the litle I know. And for me is weird seing people saying 1.58bits, cause that is not possible, bits are real numbers, there are no decimals in bits. unless if you reserve another one to make it decimal. and still it can only be half, 0.5 or positive negative, 1 or -1 but for that you need 2 bits like I said, not 1 is impossible.
Nothing is free. I have seen attempts to take 32bit models and quantize them to two bit. They produce outputs, but not particularly good outputs; typically that drastic of a quantization is inferior to a smaller model that is not quantized so severely but they do still function. I have to wonder how the output of this approach differs from what would result from a model that was trained as a 2 bit model from the start.
Very intresting. I think I'll need to read the paper though. For the neural net to be able to approximate any functions I would of thought you need at least some form of multiplication for it to be a non linear model (as in non-linear classification thresholds or non-linear reggresions). If they are not multiplying values with wheights then I would at least expect that some form of multiplication is being done in the activation function.
the activation function is RELU, so that's not multiplication, it´s ´still nonlinear enough. The attention normalization step would still need normal multiplication, for instance.
You scale up the parameters and let the network decide the precision attributed to each feature. Ternary gradients should work just as well if you up the parameters, but it remove redundant calculation. It will take longer to train on hardware not designed for it but will be way faster on hardware that is.
It’s better in seemingly all ways to previous quantized models. This isn’t a model specifically though, it’s a new quantization method being demonstrated with Llama. It means any LLM using this method can have the full performance of a non-quantized model (16 bit), with a size smaller than other quantizations (in storage, and in ram during inference) and a massive inference speed increase. I genuinely can’t see the drawback here
This sounds like an investment scheme. Nothing in this video explains how the resolution can be reduced from 16 bit int/float to only 1.58 ternary. They either didn't need that resolution in the first place, this whole time, implying that the whole industry was dumb (unlikely). Or the conversion algorithm is creating many output matrices for each input matrix to make up the difference, whick would be just re-representing each big word with several small words.
Those curious where the 1.58 comes from - It's log(3)/log(2) = 1.5849625. Basically if you have a long sequence of random three state values you can represent it with no less than 1.58 bits per three state value.
Thanks! I thought it was the square root of 5 halfs.
Actually you need to explain it a bit more why that is. Modern computers, use 1 bit to represent 0 and 1. And 2 bits to represent -1. Given you generate a random sequence of 3 possible integers of which 2 are 1 bit and another is a 2 bit one the average number of bits used would eventually converge to log(3)/log(2).
Edit: If you are curious why it's log 3/ log 2. Base 10 log that is. Because we use the decimal system and computers use binary, number of bits required to store a number is log base 2 of that number. Since it's 3 states of numbers we are using or 3 unique digits, it's log_2(3), or log 3/log 2
@@swarnavasamanta2628So if I understand you correctly, you have 1 bit to represent 1 number, and it uses 2 bits to represent the other 2 numbers? So for example, if you have a 1, you stop. but if you have a 0, then you check for 00, or 01? Giving you 3 options stored in a compact way?
@@jarrod752 You have 2 bits to represent -1 and 1 bit to represent 0 or 1. So you need 2 bits to store whenever the weight is -1 and 1 bit to store the 0 and 1 weights
I wish they were not retarded and recognised that they're speaking about balanced ternary, not force a three value number into a binary format.
Summary:
The video introduces a groundbreaking advancement in the field of Large Language Models (LLMs) by presenting the concept of 1.58-bit LLMs, a significant departure from traditional 32-bit or 16-bit floating-point representations used in these models. This new approach utilizes ternary values (-1, 0, +1) for model parameters, drastically simplifying the computational operations needed to run these models.
Traditionally, LLMs rely on complex matrix operations, which are computationally intensive and require high-performance hardware like GPUs, optimized for such tasks through technologies like NVIDIA's CUDA. However, the 1.58-bit LLMs leverage simple arithmetic operations, reducing the need for specialized hardware and potentially lowering energy consumption and operational costs.
This method significantly cuts down computational complexity, allowing advanced models to run on less powerful hardware, even devices without GPUs. It suggests a shift towards more sustainable AI technology use, with reduced environmental impact due to lower energy requirements. Moreover, it opens up avenues for hardware innovation, with the potential development of new devices optimized for ternary computation rather than complex floating-point matrix calculations.
This advancement is not just a technical feat; it represents a shift towards making high-performance AI more accessible and cost-effective, paving the way for future innovations in AI hardware and software. The 1.58-bit LLMs maintain performance levels comparable to traditional models while offering improvements in speed, memory usage, throughput, and energy efficiency. This development could redefine how LLMs are scaled and trained, offering a new paradigm for AI model deployment that is both high-performing and environmentally conscious.
I wonder how fast this method would run on a Groq chip? Strange times we live in.
Which website you are using for YT video summarization
I just fed the entire youtube transcript into gpt4.5 and told it to summarize@@__________________________6910
That's not a summary, that's a transcript 😂
@@Custodian123 a good one 😛
Bro. This is actually game changing! Why is this not a popular technique yet?
Oh. It is new. Crazy stuff
arXiv papers are not peer-reviewed.
Like all vaporware, because it doesn't work.
The research only came out yesterday, and it's preprint
I'm guessing as a research paper, it'll either be implemented into everything in 2 weeks or it won't work.
Why not address the elephant in the room. The 1.5 bit model cannot store the same amount of information as 16bit. So why is it performing at an equal level. Very fishy. metrics gamed? most probably.
Good Point. no idea what any of this mean anything or anyway.
Density. Entropy. Compression. etc
I am also super skeptical about this, some papers shows pruning the network may give some performance boost, but the performance degrades when too many weights are set to zero. This is like pruning and quantization at the same time, which would likely work in a very controlled manner, but sounds too good to be true in this paper.
It may be that our training algorithms cannot efficiently use the data resolution of floating point the size of floating point numbers is way too large for the proper atomic level of this kind of information. Think of it this way, after you get to TV screens with 16 bit colors do you actually get much more by switching to 32-bit color? You mostly get much bigger files and much slower processing.
My thoughts exactly. I’d imagine these 1.58 bit models would be more prone to catastrophic forgetting as well. They can’t hold nearly as much information.
So basically the main idea is to get rid of multiplication as multiplication by 1 is the same, by -1 the same but with - sign. And multiplication by zero is zero. So 1 is obvious. 0 is interesting as it allows for additional complexity representation to be encoded in the neural network.
I expect they could rediscover some bitwise hack techniques from the era of the beginning of 3D gaming. To use bitwise operations instead of multiplication. In that way you get the same efficiency not doing multiplication but the complexity goes up significantly.
I like though the quantization as it introduces some noise in the neural network and by doing so you get probably some interesting results as the NN can outperform the original full precision model on data on which the original was not trained for.
Its so hard to believe that 3 possible parameter values can capture the patterns and signal from the data as good as with 16-bit float. Mind blowing stuff for sure.
that's what i want to know, how are they doing it?
@@arkaprovobhattacharjee8691they have released the technical paper so you can freely check it
If you understand Digital-to-Analog conversion techniques, it is trivial. In fact, you wouldn't need a negative one value either, but they have yet to figure that part out.
Exactly, a lot of information gets lost. So you would assume that the number of knods (parameters) would need to go up.
Maybe the trick is that with ternaries there's now a convenient way to represent sparsity (the zero value), which I heard is pretty important in neural networks.
This isn't even a click bait title. This is an improvement of the original bitnet paper. This is actually a huge deal. Very. Original method, would reduce performance by almost half. We been working with it for our own LM architecture. Trying to figure out clever ways to mitigate these issurs to a degree, which required whole nee activation function for sparse representations, increasing nonlinearity...blah blah lol.
This improvement seems it retains majority of this performance. Wow. Just going off vid screenshot. I have to dig into this paper now lol. Listen we won't have gpt4+ models locally without super efficient low depth quantization. This is morr holistic since you have to pretrain. Wow.
Also generationing weights directly is more feasible at this depth, we believe that conditional weight generation is going to forever change deep learning. I know i sound crazy, but even compute won't be an advantage soon. We are going to change that. I know i sound crazy..but remember i said this. Within a year you wont have to pay hundreds of thousands/millions to pretrain a model, simply bring your dataset, model code, and hyperparameters. That's your prompt by the way. This year is going to be insane for all.
Bitwise Mamba + bitwise inference for the win.
I was thinking mamba.
+ bitmap which is irrelevant but f it were in a simulation
+ analog inference chips.
@@minimal3734 why would you use analog inference chips for 1.58 bits 💀
I'm suprised it took so long to study this. Years ago, I realized you could model combinational logic as a "neural net". Weights would have to be +1, 0 (not connected), and -1 (negated). The gates would be like neurons with various activation functions.
I never wrote about it because i figured surely someone in the feild had already written a paper on the concept. It seems so obvious.
Since I missed the boat on this discovery, I might as well share some other insights I think are rather obvious, but haven't seen in literature (though I haven't looked very hard):
1) I bet the majority of training can be done at low precision too.
Since weights start out highly randomized, they (almost by definition) don't carry much information. One should be able to train a 1.58-bit (trit) model until things settle, then add a bit of precision and continue training until the model parameters settle, then add another bit and so-on until you reach the desired performance.
It makes little sense to train at high precision if you're going to throw away most of that precision when it comes time for inference anyway.
I don't know how much precision is required for meta-parameters like momentum, but it shouldn't be that much more than the actual parameters themselves.
2) I think the field is missing a key fundamental element besides neurons and weights: delays.
AFAIK, delays have yet to be explicitly modeled in ANNs. I think it would help at a theoretical level for understanding RNNs. I think we might gain a lot from modeling neuron inputs as adaptive FIR filters instead of single-weighted, single-valued signals.
Digital Signal Processing engineers have a whole toolbox of techniques based on explicitly modeling delays.
i wish they released at least one model to test it out lol
They are planning on releasing the models for research
Seems huge! Can't wait to run 70B+ on my phone with millisecond response times
Assuming you have 13.8GB of free memory on your phone to run it of course!
@@footube3 that's not a large amount
@@TragicGFuel He said memory not storage.
My Xiaomi Mi 10 Ultra has 16GB of RAM, of which 12GB is literally always free, because that's an insane amount for a phone... but there have been phones like that for years now!
I did actually try to run llama.cpp on my phone a while back - there was a project that compiled it natively for Android, but I couldn't get it to not crash. I could have tried compiling it myself, but I got bored and figured it would probably be way too slow with a phone CPU anyway
When accuracy is so sensitive even to quantization, it's hard to grasp how this makes things better and faster! Every upgrade has a price tag, so what's the catch here? I'm curious to give their model a go, hopefully they add it to Hugging Face.
This feels like the signal to sell NVDA while they're at their high, before the shift occurs where their GPUs are no longer essential for AI.
NVIDIA is in an excellent position to optimize their GPU's for this. It will actually save them money. NVIDIA will be just as excited as this as we are. The good thing is that so too will RISC-V, Arm, Intel, AMD etc
@@MichaelBarry-gz9xl Yeah you could certainly be right. I wondered if that will be how it plays out too. My gut says it'll be a matter of how adaptive nVidia is and how quickly they'll be able to pivot from their existing momentum (which is substantial) in the whole CUDA stack with the FP16/etc matrix math to doing custom circuits (ASICs like being done by Groq). I am also a bit confused about this because it seems like there is still a need to train using the existing FP16 math and that this binary-ish technique is more for the inference stage, after quantizing an FP16 model down to a model with 1/0/-1, or at least that's my read on it. If that's so, then Groq and Cerebrus and others who are well down the custom ASIC paths may be better positioned to pivot to ASICs for this binary-like math purpose, specializing in inference (which is the larger market I believe) and leave the training phase to nVidia hardware (or similar).
There are multiple reasons to sell if you've been holding for a while. Regardless of what architecture pulls ahead, nvidia will have more competition soon. We may also be in a speculative bubble, since the investor frenzy is banking on a wildly optimistic economic transformation due to AI, and that may not come to pass.
This is a good thing for both top end and consumer level AI. The top end will always use the extra headroom to improve the quality of the model. Consumers will finally run good models that they can run on a regular PC, perhaps opening the window for games running AI locally.
Wooo, this sounds pretty cool, and may make CPU inference on low power devices much more realistic. Groq folks probably aren't too happy about this (tho lets be real, this will make groq's already incredible performance even more incredible)
The extreme wording of the title is warranted here if you didn’t know about the October paper, which I did not. So, thanks for the video!
Elaborate please?
@MrSur512 there was a paper published in October about the previous version of this named Bitnet
the idea of having integers as weights made sense to me for a while. but my man 😂only using -1, 0, and 1 is very cool
but still in abstraction it doesnt feel reliable somehow😂
@EobardUchihaThawne I don't know sh-t about the human mind *but* in whatever way it functions it can probably be abstracted to something simpler than floating points. Probably much more along the lines of two or three values like -1, 0, 1.
why? human mind is "analogic", it is not discrete, so is "quantum level" presition...@tVonDudler
"It (BitNet b1.58) matches the full-precision Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance"
I don't buy this at all. I will only believe it when I have a model like that running locally in my computer.
This is huge! Thanks for bringing it to our attention ❤
1.58 = Log 3 in base 2. (corrected as below... Uh-doiii!)
log_2(3) or log(3)/log(2) actually. But good catch anyway! This is related to entropy. This BitNet shifts from a binary to a ternary representation. So if you have a sequence of values with 3 possible states each, you need about 1.58 bits per value to encode them efficiently without loss of information.
Ternary is apparently a very old, alternative number system, which permits decision making with the use of -1. 0, and 1, unlike just binary's 0 and 1. Apparently a computer called Setun which was based on it, was built by the Soviets in the 1950s. It makes me wonder what other innovations might be possible, if we were to look at such obscure and potentially unexplored ideas.
Ternary is difficult to make into logical gates.
These ternaries are still represented by binary bits here.
That it took so long to find out that a 1-bit approach is viable, suggests that it was very counter-intuitive, so it is good that we eventually found out, because in terms of hardware, this is revolutionary.
It also possibly sheds some light on how neurons work, where the complexity may just be a consequence of the biology, but unnecessary in that it does not add to the performance, but (and this is key) nor does it get in the way. So maybe we can simplify a neuron model along the 1-bit lines.
This is truly a big change, ty for the info
Thank you for sharing! This process will revolutionize local open source AI!
Amazzing!! You are always very much up to date with research work in this AI field.
This is fantastic news for people running local open source models, if the performance translates.
We have to wait until a Company will implement this technique. Very promising. Thanks for the update on new llm technique.
This is phenomenal
Some earlier tensor cores of nVidia GPUs (Turing, Ampere A100) have a 1 bit matrix multiplication mode. It was labeled as experimental in the documentation. Not sure if they kept this in later hardware revisions.
I still couldn't understand the main thing. In this work, the original model was trained in 16 bits, and then it was quantized in -1,0,1, or did they learn to train the model immediately in -1,0,1, managed to do full backpropagation in this representation, etc.?
From what I understand you get the full benefits only if you train on 1,58 bits. If you do post training quantization you get what you get, lower precision that scales down 8 bit, 4, bit, 2, bit, 1 bit.
When you backpropagate, you use the fp16 values as reference but tell the the ternary model to only use 1, 0, or -1 (paraphrasing from someone else comment from other video)
Great video. Very cool stuff
If this is indeed real, thats game changer. But how can the model resolution be maintained with so much compression? LLMs are naturally lossy, but this takes it to the extreme.
From what I understand, this isn't a compression method. It's an alternative way of encoding the parameters and of processing them during training and inference that makes a more efficient use of the available memory space. Models must be trained from scratch using this new encoding scheme.
I file this under "I believe it when I see it". I find it suspicious that they only show scores for tiny models (up to 3B), but they have tokens/s and model-size for bigger 1.58-bit models (upto 70B)
Great explanation! It will be great if you can run a more in depth explanation of the paper and share some resources contrasting conventional matrix multiplication with the 1Bit approach. Thanks a lot!
did they provide a real example with tests? this is going to be great for low power computational devices like RPI and so on...
There is a lot of potential here. Originally, I thought this was a completely Ternary process. BUT, I doubt it.
So this model is probably using similar method as "BitNet: Scaling 1-bit Transformers for Large Language Models." which explains "BitNet employs low-precision binary weights and quantized activations, while maintaining high precision for the optimizer states and gradients during training."...
In other words, this method is not a completely Ternary process. It is still using high precision for gradients etc
(If it is using all Ternary, then please explain how it is avoiding the same issues as over quantization)
This is huge and actually shocking... But, how can the 3 values (-1, 0, 1) do the same work as the 16-bit ones without losing any precision?
Also, wouldn't this mean that something on the level of GPT-4 will work with a single powerful GPU (something like a 3090) if it uses this 1.58-bit form, as it will scale down massively both in size and required computational power?
edit: A deeper dive on this if possible will be helpful.
Yes. 120B parameters on a 3090. Hook a few up and you can absolutely give GPT4 a run for it's money. Don't forgot GPT 5 will use this tech, and probably improve it and keep the improvements secret
For speed:
Increase the quantity of calculations
and
Lower the complexity of calculation?
For model size:
The model is highly compressed because the values are highly compressable (only three options)?
So maybe the model is bigger in terms of overall weights / information held in the model but much lower in disc size?
Disclaimer: I have no idea wtf I'm talking about.
no drawback?
None that I can see. It's seems better in every single way. 5x smaller. 2x context length. 4x lower latency. 11x greater batch sizes. Several orders of magnitude less energy required. Better scaling law. It needs retraining from scratch so it'll be the good part of a year before we get some decent models to play with. It will also spur on the development of new hardware. So we'll all be wanting to buy new hardware, which will make it even better
i think using -1,0,1,2 with 2 total bits would be a little smarter but i see why using only one bit is easier because multiplication on a single bit is just AND
Wow! This takes LLM Design out of the realm of high-level abstraction and pulls it down to the hardware level. Looks like old C++ Guys like me are about to become relevant again. If the performance of these models can match that of FP16 on the output side, then this is truly game-changing. The cost of building out AI Infrastructure dropped by several orders of magnitude. All we need to do is develop efficient libraries for performing matrix math using 2-Bit Unsigned Integers.
Not only that but it now seems logical to ditch the tokenizers and start using binary. That way we can train the models on existing binaries. Text in executable out and vice versa. Think about the ramifications. In the future everything will be open source, whether we like it or not.
@@MichaelBarry-gz9xlBy binary in and text out, I assume you mean binary in and source-code out - that is an intriguing proposition.
People have already predicted that AI spells the end for programmers. That is not a question of if but when. Anyone want to start a Job Death Pool?
@@walterbaltzley4546 I mean compiled code out. And in. Source code sure, but we can already do that. Imagine "feeding" it a copy of Microsoft Windows and then saying I want this but change X, Y, Z. Out comes the binaries, the source code, the documentation, everything. Next just say, change it around slightly so it doesn't infringe copyright. Boom.
@@walterbaltzley4546 it's not the end of programmers, that's like saying I already have an iPhone so I don't need apple anymore. Sure but you want updates and improvements and someone to blame when it goes wrong.
@@walterbaltzley4546 I think the best way to look at the future of AI is: ANY to ANY modality. I.E, voice in Blockbuster movie out. Video in poem out. Picture in software out. Text in video game out. Video game in movie out. And so on. ANY to ANY, is the way to see it. Also think in terms of millions and then billions and then trillions, and so on, of tokens per second. Now you should see where this is going.
If it ends up being true, I have my doubts but regardless, can we even make models smaller after this? I don't think it's possible really. I think maybe the next step is trying to trim parameters, I think I saw this being talked about once, just trying to find parameters that don't contribute much but it would be computationally expensive
Very nice, thank you very much.
From 16bits float to 1bit ternary, bring back memory, from CD (16bit int) to SACD (1bit DSD/SDM)... Does a text-to-sound with a 1.58 quantization could output DSD directly ?
Isn't there some kind of restriction on sale of high-performance GPU/LPU hardware to China? That might be fueling this. It might also be cause for doubt. I hope it's true though!
I asked gemini for about 2 hours how it works.It was mostly unsure.
Please provide transcripts to study your videos alongside Gemini. Thank you!
couldn't help but think of quantum computers when seeing the -1 0 1 bits, is there possibility for this?
If the 1.58bit model is out performing the traditional model on current hardware what could it do with optimised hardware? Would be interested to know also if it is quicker (and cheaper) to train. Possible energy/environmental benefits too? Could it make putting LLMs in phones and small, portable devices more doable?
It should be possible to do the calculations with mostly analogue circuits, just a couple of transistors per weight really, one layer adding and one layer subtracting (or doing nothing). Like 10000x10000 weigh layer update in one ns with very little energy.
In computer everything is just 0 or 1. A FP16 is just 16 1-bit digits
How does this differ from all the other quantization methods that we have used so far? I have worked with 2-bit, 4-bit, 5-bit, and 6-bit quantized versions. Is this a completely new approach?
Quantization takes a model created in fp16 and "compresses" it down to 6,5,4,3,2 bits. This technique requires one to *train* the model at 1.58 bits.
So what happens to 1bit Mamba models
Wish I could give it two thumbs up. This is almost too good to be true.
Great explanations
waw!!! thanks for this amazing information! :)
If this is true, ternary computing just found its application, finally.
But I doubt this representation is useful on current hardware, or faster than 8 or 16 bit integer. Well, unless the matrix library does some mad bit banging and masking.
thanks for sharing!
Are we going to hit a wall with LLM? Garbage in garbage out, we need a system that verifies the AI is not corrupting itself, plus we need more energy to feed this monster. Is the Singularity near or far?
Doesn't this do the same thing to an LLM computation that first using UMAP on high density embedding at a 3 dimensional level for visualization does in Topic modeling using classic ML methods? I always use the HDBscan output for hierarchical and algorithmic insights before 3d vis because of the density of the parameters gives out more precise, but harder to handle high density data. The 3d data is still VERY usable and holds most understanding from HD. Is this comparable?
Nice to know after struggling with the frustration of Nvidia CUDA update dragging..
🎯 Key Takeaways for quick navigation:
00:00 *🔄 The video introduces changes in the LLM (large language model) world, highlighting the transition from traditional deep learning models to ones that no longer require GPUs for high-performance matrix multiplication.*
00:43 *🔍 The discussion introduces the concept of 1bit LLMs, suggesting a move towards more efficient computational models that retain performance parity with current LLMs but at a reduced computational cost.*
01:11 *🧮 Explains the shift from 32-bit or 16-bit floating-point representations to a ternary system (using -1, 0, 1) for model parameters, significantly simplifying the computational process by eliminating the need for multiplication.*
03:31 *🆕 Introduces the "B 1.58 model," which uses ternary values instead of binary, enhancing learning capabilities and performance by incorporating a zero value alongside -1 and 1.*
05:08 *💡 Discusses the potential for new hardware development optimized for the ternary computation model, suggesting a significant shift away from GPU reliance and towards more specialized computing solutions.*
06:02 *🚀 Highlights the paper's assertion that the 1.58 bit LLM architecture offers comparable accuracy to traditional models while improving latency, memory efficiency, throughput, and energy consumption.*
07:12 *📈 Provides evidence of the new model's effectiveness through comparisons with the Llama LLM architecture, showing equal or better performance on various metrics, including perplexity and downstream task performance.*
09:58 *🎛️ Elaborates on the technical implementation of the 1.58 bit LLM, retaining the Transformer architecture but altering the numerical representation and computational approach within the model.*
11:48 *🌍 Suggests a significant impact on the scalability and application of LLMs across different hardware platforms, including mobile and edge computing, due to the reduced computational requirements.*
13:11 *📉 Concludes with the potential for dramatic improvements in hardware efficiency and cost reduction for deploying large-scale LLMs, due to the shift to a 1.58 bit computational model.*
Made with HARPA AI
How do you train such models? I mean, changing a parameter from 1, to 0, to -1 is a big discrete step. How do you know when it's time to change it?
That's where it'll be interesting to see if the authors share the code. They haven't yet.
@@1littlecoder Thanks for the reply (and for this very nice high-level summary video). I looked into the paper, the closest they come to discussing training is to mention that other groups have used post-training quantization (which is the most natural guess), but then they criticize such methods and don't say what they do instead (a little bit suspicious, but maybe being guarded is normal with so much money at stake).
Clearly to make such a big change of a 1 to 0 or -1 at once, the decision to change must be based on many training samples somehow. The best way I can think of to do this, without storing a hidden floating-point for each parameter, is by a Monte Carlo method (each training sample suggest some small direction for each trit to change in, and RNG is used to accept the change probabilistically, so that on average the trit parameters are being sampled from a relevant distribution that takes into account all the training samples). Just a guess!
@@iyziejane Given your interest, read up on Extropic. They're proposing a new chip and compute!
@@1littlecoder Ah that's a serendipitous recommendation, I knew the CEO Verdon from his days as a graduate student in quantum computing. I might actually get in touch to help their efforts...
@@iyziejane woah; that's so nice of you! Thanks for sharing this!
Great videos ❤ there it goes Sama 7Ts unless this can't be used in training like quantization. I have been using 2bit xss quants models. they are accurate enough and given you can build them from the ground up to be compatible as 1bit or 1.5bit I think is revolutionary not sure we won't need GPUs at all even with linear mults. I did miss this one with all the noise too.
Hey, where do you get all these papers? How do you stay informed with all the new important papers? I'd like to start reading some specific to my work (data science)
My main source is this guy on Twitter - twitter.com/_akhaliq (Very high signal to noise ratio)
Hugging Face Daily Papers. Alternatively just filter the arxiv to only include keywords such as LLM
@@MichaelBarry-gz9xlhey I checked your comments on UA-cam and they're good and you seem to be one of the few knowledgeable out here. I'm working on something interesting but want to verify with you. If possible can we connect on discord. Thanks!
One network with many hidden layers = deep neural network
If I were CEO of Intel I would be working on this right now.
I checked the paper for a sec and found these models actually eat more memory than 4-bit quantized ones and don't offer much of a speedup compared to them either.
Don't know where the memory inefficiency comes from and whether it could be fixed.
If this is the best it can do, it's quite Zzz
Huh? Did we read the same paper? I did my own calculations in my head before reading the paper, and the paper confirmed what I suspected. This architecture can fit around 40B parameters on a little 8GB card
@@MichaelBarry-gz9xl
"BitNet b1.58 1.3B 1.14 (2.93x) 0.97 (1.67x) 11.29"
taking 1.14 GB to run a 1.3B model gives us
8 / 1.14 = 7.017 * 1.3 = 9.1B for an 8GB card, assuming of course 0 bytes being used for stuff like display drivers. However, we'll also need some more memory for our context window.
@@Alice_Fumo the smaller models are larger than expected yes I noticed that too and perhaps its head which is 8 bits? But it levels of add or gets bigger and reaches the expected 1.58 bits. Check the chart that compares the size using the log scale on the left. 70B was around 10Gb, and the 7B was around 10Gb or less. I'm going off memory, but everything seemed as I expected it to be (with the exception of the smaller models)
@@MichaelBarry-gz9xl ah, you're right. I kinda missed that graph.
This would put the memory consumption of 70b models to ~20gb (they only gave the ratio for that one which is why I overlooked it), which is actually within the realm of consumer GPUs (although barely).
Most interestingly, it would let one run mixtral (or similar) on a single GPU.
@@MichaelBarry-gz9xl I still do not quite get how it would be matching performance with the 32 bit variants?
I'm just an undergrad student, so I'm probably missing a lot of the necessary details.
If you could explain it, or guide me to some resources that could, I would be grateful.
This seems to be a game changer
I like this title more, it's a lot less click-baity. ;-) Though I still wonder if this even changes anything at all. If the paper is is already from October last year, it's old by todays standards. So there is probably a good reasons why it has not been adopted yet.
when compared with the previous one?
@@1littlecoder Yes, and since people where discussing this, I thought I'd add my 2cents of feedback.
Thank you, appeciate it!
Pretraining is expensive and takes time, not many people can afford it. Hence out of the thousands of models on the hub, it's basically just LLaMA and Mistral. The original paper was good, but the accuracy wasn't there. Now the accuracy is there and I think companies such as meta, maybe even Microsoft as they played a part in this research, will be scrambling to make bigger models with this. But it takes a long time and a lot of money
@@MichaelBarry-gz9xl What do you mean by "the accuracy wasn't there. Now the accuracy is there". I don't think that there is new hardware or software with increased "accuracy" since october 2023. Also the most time consuming part of training for these companies is most likely the preparation of the datasets. Apart from the biggest models training takes only days or weeks with the hardware that these big tech companies have. For the training datasets they could just use the existing ones. If on the other hand this 1.58BitNet approach allows for training of smaller models like 7b from scratch with more affordable hardware or if there is a way to "compress" existing models by converting or re-training them into the new format, one would expect that there were are already some open source examples floating around.
Reading this paper they got the idea to include also the zero as a third operator instead of 1 and -1 to get to 1.58 bits. The whole idea is to get rid of multiplication. As multiplication by 1 is the same thing, we just keep track of the sign. And 0 has also the property to remain zero.
Still there is another possibility. to use "signed 0" as operator too. So we gave -1, 1, -0, +0. We get 2 bit instead of 1 bit but with the same computational effort. The idea of signed zero is about a "zero with memory" with the sign we keep the memory of where it came from from the positive side or from the negative side.
en.wikipedia.org/wiki/Signed_zero
GTC is coming up, Nvidia sessions should be online soon. Ive been checking them out more and more and Remix tech is raging.
Great Video bro. Actually Im doing a project in AI which converts an UI design into front end code. Can you upload videos regarding this. It will be very useful for my project. Thanks in advance
So, going from "All You Need is Attention" to "All You Need is 1.58 Bits"?
atleast theres no “shocking” this time
I haven't used that word in my title at least in my last 10 to 15 videos that I could verify
Wes Roth only uses click bait ( shock , entire industry shock ) same as matt.
But 1 little coder didn't use this fucking words
how ironic crypto bro's used to use that
shocking truly lol
Just wonder if anyone can reproduce their result lol
Question: If word2vec is able to represent semantically similar words based on them existing near each other in an n dimensional vector space, how can it achieve the same result if there are only three possible positions in each dimension, or are the vectors produced by word2vec solely the data that is fed to the model, but not the model weights itself?
Does this also have anything to do with why activations are 8 bit, or is that also unrelated?
consider that embeddings have 512 to 1024 or even more dimensions, that means with just 1, -1, and 0, u can represent 3^1024 unique elements in an embedding. that's more than the number of atoms in the universe afaik. I am unsure if this produces embeddings in 1.58 bits but that is still a very large space to generate information over.
One more step towards the merging of AI neural networks and quantum annealing style computing...
This is groundbreaking 😮
Thank You
The main issue here is these are tiny models they are playing with. There is no proof that this technique scales to anything above (in terms of perplexity/quality). It reminds me of RetNet, which was also supposed to be a big breakthrough and hasn't been released with any open weights since.
can this method be used for only inference or training as well. if I understand correctly the hardware requirements would remain the same for training.
Is this in any way related to neuromorphic/analogue computers?
I don’t know why they ever used floats. Seems like complete overkill. I suspect just because GPUs were originally used for gaming.
What are the equivalent CPU to a 3090 gpu for this? or you do you still need a gpu for parrel processing ?
I thought with quantisation the model performance would drop. Maybe the memory requirement has gone up as a trade off?
The matrix dimension has to increase to have all of the information be stored in some way.
i m not expert in anyway just feels its not that simple..
9:20 I thought so too, but…. No? The memory footprint is clearly smaller. I’m genuinely shocked and it seems to decrease the relative memory requirements more the bigger the model size is.
So we get smaller llms (not parameter wise, but storage and ram wise) with pretty much no loss in quality (and even some gain) with WAY faster inference times. This truly will change everything. Imagine this on mixtral
@@maxieroo629 if true, it will certainly be shocking
No, it's just that the current models are really really bad, like incredibly inefficient. That's being rectified little by little
@@maxieroo629perhaps the math performed during quantization simply produces less relevant values than are decided by this 1.58 bit’s (rounding?) rules? When you’re quantizing you may inadvertently keep less statistically relevant information, or values that amount to less statistically relevant information during inference whereas this technique performs a similar function at a different time that just happens to produce similar results to the original…or perhaps the paper was run through Google’s marketing team and thus the entire thing is bogus 😅
In any case - well, other than if Google really was involved 😂 - I can’t wait to check this out on some local 1.58 bit mistral models!
how would u get the signed bit (+/-) "without multiplying" with the weights
from what i understood, it's a ternary representation {-1, 0, 1} so 3 base elements.
Because it is technically a 2-bit representation, inverting the first bit allows you to 'multiply' it by (-1).
to all people who are kept saying today's AI is over-hyped by some people. they should see this, how much research are advancing. today's AI can be this smart, i lost of my imagination to imagine how advanced, agile and light the future AI, as if they can be embedded in small powered devices already insaaaaaaane.... that way we all could have decentralized ownership like david sapphiro said, no need carry mobile phone when want to take picture just ask AI it's all over places they all embedded everywhere so ask AI it will bring up all the candid photoshots, or maybe when you need to talk with someone just going to nearby embedded AI and access it's network to talk to someone you know, maybe he's in a yacht picking up your call at his yacht windshield, while talking he want's to grab a beer he just go down and can keep talking to you because AI embedded all over the places.
This essentially creates different "engines" ... like cars there are v4 and v12, you drive what you can afford! Currently they're all too expensive lol
if one used this with current hardware, what would happen? Would it be quicker?
They used current hardware to build it and test it. Yes it's a lot faster, but it could be faster still
3 values cannot be represented using 1 bit
No, you'd need about 1.58 bits.
It's not binary, it's ternary. That's why new hardware will be more efficient. In binary terms this takes up 1.58 bits, not 1
If only the paper authors had thought of that, they could have saved a lot of time!
To address the core of the confusion: 3 values can absolutely be represented with 1 bit. 1 ternary bit holds 3 values. It's only binary that holds 2 values. But because binary is so ubiquitous people often assume it is the only game in town. The moral of the story, and the thing to remember is that binary is not the only game in town. You can have any number of values in your bits (so long as you are a chip manufacturer). The values themselves are not the bits, they are properties of the bits. And a bit can have as many properties/states/values as the manufacturer desires.
All I have to say is that this in no way is a 1bit tecnology like you make it to be, or even 1.58, cause a bit is either 0 or 1, you can not make an half bit, or you can but you need 2 bits to say that it is an half, or that it is positive or negative if you say it is -1 0 1, to represent that you need 2 bits, no other way to make it.
I see another comment below where people are talking about compression and whatever, but if you give an ai an 1h 1GB video to learn from it and the same video in MP4 with 300mb the ai will learn the same, it will not increase 1GB neither 300mb, it learns that the video has the things you alredy teached it before or not, example if cats show in the video or people or whatever.
It does not compresss the video and put it in another file somewhere like most people seem to think that it is how it works. Most ais are trained with TBs of data, if it worked like that, we would never be able to run it on or basic pcs. If you guys see even the 70B parameter models, dont get to terrabytes of size. And it has nothing to do with compression of image or whatever.
It has to do with what your brain do also, if you see a new movie, you will remember some scenes that clicked with you or shocked you, but you will probably not remember the color of the dress the actriz was wearing. You will not compress the video in your head, what you will retain is the things that your brain has connections already from other movies or irl situations. Or trully new things that surprise you or have an impact on you, so your brain remembers it, but not pixel by pixel, is based on parameters that you know already. Like LLMs
At least this is how I see it, but maybe I'm ignorant. I talk from the litle experience I have with llms and coding. And that is the reason that I dont understand why you are calling this a 1bit model, it is possible to do it, but you will need a way bigger ammount of parameters to make it learn anything if the only options you have is to give it 0 or 1 to tune its learning skills in each neuron I think.
Is like saying I will make an ai that learns all 256 color range for the red in rgb with 1 bit, you can do it, but it needs at least 256 1 bit parameters to detect all possible 256 values, or at least half of that. since it is in bits. 0 and 1.
If there are any experts that can show me where I'm thinking wrong about it, I will be glad to read your opinion and learn with it. Like I said I talk from the litle I know. And for me is weird seing people saying 1.58bits, cause that is not possible, bits are real numbers, there are no decimals in bits. unless if you reserve another one to make it decimal. and still it can only be half, 0.5 or positive negative, 1 or -1 but for that you need 2 bits like I said, not 1 is impossible.
Hello, how about acc to predict tokens? , even in traditional 16bit fp time there is little acc down the slope on selecting tokens
Nothing is free. I have seen attempts to take 32bit models and quantize them to two bit. They produce outputs, but not particularly good outputs; typically that drastic of a quantization is inferior to a smaller model that is not quantized so severely but they do still function. I have to wonder how the output of this approach differs from what would result from a model that was trained as a 2 bit model from the start.
SHOCKS THE WORLD
Very intresting. I think I'll need to read the paper though. For the neural net to be able to approximate any functions I would of thought you need at least some form of multiplication for it to be a non linear model (as in non-linear classification thresholds or non-linear reggresions). If they are not multiplying values with wheights then I would at least expect that some form of multiplication is being done in the activation function.
Optimizing the activations are being left for future work
the activation function is RELU, so that's not multiplication, it´s ´still nonlinear enough. The attention normalization step would still need normal multiplication, for instance.
Nice paper
everyone makes this viral so it's will get more attentions such research is our future of AI. gooooooooo
How does one get reasonable gradients on 1/1.58 Bits?
You scale up the parameters and let the network decide the precision attributed to each feature. Ternary gradients should work just as well if you up the parameters, but it remove redundant calculation. It will take longer to train on hardware not designed for it but will be way faster on hardware that is.
Thala suthudhu how it performs like 16
reminds me of pcm vs dsd in audio
how does it compare to quantized models?
It’s better in seemingly all ways to previous quantized models. This isn’t a model specifically though, it’s a new quantization method being demonstrated with Llama. It means any LLM using this method can have the full performance of a non-quantized model (16 bit), with a size smaller than other quantizations (in storage, and in ram during inference) and a massive inference speed increase. I genuinely can’t see the drawback here
Obrigado. Thanks
This sounds like an investment scheme. Nothing in this video explains how the resolution can be reduced from 16 bit int/float to only 1.58 ternary. They either didn't need that resolution in the first place, this whole time, implying that the whole industry was dumb (unlikely). Or the conversion algorithm is creating many output matrices for each input matrix to make up the difference, whick would be just re-representing each big word with several small words.