With the automatic audio dubbing from UA-cam /Google you hear a synthetic voice in your regional language. To hear my original voice in English, switch to "Default" or "English" in the settings. Thank you.
Anecdote: I was once working on an oil-rig single board computer .. which was simply too weak for what it needed to do. After about 4 weeks and a major software rewrite it was ALMOST usable but not quite ... I had run out of road. By chance, one of the hardware techies came by and saw my despair. He told me to wait a moment and then returned with a replacement processor chip which was maybe 5x faster than the original. Problem solved! Apparently he was an official beta tester of Motorola CPUs! Moral of the story : we may need to optimise today - but hardware and software will get better and faster tomorrow!
Re: the need for all this compression work : I used to work for BMW Research, and I found most engineers aim for small, fast code. This is an ingrained habit which is probably not really needed : Nvidia is about to produce a $3000 retail powerful AI processor box which can do what a car needs without needing tiny models. In quantity Mercedes could possibly get a smaller version in quantity for maybe $2000. This is not a huge sum, especially for fancy cars.
Excellent - I have been waiting for this video for months! Everyone is aiming for HUGE models ... but tiny models will have great opportunities too! I have tried machine control LLMs where I get the LLM to emit and process special strings success as [1,5] which are picked up or emitted by a C wrapper to interface with I/O. That said, cars might need medium size rather than tiny models. The tiny models will be need for toasters and cookers.
Very cool video, thanks!!! A few thoughts (I'm not an automotive expert at all, but have a tiny bit of embedded background): Q4_0 can be accelerated nicely on current arm CPUs (llama.cpp was able to accelerate PP 2-3x and also TG a bit by using special arm CPU-instructions for GEMM/GEMV operations), a GPU/NPU helps not much for SLM inference as TG/token-generation is mainly limited by RAM-bandwidth. A current NVIDIA Jetson Orin Nano 4GB embedded module (50GB/s memory-bandwidth, with its 64-bit bus) would be a platform with approx. the performance constraints mentioned for SLM token-generation in this paper. To me, it does NOT seem like old/cheap hardware. It's GPUs could be usd for other tasks like e.g. vision. Probably the SLM runs on hardware dedicated to "dashboard" centric features - a Jetson Orin Nano would already be a VERY luxurious processor for this. Servus aus Wien
From a Lossless (~1.5:1) Compression Algorithm for Llama2 7B Weights to Variable Precision, Variable Range, Compressed Numeric Data Types for CNNs and LLMs on vixra (re-arrange the letters)
With the automatic audio dubbing from UA-cam /Google you hear a synthetic voice in your regional language.
To hear my original voice in English, switch to "Default" or "English" in the settings. Thank you.
Anecdote: I was once working on an oil-rig single board computer .. which was simply too weak for what it needed to do.
After about 4 weeks and a major software rewrite it was ALMOST usable but not quite ... I had run out of road.
By chance, one of the hardware techies came by and saw my despair. He told me to wait a moment and then returned with a replacement processor chip which was maybe 5x faster than the original. Problem solved! Apparently he was an official beta tester of Motorola CPUs!
Moral of the story : we may need to optimise today - but hardware and software will get better and faster tomorrow!
Re: the need for all this compression work : I used to work for BMW Research, and I found most engineers aim for small, fast code. This is an ingrained habit which is probably not really needed : Nvidia is about to produce a $3000 retail powerful AI processor box which can do what a car needs without needing tiny models. In quantity Mercedes could possibly get a smaller version in quantity for maybe $2000. This is not a huge sum, especially for fancy cars.
It feels like gene editing. Interesting video
Excellent - I have been waiting for this video for months!
Everyone is aiming for HUGE models ... but tiny models will have great opportunities too!
I have tried machine control LLMs where I get the LLM to emit and process special strings success as [1,5] which are picked up or emitted by a C wrapper to interface with I/O. That said, cars might need medium size rather than tiny models. The tiny models will be need for toasters and cookers.
Very cool video, thanks!!! A few thoughts (I'm not an automotive expert at all, but have a tiny bit of embedded background): Q4_0 can be accelerated nicely on current arm CPUs (llama.cpp was able to accelerate PP 2-3x and also TG a bit by using special arm CPU-instructions for GEMM/GEMV operations), a GPU/NPU helps not much for SLM inference as TG/token-generation is mainly limited by RAM-bandwidth. A current NVIDIA Jetson Orin Nano 4GB embedded module (50GB/s memory-bandwidth, with its 64-bit bus) would be a platform with approx. the performance constraints mentioned for SLM token-generation in this paper. To me, it does NOT seem like old/cheap hardware. It's GPUs could be usd for other tasks like e.g. vision. Probably the SLM runs on hardware dedicated to "dashboard" centric features - a Jetson Orin Nano would already be a VERY luxurious processor for this.
Servus aus Wien
Thabks for sharing!
Great video! Can you the link to paper(s) in the description for these types of videos please?
Interesting concept!
Tiny LMs? Yes please!
From a Lossless (~1.5:1) Compression Algorithm for Llama2 7B Weights to Variable Precision, Variable Range, Compressed Numeric Data Types for CNNs and LLMs on vixra (re-arrange the letters)
They should just train a model from scratch. They have the money to do it and it’s so domain specific. I’m not sure what the hell they’re doing.
Yeah, this is weird but this was a good experiment tbh.
its just sad. german bean counter culture at the big autos. meanwhile look at what tesla and the chinese are doing.