The smaller the model, the quicker the first token, the more tokens/sec, and the less electricity used. According to Andrew Ng, using an LLM in an agentic workflow can improve the performance more than increasing the size of the LLM. Also, retrieving a previous response that has been fact checked can be faster than generating a new hallucinated response.
Qualcomm AI hub should measure watt hours of electricity used in a typical hour of operation, in addition to the time to first token of output, and tokens per second of output for each optimized model.
Once an application works at all, it can typically be optimized to run faster, and to run on less powerful hardware such as smartphones. Developers of hardware, such as Qualcomm, should provide the means of optimizing software for their hardware.
The next thing I would like to see is a digital agent running on my smartphone that is part of an open source and decentralized global platform for collective human and digital intelligence.
The smaller the model, the quicker the first token, the more tokens/sec, and the less electricity used. According to Andrew Ng, using an LLM in an agentic workflow can improve the performance more than increasing the size of the LLM. Also, retrieving a previous response that has been fact checked can be faster than generating a new hallucinated response.
Qualcomm AI hub should measure watt hours of electricity used in a typical hour of operation, in addition to the time to first token of output, and tokens per second of output for each optimized model.
Once an application works at all, it can typically be optimized to run faster, and to run on less powerful hardware such as smartphones. Developers of hardware, such as Qualcomm, should provide the means of optimizing software for their hardware.
The next thing I would like to see is a digital agent running on my smartphone that is part of an open source and decentralized global platform for collective human and digital intelligence.