For measuring relative performance, it is wrong to do a per MHz calculation. The only metric that should matter is the total time needed to run the same application on both processors. A more complicated ISA means clock speeds will be reduced (which gives better per MHz performance), but that does not mean the processor is faster
RISC-V, ARM and others like MIPS which I have plenty of experience with are just architectures; the chips you can buy are implementations of these architectures. First thing to notice from a 30,000ft perspective is that 64-bit ARM, MIPS and RISC-V are surprisingly similar. In the past CPU architects were more adventerous. These days no more bat crazy shit like segments (x86) or register windows (SPARC, and totally batshit crazy on IA-64). MIPS is an early but well designed RISC architecture; 64-bit ARM (which fortunately is rather dissimilar to the 32-bit ARM architecture) is surprisingly similar. Which is unsurprising because one of the architects used to work for MIPS. And RISC-V was designed by the fathers of SPARC and MIPS. So are they all the same? Not quite but 64-bit ARM and RISC-V benefit significantly from hindsight. Now, once you take things to the limit things will be different. RISC-V's smaller footprint allows fitting more cores running at a higher clockrate on a die. It barely matters for the birdseed class of microcontrollers that's polluting most PCBs ;- So for most uses architecture doesn't matter - software does. That's where ARM is very well supported, MIPS is well established and RISC-V is still catching up. That said, the folks behind RISC-V is smart and have impressed me by what they have achieved and in my discussion so they're going to close that gap. Plus hgiher end implementations are going to show up. Being a truely open architecture however the RISC-V market can be as confusing as a ant pile - or open source in general ;-)
@@markhaus Yes and no. Availabillity of documentation is greatly simplyfying a port to a new architecture, especially when there is already a port to a simila architecture. It still remains a major underrtaking in terms of manhours required. Been there, done that. Three times 🙂 As far as software development is concerned, x86, MIPS, RISC-V, MIPS, ARM are open enough to aloo development of dcent software. IA-64 was special in that it's performance characteristics are ... complex. without NDA or possibly even being Intel it was not possible to certain software including high-end compilers. The level beyond that is licensing for the architecture itself to develop a core by anybody who wants to. RISC-V didnt really inovate there, there have been other such public domain or similarly unrestricted architectures before. But they were the first polished architecture with academic and industry acceptance, documentation and very liberal licensing on top. It's this mix (and probably a few more things on top) which made the rise of RIsc-V possible.
It’s not just about the people who designed the RISC-V ISA it is also about the people that implement it, they are the ones that actually determine the efficiency and clock speeds and similar.
@@markhaus Yes and no. Programming specs are open for most architectures though the degree of detail varies. On the example of Intel, Intel CPUs accept only a signed blob as microcode and the microcode programming interface is not even documented. Probably few users care. More painful was Intel's attitude towards protecting the IA-64. The Merced documentation covers four thick books printed on thin paper and is also available for download (1-click sign-away your soul acceptance of terms required ;-) but errata required and NDA and certain very deep secrets were only available under terms of a much stricter NDA - the most restrrictive I've ever seen. Finally some further aspects such as deep details on performance aspects of the pipeline which are essential for the ijmplementation of top notch code generators and compilers were not available outside of Intel at all. I don't want to single out Intel but just picken them as an illustrative example Companies very in their degree of paranoida, protectiveness and openness and corporate history and experiences are part of that. RISC-V may be an open architecture. That means the architecture is open. It does not mean an actual implementation is open. It is possible to implement a fully RISC-V-compliant processor under the terms of the RISC-V licensing conditions - great. Yet iI can keep the implementation as closed as a traditional microprocessor implementation from companies such as Motorola, IBM, Intel, AMD, HItachi, MIPS, ARM etc. The result may be something that executes RISC-V code just fine yet for certain aspects such as performance has to be treated as opaque, as a black box. As somebody who has ported Linux to MIPS it's been occasinally helpful to have access to the folks who did all the mental heavy lifting and wrote the specs. With a RISC-V compaliant core one may or may not have the same kind of access. for a particular project.
@@conorstewart2214 While this is correct one should consider the architecture defiinition of any processor architecture as something that sets the absolute limits of what's possible. A good implementation can reach for near 100% of that; a bad one will stay well below. With my MIPS experience I found the size of early MIPS and RISC-V cores which are somewhat comparable And the RISC-V implementation is much smaller in terms of transistors / gates which is an indication of how polished the architecture is. Ok,RISC-V had the benefit modern software tools to aid the implementation. Such comparisons across decades are bound to limp somewhat. An interesting aspect is how the RISC-V architecture is made up of several optional parts. Just to pick one example, an implementation does not need to have a multiplication or division instruction. They were looking at other architectures' pain points. MIPS was born as a super-fast RISC processor for super-mini computers, later workstations and servers. Nobody early on thoght of embedded computing. Such omissions are hard to rectify lateron in a clean manor. One point where RISC-V is brutally efficient is cost due to absence of licensing fees for the architecture itself. To some users that's the #1 aspect that matters.
One more comment. Most processor manufacturers produce a spec called DMIPS/MHz, or millions of integer calculations per megahertz clock speed. This allows you to do a clock for clock comparison between parts.
Let me wind back the clock to the mid-80s to point you at the horrors of the Dhrystone benchmark which back then was more or less the canonical benchmark for integer performance. Even in the best case Dhrystone results didn't represent real world performance very well. Dhrystone wasn't only ignoring fp math entirely, its results also got more and more comically absurd as architectures got more sophisticated (caches and out-of-order made a giant difference) but also as compilers improved and started to "optimize away" part of dhrystone. The peak was rached when certain compilers started to recognice Dhrystone and applied Dhrystone-specific optimizations for almost arbitrary benchmark results - whatever marketing orders ;-) It gives me headaches to see parts of the industry are still using DMIPS decades after it's been throughly proven to be rubbish. (It seems many folks don't know these days - the D in DMIPS stands for Dhrystone).
It's an artificial value though and not very helpful for real world performance comparisons. I know this because Qualcomm quoted quite a high DMIPS for their Krait CPU core back during the ARMv7-A generation, and it routinely got thrashed by the lower DMIPS rated Cortex-A9 based SoCs in actual performance.
I was surprised that the now somewhat venerable Black Pill did so well in these tests against the newer upstarts, especially in power consumption and power efficiency. Thanks Gary!
Not just "black pill" but stm32f401 or 411. Today one mc on black pcb tomorrow another... And f411 has 12.7mA at 100MHz core with periph disabled. Not all periph is need to be on. I have doubts about this video test 20mA. The Chinese have a lot of analogues stm32. And for example CH32V203 (f103 clone with riscv core) has 8mA at 144MHz. CH32V30x ( riscv core with fpu) 12mA at 144MHz. And they have ever tssop20 case. As f103 clone CAN onboard, that f411 doesn't have. And 307 has Ethernet, 208 has bluetooth + Ethernet and 2.2$ in my local store. I have not been interested in buying stm32 for a long time. Only Chinese only like stm32 has CH32, HK32, AT32, GD32 and so on.
I appricate that Gary is right here that RISC-V is not yet *as* effecient as but I'm very impressed that RISC-V is already *almost* as efficent as ARM with for the same processes being run 1.36mWh compaired to the equivlent ARM board getting 1.31mWh and even compaired to the *much* more established Pi Pico, it's only 8% less effient (than the Pico). Obviously being almost 89% less efficient than the Blackpill isn't ideal for RISC-V but this is still early days for it compared to ARM and just with there being so many less RISC-V processors produced vs ARM, I don't think you can expect it to be beating the leaders of the pack in ARM just yet. Maybe when there are as many models of RISK-V processor as ARM processors the leader will be arm. Maybe with more time for tuning, the leader of the RISC-V pack will beat the leader of the ARM pack; even with less models out there. Encouraging stuff. Stating my bias: I want RISC-V to succeed as I think open source is the way forward and garding "intelectual property" like dragons over gold, is holding humanity back. Thanks for the interesting video Gary!
I agree. RISC-V is not there yet but made a very good showing being the new kid on the block. ARM has been at this game for decades. It is unrealistic to expect the new kid to outperform the veteran. ARM has been optimized over decades. RISC-V has to pay its dues to take the crown. I am strong RISC-V advocate. I look at this as there is plenty of room for RISC-V to improve. The ground to cover in some areas are not that great to close the gap.
@@xade8381 that's not correct. ARM started to be designed in 1983 and the first chips and boards were in 1986. ARM the company started in 1991, when there were already 100,000 ARM-based Archimedes PCs in use. RISC-V started to be designed in Berkeley university in 2010 (27 years after ARM), the initial frozen spec was published in 2014, the first board you could buy commercially from the first RISC-V company was in 2016 (30 years after ARM).
@@xade8381 RISC-V was still an educational tool for years though, with zero plans for reaching any sort of market. Whereas ARM was made from the very beginning as a commercial ISA, and is 30 years older to boot. Not very comparable.
Great work as always. Benchmarking is always a can of worms because it is as dependent on the application as it is on the processor. Do you need fast integer? Fast interrupt response? Floating point? DMA? If you used newer M3 and M4 parts they would have performed much better even in this integer-only test both with regard to processing speed and power consumption given that they’re built on *much* newer process nodes. And a recent STM32 M7 would’ve blown everything else out of the water.
Why are so many of the commenters here so obsessed with process node? It strikes me that many (not aimed at you in particular Mark, sorry) may just be reciting jargon without understanding it. Even a very old node such as 180nm is good enough for making a 300+ MHz chip (e.g. the SiFive FE-310 on many RISC-V microcontroller boards) which is plenty for anything in this test. Smaller process nodes do allow higher clock speeds, but if you're not USING that ability then they are not just a waste of money in the much more expensive design and manufacturing process, but they may actively be WORSE because of things such as higher leakage current when operated at low clock speeds or in low power sleep modes. It's also a complete waste when you're making a simple stand-alone chip such as a microcontroller with a small core and a small amount of SRAM because even with the old nodes you end up with the actual processor&memory being a tiny little square inside a huge bit of silicon with the I/O pin pads taking up 90% or 99% of the extremely expensive small process node chip area. The default assumption unless you're a real expert should be that the manufacturer has chosen the best process node to optimise what they want to achieve with their chip.
We are familiar with needing to sample the test code many times to generate benchmark results which are not misleading, but it is also essential to sample different kinds of test code, to not be misled even by random compiler differences on each bit of code tested. With the performance results between the esp-c and the black pill coming within 1% of each other, that suggests the test was entirely memory bound on those systems and the systems share very similar memory systems. Multiple programs need to be benchmarked for a picture to emerge.
@@BruceHoult In general I would think that a smaller feature size would mean less parasitic capacitance, but I didn't think about leakage current. Is that from quantum tunneling? I wonder where the sweet spot is for that. But there's also the matter of different topologies like finfet and gaa, that might reduce the switching current. Mostly I think it's an economic decision. Everybody wants better speed and battery life, but how much are they willing to pay for it? For a computer that only runs a single program continuously, all you need is "good enough". Microcontrollers often have external power anyway. The main concern vis a vis power consumption is cooling.
I just bought my first RISC-V chip, an esp32-c3 from adafruit. Mostly bought it to learn RISC-V Assembly. Generally want to learn AVR, ARM and RISC-V Assembly.
Sounds like a great plan. All are good ISAs. If you have any questions the Reddit /r/asm forum is pretty good for any ISA, and /r/avr and /r/riscv are helpful too. Sadly, /r/arm seems dead and/or non-technical.
A nice explanation as always. But I'm missing the sleep current for the different boards. It would be intresting to see how they perform compared to eachother. It is more if a comparison between MCU brands than core architechture, but still! :D
A huge factor for efficiency is compiler quality which grows with age. The major design differences ariund efficiency is stuff like dark silicon for common tasks and SIMD engine implementation plus caches.
Crazy to use only a single RISC-V board as representing a whole ISA. Obviously not all ARM cores or boards are created equal, and neither are all RISC-V cores or boards. Espressif doesn't even say in their datasheet what RISC-V core it uses. Crazy also not to include Sipeed Longan Nano ($4.80, 108 MHz, been around for three years), some Bouffalo lab BL602 board (similar price to ESP32s, we know it uses a SiFive core) or even extend the price limit a fraction to include a K210 board (dual core 400 MHz 64 bit) such as Maix Bit. Still, it is interesting to see that from the same chip/board manufacturer the RISC-V does in fact give better performance per MHz and per Watt than what they were using before. A really interesting test would be the Longan Nano (GD32VF103 clone of an STM32 but with a RISC-V core) vs either a GD32F103 (same manufacturer STM32 clone with a real licensed ARM core) and/or a real STM32F103.
It’s always strange being reminded that people think RISC-V is inherently more efficient than ARM. That’s not why people like the architecture. It’s an open standard, whereas ARM is proprietary. Anyone who can make a chip can make and innovate a RISC-V design, not the case with ARM. That being said, this was nice to see. I’m sure it has a lot of people blackpilled now.
11:20 how can current stay the same? I have searched many and many voltage regulators. They all come with no load or quiescent current. If you don't use a voltage regulator, it must use less current. This is not complicated.
Good video. 13:59: Board A uses 20mA·26s = 0.52 Coulomb = 3.2448·10²¹ electrons to accomplish the task, and Board B uses 51mA·18s = 0.918 Coulomb = 5.72832·10²¹ electrons, so Board A peruses only ~57% of the electrons that Board B uses. Therefore A is more efficient.
One other important factor on powe consumption, is how low ot can get when not doing much. This can be very important when making battery operated things. For example, the Nordic nRF52840 can gwt down to less than 5 mA with BLE running, and lower in LP modes, while I have not gotten the ESP32-C3 down to less than the same 38mA with BLE on, even when the CPU is just waiting. I'm working on this, trying to get the C3 to run more efficiently, because it is otherwise a very affordable and capable module.
I really love the black pill , actually currently I’m working on project using it (STM32F411CE), so glad to hear it did will in the benchmark, but Gray I have question Did u write the program for each board in assembly or C ? In case the answer C , then What compiler did u use for each one? I hope didn’t throw up a lot of questions 😅😅. Amazing work man, thanks a lot for this benchmark and I hope see more of them!!
Gonna take a while for the RISC-V manufacturers to figure out how to design really great chips with it, but there's no reason not to expect it will be roughly the same as ARM in the long run, just with an open ISA which is an absolute win on its own. Hobbyists who aren't trying to squeeze every last bit of performance and efficiency out of their projects should support RISC-V to help it along and encourage faster development. It's already outpacing ARM's development, which was already quite rapid.
The really interesting bit is RV32E with Zc* extensions. It essentially repurposes a bunch of the the floating point compressed instructions allowing a 16-bit only CPUs with half the registers. That'll be a tiny core.
@@mrbigberdminor nit: they still have to implement the 32-bit instructions. there's no base isa that's compressed-only. there are a few hoops to jump through, but it's not hard for an individual to contribute. and i've been tempted to contribute for months now. (i have a number of ideas i'd like to fling their way already.)
@@dead-claudia That's not strictly necessary as the 16-bit only format is Turing Complete. There are still 10-ish opcodes still left and a couple of them could be broken down further to provide a few more 2-reg instructions. Most importantly, a 16-bit only extension would allow the use of the 11 top-level opcode space. This would increase total instruction space by 25% in 16-bit only designs and that would give enough space for a massive 64 2-register opcode space using the CA instruction format. That's enough to add in more branch instructions, A, B, M, CSR, Zicond, Zacas, etc. Going further, the E-series only has access to 16 registers. Reclaiming those bits for CR gives 2 extra instruction bits (4x as many instructions). CI doubles its available instruction space too. This would open a path to add a basic Vector/DSP extension too.
Well... There are certain extras in your core implementation that will make a difference; stuff like the different caches and the coherency mechanism, the branch predictor, cpu internal bus and the bus arbiters, there's just so many extra internals that are all abstracted away in complex logic. Some of that complex logic is just more appropriate to implement in another program, i think some of the cpu caches are governed by a whole other "management engine" that runs its own firmware to keep track of the bits in the cache....
Back in uni, I still remember the active power (total power - leakage current power) is proportional to square of frequency. Can we use it to extrapolate the power usage of the pi to 160 or 240 mhz?
Active power is proportional to Vcore^2 * frequency. Not frequency squareq but just frequency multiplied by core voltage squared. You may get around frequency squared when cores are pushed harder than above mentioned microcontrollers (not as hard as full boost latest Intel or AMD chips, frequency still has to be supported by changing core voltage).
@@leonardosabino2002 i literally wrote the very same formula: Vcore^2 * frequency power is proportional to frequency and to voltage squared. Power scaling does also resemble frequency squared at some part of volt-frequency curve (probably 0,7 to 1V region for latest chips)
@@volodumurkalunyak4651 Not the same formula. Look again, it's the -frequency that's squared.- EDIT: I just looked up the formula, looks like voltage squared is correct. Sorry about that.
In general, performance of risc-5 is not up to the standards of ARM. Full stop. But this battle does not stop today. ARM just announced, that they will charge their customers in future based on the device prices instead for IP. That will drive the research in the area of Risc-V up. I expect the Risc-V to become a contender in the Mobile Phone space (low end) in about 3 years and in the high end market in 6-7 years.
@@michaelkaercher I am waiting and drinking my tea while the attention trolls on UA-cam keep asking me for all the love they didn't get from their Moms. :-)
As I watch, I get questions, and as soon as they pop into my mind, Gary already responds to them. It's rare that a tech video is this well thought out and structured this well!
Very interesting article. I have always thought about how RISC-V would be compared to ARM. Do you have similar comparising for enterprise chips too? comparing RISC-V with x86 (Intel/AMD) and perhaps also including ARM?
Are all these microcontrollers fabbed on same process node (and by same manufacturer), for example fabbing m4 or 40nm and 20nm will differ in performance and power efficiency.
Hi Gary, it would be really interesting to have you do a Intel Atom/E-core (Alderlake/Gracemont) architectural deep dive video, and a comparison to Arm/Risc-v.
Ah, it can't be compared. Atom is an x86 cpu, and it depends on how the cache and fsb are set. Arm chips usually operate at max 0.5 volt atom can go up.to 2 volt on turbo so is definitely diffrent clas of cpu ah why not against an ia64 cpu lol 😆 wanna se that race 😆
@@adriancoanda9227 In the end what makes the difference between slow and fast, is 99% software. I would win that race if I am the programmer : would use inline assembly, lookout tables with pre-computed values, would not miss the caches with visibility list, local goto.... Sofware always wins.
@MarquisDeSang not always. It still needs hardware to run on y saw some remastered games to be used via the browser chromebook target it ah tnd that launcher looped 2 gb of data but target just on cpu core so the loading screen took 10 minutes search for ah y like to see you in quantum pc your thinking won't apply there cause it is not a digital cpu is analog and capable of insane parallel computing and it exists already a portable one withou a transmission it won't run any apps like you are used to, 😉
The difference is that STM32 is from a company, i.e. STMicroelectronics, to support the ARM chips they make. RISC-V is an architecture, so you need to pick a company and see what it provides. The Espressif chips seems to have a mature development system for all their processors including Arduino support.
What about DSP math tools, QMath, and feature like this, are they included in the Espressif ?, is there nice youtube session to give me overview. I review that last time (1-2 year ago) but not confident with IDE setup that seen so far
You appear to have detailed requirements, and I'm not sure I can provide a thorough response that fully addresses all of your questions. It might be best for you to reassess these platforms to determine if they align with your needs.
So - I've implemented an ARM and RISC-V architectures "on paper", and RISC-V is simpler in ways that pay. There's only about a dozen basic choices that even *can* be optimized out in the core ISA yielding an architecturally pure ISA. My less favorite parts: * The opcode, funct3, and funct7 not being unified in decode step. * The LU opcodes not mapping to truth table means LU operations are not simple 4BD decodes with add and mul being 2x4BD+1 decodes. AFAIK no commercially available ISA has ever achieved this, but it's been discussed widely in academic circles. ARM though has for example the Java bit which *halves* the available opcode range, and is AFAIK based on an earlier RISC platform with some commercial extensions. And sure, some of that makes it fast, but it's going to be less efficient per wire than a cleaner ISA. There's actually tons of details I don't like in it.
May have already been mentioned, but Amps != power. When you change the input voltage to 3.3V and the current doesn't change, that indicates a change in power. I'm not fasmiliar with these boards, so IDK what the initial input voltage as, but if we assume 5V, and the current doesn't change when switching to 3.3V, then that is a 34% decrease in power.
About the ESP32 power consumption when powered by 3V3: cheap LDOs like the AMS11x consume quite a lot of power when only voltage is applied at the out pin. Is in the range of 3-10mA
@@GaryExplains Yes cheap LDO Voltage Regulator like the AMS1117 consume power even when no voltage is converted by it. this is called "Quiescent Current" and can be found in all LDO datasheets. For the AMS its between 3 - 10mA and is the main reason why cheap EPS32 boards consume above 1mA when in deep sleep. There are better one available but they cost 60 cents not 6 cents. I had to find this out the hard way when when designing a battery powered ESP32-S3 board. Its enough to have 3V3 on the output pin of the LDO for this current to flow from the 3V3 output to GND through the LDO chip. its a kind of leak current. an easy way to fix this is to just desolder the LDO and power it directly with 3.3V on the 3.3V pin
Were the connectors taken into account? USB C has transfer rates close to 10Gbps while micro usb is pushing over 450 Mbps. Then as far as power, USB-C handle nearly an order of magnitude power than the micro usb at 100W. Just curious.
As I see it: Arm's been around for a while. It's had an awful lot of work put into its efficiency, power, etc. over the decades. RISC-V is new, and there isn't a lot of money in perfectly optimizing it (yet). The fact that it is at all competitive now is a good sign for things to come, but it's gonna need more time, work, and support to be fully realized in this regard.
I would love to see the xiao nrf52840 board or equivalent, put to the test as it is running at 64 mhz. This is the microcontroller used on a lot of smartwatches. Plus it would also be interesting to see the boards already test, retested at lower clock speeds, if that option is available. I know some esp32 can have the clock lowered. For pure power efficiency, I believe lower clock speed tends to be more power efficient for the same work done, as power usage tends to go up on an exponential scale, whereas processing power for the same processor goes up linearly. If the amps are the same at 3.3 and 5, then it is using an inefficient regulator to drop the voltage. Just curious, did you calculate the power efficiency using 3.3 or 5 volts? I am not a fan of any architect as I just use whatever is better suited to the task. Of course having one that does it all would be nice and save having to learn all the differences, but now that assembly language is rarely used, it is not like having to learn an entirely new instruction set. By the way, if anyone gets a xiao nrf52840, if they say double click the button beside the usb c, the double click speed is a bit slower double click than I was used to. Took me a lot of tries to get it right. Luckily someone mentioned doing a slow double click somewhere.
2 роки тому+6
One detail that you missed is that the Pico and Pico W do not have a linear regulator; they have an on-board buck-boost switching power supply. Current consumption will not be constant; it will go up as voltage decreases.
Not sure if this is a valid question, but here goes. Based in these clock speeds, could one of these chips act as a processor in a micro DOS or Windows environment? Thinking kiosk that runs a corporate webpage and allows customer data entry or order entry on-site. Or tiny web book or a tablet just for web or ebook reader where its mostly text. I know that the Pi, which is more powerful and has a video decoder is slow at video and graphics. Just thinking that if not much computing power was needed, you could pair with a mid power graphics chip for running the display and decoding video streams. Then maybe you get TVs with minor computing and networking power. Or is this how they are making smart TVs?
Hopefully you look into purchasing the DeepComputing/Xcalibyte ROMA RISC-V laptop (or a related RISC-V laptop or desktop) for a future video, but much more refinement will probably be necessary for it to reach it's full potential.
Useful test but not good test on the topic of CPU core efficiency for several reasons: 1. likely system bus speed differences between these (system bus interfaces to on-chip SRAM) obfuscate differences between true CPU core performance/MHz/Watt unless you downclocked all of them to lowest common denominator system bus speed, 2. differences in flash memory/prefetchers further obfucate CPU core performance unless you ran the benchmark from RAM and even then some like M3/M4 could use dual-buses 1 for data and 1 for instructions making it unfair, 3. finally at least some of these probably manufactured on different process nodes
@@GaryExplains Actually I didn't finish watching when I commented, I see you ran all of them at 1MHz later to level the playing field and I assume system bus was dropped to 1MHz also and that's a first important step. I would run all of these CPU cores at the system bus speed of the lowest common denominator system bus speed. The second step is to link to run the code out of SRAM instead of Flash on all of them. That's probably the best you can do to isolating core performance efficiency.
Gary, there was an article that i read about a week age that Apple may be shifting away from ARM to RISC-V. What do you think that Apple will switch to RISC V or continue with ARM for the time being?
If we read the same article it says that Apple is using RISC-V for some of its small co-processors, that is all. It is a good engineering choice, if it has to design bespoke hardware blocks then RISC-V is a workable solution.
@@GaryExplains Maybe but that Article had some text about moving to RISC-V that Apple might be considering. Moving to RISC-V would benefit Apple in long-term as they wouldn't have to keep paying ARM for royalties or whatever deal they have with ARM. What's your take on this?
You mentioned that you encryption algorithms don't use floating point or integer division, but does use bit manipulation. I'll ask if it also uses integer multiplication, because multiplication by default comes in the same extension as division, but was also made available on its own as Zmmul. Bit manipulation instructions beyond basic bitwise logic are also their own extension B and its parts. Did the RISC-V processors used support these extensions, and if so did you tell the compiler to use them when compiling your code?
Gary did you check the real clock speed of the RP2040. The maximum clock speed is 133 MHz but in the SDK it is set to 120 MHz because it is easier to get the correct clock for peripherals like the USB. Check SystemCoreClock in the SDK. Are you running the test from RAM or XIP? You probably see a difference here.
There is actually even more fun stuff here: The chip has 2 PLLs; it sets one to 48MHz for USB, and one to 125MHz for CPU and bus clock. 125 is also much more manageable to get useful clocks for other peripherals as you said. You can also push the pico MUCH further than what it is specced for. I have run complex programs with PIO and PWM at 300MHz just fine running from RAM, and ~250MHz when running from XIP.
@@GaryExplains Thank you for your response. By "specifically tailored for RISC-V", I meant a version of the Python interpreter that's been optimized to run on RISC-V architectures, taking advantage of its specific features and instructions. Just as we have optimized versions or builds of software for different platforms or architectures (e.g., ARM, x86), I was wondering if there's an equivalent for RISC-V. Essentially, Any version of Python that might offer similar performance or other benefits when running on a RISC-V system.
Hmmm... I am not sure that Python has special optimizations for different architectures. I just downloaded the Python source code and I see very little code that is optimized for say SSE3 or SSE4 or AVX. There isn't much assembly language either. I see a little bit of x86 ASM code in one of the math libraries, but there isn't an equivalent for ARM64. It is just C code in general. 🤷♂️
A decent video. And yes, instruction set architectures don't largely impact power efficiency. Hardware implementation however impacts efficiency far more. But there is nuances on the ISA level that sets limits for actual implementations of the ISA. Be it limits on minimum transistor count, power efficiency, peak clock speed, etc. Sometimes one has to trade one aspect for another. As an example, a resource efficient architecture using few transistors will generally not offer all that great peak performance. While a more peak performance oriented ISA will tend to be hard to build with few resources. Power efficiency is meanwhile largely decoupled from this view of complexity, since power efficiency is more about how well a given piece of software can make use of the architecture provided. It is oftentimes better for efficiency to have dedicated instructions for complex tasks, but what tasks to choose is a debatable subject in itself. If one throws in everything but the kitchen sink, then it is often far from trivial to make an efficient hardware implementation of it in practice. In short, designing an ISA is all about compromises to reach a prespecified goal. And then make a good hardware implementation of that along the way. Then it is up to the market to find/make applicable software for it.
I'd say that when it comes to efficiency, a number of major interest is how much power the chip/board burns while idle. Typical MCU systems are not to crunch numbers, but for control purposes. Numbers when ready to respond to Wifi may be the most interesting, but of course there are also applications not needing wifi while waiting to do a bit of work. As usual, comparing MHz across architectures is not useful, a more realistic yardstick could be a "maximally trivial" task like how fast it can count.
It's not that surprising that the winners are the ones with faster clocks especially when in dual-core configurations the second core is just set idle (why?) This test was designed to benefit single cored and higher clocked processors. Multicore processors are known to have better performance at much lower clocks frequencies and consequently more energy efficient. I'm having a very hard time to understand what's the motivation here. However, obviously the performance is never about a difference in ISAs, especially if all of them are RISC architectures. Put some CISC ISAs in the mix and you will see huge differences, though. Also chips that run at faster clocks generally consume more energy. That's the whole deal about multicore architectures, to have high performance with low frequencies. The whole deal of ARM processors and their use on mobile devices is exactly that. No surprise here either, since this is obvious. The only surprise is to see a processor running at 72 MHz consuming more than one at 160 MHz. I think this must come from the fact that you are measuring power at the board level, not at the processor itself, otherwise we would see this reversed. Now about RISC-V. There is no way RISC-V processors could compete at any level with ARM processors that have been far and wide used in smartphones. RISC-V processors are new kids in the block that are running far behind. RISC-V is still lacking the support needed to be better than ARM processors. But there is a huge advantage of RISC-V, though, that cannot be measured for now. It's potentially much cheaper to produce RISC-V than ARM processors, since it is an open and free ISA. However, we cannot still see the advantage in prices because they are still not produced in high volume. Volume production is everything in chips prices. But we can expect to see much cheaper RISC-V processors in the future to the point of beating ARM processors prices. I think that's where the RISC-V will position itself as a competitive ISA.
@@GaryExplains: Thanks for your comment, Gary. You are probably referring to the hypothetical comparison if the processors were all running at 1MHz. You know that just multiplying the time by the clock frequency is not a very accurate performance indicator. I am looking forward to see the comparison between dual-core and single core processors. For the kind of comparison (very repetitive and computing intensive tasks) you are doing, you would generally be better with dual-cores. However, that's not always true. As I stated in another comment in another video, modern architectures have lots of intrinsic parallelism (that translates into several instructions executed per cycle) that simply don't work when you impose atomic execution to synchronize threads. That benefits single cores better than multicores. In my estimate, to start having clear cut better performance in multicore you need at least 8 cores, unless you don't use atomic operations. That's the reason smartphones have dedicated cores for certain activities, because in this way you don't need synchronization. The advantage of these configurations is simplicity, you don't need load balancing. But the problem is that you will have most cores idle if their correspondent activities are not taking place.
Except for the raw performance test (ie how many Ms to complete the task), none of the higher clock speed microcontrollers won. As for the clock frequency, in my previous video I actually changed the clock speed, and while performance isn't perfectly linear it is quite close, certainly close enough to make meaningful comparisons.
@@GaryExplains : Thanks. I didn't see your previous video, so I just assumed you multiplied the frequency by the time. It seems I will have to see this video again to understand what you mean with "raw performance". I probably overlooked that. Sorry.
The ISA does make a small difference, and the fetch-decode speed was a large factor up until high mHz and pipelined branch prediction. A clean(ish) slate approach to both the ISA and IMPLEMENTATION SPECIFICATIONS of Risc-V working in tandem is what gives RISC-VECTOR the edge. -- A proper Vector Processing specification instead of SIMD (an ISA DISASTER that should have stopped at SSE4 on the X86, and should never have been introduced into the ARM ISA... A Vector processor would have been vastly preferable and the tech was well proven., -- A major benefit is to combine CPU + GPU programming into one (much more) bare metal ISA for both, eliminating a ton of API translations and JIT compilation, Short but quite efficient or very long and perhaps more efficient pipelines can be experimented with by LOTS MORE CHIP (PART) DESIGNERS, while developers get a STANDARDISED ISA.. -- Bare metal GPU Compute will be much EASIER. Integrated graphics and general purpose Vector processing compliment each other, but software-only graphics systems using just the vector processor and a few CPU cores could be more efficient and good enough for web + office..
@@GaryExplains .. not yet, and hopefully never! I agree on the microcontroller front RISC-V is no better than the Pi Pico spec. It's also less RISC than the pico.. The low end RISC-V spec now includes basic, MMX level integer SIMD, probably FP SIMD when it's finalised then extended, so quite bloated compared to Pi Pico ISA. -- I'm an ARM fan but think the High End RISC-V spec is a better idea (Vectors vs fixed sized SIMD).. Risc-V is an ARM killer, X86 never was... ARM is still the most likely X85 killer but Intel and AMD will probably race to replace X86 with native Risc-V and emulated X86. 10s to 100s of smaller SOC designers and manufacturers will obviously also prefer Risc-V. -- Sadly ARM's days are numbered. It may well have to abandon its ISA and many core implementation details when it too goes Risc-V.. Open Standards are very powerful forces.. Look at IBM PC, HTML + CSS, Unicode. For better or or worse, these royalty-free technologies alays dominate. -- I actually prefer 2 byte opcode ISAs using a few tricks and vector processing over SIMD. de-bloats the cache and pipeline. Risc-V is getting more bloated despite its lack of SIMD. Too many cooks spoiling the broth will be the reason Risc-V fails, if it does, which it probably won''t. A (US) Big Boy could buy out the project I suppose, and ruin or bury it, but that's unlikely too.
I think it quite surprising that a 13 year old design stands up so well. I would suspect that if the power saving features of more modern ARM processor designs were to be exploited for a micro-controller SoC, then it might do better still. However, presumably the priority has switched to producing much more powerful, low-power architectures for use in servers, laptops and the like. producing the ultimate in low power micro-controllers is probably not a priority as these things are rarely required to do heavy number crunching.
Great video! I'd love to see one where you analyze just power efficiency. I use microcontrollers around my house to monitor just about everything. I'd love to know which would last the longest on a battery. They need WIFI so they can report in. But my requirements use very little processing. Just check the sensor and report in. Thanks!
You aren't doing anything around your house that requires more than a lemon battery's worth of power. What you would need, though, are low power drivers for your network, which are hard to get, it seems. Just use whatever works and plug it into the wall. Who cares about a couple of Watts of extra power consumption.
I would have loved to see more different benchmarks hitting different areas of the MPUs, since concluding based on one very specific crypto-benchmark not even using floats seems quite off to me...
LOL, other people complained when they thought I was using floats (as some MCU's don't have an FPU). I just can't win. UA-cam comments for the victory! 🤪
Outside of very specialised areas, almost no software uses floating point on desktop computers, let alone on microcontrollers! I've been programming professionally for 40 years and 99% of C programs I work on don't even have the word "float" or "double" in them. Gary's previous "Primes by division" benchmark was quite unrepresentative of normal programs, but this one sounds pretty good (I don't know if the actual source code is available?) so I for one applaud this change.
@@BruceHoult "almost no software uses floating point on desktop computers" u wot mate ? Browsers and games are "almost nothing" ? Though to be fair, I don't know much about other software, but I'd be surprised if these would be the only major ones. Still, I'd also say it's kind of irrelevant what desktop-level software use and then compare to what MCU-level software uses.
@@Winnetou17 "outside of very specialised areas". Games and browsers are specialised. A lot of people run them, it's true, but they constitute a very small proportion of the lines of code written or programmers employed.
You are comparing what against what exactly? For ARM, there are many different versions of the ISA (instruction set architecture). v7, v8, THUMB, THUMB2 being only the major families. Let's say you take the latest and greatest of these: that would be ARMv8 with Thumb 2 instructions. For RISC-V, the situation is more clear: There is the RV32I (32 bit) and RV64I (64 bit), with I = basic/integer, and extensions M (multiply/divide), A (atomic operations), F (floating point), D (double precision). Collectively IMAFD is called G. There are compressed instructions of the I set, called C. Then there is the V extension for "vector". Also there is the H extension for "hypervisor" I think that when comparing ISA's it would be fair to compare ARMv8+THUMB2 with RV64GCVH. Now of course, somewhat decent RISC-V boards are coming available just about now, and efficient CPUs with the ARMv8+THUMB2 are now on the verge of beating Intel/AMD in laptops and servers. So it is just not fair to compare the current implementions of both instruction set families. You can compare code side: RISC-V linux executables are smaller than both x86_64 and ARMv8 for the programs I compared: ls, mv, cp, sshd, gzip. This is in contrast with what everybody claimed: C programs should be bigger when compiled to RISC-V machine language because it is RISC and the other two are CISC. Well, ARMv8 is technically RISC, I read, but compared to RISC-V the language is huge. However, code size is vanishingly small compared with data even on a Windows system. Still, RISC-V Linux has consistently about 10% to 20% smaller executables. You could also count the number of instructions executed for a certain task, say sorting an array, or compressing a file, or computing something scientific and massively parallel. Then you can compare the number of instructions used in RISC-V vector extensions against ARMv8 Thumb2 instructions. Still there is a caveat: RISC-V V extension is vector length independent. Newer chips can run the same binary more efficiently when it has a larger vector length. You can do normal performance benchmarks but then you are comparing hardware implementations, not the ISA's.
@@GaryExplains You are partly right. I actually did watch it before but I more or less forgot. I now watched it again. My issue remains though: If you are comparing the efficiency of an ISA to another ISA, that is really hard, I think. It depends on the qualify of your assembly program if you are programming that directly. Or if writing C, it depends on the quality of the compiler. The compilers for RISC-V may not be as mature as those for other archs. Especially for critical fast code using the vector instructions. So you can count cycles for instance, and see in how many cycles each arch can get a certain task done. Still not really fair: CISC can do more in less cycles presumably, although x86_64 instructions can take 10's of cycles and RISC-V does 1 cycle per most instructions and maybe 3-4 for difficult ones. Anyway, I have always wanted to start writing assembly, but always found the ISAs way too complicated. Including the various ARM ISAs. My last real experience was with the 6502 (C64 days), and I only tried it when those days where almost over. But now there is this new promising ISA that is simple enough for me to learn Assembly from scratch. So I am exited for it and I want the platform to succeed. I have a Milk-V Mars on my desk but have not been able to boot it from a eMMC card yet. I also have a Milk-V Jupiter on order which has the vector RVV1.0 extension. And I have pre-ordered four of the Milk-V Oasis boards with the sg2380 chipset. I have tried some assembly in an RISC-V qemu machine running Ubuntu that works surprisingly well. Anyway, how would you go about comparing the relative efficiency of two ISA families? Can it be done?
The comparison is not with new hardware. The visionfive 2 board looks to be 4 core risc v and by having a risc instruction set allows for better parallel processing, making the possibility of higher efficiency. The ability to boot from an nvme and the concurrent processing will need better coding , to achieve faster processing .
@@GaryExplains just a big improvement on visionfive 2 board efficiency’s. Not Risc-v specific. Currently no soc boards have nvme boot up and processing, not even raspberry pi.
While I agree that it isn't necessarily linear, as far as I know that is only if the voltage changes with the frequency. In my testing I didn't only use extrapolation, I did clock them (where possible) at the same freq and the results correlated with my extrapolations.
Out of curiosity how would the old ATmdga328p fair in such a comparison. Max 20Mhz , very very old node (I think I once looked it up and it was still in the micrometer range).
The silicon fab processor node tech used to make the chips plays a huge role in their efficiency. It would be good to include fab node info in the comparison data.
Indeed, it is something I will note for future videos. As for this video the key is that the Arm Cortex-M4 is using 90nm and the RISC-V ESP32-C3 is on 40nm, which makes the performance of the RISC-V processor even worse.
I have to say that only the last plot (mWh to the task) makes at least some sense... But in general I would say that you can not generalize these boards and compare them directly. MHz is not linear to power comsumption. It's quite simple: The esp32 boards can run at 240MHz and are there for the fastest. It does not matter if the M4 can "compute more per MHz", if it is capped at 100Mhz and therefore is still slower to do the task... If you are looking at power efficiency you probably do not need those high clock speeds anyway. You can power down the Modem of the ESP and that will cut down the power substantialy. If you want to compare the ESP32 to the M4, you should clock down the ESP to comparable levels and run the tests again.
Hmmm... If you look at my previous video about microcontrollers you will see that I actually did change the clock speeds. While it isn't linear it is very close.
Arm is a risc chip. Also, it stands for reduced instruction set. Actually, you will nrrd to have the same motherboard with a socket mount in order to exclude other factors in the testing, but even then the fastest chip was at 240 mhz y won't se where those can make a use maybe in remote controls, elsewhere those are to slow, or use them I a insane cluster 999999999999x cluster but you will need a dam fast cluster management running within the firmware
Why not transistor count instead of energy used, too many variables. Assuming transistor numbers usually correlate to cost ultimately… to show what architecture more efficient for the theoretical cost of production (if they were same fab, same node)
Transistor count doesn't correlate in any meaningful way. It won't help you decide what size battery to use etc. Power usage is the most important thing, everything else is just statistics.
clock speed scaling is definitely not linear enough to fix afterwards, you should down clock all of them to the same speed, if you want compare at the same speed...
Clock speed scaling is linear on microcontrollers. They are in-order and deterministic. Plus I did actually change the clock speed on many of the units to check that, and it is.
I was just thinking the current measurements aren't very useful because of all the extra stuff on a lot of those boards. Plus the esp32 are not known for low power. You would have to compare active current with the idle current of each board.
The tricky thing with a delta number is that a CPU can never actually be idle. Even doing nothing is still looping and reading instructions waiting to no longer be "idle". To help in this situation there are two general solutions. 1. Lower the clock frequency and the voltage. This is something that smartphones and laptops do. 2. Put the CPU to sleep, this is a feature MCUs tend to have and it is similar to 1 but not dynamic.
@@GaryExplains thanks for replying. The motivation for the delta is to see the difference between the dynamic power consumption of the cpu architectures. I take the point that the cpu is never really idle, but I the case of MCUs, it should be at least the cores are idle, or running noops. I think the data would be interesting nevertheless. Idle power in itself would be interesting, so all 3 data points tells a story, idle, full load, and 'full load - idle'. Its quite surprising that a 22 year old design/process can still beat a 2 year old one.
I guess Gary, you really should put out a video series explaining the differences between ISAs, microarchitecture, process node etc. to the general public, as I have watched many people are disagreeing with you on various issues. I think this video series will work as prelude to ARM vs RISCC-V video BWT i also felt that I need some more help 😅😅😅😅 on this. Thank you
First off, nice that someone takes the time to do benchmarks; we can really use some more of that. However, I also think any benchmark that leaves out the different basic types is inherently flawed. An int32 benchmark is nice for pure int32 operations, but it still tells me nothing about int64, float32 and float64. For example, the ESP32 has an FPU for float32, but not for float64. It also leaves out any peripherals - but that's okay (if you need a certain peripheral you should just select on that)... For example, I have a few ESP-S2's here that use the TinyUSB stack. They are great, but whenever you feel like using the native USB in instead of the hardware uart, it starts to eat up your cpu cycles like cookie monster... it'll be the same story for the RP I suspect. Especially float can give very nasty surprises, I suspect it will be the same in terms of power consumption / efficiency.
I think the general wisdom is that floating point code accounts for less than 1% of microcontroller code. So doing a test that focusses on floating point is inherently flawed.
@@GaryExplains Where did you get that "general wisdom"? I know I've never seen it in my 30+ years of professional software engineering... Not saying it's incorrect, but in my experience it very much depends on the application how much floats are being used... Source? But even if it is correct, I don't think you understand how bad it really is. I actually did some benchmarks on the esp32 a while back, because I couldn't make heads or tails of the performance numbers. It has roughly 600 MIPS and just 1 MFLOPS (!) for common operations. That means that even if only 0.2% of your code is using floating point, it will consume 50% of your cpu power. It's that bad...
When I say general wisdom, I mean general wisdom, there isn't a particular source. However over the years I have seen multiple presentations that analyze real-world code and FP code is minimal, certainly on microcontrollers. That is why some microcontrollers don't even include an FPU, not needed really.
@@GaryExplains Right, and as I said, I'm no amateur, and I've seen a lot of issues with FP over the years. At the end of the day it doesn't matter what the exact percentage is: since FP is so much slower than integer operations (for obvious reasons), the effects on the application as a whole are still significant. Whether or not FP is required for applications at all is a totally different discussion. Again, such discussion is eventually irrelevant; the fact is that regardless if it's a good idea or not, people use it for everything from motion control to PID loops and from UI's to signal processing. That is why there's a tendency for vendors to add an FPU: because it is needed. ESP, STM32F4 seem to agree with me. The RP2040 does not have one.
I'm glad you addressed the point about WiFi on/off not making a difference, although I'd like to ask about those mWh numbers - you said for the ESP32 that it's the same current draw whether you're supplying 3.3V directly or 5V, so which voltage are these energy numbers for?
They are for 5v. But they are all 5v (i.e. for all the boards). I have the 3.3v numbers are well, but of course it changes nothing, just smaller numbers.
@@AbelShields some chips use lineal voltage regulator (5V to 3.3V) some - switching voltage regulator (at least Raspberry Pi Pico with an RP2040). Switchers waste way less power (probably 92% efficiency for regulator, 94% efficient reverse voltage protection, 85% in total vs 64-66% in total with lineal one)
Although the Arduino IDE hardware abstraction does a good job of providing a common programming interface it is not really a good platform for performance comparisons. Some of these chips have a lot of functionality to improve performance per watt which isn't supported by Arduino HAL and the HAL has to do a lot more work with some architectures slowing down performance too. That said it is clear that the now ancient ARM architectures still hold up extremely well to the modern competition.
"Some of these chips have a lot of functionality to improve performance per watt which isn't supported by Arduino HAL" - Could you please give me some examples.
@@GaryExplains You can shut down the ESP32s entire radio circuitry if you have access to the low level registers. This saves a lot of power even when the radio isn't being used. If you have access to clock multipliers on the STM chips you can tune them to give lower power consumption too. Your encryption algorithm may be able to take advantage of encryption hardware on some of the chips which would make a big difference but the HAL won't necessarily take advantage if it.
Well, you can shutdown the entire radio circuity using the Arduino HAL. In fact I tried that, and said so in the video. Switching on low-power idle modes isn't relevant to this test. Also I used my encryption algorithm as a example of a heavy CPU load, it doesn't matter that it is about encryption. In my previous video I used finding primes and in my next I might use nqueens. It isn't about using special HW encryption blocks, but about testing the CPU.
@@GaryExplains Turning off the radio is not a low power idle mode it is just turning off the WIFI circuitry when the application doesn't require it. The rest of the chip runs at full speed and full power. It gives a better apples to apples comparison when testing say STM chips with ESP. Like when people compare the PI Pico to others ignoring the programmable IO which is it's most unique and powerful feature.
Hmmm... I seem to be repeating myself, one more go I guess: You can shutdown the entire radio circuity using the Arduino HAL. In fact I tried that, and said so in the video.
Some feedback: - Current draw is not exactly directly proportional to clock frequency, for instance at lower frequencies, efficiency can be worse because there is some "idle current" that doesn't change much and becomes more important relative to the clock based current. So I think it would be better to set the clock frequency of the MCU at the same speed, and do the same tests at different clock speed (because they might have different sweet spots). - If the goal is to compare architecture and not simply the MCUs, I think this is only a fair comparison if the chips are manufactured using the same technology node, I do not know if it is the case. - I think measuring the board current instead of the MCU current is not great either, I don't know for those specific circuits, but there are many ICs which easily consume a few mA doing nothing, some of them even when they are "turned off" (shutdown current in datasheets is usually low, but not always). One way to measure just the MCU current would be to completely remove other circuits from the board (yes, it's more challenging, and destructive to the board).
Some feedback on your feedback: - I did that in the previous video on MCU power efficiency. - The goal was to show the current state of RISC-V MCUs and to debunk the myth that just because a processor is RISC-V, it somehow means it is inherently better. - I covered that in the video and made the same point myself, did you miss that segment?
@@GaryExplains Thanks for your reply, I had not seen the other video. Your graph at around 12 mins shows what I mean. For instance, at 240 MHz, rpico consumes 0.16 mA/MHz, while at 50MHz, it consumes 0.26 mA/MHz. Similar results are seen for ESP32. If it was linear, it would be the same number. That's actually a larger difference than I thought it would be. It is counterintuitive, but I believe MCUs tend to be more efficient at higher clock speed (likely up to a certain threshold). Hence, comparing the energy usage at different clock speed seems to favor the boards running at higher clock speeds. If the goal is simply to show that a risc-v chip can be less efficient than an arm processor, it is achieved, but then IMO, the title "Arm vs RISC-V? Which One Is The Most Efficient?" is a tad misleading, I was hoping to get a comparison of efficiency of risc-v compared to ARM, which would need to control the other parameters (especially the technology node, since it is likely a huge factor). Still an interesting video nonetheless. You did mention it in the video that you measure the board current. Depending on what's on the board this may have a huge impact. I now had a quick look at some schematics and it looks like the boards are quite bare (though I'm not sure what's the exact board you use in some cases), so it may not be that important in the end. One thing I noted though is that most board use an LDO while the Pico apparently uses a DC/DC converter. Boards that use an LDO should indeed have the same current going in 5V as in 3V, however this should not be the case for the DC/DC converter. Efficiency of those LDO is 3.3/5 ~= 65%, while efficiency of the DC/DC converter of the pico is mentioned "up to" 90% (though this varies with consumption). This is an advantage towards the pico board, not related to architecture. If you indeed measure the same current when supplying the pico from 3.3V, it is either because the efficiency of the DC/DC converter is actually 65% as well, or because there is some leakage to the DC/DC converter when there is a voltage applied to its output while its input is floating (which is possible since it is likely not an intended use case). Just to make it clear, I just wanted to provide some constructive feedback, I'm subscribed and enjoy watching some of your videos, I hope this doesn't come off as arrogant.
@@FranzzInLove I agree; if the goal was indeed to compare the efficiency of Arm vs RISC-V, the best way to do it (aside from getting two different chips that are identical, apart from the CPU core - so same node, same class, same memory, same speeds etc.) would be to record the actual number of instructions executed for a given benchmark - i.e. the _dynamic instruction count_. This is the only meaningful number to look at when comparing one ISA vs another. Otherwise you're just comparing chip vs chip. And the direct comparison of cycle counts that was done in this video isn't realistic either, for the exact reason that Gary actually explained just before showing the comparison; memory systems are running slower than the cores themselves and often have a somewhat fixed latency when reading data (and instructions), so you'll typically waste more cycles waiting for memory when running the CPU at a higher frequency. So Gary: nice try and I really appreciate that you focus a bit on my field (MCUs) as well, but for this particular comparison it could've been a bit better - at least from a "comparing ISAs" point of view, from a "comparing MCUs" point of view it was great! :)
minor but architecture can 100% be relevant for effeciency, speed, etc. Yes, a good x86 implementation can always sip power in comparison to a shit ARM implementation, but that doesn't mean implementation is all that matters. A slow algorithm on a super computer will outpace a fast algorithm on a microcontroller, but that doesn't mean picking the right algorithm doesn't matter, it just means that it's not the sole deciding factor. These architectures were invented to solve specific problems and to suggest that architecture is irrelevant is really just disengenous. No, the differences won't be direct, but the architecture influences the implementation; different architectures lend themselves better or worse to different designs, and some designs are better in some functionality than others. Intel was *_far_* ahead of AMD for a good long while, but then AMD started going batshit and putting dozens of cores on their CPUs and now at the ultra-high performance they're pretty unmatched. In single core they still lag a tiny bit IIRC, but in multicore it's real hard to beat 16, 32, 64, 128 seperate cores. Speed isn't just an RPG stat, there is a *_lot_* of nuance and 'speed' is really just the composite of how fast it can go and how easily it can go that fast. If your chip is the fastest thing in the world, but it takes 500x more work to develop for, it'll never take off. (outside of niche use cases of course) On the other hand, if your chip is 25% faster and a drop-in replacement, it'll spread like wildfire. One thing I really think RISC-V needs to work on is making sure that they go out of their way to make cross-compilation as easy as possible; that or invent a damn good emulation suite. (but only Apple has really ever pulled off performant cross-architecture support AFAIK. I hear a few projects are getting pretty good, but I've never heard of one *_really_* bridging the gap outside of Apple) A new architecture just can't demand people spend time porting their software unless they have something *_really_* good to offer, and really RISC is more of just an incremental improvement than anything.
Nice. You mentionned design importance but, to reiterate, the designer of the microcontroller is important here. I think your results show STMicro expertise.
Its kinda wrong to average out the performance. The esp32, esp32-s2 and esp32-c3 have an adjustable clock ( 80 Mhz,160 MHz and 240 Mhz). The newer Esp32-s3 can go as low as 10 Mhz. You can set the esps to 160 Mhz to compare to each other. You can also average the time it takes for fixed set of operations etc
Interesting video :) I do not think it is enough to just to power the 3.3v rail since there are other onboard electronics which also require a power (usb to serial converter) on the esp32 chip. It could have been interesting to see it compared to the datasheet :)
Unfortunately, there is no way to quantify which ISA is more efficient based on random boards from different manufacturers. There are too many variables in this equation to drive any meaningful data from these tests. One would need to custom engineer their own hardware while keeping CPU design really close to each other to be able to accurately quantify this
Gary, M4 has an FPU speced on core. C3 has cryptographic modules. I'm very impressed with the quite new C3's placing on the list, but do you know if the C3 cryptographic processing components were used in your compiled code? This influences your results quite significantly.
The cryptographic co-processor in the C3 accelerate very specific algos (SHA and AES), and need to be expressly enabled in code through C headers as well as through the NVM configuration. His crypto algorithm is very custom(?), so I doubt it can even take advantage of the co-processor, let alone the fact that putting code forward that used co-processor on the C3 would kill compilation for all the other chips, since the header would have definitions for C3 specifics.... unless of course Gary was a complete A-hole and put #IF guards around that part of the code. (which would absolutely give the C3 and advantage.)
2010 does indeed seem 23 years ago
🤦♂️😜 Darn. That was a stupid mistake! But I think you are right it feels like soooo long ago!
Apple's first iPad:
April 2010
@@GaryExplains Future proofing the video, that's all
@@ZDevelopers Yea that's what I was going to say
It’s so sad hear that 😢
For measuring relative performance, it is wrong to do a per MHz calculation. The only metric that should matter is the total time needed to run the same application on both processors. A more complicated ISA means clock speeds will be reduced (which gives better per MHz performance), but that does not mean the processor is faster
RISC-V, ARM and others like MIPS which I have plenty of experience with are just architectures; the chips you can buy are implementations of these architectures. First thing to notice from a 30,000ft perspective is that 64-bit ARM, MIPS and RISC-V are surprisingly similar. In the past CPU architects were more adventerous. These days no more bat crazy shit like segments (x86) or register windows (SPARC, and totally batshit crazy on IA-64). MIPS is an early but well designed RISC architecture; 64-bit ARM (which fortunately is rather dissimilar to the 32-bit ARM architecture) is surprisingly similar. Which is unsurprising because one of the architects used to work for MIPS. And RISC-V was designed by the fathers of SPARC and MIPS.
So are they all the same? Not quite but 64-bit ARM and RISC-V benefit significantly from hindsight.
Now, once you take things to the limit things will be different. RISC-V's smaller footprint allows fitting more cores running at a higher clockrate on a die. It barely matters for the birdseed class of microcontrollers that's polluting most PCBs ;-
So for most uses architecture doesn't matter - software does. That's where ARM is very well supported, MIPS is well established and RISC-V is still catching up. That said, the folks behind RISC-V is smart and have impressed me by what they have achieved and in my discussion so they're going to close that gap. Plus hgiher end implementations are going to show up. Being a truely open architecture however the RISC-V market can be as confusing as a ant pile - or open source in general ;-)
RISCV having open specs should make software catching up easier than for arm no?
@@markhaus Yes and no. Availabillity of documentation is greatly simplyfying a port to a new architecture, especially when there is already a port to a simila architecture. It still remains a major underrtaking in terms of manhours required. Been there, done that. Three times 🙂
As far as software development is concerned, x86, MIPS, RISC-V, MIPS, ARM are open enough to aloo development of dcent software. IA-64 was special in that it's performance characteristics are ... complex. without NDA or possibly even being Intel it was not possible to certain software including high-end compilers.
The level beyond that is licensing for the architecture itself to develop a core by anybody who wants to. RISC-V didnt really inovate there, there have been other such public domain or similarly unrestricted architectures before. But they were the first polished architecture with academic and industry acceptance, documentation and very liberal licensing on top. It's this mix (and probably a few more things on top) which made the rise of RIsc-V possible.
It’s not just about the people who designed the RISC-V ISA it is also about the people that implement it, they are the ones that actually determine the efficiency and clock speeds and similar.
@@markhaus Yes and no. Programming specs are open for most architectures though the degree of detail varies. On the example of Intel, Intel CPUs accept only a signed blob as microcode and the microcode programming interface is not even documented. Probably few users care. More painful was Intel's attitude towards protecting the IA-64. The Merced documentation covers four thick books printed on thin paper and is also available for download (1-click sign-away your soul acceptance of terms required ;-) but errata required and NDA and certain very deep secrets were only available under terms of a much stricter NDA - the most restrrictive I've ever seen. Finally some further aspects such as deep details on performance aspects of the pipeline which are essential for the ijmplementation of top notch code generators and compilers were not available outside of Intel at all.
I don't want to single out Intel but just picken them as an illustrative example Companies very in their degree of paranoida, protectiveness and openness and corporate history and experiences are part of that.
RISC-V may be an open architecture. That means the architecture is open. It does not mean an actual implementation is open. It is possible to implement a fully RISC-V-compliant processor under the terms of the RISC-V licensing conditions - great. Yet iI can keep the implementation as closed as a traditional microprocessor implementation from companies such as Motorola, IBM, Intel, AMD, HItachi, MIPS, ARM etc. The result may be something that executes RISC-V code just fine yet for certain aspects such as performance has to be treated as opaque, as a black box.
As somebody who has ported Linux to MIPS it's been occasinally helpful to have access to the folks who did all the mental heavy lifting and wrote the specs. With a RISC-V compaliant core one may or may not have the same kind of access. for a particular project.
@@conorstewart2214 While this is correct one should consider the architecture defiinition of any processor architecture as something that sets the absolute limits of what's possible. A good implementation can reach for near 100% of that; a bad one will stay well below.
With my MIPS experience I found the size of early MIPS and RISC-V cores which are somewhat comparable And the RISC-V implementation is much smaller in terms of transistors / gates which is an indication of how polished the architecture is. Ok,RISC-V had the benefit modern software tools to aid the implementation. Such comparisons across decades are bound to limp somewhat.
An interesting aspect is how the RISC-V architecture is made up of several optional parts. Just to pick one example, an implementation does not need to have a multiplication or division instruction. They were looking at other architectures' pain points. MIPS was born as a super-fast RISC processor for super-mini computers, later workstations and servers. Nobody early on thoght of embedded computing. Such omissions are hard to rectify lateron in a clean manor.
One point where RISC-V is brutally efficient is cost due to absence of licensing fees for the architecture itself. To some users that's the #1 aspect that matters.
One more comment. Most processor manufacturers produce a spec called DMIPS/MHz, or millions of integer calculations per megahertz clock speed. This allows you to do a clock for clock comparison between parts.
Let me wind back the clock to the mid-80s to point you at the horrors of the Dhrystone benchmark which back then was more or less the canonical benchmark for integer performance. Even in the best case Dhrystone results didn't represent real world performance very well. Dhrystone wasn't only ignoring fp math entirely, its results also got more and more comically absurd as architectures got more sophisticated (caches and out-of-order made a giant difference) but also as compilers improved and started to "optimize away" part of dhrystone. The peak was rached when certain compilers started to recognice Dhrystone and applied Dhrystone-specific optimizations for almost arbitrary benchmark results - whatever marketing orders ;-)
It gives me headaches to see parts of the industry are still using DMIPS decades after it's been throughly proven to be rubbish.
(It seems many folks don't know these days - the D in DMIPS stands for Dhrystone).
@@ralfbaechle Thank you for this informative comment. One learns something new everyday.
It's an artificial value though and not very helpful for real world performance comparisons.
I know this because Qualcomm quoted quite a high DMIPS for their Krait CPU core back during the ARMv7-A generation, and it routinely got thrashed by the lower DMIPS rated Cortex-A9 based SoCs in actual performance.
I was surprised that the now somewhat venerable Black Pill did so well in these tests against the newer upstarts, especially in power consumption and power efficiency. Thanks Gary!
Not just "black pill" but stm32f401 or 411. Today one mc on black pcb tomorrow another...
And f411 has 12.7mA at 100MHz core with periph disabled. Not all periph is need to be on. I have doubts about this video test 20mA. The Chinese have a lot of analogues stm32. And for example CH32V203 (f103 clone with riscv core) has 8mA at 144MHz. CH32V30x ( riscv core with fpu) 12mA at 144MHz. And they have ever tssop20 case. As f103 clone CAN onboard, that f411 doesn't have. And 307 has Ethernet, 208 has bluetooth + Ethernet and 2.2$ in my local store. I have not been interested in buying stm32 for a long time. Only Chinese only like stm32 has CH32, HK32, AT32, GD32 and so on.
When I studied computer science Risc-V was my favorite to program. Good to see they are now doing a comeback.
I appricate that Gary is right here that RISC-V is not yet *as* effecient as but I'm very impressed that RISC-V is already *almost* as efficent as ARM with for the same processes being run 1.36mWh compaired to the equivlent ARM board getting 1.31mWh and even compaired to the *much* more established Pi Pico, it's only 8% less effient (than the Pico). Obviously being almost 89% less efficient than the Blackpill isn't ideal for RISC-V but this is still early days for it compared to ARM and just with there being so many less RISC-V processors produced vs ARM, I don't think you can expect it to be beating the leaders of the pack in ARM just yet. Maybe when there are as many models of RISK-V processor as ARM processors the leader will be arm. Maybe with more time for tuning, the leader of the RISC-V pack will beat the leader of the ARM pack; even with less models out there. Encouraging stuff.
Stating my bias: I want RISC-V to succeed as I think open source is the way forward and garding "intelectual property" like dragons over gold, is holding humanity back.
Thanks for the interesting video Gary!
You're in the realm of potential compiler optimizations... And which process node these chips are made of...
arm & risc-v are nearly of same age.
Sadly, only ARM got attention at that time.
I agree. RISC-V is not there yet but made a very good showing being the new kid on the block. ARM has been at this game for decades. It is unrealistic to expect the new kid to outperform the veteran. ARM has been optimized over decades. RISC-V has to pay its dues to take the crown. I am strong RISC-V advocate. I look at this as there is plenty of room for RISC-V to improve. The ground to cover in some areas are not that great to close the gap.
@@xade8381 that's not correct. ARM started to be designed in 1983 and the first chips and boards were in 1986. ARM the company started in 1991, when there were already 100,000 ARM-based Archimedes PCs in use. RISC-V started to be designed in Berkeley university in 2010 (27 years after ARM), the initial frozen spec was published in 2014, the first board you could buy commercially from the first RISC-V company was in 2016 (30 years after ARM).
@@xade8381 RISC-V was still an educational tool for years though, with zero plans for reaching any sort of market. Whereas ARM was made from the very beginning as a commercial ISA, and is 30 years older to boot. Not very comparable.
Its not about performance only!!!! The biggest thing RISCV is OPEN SOURCE processor...
Really? You understand that only the document describing the instruction set is open source. What advantage does that give consumers?
@@GaryExplains Pls read (even in google) why riscv and open instruction set is so important.
@@mementomori1868 😂 Or please watch my videos as I have several about RISC-V and what it really is.
Great work as always. Benchmarking is always a can of worms because it is as dependent on the application as it is on the processor. Do you need fast integer? Fast interrupt response? Floating point? DMA? If you used newer M3 and M4 parts they would have performed much better even in this integer-only test both with regard to processing speed and power consumption given that they’re built on *much* newer process nodes. And a recent STM32 M7 would’ve blown everything else out of the water.
Why are so many of the commenters here so obsessed with process node? It strikes me that many (not aimed at you in particular Mark, sorry) may just be reciting jargon without understanding it. Even a very old node such as 180nm is good enough for making a 300+ MHz chip (e.g. the SiFive FE-310 on many RISC-V microcontroller boards) which is plenty for anything in this test. Smaller process nodes do allow higher clock speeds, but if you're not USING that ability then they are not just a waste of money in the much more expensive design and manufacturing process, but they may actively be WORSE because of things such as higher leakage current when operated at low clock speeds or in low power sleep modes. It's also a complete waste when you're making a simple stand-alone chip such as a microcontroller with a small core and a small amount of SRAM because even with the old nodes you end up with the actual processor&memory being a tiny little square inside a huge bit of silicon with the I/O pin pads taking up 90% or 99% of the extremely expensive small process node chip area. The default assumption unless you're a real expert should be that the manufacturer has chosen the best process node to optimise what they want to achieve with their chip.
We are familiar with needing to sample the test code many times to generate benchmark results which are not misleading, but it is also essential to sample different kinds of test code, to not be misled even by random compiler differences on each bit of code tested. With the performance results between the esp-c and the black pill coming within 1% of each other, that suggests the test was entirely memory bound on those systems and the systems share very similar memory systems. Multiple programs need to be benchmarked for a picture to emerge.
@@BruceHoult In general I would think that a smaller feature size would mean less parasitic capacitance, but I didn't think about leakage current. Is that from quantum tunneling? I wonder where the sweet spot is for that. But there's also the matter of different topologies like finfet and gaa, that might reduce the switching current. Mostly I think it's an economic decision. Everybody wants better speed and battery life, but how much are they willing to pay for it? For a computer that only runs a single program continuously, all you need is "good enough". Microcontrollers often have external power anyway. The main concern vis a vis power consumption is cooling.
1:10 I did, RV32I in fact! although i had a hard drive failure so now it's abandoned...
I just bought my first RISC-V chip, an esp32-c3 from adafruit. Mostly bought it to learn RISC-V Assembly.
Generally want to learn AVR, ARM and RISC-V Assembly.
Sounds like a great plan. All are good ISAs. If you have any questions the Reddit /r/asm forum is pretty good for any ISA, and /r/avr and /r/riscv are helpful too. Sadly, /r/arm seems dead and/or non-technical.
A nice explanation as always. But I'm missing the sleep current for the different boards. It would be intresting to see how they perform compared to eachother. It is more if a comparison between MCU brands than core architechture, but still! :D
A huge factor for efficiency is compiler quality which grows with age.
The major design differences ariund efficiency is stuff like dark silicon for common tasks and SIMD engine implementation plus caches.
Crazy to use only a single RISC-V board as representing a whole ISA. Obviously not all ARM cores or boards are created equal, and neither are all RISC-V cores or boards. Espressif doesn't even say in their datasheet what RISC-V core it uses. Crazy also not to include Sipeed Longan Nano ($4.80, 108 MHz, been around for three years), some Bouffalo lab BL602 board (similar price to ESP32s, we know it uses a SiFive core) or even extend the price limit a fraction to include a K210 board (dual core 400 MHz 64 bit) such as Maix Bit. Still, it is interesting to see that from the same chip/board manufacturer the RISC-V does in fact give better performance per MHz and per Watt than what they were using before. A really interesting test would be the Longan Nano (GD32VF103 clone of an STM32 but with a RISC-V core) vs either a GD32F103 (same manufacturer STM32 clone with a real licensed ARM core) and/or a real STM32F103.
Fantastic succinct but thorough coverage. I can tell a lot of work went into this great video. Subscribed!
Isn't the manufacturing process(how many nm) a major factor in power consumption?
It’s always strange being reminded that people think RISC-V is inherently more efficient than ARM. That’s not why people like the architecture. It’s an open standard, whereas ARM is proprietary. Anyone who can make a chip can make and innovate a RISC-V design, not the case with ARM.
That being said, this was nice to see. I’m sure it has a lot of people blackpilled now.
"Anyone who can make a chip..." Really? I wouldn't even know where to start.
@@GaryExplains I was referring to the legality, mostly.
The architecture is open, you don’t need to pay for a license to make RISC-V chips.
11:20 how can current stay the same? I have searched many and many voltage regulators. They all come with no load or quiescent current. If you don't use a voltage regulator, it must use less current. This is not complicated.
Good video. 13:59: Board A uses 20mA·26s = 0.52 Coulomb = 3.2448·10²¹ electrons to accomplish the task, and Board B uses 51mA·18s = 0.918 Coulomb = 5.72832·10²¹ electrons, so Board A peruses only ~57% of the electrons that Board B uses. Therefore A is more efficient.
One other important factor on powe consumption, is how low ot can get when not doing much. This can be very important when making battery operated things. For example, the Nordic nRF52840 can gwt down to less than 5 mA with BLE running, and lower in LP modes, while I have not gotten the ESP32-C3 down to less than the same 38mA with BLE on, even when the CPU is just waiting. I'm working on this, trying to get the C3 to run more efficiently, because it is otherwise a very affordable and capable module.
Literally searched this a few days ago with all the news about RISC V Vs ARM. And there was no video. Thank you for this one.
What news are you referring to? Also, did you see this video of mine? ua-cam.com/video/GyWyikB2hFs/v-deo.html
@@GaryExplains Talking about an efficiency specific comparison.
Do you have thoughts about potential of Risc-V? I was curious about any production difference, like applied node size.
I talk about RISC-V's potential in my RISC-V series.
@@GaryExplains indeed you did! Quite a few as well ua-cam.com/play/PLxLxbi4e2mYFTkLsNYqWLrSQZtLB94wnY.html
I really love the black pill , actually currently I’m working on project using it (STM32F411CE), so glad to hear it did will in the benchmark, but Gray I have question
Did u write the program for each board in assembly or C ?
In case the answer C , then What compiler did u use for each one?
I hope didn’t throw up a lot of questions 😅😅.
Amazing work man, thanks a lot for this benchmark and I hope see more of them!!
Gonna take a while for the RISC-V manufacturers to figure out how to design really great chips with it, but there's no reason not to expect it will be roughly the same as ARM in the long run, just with an open ISA which is an absolute win on its own. Hobbyists who aren't trying to squeeze every last bit of performance and efficiency out of their projects should support RISC-V to help it along and encourage faster development. It's already outpacing ARM's development, which was already quite rapid.
The really interesting bit is RV32E with Zc* extensions. It essentially repurposes a bunch of the the floating point compressed instructions allowing a 16-bit only CPUs with half the registers. That'll be a tiny core.
@@mrbigberdminor nit: they still have to implement the 32-bit instructions. there's no base isa that's compressed-only.
there are a few hoops to jump through, but it's not hard for an individual to contribute. and i've been tempted to contribute for months now. (i have a number of ideas i'd like to fling their way already.)
@@dead-claudia That's not strictly necessary as the 16-bit only format is Turing Complete.
There are still 10-ish opcodes still left and a couple of them could be broken down further to provide a few more 2-reg instructions.
Most importantly, a 16-bit only extension would allow the use of the 11 top-level opcode space. This would increase total instruction space by 25% in 16-bit only designs and that would give enough space for a massive 64 2-register opcode space using the CA instruction format. That's enough to add in more branch instructions, A, B, M, CSR, Zicond, Zacas, etc.
Going further, the E-series only has access to 16 registers. Reclaiming those bits for CR gives 2 extra instruction bits (4x as many instructions). CI doubles its available instruction space too. This would open a path to add a basic Vector/DSP extension too.
How do these compare with x86 chips, specifically in running programs that are designed for x86?
x86 chips can only run Arm and RISC-V programs using emulation. The opposite is also true.
Well... There are certain extras in your core implementation that will make a difference; stuff like the different caches and the coherency mechanism, the branch predictor, cpu internal bus and the bus arbiters, there's just so many extra internals that are all abstracted away in complex logic. Some of that complex logic is just more appropriate to implement in another program, i think some of the cpu caches are governed by a whole other "management engine" that runs its own firmware to keep track of the bits in the cache....
Back in uni, I still remember the active power (total power - leakage current power) is proportional to square of frequency. Can we use it to extrapolate the power usage of the pi to 160 or 240 mhz?
Or better still watch my previous video on this topic where I actually changed the clock speed of the Pico and measured the power usage.
Active power is proportional to Vcore^2 * frequency. Not frequency squareq but just frequency multiplied by core voltage squared.
You may get around frequency squared when cores are pushed harder than above mentioned microcontrollers (not as hard as full boost latest Intel or AMD chips, frequency still has to be supported by changing core voltage).
@@volodumurkalunyak4651 The formula I remember from university is proportional to voltage and to frequency squared (P ∝ V * f^2).
@@leonardosabino2002 i literally wrote the very same formula:
Vcore^2 * frequency
power is proportional to frequency and to voltage squared.
Power scaling does also resemble frequency squared at some part of volt-frequency curve (probably 0,7 to 1V region for latest chips)
@@volodumurkalunyak4651 Not the same formula. Look again, it's the -frequency that's squared.-
EDIT: I just looked up the formula, looks like voltage squared is correct. Sorry about that.
In general, performance of risc-5 is not up to the standards of ARM. Full stop. But this battle does not stop today. ARM just announced, that they will charge their customers in future based on the device prices instead for IP. That will drive the research in the area of Risc-V up. I expect the Risc-V to become a contender in the Mobile Phone space (low end) in about 3 years and in the high end market in 6-7 years.
ARM has not announced anything of the sort. You are repeating a rumor published by the FT.
@@GaryExplains It came from Softbank, the owner of ARM. Let us wait and drink tea. Maybe it is a hoax.
Again, nothing official has been said by Softbank or Arm.
Let us wait and drink tea. Btw. Enjoying most of your content. Great channel.
@@michaelkaercher I am waiting and drinking my tea while the attention trolls on UA-cam keep asking me for all the love they didn't get from their Moms. :-)
As I watch, I get questions, and as soon as they pop into my mind, Gary already responds to them. It's rare that a tech video is this well thought out and structured this well!
the one and only legendary Gary Explains.
00:50 My Samsung Phone selfactivates Bixby starting between now and next 10 seconds. HOW and WHY?
Very interesting article. I have always thought about how RISC-V would be compared to ARM.
Do you have similar comparising for enterprise chips too? comparing RISC-V with x86 (Intel/AMD) and perhaps also including ARM?
I think 'process node' probably has a huge influence
You also need to consider the code density. The firmware binary size is usually smaller using ARM cortex compare to RISC-V or ESP32.
There is also a compact version of arm called thumb which offers higher code density.
update from the future: the code density is starting to change in risc-v's favor as compressed instruction support is maturing.
Are all these microcontrollers fabbed on same process node (and by same manufacturer), for example fabbing m4 or 40nm and 20nm will differ in performance and power efficiency.
The Arm one is on 90nm, the RISC-V on 40nm.
Thanks Gary for this wonderful comparison! Greatly appreciated!
My pleasure!
Hi Gary, it would be really interesting to have you do a Intel Atom/E-core (Alderlake/Gracemont) architectural deep dive video, and a comparison to Arm/Risc-v.
They are too fare away to be compared.
e-cores aren't efficient whatsoever
Ah, it can't be compared. Atom is an x86 cpu, and it depends on how the cache and fsb are set. Arm chips usually operate at max 0.5 volt atom can go up.to 2 volt on turbo so is definitely diffrent clas of cpu ah why not against an ia64 cpu lol 😆 wanna se that race 😆
@@adriancoanda9227 In the end what makes the difference between slow and fast, is 99% software. I would win that race if I am the programmer : would use inline assembly, lookout tables with pre-computed values, would not miss the caches with visibility list, local goto.... Sofware always wins.
@MarquisDeSang not always. It still needs hardware to run on y saw some remastered games to be used via the browser chromebook target it ah tnd that launcher looped 2 gb of data but target just on cpu core so the loading screen took 10 minutes search for ah y like to see you in quantum pc your thinking won't apply there cause it is not a digital cpu is analog and capable of insane parallel computing and it exists already a portable one withou a transmission it won't run any apps like you are used to, 😉
That was very interesting to see, and I liked the different ways you picked to look at the question.
I'm more interested on what the best ide to use and library for risc-5. Are they up to stm32 level of library support however buggy they may be.
I mean in context of cmsis which includes dsp and many things
The difference is that STM32 is from a company, i.e. STMicroelectronics, to support the ARM chips they make. RISC-V is an architecture, so you need to pick a company and see what it provides. The Espressif chips seems to have a mature development system for all their processors including Arduino support.
Also, CMSIS is an Arm thing, from Arm itself.
What about DSP math tools, QMath, and feature like this, are they included in the Espressif ?, is there nice youtube session to give me overview. I review that last time (1-2 year ago) but not confident with IDE setup that seen so far
You appear to have detailed requirements, and I'm not sure I can provide a thorough response that fully addresses all of your questions. It might be best for you to reassess these platforms to determine if they align with your needs.
So - I've implemented an ARM and RISC-V architectures "on paper", and RISC-V is simpler in ways that pay. There's only about a dozen basic choices that even *can* be optimized out in the core ISA yielding an architecturally pure ISA. My less favorite parts:
* The opcode, funct3, and funct7 not being unified in decode step.
* The LU opcodes not mapping to truth table means LU operations are not simple 4BD decodes with add and mul being 2x4BD+1 decodes. AFAIK no commercially available ISA has ever achieved this, but it's been discussed widely in academic circles.
ARM though has for example the Java bit which *halves* the available opcode range, and is AFAIK based on an earlier RISC platform with some commercial extensions. And sure, some of that makes it fast, but it's going to be less efficient per wire than a cleaner ISA. There's actually tons of details I don't like in it.
May have already been mentioned, but Amps != power. When you change the input voltage to 3.3V and the current doesn't change, that indicates a change in power. I'm not fasmiliar with these boards, so IDK what the initial input voltage as, but if we assume 5V, and the current doesn't change when switching to 3.3V, then that is a 34% decrease in power.
Yes, of course, but that doesn't change the relative results, does it. What exactly is the point you are making?
@@GaryExplains yes it does change relative results. Rpi pico does have a switching regulator not lineal one that outher boards have.
Hello from Tennessee, Mr. Simms. Love your channel. Thanks for the video.
About the ESP32 power consumption when powered by 3V3: cheap LDOs like the AMS11x consume quite a lot of power when only voltage is applied at the out pin. Is in the range of 3-10mA
And on the 3.3v pin?
@@GaryExplains Yes cheap LDO Voltage Regulator like the AMS1117 consume power even when no voltage is converted by it. this is called "Quiescent Current" and can be found in all LDO datasheets. For the AMS its between 3 - 10mA and is the main reason why cheap EPS32 boards consume above 1mA when in deep sleep. There are better one available but they cost 60 cents not 6 cents. I had to find this out the hard way when when designing a battery powered ESP32-S3 board. Its enough to have 3V3 on the output pin of the LDO for this current to flow from the 3V3 output to GND through the LDO chip. its a kind of leak current. an easy way to fix this is to just desolder the LDO and power it directly with 3.3V on the 3.3V pin
Thanks for the info, very helpful. 👍
@@GaryExplains no problem ;] power consumption is a bitch x]
Were the connectors taken into account? USB C has transfer rates close to 10Gbps while micro usb is pushing over 450 Mbps. Then as far as power, USB-C handle nearly an order of magnitude power than the micro usb at 100W. Just curious.
The test didn't use the USB ports.
As I see it: Arm's been around for a while. It's had an awful lot of work put into its efficiency, power, etc. over the decades. RISC-V is new, and there isn't a lot of money in perfectly optimizing it (yet). The fact that it is at all competitive now is a good sign for things to come, but it's gonna need more time, work, and support to be fully realized in this regard.
Top notch work ! Thanks for the video :)
Glad you liked it!
I would love to see the xiao nrf52840 board or equivalent, put to the test as it is running at 64 mhz. This is the microcontroller used on a lot of smartwatches.
Plus it would also be interesting to see the boards already test, retested at lower clock speeds, if that option is available. I know some esp32 can have the clock lowered. For pure power efficiency, I believe lower clock speed tends to be more power efficient for the same work done, as power usage tends to go up on an exponential scale, whereas processing power for the same processor goes up linearly.
If the amps are the same at 3.3 and 5, then it is using an inefficient regulator to drop the voltage. Just curious, did you calculate the power efficiency using 3.3 or 5 volts? I am not a fan of any architect as I just use whatever is better suited to the task. Of course having one that does it all would be nice and save having to learn all the differences, but now that assembly language is rarely used, it is not like having to learn an entirely new instruction set.
By the way, if anyone gets a xiao nrf52840, if they say double click the button beside the usb c, the double click speed is a bit slower double click than I was used to. Took me a lot of tries to get it right. Luckily someone mentioned doing a slow double click somewhere.
One detail that you missed is that the Pico and Pico W do not have a linear regulator; they have an on-board buck-boost switching power supply. Current consumption will not be constant; it will go up as voltage decreases.
you should also list the process node for the processor, it also really affects efficiency.
Yes it does and what is shocking is that the Arm chips were on the older process nodes, making the RISC-V even worse .
Not sure if this is a valid question, but here goes. Based in these clock speeds, could one of these chips act as a processor in a micro DOS or Windows environment? Thinking kiosk that runs a corporate webpage and allows customer data entry or order entry on-site. Or tiny web book or a tablet just for web or ebook reader where its mostly text. I know that the Pi, which is more powerful and has a video decoder is slow at video and graphics. Just thinking that if not much computing power was needed, you could pair with a mid power graphics chip for running the display and decoding video streams. Then maybe you get TVs with minor computing and networking power. Or is this how they are making smart TVs?
Hopefully you look into purchasing the DeepComputing/Xcalibyte ROMA RISC-V laptop (or a related RISC-V laptop or desktop) for a future video, but much more refinement will probably be necessary for it to reach it's full potential.
Useful test but not good test on the topic of CPU core efficiency for several reasons: 1. likely system bus speed differences between these (system bus interfaces to on-chip SRAM) obfuscate differences between true CPU core performance/MHz/Watt unless you downclocked all of them to lowest common denominator system bus speed, 2. differences in flash memory/prefetchers further obfucate CPU core performance unless you ran the benchmark from RAM and even then some like M3/M4 could use dual-buses 1 for data and 1 for instructions making it unfair, 3. finally at least some of these probably manufactured on different process nodes
How would you suggest I resolve those issues?
@@GaryExplains Actually I didn't finish watching when I commented, I see you ran all of them at 1MHz later to level the playing field and I assume system bus was dropped to 1MHz also and that's a first important step. I would run all of these CPU cores at the system bus speed of the lowest common denominator system bus speed. The second step is to link to run the code out of SRAM instead of Flash on all of them. That's probably the best you can do to isolating core performance efficiency.
Gary, there was an article that i read about a week age that Apple may be shifting away from ARM to RISC-V. What do you think that Apple will switch to RISC V or continue with ARM for the time being?
If we read the same article it says that Apple is using RISC-V for some of its small co-processors, that is all. It is a good engineering choice, if it has to design bespoke hardware blocks then RISC-V is a workable solution.
@@GaryExplains Maybe but that Article had some text about moving to RISC-V that Apple might be considering. Moving to RISC-V would benefit Apple in long-term as they wouldn't have to keep paying ARM for royalties or whatever deal they have with ARM. What's your take on this?
No, that part was just pure speculation because otherwise it would be a boring article and no one would read it.
@@GaryExplains ok, thanks for clarifying.
You mentioned that you encryption algorithms don't use floating point or integer division, but does use bit manipulation. I'll ask if it also uses integer multiplication, because multiplication by default comes in the same extension as division, but was also made available on its own as Zmmul. Bit manipulation instructions beyond basic bitwise logic are also their own extension B and its parts.
Did the RISC-V processors used support these extensions, and if so did you tell the compiler to use them when compiling your code?
4K fits in the cache. What about external memory access speew?
Gary did you check the real clock speed of the RP2040. The maximum clock speed is 133 MHz but in the SDK it is set to 120 MHz because it is easier to get the correct clock for peripherals like the USB. Check SystemCoreClock in the SDK. Are you running the test from RAM or XIP? You probably see a difference here.
There is actually even more fun stuff here:
The chip has 2 PLLs; it sets one to 48MHz for USB, and one to 125MHz for CPU and bus clock. 125 is also much more manageable to get useful clocks for other peripherals as you said.
You can also push the pico MUCH further than what it is specced for. I have run complex programs with PIO and PWM at 300MHz just fine running from RAM, and ~250MHz when running from XIP.
Is there a fully stable and official Python interpreter specifically tailored for RISC-V?
,
What do you mean by "specifically tailored for RISC-V"? What alterations do you want in this RISC-V specific version?
@@GaryExplains Thank you for your response. By "specifically tailored for RISC-V", I meant a version of the Python interpreter that's been optimized to run on RISC-V architectures, taking advantage of its specific features and instructions. Just as we have optimized versions or builds of software for different platforms or architectures (e.g., ARM, x86), I was wondering if there's an equivalent for RISC-V. Essentially, Any version of Python that might offer similar performance or other benefits when running on a RISC-V system.
Hmmm... I am not sure that Python has special optimizations for different architectures. I just downloaded the Python source code and I see very little code that is optimized for say SSE3 or SSE4 or AVX. There isn't much assembly language either. I see a little bit of x86 ASM code in one of the math libraries, but there isn't an equivalent for ARM64. It is just C code in general. 🤷♂️
@@GaryExplains Thanks Gary.
Languages such as Python defeat the purpose of efficiency.
A decent video.
And yes, instruction set architectures don't largely impact power efficiency. Hardware implementation however impacts efficiency far more.
But there is nuances on the ISA level that sets limits for actual implementations of the ISA. Be it limits on minimum transistor count, power efficiency, peak clock speed, etc. Sometimes one has to trade one aspect for another.
As an example, a resource efficient architecture using few transistors will generally not offer all that great peak performance. While a more peak performance oriented ISA will tend to be hard to build with few resources. Power efficiency is meanwhile largely decoupled from this view of complexity, since power efficiency is more about how well a given piece of software can make use of the architecture provided. It is oftentimes better for efficiency to have dedicated instructions for complex tasks, but what tasks to choose is a debatable subject in itself. If one throws in everything but the kitchen sink, then it is often far from trivial to make an efficient hardware implementation of it in practice.
In short, designing an ISA is all about compromises to reach a prespecified goal.
And then make a good hardware implementation of that along the way.
Then it is up to the market to find/make applicable software for it.
I'd say that when it comes to efficiency, a number of major interest is how much power the chip/board burns while idle. Typical MCU systems are not to crunch numbers, but for control purposes. Numbers when ready to respond to Wifi may be the most interesting, but of course there are also applications not needing wifi while waiting to do a bit of work.
As usual, comparing MHz across architectures is not useful, a more realistic yardstick could be a "maximally trivial" task like how fast it can count.
It's not that surprising that the winners are the ones with faster clocks especially when in dual-core configurations the second core is just set idle (why?) This test was designed to benefit single cored and higher clocked processors. Multicore processors are known to have better performance at much lower clocks frequencies and consequently more energy efficient. I'm having a very hard time to understand what's the motivation here. However, obviously the performance is never about a difference in ISAs, especially if all of them are RISC architectures. Put some CISC ISAs in the mix and you will see huge differences, though.
Also chips that run at faster clocks generally consume more energy. That's the whole deal about multicore architectures, to have high performance with low frequencies. The whole deal of ARM processors and their use on mobile devices is exactly that. No surprise here either, since this is obvious. The only surprise is to see a processor running at 72 MHz consuming more than one at 160 MHz. I think this must come from the fact that you are measuring power at the board level, not at the processor itself, otherwise we would see this reversed.
Now about RISC-V. There is no way RISC-V processors could compete at any level with ARM processors that have been far and wide used in smartphones. RISC-V processors are new kids in the block that are running far behind. RISC-V is still lacking the support needed to be better than ARM processors. But there is a huge advantage of RISC-V, though, that cannot be measured for now. It's potentially much cheaper to produce RISC-V than ARM processors, since it is an open and free ISA. However, we cannot still see the advantage in prices because they are still not produced in high volume. Volume production is everything in chips prices. But we can expect to see much cheaper RISC-V processors in the future to the point of beating ARM processors prices. I think that's where the RISC-V will position itself as a competitive ISA.
I am planning a dual core follow up video. Also the devices with a higher clock speed didn't "win".
@@GaryExplains: Thanks for your comment, Gary. You are probably referring to the hypothetical comparison if the processors were all running at 1MHz. You know that just multiplying the time by the clock frequency is not a very accurate performance indicator.
I am looking forward to see the comparison between dual-core and single core processors. For the kind of comparison (very repetitive and computing intensive tasks) you are doing, you would generally be better with dual-cores. However, that's not always true. As I stated in another comment in another video, modern architectures have lots of intrinsic parallelism (that translates into several instructions executed per cycle) that simply don't work when you impose atomic execution to synchronize threads. That benefits single cores better than multicores. In my estimate, to start having clear cut better performance in multicore you need at least 8 cores, unless you don't use atomic operations. That's the reason smartphones have dedicated cores for certain activities, because in this way you don't need synchronization. The advantage of these configurations is simplicity, you don't need load balancing. But the problem is that you will have most cores idle if their correspondent activities are not taking place.
Except for the raw performance test (ie how many Ms to complete the task), none of the higher clock speed microcontrollers won. As for the clock frequency, in my previous video I actually changed the clock speed, and while performance isn't perfectly linear it is quite close, certainly close enough to make meaningful comparisons.
@@GaryExplains : Thanks. I didn't see your previous video, so I just assumed you multiplied the frequency by the time. It seems I will have to see this video again to understand what you mean with "raw performance". I probably overlooked that. Sorry.
Also, I don't think MCUs have much in the way of ILP, and certainly not out of order execution.
V. Informative. Tq :)
The ISA does make a small difference, and the fetch-decode speed was a large factor up until high mHz and pipelined branch prediction. A clean(ish) slate approach to both the ISA and IMPLEMENTATION SPECIFICATIONS of Risc-V working in tandem is what gives RISC-VECTOR the edge.
--
A proper Vector Processing specification instead of SIMD (an ISA DISASTER that should have stopped at SSE4 on the X86, and should never have been introduced into the ARM ISA... A Vector processor would have been vastly preferable and the tech was well proven.,
--
A major benefit is to combine CPU + GPU programming into one (much more) bare metal ISA for both, eliminating a ton of API translations and JIT compilation, Short but quite efficient or very long and perhaps more efficient pipelines can be experimented with by LOTS MORE CHIP (PART) DESIGNERS, while developers get a STANDARDISED ISA..
--
Bare metal GPU Compute will be much EASIER. Integrated graphics and general purpose Vector processing compliment each other, but software-only graphics systems using just the vector processor and a few CPU cores could be more efficient and good enough for web + office..
You think microcontrollers have branch prediction?
@@GaryExplains .. not yet, and hopefully never! I agree on the microcontroller front RISC-V is no better than the Pi Pico spec. It's also less RISC than the pico.. The low end RISC-V spec now includes basic, MMX level integer SIMD, probably FP SIMD when it's finalised then extended, so quite bloated compared to Pi Pico ISA.
--
I'm an ARM fan but think the High End RISC-V spec is a better idea (Vectors vs fixed sized SIMD).. Risc-V is an ARM killer, X86 never was... ARM is still the most likely X85 killer but Intel and AMD will probably race to replace X86 with native Risc-V and emulated X86. 10s to 100s of smaller SOC designers and manufacturers will obviously also prefer Risc-V.
--
Sadly ARM's days are numbered. It may well have to abandon its ISA and many core implementation details when it too goes Risc-V.. Open Standards are very powerful forces.. Look at IBM PC, HTML + CSS, Unicode. For better or or worse, these royalty-free technologies alays dominate.
--
I actually prefer 2 byte opcode ISAs using a few tricks and vector processing over SIMD. de-bloats the cache and pipeline. Risc-V is getting more bloated despite its lack of SIMD. Too many cooks spoiling the broth will be the reason Risc-V fails, if it does, which it probably won''t. A (US) Big Boy could buy out the project I suppose, and ruin or bury it, but that's unlikely too.
I think it quite surprising that a 13 year old design stands up so well. I would suspect that if the power saving features of more modern ARM processor designs were to be exploited for a micro-controller SoC, then it might do better still. However, presumably the priority has switched to producing much more powerful, low-power architectures for use in servers, laptops and the like. producing the ultimate in low power micro-controllers is probably not a priority as these things are rarely required to do heavy number crunching.
Great video! I'd love to see one where you analyze just power efficiency. I use microcontrollers around my house to monitor just about everything. I'd love to know which would last the longest on a battery. They need WIFI so they can report in. But my requirements use very little processing. Just check the sensor and report in. Thanks!
You aren't doing anything around your house that requires more than a lemon battery's worth of power. What you would need, though, are low power drivers for your network, which are hard to get, it seems. Just use whatever works and plug it into the wall. Who cares about a couple of Watts of extra power consumption.
I would have loved to see more different benchmarks hitting different areas of the MPUs, since concluding based on one very specific crypto-benchmark not even using floats seems quite off to me...
LOL, other people complained when they thought I was using floats (as some MCU's don't have an FPU). I just can't win. UA-cam comments for the victory! 🤪
Outside of very specialised areas, almost no software uses floating point on desktop computers, let alone on microcontrollers! I've been programming professionally for 40 years and 99% of C programs I work on don't even have the word "float" or "double" in them. Gary's previous "Primes by division" benchmark was quite unrepresentative of normal programs, but this one sounds pretty good (I don't know if the actual source code is available?) so I for one applaud this change.
@@BruceHoult "almost no software uses floating point on desktop computers" u wot mate ? Browsers and games are "almost nothing" ? Though to be fair, I don't know much about other software, but I'd be surprised if these would be the only major ones. Still, I'd also say it's kind of irrelevant what desktop-level software use and then compare to what MCU-level software uses.
@@Winnetou17 "outside of very specialised areas". Games and browsers are specialised. A lot of people run them, it's true, but they constitute a very small proportion of the lines of code written or programmers employed.
Bruce, the code to Oceantoo is in my GitHub repo, there is also an accompanying video here on this channel.
You are comparing what against what exactly?
For ARM, there are many different versions of the ISA (instruction set architecture). v7, v8, THUMB, THUMB2 being only the major families.
Let's say you take the latest and greatest of these: that would be ARMv8 with Thumb 2 instructions.
For RISC-V, the situation is more clear:
There is the RV32I (32 bit) and RV64I (64 bit), with I = basic/integer, and extensions M (multiply/divide), A (atomic operations), F (floating point), D (double precision).
Collectively IMAFD is called G.
There are compressed instructions of the I set, called C.
Then there is the V extension for "vector".
Also there is the H extension for "hypervisor"
I think that when comparing ISA's it would be fair to compare ARMv8+THUMB2 with RV64GCVH.
Now of course, somewhat decent RISC-V boards are coming available just about now, and efficient CPUs with the ARMv8+THUMB2 are now on the verge of beating Intel/AMD in laptops and servers.
So it is just not fair to compare the current implementions of both instruction set families.
You can compare code side: RISC-V linux executables are smaller than both x86_64 and ARMv8 for the programs I compared: ls, mv, cp, sshd, gzip. This is in contrast with what everybody claimed: C programs should be bigger when compiled to RISC-V machine language because it is RISC and the other two are CISC. Well, ARMv8 is technically RISC, I read, but compared to RISC-V the language is huge.
However, code size is vanishingly small compared with data even on a Windows system. Still, RISC-V Linux has consistently about 10% to 20% smaller executables.
You could also count the number of instructions executed for a certain task, say sorting an array, or compressing a file, or computing something scientific and massively parallel. Then you can compare the number of instructions used in RISC-V vector extensions against ARMv8 Thumb2 instructions. Still there is a caveat: RISC-V V extension is vector length independent. Newer chips can run the same binary more efficiently when it has a larger vector length.
You can do normal performance benchmarks but then you are comparing hardware implementations, not the ISA's.
Why do I get the feeling that you didn't even watch the video? 😭
@@GaryExplains You are partly right. I actually did watch it before but I more or less forgot. I now watched it again.
My issue remains though:
If you are comparing the efficiency of an ISA to another ISA, that is really hard, I think. It depends on the qualify of your assembly program if you are programming that directly. Or if writing C, it depends on the quality of the compiler.
The compilers for RISC-V may not be as mature as those for other archs. Especially for critical fast code using the vector instructions.
So you can count cycles for instance, and see in how many cycles each arch can get a certain task done.
Still not really fair: CISC can do more in less cycles presumably, although x86_64 instructions can take 10's of cycles and RISC-V does 1 cycle per most instructions and maybe 3-4 for difficult ones.
Anyway, I have always wanted to start writing assembly, but always found the ISAs way too complicated. Including the various ARM ISAs. My last real experience was with the 6502 (C64 days), and I only tried it when those days where almost over.
But now there is this new promising ISA that is simple enough for me to learn Assembly from scratch. So I am exited for it and I want the platform to succeed.
I have a Milk-V Mars on my desk but have not been able to boot it from a eMMC card yet. I also have a Milk-V Jupiter on order which has the vector RVV1.0 extension.
And I have pre-ordered four of the Milk-V Oasis boards with the sg2380 chipset.
I have tried some assembly in an RISC-V qemu machine running Ubuntu that works surprisingly well.
Anyway, how would you go about comparing the relative efficiency of two ISA families? Can it be done?
How did you ensure that wifi and bluetooth radios did not affect the power measurements?
The comparison is not with new hardware. The visionfive 2 board looks to be 4 core risc v and by having a risc instruction set allows for better parallel processing, making the possibility of higher efficiency. The ability to boot from an nvme and the concurrent processing will need better coding , to achieve faster processing .
What has booting from nvme got to do with the efficiency of RISC-V?
@@GaryExplains just a big improvement on visionfive 2 board efficiency’s. Not Risc-v specific. Currently no soc boards have nvme boot up and processing, not even raspberry pi.
Nvme boot doesn't improve efficiency, it improves IO performance, which isn't related to RISC-V in any way.
Also, I have a VisionFive 2 board, and looking at it there doesn't seem to be support to boot from NVME.
waited for this for so long ... thanks to mr.sims for making it happen finally. thank you
You should put "power efficient" in the title. They are both pretty inefficient in terms of memory usage.
frequency scaling with power usage isn't linear, its exponential, its better to have all of them at the same clock frequency
While I agree that it isn't necessarily linear, as far as I know that is only if the voltage changes with the frequency. In my testing I didn't only use extrapolation, I did clock them (where possible) at the same freq and the results correlated with my extrapolations.
Out of curiosity how would the old ATmdga328p fair in such a comparison. Max 20Mhz , very very old node (I think I once looked it up and it was still in the micrometer range).
No comparision. Atmega328 is an 8-bit processor.
The silicon fab processor node tech used to make the chips plays a huge role in their efficiency. It would be good to include fab node info in the comparison data.
Indeed, it is something I will note for future videos. As for this video the key is that the Arm Cortex-M4 is using 90nm and the RISC-V ESP32-C3 is on 40nm, which makes the performance of the RISC-V processor even worse.
@@GaryExplains Wow that's very telling. Thanks Gary!
Also the area of the chip also should be a criteria
Consider ploting these chips as a chart Power_consumption vs Time_of_execution.
By doing that, we will see the best over all chip.
I have to say that only the last plot (mWh to the task) makes at least some sense... But in general I would say that you can not generalize these boards and compare them directly. MHz is not linear to power comsumption. It's quite simple: The esp32 boards can run at 240MHz and are there for the fastest. It does not matter if the M4 can "compute more per MHz", if it is capped at 100Mhz and therefore is still slower to do the task... If you are looking at power efficiency you probably do not need those high clock speeds anyway. You can power down the Modem of the ESP and that will cut down the power substantialy. If you want to compare the ESP32 to the M4, you should clock down the ESP to comparable levels and run the tests again.
Hmmm... If you look at my previous video about microcontrollers you will see that I actually did change the clock speeds. While it isn't linear it is very close.
Arm is a risc chip. Also, it stands for reduced instruction set. Actually, you will nrrd to have the same motherboard with a socket mount in order to exclude other factors in the testing, but even then the fastest chip was at 240 mhz y won't se where those can make a use maybe in remote controls, elsewhere those are to slow, or use them I a insane cluster 999999999999x cluster but you will need a dam fast cluster management running within the firmware
Why not transistor count instead of energy used, too many variables. Assuming transistor numbers usually correlate to cost ultimately… to show what architecture more efficient for the theoretical cost of production (if they were same fab, same node)
Transistor count doesn't correlate in any meaningful way. It won't help you decide what size battery to use etc. Power usage is the most important thing, everything else is just statistics.
clock speed scaling is definitely not linear enough to fix afterwards, you should down clock all of them to the same speed, if you want compare at the same speed...
Clock speed scaling is linear on microcontrollers. They are in-order and deterministic. Plus I did actually change the clock speed on many of the units to check that, and it is.
It would be interesting to know also the idle power consumption, it would give an idea of how the boards would behave when powered with a battery.
I was just thinking the current measurements aren't very useful because of all the extra stuff on a lot of those boards. Plus the esp32 are not known for low power. You would have to compare active current with the idle current of each board.
Yes the delta power should show the true cpu energy used for the benchmark, maybe Gary can follow up?
The tricky thing with a delta number is that a CPU can never actually be idle. Even doing nothing is still looping and reading instructions waiting to no longer be "idle". To help in this situation there are two general solutions. 1. Lower the clock frequency and the voltage. This is something that smartphones and laptops do. 2. Put the CPU to sleep, this is a feature MCUs tend to have and it is similar to 1 but not dynamic.
@@GaryExplains thanks for replying. The motivation for the delta is to see the difference between the dynamic power consumption of the cpu architectures. I take the point that the cpu is never really idle, but I the case of MCUs, it should be at least the cores are idle, or running noops. I think the data would be interesting nevertheless. Idle power in itself would be interesting, so all 3 data points tells a story, idle, full load, and 'full load - idle'. Its quite surprising that a 22 year old design/process can still beat a 2 year old one.
I will look into this more and see if it is interesting enough for a follow up video...
I guess Gary, you really should put out a video series explaining the differences between ISAs, microarchitecture, process node etc. to the general public, as I have watched many people are disagreeing with you on various issues. I think this video series will work as prelude to ARM vs RISCC-V video BWT i also felt that I need some more help 😅😅😅😅 on this.
Thank you
First off, nice that someone takes the time to do benchmarks; we can really use some more of that. However, I also think any benchmark that leaves out the different basic types is inherently flawed. An int32 benchmark is nice for pure int32 operations, but it still tells me nothing about int64, float32 and float64. For example, the ESP32 has an FPU for float32, but not for float64. It also leaves out any peripherals - but that's okay (if you need a certain peripheral you should just select on that)... For example, I have a few ESP-S2's here that use the TinyUSB stack. They are great, but whenever you feel like using the native USB in instead of the hardware uart, it starts to eat up your cpu cycles like cookie monster... it'll be the same story for the RP I suspect.
Especially float can give very nasty surprises, I suspect it will be the same in terms of power consumption / efficiency.
I think the general wisdom is that floating point code accounts for less than 1% of microcontroller code. So doing a test that focusses on floating point is inherently flawed.
@@GaryExplains Where did you get that "general wisdom"? I know I've never seen it in my 30+ years of professional software engineering... Not saying it's incorrect, but in my experience it very much depends on the application how much floats are being used... Source?
But even if it is correct, I don't think you understand how bad it really is. I actually did some benchmarks on the esp32 a while back, because I couldn't make heads or tails of the performance numbers. It has roughly 600 MIPS and just 1 MFLOPS (!) for common operations. That means that even if only 0.2% of your code is using floating point, it will consume 50% of your cpu power. It's that bad...
When I say general wisdom, I mean general wisdom, there isn't a particular source. However over the years I have seen multiple presentations that analyze real-world code and FP code is minimal, certainly on microcontrollers. That is why some microcontrollers don't even include an FPU, not needed really.
@@GaryExplains Right, and as I said, I'm no amateur, and I've seen a lot of issues with FP over the years. At the end of the day it doesn't matter what the exact percentage is: since FP is so much slower than integer operations (for obvious reasons), the effects on the application as a whole are still significant.
Whether or not FP is required for applications at all is a totally different discussion. Again, such discussion is eventually irrelevant; the fact is that regardless if it's a good idea or not, people use it for everything from motion control to PID loops and from UI's to signal processing.
That is why there's a tendency for vendors to add an FPU: because it is needed. ESP, STM32F4 seem to agree with me. The RP2040 does not have one.
I'm glad you addressed the point about WiFi on/off not making a difference, although I'd like to ask about those mWh numbers - you said for the ESP32 that it's the same current draw whether you're supplying 3.3V directly or 5V, so which voltage are these energy numbers for?
They are for 5v. But they are all 5v (i.e. for all the boards). I have the 3.3v numbers are well, but of course it changes nothing, just smaller numbers.
@@GaryExplains so all these chips run at 3.3V natively? Fair enough
@@AbelShields some chips use lineal voltage regulator (5V to 3.3V) some - switching voltage regulator (at least Raspberry Pi Pico with an RP2040). Switchers waste way less power (probably 92% efficiency for regulator, 94% efficient reverse voltage protection, 85% in total vs 64-66% in total with lineal one)
Although the Arduino IDE hardware abstraction does a good job of providing a common programming interface it is not really a good platform for performance comparisons. Some of these chips have a lot of functionality to improve performance per watt which isn't supported by Arduino HAL and the HAL has to do a lot more work with some architectures slowing down performance too. That said it is clear that the now ancient ARM architectures still hold up extremely well to the modern competition.
"Some of these chips have a lot of functionality to improve performance per watt which isn't supported by Arduino HAL" - Could you please give me some examples.
@@GaryExplains You can shut down the ESP32s entire radio circuitry if you have access to the low level registers. This saves a lot of power even when the radio isn't being used. If you have access to clock multipliers on the STM chips you can tune them to give lower power consumption too. Your encryption algorithm may be able to take advantage of encryption hardware on some of the chips which would make a big difference but the HAL won't necessarily take advantage if it.
Well, you can shutdown the entire radio circuity using the Arduino HAL. In fact I tried that, and said so in the video. Switching on low-power idle modes isn't relevant to this test. Also I used my encryption algorithm as a example of a heavy CPU load, it doesn't matter that it is about encryption. In my previous video I used finding primes and in my next I might use nqueens. It isn't about using special HW encryption blocks, but about testing the CPU.
@@GaryExplains Turning off the radio is not a low power idle mode it is just turning off the WIFI circuitry when the application doesn't require it. The rest of the chip runs at full speed and full power. It gives a better apples to apples comparison when testing say STM chips with ESP. Like when people compare the PI Pico to others ignoring the programmable IO which is it's most unique and powerful feature.
Hmmm... I seem to be repeating myself, one more go I guess: You can shutdown the entire radio circuity using the Arduino HAL. In fact I tried that, and said so in the video.
great video, thank you
This is good for RISC-V. It is comparing ARM which has been around (and refined) for decades with RISC-V which is quite new.
Some feedback:
- Current draw is not exactly directly proportional to clock frequency, for instance at lower frequencies, efficiency can be worse because there is some "idle current" that doesn't change much and becomes more important relative to the clock based current. So I think it would be better to set the clock frequency of the MCU at the same speed, and do the same tests at different clock speed (because they might have different sweet spots).
- If the goal is to compare architecture and not simply the MCUs, I think this is only a fair comparison if the chips are manufactured using the same technology node, I do not know if it is the case.
- I think measuring the board current instead of the MCU current is not great either, I don't know for those specific circuits, but there are many ICs which easily consume a few mA doing nothing, some of them even when they are "turned off" (shutdown current in datasheets is usually low, but not always). One way to measure just the MCU current would be to completely remove other circuits from the board (yes, it's more challenging, and destructive to the board).
Some feedback on your feedback:
- I did that in the previous video on MCU power efficiency.
- The goal was to show the current state of RISC-V MCUs and to debunk the myth that just because a processor is RISC-V, it somehow means it is inherently better.
- I covered that in the video and made the same point myself, did you miss that segment?
@@GaryExplains Thanks for your reply, I had not seen the other video. Your graph at around 12 mins shows what I mean. For instance, at 240 MHz, rpico consumes 0.16 mA/MHz, while at 50MHz, it consumes 0.26 mA/MHz. Similar results are seen for ESP32. If it was linear, it would be the same number. That's actually a larger difference than I thought it would be. It is counterintuitive, but I believe MCUs tend to be more efficient at higher clock speed (likely up to a certain threshold). Hence, comparing the energy usage at different clock speed seems to favor the boards running at higher clock speeds.
If the goal is simply to show that a risc-v chip can be less efficient than an arm processor, it is achieved, but then IMO, the title "Arm vs RISC-V? Which One Is The Most Efficient?" is a tad misleading, I was hoping to get a comparison of efficiency of risc-v compared to ARM, which would need to control the other parameters (especially the technology node, since it is likely a huge factor). Still an interesting video nonetheless.
You did mention it in the video that you measure the board current. Depending on what's on the board this may have a huge impact. I now had a quick look at some schematics and it looks like the boards are quite bare (though I'm not sure what's the exact board you use in some cases), so it may not be that important in the end. One thing I noted though is that most board use an LDO while the Pico apparently uses a DC/DC converter. Boards that use an LDO should indeed have the same current going in 5V as in 3V, however this should not be the case for the DC/DC converter. Efficiency of those LDO is 3.3/5 ~= 65%, while efficiency of the DC/DC converter of the pico is mentioned "up to" 90% (though this varies with consumption). This is an advantage towards the pico board, not related to architecture. If you indeed measure the same current when supplying the pico from 3.3V, it is either because the efficiency of the DC/DC converter is actually 65% as well, or because there is some leakage to the DC/DC converter when there is a voltage applied to its output while its input is floating (which is possible since it is likely not an intended use case).
Just to make it clear, I just wanted to provide some constructive feedback, I'm subscribed and enjoy watching some of your videos, I hope this doesn't come off as arrogant.
@@FranzzInLove I agree; if the goal was indeed to compare the efficiency of Arm vs RISC-V, the best way to do it (aside from getting two different chips that are identical, apart from the CPU core - so same node, same class, same memory, same speeds etc.) would be to record the actual number of instructions executed for a given benchmark - i.e. the _dynamic instruction count_.
This is the only meaningful number to look at when comparing one ISA vs another. Otherwise you're just comparing chip vs chip. And the direct comparison of cycle counts that was done in this video isn't realistic either, for the exact reason that Gary actually explained just before showing the comparison; memory systems are running slower than the cores themselves and often have a somewhat fixed latency when reading data (and instructions), so you'll typically waste more cycles waiting for memory when running the CPU at a higher frequency.
So Gary: nice try and I really appreciate that you focus a bit on my field (MCUs) as well, but for this particular comparison it could've been a bit better - at least from a "comparing ISAs" point of view, from a "comparing MCUs" point of view it was great! :)
On the eficient test, what data was used and where it was stored
It might be better to run multiple types of programs bc different ones may compute using different power drawing
Thanks Gary, now i must think about an application that does allot of calculations.
I just created a NAS with a Raspberry Pi 4B and an external USB HDD. This would be a good application to verify an SBC is useful.
minor but architecture can 100% be relevant for effeciency, speed, etc. Yes, a good x86 implementation can always sip power in comparison to a shit ARM implementation, but that doesn't mean implementation is all that matters. A slow algorithm on a super computer will outpace a fast algorithm on a microcontroller, but that doesn't mean picking the right algorithm doesn't matter, it just means that it's not the sole deciding factor. These architectures were invented to solve specific problems and to suggest that architecture is irrelevant is really just disengenous.
No, the differences won't be direct, but the architecture influences the implementation; different architectures lend themselves better or worse to different designs, and some designs are better in some functionality than others. Intel was *_far_* ahead of AMD for a good long while, but then AMD started going batshit and putting dozens of cores on their CPUs and now at the ultra-high performance they're pretty unmatched. In single core they still lag a tiny bit IIRC, but in multicore it's real hard to beat 16, 32, 64, 128 seperate cores. Speed isn't just an RPG stat, there is a *_lot_* of nuance and 'speed' is really just the composite of how fast it can go and how easily it can go that fast. If your chip is the fastest thing in the world, but it takes 500x more work to develop for, it'll never take off. (outside of niche use cases of course) On the other hand, if your chip is 25% faster and a drop-in replacement, it'll spread like wildfire.
One thing I really think RISC-V needs to work on is making sure that they go out of their way to make cross-compilation as easy as possible; that or invent a damn good emulation suite. (but only Apple has really ever pulled off performant cross-architecture support AFAIK. I hear a few projects are getting pretty good, but I've never heard of one *_really_* bridging the gap outside of Apple) A new architecture just can't demand people spend time porting their software unless they have something *_really_* good to offer, and really RISC is more of just an incremental improvement than anything.
Nice. You mentionned design importance but, to reiterate, the designer of the microcontroller is important here. I think your results show STMicro expertise.
Its kinda wrong to average out the performance. The esp32, esp32-s2 and esp32-c3 have an adjustable clock ( 80 Mhz,160 MHz and 240 Mhz). The newer Esp32-s3 can go as low as 10 Mhz. You can set the esps to 160 Mhz to compare to each other. You can also average the time it takes for fixed set of operations etc
But the point is the power efficiency per MHz, which is what I showed. I don't think you understood the video.
Interesting video :) I do not think it is enough to just to power the 3.3v rail since there are other onboard electronics which also require a power (usb to serial converter) on the esp32 chip. It could have been interesting to see it compared to the datasheet :)
Both are amazing and exciting alternatives to X86 💪🙏
Hmmm ... Efficiency is largely dependent on the implementation and which extensions are used...
Did I not say that?
@@GaryExplainsyeah, sometimes I type out my thoughts before I watch the whole video. You did great.
Can't you measure the current directly from the VCC and GND pins?
I don't feel like I got an answer. Also you did not list what the architecture (risc or arm) was in the graphs.
The processor is shown along the x axis.
It's pretty quiet on the Speedtest G channel. Any plans for new speed tests?
Sadly, no.
Unfortunately, there is no way to quantify which ISA is more efficient based on random boards from different manufacturers. There are too many variables in this equation to drive any meaningful data from these tests. One would need to custom engineer their own hardware while keeping CPU design really close to each other to be able to accurately quantify this
Gary,
M4 has an FPU speced on core. C3 has cryptographic modules. I'm very impressed with the quite new C3's placing on the list, but do you know if the C3 cryptographic processing components were used in your compiled code? This influences your results quite significantly.
The cryptographic co-processor in the C3 accelerate very specific algos (SHA and AES), and need to be expressly enabled in code through C headers as well as through the NVM configuration. His crypto algorithm is very custom(?), so I doubt it can even take advantage of the co-processor, let alone the fact that putting code forward that used co-processor on the C3 would kill compilation for all the other chips, since the header would have definitions for C3 specifics.... unless of course Gary was a complete A-hole and put #IF guards around that part of the code. (which would absolutely give the C3 and advantage.)
I do wonder how much current ran through them at the same clock speed.