The Magic Of ARM w/ Casey Muratori
Вставка
- Опубліковано 25 лис 2024
- Recorded live on twitch, GET IN
Guest
Casey Muratori | x.com/cmurator...
My Stream
/ theprimeagen
Best Way To Support Me
Become a backend engineer. Its my favorite site
boot.dev/?prom...
This is also the best way to support me is to support yourself becoming a better backend engineer.
MY MAIN YT CHANNEL: Has well edited engineering videos
/ theprimeagen
Discord
/ discord
Have something for me to read or react to?: / theprimeagen
Kinesis Advantage 360: bit.ly/Prime-K...
Get production ready SQLite with Turso: turso.tech/dee...
I love Casey, well spoken, knowledgeable, easy to follow even for non native English speaker (edit: I AM not native English speaker, sorry for the confusion). Technical enough yet relatively easy to understand
Smart yet humble, good combo and makes for good teachers
@@pablomelana-dayton9221he’s not very humble
yeah i like him for another reason too
@@mattmurphy7030 he actually is.
??? hearing this guy has been the most infuriating experience this week. He just CAN'T get to the point holy.... he kept rambling, I'm at minute 21 of the video and he STILL hasn't got the to point he wanted to say after minute 3 of the video. He reminds me a the boomer engineers that I work with just ramble and complain and never get anything done.
Hour and a half with Casey? YES!
You sound like an anime girl and I'm all for it 👍
Casey is the best. He's forgotten more than I know. And I'm just a bit behind him staring down reaching 30 years as a Software Engineer.
I am in awe of how verbally articulate he is over such a wide range of knowledge, in depth. Both wide and deep knowledge + articulate is a very rare gift and puts you at the top of the top in Engineering.
I've had the good fortune of working directly with several "Distinguished Engineers" over my career and Casey has the all of the same qualities.
Humble, incredibly articulate to a very detailed level at a wide range of subjects, doesn't talk in absolutes and knows to mention some of the tradeoffs, and know when they are getting into areas where they might lean on someone else for specific expertise.
They are the best people to work with and know how to work at different levels of people without being patronizing or making you feel imposter syndrome.
Casey is definitely in that class of Engineering and it's always a treat how well him and Prime work together despite coming from very different backgrounds.
Well done as always, gentleman! I learned so much from this video that I had to come back and edit my original comment to add much more.
I could listen to Casey talk for DAYS and not be bored
DA : Once you know the stuff, you will get bored. It's like a machine on repeat.
@@RealGrandFail I feel like once you know the stuff, the joy comes from teaching others!
@@grimm_gen totally agree 💯
mollyrocket is his youtube handle (his wife does a childrens novel I think if I remember the lore correctly?), he has several amazing vids on there!
Casey is my favorite of your guests. Always love when he's on
Casey is such a great guest! I always learn so much when I watch these videos
As an embedded engineer, this was so great to listen to. It's hard to find good content in the embedded domain.
I remember back when I was in highschool trying to get into game dev I found Casey's GJK video. Reading the paper was way over my head with academic language and math symbols - but his walktrough helped me implement it and EPA. It really helped me see that stuff that seemed untouchable (paper, cryptic code, abstract code) was understandable if you broke it down, take it step by step and try to visualize.
I wish I had more teachers like him back in school, or more material like his available back then. Kids these days are really lucky to have content like this available almost effortlessly
It's both a blessing and a curse. Great learning materials are out there and readily available if you know where to look, but knowing where to look is the hard part, with low quality or outright hostile content often winning at SEO and pushing down the gems.
The issue of junk search results is only growing, hopefully soon we get hypergoogle.
Wanted to put this as background, turned out I can sit on my toilet for whole 1.5 hrs just listening to this.
Very informative! Thank you Primeagen and Casey!
People seem to forget that both Intel and AMD had RISC cpus already in early 90ies. One of Sega's most popular arcade games used the Intel i960 (Sega rally yeaaaahhhh)
True. I still have a i860 in my NeXTcube. At some point, Intel has also made an ARM CPU: the XScale.
@@MartialBoniou oh, a NeXT cube :O I love that design. Always when I did a drawing of a computer I made it like a NeXT Cube :) . I hade forgot about th XScale actually lol :)
OMG we only have i9 today and there was already i960 in 90ies
This was a great talk from Casey, especially out of the head. There is one thing i would like to add to "the ARM ISR". There is not only one but a bunch of them. The most important ones are Cortex -A, -M, and -R. Their main difference is how you attack performance requirements from a (descrete) math point.
Cortex A is the general compute approach. They are designed to run an OS and are used as CPU's, in phones, mobile, or AI clusters. Their goal is pure compute power, even at the cost of determinism or safety with things like branch-prediction, chunkwise caching, etc.
Cortex R is realtime applications like the ABS/ESP in a car, a flight controller or the primary controll of a power/production plant. They are designed to guarantee a computation within a certain timeframe, provide redundancy, private memory for certain things etc.
Cortex M is for microcontrollers. In very broad terms they are a hybrid of R and A. They can map a view realtime features, while still do some general compute when necessary. They are a great choice for a car door with the window control and a view buttons.
Intel used to have different sets with x8150, x82 etc. but the portfolio narrowed down to what is known as x86 today. While ARM diversified from theoriginal ARMv1 / ARMv2 chip. They are also roughly the same age, they just grew in different industries.
Ian Cutress did an interview with Jim Keller and has a clip that would make a great supplement to this titled "Jim Keller: Arm vs x86 vs RISC-V - Does it Matter?".
Casey just seems like such a wonderful human being.
Casey is better than wikipedia
No doubt
Most things are
@@grendel_eoten nah, wikipedia is way better than e.g. most social media, including youtube comments. wikipedia is also way better than many youtube videos, especially when it comes to stuff like accuracy
@@asdfghyter Get a degree in aerospace engineering and try to use Wikipedia for anything related.
So the processors that fetch multiple instructions in one cycle are called superscalar. And they can either be in-order execution or out-of-order execution. When it is out of order, they undergo register renaming (using a map and a free list of physical registers) to resolve dependencies (other than true dependency), and get dispatched into a buffer (Register Update Unit) where they wait until their operands are ready. A group of instructions get picked from this RUU and is executed since all the dependencies are resolved. Then, there is an in-order commit for the instruction at the head of the RUU. So we get in-order dispatch, out-of-order exec and in-order commits
One big extra power burn is x86-64 devices is the platform is desktop and laptop with expandable RAM. You need more voltage to drive big ram sticks further away. And ARM has always been on embedded with soldered down ram. Intel just demonstrated with Lunar Lake chips with soldered ram on the laminate, saving the memory controller voltage puts them a LOT closer to Apple Silicon in terms of performance per watt. You could bucket a big thing like RAM config in Casey's business explanation. REALLY good explanation from Casey!
ARM was developed as a desktop CPU though, and that's where it started. On the desktop.
@@-_James_- thanks for the correction. It wasn't until 1992 that the apple newton was a mobile device with an arm cup in it.
To be fair, mobile atom CPUs used in cellphones of the era were using embedded dram too.
First there are cores developed by ARM UK and GPUs developed by ARM Norway, then there are third party designs, by Qualcomm and Apple.
@@Loanshark753 Intel had some ARM designs for a while too after they acquired them from DEC.
That was FANTASTIC!!! Pretty nostalgic too. I was lucky enough to build my 286, 386, & 486 computers back in the day when they came out. If they kept that naming convention, I wonder if the latest computer would be a 10086 or 20086 by now.... I totally had an assembly course in college. It good to know "nobody" writes that stuff nowadays. If you still do, then consider yourself nobody.
You'd be happy to know that there is a new Intel 285 chip coming out soon! The Core Ultra 9 285 has 24 cores and is among the highest tier of the upcoming Arrow Lake chips.
@@r.k.vignesh7832 Yikes! I almost thought they went backwards. 286 was short for the 80286 processor... looks like the 285 is short for 285K (285,000). Not sure if those numbers are a true apples to apples comparison but at least they are headed in the right direction. 😅
@@Angel-Fish The K is used to distinguish chips w/ unlocked multipliers from the standard ones. There will also be a 285 non-K. This would have been the Core i9 15900(K) with last year's naming scheme, but they changed it for some reason. Probably to confuse us even more.
I had no idea Godbolt was named after a Mr. Godbolt!!!! He just took the #1 spot on the "best surnames of all time" list from my friend Mr. Goldhammer
Casey’s performance aware programming course is so rad, this dude rules
You guys really need to just start a podcast. The chemistry is great, Casey is a blackhole of knowledge and Prime keeps the mood lighthearted and fun.
This was a great one. I spent thousands of hours programming the 6502, M68000, and M68020 back in the ’80s and ’90s. It was a lot of fun, but nowadays I’m quite happy to be coding in higher-level languages, especially my favourite - Clojure. Still, I sometimes miss the days of programming in Assembly and C. There was something special about having complete control over everything running on the machine.
Yep, past few years, been filling in and expanding knowledge and capability in assembly, for fun
Assembly is still pretty fun its just a lot of instructions to keep track of. Ive messed around with doing a basic x11 hello world and it was almost 1000 lines
Same for me... 6502 and 68000. I still prefer lower level coding. Most of my work is with lagacy C code and C++
It’s time to reboot the “Jeff and Casey” show with the new “Prime and Casey” show.
I would love to see Jeff interact with Prime too. And throw in Jon Blow there too.
Great talk! A good follow-up topic might be the memory model differences because (1) it's one of the major differences an actual programmer might hit when porting code from x86 to ARM, and (2) I would imagine it has power consumption implications since x86 chips are required to do more possibly useless work to keep caches coherent.
I love the way Casey explains stuff. I learned so much just from his preamble.
"I can't believe we're doing all of this just to run JavaScript"
lmao
I didn't think much about ARM until I had to program data transfer using DMA. The ARM DMA subsystem is a marvel to behold, a fine piece of art.
I've been followed Casey since he started Handmade Hero and I love the dynamic between you two.
Power consumption is a byproduct of the electronics design (transistor architecture) and NOT ANY firmware or software characteristics. That's why the first ARM chip just happened to be able to operate using stray electric currents from peripheral components on the PCB. That wasn't on purpose but something that was discovered by accident. Well, that sort of discovery now becomes a desired "feature" to pursue on purpose and here we are.
That is true. However energy = power x time. So if a process takes longer to execute it can consume more energy even at a lower power consumption. So for a particular application a lower power device is not guaranteed to be more energy efficient.
i am 30 minutes in and i think i can listen to casey 10 hours. 👍🏽
Casey has literally flipped my approach to web performance on its head. Love it!
0:58 Prime being hilarious while ruffling a lot of feathers completely on accident.
The risc-v guys really don’t like being called cisc even though it essentially turns into one the moment you include any of the common high perf extensions.
I think there's not really a solid boundary between risc and cisc, but I reckon risc-v at least does it well by splitting the entire isa into extensions which have individual purpose as opposed to have extensions hacked on with new versions or whatever. I believe the beauty of risc-v is that you can create tailored chips for a specific application. For example, you might slap a bunch of vector extensions and parallelisation extensions but leave out stuff like atomics to get a low power, efficient gpu (ofc the technology isn't really developed to that point, but that's the theory anyway). So risc-v is really good for specialised chips as opposed to necessarily desktop cpus, which are pretty much always going to devolve into cisc anyway at some point
Used the BBC micro B at school.... It was the business.... The RISC based Archimedes was on the horizon and it was truly from another universe 😊. It was so far ahead it was indescribable in the late 80s... It was a jump from 8 bit to 32... That's pretty massive.... Price tag to match.....
Love The Primeagen’s priorities on display! ❤
Finally got time to sit and watch this. I absolutely love these chats with Casey, I always learn so much. He is an amazing teacher and I'm glad there are people out there like him. I'm so glad Prime has him on and that Casey wants to be on as well. Can't wait for the next lesson.
Low level programming but in simple language.
What a treat!
❤❤❤
I wouldn't say LLL talks in an overly complicated way
Another Casey video, this is just what I needed to make my day.
The amount of preamble here was v precisely calibrated - I’ve never looked at assembly at all, but followed every point made, expertly done!!!
Thank you Casey, it's always a treat to learn from you.
Thank you for going slowly to make sure that you don't leave anyone behind, Casey! Thank you!
Thank you for introducing the godbolt decompiler for those of us that didn't know. Having done some x86, PIC and other chip assembly programming in school long long ago ( that I hardly remember) this is a great primer for demystifying low-level instructions. There is a small hang-up I'd love to get his take on for clarity, I seem to recall that x86 had a much much larger instruction set with machine instructions that would take 10-20 cycles to execute while the more basic (Motorolla etc) chips did not; the more basic chips used, AFAIR, only the accumulator to perform operations (with few exceptions), while x86 allowed a subset of instructions to perform operations entirely within CPU registers without touching the accumulator value. Even ops like addition to direct memory locations were possible (beyond the CPU registers) whereas basic chips would have to move those values from memory to registers, perform add op and the result would have to be moved back from accumulator to the original mem location.
All this to say the idle power draw to the extra transistors that x86 has to perform the ops on so many working registers was significantly higher, and as a result x86 arch was not as power efficient over the long periods where it doesn't use those extra functions. Is that still the case or is ARM arch now as "bloated" as x86 where it has similar transistor count in the ballpark order of magnitude as x86?
Man. This guy is so good at explaining things that even someone such as myself that doesn't code can understand.
Casey is so powerful flip actually zoomed in when he said it.
As someone that did some arm assembly writting for learning and such, this was really cool to listen to.
ex-system architect here. instruction sets are not the issue, it's the way how it's architected. As ex-bios engineer worked on APM and ACPI and later specialized in power management on ARM devices, it's just night and day differences on how two architecture approaches designs.
One example of why instructions doesn't matter. When I was a bios engineer, I worked on x86 asm. When I worked on arm, I've mostly used c/c++. Only the rare time I have to use jtag and debug in asm and that's almost never the issue.
On power implementation approach, x86 is almost an after thought. ARM platforms I worked on literary think of every possible way to try to improve power in every iteration.
Great to come across someone who’s really familiar with it. For Intel - WHY is it an afterthought? Don’t they have as much to gain from the same?
But by the original notion - isn’t it expensive to run all this fancy decode outside the core when modern compilers just aren’t using the breadth of x86? Surely that’s a whole bunch of transistors ARM just doesn’t need to contend with?
@@Freshbott2 I didn't work for Intel but I suspect it's purely due to politics. They had ARM license back in the days when they did PXA270 and they know how it works. The fact that they sold it off and not apply much to their architecture (at least from external pov), seems they just didn't care for it enough. I'd assume they were making so much money from server side that they just didn't care for the ARM threat.
On fancy decode, it's not that expensive to run outside (just think how mobile works). Also not that complex to add these to compilers (maybe back in the days if they add that to gcc). Or the ISA can have prefetch to sort of know this is certain kind of workload that needs to be offload to the correct component/core.
Again, just think how mobile works. It has all the features of a pc in a SoC.
About the arm chip being 0 power usage; if memory serves the anecdote is that the input power of the clock signal for the display was enough to power the rest of the chip
That's how I remember it. Or was it current on the data pins? Something like that. Not electric fields though, never heard of that. And doesn't really make sense, either. :D
@@ControversialOpinion input signals in general most likely yeah, might have a variation of which input depending on where you heard it from haha
It was voltage leakage from the support chips that provided enough power for the first ARM samples to run without any dedicated power supply of their own.
Lmao prime bailing to deal w the kid is brilliant. Love it
Casey is right that it's not the ISA that's mostly affecting efficiency. Intel Lunar Lake is an example of how x86 can match or even beat ARM in terms of low power - while keeping backwards compatibility.
Intel and AMD just needed to prioritize low power and Apple + Qualcomm finally gave them a real reason to.
Lunar lake has similar performance, heat, and battery runtime numbers to M3 and Snapdragon. See Just Josh's lunar lake video for more about this.
However ARM is better since it's more open and more competition is happening there to get the best performance per watt.
For the variable length instruction decoding on Intel, the CPU doesn’t necessarily need to decode what the compiler generated, it can theoretically decode something else.
The CPU executes what is in instruction cache and the move from memory to instruction cache is slow. In theory you could remove variable length instructions on the fetch to instruction cache and give the CPU fix length microcode instructions.
That have cons. Intel CPUs are designed to execute legacy x86 instructions, and these are inherently variable length. Converting instructions into fixed length microcode would require a significant architecture overhaul, impacting compatibility with existing software and instructions. Intel CPUs already have optimizations like the micro op cache. This cache holds decoded uops for reuse, reducing the need to repeatedly decode instructions from memory. This already achieves a similar goal of reducing decoding overhead by reusing pre decoded instructions
> For the variable length instruction decoding on Intel, the CPU doesn’t necessarily need to decode what the compiler generated, it can theoretically decode something else.
No. The incoming instruction stream, regardless of whether it is variable or fixed length, has to be decoded as is.
> The CPU executes what is in instruction cache and the move from memory to instruction cache is slow.
As slow as the memory system can operate at, provided that software does not interfere by making things worse - which sadly is a common case. Without reuse caching is not faster than directly running off memory.
> In theory you could remove variable length instructions on the fetch to instruction cache and give the CPU fix length microcode instructions.
In practice this is what various platforms did and continue to do in various forms for several decades. What gets fed into the core from the instruction stream perspective is very different to what is actually being acted upon internally.
i 💜 Casey Muratori's deep dives
Love to see Casey, please come on more often!
x86 is like utf8 and ARM is like utf16
As someone with like datascience/machine learning, I always have no idea where Casey is going but I always love to come on the adventure and I always learn something new -- pulling up the webtool and following and playing along really helps with this video!
Casey's channel is "molly rocket" btw, it always escapes my brain and then I remember -- incase you are looking for it u.u
Casey is the GOAT. I can't get enough
ANOTHER CASEY VIDEO!!! ❤🎉
You should have someone on to talk about the difference in memory models (x86 strong, arm/riscv weak).
Also worth touching on how the C11 memory model's adoption has made far more software compatible with weak memory models.
1:17:08 I remember when Intel invented new instructions specifically for XML parsing. I would not be surprised if we see JSON parsing instructions in next i9 or something.
EDIT: I exaggerated quite a bit: SSE4.2 text processing instructions are general purpose, not intended for XML processing only.
Seriously? Tried to google it to found what instructions do this but found nothing. Do you have sources?
@@poteitogamerbr2927 SSE4.2 text processing instructions: PCMPESTRI, PCMPESTRM, PCMPISTRI and PCMPISTRM. I guess when they were introduced, XML was the new hotness, and these were marketed accordingly. Looks like they actually are general purpose and can be used for JSON processing too.
@@KvapuJanjalia Those are really just for string searching. You can use them to implement for example strpbrk. And they have a variant for null-terminated strings.
@@KvapuJanjalia thanks, it seems very cool. I wonder if compilers like gcc actually optimize say C code into those instructions since they are very specific or you must call them directly.
@@poteitogamerbr2927 that might depend on a couple of things.
As far as I understand, if it's a fairly widely supported instruction then your compiled binary may contain it with a fallback for a chip that doesn't support it.
If it's quite specific you might need to let the compiler know through flags to include it.
I can’t shake the feeling that this discussion becomes second guessing after some 40 mins. It’d be good to invite Jim Keller on the show.
In the early 80s, if you had a Commodore 20/64 8-bit with a MOS6510 and your programs had to run full speed, there was nothing but assembler.
I bought "Creating Arcade Games on the Commodore 64," and I think I also bought a machine language book, too. Sadly, I didn't get very far with either book. But, I remember the excitement I had finding out that books like that existed, because I really wanted to program games. Too bad I didn't have the skills that others did.
@@michaelday341 Basic was better than nothing.
Exactly this got me into assembler on the C64. Pure performance poverty 😂 Not even a compiler. Just writing code directly in my Power Cartridge monitor.
I loved this talk, I learnt a ton, and helps understand everything so much better
this is absolutely fantastic! very informative!
thank you, finally something besides the typical ARM RISC copypasta that's 30 years out of date.
Godbolt sounds like a man who is blazingly fast!
Fun fact, the a in arm originally stood for acorn, the makers of the BBC micro... The first arm chips were literally acorn asking how they could make a sequel to the BBC micro [or one of its successors. I'm not British or a computer historian for that matter]😊
Legendary video with a mandatory algorithm boosting comment from me.
SO glad this is finally up. ARM is on my to 'RUN' list. It's apparently effective at reading Malware. I've been spoiled by Lua, Python, JavaScript and so on.
56:00 Guy in a documentary I saw told they forgot to connect Vcc rail, but the first Acorn Risc Machine chip was able to run on currents passing through pull-up resistors (stuff that stabilizes bus state).
Love it, the content we need. Thx ❤
Lot of knowledge and history here! Sounds like ARM instructions are a better design, I'll keep it in mind
About the ARM no power anecdote there is an interview to one of the engineers that work on the first ARM chip in Acorn (ARM used to be Acorn Risc Machine) in which he explains that the first prototype of the chip when they first tested it they measured 0mA current going in the power rails. They soon realize it was because the power ralis were disconnected but the chip was working anyway because the current was flowing in by other pins in the package. It doesn't mean that the chip used virtually no power, only that it used little, so little that only with the input signals and capacitors had enough to work with, without the power rail conected to anything.
56:00 There's a great 3-parter video interview with Sophie Wilson on channel "Charbax".
If I remember correctly, she talks about the low power ARM stuff in one of those.
Well....x86 around 1600 instructions, Arm around 150 and RISC-V (GC) has around 40....but that's not the sole deciding thing, On RISC-V the instructions are no longer human readable (if that's even possible) in their hexadecimal form and optimized for the instruction decoding logic to be as simple as it could get. So if we compare those, compare comparable things.
But other than that detail, fantastic video and great knowledge shared by Casey! Thank you very much!
i cant get enough of casey talking about computers ❤
If you get Casey on again for a similar topic, I think reading through and discussing David Chisnall's article "There's No Such Thing as a General-Purpose Processor: And the belief in such a device is harmful" would be interesting -- he goes into things like the energy impact of complex decoding machinery.
"I only look at it occasionally" lol after that knowledge bomb
I adore the Casey streams and the rabbitholes²
Maybe I do not understand all the details, but I think memory model is way more important in the limit. x86 is way more restrictive on how it can reorder memory access (for atomic operations it will always be memory_order_seq_cst) in spirit it is very similar to GIL in python. While arm is free to do way more reordering and given how slow memory access is I can see how this difference can bring substantial edge in performance.
I learn so much from this, quality content
by the time I reach @29:21 I was absolutely excited. this is the most interesting stuff.
i love hearing casey talk about anything
Love these videos with Casey
It is lower power because it is lower count of transistors AND there are less switching transitions per productive computation. Initially. Then yes, the trend to lower voltages and physical layout of transistors. Still though, those initial design constructs count. Also, switching to thumb mode is way to power down extra circuitry in chip. Power is burned when a transistor transitions.
2:27 1500- 3600 instructions vs 240 instructions... yea real hard to justify "reduced" instruction set
i loved coding arm assembler. just conditional suffixes to any instruction to skip or execute (B - branch ... B (LE) B(EQ) ... so simple. avoids branch hell in cisc, wilson did a top job on the instruction set. 454:25 not quite true. on the older CPU's having to wait 11 cycles to complete a single op was not uncommon. Esp when you had NOP instructions.
> wilson created the ARM instruction set to be the "programmers dream wish list". Houser did the layout for ARM 1.
> the 1st arm's had minimal microcode.
> lower IPC. ram at the time for other processors was waiting for those 11 cycles to complete before getting the next OP.
> optimising your compiler for ARM - easy peasy. Optimising for x86 with those dozens of sets of extensions - like playing hopscotch in a minefield. This is why AMD & Intel came together because this is now so bad a problem.
> the tale of zero watts is true - kinda. When they measured the power draw they were very worried about the power and heat because if high would require a much more expensive ceramic housing. Anyway the digital meter measured 0000 watts. true story. After some investigation, they found that the power was actually getting leaked via the address bus and that was enough to power the CPU at full speed and that the CPU power pin had 0v because of a board defect.
Love the opening 😂
Thank you, Casey.
4:34 to 4:40 - That's when Prime started getting off track and Casey canceled that tangent in a heartbeat.
Anyone interested check out the British film Micro Men (available on UA-cam legitimately). This covers the story of the companies and engineers of Acorn, Sinclair, and the development of the BBC, Spectrum, and Electron (Clive Sinclair, Chris Curry, Steve Furber and Hermann Hauser).
That was great. I learned a lot - thank you!
54:23 at minute 54 and orthogonal memory access is not mentioned. 1:01:09 beside the decoders. Orthogonal memory access modes in x86 is why it needs more transistors to be implemented.
"Read assembly language fluently"
Prime is the best talk show host!
love Casey! he has a big brain
What I take from this is that x86 comes from a very old place where instructions didn't take more than just 2 bytes, but as time went by, the need for bigger instructions lead to a solution meant for retrocompatibility, which made instructions take more clock cycles to figure out what you're trying to do. ARM, on the other hand, decided (probably due to experience) to keep a fixed size for instructions with a certain large that they would think it's enough, and thus making them all take the same time which would be (I would assume) 1 clock cycle.
The other thing I take from this is that there's not a big necessity for better CPU's and the companies are relying on programmers wasting resources so they would need better products due to that inefficiency so they can keep the marketing going, which is... concerning.
More Casey please!!
The orginal tdp design goal for the 1st arm chip was under 1watt. This was a design goal to keep the packaging cost down. Its covered in a few interviews with the orginal designers. I think they wanted a plastic package vs. a ceramic one. The ceramic package would cost $10 to manufacure vs $1 for tthe plastic. Market forces pushing later designs is probably correct. They had to design each chip to come in under an already low bar. Intel on the otherhand could slip by a few watts or just bin the chip at a higher tdp as long as performance was somewhat in line for that tdp.
Casey
It's really good in C and C++, if you have the right tools (although even stupid gdbgui can align lines and assembly it's producing nowadays and simple give you optimized out message when it's done so) and your compiler knows how to output symbols for them (most tools just used either DWARF or sometimes STAB). You can usually get general idea of whats happening and eventually learn to see when the codegen did something stupid.
C and CPP are both low level languages that were designed to abstract assembly. So you don't want something interpreted, because then you're actually profiling the interpreter not the code you wrote, and you probably want something that is fairly low level and doesn't abstract away too much or you won't be able to match your block of code to the assembly output.
When is it good? Generally I'd argue anytime you're working on a platform with limited resources, or you're interested in low level performance for some reason - there's a piece of code you need to run at scale and to avoid needing to double your infrastructure optimising this hot loop is worth doing.
Because it's always a tradeoff - I know prime and Casey like to say it always matters to optimise, but often the extra cost of the hardware required because the most optimal language was not chosen is a fraction of the cost of the engineer to be able to do said optimisation.
Great stuff! More Casey please :)
I suspect that things like the branch predictor, instruction pipelining, and other chip designs/architectures would affect power more than the ISA. If you want the instructions to run faster you need to do more work on the CPU that may consume more power. If you want to use less power you are constrained in the kinds of CPU-level optimizations you can do.
Love these discussions
Its amazing that we get to tap into this mans knowledge for free