are stack based vms really slower?

leddoo

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 17 жов 2024

КОМЕНТАРІ • 66

@joseph199627 Рік тому ⁺⁶⁸
Very interesting, I imagine it took a lot of head scratching to really understand what was going on with those experiments. Great video
@leddoo Рік тому ⁺¹⁵
thanks!
yeah, took me about a week to figure out what was going on :D
initially, i thought it was just about the number of memory accesses, but things were just not adding up.
the experiments from the video are the ones that convinced me it was about dependencies. but yeah, there were more :D
@SaHaRaSquad Рік тому ⁺¹³
I once wrote a small stack-based vm but had two bits left over in the encoded instructions, so I decided to just have two independent stacks and the last two bits determined which stack would be read for arguments and which would receive the result.
I didn't do a lot with it so I'm not sure about how this affected performance, but I think it could allow quite some optimizations.
Also I saw a talk about a very small chip running Forth code, and it used a stack as well as two registers, so a bit of extra space for temporary values really pays off I guess.
Oh, and because I saw your lisp-based syntax: It is surprisingly easy to write a simple Lisp-to-Lua transpiler. Lua lets you use most things as expressions and also supports tail-call optimization, so you can basically map Lisp expressions 1:1 to equivalent Lua code.
@isotoxal Рік тому ⁺¹⁹
You are now in the list of my favourite youtubers
@leddoo Рік тому ⁺⁵
damn, big words!
thanks man 😎
@JoeHinkle11 Рік тому ⁺¹⁹
I haven't seen many channels focus on programming language development like yours. Amazing job! Also, I love your animations. Really makes it clear what's happening when you explain concepts
@devtadeo Рік тому ⁺⁸⁷
Me wondering why he uses a knife to point around
@pp_up 4 місяці тому ⁺¹
It's because... it's _pointy_
@randomstuffgamess Рік тому ⁺¹⁶
Awesome video! I really like the 'modern cpus go brr (if you let them)' section. And the animations really helped with clarity. How did you make them? And your explanations are really on point.
@leddoo Рік тому ⁺⁴
thanks! the stack & cpu were animated using motion canvas (i think i left links to the components in the description, but they're not updated to the latest version of motion canvas)
@OMFGxIamxAxNinja Рік тому ⁺⁸
Wow, this just popped up in my feed, but I really enjoyed this video! You made it really interesting!
@JoStro_ Рік тому ⁺⁴
fascinating, i've never considered this aspect of performace before.
@shilangyu Рік тому ⁺⁵
Subscribed. High quality content and touching exactly the topics I love.
@AK-vx4dy Рік тому ⁺⁴
Nicely explained, good work !! Keep explaining :)
@j-r-hill Рік тому ⁺⁴
Strongly recommend looking into work done by Anton Ertl if you're interested in stack machine compiler optimizations
@thomasmewily4012 Рік тому ⁺⁴
This is exactly the channel I needed : cleans, concises and really good comparison and explanation about Stack VM vs Register VM, thank you !
Just one question : When you have a stack VM, it is like having an infinite number of register. But when you are using a register VM, you usually get a limited number of register.
What happen if you use more variable than the number of available register, for a register VM ?
And how do the VM handle that, or even your computer if you compile it instead of interpreting it ?
@leddoo Рік тому ⁺⁴
the register vm in my scripting language is currently limited to 255 registers (one byte per operand).
so the compiler would raise an error (currently just panicks, lol).
i may add instructions to use more than 255 registers later, not sure.
python has a special instructions for when there are more than 255 local variables (docs.python.org/3/library/dis.html#opcode-EXTENDED_ARG )
in general, compilers tend to have all kinds of limits that you never notice during regular use (like max number of nested blocks, etc).
@thomasmewily4012 Рік тому ⁺¹
@@leddoo Yeah I was thinking about tricky/edges cases
Thank for the link btw ^^
@oblivion_2852 Рік тому ⁺²
Funny enough I came across the max number of arguments in a java function call because I was transpiling from a language that had arbitrary length arguments. It was a really dumb concat function that 700 arguments but that day I learnt that Java only supports 254 arguments (since the class itself is always passed as an argument)
@thomasmewily4012 Рік тому
@oblivion8459
Wow, I wonder in which context you manage to have a concat function with more than 700 arguments ? (Maybe the function call was written by another program)
Look like you reach some cool hidden technical details too. From memory in C it is at least 127 arguments.
@thomasmewily4012 Рік тому
@oblivion8459 The translated function can also have a single list or somethings iterable that contains the variadic arguments, but I guess it's just better to don't handle the case and see where the weird code is
@kevinrobben1999 Рік тому ⁺²
Learned so much from your videos! Thank you, subbed
@Diamonddrake Рік тому ⁺¹
Very well explained!
@Israel220500 Місяць тому
What program do you use to generate the reports on CPU time and microarchitecture pipeline usage?
@luna010 Рік тому ⁺²
great video. I'd be curious to see how your scripting language compares to VFX Forth (which is the fastest Forth compiler/interpreter afaik).
@JH-pe3ro 5 місяців тому
There's a strong path-dependency to how we've ended up in "registers are faster", I think(and I was aware of that before the video, though it's a great explanation of why); register architectures are intuitive when computers are viewed in terms of arithmetic processing, and they map well to local variables. Therefore we have hardware that engages with that, provides a lot of enumerated registers, and asks for languages to utilize all of them, and that has a consequence on VM design, since the VM's goal is to remap program description to efficient hardware utilization: if it looks stack-y, then the JIT rewrites it to register-y code.
On the other hand, the CPU architecture can avoid this and do something like Minimal 64x4, a non-pipelined hobby computer using discrete logic: it achieves very good performance at a relatively low clock rate by reducing the register set and optimizing the instruction set around zero page memory(a construct which exists on 6502 as well, but is used supplemental to registers), thus algorithm descriptions are smaller and the total sequential work is smaller than on other 8-bits, despite having a lower "MIPS" rating. I believe the distinction between VMs would be relatively less important on Min 64x4 because the stack VM's instruction set would also benefit from the ability to code random access efficiently.
I've started to think that local variables are worth questioning, though. They serve a certain "black-boxing" purpose of presenting short routines as a matter of loading up the registers in an initial state, doing the computation, and then writing that back using the stack as a protocol between that routine and the rest of the program. But the usual alternative to a stack protocol is a buffer, and buffers have certain benefits in terms of coordination of resource use. If the buffer were implemented like a ring buffer and the computer optimized around that, it presents ideas like the Mill computer's "belt", an interesting concept, albeit one that sadly still hasn't shipped anything.
@flareflo362 Рік тому ⁺³
What are the various profiling tools used here? They look really neat!
@leddoo Рік тому ⁺²
intel's vtune (on windows, i'm using the msvc integration)
and apple's instruments (for the macos stuff)
@evanfreethy2574 Рік тому
This is my favorite type of video 🎉
@Richard_Knusprig Рік тому ⁺³
I really like your editing style and how you ... aaahm ..., anyway.
@leddoo Рік тому ⁺¹
asufutimaehaehfutbscueme, anyway
@dimitar.bogdanov Рік тому ⁺²
Really interesting!
@aredrih6723 Рік тому ⁺²
I'm more familiar with the JVM and in these stack machine, the locals (which include arguments) and the stack are kept separate.
The stack starts out empty and you have to load the value from the locals.
A bit as if there was a shelf alongside the stack and instructions to add the value of a local and instructions to save the current value to a local.
Basically, these stack machine _have_ register, they just have 2 operation allowed on them.
I'm not sure how python does it differently or if I'm misinterpreting the visualisation.
So does python really load all the arguments at the bottom of the stack?
@leddoo Рік тому ⁺¹
yes, python puts the locals at the bottom of the stack frame and loads them to the top for operations. (i can recommend messing around with python's `dis` module to see how python implements language constructs in bytecode!)
from what i've read on wikipedia, the stack in the JVM is more of an abstraction though. the jit turns it into regular native register code. similar to how WASM is stack based, but wasm runtimes convert the code to native register code before execution.
@aredrih6723 Рік тому
@@leddoo Hum, I did a quick test by making an identity function:
def ident(x): return x
It compiled to "load_fast" and "return_value".
I used the command `python3 -m compileall` to get it compiled, removed the "load_fast" from the compiled function (couldn't found how to load arbitrary byte code) and added a nop (byte code 9; for opts and arg) after the return to keep alignment (no clue how the format works, just searched byte code matching the disassembly and replaced it).
Then I tried to load the module. It worked.
Then I tried to called the function. It's SIGSEGV.
I'll need to dig deeper to find out if my "unconventional" edit are the caused or if the function need the "load_fast" to put a value on the stack.
Or I might be completely misunderstanding what you mean by "stack frame". Maybe what you call "stack frame" is what java call " slot" and wasm "local".
@leddoo Рік тому
@@aredrih6723 yeah, load_fast loads the value to the top, return_value returns that value.
maybe it crashes, cause python does not expect the parameters to be popped?
by stack frame i mean the area of the stack, reserved for the function invocation. (en.wikipedia.org/wiki/Call_stack#Structure )
you know, actually i'm not sure whether parameters are on the stack in python (that's what i do in my VM). so maybe it crashes, cause return_value tries to pop from an empty stack?
@aredrih6723 Рік тому ⁺¹
@@leddoo So, I tried digging a bit and dis.code_info seems to confirm the existence of locals (which would be register only supporting writing to and reading from the stack).
I also seem python encode the (I assume maximum) size of the stack and numbers of locals.
(also, "def f(a, b) : return 0" is marked as having 2 local and a stack of 1)
Now, storing the arguments at the bottom of the stack is a possible implementation but you would either need to copy an element from an absolute index or every loading depends on the size of the current size of the stack which would need to consider the size of the stack in both case.
Still, it seems to allow for some neatly optimized code so far so I'm curious how the vm will mature.
@leddoo Рік тому
@@aredrih6723 in my vm, the call instruction currently takes a variable number of register operands & copies those into the newly created stack frame (so params are the bottom `n` registers upon entry, but the callee is free to modify them).
@cemgecgel4284 Рік тому
Great video!
@otesunki Рік тому ⁺¹
this makes me curious
is it faster to push a dummy value like 0 then add the top two elements and copy over the dummy value
@leddoo Рік тому ⁺²
hmm, not sure what you mean 🤔
my takeaway from that video was: if two "instructions" access the same memory location, they can interfere. (a "read after read" is fine, but "read after write" causes delays)
@treelibrarian7618 Рік тому
pretty sure the reason the "smart" version is slower is the rotation - since that would have to read and write every entry in the stack memory to shuffle them down by one. better would probably be:
Load { src: a },
Load { src b },
Store { dst: a },
Add,
@u9vata Рік тому
I understood this before the vid being hiperf coding guy... But I am still not totally sure how good an optimizer can optimize stack based code to avoid these. I used to like Factor for example and although I never used it for hiperf code like I did with C++ they claim to have stack based good optimizer...
I mean for example a stack base language optimizer could spot that your load + dup pattern could be exchanged with the two separate load - I see no issue why the optimizer could not have this rule for example and this is just an example.
Generally informative video though. I did not know about the tool that shows that "green / red" thing for pipelining. What is the name of it? Ah never mind... its vtune. I prefer perf and such tools generally.
@nahblue Рік тому
Nice! Does anyone have a link for the pipeline diagram tool?
@leddoo Рік тому
link in the description :P
framework is motion canvas.
script is not ported to latest version (of mc) however.
@blinking_dodo Рік тому ⁺²
This is interesting...
@hfspace Рік тому
on the other hand these smart instructions take less place in the cpu even if they are dependent, since they are only occupying one "execution" slot. So in principal you have more capacity to execute other completely independent instructions. This seems to be a trade off here or am i missing something?
@hfspace Рік тому
i guess the product of taken execution slots and the time needed to execute is needed to compare different instructions absolutely.
@bernatrosello4375 Рік тому
I don't think the "parallel" vs sequential argument is necessarily right. Even if you're running both versions using more/less variables you have register usage risks, which are going to slow yo down significantly on a segmented data path CPU.
@leddoo Рік тому ⁺¹
hmm, which part of the video are you referring to? and which registers - cpu regs or vm regs?
@OhadLutzky Рік тому ⁺¹
That is super cool, and seems highly relevant to Java as well, which people like to at least pretend can be performant. (JVM is stack-based, Dalvik is register-based)
@leddoo Рік тому ⁺²
well kinda :D
wasm is also stack based, but close to native in perf.
the question is whether the *interpreter* is stack based.
java is typically jitted, so it runs natively using the physical registers.
@nexovec Рік тому
I'm pretty sure not padding every stack frame with 200 bytes helped in making your interpreter faster than python 😂
@elarajade7893 Рік тому ⁺¹
First ✌
@themcchuck8400 Рік тому
You really need to work on your audio levels. They're all over the place. Keep the microphone a steady distance from your mouth for the entire session, and keep your mouth pointed at the microphone.
@HXMCPP Рік тому
what do you use ? to benchmark it ? 6:44
@leddoo 11 місяців тому
i use vtune on windows (the msvc extension) and instruments for macos.
@HXMCPP 11 місяців тому
does it work with amd ? @@leddoo
@leddoo 11 місяців тому
@@HXMCPP i don't think so. amd has "amd uprof", but i'm not familiar with it.
@kfsone Рік тому
For some reason people look at dis output and always miss the elephant in the room.
2 0 LOAD_FAST 0 (a)
2 LOAD_FAST 1 (b)
4 BINARY_ADD
6 LOAD_FAST 2 (c)
8 BINARY_ADD
10 LOAD_FAST 3 (d)
12 LOAD_FAST 4 (e)
14 BINARY_ADD
16 BINARY_ADD
18 STORE_FAST 5 (r)
You're looking a the center column and not the far right.
2 0 LOAD_FAST 0 (this_couldnt_possibly_be_slower)
2 LOAD_FAST 1 (could_it)
4 BINARY_ADD
6 LOAD_FAST 2 (just_changing_the_variable_names)
8 BINARY_ADD
10 LOAD_FAST 3 (couldnt_make_it_slower)
12 LOAD_FAST 4 (right)
14 BINARY_ADD
16 BINARY_ADD
18 STORE_FAST 5 (i_mean_theyre_just_hash_lookups_arent_they)
def f(a, b, c, d, e): r = (a + b) + c + (d + e)
%timeit f(007, 14, 21, 27, 351)
from os import *
from sys import *
from collections import *
from itertools import *
from functools import *
def well_what_about_this(this_couldnt_possibly_be_slower, could_it, just_changing_the_variable_names, couldnt_make_it_slower, right, i_mean_theyre_just_hash_lookups_arent_they):
i_mean_theyre_just_hash_lookups_arent_they = (this_couldnt_possibly_be_slower + could_it) + just_changing_the_variable_names + (couldnt_make_it_slower + right)
%timeit well_what_about_this(007, 14, 21, 27, 351)
(bonus points if you figure why the imports)
@kfsone Рік тому
In my testing, "42 + 27 + 351" takes 25ns. calling "adder(42, 27, 351)" takes 75ns. changing adder to "adder(first, second, third)" and the body to just "result = (first + second) + third" adds 1.1-1.6ns to each call.
Well, 1.2ns on top of 76? No. 50ns of that is the overhead of PyCall, the actual instructions are only 25 + 1.2ns, or those few extra letters of the names accounted for 5% of the cost for that one line of 2 additions.
And *that* is just for top-level local parameters.
@kfsone Рік тому
ua-cam.com/video/vobGqYSvHCU/v-deo.html
@leddoo Рік тому ⁺²
i actually have no idea what you’re talking about 🤔
the length of the variable names doesn’t matter at runtime. they’re turned into indices. the last column is just for the programmer.
you seem to make some other point about call overhead, which is something that can matter, but in this video, the function call overhead was insignificant. and there wasn’t any python in this vid.
i’m just very confused.
@0LoneTech Рік тому
Those indices are the second rightmost column (outside the parenthesis) and the reason the accesses are labeled "fast". The imports certainly increase the odds of hash collisions, but it only affects the function lookup. Long names take (usually insignificantly) more time to parse, hash and intern (all during compile time), but once interned can be matched by id.
You might want to rerun those timeit tests a handful of times; in my tests the differences are entirely noise and swing either way.

Наступне

Автоматичне відтворення