That's a very interesting topic of which i didn't have even a slightest clue! Before watching i didn't even expect to see such low level stuff, but i'll surely have embroidered in my mind the following: 1. CPU can and most of the time does instruction pipelining. Thanks to this, CPU can also increment PC (program counter) parallel to executing current instruction. 2. CPU can do the upper only if it knows what instructions it has to do, so it does branch prediction. As in the video, this technique can be benefitial if CPU is right about predicted branches, otherwise - not. So, usual switch statement is not good enough because it is only one branch, more chaotic that can spit out only common parts of branches for the CPU. And here comes to the play threaded dispatch which provides more information to the branch predictor which instructions follow which 3. Also, compilers are usually aware of tail calls and can use them when all the conditions for this are met. The main ones (which were noted in the vid) is that callee has the same signature as caller and neither of them has destructors.
The load for the next instruction could potentially happen at the start of the previous one (except for non-conditional jump) which might help target prediction as the load has a higher chance of being known once it's ready to execute. Would be interesting to experiment with. I'm surprised the compiler isn't already moving that load for the jump target up higher, it might think there is a chance of aliasing I haven't looked deeply at the code.
I am wondering... is it possible maybe to do all/most opcodes branchless??? I am thinking maybe it is possible to come up with bytecode whose interpreter can be written mostly branchless... Also a bit sad that regular if statement logic was not measured... that is the non-binary-search, non-jump-table variant if all ops maybe fit in the ICACHE. Also what I would do is to branch first based on instuction just incrementing PC or being a jump instruction... because this is very easy to predict and you can win ++pc being always in ICACHE and pipeline-able this way while branching being "slow path". Maybe would hurt perf for very tight loops of the interpreted bytecodes - but help regular loops.
I still wonder if jumps between direct code don't gave a best chance for branch predictor but i don't know internals of today CPUs. It may pollute I-CACHE but give more seprated places of jump wich could be predicted seperately. On old Arm you could use conditionals in instructions instead of jumps, but in AArch64 they abandoned this feature if i rember correctly I think x64 also has conditional mov now. If code is so small like in this example then maybe better execute all code without branches but only store result(s) conditionaly
If there are some architectures that don't support tail call optimization and that's why GCC is refusing to add an annotation like musttail attribute, then why not just have the compiler refuse to compile code with those annotations for those architectures?! Sounds like an extremely weak argument to not implement it.
Does anyone understand how the direct-threaded dispatch is implemented using tail calls? Specifically, how does the instruction pointer store the function address itself, does that not require that the instruction stream be generated in the binary itself? Or there is a post-processing step that overrwrites all the opcodes with the actual function addresses?
> Specifically, how does the instruction pointer store the function address itself, Yes. You either need to generate the bytecode in the same binary directly, or you have a post-processing step. Either one works. Note that the post-processing step just walks over the bytecode once on startup, so there is no performance impact on the dispatch during the post-processing.
I'll add some context: direct threading was invented for FORTH, for which a lot of implementations are "live-compiler" systems - ie the compiler is incremental and knows where each function/basic op lives in memory; when you define your new function, it goes into the same image in the same address space. FORTH doesn't really do intermediate bytecode, and the artifact of code generation is a single big image blob.
That's a very interesting topic of which i didn't have even a slightest clue!
Before watching i didn't even expect to see such low level stuff, but i'll surely have embroidered in my mind the following:
1. CPU can and most of the time does instruction pipelining. Thanks to this, CPU can also increment PC (program counter) parallel to executing current instruction.
2. CPU can do the upper only if it knows what instructions it has to do, so it does branch prediction. As in the video, this technique can be benefitial if CPU is right about predicted branches, otherwise - not. So, usual switch statement is not good enough because it is only one branch, more chaotic that can spit out only common parts of branches for the CPU. And here comes to the play threaded dispatch which provides more information to the branch predictor which instructions follow which
3. Also, compilers are usually aware of tail calls and can use them when all the conditions for this are met. The main ones (which were noted in the vid) is that callee has the same signature as caller and neither of them has destructors.
The load for the next instruction could potentially happen at the start of the previous one (except for non-conditional jump) which might help target prediction as the load has a higher chance of being known once it's ready to execute. Would be interesting to experiment with.
I'm surprised the compiler isn't already moving that load for the jump target up higher, it might think there is a chance of aliasing I haven't looked deeply at the code.
I am wondering... is it possible maybe to do all/most opcodes branchless??? I am thinking maybe it is possible to come up with bytecode whose interpreter can be written mostly branchless...
Also a bit sad that regular if statement logic was not measured... that is the non-binary-search, non-jump-table variant if all ops maybe fit in the ICACHE.
Also what I would do is to branch first based on instuction just incrementing PC or being a jump instruction... because this is very easy to predict and you can win ++pc being always in ICACHE and pipeline-able this way while branching being "slow path". Maybe would hurt perf for very tight loops of the interpreted bytecodes - but help regular loops.
I still wonder if jumps between direct code don't gave a best chance for branch predictor but i don't know internals of today CPUs.
It may pollute I-CACHE but give more seprated places of jump wich could be predicted seperately.
On old Arm you could use conditionals in instructions instead of jumps, but in AArch64 they abandoned this feature if i rember correctly
I think x64 also has conditional mov now. If code is so small like in this example then maybe better execute all code without branches but only store result(s) conditionaly
If there are some architectures that don't support tail call optimization and that's why GCC is refusing to add an annotation like musttail attribute, then why not just have the compiler refuse to compile code with those annotations for those architectures?! Sounds like an extremely weak argument to not implement it.
Does anyone understand how the direct-threaded dispatch is implemented using tail calls? Specifically, how does the instruction pointer store the function address itself, does that not require that the instruction stream be generated in the binary itself? Or there is a post-processing step that overrwrites all the opcodes with the actual function addresses?
> Specifically, how does the instruction pointer store the function address itself,
Yes.
You either need to generate the bytecode in the same binary directly, or you have a post-processing step. Either one works. Note that the post-processing step just walks over the bytecode once on startup, so there is no performance impact on the dispatch during the post-processing.
Thanks @@foonathan
I'll add some context: direct threading was invented for FORTH, for which a lot of implementations are "live-compiler" systems - ie the compiler is incremental and knows where each function/basic op lives in memory; when you define your new function, it goes into the same image in the same address space. FORTH doesn't really do intermediate bytecode, and the artifact of code generation is a single big image blob.
TLDR; 44:41 he uses Arch.