I once heard a programmer friend of mine say _"I hate computers, because they will do exactly what I tell them to do, rather than what I want them to do."_
if you tell the processor to do nothing it will just calmly do nothing, instead of doing nothing nervously, wondering if you will come in and tell it to do something
I was so proud of myself for figuring it out early around the 1:00 minute mark. I was like, "He's about to say that telling it to do nothing is faster than letting it do nothing" *insert shocked face*.
So, basically: instead of pressing the button once to call the elevator, then patiently waiting, you mash the button continuously until the elevator finally arrives.
It's more like there's a train that counts people as they pass through a turn style and the train only leaves when enough people are on board. Instead of waiting around for more people to show up so the train leaves with a full load when it's most efficient you turn around and spin the turn style really fast with both hands making it leave instantly because you know you're the last person who needs to get on for a while.
See it this way: Imagine a train that will leave the station only when full. You don't want to wait and you have money... so you buy ALL the other seats. Train has departed, even if it had empty (but paid) seats.
By sending 44 commands (double the buffer size of 22), you ensure the buffer is filled twice and flushed. This guarantees the Z-Buffer FillRect and subsequent FillRect operations execute immediately. This also gives the RSP enough time to load the next stage of graphics data (vertices, matrices, lighting overlays) without stalling the RDP.
@@diggoran It's most likely a function-like macro since there is at least 1 hanging out on the left-hand side in the search. It's probably getting replaced by either a function call or another function-like macro. (Misread one of the macros on the left and thought it was defining gsDPNoOp, but it was defining gsSPNoOp, so got a bit confused there :p)
Taking in account that Mario 64 is the base for the SDK and the sample engine, all the games of the first two years use it directly or with light modifications; if people or at least AI gets involved would be possible improve all those games.
Sorta sounds like how command buffering works in OpenGL. The commands go into a buffer which isnt actually sent to the GPU until it's full, unless you call glFlush to force it. In your case, the GPU is idling while waiting for more commands, so you need to crap out a ton of NOPs in order to flush the command buffer.
So if you imagine that the RDP is a bus, it's faster to have 41 bus stops in the same location rather than to have the driver drive aimlessly until the next bus stop is chosen?
It’s more like telling the bus driver that nobody else is coming at each bus stop (confirming that there should be 41 empty seats). That lets the driver start driving right away instead of looking around for anyone remotely nearby that might want to take a bus in the future.
yeah exactly. and it's noteworthy that the first passenger is REALLY fat, like GIGA FAT because it's the zbuffer clear command clearing the entire zbuffer. so it doesn't matter that 41 seats are empty, driving this one dude is enough work
I've had a situation writing GPU assembly code where an odd or even number of instructions in a particular section had a 2-3% impact on performance so I just padded it out with a NOP. This only happened when the entire program didn't fit within the instruction cache, so it probably had something to do with cache line alignment on branches.
Very cool! Not the kind of optimization you can do in most situations, as this requires intimate knowledge of how other processes work! My best analogy for how it works: I could ask my assistant if the line at the coffee shop is empty, where they wait until it is and then report back so I can walk there to get a coffee. OR I could just KNOW that the line is empty every day at 9:45 and start walking there then :) (2 trips after the line empties [assistant walking back and me walking there] VS 0 trips [I start walking before the line is empty and arrive right when the last person is served])
reminds me of what yuzu did with vulkan maximum clocks which makes amd gpus not use low power, which is what they do more than they should. In short, amd gpus need 'do nothing' vulkan commands to increase their performance (clocks and subsequent power draw), nvidia can't relate
Nah, he is say it at the end, impossible, maybe he found another trick in the hat like this but doesn't seems plausible. He alredy found a way arround to double memory, so could go up to 110fps? But it's only possible to apply once the game is finish, and also struggles emulators...
14 днів тому+6
@@eduardoanonimo3031Emulators must adapt to whatever the Hardware can do, and not the other way around.
Damn, I have been playing your hacks with my friend for years now, didn't realize youre a fellow german. Keep it up! We love your work. She's exclusively playing Mario ROM hacks and yours just are some of the best.
He claimed he was never going to hit 60 FPS for the main game, but maybe for the starting area...I'm starting to think he's wrong. I'm starting to think there's more juice to squeeze out of the machine and that he's going to keep doing so until he hits 60.
This reminds me of hitting the scanlines in the 6502 assembler for the NES to get the stablest framerate by "wasting" cpu-cyclers time until the next frame. TwT
Love it! Hope you'll eventually become interested in making this game run in 3D SBS stereoscopic, that would be a first on the N64 I believe. Keep up the great work!
Got it, it's waiting to do whatever its told to (which is when the buffer is filled), if we fill the buffer with "trust me, you don't have to do anything instead of waiting", it then just does the thing it's meant to.
I'd guess it alters the timing in just the right way and instead of having a pipeline stall it ends up finishing the noops at the right time when it is ready for the the next command to start. The system might be very bad at recovering from a pipeline stall so padding with noops prevents attempting to execute the next command before the the system has finished the previous command and is ready to actually execute the next one. For example if a pipeline stall costs 50 clocks to recover, then 44 noops is an improvement if it prevents the stall.
Are there any other consoles you would be interested in learning how to program? I assume the ps1 and ps2 would be easier to learn since they are also MIPS-based like the n64, though the ps2 is reportedly harder to program than even the n64.
To be honest, kaze could easily write more portable code for other platforms with the knowledge he has. It's just when trying to get every ounce of performance from a piece of hardware that deep knowledge about the system architecture is required, in which case something mips-based would be easier to adapt to.
Given the insane level of optimization you achieved, instead of 60fps I'd prefer 30fps but with more complex 3D models, especially for the environments. 60fps for a N64 platform aren't that much of an added value IMHO, while an outstanding graphics for the console standards would be breathtaking. I don't know if it's possible or if you'd like to consider that.
Nah, 30 fps is junk for cinematic games where gameplay doesn't matter. Stay away from that as much as possible. 60 fps was always the golden standard for gaming for a reason.
Part of the N64s GPU, which is split into two chips, the RSP and the RDP. They are the Reality Signal Processor and the Reality Display Processor respectively. The former calculates each 3D scene and the later uses the results to actually draw the pixels.
So instead of having it do nothing on its own, specifically telling it to do nothing is faster. Technology!
We could do nothing, or we could do nothing FASTER because THE KING told us to do nothing!
@@TheDoomerBlox Da f... good analogy, I can apply to my job:
I can be lazy at work, or the boss can give me the day free... 🤔
Way faster go home...
I once heard a programmer friend of mine say _"I hate computers, because they will do exactly what I tell them to do, rather than what I want them to do."_
if you tell the processor to do nothing it will just calmly do nothing, instead of doing nothing nervously, wondering if you will come in and tell it to do something
I was so proud of myself for figuring it out early around the 1:00 minute mark. I was like, "He's about to say that telling it to do nothing is faster than letting it do nothing" *insert shocked face*.
Mario gains 4 FPS by doing absolutely nothing
L is real
In case you didn't know: you're more effective, creative, and content human being by spending enough time doing absolutely nothing.
He’s saying commands not comments took me a minute
thanks for this lmao
ohhhh i was going to say ... what're they compiling with that leaves in comments?! but i see
Where is Kaze from?
Or is that a dialect?
@@daskampffredchen He is german I am pretty sure so english is not his first language bare with him haha
Ohhhhhhh
clears things up. I was so confused why comments are executed lol
if someone told me to do nothing 44 times in a row I'd probably start working just to spite them too
Do nothing
I'd post this comment 43 more times but that would defintely get my account blocked
@@nodrancedo nothing
You really ought to do nothing
Do nothing. That's an order!
do nothing
So, basically: instead of pressing the button once to call the elevator, then patiently waiting, you mash the button continuously until the elevator finally arrives.
It's more like there's a train that counts people as they pass through a turn style and the train only leaves when enough people are on board. Instead of waiting around for more people to show up so the train leaves with a full load when it's most efficient you turn around and spin the turn style really fast with both hands making it leave instantly because you know you're the last person who needs to get on for a while.
See it this way:
Imagine a train that will leave the station only when full.
You don't want to wait and you have money... so you buy ALL the other seats.
Train has departed, even if it had empty (but paid) seats.
By sending 44 commands (double the buffer size of 22), you ensure the buffer is filled twice and flushed. This guarantees the Z-Buffer FillRect and subsequent FillRect operations execute immediately. This also gives the RSP enough time to load the next stage of graphics data (vertices, matrices, lighting overlays) without stalling the RDP.
While making sure the elevator gets to your floor and back before the next person arrives, else you riding the elevator is, overall, slower.
When this is all is done I hope we can get a compilation of the silliest optimizations you found while developing this game.
I would love to see that.
Poor mans flush command
From all the comments, I liked yours the most
This is literally the **waits faster** song: Wii Shop Channel [Eurobeat Remix]
aka ua-cam.com/video/gmaIv5JaryU/v-deo.htmlsi=g8DvYSsoMmDbECDN
If I weren't a programmer myself, I would assume this to be an April fools joke.
This is amazing.
Kaze when Simpleflips says "Bomb Ombs" : grr😡
Kaze when he says "Communds" : 😴
The subtitles kept saying "comments" and I was so confused because it sure looked like function calls to me!
@@diggoran It's most likely a function-like macro since there is at least 1 hanging out on the left-hand side in the search. It's probably getting replaced by either a function call or another function-like macro. (Misread one of the macros on the left and thought it was defining gsDPNoOp, but it was defining gsSPNoOp, so got a bit confused there :p)
@@diggoran I thought maybe the gsDPNoOp function contained comments or something until I looked into the comments
@@Halfbit_0 it's a microcode command, not a function call
cant wait to have 500 bob omb battlefields in one place to be rendered
why does that sound kinda sick actually
both technically and in practice
A playable version of those 100 AIs play Super Mario Bros videos
Mario 64 at 60
Return to Yoshi island at 60
Crazy how far will this engine go to change n64 games forever 🎉
Taking in account that Mario 64 is the base for the SDK and the sample engine, all the games of the first two years use it directly or with light modifications; if people or at least AI gets involved would be possible improve all those games.
I would love to see Rare's N64 games run at 60fps.
@@Kruegernator123 literally 🗣️
sm64 would probably run at 100+ fps by the time rtyi is done
@@tl1882pretty sure the N64 is hard capped at 60 on hardware at least. Though perhaps emulators would be able to push beyond that?
Sorta sounds like how command buffering works in OpenGL. The commands go into a buffer which isnt actually sent to the GPU until it's full, unless you call glFlush to force it. In your case, the GPU is idling while waiting for more commands, so you need to crap out a ton of NOPs in order to flush the command buffer.
So if you imagine that the RDP is a bus, it's faster to have 41 bus stops in the same location rather than to have the driver drive aimlessly until the next bus stop is chosen?
It’s more like telling the bus driver that nobody else is coming at each bus stop (confirming that there should be 41 empty seats). That lets the driver start driving right away instead of looking around for anyone remotely nearby that might want to take a bus in the future.
yeah exactly. and it's noteworthy that the first passenger is REALLY fat, like GIGA FAT because it's the zbuffer clear command clearing the entire zbuffer. so it doesn't matter that 41 seats are empty, driving this one dude is enough work
Farting 44 times instead of sharting once, genius.
Got extra 4 doodooframes per kakasecond
When "I will wait for the bus faster" actually works
“Command” in your accent sounds a lot like “comment” to me, for which I was thoroughly confused for a bit
It's not an accent. He just didn't know how to pronounce it.
I like how some of the simplest solutions offer great rewards. This is fantastic!
Kaze as a manager would be scary
I told them to do nothing and they finished their work early
i remember someone else talking about how they put a noop which significantly increased performance. such a strange optimization
maybe it forced a big function to not be inlined? curious why that would help, funny either way though
something something loop alignment, maybe?
I've had a situation writing GPU assembly code where an odd or even number of instructions in a particular section had a 2-3% impact on performance so I just padded it out with a NOP. This only happened when the entire program didn't fit within the instruction cache, so it probably had something to do with cache line alignment on branches.
Very cool! Not the kind of optimization you can do in most situations, as this requires intimate knowledge of how other processes work! My best analogy for how it works:
I could ask my assistant if the line at the coffee shop is empty, where they wait until it is and then report back so I can walk there to get a coffee. OR I could just KNOW that the line is empty every day at 9:45 and start walking there then :)
(2 trips after the line empties [assistant walking back and me walking there] VS 0 trips [I start walking before the line is empty and arrive right when the last person is served])
Is it right to assume that by reducing these noop calls to a single "pls start working" call, you'd get a minimal increase in performance, too?
yes
reminds me of what yuzu did with vulkan maximum clocks which makes amd gpus not use low power, which is what they do more than they should. In short, amd gpus need 'do nothing' vulkan commands to increase their performance (clocks and subsequent power draw), nvidia can't relate
Ok that is absurd, but hey anything's possible now.
I can't wait to see this scene hitting 60fps. Keep up the good job Kaze ♥️
Nah, he is say it at the end, impossible, maybe he found another trick in the hat like this but doesn't seems plausible.
He alredy found a way arround to double memory, so could go up to 110fps? But it's only possible to apply once the game is finish, and also struggles emulators...
@@eduardoanonimo3031Emulators must adapt to whatever the Hardware can do, and not the other way around.
@@eduardoanonimo3031if anyone can do it, Kaze will
@@jimbobcheezeburger2020 No he wont, you cant optimize microseconds, you need to optimize the milliseconds. Why draw every pixel per frame?
@googleuser4720 you gotta see the vision my friend 👍
rdp: *does nothing slowly*
kaze: shoop da NoOp
Loving the late 2000's vibes from this comment
Next up: N64 multithreading microtasks instead of noop busy waiting
It can`t get more silly, right??? 😂
A classic solution from the Atari days! Sweet!
Damn, I have been playing your hacks with my friend for years now, didn't realize youre a fellow german. Keep it up! We love your work. She's exclusively playing Mario ROM hacks and yours just are some of the best.
I was scared to click this video but I trusted the algorithm and it delivered
He can't keep getting away with this!!!
the audio stutter on all of your clip add to the insanity of this shit
We will reach n64 singularity soon ❤
It's Just Works.
Programmer: (tells me to do nothing 22 times)
Me, a little buffer boy: “Say that again.”
I read silent optimization, and the video still made sense lol
First time in history SPAM works on a technical level.
IIRC the C compiler is smart enough to unravel for loops, making this code more simple but not sacrificing performance. Worth a shot
He claimed he was never going to hit 60 FPS for the main game, but maybe for the starting area...I'm starting to think he's wrong. I'm starting to think there's more juice to squeeze out of the machine and that he's going to keep doing so until he hits 60.
This reminds me of hitting the scanlines in the 6502 assembler for the NES to get the stablest framerate by "wasting" cpu-cyclers time until the next frame. TwT
Love it! Hope you'll eventually become interested in making this game run in 3D SBS stereoscopic, that would be a first on the N64 I believe. Keep up the great work!
basically flushing the buffer
Got it, it's waiting to do whatever its told to (which is when the buffer is filled), if we fill the buffer with "trust me, you don't have to do anything instead of waiting", it then just does the thing it's meant to.
I'd guess it alters the timing in just the right way and instead of having a pipeline stall it ends up finishing the noops at the right time when it is ready for the the next command to start. The system might be very bad at recovering from a pipeline stall so padding with noops prevents attempting to execute the next command before the the system has finished the previous command and is ready to actually execute the next one. For example if a pipeline stall costs 50 clocks to recover, then 44 noops is an improvement if it prevents the stall.
ive explained why it works in the video and this explanation is not it unfortunately
Absolutely brilliant.
Super Mario 64 must be one of the most studied computer programs of all time.
wait what if you do it more than 44?
then you lose perf
god, I love technology
Incredible, but wouldn’t it be more efficient with a for loop or a goto?
that is all a single statement.
look more closely.
@@chri-k That's not how it compiles.
@@chri-k Think about machine code.
@@chri-k The best way to temper a C program is to use a sleep function from the embedded library or the C11 thrd_sleep.
@@StEvUgnIn it's basically an array of instructions for an interpreter.
this is not code. this is data.
bro literally said "hold up" 44 times
Are there any other consoles you would be interested in learning how to program? I assume the ps1 and ps2 would be easier to learn since they are also MIPS-based like the n64, though the ps2 is reportedly harder to program than even the n64.
To be honest, kaze could easily write more portable code for other platforms with the knowledge he has. It's just when trying to get every ounce of performance from a piece of hardware that deep knowledge about the system architecture is required, in which case something mips-based would be easier to adapt to.
"This won't fool compiler when it gets smarter. Oh well..."
this is data, the compiler can't edit this
@@KazeClipsThis is a reference to a comment in the leaked Windows XP source code that appears alongside a similar noop command.
Physically shuffle the nulls around to shape a donut.
But why not use a loop? Is it because of the compiler?
At this point you're entering black magic of programming territory!
This is the Arnold method.. "sleep fastah!"
does ... doen't it work with a for loop?
Why not use a for loop?
gsDPNoOp vs (evil) gsDPNoOp
The lost art of inserting NOP sleds into your program
Hey Kaze,
Why don’t you show the most expensive location as well?
Will 56fps ever be beaten?
Summoning Salt music starts playing.
1:36 I think its time for a styling ide that could hide eye sores away from devs. Maybe embed documents in even.
Why not put it in a loop
This man is not going to stop until SM64 runs at 64FPS!!!!
Does the compiler not auto unroll the loop if you just specify it to be a for loop that goes 41 times?
beautiful
Hilarious, and Genius
wonder what the "good reason" for that is
nop to go fast is not the optimization I was expecting
Do Nothing (220ms)
Don't
Don't
Don't
Don't
Don't
Don't
Don't
Don't
Don't
Don't
Don't
Don't
Don't
Don't
Don't
Don't
Don't
Don't
Don't
Don't
Don't
Don't
Don't
Don't
Don't
Don't
Don't
Don't
Don't
Don't
Don't
Don't
Don't
Don't
Don't
Don't
Don't
Don't
Don't
Don't
Don't
Don't
Don't
Don't
we're giving the rdp leisure time so it can work faster
_"But it works."_
-Literally any programmer.
Huh. So this is an in-between solution until your colleague can implement a microcode solution to this problem?
This seems quite trivial to write a macro for.
i dont see the point of a macro
Given the insane level of optimization you achieved, instead of 60fps I'd prefer 30fps but with more complex 3D models, especially for the environments. 60fps for a N64 platform aren't that much of an added value IMHO, while an outstanding graphics for the console standards would be breathtaking. I don't know if it's possible or if you'd like to consider that.
i target 30 on real hardware with more fidelity. this is just my test area that was made before all these optimizations.
@@KazeClips Thanks! I didn't know that.
Faster is always better
Nah, 30 fps is junk for cinematic games where gameplay doesn't matter. Stay away from that as much as possible. 60 fps was always the golden standard for gaming for a reason.
@@Rafa-Silva-Alt super famicom games atleast used to run logic and basic movement at 60fps yeah
I wish that guy would work on N64 emulator, or porting Perfect Dark to Gamecube
Ah yes explicitly doing nothing instead of doing nothing is faster
Why did you type out that function several times instead of writing it into a for loop?
its not a function, this is data
If it's wrong, and it works, then it isn't wrong.
Stupid solutions are the best solutions
But...why?
Could Mario 64 be ran at 60 fps through an emulator today?
yeah you can even run at 4000fps if you want
@ how
@@nuts5388 unlimit fps
@ Alr thanks
1.19ms per frame left, best of luck.
Infinite fps hack 🤑
ah yes, the No0p optimization
Reminds me of cache warming
So you DO document the weird/fast code you make... or maybe the other guy made that comment. Thank the lords I finally saw a comment in your code!
lesgoooooo
Wtf is RDP?
Part of the N64s GPU, which is split into two chips, the RSP and the RDP. They are the Reality Signal Processor and the Reality Display Processor respectively. The former calculates each 3D scene and the later uses the results to actually draw the pixels.
Dear clean code enthusiasts, 🤲🌈❤️
Are loops allowed in your code here or am I missing something
Amazing xD
This is hilarious
it has 888 likes
that's my birthday :]
wait faster!!
So it's a win by doing nothing
Literally doing more of nothing
*miliseconds
*commands
*bronounsiazion
no, i am talking about microseconds here.
This development is better to be on a new game with a modern story and style because Mario has been made for this device
The Nintendo 64 is a silly platform