Thanks for the video, I can advise you not to make a greedy mesh for each type of block, but to make for all complete solid blocks, and then transfer to the GPU data structure with the help of which you can calculate the block type and texture by pixel position, it will simplify the mesh many times as well as the algorithm itself.
this is insane! i have my own culled and greedy meshing implementations and i know they're not the fastest, but i'd never have thought it could get THIS fast. you could literally remesh every chunk every frame with this and still get good fps, which is mind-boggling. good job with the explanations, too.
When you explained the part in 14:40, where you explained how to find the faces looking right just by modify an interger, I was so surprise at how simple it is an yet amazingly complex
I now fully understand the concepts used to achieve such high performance. I also fully understand that if I were to try to write it. Every line of code would have an off-by-one error.
i love everything about this your awkward presentation, the handdrawn sketches, the weird pronounciation, the focus on speed, your manbun, your long hair that makes you look like a metalhead, the jokes, the effort you made, everything, everything in this video is just *right* .
This is incredible! I did something extremely close to this for counting strings within DNA sequences and got immense speedup. Binary manipulation is insanely speedy if you can comprehend it. Great job figuring this out and explaining it.
CHUNK_SIZE_P3 = 34*34*34, so the size of axis_cols is 3*34*34*34*64 = 7,546,368 bits. Additionally, we need twice that for col_face_masks, giving ~2.83MB. Honestly, I don't know whether this will fit on the stack or not. Maybe someone else can provide additional information?
@@0x4849 I believe max stack size can be changed when compiling, but the default is usually not very large. I would instead preallocate the vector once and then always use that one instance
@@0x4849 I had a program where I had an array of 5 Mb. So 2.83MB should be feasible. Also, the memory can be static. We really just need to benchmark the approaches and choose the best one
2.83mb should be feasible. I had a program that used 5mb for a stack array 😅. The memory can also be shared between calls whatever it's stack or heap based. Different approaches should be benched and there should be a room for improvement
One final thought, if you use 30x30 chunks you can fit the left/right neighbors into a 32bit int rather than expanding to 64. It'd halve your memory bandwidth requirements at minimum and if you use SIMD it will let you double the number of calculations performed per cycle.
Epic games had a wonderful talk about nanite, and the part that blowed my mind is: Gpu’s are very slow at rendering extremely small triangles. So what they did? They just wrote a SOFTWARE RASTERIZER, that is faster than the hardware one I think when the size of a triangle is less then 40*40 pixels. The difference is really impactfull, and they showed the code and implementation for everybody to use it!
I know this is an old comment, but this reminds me of when the youtuber Gamehut (who worked at telltale games/sega in the 90's) would do a lot. He would frequently have some form of technical problem where the hardware wasn't capable of doing what he wanted, so he just wrote the functionality in software.
Oh my god I wish i had this video like 2 months ago when i was trying to write a greedy mesher. Thank you so much for this resource! Will definetly save it for the future!!
The brings back memories. I recognized some code I wrote about 15 years ago after being blown away by Minecraft. The & operation on the shifted bits specifically. The merging of the meshes was clever and much better than what I ever came up with. I put it all in a fancy octree though so I only rendered on-screen chunks. I hit a brick wall getting the lighting to work on merged meshes and it all fell apart once I had more than, say, 6 block types. Your code will do so too. But it's a great exercise and good job. A more modern way would no doubt be raycasting, there are many more triangles than pixels on the screen if you scale things up and it parallelizes better. Nice video, keep them coming!
This video is honestly so well explained and even though I don't know anything about voxel engines or game development I was able to understand it. This is probably one of the best resources for making a voxel engine. If I ever make one, I'll probably take a look at this again, thanks for your amazing work!
When you stare at engine development - graphics programming stares back. Takes like 15 passes of starting to learn the basics, to learn the basics. And this wasn't a typo
I haven’t tried greedy meshing but I’ve seen some demos of greedy meshes where it doesn’t care about block type, it constructs the triangles while remembering where the different block types are, so it’s possible to make the greedy meshes not slow down when you increase the block type count
Use an array for the data instead of a vector (since you know exactly how many entries it will contain) and it should have essentially zero allocation time since it'll be allocated on the stack instead of the heap
It doesn't fit on the stack on linux it's to large! But here is the funny thing... Someone noticed I'm allocating WAY more memory than was actually used. And now it does fit on the stack :) So I've changed it. The performance difference was only minimal though.
I am building a voxel engine as my senior design project for school and I am about to start greedy meshing. This video helped so much, my mind is blown by that face culling technique and I can't wait to try it because right now that is a massive slowdown in my chunk generation.
You can combine different block types like grass & dirt into the same triangle. You would obviously still need to differentiate between them, which can be done in the fragment shader. Does introduce a lookup in the fragment shader but it may be very worth it if you have a lot of block types or if greedy meshing is your bottleneck
Man amazing video. You made it sound so hard but I feel like I grasp all of it pretty well. The visuals make it soo much easy to follow, hats off to your work.
Several people have mentioned looking into SIMD optimization, but a few other ideas: 1) Using a fixed sized allocation instead of a Vec since the size is known. Not sure whether the entire arrays would fit on the stack but if so that may provide several speed improvements over a Vec on the heap. 2) It might be possible to combine both positive and negative edge detection into a single operation by using an XOR, but would require a slightly different method of iterating over them to pass into the greedy meshing. 3) Your structure for axis_cols has the data for each grid separated, a format similar as such: (y1, y2, y3, y4... x1, x2, x3, x4... z1, z2, z3, z4...) this means when setting the values you're writing into separate parts of the vec that might be far enough from each other to cause frequent cache misses. A layout where the three axis all are interwoven beside beside each others might be faster, such as (y1, x1, z1, y2, x2, z2, y3, x3, z3, y4, x4, z4...) 4) It would require a bit of rework but this seems very reasonably practical for a compute shader. 5) Would take a fair amount of work, but rethinking how you store the actual voxel data in general may make it faster to convert. 6) Again it would be a change in direction, but there are approaches people have taken where you can greedy mesh any flat surface, regardless of different types of blocks. The way that achieve this is usually to pack the color data for the chunk into a 3D texture and use it in the material/shader for the chunk mesh, then, rather than each triangle having a color, the fragment shader can use world coordinates to query from the color data as a 3D texture at the position of the face. Allowing a single triangle to have multiple colors on it. This makes the fragment shader slightly more complex, but in most examples of people using this technique it tends to improve performance in both rendering and construction because it can result in a massive reduction of polygons, especially as you add more and more materials.
I'm making a game that has voxels and I implemented this also using Rust and something very similar to the slow approach. This video came out with such a great timing. Thanks for sharing this. Maybe it can be even faster if SIMD or parallelization are included? 😁
It's actually genuinely fun if you like puzzles. A lot of it is figuring out how to visualize it so you can figure out what's going on because the final product is always undecipherable (at least for me).
It's honestly amazing how many usecases there are for bitwise operations, I think at least some understanding, even if only basic, should be a core skill of any serious developer.
Love this. I remember building a 2048 AI and going from loop-type grid transformations to bitwise operations. Bitwise stuff is hard to grok but there are sooo many orders of magnitude of improvement and it's so satisfying :)
I love how other SQL devs look at me when I explain my stored procs that utilize bitmap logic to be a million times faster than the naive approach to the same problem.
@@DanKaschel Calculating using the bitwise code and returning the final result set in postgres put less load on the postgres server than serving the data it's based on to application code, which then had to run the calculations. This is true quite often for OLAP style workloads.
@@DanKaschel a 10 line comment was enough for my colleague to understand and confidently make adjustments for a new requirment. Bitwise code isn't magic or that hard to do when you know the incoming data, the result, the intended behavior and you've got the code in front of you. And to go from a batch job ran once a month to an on demand real time task is quite important when the report directly generates revenue for it's users with more benefit being reaped the fresher the data being presented is.
It's like a gift for me. It was a problem no matter how much I optimized it before but now I have no problem loading and rendering faster than before. thanks for your video
If your friends CPU has significantly larger L2 or L3 cache, the performance difference could perhaps be cache misses? Aligning data for CPU cache optimization is another beast to tackle though 😅
I dont watch your videos (but still subscribed (I want to learn rust&bevy some day)), but every time I see your videos it feels like a new scientific experiment.
Lmao I wrote an algorithm yesterday for greedy meshing which does a bunch of neighbour checks for each block and then creates a bit mask from that. Definitely stealing the bitshifted comparison optimisation. This must be the most amazingly timed video I've ever seen.
Fantastic video! I very much appreciate all the work you put into these. You have a gift for explaining things. Can’t wait for you to publish a game. Make sure to introduce it in a video, pls 😊
At least on x86, bitwise operations like count trailing/leading zeros are only available on the more recent AVX-512 processors, so adding SIMD might make it faster of their friend's CPU, but could actually make it slower in parts on their own. There are probably some places where it'd be beneficial anyways though.
I have an exam in relational databanks in 3 days. And this video and the bitwise manipulation really, really didnt help me at all. But its very cool stuff and im looking forward to try something like this as soon as i finished my exam
You could further speed this up by making use of SIMD as you can do all these bit-wise operations on several values at once reducing your need to loop as many times. The XMM registers are 128bit wide allowing you to stuff 4 32bit values into a register would allow you to calculate 4 values in one operation.
Just want you to kmow that this video was so good that at 11:27 there was a solid 5 seconds where I actually scrambled to rewind the video to try to desperately see the code
Oh! Great catch. Initially in my rendering I've used 64 bitmasks, because my chunks (not rendering chunks) were always 4x4x4 voxels. Tho I haven't implemented a greedy meshing, because I need to support much more than a solid block, so different shapes etc. End up with custom rasterizer.
The reason, why the code was equally fast is, that bit operations are single instructions. The reason, why new CPUs are usually faster, is other than clock speeds, which are generally about the same, because chip engineering hit a limit on that one (any faster and CPUs melt basically). However, I am not sure, why the standard mesher is faster. But I am suspecting CPU cache size, since improvement in pipelining or superscalar execution would also lead to improvements in the first comparison and the clock speed is about the same, as I mentioned before
15:04 this was the point I verbally said “this guys psychotic” but in a good way. This is a crazy way to think about this data but it makes so much sense! Good work man!
Coincidentally I just implemented nearly the same thing a week ago, though I support octree blocks so it's a bit more involved, but cool to compare implementations. I made use of xor to detect my faces, never thought of just flipping the neighbor... My meshing ended up about 50% faster somehow after implementing it, even though it feels like more work is being done
The expression he did for his mesher is actually one half of an xor (A xor B = A*(!B) + (!A)*B), and since CPUs have built-in support for all binary operations, your algorithm does the work at once instead of going through it twice by choosing the two paths at once. The only caveat here being two bits are on instead of one, but that difference is irrelevant as they are guaranteed to be next to each other.
Interestingly I tried switching my system to just flipping bits instead of xor, I was already flipping the bits for another part so surely it should be free gains. Weirdly, it ended up very slightly slower which is perplexing. I don't think it's worth diving into it enough to find out why or what changes the compiler has here, but thought I'd note what I found...
Pretty cool project and interesting technique and considering that it is within the context of a voxel engine, it is very fitting to say the least. Outside of that, if one isn't referring to a pure voxel engine; there are many other interesting techniques that are also very interesting. These range from: Quad and Octrees, The use of Quaternions, Octonions, Sparse Matrices, FFTs (Fast Fourier Transforms) and their inverses, Instancing through referencing, other types of procedural generation techniques, Perlin Noise Sampling, Batch Processing through a priority queue, Fractal Generation, and so much more. Don't forget, you can also do a lot of interesting things with a plethora of available techniques not just within the engine itself, but also within the context of shaders. The Graphics Pipeline in general is a very interesting field of study.
Fun fact, you can further reduce polygon count by allowing polygons cross (each other in) the same block type or culled space. z-fighting is not an issue as it is the same texture or not visible. I have demo and thesis on this. The following "donut" example requires only a single polygon :D 01110 01x10 01110
Thanks to practicing image manipulation in JS, this was surprisingly easy to understand and clicked right away for me. 1D data models and traversal is not simple, so I understand your pain.
Thanks from a godot developer (csharp) this is very useful there as well since bitwise operations work very similarly and especially with multimesh instancing! Cheers :)
bitwise operators are basically universal, they aren't language specific, you can do them in every language I know of. So very useful and easily transferable skill to know.
This is cool, I remember playing with greedy mesher but end up going back to a traditional one because I didn't find a good way to get rid of T-junction artifacts
This is awesome! I'm looking forward to attempting to implement this myself. I'd love it if you would cover ambient occlusion in the future or at least provide some resources for where you learned about it
You can go farther. In my greedy mesher I store both block and ambient-occlusion lighting in textures, as bytes, using a bin-packing algorithm. One 2kx2k texture has always been enough, but I also added the ability to track which is needed by each chunk in case I needed many. This is particularly useful in city-like terrain where the geometry has a lot of flat faces made up of different types of blocks.
Amazing to see the level of performance you can get out of using the binary representation and this has me wondering if I can use any in my own projects. I suspect I will need something similar to create an AI navmesh in the near future. Fantastic video once again TanTan!
This voxel engine looks incredibly advanced and would make a brilliant base for games! For future videos I'd love to see you implement a scripting language into an existing Rust project like Lua or Angelscript.
I tried out the mesher in my own project with a different chunk storage scheme. I ran benchmarks on my project and got around 500 µs (microseconds) per chunk. When I ran your benchmark I was getting around 32 µs. Initially I thought it was just inefficiencies in retrieving voxel data since I'm using palette based compression. After some more testing I found that if I use your chunk generation code the benchmark result was around 50 µs. Turns out there's only a few solid voxels in the benchmark chunk which is why it runs much faster. The first chunk I tested/benchmarked had solid voxels in a sphere shape. Still my voxel retrievel from the chunk is significantly slower then simply indexing into a Vec. Mainly because I use bitpacking for the indices.
I think your memory layout can probably benefit a little from ordering by which stage of execution needs the direction. The newer CPU has a better cache and branch prediction, and that's why it outperformed on the culled renderer. The greedy mesher caused it to stall more, so these advantages were likely nullified. Your greedy mesher seems to tax the cache a lot - your blocks are 32x32x32 bits (right?), and interleaved in triples; but it could be better to somehow make them fit into 64x8=8x8x8 bits (64 bytes, a cache line) for each of their core operations. So, making the block smaller by a factor of 2 to 4 in every dimension AND ordering the memory according to execution order / ordering execution to be doing operations on directly adjacent memory blocks directly after one another could probably give you another big boost.
I wonder if the data layout could be improved as it looked like you use sequences of array indices that are far apart from another. Depending on how large the data is, this could theoretically lead to cache misses as not the entire array is loaded into the cache at the same time. But it's only 0.8% of runtime and the Compiler probably already optimizes this. But if there was a slowdown caused by cache misses, improving the data layout could speed up the code a lot
Excellent video. The animations are easy to follow along with. Thanks for sharing. I'm curious about the method you used to profile your code to determine the execution time of various sections. I didn't see any particular video in your catalog that seemed to cover this, so perhaps a "How Tantan profiles his Rust code" could be an idea for another video.
Idk if this would be any better, but I'd like to see an algorithm that doesn't mesh it at all, but uses the GPU to figure it out. Basically, you have a 32x32x32 chunk, with 32 quads from each of 3 directions, all facing the camera. When it comes time to render, you send those quads to the GPU along with an array with the 32768 voxel types, and the GPU would be able to draw each quad with the respective colors. To draw it in the correct layer position, draw all of the chunks that the camera isn't in order from furthest to closest, and do the same with the quads.
a tip go further decrease the data creation time: You're always creating and releasing memory with those Vec's. You should find a way to allocate memory once and reuse it instead. Also, don't use the stack because it can heavily limit you.
I think this can be made even faster using SIMD-instructions. Most of these problems are similar to problems in parsing where I know that those instructions can make a big difference. Especially in that data preparation step.
I imagine, there should be a way to optimise the procedure at ~3:35 so that the whole meshing step is complete in the smallest number of expansions possible (for the general case).
Very cool. I bet you can double the performance with some tweaks to how you manage memory. I see a lot allocations happening in loops when you could make 1 allocation outside the loop and reuse the variable for each cycle of the loop.
Brilliant! how does splitting data by block type affect the memory footprint as more types are added tho? Is there an optimal sized chunk to limit the unique block types that can occur within each versus the number of iterations to cover every chunk?
Would it be possible to bitmask the x, y and z coords somewhere? So as you can calculate the face-5ets in one pass, without having to triple loop and storing the bite set for it somewhere external to the loop? Whether it's worth it or not is down to testing performance but is it even possible? Could it be possible to use the remaining space in the byte to store block type data too? (IE: air/dirt/grass etc.) it would grow as you add more block types, but if you're taking up 3, for xyz, perhaps possible to find a way for the remaining 5 to be useful for each parse?
Hi! I wrote the "Binary greedy meshing" algorithm. Very cool to see this video on my youtube frontpage today, I love your video and explanations :)
glad to see people are finally seeing this now! It definitely deserves more attention
I was skeptical at first, but after some digging, damn you really are the guy that wrote the mesh algorithm 4 years ago. Nice!
@@tisaconundrum always has been
cosmos: casually has a 20 microsecond binary greedymesher.
You just casually made a spatially mapped datamodel lol
What part of this video was casual lol
@@daddy7860 The way he explained it felt like a friend explaining something to me rather than a teacher explaining.
yep. just to remake minecraft. this is what people do on the internet.. its awesome.
@@notthetruedm the best way to learn
@@Pockeywn voxel games existed before and after minecraft, not every voxel game is a minecraft clone
Thanks for the video, I can advise you not to make a greedy mesh for each type of block, but to make for all complete solid blocks, and then transfer to the GPU data structure with the help of which you can calculate the block type and texture by pixel position, it will simplify the mesh many times as well as the algorithm itself.
ah, this makes sense
I'd love to see the speed comparison on that, sounds promising!!
Is it really faster to do that lookup in the fragment shader than it is to store it in the vertex data or look it up in the vertex shader?
@@CaptTerrific Saves a hashmap entry access for every voxel. bets on 4x speed.
Same for lighting and ambient occlusion
this is insane! i have my own culled and greedy meshing implementations and i know they're not the fastest, but i'd never have thought it could get THIS fast. you could literally remesh every chunk every frame with this and still get good fps, which is mind-boggling. good job with the explanations, too.
When you explained the part in 14:40, where you explained how to find the faces looking right just by modify an interger, I was so surprise at how simple it is an yet amazingly complex
My thought right there was "oh, is this edge detection?" It was a really intuitive explanation
I now fully understand the concepts used to achieve such high performance. I also fully understand that if I were to try to write it. Every line of code would have an off-by-one error.
i love everything about this
your awkward presentation, the handdrawn sketches, the weird pronounciation, the focus on speed, your manbun, your long hair that makes you look like a metalhead, the jokes, the effort you made, everything, everything in this video is just *right* .
This is incredible! I did something extremely close to this for counting strings within DNA sequences and got immense speedup. Binary manipulation is insanely speedy if you can comprehend it. Great job figuring this out and explaining it.
You can speed up the data setup part by using stack arrays instead of using "Vec"s
oh yeah definitely since the array size is a known value, and doesn't need to be resized. and is small enough to fit into stack.
CHUNK_SIZE_P3 = 34*34*34, so the size of axis_cols is 3*34*34*34*64 = 7,546,368 bits. Additionally, we need twice that for col_face_masks, giving ~2.83MB. Honestly, I don't know whether this will fit on the stack or not. Maybe someone else can provide additional information?
@@0x4849 I believe max stack size can be changed when compiling, but the default is usually not very large.
I would instead preallocate the vector once and then always use that one instance
@@0x4849 I had a program where I had an array of 5 Mb. So 2.83MB should be feasible. Also, the memory can be static. We really just need to benchmark the approaches and choose the best one
2.83mb should be feasible. I had a program that used 5mb for a stack array 😅. The memory can also be shared between calls whatever it's stack or heap based. Different approaches should be benched and there should be a room for improvement
One final thought, if you use 30x30 chunks you can fit the left/right neighbors into a 32bit int rather than expanding to 64. It'd halve your memory bandwidth requirements at minimum and if you use SIMD it will let you double the number of calculations performed per cycle.
Epic games had a wonderful talk about nanite, and the part that blowed my mind is: Gpu’s are very slow at rendering extremely small triangles. So what they did? They just wrote a SOFTWARE RASTERIZER, that is faster than the hardware one I think when the size of a triangle is less then 40*40 pixels. The difference is really impactfull, and they showed the code and implementation for everybody to use it!
I know this is an old comment, but this reminds me of when the youtuber Gamehut (who worked at telltale games/sega in the 90's) would do a lot. He would frequently have some form of technical problem where the hardware wasn't capable of doing what he wanted, so he just wrote the functionality in software.
Oh my god I wish i had this video like 2 months ago when i was trying to write a greedy mesher. Thank you so much for this resource! Will definetly save it for the future!!
The brings back memories. I recognized some code I wrote about 15 years ago after being blown away by Minecraft. The & operation on the shifted bits specifically. The merging of the meshes was clever and much better than what I ever came up with. I put it all in a fancy octree though so I only rendered on-screen chunks. I hit a brick wall getting the lighting to work on merged meshes and it all fell apart once I had more than, say, 6 block types. Your code will do so too. But it's a great exercise and good job. A more modern way would no doubt be raycasting, there are many more triangles than pixels on the screen if you scale things up and it parallelizes better. Nice video, keep them coming!
This video is honestly so well explained and even though I don't know anything about voxel engines or game development I was able to understand it. This is probably one of the best resources for making a voxel engine. If I ever make one, I'll probably take a look at this again, thanks for your amazing work!
When you stare at engine development - graphics programming stares back. Takes like 15 passes of starting to learn the basics, to learn the basics. And this wasn't a typo
I haven’t tried greedy meshing but I’ve seen some demos of greedy meshes where it doesn’t care about block type, it constructs the triangles while remembering where the different block types are, so it’s possible to make the greedy meshes not slow down when you increase the block type count
Use an array for the data instead of a vector (since you know exactly how many entries it will contain) and it should have essentially zero allocation time since it'll be allocated on the stack instead of the heap
It doesn't fit on the stack on linux it's to large!
But here is the funny thing... Someone noticed I'm allocating WAY more memory than was actually used.
And now it does fit on the stack :) So I've changed it.
The performance difference was only minimal though.
@@Tantandev you could also go full Kaze Emmanuar and save 0.5µs by rewriting the thing differently 5 times
I am building a voxel engine as my senior design project for school and I am about to start greedy meshing. This video helped so much, my mind is blown by that face culling technique and I can't wait to try it because right now that is a massive slowdown in my chunk generation.
You can combine different block types like grass & dirt into the same triangle. You would obviously still need to differentiate between them, which can be done in the fragment shader. Does introduce a lookup in the fragment shader but it may be very worth it if you have a lot of block types or if greedy meshing is your bottleneck
Man amazing video. You made it sound so hard but I feel like I grasp all of it pretty well. The visuals make it soo much easy to follow, hats off to your work.
Several people have mentioned looking into SIMD optimization, but a few other ideas:
1) Using a fixed sized allocation instead of a Vec since the size is known. Not sure whether the entire arrays would fit on the stack but if so that may provide several speed improvements over a Vec on the heap.
2) It might be possible to combine both positive and negative edge detection into a single operation by using an XOR, but would require a slightly different method of iterating over them to pass into the greedy meshing.
3) Your structure for axis_cols has the data for each grid separated, a format similar as such: (y1, y2, y3, y4... x1, x2, x3, x4... z1, z2, z3, z4...) this means when setting the values you're writing into separate parts of the vec that might be far enough from each other to cause frequent cache misses. A layout where the three axis all are interwoven beside beside each others might be faster, such as (y1, x1, z1, y2, x2, z2, y3, x3, z3, y4, x4, z4...)
4) It would require a bit of rework but this seems very reasonably practical for a compute shader.
5) Would take a fair amount of work, but rethinking how you store the actual voxel data in general may make it faster to convert.
6) Again it would be a change in direction, but there are approaches people have taken where you can greedy mesh any flat surface, regardless of different types of blocks. The way that achieve this is usually to pack the color data for the chunk into a 3D texture and use it in the material/shader for the chunk mesh, then, rather than each triangle having a color, the fragment shader can use world coordinates to query from the color data as a 3D texture at the position of the face. Allowing a single triangle to have multiple colors on it. This makes the fragment shader slightly more complex, but in most examples of people using this technique it tends to improve performance in both rendering and construction because it can result in a massive reduction of polygons, especially as you add more and more materials.
I'm making a game that has voxels and I implemented this also using Rust and something very similar to the slow approach. This video came out with such a great timing.
Thanks for sharing this.
Maybe it can be even faster if SIMD or parallelization are included? 😁
Looking at this made me realise that I clearly need to lurn bitwise manipulation
It's actually genuinely fun if you like puzzles. A lot of it is figuring out how to visualize it so you can figure out what's going on because the final product is always undecipherable (at least for me).
It's honestly amazing how many usecases there are for bitwise operations, I think at least some understanding, even if only basic, should be a core skill of any serious developer.
if you know all these quirky things you can figure out pretty quickly that an odd number is determined by its first bit
@@memes_gbc674 assuming you understand endianness and therefore which bit is "first"
@@DanKaschel that too
This is some Big Brain Calculation right here, great video Tantan!
We miss you TanTan :(
What happened?
Great video! I wrote a basic algorithm for doing this on culled meshes, but I'm glad to see it's possible with greedy meshes too!
I like how you mostly pronounce "Chunk" as "Shunk", always made me smile :D
Thank you Tantan for somehow releasing a video on the exact topic I was worried about for my next project, very cool.
Love this. I remember building a 2048 AI and going from loop-type grid transformations to bitwise operations. Bitwise stuff is hard to grok but there are sooo many orders of magnitude of improvement and it's so satisfying :)
I love how other SQL devs look at me when I explain my stored procs that utilize bitmap logic to be a million times faster than the naive approach to the same problem.
@@magfal umm. I think I'd do the same if a colleague said they were using bit manipulation in a stored proc
@@DanKaschel
Calculating using the bitwise code and returning the final result set in postgres put less load on the postgres server than serving the data it's based on to application code, which then had to run the calculations.
This is true quite often for OLAP style workloads.
@@magfal that is true, but it'd have to be pretty niche before performance trumped maintainability
@@DanKaschel a 10 line comment was enough for my colleague to understand and confidently make adjustments for a new requirment.
Bitwise code isn't magic or that hard to do when you know the incoming data, the result, the intended behavior and you've got the code in front of you.
And to go from a batch job ran once a month to an on demand real time task is quite important when the report directly generates revenue for it's users with more benefit being reaped the fresher the data being presented is.
What a coincidence, I currently need a good voxel algorithm for my project :D Will definitely look into it! thanks
this videos singlehandedly makes me wanna try to make a 3D game from scratch
Now i understood why this video take a while to be made!
This was a really good video!
It's like a gift for me.
It was a problem no matter how much I optimized it before
but now I have no problem loading and rendering faster than before. thanks for your video
Woah, love the bit shift and negation. That's a great way to generate the culling indices instead of iterating through every single block.
3:00 as I always say "paint is the most important software for software developers"
If your friends CPU has significantly larger L2 or L3 cache, the performance difference could perhaps be cache misses? Aligning data for CPU cache optimization is another beast to tackle though 😅
I think it's possible to enable larger memory pages in some compilers.
I dont watch your videos (but still subscribed (I want to learn rust&bevy some day)), but every time I see your videos it feels like a new scientific experiment.
Lmao I wrote an algorithm yesterday for greedy meshing which does a bunch of neighbour checks for each block and then creates a bit mask from that.
Definitely stealing the bitshifted comparison optimisation. This must be the most amazingly timed video I've ever seen.
I came from Dani's video, and I'm glad I did :) Great video!
phenomenal video explaining this. you are very good at explaning these topics
nice explantion and walk through code. i loved that. thanks
Fantastic video! I very much appreciate all the work you put into these. You have a gift for explaining things. Can’t wait for you to publish a game. Make sure to introduce it in a video, pls 😊
1:03 missed opportunity for the vsauce intro ost
I would love to see a full bevy tutorial on your channel
You're using bitwise operations to calculate binary derivatives. That's dope :')
Now do it with SIMD
At least on x86, bitwise operations like count trailing/leading zeros are only available on the more recent AVX-512 processors, so adding SIMD might make it faster of their friend's CPU, but could actually make it slower in parts on their own. There are probably some places where it'd be beneficial anyways though.
@@angeldude101 you sure about that? Let me check... FELIIIIIX, WE NEED YOUR SITE AGAIN
Someone on the internet makes a fast greedy voxel mesher.
*distant horizons 2.0 has entered chat
I have an exam in relational databanks in 3 days. And this video and the bitwise manipulation really, really didnt help me at all. But its very cool stuff and im looking forward to try something like this as soon as i finished my exam
You could further speed this up by making use of SIMD as you can do all these bit-wise operations on several values at once reducing your need to loop as many times. The XMM registers are 128bit wide allowing you to stuff 4 32bit values into a register would allow you to calculate 4 values in one operation.
True, but I typically try to avoid SIMD because of portability.
I'm not sure what Tantan is expecting to support.
Just want you to kmow that this video was so good that at 11:27 there was a solid 5 seconds where I actually scrambled to rewind the video to try to desperately see the code
Code's in the description
Wow man, mad props. That was some heavy stuff and yiu actually explained it extremely well. Thanks, and keep up the good work!
BLAZINGLY FAST
Amazing work, You can make this dramatically faster using SIMD now that its a bitwise op game
Oh! Great catch. Initially in my rendering I've used 64 bitmasks, because my chunks (not rendering chunks) were always 4x4x4 voxels. Tho I haven't implemented a greedy meshing, because I need to support much more than a solid block, so different shapes etc. End up with custom rasterizer.
You succeeded very well in explaining something complex in a simple manner! Well done!
The reason, why the code was equally fast is, that bit operations are single instructions. The reason, why new CPUs are usually faster, is other than clock speeds, which are generally about the same, because chip engineering hit a limit on that one (any faster and CPUs melt basically).
However, I am not sure, why the standard mesher is faster. But I am suspecting CPU cache size, since improvement in pipelining or superscalar execution would also lead to improvements in the first comparison and the clock speed is about the same, as I mentioned before
15:04 this was the point I verbally said “this guys psychotic” but in a good way. This is a crazy way to think about this data but it makes so much sense! Good work man!
WOW you did a really phenomenal job at explaining your algorithm
Coincidentally I just implemented nearly the same thing a week ago, though I support octree blocks so it's a bit more involved, but cool to compare implementations. I made use of xor to detect my faces, never thought of just flipping the neighbor... My meshing ended up about 50% faster somehow after implementing it, even though it feels like more work is being done
The expression he did for his mesher is actually one half of an xor (A xor B = A*(!B) + (!A)*B), and since CPUs have built-in support for all binary operations, your algorithm does the work at once instead of going through it twice by choosing the two paths at once. The only caveat here being two bits are on instead of one, but that difference is irrelevant as they are guaranteed to be next to each other.
Interestingly I tried switching my system to just flipping bits instead of xor, I was already flipping the bits for another part so surely it should be free gains. Weirdly, it ended up very slightly slower which is perplexing. I don't think it's worth diving into it enough to find out why or what changes the compiler has here, but thought I'd note what I found...
Pretty cool project and interesting technique and considering that it is within the context of a voxel engine, it is very fitting to say the least.
Outside of that, if one isn't referring to a pure voxel engine; there are many other interesting techniques that are also very interesting. These range from:
Quad and Octrees, The use of Quaternions, Octonions, Sparse Matrices, FFTs (Fast Fourier Transforms) and their inverses, Instancing through referencing, other types of procedural generation techniques, Perlin Noise Sampling, Batch Processing through a priority queue, Fractal Generation, and so much more. Don't forget, you can also do a lot of interesting things with a plethora of available techniques not just within the engine itself, but also within the context of shaders. The Graphics Pipeline in general is a very interesting field of study.
Loved the Flight of the Conchords reference
Fun fact, you can further reduce polygon count by allowing polygons cross (each other in) the same block type or culled space.
z-fighting is not an issue as it is the same texture or not visible. I have demo and thesis on this.
The following "donut" example requires only a single polygon :D
01110
01x10
01110
While Z-fighting isn't an issue, you still then may have to deal with overdraw.
@@DreadKyller yup, it's a tradeoff.
That is cool revelation and use of bits.
Now SIMD it!
Granted LLVM might auto-vectorize it already, but you might be able to optimize even more by doing it manually :)
Thanks to practicing image manipulation in JS, this was surprisingly easy to understand and clicked right away for me. 1D data models and traversal is not simple, so I understand your pain.
Thanks from a godot developer (csharp) this is very useful there as well since bitwise operations work very similarly and especially with multimesh instancing! Cheers :)
bitwise operators are basically universal, they aren't language specific, you can do them in every language I know of. So very useful and easily transferable skill to know.
This is cool, I remember playing with greedy mesher but end up going back to a traditional one because I didn't find a good way to get rid of T-junction artifacts
Why can't the mesher be happy with what it has
because it runs out
This is awesome! I'm looking forward to attempting to implement this myself. I'd love it if you would cover ambient occlusion in the future or at least provide some resources for where you learned about it
You explained this really well! Thanks!
Really cool. I wonder how this would relate to the optimizations that Vercidium uses to get voxel rendering running at a claimed 12000 fps.
Very nice work, and mathematical ideas!
Good inspiration...
You can go farther. In my greedy mesher I store both block and ambient-occlusion lighting in textures, as bytes, using a bin-packing algorithm. One 2kx2k texture has always been enough, but I also added the ability to track which is needed by each chunk in case I needed many.
This is particularly useful in city-like terrain where the geometry has a lot of flat faces made up of different types of blocks.
God damn bitwise wizards, I really have to learn how to use that stuff, because in theory I understand it, but I don't know how to use it
My osrs hydra ptsd came back when your music started playing around 2:25
Amazing to see the level of performance you can get out of using the binary representation and this has me wondering if I can use any in my own projects. I suspect I will need something similar to create an AI navmesh in the near future.
Fantastic video once again TanTan!
please would you turn the in-game render distance up as far as you can and show us how fast it runs?!
This voxel engine looks incredibly advanced and would make a brilliant base for games! For future videos I'd love to see you implement a scripting language into an existing Rust project like Lua or Angelscript.
I tried out the mesher in my own project with a different chunk storage scheme. I ran benchmarks on my project and got around 500 µs (microseconds) per chunk. When I ran your benchmark I was getting around 32 µs. Initially I thought it was just inefficiencies in retrieving voxel data since I'm using palette based compression. After some more testing I found that if I use your chunk generation code the benchmark result was around 50 µs. Turns out there's only a few solid voxels in the benchmark chunk which is why it runs much faster. The first chunk I tested/benchmarked had solid voxels in a sphere shape. Still my voxel retrievel from the chunk is significantly slower then simply indexing into a Vec. Mainly because I use bitpacking for the indices.
I think your memory layout can probably benefit a little from ordering by which stage of execution needs the direction.
The newer CPU has a better cache and branch prediction, and that's why it outperformed on the culled renderer. The greedy mesher caused it to stall more, so these advantages were likely nullified.
Your greedy mesher seems to tax the cache a lot - your blocks are 32x32x32 bits (right?), and interleaved in triples; but it could be better to somehow make them fit into 64x8=8x8x8 bits (64 bytes, a cache line) for each of their core operations.
So, making the block smaller by a factor of 2 to 4 in every dimension AND ordering the memory according to execution order / ordering execution to be doing operations on directly adjacent memory blocks directly after one another could probably give you another big boost.
This is awesome! i love to see the Rust code and see how you are using Rust. As a new to low level programmer, this helps a ton! keep up the content!
This is a beautiful piece of engineering😍😍
I wonder if the data layout could be improved as it looked like you use sequences of array indices that are far apart from another. Depending on how large the data is, this could theoretically lead to cache misses as not the entire array is loaded into the cache at the same time. But it's only 0.8% of runtime and the Compiler probably already optimizes this. But if there was a slowdown caused by cache misses, improving the data layout could speed up the code a lot
i barely understood this but great video! gonna return to this video one day
Excellent video. The animations are easy to follow along with. Thanks for sharing.
I'm curious about the method you used to profile your code to determine the execution time of various sections. I didn't see any particular video in your catalog that seemed to cover this, so perhaps a "How Tantan profiles his Rust code" could be an idea for another video.
Now write it in SIMD using WIDE bit registers. imagine what you could do with 4x256 bits :P
My goodness I don't think the world is prepared for that much power...
Buffers can be prealocated and reused, that should speed up a bit more
Idk if this would be any better, but I'd like to see an algorithm that doesn't mesh it at all, but uses the GPU to figure it out.
Basically, you have a 32x32x32 chunk, with 32 quads from each of 3 directions, all facing the camera. When it comes time to render, you send those quads to the GPU along with an array with the 32768 voxel types, and the GPU would be able to draw each quad with the respective colors.
To draw it in the correct layer position, draw all of the chunks that the camera isn't in order from furthest to closest, and do the same with the quads.
Yes! He's back! Let's goooo!
This was so clearly explained! You finally made me understand a usecase for bit-shifting!
a tip go further decrease the data creation time: You're always creating and releasing memory with those Vec's. You should find a way to allocate memory once and reuse it instead. Also, don't use the stack because it can heavily limit you.
Babe wake up another tantan video has dropped
I think this can be made even faster using SIMD-instructions. Most of these problems are similar to problems in parsing where I know that those instructions can make a big difference. Especially in that data preparation step.
Thak You! That video and research are so usefull!
After watching your video, my greedy mesher looks so sloooooow :c
2:49 but acerola... have is your culled mesher optimized as much yet?
I imagine, there should be a way to optimise the procedure at ~3:35 so that the whole meshing step is complete in the smallest number of expansions possible (for the general case).
This is the same algorithm as bitmap edge detection. Shift-not-and-ing is really common in other applications 🤙
You love to see it! I've also been optimizing the Rust code of my Chess engine, although this seems exponentially more complicated 😅
Very cool. I bet you can double the performance with some tweaks to how you manage memory. I see a lot allocations happening in loops when you could make 1 allocation outside the loop and reuse the variable for each cycle of the loop.
Brilliant! how does splitting data by block type affect the memory footprint as more types are added tho? Is there an optimal sized chunk to limit the unique block types that can occur within each versus the number of iterations to cover every chunk?
Would it be possible to bitmask the x, y and z coords somewhere? So as you can calculate the face-5ets in one pass, without having to triple loop and storing the bite set for it somewhere external to the loop? Whether it's worth it or not is down to testing performance but is it even possible?
Could it be possible to use the remaining space in the byte to store block type data too? (IE: air/dirt/grass etc.) it would grow as you add more block types, but if you're taking up 3, for xyz, perhaps possible to find a way for the remaining 5 to be useful for each parse?