@@christoferluiz365 Soft deletes: 1. Require space for all "deleted" items in your database. 2. Users can ask to _completely_ delete their data from your database according to GDPR law. "Right to be forgotten".
Josh, one small thing that ended up being a bit misleading (accidentally) are the total table sizes. Since you had a 2 column table with a relatively short string value, it makes the id size look far more impactful than it would be in a real world production table. In reality, the absolute difference that you're showing is accurate, but the relative difference isn't even close. A 1M row table would take up multiple GBs of disk. A 40MB difference at that scale is proportionally tiny. Not really worth thinking about if you believe CUID's come with some sort of DX benefit, better slugs, etc.
Yes, this is exactly what I was thinking when watching the video. Just one column on the table heavily skews the size of the table towards the size of the IDs. Besides, a single-column table is nowhere near a good example of a table on a production database.
True. But what about foreign key tables for joining or just storing them in other tables some other way? If you have many of them, you’re increasing storage even more. You’re never going to have just one table.
How to solve problems with UUID 1. Don’t use MySQL, use for example PostgreSQL instead 2. Don’t use btree indexes - use hash ones 1M rows with UUIDs take 36MB. That’s NOT a lot these days. What’s the main problem with serial/int PK: you can’t use shared resource for your developers, for example Redis. With UUID you won’t have collision issues
On 2. you have to know that you'll never need to sort or access more than one record at a time. Works in a lot of cases, but if you have say an iot use case with multiple rows per device then it doesn't.
The benefits of UUIDs or similar still outweigh the smaller size of integers, by a lot. Like, security, scalability, flexibility, traceability. Paying a minimal amount of money more to store UUIDs instead of integers gets you so many more benefits. Therefore, a price worth paying.
@@furycorp global uniqueness, elimination of central bottlenecks, ease of data merging and replication and handling massive amounts of data are only some benefits of UUIDs in regards to scalability.
I use both. Primary key is an incremental int and then have a UUID column. All foreign keys are using the PK while the api layer and outside world are only aware of the UUID. You can then add indexing and caching to mitigate database load. Storage is so cheap now I never think about it. I just monitor index fragmentation and query times
Especially since an additional 4 or 8 byte integer isn't really a big deal, when you already committed to spend the 35+ bytes for an uuid. Also: you can put all the uuids in a separate table that only gets joined in when you actually need it.
To be frank, most of those numbers aren't all that impactful, even at large scale. Sure, on paper it's a rather large loss, but practically speaking, I'd say the benefits outweigh the disadvantages. At very large scale it might add up, as it grows by gigabytes. But even at that, it wouldn't be necessary to keep all the rows in RAM at all times (Which is the costly part) the less frequenty accessed data can be stored on SSD/HDD which would still take up some amount but in comparison it costs nothing.
He did use Postgres in his examples though... In mySQL the difference is actually smaller nowadays, because of native table page compression (which is another point where uuids are detrimental, because of their random nature). Also: the size quickly adds up (1M rows is a rather small table), especially as id columns tend to be used in multiple locations to enable table joins. And all those data need to be read from disk; and the larger the database grows, the less likely it is to fit into the OS & DBMS's cache.
@@stephanweinberger I mean, gigabytes is quite all-fitting, it includes Fortune 500 companies and to an extent FAANG; Mostly meaning the average non-FAANG larger company
An UUID is 128bit which is 16 bytes. If you are really try to make it more than than by using weird database column types you will get columns larger than that but postgres as you use as an example has an UUID type which does the string to binary conversion and all that and only uses the 16 bytes it needs for storing an UUID value. You can of course achieve the efficient storage in any database that supports byte arrays types by storing the UUID as a [16]byte and properly encoding/decoding it client side.
I would also add that if you are really dealing with a storage issue where you literally do not have enough space to store a freaking UUID, you probably have a bigger problem on your hands. You would have to reach roughly the size of google before you start worrying about that.
He got 37 bytes because he's storing the UUID as a null-terminated string. Not sure if that's the native implementation or what but it's strange to me to think that it would be.
@@akam9919 I have a storage issue like that, I run a site called Skyblock finance and it has over 1 billion entries in its prices table. I optimized that table to consume as little storage as possible and yet my database is already over 150GB which is a LOT for a self-hosted hobby project
pg_column_size is a psql function. Postgres has the UUID column type built in, wich does not waste memory on the text representation. mysql does not, but it has binary and conversion functions to convert a uuid in string representation to binary (and back). of course one does not store a uuid as string. same as not storing a date into a varchar.
yeah, misleading video or simply ignorant. The article he is reading itself says that guids are 128 bits (16 bytes) unless you take the naive approach but he goes ahead and stores them as 37 bytes strings anyway, which nobody will do, instead of showing people the right way who may be interested in the trade offs. After he was done with his little test the video was over, to say nothing of the pros and cons outside of size taken up on disk. I don't watch this guy much, but when I do, his videos often fall short like this. Another students teaching students channel
Problem 1 is solved by UUID Version 7 which is time-based. Problem 2 is partially solved by simply a more efficient storage, because obviously UUID requires only 16 bytes, but for some reason you store it as 37 bytes. Does your SQL engine waste storage space? Well there's a problem with your RDBMS, not with UUIDs. Also in real DBs there usually are many more columns besides the ID, so the UUID overhead is negligible. Any other problems maybe?
It's not about size but the performance of the insert, update, delete which get effected if table having index on those column. Instead using ULID works better. I'll say avoid uuid data type until it's really required
Since the demo uses Postgres (which has a native support for UUID type, unlike MySQL), if you create the test table using the `UUID` column type instead of `TEXT`, you will find that `pg_column_size(id)` will return 16 bytes instead of 37, which is smaller than Cuid (26 bytes when stored as string).
Sure UUID take a large amount of space, but they allow to scale horizontaly across multiple databases without worrying about id collision, which happens with integer auto increment. They also are easy to manipulate because there are plenty of libraries using them among multiple languages. If you plan early on horizontal scaling and don't want the hassle to either build your own collision-free distributed ID generator or migrate the database id type, UUIDs offer a great trade off at the cost of storage.
I can't believe i am reading that UUID takes a lot of space. People sound like the gray hairs talking about wasting all those bits when you could use a byte instead of 4 bytes. Or the really crazy ones encoding booleans in bits using bitwise operations to get the flags.
@@JacobSantosDev depends onnthe application. For most applications it is not significant, but for a large scale key-value store where the values are small it matters. Having a x4 or x8 difference in size makes a huge difference when scaling to millions. For most use cases though, not a big deal.
@@TheShynamo I disagree. Rarely have I seen a table with only a single primary key table. Plan for the future. I can't tell you how many times I tried using the old, use the data type that uses the least memory and change when needed. It is a lot harder to change with SQL and doesn't matter with noSQL. I used to bigint all primary keys. Now I just make everything UUID. Set and forget. The advantages of UUID, especially when working with millions or 100s of millions of rows are worth it. Who is sorting using the primary key anyway? I guess that could take the place of created date fields. Created and updated are more meaningful than some number that gets bigger. I would rather sort on created or updated if I am going to have them anyway.
@@JacobSantosDev "Or the really crazy ones encoding booleans in bits using bitwise operations to get the flags." Doing this can be extremely convenient depending on the use case. Especially when working with groups of states.
Another drawback of using UUIDs as the _PK_ is that these usually appear in multiple tables, which inflates disk usage even more than in the simple example presented in the video. Plus, the random nature of UUIDs makes them very hard to compress (either natively like in mySQL, or in the filesystem, or on the NAS/SAN). In the end this all means not only more disk usage, but also more I/O (in an enterprise setting often network I/O) and more cache evictions - the performance impact isn't just linear, but exponential. So if you really need UUIDs in your application, put them in a separate table and only join them in when required (i.e. when resolving API calls, or providing output to the API), but do all internal relations with simple integer PKs.
In my new project, I resolved to creating an extra `publicId`column using UUID if only it's exposed to a client. Dunno if it's really intuitive because if I were to adapt micro services, I would run into key conflicts eventually in the future for PK's as Int.
For a million row database, 147MB is absolutely nothing i'm even going to think about. It's has no input on whether I would or wouldn't choose UUIDs. But the benefits of globally unique ids across a database cluster is pretty great.
This video should be retitled "The Problem With Using String-Formatted UUIDv4s". The first problem that the blog post mentions is valid when using UUIDv4 as a primary key. However, the solution is simple: don't use a UUIDv4. Use a UUIDv1 (or better yet, the upcoming UUIDv6) which is time-based and is therefore implicitly ordered instead of random. The second problem is valid, but negligible. Using 128 bits instead of 32 for your key is chump change compared to the size of the rest of the columns. You would need to be operating at the scale of the likes of Microsoft, Amazon, and Twitter before that kind of difference starts to matter (and frankly, if you're operating at those scales, 32 bits for your primary key isn't going to cut it anyway). But beyond that, the ID size should've clocked in at 16 bytes, not 37, which suggests you stored the UUID as a CHAR(36) even though Postgres had native support for the binary UUID data type. This is confusing, especially considering the article itself mentioned storing UUIDs in 128-bit binary format as the better option.
Sure, UUID are bigger in memory, and yes the database size and performance will be affected. For me, it's just a compromise for a better security (no brute force, no guessing the total entities count, ...). I imagine that if you want to optimize queries you could use "internal" (int) and "external" (UUID) IDs, but I'm not sure that the balance will be better (even bigger disk space used, more complex queries,...) What do you suggest as an alternative? Random numbers? Another string algorithm?
The biggest advantage with UUIDs for me is that the ID can be generated client side, or anywhere really. It makes working with distributed systems easier. The insert performance is actually the most important negative point for me. And I'd assume that it would become worse the larger your table is.
Well wouldn't you want to validate the uuid anyways on server side? What is the significant advantage with being able to generate them anywhere? (Genuine question)
@@Dom-zy1qythe advantage i can think of is data relation especially if you have complex data structure into multiple tables that need to be executed by clients
@@Dom-zy1qy There are probably not too many scenarios, but one I have in mind is offline-first apps. Of course, ID would be validated on the server side, but client-side generated ID would be used prior to that.
UUIDs also have a big impact on number of keys in an index page. The more keys you can fit in each page the less i/o required to execute a search. This is compounded if the primary is also used as a foreign key in details tables. Each of which will need an index on that same key. I use both int/long for the keys and a Ulid column to expose in the API. Of course, it needs an index but only one.
Insert perfomance will suffer but not in a way you suspect, 4x more space for value means that will be 4x less values on page or B+Tree node, so index containg UUIDs will advance to deeper level sooner, but this is logarithmic relation and in practice means that very big tables will slow down a little sooner. Using UUID as primary key should be dependent on the needs like: security (unability to predict next value), split databases (horizontal shards or clients) where id must be generated localy. In many cases, in whole row size difference between int and UUID is neglegable. Also if we talking about very big tables, 4 byte int could be not enough (in some circumstances), so for future proof long (8 bytes) is often used so its advantage diminishes.
@@TapetBart I didn't know that someone go to such lengths 😉 if you know the data you can optimize storage to the point of pain or diminishing returns. Everything is on the head this days.
@@TapetBart it depends how do you access data but gain in is impressive indeed. Long ago I tried similar method with differential sending of files over modem which have big chunks unchanged.
There are pros and cons for using uuids or integers. Thus, when I start a new project I always toss a coin to determine which half of programmers I'm going to piss off Kappa
You don't even need to use another database... Which also uses b-trees because they are awesome and highly optimized for indexes. There are index types that don't use b-trees that you should use for such indexes. UUIDs are an absolute plus when doing replication.
The UUIDs are using 36 bytes because you're not using Postgres's UUID datatype, which stores them as binary (16 bytes) instead of text. That would have halved your index size and improved insertion performance. Also using UUIDv7 instead of UUIDv4 will further improve query/insertion performance because they're time ordered.
On the way I think, almost every time we need UUID to allow our code to knows which data is which, making the UUID part of the data architecture designed for the software, since the default APIs usually are REST APIs. So in the end, we gonna need to store UUIDs anyway on the database, and many times if the same data is splitted into distinct tables. And if you try to use integers, but still need to keep the UUID of the data, to be able to link the data between tables, is cheaper to store only the UUID as primary key, since you have one less column in that case. The only scenario I see fit to use integers instead UUIDs as primary key, is to things that can be generalized, like countries name, things that doesn't make sense to have a ownership of a specific user from your software. But in that case, the table will be very small and will not have need to escalate too often, making us thinking if that over engineering on the table design will actually have a visible impact or not.
If you are watching this video, you are probably using Next.js to render your JavaScript project, which is likely hosted on Vercel and also utilizing an ORM built with JavaScript. Performance may not be a top priority for you. In fact, it might be the least of your concerns.
Issue really in using integer is the incrementing and allocation of that integer. Any time something needs to count and hand out these values there is ultimately a single point of failure. We generally design systems for high availability and redundancy these days and for global replication. Even with automated failover a single point of failure isn’t really acceptable in most of the applications I work on these days.
You see only the cost of storage. But there is a significant cost in CPU/Memory. Sure on a small project, you’ll barely feel it. But as soon as your project will scale up, you’ll have to refactor it. A wise software engineer would tell you that if it’s cheap do to it right from the start, you should.
Integer / 32bit is not a good example because if your stuff fits into a int, you don't have a problem anyway. Data locality is critical therefore uuidv7 is important. And for storage size: scale horizontal with your storage than data at rest doesn't matter just be smart with index or sharding.
Whenever I need random IDs, I use the UA-cam-approach. I create an ID field that js exactly 11 characters long and just check if an ID is already taken before inserting a new record. Tom Scott has a great video explaining how it works.
Comparing UUID and INT types is like comparing apples and pears. If you look specifically at performance and space, of course the auto-increment integer definition seems more useful. But it doesn't take you further than making database-oriented software. If you want to abstract your software from the persistant storage (rdbms), you have to bear other costs. Then the gains far outweigh this comparison chart. What I am saying is that it is best to decide in accordance with the software and system architecture.
Everybody who uses UUID should know, that this is tradeoff between performance and security. And, as some people mentioned, uuid stored in proper type of colum, uniqueidentifier in mssql, and uuid in postgresql and mariadb, uses 16 bytes instead of 37. Also it can be stored as 16 bytes binary string in mysql, and converted to from and to string with built-in functions.
lol, well, as the most tables will have 30, 40, 50, 100 columns, the table size will be in the end 1000 MB (with uuids) compared to 970 MB (with integers). Of course, people would argue about it, but I just say "come on..."
If you have 100 columns in a relational DB you should take a serious look at your data model. And if you just need to store some unrelated datasets, a relational DBMS isn't the best choice in the first place.
Didn't even mention that if you're primary key is "clustered", it'll cause the entire table on disk to be reorganised every time a new row is inserted. This is because the clustered index defines the order the rows are stored on disk. With an auto-incrementing integer type this isn't a problem at new rows always go onto the bottom of the table.
They seem to completely miss the purpose of UUID which is "Universally Unique", meaning the purpose is that an ID can be created anywhere. If you are just inserting then an auto increment id would be better, but if the object needs to be created outside of the app then there needs to be a way to get a unique ID. Firebase has some functionality for this, to get a new incrementing unique id from the db first, but it's just a complex solution.
Yes. Quadrupling your index size will cause a lot more cache evictions (disk cache, DBMS cache, as well as CPU cache). Add to that the additional I/O required to (re-)fetch data (of increased size) from storage, and the performance impact isn't just linear but exponential.
Yes, that's the better solution if you really need UUIDs. Preferably even put the UUIDs into a separate table and only join it when required (e.g. when communicating data to the outside world).
4byte vs 37byte ID value (9.25x larger) with a million rows resulting in 21MB vs 73MB (only 3.47x larger) was a bigger difference than you expected?.. I'm more shocked the id row is consuming 21MB instead of 3.8MB on the integer. Why is there 18MB of fluff? So that's 22bytes per row on the integer. 35MB is the pure size for the UUIDs, which suggests 39bytes of fluff per row for the UUIDs for a total of 76bytes per row. My questions are these: Why is there 22+bytes of extra data attached to that index value row, and why is the amount of fluff different? Why is it 39 bytes for the UUID index and 22 bytes for the integer (again, those are values in ADDITION to the actual bytes for the value of the index)
Good summary video. In the case of MySQL storage engine (innodb) the PK gets stored 6 times in a page for a single row, that is not including potential FK. It works out to about 60% of storage space just on storing PK. I personally still would not use UUID in my MySQL database as a primary key. Database thrashing and inflated storage is killer.
Ones I can think of: users can make an estimate on how many rows you have, users can easily guess existing records and on public data could be risky (e.g if they create a “share url” with id of 3050 they can see someone else’s 3049 easily). Generally anything that requires the creator to be “logged in” or verified is safe to use ID (internal) and anything that is public, like a share url or something like that you’ll want a UUID or at least a randomised value. Good luck, id love to hear more disadvantages too!
@@oSpam but then again users shouldn't have direct access to your internal DB structure anyways, regardless of logins. The API should _never_ expose internal keys.
@@stephanweinberger we’re talking about APIs not intercwces with closed backends. Closed backends won’t ever give non logged in users access anyway so it doesn’t matter. Data that’s publicly available is what I’m talking about “public data”
@@oSpam _especially_ then you don't want to leak internal data, do you? So you need to devise a method for identifying data externally in any case and map whatever identifiers you use (and UUIDs may be a way to do that) to your internal keys. That doesn't mean that you database has to use the same identifiers as PKs internally (and it probably shouldn't).
@@stephanweinberger but internal data will all have permission checks so what’s the point in coming up with random Ids. If you’re suggesting not using perm checks and just using random IDs/UUID in my opinion that’s worse. I don’t see what you’re trying to get at
It’s important to look at the bigger picture here. A real world application is going to have lots of foreign keys - so a table may have two, three, four or even more UUID values per record. That’s going to make storage costs mount up pretty quickly. I liked the suggestion another commenter has about using an internal integer for those joins but exposing a public UUID in your APIs to avoid the flaws of ints - that would keep table sizes lower, so long as you’re careful with the API design.
@@CatDevz hmm, let me try to count it again. 37 hexadecimal characters, trim the endline its 36 bytes. 2 hex characters compressed to 1 byte, so 36/2 = 18 bytes. my bad
Can we do one thing:- 1. Use Primary INR (index) 2. Encrypt Primary ID (say encrypted_id) 3. Store Encrypted ID's in redis with mapping to integer ID's 4. When Rest API calls with encrypted id, we fetch the id from redis and perform SQL query Problem is for how many tables we do?, ans is where we are showing data publically, for internal it should not be issue. Or instead of redis, we can simply decrypt the id (should be quick I believe) and the query the DB
The difference in the example is 100 MB. This is literally nothing in large distributed systems where UUIDs are needed. It's like a few high-res photos on Instagram which are loaded when you scroll for a few seconds.
I am not a fun of exposing the int ID, but doesn't having PK Int ID and public_id UUID take up even more space? if so, does it then mean having uuid as pk is a good option?
this is pointless since in real world, you aren't just storing a single id in every table. frickin 128 bits id column will almost have no impact compared to varchar(255) from many such column like name, address, description, message. not to mention you still need a way to store some kind of slug or any identifier in your table anyway, you wouldn't want your web url to be something like /post/[incremental id] since that can be guessed easily making a huge potential security issue
I think Int Ids can create problems in React. If you map and render two tables in the same component then how do we pass unique key? I will have to move everything to its separate component.
The key prop identifies the component from its siblings. It can be either a number or a string; it really doesn't matter, as long as no sibling has the same key. If you have auto-incrementing keys on your backend, you probably never encounter the same key twice under a single component.
Wow, high effort video where we carefully consider the pros and cons of different index datatypes... nah, just tell the other kids uuid = bad because size 💩
No way? You mean that a representation of a larger space takes more memory? Who would have known? Something that can represent to store 2^128 items vs 2^32 items takes more memory. What a big surprise.
It depends on the use-case. Not every table object requires a UUID and most applications ain’t even big enough to leverage them. Biggest benefit is to have them as well identifier to the outside world. No numeric id guessing 🙃 and before table size is a factor to think of … have a look at your schemas for wasteful definitions.
Doesn't seem too relevant. With ints having 4 byte width and uuids having 37 byte widths comes to about 10 times the size. Yet the total size of the uuid table was only about 2.5 times the size of the int table, which suggests that either the uuid storage is optimised, or some other data is using space or both. And then again, 260 mb for a million rows - I can live with that
I think this test show literally nothin, a table entry most of the time has multiple columns with values, so the percentage of the uuid would be way less than in your example. And 2nd even if we take your example lets say we have 1 Billion rows then this would be about 6 - 15GB considering that most low tier vps have at least 10gb of ram and that most saas companies never reach such a big number of entries makes it even less relevant. Additionally to that 1 billion rows can't run performantly on a 0.5 vcpu vps, so you would need a beefier machine, which would cost more i guess somewhere between 20 - 100$ depending on what your preferences are and these come with a lot more storage, so the storage is only problematic when you have trillions of rows or use a service like planetscale where storage is pricey
How many of us with their hobby projects and/or 2 users in production would even care? We need to not overengineer every solution. Of course you will have downsides with every approach you take. Let's be real: It's ok to fail, it's ok to not get it right the first time, it's ok to iterate and improve.
One huge problem with int ids, if you delete a record, you have a gap 1, 2, 4, 5. Which triggers my OCD.
What about soft deletes then?
@@christoferluiz365 Soft deletes:
1. Require space for all "deleted" items in your database.
2. Users can ask to _completely_ delete their data from your database according to GDPR law. "Right to be forgotten".
@@christoferluiz365 1. Database space. 2. GDPR.
😂😂😂
😂😂😂
In my 20 years, I have never seen security and performance in the same sentence.
And here it is.
I thought so as well.
Josh, one small thing that ended up being a bit misleading (accidentally) are the total table sizes.
Since you had a 2 column table with a relatively short string value, it makes the id size look far more impactful than it would be in a real world production table. In reality, the absolute difference that you're showing is accurate, but the relative difference isn't even close. A 1M row table would take up multiple GBs of disk. A 40MB difference at that scale is proportionally tiny. Not really worth thinking about if you believe CUID's come with some sort of DX benefit, better slugs, etc.
Yes, this is exactly what I was thinking when watching the video. Just one column on the table heavily skews the size of the table towards the size of the IDs. Besides, a single-column table is nowhere near a good example of a table on a production database.
Yeah this was my immediate thought too. Based of the arguments provided, this is only worth thinking about at scale.
True. But what about foreign key tables for joining or just storing them in other tables some other way? If you have many of them, you’re increasing storage even more. You’re never going to have just one table.
Yes .. point 1 is a lot more important than point 2. The size difference is completely negligible
This says it all
How to solve problems with UUID
1. Don’t use MySQL, use for example PostgreSQL instead
2. Don’t use btree indexes - use hash ones
1M rows with UUIDs take 36MB. That’s NOT a lot these days.
What’s the main problem with serial/int PK: you can’t use shared resource for your developers, for example Redis. With UUID you won’t have collision issues
On 2. you have to know that you'll never need to sort or access more than one record at a time. Works in a lot of cases, but if you have say an iot use case with multiple rows per device then it doesn't.
Bruh just use ULID
I was shocked that they said to store the UUID as a char(36) when really the DB should just have a native UUID type...
I can't agree more with point 1 lol. Just use PostgreSQL.
@@valerianmp or just use ULID instead of the outdated UUID. It's smaller, sortable and doesn't cause issues with any DB
UUID size killed the free plans on planetscale?
funny af
LOL
Yes. It does. :D
No free plan
The benefits of UUIDs or similar still outweigh the smaller size of integers, by a lot. Like, security, scalability, flexibility, traceability. Paying a minimal amount of money more to store UUIDs instead of integers gets you so many more benefits. Therefore, a price worth paying.
scalability? more like the opposite
@@furycorp global uniqueness, elimination of central bottlenecks, ease of data merging and replication and handling massive amounts of data are only some benefits of UUIDs in regards to scalability.
@@furycorp try to make an integer autoincrement column in a distributed system with no single point of failure allowed
@@furycorp yeah you never dealt with scale...
The importance of this comment cannot be overstated.
I use both. Primary key is an incremental int and then have a UUID column. All foreign keys are using the PK while the api layer and outside world are only aware of the UUID.
You can then add indexing and caching to mitigate database load.
Storage is so cheap now I never think about it. I just monitor index fragmentation and query times
wow enjoying the best of two worlds, i do this also...
Especially since an additional 4 or 8 byte integer isn't really a big deal, when you already committed to spend the 35+ bytes for an uuid.
Also: you can put all the uuids in a separate table that only gets joined in when you actually need it.
Can you redo this video without the obvious mistake of storing UUID as a string (36 (37?) bytes) instead of its binary form (16 bytes)?
But that would require Josh to understand what he's talking about instead of just repeating blog posts in video form! He put in so much effort too...
In the grand scheme of things 64MB and 104MB is almost nothing, the other columns will add way more size. MySQL issue tbh
To be frank, most of those numbers aren't all that impactful, even at large scale. Sure, on paper it's a rather large loss, but practically speaking, I'd say the benefits outweigh the disadvantages.
At very large scale it might add up, as it grows by gigabytes. But even at that, it wouldn't be necessary to keep all the rows in RAM at all times (Which is the costly part) the less frequenty accessed data can be stored on SSD/HDD which would still take up some amount but in comparison it costs nothing.
He did use Postgres in his examples though...
In mySQL the difference is actually smaller nowadays, because of native table page compression (which is another point where uuids are detrimental, because of their random nature).
Also: the size quickly adds up (1M rows is a rather small table), especially as id columns tend to be used in multiple locations to enable table joins. And all those data need to be read from disk; and the larger the database grows, the less likely it is to fit into the OS & DBMS's cache.
@@Azoraqua I wouldn't use "gigabytes" and "very large scale" in the same sentence :-)
@@stephanweinberger I mean, gigabytes is quite all-fitting, it includes Fortune 500 companies and to an extent FAANG; Mostly meaning the average non-FAANG larger company
But uuid being random not suite good with indexes
An UUID is 128bit which is 16 bytes. If you are really try to make it more than than by using weird database column types you will get columns larger than that but postgres as you use as an example has an UUID type which does the string to binary conversion and all that and only uses the 16 bytes it needs for storing an UUID value.
You can of course achieve the efficient storage in any database that supports byte arrays types by storing the UUID as a [16]byte and properly encoding/decoding it client side.
I would also add that if you are really dealing with a storage issue where you literally do not have enough space to store a freaking UUID, you probably have a bigger problem on your hands. You would have to reach roughly the size of google before you start worrying about that.
He got 37 bytes because he's storing the UUID as a null-terminated string. Not sure if that's the native implementation or what but it's strange to me to think that it would be.
@@akam9919 I have a storage issue like that, I run a site called Skyblock finance and it has over 1 billion entries in its prices table. I optimized that table to consume as little storage as possible and yet my database is already over 150GB which is a LOT for a self-hosted hobby project
pg_column_size is a psql function. Postgres has the UUID column type built in, wich does not waste memory on the text representation.
mysql does not, but it has binary and conversion functions to convert a uuid in string representation to binary (and back).
of course one does not store a uuid as string. same as not storing a date into a varchar.
too bad he didn't include this in the test, using the uuid-ossp in psql
I had no idea. Thanks for pointing that out.
This!
Be me: who stores UUID in char(36).
Yes i am sad. No i don't care.
yeah, misleading video or simply ignorant. The article he is reading itself says that guids are 128 bits (16 bytes) unless you take the naive approach but he goes ahead and stores them as 37 bytes strings anyway, which nobody will do, instead of showing people the right way who may be interested in the trade offs. After he was done with his little test the video was over, to say nothing of the pros and cons outside of size taken up on disk. I don't watch this guy much, but when I do, his videos often fall short like this. Another students teaching students channel
Problem 1 is solved by UUID Version 7 which is time-based.
Problem 2 is partially solved by simply a more efficient storage, because obviously UUID requires only 16 bytes, but for some reason you store it as 37 bytes. Does your SQL engine waste storage space? Well there's a problem with your RDBMS, not with UUIDs. Also in real DBs there usually are many more columns besides the ID, so the UUID overhead is negligible.
Any other problems maybe?
If you are worried about UUID size in your database you can afford a team to fix it
Me with a hobby project that has over a billion rows in one of its tables lol
@@FlorianWendelborn Sure, but do you really need a UUID for these rows or would a 64-Bit integer suffice?
It's not about size but the performance of the insert, update, delete which get effected if table having index on those column. Instead using ULID works better. I'll say avoid uuid data type until it's really required
Since the demo uses Postgres (which has a native support for UUID type, unlike MySQL), if you create the test table using the `UUID` column type instead of `TEXT`, you will find that `pg_column_size(id)` will return 16 bytes instead of 37, which is smaller than Cuid (26 bytes when stored as string).
Sure UUID take a large amount of space, but they allow to scale horizontaly across multiple databases without worrying about id collision, which happens with integer auto increment.
They also are easy to manipulate because there are plenty of libraries using them among multiple languages.
If you plan early on horizontal scaling and don't want the hassle to either build your own collision-free distributed ID generator or migrate the database id type, UUIDs offer a great trade off at the cost of storage.
I can't believe i am reading that UUID takes a lot of space. People sound like the gray hairs talking about wasting all those bits when you could use a byte instead of 4 bytes. Or the really crazy ones encoding booleans in bits using bitwise operations to get the flags.
@@JacobSantosDev depends onnthe application.
For most applications it is not significant, but for a large scale key-value store where the values are small it matters. Having a x4 or x8 difference in size makes a huge difference when scaling to millions.
For most use cases though, not a big deal.
@@TheShynamo I disagree. Rarely have I seen a table with only a single primary key table. Plan for the future. I can't tell you how many times I tried using the old, use the data type that uses the least memory and change when needed. It is a lot harder to change with SQL and doesn't matter with noSQL. I used to bigint all primary keys. Now I just make everything UUID. Set and forget.
The advantages of UUID, especially when working with millions or 100s of millions of rows are worth it. Who is sorting using the primary key anyway? I guess that could take the place of created date fields. Created and updated are more meaningful than some number that gets bigger. I would rather sort on created or updated if I am going to have them anyway.
@@JacobSantosDev "Or the really crazy ones encoding booleans in bits using bitwise operations to get the flags."
Doing this can be extremely convenient depending on the use case. Especially when working with groups of states.
@@prima_ballerina also useful for genetic algorithms but I was speaking specifically of code and not encodings for storage, network, etc.
Another drawback of using UUIDs as the _PK_ is that these usually appear in multiple tables, which inflates disk usage even more than in the simple example presented in the video.
Plus, the random nature of UUIDs makes them very hard to compress (either natively like in mySQL, or in the filesystem, or on the NAS/SAN).
In the end this all means not only more disk usage, but also more I/O (in an enterprise setting often network I/O) and more cache evictions - the performance impact isn't just linear, but exponential.
So if you really need UUIDs in your application, put them in a separate table and only join them in when required (i.e. when resolving API calls, or providing output to the API), but do all internal relations with simple integer PKs.
In my new project, I resolved to creating an extra `publicId`column using UUID if only it's exposed to a client. Dunno if it's really intuitive because if I were to adapt micro services, I would run into key conflicts eventually in the future for PK's as Int.
For a million row database, 147MB is absolutely nothing i'm even going to think about. It's has no input on whether I would or wouldn't choose UUIDs. But the benefits of globally unique ids across a database cluster is pretty great.
This video should be retitled "The Problem With Using String-Formatted UUIDv4s".
The first problem that the blog post mentions is valid when using UUIDv4 as a primary key. However, the solution is simple: don't use a UUIDv4. Use a UUIDv1 (or better yet, the upcoming UUIDv6) which is time-based and is therefore implicitly ordered instead of random.
The second problem is valid, but negligible. Using 128 bits instead of 32 for your key is chump change compared to the size of the rest of the columns. You would need to be operating at the scale of the likes of Microsoft, Amazon, and Twitter before that kind of difference starts to matter (and frankly, if you're operating at those scales, 32 bits for your primary key isn't going to cut it anyway). But beyond that, the ID size should've clocked in at 16 bytes, not 37, which suggests you stored the UUID as a CHAR(36) even though Postgres had native support for the binary UUID data type. This is confusing, especially considering the article itself mentioned storing UUIDs in 128-bit binary format as the better option.
Sure, UUID are bigger in memory, and yes the database size and performance will be affected. For me, it's just a compromise for a better security (no brute force, no guessing the total entities count, ...). I imagine that if you want to optimize queries you could use "internal" (int) and "external" (UUID) IDs, but I'm not sure that the balance will be better (even bigger disk space used, more complex queries,...)
What do you suggest as an alternative? Random numbers? Another string algorithm?
Snowflake ID could be an alternative
@@nickonos Nice, I discovered something, thanks for the tip!
@@nickonosLooks interesting
UUID v7 is already time incremented and is precise up to 1 milliseconds, meaning conflicts is impossible and it's already sorted by time :)
I use Laravel's Ordered::uuid I guess it's using v8 or something
The biggest advantage with UUIDs for me is that the ID can be generated client side, or anywhere really. It makes working with distributed systems easier.
The insert performance is actually the most important negative point for me. And I'd assume that it would become worse the larger your table is.
Use ULID, same but better
Well wouldn't you want to validate the uuid anyways on server side? What is the significant advantage with being able to generate them anywhere? (Genuine question)
@@Dom-zy1qythe advantage i can think of is data relation especially if you have complex data structure into multiple tables that need to be executed by clients
@@Dom-zy1qy There are probably not too many scenarios, but one I have in mind is offline-first apps. Of course, ID would be validated on the server side, but client-side generated ID would be used prior to that.
UUIDs also have a big impact on number of keys in an index page. The more keys you can fit in each page the less i/o required to execute a search. This is compounded if the primary is also used as a foreign key in details tables. Each of which will need an index on that same key. I use both int/long for the keys and a Ulid column to expose in the API. Of course, it needs an index but only one.
Insert perfomance will suffer but not in a way you suspect, 4x more space for value means that will be 4x less values on page or B+Tree node, so index containg UUIDs will advance to deeper level sooner, but this is logarithmic relation and in practice means that very big tables will slow down a little sooner.
Using UUID as primary key should be dependent on the needs like: security (unability to predict next value), split databases (horizontal shards or clients) where id must be generated localy.
In many cases, in whole row size difference between int and UUID is neglegable.
Also if we talking about very big tables, 4 byte int could be not enough (in some circumstances), so for future proof long (8 bytes) is often used so its advantage diminishes.
@@TapetBart I didn't know that someone go to such lengths 😉 if you know the data you can optimize storage to the point of pain or diminishing returns. Everything is on the head this days.
@@TapetBart it depends how do you access data but gain in is impressive indeed.
Long ago I tried similar method with differential sending of files over modem which have big chunks unchanged.
There are pros and cons for using uuids or integers. Thus, when I start a new project I always toss a coin to determine which half of programmers I'm going to piss off Kappa
I use Postgresql daily, and a UUID there is stored natively and treated as a binary type
Just use a different database
You don't even need to use another database... Which also uses b-trees because they are awesome and highly optimized for indexes. There are index types that don't use b-trees that you should use for such indexes.
UUIDs are an absolute plus when doing replication.
The UUIDs are using 36 bytes because you're not using Postgres's UUID datatype, which stores them as binary (16 bytes) instead of text. That would have halved your index size and improved insertion performance. Also using UUIDv7 instead of UUIDv4 will further improve query/insertion performance because they're time ordered.
On the way I think, almost every time we need UUID to allow our code to knows which data is which, making the UUID part of the data architecture designed for the software, since the default APIs usually are REST APIs. So in the end, we gonna need to store UUIDs anyway on the database, and many times if the same data is splitted into distinct tables.
And if you try to use integers, but still need to keep the UUID of the data, to be able to link the data between tables, is cheaper to store only the UUID as primary key, since you have one less column in that case.
The only scenario I see fit to use integers instead UUIDs as primary key, is to things that can be generalized, like countries name, things that doesn't make sense to have a ownership of a specific user from your software. But in that case, the table will be very small and will not have need to escalate too often, making us thinking if that over engineering on the table design will actually have a visible impact or not.
Seems like a good fit for UUID v5 where you generate it from the primary integer key only when necessary.
There is also UUID v7, so your problem 1 wouldn't be even a thing anymore.
If you are watching this video, you are probably using Next.js to render your JavaScript project, which is likely hosted on Vercel and also utilizing an ORM built with JavaScript. Performance may not be a top priority for you. In fact, it might be the least of your concerns.
Issue really in using integer is the incrementing and allocation of that integer. Any time something needs to count and hand out these values there is ultimately a single point of failure. We generally design systems for high availability and redundancy these days and for global replication. Even with automated failover a single point of failure isn’t really acceptable in most of the applications I work on these days.
I don't really care about either of these things, storage is cheap enough nowadays and whether insert performance matters depends on the project.
You see only the cost of storage. But there is a significant cost in CPU/Memory. Sure on a small project, you’ll barely feel it. But as soon as your project will scale up, you’ll have to refactor it. A wise software engineer would tell you that if it’s cheap do to it right from the start, you should.
Integer / 32bit is not a good example because if your stuff fits into a int, you don't have a problem anyway.
Data locality is critical therefore uuidv7 is important.
And for storage size: scale horizontal with your storage than data at rest doesn't matter just be smart with index or sharding.
Whenever I need random IDs, I use the UA-cam-approach. I create an ID field that js exactly 11 characters long and just check if an ID is already taken before inserting a new record.
Tom Scott has a great video explaining how it works.
Comparing UUID and INT types is like comparing apples and pears. If you look specifically at performance and space, of course the auto-increment integer definition seems more useful. But it doesn't take you further than making database-oriented software. If you want to abstract your software from the persistant storage (rdbms), you have to bear other costs. Then the gains far outweigh this comparison chart. What I am saying is that it is best to decide in accordance with the software and system architecture.
Everybody who uses UUID should know, that this is tradeoff between performance and security. And, as some people mentioned, uuid stored in proper type of colum, uniqueidentifier in mssql, and uuid in postgresql and mariadb, uses 16 bytes instead of 37. Also it can be stored as 16 bytes binary string in mysql, and converted to from and to string with built-in functions.
lol, well, as the most tables will have 30, 40, 50, 100 columns, the table size will be in the end 1000 MB (with uuids) compared to 970 MB (with integers). Of course, people would argue about it, but I just say "come on..."
If you have 100 columns in a relational DB you should take a serious look at your data model.
And if you just need to store some unrelated datasets, a relational DBMS isn't the best choice in the first place.
Use ULIDS instead
Any idea how to use it with drizzle?
You can use the $default to return your id in drizzle
or UUID v7
@@formula-box so no native support from drizzle? we'll have to resort to a generator function?
I guess the “in MySQL” is the important part
Didn't even mention that if you're primary key is "clustered", it'll cause the entire table on disk to be reorganised every time a new row is inserted. This is because the clustered index defines the order the rows are stored on disk. With an auto-incrementing integer type this isn't a problem at new rows always go onto the bottom of the table.
What about the builtin UUID data type, how efficient is that?
Great video, thank you!
They seem to completely miss the purpose of UUID which is "Universally Unique", meaning the purpose is that an ID can be created anywhere. If you are just inserting then an auto increment id would be better, but if the object needs to be created outside of the app then there needs to be a way to get a unique ID. Firebase has some functionality for this, to get a new incrementing unique id from the db first, but it's just a complex solution.
what about RAM size? when you index these UUID and integer, speed is almost the same but RAM is not the same. This also affect scalability as well.
Yes. Quadrupling your index size will cause a lot more cache evictions (disk cache, DBMS cache, as well as CPU cache). Add to that the additional I/O required to (re-)fetch data (of increased size) from storage, and the performance impact isn't just linear but exponential.
But what's wrong with having int as pk index and another let's say userId uuid col ?
Yes, that's the better solution if you really need UUIDs. Preferably even put the UUIDs into a separate table and only join it when required (e.g. when communicating data to the outside world).
4byte vs 37byte ID value (9.25x larger) with a million rows resulting in 21MB vs 73MB (only 3.47x larger) was a bigger difference than you expected?.. I'm more shocked the id row is consuming 21MB instead of 3.8MB on the integer. Why is there 18MB of fluff? So that's 22bytes per row on the integer. 35MB is the pure size for the UUIDs, which suggests 39bytes of fluff per row for the UUIDs for a total of 76bytes per row.
My questions are these:
Why is there 22+bytes of extra data attached to that index value row, and why is the amount of fluff different? Why is it 39 bytes for the UUID index and 22 bytes for the integer (again, those are values in ADDITION to the actual bytes for the value of the index)
He was wrong on the size of UUIDs. They are 16 bytes, not 37. He was storing them as a string which isn't really what you're supposed to do lol.
Good summary video. In the case of MySQL storage engine (innodb) the PK gets stored 6 times in a page for a single row, that is not including potential FK. It works out to about 60% of storage space just on storing PK.
I personally still would not use UUID in my MySQL database as a primary key. Database thrashing and inflated storage is killer.
Are there any disadvantages to using the usual id method?
Ones I can think of: users can make an estimate on how many rows you have, users can easily guess existing records and on public data could be risky (e.g if they create a “share url” with id of 3050 they can see someone else’s 3049 easily). Generally anything that requires the creator to be “logged in” or verified is safe to use ID (internal) and anything that is public, like a share url or something like that you’ll want a UUID or at least a randomised value.
Good luck, id love to hear more disadvantages too!
@@oSpam but then again users shouldn't have direct access to your internal DB structure anyways, regardless of logins. The API should _never_ expose internal keys.
@@stephanweinberger we’re talking about APIs not intercwces with closed backends. Closed backends won’t ever give non logged in users access anyway so it doesn’t matter. Data that’s publicly available is what I’m talking about “public data”
@@oSpam _especially_ then you don't want to leak internal data, do you? So you need to devise a method for identifying data externally in any case and map whatever identifiers you use (and UUIDs may be a way to do that) to your internal keys. That doesn't mean that you database has to use the same identifiers as PKs internally (and it probably shouldn't).
@@stephanweinberger but internal data will all have permission checks so what’s the point in coming up with random Ids. If you’re suggesting not using perm checks and just using random IDs/UUID in my opinion that’s worse. I don’t see what you’re trying to get at
how does UUID column type fare in postgresql ?
Theo just recorded the same video last stream lol
Can't find it on YT
@@hofimastah he didn't release it as a separate video yet, but you can find it in his latest stream vod
@@ronanru thanks. Can't find the stream. Maybe he will publish a video on UA-cam
This guy literally bases his channel on stealing Theo's content
It’s important to look at the bigger picture here. A real world application is going to have lots of foreign keys - so a table may have two, three, four or even more UUID values per record. That’s going to make storage costs mount up pretty quickly. I liked the suggestion another commenter has about using an internal integer for those joins but exposing a public UUID in your APIs to avoid the flaws of ints - that would keep table sizes lower, so long as you’re careful with the API design.
In addition, the row's primary key is stored as part of each index over the table. So the size goes up again for each index you add.
if you care that much about the size, then you can just simply convert the uuid 37 bytes char form to its 10 bytes binary form
@@aldi_nh 16 bytes
@@CatDevz hmm, let me try to count it again. 37 hexadecimal characters, trim the endline its 36 bytes. 2 hex characters compressed to 1 byte, so 36/2 = 18 bytes.
my bad
@@aldi_nh It's 16 bytes, or 128 bits, to store a UUID. You didn't consider the dashes which make up 4 characters in a UUID string.
Space has never been an issue we are using our internal storage space to server along with backup storage which run if there is any failure.
Can we do one thing:-
1. Use Primary INR (index)
2. Encrypt Primary ID (say encrypted_id)
3. Store Encrypted ID's in redis with mapping to integer ID's
4. When Rest API calls with encrypted id, we fetch the id from redis and perform SQL query
Problem is for how many tables we do?, ans is where we are showing data publically, for internal it should not be issue.
Or instead of redis, we can simply decrypt the id (should be quick I believe) and the query the DB
The difference in the example is 100 MB. This is literally nothing in large distributed systems where UUIDs are needed. It's like a few high-res photos on Instagram which are loaded when you scroll for a few seconds.
Insertion performance and index fragmentation really IS the main problem - not storage.
This video is very interesting and informative. I guess using MySQL at scale may not be a good idea.
I was also thinking this that is why I prefer the integer IDs. way simpler to implement and takes less performance hit.
Hey Josh, why do you say Planetscale used to be your favorite database provider? What changed and what would you recommend now?
Well use UUID with Cassandra or Scylla. Those highly distributed systems are pretty good with UUIDs
I am not a fun of exposing the int ID, but doesn't having PK Int ID and public_id UUID take up even more space? if so, does it then mean having uuid as pk is a good option?
Why care about uuid size when we sometimes store profile img of each user unless we are using Google auth
this is pointless since in real world, you aren't just storing a single id in every table.
frickin 128 bits id column will almost have no impact compared to varchar(255) from many such column like name, address, description, message.
not to mention you still need a way to store some kind of slug or any identifier in your table anyway, you wouldn't want your web url to be something like /post/[incremental id] since that can be guessed easily making a huge potential security issue
Josh turning into Theo (reviewing blogs) . Not that i hate it, but its funny how all developers are coming to same content creation path
Probably because developers doing weird things on camera is rare, except probably joma and techlead, to make a reaction video 😂
Is this just for MySQL or PostgresSQL too?
mysql things, pg has uuid built in
I think Int Ids can create problems in React. If you map and render two tables in the same component then how do we pass unique key? I will have to move everything to its separate component.
The key prop identifies the component from its siblings. It can be either a number or a string; it really doesn't matter, as long as no sibling has the same key. If you have auto-incrementing keys on your backend, you probably never encounter the same key twice under a single component.
How about as a non-primary key?
You wouldn't be constantly querying that value and indexing like a psycho but it's there when you need it
What about NANOiDs?
Wow, high effort video where we carefully consider the pros and cons of different index datatypes... nah, just tell the other kids uuid = bad because size 💩
We would you even use UUIDs in you database in the first place?
No way? You mean that a representation of a larger space takes more memory? Who would have known? Something that can represent to store 2^128 items vs 2^32 items takes more memory. What a big surprise.
It depends on the use-case. Not every table object requires a UUID and most applications ain’t even big enough to leverage them. Biggest benefit is to have them as well identifier to the outside world. No numeric id guessing 🙃 and before table size is a factor to think of … have a look at your schemas for wasteful definitions.
I would have preferred a chart to visualize those numbers. Easier to compare
I store Mongo Object Ids as UUID in my postgres as binary array
Doesn't seem too relevant. With ints having 4 byte width and uuids having 37 byte widths comes to about 10 times the size. Yet the total size of the uuid table was only about 2.5 times the size of the int table, which suggests that either the uuid storage is optimised, or some other data is using space or both. And then again, 260 mb for a million rows - I can live with that
MongoDB having a sip of coffee.
What is the solution to this then?
(why would you use them as PRIMARY keys?)
what about postgres?
PostgreSQL has uuid type builtin, so most of these claims is not true for postgres
I just use integers everywhere, never had a problem
This info is very useful thanks
Looks like the problem with UUID is that provides too much entropy, and in this context is very wasteful.
To summarize: it’s related to indexing
I think this test show literally nothin, a table entry most of the time has multiple columns with values, so the percentage of the uuid would be way less than in your example. And 2nd even if we take your example lets say we have 1 Billion rows then this would be about 6 - 15GB considering that most low tier vps have at least 10gb of ram and that most saas companies never reach such a big number of entries makes it even less relevant. Additionally to that 1 billion rows can't run performantly on a 0.5 vcpu vps, so you would need a beefier machine, which would cost more i guess somewhere between 20 - 100$ depending on what your preferences are and these come with a lot more storage, so the storage is only problematic when you have trillions of rows or use a service like planetscale where storage is pricey
First problem is the biggest, no need to check 2nd, it's not important
yeah lets take care of using UUID as an ID and skip that huge varchar/json(b) columns, IDs is so much important in optimizing storage use)
What headphones are you using? They look sick
That's Sony WH1000XM5
Should be the sony wh-1000xm5
They’re ok. Not what they’re hyped out to be though.
Did this guy just used a string column to store a UUID in postgress instead of just using a UUID column?
And ... why is this surprising?
bro is waffeling about 146MB wtf
So wtf do we do?
Don't use mysql if you plan to scale 😅
me and my homies hate planetscale.
Same. My ass is too lazy to migrate my old portfolio projects off it so they're just dead now.
How many of us with their hobby projects and/or 2 users in production would even care?
We need to not overengineer every solution. Of course you will have downsides with every approach you take.
Let's be real: It's ok to fail, it's ok to not get it right the first time, it's ok to iterate and improve.
I only use vector and document dbs, why should I care?
You say "4X BIGGER!?!?" but i hear "100mb for a million records? That's peanuts".
The SaaS company I work for is racking millions by using UUIDs. So yeah UUIDs 👍
You’re comparing the binary size of int and text size of uuid? That’s a pretty shitty comparison, don’t you think?
Nice shirt
lets use nanoid!
Nice thing to know
Storage is the least problem. If the storage was a big deal rhen you won't be able to make these nice videos 😉