UUID vs INT: What’s Better For Your Primary Key?

Database Star

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 15 вер 2024

КОМЕНТАРІ • 115

@Pyrazahn День тому ⁺²³
1:42 if users can change the ID in a URL to access data they are not allowed to, then the fix isn't to switch from int to uuid. You need to implement proper permission checks before serving the request.
There are scenarios where non-guessable IDs won't protect you. For example copy&pasting - most normal users have no idea which part of a URL contains critical data so they will just send the whole thing to a colleage/friend. Or someone bookmarks a page and later their access role is revoked.
@BW022 День тому ⁺¹
Bingo. If you can change ID fields, why can't you change uuid fields? It seems you have the same issue. Correct way is for the middleware or DB to prevent calls in using IDs which you aren't supposed to access. Also, in many DBs, IDENTITY fields are read only.
@DatabaseStar День тому
Great point (and probably one I didn't make that well in my video). It should not be possible for a user to change an ID in a URL to see things for another account or record that they are not allowed to se.
@MissMyMusicAddiction 9 днів тому ⁺²³
we had this debate in another database forum. the consensus was the combo int PK/FK, and UUID for API's.
@DatabaseStar 8 днів тому ⁺²
Good to know! That's a good option.
@mind.journey 8 днів тому ⁺³
I've been doing this on my own without knowing it was an established thing. You really get the best of both worlds
@MissMyMusicAddiction 8 днів тому ⁺¹
@@DatabaseStar this was literally what you suggested, so, this is more like +1 than a suggestion
@MissMyMusicAddiction 8 днів тому
@@mind.journey its more complex, but yes, otherwise best of both worlds
@danielreis58 7 днів тому ⁺²
But if I use both, for example, a users table:
id: bigint,
uuid: uuid,
name: varchar(255)
Considering that I will only expose the uuid to the API, I would have
GET: users/:uuid
Which would be the query
SELECT * FROM users WHERE uuid = param.uuid
If I have a table with tens of millions of users, I would have to create an index for this query, right? Then we fall into the disadvantage of uuid, which is index fragmentation. Since we always will have to search by uuid, wouldn't it be better to use only it as the primary key? Does the performance in joins justify having both, even if they take up more space?
@walkingradiance9556 12 днів тому ⁺⁷
For the sorting pros of integer primary keys, I always add 3 columns at the end--note as a text field for when something needs to be clarified about the data record, date_created, and date_updated. date_created will be set by default to something like systimestamp and date_updated will update to the date time stamp when the record is updated. a trigger will fire with an on update command clause. then in Oracle I add a virtual date_created_or_updated where its date_created if no updates have been before, and date_updated if updates have been performed. I put date_updated to null at the beginning. That way you can sort to see when the records were created like sorting the integer primary keys.
@DatabaseStar 9 днів тому
That's true, using dates when the record was created or updated is a good way to see how old records are or to sort by when they were created.
@angrydachshund 11 днів тому ⁺²³
Another con of uuid4 as your primary key: Normally you will cluster on primary key, but if the PK is a uuid4 then your table will be constantly fragmenting.
@DatabaseStar 9 днів тому ⁺²
That's true, good point.
@avwie132 3 дні тому ⁺²
Which doesn’t matter in most cases
@angrydachshund 3 дні тому
@@avwie132 Sure, but you'd better be sure it doesn't matter in each case. Every PK creates an index for itself, and so a uuid4 PK's index will be constantly page splitting, even if you mark the table nonclustered.
@Joooooooooooosh 2 дні тому
True, but MSSQL has a sequential GUID function that guarantees the GUID is incrementing by packing a time based section at the beginning.
@angrydachshund 2 дні тому
@@Joooooooooooosh TIL
@walkingradiance9556 12 днів тому ⁺⁶
I used integers with MySQL along, and then I was learning Postgres and I learned about UUIDs and I learned the advantages of UUIDs and I became converted/converted to UUIDs.
@DatabaseStar 9 днів тому
Good to know! My journey was similar.
@luqmansen 9 днів тому
Just to be aware of when using uuid with mysql (innoDb engine), it destroys the write performance
@database_tips_tricks 13 днів тому ⁺⁵
Brilliant, I been looking for this info for years, uuid much better,no sequence naming,no central authority to control sewiences
@edgeeffect 11 днів тому ⁺⁵
Beware "better" is relative. What is "good" or "bad' always depends on context.
@ramireznoy 11 днів тому ⁺⁶
Then you are probably doing it wrong.
UUIDs are not a replacement for sequences. Just use them if you want/need to expose entities. So your api uses UUIDs as parameters instead of actual valid ids.
But using integers as primary keys is the most performant way to handle relationships, no question about that.
@DatabaseStar 9 днів тому
Glad you liked it!
@Mirage2020 9 днів тому
@@ramireznoy UUID are better as PK for distributed databases because you avoid collisions between those
@ramireznoy 9 днів тому
@@Mirage2020 If your distributed model allows for collisions, the issue is the model, not the PK.
If you need to put the entities together, then your DB is not distributed.
But even if you absolutely realize integers are not the best solution in your very specific case, not even composite keys, UUIDs are no replacement for integers. Those are just two totally separated use cases still.
@matrixevo88 13 днів тому ⁺⁴
Thanks! I was just querying a database that uses UUIDs as primary keys. I didn't previously know that they are not generated incrementally.
@DatabaseStar 11 днів тому
Glad it helped!
@dany_fg День тому ⁺¹
what I did is retain the default integer id and add another unique field of uuid v4.
this way a merger/sorting can happen and you can give the user at some end point the uuid as an id to reduce security risks.
the only big con about this is the memory usage (both in runtime and storage).
@squintps День тому ⁺²
They are separate concerns, that should not be comingled. Primary keys are there for referential integrity within your database, and should be int for maximum performance. UUIDs are there to uniquely identity a records at higher application layers, or across databases , or for replication - so they should be added as an additional column when needed.
@DatabaseStar День тому
Good point. From my research, and other comments, a valid approach is to use both the INT PK and a UUID (or similar) in the way you have mentioned.
@dyto2287 День тому ⁺¹
@@DatabaseStar Using both Int PK & UUID has some pros but it has some cons of both. It increases row size significantly and looses advantage of preparing data & references before performing actual writes. Also, using ORM or generating models from SQL in application code will require to additionally worry about hiding integer primary keys in API responses. I would rather pick one depending on the table and the amount of writes & reads it will receive.
@squintps 23 години тому
@@dyto2287 If your ORM entities are somehow making it all the way to your API controllers where you have to hide your primary keys, there's something terribly wrong with your application architecture. ORM entities should not make it past your repository layer, and the rest of the application should be completely isolated from having to deal with database concerns such as primary keys and referential integrity.
@dyto2287 13 годин тому
@@squintps In a perfect world yes but in reality when you need to build fast and ship features too much of isolation and guardrails will bite you in the ass slowing down all development. Don't over-complicate stuff before it's absolutely necessary.
@FINALLYQQQQAVAILABLE 2 дні тому ⁺²
3:02 Frequency of collisions depends on details of how UUIDs are generated. Actual probability might be much higher than a theoretical probability that assumes a perfectly distributed random variable. I've seen some very poor implementations. Having some (very rare) collisions is not a problem if a collision doesn't lead into a catastrophe. But a software engineer has to make sure it does not. Sometimes you just have to be sure that the IDs will always be unique and no collision will ever happen.
@DatabaseStar День тому
Great point, thanks for sharing!
@sarkedev 6 днів тому ⁺⁴
1:01 primary keys always have indexes, you don't need to make a separate one.
@DatabaseStar 6 днів тому
That's a good point, yes they do.
@walkingradiance9556 9 днів тому ⁺¹
One thing that relates to a different discussion which is a use a surrogate key not a name or something for the primary key. So say I have my table city with the column city that stores the name of the city and city_id which stores the primary key as an automatically generated UUID (which implementation will differ in MySQL--bin, Postgres/PostgreSQL--UUID, Microsoft (MS) SQL Server--can't remember how its handled, and Oracle--raw(16)). So the surrogate key means we can update the city if we need to. However, I think there should be a term like functional primary key where its unique and not null. Its not the primary key because we want a value that is independent of the primary key, but its important enough and unique enough that it is commonly used to identify something. Examples of "functional primary keys"/natural keys include first name, last name, email address, phone number, physical address, and name. Each of these should be mapped 1 to 1 to the entity. So if I know your first and last name that should map to 1 person. I say should because maybe 2 people have the same first and last name. We could say well then your middle name has to be unique. Well then the father and the son have the same first, middle, last. well then add the prefix like John James Smith 7 is son of John James Smith 6. An email address should have a one to one relationship with a person, but this is often not the case where you multiple email addresses, phone numbers, physical addresses I guess if you own 2 houses, names, etc. The point I'm getting at is I think with a surrogate key there is a loss of how the functional primary key should have a 1 to 1 relationship and be unique and not null, so a new concept like functional primary key could be introduced. It could be that said that its functionally primary, even though its not really primary because if its really primary it should have no relation to the input data. I hope that made sense.
@DatabaseStar 8 днів тому
Good point about the functional primary key and how it's often unique. It's a good idea to separate it from the table primary key. I've heard the term "business key" used to refer to this, or even a "candidate key".
@walkingradiance9556 8 днів тому
@@DatabaseStar I think candidate key is something that could be a primary key. I think there's a difference between candidate key and "functional primary key".
@pravinprince3221 11 днів тому ⁺¹
Thank you for the wonderful video sir, it is so helpful for my team and thanks again sir
@DatabaseStar 11 днів тому
Glad it helps!
@IchwanulFadhli 5 днів тому ⁺⁴
My personal approach would be like this. Every time I make a new table, my primary key is an integer. But when I need to expose the data to the application, I don't expose the primary key. That's really bad! So instead I make another column just to store UUID and I set the default value to "UUID()" so that every time I insert a new data, an UUID will automatically be generated. But my traditional method use varchar to store UUID. I might need to change it in the future since you said it would be better to store it in binary.
Nevertheless, by doing this I can expose the data just by exposing the UUID.
Let me know if there's a better approach than mine, cheers.
@DatabaseStar 4 дні тому ⁺¹
That sounds like a good approach!
@IchwanulFadhli 4 дні тому
@@DatabaseStar Thanks a lot!
@sangramjitchakraborty7845 День тому
So do you index on your UUID column in addition to the PK? I imagine it would be slow to retrieve records based on the UUID?
@IchwanulFadhli 23 години тому
@@sangramjitchakraborty7845 No, it's still on the integer
@walkingradiance9556 12 днів тому ⁺³
I would edit the answer to it depends, but in general use UUIDs or something like that.
@edgeeffect 11 днів тому ⁺¹
You get my "thumbs up" for saying "it depends".
@DatabaseStar 9 днів тому
That's true, it does depend.
@io_inc 11 днів тому ⁺¹
Complex scenario made simple... i shall subscribe dear sir & thank you!
@DatabaseStar 11 днів тому
Thanks!
@bitpilot79 13 днів тому
Thank you, very useful video.
@DatabaseStar 11 днів тому
Glad it was helpful!
@gerardvanwilgen9917 10 днів тому ⁺²
In short, use integers when you can, and UUIDs when you can't.
@DatabaseStar 9 днів тому ⁺¹
Good summary!
@Grumpicles 16 годин тому
Integer IDs are NOT a security risk, if proper application security is implemented.
I should be able to take any ID - regardless of the type of value - and not be able to access the record unless I'm specifically authorised.
And anyone who thinks having a non-guessable ID is a "layer" of security is fooling themselves with "security through obscurity".
@michelvandermeiren8661 11 днів тому ⁺³
By far I prefer int
@DatabaseStar 9 днів тому
Good to know!
@walkingradiance9556 12 днів тому
You could make gen_random_uuid the default instead of putting it in the insert statements at 7:55.
@DatabaseStar 9 днів тому
That's true, I could do that.
@JohnSmith-op7ls 2 дні тому ⁺⁴
Just use a pseudo-random, incrementing number algo like Snowflake. All the benefits of an auto incrementing int without the problems of being able to guess the next one in sequence, worrying about collisions in a distributed setup and much more compact than a GUID, unless you opt for an int64 but you still get the randomness, incrementing, better DB clustering.
Because Snowflake and similar algos are time based incremental, even on a distributed setup you have built in time ordering so you can sort on the PK for newest/oldest time ordering which might save you from needing a date/time column if you don’t need one for other reasons.
@eng3d 2 дні тому ⁺¹
Ditto, Snowflake is cool
@hi_im_julian_kirsch День тому ⁺¹
Thanks for this tip. I‘ve only come across snowflake ids when working with Discords API but deemed it unremarkable.
As a side question: Do you know where to get the „machine id“ from? Is it a „random“ static id assigned per backend node?
@JohnSmith-op7ls День тому
@@hi_im_julian_kirsch Depends on your architecture.
If you don’t have a distributed system then the machine ID can just be some static value you hard code into the back end.
If you have multiple back end components that need IDs such as multiple web services or cloud functions, or just multiple instances of the same web service for load balancing or fault tolerance, it’s probably best to consolidate your ID generation into a dedicated service or function. Each one of these would have a static ID as the machine ID.
If you need super fast ID generation you can even have your ID generator component generate a pool of IDs to pass back to whatever is using them which would cache them and replenish the cache before it got too low, saving the overhead of doing a call for each ID.
Like you could grab 1000 or 10000 at a time.
@DatabaseStar День тому
Great point! Snowflake seems like a good approach, and in my research, many people were also recommending Snowflake.
@walkingradiance9556 12 днів тому ⁺¹
I don't quite understand the disadvantage of integers and the advantage of UUIDs about how integers have to be generated by the database and UUIDs can be generated by the client. I understand, kind of.
@someoneanonymous6707 12 днів тому ⁺²
The idea is that if you use incremental int, you send the request to the backend, in which in turns does some computations and sends the insert sql query to the db, which then checks which integer is next and inserts that (for example id column) and then gives it back to the backend which then it gives it back to you (something like user created with id=42)
In the case of uuid the frontend randomly generates the user uuid as the id column and then sends it to the backend to be put in the db without db need to do anything. Meaning you know in advance the user id without even hitting the database and you dont need db to respond to proceed with what you are doing
@DatabaseStar 9 днів тому
The comment from "someoneanonymous6707" is right.
The integer IDs need to be generated by the database because that's where the record of available integers is stored (e.g. within a sequence). With a UUID, it could be calculated by the database, or could be calculated by any system that interacts with it.
@walkingradiance9556 12 днів тому ⁺²
I think one disadvantage of UUIDs is that they are harder to remember. For the possible security risk of integers, at Marshall University, where I'm a student, I have a student ID that starts with 901 or 903 like 901 234 567. 901 234 567 is a lot easier to remember than 32d7169d-fb3f-4543-8148-94a75641bc55. If I have to remember a value to repeat to someone, I would prefer 901 234 567 to 32d7169d-fb3f-4543-8148-94a75641bc55. Memorizing 32d7169d-fb3f-4543-8148-94a75641bc55 is doable, but not easy.
@edgeeffect 11 днів тому ⁺⁴
UUIDs are for computers not machines... if any system requires people to remember arcane identifiers... it's an awful system.
@DatabaseStar 9 днів тому ⁺¹
Good point, and like @edgeeffect mentioned, humans aren't supposed to remember UUIDs.
The primary key may not even need to be shown to the user, even if it's an integer.
A student table may have a primary key field, which can be either INT or UUID. It could also have a separate student_id field, which is an integer, and can be used when making enquiried about your record.
@tubero911 День тому
Having a predictable PK value can also be a security hazard. It makes your item keys guessable. This is a weakness of auto-incrementing values and a strength of UUIDs. Integer PKs can also potentially make future table merges and table migrations a challenge whereas UUIDs never pose such a challenge (they are globally unique and you will not have conflicts).
@AhmedShoulah 12 днів тому
Great as usual
what if we want using uuid as invoice no ?
it will be more complicated but as pros it will be scalable and avoid conflicts across store branches
@DatabaseStar 9 днів тому ⁺²
Thanks! We wouldn't want to make the UUID an invoice number as they represent different things and the UUID is hard to remember.
The invoice number could (and should) be a separate field.
@bob_kazamakis День тому
I don’t get how little a role ULID played in this video, especially with the conclusion still mentioning v7 as an alternative even though ULID is more compact
@DatabaseStar День тому
Good point. I mentioned it briefly as I had covered the concept and differences of UUID earlier in the video, but ULID is a good option.
@iCrimzon 3 дні тому
Which I use depends on which DB i use, mySQL i use int, cassandra i use uuid
@DatabaseStar День тому
Good to know!
@RonaldoArch 12 днів тому
thanks
@DatabaseStar 11 днів тому
You’re welcome
@walkingradiance9556 12 днів тому
I think UUIDs can be better with sharding replication and backup and migration. I think that's right I could be wrong.
@DatabaseStar 9 днів тому
That's a good point.
@glensmith491 9 днів тому
Unless someone changed the underlying algorithm for auto generated keys in the last decade, int does not garuntee any ordering. In transaction databases with more than a couple of hundred simultaneous users, it is almost certainly not a good assumption to rely on unless you don't care about bottlenecking inserts due to keys.
@DatabaseStar 8 днів тому
When you say "ordering" do you mean "working out the order that data was inserted"? If so, then yes that's right, there could be situations where auto-incremented integers are calculated but inserted in a different order.
If you mean "generated in an ordered sequence of numbers", then many databases do this, using things such as sequence objects.
@teddykayy Годину тому
Int
@donmorris4506 9 днів тому
Nanoid my friend.
@DatabaseStar 8 днів тому
Thanks!
@scott_itall8638 3 дні тому
ULID’s
@DatabaseStar День тому
Yeah they are a good option.
@geemov114 День тому
A good solution is tsid
@DatabaseStar День тому
Good to know!
@lindor941 2 дні тому ⁺¹
Just don't expose your PK/FKs. Always found it bad practice that REST does this. If you need keys on the client side, you're doing something wrong.
@DatabaseStar День тому
Good point!
@walkingradiance9556 12 днів тому ⁺¹
I would say that UUIDs are better. I would say short answer use UUIDs and you won't regret it. if you start using integer keys, you might regret it and have to change to UUIDs later on.
@someoneanonymous6707 12 днів тому ⁺³
Is not as easy as that. The speedup you get from sequential numbers are not that marginal as you would expect especially if you have big monolithic apps. Another thing is space itself. I would actually say you start with sequential first then go UUID. As you probably start with a monolithic app first and then you can split it and make it distributed where the UUID are a bit more used
@edgeeffect 11 днів тому
Beware of thinking anything is "better" in all situations. Everything depends on context.
@DatabaseStar 9 днів тому
Good point!
@wildtomi23 10 днів тому ⁺²
man.. if someone is using an auto incrementing integer as a key in database, it is not a database guy, it is a butcher...
@DatabaseStar 9 днів тому
Hah! An auto incrementing integer for a primary key is valid, but it does have some disadvantages.
@wildtomi23 9 днів тому
@@DatabaseStar I am not sure if you are just trolling or you really do mean it... Auto incrementing integer never is a valid key in a database.. it is a poor man hack for ppl who doesn't know what they are doing.. it is a time based pseudo random value (and badly random) which entangles all inserts into the database.. Just imagine what happens if two ppl are concurrently inserting bunch of rows.. Try to imagine what happens if you have a distributed database.. they get interlocked (or deadlocked) to not get collisions.. such "key" has no relation to the data, so in case of any failure you are not able to recover or reconstruct it... if you need something to index the rows why would you do that by a random value ? Use a hash function if you have to, but any competent db engine does that (and more) for you on the background... and if you are more formal just check what adding of a random numbers to the data does to db normal forms.. maybe you can get ok with atomicity but any higher NF than that breaks instantly... and btw if you really don't have any natural key in the db structure (often because of a bad db model..) then maybe just don't use any key..
tbh uuids don't solve all the issues but at least they drop the time-based randomness for much better randomness, and drop the "incrementing" constrain.. still they are not a true "keys" because they are only statistically unique.. but if done right it is usually good enough.. if you extend "uuids concept" to equivalency classes of uuids it is almost perfect (but well...🤣)
@wildtomi23 9 днів тому
@@DatabaseStar sorry man just wrote you an answer why it is never a valid key (rumbling about time based pseudo randomness, interlocking issues, normal forms breaking, recovery hell, performance hit, concurrency problems.. you name it) but I accidentally deleted that comment and even I am still wondering how useful can be an index on a random value I don't wanna write that again..🥴
@DatabaseStar 7 днів тому
No problem, I still see your comment!
I wasn't trolling - I actually believe that auto-incrementing integers are a good choice for primary keys. But they aren't perfect, as mentioned in the video.
I don't think an auto-incrementing integer is a time-based pseudo-random value. It uses a database feature, such as a sequence, to retrieve the next integer value each time it is called. For example, the first value is 1, the next value is 2, and so on. There's nothing random about it.
Generally, these approaches work for multiple transactions. For example, the numbers generated by an Oracle sequence are independent of transactions, so you won't get collisions.
I don't see how the format of the primary key relates to normal forms though.
You're right that UUIDs can help. They are better for distributed databases and can help avoid locking and dependencies on the database, as mentioned in the video.
@wildtomi23 6 днів тому
@@DatabaseStar Well, second NF basically says that attributes in a row have to depend only on the value of the key... How do they depend on a value of that assigned integer? Someone in the comments tried to fixed that with adding timestamps not realizing that timestamps would be actually a better key...
If several ppl are inserting many rows at once (concurrently) how could they predict (if those are not random) the assigned values?
You say that each time you call a "db feature" it returns different number... but in db you are not doing things sequentially, db tries to reorder data for efficiency and work with them in bulk... but to avoid a collision each part of the db structure (imagine a cluster) needs to synchronize (and wait) and then assign almost one by one those integers...
I mean ok, in the last century there were valid reasons to use such things.. like memory constrains, inefficient db engines (mysql anyone?) and there were a tiny amounts of data in those databases (comparing to today), not having to deal with parallel things and so on.... but today? buy ram, use smart db engine... and focus on proper data structure.. after few years and thousand times more data then planned, you will be happy you haven't had used those autoincrements...

Наступне

Автоматичне відтворення