The Lesson About GUID IDs I Learned the Hard Way

Zoran Horvat

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 21 гру 2024

КОМЕНТАРІ • 239

@FBarbarian 9 місяців тому ⁺⁵⁵
This channel is proof that sometimes the most valuable advice is not on some big channel with millions of subscribers, but on a small channel like this. I cannot express how grateful I am for all your content and advice. You are extremely valuable to the programing community. Thank you so much.
@zoran-horvat 9 місяців тому ⁺¹⁰
Hey, don't speak like that! I want my channel to grow, too :)
@FBarbarian 9 місяців тому ⁺⁴
@@zoran-horvat I really hope you become as successful as you want to be on youtube! I was just thinking that the demographic for advanced programming discussions is not large. That's why we see channels who explain the basics receive way more views than channels who showcase advanced knowledge. If channel views were proportional to the knowledge of the creator I'm sure you'd be in the top 10 UA-camrs already. Anyway, I was just trying to express my gratitude for your videos and everything you teach us. Agan. Thank you!
@adam-xt8te 7 місяців тому ⁺¹
@@zoran-horvat You have my sub mister... and epic name
@codingbloke 9 місяців тому ⁺³¹
Thanks Zoran. A note though that there a number of advantages to a clustered index. Some watching this video might be led to believe that its about insertion performance but there is more to it. In fact I would suggest its about read performance. Many large tables represent transactions (e.g. an order), over time the older transactions are rarely accessed, its the active current ones that accessed concurrently. Having these records physically "clustered" greatly increases the chances that they are already in memory when accessed. It also reduces the read effort a little because the leaf index pages *are* the record pages.
So whether to use a clustered index or not is decision to made carefully and not just on the basis that a developer may prefer to allocate the ID themselves rather than let the DB to it. That said there are approaches that give the best of both worlds, that is, keep the the index clustered and generate keys in the application that highly likely to be sequential.
@zoran-horvat 9 місяців тому ⁺⁵
You have good points here, thanks! Especially the one on reading the data sequentially.
@mannetjie3704 9 місяців тому
@@zoran-horvat sql azure forces you to add a clustered index on your table(s). that said, it doesn't have to be the primary key :)
@zoran-horvat 9 місяців тому
@@mannetjie3704 I am not sure that SQL Azure really forces you to use the clustered index. I'd rather say it takes it as a default, which is the same as in SQL Server. Are you sure about that?
And yes, you can build a clustered index on any column you wish, even the nullable and non-unique columns are susceptible to clustered indexes. The rule that makes sense is that one table can only have one clustered index.
@cdoubleplusgood 9 місяців тому
It depends on the use case. Seeking a clustered index saves the extra key lookup in the main table, but the table itself being the index also means that typically fewer rows fit in one page, and as a result, a clustered index seek causes more IO than a seek in a "lean" index.
The best case is when you have a "lean" index with few rows that is still covering. That might not be possible for every use case.
@zoran-horvat 9 місяців тому ⁺¹
@@cdoubleplusgood That is true, but this video is not about removing a clustered index. It is about a bug that may happen when there is a clustered index. To fix the bug, you must either guarantee the sequence or drop the clustered index. That is the message.
Looking from the other end, then if the clustered index is required (e.g. for the performance benefits it offers in some scenarios), then we must be very careful when generating ID values.
@billy65bob 9 місяців тому ⁺¹⁵
Protip for people who want to drop primary keys:
If you make a change to it in SSMS' Table designer, you can have it generate a script, and it will output all the corresponding drop/create statements for you.
I use this a lot when I clean up databases, as it's often faster to reverify foreign keys (minutes) than to leave them active during mass data operations (days).
Especially if you have foreign keys that aren't a primary predicate in your indexes, which is often the case.
@zoran-horvat 9 місяців тому
That is a very useful tip, thank you!
@billy65bob 9 місяців тому
If I had a wishlist for SQLServer, right at the very top would be some sort of mechanism that optionally deferred validating referential integrity until commit.
@zoran-horvat 9 місяців тому
@@billy65bob That is an interesting thought. Do you know of some other database that can do that already?
@billy65bob 9 місяців тому
@@zoran-horvat I would not be surprised to learn that OracleDB does.
But some googling suggests that both OracleDB and PostreSQL supports 3 levels of it on all constraints.
Those levels being Never (e.g. UNIQUE), optionally during Statements (e.g. bulk insert with FOREIGN KEY to self), and Until Commit.
With the level being set when the constraint is defined.
@lettuceturnipthebeets790 9 місяців тому ⁺¹⁹
love your sense of humor Zoran! 😊
@torrvic1156 9 місяців тому ⁺³
Thank you Zoran! Nice video. I am always trying to avoid usage of this GUIDs and use ints but sometimes I’ll have to use this.
@jordanfarr3157 9 місяців тому ⁺⁴
I cannot believe how much I learned in 15 minutes. This is one of your best.
@zoran-horvat 9 місяців тому ⁺¹
Glad to hear that.
@arkord76 8 місяців тому ⁺²
Thank you very much for this video, Zoran! I was not aware about the nature of clustered indices. The explanation helped me to revise the DB configuration on a current project I am working on. Very helpful!
@OscarAgreda 4 місяці тому ⁺¹
Thank you for making this topic accessible! The explanation on the performance pitfalls of using GUIDs with clustered indexes was particularly insightful.
@esra_erimez 9 місяців тому ⁺⁵
The brilliance of this video cannot be overstated. Well done.
@geofftnz 4 місяці тому ⁺¹
A lot of data to move when inserting rows in the middle of the clustered index -> this is just a page split, and only then when a page is full. Not super-ideal, but there is an upper-bound to the cost.
@GiovanniCostagliola 9 місяців тому ⁺³
Thank you for sharing, Zoran.
As you're aware, removing the cluster primary key entails additional overhead for each key retrieval. While this action resolves index fragmentation, it introduces an I/O penalty with each key access.
An alternative solution I propose is the adoption of ULID or UUIDv7 (with a personal preference for the former). These options ensure the monotonicity of generated GUIDs. This quality not only renders clustered primary keys feasible and time-sequential processing effective, but also enriches the ID with additional time-information that could prove beneficial in some scenarios.
You would saved a lot of migrations ;)
@zoran-horvat 9 місяців тому ⁺¹
Yes, the use of a UUID standard that guarantees sequence is preferable, but be careful if the system is truly distributed. The monotonicity could fall off node clocks are out of sync.
On the other hand, a clustered index forgives small discrepancies, i.e. the performance loss will usually be immeasurable if it only has to move a few rows at worst. That is another hidden gem of clusters indexes.
@mangodude-nq6su 9 місяців тому ⁺²
Your voice and manner of talking are amazing. Thanks for valuable advice
@zhenglaowang8489 9 місяців тому ⁺¹
I just want to emphasize how professional and helpful this video was. Please keep it up and thank you. 👏
@TimSchmidt-k4u 3 місяці тому ⁺¹
I am entranced by your voice, @zoran-horvat! It was like you were telling me a story. I could listen to you all day long.
Also, great informational video :)
@zbaktube 9 місяців тому ⁺³
Could the original setup work, but with Ulid instead of Guid? So we can keep the clustered index?
@nessitro 2 місяці тому ⁺¹
Clear and concise explanation; best video I've seen in quite a while 💯
@manmohanmundhraa3087 2 місяці тому
In summary, for guid type primary key column it should not be cluster index. Because sql server will generate sequential guid which causes performance issue. And this performance issue is nothing to do with entity framework. Also it doesnot matter whether app or db generate guid.
Curious to know whether all kind of db have this kind of issue or not? Let me know if i misunderstood any part.
@zoran-horvat 2 місяці тому
@@manmohanmundhraa3087 The problem is universal - a clustered index requires a sorted key sequence to remain performant.
@wsollers1 9 місяців тому ⁺¹
I love these videos. While I don't code in c# I do find them remarkably informative and useful.
@AlexUkrop 9 місяців тому ⁺¹
Really useful as I created GUID PK clustered as SQL suggested by default wrongly. Thanks a lot.
@tomtoups 5 місяців тому
Following the advice of this video, I asked the DBAs about this at my work, and I was surprised to learn that they don't rebuild indexes in production. They said index fragmentation has a negligible performance impact when the storage medium is non mechanical. Statistics, however, are still important. So random guids may take up more space, but they don't think it's a big issue as long as you give the index plenty of fill factor
@zoran-horvat 5 місяців тому
@@tomtoups My experience is also that the impact is negligible. I never saw a performance issue with GUIDs. Therefore, if the IDs are generated by the client, this is my default decision. It never failed.
@zalmar5855 5 місяців тому ⁺¹
The problem is not the index fragmentation after insterts. The problem is the page splits during insterts. It's easier to insert at the end and just append to an existing structure, than to insert somewhere in the middle, and having to rearrange the entire structure.
Fill factor is always 100%. It should not be any lower unless there is some weirdly specific reason.
@zoran-horvat 5 місяців тому
@@zalmar5855 Your statement is ridiculously wrong. You never tested it in reality. In Oracle, for example, the default organization is heap with B-tree index. Have you ever heard anyone saying Oracle "straight up murders performance of every query", as you have put it? Of course not
@MixturaLife 9 місяців тому ⁺⁵
Fantastic! Your voice makes it sound like fairty tale)
@PhantasyAI0 9 місяців тому ⁺¹
Thank you Zoran, indexing has been something which very little videos exist for EF core. Please if you can create more videos explaining indexing, so that when your database gets large you can insert and read data in a performant way?
@zoran-horvat 9 місяців тому ⁺¹
There are many channels about SQL Server for example where you can find excellent videos on those topics. I am not sure I can explain indexes and performance even remotely well as in some videos I watched in the past.
@PhantasyAI0 9 місяців тому
@@zoran-horvat I have a question is this problem which you mentioned, only happen when you let the database create GUIDs as well as creating your own? How does this apply if you always create your own GUIDs, and when doing so you you have them as non clustered?
@admindravid482 9 місяців тому ⁺¹
Thank you so much...I encountered this exact problem....when the table got big...
@zoran-horvat 9 місяців тому
That is when you find out...
What were your timings? How long would an insert take when you noticed the problem?
@Krauttrooper 9 місяців тому
Alter table is a DDL instruction and can not included in a transaction. Only DML is protected by a transaction.
@zoran-horvat 9 місяців тому ⁺²
DDL statements in SQL Server exexute in a transaction, with a few exceptions like CREATE/DROP/ALTER DATABASE and the same for FULLTEXT INDEX.
You can also mix DDL and DML statements in the same transaction.
That is not limited to SQL Server. Other databases also support transactional DDL, most notably Oracle.
@Krauttrooper 9 місяців тому ⁺¹
@@zoran-horvat didn't know. Thanks
@eveneveneven 9 місяців тому
That's not true for SQL Server is it?
@zoran-horvat 9 місяців тому ⁺¹
@@eveneveneven SQL Server supports DDL in transactions.
@coolY2k 9 місяців тому ⁺²
But what if the database is read-heavy, and you search data by id: guid?
Yes, insert would be slow O(long n), insted of O(1), but the read would be O(log n), instead of O(n).
It's a trade off.
@zoran-horvat 9 місяців тому
It is a tradeoff, but not with those asymptotic complexities.
The insert of out-of-sequence IDs into a table with the clustered index takes O(n) time and read takes O(log n).
The insert and read from a table with a common key is O(log n) in both cases. So it is actually O(log n) vs. O(n) tradeoff, which becomes tragic on any table with more than, say, 100k rows and absolutely unsustainable on a 1M rows table.
The point is that there is no arithmetic joggling that would make the relocation of half a table in every insert even remotely performant as anything else a man could imagine, lest the table is of a very small size, hence not important in any analysis.
@coolY2k 9 місяців тому
@@zoran-horvat > The insert of out-of-sequence IDs into a table with the clustered index takes O(n) time
Ok. Fair enough finding where to insert data is O(log n), but actually moving everything and inserting is O(n).
For tables where I expect a lot of rows, and where I'm not going use id:guid as most used where condition I don't think it's a problem.
It would be fun to see what are actually read and write speeds for 1k, 10k, 100+k rows table with clustered and not-clustered id:guid.
@desertfish74 9 місяців тому ⁺²
How,do other databases such as Postgres or even SQLite handle this?
@zoran-horvat 9 місяців тому
From what I know, Postgres also supports clustered indexes.
@cdoubleplusgood 9 місяців тому ⁺⁴
Postgres has something similar, but it is not applied by default. By default, the PK is a "normal" index.
Similar for Oracle. They have something called an "index organized table". But that is also not used by default.
@fernandocalmet 9 місяців тому ⁺¹
Perfect explanation, very kind to share this lesson, big thanks.
@sfif4585 8 місяців тому ⁺¹
Thank you so much for your work. You are really amazing teacher.
@PelFox 9 місяців тому ⁺¹
Do you recommend making some other column with an incremental value as clustered or is it fine to have a table without it?
@zoran-horvat 9 місяців тому ⁺¹
That would be an optimization and, as such, it is valid in certain scenarios.
I have intentionally tried to avoid getting into the details of measuring and optimizing read and write performance because that would make it an hour long video at least, but it seems that the question of performance is inescapable...
@ml_serenity 4 місяці тому
As an alternative that let you generate PKs with a clustered index, and it's almost guaranteed to not cause page fragmentation is to use Snowflake IDs.
@nikize 9 місяців тому
0:47 There is no guarantee whatsoever that there will never be collisions 💥 however, it is unlikely.
Guids is 128 bits, some of these bits has defined values and can not change, and there is several issues that can make them non unique, the default new Guid implementations are even predictable.
@Dalamain 4 місяці тому
Thanks, very enlightening. So what will be the default ordering of tha data at the table/page level now that we do not have a clustered index? ... I guess it doesn't concern us since we now have a non-clustered index on the GUID column, so we're able to find a GUID efficiently.
@zoran-horvat 4 місяці тому
@@Dalamain There is no ordering at the table level in that case. It is just a heal, and rows are found using a common index external to the table.
Some definitions of GUIDs produce ordered sequences, which may bring clustered index back into the game. However, you might need to ensure to only generate them on one node, which defeats the idea once again.
@Veretax 9 місяців тому
Yeah I'm glad that you showed this because I remember that when I was a junior lots of my co-workers thought that you would just generate the idea and this is before everything went to Goods because then you wouldn't have to find a way to look up what you just saved or find a way to get it to return the good from the database but when you have a distributed system this becomes a problem in part because if every client is creating a good there is a slight possibility of a collision it's extremely rare but depending upon the algorithm on which the idea is being created yeah you can have a collision it is more common in integer Auto incrementing what keys then g u i d paid guid has sort of become the standard to avoid this but again because the way networks work I would rather the server to find these values than the client 99% of the time. Because otherwise you might have values to get created that could be security issues a function which would then do an insert that would add something with a good then you might have data being added. Net has done a good job of trying to sanitize these things and make it harder for it to happen but it's one of those things that's always been in the back of my head that there's a reason why some things are best left on the server
@zoran-horvat 9 місяців тому
The possibility of a collision in GIIDs is truly negligent, so you shouldn't count it in. There are issues that happen far more frequently in distributed databases and we still have the means to handle them.
@Veretax 9 місяців тому
Yeah I feel like I didn't really word what I was trying to say very well.
@zalmar5855 5 місяців тому ⁺¹
Please do not remove the clustered index from a table, it will straight up murder the performance of every query. Removing the clustered index makes the table a heap, also clears all plans from the cache that use it. and forces SQL server to recompile the queries, leading to brutal CPU pressure, while the table being the heap means SQL server will have to read and cache more pages, leading to memory pressure. Unless you want your DBAs to murder you, please do not remove a clustered index from a table, especially if your root problem is coming from the table being too large.
Use a non-clustered index on a GUID that uniquely identifies a record between systems, and keep some kind of incremental index (GUID or Identity int) for the "internal" clustered primary key. You don't even have to store the external GUID on the table if you don't want to, you can create a mapping schema or mapping table. The purpose of using GUIDs for unique identification between multiple/distributed systems is to remove dependencies from each other. Don't change the DB structure to accomodate this.
@zoran-horvat 5 місяців тому
@@zalmar5855 Another commenter said DBA confirmed no issues ever noticed after removing a clustered index. Now figure that.
I also confirm there is no performance penalty except in a few extremely specific corner cases.
@bfrytech 4 місяці тому
What RDBMS are you referencing?
@krm1t 9 місяців тому ⁺¹
This does not seem like a very great advise - Unless you have a small database or don't care about select/update/delete performance.
Imagine having a really large warehouse and when ever someone gives you some new data to "insert" you put it in a box and label it with your primary key.
Since your key is ordered (typically ascending) you need to place the box at the right spot. If your key comes after the last box then you just put the box right at the end but if it's somewhere in the middle you now have to move all the boxes (this is called page splitting) that comes after your key to make room for the new one.
Now imagine that same warehouse but without the clustered index.
You get some new data but instead of putting the data into a box and labeling it you just put data on what ever shelf where you can find room.
You are not even assured that all the data is in the same place. So when your "client" wants their data again you have to run all over the warehouse to gather up the pieces in order to get them their data.
To put it simply. This fragments data on the disc making it slower to find the data for selects, updates and, deleting.
The real solution to this problem is to make an integer or bigint column as the primary key and then make it an identity column.
If you insist on having a Guid as your identifier then add that as a separate column with a nonclustered unique index.
@zoran-horvat 9 місяців тому
First of all, my advice is NOT that you should not use a clustered index, as your comment seems to imply. The video is about a very specific situation when an application generates GUIDs using an algorithm that does not guarantee the sequence and there is a clustered index. One must go.
Regarding your analysis, the actual performance depends on read and write patterns. Neither the clustered nor non-clustered index is better in advance without knowing the statistics. Your analysis does not generalize and, hence, does not lead to any conclusion.
@krm1t 9 місяців тому
@@zoran-horvat what pattern are you talking about when the sql server just sticks the data wherever it can find available space on the disc?
@zoran-horvat 9 місяців тому
@@krm1t This what you said is too wrong to answer to.
@krm1t 9 місяців тому
@zoran-horvat sry it seems UA-cam decided to not show all of your message. A refresh added the first part which I hadn't read.
That is true. You don't say it's a general advise. That's a misunderstanding on my part.
@MetronSM 9 місяців тому
I'm not that fit in databases. But does that mean that I can basically generate my (gu-)ids in my software and use those ids to link entries of different tables before storing the data in the database? We ran into the problem that in Entity Framework 6, when referencing objects, we have to transfer large parts of the database because of the direct object links. The clients don't have direct access to the database but go through a server which handles the communication with the database. Being able to directly generate the ids myself would help us to mitigate some problems we have during multiple transfers.
@zoran-horvat 9 місяців тому ⁺¹
If I got your description right, that is precisely the scenario where we choose to let the application generate IDs. The application then has the control of the times when it creates objects and times when it does then to the database - those two don't have to happen simultaneously.
It is also useful when sending objects between systems, as well the systems will associate the same object with the same ID.
@MetronSM 9 місяців тому
Thanks for the reply@@zoran-horvat
@kocot. 9 місяців тому ⁺¹
TLDR; if you can't ensure ascending GUIDs order, don't use clustered index, you're welcome.
@zoran-horvat 9 місяців тому
Watch the video with more attention. You're welcome.
@kocot. 9 місяців тому
@@zoran-horvatsorry, I've rephrased it, now it's both more precise and shorter :) don't get me wrong, many are interested in a full video, but most don't have 15min+ just to figure out what the whole thing about and and worth spending time, unfortunately it's rare that youtubers provide sufficient description, or even tray to bait with incomplete titles
@zoran-horvat 9 місяців тому ⁺¹
@@kocot.It is not possible to state the problem precisely in 50-characters title, nor in the description for that matter. My channel consists of advanced-level videos and trust me, quite a few programmers watch them very carefully. I expect them to because I don't know how to pour an advanced piece of knowledge into a head that lacks patience. Look at me - I have been learning programming for 35 years now.
Regarding the problem addressed in this video, the assumption is that the change cannot fit the existing table with data in it. That is a regular case in practice. Either the IDs are coming from another system that just didn't care to make them sortable, or the table already contains a lot of data, like a million rows, produced with a sortable UUID generated by a different algorithm, so the new algorithm does not extend the existing sequence.
If either is the case, we must make cuts. One is to give up the clustered index, and that decision is tied to a few other considerations, such as write and read statistics. The other option is to recreate the table and generate new IDs using the new algorithm. Both decisions are hard to make, and this video does not suggest which one is the right one. Actually, chances are that it will be the DBA to decide.
The upcoming video, which is in the publishing queue right now, will demonstrate both possibilities on two tables in the same database.
@Dragonet17 9 місяців тому
What do you think about ULID lib ? Does it solve the problem ?
@zoran-horvat 9 місяців тому ⁺¹
There is the new video demonstrating that possibility: ua-cam.com/video/MJcaxi3bOO8/v-deo.html
The caveat is that ULID will not solve the problem if the table already contains many rows. ULIDs are sorted left-to-right, where GUIDs used by the SQL Server have a different structure and sorting. Therefore, the sequence of ULIDs would not extend the sequence of IDs already in the table, once again causing the relocation of great numbers of existing rows upon every insert.
@Dragonet17 9 місяців тому ⁺¹
@@zoran-horvat thank you for you answer, I've already started watching your latest video, I'm glad that answer is there
@zhenglaowang8489 9 місяців тому
Actually I have a question. I am currently working with EF Core using a .NET Core project, and I want to be able to prevent values below 0 from getting inserted/updated into a database table (basically expecting an exception thrown by EF Core in that case). I have read that there is a "CHECK" clause to enable us to validate a value before updating a column, but chieving it with EF Core seems to be difficult to do. I have noticed there is a Fluent-API call within EF Core module "HasCheckConstraint()", which seems to be the way to go, but it's marked as "[Obsolete]". I want a method that I can use together with migrations. I have also tried the "[Range(0, int.MaxValue)]" annotation, but with no success. Any idea as to how I can do it, and in an elegant and robust way, supporting migrations with EF Core? THANK YOU!
@sazarkevich 3 місяці тому
Can anybody explain why SQL generated GUIDs are "in growing sequence" (time - 3.53) ?
I see they has random part in first place, so it will not lexicographically sequential...
am I missing something?
@nothingisreal6345 2 місяці тому
Newsequentialid() as default
@xorlop 9 місяців тому ⁺¹
Wow this was a cool video to learn about. Has general applications anywhere :)
@reikooters 9 місяців тому ⁺¹
You said what you did, but you didn't say why you did it. Why don't you want sequential guids? Where are the benchmarks? Before the change, what was the insert and select time and the query plan generated? After you made the change, what was the insert and select time and the query plan generated? Why did you decide you don't you want a clustered index? You may speed up writes but surely reads will be slower.
@zoran-horvat 9 місяців тому ⁺¹
It depends on what you think that I said. I see that many viewers didn't see the central message of the video, which is probably my fault in passing that message.
So, the message is: IF the application has been generating IDs for some time AND there was a clustered index THEN... we begin.
The video doesn't question any of the decisions because those are driven by entirely different forces. It only questions the combination of decisions.
@reikooters 9 місяців тому
@@zoran-horvat That does help to clarify the starting situation, and is still interesting as I definitely think others have been in this situation, although I do think it would've been nice to see some details on the improvement that the change made
@PANYNY-mob Місяць тому
Thank you Dr. Xavier, u helped a lot
@Buutyful 9 місяців тому ⁺¹
hi zoran, thank you for all this knowladge, is there a video in your channel that talks about domain modeling design?
@zoran-horvat 9 місяців тому ⁺¹
In many videos I cover different aspects of domain modeling, though only a few videos mention that intent in the title.
@andersjuul8310 9 місяців тому
Brilliant! I didn't know there was a problem (and always used app generated guids before). One day a colleague would have suggested db-generation and I would have accepted if not for your caution...
@zoran-horvat 9 місяців тому
It's the other way around: database-generated IDs are normally safe, because the tools would set up the database correctly. It is the application-generated IDs that may sometimes cause trouble if not aligned with how the database is created.
@lettuceturnipthebeets790 9 місяців тому ⁺¹
I have an important question
you've written a custom migration there, which means that in order to squash all the migrations in the future, your colleague would have to find the custom code among the sea of other auto-generated migrations and then recreate it later
from my experience, it's a huuge pain in the butt
how would you deal with custom migrations? is it ok to just put them in a different folder (under same namespace)? or is there a more elegant way for that? thanks!
@zoran-horvat 9 місяців тому
There is no difference between custom and generated migration code. Each migration is immutable and EF will not write into it again.
As I pointed out, this is not the problem caused by EF. I have been using other migration tools before and they had the same problem, that you must add custom migration code to make a certain change.
On the other hand, the whole issue only happened once I made a bug in the earlier configuration.
@lettuceturnipthebeets790 9 місяців тому ⁺¹
@@zoran-horvatI understand that migrations are immutable once created, but how do you squash migrations that you cannot recreate using ef core? by squashing I mean creating a big migration from hundreds of smaller ones (which will get the custom migrations lost if not careful)
@PaulPendor 9 місяців тому
@@lettuceturnipthebeets790I’m trying to understand your question. What Zoran did was generate a migration, and then add to it. You can add whatever you want to the migration, even down to custom SQL statements. But that is the complete definition of your migration. If you later want to create a migration script, say an idempotent script which you will generate and apply within a CI pipeline, all you’re doing is combining the migrations you have defined. The migrations aren’t generated again. You’re just combining, or squashing if I am understanding your question correctly, into a single SQL migration script.
@brianm.9451 9 місяців тому
I didn’t know that EF Core and SQL Server generated sequential GUID’s when properties are empty. I started defaulting back to using long’s over guids. I prefer the sequential nature of it and clustered index is important for me.
@zoran-horvat 9 місяців тому
When a clustered index is required, then the number of options is limited. That is the case in systems with heavy inserts.
The way EF Core handles that is to initialize the table field with NEWSEQUENTIALID(), the same way as we would do when writing the CREATE TABLE script by hand.
@pyce. 9 місяців тому
I would keep an auto-incremented long as primary key (clustered) and a GUID with unique constraint (unclustered).
@zoran-horvat 9 місяців тому
That is one possible design. But only one out of several valid ones. We should always keep the requirements in mind, both functional and nonfunctional, when making design decisions.
For example, a clustered index can cause write congestion in some scenarios and work like a charm in others. A table with a common scattered index can cause read delays in other scenarios. There is no one size that fits all.
@MrWTFNETWORK 9 місяців тому
Why not just have an auto incremented long as Primary Key (generated by the db) and don't even have the Guid?
@zoran-horvat 9 місяців тому ⁺²
@@MrWTFNETWORK Because it might be that the application is generating ID and database is only storing it. That is frequently the case in distributed systems. In such a situation, autoincrement numbers are not an option.
@MrWTFNETWORK 9 місяців тому
@@zoran-horvat you made me think of a scenario where I store invoices and invoice line itens, 1 to Many relationship and we were using guid for primary Key of both tables. In reality to acess line items I see that the line items table could instead have a composite primary Key as invoice Id and line item id (incrementing integer generated in memory instead of guid as there can only be max 100000 items), what do you think about this approach? We only need to get 2000 items in a paged way until we allow clients to get all line items via API. We would use a continuation token being the id of the last item and always sort by the id
@stevehancock2396 9 місяців тому
Hi Zoran, thanks for the video on this subject, certainly interesting. The only thing that I'd suggest would be a little more focus on the database itself, regarding default column values and the using NewID() and NewSequentialId() functions. While EntityFramework can take care of this, if you're using EF to generate your database, this isn't the case if either using another ORM, or working against an already existing database. Just felt that this bit of detail had been glossed over.
@zoran-horvat 9 місяців тому ⁺¹
Yes, that could be a value topic for another video. I thought that covering other aspects in this video would hurt its purpose.
@artou5226 Місяць тому ⁺¹
blaaast!... I've got an error...
love it^^
@MrValhallans 9 місяців тому
You note a performance impact on using application generated GUIDs on a clustered index, however I would love to know what sort of impact that might be. For instance against a 1 million row X 10 column table, just as an example to get the percentage differance on read and write ops.
@zoran-horvat 9 місяців тому
The percentage doesn't matter. It is the fact that an operation that takes near-constant time would be substituted with an operation that takes time proportional to a half of the table's size on average. One simply does not do the latter.
@MrValhallans 9 місяців тому
@@zoran-horvat I take your point that its is self evidently the correct approach, however I have applications with this issue and I need to prioritize my time. If I go back and spend the dev time on the migrations will I see a 5% or a 40% perf increase. Thanks for the amazing videos, big fan 😁
@zoran-horvat 9 місяців тому ⁺¹
@@MrValhallans The point is that when this issue manifests, it makes every row insert a pain so obvious that measuring it makes no difference, especially knowing that every next day will only make it slower. How do you prioritize work on a system where inserting a row takes a second?
@adam-xt8te 7 місяців тому
I wonder why people prefer GUIDs instead incremented integers as PKs.
Doing emergency hand-corrections in DB with FKs cause headache then.
@zoran-horvat 7 місяців тому ⁺¹
An application cannot generate an autoincrement number. It must rely on the database to communicate the ID back to the application. Not to mention that autoincrement anything is almost invariably out of the question in distributed systems.
Now list ten situations that come to your mind where waiting for the autoincrement number is an issue, and you will have ten use cases for GUIDs. It shouldn't be hard.
@adam-xt8te 7 місяців тому
@@zoran-horvat Adding related data to many tables at once requires knowledge about PKs and FKs, but I always relied on EF or SQL's SCOPE_IDENTITY then and never set them by hand.
Ok, it's just easier having keys before and after the action.
@jaiderariza1292 9 місяців тому
The problem occurred when the app needs to create its own GUIDs, but if I understand you correctly, there is no problem with letting the DB generating the GUID + clustered, if the app does not want to create GUIDs, and will perform better. correct?
@zoran-horvat 9 місяців тому ⁺¹
The problem was with IDs generated out of order when there was a clustered index. The database generates GUIDs in the ascending order when instructed, and so the problem would not occur then.
However, that is not to say that database-generated IDs or clustered index perform better. The overall performance is also the result of read and write patterns exhibited by the application.
@OzgulEzgin 9 місяців тому
Thanky you very much Zoran? Is it possible to create sequential Guid IDs in the Application so we don't have to give up on clustered index?
@zoran-horvat 9 місяців тому ⁺³
There is a class that apparently does that (learn.microsoft.com/en-us/dotnet/api/microsoft.entityframeworkcore.valuegeneration.sequentialguidvaluegenerator?view=efcore-8.0).
I must admit I have never used it. To make it worse, I have never even read is documentation carefully. The critical question to answer is whether the sequence remains ascending if you move the application to a new server? Though there exists the standard definition of GUID that puts a timestamp at the front, thus making any sequence of GUIDs ascending, there is a nasty corner case when the two servers are out of sync with time. The problem is the same as when using a timestamp to version entities where it may sometimes lead to lost updates!
For all these reasons, I normally let the database generate IDs when a clustered index is a requirement (in systems with heavy inserts, e.g.), and drop the sequential IDs request otherwise.
@OzgulEzgin 9 місяців тому
Thank you so much @@zoran-horvat
@devmarkmonster 9 місяців тому
Just wondering how many deployment systems actually use the Down migration. I haven’t seen them yet, mostly a idempotent migration script is generated for the Up. Am I missing out?
@zoran-horvat 9 місяців тому ⁺²
It is a safety valve in case you suddenly figure something out after deploying to production. I remember events of that kind in my career. Very rare indeed, but when it happened it was always driven by panic reactions at the customer, so they would request immediate rollback of the deployment and only then an investigation.
@DavidBeaumont 9 місяців тому
Interesting. Why is it better to want to switch to the non clustered approach? Why not let the DB continue to create the ordered GUIDs itself?
@zoran-horvat 9 місяців тому ⁺¹
It is not the question whether it is better or not, but which node in the distributed system is in charge to generate or, let's say, know the value of the ID.
If it is an applicative node, a service, or anything of that sort, then keeping the storage in charge of generating IDs requires an additional trip back to the application before it can continue doing what it planned.
It is also hard to rely on the database in polyglot systems where some of the storage nodes might not even support generating IDs.
@JohnSmith-op7ls 9 місяців тому ⁺²
This is why you just let SQL Server generate the GUIDs for you in a simple setup or if you have a clustered DB setup or need to generate IDs in the application for other reasons, you don’t bother with GUIDs and instead use a better, sequential ID algo like Snowflake or some variant.
Not only is it more compact, saving data file and index space, but you can keep a clustered index in cases where your reads will be more frequent than writes, both of which will improve performance and lower costs.
@zoran-horvat 9 місяців тому ⁺⁵
If it only were "just"...
@nothingisreal6345 2 місяці тому
Do not have a table in SQL server without a cluster index!! Select performance will be degraded! If you want a guid id column for a primary key use newsequentialid() to generate it as default value. That is why code first is tricky! You must understand the db.
@zoran-horvat 2 місяці тому
@@nothingisreal6345 You are selling one possibility as a general solution, using shaky argumente that are easy to prove wrong.
BTW how come that that silver bullet of yours doesn't apply to Oracle?
@nickbarton3191 9 місяців тому
Some years ago, I worked on a project using SSRS; I was writing T-SQL scripts to improve report turn-around constantly. Scary stuff and absolutely hated it.
How does non-clustered indexes compare in performance in-the-large ?
@zoran-horvat 9 місяців тому ⁺²
It depends on what you need. Someone pointed out how fast they are if we tend to read the rules in chronological order, which is the central request in some applications. The clustered index will then make such reads substantially faster than the one with rows scattered all over the data pages.
On the other hand, most of the business applications do not have such a request. We usually read the routes as per request, and requests are randomly driven by end users. That makes clustered indexes less important, even to a degree that maintaining them becomes a liability with no benefit in return.
@nickbarton3191 9 місяців тому ⁺¹
@@zoran-horvat Yes, I found that clustered indexes worked faster with time-ordered data; happened to be quantities of mushrooms collected at distributed farms per picker over a week and aggregates per farm per cost-center.
@kasperlaursen7005 9 місяців тому ⁺³
I usually avoid using guids as primary keys in the database completely.
I assign standard integer database owned primary keys and add a second id column that is the guid.
Thoughts on this?
@zoran-horvat 9 місяців тому ⁺⁸
Then you lose the greatest benefits of using GUIDs - the ability to generate the value before persistence and the ability to persist entities in the storage that does not support generating IDs.
@billy65bob 9 місяців тому ⁺¹
@@zoran-horvat Such a setup doesn't preclude the benefits of application generated guids.
Applications with such a setup are usually ignorant of the incrementing Ids when they're not resolving foreign keys in their queries.
It does come at the cost of requiring 2 indexes at minimum; a clustered one for the incrementing ID, and a unique nonclustered one for the GUID; which does have some performance implications.
@zoran-horvat 9 місяців тому
@@billy65bob Yes, a better formulation in my reply would be "you don't take these benefits for free".
@AndrewRussell-w4l 9 місяців тому
Hi Zoran, thanks for another interesting video. I like to add a clustered index on another column which I make as a shadow property in EF and make this and identity seed to ensure the order. What are your thoughts on this approach?
@zoran-horvat 9 місяців тому
I have used similar tricks in the early days, but then found out that I favor simpler configurations... Though, the idea you mentioned will work.
@berndeckenfels 9 місяців тому
Why would sql server sort by the 4th segment?
@zoran-horvat 9 місяців тому
The structure of UUIDs is defined by the corresponding standards. If you need to analyze the values, then you take the parts as defined in the standard. There is no freedom to decide.
@berndeckenfels 9 місяців тому
@@zoran-horvat yes but the sql server does sort the bytes/characters in order of left to right, and normal v1 uuids like you showed do not sort chronological this way. V6 are better in this regard.
@Galakyllz 9 місяців тому
Resetting the UUIDs to UUID7 would be all that you would want to do. Removing the clustered index is not the best solution when we have UUID7.
UUID7's initial bytes are sorted by a timestamp and then the right side is random, allowing you to sort them and immediately tell which ones were created first-to-last. If these are meant to be customer-facing and being able to tell when an entry was created is a security concern, then the primary key should be an auto-incrementing integer (with a clustered index) and a second column should exist with a UUID4 (with a regular index). As a side-note, including a UNIQUE constraint on the second column would cause unnecessary overhead.
@zoran-horvat 9 місяців тому
That is true, with a caveat. What used to be encapsulated in the database, both creating and consuming IDs, now becomes the problem of someone else. It is entirely possible that a future update, such as accepting IDs from the node or service which was not entitled with that before, could break the rule causing a surge in performance.
I would advise to either keep the dependency in an algorithm in one place (database), or to not have it at all.
@Galakyllz 9 місяців тому ⁺¹
@@zoran-horvat True, either the database needs to have an implementation of UUID7 that it uses to generate the IDs or (when the UUID7 must come from an external source) all pathways that result in an insert into that table must go through the same abstract layer that generates a UUID7 on the caller's behalf. Most systems have a "repository" layer that would suffice but once you have more than one system with access to insert into that database, each one becomes an entrypoint for bad (non-UUID7) IDs.
Creating a shared library for interacting with tables of that database would be my next path forward to greatly reduce errors.
@gunnarliljas8459 3 місяці тому
Not good in SQL Server, since there you want the LEFT side to be random.
@jarosawfrankowski7845 9 місяців тому ⁺¹
Great content!
@ScorpionJL 9 місяців тому ⁺²
This is one reason I do not use migrations for SQL Server and instead go with a database first approach. The VS SSDT project allows for publishing changes to the DB schema and handles all the drops and creates needed. It's especially useful when iterating changes over a dev database and then pushing them all to a prod one when ready.
Also, removing the clustered key results in a table classified as a HEAP and has it's own pros and cons. If you only ever lookup via the GUID and never do a range lookup there won't be much difference, but if you ever need to lookup a range of rows in a table, you'll want to create a clustered key. Also, depending on the concurrency and throughput of the application, sometimes a clustered GUID column may help in the long run, as long as you are doing database maintenance such as index rebuilds or defrags on a consistent basis.
@mcdaddy1334 9 місяців тому ⁺¹
Very Nice!
@DavidSmith-ef4eh 6 місяців тому
there must be some downside to using non-clustered indexes... maybe queries or sorting become less performant
@zoran-horvat 6 місяців тому ⁺¹
It is a tradeoff, like any other decision. However, *if* IDs are generated from the application, then the clustered index is most probably not applicable.
@DavidSmith-ef4eh 6 місяців тому
@@zoran-horvat I am totally new to this. I am even confused why they use UUIDs instead of numeric increments. I was thinking about using csharp for part of our backend administration panel, but we mariadb. I assume that would be possible, the driver is surely supported, but would it be feasable. Every video I see about csharp and databases, there are Guid in it. I'll probably just keep using php or maybe Go, eventhough cshapr does have the most advanced language featrues and syntactic sugar
@DavidSmith-ef4eh 6 місяців тому
also, modern php is OK. no bully. we have static analysis with php-stan, traits, and some multithreading/async features when using swoole. I would still love to use csharp though.
@petrmalecik5661 9 місяців тому
Great Video I reall liked It. Just an offtopic question are you planning to create a course on Udemy or Pluralsight which would go more deeper into programming paradigm used by you. I took your OOP course and Design Pattern Course. Now I would also love to see some FP/OOP course will such course ever be created? Thank you.
@zoran-horvat 9 місяців тому
I have one such course in the queue about OOD and domain modeling, but no dates when I might actually start working on it.
@Syed.KhubaibAli 9 місяців тому ⁺¹
Thankyou @zoran-horvat, I really like your content , please keep posting valuable stuff.
@JoeFeser 9 місяців тому
The new index will end up getting fragmented over time and will require some maintenance.
@zoran-horvat 9 місяців тому ⁺¹
Every nonclustered index does, and databases are still live and kicking.
@JoeFeser 9 місяців тому ⁺²
I was on a project a few years back that every id in every table was a guid. One of the cross tables was a 3 way primary key. Inserts would take 20-40 seconds. I was assigned to "fix" the perf issue and they were not happy with my solution, DROP THE INDEX.. I tried telling them the design was bad.
@zoran-horvat 9 місяців тому ⁺²
That is the point, the DBA job must be done well for a database to operate well.
@jasonfreeman8022 9 місяців тому ⁺¹
Use INT ids for joins WITHIN the DB. Assign GUID’s as actual entity IDs so you can sync across DB’s (think Dev, QA, UAT and PROD) without using identity insert. If you want to use guids for actual IDs then use a time-based guid thats more suitable to your clustered index.
@zoran-horvat 9 місяців тому ⁺¹
If you are going to query by GUID then you got nothing... That is the reason why we use external ID as the key (in those scenarios). Less code and better database performance, all in one package.
@ДмитрийКондратенко-б5ь 9 місяців тому ⁺¹
Perfect video Zoran!
Looks like that initial approach with generating guids on database level is quite simple and we may forget about these nuances of how sql server works. We just don't need to assign new guid values for entities in our application code, am I right?
@zoran-horvat 9 місяців тому
Actually, I believe it is better to generate GUIDs in the application because that is their greatest strength: global uniqueness. It lets it generate them in the frontend as well. It lets us choose whether to store entities in aggregate storage in one node or bounded context and in a relational database in another.
When you put everything on the table, leaving the database to generate values on insert begins to look less attractive.
@ДмитрийКондратенко-б5ь 9 місяців тому
Please correct me if I wrong here, but for Publishers table with this approach we don't have any indexes now because primary key PublisherId is not the clustered index now. Will our search and filter operations in this table now be slower than with a clustered index?
@zoran-horvat 9 місяців тому ⁺¹
@@ДмитрийКондратенко-б5ь There must always be an index on the key, or otherwise the database would perform a full table scan on every access to a row. The default in SQL Server (and only in SQL Server) is that it keeps the table sorted by the ascending je, a.k.a. the clustered index. But if you opt for a "regular" unique index, as I did in the video, then there will be a separate index built and updated on every insert.
So, the answer is that both Books and Publishers tables will have unique indexes on their keys at the end of the video.
@ДмитрийКондратенко-б5ь 9 місяців тому ⁺¹
@@zoran-horvat Thank you for your explanations, I became smarter after watching your video and reading your answers!
@anon_y_mousse 9 місяців тому
Is that a problem with SQL Server specifically? If so, why not use something better? Also, GUID ID is redundant. You can just say GUID as both ID's are the same thing.
@zoran-horvat 9 місяців тому
This is not the problem, nor is it with SQL Server specifically. Many other databases support clustered indexes - Oracle and Postgres do, for instance. The video is showing one typical mistake that can happen. A few commenters confirmed that happened to them in the past, and it also happened to me.
You would typically notice the problem when an insert into a table becomes unreasonably long, such as over second or longer. I have heard horror stories of an insert taking half a minute before someone noticed that the things are not right.
Regarding the naming, you are right that GUID is the ID. It is a good point one should keep in mind. However awkwardly, I was reading it in C# - Guid Id, i.e. the Guid type and the Id property.
@anon_y_mousse 9 місяців тому
@@zoran-horvat Sequential indexes shouldn't cause that problem if the database is implemented correctly. I can only assume that they're using the GUID directly as a key instead of hashing it. This kind of thing is why increasingly I write my own. I keep seeing these kinds of logical fails in "professional" software.
@dxhelios7902 9 місяців тому ⁺¹
@zoran
"on SQL Server, when a GUID property is configured as a primary key, the provider automatically performs value generation client-side, using an algorithm to generate OPTIMAL SEQUENTIAL GUID values." from EF core docs.
@zoran-horvat 9 місяців тому ⁺²
That is true. The problem occurs when GUIDs are generated in the application and only transferred to EF. In that case, the algorithm will be whatever the application used and, unless a specific sequential algorithm is configured, that will be an out-of-sequence algorithm, as it was in my demo.
@dxhelios7902 9 місяців тому
@@zoran-horvat "whatever application"? eehh, i don't know. Let's say you have existing dataset from other system (SYSEXT) that you would like to integrate into your system (SYSPROD). This dataset has entities with attributes (RDBMS sense): SYSEXTItemID, SYSEXTItemTitle. You want to store these entities in your DB as well. In that case you may say this is application generated IDs. I can get such scenario. You see how I view PK - they are specific to your DAL of your DB (SYSPRODDB). IMHO DAL responsibility is to save, operate and retrieve data consistently from your data store. The implications of such approach:
1. Only DAL controls how data is saved and retrieved.
2. It must control indexes, constraints and relations as needed inside DAL.
In your scenario you would may as well have EF CORE generated sequential IDs in following way:
- define table like SYSPRODID, SYSEXTItemID, SYSEXTItemTitle
- add index on SYSEXTItemID, if needed, but keeping in mind your goals of adding records quickly, probably you want to avoid that
In that case you would remove design dependency between PK and "IDs generated by application". Even if the application changes its behavior, DAL implementation stay stable.
If I would encounter such an issue, then I would keep my DAL fully responsible for consistency and performance and would remove coupling between DB design and application as it would lead to other issues. Team may misuse and abuse such approaches.
Also this does not help if Item is deleted or SQL is restarted due to maintenance on-prem or in the cloud. Fragmentation still possible and index maintenance should help more than anything else.
@7th_CAV_Trooper 9 місяців тому
Guids will kill your index. Btrees are optimized for right only writes. Use ulids instead because they're monotonicly increasing. I'm 99% sure Azure SQL requires clustered pk. EF can be setup to allow auto-inc into indexes. This is the best option, unless you're building mobile apps that require offline operation.
@ivancavlek4755 9 місяців тому
Ok, but what about creating a sequential GUID in the app? There are NuGet packages that enable that (SequentialGuid by jhtodd for example) and also code examples on Code Project by Jeremy Todd ("GUIDs as fast primary keys under multiple databases"). I use this code for almost a decade now in the app and it didn't fail me (yet).
Would these solve the problem i.e. have your cake and eat it too, or clustered index with an app generated sequential ID?
@zoran-horvat 9 місяців тому ⁺¹
That would resolve the problem, but with a caveat: Every node that generates the IDs would have to know the condition and, on top of that, use the same algorithm.
I fear more from configuration errors than the conceptual ones, especially as the resulting big is a nonfunctional one, therefore hard to reproduce and catch.
On a related note, I plan to make a video on strongly-typed IDs with EF Core. The strong ID type would be the ideal place to lock the GUID generation algorithm, though it remains that the C# module containing that decision would have to be in sync with any other source of IDs.
@ssa2955 9 місяців тому
Is there a benefit in creating the GUIDs in code in comparison to creating them on the RDMS side?
@zoran-horvat 9 місяців тому ⁺²
It is tricky to vote this in terms of benefits. It is rather a reaction to other forces when designing a distributed system. So, in practice, you will need to cope with a situation where applications already have the need to generate IDs and letting a database do it for them is a sub par solution.
@KangoV 9 місяців тому
Uuid key solved by using Type 1. These are ordered. This is how Cassandra solves it. You're welcome.
@zoran-horvat 9 місяців тому
Yeah sure, but you forgot the problem statement. You just messed up the database in exactly the same way as I did in the demo. You're welcome.
@KangoV 9 місяців тому
@@zoran-horvat Explain how? I've been using Type 1 UUIDs for years. Never had a problem. Maybe this is SQL Server specific?
@zoran-horvat 9 місяців тому
@@KangoV That is your homework, not mine.
BTW, I did explain it in the video. Two times.
@Dalendrion 9 місяців тому
So SQL Server generates the GUIDs in a way that each GUID is sequential.
Intuitively, it seems that it's less random, and so it's more likely to create collisions.
Letting the app generate GUIDs seems safer.
Is my intuition correct? Or am I missing some context?
@Dalendrion 9 місяців тому
"Sequential" is the wrong word, I suppose. "Ordered" covers it better.
@zoran-horvat 9 місяців тому ⁺¹
@@Dalendrion Yes, ordered is a better term for that. There are several standards for GUIDs, at least one of them guarantees order on top of universal uniqueness (that is the principal trait of that standard).
@SantiagoBurbano-su9oe 9 місяців тому
Don't, don't, thousand don'ts.
You have just turned your Books table into a Heap, which in MSSQL absolutely kills reading performance. While the problem you describe with non-sequential guids is correct, leaving the table without Clustered Index is much much worse.
A simple Index reorganization will fix the guid problem as your table grows. Stick it to the Maintenance Plan (you DO have one, right?) and call it a day.
@zoran-horvat 9 місяців тому
Did you just explain we should never use heap tables before asking what queries we plan to execute? You know better than that, I hope. For example, that "absolutely kills reading performance" depends on what we query, right?
@SantiagoBurbano-su9oe 9 місяців тому ⁺¹
@@zoran-horvat You ask me to provide context when you yourself provided none. In fact, that's actually the main objection I make to your video: the lack of context.
Sure, Heaps have a place, sure non-clustered PKs have a place. But giving a 'general advice' to drop the Clustered when client-generated Guids are used as PKs without any warnings or caveats can put inexperienced developers in a much worse place than a fragmented clustered index would. I'm absolutely sure you know better too. 😉
@zoran-horvat 9 місяців тому
@@SantiagoBurbano-su9oe I think I gave quite a sufficient amount of context when I said that the unsorted IDs are coming in from an external source. That condition is a given at the beginning of the demo. The rest is a consequence.
@joshman1019 9 місяців тому
Couldn't you seed your ID with some sort of sequential ID in a format similar to {SquentialID}_{GUID}?
So something like a Linux based time index. That way the record always gets sorted into the end of the file. This would allow you the benefits of clustered indexing and collision avoidance.
If you did get a few rows generated on the same time index, then the re-sorting would be fairly efficient. In my opinion.
@zoran-horvat 9 місяців тому ⁺¹
You can use a specialized algorithm that guarantees sequence, but that opens a different set of issues. All components that you might allow assigning IDs must guarantee to use the same algorithm, for example. Any later change is not possible, and so on.
All these limitations stem from the principal problem that we have separated the source of IDs from the storage that depends on them. None of the issues appear if we lift the artificial constraint of the storage, that is the point.
@joshman1019 9 місяців тому
@@zoran-horvatThis is great information. Thank you!
@Muhammadsaleh77 9 місяців тому ⁺¹
That's great
@Sokooo 9 місяців тому
That’s why you use Postgres ;p
@zoran-horvat 9 місяців тому ⁺¹
No, that's not why you use Postgres. One must be stupid to make such a random technical decision.
@berndeckenfels 9 місяців тому
Btw I think clustered uuid are not that bad if you insert/delete regularly because the gaps in the index will stabilize and only minor reordering is needed. In fact the easier clustered lookup will make up for it in some cases and you don’t run into one sided balanced index trees or mind extra old records keeping a lot of empty heap dangling,
@CronKz 9 місяців тому
Great video as always!
What about using ULIDs instead? By the spec they are sorted by definition, since the first part of a ULID is the encoded timestamp.
In that case you might be able to keep using the clustered index, even if the application generates the values.
ULIDs are great! Unfortunately they are not yet commonly used.
@zoran-horvat 9 місяців тому
It is a good idea. The only issue with types that are sorted by time is when used for versioning - once the system truly becomes distributed, small discrepancies among timers will cause incorrect sorting of versions and lost updates.
However, the concept alone is very useful.
@berndeckenfels 9 місяців тому
Normal uuids are not sorted by timestamp but there are special variants which are
@berndeckenfels 9 місяців тому
@@zoran-horvatwhy does gaps in timers cause data loss?
@zoran-horvat 9 місяців тому
@@berndeckenfels When a time sequence is used to version changes in a distributed system, then we risk the case when two updates are reversed because the node that effectively generated a later update could have its timer late. In that case, the earlier update could overwrite it. That is one of the failure modes for distributed databases.
@berndeckenfels 9 місяців тому
@@zoran-horvat ah ok, yeah that’s not really related to uuid but yes that’s a problem if you don’t use vector clocks or similar
@bfrytech 4 місяці тому
Code first solutions are rarely the answer. Data first....business rules next
@bfrytech 4 місяці тому
Why would you use external code to do data migration? This makes my stomach hurt...
@zoran-horvat 4 місяці тому
@@bfrytech Can you explain in plain words the reasons why data-first design methods were mainly abandoned in practice?
@bfrytech 4 місяці тому
@@zoran-horvat It's mainly a lack of understanding of how databases work and knowledge of how to properly normalize a db for transactional data. Plus, databases are boring. Writing C# code is far more fun than writing scripts. But, you gotta eat your veggies if you're gonna get desert.
@zoran-horvat 4 місяці тому
@@bfrytech Those who had the understanding and knowledge have also moved in the same direction. Why was that?
@bfrytech 4 місяці тому
@@zoran-horvat I want to clarify that I'm a full stack guy, who happens to enjoy the db side. And I'm not being all fire and brimstone on the subject. Just from my experience, every project that I was pulled into, regarding CF, there was a lack of experience on the data structure side. Ordinarily, the CF stuff was what was stalling the project. For the exact reasons you outlined. A stored procedure with an output parameter is faster and more natural than fighting the issue you outlined. Also, not many DBAs wanna grant that level of permission to an external program.
Now, to your question, I can't answer that. It may be external business pressures or having an expert at CF that mandates the practice, could be a lot of things. But, a majority of managed code folks that I've worked with, don't really understand how an RDBMS handles data. Most managed code collections tend to be handled iteratively and database engines handle data in bulk. I always tried to treat data the same way the data engines to.
@matiascasag 9 місяців тому
UA-cam why my comments are being deleted?
@zoran-horvat 9 місяців тому ⁺¹
Did you add tags or links into them? UA-cam regularly deleted those.
@matiascasag 9 місяців тому
@@zoran-horvat I wrote N3wId from M@ss Tr@ansit
@ricodomonkos3053 9 місяців тому
DDLs not always can be rolled back. For example in MySQL, Postgres and Oracle a DDL is implicitly committed.
@zoran-horvat 9 місяців тому ⁺¹
Yes, sure, but that doesn't change the fact that every mainstream RDBMS supports transactional DDL. Actually, missing such a fundamental feature in 2024 either makes it outdated or a niche technology with an accent on something else (which is fine).
@gunnarliljas8459 3 місяці тому
This is really bad advice as a blanket statement. I mean, there are certainly benefits to skipping the clustered index, especially for inserting stuff, but heap tables come with many, many disadvantages in very common scenarios.
@zoran-horvat 3 місяці тому
@@gunnarliljas8459 "many many disadvantages" is a bad advice as a blanket statement. So many systems are built using this idea, including quite a few where I worked, and those things people fear never happened. For example, the default table organization in Oracle is heap - are there many many disadvantages to using Oracle?
@jonijarvinen7833 9 місяців тому ⁺¹
Sorry but this video provides absolutely no context in which circumstances using GUIDs is slow. It isn't even slow by default and can be just as fast or faster than ordered sequential keys because it prevents congestion. This video also does not provide any data on what exactly is "slow". Don't take it from me, just go watch a video about with actual data and experience to back up what they are saying, which is, you should not worry about using GUIDs. They are just as fast in almost all situations withou any kind of index shenanigans:
ua-cam.com/video/nc4CMo7VSPo/v-deo.htmlsi=2dPyPAT3rX2OGHOn
@zoran-horvat 9 місяців тому ⁺³
My video is not on the topic you are mentioning. I would appreciate it if you could watch my video as being what it is, rather than as not being on the topic of some other video.
My video is about a very particular situation when a programmer unintentionally leaves the clustered index on the key column, and then intentionally uses an out-of-sequence source of IDs. The situation naturally occurs when using GUIDs.
Contrary to what you have imposed, the video you have commented on is not about index performance.
After reading your comment two times, I have an impression that you have missed the theme of my video entirely, particularly the point that out-of-sequence IDs cause relocation of large number of table rows on every insert - an effect that has no relationship to fragmentation, specificity, uniqueness, or any other property of indexes.
@jonijarvinen7833 9 місяців тому ⁺²
@@zoran-horvat And that video takes a deep dive into clustered indexes. There is no one best practice. It depends how you use your data. Heavy inserts are slower on an ordered clustered index because you are creating a hot spot to the last page. From the author of the video from another source:
"First, I ask that you realize my video isn't just about GUIDs. I use Random GUIDs to emphasize many myths and points about all kinds of indexes.
Shifting gears back to your question, when you're working on an almost pure OLTP platform and as I say in the video, logical fragmentation isn't going to matter at all. You're going to be looking up one row at a time and that means a trip in the B-Tree from the root thru the intermediate levels of the B-Tree, to the page in the leaf level where your row is. If the index is (for example), 4 levels deep, then you'll do 4 8k page reads for every look IF the index can be used. If not, then a scan of the leaf level is going to occur and I probably don't need to tell you that can be millions of pages.
Also, Random GUIDs any better or worse than IDENTITY columns when it comes to finding things UNLESS you're using those things as previously identified keys. They're both surrogate keys and mostly tell you nothing about the data on the row. In both cases, they work great for OLTP IF you already know which key to lookup. For everything else, you're going to need a different column or two to find what you're looking for.
The key thing to remember when it comes to "Clustered Indexes" is that "It Depends" and there is no panacea. If your system IS most OLTP then, like I said, Random GUIDs and IDENTITY columns are great even as a Clustered Index. If, however, you're doing "range" lookups by (for example), dates, then the better choice would be to put the Clustered Index on the date column as the first key and the unique surrogate column as the second key column so that you SQL doesn't have to do "Row Lookups" to get the information not carried in a non-clustered index with a date key.
Even that isn't a universal recommendation because, just like an IDENTTY column, have a Clustered Index on the data will likely create an Insert "Hot Spot" at the logical last page of the Clustered Index. Even worse and as I mention in the video, it's even worse if you do such inserts (which all go in to 100% page fullness and then they create another page) and those are following by any UPDATE that makes the row larger (which I refer to as "ExpAnsive Updates") before you can rebuild your index at a lower fill factor, you're going to get massive page splitting and the resulting fragmentation. Of course, Random GUIDs don't have either of those issues because all INSERTs are fairly well equally distributed across multiple pages (usually 1 page per row if the table has at least 1 page per row being inserted).
EVERYTHING is a tradeoff when it comes to the Clustered Index. If someone starts a recommendation about Clustered Indexes with "It a "Best Practice" that all Clustered Indexes should be...", just walk away because they completely wrong unless they end that with "assigned with great caution because there is no panacea".
I also used to believe in the old advice that "every table should have a clustered index". Again, I'll now say "It Depends", especially since that came out with ALTER TABLE REBUILD in 2008. Yep... I know about that nasty old RID being added to non-clustered indexes instead of a nice narrow INTEGER or BIGINT but still "It Depends".
So, for OLTP, "It Depends". For Analytics, "It Depends" on something else. Even within Analytics, each table may have (likely WILL have) different requirements to maximize performance and minimize resource usage.
I'll also end this with logical fragmentation may or may not make a difference for reads (you need to test), will usually make a difference for writes because even supposed "good" page splits are 4 to 43 times more resource intensive as proven by Paul Randal, and page density should almost always be a primary concern. It's very bad to have a low page density in many cases and it's very bad to have a high page density in others.
Again, I know it sounds like a cop out on my part but... "It Depends" and "Must Look Eye" are the only truths when it comes to indexes and the lack there of. Ah... there's one more truth... no matter what you decide on, it will be a compromise... especially for Clustered Indexes and Heaps :D"
@zoran-horvat 9 місяців тому ⁺²
@@jonijarvinen7833 My video does not take any dive on clustered indexes. My video is not about clustered indexes. OK?
@yaroslavpanych2067 9 місяців тому
EF. Duh. It is such a bitch. It would be ok, if it just worked with data, not fucking with structure! I'm developer, and in most cases i have few architects above me, we most definitely know about structure better, and how to update it.
@zoran-horvat 9 місяців тому
Actually, EF did nothing here. It was just passing values between the application and the database with no modifications.
@vasilecuzmin5153 7 місяців тому
Can you talk normally?
@zoran-horvat 7 місяців тому
No, never.
@mahfoudbouabdallah6286 3 місяці тому
Many thanks for the excellent tutorials! Now, does the same principle hold true for the new GUID version 7 introduced in .NET 9 (Guid.CreateVersion7)?
@zoran-horvat 3 місяці тому ⁺¹
@@mahfoudbouabdallah6286 Any GUID that guarantees a sequence is generally fine.

Наступне

Автоматичне відтворення

Learn C#: Abstract or Virtual Method, Which Fits Better Here?