This channel is proof that sometimes the most valuable advice is not on some big channel with millions of subscribers, but on a small channel like this. I cannot express how grateful I am for all your content and advice. You are extremely valuable to the programing community. Thank you so much.
@@zoran-horvat I really hope you become as successful as you want to be on youtube! I was just thinking that the demographic for advanced programming discussions is not large. That's why we see channels who explain the basics receive way more views than channels who showcase advanced knowledge. If channel views were proportional to the knowledge of the creator I'm sure you'd be in the top 10 UA-camrs already. Anyway, I was just trying to express my gratitude for your videos and everything you teach us. Agan. Thank you!
Thanks Zoran. A note though that there a number of advantages to a clustered index. Some watching this video might be led to believe that its about insertion performance but there is more to it. In fact I would suggest its about read performance. Many large tables represent transactions (e.g. an order), over time the older transactions are rarely accessed, its the active current ones that accessed concurrently. Having these records physically "clustered" greatly increases the chances that they are already in memory when accessed. It also reduces the read effort a little because the leaf index pages *are* the record pages. So whether to use a clustered index or not is decision to made carefully and not just on the basis that a developer may prefer to allocate the ID themselves rather than let the DB to it. That said there are approaches that give the best of both worlds, that is, keep the the index clustered and generate keys in the application that highly likely to be sequential.
@@mannetjie3704 I am not sure that SQL Azure really forces you to use the clustered index. I'd rather say it takes it as a default, which is the same as in SQL Server. Are you sure about that? And yes, you can build a clustered index on any column you wish, even the nullable and non-unique columns are susceptible to clustered indexes. The rule that makes sense is that one table can only have one clustered index.
It depends on the use case. Seeking a clustered index saves the extra key lookup in the main table, but the table itself being the index also means that typically fewer rows fit in one page, and as a result, a clustered index seek causes more IO than a seek in a "lean" index. The best case is when you have a "lean" index with few rows that is still covering. That might not be possible for every use case.
@@cdoubleplusgood That is true, but this video is not about removing a clustered index. It is about a bug that may happen when there is a clustered index. To fix the bug, you must either guarantee the sequence or drop the clustered index. That is the message. Looking from the other end, then if the clustered index is required (e.g. for the performance benefits it offers in some scenarios), then we must be very careful when generating ID values.
Protip for people who want to drop primary keys: If you make a change to it in SSMS' Table designer, you can have it generate a script, and it will output all the corresponding drop/create statements for you. I use this a lot when I clean up databases, as it's often faster to reverify foreign keys (minutes) than to leave them active during mass data operations (days). Especially if you have foreign keys that aren't a primary predicate in your indexes, which is often the case.
If I had a wishlist for SQLServer, right at the very top would be some sort of mechanism that optionally deferred validating referential integrity until commit.
@@zoran-horvat I would not be surprised to learn that OracleDB does. But some googling suggests that both OracleDB and PostreSQL supports 3 levels of it on all constraints. Those levels being Never (e.g. UNIQUE), optionally during Statements (e.g. bulk insert with FOREIGN KEY to self), and Until Commit. With the level being set when the constraint is defined.
Thank you very much for this video, Zoran! I was not aware about the nature of clustered indices. The explanation helped me to revise the DB configuration on a current project I am working on. Very helpful!
Thank you for making this topic accessible! The explanation on the performance pitfalls of using GUIDs with clustered indexes was particularly insightful.
A lot of data to move when inserting rows in the middle of the clustered index -> this is just a page split, and only then when a page is full. Not super-ideal, but there is an upper-bound to the cost.
Thank you for sharing, Zoran. As you're aware, removing the cluster primary key entails additional overhead for each key retrieval. While this action resolves index fragmentation, it introduces an I/O penalty with each key access. An alternative solution I propose is the adoption of ULID or UUIDv7 (with a personal preference for the former). These options ensure the monotonicity of generated GUIDs. This quality not only renders clustered primary keys feasible and time-sequential processing effective, but also enriches the ID with additional time-information that could prove beneficial in some scenarios. You would saved a lot of migrations ;)
Yes, the use of a UUID standard that guarantees sequence is preferable, but be careful if the system is truly distributed. The monotonicity could fall off node clocks are out of sync. On the other hand, a clustered index forgives small discrepancies, i.e. the performance loss will usually be immeasurable if it only has to move a few rows at worst. That is another hidden gem of clusters indexes.
I am entranced by your voice, @zoran-horvat! It was like you were telling me a story. I could listen to you all day long. Also, great informational video :)
In summary, for guid type primary key column it should not be cluster index. Because sql server will generate sequential guid which causes performance issue. And this performance issue is nothing to do with entity framework. Also it doesnot matter whether app or db generate guid. Curious to know whether all kind of db have this kind of issue or not? Let me know if i misunderstood any part.
Following the advice of this video, I asked the DBAs about this at my work, and I was surprised to learn that they don't rebuild indexes in production. They said index fragmentation has a negligible performance impact when the storage medium is non mechanical. Statistics, however, are still important. So random guids may take up more space, but they don't think it's a big issue as long as you give the index plenty of fill factor
@@tomtoups My experience is also that the impact is negligible. I never saw a performance issue with GUIDs. Therefore, if the IDs are generated by the client, this is my default decision. It never failed.
The problem is not the index fragmentation after insterts. The problem is the page splits during insterts. It's easier to insert at the end and just append to an existing structure, than to insert somewhere in the middle, and having to rearrange the entire structure. Fill factor is always 100%. It should not be any lower unless there is some weirdly specific reason.
@@zalmar5855 Your statement is ridiculously wrong. You never tested it in reality. In Oracle, for example, the default organization is heap with B-tree index. Have you ever heard anyone saying Oracle "straight up murders performance of every query", as you have put it? Of course not
Thank you Zoran, indexing has been something which very little videos exist for EF core. Please if you can create more videos explaining indexing, so that when your database gets large you can insert and read data in a performant way?
There are many channels about SQL Server for example where you can find excellent videos on those topics. I am not sure I can explain indexes and performance even remotely well as in some videos I watched in the past.
@@zoran-horvat I have a question is this problem which you mentioned, only happen when you let the database create GUIDs as well as creating your own? How does this apply if you always create your own GUIDs, and when doing so you you have them as non clustered?
DDL statements in SQL Server exexute in a transaction, with a few exceptions like CREATE/DROP/ALTER DATABASE and the same for FULLTEXT INDEX. You can also mix DDL and DML statements in the same transaction. That is not limited to SQL Server. Other databases also support transactional DDL, most notably Oracle.
But what if the database is read-heavy, and you search data by id: guid? Yes, insert would be slow O(long n), insted of O(1), but the read would be O(log n), instead of O(n). It's a trade off.
It is a tradeoff, but not with those asymptotic complexities. The insert of out-of-sequence IDs into a table with the clustered index takes O(n) time and read takes O(log n). The insert and read from a table with a common key is O(log n) in both cases. So it is actually O(log n) vs. O(n) tradeoff, which becomes tragic on any table with more than, say, 100k rows and absolutely unsustainable on a 1M rows table. The point is that there is no arithmetic joggling that would make the relocation of half a table in every insert even remotely performant as anything else a man could imagine, lest the table is of a very small size, hence not important in any analysis.
@@zoran-horvat > The insert of out-of-sequence IDs into a table with the clustered index takes O(n) time Ok. Fair enough finding where to insert data is O(log n), but actually moving everything and inserting is O(n). For tables where I expect a lot of rows, and where I'm not going use id:guid as most used where condition I don't think it's a problem. It would be fun to see what are actually read and write speeds for 1k, 10k, 100+k rows table with clustered and not-clustered id:guid.
Postgres has something similar, but it is not applied by default. By default, the PK is a "normal" index. Similar for Oracle. They have something called an "index organized table". But that is also not used by default.
That would be an optimization and, as such, it is valid in certain scenarios. I have intentionally tried to avoid getting into the details of measuring and optimizing read and write performance because that would make it an hour long video at least, but it seems that the question of performance is inescapable...
As an alternative that let you generate PKs with a clustered index, and it's almost guaranteed to not cause page fragmentation is to use Snowflake IDs.
0:47 There is no guarantee whatsoever that there will never be collisions 💥 however, it is unlikely. Guids is 128 bits, some of these bits has defined values and can not change, and there is several issues that can make them non unique, the default new Guid implementations are even predictable.
Thanks, very enlightening. So what will be the default ordering of tha data at the table/page level now that we do not have a clustered index? ... I guess it doesn't concern us since we now have a non-clustered index on the GUID column, so we're able to find a GUID efficiently.
@@Dalamain There is no ordering at the table level in that case. It is just a heal, and rows are found using a common index external to the table. Some definitions of GUIDs produce ordered sequences, which may bring clustered index back into the game. However, you might need to ensure to only generate them on one node, which defeats the idea once again.
Yeah I'm glad that you showed this because I remember that when I was a junior lots of my co-workers thought that you would just generate the idea and this is before everything went to Goods because then you wouldn't have to find a way to look up what you just saved or find a way to get it to return the good from the database but when you have a distributed system this becomes a problem in part because if every client is creating a good there is a slight possibility of a collision it's extremely rare but depending upon the algorithm on which the idea is being created yeah you can have a collision it is more common in integer Auto incrementing what keys then g u i d paid guid has sort of become the standard to avoid this but again because the way networks work I would rather the server to find these values than the client 99% of the time. Because otherwise you might have values to get created that could be security issues a function which would then do an insert that would add something with a good then you might have data being added. Net has done a good job of trying to sanitize these things and make it harder for it to happen but it's one of those things that's always been in the back of my head that there's a reason why some things are best left on the server
The possibility of a collision in GIIDs is truly negligent, so you shouldn't count it in. There are issues that happen far more frequently in distributed databases and we still have the means to handle them.
Please do not remove the clustered index from a table, it will straight up murder the performance of every query. Removing the clustered index makes the table a heap, also clears all plans from the cache that use it. and forces SQL server to recompile the queries, leading to brutal CPU pressure, while the table being the heap means SQL server will have to read and cache more pages, leading to memory pressure. Unless you want your DBAs to murder you, please do not remove a clustered index from a table, especially if your root problem is coming from the table being too large. Use a non-clustered index on a GUID that uniquely identifies a record between systems, and keep some kind of incremental index (GUID or Identity int) for the "internal" clustered primary key. You don't even have to store the external GUID on the table if you don't want to, you can create a mapping schema or mapping table. The purpose of using GUIDs for unique identification between multiple/distributed systems is to remove dependencies from each other. Don't change the DB structure to accomodate this.
@@zalmar5855 Another commenter said DBA confirmed no issues ever noticed after removing a clustered index. Now figure that. I also confirm there is no performance penalty except in a few extremely specific corner cases.
This does not seem like a very great advise - Unless you have a small database or don't care about select/update/delete performance. Imagine having a really large warehouse and when ever someone gives you some new data to "insert" you put it in a box and label it with your primary key. Since your key is ordered (typically ascending) you need to place the box at the right spot. If your key comes after the last box then you just put the box right at the end but if it's somewhere in the middle you now have to move all the boxes (this is called page splitting) that comes after your key to make room for the new one. Now imagine that same warehouse but without the clustered index. You get some new data but instead of putting the data into a box and labeling it you just put data on what ever shelf where you can find room. You are not even assured that all the data is in the same place. So when your "client" wants their data again you have to run all over the warehouse to gather up the pieces in order to get them their data. To put it simply. This fragments data on the disc making it slower to find the data for selects, updates and, deleting. The real solution to this problem is to make an integer or bigint column as the primary key and then make it an identity column. If you insist on having a Guid as your identifier then add that as a separate column with a nonclustered unique index.
First of all, my advice is NOT that you should not use a clustered index, as your comment seems to imply. The video is about a very specific situation when an application generates GUIDs using an algorithm that does not guarantee the sequence and there is a clustered index. One must go. Regarding your analysis, the actual performance depends on read and write patterns. Neither the clustered nor non-clustered index is better in advance without knowing the statistics. Your analysis does not generalize and, hence, does not lead to any conclusion.
@zoran-horvat sry it seems UA-cam decided to not show all of your message. A refresh added the first part which I hadn't read. That is true. You don't say it's a general advise. That's a misunderstanding on my part.
I'm not that fit in databases. But does that mean that I can basically generate my (gu-)ids in my software and use those ids to link entries of different tables before storing the data in the database? We ran into the problem that in Entity Framework 6, when referencing objects, we have to transfer large parts of the database because of the direct object links. The clients don't have direct access to the database but go through a server which handles the communication with the database. Being able to directly generate the ids myself would help us to mitigate some problems we have during multiple transfers.
If I got your description right, that is precisely the scenario where we choose to let the application generate IDs. The application then has the control of the times when it creates objects and times when it does then to the database - those two don't have to happen simultaneously. It is also useful when sending objects between systems, as well the systems will associate the same object with the same ID.
@@zoran-horvatsorry, I've rephrased it, now it's both more precise and shorter :) don't get me wrong, many are interested in a full video, but most don't have 15min+ just to figure out what the whole thing about and and worth spending time, unfortunately it's rare that youtubers provide sufficient description, or even tray to bait with incomplete titles
@@kocot.It is not possible to state the problem precisely in 50-characters title, nor in the description for that matter. My channel consists of advanced-level videos and trust me, quite a few programmers watch them very carefully. I expect them to because I don't know how to pour an advanced piece of knowledge into a head that lacks patience. Look at me - I have been learning programming for 35 years now. Regarding the problem addressed in this video, the assumption is that the change cannot fit the existing table with data in it. That is a regular case in practice. Either the IDs are coming from another system that just didn't care to make them sortable, or the table already contains a lot of data, like a million rows, produced with a sortable UUID generated by a different algorithm, so the new algorithm does not extend the existing sequence. If either is the case, we must make cuts. One is to give up the clustered index, and that decision is tied to a few other considerations, such as write and read statistics. The other option is to recreate the table and generate new IDs using the new algorithm. Both decisions are hard to make, and this video does not suggest which one is the right one. Actually, chances are that it will be the DBA to decide. The upcoming video, which is in the publishing queue right now, will demonstrate both possibilities on two tables in the same database.
There is the new video demonstrating that possibility: ua-cam.com/video/MJcaxi3bOO8/v-deo.html The caveat is that ULID will not solve the problem if the table already contains many rows. ULIDs are sorted left-to-right, where GUIDs used by the SQL Server have a different structure and sorting. Therefore, the sequence of ULIDs would not extend the sequence of IDs already in the table, once again causing the relocation of great numbers of existing rows upon every insert.
Actually I have a question. I am currently working with EF Core using a .NET Core project, and I want to be able to prevent values below 0 from getting inserted/updated into a database table (basically expecting an exception thrown by EF Core in that case). I have read that there is a "CHECK" clause to enable us to validate a value before updating a column, but chieving it with EF Core seems to be difficult to do. I have noticed there is a Fluent-API call within EF Core module "HasCheckConstraint()", which seems to be the way to go, but it's marked as "[Obsolete]". I want a method that I can use together with migrations. I have also tried the "[Range(0, int.MaxValue)]" annotation, but with no success. Any idea as to how I can do it, and in an elegant and robust way, supporting migrations with EF Core? THANK YOU!
Can anybody explain why SQL generated GUIDs are "in growing sequence" (time - 3.53) ? I see they has random part in first place, so it will not lexicographically sequential... am I missing something?
You said what you did, but you didn't say why you did it. Why don't you want sequential guids? Where are the benchmarks? Before the change, what was the insert and select time and the query plan generated? After you made the change, what was the insert and select time and the query plan generated? Why did you decide you don't you want a clustered index? You may speed up writes but surely reads will be slower.
It depends on what you think that I said. I see that many viewers didn't see the central message of the video, which is probably my fault in passing that message. So, the message is: IF the application has been generating IDs for some time AND there was a clustered index THEN... we begin. The video doesn't question any of the decisions because those are driven by entirely different forces. It only questions the combination of decisions.
@@zoran-horvat That does help to clarify the starting situation, and is still interesting as I definitely think others have been in this situation, although I do think it would've been nice to see some details on the improvement that the change made
Brilliant! I didn't know there was a problem (and always used app generated guids before). One day a colleague would have suggested db-generation and I would have accepted if not for your caution...
It's the other way around: database-generated IDs are normally safe, because the tools would set up the database correctly. It is the application-generated IDs that may sometimes cause trouble if not aligned with how the database is created.
I have an important question you've written a custom migration there, which means that in order to squash all the migrations in the future, your colleague would have to find the custom code among the sea of other auto-generated migrations and then recreate it later from my experience, it's a huuge pain in the butt how would you deal with custom migrations? is it ok to just put them in a different folder (under same namespace)? or is there a more elegant way for that? thanks!
There is no difference between custom and generated migration code. Each migration is immutable and EF will not write into it again. As I pointed out, this is not the problem caused by EF. I have been using other migration tools before and they had the same problem, that you must add custom migration code to make a certain change. On the other hand, the whole issue only happened once I made a bug in the earlier configuration.
@@zoran-horvatI understand that migrations are immutable once created, but how do you squash migrations that you cannot recreate using ef core? by squashing I mean creating a big migration from hundreds of smaller ones (which will get the custom migrations lost if not careful)
@@lettuceturnipthebeets790I’m trying to understand your question. What Zoran did was generate a migration, and then add to it. You can add whatever you want to the migration, even down to custom SQL statements. But that is the complete definition of your migration. If you later want to create a migration script, say an idempotent script which you will generate and apply within a CI pipeline, all you’re doing is combining the migrations you have defined. The migrations aren’t generated again. You’re just combining, or squashing if I am understanding your question correctly, into a single SQL migration script.
I didn’t know that EF Core and SQL Server generated sequential GUID’s when properties are empty. I started defaulting back to using long’s over guids. I prefer the sequential nature of it and clustered index is important for me.
When a clustered index is required, then the number of options is limited. That is the case in systems with heavy inserts. The way EF Core handles that is to initialize the table field with NEWSEQUENTIALID(), the same way as we would do when writing the CREATE TABLE script by hand.
That is one possible design. But only one out of several valid ones. We should always keep the requirements in mind, both functional and nonfunctional, when making design decisions. For example, a clustered index can cause write congestion in some scenarios and work like a charm in others. A table with a common scattered index can cause read delays in other scenarios. There is no one size that fits all.
@@MrWTFNETWORK Because it might be that the application is generating ID and database is only storing it. That is frequently the case in distributed systems. In such a situation, autoincrement numbers are not an option.
@@zoran-horvat you made me think of a scenario where I store invoices and invoice line itens, 1 to Many relationship and we were using guid for primary Key of both tables. In reality to acess line items I see that the line items table could instead have a composite primary Key as invoice Id and line item id (incrementing integer generated in memory instead of guid as there can only be max 100000 items), what do you think about this approach? We only need to get 2000 items in a paged way until we allow clients to get all line items via API. We would use a continuation token being the id of the last item and always sort by the id
Hi Zoran, thanks for the video on this subject, certainly interesting. The only thing that I'd suggest would be a little more focus on the database itself, regarding default column values and the using NewID() and NewSequentialId() functions. While EntityFramework can take care of this, if you're using EF to generate your database, this isn't the case if either using another ORM, or working against an already existing database. Just felt that this bit of detail had been glossed over.
You note a performance impact on using application generated GUIDs on a clustered index, however I would love to know what sort of impact that might be. For instance against a 1 million row X 10 column table, just as an example to get the percentage differance on read and write ops.
The percentage doesn't matter. It is the fact that an operation that takes near-constant time would be substituted with an operation that takes time proportional to a half of the table's size on average. One simply does not do the latter.
@@zoran-horvat I take your point that its is self evidently the correct approach, however I have applications with this issue and I need to prioritize my time. If I go back and spend the dev time on the migrations will I see a 5% or a 40% perf increase. Thanks for the amazing videos, big fan 😁
@@MrValhallans The point is that when this issue manifests, it makes every row insert a pain so obvious that measuring it makes no difference, especially knowing that every next day will only make it slower. How do you prioritize work on a system where inserting a row takes a second?
An application cannot generate an autoincrement number. It must rely on the database to communicate the ID back to the application. Not to mention that autoincrement anything is almost invariably out of the question in distributed systems. Now list ten situations that come to your mind where waiting for the autoincrement number is an issue, and you will have ten use cases for GUIDs. It shouldn't be hard.
@@zoran-horvat Adding related data to many tables at once requires knowledge about PKs and FKs, but I always relied on EF or SQL's SCOPE_IDENTITY then and never set them by hand. Ok, it's just easier having keys before and after the action.
The problem occurred when the app needs to create its own GUIDs, but if I understand you correctly, there is no problem with letting the DB generating the GUID + clustered, if the app does not want to create GUIDs, and will perform better. correct?
The problem was with IDs generated out of order when there was a clustered index. The database generates GUIDs in the ascending order when instructed, and so the problem would not occur then. However, that is not to say that database-generated IDs or clustered index perform better. The overall performance is also the result of read and write patterns exhibited by the application.
There is a class that apparently does that (learn.microsoft.com/en-us/dotnet/api/microsoft.entityframeworkcore.valuegeneration.sequentialguidvaluegenerator?view=efcore-8.0). I must admit I have never used it. To make it worse, I have never even read is documentation carefully. The critical question to answer is whether the sequence remains ascending if you move the application to a new server? Though there exists the standard definition of GUID that puts a timestamp at the front, thus making any sequence of GUIDs ascending, there is a nasty corner case when the two servers are out of sync with time. The problem is the same as when using a timestamp to version entities where it may sometimes lead to lost updates! For all these reasons, I normally let the database generate IDs when a clustered index is a requirement (in systems with heavy inserts, e.g.), and drop the sequential IDs request otherwise.
Just wondering how many deployment systems actually use the Down migration. I haven’t seen them yet, mostly a idempotent migration script is generated for the Up. Am I missing out?
It is a safety valve in case you suddenly figure something out after deploying to production. I remember events of that kind in my career. Very rare indeed, but when it happened it was always driven by panic reactions at the customer, so they would request immediate rollback of the deployment and only then an investigation.
It is not the question whether it is better or not, but which node in the distributed system is in charge to generate or, let's say, know the value of the ID. If it is an applicative node, a service, or anything of that sort, then keeping the storage in charge of generating IDs requires an additional trip back to the application before it can continue doing what it planned. It is also hard to rely on the database in polyglot systems where some of the storage nodes might not even support generating IDs.
This is why you just let SQL Server generate the GUIDs for you in a simple setup or if you have a clustered DB setup or need to generate IDs in the application for other reasons, you don’t bother with GUIDs and instead use a better, sequential ID algo like Snowflake or some variant. Not only is it more compact, saving data file and index space, but you can keep a clustered index in cases where your reads will be more frequent than writes, both of which will improve performance and lower costs.
Do not have a table in SQL server without a cluster index!! Select performance will be degraded! If you want a guid id column for a primary key use newsequentialid() to generate it as default value. That is why code first is tricky! You must understand the db.
@@nothingisreal6345 You are selling one possibility as a general solution, using shaky argumente that are easy to prove wrong. BTW how come that that silver bullet of yours doesn't apply to Oracle?
Some years ago, I worked on a project using SSRS; I was writing T-SQL scripts to improve report turn-around constantly. Scary stuff and absolutely hated it. How does non-clustered indexes compare in performance in-the-large ?
It depends on what you need. Someone pointed out how fast they are if we tend to read the rules in chronological order, which is the central request in some applications. The clustered index will then make such reads substantially faster than the one with rows scattered all over the data pages. On the other hand, most of the business applications do not have such a request. We usually read the routes as per request, and requests are randomly driven by end users. That makes clustered indexes less important, even to a degree that maintaining them becomes a liability with no benefit in return.
@@zoran-horvat Yes, I found that clustered indexes worked faster with time-ordered data; happened to be quantities of mushrooms collected at distributed farms per picker over a week and aggregates per farm per cost-center.
I usually avoid using guids as primary keys in the database completely. I assign standard integer database owned primary keys and add a second id column that is the guid. Thoughts on this?
Then you lose the greatest benefits of using GUIDs - the ability to generate the value before persistence and the ability to persist entities in the storage that does not support generating IDs.
@@zoran-horvat Such a setup doesn't preclude the benefits of application generated guids. Applications with such a setup are usually ignorant of the incrementing Ids when they're not resolving foreign keys in their queries. It does come at the cost of requiring 2 indexes at minimum; a clustered one for the incrementing ID, and a unique nonclustered one for the GUID; which does have some performance implications.
Hi Zoran, thanks for another interesting video. I like to add a clustered index on another column which I make as a shadow property in EF and make this and identity seed to ensure the order. What are your thoughts on this approach?
The structure of UUIDs is defined by the corresponding standards. If you need to analyze the values, then you take the parts as defined in the standard. There is no freedom to decide.
@@zoran-horvat yes but the sql server does sort the bytes/characters in order of left to right, and normal v1 uuids like you showed do not sort chronological this way. V6 are better in this regard.
Resetting the UUIDs to UUID7 would be all that you would want to do. Removing the clustered index is not the best solution when we have UUID7. UUID7's initial bytes are sorted by a timestamp and then the right side is random, allowing you to sort them and immediately tell which ones were created first-to-last. If these are meant to be customer-facing and being able to tell when an entry was created is a security concern, then the primary key should be an auto-incrementing integer (with a clustered index) and a second column should exist with a UUID4 (with a regular index). As a side-note, including a UNIQUE constraint on the second column would cause unnecessary overhead.
That is true, with a caveat. What used to be encapsulated in the database, both creating and consuming IDs, now becomes the problem of someone else. It is entirely possible that a future update, such as accepting IDs from the node or service which was not entitled with that before, could break the rule causing a surge in performance. I would advise to either keep the dependency in an algorithm in one place (database), or to not have it at all.
@@zoran-horvat True, either the database needs to have an implementation of UUID7 that it uses to generate the IDs or (when the UUID7 must come from an external source) all pathways that result in an insert into that table must go through the same abstract layer that generates a UUID7 on the caller's behalf. Most systems have a "repository" layer that would suffice but once you have more than one system with access to insert into that database, each one becomes an entrypoint for bad (non-UUID7) IDs. Creating a shared library for interacting with tables of that database would be my next path forward to greatly reduce errors.
This is one reason I do not use migrations for SQL Server and instead go with a database first approach. The VS SSDT project allows for publishing changes to the DB schema and handles all the drops and creates needed. It's especially useful when iterating changes over a dev database and then pushing them all to a prod one when ready. Also, removing the clustered key results in a table classified as a HEAP and has it's own pros and cons. If you only ever lookup via the GUID and never do a range lookup there won't be much difference, but if you ever need to lookup a range of rows in a table, you'll want to create a clustered key. Also, depending on the concurrency and throughput of the application, sometimes a clustered GUID column may help in the long run, as long as you are doing database maintenance such as index rebuilds or defrags on a consistent basis.
It is a tradeoff, like any other decision. However, *if* IDs are generated from the application, then the clustered index is most probably not applicable.
@@zoran-horvat I am totally new to this. I am even confused why they use UUIDs instead of numeric increments. I was thinking about using csharp for part of our backend administration panel, but we mariadb. I assume that would be possible, the driver is surely supported, but would it be feasable. Every video I see about csharp and databases, there are Guid in it. I'll probably just keep using php or maybe Go, eventhough cshapr does have the most advanced language featrues and syntactic sugar
also, modern php is OK. no bully. we have static analysis with php-stan, traits, and some multithreading/async features when using swoole. I would still love to use csharp though.
Great Video I reall liked It. Just an offtopic question are you planning to create a course on Udemy or Pluralsight which would go more deeper into programming paradigm used by you. I took your OOP course and Design Pattern Course. Now I would also love to see some FP/OOP course will such course ever be created? Thank you.
I was on a project a few years back that every id in every table was a guid. One of the cross tables was a 3 way primary key. Inserts would take 20-40 seconds. I was assigned to "fix" the perf issue and they were not happy with my solution, DROP THE INDEX.. I tried telling them the design was bad.
Use INT ids for joins WITHIN the DB. Assign GUID’s as actual entity IDs so you can sync across DB’s (think Dev, QA, UAT and PROD) without using identity insert. If you want to use guids for actual IDs then use a time-based guid thats more suitable to your clustered index.
If you are going to query by GUID then you got nothing... That is the reason why we use external ID as the key (in those scenarios). Less code and better database performance, all in one package.
Perfect video Zoran! Looks like that initial approach with generating guids on database level is quite simple and we may forget about these nuances of how sql server works. We just don't need to assign new guid values for entities in our application code, am I right?
Actually, I believe it is better to generate GUIDs in the application because that is their greatest strength: global uniqueness. It lets it generate them in the frontend as well. It lets us choose whether to store entities in aggregate storage in one node or bounded context and in a relational database in another. When you put everything on the table, leaving the database to generate values on insert begins to look less attractive.
Please correct me if I wrong here, but for Publishers table with this approach we don't have any indexes now because primary key PublisherId is not the clustered index now. Will our search and filter operations in this table now be slower than with a clustered index?
@@ДмитрийКондратенко-б5ь There must always be an index on the key, or otherwise the database would perform a full table scan on every access to a row. The default in SQL Server (and only in SQL Server) is that it keeps the table sorted by the ascending je, a.k.a. the clustered index. But if you opt for a "regular" unique index, as I did in the video, then there will be a separate index built and updated on every insert. So, the answer is that both Books and Publishers tables will have unique indexes on their keys at the end of the video.
Is that a problem with SQL Server specifically? If so, why not use something better? Also, GUID ID is redundant. You can just say GUID as both ID's are the same thing.
This is not the problem, nor is it with SQL Server specifically. Many other databases support clustered indexes - Oracle and Postgres do, for instance. The video is showing one typical mistake that can happen. A few commenters confirmed that happened to them in the past, and it also happened to me. You would typically notice the problem when an insert into a table becomes unreasonably long, such as over second or longer. I have heard horror stories of an insert taking half a minute before someone noticed that the things are not right. Regarding the naming, you are right that GUID is the ID. It is a good point one should keep in mind. However awkwardly, I was reading it in C# - Guid Id, i.e. the Guid type and the Id property.
@@zoran-horvat Sequential indexes shouldn't cause that problem if the database is implemented correctly. I can only assume that they're using the GUID directly as a key instead of hashing it. This kind of thing is why increasingly I write my own. I keep seeing these kinds of logical fails in "professional" software.
@zoran "on SQL Server, when a GUID property is configured as a primary key, the provider automatically performs value generation client-side, using an algorithm to generate OPTIMAL SEQUENTIAL GUID values." from EF core docs.
That is true. The problem occurs when GUIDs are generated in the application and only transferred to EF. In that case, the algorithm will be whatever the application used and, unless a specific sequential algorithm is configured, that will be an out-of-sequence algorithm, as it was in my demo.
@@zoran-horvat "whatever application"? eehh, i don't know. Let's say you have existing dataset from other system (SYSEXT) that you would like to integrate into your system (SYSPROD). This dataset has entities with attributes (RDBMS sense): SYSEXTItemID, SYSEXTItemTitle. You want to store these entities in your DB as well. In that case you may say this is application generated IDs. I can get such scenario. You see how I view PK - they are specific to your DAL of your DB (SYSPRODDB). IMHO DAL responsibility is to save, operate and retrieve data consistently from your data store. The implications of such approach: 1. Only DAL controls how data is saved and retrieved. 2. It must control indexes, constraints and relations as needed inside DAL. In your scenario you would may as well have EF CORE generated sequential IDs in following way: - define table like SYSPRODID, SYSEXTItemID, SYSEXTItemTitle - add index on SYSEXTItemID, if needed, but keeping in mind your goals of adding records quickly, probably you want to avoid that In that case you would remove design dependency between PK and "IDs generated by application". Even if the application changes its behavior, DAL implementation stay stable. If I would encounter such an issue, then I would keep my DAL fully responsible for consistency and performance and would remove coupling between DB design and application as it would lead to other issues. Team may misuse and abuse such approaches. Also this does not help if Item is deleted or SQL is restarted due to maintenance on-prem or in the cloud. Fragmentation still possible and index maintenance should help more than anything else.
Guids will kill your index. Btrees are optimized for right only writes. Use ulids instead because they're monotonicly increasing. I'm 99% sure Azure SQL requires clustered pk. EF can be setup to allow auto-inc into indexes. This is the best option, unless you're building mobile apps that require offline operation.
Ok, but what about creating a sequential GUID in the app? There are NuGet packages that enable that (SequentialGuid by jhtodd for example) and also code examples on Code Project by Jeremy Todd ("GUIDs as fast primary keys under multiple databases"). I use this code for almost a decade now in the app and it didn't fail me (yet). Would these solve the problem i.e. have your cake and eat it too, or clustered index with an app generated sequential ID?
That would resolve the problem, but with a caveat: Every node that generates the IDs would have to know the condition and, on top of that, use the same algorithm. I fear more from configuration errors than the conceptual ones, especially as the resulting big is a nonfunctional one, therefore hard to reproduce and catch. On a related note, I plan to make a video on strongly-typed IDs with EF Core. The strong ID type would be the ideal place to lock the GUID generation algorithm, though it remains that the C# module containing that decision would have to be in sync with any other source of IDs.
It is tricky to vote this in terms of benefits. It is rather a reaction to other forces when designing a distributed system. So, in practice, you will need to cope with a situation where applications already have the need to generate IDs and letting a database do it for them is a sub par solution.
So SQL Server generates the GUIDs in a way that each GUID is sequential. Intuitively, it seems that it's less random, and so it's more likely to create collisions. Letting the app generate GUIDs seems safer. Is my intuition correct? Or am I missing some context?
@@Dalendrion Yes, ordered is a better term for that. There are several standards for GUIDs, at least one of them guarantees order on top of universal uniqueness (that is the principal trait of that standard).
Don't, don't, thousand don'ts. You have just turned your Books table into a Heap, which in MSSQL absolutely kills reading performance. While the problem you describe with non-sequential guids is correct, leaving the table without Clustered Index is much much worse. A simple Index reorganization will fix the guid problem as your table grows. Stick it to the Maintenance Plan (you DO have one, right?) and call it a day.
Did you just explain we should never use heap tables before asking what queries we plan to execute? You know better than that, I hope. For example, that "absolutely kills reading performance" depends on what we query, right?
@@zoran-horvat You ask me to provide context when you yourself provided none. In fact, that's actually the main objection I make to your video: the lack of context. Sure, Heaps have a place, sure non-clustered PKs have a place. But giving a 'general advice' to drop the Clustered when client-generated Guids are used as PKs without any warnings or caveats can put inexperienced developers in a much worse place than a fragmented clustered index would. I'm absolutely sure you know better too. 😉
@@SantiagoBurbano-su9oe I think I gave quite a sufficient amount of context when I said that the unsorted IDs are coming in from an external source. That condition is a given at the beginning of the demo. The rest is a consequence.
Couldn't you seed your ID with some sort of sequential ID in a format similar to {SquentialID}_{GUID}? So something like a Linux based time index. That way the record always gets sorted into the end of the file. This would allow you the benefits of clustered indexing and collision avoidance. If you did get a few rows generated on the same time index, then the re-sorting would be fairly efficient. In my opinion.
You can use a specialized algorithm that guarantees sequence, but that opens a different set of issues. All components that you might allow assigning IDs must guarantee to use the same algorithm, for example. Any later change is not possible, and so on. All these limitations stem from the principal problem that we have separated the source of IDs from the storage that depends on them. None of the issues appear if we lift the artificial constraint of the storage, that is the point.
Btw I think clustered uuid are not that bad if you insert/delete regularly because the gaps in the index will stabilize and only minor reordering is needed. In fact the easier clustered lookup will make up for it in some cases and you don’t run into one sided balanced index trees or mind extra old records keeping a lot of empty heap dangling,
Great video as always! What about using ULIDs instead? By the spec they are sorted by definition, since the first part of a ULID is the encoded timestamp. In that case you might be able to keep using the clustered index, even if the application generates the values. ULIDs are great! Unfortunately they are not yet commonly used.
It is a good idea. The only issue with types that are sorted by time is when used for versioning - once the system truly becomes distributed, small discrepancies among timers will cause incorrect sorting of versions and lost updates. However, the concept alone is very useful.
@@berndeckenfels When a time sequence is used to version changes in a distributed system, then we risk the case when two updates are reversed because the node that effectively generated a later update could have its timer late. In that case, the earlier update could overwrite it. That is one of the failure modes for distributed databases.
@@zoran-horvat It's mainly a lack of understanding of how databases work and knowledge of how to properly normalize a db for transactional data. Plus, databases are boring. Writing C# code is far more fun than writing scripts. But, you gotta eat your veggies if you're gonna get desert.
@@zoran-horvat I want to clarify that I'm a full stack guy, who happens to enjoy the db side. And I'm not being all fire and brimstone on the subject. Just from my experience, every project that I was pulled into, regarding CF, there was a lack of experience on the data structure side. Ordinarily, the CF stuff was what was stalling the project. For the exact reasons you outlined. A stored procedure with an output parameter is faster and more natural than fighting the issue you outlined. Also, not many DBAs wanna grant that level of permission to an external program. Now, to your question, I can't answer that. It may be external business pressures or having an expert at CF that mandates the practice, could be a lot of things. But, a majority of managed code folks that I've worked with, don't really understand how an RDBMS handles data. Most managed code collections tend to be handled iteratively and database engines handle data in bulk. I always tried to treat data the same way the data engines to.
Yes, sure, but that doesn't change the fact that every mainstream RDBMS supports transactional DDL. Actually, missing such a fundamental feature in 2024 either makes it outdated or a niche technology with an accent on something else (which is fine).
This is really bad advice as a blanket statement. I mean, there are certainly benefits to skipping the clustered index, especially for inserting stuff, but heap tables come with many, many disadvantages in very common scenarios.
@@gunnarliljas8459 "many many disadvantages" is a bad advice as a blanket statement. So many systems are built using this idea, including quite a few where I worked, and those things people fear never happened. For example, the default table organization in Oracle is heap - are there many many disadvantages to using Oracle?
Sorry but this video provides absolutely no context in which circumstances using GUIDs is slow. It isn't even slow by default and can be just as fast or faster than ordered sequential keys because it prevents congestion. This video also does not provide any data on what exactly is "slow". Don't take it from me, just go watch a video about with actual data and experience to back up what they are saying, which is, you should not worry about using GUIDs. They are just as fast in almost all situations withou any kind of index shenanigans: ua-cam.com/video/nc4CMo7VSPo/v-deo.htmlsi=2dPyPAT3rX2OGHOn
My video is not on the topic you are mentioning. I would appreciate it if you could watch my video as being what it is, rather than as not being on the topic of some other video. My video is about a very particular situation when a programmer unintentionally leaves the clustered index on the key column, and then intentionally uses an out-of-sequence source of IDs. The situation naturally occurs when using GUIDs. Contrary to what you have imposed, the video you have commented on is not about index performance. After reading your comment two times, I have an impression that you have missed the theme of my video entirely, particularly the point that out-of-sequence IDs cause relocation of large number of table rows on every insert - an effect that has no relationship to fragmentation, specificity, uniqueness, or any other property of indexes.
@@zoran-horvat And that video takes a deep dive into clustered indexes. There is no one best practice. It depends how you use your data. Heavy inserts are slower on an ordered clustered index because you are creating a hot spot to the last page. From the author of the video from another source: "First, I ask that you realize my video isn't just about GUIDs. I use Random GUIDs to emphasize many myths and points about all kinds of indexes. Shifting gears back to your question, when you're working on an almost pure OLTP platform and as I say in the video, logical fragmentation isn't going to matter at all. You're going to be looking up one row at a time and that means a trip in the B-Tree from the root thru the intermediate levels of the B-Tree, to the page in the leaf level where your row is. If the index is (for example), 4 levels deep, then you'll do 4 8k page reads for every look IF the index can be used. If not, then a scan of the leaf level is going to occur and I probably don't need to tell you that can be millions of pages. Also, Random GUIDs any better or worse than IDENTITY columns when it comes to finding things UNLESS you're using those things as previously identified keys. They're both surrogate keys and mostly tell you nothing about the data on the row. In both cases, they work great for OLTP IF you already know which key to lookup. For everything else, you're going to need a different column or two to find what you're looking for. The key thing to remember when it comes to "Clustered Indexes" is that "It Depends" and there is no panacea. If your system IS most OLTP then, like I said, Random GUIDs and IDENTITY columns are great even as a Clustered Index. If, however, you're doing "range" lookups by (for example), dates, then the better choice would be to put the Clustered Index on the date column as the first key and the unique surrogate column as the second key column so that you SQL doesn't have to do "Row Lookups" to get the information not carried in a non-clustered index with a date key. Even that isn't a universal recommendation because, just like an IDENTTY column, have a Clustered Index on the data will likely create an Insert "Hot Spot" at the logical last page of the Clustered Index. Even worse and as I mention in the video, it's even worse if you do such inserts (which all go in to 100% page fullness and then they create another page) and those are following by any UPDATE that makes the row larger (which I refer to as "ExpAnsive Updates") before you can rebuild your index at a lower fill factor, you're going to get massive page splitting and the resulting fragmentation. Of course, Random GUIDs don't have either of those issues because all INSERTs are fairly well equally distributed across multiple pages (usually 1 page per row if the table has at least 1 page per row being inserted). EVERYTHING is a tradeoff when it comes to the Clustered Index. If someone starts a recommendation about Clustered Indexes with "It a "Best Practice" that all Clustered Indexes should be...", just walk away because they completely wrong unless they end that with "assigned with great caution because there is no panacea". I also used to believe in the old advice that "every table should have a clustered index". Again, I'll now say "It Depends", especially since that came out with ALTER TABLE REBUILD in 2008. Yep... I know about that nasty old RID being added to non-clustered indexes instead of a nice narrow INTEGER or BIGINT but still "It Depends". So, for OLTP, "It Depends". For Analytics, "It Depends" on something else. Even within Analytics, each table may have (likely WILL have) different requirements to maximize performance and minimize resource usage. I'll also end this with logical fragmentation may or may not make a difference for reads (you need to test), will usually make a difference for writes because even supposed "good" page splits are 4 to 43 times more resource intensive as proven by Paul Randal, and page density should almost always be a primary concern. It's very bad to have a low page density in many cases and it's very bad to have a high page density in others. Again, I know it sounds like a cop out on my part but... "It Depends" and "Must Look Eye" are the only truths when it comes to indexes and the lack there of. Ah... there's one more truth... no matter what you decide on, it will be a compromise... especially for Clustered Indexes and Heaps :D"
EF. Duh. It is such a bitch. It would be ok, if it just worked with data, not fucking with structure! I'm developer, and in most cases i have few architects above me, we most definitely know about structure better, and how to update it.
This channel is proof that sometimes the most valuable advice is not on some big channel with millions of subscribers, but on a small channel like this. I cannot express how grateful I am for all your content and advice. You are extremely valuable to the programing community. Thank you so much.
Hey, don't speak like that! I want my channel to grow, too :)
@@zoran-horvat I really hope you become as successful as you want to be on youtube! I was just thinking that the demographic for advanced programming discussions is not large. That's why we see channels who explain the basics receive way more views than channels who showcase advanced knowledge. If channel views were proportional to the knowledge of the creator I'm sure you'd be in the top 10 UA-camrs already. Anyway, I was just trying to express my gratitude for your videos and everything you teach us. Agan. Thank you!
@@zoran-horvat You have my sub mister... and epic name
Thanks Zoran. A note though that there a number of advantages to a clustered index. Some watching this video might be led to believe that its about insertion performance but there is more to it. In fact I would suggest its about read performance. Many large tables represent transactions (e.g. an order), over time the older transactions are rarely accessed, its the active current ones that accessed concurrently. Having these records physically "clustered" greatly increases the chances that they are already in memory when accessed. It also reduces the read effort a little because the leaf index pages *are* the record pages.
So whether to use a clustered index or not is decision to made carefully and not just on the basis that a developer may prefer to allocate the ID themselves rather than let the DB to it. That said there are approaches that give the best of both worlds, that is, keep the the index clustered and generate keys in the application that highly likely to be sequential.
You have good points here, thanks! Especially the one on reading the data sequentially.
@@zoran-horvat sql azure forces you to add a clustered index on your table(s). that said, it doesn't have to be the primary key :)
@@mannetjie3704 I am not sure that SQL Azure really forces you to use the clustered index. I'd rather say it takes it as a default, which is the same as in SQL Server. Are you sure about that?
And yes, you can build a clustered index on any column you wish, even the nullable and non-unique columns are susceptible to clustered indexes. The rule that makes sense is that one table can only have one clustered index.
It depends on the use case. Seeking a clustered index saves the extra key lookup in the main table, but the table itself being the index also means that typically fewer rows fit in one page, and as a result, a clustered index seek causes more IO than a seek in a "lean" index.
The best case is when you have a "lean" index with few rows that is still covering. That might not be possible for every use case.
@@cdoubleplusgood That is true, but this video is not about removing a clustered index. It is about a bug that may happen when there is a clustered index. To fix the bug, you must either guarantee the sequence or drop the clustered index. That is the message.
Looking from the other end, then if the clustered index is required (e.g. for the performance benefits it offers in some scenarios), then we must be very careful when generating ID values.
Protip for people who want to drop primary keys:
If you make a change to it in SSMS' Table designer, you can have it generate a script, and it will output all the corresponding drop/create statements for you.
I use this a lot when I clean up databases, as it's often faster to reverify foreign keys (minutes) than to leave them active during mass data operations (days).
Especially if you have foreign keys that aren't a primary predicate in your indexes, which is often the case.
That is a very useful tip, thank you!
If I had a wishlist for SQLServer, right at the very top would be some sort of mechanism that optionally deferred validating referential integrity until commit.
@@billy65bob That is an interesting thought. Do you know of some other database that can do that already?
@@zoran-horvat I would not be surprised to learn that OracleDB does.
But some googling suggests that both OracleDB and PostreSQL supports 3 levels of it on all constraints.
Those levels being Never (e.g. UNIQUE), optionally during Statements (e.g. bulk insert with FOREIGN KEY to self), and Until Commit.
With the level being set when the constraint is defined.
love your sense of humor Zoran! 😊
Thank you Zoran! Nice video. I am always trying to avoid usage of this GUIDs and use ints but sometimes I’ll have to use this.
I cannot believe how much I learned in 15 minutes. This is one of your best.
Glad to hear that.
Thank you very much for this video, Zoran! I was not aware about the nature of clustered indices. The explanation helped me to revise the DB configuration on a current project I am working on. Very helpful!
Thank you for making this topic accessible! The explanation on the performance pitfalls of using GUIDs with clustered indexes was particularly insightful.
The brilliance of this video cannot be overstated. Well done.
A lot of data to move when inserting rows in the middle of the clustered index -> this is just a page split, and only then when a page is full. Not super-ideal, but there is an upper-bound to the cost.
Thank you for sharing, Zoran.
As you're aware, removing the cluster primary key entails additional overhead for each key retrieval. While this action resolves index fragmentation, it introduces an I/O penalty with each key access.
An alternative solution I propose is the adoption of ULID or UUIDv7 (with a personal preference for the former). These options ensure the monotonicity of generated GUIDs. This quality not only renders clustered primary keys feasible and time-sequential processing effective, but also enriches the ID with additional time-information that could prove beneficial in some scenarios.
You would saved a lot of migrations ;)
Yes, the use of a UUID standard that guarantees sequence is preferable, but be careful if the system is truly distributed. The monotonicity could fall off node clocks are out of sync.
On the other hand, a clustered index forgives small discrepancies, i.e. the performance loss will usually be immeasurable if it only has to move a few rows at worst. That is another hidden gem of clusters indexes.
Your voice and manner of talking are amazing. Thanks for valuable advice
I just want to emphasize how professional and helpful this video was. Please keep it up and thank you. 👏
I am entranced by your voice, @zoran-horvat! It was like you were telling me a story. I could listen to you all day long.
Also, great informational video :)
Could the original setup work, but with Ulid instead of Guid? So we can keep the clustered index?
Clear and concise explanation; best video I've seen in quite a while 💯
In summary, for guid type primary key column it should not be cluster index. Because sql server will generate sequential guid which causes performance issue. And this performance issue is nothing to do with entity framework. Also it doesnot matter whether app or db generate guid.
Curious to know whether all kind of db have this kind of issue or not? Let me know if i misunderstood any part.
@@manmohanmundhraa3087 The problem is universal - a clustered index requires a sorted key sequence to remain performant.
I love these videos. While I don't code in c# I do find them remarkably informative and useful.
Really useful as I created GUID PK clustered as SQL suggested by default wrongly. Thanks a lot.
Following the advice of this video, I asked the DBAs about this at my work, and I was surprised to learn that they don't rebuild indexes in production. They said index fragmentation has a negligible performance impact when the storage medium is non mechanical. Statistics, however, are still important. So random guids may take up more space, but they don't think it's a big issue as long as you give the index plenty of fill factor
@@tomtoups My experience is also that the impact is negligible. I never saw a performance issue with GUIDs. Therefore, if the IDs are generated by the client, this is my default decision. It never failed.
The problem is not the index fragmentation after insterts. The problem is the page splits during insterts. It's easier to insert at the end and just append to an existing structure, than to insert somewhere in the middle, and having to rearrange the entire structure.
Fill factor is always 100%. It should not be any lower unless there is some weirdly specific reason.
@@zalmar5855 Your statement is ridiculously wrong. You never tested it in reality. In Oracle, for example, the default organization is heap with B-tree index. Have you ever heard anyone saying Oracle "straight up murders performance of every query", as you have put it? Of course not
Fantastic! Your voice makes it sound like fairty tale)
Thank you Zoran, indexing has been something which very little videos exist for EF core. Please if you can create more videos explaining indexing, so that when your database gets large you can insert and read data in a performant way?
There are many channels about SQL Server for example where you can find excellent videos on those topics. I am not sure I can explain indexes and performance even remotely well as in some videos I watched in the past.
@@zoran-horvat I have a question is this problem which you mentioned, only happen when you let the database create GUIDs as well as creating your own? How does this apply if you always create your own GUIDs, and when doing so you you have them as non clustered?
Thank you so much...I encountered this exact problem....when the table got big...
That is when you find out...
What were your timings? How long would an insert take when you noticed the problem?
Alter table is a DDL instruction and can not included in a transaction. Only DML is protected by a transaction.
DDL statements in SQL Server exexute in a transaction, with a few exceptions like CREATE/DROP/ALTER DATABASE and the same for FULLTEXT INDEX.
You can also mix DDL and DML statements in the same transaction.
That is not limited to SQL Server. Other databases also support transactional DDL, most notably Oracle.
@@zoran-horvat didn't know. Thanks
That's not true for SQL Server is it?
@@eveneveneven SQL Server supports DDL in transactions.
But what if the database is read-heavy, and you search data by id: guid?
Yes, insert would be slow O(long n), insted of O(1), but the read would be O(log n), instead of O(n).
It's a trade off.
It is a tradeoff, but not with those asymptotic complexities.
The insert of out-of-sequence IDs into a table with the clustered index takes O(n) time and read takes O(log n).
The insert and read from a table with a common key is O(log n) in both cases. So it is actually O(log n) vs. O(n) tradeoff, which becomes tragic on any table with more than, say, 100k rows and absolutely unsustainable on a 1M rows table.
The point is that there is no arithmetic joggling that would make the relocation of half a table in every insert even remotely performant as anything else a man could imagine, lest the table is of a very small size, hence not important in any analysis.
@@zoran-horvat > The insert of out-of-sequence IDs into a table with the clustered index takes O(n) time
Ok. Fair enough finding where to insert data is O(log n), but actually moving everything and inserting is O(n).
For tables where I expect a lot of rows, and where I'm not going use id:guid as most used where condition I don't think it's a problem.
It would be fun to see what are actually read and write speeds for 1k, 10k, 100+k rows table with clustered and not-clustered id:guid.
How,do other databases such as Postgres or even SQLite handle this?
From what I know, Postgres also supports clustered indexes.
Postgres has something similar, but it is not applied by default. By default, the PK is a "normal" index.
Similar for Oracle. They have something called an "index organized table". But that is also not used by default.
Perfect explanation, very kind to share this lesson, big thanks.
Thank you so much for your work. You are really amazing teacher.
Do you recommend making some other column with an incremental value as clustered or is it fine to have a table without it?
That would be an optimization and, as such, it is valid in certain scenarios.
I have intentionally tried to avoid getting into the details of measuring and optimizing read and write performance because that would make it an hour long video at least, but it seems that the question of performance is inescapable...
As an alternative that let you generate PKs with a clustered index, and it's almost guaranteed to not cause page fragmentation is to use Snowflake IDs.
0:47 There is no guarantee whatsoever that there will never be collisions 💥 however, it is unlikely.
Guids is 128 bits, some of these bits has defined values and can not change, and there is several issues that can make them non unique, the default new Guid implementations are even predictable.
Thanks, very enlightening. So what will be the default ordering of tha data at the table/page level now that we do not have a clustered index? ... I guess it doesn't concern us since we now have a non-clustered index on the GUID column, so we're able to find a GUID efficiently.
@@Dalamain There is no ordering at the table level in that case. It is just a heal, and rows are found using a common index external to the table.
Some definitions of GUIDs produce ordered sequences, which may bring clustered index back into the game. However, you might need to ensure to only generate them on one node, which defeats the idea once again.
Yeah I'm glad that you showed this because I remember that when I was a junior lots of my co-workers thought that you would just generate the idea and this is before everything went to Goods because then you wouldn't have to find a way to look up what you just saved or find a way to get it to return the good from the database but when you have a distributed system this becomes a problem in part because if every client is creating a good there is a slight possibility of a collision it's extremely rare but depending upon the algorithm on which the idea is being created yeah you can have a collision it is more common in integer Auto incrementing what keys then g u i d paid guid has sort of become the standard to avoid this but again because the way networks work I would rather the server to find these values than the client 99% of the time. Because otherwise you might have values to get created that could be security issues a function which would then do an insert that would add something with a good then you might have data being added. Net has done a good job of trying to sanitize these things and make it harder for it to happen but it's one of those things that's always been in the back of my head that there's a reason why some things are best left on the server
The possibility of a collision in GIIDs is truly negligent, so you shouldn't count it in. There are issues that happen far more frequently in distributed databases and we still have the means to handle them.
Yeah I feel like I didn't really word what I was trying to say very well.
Please do not remove the clustered index from a table, it will straight up murder the performance of every query. Removing the clustered index makes the table a heap, also clears all plans from the cache that use it. and forces SQL server to recompile the queries, leading to brutal CPU pressure, while the table being the heap means SQL server will have to read and cache more pages, leading to memory pressure. Unless you want your DBAs to murder you, please do not remove a clustered index from a table, especially if your root problem is coming from the table being too large.
Use a non-clustered index on a GUID that uniquely identifies a record between systems, and keep some kind of incremental index (GUID or Identity int) for the "internal" clustered primary key. You don't even have to store the external GUID on the table if you don't want to, you can create a mapping schema or mapping table. The purpose of using GUIDs for unique identification between multiple/distributed systems is to remove dependencies from each other. Don't change the DB structure to accomodate this.
@@zalmar5855 Another commenter said DBA confirmed no issues ever noticed after removing a clustered index. Now figure that.
I also confirm there is no performance penalty except in a few extremely specific corner cases.
What RDBMS are you referencing?
This does not seem like a very great advise - Unless you have a small database or don't care about select/update/delete performance.
Imagine having a really large warehouse and when ever someone gives you some new data to "insert" you put it in a box and label it with your primary key.
Since your key is ordered (typically ascending) you need to place the box at the right spot. If your key comes after the last box then you just put the box right at the end but if it's somewhere in the middle you now have to move all the boxes (this is called page splitting) that comes after your key to make room for the new one.
Now imagine that same warehouse but without the clustered index.
You get some new data but instead of putting the data into a box and labeling it you just put data on what ever shelf where you can find room.
You are not even assured that all the data is in the same place. So when your "client" wants their data again you have to run all over the warehouse to gather up the pieces in order to get them their data.
To put it simply. This fragments data on the disc making it slower to find the data for selects, updates and, deleting.
The real solution to this problem is to make an integer or bigint column as the primary key and then make it an identity column.
If you insist on having a Guid as your identifier then add that as a separate column with a nonclustered unique index.
First of all, my advice is NOT that you should not use a clustered index, as your comment seems to imply. The video is about a very specific situation when an application generates GUIDs using an algorithm that does not guarantee the sequence and there is a clustered index. One must go.
Regarding your analysis, the actual performance depends on read and write patterns. Neither the clustered nor non-clustered index is better in advance without knowing the statistics. Your analysis does not generalize and, hence, does not lead to any conclusion.
@@zoran-horvat what pattern are you talking about when the sql server just sticks the data wherever it can find available space on the disc?
@@krm1t This what you said is too wrong to answer to.
@zoran-horvat sry it seems UA-cam decided to not show all of your message. A refresh added the first part which I hadn't read.
That is true. You don't say it's a general advise. That's a misunderstanding on my part.
I'm not that fit in databases. But does that mean that I can basically generate my (gu-)ids in my software and use those ids to link entries of different tables before storing the data in the database? We ran into the problem that in Entity Framework 6, when referencing objects, we have to transfer large parts of the database because of the direct object links. The clients don't have direct access to the database but go through a server which handles the communication with the database. Being able to directly generate the ids myself would help us to mitigate some problems we have during multiple transfers.
If I got your description right, that is precisely the scenario where we choose to let the application generate IDs. The application then has the control of the times when it creates objects and times when it does then to the database - those two don't have to happen simultaneously.
It is also useful when sending objects between systems, as well the systems will associate the same object with the same ID.
Thanks for the reply@@zoran-horvat
TLDR; if you can't ensure ascending GUIDs order, don't use clustered index, you're welcome.
Watch the video with more attention. You're welcome.
@@zoran-horvatsorry, I've rephrased it, now it's both more precise and shorter :) don't get me wrong, many are interested in a full video, but most don't have 15min+ just to figure out what the whole thing about and and worth spending time, unfortunately it's rare that youtubers provide sufficient description, or even tray to bait with incomplete titles
@@kocot.It is not possible to state the problem precisely in 50-characters title, nor in the description for that matter. My channel consists of advanced-level videos and trust me, quite a few programmers watch them very carefully. I expect them to because I don't know how to pour an advanced piece of knowledge into a head that lacks patience. Look at me - I have been learning programming for 35 years now.
Regarding the problem addressed in this video, the assumption is that the change cannot fit the existing table with data in it. That is a regular case in practice. Either the IDs are coming from another system that just didn't care to make them sortable, or the table already contains a lot of data, like a million rows, produced with a sortable UUID generated by a different algorithm, so the new algorithm does not extend the existing sequence.
If either is the case, we must make cuts. One is to give up the clustered index, and that decision is tied to a few other considerations, such as write and read statistics. The other option is to recreate the table and generate new IDs using the new algorithm. Both decisions are hard to make, and this video does not suggest which one is the right one. Actually, chances are that it will be the DBA to decide.
The upcoming video, which is in the publishing queue right now, will demonstrate both possibilities on two tables in the same database.
What do you think about ULID lib ? Does it solve the problem ?
There is the new video demonstrating that possibility: ua-cam.com/video/MJcaxi3bOO8/v-deo.html
The caveat is that ULID will not solve the problem if the table already contains many rows. ULIDs are sorted left-to-right, where GUIDs used by the SQL Server have a different structure and sorting. Therefore, the sequence of ULIDs would not extend the sequence of IDs already in the table, once again causing the relocation of great numbers of existing rows upon every insert.
@@zoran-horvat thank you for you answer, I've already started watching your latest video, I'm glad that answer is there
Actually I have a question. I am currently working with EF Core using a .NET Core project, and I want to be able to prevent values below 0 from getting inserted/updated into a database table (basically expecting an exception thrown by EF Core in that case). I have read that there is a "CHECK" clause to enable us to validate a value before updating a column, but chieving it with EF Core seems to be difficult to do. I have noticed there is a Fluent-API call within EF Core module "HasCheckConstraint()", which seems to be the way to go, but it's marked as "[Obsolete]". I want a method that I can use together with migrations. I have also tried the "[Range(0, int.MaxValue)]" annotation, but with no success. Any idea as to how I can do it, and in an elegant and robust way, supporting migrations with EF Core? THANK YOU!
Can anybody explain why SQL generated GUIDs are "in growing sequence" (time - 3.53) ?
I see they has random part in first place, so it will not lexicographically sequential...
am I missing something?
Newsequentialid() as default
Wow this was a cool video to learn about. Has general applications anywhere :)
You said what you did, but you didn't say why you did it. Why don't you want sequential guids? Where are the benchmarks? Before the change, what was the insert and select time and the query plan generated? After you made the change, what was the insert and select time and the query plan generated? Why did you decide you don't you want a clustered index? You may speed up writes but surely reads will be slower.
It depends on what you think that I said. I see that many viewers didn't see the central message of the video, which is probably my fault in passing that message.
So, the message is: IF the application has been generating IDs for some time AND there was a clustered index THEN... we begin.
The video doesn't question any of the decisions because those are driven by entirely different forces. It only questions the combination of decisions.
@@zoran-horvat That does help to clarify the starting situation, and is still interesting as I definitely think others have been in this situation, although I do think it would've been nice to see some details on the improvement that the change made
Thank you Dr. Xavier, u helped a lot
hi zoran, thank you for all this knowladge, is there a video in your channel that talks about domain modeling design?
In many videos I cover different aspects of domain modeling, though only a few videos mention that intent in the title.
Brilliant! I didn't know there was a problem (and always used app generated guids before). One day a colleague would have suggested db-generation and I would have accepted if not for your caution...
It's the other way around: database-generated IDs are normally safe, because the tools would set up the database correctly. It is the application-generated IDs that may sometimes cause trouble if not aligned with how the database is created.
I have an important question
you've written a custom migration there, which means that in order to squash all the migrations in the future, your colleague would have to find the custom code among the sea of other auto-generated migrations and then recreate it later
from my experience, it's a huuge pain in the butt
how would you deal with custom migrations? is it ok to just put them in a different folder (under same namespace)? or is there a more elegant way for that? thanks!
There is no difference between custom and generated migration code. Each migration is immutable and EF will not write into it again.
As I pointed out, this is not the problem caused by EF. I have been using other migration tools before and they had the same problem, that you must add custom migration code to make a certain change.
On the other hand, the whole issue only happened once I made a bug in the earlier configuration.
@@zoran-horvatI understand that migrations are immutable once created, but how do you squash migrations that you cannot recreate using ef core? by squashing I mean creating a big migration from hundreds of smaller ones (which will get the custom migrations lost if not careful)
@@lettuceturnipthebeets790I’m trying to understand your question. What Zoran did was generate a migration, and then add to it. You can add whatever you want to the migration, even down to custom SQL statements. But that is the complete definition of your migration. If you later want to create a migration script, say an idempotent script which you will generate and apply within a CI pipeline, all you’re doing is combining the migrations you have defined. The migrations aren’t generated again. You’re just combining, or squashing if I am understanding your question correctly, into a single SQL migration script.
I didn’t know that EF Core and SQL Server generated sequential GUID’s when properties are empty. I started defaulting back to using long’s over guids. I prefer the sequential nature of it and clustered index is important for me.
When a clustered index is required, then the number of options is limited. That is the case in systems with heavy inserts.
The way EF Core handles that is to initialize the table field with NEWSEQUENTIALID(), the same way as we would do when writing the CREATE TABLE script by hand.
I would keep an auto-incremented long as primary key (clustered) and a GUID with unique constraint (unclustered).
That is one possible design. But only one out of several valid ones. We should always keep the requirements in mind, both functional and nonfunctional, when making design decisions.
For example, a clustered index can cause write congestion in some scenarios and work like a charm in others. A table with a common scattered index can cause read delays in other scenarios. There is no one size that fits all.
Why not just have an auto incremented long as Primary Key (generated by the db) and don't even have the Guid?
@@MrWTFNETWORK Because it might be that the application is generating ID and database is only storing it. That is frequently the case in distributed systems. In such a situation, autoincrement numbers are not an option.
@@zoran-horvat you made me think of a scenario where I store invoices and invoice line itens, 1 to Many relationship and we were using guid for primary Key of both tables. In reality to acess line items I see that the line items table could instead have a composite primary Key as invoice Id and line item id (incrementing integer generated in memory instead of guid as there can only be max 100000 items), what do you think about this approach? We only need to get 2000 items in a paged way until we allow clients to get all line items via API. We would use a continuation token being the id of the last item and always sort by the id
Hi Zoran, thanks for the video on this subject, certainly interesting. The only thing that I'd suggest would be a little more focus on the database itself, regarding default column values and the using NewID() and NewSequentialId() functions. While EntityFramework can take care of this, if you're using EF to generate your database, this isn't the case if either using another ORM, or working against an already existing database. Just felt that this bit of detail had been glossed over.
Yes, that could be a value topic for another video. I thought that covering other aspects in this video would hurt its purpose.
blaaast!... I've got an error...
love it^^
You note a performance impact on using application generated GUIDs on a clustered index, however I would love to know what sort of impact that might be. For instance against a 1 million row X 10 column table, just as an example to get the percentage differance on read and write ops.
The percentage doesn't matter. It is the fact that an operation that takes near-constant time would be substituted with an operation that takes time proportional to a half of the table's size on average. One simply does not do the latter.
@@zoran-horvat I take your point that its is self evidently the correct approach, however I have applications with this issue and I need to prioritize my time. If I go back and spend the dev time on the migrations will I see a 5% or a 40% perf increase. Thanks for the amazing videos, big fan 😁
@@MrValhallans The point is that when this issue manifests, it makes every row insert a pain so obvious that measuring it makes no difference, especially knowing that every next day will only make it slower. How do you prioritize work on a system where inserting a row takes a second?
I wonder why people prefer GUIDs instead incremented integers as PKs.
Doing emergency hand-corrections in DB with FKs cause headache then.
An application cannot generate an autoincrement number. It must rely on the database to communicate the ID back to the application. Not to mention that autoincrement anything is almost invariably out of the question in distributed systems.
Now list ten situations that come to your mind where waiting for the autoincrement number is an issue, and you will have ten use cases for GUIDs. It shouldn't be hard.
@@zoran-horvat Adding related data to many tables at once requires knowledge about PKs and FKs, but I always relied on EF or SQL's SCOPE_IDENTITY then and never set them by hand.
Ok, it's just easier having keys before and after the action.
The problem occurred when the app needs to create its own GUIDs, but if I understand you correctly, there is no problem with letting the DB generating the GUID + clustered, if the app does not want to create GUIDs, and will perform better. correct?
The problem was with IDs generated out of order when there was a clustered index. The database generates GUIDs in the ascending order when instructed, and so the problem would not occur then.
However, that is not to say that database-generated IDs or clustered index perform better. The overall performance is also the result of read and write patterns exhibited by the application.
Thanky you very much Zoran? Is it possible to create sequential Guid IDs in the Application so we don't have to give up on clustered index?
There is a class that apparently does that (learn.microsoft.com/en-us/dotnet/api/microsoft.entityframeworkcore.valuegeneration.sequentialguidvaluegenerator?view=efcore-8.0).
I must admit I have never used it. To make it worse, I have never even read is documentation carefully. The critical question to answer is whether the sequence remains ascending if you move the application to a new server? Though there exists the standard definition of GUID that puts a timestamp at the front, thus making any sequence of GUIDs ascending, there is a nasty corner case when the two servers are out of sync with time. The problem is the same as when using a timestamp to version entities where it may sometimes lead to lost updates!
For all these reasons, I normally let the database generate IDs when a clustered index is a requirement (in systems with heavy inserts, e.g.), and drop the sequential IDs request otherwise.
Thank you so much @@zoran-horvat
Just wondering how many deployment systems actually use the Down migration. I haven’t seen them yet, mostly a idempotent migration script is generated for the Up. Am I missing out?
It is a safety valve in case you suddenly figure something out after deploying to production. I remember events of that kind in my career. Very rare indeed, but when it happened it was always driven by panic reactions at the customer, so they would request immediate rollback of the deployment and only then an investigation.
Interesting. Why is it better to want to switch to the non clustered approach? Why not let the DB continue to create the ordered GUIDs itself?
It is not the question whether it is better or not, but which node in the distributed system is in charge to generate or, let's say, know the value of the ID.
If it is an applicative node, a service, or anything of that sort, then keeping the storage in charge of generating IDs requires an additional trip back to the application before it can continue doing what it planned.
It is also hard to rely on the database in polyglot systems where some of the storage nodes might not even support generating IDs.
This is why you just let SQL Server generate the GUIDs for you in a simple setup or if you have a clustered DB setup or need to generate IDs in the application for other reasons, you don’t bother with GUIDs and instead use a better, sequential ID algo like Snowflake or some variant.
Not only is it more compact, saving data file and index space, but you can keep a clustered index in cases where your reads will be more frequent than writes, both of which will improve performance and lower costs.
If it only were "just"...
Do not have a table in SQL server without a cluster index!! Select performance will be degraded! If you want a guid id column for a primary key use newsequentialid() to generate it as default value. That is why code first is tricky! You must understand the db.
@@nothingisreal6345 You are selling one possibility as a general solution, using shaky argumente that are easy to prove wrong.
BTW how come that that silver bullet of yours doesn't apply to Oracle?
Some years ago, I worked on a project using SSRS; I was writing T-SQL scripts to improve report turn-around constantly. Scary stuff and absolutely hated it.
How does non-clustered indexes compare in performance in-the-large ?
It depends on what you need. Someone pointed out how fast they are if we tend to read the rules in chronological order, which is the central request in some applications. The clustered index will then make such reads substantially faster than the one with rows scattered all over the data pages.
On the other hand, most of the business applications do not have such a request. We usually read the routes as per request, and requests are randomly driven by end users. That makes clustered indexes less important, even to a degree that maintaining them becomes a liability with no benefit in return.
@@zoran-horvat Yes, I found that clustered indexes worked faster with time-ordered data; happened to be quantities of mushrooms collected at distributed farms per picker over a week and aggregates per farm per cost-center.
I usually avoid using guids as primary keys in the database completely.
I assign standard integer database owned primary keys and add a second id column that is the guid.
Thoughts on this?
Then you lose the greatest benefits of using GUIDs - the ability to generate the value before persistence and the ability to persist entities in the storage that does not support generating IDs.
@@zoran-horvat Such a setup doesn't preclude the benefits of application generated guids.
Applications with such a setup are usually ignorant of the incrementing Ids when they're not resolving foreign keys in their queries.
It does come at the cost of requiring 2 indexes at minimum; a clustered one for the incrementing ID, and a unique nonclustered one for the GUID; which does have some performance implications.
@@billy65bob Yes, a better formulation in my reply would be "you don't take these benefits for free".
Hi Zoran, thanks for another interesting video. I like to add a clustered index on another column which I make as a shadow property in EF and make this and identity seed to ensure the order. What are your thoughts on this approach?
I have used similar tricks in the early days, but then found out that I favor simpler configurations... Though, the idea you mentioned will work.
Why would sql server sort by the 4th segment?
The structure of UUIDs is defined by the corresponding standards. If you need to analyze the values, then you take the parts as defined in the standard. There is no freedom to decide.
@@zoran-horvat yes but the sql server does sort the bytes/characters in order of left to right, and normal v1 uuids like you showed do not sort chronological this way. V6 are better in this regard.
Resetting the UUIDs to UUID7 would be all that you would want to do. Removing the clustered index is not the best solution when we have UUID7.
UUID7's initial bytes are sorted by a timestamp and then the right side is random, allowing you to sort them and immediately tell which ones were created first-to-last. If these are meant to be customer-facing and being able to tell when an entry was created is a security concern, then the primary key should be an auto-incrementing integer (with a clustered index) and a second column should exist with a UUID4 (with a regular index). As a side-note, including a UNIQUE constraint on the second column would cause unnecessary overhead.
That is true, with a caveat. What used to be encapsulated in the database, both creating and consuming IDs, now becomes the problem of someone else. It is entirely possible that a future update, such as accepting IDs from the node or service which was not entitled with that before, could break the rule causing a surge in performance.
I would advise to either keep the dependency in an algorithm in one place (database), or to not have it at all.
@@zoran-horvat True, either the database needs to have an implementation of UUID7 that it uses to generate the IDs or (when the UUID7 must come from an external source) all pathways that result in an insert into that table must go through the same abstract layer that generates a UUID7 on the caller's behalf. Most systems have a "repository" layer that would suffice but once you have more than one system with access to insert into that database, each one becomes an entrypoint for bad (non-UUID7) IDs.
Creating a shared library for interacting with tables of that database would be my next path forward to greatly reduce errors.
Not good in SQL Server, since there you want the LEFT side to be random.
Great content!
This is one reason I do not use migrations for SQL Server and instead go with a database first approach. The VS SSDT project allows for publishing changes to the DB schema and handles all the drops and creates needed. It's especially useful when iterating changes over a dev database and then pushing them all to a prod one when ready.
Also, removing the clustered key results in a table classified as a HEAP and has it's own pros and cons. If you only ever lookup via the GUID and never do a range lookup there won't be much difference, but if you ever need to lookup a range of rows in a table, you'll want to create a clustered key. Also, depending on the concurrency and throughput of the application, sometimes a clustered GUID column may help in the long run, as long as you are doing database maintenance such as index rebuilds or defrags on a consistent basis.
Very Nice!
there must be some downside to using non-clustered indexes... maybe queries or sorting become less performant
It is a tradeoff, like any other decision. However, *if* IDs are generated from the application, then the clustered index is most probably not applicable.
@@zoran-horvat I am totally new to this. I am even confused why they use UUIDs instead of numeric increments. I was thinking about using csharp for part of our backend administration panel, but we mariadb. I assume that would be possible, the driver is surely supported, but would it be feasable. Every video I see about csharp and databases, there are Guid in it. I'll probably just keep using php or maybe Go, eventhough cshapr does have the most advanced language featrues and syntactic sugar
also, modern php is OK. no bully. we have static analysis with php-stan, traits, and some multithreading/async features when using swoole. I would still love to use csharp though.
Great Video I reall liked It. Just an offtopic question are you planning to create a course on Udemy or Pluralsight which would go more deeper into programming paradigm used by you. I took your OOP course and Design Pattern Course. Now I would also love to see some FP/OOP course will such course ever be created? Thank you.
I have one such course in the queue about OOD and domain modeling, but no dates when I might actually start working on it.
Thankyou @zoran-horvat, I really like your content , please keep posting valuable stuff.
The new index will end up getting fragmented over time and will require some maintenance.
Every nonclustered index does, and databases are still live and kicking.
I was on a project a few years back that every id in every table was a guid. One of the cross tables was a 3 way primary key. Inserts would take 20-40 seconds. I was assigned to "fix" the perf issue and they were not happy with my solution, DROP THE INDEX.. I tried telling them the design was bad.
That is the point, the DBA job must be done well for a database to operate well.
Use INT ids for joins WITHIN the DB. Assign GUID’s as actual entity IDs so you can sync across DB’s (think Dev, QA, UAT and PROD) without using identity insert. If you want to use guids for actual IDs then use a time-based guid thats more suitable to your clustered index.
If you are going to query by GUID then you got nothing... That is the reason why we use external ID as the key (in those scenarios). Less code and better database performance, all in one package.
Perfect video Zoran!
Looks like that initial approach with generating guids on database level is quite simple and we may forget about these nuances of how sql server works. We just don't need to assign new guid values for entities in our application code, am I right?
Actually, I believe it is better to generate GUIDs in the application because that is their greatest strength: global uniqueness. It lets it generate them in the frontend as well. It lets us choose whether to store entities in aggregate storage in one node or bounded context and in a relational database in another.
When you put everything on the table, leaving the database to generate values on insert begins to look less attractive.
Please correct me if I wrong here, but for Publishers table with this approach we don't have any indexes now because primary key PublisherId is not the clustered index now. Will our search and filter operations in this table now be slower than with a clustered index?
@@ДмитрийКондратенко-б5ь There must always be an index on the key, or otherwise the database would perform a full table scan on every access to a row. The default in SQL Server (and only in SQL Server) is that it keeps the table sorted by the ascending je, a.k.a. the clustered index. But if you opt for a "regular" unique index, as I did in the video, then there will be a separate index built and updated on every insert.
So, the answer is that both Books and Publishers tables will have unique indexes on their keys at the end of the video.
@@zoran-horvat Thank you for your explanations, I became smarter after watching your video and reading your answers!
Is that a problem with SQL Server specifically? If so, why not use something better? Also, GUID ID is redundant. You can just say GUID as both ID's are the same thing.
This is not the problem, nor is it with SQL Server specifically. Many other databases support clustered indexes - Oracle and Postgres do, for instance. The video is showing one typical mistake that can happen. A few commenters confirmed that happened to them in the past, and it also happened to me.
You would typically notice the problem when an insert into a table becomes unreasonably long, such as over second or longer. I have heard horror stories of an insert taking half a minute before someone noticed that the things are not right.
Regarding the naming, you are right that GUID is the ID. It is a good point one should keep in mind. However awkwardly, I was reading it in C# - Guid Id, i.e. the Guid type and the Id property.
@@zoran-horvat Sequential indexes shouldn't cause that problem if the database is implemented correctly. I can only assume that they're using the GUID directly as a key instead of hashing it. This kind of thing is why increasingly I write my own. I keep seeing these kinds of logical fails in "professional" software.
@zoran
"on SQL Server, when a GUID property is configured as a primary key, the provider automatically performs value generation client-side, using an algorithm to generate OPTIMAL SEQUENTIAL GUID values." from EF core docs.
That is true. The problem occurs when GUIDs are generated in the application and only transferred to EF. In that case, the algorithm will be whatever the application used and, unless a specific sequential algorithm is configured, that will be an out-of-sequence algorithm, as it was in my demo.
@@zoran-horvat "whatever application"? eehh, i don't know. Let's say you have existing dataset from other system (SYSEXT) that you would like to integrate into your system (SYSPROD). This dataset has entities with attributes (RDBMS sense): SYSEXTItemID, SYSEXTItemTitle. You want to store these entities in your DB as well. In that case you may say this is application generated IDs. I can get such scenario. You see how I view PK - they are specific to your DAL of your DB (SYSPRODDB). IMHO DAL responsibility is to save, operate and retrieve data consistently from your data store. The implications of such approach:
1. Only DAL controls how data is saved and retrieved.
2. It must control indexes, constraints and relations as needed inside DAL.
In your scenario you would may as well have EF CORE generated sequential IDs in following way:
- define table like SYSPRODID, SYSEXTItemID, SYSEXTItemTitle
- add index on SYSEXTItemID, if needed, but keeping in mind your goals of adding records quickly, probably you want to avoid that
In that case you would remove design dependency between PK and "IDs generated by application". Even if the application changes its behavior, DAL implementation stay stable.
If I would encounter such an issue, then I would keep my DAL fully responsible for consistency and performance and would remove coupling between DB design and application as it would lead to other issues. Team may misuse and abuse such approaches.
Also this does not help if Item is deleted or SQL is restarted due to maintenance on-prem or in the cloud. Fragmentation still possible and index maintenance should help more than anything else.
Guids will kill your index. Btrees are optimized for right only writes. Use ulids instead because they're monotonicly increasing. I'm 99% sure Azure SQL requires clustered pk. EF can be setup to allow auto-inc into indexes. This is the best option, unless you're building mobile apps that require offline operation.
Ok, but what about creating a sequential GUID in the app? There are NuGet packages that enable that (SequentialGuid by jhtodd for example) and also code examples on Code Project by Jeremy Todd ("GUIDs as fast primary keys under multiple databases"). I use this code for almost a decade now in the app and it didn't fail me (yet).
Would these solve the problem i.e. have your cake and eat it too, or clustered index with an app generated sequential ID?
That would resolve the problem, but with a caveat: Every node that generates the IDs would have to know the condition and, on top of that, use the same algorithm.
I fear more from configuration errors than the conceptual ones, especially as the resulting big is a nonfunctional one, therefore hard to reproduce and catch.
On a related note, I plan to make a video on strongly-typed IDs with EF Core. The strong ID type would be the ideal place to lock the GUID generation algorithm, though it remains that the C# module containing that decision would have to be in sync with any other source of IDs.
Is there a benefit in creating the GUIDs in code in comparison to creating them on the RDMS side?
It is tricky to vote this in terms of benefits. It is rather a reaction to other forces when designing a distributed system. So, in practice, you will need to cope with a situation where applications already have the need to generate IDs and letting a database do it for them is a sub par solution.
Uuid key solved by using Type 1. These are ordered. This is how Cassandra solves it. You're welcome.
Yeah sure, but you forgot the problem statement. You just messed up the database in exactly the same way as I did in the demo. You're welcome.
@@zoran-horvat Explain how? I've been using Type 1 UUIDs for years. Never had a problem. Maybe this is SQL Server specific?
@@KangoV That is your homework, not mine.
BTW, I did explain it in the video. Two times.
So SQL Server generates the GUIDs in a way that each GUID is sequential.
Intuitively, it seems that it's less random, and so it's more likely to create collisions.
Letting the app generate GUIDs seems safer.
Is my intuition correct? Or am I missing some context?
"Sequential" is the wrong word, I suppose. "Ordered" covers it better.
@@Dalendrion Yes, ordered is a better term for that. There are several standards for GUIDs, at least one of them guarantees order on top of universal uniqueness (that is the principal trait of that standard).
Don't, don't, thousand don'ts.
You have just turned your Books table into a Heap, which in MSSQL absolutely kills reading performance. While the problem you describe with non-sequential guids is correct, leaving the table without Clustered Index is much much worse.
A simple Index reorganization will fix the guid problem as your table grows. Stick it to the Maintenance Plan (you DO have one, right?) and call it a day.
Did you just explain we should never use heap tables before asking what queries we plan to execute? You know better than that, I hope. For example, that "absolutely kills reading performance" depends on what we query, right?
@@zoran-horvat You ask me to provide context when you yourself provided none. In fact, that's actually the main objection I make to your video: the lack of context.
Sure, Heaps have a place, sure non-clustered PKs have a place. But giving a 'general advice' to drop the Clustered when client-generated Guids are used as PKs without any warnings or caveats can put inexperienced developers in a much worse place than a fragmented clustered index would. I'm absolutely sure you know better too. 😉
@@SantiagoBurbano-su9oe I think I gave quite a sufficient amount of context when I said that the unsorted IDs are coming in from an external source. That condition is a given at the beginning of the demo. The rest is a consequence.
Couldn't you seed your ID with some sort of sequential ID in a format similar to {SquentialID}_{GUID}?
So something like a Linux based time index. That way the record always gets sorted into the end of the file. This would allow you the benefits of clustered indexing and collision avoidance.
If you did get a few rows generated on the same time index, then the re-sorting would be fairly efficient. In my opinion.
You can use a specialized algorithm that guarantees sequence, but that opens a different set of issues. All components that you might allow assigning IDs must guarantee to use the same algorithm, for example. Any later change is not possible, and so on.
All these limitations stem from the principal problem that we have separated the source of IDs from the storage that depends on them. None of the issues appear if we lift the artificial constraint of the storage, that is the point.
@@zoran-horvatThis is great information. Thank you!
That's great
That’s why you use Postgres ;p
No, that's not why you use Postgres. One must be stupid to make such a random technical decision.
Btw I think clustered uuid are not that bad if you insert/delete regularly because the gaps in the index will stabilize and only minor reordering is needed. In fact the easier clustered lookup will make up for it in some cases and you don’t run into one sided balanced index trees or mind extra old records keeping a lot of empty heap dangling,
Great video as always!
What about using ULIDs instead? By the spec they are sorted by definition, since the first part of a ULID is the encoded timestamp.
In that case you might be able to keep using the clustered index, even if the application generates the values.
ULIDs are great! Unfortunately they are not yet commonly used.
It is a good idea. The only issue with types that are sorted by time is when used for versioning - once the system truly becomes distributed, small discrepancies among timers will cause incorrect sorting of versions and lost updates.
However, the concept alone is very useful.
Normal uuids are not sorted by timestamp but there are special variants which are
@@zoran-horvatwhy does gaps in timers cause data loss?
@@berndeckenfels When a time sequence is used to version changes in a distributed system, then we risk the case when two updates are reversed because the node that effectively generated a later update could have its timer late. In that case, the earlier update could overwrite it. That is one of the failure modes for distributed databases.
@@zoran-horvat ah ok, yeah that’s not really related to uuid but yes that’s a problem if you don’t use vector clocks or similar
Code first solutions are rarely the answer. Data first....business rules next
Why would you use external code to do data migration? This makes my stomach hurt...
@@bfrytech Can you explain in plain words the reasons why data-first design methods were mainly abandoned in practice?
@@zoran-horvat It's mainly a lack of understanding of how databases work and knowledge of how to properly normalize a db for transactional data. Plus, databases are boring. Writing C# code is far more fun than writing scripts. But, you gotta eat your veggies if you're gonna get desert.
@@bfrytech Those who had the understanding and knowledge have also moved in the same direction. Why was that?
@@zoran-horvat I want to clarify that I'm a full stack guy, who happens to enjoy the db side. And I'm not being all fire and brimstone on the subject. Just from my experience, every project that I was pulled into, regarding CF, there was a lack of experience on the data structure side. Ordinarily, the CF stuff was what was stalling the project. For the exact reasons you outlined. A stored procedure with an output parameter is faster and more natural than fighting the issue you outlined. Also, not many DBAs wanna grant that level of permission to an external program.
Now, to your question, I can't answer that. It may be external business pressures or having an expert at CF that mandates the practice, could be a lot of things. But, a majority of managed code folks that I've worked with, don't really understand how an RDBMS handles data. Most managed code collections tend to be handled iteratively and database engines handle data in bulk. I always tried to treat data the same way the data engines to.
UA-cam why my comments are being deleted?
Did you add tags or links into them? UA-cam regularly deleted those.
@@zoran-horvat I wrote N3wId from M@ss Tr@ansit
DDLs not always can be rolled back. For example in MySQL, Postgres and Oracle a DDL is implicitly committed.
Yes, sure, but that doesn't change the fact that every mainstream RDBMS supports transactional DDL. Actually, missing such a fundamental feature in 2024 either makes it outdated or a niche technology with an accent on something else (which is fine).
This is really bad advice as a blanket statement. I mean, there are certainly benefits to skipping the clustered index, especially for inserting stuff, but heap tables come with many, many disadvantages in very common scenarios.
@@gunnarliljas8459 "many many disadvantages" is a bad advice as a blanket statement. So many systems are built using this idea, including quite a few where I worked, and those things people fear never happened. For example, the default table organization in Oracle is heap - are there many many disadvantages to using Oracle?
Sorry but this video provides absolutely no context in which circumstances using GUIDs is slow. It isn't even slow by default and can be just as fast or faster than ordered sequential keys because it prevents congestion. This video also does not provide any data on what exactly is "slow". Don't take it from me, just go watch a video about with actual data and experience to back up what they are saying, which is, you should not worry about using GUIDs. They are just as fast in almost all situations withou any kind of index shenanigans:
ua-cam.com/video/nc4CMo7VSPo/v-deo.htmlsi=2dPyPAT3rX2OGHOn
My video is not on the topic you are mentioning. I would appreciate it if you could watch my video as being what it is, rather than as not being on the topic of some other video.
My video is about a very particular situation when a programmer unintentionally leaves the clustered index on the key column, and then intentionally uses an out-of-sequence source of IDs. The situation naturally occurs when using GUIDs.
Contrary to what you have imposed, the video you have commented on is not about index performance.
After reading your comment two times, I have an impression that you have missed the theme of my video entirely, particularly the point that out-of-sequence IDs cause relocation of large number of table rows on every insert - an effect that has no relationship to fragmentation, specificity, uniqueness, or any other property of indexes.
@@zoran-horvat And that video takes a deep dive into clustered indexes. There is no one best practice. It depends how you use your data. Heavy inserts are slower on an ordered clustered index because you are creating a hot spot to the last page. From the author of the video from another source:
"First, I ask that you realize my video isn't just about GUIDs. I use Random GUIDs to emphasize many myths and points about all kinds of indexes.
Shifting gears back to your question, when you're working on an almost pure OLTP platform and as I say in the video, logical fragmentation isn't going to matter at all. You're going to be looking up one row at a time and that means a trip in the B-Tree from the root thru the intermediate levels of the B-Tree, to the page in the leaf level where your row is. If the index is (for example), 4 levels deep, then you'll do 4 8k page reads for every look IF the index can be used. If not, then a scan of the leaf level is going to occur and I probably don't need to tell you that can be millions of pages.
Also, Random GUIDs any better or worse than IDENTITY columns when it comes to finding things UNLESS you're using those things as previously identified keys. They're both surrogate keys and mostly tell you nothing about the data on the row. In both cases, they work great for OLTP IF you already know which key to lookup. For everything else, you're going to need a different column or two to find what you're looking for.
The key thing to remember when it comes to "Clustered Indexes" is that "It Depends" and there is no panacea. If your system IS most OLTP then, like I said, Random GUIDs and IDENTITY columns are great even as a Clustered Index. If, however, you're doing "range" lookups by (for example), dates, then the better choice would be to put the Clustered Index on the date column as the first key and the unique surrogate column as the second key column so that you SQL doesn't have to do "Row Lookups" to get the information not carried in a non-clustered index with a date key.
Even that isn't a universal recommendation because, just like an IDENTTY column, have a Clustered Index on the data will likely create an Insert "Hot Spot" at the logical last page of the Clustered Index. Even worse and as I mention in the video, it's even worse if you do such inserts (which all go in to 100% page fullness and then they create another page) and those are following by any UPDATE that makes the row larger (which I refer to as "ExpAnsive Updates") before you can rebuild your index at a lower fill factor, you're going to get massive page splitting and the resulting fragmentation. Of course, Random GUIDs don't have either of those issues because all INSERTs are fairly well equally distributed across multiple pages (usually 1 page per row if the table has at least 1 page per row being inserted).
EVERYTHING is a tradeoff when it comes to the Clustered Index. If someone starts a recommendation about Clustered Indexes with "It a "Best Practice" that all Clustered Indexes should be...", just walk away because they completely wrong unless they end that with "assigned with great caution because there is no panacea".
I also used to believe in the old advice that "every table should have a clustered index". Again, I'll now say "It Depends", especially since that came out with ALTER TABLE REBUILD in 2008. Yep... I know about that nasty old RID being added to non-clustered indexes instead of a nice narrow INTEGER or BIGINT but still "It Depends".
So, for OLTP, "It Depends". For Analytics, "It Depends" on something else. Even within Analytics, each table may have (likely WILL have) different requirements to maximize performance and minimize resource usage.
I'll also end this with logical fragmentation may or may not make a difference for reads (you need to test), will usually make a difference for writes because even supposed "good" page splits are 4 to 43 times more resource intensive as proven by Paul Randal, and page density should almost always be a primary concern. It's very bad to have a low page density in many cases and it's very bad to have a high page density in others.
Again, I know it sounds like a cop out on my part but... "It Depends" and "Must Look Eye" are the only truths when it comes to indexes and the lack there of. Ah... there's one more truth... no matter what you decide on, it will be a compromise... especially for Clustered Indexes and Heaps :D"
@@jonijarvinen7833 My video does not take any dive on clustered indexes. My video is not about clustered indexes. OK?
EF. Duh. It is such a bitch. It would be ok, if it just worked with data, not fucking with structure! I'm developer, and in most cases i have few architects above me, we most definitely know about structure better, and how to update it.
Actually, EF did nothing here. It was just passing values between the application and the database with no modifications.
Can you talk normally?
No, never.
Many thanks for the excellent tutorials! Now, does the same principle hold true for the new GUID version 7 introduced in .NET 9 (Guid.CreateVersion7)?
@@mahfoudbouabdallah6286 Any GUID that guarantees a sequence is generally fine.