Quickie this morning to start the day off. I saw a question on a forum: if I *have* to use a GUID and *must* have a primary key, should I make the primary key clustered or nonclustered?
Now, I'm not getting into the whole GUID vs. bigint identifier, or random GUID vs. GUID generated by NEWSEQUENTIALID(), so please don't comment on those issues, they're not relevant here. I just want to address the question – what kind of index should it be?
From a Storage Engine perspective, my answer is nonclustered. Here are three reasons why:
-
If the index is clustered, then the cluster key is immediately at least 16 bytes (the size of a GUID). This doesn't change the size of the clustered index records (as the GUID column has to be stored in the table anyway, and a clustered index IS the table), but it does change the size of the nonclustered indexes. All nonclustered indexes on the table must include the cluster keys, even of they are not explicitly part of the nonclustered index keys (I'll do a post on this later). This means the GUID is present in every nonclustered index record too. From this perspective, it would be better to use a smaller clustered index key and have the GUID primary ley be nonclustered so it's only present in that one nonclustered index.
-
Random GUIDs used as the high-order key cause index fragmentation. Their random nature means the insertion point into the index is also random. This causes page splits, which cause fragmentation and are *expensive*. (I touched on this a bit a few days ago in my post How expensive are page splits in terms of transaction log?.). With a random key value, it's hard to avoid page splits and fragmentation, although you can delay them somewhat using FILLFACTOR, but at the expense of using extra space. By making the GUID index nonclustered, you can delay page splits even further. The clustered index is the table, so the records are (usually always) larger than nonclustered index records. This means you can get fewer clustered index records on an 8KB page than nonclustered index records. With fewer records per page, you can do fewer random insertions on the page before a page split occurs. So using a nonclustered index for the GUID key means you can do fewer expensive page splits.
-
Given that whatever kind of index you create for the GUID key is going to experience index fragmentation, you're probably going to want to periodically remove the fragmentation as part of your database maintenance plan. It makes sense to try to limit the amount of resources used by the fragmentation removal operation (e.g. cpu, IO, disk space, transaction log space), and so the smaller the fragmented index, the better. A nonclustered index for the GUID key will be smaller than a clustered index, so if you choose a non-fragmentation-causing clustered index key, and confine the fragmentation to the nonclustered index, you can use fewer resources during database maintenance.
And there you have it. I'm sure some of you have seen pathological cases that disprove one of the above points, but my arguments are generalizations. Maybe this is a can of worms I've opened, in which case I look forward to the comments!
PS Brent did a great post about humor when blogging, the cartoon links he includes are great. Check it out here.
12 thoughts on “Clustered or nonclustered index on a random GUID?”
Hello Paul,
but what about the alternatives? In some cases you cannot choose BIGINT Identity or NEW_SEQUENTIAL_ID() because the Application doesn’t support it.
What do you think about adding a datetime column default(getdate()) and making this the clustered index? Eigher when the records shall be displayed by date most of the time or even when the records have no "natural order".
Günter
One potential extension to your point there is that if you do make that PK nonclustered, not only does make defragmentation cheaper, but you may not even need to defragment it very often in the longer term regardless.
From my experience at least, a heavily fragmented index that never gets scanned is rarely a real performance problem because fragmentation doesn’t exact a heavy toll on single-row seek operations.
Thanks for the comment but quite an orthogonal discussion from this post – and what I specifically said I didn’t want to get into. Way too much of a can of worms. Hassle Kimberly about that one :-)
Well, this is a big can of worms and what I’d say is that a GOOD clustering key should be: Unique, Narrow, Static and ever-increasing. I’ve blogged about that here: http://www.sqlskills.com/blogs/KIMBERLY/post/Ever-increasing-clustering-key-the-Clustered-Index-Debateagain!.aspx and in quite a few other indexing posts I go into even more details about why that’s a good idea.
However, in a large table where you are likely to have data that’s ever-increasing based on date, where you might end up wanting to partition on date and where you also do a lot of range queries – I typically argue for a date, id composite clustering key. I don’t usually use date alone as it’s not unique but when I add a column to uniquify it I want to add the smallest column I can. Ideally, an int (if that’s enough) but if you absolutely don’t have something that can uniquify the rows, you can rely on the internal uniquifier that SQL Server uses (internally every clustering key MUST be unique).
Anyway, check out my other indexing posts and certainly "hassle" (as per Paul :) me if you have more questions/comments!
Cheers,
kt
Great point Aaron – as long as the page density doesn’t get too low, it’s probably only being used for singleton-lookups anyway. Thanks!
If the primary key in all tables are the GUIDs, would it still be best to have the indexes on the GUID be non-clustered? Wouldn’t that impact perfomance on the joins and thus the querying?
Maybe, depends on the schema and queries – there are good alternatives to GUIDs that would solve both problems. But that’s an even bigger can of worms :-)
Hi Paul —
I in general hate using GUIDS for a PK and consider it to be simply bad design, *HOWEVER*, when you are writing
cross-platform applications (please, no religious debates – it is what it is and it’s made the owners of my company very very wealthy) the ugly GUID is simply the only feasible way to write a cross-platform piece of code.
As an alternative to the GUID, however, I would recommend that people look at implementing the Mersenne Twister algorithm as a DLL to generate unique key values. In our testing, we have created IIRC an 18 byte
numeric (whatever the boundary is for < 10 bytes) that generated no duplicate numbers over 80,000,000 calls.
Twister is about as close to random as you can get without using three randomly tuned radios generating white noise frequencies or, as would be preferable, nuclear decay. There are still issues with fragmentation, however you are dropping from a 36 byte key to a 9 byte key, thus enabling four times as many keys per page. We also looked at the ‘short GUID’, which is simply a GUID minus the dashes, but the savings in space weren’t enough to justify that approach.
Over a set size of 20M rows, a GUID clustered PK takes up ~1.2GB of space, a Twister numeric key takes up about 450MB of space, roughly a 3x decrease. For reference, a clustered identity starting at 1 uses a bit over 240MB of space.
The other point WRT to both a Twister and GUID PK is that both of the indexes were 100% (yes, 100%) fragmented – not so good for performance – and the identity was almost 0% fragmented. I have done a great deal of research into this area in looking for a replacement for the GUID in our company and Mersenne at a width of 18 bytes is the front runner. Using any less than 18 bytes for the Twister key quickly runs into a knee where the rate of duplicate generation is almost vertical beyond a certain point – at about ~2M keys for an 8 byte Twister key, ~5M for a 10 byte key.
I know this is an old post but so called GUIDs, if stored properly, don’t take 36 bytes… they take 16. They’re also quite random and very uniformly distributed so and you don’t have to mess around with an external DLL.
Dave – extremely interesting – thanks for posting.
"would be better to use a smaller clustered index key and have the GUID primary ley be nonclustered"
For a nonclustered index on a guid to work well, must there be a clustered index on some other column? It would take several columns as a group to be unique or close to being unique.
Not necessarily. The NC index will either contain the cluster keys or the heap RID of the base table record so in that respect, heap vs clustered index isn’t an issue. It really depends on the access patterns of the table whether a clustered index is required, but it has no real bearing on whether the NC index on the GUID will work well or not. In general, I’d say that a clustered index is desirable for many reasons.