Kimberly and I were presenting at our local (Redmond) .Net Developers Association on Monday and the following question came up while Kimberly was talking about missing and extra indexes (paraphrasing): What’s the best non-clustered index to use for the query with a predicate WHERE lastname = ‘Randal’ AND firstname = ‘Paul’ AND middleinitial = ‘S’? Kimberly said that the order of the keys (e.g. lastname, firstname, middleinitial; or middleinitial, lastname, firstname; etc) doesn’t matter for this case. I thought about it for a second and then argued, saying that the most selective column should come first. We agreed to discuss with the group at the end, but I thought about it some more and realized (and admitted to the group) that she’s right – I should know better than to question Kimberly’s knowledge of indexing… :-) She’s right because for a pure equality query using AND for multiple predicates, the Storage Engine will seek straight to the first exactly matching record in the index (and then scan for more matches if it’s a non-unique index). It doesn’t matter what order the index keys are defined because the Storage Engine is looking for an exact match. When I started arguing, I was thinking about a phone book, which is ordered by lastname, firstname, middleinitial. You may think that a phone book is ordered that way because lastname is the most selective. Wrong. It’s because the lastname is what most people know – it just happens to be the most selective of the three choices. Most SQL geeks should be able to find Kimberly in a phone book by looking for Tripp, Kimberly. But what if it was ordered by middleinital? I’d have no problem finding Kimberly, but how many of you would remember that her middleinitial is L? Probably a few as we both use our middle initials in our public names. What about if it was ordered by middleNAME? Again, no problem for me but who how many other people know her middle name is Lynn? Then I started thinking about other queries and how they would play into the index choice to answer to the question above. If I also wanted to support a query with the predicate WHERE lastname = ‘Randal’, then having the left-most index key be anything other than lastname won’t work so well. If the key order was firstname, middleinitial, lastname then all the distinct lastname values would be spread through the index rather than being together. The index might still be used to satisfy the query if it’s the lowest cost index to use. However, having lastname be the leading key probably wouldn’t work very well for a query with a predicate of WHERE firstname = ‘Paul’ – that argues for having firstname be the left-most index key. Which should I choose? I probably I can’t have both in the same index, so maybe I’d have TWO non-clustered indexes, to support both queries. The answer depends on how often the various queries are used and the trade-off between how much of a performance gain the non-clustered index would provide against the performance drop of having to maintain it during DML operations. I hear time and again about people adding a non-clustered index for every column in the table, thinking that this will help – and my thinking is that this is wrong because these indexes can only satisfy a query where the only predicate is the column being indexed. I ran this argument past Kimberly and she added that these indexes could also be used if the column is chosen as the most selective in a multi-predicate query, and no other index has a lower cost than that one (a slim chance usually). Even what I though of as a simple case has caveats! So what’s the point of this post? Well, I wanted to show how indexing for one very simple query is pretty straightforward, but as soon as the number of different queries grows, and the query predicates get more complicated, indexing becomes more complex. You really have to know your workload and your data to know which columns are used, in what combinations, and how often – and then it helps to know how indexes are costed and used so that you can make intelligent choices about which indexes to define. This thought-exercise has really shown me that I didn’t know how much I don’t know about indexes – I know precisely how they work at the Storage Engine level but not too much about how they’re used by the Query Processor. I have new-found respect for Kimberly’s indexing expertise. Luckily she’s teaching a class at Microsoft called Indexing For Performance next week – I think I’ll attend :-)
2024: the year in books
Back in 2009 I started posting a summary at the end of the year of what I read during the year and people have been
7 thoughts on “How hard is it to pick the right non-clustered indexes?”
Hi Paul
Thanks for this article , very useful. I’d like to ask you one question.
As we all know , SQL Server keeps statistics on first (left-most index key) , so how does it help to seek straight to the first exactly matching record as you said,I mean does not SQL Server start first looking for statistics to find the column?
Thanks
"Microsoft SQL Server 2005 Unleashed", for all your indexing needs. Aside from discussing the on-disk format, it also goes into depth on how the optimizer rolls. If you don’t want that much reading up, studying execution plans will also give you a feeling for how things are done.
Or you could just submit a representative workload to the Tuning Advisor. The earlier versions weren’t much cop, but the Advisor that comes with 2005 does a respectable job of indexing. Still not what an experienced DBA would accomplish, but good enough for most scenarios.
@Uri: first of all, statistics are *not* used to find records; just to decide whether the index is selective enough to use in the first place. Only the index itself can be used to find records.
Second, SQL Server keeps statistics on *all* columns in the index for purposes of determining index *density* (the median selectivity). The first column is special because histogram steps (which approximate selectivity for ranges of specific values) are kept only for the first column. Regardless of the order of the columns here, the optimizer would decide that the index is sufficiently selective to use, because the combined selectivity is the same in all cases.
Hi Jeroen
I did not say that statistics are used to find records, but if statistics are bad SQL Server will probably generate a bad execution plan.
For the rest thanks fo confirmation.
@Uri: Believe me, I know all about bad statistics generating bad plans… One of our databases has a rather pathological collection of data where one table has a foreign key to another table, and the number of records joined this way varies: it’s either 1 or 50 (either is about as likely). There’s no middle ground.
As a result, the statistics gathered on the table will indicate that the most likely number of records obtained when joining is 50. As a result, when you perform the join with multiple parent records, the optimizer thinks that there will be many more records involved than there actually are, and as a result it chooses a table scan when it could have used an index. Generating the statistics so that it samples all values rather than only some of them fixes this issue. (Of course, that’s far from the only way of fixing it, but it is rather convenient.)
As long as you use an index for an exact match, however, the histogram statistics for that index don’t matter. Unless the index itself is bad (very unselective) the optimizer will almost certainly use it, no matter what the histogram for the first column says.
Any chance the class is open to the public? :)
Unfortunately not. The next public indexing class we’re doing is a two-day event in September in Scotland. We’ll be doing a single session at TechEd and Fall Connections on indexing as well.
Hi Paul,
How about of I have index like (Lastname, MiddleName, Pno) and prdicate is
Where Lastname=’xyz’ and middlename=’abc’ and Pno=123
do you think SQLServer uses the index? I checked it is not using it just want to confirm.
Thanks.
—
Farhan