This survey is based on a suggestion from Jonathan Kehayias (blog/Twitter).

In this survey I'm asking two questions:

  1. When you're designing/implementing a database, do you manually create statistics or let SQL Server create them automatically?
  2. How do you maintain statistics? Manually or letting SQL Server do it for you? Or both? 

I'll probably collaborate with Kimberly on the editorial for this, as she's forgotten more about statistics than I'll ever know!

*Please* no comments on this post - wait for the survey results post to avoid skewing the answers (comments are unfortunately moderated anyway, now that my blog gets a lot of spam). I'm very interested in your reasoning, but not until everyone else responds.

   

As always, a big Thanks! for contributing to the blog/community by responding. Please shoot me an email (Contact button, bottom left of the blog), or ping me on Twitter (@PaulRandal) if you have an idea for a good survey.

This is a question that came up at the Cleveland SQL Server User Group last night. An attendee had a wager with his colleagues about how the names of auto-created column-level stats are generated.

Let's find out. I'm going to create a table without any indexes and then do some queries.

USE master;
GO
CREATE DATABASE StatsTest;
GO
USE StatsTest;
GO

EXEC sp_dboption 'StatsTest', 'auto create statistics', 'on';
GO

CREATE TABLE t1 (c1 INT, c2 INT, c3 INT);
GO
INSERT INTO t1 VALUES (1, 1, 1);
INSERT INTO t1 VALUES (1, 1, 2);
INSERT INTO t1 VALUES (1, 1, 3);
GO

Now I'll do a little query. My database has AUTO_CREATE_STATISTICS turned out, so the query optimizer recognizes that it could use a statistic to potentially help choose a more efficient query plan more quickly - so it generates it.

SELECT * FROM t1 WHERE c3 < 2;
GO

SELECT [object_id], [name] FROM sys.stats
WHERE [object_id] = OBJECT_ID ('t1');
GO

object_id   name
----------- --------------------------
2073058421  _WA_Sys_00000003_7B905C75

I've got an automatically created statistic - but what does the name mean? Well, the 7B905C75 is the hex equivalent of the table's object ID. The Sys is because it was automatically created. The WA is because the SQL team is based in Washington State in the US (I'm absolutely serious!). The 00000003 is because the statistic is on column ID 3 of the table. I can prove this last one by doing a select on column ID 1 and checking the new statistic that is created.

SELECT * FROM t1 WHERE c1 < 2;
GO

SELECT [object_id], [name] FROM sys.stats
WHERE [object_id] = OBJECT_ID ('t1');
GO

object_id   name
----------- --------------------------
2073058421  _WA_Sys_00000003_7B905C75
2073058421  _WA_Sys_00000001_7B905C75

The new statistic has the exact same name as the previous one, but the first number is now 00000001 - clearly matching the column ID of the column I queried.

So there you go - wager settled!

Categories:
Statistics

Jack Li, one of the Senior Escalation Engineers in Product Support, just posted details of an interesting case over on the CSS blog - his article is here. It talks about index builds and rebuilds, but the issue is the same for both, so I'll just talk about rebuilds.

The jist of the problem is that index rebuilds can parallelize, but sometimes they don't parallelize vey well. Each thread gets a certain range of the index to rebuild, using the existing index statistics to divide the ranges equally between the threads. If there's massive data skew, then one thread can end up doing the majority of the work, leading to a long run-time. The case in Jack's post involved a 250 million row index where 150 million rows had the same (NULL) key value. This range has to be processed by a single thread - a single value can't be divided between two+ threads.

Now, this is understandable behavior by the database engine, but it relies on the statistics being up-to-date. That's a bit of a catch-22 - rebuilding an index updates the statistics, but if the statistics aren't up-to-date then the index rebuild might parallelize badly! I guess the solution is that if you know that you have massive data skew in your large indexes, update statistics BEFORE doing an index rebuild. And given what I've been hearing this week at SQL Connections about how badly statistics keep biting people, I'm leaning towards a different recommendation for those people who have lots of perf trouble caused by statistics and the potential for skewed data - rebuild all your statistics regularly, and only rebuild/reorganize fragmented indexes. Statistics just cause so many problems it seems.

Thanks

PS Kimberly has a lot more info about statistics over on her blog - I'm just starting to venture into that mine-field

Theme design by Nukeation based on Jelle Druyts