Time flies when you're having fun, eh? This week Paul and I are in Dublin, Ireland delivering a deep technical "SQL Immersion Event" on Internals, Performance Tuning and Maintenance with our partner ProData (and having a blast - we have a really fun and very interactive group). And, today's an even more interesting day to be in Dublin as it's the 250th anniversary of the Guinness Brewery...everyone around Dublin will be toasting the most popular drink in Ireland today (Sept 24) at 5:59pm GMT. Unfortunately, our class runs until 6pm but Paul tells me that we're still going to celebrate. ;-)

For now, we'll just have to celebrate some SQL (I can hear minor grumblings everywhere :)...

Filtered indexes and filtered stats are an incredibly powerful feature in SQL Server 2008 - one that I've posted on before and will again. Today's post is about some of the gotchas with filtered index stats and filtered stats. Cutting to the punchline - you might have some trouble with their accuracy as your data changes. The SIMPLE solution is to update these specific stats manually and more frequently (meaning that you don't want to rely on the database option AUTO_UPDATE_STATISTICS). In fact, depending on your data volatility, you might want to update these statistics on a VERY frequent basis. Part of the question you'll have to answer is how much of the filtered set changes over the course of a day/week/month. And, in whatever frequency you reach 5-10% (let's say weekly), I'd suggest updating the statistics weekly. If you update 5-10% within the course of a day - then daily. Let me give you the full story...

Statistics are traditionally updated when roughly 20% (+ a minimum of 500 rows) of the data has changed. If you have a table with 100,000 rows then SQL Server will hit the threshold at 20,500 rows (just a tad over 20%) and a table with 10,000 rows will hit the threshold at 2,500 rows (which is 25%). Additionally, in SQL Server 200x (2000, 2005 and 2008) statistics are NOT immediately updated when the threshold is reached, instead they are invalidated. It's not until someone needs the statistic that SQL Server updates it. This reduces thrashing that occurred in SQL Server 7.0 when stats were updated immediately instead of just being invalidated. Another interesting point is what is meant by "20% of the data has changed?"... How is that defined? Is it based on updates to columns or inserts of rows? Of course the answer is... it depends - here, it depends on the version of SQL Server that you're using:

  • SQL Server 2000 defines 20% as 20% of the ROWS have changed. You can see this in sysindexes.rcmodctr.
  • SQL Server 2005/8 defines 20% as 20% of the COLUMN data has changed. You cannot see this unless you are accessing SQL Server through the DAC as it's in a base system table (2005: sysrowsetcolumns.rcmodified and for 2008: sysrscols.rcmodified).

Now, for regular indexes (those that do not include a filter) both strategies have pros and cons. Let's start with the con for SQL 2000... If you have a particularly volatile column then the rowmodctr will be increased quickly and then ALL statistics (even those that have not changed) are invalidated. Without a way to understand a column's volatility, you will invalidate some statistics too soon. So, 2005/2008 created a column modification counter to better define when a statistic should be invalidated. But, the con for SQL 2005/2008 (which is the pro for row-based algorithm of SQL 2000) is that a lot more data must change in a single column before the statistics get updated. If your modifications are fairly well balanced (across the table) then the row-based algorithm could suit you better. If your modifications are isolated (to a specific column) then the column-based algorithm will suit you better. But, neither is perfect... and, the new column-based algorithm poses another HUGE problem for filtered indexes.

For filtered indexes the threshold invalidation algorithm is tied SOLELY to the column and not just to the filtered set. So, if your table has 10,000 rows it takes 2,500 modifications in that column to update statistics. If your filtered index only has 1,000 rows... then you could theoretically modify this specific filtered set 2.5 times before IT would be updated. Yes, I realized that this sounds unlikely but it's really not. What if your filter is for ACTIVE = 1. It's only 1% of your table and it's the most active data set. You will effectively have to modify this set 20 times over before the statistics for your filtered index get updated.

But, having said that - I also have a simple solution. Don't wait... For filtered indexes (especially on volatile data) add a special UPDATE STATISTICS job that specifically updates their stats with a FULLSCAN. The good news is that they are likely to be smaller indexes (and therefore easier/less-costly statistics to update) and you'll clearly get a lot better uses out of them. Additional good news is that many filtered indexes are designed to be covering indexes, covering indexes aren't going to rely as heavily on statistics so there are many covered queries that won't care. The following query will get the list of all of your tables/indexes that have filters. Once you have this list you can create a loop to update statistics.

SELECT object_name(object_id) AS [Table Name]
       , name AS [Index Name]
       , stats_date(object_id, stats_id) AS [Last Updated]
FROM sys.
stats
WHERE has_filter = 1

So, there are lots of options and definitely a few things of which to be aware... filtered indexes and filtered stats are a powerful option to leverage, just make sure you keep those stats up to date!

Cheers... it's time for a pint of Guinness!

kt

PS - For more information on statistics, check out the SQL Server whitepapers:

OK, it seems as though there's A LOT of confusion about what steps are required for proper database maintenance. And, it seems as though some recommendations are being given as "quick fixes" without any real recommendation for root cause analysis. I'm not saying that the generalizations are horribly wrong but in many cases they're just too broad and/or unspecific to actually be useful (and, well, in all honesty, some of them are just really bad recommendations because they’re so ambiguous). And, in my random internet trolling for the day, I found 4 different references that I want to go through (which is why this is only Part I). For this post, I’ll focus just on Sharepoint.

First, what did I see that’s motivating this post?
I found the following KB article – which was referenced by numerous sites as recommended reading. And, without knowing a lot about SQL (and, that’s NOT meant as a dig at all – it’s just that most apps that sit on SQL don’t ever even recommend that you need to know SQL and I can argue certain aspects of that point as well BUT, wrt to maintenance, it can really become a problem if you don't know a few things about these tasks), it does seem like good reading: Information about the Maintenance Plan Wizard in SQL Server 2005 and about tasks that administrators can perform against SharePoint databases

Here is the part that over-simplifies picking what maintenance tasks to run vs. what not to run:

DIRECTLY TAKEN FROM THIS KB ARTICLE IS THE FOLLOWING:

We have tested these tasks and the effects that these tasks have on database schema and performance. The following table summarizes the results of these tests.

Task                                        Safe to perform this task?
Check database Yes
Reduce a database Yes
Reorganize an index Yes
Clean up the history Yes
Update statistics Yes. However, this task is unnecessary because the SharePoint Timer service performs this task automatically.
Rebuild an index No. The task does not restore existing index options before the rebuild operation. However, you can use scripts that restore index options.
Note This problem was corrected in SQL Server 2005 Service Pack 2.

We used the following criteria to determine whether a task was safe to perform:

  • Whether the task modified the database schema from its default state
  • Whether the task decreased performanceResults may vary depending on the environment.

However, if you use the Maintenance Plan Wizard to perform the tasks that are listed in the table as "safe to perform," you are likely to experience increased performance in SQL Server 2005.

The big problem is: this is just too little information about too many VERY important tasks!

Let me break this down task by task and give you a few other places to go for more information.

Check database

The check database task refers to DBCC CHECKDB. This is definitely an important part of any maintenance plan. And, it really is a safe task to run as it’s NOT corrective by default. However, there is nothing mentioned about how this command may completely flush your buffer pool as it reads all of the pages of all of the objects it’s checking. So, this might impact performance but, of all of the tasks, this is the safest to run and it’s definitely a recommended task.

If you want to learn more about DBCC CHECKDB, check out these things: 

  1. Read Paul’s blog post to his TechNet article on Database Maintenance. It’s a great overview of all of these maintenance tasks and will give you a good overview of what each one does. Here’s a link to the blog post that pulls all of the TechNet resources together: http://www.sqlskills.com/BLOGS/PAUL/post/TechNet-Magazine-Effective-Database-Maintenance-article-and-August-SQL-QA-column.aspx  
  2. Read Paul’s blog post on Myths around causing corruption – so that you can get better insight into where/why the actual corruptions are occurring. 
  3. Finally, if you’re really interested in the internals of CHECKDB and how it works – Paul’s written a ton about it in his CHECKDB from Every Angle category. FYI, 3 of his 9 years on the SQL Server Development Team were spent writing CHECKDB and repair for SQL Server 2005 (so, he definitely knows how it works J). Here’s the link to the category: http://www.sqlskills.com/BLOGS/PAUL/category/CHECKDB-From-Every-Angle.aspx

Reduce a database 

OK, I’m sure I’ll get a lot of responses to this one but IMO, a database maintenance plan SHOULD NEVER INCLUDE A SHRINK.

Let me explain… J
To be honest, I'm not even a fan of manually running database-level shrinks (DBCC SHRINKDATABASE) either. Don't get me wrong - there are ACCEPTABLE times to shrink parts of a database but, in general, I'd recommend only using DBCC SHRINKFILE for individual file-level shrinks. I wouldn't schedule shrinks nor would I EVER turn on [the database option] autoshrink. I don't think shrinks should EVER be automated - either through the database option OR through maintenance plans.

If you need to do regular shrinks - then it's likely that you have some other problem. And, without DIRECTLY addressing this problem, you *might* be making things worse.

This is a bigger discussion and I’ve found a few other references that I want to pull together. I’ll post another post about this within the next day or so – and link to it from here BUT, for right now…Know this – free space is generally GOOD. Excessive free space has happened FOR A REASON. Maybe there’s a pattern to it but often shrinking is worse than just leaving the free space for the next data explosion (a bunch of data comes in, the database grows, the data is archived, the free space remains for the next set of data that comes in).

If you shrink the database you might make things worse by fragmenting everything. Paul’s video that goes with the TechNet article on Database Maintenance shows you the [shocking if you didn’t know this] effect of shrinking a database on indexes.

Reorganize an index, Update statistics and Rebuild an index

These need to be grouped together to start because this KB article does NOT address the impact of running these together. In fact, the problem, if you run these together – is that you MIGHT make things worse. First, let me give you an overview of each:

Reoganizing an index removes fragmentation in the largest part of an index (it’s called the leaf level of the index) and removing fragmentation in this level has the greatest (and positive) effect on range query scans and cache. So, this is really the most important type of fragmentation to remove. However, this is NOT the only way to do it… 

Rebuilding an index completely and totally removes ALL forms of fragmentation in all levels of an index; however, this is the most expensive (yet most effective) way to do it. As a result of rebuilding an index, SQL Server also updates the statistics for the indexes that were rebuilt. Therefore you do not need to update statistics OR reorganize an index if it gets rebuilt.

Updating statistics is important for query processing and optimization. The query processor uses statistics on your data to help determine how many rows will be processed by your query/statement. If SQL Server can accurately estimate the rows, then it can choose a more effective plan. However, if it doesn’t have good statistics, then it may not do as good of a job at accurately estimating rows and therefore it might not come up with as optimal of a plan. So, this is an integral part to good database health. However, some of this might be done via the database option: auto update statistics which is ON by default in SQL Server (and, YES, you should leave this on). Check out this post on: Auto update statistics and auto create statistics - should you leave them on and/or turn them on??

However, if you use a maintenance plan then I really see two problems: 

  1. You’ll end up doing maintenance on things that may not need it. The default behavior for these tasks is just to run them on the selected objects. And, since many people will choose all objects (possibly even of all databases) then you’ll probably select objects that won’t really need this as frequently as you run this maintenance plan. 
  2. You might end up running a combination of things that either – wastes cycles/CPU and a MASSIVE amount of log space (which can translate into all sorts of concerns for DR technologies like database mirroring which will need to send all log rows to the secondary server). For example, if you run ALL three of these things then they’ll have to be run in a certain order (you can change this in a maintenance plan). However, the default order is: Reorganize Index(es), Rebuid Index(es), Update Statistics. This means that the work that’s done by reorganizing is effectively wasted as the rebuild would have taken care of it AND the work that’s done for updating statistics could mean that they update statistics TWICE (during the rebuild AND after) and the end result is WORSE because the update statistics command might use a sampling mechanism to generate statistics (which can lead to LESS EFFECTIVE statistics information). However, this is ONLY if you change the wizard’s default. The default is for the updating statistics command to do a “full scan”. So, even if the statistics will end up being the same – it’s still problematic because it means that for all indexes you’ve just rebuilt – you’ve now updated their statistics TWICE.

SUMMARY

A database maintenance plan is CRITICAL for best performance (especially for databases that are prone to some of the problems corrected by these maintenance tasks (yes, you can read SHAREPOINT into that statement). Sharepoint uses GUIDs as PRIMARY KEYs (read this post to hear more about the side effects of this choice) and as a result, as clustering keys. This means that many Sharepoint tables are prone to [potentially a MASSIVE amount of] fragmentation.

You absolutely need to have a maintenance plan. But, what should it be?

My absolute preference is to NOT use the Database Maintenance Plan Wizard UNLESS you really know what you’re doing. It just doesn’t give enough prescriptive advice. And, if you just select the defaults, you will end up with an inoptimal maintenance plan.

A better approach would be to create your own maintenance plan. If you write the code yourself (or leverage one of the custom ones that are already out on the web) then you can strategically target ONLY the objects that have the warning signs and/or are out of date and you can set when to rebuild vs. when to reorganize (generally people rebuild if a table has more than 30% fragentation and they reorganize when it's less than 30%). Fragmentation is something that can be detected programmatically using the DMV: dm_db_index_physical_stats (in SQL 2005/2008) or by using DBCC SHOWCONTIG (in SQL 7.0/2000). Here are a few places to go to see the more flexible and programmatic way of rebuilding/reorganizing indexes:

Smart Indexing Part II - Conditional Rebuilding a blog post (with conditional index rebuild code) from SQLMCA Bob Duffy (a good friend who is located in Dublin, Ireland and whose wife (Carmel) just had a baby last week – congrats again Bob!! You guys are seriously outnumbered now!!!) here: http://blogs.msdn.com/boduff/archive/2007/06/08/smart-indexing-part-ii-conditional-rebuilding.aspx

Custom Index Defrag / Rebuild Procedures - a blog post with some posted code as well. http://www.sqlstuff.info/post/2008/03/Custom-Index-Defrag--Rebuild-Procedures.aspx

Rebuild and Reorganize Indexes in SQL 2005 – an article (with conditional index rebuild code) from SQL Server Central here: http://www.sqlservercentral.com/scripts/31857/  (NOTE: You will need to become a subscriber to get to this article.)

Rebuild Only the Indexes that Need Help - an article by Andrew Kelly (SQL MVP) on SQL Server Magazine here: http://www.sqlmag.com/articles/index.cfm?articleid=99019&pg=1 (NOTE: You will need to become a suscriber to get to the full text of the article.)

Or, build your own! Check out the BOL topic for the sys.dm_db_index_physical_stats for SQL 2005 here: http://msdn.microsoft.com/en-us/library/ms188917(SQL.90).aspx, Example D has sample code to help you get started! For SQL 2008 it’s here: http://msdn.microsoft.com/en-us/library/ms188917.aspx. It’s still Example D for the sample code to leverage. J

The most important thing I can tell you is that a SMALL amount of time getting familiar with what’s really happening in SQL as well as WHY it’s happening to you WILL BE A LOT MORE PRODUCTIVE then just slamming in a maintenance plan that solves some problems but probably creates others.

Hope this helps!
kt

YES!!!

OK, well, I guess I should be more specific because as in most things in SQL Server, the real answer is "it depends". And for these two options, it depends mostly on your SQL Server version. Since SQL Server 7.0, the way that auto update works, has changed (much so for the better!!). So, if you're in SQL Server 2005 or SQL Server 2008, I would say most definitely - leave these ON (or if you turned them off - turn them back on!!!)! If you still have problems with a specific index causing your grief, then turn off auto update at the index level NOT at the database level. To turn of auto update at the index level use STATISTICS_NORECOMPUTE in the index (or statistics) definition.

Now, as for why (and how!) this has changed over the versions... here we go:

SQL Server 7.0

  • Invalidation: Statistics were *invalidated* when a row modification counter (sysindexes.rowmodctr) was reached. This meant that they could not tell where the modifications were occuring and, if modifications were somewhat isolated to a specific column ALL of the statistics for the TABLE would be invalidated (so, statistics could be invalidated earlier than necessary)
  • Updating: Even worse, in SQL Server 7.0, when statistics were invalidated, they were immediately updated. This caused two problems - thrashing at the time of invalidation because all of the stats needed to be updated AND two, if the statistics were not used for awhile then extra work was involved to update them and by the time they were used, they might already be somewhat out of date already.

SQL Server 2000

  • Invalidation: Statistics were still invalidated based on a row modification counter.
  • Updating: SQL Server 2000 fixed the "updating-potentially-too-often" problem by only updating statistics when they were needed.

SQL Server 2005

  • Invalidation: The biggest changes were introduced in SQL Server 2005 where they decided to NO LONGER user the sysindexes.rowmodctr and instead use an internal (and undocumented) columns specific modification counter. Now, statistic invalidation is more isolated to only those columns which are heavily modified. This internal/undoc'ed column is sysrowsetcolumns.rcmodified and can only be seen when connecting to SQL Server using the DAC (Dedicated Admin Connection).
  • Updating: Updating didn't really change but, SQL Server 2005 added "Aynch Auto Update" for statistics so that when the QO (query optimizer) encounters an out-of-date (i.e. invalidated) statistic, they can "trigger" the update but not wait for the update (meaning that they'll optimize using the out-of-date statistic). This can be both positive (faster) and negative (might not be the best plan if the statistics have changed drastically). It is off by default and IMO, I'd leave it off in most cases but if you find that auto update events (which can be Profiled) are causing you grief, then you can turn this on at the database level.

SQL Server 2008

Nothing new except "Filtered Statistics" and these are interesting as the density vector is still relative to the table (not the predicate) but the histogram is just over the predicate (OK, I know I'll have to blog a lot more about this one!). Anyway, I'm still playing/learning a lot more about these and they make the most sense with filtered indexes (as opposed to just a filtered statistic) but, just like statistics on secondary columns you will also potentially want statistics on the secondary columns of your indexes. The next question is should they have a filter or not. I've found that sp_createstats doesn't seem to create statistics with filters and I'm going to need to do some testing here but I think statistics with filters (filters that match the non-clustered index) should help to make the stats better (and even allow better usage of filtered indexes) but, I'm really going to need a bunch of time with this - and another post :). As for auto create/auto update - no changes there!

Long story short, if you're using SQL Server 2005 or SQL Server 2008, you should leave auto create/auto update ON.

Thanks for reading!
kt

PS - A few of you have mailed me about a bug in the sp_helpindex2 script(s). OK, that's my next post!!! Possibly with an sp_helpstats2 script as well!

I know that Paul and I recommended that you subscribe to Conor's blog... but have you? He's posted some great details on Partitioning (Part 1 and Part 2) as well as statistics and it always reminds me of how much I can learn from other people's perspectives!

And, just to dove-tail on some of his statistics comments... I, too, have found that as tables get significantly larger AND have non-standard distributions of more than 200 distinct values (and un-even distribution between those values as well), that the optimizer just cannot possibly do a perfect job. The only way an optimizer can be good is when it can "find a good plan fast" (which I first heard from Nigel Ellis (former Development Manager of the Query Processor team) - back when he delivered a Pacific Northwest SQL Server User Group meeting many moons ago). The most important thing to realize is that it's just not possible to waste time to find the absolutely best plan... mathematically analyzing all permutations would be prohibitive - you'd have to take a vacation between query executions (wait, that's not a bad idea... I digress :).

The point:

  1. Make sure that statistics are up-to-date (either through the database option: auto update stats OR by manually updating statistics)
  2. Consider re-evaluating statistics over large tables (and, when poor performance occurs - look at the estimated rows v. the actual rows - if the estimate/actual are off by a fact of 10, then it could be the statistics). I'd try updating stats first and then if that doesn't work, updating with a fullscan. If neither of those work, I'd also re-evaluate other possible indexes (there are some distributions between tables being joined that just can't show a correct correlation between the values when in multiple indexes... sometimes the best index is a multi-column (ie. composite index)). 
  3. Consider breaking very large tables down into smaller chunks (not just table index partitioning but possibly Partitioned Tables AND Partitioned Views) as this can give the optimizer additional details about partiticularly interesting data sets. Even in SQL Server 2008, statistics are still table-level (filtered indexes can provide some, but not complete, relief... I'll give more details in a later post) but I'd often argue that some of the best table designs are not just for a single table. Consider the statistical, locking, and indexing implications for mixed workloads against a single table (and the tremendous amount of blocking that could occur in addition to varying access patterns). And, even while 2008 will offer Partition-level lock escalation, well-designed tables may not need it! I know I've mentioned this before but different perspectives on statistics, optimizers and the fact that a good optimizer has to be efficient in-and-of itself, remind me of some of the most basic things that are also the most common problems contributing to poor query performance.

Returning to the basics and optimizing a system from the ground up always leads to better scalability!

Enjoy!
kt

 

Theme design by Nukeation based on Jelle Druyts