Time flies when you're having fun, eh? This week Paul and I are in Dublin, Ireland delivering a deep technical "SQL Immersion Event" on Internals, Performance Tuning and Maintenance with our partner ProData (and having a blast - we have a really fun and very interactive group). And, today's an even more interesting day to be in Dublin as it's the 250th anniversary of the Guinness Brewery...everyone around Dublin will be toasting the most popular drink in Ireland today (Sept 24) at 5:59pm GMT. Unfortunately, our class runs until 6pm but Paul tells me that we're still going to celebrate. ;-)

For now, we'll just have to celebrate some SQL (I can hear minor grumblings everywhere :)...

Filtered indexes and filtered stats are an incredibly powerful feature in SQL Server 2008 - one that I've posted on before and will again. Today's post is about some of the gotchas with filtered index stats and filtered stats. Cutting to the punchline - you might have some trouble with their accuracy as your data changes. The SIMPLE solution is to update these specific stats manually and more frequently (meaning that you don't want to rely on the database option AUTO_UPDATE_STATISTICS). In fact, depending on your data volatility, you might want to update these statistics on a VERY frequent basis. Part of the question you'll have to answer is how much of the filtered set changes over the course of a day/week/month. And, in whatever frequency you reach 5-10% (let's say weekly), I'd suggest updating the statistics weekly. If you update 5-10% within the course of a day - then daily. Let me give you the full story...

Statistics are traditionally updated when roughly 20% (+ a minimum of 500 rows) of the data has changed. If you have a table with 100,000 rows then SQL Server will hit the threshold at 20,500 rows (just a tad over 20%) and a table with 10,000 rows will hit the threshold at 2,500 rows (which is 25%). Additionally, in SQL Server 200x (2000, 2005 and 2008) statistics are NOT immediately updated when the threshold is reached, instead they are invalidated. It's not until someone needs the statistic that SQL Server updates it. This reduces thrashing that occurred in SQL Server 7.0 when stats were updated immediately instead of just being invalidated. Another interesting point is what is meant by "20% of the data has changed?"... How is that defined? Is it based on updates to columns or inserts of rows? Of course the answer is... it depends - here, it depends on the version of SQL Server that you're using:

  • SQL Server 2000 defines 20% as 20% of the ROWS have changed. You can see this in sysindexes.rcmodctr.
  • SQL Server 2005/8 defines 20% as 20% of the COLUMN data has changed. You cannot see this unless you are accessing SQL Server through the DAC as it's in a base system table (2005: sysrowsetcolumns.rcmodified and for 2008: sysrscols.rcmodified).

Now, for regular indexes (those that do not include a filter) both strategies have pros and cons. Let's start with the con for SQL 2000... If you have a particularly volatile column then the rowmodctr will be increased quickly and then ALL statistics (even those that have not changed) are invalidated. Without a way to understand a column's volatility, you will invalidate some statistics too soon. So, 2005/2008 created a column modification counter to better define when a statistic should be invalidated. But, the con for SQL 2005/2008 (which is the pro for row-based algorithm of SQL 2000) is that a lot more data must change in a single column before the statistics get updated. If your modifications are fairly well balanced (across the table) then the row-based algorithm could suit you better. If your modifications are isolated (to a specific column) then the column-based algorithm will suit you better. But, neither is perfect... and, the new column-based algorithm poses another HUGE problem for filtered indexes.

For filtered indexes the threshold invalidation algorithm is tied SOLELY to the column and not just to the filtered set. So, if your table has 10,000 rows it takes 2,500 modifications in that column to update statistics. If your filtered index only has 1,000 rows... then you could theoretically modify this specific filtered set 2.5 times before IT would be updated. Yes, I realized that this sounds unlikely but it's really not. What if your filter is for ACTIVE = 1. It's only 1% of your table and it's the most active data set. You will effectively have to modify this set 20 times over before the statistics for your filtered index get updated.

But, having said that - I also have a simple solution. Don't wait... For filtered indexes (especially on volatile data) add a special UPDATE STATISTICS job that specifically updates their stats with a FULLSCAN. The good news is that they are likely to be smaller indexes (and therefore easier/less-costly statistics to update) and you'll clearly get a lot better uses out of them. Additional good news is that many filtered indexes are designed to be covering indexes, covering indexes aren't going to rely as heavily on statistics so there are many covered queries that won't care. The following query will get the list of all of your tables/indexes that have filters. Once you have this list you can create a loop to update statistics.

SELECT object_name(object_id) AS [Table Name]
       , name AS [Index Name]
       , stats_date(object_id, stats_id) AS [Last Updated]
FROM sys.
stats
WHERE has_filter = 1

So, there are lots of options and definitely a few things of which to be aware... filtered indexes and filtered stats are a powerful option to leverage, just make sure you keep those stats up to date!

Cheers... it's time for a pint of Guinness!

kt

PS - For more information on statistics, check out the SQL Server whitepapers:

OK, for years I've been saying that SQL Server doesn't care about the order in which you define the columns of your table because internally SQL Server will re-arrange your columns to store all of the fixed width columns first and the variable columns last. In both the fixed-width portion of the row as well as the variable-width portion of the row, the columns are defined in the order in which they are declared. So, what does matter?

It's all in the cost of the variable array's offset values. If the large majority of your NULLable records are at the end, then SQL Server doesn't need to completely populate the variable block array (which saves 2 bytes per column). If you have a table where 36 columns are NULLable and generally they are NULL, then defining those columns at the end of the row can save you space.

The following script will show you how the maximum length of the row changes based on whether or not a later column in the variable block is NOT NULL - even when most/all of the prior columns are!

CREATE TABLE RowSizeVariableBlock
(
ID
int NOT NULL identity
,
c01 char(10) NOT NULL default 'test'
,
c02 datetime2(7) NOT NULL default sysdatetime
(),
c03 char(80) NOT NULL default 'junk'
,
c04 varchar(100)
NULL,
c05 varchar(100)
NULL,
c06 varchar(100)
NULL,
c07 varchar(100)
NULL,
c08 varchar(100)
NULL,
c09 varchar(100)
NULL,
c10 varchar(100)
NULL,
c11 varchar(100)
NULL,
c12 varchar(100)
NULL,
c13 varchar(100)
NULL,
c14 varchar(100)
NULL,
c15 varchar(100)
NULL,
c16 varchar(100)
NULL,
c17 varchar(100)
NULL,
c18 varchar(100)
NULL,
c19 varchar(100)
NULL,
c20 varchar(100)
NULL,
c21 varchar(100)
NULL,
c22 varchar(100)
NULL,
c23 varchar(100)
NULL,
c24 varchar(100)
NULL,
c25 varchar(100)
NULL,
c26 varchar(100)
NULL,
c27 varchar(100)
NULL,
c28 varchar(100)
NULL,
c29 varchar(100)
NULL,
c30 varchar(100)
NULL,
c31 varchar(100)
NULL,
c32 varchar(100)
NULL,
c33 varchar(100)
NULL,
c34 varchar(100)
NULL,
c35 varchar(100)
NULL,
c36 varchar(100)
NULL,
c37 varchar(100)
NULL,
c38 varchar(100)
NULL,
c39 varchar(100)
NULL,
c40 varchar(100)
NULL
)
go

insert RowSizeVariableBlock DEFAULT VALUES
go

select * from RowSizeVariableBlock
go

select * from sys.dm_db_index_physical_stats
(db_id(), object_id('RowSizeVariableBlock'), null, null, 'detailed'
)
-- review "max" record size = 114
go

insert RowSizeVariableBlock (c01, c03, c20)
values ('med row', 'up to c20', 'test'
)
go

select * from RowSizeVariableBlock
go

select * from sys.dm_db_index_physical_stats
(db_id(), object_id('RowSizeVariableBlock'), null, null, 'detailed'
)
-- review "max" record size = 154
go

insert RowSizeVariableBlock (c01, c03, c30)
values ('med+ row', 'up to c30', 'test'
)
go

select * from RowSizeVariableBlock
go

select * from sys.dm_db_index_physical_stats
(db_id(), object_id('RowSizeVariableBlock'), null, null, 'detailed'
)
-- review "max" record size = 174
go

insert RowSizeVariableBlock (c01, c03, c40)
values ('large row', 'up to c40', 'test'
)
go

select * from RowSizeVariableBlock
go

select * from sys.dm_db_index_physical_stats
(db_id(), object_id('RowSizeVariableBlock'), null, null, 'detailed'
)
-- review "max" record size = 194
go

While there are some other optimizations at this level, most tables cannot benefit from this as the data populations aren't as predictable nor are most tables filled with so many variable-width and NULLable columns. However, if you do have this data pattern, defining these columns at the end of your table's definition - MIGHT save a tremendous amount of space, especially when this table is very large!

Paul's blogged more on these structures as well as the NULL bitmap here: http://www.sqlskills.com/BLOGS/PAUL/post/Misconceptions-around-null-bitmap-size.aspx.

Enjoy! And, thanks for reading,
kt

Recently, the SQL Server 2008 Internals title was released (and only in the past few days have people actually received their copies)! In fact, I still haven't seen the book in person... soon though!

Anyway, Kalen, Paul, Conor, Adam and I worked to create a comprehensive resource on SQL Server 2008 internals and to supplement the written content, many of us created demo scripts and examples. I've now gone back and created a sample script based on ALL of the code in the entire chapter (and in many cases I extended the code in the samples). And, while this updated content is going to be located on the companion content site, I thought I’d also release it here with some notes.

The zip contains a solution with 3 projects, each with a few scripts:

I know the names seem a bit strange but everything is ordered EXACTLY as it is shown in the book. And, in the book, I referenced "a" script called EmployeeCaseStudy-AnalyzeStructures.sql but that script was so big that I broke it down into 7 parts (hence the naming convention of 06...01, 06...02, etc.). Regardless, each script contains a brief header and a few details about the script. To get the most from the script, do not just open the script and execute it. If you really work your way through the script, you should see all of the comments and they will help you to make instance specific changes so that everything runs without error. Just take your time and really try to step back and think about each command (and what you're expecting the output to be) to test yourself while your working through the results. Taking your time and really grapsing all of these internals is what makes it fun!

Quite a bit of this content can stand alone but it's really best as companion content to the title as there's a lot more "text" and detailed information in the book. But, the scripts are really a great way to dive deeper, learn documented/undocumented commands and really get to know what the heck is really going on internally.

Finally, I only worked on Chapter 6 so here it is: 20090810-IndexInternals-Chapter6-Resources.zip (6.13 mb). As for the other companion content, you'll need to get the links from the book. Ah, or from Kalen :). 

And, certainly, if you find a typo or anything that you think needs more clarification, let me know! I'm more than happy to post updates to this content. 

Have fun,
kt

UPDATES/ERRORLOG:
2009-Apr-13 (8am): Updated the zip after remembering in my sleep (yes, sad, but true!) that one of my comments that referenced some line numbers didn't get updated in the final version. So, no errors and if you don't get this update, it's not going to break anything. But, the script that's been tweaked is script: 05_EmployeeCaseStudy-TableDefinition.sql. Enjoy! kt
2009-Apr-13 (4:30pm): Ha... guess what arrived today. Yes - our copies of the book. Wow, it's great to see it in person. Again, enjoy!
2009-Aug-10: Added a :CONNECT option inside the IndexInternals restore script AND, cleaned up the zip as it had an extra copy of the IndexInternals database in it.

Theme design by Nukeation based on Jelle Druyts