The very worst piece of advice I ever saw on the Internet was in response to someone asking on a SQL newsgroup 'how can I create a corrupt database?'. The first response was:

When I want to corrupt a database to play with, I go into the data center, find a hard-drive and flick the power switch on-and-off quickly a few times.

This was closely followed by a bunch of replies (including mine) saying 'Noooooooooooo!!!!'

So, for a few years now I've provided a zip file with a bunch of pre-corrupted databases in - so you can test your consistency checking jobs, and experiment with corrupt databases without having to create them yourself by destroying hard-drives, or less daft means. You can find the main zip file at the top of our Past Events resources page, along with a link to a blog post which explains the various databases and demo scripts.

Several times I've been asked to provide SQL Server 2008 versions of the two databases which demo system table corruption. Well, that task has finally bubbled to my long-list of blog posts and website updates. There's now a second, smaller zip file which has 2008-only versions of the DemoFatalCorruption1 and DemoFatalCorruption2 databases, which showcase corruptions that prevent DBCC CHECKDB from running.

Let me know if you have any problems (playing with the backups Wink)

Enjoy!

One of the things I love teaching is how the transaction log and logging/recovery work. I presented a session on this at both PASS and SQL Connections in the last two weeks, and in both sessions I promised to write some blog posts about the deep internals of logging operations. This is the first one in the series. Previous blog posts that dive into logging operations are:

Ok, on with the show. 

SQL Server 2005 introduced a feature called 'fast recovery' in Enterprise Edition. This allows a database to become available for use after the first part of recovery (REDO) completes and before the (usually longer running) second part of recovery (UNDO) completes. See my TechNet Magazine article Understanding Logging and Recovery in SQL Server if you don't know what I'm talking about. But how does SQL Server do this?

The answer is lock logging. A log record describes a single change made to a database. For log records describing changes that can be used as part of UNDO (yes, some changes to the database are one-way only - for instance PFS page changes), from 2005 onwards the log record also includes a description of which locks were being held at the time the change was made. These locks were necessary to protect the change being made when the original transaction was running (before the crash) and so the same locks will be necessary to protect the anti-operation which reverses the change. I'll explain more about these anti-operations in one of the next in-depth logging blog posts.

The Storage Engine does two passes through the log as part of crash recovery. The first pass does REDO and also reads the log records that will be processed as part of the second pass (UNDO), looking at the lock description and actually acquiring those locks. For fast recovery, at that point the database is brought online. This is possible because the recovery system knows that it already has the correct locks to guarantee that it can safely generate and perform the anti-operations necessary to perfom UNDO. One side-effect of this is that although the database is available for use, a query may bump into one of the locks being held to allow fast recovery - in which case it will have to wait for that lock to be dropped as UNDO progresses.

Cool eh?

That was the introduction to allow me to do some gratuitous spelunking around the internals :-) I'm going to create a few simple examples to show you lock logging in the log. Now, don't get confused - it's not logging actual locks (the memory used to hold the lock itself), it's just logging a description of which locks were held and in which modes.

Here's the script to create a database with a simple table. I'm using a LOB column and specifically setting it to be stored off row (see Importance of choosing the right LOB storage technique) so we can see some text page locks too. I'm using the SIMPLE recovery model for simplicity (ha ha) - so I can clear the log when a checkpoint occurs rather than having to muck around with log backups. I'll insert the first row and then clear the log.

CREATE DATABASE LockLogging;
GO
USE LockLogging;
GO

CREATE TABLE LockLogTest (c1 INT, c2 INT, c3 VARCHAR (MAX));
GO
EXEC sp_tableoption 'LockLogtest', 'large value types out of row', 'on';
GO

INSERT INTO LockLogTest VALUES (1, 1, 'a');
GO

ALTER DATABASE LockLogging SET RECOVERY SIMPLE;
GO
CHECKPOINT;
GO

Now let's try the first operation - a simple insert - and look at the log records using fn_dblog (and I'm skipping the checkpoint log records):

INSERT INTO LockLogTest VALUES (2, 2, 'b');
GO
SELECT [Operation], [Context], [Page ID], [Slot ID], [Number of Locks] AS Locks, [Lock Information]
FROM fn_dblog (NULL, NULL);
GO

Operation        Context       Page ID        Slot ID  Locks Lock Information
---------------- ------------- -------------- -------- ----- -------------------------------------------------------------------
LOP_BEGIN_XACT   LCX_NULL      NULL           NULL     NULL  NULL
LOP_INSERT_ROWS  LCX_TEXT_MIX  0001:00000098  1        2     ACQUIRE_LOCK_IX PAGE: 18:1:152; ACQUIRE_LOCK_X RID: 18:1:152:1
LOP_INSERT_ROWS  LCX_HEAP      0001:0000009a  1        3     ACQUIRE_LOCK_IX OBJECT: 18:2073058421:0;
                                                                 ACQUIRE_LOCK_IX PAGE: 18:1:154; ACQUIRE_LOCK_X RID: 18:1:154:1
LOP_COMMIT_XACT  LCX_NULL      NULL           NULL     NULL  NULL

We can see page IX and row X locks for the LOB value being inserted into the text page, plus table IX, page IX, and row X locks for the data record being inserted into the heap. The lock resources break out as follows:

  • 18:1:152 is page 152 in file 1 of database ID 18
  • 18:1:152:1 is slot 1 on page 152 in file 1 of database ID 18
  • 18:2073058421:0 is object ID 2073058421 (the object ID of the table LockLogTest) in database ID 18

Notice also the LOP_BEGIN_XACT and LOP_COMMIT_XACT log records - even though I didn't do an explicit transaction, SQL Server has to start one internally for me (called an implicit transaction) so that there's a boundary for where to rollback if something goes wrong during the operation. 

And now an update operation (with a checkpoint first to clear out the log):

CHECKPOINT;
GO
UPDATE LockLogTest SET c1 = 3;
GO
SELECT [Operation], [Context], [Page ID], [Slot ID], [Number of Locks] AS Locks, [Lock Information]
FROM fn_dblog (NULL, NULL);
GO

Operation        Context       Page ID        Slot ID  Locks Lock Information
---------------- ------------- -------------- -------- ----- -------------------------------------------------------------------
LOP_BEGIN_XACT   LCX_NULL      NULL           NULL     NULL  NULL
LOP_MODIFY_ROW   LCX_HEAP      0001:0000009a  0        3     ACQUIRE_LOCK_IX OBJECT: 18:2073058421:0;
                                                                 
ACQUIRE_LOCK_IX PAGE: 18:1:154; ACQUIRE_LOCK_X RID: 18:1:154:0
LOP_MODIFY_ROW   LCX_HEAP      0001:0000009a  1        3     ACQUIRE_LOCK_IX OBJECT: 18:2073058421:0;
                                                                 ACQUIRE_LOCK_IX PAGE: 18:1:154; ACQUIRE_LOCK_X RID: 18:1:154:1
LOP_COMMIT_XACT  LCX_NULL      NULL           NULL     NULL  NULL

Just as we expected - a table IX lock, a page IX lock, and two row X locks on that page.

Now, what about something more complicated like a TRUNCATE TABLE? Have you heard the myth about it not being logged? Right - it's a myth:

CHECKPOINT;
GO
TRUNCATE TABLE LockLogTest;
GO
SELECT [Operation], [Context], [Page ID], [Slot ID], [Number of Locks] AS Locks, [Lock Information]
FROM fn_dblog (NULL, NULL);
GO

Operation        Context       Page ID        Slot ID  Locks Lock Information
---------------- ------------- -------------- -------- ----- -------------------------------------------------------------------
LOP_BEGIN_XACT   LCX_NULL      NULL           NULL     NULL  NULL
LOP_LOCK_XACT    LCX_NULL      NULL           NULL     1     ACQUIRE_LOCK_SCH_M OBJECT: 18:2073058421:0
LOP_MODIFY_ROW   LCX_IAM       0001:0000009b  0        1     ACQUIRE_LOCK_X RID: 18:1:155:0
LOP_MODIFY_ROW   LCX_PFS       0001:00000001  0        1     ACQUIRE_LOCK_X PAGE: 18:1:154
LOP_MODIFY_ROW   LCX_PFS       0001:00000001  0        1     ACQUIRE_LOCK_X PAGE: 18:1:155
LOP_MODIFY_ROW   LCX_IAM       0001:00000099  0        1     ACQUIRE_LOCK_X RID: 18:1:153:0
LOP_MODIFY_ROW   LCX_PFS       0001:00000001  0        1     ACQUIRE_LOCK_X PAGE: 18:1:152
LOP_MODIFY_ROW   LCX_PFS       0001:00000001  0        1     ACQUIRE_LOCK_X PAGE: 18:1:153
LOP_SET_BITS     LCX_SGAM      0001:00000003  1        NULL  NULL
LOP_SET_BITS     LCX_GAM       0001:00000002  1        NULL  NULL
LOP_COUNT_DELTA  LCX_CLUSTERED 0001:00000014  89       NULL  NULL
LOP_COUNT_DELTA  LCX_CLUSTERED 0001:00000011  78       NULL  NULL
LOP_COUNT_DELTA  LCX_CLUSTERED 0001:00000014  90       NULL  NULL
LOP_COUNT_DELTA  LCX_CLUSTERED 0001:00000041  164      NULL  NULL
LOP_COUNT_DELTA  LCX_CLUSTERED 0001:00000041  165      NULL  NULL
LOP_COUNT_DELTA  LCX_CLUSTERED 0001:00000041  166      NULL  NULL
LOP_HOBT_DDL     LCX_NULL      NULL           NULL     NULL  NULL
LOP_MODIFY_ROW   LCX_CLUSTERED 0001:00000014  89       2     ACQUIRE_LOCK_IX OBJECT: 18:7:0;
                                                                
ACQUIRE_LOCK_X KEY: 18:458752 (0000c2681664)
LOP_HOBT_DDL     LCX_NULL      NULL           NULL     NULL  NULL
LOP_MODIFY_ROW   LCX_CLUSTERED 0001:00000014  90       2     ACQUIRE_LOCK_IX OBJECT: 18:7:0;
                                                                 ACQUIRE_LOCK_X KEY: 18:458752 (00007a581379)
LOP_MODIFY_ROW   LCX_CLUSTERED 0001:00000011  78       2     ACQUIRE_LOCK_IX OBJECT: 18:5:0;
                                                                 ACQUIRE_LOCK_X KEY: 18:327680 (00001df3833b)
LOP_COMMIT_XACT  LCX_NULL      NULL           NULL     NULL  NULL

Lots of logging and lots of locks. If you look at the Context column, you'll see that the operation is modifying allocation bitmaps (LCX_IAM, LCX_PFS, LCX_SGAM, LCX_GAM) but taking locks on the table pages, not on the allocation bitmaps themselves - they're only ever latched (an internal, much lighter-weight, synchronization mechanism). This is done as the pages comprising the table are deallocated - this is all done because the table's small enough that the Storage Engine chooses to deallocate all the storage immediately, instead of pushing it all onto the task queue for the deferred drop background task. See my previous post Search Engine Q&A #10: When are pages from a truncated table reused? which discusses this too.

There are no actual row operations performed on the table itself. The only table row operations are down at the bottom on table with object IDs 7 and 5 (sysallocunits and sysrowsets, respectively) to update the page counts, first IAM, and first page entries for the table.

So - hopefully this has been useful to you. In the next post in the series, I'll discuss compensation log records and how rollback operations work.

PS Send me an email or put in a comment if there's something in particular about the log (or log records) you'd like to see explained.

This is a blog post I've been meaning to do for a while, and I've recently noticed some info on the web about checkpoints which is a little misleading, so I want to do a quick post to explain how checkpoints work as far as log records are concerned.

When a checkpoint operation occurs, no matter how it's triggered (for instance through a manual CHECKPOINT, from a database or differential backup, or automatically) the same set of operations occurs:

  • All dirty data file pages for the database are written to disk (all pages that have changed in memory since they were read from disk or since the last checkpoint), regardless of the state of the transaction that made the change.
  • Before a page is written to disk, all log records up to and including the most recent log record describing a change to that page are written to disk (yes, log records can be cached in memory too). This guarantees recovery can work and is called write-ahead logging. Log records are written to the log sequentially, and log records from multiple transactions will be interspersed in the log. The log cannot be selectively written to disk, so writing a dirty page to disk that only has a single log record affecting it may mean writing many more previous log records to disk as well.
  • Log records describing the checkpoint are generated.
  • The LSN of the checkpoint is recorded in the database boot page in the dbi_checkptLSN field (see Search Engine Q&A #20: Boot pages, and boot page corruption).
  • If in the SIMPLE recovery model, the VLFs in the log are checked to see whether they can be marked inactive (called clearing or truncating the log - both of which are terrible misnomers, as nothing is either physically cleared or truncated).

I'm using some terms you may not have come across - for a background primer on logging, recovery, and transaction log architecture see my article in the February 2009 TechNet Magazine - Understanding Logging and Recovery in SQL Server (or come to my Wednesday November 4th PASS Spotlight session with the same title).

Checkpoints are not really tracked in the transaction log - it just serves as a useful repository for information about which transactions are active at the time of the checkpoint. The LSN of the last checkpoint is recorded in the database boot page. This is where recovery starts, and if this page is inaccessible, the database cannot be attached, opened, or processed in any way - partly because it's the boot page that knows whether the database was cleanly shut down or not, and partly because it's the only place that records the LSN of the last checkpoint record. You may say, well it's recorded in the transaction log too, but what if the log is corrupt in some way?

One area of confusion I've seen is that the checkpoint log records are overwritten by subsequent checkpoints. Absolutely not - once written, a log record is NEVER updated or overwritten - it will only be overwritten when the log wraps and the VLFs are re-used (see Inside the Storage Engine: More on the circular nature of the log). This has led to further confusion about when checkpoint information is retrievable from the log, using commands such as fn_dblog.

For the rest of this post, I want to show you what's going on in the transaction log when checkpoints occur under different circumstances.

Consider the following example:

CREATE DATABASE CheckpointTest;
GO
USE CheckpointTest;
GO
CREATE TABLE t1 (c1 INT);
GO
INSERT INTO t1 VALUES (1);
GO
CHECKPOINT;
GO
SELECT [Current LSN], [Operation] FROM fn_dblog (NULL, NULL);
GO

Current LSN             Operation
----------------------- -------------------------------
00000047:00000051:009b  LOP_BEGIN_CKPT
00000047:00000091:0001  LOP_END_CKPT

We see the log records for the checkpoint. In this case the checkpoint is very simple and so there are only two records - beginning and ending the checkpoint.

If we run another checkpoint, what do we see?

CHECKPOINT;
GO
SELECT [Current LSN], [Operation] FROM fn_dblog (NULL, NULL);
GO

Current LSN             Operation
----------------------- -------------------------------
00000047:00000092:0001  LOP_BEGIN_CKPT
00000047:00000093:0001  LOP_END_CKPT

Only information about one checkpoint, but with different LSNs for the log records. It's not that the previous checkpoint was overwritten, it's that we did a checkpoint - so the active portion of the log has moved forward and the log records for the previous checkpoint aren't considered active any more as they're not required (for instance, for database mirroring, an active transaction, a log backup, transactional replication). They're still there in the log though, but just aren't part of the required portion of the log and so aren't dumped by fn_dblog.

Now, what if I create an active transaction? In another connection, I'll do:

USE CheckpointTest;
GO
BEGIN TRAN;
GO
INSERT INTO t1 VALUES (1);
GO

And what if we do a checkpoint and look at the log now?

CHECKPOINT;
GO
SELECT [Current LSN], [Operation] FROM fn_dblog (NULL, NULL);
GO

Current LSN             Operation
----------------------- -------------------------------
00000047:00000094:0001  LOP_BEGIN_XACT
00000047:00000094:0002  LOP_INSERT_ROWS
00000047:00000094:0003  LOP_COUNT_DELTA
00000047:00000094:0004  LOP_COUNT_DELTA
00000047:00000094:0005  LOP_COUNT_DELTA
00000047:00000094:0006  LOP_BEGIN_CKPT
00000047:00000096:0001  LOP_XACT_CKPT
00000047:00000096:0002  LOP_END_CKPT

We see the start of the open transaction, the insert of the record, the update of row counts in metadata, and the checkpoint.

You may notice that there's another log record being generated for the checkpoint - LOP_XACT_CKPT. This is only generated when there are active (uncommitted) transactions and it lists information about all actvie transactions at the time the checkpoint begins. This is used during crash recovery to work out how far back in the transaction log to go to start REDO and UNDO operations (well, technically only UNDO will need to go this far back). Focusing in on this log record, we can see:

SELECT [Current LSN], [Operation], [Num Transactions], [Log Record]
FROM fn_dblog (NULL, NULL) WHERE [Operation] = 'LOP_XACT_CKPT';
GO

Current LSN             Operation                       Num Transactions
----------------------- ------------------------------- ----------------
00000047:00000096:0001  LOP_XACT_CKPT                   1

Log Record
-----------------------------------------------------------------------------------------------------------
0x000018 <snip> 780500000000000047000000940000000100040147000000940000000200000001 <snip> 6621000000000000

This log record contains information about each active (uncommitted) transaction at the time of the checkpoint. Without going into all the details about what's in the payload for this log record, you can see two things in there:

  • The LSN of the LOP_BEGIN_XACT log record for the oldest active transaction (the first bold number above - match it against the LOP_BEGIN_XACT in the log dump a little higher up)
  • The LSN of the first log record making a database change for that transaction (the second bold number above - match it against the LOP_INSERT_ROWS in the log dump a little higher up)

You'll notice that these LSNs are stored byte-reversed. 

How about if we do another checkpoint?

CHECKPOINT;
GO
SELECT [Current LSN], [Operation] FROM fn_dblog (NULL, NULL);
GO

Current LSN             Operation
----------------------- -------------------------------
00000047:00000094:0001  LOP_BEGIN_XACT
00000047:00000094:0002  LOP_INSERT_ROWS
00000047:00000094:0003  LOP_COUNT_DELTA
00000047:00000094:0004  LOP_COUNT_DELTA
00000047:00000094:0005  LOP_COUNT_DELTA
00000047:00000094:0006  LOP_BEGIN_CKPT
00000047:00000096:0001  LOP_XACT_CKPT
00000047:00000096:0002  LOP_END_CKPT
00000047:00000097:0001  LOP_BEGIN_CKPT
00000047:00000098:0001  LOP_XACT_CKPT
00000047:00000098:0002  LOP_END_CKPT

This time we see the log records for the current checkpoint and the previous one - as the active log stretches all the way back to the start of the oldest active transaction - no matter how many checkpoints we do while there's an active transaction.

Now what if we start another active transaction in a third connection?

USE CheckpointTest;
GO
BEGIN TRAN;
GO
INSERT INTO t1 VALUES (2);
GO

And then go back to the original connection and do another checkpoint and dump the log again:

CHECKPOINT;
GO
SELECT [Current LSN], [Operation] FROM fn_dblog (NULL, NULL);
GO

Current LSN             Operation
----------------------- -------------------------------
00000047:00000094:0001  LOP_BEGIN_XACT
00000047:00000094:0002  LOP_INSERT_ROWS
00000047:00000094:0003  LOP_COUNT_DELTA
00000047:00000094:0004  LOP_COUNT_DELTA
00000047:00000094:0005  LOP_COUNT_DELTA
00000047:00000094:0006  LOP_BEGIN_CKPT
00000047:00000096:0001  LOP_XACT_CKPT
00000047:00000096:0002  LOP_END_CKPT
00000047:00000097:0001  LOP_BEGIN_CKPT
00000047:00000098:0001  LOP_XACT_CKPT
00000047:00000098:0002  LOP_END_CKPT
00000047:00000099:0001  LOP_BEGIN_XACT
00000047:00000099:0002  LOP_INSERT_ROWS
00000047:00000099:0003  LOP_COUNT_DELTA
00000047:00000099:0004  LOP_COUNT_DELTA
00000047:00000099:0005  LOP_COUNT_DELTA
00000047:00000099:0006  LOP_BEGIN_CKPT
00000047:0000009b:0001  LOP_XACT_CKPT
00000047:0000009b:0002  LOP_END_CKPT

You can see that we now have three sets of checkpoint log records, and two active transactions. Only one of these sets of checkpoint log records is the pertinent one - the previous two have been superceded, but the log records haven't been overwritten or removed, as you can clearly see.

Looking inside all the LOP_XACT_CKPT records, we can see (with a bit of creative formatting of the output):

SELECT [Current LSN], [Operation], [Num Transactions], [Log Record]
FROM fn_dblog (NULL, NULL) WHERE [Operation] = 'LOP_XACT_CKPT';
GO

Current LSN             Operation                       Num Transactions
----------------------- ------------------------------- ----------------
00000047:00000096:0001  LOP_XACT_CKPT                   1
00000047:00000098:0001  LOP_XACT_CKPT                   1
00000047:0000009b:0001  LOP_XACT_CKPT                   2

Log Record
-------------------------------------------------------------------------------------------------------------
0x000018 <snip> 780500000000000047000000940000000100040147000000940000000200000001 <snip> 21000000000000
0x000018 <snip> 780500000000000047000000940000000100040147000000940000000200000001 <snip> 21000000000000
0x000018 <snip> 780500000000000047000000940000000100040147000000940000000200000001 <snip> 21000000000000 ...
    ... 79050000000000004700000099000000010004014700000099000000020000000100000002000000DC000000

The first two checkpoints only list one active transaction and the most recent one lists two, as we'd expect. The log record payload for the first two lists the same oldest active transaction (highlighted in bold). The payload for the last checkpoint lists the same oldset active transaction (as that hasn't changed), but now lists an additional transaction (match the 470000009900000001000 with the LSN of the second LOP_BEGIN_XACT in the log dump above), and has a transaction count of two.

Just to finish things off, let's look at the boot page of the database either using DBCC PAGE or DBCC DBINFO:

DBCC TRACEON (3604);
GO
DBCC DBINFO ('CheckpointTest');
GO

DBINFO STRUCTURE:


DBINFO @0x6711EF64

dbi_dbid = 18                        dbi_status = 65536                   dbi_nextid = 2089058478
dbi_dbname = CheckpointTest          dbi_maxDbTimestamp = 2000            dbi_version = 611
dbi_createVersion = 611              dbi_ESVersion = 0                   
dbi_nextseqnum = 1900-01-01 00:00:00.000                                  dbi_crdate = 2009-09-28 07:06:35.873
dbi_filegeneration = 0              
dbi_checkptLSN

m_fSeqNo = 71                        m_blockOffset = 153                  m_slotId = 6
dbi_RebuildLogs = 0                  dbi_dbccFlags = 2                   
dbi_dbccLastKnownGood = 1900-01-01 00:00:00.000                          
<snip>

The dbi_checkptLSN is dumped out in decimal - converting to hex gives us  (47:99:6), which exactly matches the LSN of the most recent LOP_BEGIN_CKPT record in the log dump above.

Hopefully this explains things clearly!

Here's a question that came up recently: if I've upgraded a database from SQL 2000 or before, how can I tell if the data purity checks will be run or not?

As you may know, DBCC CHECKDB in 2005 onwards includes 'data purity' checks. These look for column values where the value is outside the valid range of values for the column's data type. For databases created on SQL 2005 onwards, these checks are always performed by DBCC CHECKDB and cannot be turned off. For databases created on earlier versions, it's a little more complicated.

In SQL Server versions prior to 2005, it was possible to import invalid data values into a database. These invalid values could cause query execution problems, or possibly even wrong results. In 2005, when the import 'holes' were closed, the data purity checks were added to DBCC CHECKDB, but not by default for upgraded databases. Because the possibility existed of upgraded database containing invalid values, the decision was made not to enable the data purity checks by default as this could lead people to suspect that the upgrade had caused corruptions that weren't there on 2000 or before.

So, if you have a database that was upgraded and you want to run the data purity checks, you need to use the WITH DATA_PURITY option for DBCC CHECKDB. This is all documented in Books Online and I put it into the Release Notes for SQL 2005 as well. Back to the question though - at what point do you NOT need to specify the option? Well, for an upgraded database, if the WITH DATA_PURITY option is used and no problems are found, a bit is irrevocably flipped in the boot page (see Search Engine Q&A #20: Boot pages, and boot page corruption) and from that point onwards the data purity checks will be performed on the database whenever DBCC CHECKDB runs.

The problem is though, how can you tell whether the bit has been flipped? You need to look at the boot page. The easiest way to do that is to use the DBCC DBINFO command (undocumented, but perfectly safe). It's the equivalent of using DBCC PAGE ('dbname', 1, 9, 3) to look at the boot page contents, as I explained in the blog post referenced in the previous paragraph.

There are two things you need to look for: what version of SQL Server created the database, and it the 'create version' is 2000 or lower, whether the special 'data purity' flag is set or not.

For a database created on SQL Server 2005:

DBCC TRACEON (3604);
GO
DBCC DBINFO ('master');
GO

DBINFO STRUCTURE:

DBINFO @0x66C8EF64

dbi_dbid = 1                         dbi_status = 65544                   dbi_nextid = 1984726123
dbi_dbname = master                  dbi_maxDbTimestamp = 16000           dbi_version = 611
dbi_createVersion = 611              dbi_ESVersion = 0                   
dbi_nextseqnum = 1900-01-01 00:00:00.000                                  dbi_crdate = 1900-01-01 00:00:00.000
dbi_filegeneration = 0              
dbi_checkptLSN

m_fSeqNo = 7612                      m_blockOffset = 224                  m_slotId = 1
dbi_RebuildLogs = 0                  dbi_dbccFlags = 0                   
dbi_dbccLastKnownGood = 2009-05-12 16:07:15.647                          
dbi_dbbackupLSN

<snip>

If the dbi_createVersion is 611 or higher, the database was created on SQL Server 2005+ and will always have data purity checks performed.

For an upgraded database (this is one of my pre-corrupted databases, you can get it from Conference corruption demo scripts and example corrupt databases):

DBCC TRACEON (3604);
GO
DBCC DBINFO ('DemoCorruptMetadata');
GO

DBINFO STRUCTURE:

DBINFO @0x6855EF64

dbi_dbid = 7                         dbi_status = 16                      dbi_nextid = 2089058478
dbi_dbname = DemoCorruptMetadata     dbi_maxDbTimestamp = 100             dbi_version = 611
dbi_createVersion = 539              dbi_ESVersion = 0                   
dbi_nextseqnum = 1900-01-01 00:00:00.000                                  dbi_crdate = 2009-06-17 15:14:49.490
dbi_filegeneration = 0              
dbi_checkptLSN

m_fSeqNo = 10                        m_blockOffset = 303                  m_slotId = 1
dbi_RebuildLogs = 0                  dbi_dbccFlags = 0                   
dbi_dbccLastKnownGood = 1900-01-01 00:00:00.000                          
dbi_dbbackupLSN

<snip>

This database has a dbi_createVersion lower than 611, so we need to look at the dbi_dbccFlags field. A value of 0 means that the data purity checks are not enabled by default. A value of 2 means they are enabled by default. You can easily check this out for your own databases.

Have fun!

This is a question that comes up every so often, most recently this morning while teaching a private class (and Kimberly's teaching now): how large is the forwarded record back-pointer? (And I haven't posted anything geeky for a while...)

In a heap it is possible to get forwarding and forwarded records. They occur when a record in a heap expands such that it no longer fits on the page it currently resides on. In this case, the record is moved to a new page, and a small forwarding record is left in the original location. The forwarding record points to the new location of the record, which is known as a forwarded record. This is done as a performance optimization so that all the nonclustered indexes on the heap do not have to be altered with the new location of the heap record.

As an aside, when a query uses a nonclustered index to satisfy a query, but needs more columns from the table record to fill the result set column list, it must go to the actual data record to retrieve the extra columns. It does this by using the table record locator that is stored in the nonclustered index record. If the table is a heap, the record locator is the physical location of the data record in the heap.

If the table is a clustered index (remember that a table can be organized as a heap OR a clustered index, not both), the record locator is the set of cluster keys of the data record. Both of these record locators are guaranteed to be unique. For a heap record locator, the record lookup (commonly called a bookmark lookup) goes directly to the physical location of the record. For a clustered index record locator, the record lookup uses the cluster keys to navigate down through the clustered index to the leaf level.

If a forwarding record occurs in a heap, when the record locator points to that location, the Storage Engine gets there and says Oh, the record isn't really here - it's over there! And then it has to do another (potentially physical) I/O to get to the page with the forwarded record on. This can result in a heap being less efficient that an equivalent clustered index.

There has been lots of discussion about whether it's better to have a heap or a clustered index - generally we recommend a clustered index, but there are special cases where a heap may be fine. This post isn't about that and I won't get drawn into a Comments argument about it. Go bug Kimberly :-)

Back to the point of the post. So the record has moved to a new location and there's a small record left in the original location which helps bookmark lookups.

Oh yeah, one more aside. The Storage Engine, when scanning the heap, will not process the forwarded record UNLESS it has reached it by following the forwarding record. This prevents race conditions where the record could be processed twice - if the forwarding operation occurs during a large table scan.

Ok, really back to the point of the post. What happens if the original record grows again and has to move again? Does it leave ANOTHER forwarding record when it moves to the second new location - creating a chain of forwarding records?

The answer is no. The *original* forwarding record is updated with the new location of the forwarded record. This can only be done if the forwarded record points *back* to the forwarding record - which it does.

The question becomes- how big is the forwarding record back-pointer? It's not dumped out by DBCC PAGE (sorry, when I rewrote DBCC PAGE for SQL 2005 I forgot to put that in there). Let's work it out using a script.

First off I'm going to create a database and table to play with.

CREATE DATABASE DBMaint2008;
GO
USE DBMaint2008;
GO

CREATE TABLE DbccPageTest (intCol1  INT IDENTITY,  intCol2  INT, vcharCol VARCHAR (8000),  lobCol  VARCHAR (MAX));
GO

INSERT INTO DbccPageTest VALUES (1, REPLICATE ('Row1', 600), REPLICATE ('Row1Lobs', 1000));
INSERT INTO DbccPageTest VALUES (2, REPLICATE ('Row2', 600), REPLICATE ('Row2Lobs', 1000));
GO

Using DBCC IND ('DBMaint2008', 'DbccPageTest', -1) , I find that the data page is page ID (1:154). If I dump out that page using DBCC PAGE, I can see both of the records fully contained on the page.

Now I'm going to update the second row to make it 8000+ bytes - forcing it to move to a new page.

UPDATE DbccPageTest SET vcharCol = REPLICATE ('LongRow2', 1000) WHERE intCol2 = 2;
GO

And now looking at page (1:154) with DBCC PAGE again, I can see that the second row has been replaced with:

Slot 1 Offset 0x137a Length 9

Record Type = FORWARDING_STUB        Record Attributes =
Memory Dump @0x66F4D37A

00000000:   049d0000 00010000 00†††††††††††††††††.........
Forwarding to  =  file 1 page 157 slot 0

The record has been replaced with a forwarding record - pointing to the new location of the record on page (1:157).

Now DBCC PAGE doesn't dump out the back-pointer in the forwarded record - so how can we tell how large it is? I'm going to see how large the record is, and then create a clustered index on the table. This will remove the back-pointer from the record without changing anything else about the record. The difference in size will be the back-pointer size.

Doing DBCC PAGE on page (1:157) I get (partial results):

Slot 0 Offset 0x60 Length 8057

Record Type = FORWARDED_RECORD       Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS

Now creating a clustered index and then dumping the page with the second row on it (left as an exercise for the reader :-)):

CREATE UNIQUE CLUSTERED INDEX Dbcc_CL ON DbccPageTest (intCol1);
GO

<figure out which page to look at>

DBCC PAGE ('DBMaint2008', 1, 169, 3);
GO

<partial results>
Slot 0 Offset 0x60 Length 8047

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS

Clearly the back-pointer is 10 bytes. And here's how it breaks down:

Hope this helps!

Page splits are always thought of as expensive, but just how bad are they? In this post I want to create an example to show how much more transaction log is created when a page in an index has to split. I'm going to use the sys.dm_tran_database_transactions DMV to show how much more transaction log is generated when a page has to split. You can find the list of columns and a small amount of explanation of each column in Books Online here - I was reminded of its existence by someone on Twitter (sorry, don't remember who it was and I couldn't find it in search).

In the example, I'm going to create a table with approximately 1000-byte long rows:

CREATE DATABASE PageSplitTest;
GO
USE pagesplittest;
GO

CREATE TABLE BigRows (c1 INT, c2 CHAR (1000));
CREATE CLUSTERED INDEX BigRows_CL ON BigRows (c1);
GO

INSERT INTO BigRows VALUES (1, 'a');
INSERT INTO BigRows VALUES (2, 'a');
INSERT INTO BigRows VALUES (3, 'a');
INSERT INTO BigRows VALUES (4, 'a');
INSERT INTO BigRows VALUES (6, 'a');
INSERT INTO BigRows VALUES (7, 'a');
GO

I've engineered the case where the clustered index data page has space for one more row, and I've left a 'gap' at c1=5. Let's add it as part of an explicit transaction and see how much transaction log is generated:

BEGIN TRAN
INSERT INTO BigRows VALUES (8, 'a');
GO

SELECT [database_transaction_log_bytes_used] FROM sys.dm_tran_database_transactions
WHERE [database_id] = DB_ID ('PageSplitTest');
GO

database_transaction_log_bytes_used
-----------------------------------
1228

That's about what I'd expect for that row. Now what about when I cause a page split by inserting the 'missing' c1=5 row into the full page?

-- commit previous transaction
COMMIT TRAN
GO

BEGIN TRAN
INSERT INTO BigRows VALUES (5, 'a');
GO

SELECT [database_transaction_log_bytes_used] FROM sys.dm_tran_database_transactions
WHERE [database_id] = DB_ID ('PageSplitTest');
GO

database_transaction_log_bytes_used
-----------------------------------
6724

Wow. 5.5x more bytes are written to the transaction log as part of the system transaction that does the split.

The ratio gets worse as the row size gets smaller. For a row with an approximately 100-byte long row (use the same code as above, but change to a CHAR (100), insert 67 rows with a 'gap' somewhere then insert the 68th to cause the split), the two numbers are 328 and 5924 - the split cause 18 times more log to be generated! For a row with an approximately 10-byte long row, I got numbers of 240 and 10436, because I created skewed data (about 256 rows with the key value 8) and then inserted key value 5 which forced a (rare) non-middle page split. That's a ratio of more than 43 times more log generated! You can try this yourself if you want: I changed the code to have a CHAR (10), inserted values 1, 2, 3, 4, 6, 7, then inserted 256 key values of 8 and then 2 of 5. The resulting page had only 6 rows - it split after the key value 5 - the Storage Engine doesn't always do a 50/50 page split. And that's not even causing nasty cascading page-splits, or splits that have to split a page multiple times to fit a new (variable-sized) row in.

Bottom line: page splits don't just cause extra IOs and index fragmentation, they generate a *lot* more transaction log. And all that log has to be (potentially) backed up, log shipped, mirrored....

This is a quick follow-on from my Misconceptions around null bitmap size post.

The null bitmap is *always* present in a data record (i.e. records in a heap or the leaf-level of a clustered index), but is optional in index records if all the columns in the index records are not nullable. The misconception is around what happens when a new column is added to the table. The common misconception is that if you have 8 columns in the table (and hence 8 bits in the null bitmap), if you add a ninth column then SQL Server has to go update every record so the null bitmaps all contain 9 bits. (Same misconception applies to adding the 17th, 25th, 33rd, etc column).

This is usually not true. Let's consider the cases:

  • New column is nullable, with a NULL default. The table's metadata records the fact that the new column exists but may not be in the record. This is why the null bitmap also has a count of the number of columns in that particular record. SQL Server can work out whether a column is present in the record or not. So - this is NOT a size-of-data operation - the existing table records are not updated when the new column is added. The records will be updated only when they are updated for some other operation.
  • New column is nullable, with a non-NULL default. This IS a size-of-data operation. The non-NULL default forces all existing records to be updated when the column is added, and so the null bitmap will be updated too.
  • New column is not-nullable (obviously with a non-NULL default). This IS a size-of-data operation, for the same reasons as above.

Hope this helps.

I'm starting a new series called 'Misconceptions' - a series of short posts that debunk some of the many myths and misconceptions that exist about the way SQL Server behaves. I actually already did the first post a couple of weeks back (Misconceptions around TF 1118) but just went back to re-tag it.

In this post I want to debunk the common myth that the null bitmap only contains bits for nullable columns. It doesn't - it has one bit per column in the table definition, as long as at least one column in the table is nullable. The 'unused' bits are always set to 1, which means 'null'.

And now the proof. I'm going to create two tables, with 10 columns each (meaning that the null bitmap has to have two bytes to store all the bits, plus two bytes for the count of columns in the record). The first table will have all nullable columns. The second will have the first 9 columns not nullable, and the tenth column nullable.

CREATE TABLE t1 (
    c1 INT, c2 INT, c3 INT, c4 INT, c5 INT,
    c6 INT, c7 INT, c8 INT, c9 INT, c10 INT);
GO

CREATE TABLE t2 (
    c1 INT NOT NULL, c2 INT NOT NULL, c3 INT NOT NULL,
    c4 INT NOT NULL, c5 INT NOT NULL, c6 INT NOT NULL,
    c7 INT NOT NULL, c8 INT NOT NULL, c9 INT NOT NULL,
    c10 INT);
GO

INSERT INTO t1 VALUES (1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
GO

INSERT INTO t2 VALUES ( 1, 2, 3, 4, 5, 6, 7, 8, 9, NULL);
GO

And now let's look at the pages themselves. I'll get the pages involved using my sp_AllocationMetadata script (see Inside The Storage Engine: sp_AllocationMetadata - putting undocumented system catalog views to work):

EXEC sp_AllocationMetadata 't1';
EXEC sp_allocationMetadata 't2';
GO

Object Name  Index ID  Alloc Unit ID      Alloc Unit Type  First Page  Root Page  First IAM Page
------------ --------- ------------------ ---------------- ----------- ---------- ---------------
t1           0         72057594042318848  IN_ROW_DATA      (1:152)     (0:0)      (1:153)

Object Name  Index ID  Alloc Unit ID      Alloc Unit Type  First Page  Root Page  First IAM Page
------------ --------- ------------------ ---------------- ----------- ---------- ---------------
t2           0          2057594042384384  IN_ROW_DATA      (1:154)     (0:0)      (1:155)

And now dump them with DBCC PAGE to look at the row structure:

DBCC TRACEON (3604);
GO
DBCC PAGE ('NullTest', 1, 152, 3);
DBCC PAGE ('NullTest', 1, 154, 3);
GO

-- Record dump from t1, all ten columns nullable and not NULL

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP
Memory Dump @0x684EC060

00000000:   10002c00 01000000 02000000 03000000 †..,.............
00000010:   04000000 05000000 06000000 07000000 †................
00000020:   08000000 09000000 0a000000 0a0000fc †................

-- Record dump from t2, first nine columns not nullable, tenth column nullable and NULL

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP
Memory Dump @0x684EC060

00000000:   10002c00 01000000 02000000 03000000 †..,.............
00000010:   04000000 05000000 06000000 07000000 †................
00000020:   08000000 09000000 21212121 0a0000fe †........!!!!....

If you try this yourself, you may get a different bit pattern for the NULL c10 (I got 0x21212121). It's a bit of a crap-shoot depending on what memory SQL Server reuses to create the record in memory before writing to the page - the 4 bytes of the NULL column aren't overwritten from the previous usage of the memory (and don't need to be).

In the first record, you can see that the null bitmap (underlined and in bold) says there's 10 columns (the 0x0a00 gets byte-reversed into 0x000a) and the null bitmap is 0xfc00, which is 1111110000000000 - 6 unused bits all set to 1 and 10 not-NULL column values.

In the first record, you can see that the null bitmap (underlined and in bold) says there's 10 columns again and the null bitmap this time is 0xfe00, which is 1111111000000000 - 6 unused bits all set to 1, 1 column NULL and 9 columns not-NULL. Column 1 is the far right bit, so column 10 is the tenth from right (the first '1').

This clearly shows that all columns are represented in the null bitmap even if only one of the columns is nullable. It's the same if on the first column in the table is nullable, I'll leave that for you to prove to yourself.

Next up: the next misconception I come across!

There's been a recent flurry of confusion and misconceptions about trace flag 1118 in SQL Server 2008. This trace flag switches allocations in tempdb from single-page at a time for the first 8 pages, to immediately allocate an extent (8 pages). It's used to help alleviate allocation bitmap contention in tempdb under a heavy load of small temp table creation and deletion.

There are multiple points of confusion, which I'll address in turn. Then I'll prove that the trace flag still works in SQL Server 2008.

1) Why is the trace flag usually required in 2000? In SQL 2000, whenever a temp table is created in tempdb and a row inserted, an IAM page must be allocated and a single data page must be allocated. These two pages are both 'single-page' allocations, from a mixed extent (see Inside The Storage Engine: Anatomy of an extent for more info). This means that an SGAM allocation bitmap page must be accessed, and a PFS page must be accessed (see Inside The Storage Engine: GAM, SGAM, PFS and other allocation maps for more info).

With lots of very small temp tables being created, this means the very first SGAM page and the very first PFS page in the data file are accessed/changed by all the threads, leading to latch contention problems on these two pages. When the temp tables are deleted again, the various pages are deallocated, which again needs to access and change the PFS page, and potentially the SGAM page.

There are two ways to alleviate this problem. Firstly, create multiple data files in tempdb - which splits the latch contention over multiple allocation bitmaps (from having allocations come from multiple files) and thus reduces the contention. The general rule of thumb was one tempdb data file for each processor core. Secondly, turn on TF1118, which makes the first 8 data pages in the temp table come from a dedicated extent. This means one extent is allocated from the GAM page, rather than 8 single pages (and potentially 8 accesses to the SGAM page). The pages within the extent are reserved and allocated singly from this extent, as needed. This also cuts down on contention and is documented in KB 328551.

2) What does reserved vs. allocated mean? When an extent is allocated to a table, the 8 pages in the extent are not immediately allocated as well. Allocating an extent means those 8 pages are reserved exclusively for subsequent allocation to that table. The pages are allocated individually as needed, but no other table can allocate them. This is why such extents are called 'dedicated' extents (see my blog post link above for more details). You can see the counters of reserved pages vs. allocated pages in the output from sp_spaceused.

3) Why is the trace flag not required so much in 2005 and 2008? In SQL Server 2005, my team changed the allocation system for tempdb to reduce the possibility of contention. There is now a cache of temp tables. When a new temp table is created on a cold system (just after startup) it uses the same mechanism as for SQL 2000. When it is dropped though, instead of all the pages being deallocated completely, one IAM page and one data page are left allocated, and the temp table is put into a special cache. Subsequent temp table creations will look in the cache to see if they can just grab a pre-created temp table 'off the shelf'. If so, this avoids accessing the allocation bitmaps completely. The temp table cache isn't huge (I think it's 32 tables), but this can still lead to a *big* drop in latch contention in tempdb.

4) Does the trace flag still exist in 2005 and 2008? Yes it does - KB 328551 clearly states:

Note Trace flag -T1118 is also available and supported in Microsoft SQL Server 2005 and SQL Server 2008. However, if you are running SQL Server 2005 or SQL Server 2008, you do not have to apply any hotfix.

Just to make extra-sure (as I'm always paranoid about saying absolutes), I checked with my good friend Ryan Stonecipher, who's the dev lead for the team that owns allocation (and a bunch of other stuff, including DBCC). He confirmed the code is exactly the same in 2008 as it was in 2005. And I prove it to you below too.

5) And why is it sill there in 2005 and 2008? It does the same thing in 2005/2008 as it did in 2000. If the temp table creation/deletion workload is high enough, you can still see latch contention, as the temp table cache won't be enough to completely alleviate the need for creating actual new temp tables, rather than just being able to grab one 'off the shelf'. In that case, using the trace flag to change to extent-based allocation (in *exactly* the same way as for 2000) can help, as can creating more tempdb data files.

As far as data files go though, the number has changed. Instead of a 1-1 mapping between processor cores and tempdb data files (*IF* there's latch contention), now you don't need so many - so the recommendation from the SQL team is the number of data files should be 1/4 to 1/2 the number of processor cores (again, only *IF* you have latch contention). The SQL CAT team has also found that in 2005 and 2008, there's usually no gain from having more than 8 tempdb data files, even for systems with larger numbers of processor cores. Warning: generalization - your mileage may vary - don't post a comment saying this is wrong because your system benefits from 12 data files. It's a generalization, to which there are always exceptions.

6) Why does DBCC IND still show two pages, even with the trace flag on? I've heard of some people being confused by the output of DBCC IND in SQL 2008 when the trace flag is turned on. Creating a single row temp table will only show two pages allocated in the DBCC output - one IAM page and one data page. Yes, that's completely correct - as only two pages are allocated, but the data page comes from a dedicated extent, not a mixed extent. (IAM pages are *always* single-page allocations from mixed-extents).

And now the proof, on SQL 2008.

SELECT @@VERSION;
GO

Microsoft SQL Server 2008 (RTM) - 10.0.1600.22 (Intel X86)

First off, I'll create a temp table without the trace flag enabled, and see what pages the table has allocated, by looking at the first IAM. I'll use a temp table with an 8000+ byte row size, and insert two rows - so we have two data pages for clarity.

DBCC TRACEOFF (1118, -1);
GO

USE tempdb;
GO

CREATE TABLE #temp (c1 INT IDENTITY, c2 CHAR (8000) DEFAULT 'a');
GO
INSERT INTO #temp DEFAULT VALUES;
GO 2

Now I'll figure out what is the first IAM page, using my sp_AllocationMetadata script (see here for the script and details), and dump it with DBCC PAGE to see the single-page allocations it's tracking, and which dedicated extents are allocated to the table:

EXEC sp_AllocationMetadata '#temp';
GO

Object Name    Index ID  Alloc Unit ID        Alloc Unit Type  First Page  Root Page  First IAM Page
-------------- --------- -------------------- ---------------- ----------- ---------- ---------------
#temp__<snip>  0         1152921505223016448  IN_ROW_DATA      (1:158)     (0:0)      (1:199)

DBCC TRACEON (3604);
GO
DBCC PAGE ('tempdb', 1, 199, 3);
GO

<snip>

IAM: Single Page Allocations @0x4A35C08E

Slot 0 = (1:158)                     Slot 1 = (1:200)                     Slot 2 = (0:0)
Slot 3 = (0:0)                       Slot 4 = (0:0)                       Slot 5 = (0:0)
Slot 6 = (0:0)                       Slot 7 = (0:0)                      


IAM: Extent Alloc Status Slot 1 @0x4A35C0C2

(1:0)        - (1:1016)     = NOT ALLOCATED   

As you can clearly see from the partial output of the dump of the IAM page, there are two single-page allocations and no extents allocated to the temp table. This is what should happen when the trace flag is not enabled.

Now I'll do the same thing with the trace flag 1118 enabled.

USE tempdb;
GO

DROP TABLE #temp;
GO

DBCC TRACEON (1118, -1);
GO

CREATE TABLE #temp (c1 INT IDENTITY, c2 CHAR (8000) DEFAULT 'a');
GO
INSERT INTO #temp DEFAULT VALUES;
GO 2

EXEC sp_AllocationMetadata '#temp';
GO

Object Name    Index ID  Alloc Unit ID        Alloc Unit Type  First Page  Root Page  First IAM Page
-------------- --------- -------------------- ---------------- ----------- ---------- ---------------
#temp__<snip>  0         1224979099301904384  IN_ROW_DATA      (1:208)     (0:0)      (1:158)

DBCC TRACEON (3604);
GO
DBCC PAGE ('tempdb', 1, 158, 3);
GO

<snip>

IAM: Single Page Allocations @0x4A8FC08E

Slot 0 = (0:0)                       Slot 1 = (0:0)                       Slot 2 = (0:0)
Slot 3 = (0:0)                       Slot 4 = (0:0)                       Slot 5 = (0:0)
Slot 6 = (0:0)                       Slot 7 = (0:0)                      


IAM: Extent Alloc Status Slot 1 @0x4A8FC0C2

(1:0)        - (1:200)      = NOT ALLOCATED                              
(1:208)      -              =     ALLOCATED                              
(1:216)      - (1:1016)     = NOT ALLOCATED
                              

Now as you can clearly see, there are no single-page allocations, and there's a single extent allocated to the table. Proof that trace flag 1118 still does exactly what it should in SQL Server 2008.

Now for a DBCC IND on the table: 

DBCC IND ('tempdb', '#temp', -1);
GO

PageFID PagePID     IAMFID IAMPID      ObjectID    IndexID    
------- ----------- ------ ----------- ----------- -----------
1       158         NULL   NULL        293576084   0
1       208         1      158         293576084   0
1       209         1      158         293576084   0

(I've removed some of the trailing columns for clarity.) We see that it still only lists the two data pages (1:208, 1:209) and the IAM page (1:158) - although an entire extent was allocated to the temp table, only two pages from the extent were actually allocated and used - the rest are reserved for use by that table, but remain unallocated. 

Hopefully this post has cleared up a lot of the confusion around this trace flag and what it does.

Every so often (well, ok, very frequently) my internal list of 'I must blog about that' gets too big and I have an almost uncontrollable urge. Luckily, these urges end up with me blogging several times, otherwise who knows what kind of trouble I'd get into? Smile

First up today is some pretty deep internals about how the transaction log works in a specific situation. This has come up a few times in the last few internals/maintenance classes I've taught in the transaction log module so I wanted to get a blog post out there that proves my answer is correct. The question is easier to frame with a picture, borrowed from my TechNet Magazine article from February (see here).

  

The image shows the circular natue of the transaction log, and how VLFs can be overwritten once the log has wrapped around, as long as they are no longer active (red = active log/VLF, blue = inactive log/VLF). If none of this makes sense to you, go read the TechNet Magazine article and it should do.

The question is: what happens if log records continue being written to VLFs 1 and 2 in the picture, and eventually the end of VLF 2 has been reached, but VLF 3 is still active, and can't be overwritten? Does the database just stop?

Let's find out.

I'm going to create a database and then engineer the situation above.

CREATE DATABASE LogTestDB ON PRIMARY (
    NAME = LogTestDB_data,
    FILENAME = N'C:\SQLskills\LogTestDB_data.mdf')
LOG ON (
    NAME = 'LogTestDB_log',
    FILENAME = N'C:\SQLskills\LogTestDB_log.ldf',
    SIZE = 5MB);
GO

The database is in the SIMPLE recovery model. Well, to be completely honest, it's in the FULL recovery model, but there hasn't been a full database backup yet, so as far as transaction log behavior is concerned, it behaves as if it was in SIMPLE (the log can be cleared by a CHECKPOINT) - again, if this makes no sense, go read the TechNet Magazine article. 

And then see how many VLFs it has using the DBCC LOGINFO command (and this is the *only* way to see how many VLFs there are):

DBCC LOGINFO ('LogTestDB');
GO

FileId   FileSize   StartOffset   FSeqNo   Status   Parity CreateLSN
-------- ---------- ------------- -------- -------- ------ -----------
2        1245184    8192          31       2        64     0
2        1245184    1253376       0        0        0      0
2        1245184    2498560       0        0        0      0
2        1499136    3743744       0        0        0      0

We've got 4 VLFs. The Status column tells us whether the VLF is active or not. A status of 2 means the VLF is active, 0 means it's not. The sequence number (FSeqNo) is the logical order of the VLFs within the log. The FileSize is in bytes (so each VLF is about 1.25 MB). Right now there's only one active VLF.

Now I'll engineer the situation in the image above. I'm going to fill the log so that VLFs 1, 2, and 3 are active. Then I'm going to start an explicit transaction that will hold VLF 3 and onwards active. Then I'll continue filling the log so it wraps around and starts to fill up VLF 1 again.

USE LogTestDB;
GO
CREATE TABLE BigRows (c1 INT IDENTITY, c2 CHAR (8000) DEFAULT 'a');
GO

SET NOCOUNT ON;
INSERT INTO BigRows DEFAULT VALUES;
GO 300

I've filled up VLFs 1 and 2, and started on 3. Let's check with DBCC LOGINFO:

DBCC LOGINFO ('LogTestDB');
GO

FileId   FileSize   StartOffset   FSeqNo   Status   Parity CreateLSN
-------- ---------- ------------- -------- -------- ------ -----------
2        1245184    8192          31       2        64     0
2        1245184    1253376       32       2        64     0
2        1245184    2498560       33       2        64     0
2        1499136    3743744       0        0        0      0

As you can see, the first 3 VLFs now have a status of 2 (active), and they're in sequence. Now I'll create an explicit transaction that will prevent VLF 3 and onwards being cleared.

BEGIN TRAN
INSERT INTO BigRows DEFAULT VALUES;
GO

If I explicitly do a CHECKPOINT now, VLFs 1 and 2 will clear:

CHECKPOINT
GO

DBCC LOGINFO ('LogTestDB');
GO

FileId   FileSize   StartOffset   FSeqNo   Status   Parity CreateLSN
-------- ---------- ------------- -------- -------- ------ ----------
2        1245184    8192          31       0        64     0
2        1245184    1253376       32       0        64     0
2        1245184    2498560       33       2        64     0
2        1499136    3743744       0        0        0      0

Now I'll carry on filling up the BigRows table so the log wraps around and starts filling up VLFs 1 and two again.

INSERT INTO BigRows DEFAULT VALUES;
GO 300

DBCC LOGINFO ('LogTestDB');
GO

FileId   FileSize   StartOffset   FSeqNo   Status   Parity CreateLSN
-------- ---------- ------------- -------- -------- ------ ----------
2        1245184    8192          35       2        128    0
2        1245184    1253376       32       0        64     0
2        1245184    2498560       33       2        64     0
2        1499136    3743744       34       2        64     0

You can see that the log has wrapped around now, but VLFs 3 and 4 are still active. Look at the sequence numbers of the active VLFs... the active log is VLF 3 then 4 then 1, with sequence numbers 33 then 34 then 35. Now if I carry on filling up the table, what's going to happen when the log bumps up against VLF 3 that is still active? Will it stop or will it grow?

INSERT INTO BigRows DEFAULT VALUES;
GO 300

DBCC LOGINFO ('LogTestDB');
GO

FileId   FileSize   StartOffset   FSeqNo   Status   Parity CreateLSN
-------- ---------- ------------- -------- -------- ------ ------------------
2        1245184    8192          35       2        128    0
2        1245184    1253376       36       2        128    0
2        1245184    2498560       33       2        64     0
2        1499136    3743744       34       2        64     0
2        253952     5242880       37       2        64     36000000049300052
2        270336     5496832       38       2        64     36000000049300052
2        253952     5767168       0        0        0      36000000107500066
2        335872     6021120       0        0        0      36000000107500066
2        253952     6356992       0        0        0      36000000190700020
2        401408     6610944       0        0        0      36000000190700020
2        253952     7012352       0        0        0      37000000037300040
2        466944     7266304       0        0        0      37000000037300040

The answer is that it grew, and kind of skipped the active VLFs! Look at the sequence numbers. The new sequence of the active log is VLF 3 then 4 then 1 then 2 then 5 then 6, as you can see from the sequence numbers. Once the active transaction I created commits or rolls back, VLFs 3, 4, 1, and 2 can clear and then the 'normal' sequencing of VLFs in the log will resume.

Every VLF has a small header which contains the sequence number of the VLF within the transaction log, so the log can kind of do contortions to work around active VLFs in the middle of the log. Very cool.

Ok - that was fun - now I feel better!

PS In the last couple of DBCC LOGINFO dumps, where the log has wrapped around, you can see that the parity bits for the log blocks in the VLFs have changed, as I explained in one of my previous posts Search Engine Q&A #24: Why can't the transaction log use instant initialization?.

(Yes, Kimberly's lecturing again...)

Here's something that I've seen crop up a lot recently on corruption forums:

Server: Msg 602, Level 21, State 50, Line 1
Could not find row in sysindexes for database ID 10, object ID 1, index ID 1. Run DBCC CHECKTABLE on sysindexes.

This isn't corruption - it comes from trying to attach a 2005 database to a 2000 server. If you try to restore a 2005 database on a 2000 server, you'll see:

Server: Msg 3169, Level 16, State 1, Line 1
The backed-up database has on-disk structure version 611. The server supports version 539 and cannot restore or upgrade this database.
Server: Msg 3013, Level 16, State 1, Line 1
RESTORE DATABASE is terminating abnormally.

This gives a more useful error. Basically, SQL Server is not up-level compatible in terms of the database physical structures. A SQL 2000 server cannot understand the new structures that are in a SQL 2005 database. There's a lot of confusion about this, and why setting database compatibility level is not the same as the physical version of the database. My blog post Search Engine Q&A #13: Difference between database version and database compatibility level has more details.

Trying to do the same thing with a 2008 database on a 2005 server is a bit better. For the attach we get:

Msg 1813, Level 16, State 2, Line 1
Could not open new database 'Dbmaint2008'. CREATE DATABASE is aborted.
Msg 948, Level 20, State 1, Line 1
The database 'Dbmaint2008' cannot be opened because it is version 655. This server supports version 612 and earlier. A downgrade path is not supported.

(Note that 611 is the physical version for all SQL Server 2005 builds, but if VARDECIMAL is enabled, the version gets bumped by one to 612 - long story...)

But the restore error is still a little cryptic:

Msg 3241, Level 16, State 7, Line 1
The media family on device 'c:\sqlskills\dbmaint2008.bck' is incorrectly formed. SQL Server cannot process this media family.
Msg 3013, Level 16, State 1, Line 1
RESTORE DATABASE is terminating abnormally.

Bottom line - you can't attach a higher-version database to a lower-version server.

A thread cropped up on SQLServerCentral involving IAM chain corruption (see Inside the Storage Engine: IAM pages, IAM chains, and allocation units for details of IAM chains). The error from DBCC CHECKDB was:

Server: Msg 2576, Level 16, State 1, Line 1
IAM page (0:0) is pointed to by the previous pointer of IAM page (1:394336) object ID 229575856 index ID 4 but was not detected in the scan.

and there was some discussion of what the error meant, and why the initial page ID in the error of (0:0) means something special. There was a further question of how the errors would differ if the IAM page header was partially zero'd out by an I/O subsystem error.

We're on-stage here at SQL Connections doing a pre-con and I'm not on until this afternoon so I can bang out a quick blog post! I'm going to create a small database and show the difference between the two cases. The error above was from a SQL 2000 database, but the behavior is the same on SQL 2005 and 2008. Here's the script to create the database.

CREATE DATABASE CorruptIAMEXample;
GO
USE CorruptIAMExample;
GO

CREATE TABLE test (c1 INT IDENTITY, c2 CHAR (8000) DEFAULT 'a');
GO
CREATE CLUSTERED INDEX test_cl on test (c1);
GO

SET NOCOUNT ON;
GO
INSERT INTO test DEFAULT VALUES;
GO 1000

After corrupting the database, I can reproduce the error above by running DBCC CHECKDB on it:

Msg 2576, Level 16, State 1, Line 1
The Index Allocation Map (IAM) page (0:0) is pointed to by the previous pointer of IAM page (1:153) in object ID 0, index ID -1, partition ID 0, alloc unit ID 72057594042384384 (type Unknown), but it was not detected in the scan.

The error is saying that the first IAM page in the IAM chain (page 1:153) does not have a reference from metadata. The sysallocunits system table contains a link to the first IAM page, the root page, and the first page. You can see these by querying the sys.system_internals_allocation_units catalog view, or you can see this blog post  - Inside The Storage Engine: sp_AllocationMetadata - putting undocumented system catalog views to work. I corrupted the sysallocunits table so that the link to the first IAM page of the test table was removed. The clue is the first page ID in the error - if it's a (0:0), that's the missing metadata case. For SQL 2000, this can happen if someone manually updates the sysindexes table.

Now what about the other case, where the IAM page itself has a corrupt header? I recreated the database again and corrupted the header of the first IAM page of the test table. Here's the output from DBCC CHECKDB:

Msg 2575, Level 16, State 1, Line 1
The Index Allocation Map (IAM) page (1:153) is pointed to by the next pointer of IAM page (0:0) in object ID 2073058421, index ID 1, partition ID 72057594038386688, alloc unit ID 72057594042384384 (type In-row data), but it was not detected in the scan.
Msg 7965, Level 16, State 2, Line 1
Table error: Could not check object ID 2073058421, index ID 1, partition ID 72057594038386688, alloc unit ID 72057594042384384 (type In-row data) due to invalid allocation (IAM) page(s).
Msg 8928, Level 16, State 6, Line 1
Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown): Page (1:153) could not be processed. See other errors for details.

and a whole bunch of

Msg 8905, Level 16, State 1, Line 1
Extent (1:216) in database ID 21 is marked allocated in the GAM, but no SGAM or IAM has allocated it.

errors. This is because the IAM page no longer looks like an IAM page and so DBCC CHECKDB can't process it as such. If I zero out the IAM page completely, DBCC CHECKDB returns basically the same errors as above. 

I've created a zipped backup of the SQL 2005 database in each case:

You'll need to do a RESTORE FILELISTONLY and then maybe move the files when you restore. Have fun!

Way back at the start of me blogging here I wrote a comprehensive description of ghost records and the ghost cleanup process - see Inside the Storage Engine: Ghost cleanup in depth. A question came up in the class I'm teaching this week that's worth answering in a blog post - do ghost records occur in heaps? The answer is no, not during normal processing.

When snapshot isolation is enabled, deletes from a heap are ghosted, as part of the overall versioning process, which can lead to some interesting side-effects. A versioned record has an extra 14-bytes tagged on the end, so a heap record that suddenly becomes versioned is 14-bytes longer - which may mean it doesn't fit on the page any longer. This could lead to it being moved, resulting in a forwarding/forwarded record pair - just because the record was deleted! Now, the page has to be full for this time happen, and the Storage Engine will take steps to avoid this happening for rows less than 32 bytes long - but that's getting a little too deep.

Anyway, I digress. I want to show you the difference between deleting from a clustered index and from a heap. I'm going to create two such tables, then delete row from each and roll it back.

CREATE TABLE t1 (c1 CHAR (10));
CREATE CLUSTERED INDEX t1c1 on t1 (c1);
GO

CREATE TABLE t2 (c1 CHAR (10));
GO

INSERT INTO t1 VALUES ('PAUL');
INSERT INTO t1 VALUES ('KIMBERLY');

INSERT INTO t2 VALUES ('PAUL');
INSERT INTO t2 VALUES ('KIMBERLY');
GO

-- prevent random background transactions
ALTER DATABASE GhostTest SET AUTO_CREATE_STATISTICS OFF;
GO

BEGIN TRAN DelFromClust;
DELETE FROM t1 WHERE c1='KIMBERLY';
ROLLBACK TRAN;
GO

BEGIN TRAN DelFromHeap;
DELETE FROM t2 WHERE c1='KIMBERLY';
ROLLBACK TRAN;
GO

SELECT * FROM ::fn_dblog (null, null);
GO

Here's a portion of the results from looking in the transaction log. The line of code where I turn off auto-update stats is just to prevent the auto-create transactions from cluttering up my view of the transaction log.

The first (highlighted) transaction is for the delete/rollback in the clustered index. You can clearly see that the third column shows a log context of ghosting for the LOP_DELETE_ROWS log record, plus the setting of the 'this page has at least one ghost record' in the PFS byte for that page.

The second (unhighlighted) transaction is for the delete/rollback in the heap. Here you can see that it just does a straight delete.

If you look at the data page contents before the rollback in both cases, for the clustered index you'll still be able to see the deleted (ghosted) record, and for the heap you'll see the deleted record really is deleted.

Hope this helps.

After writing the FILESTREAM whitepaper for Microsoft, I've had lots of questions about the structure of the FILESTREAM data container. The FILESTREAM data container is the technical term for the NTFS directory structure where all the FILESTREAM data is stored.

When you want to use FILESTREAM data, you first add a filegroup (during or after database creation):

ALTER DATABASE FileStreamTestDB ADD FILEGROUP FileStreamGroup1 CONTAINS FILESTREAM;
GO

And then add a 'file' to the filegroup:
ALTER DATABASE FileStreamTestDB ADD FILE (
     NAME = FSGroup1File, FILENAME = 'C:\Metro Labs\FileStreamTestDB\Documents')
TO FILEGROUP FileStreamGroup1;
GO

The 'file' is actually the pathname to what will become the root directory of the FILESTREAM data container. When it's initially created, it will contain a single file, filestream.hdr, and a single directory $FSLOG. Filestream.hdr is a metadata file describing the data container and the $FSLOG directory is the FILESTREAM equivalent of the database transaction log. You can think of them as equivalent, although the FILESTREAM log has some interesting semantics, which I'll cover in a separate post.

The question I most often get is: are all the FILESTREAM files for a database stored in one gigantic directory? The answer is no.

The root directory of the data container contains one sub-directory for each table (or each partition of a partitioned table). Each of those directories contains a further sub-directory for each FILESTREAM column defined in the table. An example is below, with the screen shot taken after running the following code:

CREATE TABLE FileStreamTest1 (
    DocId UNIQUEIDENTIFIER ROWGUIDCOL NOT NULL UNIQUE,
    DocName VARCHAR (25),
    Document VARBINARY(MAX) FILESTREAM);
GO

CREATE TABLE FileStreamTest2 (
    DocId UNIQUEIDENTIFIER ROWGUIDCOL NOT NULL UNIQUE,
    DocName VARCHAR (25),
    Document1 VARBINARY(MAX) FILESTREAM,
    Document2 VARBINARY(MAX) FILESTREAM);
GO

INSERT INTO FileStreamTest1 VALUES (NEWID (), 'Paul Randal', CAST ('SQLskills.com' AS VARBINARY(MAX)));
INSERT INTO FileStreamTest1 VALUES (NEWID (), 'Kimberly Tripp', CAST ('SQLskills.com' AS VARBINARY(MAX)));
GO

This image shows the FILESTREAM data container for our database that has two tables with FILESTREAM columns, each with a single partition. The first table has a two FILESTREAM columns and the second has a single FILESTREAM column. The filenames of all these directories are GUIDs. In the example, you can see two FILESTREAM files in a column-level directory. The FILESTREAM file names are actually the log-sequence number from the database transaction log at the time the files were created. You can correlate these by looking at the data with DBCC PAGE, but first finding the allocated pages using sp_AllocationMetadata (see this blog post):

EXEC sp_AllocationMetadata FileStreamTest1;
GO

Object Name     Index ID Alloc Unit ID     Alloc Unit Type   First Page Root Page First IAM Page
--------------- -------- ----------------- ---------------   ---------- --------- --------------
FileStreamTest1 0        72057594039697408 IN_ROW_DATA       (1:169)    (0:0)     (1:170)
FileStreamTest1 0        72057594039762944 ROW_OVERFLOW_DATA (0:0)      (0:0)     (0:0)
FileStreamTest1 2        72057594039828480 IN_ROW_DATA       (1:171)    (1:171)   (1:172)

(3 row(s) affected)

Notice there's a nonclustered index as well as the heap - that's the index that's enforcing the uniqueness constraint on the UNIQUEIDENTIFIER column. Now we can use DBCC PAGE to look at the first page of the heap, which will have out data records in:

DBCC TRACEON (3604);
DBCC PAGE (FileStreamTestDB, 1, 169, 3);
GO

<snip>

Slot 0 Offset 0x60 Length 88

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS
Record Size = 88
Memory Dump @0x514EC060

00000000:   30001400 140d5047 2ca9f24f 874d35ca †0.....PG,©òO‡M5Ê
00000010:   e9e77649 03000002 00280058 80506175 †éçvI.....(.X.Pau
00000020:   6c205261 6e64616c 03000000 00000080 †l Randal........
00000030:   140d5047 2ca9f24f 874d35ca e9e77649 †..PG,©òO‡M5ÊéçvI
00000040:   01000000 68020000 00000000 17000000 †....h...........
00000050:   79000000 0c000000 †††††††††††††††††††y.......

Slot 0 Column 1 Offset 0x4 Length 16 Length (physical) 16

DocId = 47500d14-a92c-4ff2-874d-35cae9e77649

Slot 0 Column 2 Offset 0x1d Length 11 Length (physical) 11

DocName = Paul Randal

Document = [Filestream column] Slot 0 Column 3 Offset 0x28 Length 48

ColType = 3                          FileId = -2147483648                 UpdateSeq = 1
CreateLSN = 00000017:00000079:000c (23:121:12)                            TxFMiniVer = 0
XdesId = (0:616)

<snip>

You can see that the CreateLSN I've highlighted above matches the filename of the first FILESTREAM file in the example image.

Hopefully this explains how the FILESTREAM files are stored - more on this in the next post where I'll show how updates and garbage collection are implemented.

This question came up in a class I was teaching a few weeks ago: given that the variable-length column offset array in a record only stores offsets, how does the storage engine find a particular variable-length column?

The question comes because there's nothing in the record to say which variable-length column is which - so how does it work? The answer is a combination of the null bitmap in the record, plus the metadata for the table/index stored in the system tables. All variable-length columns have a fixed 'location' within the variable-length portion of the record when they are non-null. Let's see what I mean.

First off I'll create a test table with a single record, with all variable-length columns null, and dump out the record using DBCC PAGE:

CREATE TABLE vartest (c1 INT, c2 VARCHAR (100), c3 VARCHAR (100), c4 varchar (100));
GO
INSERT INTO vartest VALUES (1, NULL, NULL, NULL);
GO

DBCC TRACEON (3604);
DBCC PAGE (test, 1, 152, 1);
GO

Slot 0, Offset 0x60, Length 11, DumpStyle BYTE
Record Type = PRIMARY_RECORD Record Attributes = NULL_BITMAP
Memory Dump @0x66F4C060

00000000: 10000800 01000000 0400fe†††††††††††††...........

The 0x0400fe in bold-underlined is the null bitmap. The 0x0400 reverses to 0x0004 - which is the count of columns in the record. The 0xfe is the actual bitmap, which is 11111110 in binary. So all columns in the record except the first are null (although the null bitmap only needs to store bits for 4 columns, all bits in the null bitmap that aren't used are set to indicate a null). You can also see that there's no variable-length column offset array as *all* the variable-length columns are null.

Now I'll make the middle variable-length column non-null and dump out the record using DBCC PAGE:

UPDATE vartest SET c3 = 'c3c3c3c3';
GO
DBCC PAGE (test, 1, 152, 1);
GO

Slot 0, Offset 0x60, Length 25, DumpStyle BYTE
Record Type = PRIMARY_RECORD Record Attributes = NULL_BITMAP VARIABLE_COLUMNS
Memory Dump @0x67FEC060

00000000: 30000800 01000000 0400fa02 00110019 †0...............
00000010: 00633363 33633363 33†††††††††††††††††.c3c3c3c3

We can see the null bitmap (bold-underlined) has changed to 0xfa in hex, or 11111010 in binary, to reflect that the first and third columns are non-null. As soon as a single variable-length column in the record is non-null, the variable-length column offset array is populated for all variable-length columns up to and including the last non-null column. The array has the count of entries in the array (the 0200, which reverses to 0x0002), and then for each column it stores the offset to the start of the following column, to avoid storing the length too. The difference between successive start-of-column offsets is the length, with the first length calculated using the end of the array itself as the starting point of the first column. In the record above, the array has two entries, ending at offsets 0x0011-1 and 0x0019-1 in the record. The array itself finishes at 0x0010, so the first entry in the array is essentially pointing at an empty value, which we know to be actually a null value (instead of a non-null empty value) using the null bitmap.

Now if I make the first variable-length column non-null, watch what happens to the record:

UPDATE vartest SET c2 = 'c2c2c2c2';
DBCC PAGE (test, 1, 152, 1);
GO

Slot 0, Offset 0x60, Length 33, DumpStyle BYTE
Record Type = PRIMARY_RECORD Record Attributes = NULL_BITMAP VARIABLE_COLUMNS
Memory Dump @0x6714C060

00000000: 30000800 01000000 0400f802 00190021 †0..............!
00000010: 00633263 32633263 32633363 33633363 †.c2c2c2c2c3c3c3c
00000020: 33†††††††††††††††††††††††††††††††††††3

The null bitmap has changed from 0xfa to 0xf8 to reflect the newly non-null column. The variable-length portion is re-ordered so that the newly non-null column is in its correct place (you can see the c2c2c2c2c2 comes before the c3c3c3c3 in the hex dump of the record) and the offsets have been updated accordingly. The array still only hold two values though. Now if I set the last column to be non-null:

UPDATE vartest SET c4 = 'c4c4c4c4';
DBCC PAGE (test, 1, 152, 1);
GO

Slot 0, Offset 0x81, Length 43, DumpStyle BYTE
Record Type = PRIMARY_RECORD Record Attributes = NULL_BITMAP VARIABLE_COLUMNS
Memory Dump @0x66FCC081

00000000: 30000800 01000000 0400f003 001b0023 †0..............#
00000010: 002b0063 32633263 32633263 33633363 †.+.c2c2c2c2c3c3c
00000020: 33633363 34633463 346334†††††††††††††3c3c4c4c4c4

The null bitmap now indicates all 4 columns are non-null. The count of columns in the variable-length column offset array has increased to 3 and the new entry is added on the end, as it is the last variable-length column.

You can conceptually think of the algorithm to obtain a variable-length column value as merging the relevant parts of the null bitmap with the variable-length column offset array, and then returning the value if it's non-null.

Hope this explains things!

In SQL Server 2000 and before, the symptoms of database corruption would occasionally manifest themselves as asserts, such as:

SQL Server Assertion: File: <recbase.cpp>, line=1378 Failed Assertion = 'm_offBeginVar < m_sizeRec'.

To reduce the number of assertions being fired by the SQL Engine, my team changed these asserts into real error messages for SQL Server 2005 onwards - either 5242 or 5243 - such as:

Msg 5242, Level 22, State 1, Line 1

An inconsistency was detected during an internal operation in database 'MyCorruptDatabase'(ID:12) on page (1:34923). Please contact technical support. Reference number 4.

Error 5243 is exactly the same except that the database couldn't be determined for some reason. Both of the errors above say that the offset of the variable length column offset array is beyond the end of the record.

I've noticed increasing confusion from these errors being reported - sometimes the worry is that maintenance jobs are causing them, or DBCC CHECKDB is causing them. This may appear to be so but is really just an artifact of the fact that DBCC CHECKDB reads all allocated pages in the database and may read a corrupt page that your normal queries just don't happen to cause to be read. The messages may disappear because other maintenance jobs cause indexes to be rebuilt and so corrupt pages may be deallocated from the database - meaning they won't be read by DBCC CHECKDB.

If you have either torn-page detection or page checksums enabled, you may not ever see these 5242 or 5243 errors as you'll likely see an 824 error instead. The 824 error is raised by the buffer pool when the page is read - before the page can be processed by the record-cracking code that would raise the 5242 or 5243 errors. Bottom line is that 5242 and 5243 says you've got corruption in the structure of a record - the reference numbers at the end of error say exactly what kind of corruption there is - for instance that the record size is invalid or the record type is wrong for the type of page it resides. If you see these errors, you need to go through the motions of recovering from corruption and figuring out what's wrong with your I/O subsystem.

Hope this helps.

Wow - today is all about new content. As if I haven't already blogged about enough stuff to keep you reading through next week, the February issue of TechNet Magazine is now available and contains a feature article I wrote about understanding how logging and recovery work inside SQL Server.

The article covers:

  • What is logging?
  • What is recovery?
  • The transaction log (include logical and physical architecture)
  • Recovery models and how they affect the behavior of the transaction log

There's also a ten-minute screencast video where I demonstrate a runaway transaction log.

Check it out at http://technet.microsoft.com/en-us/magazine/2009.02.logging.aspx.

(Posted with permission of the dev team)

Here's an interesting bug that people are hitting. I found out about this while here in Barcelona at TechEd when I got roped into a discussion with a couple of Microsoft colleagues. Their customer was seeing errors like the following:

Msg 8914, Level 16, State 1, Line 1
Incorrect PFS free space information for page (1:35244) in object ID 1683128146, index ID 1, partition ID 223091033422352, alloc unit ID 81405523118118176 (type LOB data). Expected value 0_PCT_FULL, actual value 100_PCT_FULL

This error says that the PFS page (see this blog post) has the wrong free-space tracking bits for a text page. In SQL Server 2000, the algorithm to keep track of free space in the PFS pages wasn't perfect so CHECKDB never reported these errors - it would silently fix them. In 2005 we fixed the algorithm (supposedly), so I turned on the reporting of these errors again in CHECKDB.

When I first saw the description from the customer, my first reaction was that it was an I/O subsystem problem causing corruption, but the customer has page checksums turned on, so a corruption would result in an 824 error before an 8914 error. My conclusion then was that there's a bug in the new free-space tracking algorithm. After checking with the dev team, it turns out my suspicions were correct - there is a bug in 2005 SP2. It's fixed in 2005 SP3 and in 2008, but you may see these 8914 errors if you're not running one of those.

Here are the technical details of the problem (slightly edited from the dev team explanation):

The issue was that when minimal logging for LOBs was used (under the SIMPLE recovery model, during BULK INSERT/BCP/Large insert with TABLOCK), extents are being pre-allocated and the pages were being marked 100% full in the PFS page when the pages were allocated. The idea was that all pages eventually will be filled up with LOB data, and by marking them full during allocation we avoid an extra update to the PFS pages when the data is actually put on the page.

Suppose now that 64 pages are pre-allocated, and only 40 or so pages are used and have rows on them. When the transaction commits, the Storage Engine is supposed to deallocate the 24 extra pages that were pre-allocated, and marked 100% full, even though they don’t have any rows on them yet. There was a bug where in a certain case the deallocation wouldn't happen, so you end up with empty pages that have a PFS state of 100%, but don’t have any rows on them.

In general, even though this is a bug in the code, functionality wise, there is nothing wrong with the database, besides a number of additional pages that are empty and allocated to the LOB tree, so as long as the message tells you that the page is supposed to be empty, but is actually marked 100%, nothing can really go wrong with that page. Unfortunately, DBCC will keep reporting these errors.

Although I haven't tried this, my guess is that you can get rid of the empty LOB pages using ALTER INDEX ... REORGANIZE WITH (LOB_COMPACTION = ON). So, if you see some of these errors with no other errors, you may have hit this bug and have nothing really to worry about.

Every so often I'll see posts on the various data corruption forums discussing causes of corruption. In this post I want to debunk some of the myths around what can cause corruption. There are really two types of corruption to deal with, physical corruption and logical corruption.

Physical corruption

This is where something has altered the contents of a data or log file sector with no regard for what is being stored there. Possible causes of physical corruption are:

  • Problem with the I/O subsystem (99.8% of all cases I've ever seen - only 3 nines as I'd estimate I've seen around about a thousand corruption cases). Remember the I/O subsystem is everything underneath SQL Server in the I/O stack - including the OS, 3rd-party file system filter drivers, device drivers, RAID controllers, SAN controllers, network hardware, drives themselves, and so on. Millions of lines of code and lots of moving parts spinning very fast, very close to very fragile pieces of metal oxide (I once heard Jim Gray liken a disk drive head to a 747 jumbo jet flying at 500 mph at a height of 1/4 inch from the ground...)
  • Problem with the host machine hardware (0.1% of cases). Most of the time this is a memory error.
  • SQL Server bugs (0.1% of cases). Yes, there have been corruption bugs. Every piece of software has bugs. There are KB articles describing bugs.
  • Deliberate introduction of corruption using a hex editor or other means.

Physical corruption is what DBCC CHECKDB usually reports and the majority of cases are caused by a physical failures of some kind, with the minority caused by humans - software bugs.

Logical corruption

This is where something has altered some data so that a data relationship is broken. Possible causes of logical corruption are:

  • Humans

:-) Okay...

  • Application bug. The application deletes one part of an inherent data relationship but not the other. Or the application designer doesn't implement a constraint properly. Or the application designer doesn't cope with a transaction roll-back properly. You get the idea.
  • Accidental update/delete. Someone deletes or updates some data incorrectly.
  • SQL Server bug. See above.
  • DBCC CHECKDB when using the REPAIR_ALLOW_DATA_LOSS option. As is documented in Books Online, and I've blogged about and mentioned when lecturing, if you run repair, it doesn't take into account any inherent or explicit constraints on the data.

The point here is that a physical failure of a component does not cause logical corruption, it causes physical corruption. Conversely, application errors cause logical corruption, not physical corruption. DBCC CHECKDB errors are about physical corruption (okay, with the inclusion of DBCC CHECKCATALOG code in 2005, it will find cases where the DBA has manually altered the system tables, causing logical corruption) and applications cannot cause physical corruption as they can only manipulate data through SQL Server. If an application hits a SQL Server bug which causes physical corruption, that's still not the application causing physical corruption, it's SQL Server.

So - on to the myths.

  • Can an application cause physical corruption? No.
  • Can stopping a shrink operation cause corruption of any kind? No.
  • Can stopping an index rebuild cause corruption of any kind? No.
  • Can running DBCC CHECKDB without repair cause corruption of any kind? No.
  • Can creating a database snapshot cause corruption of any kind? No.

Hope this helps.

This is a quickie in response to a blog comment from my previous post on instant initialization - How to tell if you have instant initialization enabled?. The comment was:

I must say, I love instant initialization. It's awesome. But I always wondered why it's not available for the log file. I assume there's a technical reason... but what is it? Does it depend on having the rest of the file be zeroed out? Doesn't it already know where it's start and stop points are anyways, since the log is circular?

I couldn't remember the exact answer so I discussed with Peter Byrne on the Storage Engine dev team and now I have the answer to share. There is a lot of metadata kicking around in the Storage Engine about the transaction log (mostly in the boot page - see my post Search Engine Q&A #20: Boot pages, and boot page corruption), including where to start reading the log during crash recovery. However, there's nothing that can be used after a crash occurs to determine where the active transaction log ends (i.e. where should crash recovery stop processing log records).

The way this is done is to have each log sector have parity bits stamped on it. When the log is first created, it is zero-initialized (with zero being an illegal log sector parity value). As the log is written, each sector has parity bits in it. When the end of the log is reached, and it wraps around to the start of the log file, the parity bits are flipped, so that overwritten log sectors have the opposite parity from when they were last written. When a crash occurs, log sectors are read and processed until a log sector with an out-of-sequence parity is found.

This entire process will not work if there's already random data in the space used by the log file - some of the random data could just look like a valid set of parity bits and cause the recovery system to try to process a log sector full of garbage, leading to a suspect database, at best.

So - it's not just a "there wasn't time" - there really is a good, architectural reason why instant initialization cannot be done with the transaction log.

Following on from my previous post on boot pages and boot page corruption, I've been asked about file header pages - and I was already planning this post as the next in the series.

So what's a file header page? Every data file in a database has the very first 8kb page (i.e. page 0 in the file) set aside as the place to store all the metadata info about the file. As with the boot page, you can look at the contents with DBCC PAGE and it will interpret all the fields for you, or you can use the DBCC FILEHEADER command, which does a better job. This is undocumented and unsupported, just like DBCC DBINFO for looking at the database boot page, but it's been discussed and posted about on the Internet before so it's existence is no secret.

The command take a database name or database ID plus the file ID to dump. Here I've created a database called FileHeaderTest and used SSMS with results-to-text, plus a bunch of editing of the results to make it blog-able:

DBCC FILEHEADER ('FileHeaderTest', 1);
GO

FileId                : 1
LogicalName           : FileHeaderTest
BindingId             : D30AE3EF-14A6-47D5-B267-96F38238D882
FileGroup             : 1
Size                  : 152
MaxSize               : -1
MinSize               : 152
UserShrinkSize        : -1
Growth                : 128
BackupLSN             : 0
RedoStartLSN          : 0
FirstLSN              : 0
MaxLSN                : 0
FirstUpdateLSN        : 0
CreateLSN             : 0
SectorSize            : 512
RecoveryForkGUID      : 00000000-0000-0000-0000-000000000000
RecoveryForkLSN       : 0
DifferentialBaseLsn   : 19000000048800037
DifferentialBaseGuid  : 279A8EF4-4431-4CA5-8939-F613E5BC3033
Status                : 2
RestoreStatus         : 0
ReadOnlyLsn           : 0
ReadWriteLsn          : 0
MaxLsnBranchId        : 00000000-0000-0000-0000-000000000000
RedoTargetPointLsn    : 0
RedoTargetPointGuid   : 00000000-0000-0000-0000-000000000000
RestoreDiffBaseLsn    : 0
RestoreDiffBaseGuid   : 00000000-0000-0000-0000-000000000000
RestorePathOriginLsn  : 0
RestorePathOriginGuid : 00000000-0000-0000-0000-000000000000
OldestRestoredLsn     : 0

Lots of interesting stuff in here, such as:

  • BindingId: used to make sure a file is really part of this database
  • SectorSize: the disk sector size
  • Status: what kind of file and what state is it in (e.g. 2 = regular disk file)
  • Various sizes in number-of-8kb-pages (e.g. MaxSize of -1 means file growth is unlimited)
  • Growth: the number of pages to grow the file by if the 0x100000 bit is NOT set in the Status field. If it is set, the Growth is in percent.

And you can watch things change. For instance, if I change the file growth to 10%:

ALTER DATABASE FileHeaderTest MODIFY FILE (NAME = FileHeaderTest, FILEGROWTH = 10%);
GO

And then dump the file header page contents again, the Status and Growth fields have changed to:

.
.
Growth                : 10
.
.
Status                : 1048578
.
.

So what if a file header page is corrupt? I corrupted the file header page of my database and then started up SQL Server.

USE FileHeaderTest;
GO

Msg 945, Level 14, State 2, Line 1
Database 'FileHeaderTest' cannot be opened due to inaccessible files or insufficient memory or disk space. See the SQL Server errorlog for details.

Let's try EMERGENCY mode:

ALTER DATABASE FileHeaderTest SET EMERGENCY;
GO

Msg 5172, Level 16, State 15, Line 1
The header for file 'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\FileHeaderTest.mdf' is not a valid database file header. The PageAudit property is incorrect.

In this case, there's nothing to do except restore from backups, and the database is inaccessible because the PRIMARY filegroup could not be brought online. If the corruption were in a file in a secondary filegroup, things are a little bit different. Now I've added a secondary filegroup with a single file (called CorruptFile) and corrupted it's file header page. After starting up SQL Server we get the same behavior - but this time we can set the file to be offline and access the rest of the database. This is called partial database availability and works in Enterprise (and Developer) Edition only.

ALTER DATABASE FileHeaderTest MODIFY FILE (NAME = CorruptFile, OFFLINE);
GO

Note that the only way to bring an offline file back online is to restore it from a backup - see this post from my old Storage Engine blog for more details. Another corruption that can only be repaired using backups...

 

Now that I've done all the business-related blog posts, back to the good stuff to stop people complaining!

Something that's cropped up a few times over the summer so far is people trying to repair boot page corruptions.

First off, what's a boot page? Every database has a single page that stores critical information about the database itself. It's always page 9 in file 1 (the first file in the PRIMARY filegroup). You can examine the page using DBCC PAGE and it will interpret all the fields for you, but there's another command, DBCC DBINFO, that also dumps all this info (in fact the DBCC PAGE code calls the same underlying dumping code). This command is undocumented and unsupported but widely known and 'documented' in lots of places on the web - given that it uses the same code as DBCC PAGE, it's just as safe to use IMHO.

So what's on the boot page?

DBCC DBINFO ('BootPageTest');
GO

DBINFO STRUCTURE:


DBINFO @0x5BF6EF84

dbi_dbid = 19                        dbi_status = 65536                   dbi_nextid = 2073058421
dbi_dbname = BootPageTest            dbi_maxDbTimestamp = 2000            dbi_version = 611
dbi_createVersion = 611              dbi_ESVersion = 0                   
dbi_nextseqnum = 1900-01-01 00:00:00.000                                  dbi_crdate = 2008-07-10 15:53:18.843
dbi_filegeneration = 0              
dbi_checkptLSN

m_fSeqNo = 41                        m_blockOffset = 29                   m_slotId = 55
dbi_RebuildLogs = 0                  dbi_dbccFlags = 2                   
dbi_dbccLastKnownGood = 1900-01-01 00:00:00.000                          
dbi_dbbackupLSN

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0

dbi_oldestBackupXactLSN

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0
dbi_LastLogBackupTime = 1900-01-01 00:00:00.000                          
dbi_differentialBaseLSN

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0

dbi_createIndexLSN

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0

dbi_versionChangeLSN

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0
dbi_familyGUID = a4e88c13-b4cf-4320-834e-92b237244d4b                    
dbi_recoveryForkNameStack


entry 0

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0
m_guid = a4e88c13-b4cf-4320-834e-92b237244d4b                            

entry 1

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0
m_guid = 00000000-0000-0000-0000-000000000000                            
dbi_differentialBaseGuid = 00000000-0000-0000-0000-000000000000           dbi_firstSysIndexes = 0001:00000014
dbi_collation = 872468488            dbi_category = 0                     dbi_maxLogSpaceUsed = 231936
dbi_localState = 0                   dbi_roleSequence = 0                
dbi_failoverLsn

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0

dbi_dbmRedoLsn

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0

dbi_dbmOldestXactLsn

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0
dbi_dbMirrorId = 00000000-0000-0000-0000-000000000000                    
dbi_pageUndoLsn

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0
dbi_disabledSequence = 0            
dbi_dvSplitPoint

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0
dbi_CloneCpuCount = 0                dbi_CloneMemorySize = 0             
DBCC execution completed. If DBCC printed error messages, contact your system administrator.

There's all kinds on interesting things in there, for instance:

  • dbi_version and dbi_createversion: the physical version number of the database (and when it was created). See question 1 in the August 2008 SQL Q&A column in TechNet Magazine for an explanation (see here).
  • dbi_RebuildLogs: a count of the number of times the transaction log has been rebuilt for the database. PSS can use this to tell whether corruption problems could have been caused by DBAs rebuilding the log
  • dbi_dbccLastKnownGood: the completion time of the last 'clean' run of DBCC CHECKDB
  • a bunch of different LSNs related to checkpoint, backups, database mirroring
  • dbi_LastLogBackupTime: self-explanatory
  • dbi_differentialBaseGuid: the GUID generated by the last full database backup. Differential backups can only be restored on top of a matching full backup - so an out-of-band full backup could screw-up your disaster recovery - see this blog post for more info.

Now, what about if this page is corrupt in some way? I corrupted the BootPageTest database to have a corrupt boot page. Let's see what happens:

USE BootPagetest;
GO

Msg 913, Level 16, State 4, Line 1
Could not find database ID 19. Database may not be activated yet or may be in transition. Reissue the query once the database is available. If you do not think this error is due to a database that is transitioning its state and this error continues to occur, contact your primary support provider. Please have available for review the Microsoft SQL Server error log and any additional information relevant to the circumstances when the error occurred.

Okay - let's try setting the database into EMERGENCY mode:

ALTER DATABASE BootPageTest SET EMERGENCY;
GO

Msg 824, Level 24, State 2, Line 1
SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0xcdee22fa; actual: 0xcb6ea2fa). It occurred during a read of page (1:9) in database ID 19 at offset 0x00000000012000 in file 'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\BootPageTest.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

Hmm. What about running DBCC CHECKDB?

DBCC CHECKDB ('BootPageTest') WITH NO_INFOMSGS, ALL_ERRORMSGS;
GO

Msg 922, Level 14, State 1, Line 1
Database 'BootPageTest' is being recovered. Waiting until recovery is finished.

It's not looking good. Obviously the change to EMERGENCY mode couldn't complete properly. What's the database status?

SELECT [state_desc] FROM sys.databases WHERE [name] = 'BootPageTest';
GO

state_desc
------------------------------------------------------------
RECOVERY_PENDING

The boot page is inaccessible so in effect the database is inaccessible too, this is what the database state means in this case.

So what does this mean? If the boot page is corrupt, you can't run DBCC CHECKDB so you can't possibly run repair, and you can't put the database into EMERGENCY mode so you can't extract data into a new database. It means that there's NO WAY to recover from a corrupt boot page EXCEPT to restore from backups. One more reason to have backups...

(I'm actually on-stage here at TechEd doing the  DAT track pre-con with Kimberly - she's on now until lunch so I'm catching up on forum problems...)

Here's a question that came up on of the SQLServerCentral.com corruption forums I monitor that I think is worth blogging about. To paraphrase:

I have a bunch of corruptions in a database, that look like they've been there for a while. Repair is my only option - it works but I'd like to know what data is being deleted. How can I do that? Here are some of the errors:

Server: Msg 8928, Level 16, State 1, Line 2
Object ID 645577338, index ID 0: Page (1:168576) could not be processed. See other errors for details.
Server: Msg 8928, Level 16, State 1, Line 2
Object ID 645577338, index ID 0: Page (1:168577) could not be processed. See other errors for details.
Server: Msg 8928, Level 16, State 1, Line 2
Object ID 645577338, index ID 0: Page (1:168578) could not be processed. See other errors for details.
Server: Msg 8928, Level 16, State 1, Line 2
Object ID 645577338, index ID 0: Page (1:168579) could not be processed. See other errors for details.
Server: Msg 8928, Level 16, State 1, Line 2
Object ID 645577338, index ID 0: Page (1:168580) could not be processed. See other errors for details.
Server: Msg 8928, Level 16, State 1, Line 2
Object ID 645577338, index ID 0: Page (1:168581) could not be processed. See other errors for details.
Server: Msg 8928, Level 16, State 1, Line 2
Object ID 645577338, index ID 0: Page (1:168582) could not be processed. See other errors for details.
Server: Msg 8976, Level 16, State 1, Line 2
Table error: Object ID 645577338, index ID 1. Page (1:168576) was not seen in the scan although its parent (1:165809) and previous (1:168575) refer to it. Check any previous errors.
Server: Msg 8978, Level 16, State 1, Line 2
Table error: Object ID 645577338, index ID 1. Page (1:168583) is missing a reference from previous page (1:168582). Possible chain linkage problem.

This is a clustered index that CHECKDB  will repair by deleting pages at the leaf-level - essentially deleting a bunch of records. The pages look to be trashed (there were a bunch more errors that I didn't include here that said the page headers were all corrupted - looked like the IO subsystem trashde a whole 64KB chunk of the disk) so there's nothing much else you can do. As the table has a clustered index, you can use the error messages to find the pages on either 'logical' side of the pages being deleted - and hence figure out the range of records that have been deleted.

The errors show that pages 168576 through 168582 in file 1 are corrupt. There are also errors that say the previous page of 168576 is 168575, and the next page of 168582 is 168583. If you do a DBCC PAGE of these two pages, you can find the lower and upper bound of the clustered index key values that have been lost. Think of three ranges:

  • the lower range of records that are intact, logically before the corrupt pages in the index
  • the range of records that will be deleted by repair
  • the upper range of records that are intact, logically after the corrupt pages in the index

To find the upper bound of the lower range:

DBCC TRACEON (3604); -- allows the output to come to the console
DBCC PAGE ('dbname', 1, 168575, 3);
GO

The key value in the slot at the end of output is the upper bound of the bottom range that's intact.

Then do:

DBCC PAGE ('dbname', 1, 168583, 3);
GO

The key value in the slot at the beginning of the output is the lower bound of the upper range that's intact.

Everything in the middle will be deleted. You could also try a DBCC PAGE on the corrupt pages themselves too - you might be able to see some data in them.

I'll be blogging a bunch more about repair after my corruption session this week at TechEd - watch this space!

Over the weekend there was a question on one of the internal aliases at MS: how can I tell what percentage of a database has changed since the last full backup, so I can choose between a differential or full backup?

No such code exists as far as I know - until now! I happened to read the thread while sitting in the airport in Washington D.C. on the way back from Iceland so I started playing around and this morning I completed the code.

The code below creates a function and a stored procedure. The basic idea behind the code is as follows:

For each online data file in the database
   For each GAM interval in the file
      Crack the DIFF map page using DBCC PAGE
      Interpret the DIFF bitmap to aggregate the changed extents
      Add the sum to the total changed extents for the database
   End
End
Report results

There's a function that I create in msdb call SQLskillsConvertToExtents that cracks some of the DBCC PAGE output, and the main procedure is called sp_SQLskillsDIFForFULL and it created as a system object in master. I tried making it a table-valued function but you can't do things like INSERT-EXEC in a function, and that's required for processing the DBCC PAGE output. So - create your own wrapper function or whatever to use it. The interface/output is:

EXEC sp_SQLskillsDIFForFULL 'msdb';
GO

Total Extents Changed Extents Percentage Changed
------------- --------------- ----------------------
102           56              54.9

I've tested it with databases with multiple files and up to around 700GB for a single file size. There's the potential for an issue with file sizes of 4TB and beyond (where PFS intervals and GAM intervals map to the same extent every 4TB or so, but I think it'll be ok and the position of the DIFF map in the extent won't change - if someone could test it with a 4+TB file I'd be grateful). It's been tested on SQL Server 2005 and 2008. It will not work on SQL Server 2000 - I'll do a 2000 version sometime soon.

Note that after doing a full backup you will never see Changed Extents equal to zero. It will always be 4 + (number of online data files - 1), and around 20 or so for msdb. This is because the extent containing the file header in each file is always marked as changed, as are three extents in the primary file containing the roots of some critical system tables.

Anyway - here it is. You can download it in a zip file from SQLskillsDIFForFULL.zip (2.65KB). Enjoy!

/*============================================================================
   
File: SQLskillsDIFForFULL.sql

   Summary: This script creates a system-wide SP SQLskillsDIFForFILL that
   
works out what percentage of a database has changed since the
   
previous full database backup.

   Date: April 2008

   SQL Server Versions:
         
10.0.1300.13 (SS2008 February CTP - CTP-6)
         
9.00.3054.00 (SS2005 SP2)
------------------------------------------------------------------------------
   
Copyright (C) 2008 Paul S. Randal, SQLskills.com
   
All rights reserved.

   For more scripts and sample code, check out 
      
http://www.sqlskills.com/

   You may alter this code for your own *non-commercial* purposes. You may
   
republish altered code as long as you give due credit.

   THIS CODE AND INFORMATION ARE PROVIDED "AS IS" WITHOUT WARRANTY OF 
   
ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED 
   
TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR FITNESS FOR A
   
PARTICULAR PURPOSE.

============================================================================*/

-- Create the function in MSDB
--
USE msdb;
GO

IF EXISTS (SELECT * FROM sys.objects WHERE NAME = 'SQLskillsConvertToExtents')
   
DROP FUNCTION SQLskillsConvertToExtents;
GO

-- This function cracks the output from a DBCC PAGE dump
-- of an allocation bitmap. It takes a string in the form
-- "(1:8) - (1:16)" or "(1:8) -" and returns the number
-- of extents represented by the string. Both the examples
-- above equal 1 extent.
--

CREATE FUNCTION SQLskillsConvertToExtents (
   
@extents VARCHAR (100))
RETURNS INTEGER
AS
BEGIN
   
DECLARE @extentTotal   INT;
   
DECLARE @colon         INT;
   
DECLARE @firstExtent   INT;
   
DECLARE @secondExtent  INT;

   SET @extentTotal = 0;
   
SET @colon = CHARINDEX (':', @extents);

   -- Check for the single extent case
   --
   IF (CHARINDEX (':', @extents, @colon + 1) = 0)
      
SET @extentTotal = 1;
   
ELSE
      
-- We're in the multi-extent case
      --
      BEGIN
      
SET @firstExtent = CONVERT (INT,
         
SUBSTRING (@extents, @colon + 1, CHARINDEX (')', @extents, @colon) - @colon - 1));
      
SET @colon = CHARINDEX (':', @extents, @colon + 1);
      
SET @secondExtent = CONVERT (INT,
         
SUBSTRING (@extents, @colon + 1, CHARINDEX (')', @extents, @colon) - @colon - 1));
      
SET @extentTotal = (@secondExtent - @firstExtent) / 8 + 1;
   
END

   RETURN @extentTotal;
END;
GO

USE master;
GO

IF OBJECT_ID ('sp_SQLskillsDIFForFULL') IS NOT NULL
   
DROP PROCEDURE sp_SQLskillsDIFForFULL;
GO

-- This SP cracks all differential bitmap pages for all online
-- data files in a database. It creates a sum of changed extents
-- and reports it as follows (example small msdb):
--
-- EXEC sp_SQLskillsDIFForFULL 'msdb';
-- GO
--
-- Total Extents Changed Extents Percentage Changed
-- ------------- --------------- ----------------------
-- 102           56              54.9
--
-- Note that after a full backup you will always see some extents
-- marked as changed. The number will be 4 + (number of data files - 1).
-- These extents contain the file headers of each file plus the
-- roots of some of the critical system tables in file 1.
-- The number for msdb may be round 20.
--
CREATE PROCEDURE sp_SQLskillsDIFForFULL (
   
@dbName VARCHAR (128))
AS
BEGIN
   
SET NOCOUNT ON;

   -- Create the temp table
   
--
   
IF EXISTS (SELECT * FROM msdb.sys.objects WHERE NAME = 'SQLskillsDBCCPage')
   
DROP TABLE msdb.dbo.SQLskillsDBCCPage;

   CREATE TABLE msdb.dbo.SQLskillsDBCCPage (
      
[ParentObject] VARCHAR (100),
      
[Object]       VARCHAR (100),
      
[Field]        VARCHAR (100),
      
[VALUE]        VARCHAR (100));

   DECLARE @fileID         INT;
   
DECLARE @fileSizePages  INT;
   
DECLARE @extentID       INT;
   
DECLARE @pageID         INT;
   
DECLARE @DIFFTotal      INT;
   
DECLARE @sizeTotal      INT;
   
DECLARE @total          INT;
   
DECLARE @dbccPageString VARCHAR (200);

   SELECT @DIFFTotal = 0;
   
SELECT @sizeTotal = 0;

   -- Setup a cursor for all online data files in the database
   
--
   
DECLARE files CURSOR FOR
      
SELECT [file_id], [size] FROM master.sys.master_files
      
WHERE [type_desc] = 'ROWS'
      
AND [state_desc] = 'ONLINE'
      
AND [database_id] = DB_ID (@dbName);

   OPEN files;

   FETCH NEXT FROM files INTO @fileID, @fileSizePages;

   WHILE @@FETCH_STATUS = 0
   
BEGIN
      
SELECT @extentID = 0;

      -- The size returned from master.sys.master_files is in
      
-- pages - we need to convert to extents
      
--
      
SELECT @sizeTotal = @sizeTotal + @fileSizePages / 8;

      WHILE (@extentID < @fileSizePages)
      
BEGIN
         
-- There may be an issue with the DIFF map page position
         
-- on the four extents where PFS pages and GAM pages live
         
-- (at page IDs 516855552, 1033711104, 1550566656, 2067422208)
         
-- but I think we'll be ok.
         
-- PFS pages are every 8088 pages (page 1, 8088, 16176, etc)
         
-- GAM extents are every 511232 pages
         
--
         
SELECT @pageID = @extentID + 6;

         -- Build the dynamic SQL
         
--
         
SELECT @dbccPageString = 'DBCC PAGE ('
            
+ @dbName + ', '
            
+ CAST (@fileID AS VARCHAR) + ', '
            
+ CAST (@pageID AS VARCHAR) + ', 3) WITH TABLERESULTS, NO_INFOMSGS';

         -- Empty out the temp table and insert into it again
         
--
         
DELETE FROM msdb.dbo.SQLskillsDBCCPage;
         
INSERT INTO msdb.dbo.SQLskillsDBCCPage EXEC (@dbccPageString);

         -- Aggregate all the changed extents using the function
         
--
         
SELECT @total = SUM ([msdb].[dbo].[SQLskillsConvertToExtents] ([Field]))
         
FROM msdb.dbo.SQLskillsDBCCPage
            
WHERE [VALUE] = '    CHANGED'
            
AND [ParentObject] LIKE 'DIFF_MAP%';

         SET @DIFFTotal = @DIFFTotal + @total;

         -- Move to the next GAM extent
         
SET @extentID = @extentID + 511232;
      
END

      FETCH NEXT FROM files INTO @fileID, @fileSizePages;
   
END;

   -- Clean up
   
--
   
DROP TABLE msdb.dbo.SQLskillsDBCCPage;
   
CLOSE files;
   
DEALLOCATE files;

   -- Output the results
   
--
   
SELECT
      
@sizeTotal AS [Total Extents],
      
@DIFFTotal AS [Changed Extents],
      
ROUND (
         
(CONVERT (FLOAT, @DIFFTotal) /
         
CONVERT (FLOAT, @sizeTotal)) * 100, 2) AS [Percentage Changed];
END;
GO

-- Mark the SP as a system object
--
EXEC sys.sp_MS_marksystemobject sp_SQLskillsDIFForFULL;
GO

-- Test to make sure everything was setup correctly
--
EXEC sp_SQLskillsDIFForFULL 'msdb';
GO

This is a really interesting question that came up in the Microsoft Certified Architect class I'm teaching at present - if a database has torn-page protection enabled, and page checksums are enabled, is all the existing torn-page detection lost?

This is an important question, because enabling page checksums doesn't suddenly make all allocated pages be protected by page checksums (it's not until a page is read into the buffer pool, modified, and then written back to disk, that it gets a page checksum). If all the existing torn-page protection is discarded when page checksums are enabled, then the pages would be unprotected until they got page checksums on. I couldn't remember the answer, so I experimented!

My idea was to create a database with torn-page protection, create a table with a simulated torn-page in it, then enable page checksums and see if the torn-page was still reported.

-- Create the test database
USE master;
GO
CREATE DATABASE ChecksumTest;
GO
USE ChecksumTest;
GO

-- Explicitly set the database to have torn-page detection
ALTER DATABASE ChecksumTest SET PAGE_VERIFY TORN_PAGE_DETECTION;
GO

-- Create a test table and insert a row.
CREATE TABLE BrokenTable (c1 INT, c2 CHAR (1000));
INSERT INTO BrokenTable VALUES (1, 'a');
GO

-- Ensure the page is written to disk and then tossed from the buffer pool
CHECKPOINT;
GO
DBCC DROPCLEANBUFFERS;
GO

Now I'm going to examine the page. There are two bits in the page header that specify whether the page is protected by torn-page detection or with a page checksum. Specifically, the m_flagBits field will have 0x100 set if the page is encoded for torn-page protection, and 0x200 set if the page has a page-checksum stored on it, and the page has not been modified (i.e. the checksum is stillvalid). You should not see the 0x100 bit set as torn-page encoding is removed when the page is read into the buffer pool - UNLESS the page IS actually torn, in which case the encoding is NOT removed.

sp_allocationmetadata 'BrokenTable';
GO
DBCC TRACEON (3604);
GO
DBCC PAGE ('ChecksumTest', 1, 143, 3);
GO

<snip>

m_pageId = (1:143)                   m_headerVersion = 1                  m_type = 1
m_typeFlagBits = 0x4                 m_level = 0                          m_flagBits = 0x8000
m_objId (AllocUnitId.idObj) = 67     m_indexId (AllocUnitId.idInd) = 256 
Metadata: AllocUnitId = 72057594042318848                                
Metadata: PartitionId = 72057594038321152                                 Metadata: IndexId = 0
Metadata: ObjectId = 2073058421      m_prevPage = (0:0)                   m_nextPage = (0:0)
pminlen = 1008                       m_slotCnt = 2                        m_freeCnt = 6070
m_freeData = 2118                    m_reservedCnt = 0                    m_lsn = (28:183:2)
m_xactReserved = 0                   m_xdesId = (0:0)                     m_ghostRecCnt = 0
m_tornBits = 770
      

<snip>     

In this case the torn-page encoding has been removed, and the page is fine. Once I've corrupted the page on disk, it's tricky to be able to see it with DBCC PAGE. I managed to catch it once and saw the following:

m_pageId = (1:143)                   m_headerVersion = 1                  m_type = 1
m_typeFlagBits = 0x4                 m_level = 0                          m_flagBits = 0x8100
m_objId (AllocUnitId.idObj) = 67     m_indexId (AllocUnitId.idInd) = 256 
Metadata: AllocUnitId = 72057594042318848                                
Metadata: PartitionId = 72057594038321152                                 Metadata: IndexId = 0
Metadata: ObjectId = 2073058421      m_prevPage = (0:0)                   m_nextPage = (0:0)
pminlen = 1008                       m_slotCnt = 1                        m_freeCnt = 7083
m_freeData = 1107                    m_reservedCnt = 0                    m_lsn = (28:81:20)
m_xactReserved = 0                   m_xdesId = (0:0)                     m_ghostRecCnt = 0
m_tornBits = 41949233

Now if I try to select from the table I get:         

SELECT * FROM BrokenTable;
GO

Msg 824, Level 24, State 2, Line 1

SQL Server detected a logical consistency-based I/O error: torn page (expected signature: 0xaaaaaaaa; actual signature: 0xaaaaa82a). It occurred during a read of page (1:143) in database ID 8 at offset 0x0000000011e000 in file 'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\ChecksumTest.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

The crux of the question is whether this will still be reported if the database switches to page checksums - let's try:

ALTER DATABASE checksumtest SET PAGE_VERIFY CHECKSUM;
GO

SELECT * FROM BrokenTable;
GO

Msg 824, Level 24, State 2, Line 1

SQL Server detected a logical consistency-based I/O error: torn page (expected signature: 0xaaaaaaaa; actual signature: 0xaaaaa82a). It occurred during a read of page (1:143) in database ID 8 at offset 0x0000000011e000 in file 'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\ChecksumTest.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

Cool! The answer is YES - the torn-page is still detected, because the bit in the page header specifies which page protection algorithm the page is using. In fact, it even works if you turn off page checksums and torn-page detection completely.

One of the drawbacks of not being in the SQL team at Microsoft any longer is that I don't know about all the undocumented features in the next release - I have to hunt around for them like everyone else :-(

So I was poking about in SSMS in 2008 CTP-6 and noticed a function called sys.fn_PhysLocCracker that I'd never heard of. Doing an sp_helptext on it gets the following output:

-- Name: sys.fn_PhysLocCracker
--
-- Description:
-- Cracks the output of %%physloc%% virtual column
--
-- Notes:
-------------------------------------------------------------------------------
create function sys.fn_PhysLocCracker (@physical_locator binary (8))
returns @dumploc_table table
(
 [file_id] int not null,
 [page_id] int not null,
 [slot_id] int not null
)
as
begin

 declare @page_id binary (4)
 declare @file_id binary (2)
 declare @slot_id binary (2)

 -- Page ID is the first four bytes, then 2 bytes of page ID, then 2 bytes of slot
 --
 select @page_id = convert (binary (4), reverse (substring (@physical_locator, 1, 4)))
 select @file_id = convert (binary (2), reverse (substring (@physical_locator, 5, 2)))
 select @slot_id = convert (binary (2), reverse (substring (@physical_locator, 7, 2)))
 
 insert into @dumploc_table values (@file_id, @page_id, @slot_id)
 return
end

Cool - but something else I've never heard of %%physloc%% - what's that? After playing around for a while, I figured out how to make it work.  Just to be confusing, there's another identical version of the function called sys.fn_PhysLocFormatter - and that's the only one I could get to work. Here's an example:

CREATE TABLE TEST (c1 INT IDENTITY, c2 CHAR (4000) DEFAULT 'a');
GO
INSERT INTO TEST DEFAULT VALUES
;
INSERT INTO TEST DEFAULT VALUES
;
INSERT INTO TEST DEFAULT VALUES;
GO

SELECT sys.fn_PhysLocFormatter (%%physloc%%) AS [Physical RID], * FROM TEST;
GO

Physical RID       c1
-----------------  -----------
(1:411:0)          1
(1:411:1)          2
(1:413:0)          3

It's a physical-record locator function! Undocumented and unsupported (obviously), but hey, some of the best features are :-) It gives the database file, page within the file, and slot number on the page in the format (file:page:slot). I can think of a *bunch* of uses for this which I'll be exploring over the next few months.

How cool is that?!?!

Well this one is well overdue and I'm in the middle of writing a class where I want to reference this blog post - so I suppose I'd better write it!! This is an updated post from my old Storage Engine blog that now covers DIFF and ML map pages.

In some previous posts in this series I built up the storage basics in database files:

The final pieces in the allocation puzzle are the other allocation-tracking map pages - GAM, SGAM, PFS, ML map, and DIFF map pages. All of the following explanation holds for SQL Server 2000 and all subsequent releases so far. For any of these pages you can do a dump-style 3 DBCC PAGE dump and it will interpret the page and give you a human readable form of the allocation tracking data.

GAM pages

GAM stands for Global Allocation Map. If you remember from before, database data files are split up into GAM intervals (don't get confused - they're not split physically, just conceptually). A GAM interval is equivalent to the amount of space that the bitmaps in GAM, SGAM, ML map, DIFF map, and IAM pages track - 64000 extents or almost 4GB. These bitmaps are the same size in each of these five page types and have one bit per extent, but they mean different things in each of the different allocation pages.

The bits in the GAM bitmap have the following semantics:

  • bit = 1: the extent is available for allocation (you could think of it as currently allocated to the GAM page)
  • bit = 0: the extent is already allocated for use

These semantics are the same for mixed and dedicated/uniform extents.

One thing to note, at the start of every GAM interval is a GAM extent which contains the global allocation pages that track that GAM interval. This GAM extent cannot be used for any regular page allocations. The first GAM extent starts at page 0 in the file and has the following layout:

  • Page 0: the file header page (another post!)
  • Page 1: the first PFS page
  • Page 2: the first GAM page
  • Page 3: the first SGAM page
  • Page 4: Unused in 2005+
  • Page 5: Unused in 2005+
  • Page 6: the first DIFF map page
  • Page 7: the first ML map page

SGAM pages

I remember last year having an email discussion about what the 'S' stands for in SGAM. Various names have been used over the years inside and outside Microsoft but the official name that Books Online uses is Shared Global Allocation Map. To be honest, we always just call them 'es-gams' and never spell it out.

As I said above, the SGAM bitmap is exactly the same as the GAM bitmap in structure and the interval it covers, but the semantics of the bits are different:

  • bit = 1: the extent is a mixed extent and has at least one unallocated page available for use
  • bit = 0: the extent is either dedicated or is a mixed extent with no unallocated pages (essentially the same situation given that the SGAM is used to find mixed extents with unallocated pages)

Combining GAM, SGAM, and IAM pages

So, taking the GAM, SGAM and IAM pages together (remember that in the IAM bitmap, the bit is set if the extent is allocated to the IAM chain/allocation unit), the various combinations of bits are:

GAM

SGAM

Any IAM

Comments

0

0

0

Mixed extent with all pages allocated

0

0

1

Dedicated extent (must be allocated to only a single IAM page)

0

1

0

Mixed extent with >= 1 unallocated page

0

1

1

Invalid state

1

0

0

Unallocated extent

1

0

1

Invalid state

1

1

0

Invalid state

1

1

1

Invalid state

You can see that only 4 of the 8 possible bit combinations for any particular extent are valid. Anything else constitutes a corruption of some sort and can lead to all kinds of horrible situations.

ML map pages

ML stands for Minimally Logged. These pages track which extents have been modified by minimally-logged operations since the last transaction log backup when using the BULK_LOGGED recovery model. The idea is that the next transaction log backup will backup the log as usual, and then also include all the extents marked as changed in these bitmaps. The combination of these extents, plus the transaction log in the backup gives the differences that have occured in the database since the previous transaction log backup. The ML page bitmaps are cleared once they've been read. If you don't ever use the BULK_LOGGED recovery model, these pages are never used.

The ML page bitmap is exactly the same as the GAM bitmap in structure and the interval it covers, but the semantics of the bits are different:

  • bit = 1: the extent has been changed by a minimally logged operation since the last transaction log backup
  • bit = 0: the extent was not changed

DIFF map pages

DIFF stands for differential. These pages track which extents have been modified since the last full backup was taken. It is a common misconception that the bitmaps track the changes since the last differential backup. The idea is that a differential backup will contain all the extents that have changed since the last full backup. Restore time can be cut down significantly by using differential backups to avoid having to restore all the log backups in the period between the full and last differential backup - more on this in a later post. The bitmaps are not cleared until the next full backup. Note that I don't say full database backup in the explanation above. The full and differential backups can be database, filegroup, or file level backups.

The DIFF page bitmap is exactly the same as the GAM bitmap in structure and the interval it covers, but the semantics of the bits are different:

  • bit = 1: the extent has been changed since the last full backup
  • bit = 0: the extent was not changed

PFS pages

PFS stands for Page Free Space, but the PFS page tracks much more than that. As well as GAM intervals, every database file is also split (conceptually) into PFS intervals. A PFS interval is 8088 pages, or about 64MB. A PFS page doesn't have a bitmap - it has a byte-map, with one byte for each page in the PFS interval (not including itself).

The bits in each byte are encoded to mean the following:

  • bits 0-2: how much free space is on the page
    • 0x00 is empty
    • 0x01 is 1 to 50% full
    • 0x02 is 51 to 80% full
    • 0x03 is 81 to 95% full
    • 0x04 is 96 to 100% full
  • bit 3 (0x08): is there one or more ghost records on the page?
  • bit 4 (0x10): is the page an IAM page?
  • bit 5 (0x20): is the page a mixed-page?
  • bit 6 (0x40): is the page allocated?
  • Bit 7 is unused

For instance, an IAM page will have a PFS byte value of 0x70 (allocated + IAM page + mixed page).

Free space is only tracked for pages storing LOB values (i.e. text/image in SQL Server 2000, plus (n)varchar(max), varbinary(max), XML, and row-overflow data in SQL Server 2005) and heap data pages. This is because these are the only pages that store unordered data and so insertions can occur anywhere there's space. For indexes, there's an explicit ordering so there's no choice in the insertion point.

The point at which a PFS byte is reset is not intuitive. PFS bytes are not fully reset until the page is reallocated. On deallocation, the only bit in the PFS byte that's changed is the allocation status bit - this makes it very easy to rollback a deallocation.

Here's an example. Using a database with a simple table with one row. A DBCC PAGE of the IAM page includes:

PFS (1:1) = 0x70 IAM_PG MIXED_EXT ALLOCATED 0_PCT_FULL

If I drop the table in an explicit transaction and then do the DBCC PAGE again, the output no includes:

PFS (1:1) = 0x30 IAM_PG MIXED_EXT 0_PCT_FULL

And if I rollback then transaction, the DBCC PAGE output reverts to:

PFS (1:1) = 0x70 IAM_PG MIXED_EXT ALLOCATED 0_PCT_FULL

Ok - four blog posts in one day is quite enough! :-)

It's been a long few days of building slide decks and other content and I just had to stop for a bit and take care of my internals-hacking withdrawal symptoms!

While I was at Microsoft, I wrote some code in the Storage Engine to very easily return all the IAM chains/allocation units (see this post for more details of the internals of those), what type they are, and the relevant page IDs (first, root, first-IAM) so I could go spelunking with DBCC PAGE. Since I left six months ago, it's one of the things I've been missing using when poking around on customer sites, so this afternoon I sat down and wrote the equivalent in T-SQL, using the undocumented sys.system_internals_allocation_units DMV. The output is easy to match up to sys.partitions but the page IDs are formatted in byte-reversed hex so a little tweaking was needed to extract the fields and make them human readable - I've put them into the same format that all SQL Server error messages use when giving a page number.

So - I present to you sp_AllocationMetadata. I was having all kinds of trouble using it in other databases (trying to figure out a way to change database contexts in the SP) until I remembered that you can create an SP in master and mark it as a system object using the undocumented sys.sp_MS_marksystemobject SP. This makes any SP execute in the context of the database from where it is called - extremely useful when you're querying against a database's system catalog views.

[Edit: Kalen pointed out that the DMV *is* documented, but just not in the BOL index. Even better - that means the SP below isn't doing anything dodgy :-) - thanks Kalen!]

The SP can be called with an optional object name parameter, in which case it will only give you back the allocation metadata for that object. If you don't specify a parameter, it gives you back the allocation metadata for all objects in the database. Here's an example of the output:

USE AdventureWorks;
GO

EXEC sp_AllocationMetadata 'HumanResources.Employee';
GO

Object Name   Index ID   Alloc Unit ID       Alloc Unit Type   First Page   Root Page   First IAM Page
------------- ---------- ------------------- ----------------- ------------ ----------- ----------------
Employee      1          72057594050379776   IN_ROW_DATA       (1:588)      (1:594)     (1:593)
Employee      2          72057594055491584   IN_ROW_DATA       (1:2141)     (1:2144)    (1:2142)
Employee      3          72057594055557120   IN_ROW_DATA       (1:2146)     (1:2149)    (1:2147)
Employee      4          72057594055622656   IN_ROW_DATA       (1:2150)     (1:2150)    (1:2151)
Employee      5          72057594055688192   IN_ROW_DATA       (1:2153)     (1:2153)    (1:2154)

You'll notice there are only IN_ROW_DATA allocation units - that's because this table doesn't have any LOB data or an variable-length columns that have been pushed off-row (producing LOB_DATA and ROW_OVERFLOW_DATA allocation units, respectively). So - it only shows what actually exists (rather than creating NULL values, for instance).

Below is the script that creates the SP, and I've included it as an attachment too.

Ah - that feel's better :-) Happy spelunking!

USE master;
GO

IF OBJECT_ID ('sp_AllocationMetadata') IS NOT NULL
   DROP PROCEDURE sp_AllocationMetadata;
GO

CREATE PROCEDURE sp_AllocationMetadata
(
   
@object VARCHAR (128) = NULL
)
AS
SELECT
   
OBJECT_NAME (sp.object_id) AS [Object Name],
   
sp.index_id AS [Index ID],
   
sa.allocation_unit_id AS [Alloc Unit ID],
   
sa.type_desc AS [Alloc Unit Type],
   
'(' CONVERT (VARCHAR (6),
      
CONVERT (INT, SUBSTRING (sa.first_page, 6, 1) +
      
   SUBSTRING (sa.first_page, 5, 1))) +
   
':' + CONVERT (VARCHAR (20),
      
CONVERT (INT, SUBSTRING (sa.first_page, 4, 1) +
      
   SUBSTRING (sa.first_page, 3, 1) +
         
SUBSTRING (sa.first_page, 2, 1) +
         
SUBSTRING (sa.first_page, 1, 1))) +
   
')' AS [First Page],
   '(' + CONVERT (VARCHAR (6),
      
CONVERT (INT,
         
SUBSTRING (sa.root_page, 6, 1) +
         
SUBSTRING (sa.root_page, 5, 1))) +
   
':' + CONVERT (VARCHAR (20),
      
CONVERT (INT,
         
SUBSTRING (sa.root_page, 4, 1) +
         
SUBSTRING (sa.root_page, 3, 1) +
         
SUBSTRING (sa.root_page, 2, 1) +
         
SUBSTRING (sa.root_page, 1, 1))) +
   
')' AS [Root Page],
   
'(' + CONVERT (VARCHAR (6),
      
CONVERT (INT,
         
SUBSTRING (sa.first_iam_page, 6, 1) +
         
SUBSTRING (sa.first_iam_page, 5, 1))) +
   
':' + CONVERT (VARCHAR (20),
      
CONVERT (INT,
         
SUBSTRING (sa.first_iam_page, 4, 1) +
         
SUBSTRING (sa.first_iam_page, 3, 1) +
         
SUBSTRING (sa.first_iam_page, 2, 1) +
         
SUBSTRING (sa.first_iam_page, 1, 1))) +
   
')' AS [First IAM Page]
FROM
   
sys.system_internals_allocation_units AS sa,
   
sys.partitions AS sp
WHERE
   
sa.container_id = sp.partition_id
   AND sp.object_id =
      
(CASE WHEN (@object IS NULL)
         THEN sp.object_id
         
ELSE OBJECT_ID (@object)
      
END);
GO

EXEC sys.sp_MS_marksystemobject sp_AllocationMetadata;
GO

sp_AllocationMetadata.zip (.69 KB)

Ok - so we did more partying than we thought so blog posts have been a little sparse this month, but here's one to end off the year.

There's a popular impression that turning on snapshot isolation and then rebuilding indexes will cause all rows in the table to get the extra 14-byte versioning tags. Fact or fiction? Let's find out.

First I'll create a test database, containing a small table with a clustered index and few rows:

CREATE DATABASE SItest;
GO

USE SItest;
GO

CREATE TABLE SmallTable (c1 INT, c2 INT);
CREATE CLUSTERED INDEX SmallTableCI ON SmallTable (c1);
GO

INSERT INTO SmallTable VALUES (1, 1);
INSERT INTO SmallTable VALUES (2,2);
GO

Next I'll turn on READ_COMMITTED_SNAPSHOT and rebuild the index to see if statement level versioning does the trick:

ALTER DATABASE SItest SET READ_COMMITTED_SNAPSHOT ON;
GO

ALTER INDEX SmallTableCI ON SmallTable REBUILD;
GO

Now let's look at the data page holding the two rows to see if there's any versioning info (the output is snipped short a little for brevity):

DBCC IND (SItest, SmallTable, 1);
GO

DBCC TRACEON (3604); -- remember this makes the output go to the console
DBCC PAGE (SItest, 1, 153, 3);
GO

PageFID PagePID     IAMFID IAMPID      ObjectID    IndexID
------- ----------- ------ ----------- ----------- -----------
1       154         NULL   NULL        2073058421  1
1       153         1      154         2073058421  1

(2 row(s) affected)

DBCC execution completed. If DBCC printed error messages, contact your system administrator.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.

PAGE: (1:153)

<SNIP SNIP SNIP>

Slot 0 Offset 0x60 Length 15

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP
Memory Dump @0x6209C060

00000000:   10000c00 01000000 01000000 0300f9 ...............
UNIQUIFIER = [NULL]

Slot 0 Column 1 Offset 0x4 Length 4

c1 = 1

Slot 0 Column 2 Offset 0x8 Length 4

c2 = 1

Slot 1 Offset 0x6f Length 15

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP
Memory Dump @0x6209C06F

00000000:   10000c00 02000000 02000000 0300f9 ...............
UNIQUIFIER = [NULL]

Slot 1 Column 1 Offset 0x4 Length 4

c1 = 2

Slot 1 Column 2 Offset 0x8 Length 4

c2 = 2

DBCC execution completed. If DBCC printed error messages, contact your system administrator.

Nope - both the rows look normal. Now for completeness let's try transaction level versioning and a rebuild:

ALTER DATABASE SItest SET ALLOW_SNAPSHOT_ISOLATION ON;
GO

ALTER INDEX SmallTableCI ON SmallTable REBUILD;
GO

DBCC IND (SItest, SmallTable, 1);
GO

DBCC PAGE (SItest, 1, 143, 3); -- page changed when we rebuilt the index
GO

PageFID PagePID     IAMFID IAMPID      ObjectID    IndexID
------- ----------- ------ ----------- ----------- -----------
1       152         NULL   NULL        2073058421  1
1       143         1      152         2073058421  1

(2 row(s) affected)

DBCC execution completed. If DBCC printed error messages, contact your system administrator.

PAGE: (1:143)

<SNIP SNIP SNIP>

Slot 0 Offset 0x60 Length 15

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP
Memory Dump @0x6209C060

00000000:   10000c00 01000000 01000000 0300f9 ...............
UNIQUIFIER = [NULL]

Slot 0 Column 1 Offset 0x4 Length 4

c1 = 1

Slot 0 Column 2 Offset 0x8 Length 4

c2 = 1

Slot 1 Offset 0x6f Length 15

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP
Memory Dump @0x6209C06F

00000000:   10000c00 02000000 02000000 0300f9 ...............
UNIQUIFIER = [NULL]

Slot 1 Column 1 Offset 0x4 Length 4

c1 = 2

Slot 1 Column 2 Offset 0x8 Length 4

c2 = 2

DBCC execution completed. If DBCC printed error messages, contact your system administrator.

Nope - still nothing. Now I'll force a versioning operation and we should see the tags. Any update to the table should cause versioned records to be created. In this case, I'll start an explicit transaction and do some updates and we should be able to see the original values using another query window. First the updates:

BEGIN TRAN;
GO

UPDATE SmallTable SET c1 = 4;
GO

And in another window:

SELECT * FROM smalltable;
GO

c1          c2
----------- -----------
1           1
2           2

(2 row(s) affected)

Cool - so the original values are still there. Let's see the versioning info on the data page (output snipped again):

DBCC PAGE (SItest, 1, 143, 3);
GO

PAGE: (1:143)

<SNIP SNIP SNIP>

Slot 0 Offset 0x7e Length 29

Record Type = GHOST_DATA_RECORD      Record Attributes =  NULL_BITMAP VERSIONING_INFO

Memory Dump @0x61CEC07E

00000000:   5c000c00 01000000 01000000 0300f9e0 \...............
00000010:   00000001 00000057 03000000 00       .......W.....

Version Infomation =
 Transaction Timestamp: 855
 Version Pointer: (file 1 page 224 slot 0)

UNIQUIFIER = [NULL]

Slot 0 Column 1 Offset 0x4 Length 4

c1 = 1

Slot 0 Column 2 Offset 0x8 Length 4

c2 = 1

Slot 1 Offset 0xb8 Length 29

Record Type = GHOST_DATA_RECORD      Record Attributes =  NULL_BITMAP VERSIONING_INFO

Memory Dump @0x61CEC0B8

00000000:   5c000c00 02000000 02000000 0300f9e0 \...............
00000010:   00000001 00010057 03000000 00       .......W.....

Version Infomation =
 Transaction Timestamp: 855
 Version Pointer: (file 1 page 224 slot 1)

UNIQUIFIER = [NULL]

Slot 1 Column 1 Offset 0x4 Length 4

c1 = 2

Slot 1 Column 2 Offset 0x8 Length 4

c2 = 2

Slot 2 Offset 0x9b Length 29

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP VERSIONING_INFO

Memory Dump @0x61CEC09B

00000000:   50000c00 04000000 01000000 0300f800 P...............
00000010:   00000000 00000057 03000000 00       .......W.....

Version Information =
 Transaction Timestamp: 855
 Version Pointer: Null

UNIQUIFIER = [NULL]

Slot 2 Column 1 Offset 0x4 Length 4

c1 = 4

Slot 2 Column 2 Offset 0x8 Length 4

c2 = 1

Slot 3 Offset 0xd5 Length 37

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS VERSIONING_INFO

Memory Dump @0x61CEC0D5

00000000:   70000c00 04000000 02000000 0300f801 p...............
00000010:   00170001 00000000 00000000 00000057 ...............W
00000020:   03000000 00                         .....

Version Information =
 Transaction Timestamp: 855
 Version Pointer: Null


Slot 3 Column 0 Offset 0x13 Length 4

UNIQUIFIER = 1

Slot 3 Column 1 Offset 0x4 Length 4

c1 = 4

Slot 3 Column 2 Offset 0x8 Length 4

c2 = 2

DBCC execution completed. If DBCC printed error messages, contact your system administrator.

I've marked the versioning parts in bold. Notice that the old records have been turned into ghost records too. The second record is now tagged as having variable-length columns too. This is because I updated both records to have the same clustering key value and so the second record now needs a uniquifier - which is stored as a variable-length column.

So, the original statement is a myth - the only time that rows get versioning info added to them is when it's needed to support a versioning operation.

This is a question I was sent a week or so ago - if a table is truncated inside a transaction, what protects the integrity of the table's pages in case the transaction rolls back? Let's find out.

First off I'll create a simple table to experiment with.

CREATE TABLE TruncateTest (c1 INT IDENTITY, c2 CHAR (8000) DEFAULT 'A');
GO

SET NOCOUNT ON;
GO

DECLARE @a INT;
SELECT @a = 1
;
WHILE (@a < 20
)
BEGIN

INSERT INTO TruncateTest DEFAULT VALUES;
SELECT @a = @a + 1;

END;
GO

We can see what pages and extents are allocated to the table using the undocumented DBCC IND command:

DBCC IND (test, TruncateTest, 0);
GO

PageFID PagePID
------- ---------
1       193
1       192
1       194
1       195
1       196
1       197
1       198
1       199
1       200
1       224
1       225
1       226
1       227
1       228
1       229
1       230
1       231
1       232
1       233
1       234

DBCC execution completed. If DBCC printed error messages, contact your system administrator.

I've curtailed the output to just the page IDs and we can see that there are 4 extents used by this table (starting on pages (1:192), (1:200), (1:224), and (1:232)). Now if we truncate the table in a transaction, what will DBCC IND show?

BEGIN TRAN;
GO

TRUNCATE TABLE TruncateTest;
GO

DBCC IND (test, TruncateTest, 0);
GO

DBCC execution completed. If DBCC printed error messages, contact your system administrator.

Looks like there are no pages allocated to the table. So where are they? Let's check what locks there are. Instead of using sp_lock, I'm going to use it's replacement DMV, sys.dm_tran_locks:

SELECT resource_type, resource_description, request_mode FROM sys.dm_tran_locks WHERE resource_type IN ('EXTENT', 'PAGE');
GO

resource_type   resource_description   request_mode
--------------- ---------------------- --------------
EXTENT          1:200                  X
PAGE            1:198                  X
PAGE            1:199                  X
PAGE            1:196                  X
PAGE            1:197                  X
PAGE            1:194                  X
PAGE            1:195                  X
PAGE            1:192                  X
PAGE            1:193                  X
EXTENT          1:192                  X
PAGE            1:200                  X
EXTENT          1:232                  X
EXTENT          1:224                  X

Ah - all the pages and extents are locked. The table doesn't show them as allocated any more but because they're exclusively locked, the allocation subsystem can't really deallocate them until the locks are dropped (when the transaction commits). That's the answer - they can't be reused until they're really deallocated. If a transaction rollback happens, the pages are just marked as allocated again.

In one of the sessions Kimberly and I taught this week at SQL Connections, we discussed how to choose efficient data-types - I'd like to share the discussion here with an example.

The crux of the matter is in the amount of space required to store the data-types in the schema, and ensuring that the data-type choice matches the data being stored without incurring any unnecessary overhead. Here are a couple of examples:

  • A field that stores a person's age will usually be able to use a TINYINT type that can store a maximum value of 256 - unless its an archeological or historical database, for instance. Without putting a lot of thought into it, however, someone may choose to use an INTEGER type. What's the difference? A TINYINT take a single byte to store, whereas an INTEGER takes *4* bytes to store - wasting 3 bytes per record.
  • A field that stores a person's city of residence in the USA needs to be able to hold 24 characters (see my previous post on how long fields have to be) - so what data-type should you use? A CHAR (24) will be able to store all the possible values, but will *always* take up 24 bytes in the record as it's a fixed-length column. A VARCHAR (24), on the other hand will only store the number of bytes equal to the number of characters in the city name, so using the fixed-length type will waste a varying number of bytes per record.
  • In the USA, the names of the various States are commonly abbeviated to two characters - for instance, Washington = WA and California = CA. So is the best type to use CHAR (2) or VARCHAR (2)? Given that the abbreviations are always 2 characters, they'll both store 2 bytes all the time. BUT, a variable length column has a two-byte overhead (the two-byte pointer for the column that's stored in the variable-length column offset array in the record) - so in this case the best data-type to use is CHAR (2).

You can see how choosing data-types without considering whether it's the best type can lead to wasted space. Let's look at an example schema to support a population census of the USA.

CREATE TABLE CensusSchema1 (

SSN CHAR (256),

StateName CHAR (256),

Sex INTEGER,

Employed INTEGER,

SelfEmployed INTEGER,

EthnicOrigin INTEGER,

MaritalStatus INTEGER,

NumberOfDependents INTEGER,

Age INTEGER,

CountryOfBirth INTEGER,

HouseholdIncomeGroup INTEGER,

ZipCode5 INTEGER);

GO

At first glance this may look reasonable, but digging in you'll see that many of the columns are over-sized. Here's a cleaned-up schema to compare against, with notes on the per-column savings:

CREATE TABLE CensusSchema2 (

SSN CHAR (9), -- saving 244 bytes

StateName VARCHAR (256), -- saving at least 240 bytes (longest state name is 14 + 2-byte varchar overhead)

Sex BIT,

Employed BIT,

SelfEmployed BIT, -- saving 11 bytes altogether over these three fields

EthnicOrigin TINYINT, -- saving 3 bytes

MaritalStatus TINYINT, -- saving 3 bytes

NumberOfDependents TINYINT, -- saving 3 bytes

Age TINYINT, -- saving 3 bytes

CountryOfBirth TINYINT, -- saving 3 bytes

HouseholdIncomeGroup TINYINT, -- saving 3 bytes

ZipCode5 INTEGER); -- no saving

GO

The bad schema gives a per-row size of 574 bytes and the cleaned-up schema is no more than 48 bytes per-row. I designed these two schemas to support a US census. The population of the USA is approx. 300 million. This means the bad schema would take around 190GB to store all the info, and the cleaned-up schema only takes around 15GB - more than 12 times more efficient! Now we're starting to see how poor data-type choice can lead to poor performance.

Wider rows means:

  • Fewer rows can fit on an 8k page.
  • More CPU is necessary to crack open a record due to the number of CPU data cache line invalidations necessary to read the record into the CPU's various caches (every time a cache line is invalidated it takes a bunch of CPU cycles - see here for an explanation of CPUs and cache lines).

Less rows per page means:

  • More pages are needed to store the data
  • Indexes could have a smaller fan-out (if the index keys are wider than then need to be), leading to more levels in the index and less efficient index navigation from the index root page to the index leaf-level.

More pages means:

  • More IOs are necessary to read the same amount of actual data
  • More buffer pool memory is necessary to hold the same amount of actual data
  • More disk space is necessary to hold the same amount of actual data

Clearly there's a link between various aspects of workload performance and badly chosen data-types.

During SQL Server 2005 development I did a comprehensive rewrite of the Books Online entries concerned with estimating index, table, and database size (see http://msdn2.microsoft.com/en-us/library/ms187445.aspx). I was recently having a discussion with Øystein Sundsbø about a bug in one of the formulas I wrote and it turns out he's written a neat tool that codifies all the formulas.

In case you're interested, the part of the formula I had incorrect was for figuring out how many pages would be needed for the upper levels of an index b-tree. My formula was:

Num_Index_Pages = ∑Level (Index_Rows_Per_Page)Level – 1
where 1 <= Level <= Levels

and Øystein came up with a better formula:

Num_Index_Pages = ∑Level (Total_Num_Rows/(Index_Rows_Per_Page)Level )
where 1 <= Level <= Levels

where in both formulas:

Levels = 1 + log Index_Rows_Per_Page (Total_Num_Rows / Index_Rows_Per_Page)

Check out Øystein's cool tool at http://dbgoby.blogspot.com/2007/10/db-goby-v10.html!

Categories:
On-Disk Structures | Tools

Over the years I was in the Storage Engine team I saw a lot of concern on the various forums about the ghost cleanup task. There have been a few bugs with it in previous versions  (see these KB articles - 932115 and 815594) and there's very little info available on it. For some reason I didn't get around to posting about it on my old blog but today I want to go into some depth on it.

So what is ghost cleanup? It's a background process that cleans up ghost records - usually referred to as the ghost cleanup task. What's a ghost record? As I described briefly in the Anatomy of a record post last week, a ghost record is one that's just been deleted in an index on a table (well, actually it gets more complicated if snapshot isolation of some form is enabled but for now, a record in an index is a good start). Such a delete operation never physically removes records from pages - it only marks them as having been deleted, or ghosted. This is a performance optimization that allows delete operations to complete more quickly. It also allows delete operations to rollback more quickly because all that needs to happen is to unmark the records as being deleted/ghosted, instead of having to reinsert the deleted records. The deleted record will be physically removed (well, its slot will be removed - the record data isn't actually overwritten) later by the background ghost cleanup task. The ghost cleanup task will leave a single record on the page to avoid having to deallocate empty data or index pages.

The ghost cleanup task can't physically delete the ghost records until after the delete transaction commits because the deleted records are locked and the locks aren't released until the transaction commits. As an aside, when ghost records exist on a page, even a NOLOCK or READ UNCOMMITTED scan won't return them because they are marked as ghost records.

When a record is deleted, apart from it being marked as a ghost record, the page that the record is on is also marked as having ghost records in one of the allocation maps - the PFS page (post coming soon!) - and in its page header. Marking a page as having ghost records in a PFS page also changes the database state to indicate that there are some ghost records to cleanup - somewhere. Nothing tells the ghost cleanup task to clean the specific page that the delete happened on - yet. That only happens when the next scan operation reads the page and notices that the page has ghost records.

The ghost cleanup task doesn't just start up when it's told to - it starts up in the background every 5 seconds and looks for ghost records to cleanup. Remember that it won't be told to go cleanup a specific page by a delete operation - it's a subsequent scan that does it, if a scan happens. When the ghost cleanup task starts up it checks to see if its been told to cleanup a page - if so it goes and does it. If not, it picks the next database that is marked as having some ghost records and looks through the PFS allocation map pages to see if there are any ghost records to cleanup. It will check through or cleanup a limited number of pages each time it wakes up - I remember the limit is 10 pages - to ensure it doesn't swamp the system. So - the ghost records will eventually be removed - either by the ghost cleanup task processing a database for ghost records or by it specifically being told to remove them from a page. If it processes a database and doesn't find any ghost records, it marks the database as not having any ghost records so it will be skipped next time.

How can you tell its running? On SQL Server 2005, you can use the following code to see the ghost cleanup task in sys.dm_exec_requests:

SELECT * INTO myexecrequests FROM sys.dm_exec_requests WHERE 1 = 0;

GO

SET NOCOUNT ON;

GO

DECLARE @a INT

SELECT @a = 0;

WHILE (@a < 1)

BEGIN

INSERT INTO myexecrequests SELECT * FROM sys.dm_exec_requests WHERE command LIKE '%ghost%'

SELECT @a = COUNT (*) FROM myexecrequests

END;

GO

SELECT * FROM myexecrequests;

GO

And on SQL Server 2000 you need to use sysprocesses (well, on SQL Server 2005 this works as well but its fake view derived from the DMVs):

SELECT * INTO mysysprocesses FROM master.dbo.sysprocesses WHERE 1 = 0;

GO

SET NOCOUNT ON;

GO

DECLARE @a INT

SELECT @a = 0;

WHILE (@a < 1)

BEGIN

INSERT INTO mysysprocesses SELECT * FROM master.dbo.sysprocesses WHERE cmd LIKE '%ghost%'

SELECT @a = COUNT (*) FROM mysysprocesses

END;

GO

SELECT * FROM mysysprocesses;

GO

The output from sys.dm_exec_requests is (with most unused and uninteresting columns stripped off):

session_id request_id  start_time              status       command
---------- ----------- ----------------------- ------------ ----------------
15         0           2007-10-05 16:34:49.653 background   GHOST CLEANUP

So how can you tell if a record is ghosted? Let's engineer some and look at it with DBCC PAGE - I've stripped out the uninteresting bits of the output and highlighted the interesting ghost parts:

CREATE TABLE t1 (c1 CHAR(10))

CREATE CLUSTERED INDEX t1c1 on t1 (c1)

GO

BEGIN TRAN

INSERT INTO t1 VALUES ('PAUL')

INSERT INTO t1 VALUES ('KIMBERLY')

DELETE FROM t1 WHERE c1='KIMBERLY';

GO

DBCC IND ('ghostrecordtest', 't1', 1);

GO

DBCC TRACEON (3604);

GO

DBCC PAGE ('ghostrecordtest', 1, 143, 3);

GO

<snip>

m_freeData = 130    m_reservedCnt = 0   m_lsn = (20:88:20)
m_xactReserved = 0  m_xdesId = (0:518)  m_ghostRecCnt = 1
m_tornBits = 0

<snip>

Slot 0 Offset 0x71 Length 17

Record Type = GHOST_DATA_RECORD      Record Attributes =  NULL_BITMAP
Memory Dump @0x6256C071

00000000:   1c000e00 4b494d42 45524c59 20200200 †....KIMBERLY  ..
00000010:   fc†††††††††††††††††††††††††††††††††††.
UNIQUIFIER = [NULL]

Slot 0 Column 1 Offset 0x4 Length 10

c1 = KIMBERLY

Slot 1 Offset 0x60 Length 17

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP
Memory Dump @0x6256C060

00000000:   10000e00 5041554c 20202020 20200200 †....PAUL      ..
00000010:   fc†††††††††††††††††††††††††††††††††††.
UNIQUIFIER = [NULL]

Slot 1 Column 1 Offset 0x4 Length 10

c1 = PAUL

Let's see what goes on the transaction log during this process (remember this is undocumented and unsupported - do it on a test database) - I've stripped off a bunch of the columns in the output:

DECLARE @a CHAR (20)

SELECT @a = [Transaction ID] FROM fn_dblog (null, null) WHERE [Transaction Name]='PaulsTran'

SELECT * FROM fn_dblog (null, null) WHERE [Transaction ID] = @a;

GO

Current LSN              Operation         Context             Transaction ID
------------------------ ----------------- ------------------- --------------
00000014:00000054:0011   LOP_BEGIN_XACT    LCX_NULL            0000:00000206
00000014:0000005a:0012   LOP_INSERT_ROWS   LCX_CLUSTERED       0000:00000206
00000014:0000005a:0013   LOP_INSERT_ROWS   LCX_CLUSTERED       0000:00000206
00000014:0000005a:0014   LOP_DELETE_ROWS   LCX_MARK_AS_GHOST   0000:00000206
00000014:0000005a:0016   LOP_DELETE_ROWS   LCX_MARK_AS_GHOST   0000:00000206

So there are the two inserts followed by the two deletes - with the rows being marked as ghost records. But where's the update to the PFS page? Well, changing the ghost bit in a PFS page is not done as part of a transaction. We'll need to look for it another way (apart from just dumping everything in the transaction log and searching manually):

SELECT Description, * FROM fn_dblog (null, null) WHERE Context like '%PFS%' AND AllocUnitName like '%t1%';

GO

Description               Current LSN              Operation        Context   Transaction ID
------------------------- ------------------------ ---------------- --------- ----------------
Allocated 0001:0000008f   00000014:00000054:0014   LOP_MODIFY_ROW   LCX_PFS   0000:00000208
                          00000014:0000005a:0015   LOP_SET_BITS     LCX_PFS   0000:00000000

The first one is just allocating a page but the second one is the one we're looking for - it's changed the bit for the page to say it has ghost records on. Let's commit the transaction and see what happens, filtering out all the previous transaction log:

SELECT MAX ([Current LSN]) FROM fn_dblog (null, null);

GO

-- 00000014:0000005e:0001

COMMIT TRAN

GO

SELECT [Page ID], * FROM fn_dblog (null, null) WHERE [Current LSN] > '00000014:0000005e:0001';

GO

Page ID         Current LSN              Operation          Context         Transaction ID
--------------- ------------------------ ------------------ --------------- --------------
NULL            00000014:0000005f:0001   LOP_COMMIT_XACT    LCX_NULL        0000:00000206
0001:0000008f   00000014:00000060:0001   LOP_EXPUNGE_ROWS   LCX_CLUSTERED   0000:00000000

We see that almost as soon as the transaction has commited, the ghost cleanup task goes in and process the page. Let's check a page dump to make sure the record is gone, and show that the contents of the record are still on the page (again, with non-relevant bits snipped out):

DBCC PAGE ('ghostrecordtest', 1, 143, 3);

GO

<snip>

m_freeData = 130         m_reservedCnt = 0        m_lsn = (20:94:1)
m_xactReserved = 0       m_xdesId = (0:518)       m_ghostRecCnt = 0
m_tornBits = 0

<snip>

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP
Memory Dump @0x6212C060

00000000:   10000e00 5041554c 20202020 20200200 †....PAUL      ..
00000010:   fc†††††††††††††††††††††††††††††††††††.
UNIQUIFIER = [NULL]

Slot 0 Column 1 Offset 0x4 Length 10

c1 = PAUL

DBCC PAGE ('ghostrecordtest', 1, 143, 2);

GO

<snip>

6212C040:   01000000 00000000 00000000 00000000 †................
6212C050:   00000000 00000000 00000000 00000000 †................
6212C060:   10000e00 5041554c 20202020 20200200 †....PAUL      ..
6212C070:   fc1c000e 004b494d 4245524c 59202002 †.....KIMBERLY  .
6212C080:   00fc0000 00000000 00000000 01000000 †................
6212C090:   00000000 13000000 01000000 00000000 †................

<snip>

So even though the record no longer exists, all that happened was that the slot was removed from the slot array at the end of the page - the record contents will remain on the page until the space is reused.

In the next post I'll go into details of the PFS and other allocation maps. Btw - please let me know if this stuff is interesting - I'd like to know where to spend blogging time. Thanks!

I'd like to kick off the Indexes From Every Angle series this evening by re-posting some articles from my old blog. Both of these topics continue to crop up on newsgroups and forums.

What happens to non-clustered indexes when changes are made to the underlying table? Are they always rebuilt or not? Well - it depends! (my favorite answer :-)).

Before we get into that discussion, I’ll give you a little background.

 

One way to describe a non-clustered index is whether it is a covering index or not. A covering index is one which has all the table columns necessary to satisfy a query, and so there is no need to go back to the underlying table (which I’ll call the base table) to fetch additional columns. Most non-clustered indexes are not covering indexes and so it is necessary for each non-clustered index record (even in covering indexes) to include a link back to the corresponding base table record. This link takes various forms, depending on the structure of the base table:

  • Heap: A heap does not impose any logical ordering on the records within it, so the only way to find a particular record within a heap is to do a full table scan or to know the physical location of the record (i.e. which record slot on which data page) – called the physical record ID or physical RID. Clearly doing a full table scan every time a query needs to go from a non-clustered index record back to a heap is not feasible, so each non-clustered index record stores the physical RID of the corresponding heap record.
  • Clustered index: A clustered index does impose a logical ordering on the records within it and the ordering exactly matches the definition of the clustered index keys – known as the cluster key. The fastest way to find a particular record in a clustered index is to use the cluster key to navigate through the clustered index b-tree to find the exact record. So, non-clustered indexes over a clustered index include the cluster key of the base table record in each non-clustered index record. This is known as the logical RID. There is a twist though, depending on whether the clustered index is unique or not.
    • Non-unique clustered index: Every record in a clustered index HAS to be unique, otherwise there would be no way to deterministically navigate to a particular record using the index b-tree. In a non-unique clustered index, SQL Server has to add a special column, called the uniquifier, to each record, so that if multiple records have the same cluster key values, the uniquifier column will be the tie-breaker. This uniquifier column is added as part of the cluster key, and so it is also present in all non-clustered index records as part of the logical RID.
    • Unique clustered index: Every record in a unique clustered index is already unique and so no extra column is required in the cluster key.

Ok - background out of the way. Now let's discuss what happens to non-clustered indexes when various actions are performed to the base table.

 

SQL Server 2000

  • Going from a heap to a clustered index: Changing the fundamental structure of a table from a heap to a clustered index means you're changing the way that records can be looked up in the table, from physical to logical. This means that all the physical RIDs that are stored in the non-clustered index records need to be changed to logical RIDs. So - in this case, all non-clustered indexes are rebuilt after the clustered index has been built.
  • Going from a clustered index to a heap: This is the opposite of the case above - you're changing the way records are looked up from logical to physical. This means that all the logical RIDs that are stored in the non-clustered index records need to be changed to physical RIDs. So - in this case, all non-clustered indexes are rebuilt after the clustered index has been dropped.
  • Rebuilding a unique clustered index: This operation has no effect on the cluster key, so there is no need for the non-clustered indexes to be rebuilt. Various early Service Packs of SQL Server 2000 had bugs that caused this behavior to be broken - and this is the cause of much of the confusion around this behavior.
  • Rebuilding a non-unique clustered index: In SQL Server 2000, when a non-unique clustered index (which contains uniquifier columns) is rebuilt, all the uniquifier values are regenerated. This essentially means that the cluster keys have changed and so all the non-clustered indexes are rebuilt after the clustered index rebuild has completed.
  • Changing the clustered index schema: This is simple - any change to the clustered index schema that changes the cluster keys will cause all the non-clustered indexes to be rebuilt.

 SQL Server 2005

  • Going from a heap to a clustered index: This is the same as SQL Server 2000 - all non-clustered indexes are rebuilt, regardless of whether the clustered index was created online or offline.
  • Going from a clustered index to a heap: Again, all non-clustered indexes are rebuilt, regardless of whether the clustered index is dropped online or offline.
  • Rebuilding a unique clustered index: This is the same as SQL Server 2000 - the cluster keys aren't changing and so the non-clustered indexes are not rebuilt.
  • Rebuilding a non-unique clustered index: Aha! This is different from SQL Server 2000. SQL Server 2005 will RE-USE the old uniquifier values so the cluster keys don't change. This means that non-clustered indexes are NOT rebuilt in this case - that's very cool!
  • Changing the clustered index schema:
    • Changing the cluster key: This behavior has to be the same as SQL Server 2000 - any change to the cluster keys forces all the non-clustered indexes to be rebuilt after the new clustered index has been created.
    • Changing anything else (e.g. partitioning it or moving it): This is one of the cases that confuses people in SQL Server 2005. Applying a partitioning scheme to a clustered index, or moving it to a different filegroup, doesn't change the cluster key at all - so there's no need to rebuild the non-clustered indexes.

So the rule of thumb to remember is: if the physical or logical RID contained in the non-clustered index records needs to change then all the non-clustered indexes need to be rebuilt. Its as simple as that.

 

Hopefully this will help to clear up some confusion - let me know of any questions!

I mentioned this in my Anatomy of a page post - its a common misconception that records in an index are ALWAYS stored in the same physical order as the logical order defined by the index key. Here's proof for you that this is incorrect (as well as introducing you to the other dump styles for DBCC PAGE).

I'm going to create a table with a clustered index on an integer column and keep the table to a single page for simplicity:

USE MASTER;

GO

IF DATABASEPROPERTY (N'rowordertest', 'Version') > 0 DROP DATABASE rowordertest;

GO

CREATE DATABASE rowordertest;

GO

USE rowordertest;

GO

CREATE TABLE t1 (c1 INT, c2 VARCHAR (10));

CREATE CLUSTERED INDEX t1c1 ON t1 (c1);

GO

Now I'm going to insert a few rows into the table, with c1 from 2 to 5 - conspicuously not inserting c1 = 1:

INSERT INTO t1 VALUES (2, REPLICATE ('b', 10));

INSERT INTO t1 VALUES (3, REPLICATE ('c', 10));

INSERT INTO t1 VALUES (4, REPLICATE ('d', 10));

INSERT INTO t1 VALUES (5, REPLICATE ('e', 10));

GO

Now, using DBCC IND  we see that the data page is (1:143) and dumping that with DBCC PAGE gives the following (skipping the header output):

DBCC IND ('rowordertest', 't1', 1);

GO

DBCC TRACEON (3604);

GO

DBCC PAGE ('rowordertest', 1, 143, 3);

GO

Slot 0 Offset 0x60 Length 27

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS

Memory Dump @0x5BA3C060

00000000:   30000800 02000000 0300f802 0011001b †0...............
00000010:   00626262 62626262 626262†††††††††††††.bbbbbbbbbb
UNIQUIFIER = [NULL]  

Slot 0 Column 1 Offset 0x4 Length 4

c1 = 2

Slot 0 Column 2 Offset 0x11 Length 10

c2 = bbbbbbbbbb

<snip slots 1 and 2>                   

Slot 3 Offset 0xb1 Length 27

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS

Memory Dump @0x5BA3C0B1

00000000:   30000800 05000000 0300f802 0011001b †0...............
00000010:   00656565 65656565 656565†††††††††††††.eeeeeeeeee
UNIQUIFIER = [NULL]

Slot 3 Column 1 Offset 0x4 Length 4

c1 = 5

Slot 3 Column 2 Offset 0x11 Length 10

c2 = eeeeeeeeee

DBCC PAGE with dump-style 3 always outputs the records on a page in their logical order (because that's how the slot array is ordered). Notice that the record with c1 = 2 is stored at offset 0x60 in the page and the last record on the page with c1 = 5 is stored at offset 0xb1. Now we'll insert a record with c1 =1. This will become the first logical record in the index, but will it cause the page to be shuffled so the records can all be stored in logical order?

INSERT INTO t1 VALUES (1, REPLICATE ('a', 10));

GO

DBCC PAGE ('rowordertest', 1, 143, 3);

GO

Slot 0 Offset 0xcc Length 27

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS

Memory Dump @0x61FCC0CC

00000000:   30000800 01000000 0300f802 0011001b †0...............
00000010:   00616161 61616161 616161†††††††††††††.aaaaaaaaaa
UNIQUIFIER = [NULL]

Slot 0 Column 1 Offset 0x4 Length 4

c1 = 1

Slot 0 Column 2 Offset 0x11 Length 10

c2 = aaaaaaaaaa

Slot 1 Offset 0x60 Length 27

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS

<snip>

The answer is no. Even though the record with c1 =1 is output by DBCC PAGE first, look at its offset within the page - 0xCC - clearly the last record on the page and stored in a different physical order than the logical order defined by the index key. Further proof can be obtained by looking at a raw hex dump of the page using dump style 2 with DBCC PAGE:

DBCC PAGE ('rowordertest', 1, 143, 2);

GO

<snip>

6204C000:   01010400 00400001 00000000 00000800 †.....@..........
6204C010:   00000000 00000500 44000000 0f1fe700 †........D.......
6204C020:   8f000000 01000000 13000000 60000000 †............`...
6204C030:   16000000 00000000 00000000 00000000 †................
6204C040:   01000000 00000000 00000000 00000000 †................
6204C050:   00000000 00000000 00000000 00000000 †................
6204C060:   30000800 02000000 0300f802 0011001b †0...............
6204C070:   00626262 62626262 62626230 00080003 †.bbbbbbbbbb0....
6204C080:   00000003 00f80200 11001b00 63636363 †............cccc
6204C090:   63636363 63633000 08000400 00000300 †cccccc0.........
6204C0A0:   f8020011 001b0064 64646464 64646464 †.......ddddddddd
6204C0B0:   64300008 00050000 000300f8 02001100 †d0..............
6204C0C0:   1b006565 65656565 65656565 30000800 †..eeeeeeeeee0...
6204C0D0:   01000000 0300f802 0011001b 00616161 †.............aaa
6204C0E0:   61616161 61616100 00000000 00000000 †aaaaaaa.........
6204C0F0:   00000000 00000000 00000000 00000000 †................

<snip>

You can clearly see that the last row I inserted, with c1 = 1 and the replicated 'a's is stored after the other records on the page, even though its key is logically before the others.

And just to nail the point home, doing a dump style 1 with DBCC PAGE will dump out the slot array for us:

DBCC PAGE ('rowordertest', 1, 143, 1);

GO

<snip>

OFFSET TABLE:

Row - Offset
4 (0x4) - 177 (0xb1)
3 (0x3) - 150 (0x96)
2 (0x2) - 123 (0x7b)
1 (0x1) - 96 (0x60)
0 (0x0) - 204 (0xcc)

<snip>

The slot array grows backwards, which is why its dumped in what looks like reverse logical order. You can see that slot 0, which represents the first logical record on the page, is stored at an offset greater than the others.

This is a combo from some previously posted material, with some more DBCC PAGE output thrown in.

IAM pages

An IAM (Index Allocation Map) page tracks approximately 4GB worth of space in a single file, aligned on a 4GB boundary. These 4GB chunks are called 'GAM intervals'. An IAM page tracks which extents within that specific GAM interval belongs to a single entity (I'm chosing my words carefully here and not using any word that has SQL Server connotations like 'object').

An IAM page can only track the space for a single GAM interval in a single file so If the database has multiple files, or some files are more then 4GB, and the entity has space allocated from multiple files or multiple GAM intervals within a file, then you can see how multiple IAM pages are needed for each entity to track all the space that its using. If an entity requires multiple IAM pages to track all its extents, then they IAM page have to be linked together. That's where an IAM chain comes in. More on these below.

Each IAM page has two records, an IAM page header and a bitmap. Let's look at one with DBCC PAGE. I'm using the database from the page split post. Doing a DBCC IND on the table we created gives us:

SEQA3.jpg

By looking at the PageType column, we can see that there's an IAM page (a page with type 10 - see the post on Anatomy of a page for more details) with page ID (1:152):

DBCC TRACEON (3604);

GO

DBCC PAGE ('pagesplittest', 1, 152, 3);

GO

m_pageId = (1:152)                   m_headerVersion = 1                  m_type = 10
m_typeFlagBits = 0x0                 m_level = 0                          m_flagBits = 0x200
m_objId (AllocUnitId.idObj) = 68     m_indexId (AllocUnitId.idInd) = 256
Metadata: AllocUnitId = 72057594042384384
Metadata: PartitionId = 72057594038386688                                 Metadata: IndexId = 1
Metadata: ObjectId = 2073058421      m_prevPage = (0:0)                   m_nextPage = (0:0)
pminlen = 90                         m_slotCnt = 2                        m_freeCnt = 6
m_freeData = 8182                    m_reservedCnt = 0                    m_lsn = (18:116:13)
m_xactReserved = 0                   m_xdesId = (0:0)                     m_ghostRecCnt = 0
m_tornBits = -1947725876

Allocation Status

GAM (1:2) = ALLOCATED                SGAM (1:3) = ALLOCATED
PFS (1:1) = 0x70 IAM_PG MIXED_EXT ALLOCATED   0_PCT_FULL                  DIFF (1:6) = CHANGED
ML (1:7) = NOT MIN_LOGGED

IAM: Header @0x620CC064 Slot 0, Offset 96

sequenceNumber = 0                   status = 0x0                         objectId = 0
indexId = 0                          page_count = 0                       start_pg = (1:0)


IAM: Single Page Allocations @0x620CC08E

Slot 0 = (1:143)                     Slot 1 = (1:153)                     Slot 2 = (1:154)
Slot 3 = (0:0)                       Slot 4 = (0:0)                       Slot 5 = (0:0)
Slot 6 = (0:0)                       Slot 7 = (0:0)


IAM: Extent Alloc Status Slot 1 @0x620CC0C2

(1:0)        - (1:272)      = NOT ALLOCATED

Some things to note about the page header itself:

  • The page has type 10, as we'd expect
  • The previous and next page pointers are NULL, because there aren't any other IAM pages in this IAM chain
  • The slot count is 2 - one for the IAM header record and one for the bitmap itself
  • The page is almost entirely full

The IAM page header has the following fields:

  • sequenceNumber
    • This is the position of the IAM page in the IAM chain. This increases by one for each page added to the IAM chain.
  • status
    • This is unused.
  • objectId
  • indexId
    • On SQL Server 2000 and before, these are the object  and index IDS that the IAM page is part of. On SQL Server 2005 and later they are unused.
  • page_count
    • This is unused  - it used to be the number of page IDs that are being tracked in the single page allocation array.
  • start_pg
    • This is the GAM interval that the page maps. It stores the first page ID in the mapped interval.
  • Single Page Allocations array
    • These are the pages that have been allocated from mixed extents. This array is only used in the first IAM page in the chain (as the whole IAM chain only need to track at most 8 single-page allocations).

The bitmap occupies the rest of the IAM page and has a bit for each extent in the GAM interval. The bit is set if the extent is allocated to the entity, and clear if it is not. Obviously two IAM pages that map the same GAM interval for different entities cannot both have the same bit set - this is checked by DBCC CHECKDB. In the output from DBCC PAGE above, you can see that there are no extents allocated to the table. You'll notice that the output only goes up to the extent starting at page 272 in the file - this is because the data file is only that big. I inserted a bunch more rows into the table and did another DBCC PAGE of the IAM page. This time the DBCC PAGE output contains:

IAM: Single Page Allocations @0x620CC08E

Slot 0 = (1:143)                     Slot 1 = (1:153)                     Slot 2 = (1:154)
Slot 3 = (1:155)                     Slot 4 = (1:156)                     Slot 5 = (1:157)
Slot 6 = (1:158)                     Slot 7 = (1:159)


IAM: Extent Alloc Status Slot 1 @0x620CC0C2

(1:0)        - (1:152)      = NOT ALLOCATED
(1:160)      - (1:296)      =     ALLOCATED
(1:304)      - (1:400)      = NOT ALLOCATED

You can see that the entire single-page allocation array is full and then allocations switched to dedicated extents. The first available extent must have been the one starting at page 160 and all extents up to an including the one starting at page 296 are now allocated. Note also that the file must have grown because the output now goes up to page 400 in the file.

A couple more things to note about IAM pages:

  • There are themselves single-page allocations from mixed extents and are not tracked anywhere
  • They can be allocated from any file to track extents in any other file

IAM chains

If we continued to grow the file and fill up the table then eventually we'd need another IAM page to map the next GAM interval. This is where an IAM chain comes in. It's a linked-list of IAM pages that track the space allocated to a single entity. The linked-list is not sorted at all - IAM pages are appended to it in the order that they're needed. The IAM pages within the list are numbered, again, in the order that they were appended to the list.

Definition of 'entity' - what uses an IAM chain? This is vastly different in SQL Server 2000 and 2005.

In SQL Server 2000, a single IAM chain is used for each:

  • Heap or clustered index
    • A table can only have one or the other, not both. These have index IDs of 0 and 1 respectively.
  • Non-clustered index
    • These have index IDs from 2 to 250 (i.e. you can only have 249 of them)
  • Table's complete LOB storage
    • For LOB columns (text, ntext, image) in the heap or clustered index. This is sometimes called the 'text index' and has a fixed index ID of 255.

This gives a maximum of 251 IAM chains per object in SQL Server 2000 and before. I usually generalize and say that in SQL Server 2000, there's one IAM chain per index (which fits nicely if you remember that IAM stands for Index Allocation Map).

Allocation units (SQL Server 2005 and later)

Now in SQL Server 2005 and later, things have changed a lot. IAM chains and IAM pages are exactly the same, but what they correspond to is different. A table can now have up to 750000 IAM chains! There are now three things that IAM chains map space allocations for:

  1. heaps and b-trees (a b-tree is the internal structure used to store an index)
  2.  LOB data
  3. row-overflow data

and we now call these units of space allocation tracking allocation units. The internal names for these three types of allocation unit are (respectively):

  1. hobt allocation unit (Heap Or B-Tree, pronounced 'hobbit', yes, as in Lord Of The Rings)
  2. LOB allocation unit
  3. SLOB allocation unit (Small-LOB or Short-LOB)

and the external names are, respectively:

  1. IN_ROW_DATA allocation unit
  2. LOB_DATA allocation unit
  3. ROW_OVERFLOW_DATA allocation unit

They couldn't really continue to be called IAM chains, because they're no longer tracking space allocation for an index. However, they're chain of IAM pages is still called an IAM chain, and the unit of tracking is now called an allocation unit. Apart from that, there's no difference.

Let's have a quick look at three new features in SQL Server 2005 that made these changes necessary and boosted the number of potential IAM chains per table.

Included Columns
This is the ability for non-clustered indexes to include non-key columns at the leaf-level. This is useful for three reasons:

  1. Iit allows a non-clustered index to truly cover a query where the query results include more than 16 columns or the combination of column lengths in the query results is greater than 900 bytes (remember that a non-clustered index key is limited to 16 columns and 900 bytes).
  2. It allows columns to be include in the non-clustered index that have data types that cannot be part of an index key (e.g. varchar(max) or XML).
  3. It allows a non-clustered index to cover a query without having to have all the columns included in the index key. As the index key is included in rows at all levels of the b-tree, this allows the index to be smaller.

An example of space saving: imagine a 100 million row index, with a key length of 900 bytes, but only the first two integer keys are really needed as the index key, the other 4 fixed-length columns could be stored in the index as included columns. With the 900 byte key, 8 rows can fit per database page (i.e. the fanout is 8). This means there will be 12500000 pages at the leaf level, 1562500 pages at the next level up in the b-tree and so on, giving a total of 12500000 + 1562500 + 195313 + 24415 + 3052 + 382 + 48 + 6 + 1 = 14285717 pages (including 1785717 to store the upper levels of the b-tree).

If we go with the included columns method then the key size shrinks to 8 bytes, and with the row overhead we can get the row length in the upper levels of the b-tree down to 15 bytes (giving a fanout of approx. 537). Note that the fanout at the leaf-level is still going to be 8,  because the amount of data stored in each row at the leaf-level is the same. So, this means there will be 12500000 pages at the leaf level, 23278 pages at the next level up and so on, giving a total of 12500000 + 23278 + 44 + 1 = 12523323 pages (including 23323 to store the upper levels of the b-tree). Compared to the full-size 900-byte key, this is a 12% saving of 1762394 pages, or 13.6GB! Obviously this is a contrived case but you can see how the savings can occur.

The main reason for adding this feature it to enable true covering queries. A covering query is one where the query optimizer knows it can get all the query results from the non-clustered index and so the query can be satisfied without having to incur the extra IOs of looking up data in the base table - a significant performance saving.

Now that non-clustered indexes can have included columns, and those columns can be LOB data types (but only the new ones in SQL Server 2005 - varchar(max), nvarchar(max), varbinary(max), and XML). This means that having a single LOB allocation unit (as in the case of the single text index in SQL Server 2000) isn't possible any more because each index may have its own set of LOBs. Now, you may ask why there isn't just a single set of LOBs with multiple references from various indexes plus the base table. We considered that during SQL Server 2005 development but it would have made things a lot more complicated.

So, with this new feature, each index needs two allocation units - one for the data or index rows (the hobt allocation unit) and one for any LOB data.

Large Rows

One of the things that has plagued schema designers for a long time is the 8060 byte limit on table row sizes so this restriction was removed in SQL Server 2005. The way this is done is to allow variable-length columns (e.g. varchar, sqlvariant) to get pushed off-row when the row size gets too big to fit on a single page.

But where do these column values get pushed to? They're effectively turned into mini LOB columns. The column value in the row is replaced with a 16-byte pointer to the off-row column value, which is stored as if its a LOB value in a seperate allocation unit - the row-overflow (or SLOB) allocation unit. These values are stored in text pages in exactly the same way as regular LOB values are, just using a separate allocation unit. The SLOB allocation unit is only created when the first column value is pushed off-row.

This feature works for non-clustered indexes too - if you consider the ability to have included columns in non-clustered indexes then you could easily have non-clustered index rows that won't fit on a page. It would have been short-sighted of to get rid of the 900-byte limit and replace it with an 8060-byte limit by not extending the row-overflow feature to non-clustered indexes too.

Now with the addition of this new feature, each index can have up to three allocation units - hobt, LOB, and SLOB. Even with this, that only makes a maximum of 750 IAM chains per table (remember an IAM chain now maps the storage allocations for an allocation unit, so 250 indexes * 3 allocation units = 750 IAM chains). But I mentioned 750 thousand IAM chains per table earlier - where do all the rest come from?

Partitioning

This is what gives us the 1000x multiplier. As you may already know, partitioning is the new feature that allows tables and indexes to be split into a series of ranges, with each range stored separately (most commonly in seperate filegroups). Partitioning is a topic for a separate post.

If each range or partition of the table or index is stored seperately, then each is going to need its own hobt allocation unit. Of course, the LOB values associated with each partition need to be stored with it, and so each partition also needs a LOB allocation unit. Also, the row-overflow feature is per-row, and so rows in each partition will overflow into SLOB allocation units just as for un-partitioned tables and indexes. Thus each partition of a table or index can have up to three allocation units (and hence three IAM chains).

Still, where does that 1000 come in? Each table or index can have up to 1000 partitions. This gives us 250 indexes x 1000 partitions x 3 allocation units = 750000 IAM chains. Realistically this probably won't happen, but it is possible.

This one's a quickie.

In the previous post I explained about database pages - their structure and some page types. Now I'd like to explain how pages are grouped into units called extents. An extent is a group of eight physically consecutive pages in a data file. Extents are always aligned on 64KB boundaries (i.e. 8-page boundaries), starting at the beginning of the file. Extents, and all their properties, are exactly the same in SQL Server 2000 and 2005.

There are two types of extents: mixed extents and dedicated (or uniform) extents.

Mixed extents

The first 8 pages that are allocated to any IAM chain (either for an index in SQL Server 2000 or allocation unit in SQL Server 2005) are single-page allocations, which are called mixed pages. This means that only a single page is allocated to the IAM chain at a time, rather than a whole extent. The rationale behind this is to allow very small tables to take up the minimum amount of space. Once an IAM chain has passed the 8-page threshold and switched to allocating dedicated extents, it will never go back to allocating mixed pages again.

These mixed pages are allocated from mixed extents that are not allocated to any particular IAM chain. The extent is tracked as being allocated globally (by a GAM page), so no IAM chain can allocate it. If the mixed extent has any unallocated pages in it, it will also be tracked by an SGAM page. When a mixed page is required to be allocated, the SGAM pages are checked to see if any such extents are available. If not, a new mixed extent is allocated, a page is allocated from it and then the extent is tracked by the relevant SGAM page until all its pages are allocated.

As the mixed extent is not allocated to any particular IAM chain, this means that it may hold pages allocated to possibly 8 separate IAM chains. IAM pages themselves are always mixed pages, regardless of how many pages have been allocated to an IAM chain. This means a mixed extent may hold a variety of page types too,including IAM, data, index, or text pages.

Dedicated/Uniform extents

Once the 8-page threshold is passed, all further allocations to an IAM chain are from dedicated extents. This means that an extent at a time is allocated to an IAM chain and marked as such in one of the IAM pages in the IAM chain - whichever one maps the GAM interval that the extent is in. The extent is also tracked as being allocated by the relevant GAM page so no other IAM chain can allocate it.

All pages from a dedicated extent must be allocated to the same IAM chain. They do not all need to be the same type, however. For instance, a clustered index has a mixture of data and index pages. Also, when an extent is allocated to an IAM chain the pages in the extent are not all allocated at once (except for certain bulk operations). The pages are usually allocated as needed. The allocation state of each page is tracked using a PFS page.

When all the pages from a dedicated extent are deallocated, the extent itself is deallocated from its owning IAM chain and is available again for allocation to any of IAM chain, or to become a mixed extent.

Tracking changes to extents for backup

Changes to extents are tracked in two places:

  • Any extent that has been changed since the last full or differential backup will be tracked in the relevant differential bitmap page. This is how a differential backup knows which extents to backup instead of the whole database. All diff map pages are reset when the next backup is taken.
  • Any extent that has been changed by a bulk-logged operation since the last full, differential, or log backup will be tracked in the relevant minimally-logged bitmap page. Any log backup that's taken after a bulk-logged operation will also include all extents tracked this way. All ML map pages are reset when the next backup is taken.

More on PFS, GAM, SGAM, ML, and DIFF pages in a not-too-distant future post.

Next up in the Inside the Storage Engine series is a discussion of page structure. Pages exist to store records. A database page is an 8192-byte (8KB) chunk of a database data file. They are aligned on 8KB boundaries within the data files, starting at byte-offset 0 in the file. Here's a picture of the basic structure:

page.gif

Header

The page header is 96 bytes long. What I'd like to do in this section is take an example page header dump from DBCC PAGE and explain what all the fields mean. I'm using the database from the page split post and I've snipped off the rest of the DBCC PAGE output.

DBCC TRACEON (3604)

DBCC PAGE ('pagesplittest', 1, 143, 1);

GO

m_pageId = (1:143)                   m_headerVersion = 1                  m_type = 1
m_typeFlagBits = 0x4                 m_level = 0                          m_flagBits = 0x200
m_objId (AllocUnitId.idObj) = 68     m_indexId (AllocUnitId.idInd) = 256 
Metadata: AllocUnitId = 72057594042384384                                
Metadata: PartitionId = 72057594038386688                                 Metadata: IndexId = 1
Metadata: ObjectId = 2073058421      m_prevPage = (0:0)                   m_nextPage = (1:154)
pminlen = 8                          m_slotCnt = 4                        m_freeCnt = 4420
m_freeData = 4681                    m_reservedCnt = 0                    m_lsn = (18:116:25)
m_xactReserved = 0                   m_xdesId = (0:0)                     m_ghostRecCnt = 0
m_tornBits = 1333613242
             

Here's what all the fields mean (note that the fields aren't quite stored in this order on the page):

  • m_pageId
    • This identifies the file number the page is part of and the position within the file. In this example, (1:143) means page 143 in file 1.
  • m_headerVersion
    • This is the page header version. Since version 7.0 this value has always been 1.
  • m_type
    • This is the page type. The values you're likely to see are:
      • 1 - data page. This holds data records in a heap or clustered index leaf-level.
      • 2 - index page. This holds index records in the upper levels of a clustered index and all levels of non-clustered indexes.
      • 3 - text mix page. A text page that holds small chunks of LOB values plus internal parts of text tree. These can be shared between LOB values in the same partition of an index or heap.
      • 4 - text tree page. A text page that holds large chunks of LOB values from a single column value.
      • 7 - sort page. A page that stores intermediate results during a sort operation.
      • 8 - GAM page. Holds global allocation information about extents in a GAM interval (every data file is split into 4GB chunks - the number of extents that can be represented in a bitmap on a single database page). Basically whether an extent is allocated or not. GAM = Global Allocation Map. The first one is page 2 in each file. More on these in a later post.
      • 9 - SGAM page. Holds global allocation information about extents in a GAM interval. Basically whether an extent is available for allocating mixed-pages. SGAM = Shared GAM. the first one is page 3 in each file. More on these in a later post.
      • 10 - IAM page. Holds allocation information about which extents within a GAM interval are allocated to an index or allocation unit, in SQL Server 2000 and 2005 respectively. IAM = Index Allocation Map. More on these in a later post.
      • 11 - PFS page. Holds allocation and free space information about pages within a PFS interval (every data file is also split into approx 64MB chunks - the number of pages that can be represented in a byte-map on a single database page. PFS = Page Free Space. The first one is page 1 in each file. More on these in a later post.
      • 13 - boot page. Holds information about the database. There's only one of these in the database. It's page 9 in file 1.
      • 15 - file header page. Holds information about the file. There's one per file and it's page 0 in the file.
      • 16 - diff map page. Holds information about which extents in a GAM interval have changed since the last full or differential backup. The first one is page 6 in each file.
      • 17 - ML map page. Holds information about which extents in a GAM interval have changed while in bulk-logged mode since the last backup. This is what allows you to switch to bulk-logged mode for bulk-loads and index rebuilds without worrying about breaking a backup chain. The first one is page 7 in each file.
  • m_typeFlagBits
    • This is mostly unused. For data and index pages it will always be 4. For all other pages it will always be 0 - except PFS pages. If a PFS page has m_typeFlagBits of 1, that means that at least one of the pages in the PFS interval mapped by the PFS page has at least one ghost record.
  • m_level
    • This is the level that the page is part of in the b-tree.
    • Levels are numbered from 0 at the leaf-level and increase to the single-page root level (i.e. the top of the b-tree).
    • In SQL Server 2000, the leaf level of a clustered index (with data pages) was level 0, and the next level up (with index pages) was also level 0. The level then increased to the root. So to determine whether a page was truly at the leaf level in SQL Server 2000, you need to look at the m_type as well as the m_level.
    • For all page types apart from index pages, the level is always 0.
  • m_flagBits
    • This stores a number of different flags that describe the page. For example, 0x200 means that the page has a page checksum on it (as our example page does) and 0x100 means the page has torn-page protection on it.
    • Some bits are no longer used in SQL Server 2005.
  • m_objId
  • m_indexId
    • In SQL Server 2000, these identified the actual relational object and index IDs to which the page is allocated. In SQL Server 2005 this is no longer the case. The allocation metadata totally changed so these instead identify what's called the allocation unit that the page belongs to (I'll do another post that describes these later today).
  • m_prevPage
  • m_nextPage
    • These are pointers to the previous and next pages at this level of the b-tree and store 6-byte page IDs.
    • The pages in each level of an index are joined in a doubly-linked list according to the logical order (as defined by the index keys) of the index. The pointers do not necessarily point to the immediately adjacent physical pages in the file (because of fragmentation).
    • The pages on the left-hand side of a b-tree level will have the m_prevPage pointer be NULL, and those on the right-hand side will have the m_nextPage be NULL.
    • In a heap, or if an index only has a single page, these pointers will both be NULL for all pages.
  • pminlen
    • This is the size of the fixed-length portion of the records on the page.
  • m_slotCnt
    • This is the count of records on the page.
  • m_freeCnt
    • This is the number of bytes of free space in the page.
  • m_freeData
    • This is the offset from the start of the page to the first byte after the end of the last record on the page. It doesn't matter if there is free space nearer to the start of the page.
  • m_reservedCnt
    • This is the number of bytes of free space that has been reserved by active transactions that freed up space on the page. It prevents the free space from being used up and allows the transactions to roll-back correctly. There's a very complicated algorithm for changing this value.
  • m_lsn
    • This is the Log Sequence Number of the last log record that changed the page.
  • m_xactReserved
    • This is the amount that was last added to the m_reservedCnt field.
  • m_xdesId
    • This is the internal ID of the most recent transaction that added to the m_reservedCnt field.
  • m_ghostRecCnt
    • The is the count of ghost records on the page.
  • m_tornBits
    • This holds either the page checksum or the bits that were displaced by the torn-page protection bits - depending on what form of page protection is turnde on for the database.

Note that I didn't include the fields starting with Metadata:. That's because they're not part of a page header. During SQL Server 2005 development I did some major work rewriting the guts of DBCC PAGE and to save everyone using it from having to do all the system table lookups to determine what the actual object and index IDs are, I changed DBCC PAGE to do them internally and output the results.

Records

See this blog post for details.

Slot Array

It's a very common misconception that records within a page are always stored in logical order. This is not true. There is another misconception that all the free-space in a page is always maintained in one contiguous chunk. This also is not true. (Yes, the image above shows the free space in one chunk and that very often  is the case for pages that are being filled gradually.)

If a record is deleted from a page, everything remaining on the page is not suddenly compacted - inserters pay the cost of compaction when its necessary, not deleters.

Consider a completely full page - this means that record deletions cause free space holes within the page. If a new record needs to be inserted onto the page, and one of the holes is big enough to squeeze the record into, why go to the bother of comapcting it? Just stick the record in and carry on. What if the record should logically have come at the end of all other records on the page, but we've just inserted it in the middle - doesn't that screw things up somewhat?

No, because the slot array is ordered and gets reshuffled as records are inserted and deleted from pages. As long as the first slot array entry points to the logically first record on the page, everything's fine. Each slot entry is just a two-byte pointer into the page - so its far more efficient to manipulate the slot array than it is to manipulate a bunch of records on the page. Only when we know there's enough free space contained within the page to fit in a record, but its spread about the page do we compact the records on the page to make the free space into a contiguous chunk.

One interesting fact is that the slot array grows backwards from the end of the page, so the free space is squeezed from the top by new rows, and from the bottom by the slot array.

Time for the first post in the Inside the Storage Engine series. I'm going to focus on SQL Server 2005 in this series and I'll point out major differences between 2005 and previous versions. Please drop me a line if there's something you'd like to see explained and demo'd.

Before jumping into how things work, I'd like to go over two commands I'll be using a lot - DBCC PAGE and DBCC IND. These are both undocumented and unsupported commands, but are very safe to use as they're used extensively inside and outside Microsoft when troubleshooting. Nevertheless, use at your own risk. They're quite well known in the SQL community and I and others have publicized them before (I even demo'd them last year at ITForum in Spain).

To illustrate their use, I'm going to use a simple script I wrote to prove that page splits never roll back. I was having a discussion with someone a while ago about this question and the answer is always no. A page split occurs when an insert or update has to happen at a certain point in an index page, and there's no room on the page to accomomodate the new or updated record. Page splits are done internally as separate 'system' transactions. Once a system transaction commits, it cannot be rolled back - even if the user transaction it was part of rolls back.

So, let's run through the script. First thing to do is create a database containing a table with an index (as page splits only happen in indexes).

USE MASTER;

GO

IF DATABASEPROPERTY (N'pagesplittest', 'Version') > 0 DROP DATABASE pagesplittest;

GO

CREATE DATABASE pagesplittest;

GO

USE pagesplittest;

GO

CREATE TABLE t1 (c1 INT, c2 VARCHAR (1000));

CREATE CLUSTERED INDEX t1c1 ON t1 (c1);

GO

Now I'm going to fill up a page in the index, but leave a gap in the c1 values so that I can force a page split by inserting the missing key value.

INSERT INTO t1 VALUES (1, REPLICATE ('a', 900));

INSERT INTO t1 VALUES (2, REPLICATE ('b', 900));

INSERT INTO t1 VALUES (3, REPLICATE ('c', 900));

INSERT INTO t1 VALUES (4, REPLICATE ('d', 900));

-- leave a gap at 5

INSERT INTO t1 VALUES (6, REPLICATE ('f', 900));

INSERT INTO t1 VALUES (7, REPLICATE ('g', 900));

INSERT INTO t1 VALUES (8, REPLICATE ('h', 900));

INSERT INTO t1 VALUES (9, REPLICATE ('i', 900));

GO

I can find out what the first index page is using the DBCC IND command:

DBCC IND ('pagesplittest', 't1', 1);

GO

This command list all the pages that are allocated to an index. Here's the output in this case:

SEQA3.jpg

The columns mean:

  • PageFID - the file ID of the page
  • PagePID - the page number in the file
  • IAMFID - the file ID of the IAM page that maps this page (this will be NULL for IAM pages themselves as they're not self-referential)
  • IAMPID - the page number in the file of the IAM page that maps this page
  • ObjectID - the ID of the object this page is part of
  • IndexID - the ID of the index this page is part of
  • PartitionNumber - the partition number (as defined by the partitioning scheme for the index) of the partition this page is part of
  • PartitionID - the internal ID of the partition this page is part of
  • PageType - the page type. Some common ones are:
    • 1 - data page
    • 2 - index page
    • 3 and 4 - text pages
    • 8 - GAM page
    • 9 - SGAM page
    • 10 - IAM page
    • 11 - PFS page
  • IndexLevel - what level the page is at in the index (if at all). Remember that index levels go from 0 at the leaf to N at the root page (except in clustered indexes in SQL Server 2000 and 7.0 - where there's a 0 at the leaf level (data pages) and a 0 at the next level up (first level of index pages))
  • NextPageFID and NextPagePID - the page ID of the next page in the doubly-linked list of pages at this level of the index
  • PrevPageFID and PrevPagePID - the page ID of the previous page in the doubly-linked list of pages at this level of the index

So you can see we've got a single page clustered index with an IAM page. Note that the page IDs returned may differ on your server. Let's look at the data page:

DBCC TRACEON (3604);

GO

DBCC PAGE (pagesplittest, 1, 143, 3);

GO

The traceflag is to make the output of DBCC PAGE go to the console, rather than to the error log. The syntax for DBCC PAGE is:

dbcc page ( {'dbname' | dbid}, filenum, pagenum [, printopt={0|1|2|3} ])

The filenum and pagenum parameters are taken from the page IDs that come from various system tables and appear in DBCC or other system error messages. A page ID of, say, (1:143) has filenum = 1 and pagenum = 143.

The printopt parameter has the following meanings:

  • 0 - print just the page header
  • 1 - page header plus per-row hex dumps and a dump of the page slot array (unless its a page that doesn't have one, like allocation bitmaps)
  • 2 - page header plus whole page hex dump
  • 3 - page header plus detailed per-row interpretation

The per-row interpretation works for all page types, including the allocation bitmaps. In our case, we asked for a detailed output. I'll explain the various parts of the output in a post about the anatomy of a page. Here's the output from DBCC PAGE, with a bunch of the repeated per-row info removed for brevity:

PAGE: (1:143)


BUFFER:


BUF @0x02C49720

bpage = 0x05400000                   bhash = 0x00000000                   bpageno = (1:143)
bdbid = 8                            breferences = 0                      bUse1 = 22163
bstat = 0xc0010b                     blog = 0x32159bb                     bnext = 0x00000000

PAGE HEADER:


Page @0x05400000

m_pageId = (1:143)                   m_headerVersion = 1                  m_type = 1
m_typeFlagBits = 0x4                 m_level = 0                          m_flagBits = 0x8000
m_objId (AllocUnitId.idObj) = 68     m_indexId (AllocUnitId.idInd) = 256 
Metadata: AllocUnitId = 72057594042384384                                
Metadata: PartitionId = 72057594038386688                                 Metadata: IndexId = 1
Metadata: ObjectId = 2073058421      m_prevPage = (0:0)                   m_nextPage = (0:0)
pminlen = 8                          m_slotCnt = 8                        m_freeCnt = 744
m_freeData = 7432                    m_reservedCnt = 0                    m_lsn = (18:113:2)
m_xactReserved = 0                   m_xdesId = (0:0)                     m_ghostRecCnt = 0
m_tornBits = 0                      

Allocation Status

GAM (1:2) = ALLOCATED                SGAM (1:3) = ALLOCATED              
PFS (1:1) = 0x60 MIXED_EXT ALLOCATED   0_PCT_FULL                         DIFF (1:6) = CHANGED
ML (1:7) = NOT MIN_LOGGED           

Slot 0 Offset 0x60 Length 917

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS

Memory Dump @0x5C24C060

00000000:   30000800 01000000 0300f802 00110095 †0...............        
00000010:   03616161 61616161 61616161 61616161 †.aaaaaaaaaaaaaaa        

<snip> I've removed this section to save space

00000380:   61616161 61616161 61616161 61616161 †aaaaaaaaaaaaaaaa        
00000390:   61616161 61††††††††††††††††††††††††††aaaaa                   
UNIQUIFIER = [NULL]                 

Slot 0 Column 1 Offset 0x4 Length 4

c1 = 1                              

Slot 0 Column 2 Offset 0x11 Length 900

c2 = aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
a                                   

Slot 1 Offset 0x3f5 Length 917

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS

Memory Dump @0x5C24C3F5

<snip> And again...

DBCC execution completed. If DBCC printed error messages, contact your system administrator.

As you can see from the output, each row is 917 bytes long and there's only 744 bytes free (look at the m_freecnt value in the PAGE HEADER section). This means that we can't insert another row on that page of the same length - there just isn't space - but that's what we're going to do! Remember that this page currently has nine rows on it. Let's force a page split:

BEGIN TRAN;

GO

INSERT INTO t1 VALUES (5, REPLICATE ('a', 900));

GO

Now we know there wasn't enough room so the page must have split. Let's check DBCC IND again to see if another page was allocated to the index - here's the output:

SEQA41.jpg

Two pages have been added - an index page and another data page. Before we added the extra row and caused the page split, the index only needed one page. Now that there are two data pages, there needs to be an index page to allow searches through the index b-tree. Let's take a look at the two data pages to see which rows are stored on which page. Doing DBCC PAGE on them shows that page (1:143) has 5 rows, with c1 values 1 through 5, and page (1:154) has the 4 rows with c1 values 6 through 9 (I'm not going to post all the DBCC PAGE output - that would make the post way too long and it gives you an incentive to try the commands out). This is what we'd expect, as the page split occurs at the insertion point, and the row being inserted is put onto the page that split.

Now let's rollback the user transaction and see what happens:

ROLLBACK TRAN;

GO

Running the DBCC PAGE commands again shows that the index structure remains the same as after the split. Page (1:154) has the 4 rows on it from the split and page (1:143) has the other rows but not the one we inserted in the explicit transaction.

So, proof that a page split is never rolled back. I'll be making much more use of these two DBCC commands in future posts and I'll do the page anatomy one later this week. Let me know if there's anything in particular you'd like to see described in this series.

This week I'm going to post a bunch of info on the basic structures used to store data and track allocations in SQL Server. A bunch of this was posted back when I started blogging at TechEd 2006 but I want to consolidate/clarify info and add more about using DBCC PAGE to examine the various structures.

So, what are records? At the simplest level, a record is the physical storage associated with a table or index row. Of course, it gets much more complicated than that...

Data records

  • Data records are stored on data pages.
  • Data records store rows from a heap or the leaf level of a clustered index.
  • A data record always stores all columns from a table row - either by-value or by-reference.
    • If any columns are for LOB data types (text, ntext, image, and the new LOB types in SQL Server 2005 - varchar(max), nvarchar(max), varbinary(max), XML), then there's a pointer stored in the data record which points to a text record on a different page (the root of a loose tree that stores the LOB value). Exceptions to this are when the schema has been set to store LOB columns 'in-row' when possible. This is when a LOB value is small enough to fit within the size limits of a data record. This is a performance benefit as selecting the LOB column does not require an extra IO to read the text record.
    • In SQL Server 2005, non-LOB variable length columns (e.g. varchar, sqlvariant) may also be stored 'off-row' as part of the row-overflow feature of having table rows longer than 8060 bytes. In this case the storage format is the same as for LOB values - a pointer in the data record pointing to a text record.
  • There is a difference to how the columns are laid out between heaps and clustered indexes - I'll cover that in a later post.

Forwarded/Forwarding records

  • These are technically data records and are only present in a heap.
  • A forwarded record is a data record in a heap that was updated and was too large to fit in-place on its original page and so has been moved to another page. It contains a back-pointer to the forwarding record.
  • A forwarding record is left in its place and points to the new location of the record. It's sometimes known as a forwarding-stub, as all it contains is the location of the real data record.
  • This is done to avoid having to update any non-clustered index records that point back directly to the original physical location of the record.
  • Although this optimizes non-clustered index maintenance during updates, it can cause additional IOs during SELECTs. This is because the non-clustered index record points to the old location of the index, so an extra IO might be needed to read the real location of the data row. This is fuel for the heap vs clustered index debate, in favor of clustered indexes.

Index records

  • Index records are stored on index pages.
  • There are two types of index records (which differ only in what columns they store):
    1. Those that store non-clustered index rows at the leaf level of a non-clustered index
    2. Those that comprise the b-tree that make up clustered and non-clustered indexes (i.e. in index pages above the leaf level of a clustered or non-clustered index)
  • I'll explain more about the differences between these in a later post as it can be quite complicated (especially the differences between SQL Server 2000 and 2005) and is worth doing in separate posts.
  • Index records typically do not contain all the column values in a table (although some do - called covering indexes).
  • In SQL Server 2005, non-clustered index records can include LOB values as included columns (with the storage details exactly the same as for data records) and also can have row-overflow data that is pushed off-row (again, in exactly the same way as for data records).

Text records

  • Text records are stored on text pages.
  • There are various types of text records that comprise the tree structure that stores LOB values, stored on two types of text page. I'll explain how they work and are linked together in a future post.
  • They are also used to store variable-length column values that have been pushed out of data or index records as part of the row-overflow capability.

Ghost records

  • These are records that have been logically deleted but not physically deleted from a page. The reasons for this are complicated, but basically having ghost records simplfies key-range locking and transaction rollback.
  • The record is marked with a bit that indicates it's a ghost record and cannot be physically deleted until the transaction that caused it to be ghosted commits. Once this is done, it is deleted by an asynchronous background proces (called the ghost-cleanup task) or it is converted back to a real record by an insert of a record with the exact same set of keys.

Other record types

  • There are also records that are used to store various allocation bitmaps, intermediate results of sort operations, and file and database metadata (e.g. in the per-file fileheader page and database boot page). Again, I'll go into these in later posts (there's a big queue of posts building up :-))

Record structure

All records have the same structure, regardless of their type and use, but the number and type of columns will be different. For instance, a data record from a table with a complex schema may have hundreds of columns of various types whereas an allocation bitmap record will have a single column, filling up the whole page.

The record structure is as follows:

  • record header
    • 4 bytes long
    • two bytes of record metadata (record type)
    • two bytes pointing forward in the record to the NULL bitmap
  • fixed length portion of the record, containing the columns storing data types that have fixed lengths (e.g. bigint, char(10), datetime)
  • NULL bitmap
    • two bytes for count of columns in the record
    • variable number of bytes to store one bit per column in the record, regardless of whether the column is nullable or not (this is different and simpler than SQL Server 2000 which had one bit per nullable column only)
    • this allows an optimization when reading columns that are NULL
  • variable-length column offset array
    • two bytes for the count of variable-length columns
    • two bytes per variable length column, giving the offset to the end of the column value
  • versioning tag
    • this is in SQL Server 2005 only and is a 14-byte structure that contains a timestamp plus a pointer into the version store in tempdb

NULL bitmap optimization

So why is the NULL bitmap an optimization?

Firstly, having a null bitmap removes the need for storing special 'NULL' values for fixed-length datatypes. Without the null bitmap, how can you tell whether a column is NULL? For fixed-length columns you'd need to define a special 'NULL' value, which limits the effective range of the datatype being stored. For varchar columns, the value could be a zero-length empty string, so just checking the length doesn't work - you'd need the special value again. For all other variable-length data types you can just check the length. So, we nede the NULL bitmap.

Secondly, it saves CPU cycles. If there was no NULL bitmap, then there are extra instructions executed for fixed-length and variable-length columns.

For fixed-length:

  1. read in the stored column value (possibly taking a cpu data cache miss)
  2. load the pre-defined NULL value for that datatype (possibly taking a cpu data cache miss, but only for the first read in the case of a multiple row select)
  3. do a comparison between the two values

For variable-length:

  1. calculate the offset of the variable length array
  2. read the number of variable length columns (possibly taking a cpu data cache miss)
  3. calculate the position in the variable length offset array to read
  4. read the column offset from it (possibly taking a cpu data cache miss)
  5. read the next one too (possibly taking another cpu data cache miss, if the offset in step 4 was on the boundary of a cache line size)
  6. compare them to see if they're the same

But with a NULL bitmap, all you have to do is:

  1. read the NULL bitmap offset (possibly taking a cpu data cache miss)
  2. calculate the additional offset of the NULL bit you want to read
  3. read it (possibly taking a cpu data cache miss)

So, its about even for a lookup of a single fixed-length column, but for variable-length columns, and for multiple row selects, there's a clear advantage to having the NULL bitmap.

Using DBCC IND and DBCC PAGE to examine a row in detail

Let's create an example table to look at:

USE MASTER;
GO
IF DATABASEPROPERTY (N'recordanatomy', 'Version') > 0 DROP DATABASE recordanatomy;
GO
CREATE DATABASE recordanatomy;
GO

USE recordanatomy;
GO
CREATE TABLE example (destination VARCHAR(100), activity VARCHAR(100), duration INT);
GO
INSERT INTO example VALUES ('Banff', 'sightseeing', 5);
INSERT INTO example VALUES ('Chicago', 'sailing', 4);
GO

And we can use DBCC IND again to find the page to look at:

DBCC IND ('recordanatomy', 'example', 1);
GO

The output tells us the data page is (1:143) so we can dump it with DBCC PAGE, using option 3 to get a fully interpreted dump of each record.

DBCC TRACEON (3604);
GO
DBCC PAGE ('recordanatomy', 1, 143, 3);
GO

Remember we need the trace-flag to make the DBCC PAGE output go to the console instead of the error log. The output will contain something like the following:

Slot 0 Offset 0x60 Length 33

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS

Memory Dump @0x5C76C060

00000000:   30000800 05000000 0300f802 00160021 †0..............!
00000010:   0042616e 66667369 67687473 6565696e †.Banffsightseein
00000020:   67†††††††††††††††††††††††††††††††††††g

Slot 0 Column 0 Offset 0x11 Length 5

destination = Banff

Slot 0 Column 1 Offset 0x16 Length 11

activity = sightseeing

Slot 0 Column 2 Offset 0x4 Length 4

duration = 5

Let's use the record structure I listed above to go through this record and see how things are stored.

  • Byte 0 is the TagA byte of the record metadata.

    • Its 0x30, which corresponds to 0x10 (bit 4) and 0x20 (bit 5). Bit 4 means the record has a NULL bitmap and bit 5 means the record has variable length columns. If 0x40 (bit 6) was also set, that would indicate that the record has a versioning tag. If 0x80 (bit 7) was also set, that would indicate that byte 1 has a value in it.
    • Bits 1-3 of byte 0 give the record type. The possible values are:
      • 0 = primary record. A data record in a heap that hasn't been forwarded or a data record at the leaf level of a clustered index.
      • 1 = forwarded record
      • 2 = forwarding record
      • 3 = index record
      • 4 = blob fragment
      • 5 = ghost index record
      • 6 = ghost data record
      • 7 = ghost version record. A special 15-byte record containing a single byte record header plus a 14-byte versioning tag that is used in some circumstances (like ghosting a versioned blob record)
    • In our example, none of these bits are set which means the record is a primary record. If the record was an index record, byte 0 would have the value 0x36. Remember that the record type starts on bit 1, not bit 0, and so the record type value from the enumeration above needs to be shifted left a bit (multiplied by two) to get its value in the byte.
  • Byte 1 is the TagB byte of the record metadata. It can either be 0x00 or 0x01. If it is 0x01, that means the record type is ghost forwarded record. In this case its 0x00, which is what we expect given the TagA byte value.
  • Bytes 2 and 3 are the offset of the NULL bitmap in the record. This is 0x0008 (DBCC PAGE presents multi-byte values in hex dumps as least-significant byte first). This means that there's a 4-byte fixed length portion of the record starting at byte 4. We expect this because we know the table schema.
  • Bytes 4 to 7 are the fixed length portion. Again, because we know the table schema, we know to interpret these bytes as a 4-byte integer. Without that knowledge, you'd have to guess. The value therefore is 0x00000005, which is what we'd expect to see as the value of the duration column.
  • Bytes 8 and 9 are the count of columns in the record. This is 0x0003 which is correct. Given that there are only 3 columns, the NULL bitmap of one bit per column will fit in a single byte.
  • Byte 10 is the NULL bitmap. The value is 0xF8. We need to convert it to binary to make sense of the value. 0xF8 = 11111000. This makes sense - bits 0-2 represent columns 1-3 and they're all 0, meaning the columns aren't NULL. Bits 3-7 represent non-existent columns and they're set to 1 for clarity.
  • Bytes 11 and 12 are the count of variable length columns in the record. That value is 0x0002, which we again know to be correct. This means there will be two two-byte entries in the variable length column offset array. These will be bytes 13-14 and 15-16, having values of 0x0016 and 0x0021 respectively. Remember that NULL bitmap entries point to the end of the column value - this is done so that we know how long each column is without having to store their length as well.
  • So, the final offset is bytes 15 and 16, which means the offset of the start of the first variable length column must be byte 17 (or 0x11 in hex), which agrees with the DBCC PAGE dump. The offset of the end of the first variable length column is 0x0016, so the first value is from byte 17 to byte 21 inclusive. This value is 0x42616E6666. We know from the table metadata that this is the first varchar column, destination. Converting to ASCII gives us the column value 'Banff'. Using similar logic, the second value is from byte 22 to byte 32 inclusive and has the value 'sightseeing'. Both of these match the data we're expecting.

And that's it.

Some of the features of SQL Server 2008 will introduce changes to the record structure - more on those when the features are available in CTPs.

Theme design by Nukeation based on Jelle Druyts