The very worst piece of advice I ever saw on the Internet was in response to someone asking on a SQL newsgroup 'how can I create a corrupt database?'. The first response was:

When I want to corrupt a database to play with, I go into the data center, find a hard-drive and flick the power switch on-and-off quickly a few times.

This was closely followed by a bunch of replies (including mine) saying 'Noooooooooooo!!!!'

So, for a few years now I've provided a zip file with a bunch of pre-corrupted databases in - so you can test your consistency checking jobs, and experiment with corrupt databases without having to create them yourself by destroying hard-drives, or less daft means. You can find the main zip file at the top of our Past Events resources page, along with a link to a blog post which explains the various databases and demo scripts.

Several times I've been asked to provide SQL Server 2008 versions of the two databases which demo system table corruption. Well, that task has finally bubbled to my long-list of blog posts and website updates. There's now a second, smaller zip file which has 2008-only versions of the DemoFatalCorruption1 and DemoFatalCorruption2 databases, which showcase corruptions that prevent DBCC CHECKDB from running.

Let me know if you have any problems (playing with the backups Wink)

Enjoy!

(Edited 2/27/10 to add database name to the output) 

Here's a little script I knocked up this afternoon to tell me who has open transactions on the server - not just the single oldest active transaction that DBCC OPENTRAN returns.

It gives back:

  • session ID
  • login name
  • database context
  • transaction begin time
  • how many log records have been generated by the transaction
  • how much log space has been taken up by those log records
  • how much log space has been reserved in case the transaction rolls back
  • the last T-SQL that was executed in the context of the transaction
  • the last query plan that was executed (only for currently executing plans)

It's ordered by the transaction begin time. I had some trouble using CROSS APPLY with the sys.dm_exec_query_plan DMV - if the plan isn't available, it blows out the entire result-set for that transaction. After messing around for ten minutes I discovered the joys of the OUTER APPLY operator - which is the same as CROSS APPLY but allows NULL values from the function being cross-applied.

Also thanks to fellow MVP Aaron Bertrand (twitter|blog) for pointing out a mistake in the way I was calling sys.dm_exec_query_plan.

Here's the script with some example output:

SELECT s_tst.[session_id],
   s_es.[login_name] AS [Login Name],
   DB_NAME (s_tdt.database_id) AS [Database],

   s_tdt.[database_transaction_begin_time] AS [Begin Time],
   s_tdt.[database_transaction_log_record_count] AS [Log Records],
   s_tdt.[database_transaction_log_bytes_used] AS [Log Bytes],
   s_tdt.[database_transaction_log_bytes_reserved] AS [Log Rsvd],
   s_est.[text] AS [Last T-SQL Text],
   s_eqp.[query_plan] AS [Last Plan]
FROM sys.dm_tran_database_transactions s_tdt
   JOIN sys.dm_tran_session_transactions s_tst
      ON s_tst.[transaction_id] = s_tdt.[transaction_id]
   JOIN sys.[dm_exec_sessions] s_es
      ON s_es.[session_id] = s_tst.[session_id]
   JOIN sys.dm_exec_connections s_ec
      ON s_ec.[session_id] = s_tst.[session_id]
   LEFT OUTER JOIN sys.dm_exec_requests s_er
      ON s_er.[session_id] = s_tst.[session_id]
   CROSS APPLY sys.dm_exec_sql_text (s_ec.[most_recent_sql_handle]) AS s_est
   OUTER APPLY sys.dm_exec_query_plan (s_er.[plan_handle]) AS s_eqp
ORDER BY [Begin Time] ASC;
GO

session_id Login Name        Database Begin Time              Log Records Log Bytes Log Rsvd Last T-SQL Text                      Last Plan
---------- ----------------- -------- ----------------------- ----------- --------- -------- ------------------------------------ ---------
54         ROADRUNNERPR\paul foo      2010-02-01 15:28:48.560 2           236       8550     begin tran insert into t1 values (1) NULL
55         ROADRUNNERPR\paul foo      2010-02-01 16:38:18.373 3           356       8852     insert into t1 values (3)            NULL

One of the perennial problems facing both experienced and involuntary DBAs is how to tell whether a database is really in the FULL recovery model or not? This is complicated by the fact that when you switch a database into the FULL recovery mode, it actually behaves as if it's in the SIMPLE recovery mode until the log backup chain is established (this is commonly called being in 'pseudo-SIMPLE').

It's a problem for several reasons:

1) if the database is really in the FULL recovery model then log backups must be taken so the log can clear/truncate properly and it doesn't grow out of control

2) if the database is in the FULL recovery model but the log backup chain has been broken (or not established at all since the database was created) then log backups are not possible (except for the yuckiness in SQL 2000 when log backups would succeed without complaint but be totally useless during disaster recovery)

I don't know of any script to easily determine whether a database is really in the FULL recovery mode, so I knocked one together - and I present it here for you to use.

The trick to the script is finding the last LSN that's been backed up for the database. if this is non-NULL, then a log backup chain exists and the database is really in the FULL recovery mode. This is stored in the dbi_dbbackupLSN field in the database boot page (see Search Engine Q&A #20: Boot pages, and boot page corruption) but also nicely available in the DMV sys.database_recovery_status.

I've tested this on 2005 SP3 and 2008 SP1.

Note that this doesn't work on SQL 2000 - I've poked around and can't find a way to get at the LSN without reading the boot page directly, which can't be done gracefully inside a function - I'll leave that as an exercise for you. You'd expect the IsTruncLog property returned by DATABASEPROPERTY to be correct when the database is in pseudo-SIMPLE, but it's not unfortunately.

Here are some test cases for the script:

CREATE DATABASE SimpleModeDB;
CREATE DATABASE BulkLoggedModeDB;
CREATE DATABASE FullModeDB;
GO

ALTER DATABASE SimpleModeDB SET RECOVERY SIMPLE;
ALTER DATABASE BulkLoggedModeDB SET RECOVERY BULK_LOGGED;
ALTER DATABASE FullModeDB SET RECOVERY FULL;
GO

SELECT [Name], msdb.dbo.SQLSkillsIsReallyInFullRecovery ([Name]) AS 'ReallyInFULL'
FROM sys.databases
WHERE [Name] LIKE '%ModeDB';
GO

Name              ReallyInFULL
----------------- -------------
SimpleModeDB      0
BulkLoggedModeDB  0
FullModeDB        0

This makes sense - the new FullModeDB database is still in pseudo-SIMPLE. Now what if we take a full database backup?

BACKUP DATABASE FullModeDB To DISK='C:\SQLskills\FullModeDB.bck' WITH INIT;
GO
SELECT msdb.dbo.SQLSkillsIsReallyInFullRecovery ('FullModeDB') AS 'ReallyInFULL';
GO

Processed 152 pages for database 'FullModeDB', file 'FullModeDB' on file 1.
Processed 1 pages for database 'FullModeDB', file 'FullModeDB_log' on file 1.
BACKUP DATABASE successfully processed 153 pages in 0.230 seconds (5.449 MB/sec).

ReallyInFULL
------------
1

Perfect. Now what about switching it back to SIMPLE and back to FULL again?

ALTER DATABASE FullModeDB SET RECOVERY SIMPLE;
ALTER DATABASE FullModeDB SET RECOVERY FULL;
GO
SELECT msdb.dbo.SQLSkillsIsReallyInFullRecovery ('FullModeDB') AS 'ReallyInFULL';
GO

ReallyInFULL
------------
0

Just as we expect - the log backup chain has been broken and the database is back to pseudo-SIMPLE again.

Now what if we restart the log backup chain using a full database backup?

BACKUP DATABASE FullModeDB To DISK='C:\SQLskills\FullModeDB.bck' WITH INIT;
GO
SELECT msdb.dbo.SQLSkillsIsReallyInFullRecovery ('FullModeDB') AS 'ReallyInFULL';
GO

Processed 152 pages for database 'FullModeDB', file 'FullModeDB' on file 1.
Processed 1 pages for database 'FullModeDB', file 'FullModeDB_log' on file 1.
BACKUP DATABASE successfully processed 153 pages in 0.095 seconds (13.193 MB/sec).

ReallyInFULL
------------
1

Perfect. Now what about if we break the chain and try to restart it using a differential database backup?

ALTER DATABASE FullModeDB SET RECOVERY SIMPLE;
ALTER DATABASE FullModeDB SET RECOVERY FULL;
GO
BACKUP DATABASE FullModeDB To DISK='C:\SQLskills\FullModeDB_diff.bck' WITH INIT, DIFFERENTIAL;
GO
SELECT msdb.dbo.SQLSkillsIsReallyInFullRecovery ('FullModeDB') AS 'ReallyInFULL';
GO

Processed 40 pages for database 'FullModeDB', file 'FullModeDB' on file 1.
Processed 1 pages for database 'FullModeDB', file 'FullModeDB_log' on file 1.
BACKUP DATABASE WITH DIFFERENTIAL successfully processed 41 pages in 0.080 seconds (4.192 MB/sec).

ReallyInFULL
------------
1

Perfect - that works too, as I'd expect. You may wonder why a differential backup works - either a full or differential backup will work as they bridge the LSN gap since the last full or differential backup before the log backup chain was broken - both of these backups include transaction log - see More on how much transaction log a full backup includes.

And here's the script itself - enjoy!

/*============================================================================
   File: SQLskillsIsReallyInFullRecovery.sql

   Summary: This script creates a function in msdb that returns a BIT value of
   1 if the supplied database is really in the FULL recovery mode and not still
   in pseudo-SIMPLE.

   Date: October 2009

   SQL Server Versions:
      10.0.2531.00 (SS2008 SP1)
      9.00.4035.00 (SS2005 SP3)
------------------------------------------------------------------------------
   Copyright (C) 2009 Paul S. Randal, SQLskills.com
   All rights reserved.

   For more scripts and sample code, check out
     
http://www.sqlskills.com/

   You may alter this code for your own *non-commercial* purposes. You may
   republish altered code as long as you give due credit.

   THIS CODE AND INFORMATION ARE PROVIDED "AS IS" WITHOUT WARRANTY OF
   ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED
   TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR FITNESS FOR A
   PARTICULAR PURPOSE.
============================================================================*/

USE [msdb];
GO

IF EXISTS (SELECT * FROM sys.objects WHERE NAME = 'SQLskillsIsReallyInFullRecovery')
   DROP FUNCTION SQLskillsIsReallyInFullRecovery;
GO
 
CREATE FUNCTION SQLskillsIsReallyInFullRecovery (
   @DBName sysname)
RETURNS BIT
AS
BEGIN
  
DECLARE @IsReallyFull  BIT;
   DECLARE @LastLogBackupLSN NUMERIC (25,0);
   DECLARE @RecoveryModel  TINYINT;

   SELECT @LastLogBackupLSN = [last_log_backup_lsn]
   FROM sys.database_recovery_status
   WHERE [database_id] = DB_ID (@DBName);

   SELECT @RecoveryModel = [recovery_model]
   FROM sys.databases
   WHERE [database_id] = DB_ID (@DBName);

   IF (@RecoveryModel = 1 AND @LastLogBackupLSN IS NOT NULL)
      SELECT @IsReallyFull = 1
   ELSE
      SELECT @IsReallyFull = 0;

   RETURN (@IsReallyFull);
END;
GO

Yes, I'm a week late (at least) uploading these, and I apologize. But better late than never!

For those lucky 48 people who attended our week-long SQL class in Dublin at the end of September, I've uploaded all our demo scripts on our past conferences page. Let me know if you have any issues.

Enjoy!

Categories:
Classes | Example Scripts

This is a question that comes up every so often, most recently this morning while teaching a private class (and Kimberly's teaching now): how large is the forwarded record back-pointer? (And I haven't posted anything geeky for a while...)

In a heap it is possible to get forwarding and forwarded records. They occur when a record in a heap expands such that it no longer fits on the page it currently resides on. In this case, the record is moved to a new page, and a small forwarding record is left in the original location. The forwarding record points to the new location of the record, which is known as a forwarded record. This is done as a performance optimization so that all the nonclustered indexes on the heap do not have to be altered with the new location of the heap record.

As an aside, when a query uses a nonclustered index to satisfy a query, but needs more columns from the table record to fill the result set column list, it must go to the actual data record to retrieve the extra columns. It does this by using the table record locator that is stored in the nonclustered index record. If the table is a heap, the record locator is the physical location of the data record in the heap.

If the table is a clustered index (remember that a table can be organized as a heap OR a clustered index, not both), the record locator is the set of cluster keys of the data record. Both of these record locators are guaranteed to be unique. For a heap record locator, the record lookup (commonly called a bookmark lookup) goes directly to the physical location of the record. For a clustered index record locator, the record lookup uses the cluster keys to navigate down through the clustered index to the leaf level.

If a forwarding record occurs in a heap, when the record locator points to that location, the Storage Engine gets there and says Oh, the record isn't really here - it's over there! And then it has to do another (potentially physical) I/O to get to the page with the forwarded record on. This can result in a heap being less efficient that an equivalent clustered index.

There has been lots of discussion about whether it's better to have a heap or a clustered index - generally we recommend a clustered index, but there are special cases where a heap may be fine. This post isn't about that and I won't get drawn into a Comments argument about it. Go bug Kimberly :-)

Back to the point of the post. So the record has moved to a new location and there's a small record left in the original location which helps bookmark lookups.

Oh yeah, one more aside. The Storage Engine, when scanning the heap, will not process the forwarded record UNLESS it has reached it by following the forwarding record. This prevents race conditions where the record could be processed twice - if the forwarding operation occurs during a large table scan.

Ok, really back to the point of the post. What happens if the original record grows again and has to move again? Does it leave ANOTHER forwarding record when it moves to the second new location - creating a chain of forwarding records?

The answer is no. The *original* forwarding record is updated with the new location of the forwarded record. This can only be done if the forwarded record points *back* to the forwarding record - which it does.

The question becomes- how big is the forwarding record back-pointer? It's not dumped out by DBCC PAGE (sorry, when I rewrote DBCC PAGE for SQL 2005 I forgot to put that in there). Let's work it out using a script.

First off I'm going to create a database and table to play with.

CREATE DATABASE DBMaint2008;
GO
USE DBMaint2008;
GO

CREATE TABLE DbccPageTest (intCol1  INT IDENTITY,  intCol2  INT, vcharCol VARCHAR (8000),  lobCol  VARCHAR (MAX));
GO

INSERT INTO DbccPageTest VALUES (1, REPLICATE ('Row1', 600), REPLICATE ('Row1Lobs', 1000));
INSERT INTO DbccPageTest VALUES (2, REPLICATE ('Row2', 600), REPLICATE ('Row2Lobs', 1000));
GO

Using DBCC IND ('DBMaint2008', 'DbccPageTest', -1) , I find that the data page is page ID (1:154). If I dump out that page using DBCC PAGE, I can see both of the records fully contained on the page.

Now I'm going to update the second row to make it 8000+ bytes - forcing it to move to a new page.

UPDATE DbccPageTest SET vcharCol = REPLICATE ('LongRow2', 1000) WHERE intCol2 = 2;
GO

And now looking at page (1:154) with DBCC PAGE again, I can see that the second row has been replaced with:

Slot 1 Offset 0x137a Length 9

Record Type = FORWARDING_STUB        Record Attributes =
Memory Dump @0x66F4D37A

00000000:   049d0000 00010000 00†††††††††††††††††.........
Forwarding to  =  file 1 page 157 slot 0

The record has been replaced with a forwarding record - pointing to the new location of the record on page (1:157).

Now DBCC PAGE doesn't dump out the back-pointer in the forwarded record - so how can we tell how large it is? I'm going to see how large the record is, and then create a clustered index on the table. This will remove the back-pointer from the record without changing anything else about the record. The difference in size will be the back-pointer size.

Doing DBCC PAGE on page (1:157) I get (partial results):

Slot 0 Offset 0x60 Length 8057

Record Type = FORWARDED_RECORD       Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS

Now creating a clustered index and then dumping the page with the second row on it (left as an exercise for the reader :-)):

CREATE UNIQUE CLUSTERED INDEX Dbcc_CL ON DbccPageTest (intCol1);
GO

<figure out which page to look at>

DBCC PAGE ('DBMaint2008', 1, 169, 3);
GO

<partial results>
Slot 0 Offset 0x60 Length 8047

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS

Clearly the back-pointer is 10 bytes. And here's how it breaks down:

Hope this helps!

I came across a question on ServerFault this afternoon that inflamed my desire to be ultra-geeky (it was really already inflamed after teaching backup and restore internals all afternoon). Basically the question boiled down to how to find out who dropped a table if there's no other way except the transaction log (e.g. no tracing is available, even the default trace). So I hacked around and figured out at least how to find out *when* a particular table was dropped plus the UID and SPID of who dropped it.

Everything hinges on using undocumented commands to look into the transaction log. I've played with this before on the blog: fn_dblog.

First off I created a script to create a database, populate a table and then drop it.

USE master;
GO
CREATE DATABASE FnDbLogTest;
GO
USE FnDbLogTest;
GO

CREATE TABLE TestTable (
    c1 INT IDENTITY,
    c2 CHAR (100) DEFAULT 'a');
GO
CREATE CLUSTERED INDEX TT_CL ON TestTable (c1);
GO
INSERT INTO TestTable DEFAULT VALUES;
GO

SELECT OBJECT_ID ('TestTable');
GO

DROP TABLE TestTable;
GO

First we need to find the transactions that drop tables in the log. The following code works for 2000, 2005, and 2008:

SELECT [Transaction Id], [Begin Time], [UID], [SPID]
FROM ::fn_dblog (NULL, NULL)
WHERE [Transaction Name] = 'DROPOBJ'
GO

The (NULL, NULL) is the starting LSN and ending LSN to process - NULL means process everything available. 

Results on SQL Server 2000 will look as follows for me logging in as a Windows login in the SA role:

Transaction Id Begin Time              UID         SPID       
-------------- ----------------------- ----------- -----------
0000:000000e0  2009/06/16 18:23:03:320 1           51

And for the same circumstances on 2005 and 2008, the results look like:

Transaction Id Begin Time               UID         SPID
-------------- ------------------------ ----------- -----------
0000:00000587  2009/06/16 17:49:56:927  -1          51

If the user who dropped the table logged in as a role member, and they're not connected as that SPID any more, you may not be able to tell who it was unless you're also tracking successful logins into your server - but you'll at least know what role it was.

Now, this only shows us that a table was dropped, not which table it was. There's no way to get the name of the table that was dropped, only the object ID - so you'll need to have some other way to determine what the table ID is if there are multiple table drops and only one of them is malignant.

For SQL Server 2000, the code to find which object ID we're talking about is as follows (dropping the Transacation Id into the WHERE clause):

SELECT DISTINCT [Object Name] FROM ::fn_dblog (NULL, NULL)
WHERE [Transaction Id] = '0000:000000e0'
AND [Context] = 'LCX_IAM';
GO

Object Name     
--------------------
(2009058193)

The object ID in parentheses is the ID of the table that was dropped. 

For SQL Server 2005 and 2008, the code is as follows (with the same Transaction Id substitution):

SELECT TOP (1) [Lock Information] FROM ::fn_dblog (NULL, NULL)
WHERE [Transaction Id] = '0000:00000587'
AND [Lock Information] LIKE '%SCH_M OBJECT%';
GO

Lock Information
--------------------------------------------
ACQUIRE_LOCK_SCH_M OBJECT: 8:2073058421:0

The 8:2073058421 is the database ID and object ID of the table that was dropped.

Now you can go find whoever it was and take whatever action you deem appropriate Wink

Hope this helps!

PS If you find the you don't get enough info from ::fn_dblog, try turning on trace flag 2537. It allows the function to look at *all* possible log, not just the active log.

There's still a widely held misconception that when properly in the FULL or BULK_LOGGED recovery models that full or differential backups can truncate the log. No. It *NEVER* happens. This is one of the reasons why I'm doing a whole spotlight session on this at PASS this year - the transaction log and its behavior is IMHO one of the most misunderstood parts of SQL Server.

Notice that I said 'when properly in the FULL or BULK_LOGGED recovery models'. If you switch recovery models to FULL or BULK_LOGGED, until you take the first full backup, you are still essentially in the SIMPLE recovery model, and so the log will truncate on checkpoint. Once you take that first full backup, you are then in I-will-manage-the-size-of-the-log-through-log-backups mode. After that, the ONLY thing that will allow the log to clear/truncate is a log backup, as long as nothing else requires those transaction log records.

If you're not familiar with the term 'log clearing' or 'log truncating', they mean exactly the same thing - part of the transaction log is marked as no longer needed and can be overwritten. Nothing is zeroed out, the log file size is not altered. Some background reading on this:

Earlier this week I was involved in a discussion about log backup size management and how to prevent a log backup following a maintenance operation to not contain details of the maintenance operation.

There's a very simple answer: you can't.

If you do an operation in the FULL or BULK_LOGGED recovery models, the next log backup will contain all information required to replay that operation. In the FULL recovery model, everything is fully logged, so the log backup will contain all the log records generated by the operation. In the BULK_LOGGED recovery model, you may perform a minimally-logged operation, which generates hardly any transaction log, but the next log *backup* will be about the same size as if the operation was fully logged - because the log backup will pick up all the data extents modified by the minimally-logged operation.

One point in the discussion was that if you're running in the FULL or BULK_LOGGED recovery models, and you do a full backup after the maintenance operation, and before the log backup, the full backup will contain all the changes made by the maintenance operation, yes, and will clear the log.

No. Never.

A log backup is *ALL* the log generated since the last log backup. If this were not the case, how would log shipping work? You could take a full backup on the log shipping primary and suddenly you've broken the log backup chain and log shipping breaks. No, this is not how things work. A full backup contains only enough transaction log necessary to be able to restore that database to a transactionally consistent time - the time at which the data reading portion of the full backup completed. I blogged about this extensively previously:

But you don't have to believe me - it's very simple to convince yourself. The following script will show you that a full backup has no effect on the transaction log. It does the following:

  • Create a database and put it into the FULL recovery model, with a full backup.
  • Create and populate and index.
  • Take log backup 1 (just to clear things out)
  • Rebuild the index.
  • Take log backup 2.
  • Rebuild the index.
  • Take a full backup.
  • Take log backup 3.

And we will see that log backup #3 is the same size as log backup #2. The full backup will make no difference whatsoever.

Here's the script:

USE master;
GO
DROP DATABASE LogBackupTest;
GO
CREATE DATABASE LogBackupTest;
GO
USE LogBackupTest;
GO

ALTER DATABASE LogBackupTest SET RECOVERY FULL;
GO
BACKUP DATABASE LogBackupTest TO
DISK = 'C:\SQLskills\LogBackupTest_Full1.bak' WITH INIT;
GO

CREATE TABLE t1 (c1 INT IDENTITY, c2 CHAR (8000) DEFAULT 'a');
GO
CREATE CLUSTERED INDEX t1c1 ON t1 (c1);
GO
SET NOCOUNT ON;
GO
INSERT INTO t1 DEFAULT VALUES;
GO 1000

BACKUP LOG LogBackupTest TO
DISK = 'C:\SQLskills\LogBackupTest_Log1.bak' WITH INIT;
GO

-- Rebuild the index to generate some log and get a baseline
ALTER INDEX t1c1 ON t1 REBUILD;
GO
BACKUP LOG LogBackupTest TO
DISK = 'C:\SQLskills\LogBackupTest_Log2.bak' WITH INIT;
GO

-- Now do it again, but take a full backup before the log backup
ALTER INDEX t1c1 ON t1 REBUILD;
GO
BACKUP DATABASE LogBackupTest TO
DISK = 'C:\SQLskills\LogBackupTest_Full2.bak' WITH INIT;
GO
BACKUP LOG LogBackupTest TO
DISK = 'C:\SQLskills\LogBackupTest_Log3.bak' WITH INIT;
GO

And here's the result:

 

Log backup #3 is the same size as log backup #2. It contains all the log generated since log backup #2 was taken. The full backup had no affect whatsoever, because that would break the log backup chain.

If you don't believe me, run the script yourself and you'll see. A full backup does not and cannot affect the transaction log.

About a year ago, I blogged a cool script that would work out how much of a database has changed since the last full backup - i.e. how big will the next differential backup be. You can find that script at New script: How much of the database has changed since the last full backup?. I'm in the middle of writing a script that will tell you how big your next log backup will be, and as part of it, I need a way to determine how many extents from minimally logged operations will be included in the next log backup. I thought it might be useful to someone so I tidied it up and here it is for you to play with. It's almost the same as the previous script I published.

The code below creates a function and a stored procedure. The basic idea behind the code is as follows:

For each online data file in the database
   For each GAM interval in the file
      Crack the ML map page using DBCC PAGE
      Interpret the ML bitmap to aggregate the changed extents
      Add the sum to the total changed extents for the database
   End
End
Report results

There's a function that I create in msdb call SQLskillsConvertToExtents that cracks some of the DBCC PAGE output, and the main procedure is called sp_SQLskillsMLChangedData and it created as a system object in master. I tried making it a table-valued function but you can't do things like INSERT-EXEC in a function, and that's required for processing the DBCC PAGE output. So - create your own wrapper function or whatever to use it. The interface/output is:

EXEC sp_SQLskillsMLChangedData 'testdb';
GO

Total Extents Changed Extents Percentage Changed
------------- --------------- ----------------------
355           153              43.1

Anyway - here it is. You can download it in a zip file from SQLskillsMLChangedData.zip (2.50KB). Enjoy!

/*============================================================================
   File: SQLskillsMLChangedData.sql

   Summary: This script creates a system-wide SP SQLskillsMLChangedData that
   works out what percentage of a database has been changed by minimally
   logged operations since the last log backup.

   Date: May 2009

   SQL Server Versions:
      10.0.2531.00 (SS2008 SP1)
      9.00.4035.00 (SS2005 SP3)
------------------------------------------------------------------------------
   Copyright (C) 2009 Paul S. Randal, SQLskills.com
   All rights reserved.

   For more scripts and sample code, check out
      http://www.sqlskills.com/

   You may alter this code for your own *non-commercial* purposes. You may
   republish altered code as long as you give due credit.

   THIS CODE AND INFORMATION ARE PROVIDED "AS IS" WITHOUT WARRANTY OF
   ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED
   TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR FITNESS FOR A
   PARTICULAR PURPOSE.
============================================================================*/

-- Create the function in MSDB
--
USE msdb;
GO

IF EXISTS (SELECT * FROM sys.objects WHERE NAME = 'SQLskillsConvertToExtents')
   DROP FUNCTION SQLskillsConvertToExtents;
GO

-- This function cracks the output from a DBCC PAGE dump
-- of an allocation bitmap. It takes a string in the form
-- "(1:8) - (1:16)" or "(1:8) -" and returns the number
-- of extents represented by the string. Both the examples
-- above equal 1 extent.
--
CREATE FUNCTION SQLskillsConvertToExtents (
   @extents VARCHAR (100))
RETURNS INTEGER
AS
BEGIN
   DECLARE @extentTotal INT;
   DECLARE @colon INT;
   DECLARE @firstExtent INT;
   DECLARE @secondExtent INT;
   SET @extentTotal = 0;
   SET @colon = CHARINDEX (':', @extents);

   -- Check for the single extent case
   --
   IF (CHARINDEX (':', @extents, @colon + 1) = 0)
      SET @extentTotal = 1;
   ELSE
      -- We're in the multi-extent case
      --
      BEGIN
      SET @firstExtent = CONVERT (INT,
         SUBSTRING (@extents, @colon + 1, CHARINDEX (')', @extents, @colon) - @colon - 1));
      SET @colon = CHARINDEX (':', @extents, @colon + 1);
      SET @secondExtent = CONVERT (INT,
         SUBSTRING (@extents, @colon + 1, CHARINDEX (')', @extents, @colon) - @colon - 1));
      SET @extentTotal = (@secondExtent - @firstExtent) / 8 + 1;
   END

RETURN @extentTotal;
END;
GO

USE master;
GO

IF OBJECT_ID ('sp_SQLskillsMLChangedData') IS NOT NULL
   DROP PROCEDURE sp_SQLskillsMLChangedData;
GO

-- This SP cracks all minamally-logged bitmap pages for all online
-- data files in a database. It creates a sum of changed extents
-- and reports it as follows (example small msdb):
--
-- EXEC sp_SQLskillsMLChangedData 'msdb';
-- GO
--
-- Total Extents Changed Extents Percentage Changed
-- ------------- --------------- ----------------------
-- 355           153             43.1
--
CREATE PROCEDURE sp_SQLskillsMLChangedData (
   @dbName VARCHAR (128))
AS
BEGIN
   SET NOCOUNT ON;

   -- Create the temp table
   --
   IF EXISTS (SELECT * FROM msdb.sys.objects WHERE NAME = 'SQLskillsDBCCPage')
      DROP TABLE msdb.dbo.SQLskillsDBCCPage;

   CREATE TABLE msdb.dbo.SQLskillsDBCCPage (
      [ParentObject] VARCHAR (100),
      [Object] VARCHAR (100),
      [Field] VARCHAR (100),
      [VALUE] VARCHAR (100));

   DECLARE @fileID INT;
   DECLARE @fileSizePages INT;
   DECLARE @extentID INT;
   DECLARE @pageID INT;
   DECLARE @MLTotal INT;
   DECLARE @sizeTotal INT;
   DECLARE @total INT;
   DECLARE @dbccPageString VARCHAR (200);

   SELECT @MLTotal = 0;
   SELECT @sizeTotal = 0;

   -- Setup a cursor for all online data files in the database
   --
   DECLARE files CURSOR FOR
      SELECT [file_id], [size] FROM master.sys.master_files
      WHERE [type_desc] = 'ROWS'
      AND [state_desc] = 'ONLINE'
      AND [database_id] = DB_ID (@dbName);

   OPEN files;

   FETCH NEXT FROM files INTO @fileID, @fileSizePages;

   WHILE @@FETCH_STATUS = 0
   BEGIN
      SELECT @extentID = 0;

      -- The size returned from master.sys.master_files is in
      -- pages - we need to convert to extents
      --
      SELECT @sizeTotal = @sizeTotal + @fileSizePages / 8;

      WHILE (@extentID < @fileSizePages)
      BEGIN
         -- There may be an issue with the ML map page position
         -- on the four extents where PFS pages and GAM pages live
         -- (at page IDs 516855552, 1033711104, 1550566656, 2067422208)
         -- but I think we'll be ok.
         -- PFS pages are every 8088 pages (page 1, 8088, 16176, etc)
         -- GAM extents are every 511232 pages
         --
         SELECT @pageID = @extentID + 7;

         -- Build the dynamic SQL
         --
         SELECT @dbccPageString = 'DBCC PAGE ('
            + @dbName + ', '
            + CAST (@fileID AS VARCHAR) + ', '
            + CAST (@pageID AS VARCHAR) + ', 3) WITH TABLERESULTS, NO_INFOMSGS';

         -- Empty out the temp table and insert into it again
         --
         DELETE FROM msdb.dbo.SQLskillsDBCCPage;
         INSERT INTO msdb.dbo.SQLskillsDBCCPage EXEC (@dbccPageString);

         -- Aggregate all the changed extents using the function
         --
         SELECT @total = SUM ([msdb].[dbo].[SQLskillsConvertToExtents] ([Field]))
         FROM msdb.dbo.SQLskillsDBCCPage
            WHERE [VALUE] = ' MIN_LOGGED'
            AND [ParentObject] LIKE 'ML_MAP%';

         SET @MLTotal = @MLTotal + @total;

         -- Move to the next GAM extent
         SET @extentID = @extentID + 511232;
      END

      FETCH NEXT FROM files INTO @fileID, @fileSizePages;
   END;

   -- Clean up
   --
   DROP TABLE msdb.dbo.SQLskillsDBCCPage;
   CLOSE files;
   DEALLOCATE files;

   -- Output the results
   --
   SELECT
      @sizeTotal AS [Total Extents],
      @MLTotal AS [Changed Extents],
      ROUND (
         (CONVERT (FLOAT, @MLTotal) /
         CONVERT (FLOAT, @sizeTotal)) * 100, 2) AS [Percentage Changed];
END;
GO

-- Mark the SP as a system object
--
EXEC sys.sp_MS_marksystemobject sp_SQLskillsMLChangedData;
GO

-- Test to make sure everything was setup correctly
--
EXEC sp_SQLskillsMLChangedData 'msdb';
GO

After teaching some of the MCM-SharePoint class last week, one of the attendees pointed me at a blog post about measuring churn in SharePoint databases. The poster gave code to measure how large full backups are, which really only measures how much data there is in the database, not whether existing data has changed. If two successive full database backups are the same size, there's no way to tell how much changed - and of course, you need to take a full backup to be able to tell whether the size changed.

A while ago I wrote a script that would enable SQL database DBAs to tell how big the next differential backup will be. A differential backup contains everything that's changed since the last full database backup, so every new and everything changed. Although this still doesn't show whether a single piece of existing content changed multiple times, it can still show whether existing content changed at least once. Better still, you don't need to take any kind of backup to run this script.

So, to get an idea of the churn rate of your content databases, check out this script - see New script: How much of the database has changed since the last full backup?.

Enjoy!

PS If you're using SharePoint and find this useful, please let me know and if there's any other scripts that would be useful - Kimberly and I are both starting to get more into SharePoint admin from a SQL perspective.

While I was teaching the MCM-Database class last week, we were discussing fragmentation and the effect of a high-order GUID key on an index. Without going into too many details, having a random GUID - as generated from the NEWID() - function is bad, but having one generated by NEWSEQUENTIALID() isn't anyway near so bad (I'll discuss the details more in the fragmentation series I'm starting). As part of the demo, we wanted to change the column default for the leading key of a table from NEWID() to NEWSEQUENTIALID() - problem was that none of us could remember the exact syntax, so we worked it out together. I thought it would make an interesting post, so here it is.

First off, here's my table with a poor clustered index key:

CREATE TABLE BadKeyTable (
    c1 UNIQUEIDENTIFIER DEFAULT NEWID () ROWGUIDCOL,
    c2 SMALLDATETIME DEFAULT GETDATE (),
    c3 CHAR (400) DEFAULT 'a',
    c4 VARCHAR(MAX) DEFAULT 'b');
GO
CREATE CLUSTERED INDEX BadKeyTable_CL ON BadKeyTable (c1);
GO

INSERT INTO BadKeyTable DEFAULT VALUES;
GO

(And you'll notice that I've given up doing nice colors in the T-SQL I post - it's too time consuming). The default we're interested in is in bold above. To change the default, we first need to find the constraint name so we can drop it. There are two queries you can use:

SELECT [name] FROM sys.objects
WHERE [parent_object_id] = OBJECT_ID ('BadKeyTable');
GO

or

SELECT [name] FROM sys.default_constraints
WHERE [parent_object_id] = OBJECT_ID ('BadKeyTable');
GO

The second is obviously the more supported way, as the first will return the names of all sub-objects of this table - all constraints, and all internal tables, such as XML indexes. The second query returns:

name
-------------------------------
DF__BadKeyTable__c1__7C8480AE
DF__BadKeyTable__c2__7D78A4E7
DF__BadKeyTable__c3__7E6CC920
DF__BadKeyTable__c4__7F60ED59

The constraint we're interested in is the one for the first column - DF__BadKeyTable__c1__7C8480AE. Now we need to drop the constraint and then add the new one as there's no way to simply alter the constraint in-place. We do that using:

ALTER TABLE BadKeyTable DROP CONSTRAINT DF__BadKeyTable__c1__7C8480AE;
GO

ALTER TABLE BadKeyTable ADD CONSTRAINT DF__BadKeyTable__c1
DEFAULT NEWSEQUENTIALID() FOR c1;
GO

And we're done.

Way back at the start of me blogging here I wrote a comprehensive description of ghost records and the ghost cleanup process - see Inside the Storage Engine: Ghost cleanup in depth. A question came up in the class I'm teaching this week that's worth answering in a blog post - do ghost records occur in heaps? The answer is no, not during normal processing.

When snapshot isolation is enabled, deletes from a heap are ghosted, as part of the overall versioning process, which can lead to some interesting side-effects. A versioned record has an extra 14-bytes tagged on the end, so a heap record that suddenly becomes versioned is 14-bytes longer - which may mean it doesn't fit on the page any longer. This could lead to it being moved, resulting in a forwarding/forwarded record pair - just because the record was deleted! Now, the page has to be full for this time happen, and the Storage Engine will take steps to avoid this happening for rows less than 32 bytes long - but that's getting a little too deep.

Anyway, I digress. I want to show you the difference between deleting from a clustered index and from a heap. I'm going to create two such tables, then delete row from each and roll it back.

CREATE TABLE t1 (c1 CHAR (10));
CREATE CLUSTERED INDEX t1c1 on t1 (c1);
GO

CREATE TABLE t2 (c1 CHAR (10));
GO

INSERT INTO t1 VALUES ('PAUL');
INSERT INTO t1 VALUES ('KIMBERLY');

INSERT INTO t2 VALUES ('PAUL');
INSERT INTO t2 VALUES ('KIMBERLY');
GO

-- prevent random background transactions
ALTER DATABASE GhostTest SET AUTO_CREATE_STATISTICS OFF;
GO

BEGIN TRAN DelFromClust;
DELETE FROM t1 WHERE c1='KIMBERLY';
ROLLBACK TRAN;
GO

BEGIN TRAN DelFromHeap;
DELETE FROM t2 WHERE c1='KIMBERLY';
ROLLBACK TRAN;
GO

SELECT * FROM ::fn_dblog (null, null);
GO

Here's a portion of the results from looking in the transaction log. The line of code where I turn off auto-update stats is just to prevent the auto-create transactions from cluttering up my view of the transaction log.

The first (highlighted) transaction is for the delete/rollback in the clustered index. You can clearly see that the third column shows a log context of ghosting for the LOP_DELETE_ROWS log record, plus the setting of the 'this page has at least one ghost record' in the PFS byte for that page.

The second (unhighlighted) transaction is for the delete/rollback in the heap. Here you can see that it just does a straight delete.

If you look at the data page contents before the rollback in both cases, for the clustered index you'll still be able to see the deleted (ghosted) record, and for the heap you'll see the deleted record really is deleted.

Hope this helps.

Ola Hallengren, who we meet every so often at SQL Connections, has a great script that helps automate consistency checking, backups, and index maintenance. He's constantly updating it and it's good quality code. Check out the latest version for  SQL Server 2005 and 2008:

Enjoy! 

After writing the FILESTREAM whitepaper for Microsoft, I've had lots of questions about the structure of the FILESTREAM data container. The FILESTREAM data container is the technical term for the NTFS directory structure where all the FILESTREAM data is stored.

When you want to use FILESTREAM data, you first add a filegroup (during or after database creation):

ALTER DATABASE FileStreamTestDB ADD FILEGROUP FileStreamGroup1 CONTAINS FILESTREAM;
GO

And then add a 'file' to the filegroup:
ALTER DATABASE FileStreamTestDB ADD FILE (
     NAME = FSGroup1File, FILENAME = 'C:\Metro Labs\FileStreamTestDB\Documents')
TO FILEGROUP FileStreamGroup1;
GO

The 'file' is actually the pathname to what will become the root directory of the FILESTREAM data container. When it's initially created, it will contain a single file, filestream.hdr, and a single directory $FSLOG. Filestream.hdr is a metadata file describing the data container and the $FSLOG directory is the FILESTREAM equivalent of the database transaction log. You can think of them as equivalent, although the FILESTREAM log has some interesting semantics, which I'll cover in a separate post.

The question I most often get is: are all the FILESTREAM files for a database stored in one gigantic directory? The answer is no.

The root directory of the data container contains one sub-directory for each table (or each partition of a partitioned table). Each of those directories contains a further sub-directory for each FILESTREAM column defined in the table. An example is below, with the screen shot taken after running the following code:

CREATE TABLE FileStreamTest1 (
    DocId UNIQUEIDENTIFIER ROWGUIDCOL NOT NULL UNIQUE,
    DocName VARCHAR (25),
    Document VARBINARY(MAX) FILESTREAM);
GO

CREATE TABLE FileStreamTest2 (
    DocId UNIQUEIDENTIFIER ROWGUIDCOL NOT NULL UNIQUE,
    DocName VARCHAR (25),
    Document1 VARBINARY(MAX) FILESTREAM,
    Document2 VARBINARY(MAX) FILESTREAM);
GO

INSERT INTO FileStreamTest1 VALUES (NEWID (), 'Paul Randal', CAST ('SQLskills.com' AS VARBINARY(MAX)));
INSERT INTO FileStreamTest1 VALUES (NEWID (), 'Kimberly Tripp', CAST ('SQLskills.com' AS VARBINARY(MAX)));
GO

This image shows the FILESTREAM data container for our database that has two tables with FILESTREAM columns, each with a single partition. The first table has a two FILESTREAM columns and the second has a single FILESTREAM column. The filenames of all these directories are GUIDs. In the example, you can see two FILESTREAM files in a column-level directory. The FILESTREAM file names are actually the log-sequence number from the database transaction log at the time the files were created. You can correlate these by looking at the data with DBCC PAGE, but first finding the allocated pages using sp_AllocationMetadata (see this blog post):

EXEC sp_AllocationMetadata FileStreamTest1;
GO

Object Name     Index ID Alloc Unit ID     Alloc Unit Type   First Page Root Page First IAM Page
--------------- -------- ----------------- ---------------   ---------- --------- --------------
FileStreamTest1 0        72057594039697408 IN_ROW_DATA       (1:169)    (0:0)     (1:170)
FileStreamTest1 0        72057594039762944 ROW_OVERFLOW_DATA (0:0)      (0:0)     (0:0)
FileStreamTest1 2        72057594039828480 IN_ROW_DATA       (1:171)    (1:171)   (1:172)

(3 row(s) affected)

Notice there's a nonclustered index as well as the heap - that's the index that's enforcing the uniqueness constraint on the UNIQUEIDENTIFIER column. Now we can use DBCC PAGE to look at the first page of the heap, which will have out data records in:

DBCC TRACEON (3604);
DBCC PAGE (FileStreamTestDB, 1, 169, 3);
GO

<snip>

Slot 0 Offset 0x60 Length 88

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS
Record Size = 88
Memory Dump @0x514EC060

00000000:   30001400 140d5047 2ca9f24f 874d35ca †0.....PG,©òO‡M5Ê
00000010:   e9e77649 03000002 00280058 80506175 †éçvI.....(.X.Pau
00000020:   6c205261 6e64616c 03000000 00000080 †l Randal........
00000030:   140d5047 2ca9f24f 874d35ca e9e77649 †..PG,©òO‡M5ÊéçvI
00000040:   01000000 68020000 00000000 17000000 †....h...........
00000050:   79000000 0c000000 †††††††††††††††††††y.......

Slot 0 Column 1 Offset 0x4 Length 16 Length (physical) 16

DocId = 47500d14-a92c-4ff2-874d-35cae9e77649

Slot 0 Column 2 Offset 0x1d Length 11 Length (physical) 11

DocName = Paul Randal

Document = [Filestream column] Slot 0 Column 3 Offset 0x28 Length 48

ColType = 3                          FileId = -2147483648                 UpdateSeq = 1
CreateLSN = 00000017:00000079:000c (23:121:12)                            TxFMiniVer = 0
XdesId = (0:616)

<snip>

You can see that the CreateLSN I've highlighted above matches the filename of the first FILESTREAM file in the example image.

Hopefully this explains how the FILESTREAM files are stored - more on this in the next post where I'll show how updates and garbage collection are implemented.

Back in 2005 Kimberly produced two very popular webcast series - an 11-part webcast series for TechNet called SQL Server 2005 for the IT Professional and a 10-part webcast series for MSDN called A Primer for Proper SQL Server Development. The webcast links and blog posts were broken for quite a while but now they're all fixed up and working again. I've created some web pages that link to all the webcasts and blog posts, along with abstracts. I've also included some more recent ones too and will be adding to the list over the next few weeks.

There's over 30 hours of good stuff to watch - check them out at http://www.sqlskills.com/webcasts.asp

In the TechNet Magazine article I wrote on Advanced Troubleshooting with Extended Events, I mentioned the always-on event session called system_health. Jonathan Kehayias, a fellow MVP and blogging mad-man, posted a great article with SQL Server Central today about how to get historical deadlock graph info from it. His article explains some of the pre-reqs for doing this (like installing 2008 RTM CU1 - see KB 956717 - to get it working and a bug with the XML output) and gives some code.

After wrestling with the download of CU1 (it took 5 tries to get it to download!), I got it installed in one of my VPCs and gave Jon's code a whirl. I forced a deadlock by creating two tables, locking them from separate transactions and then trying to select from the other table. See the code below:

-- Connection 1:
CREATE TABLE t1 (c1 INT);
INSERT INTO t1 VALUES (1);
GO

BEGIN TRAN
UPDATE t1 SET c1 = 2;
GO

-- Connection 2:
CREATE TABLE t2 (c1 INT);
INSERT INTO t2 VALUES (1);
GO

BEGIN TRAN
UPDATE t2 SET c1 = 2;
GO

-- Connection 1:
SELECT * FROM t2;
GO

-- Connection 2:
SELECT * FROM t1;
GO

And one of them will be killed by the deadlock monitor.

Msg 1205, Level 13, State 45, Line 1
Transaction (Process ID 57) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.

Now running Jon's (slightly reformatted) code:

SELECT CAST (
    REPLACE (
        REPLACE (
            XEventData.XEvent.value ('(data/value)[1]', 'varchar(max)'),
            '<victim-list>', '<deadlock><victim-list>'),
        '<process-list>', '</victim-list><process-list>')
    AS XML) AS DeadlockGraph
FROM (SELECT CAST (target_data AS XML) AS TargetData
    FROM sys.dm_xe_session_targets st
    JOIN sys.dm_xe_sessions s ON s.address = st.event_session_address
    WHERE [name] = 'system_health') AS Data
CROSS APPLY TargetData.nodes ('//RingBufferTarget/event') AS XEventData (XEvent)
    WHERE XEventData.XEvent.value('@name', 'varchar(4000)') = 'xml_deadlock_report';

Gives the XML deadlock graph (which unfortunately can't be displayed graphically because the format differs from the Profiler/trace-flag output)

<deadlock-list>
  <deadlock>
    <victim-list>
      <victimProcess id="process53d9000" />
    </victim-list>
    <process-list>
      <process id="process53d9000" taskpriority="0" logused="256" waitresource="RID: 1:1:304:0" waittime="331" ownerId="1868" transactionname="user_transaction" lasttranstarted="2009-02-23T13:32:05.100" XDES="0x4eb8c10" lockMode="S" schedulerid="1" kpid="2216" status="suspended" spid="57" sbid="0" ecid="0" priority="0" trancount="1" lastbatchstarted="2009-02-23T13:32:33.093" lastbatchcompleted="2009-02-23T13:32:05.100" clientapp="Microsoft SQL Server Management Studio - Query" hostname="CHICAGO" hostpid="2552" loginname="CHICAGO\Administrator" isolationlevel="read committed (2)" xactid="1868" currentdb="1" lockTimeout="4294967295" clientoption1="671090784" clientoption2="390200">
        <executionStack>
          <frame procname="" line="1" sqlhandle="0x02000000c70b7b1daba27f2ddf3e772600949464d524b6f2" />
        </executionStack>
        <inputbuf>
SELECT * FROM t1;
    </inputbuf>
      </process>
      <process id="process4985558" taskpriority="0" logused="256" waitresource="RID: 1:1:294:0" waittime="10181" ownerId="1877" transactionname="user_transaction" lasttranstarted="2009-02-23T13:32:08.067" XDES="0x4eb9be0" lockMode="S" schedulerid="1" kpid="2020" status="suspended" spid="56" sbid="0" ecid="0" priority="0" trancount="1" lastbatchstarted="2009-02-23T13:32:23.240" lastbatchcompleted="2009-02-23T13:32:08.067" clientapp="Microsoft SQL Server Management Studio - Query" hostname="CHICAGO" hostpid="2552" loginname="CHICAGO\Administrator" isolationlevel="read committed (2)" xactid="1877" currentdb="1" lockTimeout="4294967295" clientoption1="671090784" clientoption2="390200">
        <executionStack>
          <frame procname="" line="1" sqlhandle="0x02000000f4d34532698b7ad324df813feb2ba5024730cb79" />
        </executionStack>
        <inputbuf>
SELECT * FROM t2;
    </inputbuf>
      </process>
    </process-list>
    <resource-list>
      <ridlock fileid="1" pageid="304" dbid="1" objectname="" id="lock47aea80" mode="X" associatedObjectId="72057594039042048">
        <owner-list>
          <owner id="process4985558" mode="X" />
        </owner-list>
        <waiter-list>
          <waiter id="process53d9000" mode="S" requestType="wait" />
        </waiter-list>
      </ridlock>
      <ridlock fileid="1" pageid="294" dbid="1" objectname="" id="lock47af200" mode="X" associatedObjectId="72057594038976512">
        <owner-list>
          <owner id="process53d9000" mode="X" />
        </owner-list>
        <waiter-list>
          <waiter id="process4985558" mode="S" requestType="wait" />
        </waiter-list>
      </ridlock>
    </resource-list>
  </deadlock>
</deadlock-list>

Cool!

Wow - today is all about new content. As if I haven't already blogged about enough stuff to keep you reading through next week, the February issue of TechNet Magazine is now available and contains a feature article I wrote about understanding how logging and recovery work inside SQL Server.

The article covers:

  • What is logging?
  • What is recovery?
  • The transaction log (include logical and physical architecture)
  • Recovery models and how they affect the behavior of the transaction log

There's also a ten-minute screencast video where I demonstrate a runaway transaction log.

Check it out at http://technet.microsoft.com/en-us/magazine/2009.02.logging.aspx.

Over the last few weeks Sunil Agarwal (from the SQL Storage Engine team) has posted a great series of blog articles about tempdb and the version store, over on my old stomping ground - the Storage Engine blog. The articles are well worth reading - the links are:

Enjoy!

This blog post describes the demo "2 - NC Indexes" from my Corruption Survival Techniques conference session from various conferences in 2008. The links to the scripts and databases to use are in this blog post.

The aim of this demo is to show that sometimes its just redundant data (i.e. nonclustered indexes) that get corrupted, and so you don't have to do anything that takes the actual data offline - like restoring from a full backup or running one of the repair options (both of which require the database to be in single-user mode).

Let's look at an example. Extract and restore the DemoNCIndex database, and the NCIndexCorruption.sql script. What do we get from running DBCC CHECKDB on the DemoNCIndex database (lines 39-42)?

DBCC CHECKDB (DemoNCIndex) WITH NO_INFOMSGS, ALL_ERRORMSGS;
GO

Msg 8951, Level 16, State 1, Line 1
Table error: table 'Customers' (ID 453576654). Data row does not have a matching index row in the index 'CustomerName' (ID 2). Possible missing or invalid keys for the index row matching:
Msg 8955, Level 16, State 1, Line 1
Data row (1:45:28) identified by (CustomerID = 29) with index values 'LastName = 'Adams' and CustomerID = 29'.
Msg 8951, Level 16, State 1, Line 1
Table error: table 'Customers' (ID 453576654). Data row does not have a matching index row in the index 'CustomerName' (ID 2). Possible missing or invalid keys for the index row matching:
Msg 8955, Level 16, State 1, Line 1
Data row (1:180:164) identified by (CustomerID = 2118) with index values 'LastName = 'Adams' and CustomerID = 2118'.

<snip - removed for brevity>

Msg 8952, Level 16, State 1, Line 1
Table error: table 'Customers' (ID 453576654). Index row in index 'CustomerName' (ID 2) does not match any data row. Possible extra or invalid keys for:
Msg 8956, Level 16, State 1, Line 1
Index row (1:24482:16) with values (LastName = 'Andersen' and CustomerID = 18718) pointing to the data row identified by (CustomerID = 18718).
Msg 8952, Level 16, State 1, Line 1
Table error: table 'Customers' (ID 453576654). Index row in index 'CustomerName' (ID 2) does not match any data row. Possible extra or invalid keys for:
Msg 8956, Level 16, State 1, Line 1
Index row (1:24482:127) with values (LastName = 'Arthur' and CustomerID = 9758) pointing to the data row identified by (CustomerID = 9758).
CHECKDB found 0 allocation errors and 26 consistency errors in table 'Customers' (object ID 453576654).
CHECKDB found 0 allocation errors and 26 consistency errors in database 'DemoNCIndex'
repair_rebuild is the minimum repair level for the errors found by DBCC CHECKDB (DemoNCIndex).
 

Lots of errors. Now, in this example there are only 26 errors, but in cases where there are hundreds of errors it can be really hard to tell whether all the corruptions are in nonclustered indexes (i.e. indexes with IDs > 1). Luckily, there's an undocumented option to all the DBCC CHECK* commands - WITH TABLERESULTS. The option is undocumented because the output can change from release to release, but basically this converts the DBCC output into tabular form. Try running lines 48-50 in the script and you'll see what I mean. One of the columns in the output is IndexId - so you can easily see whether all the errors are in nonclustered indexes. In this case, they are, and all in one index of the Customers table.

You could run lines 55-57 of the script to prove to yourself that repairs can't be run online, and then realize that we can address the problem without having to run repair or restore. First off we need to figure out the name of the index to fix - index ID 2 of the Customers table. Lines 77-80 run sp_HelpIndex on the table (although I should really be using Kimberly's sp_HelpIndex2):

USE DemoNCIndex
GO
EXEC sp_HelpIndex 'Customers';
GO

index_name     index_description                                   index_keys
-------------- --------------------------------------------------- ------------
CustomerName   nonclustered located on PRIMARY                     LastName
CustomerPK     clustered, unique, primary key located on PRIMARY   CustomerID

The nonclustered index is called CustomerName. Plug in the index name to line 82, then try fixing the index by doing an online index rebuild, and run DBCC CHECKDB afterwards (lines 82-89). The corruption hasn't been fixed! Online index rebuild reads the old index to build the new one so the new index has the same missing rows as the old one. We need to do an offline index rebuild - with lines 110-115. After the last DBCC CHECKDB, the index is fixed up. Now, on SQL Server 2008, you may or may not get a query plan for the index rebuild that doesn't use the old index, because the query optimizer has some more plan choices available to it - so on SQL Server 2008 you may need to do an actual drop and create of the broken index (carefully, if its enforcing a constraint).

So - just because DBCC CHECKDB reports a ton of errors, that doesn't necessarily mean that the database needs to be taken (essentially) offline to repair it - check through the errors to see if it's just nonclustered indexes that are affected.

For those of you who couldn't make it to a conference this year where I presented my Corruption Survival Techniques session, the folks at TechEd EMEA have just posted an 80 minute long video of the presentation I did in Barcelona in early November. It walks through I/O errors, what CHECKDB does, how it works, how to run it, CHECKDB FAQ, how to interpret the output, choosing between repair and restore and has a bunch of demos of recovering from corruptions. Lots of fun stuff!

The video is available at http://www.microsoft.com/emea/teched2008/itpro/tv/default.aspx?vid=78. The accompanying scripts and corrupt databases are all posted on our website - see this blog post for details.

Enjoy!

This blog post describes the demo "1 - Fatal Errors" from my Corruption Survival Techniques conference session from various conferences in 2008. The links to the scripts and databases to use are in this blog post.

The aim of this demo is to show that sometimes a database is so corrupt that DBCC CHECKDB just cannot run on it. In that case, there's no way to force DBCC CHECKDB to get past the fatal corruption and so there's no way to run a repair either - you're looking at restoring from a backup or at worst, extracting as much data as possible into a new database.

Let's look at a couple of examples. Extract and restore the DemoFatalCorruption1 and DemoFatalCorruption2 databases, and the FatalErrors.sql script. What do we get from running DBCC CHECKDB on DemoFatalCorruption1 (lines 47-49 in the script)?

DBCC CHECKDB (DemoFatalCorruption1)
WITH NO_INFOMSGS, ALL_ERRORMSGS;
GO

Msg 8928, Level 16, State 6, Line 1
Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown): Page (1:71) could not be processed. See other errors for details.
CHECKDB found 1 allocation errors and 0 consistency errors not associated with any single object.
Msg 8906, Level 16, State 1, Line 1
Page (1:19) in database ID 8 is allocated in the SGAM (1:3) and PFS (1:1), but was not allocated in any IAM. PFS flags 'MIXED_EXT ALLOCATED 0_PCT_FULL'.
Msg 2575, Level 16, State 1, Line 1
The Index Allocation Map (IAM) page (1:71) is pointed to by the next pointer of IAM page (0:0) in object ID 15, index ID 1, partition ID 983040, alloc unit ID 983040 (type In-row data), but it was not detected in the scan.
Msg 7965, Level 16, State 2, Line 1
Table error: Could not check object ID 15, index ID 1, partition ID 983040, alloc unit ID 983040 (type In-row data) due to invalid allocation (IAM) page(s).
Msg 8906, Level 16, State 1, Line 1
Page (1:71) in database ID 8 is allocated in the SGAM (1:3) and PFS (1:1), but was not allocated in any IAM. PFS flags 'IAM_PG MIXED_EXT ALLOCATED 0_PCT_FULL'.
Msg 8939, Level 16, State 5, Line 1
Table error: Object ID 15, index ID 1, partition ID 983040, alloc unit ID 983040 (type In-row data), page (1:71). Test (m_headerVersion == HEADER_7_0) failed. Values are 0 and 1.
Msg 8939, Level 16, State 6, Line 1
Table error: Object ID 15, index ID 1, partition ID 983040, alloc unit ID 983040 (type In-row data), page (1:71). Test ((m_type >= DATA_PAGE && m_type <= UNDOFILE_HEADER_PAGE) || (m_type == UNKNOWN_PAGE && level == BASIC_HEADER)) failed. Values are 0 and 0.
Msg 8939, Level 16, State 5, Line 1
Table error: Object ID 15, index ID 1, partition ID 983040, alloc unit ID 983040 (type In-row data), page (1:71). Test (m_headerVersion == HEADER_7_0) failed. Values are 0 and 1.
Msg 8939, Level 16, State 6, Line 1
Table error: Object ID 15, index ID 1, partition ID 983040, alloc unit ID 983040 (type In-row data), page (1:71). Test ((m_type >= DATA_PAGE && m_type <= UNDOFILE_HEADER_PAGE) || (m_type == UNKNOWN_PAGE && level == BASIC_HEADER)) failed. Values are 0 and 0.
CHECKDB found 5 allocation errors and 3 consistency errors in table 'sys.syshobts' (object ID 15).
Msg 7995, Level 16, State 1, Line 1
Database 'DemoFatalCorruption1': consistency errors in system catalogs prevent further DBCC checkdb processing.
CHECKDB found 0 allocation errors and 1 consistency errors in table 'ALLOCATION' (object ID 99).
CHECKDB found 6 allocation errors and 4 consistency errors in database 'DemoFatalCorruption1'.

A bunch of errors that look like regular DBCC CHECKDB output - but if you look carefully near the end of the output you'll see error 7995 stating that the system catalogs are so corrupt that DBCC CHECKDB can't continue. Notice also that there's nothing at the end of the output stating what the minimum repair level is to fix the errors - because repair cannot be run on this database.

The second example is even worse (running lines 53-55 in the script):

DBCC CHECKDB (DemoFatalCorruption2)
WITH NO_INFOMSGS, ALL_ERRORMSGS;
GO

Msg 211, Level 23, State 51, Line 1
Possible schema corruption. Run DBCC CHECKCATALOG.

In this case, the corruption is so bad that DBCC CHECKDB didn't even get a chance to terminate gracefully - the metadata subsystem in the Query Processor just blew away the whole command. Running DBCC CHECKCATALOG as the error message states doesn't do any better - it just prints the same error! (I didn't write that error message Wink)

So - just because DBCC CHECKDB completes, doesn't always mean it completes successfully. Make sure you always check the output.

This blog post explains the demo scripts and databases I've posted to cover all the Corruption Survival Techniques and DBCC CHECKDB sessions I've presented at conferences this year. There are two zip files you need to download: the example corrupt databases (36-MB zip) and the demo scripts. These are also both available through our past events pages.

The databases zip contains the following databases:

  • DemoDataPurity
    • 192-MB SQL Server 2005 database with a single 2570 (data purity) error
  • DemoFatalCorruption1
    • 1-MB SQL Server 2005 database with a corrupt system table (that allows CHECKDB to complete)
  • DemoFatalCorruption2
    • 1-MB SQL Server 2005 database with a corrupt system table (that terminates CHECKDB)
  • DemoNCIndex
    • 192-MB SQL Server 2005 database with a bunch of nonclustered index corruptions
  • DemoRestoreOrRepair
    • 1-MB SQL Server 2005 database with a page checksum failure (in fact a zero'd out page)
  • DemoCorruptMetadata
    • 1-MB SQL Server *2000* database with corrupt syscolumns table

The scripts zip contains the following directories:

  • 1 - Fatal Errors
    • This makes use of the DemoFatalCorruption1 and DemoFatalCorruption2 databases. The FatalErrors.sql script has the steps to follow and see this blog post for a complete walk-through.
  • 2 - NC Indexes
    • This makes use of the DemoNCIndex database. The NCIndexCorruption.sql script has the steps to follow and see this blog post for a complete walk-through.
  • 3 - Data Purity
    • This makes use of the DemoDataPurity database. The DataPurityCorruption.sql script has the steps to follow. I'll blog more details on this script sometime in the next week (and then update this post).
  • 4 - Metadata
    • This makes use of the DemoCorruptMetadata database. The CorruptMetadata.sql script has the steps to follow and see this blog post for a complete walk-through.
  • 5 - Restore or Repair
    • This makes use of the DemoRestoreOrRepair database. There's a setup script (01CreateRestoreOrRepair.sql) and two demo scripts - FixUsingPageRestore.sql and FixUsingRepair.sql. I'll blog more details on these scripts sometime in the next week (and then update this post).
  • 6 - Suspect Database
    • This demo doesn't come with a corrupt database - you create your own one. The SuspectDatabase.sql script has the steps to follow and see this blog post for a complete walk-through.

Enjoy!

It's been a long time since the last Conference Questions Pot-Pourri - in fact it was at the last SQL Connections in Orlando in April. Now we're in Las Vegas doing SQL Connections Fall - Kimberly's lecturing for an hour on partitioning so I can get out a quick post.

This is a question that came up yesterday - can CREATE INDEX ... WITH DROP_EXISTING be used to move indexes that enforce constraints? Let's check it out.

First up I'm going to create a couple of tables. Table t1 has a unique constraint backed by a nonclustered index. Table t1 has a primary key constraint backed by a clustered index.

CREATE DATABASE ConstraintTest;
GO
USE ConstraintTest;
GO

CREATE TABLE UniqueConstraint (c1 INT UNIQUE NONCLUSTERED);
GO
INSERT INTO UniqueConstraint VALUES (1);
GO

CREATE TABLE PrimaryKeyConstraint (c2 INT PRIMARY KEY CLUSTERED);
GO
INSERT INTO PrimaryKeyConstraint VALUES (1);
GO

(Oops - in the original post I c&p'd the wrong code and had the second table as a unique nonclustered constraint too - sorry for the mixup) 

Now I'll add another filegroup that we'll try to move the indexes into.

ALTER DATABASE ConstraintTest ADD FILEGROUP ExtraFilegroup;
GO

ALTER DATABASE ConstraintTest ADD FILE (
   NAME = ExtraFile1,
   FILENAME = 'C:\SQLskills\ExtraFile1.ndf',
   SIZE = 5MB)
TO FILEGROUP ExtraFilegroup;
GO

Now I'll try moving the nonclustered index enforcing the unique constraint.

SELECT [name], [index_id] FROM sys.indexes
WHERE [object_id] = OBJECT_ID ('UniqueConstraint')
GO

name                           index_id
------------------------------ --------
NULL                           0
UQ__UniqueConstraint__7C8480AE 2

The index we want is index ID=2, so we should be able to move it as follows:

CREATE UNIQUE NONCLUSTERED INDEX UQ__UniqueConstraint__7C8480AE
ON UniqueConstraint (c1) WITH (DROP_EXISTING = ON)
ON ExtraFilegroup;
GO

No problem! Now let's try the same thing for the clustered primary key constraint.

SELECT [name], [index_id] FROM sys.indexes
WHERE [object_id] = OBJECT_ID ('PrimaryKeyConstraint')
GO

name                           index_id
------------------------------ --------
PK__PrimaryKeyConstr__7E6CC920 1

We only have one choice, so we should be able to rebuild it using:

CREATE CLUSTERED INDEX PK__PrimaryKeyConstr__7E6CC920
ON PrimaryKeyConstraint (c2) WITH (DROP_EXISTING = ON)
ON ExtraFilegroup;
GO

Msg 1907, Level 16, State 1, Line 1
Cannot recreate index 'PK__PrimaryKeyConstr__7E6CC920'. The new index definition does not match the constraint being enforced by the existing index.

Hmm - what am I missing? I'll check sys.indexes again and not filter the columns this time:

SELECT * FROM sys.indexes
WHERE [object_id] = OBJECT_ID ('PrimaryKeyConstraint')
GO

object_id  name                           index_id type type_desc is_unique
---------- ------------------------------ -------- ---- --------- ---------
2105058535 PK__PrimaryKeyConstr__7E6CC920 1        1    CLUSTERED 1

data_space_id ignore_dup_key is_primary_key is_unique_constraint fill_factor
------------- -------------- -------------- -------------------- -----------
1             0              1              0                    0

is_padded is_disabled is_hypothetical allow_row_locks allow_page_locks
--------- ----------- --------------- --------------- ----------------
0
         0           0               1               1

Ah - I was missing the unique attribute on the index. What I need to do is the following:

CREATE UNIQUE CLUSTERED INDEX PK__PrimaryKeyConstr__7E6CC920
ON PrimaryKeyConstraint (c2) WITH (DROP_EXISTING = ON)
ON ExtraFilegroup;
GO

And that works fine. So - the answer is yes, you can use CREATE INDEX ... WITH DROP_EXISTING to move indexes that enforce constraints. This is really good, as it allows you to move these indexes without having to do DROP INDEX then CREATE INDEX. That method is bad, as it creates a window of opportunity for someone to enter data that violates the constraint while the index is dropped.

This is a quick post inspired by a question I was sent in email (thanks Marcos!) which very neatly lets me show a DMV I've been meaning to blog about for a while. And the weather here in Redmond really sucks right now so I can't go outside - blogging will serve as my work-avoidance strategy this afternoon :-).

The (paraphrased) question is:

A checkpoint is a process that writes all dirty pages to disk, and is per-database. So, if the data cache can hold a page from any database, how does checkpoint know which pages to check for a dirty status? Does it scan through buffer pool looking for pages for database X and process only those? Or is data cache somehow partitioned by database? I'd like to know a bit better how it works under the covers.

The answer is that pages are stored in buffers in the buffer pool (aka buffer cache or data cache), and the buffers are indeed hashed so they can easily be found by database. You can see what pages are currently in the buffer pool, and their status using the sys.dm_os_buffer_descriptors DMV in 2005:

SELECT * FROM sys.dm_os_buffer_descriptors;
GO

database_id file_id  page_id  page_level  allocation_unit_id   page_type      row_count   free_space_in_bytes is_modified
----------- -------- -------- ----------- -------------------- -------------- ----------- ------------------- -----------
1           1        9        0           6488064              BOOT_PAGE      1           7362                0
1           1        6        0           6488064              DIFF_MAP_PAGE  2           6                   0
1           1        7        0           6488064              ML_MAP_PAGE    2           6                   0
1           1        104      0           262144               DATA_PAGE      100         4196                0
1           1        105      0           851968               DATA_PAGE      65          5041                0
1           1        106      0           262144               DATA_PAGE      197         413                 0
1           1        107      0           262144               DATA_PAGE      207         23                  0
1           1        108      1           262144               INDEX_PAGE     7           7949                0
.
.

I cut off the output rather than list all 3258 pages in the buffer pool on my laptop. The DMV gives you back some info from the pages themselves as well as you can see(remember all this is in memory so it's quick to find).

I played around with the DMV a little bit and came up with a neat script that will tell you may many clean and dirty pages there are in the buffer pool per-database.

SELECT
   
(CASE WHEN ([is_modified] = 1) THEN 'Dirty' ELSE 'Clean' END) AS 'Page State',
   
(CASE WHEN ([database_id] = 32767) THEN 'Resource Database' ELSE DB_NAME (database_id) END) AS 'Database Name',
   
COUNT (*) AS 'Page Count'
FROM sys.dm_os_buffer_descriptors
   
GROUP BY [database_id], [is_modified]
   
ORDER BY [database_id], [is_modified];
GO

Page State Database Name             Page Count
---------- ------------------------- -----------
Clean      master                    302
Dirty      master                    1
Clean      tempdb                    88
Dirty      tempdb                    52
Clean      model                     56
Clean      msdb                      622
Dirty      msdb                      5
Clean      adventureworks            110
Clean      DemoRestoreOrRepair       64
Clean      DBMaint2008               88
Clean      DemoFatalCorruption1      64
Clean      DemoFatalCorruption2      64
Clean      broken                    64
Clean      DemoFatalCorruption3      64
Clean      DemoCorruptMetadata       111
Clean      DemoDataPurity            88
Clean      SalesDB                   123
Clean      DemoNCIndex               88
Clean      shrinktest                88
Clean      DemoRestoreOrRepairCopy   64
Clean      DemoSuspect               64
Clean      FileHeaderTest            96
Clean      MultiFileDB               96
Clean      HA2008                    88
Clean      SalesDB_Snapshot          21
Clean      BootPageTest              88
Clean      Resource Database         599

Later this week I'll try to blog a script that can tell you how much of a particular table is in memory. Enjoy!

I woke up this morning and someone had replaced my wife with someone who likes to blog :-). Kimberly's turned over a new leaf and is going to blog much more often - in fact she's blogged 4 times today already. Check out her blog here.

Continuing on the transaction log theme of the last few Search Engine Q&A posts, this one addresses a question I've heard a few times, most recently on an MVP discussion group. Let me paraphrase:

If I have a transaction that inserts a huge amount of data, the transaction log grows to 50-GB. I then rollback the transaction. When I take the next log backup, it's way smaller than 50-GB. What's going on?

Let's see if we can repro the scenario. I've created a database with a 500-MB data file and a 1-MB log file, with 100-MB and 1-MB auto-growth intervals. I want the log to be as small as possible and to grow in small chunks so I can see just how much it *has* to grow by, rather than having a large growth size. Then I set the recovery mode to full and took a database backup to make sure the log won't truncate until it's backed up.

CREATE DATABASE LogSizeTest ON
   
(NAME = N'LogSizeTest',
   
FILENAME = N'C:\SQLskills\LogSizeTest.mdf',
   
SIZE = 512MB,
   
FILEGROWTH = 100MB)
LOG ON 
   
(NAME = N'LogSizeTest_log',
   
FILENAME = N'C:\SQLskills\LogSizeTest_log.ldf',
   SIZE = 1MB,
   
FILEGROWTH = 1MB);
GO

ALTER DATABASE LogSizeTest SET RECOVERY FULL;
GO

BACKUP DATABASE LogSizeTest TO DISK = 'C:\SQLskills\LogSizeTest.bak';
GO

Let's check the size of the log:

DBCC SQLPERF (LOGSPACE);
GO

Database Name  Log Size (MB) Log Space Used (%) Status
-------------- ------------- ------------------ -----------
LogSizeTest    0.9921875     36.66339           0

This gives back info for all databases, I've trimmed down the output just for the LogSizeTest database.

Now I'm going to create a table, start an explicit transaction and add about 500-MB of info to the table.

USE LogSizeTest;
GO
SET NOCOUNT ON;
GO
CREATE
TABLE Test (c1 INT IDENTITY, C2 CHAR (8000) DEFAULT (REPLICATE ('a', 8000)));
GO

BEGIN TRAN;
GO

DECLARE @count INT;
SELECT @count = 0;
WHILE (@count < 64000)
BEGIN
   
INSERT INTO Test DEFAULT VALUES;
   
SELECT @count = @count + 1;
END;
GO

Checking the log file size again gives:

DBCC SQLPERF (LOGSPACE);
GO

Database Name  Log Size (MB) Log Space Used (%) Status
-------------- ------------- ------------------ -----------
LogSizeTest    703.9922      99.98737           0

The log size has grown to about 700-MB, way more than the size of the data I was inserting, and it's completely full. Now let's rollback the transaction and check the log size again.

ROLLBACK TRAN;
GO

DBCC SQLPERF (LOGSPACE);
GO

Database Name  Log Size (MB) Log Space Used (%) Status
-------------- ------------- ------------------ -----------
LogSizeTest    703.9922      85.21268           0

The size of the log file is the same, but the percentage used has actually gone down! How can that happen? Let's take a backup and checkout it's size:

BACKUP LOG LogSizeTest TO DISK = 'C:\SQLskills\LogSizeTest_log.bck';
GO
RESTORE HEADERONLY FROM DISK = 'C:\SQLskills\LogSizeTest_log.bck';
GO

The BackupSize in the output from the RESTORE HEADERONLY is 631454208, which is 602.2-MB. Taking the numbers from the DBCC SQLPERF output above, 85.21268% of 703.9922-MB is 599.89-MB - so the backup is roughly the same size as the used transaction log. That's what I'd expect, but why is it smaller than the total size of the transaction log?

So what's going on? Why did the transaction log need to grow so much larger than it needed to, and why did the percentage used actually *drop* after the transaction rolled back?

The answer is in the way the transaction log works. Whenever a logged operation occurs in a transaction, there is some transaction log space reserved in case the transaction rolls back. The idea is that there's always enough space available in the transaction log for a transaction to roll back, without having to grow the transaction log and potentially have that fail. If a transaction could not roll back successfully because the log didn't have enough space, the database would become transactionally inconsistent, would be taken offline and the state changed to SUSPECT.

The behavior we saw was the Storage Engine reserving transaction log space for a potential roll back. When the roll back occured, the transaction log records necessary to undo the effects of the transaction (called compensation log records) are created and written to the log. The issue is that they usually don't take up as much space as the Storage Engine reserved, as it tends to be very conservative in its estimates of how much log space to reserve, to avoid the potential for SUSPECT databases. This explains the difference between the various sizes and percentages we saw above.

The Storage Engine code to do the reservations is quite interesting - I remember fixing a couple of bugs in it during SQL Server 2000 development in 1999 while I was getting to know the internals of the logging and recovery system before tackling some of the (since removed) log-reading code in DBCC CHECKDB in SQL Server 2000.

Anyway, there you have it. Log space reservation is the answer, and is also one of the reasons why it can be tricky to estimate how large a transaction log should be when a database is created.

As you may already know, instant initialization is a way to prevent data (not log) file create and grow operations having to zero-initialize the new space before allowing it to be used. This can vastly speed up these operations as zero-initialization can take a *long* time for large files. This is especially useful in disaster recovery operations as the first phase of a restore is always to create the requisite files, if they don't already exist. Cutting minutes or even hours from this phase can significantly reduce downtime. It's available on XP SP2 and Windows Server 2003 and above. You can get more details from a blog post of Kimberly's from March 2007.

The way to enable it is to give the SQL Server service account the 'Perform volume maintenance tasks' privilege and then restart the service. There's no way to enable or disable it from within SQL Server, and until now, no way I've known of to tell whether it's enabled from within SQL Server. I was teaching a Microsoft-internal class on Database Maintenance last week and one of the students came up with a neat way to tell - using xp_cmdshell to execute the whoami /priv command, which lists all the privileges that SQL Server service account has.

The whoami command is available on Windows Server machines but for XP you need to download the support tools from Microsoft to get it to work, as I did on my laptop. You can get them here (5MB download) and you need to stop/start SQL Server on XP after installing them so it picks up the new tools path. Now, most people will be running with xp_cmdshell turned off, because of the security risks involved with enabling it, so here's a script that turns it on, checks the privileges, and turns it off again.

EXEC sp_configure 'xp_cmdshell', 1;
GO
RECONFIGURE WITH OVERRIDE;
GO

CREATE TABLE #xp_cmdshell_output (Output VARCHAR (8000));
GO

INSERT INTO #xp_cmdshell_output EXEC ('xp_cmdshell ''whoami /priv''');
GO

IF EXISTS (SELECT * FROM #xp_cmdshell_output WHERE Output LIKE '%SeManageVolumePrivilege%')
   
PRINT 'Instant Initialization enabled'
ELSE
   
PRINT 'Instant Initialization disabled';
GO

DROP TABLE #xp_cmdshell_output;
GO

EXEC sp_configure 'xp_cmdshell', 0;
GO
RECONFIGURE WITH OVERRIDE;
GO

Enjoy!

Today I presented my brand new session Surviving Corruption: From Detection to Recovery at TechEd. I had a lot of fun putting together the demos, presenting the session, and talking to people afterwards. During the session, I promised to blog each of the demos so that everyone can run through them - here's the first one.

On SQL 2000, it was pretty easy to get into the system tables and manually change them - all you had to do was:

EXEC sp_configure 'allow updates', 1;
GO
RECONFIGURE WITH OVERRIDE;
GO

And then you could insert, update, and delete whatever you wanted in the all the system tables, including the critical three - sysindexes, sysobjects, and syscolumns. The problem was that sometimes people actually did this and messed things up - for instance, by manually deleting an object from sysobjects, but leaving around all the other info about the object - such as indexes and columns. DBCC CHECKCATALOG in SQL 2000 would find this, but DBCC CHECKDB would not - as it didn't run the DBCC CHECKCATALOG code - any most people do not run DBCC CHECKCATALOG at all. Many times now, I've seen databases upgraded to 2005 and suddenly DBCC CHECKDB is reporting metadata corruption errors - all because someone had manually changed the system tables on 2000, and I changed DBCC CHECKDB in 2005 to include the DBCC CHECKCATALOG checks.

This demo is all about that. I created a 2000 database, manually deleted a row in sysobjects and then upgraded the database to 2005. The corrupt database is available in a zip file - DemoCorruptMetadata.zip. If you unzip it into a folder C:\SQLskills then you can attach it using:

RESTORE DATABASE DemoCorruptMetadata FROM DISK = 'C:\SQLskills\DemoCorruptMetadata.bak'
   
WITH MOVE 'DemoCorruptMetadata' TO 'C:\SQLskills\DemoCorruptMetadata.mdf',
   
MOVE 'DemoCorruptMetadata_log' TO 'C:\SQLskills\DemoCorruptMetadata_log.ldf',
   
REPLACE;
GO

So what does the corruption look like on 2005?

DBCC CHECKDB (DemoCorruptMetadata) WITH NO_INFOMSGS, ALL_ERRORMSGS;
GO
Msg 8992, Level 16, State 1, Line 1
Check Catalog Msg 3853, State 1: Attribute (object_id=1977058079) of row (object_id=1977058079,column_id=1) in sys.columns does not have a matching row (object_id=1977058079) in sys.objects.
Msg 8992, Level 16, State 1, Line 1
Check Catalog Msg 3853, State 1: Attribute (object_id=1977058079) of row (object_id=1977058079,column_id=2) in sys.columns does not have a matching row (object_id=1977058079) in sys.objects.
CHECKDB found 0 allocation errors and 2 consistency errors not associated with any single object.
CHECKDB found 0 allocation errors and 2 consistency errors in database 'DemoCorruptMetadata'.

This is what we expect. Notice that there's no recommended repair level at the end of the output - this is because CHECKDB can't repair metadata corruptions. We can't fix this with a backup - unless we have a backup from 2000 from before the manual delete in the system tables. To fix this we'd need to go back to 2000, fix the corruption, and then upgrade again - usually not feasible.

Instead, we're going to fix it by manually altering the system tables in 2005 - something that's purportedly not possible. First let's see what tables there are that could include column information (remembering that the system catalogs were completely rewritten between 2000 and 2005):

SELECT [name] FROM DemoCorruptMetadata.sys.objects WHERE [name] LIKE '%col%';
GO

name
------------------
sysrowsetcolumns
syshobtcolumns
syscolpars
sysiscols

I know that sysrowsetcolumns and syshobtcolumns are involved at low-levels of the Storage Engine and don't contain relational metadata, so let's try syscolpars. I want to see what columns there are to see if one of the looks like an object ID, and another looks like a column ID. This query will just return the table columns, with no rows (because the condition 1=0 is always false:

SELECT * FROM DemoCorruptMetadata.sys.syscolpars WHERE 1 = 0;
GO

Msg 208, Level 16, State 1, Line 1
Invalid object name 'DemoCorruptMetadata.sys.syscolpars'.

I can't bind to internal system tables in 2005. But - I can bind to internal system tables using the Dedicated Admind Connection (or DAC for short). This is documented in Books Online at http://msdn.microsoft.com/en-us/library/ms179503.aspx. You can get to the DAC through SQLCMD using the /A switch. So - assuming I'm now connected through the DAC, I'll try that command again:

C:\Documents and Settings\paul>sqlcmd /A
1> USE DemoCorruptMetadata;
2> GO
Changed database context to 'DemoCorruptMetadata'.
1> SELECT * FROM sys.syscolpars WHERE 1=0;
2> GO
id          number colid       name

xtype utype       length prec scale collationid status      maxinrow xmlns
 dflt        chk         idtval

----------- ------ ----------- -------------------------------------------------
-------------------------------------------------------------------------------
----- ----------- ------ ---- ----- ----------- ----------- -------- -----------
 ----------- ----------- -------------------------------------------------------
-----------

(0 rows affected)
1>

This looks like the table. Now I'll query against it using the object ID from the original corruption message:

1> SELECT colid, name FROM sys.syscolpars WHERE id = 1977058079;
2> GO
colid       name
----------- --------------------------------------------------------------------
------------------------------------------------------------
          1 SalesID
         
2 CustomerID
(2 rows affected)
1>

Cool. So I'll try deleting the orphaned columns:

1> DELETE FROM sys.syscolpars WHERE id = 1977058079;
2> GO
Msg 259, Level 16, State 1, Server ROADRUNNERPR, Line 1
Ad hoc updates to system catalogs are not allowed.
1>

Hmm. And it doesn't help if I set 'allow updates' to 1, or try putting the database into single-user mode.

There IS a way though. You can put the SERVER into single-user mode, then connect with the DAC and you can then update the system tables. This particular twist on using the DAC isn't documented anywhere except in an MSDN forum thread answered by someone from Microsoft (see here).

BEWARE (if I could put little flashing lights around this too then I would...) that this is undocumented and unsupported - misuse will lead to unrepairable corruption of your databases.

The sequence of events to follow is:

  • make a backup of the database just in case something goes wrong
  • shutdown the server
  • go to the binaries directory (e.g. C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Binn) and start the server in single-user mode using 'sqlservr -m'
  • connect back in using SQLCMD /A, and run the deleta again. This time it will work, but will give an error about metadata cache consistency:

C:\Documents and Settings\paul>sqlcmd /A
1> USE DemoCorruptMetadata;
2> GO
Changed database context to 'DemoCorruptMetadata'.
1> DELETE FROM sys.syscolpars WHERE id = 1977058079;
2> GO

(2 rows affected)
Warning: System table ID 41 has been updated directly in database ID 12 and cache coherence may not have been maintained. SQL Server should be restarted.
1>

  • The system table has been updated, but the in-memory cache of metadata is now out-of-sync with the system tables. So, shutdown the server again as the message suggests and restart it normally
  • run CHECKDB again and you'll see the corruption has been fixed.

Hope this helps some of you. Watch this space for the next demo from TechEd of repairing corruption when no backup is available.

Before I start, I want to make it clear that you can only hit this bug if you ALREADY have corruption, that it's quite rare, and that there is a workaround.

I've noticed a few more people in the forums having CHECKDB fail with this particular error in the last month

Msg 8967, Level 16, State 216, Line 1
An internal error occured in DBCC which prevented further processing. Please contact Product Support.

instead of completing properly and listing the corruptions in the database.

Whenever CHECKDB is using a database snapshot, it must check that the page it read through the snapshot does not have an LSN (Log Sequence Number) higher than that when the snapshot was created. If it did, this would mean that the page was modified AFTER the snapshot was created and hence CHECKDB would be working from an inconsistent view of the database. If this case is discovered, CHECKDB stops immediately. When I rewrote CHECKDB for SQL Server 2005, I changed a bunch of code assertions into seperate states of the 8967 error, so that CHECKDB would fail gracefully if some condition occured that indicates a bug or something that should never happen. State 216 is for the bad LSN condition I've just described.

I used to think it was caused by a race condition with the NTFS code that implements sparse files, which is used by the hidden database snapshot that CHECKDB uses by default. However, I've come to learn that this is a bug in CHECKDB (not one of mine I should say :-)) that causes this behavior under certain circumstances when corruption is present. The bug is that if a corrupt page fails auditing inside CHECKDB, the LSN check is still performed. If the corruption affects the LSN stamped in the page header, the 8967 error could be triggered. I've seen this a handful of times in the last few weeks - hence the need for a blog post. I've discussed this with the dev team and hopefully the fix will make it into the next SPs for 2005 and 2008 (too late to fix such a rare problem in such a critical component at this stage of 2008 development). They're going to put a KB article together too - but in the meantime, I wanted to get this on the Internet so Google/Live Search pick it up.

Now let's repro the problem. Starting with a simple database and table, I'll find the first page so I can corrupt it.

CREATE DATABASE TestCheckdbBug;
GO
USE TestCheckdbBug;
GO
CREATE TABLE test (c1 INT, c2 CHAR (5000));
INSERT INTO test VALUES (1, 'a');
GO
EXEC sp_AllocationMetadata 'test';
GO

Object Name  Index ID  Alloc Unit ID      Alloc Unit Type  First Page  Root Page  First IAM Page
-----------  --------  -----------------  ---------------  ----------  ---------  --------------
test         0         72057594042318848  IN_ROW_DATA      (1:143)     (0:0)      (1:152)

Now I'm going to corrupt the page type on page (1:143) to be 255 (an invalid page type), which will guarantee the page fails the audit inside CHECKDB.

DBCC CHECKDB ('TestCheckdbBug') WITH ALL_ERRORMSGS, NO_INFOMSGS;
GO

Msg 8928, Level 16, State 1, Line 1
Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data): Page (1:143) could not be processed. See other errors for details.
Msg 8939, Level 16, State 6, Line 1
Table error: Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data), page (1:143). Test ((m_type >= DATA_PAGE && m_type <= UNDOFILE_HEADER_PAGE) || (m_type == UNKNOWN_PAGE && level == BASIC_HEADER)) failed. Values are 255 and 255.
CHECKDB found 0 allocation errors and 2 consistency errors in table 'test' (object ID 2073058421).
CHECKDB found 0 allocation errors and 2 consistency errors in database 'TestCheckdbBug'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (TestCheckdbBug).

Now I'm going to corrupt the LSN on that page such that it's guaranteed to be higher than the creation LSN of the database snapshot (basically by filling the first part of the page header LSN field with 0xFF).

DBCC CHECKDB ('TestCheckdbBug') WITH ALL_ERRORMSGS, NO_INFOMSGS;
GO

Msg 8967, Level 16, State 216, Line 1
An internal error occurred in DBCC that prevented further processing. Contact Customer Support Services.
Msg 8921, Level 16, State 1, Line 1
Check terminated. A failure was detected while collecting facts. Possibly tempdb out of space or a system table is inconsistent. Check previous errors.

Bingo! And in the error log, there's some diagnostic information so we can tell which page caused the problem:

2008-05-22 14:55:01.95 spid53   DBCC encountered a page with an LSN greater than the current end of log LSN (31:0:1) for its internal database snapshot. Could not read page (1:143), database 'TestCheckdbBug' (database ID 15), LSN = (-1:65535:18), type = 255, isInSparseFile = 0. Please re-run this DBCC command.
2008-05-22 14:55:01.95 spid53   DBCC CHECKDB (TestCheckdbBug) WITH all_errormsgs, no_infomsgs executed by ROADRUNNERPR\paul terminated abnormally due to error state 1. Elapsed time: 0 hours 0 minutes 0 seconds.

Note the page ID (in black bold above) tells us the bad page and the LSN (in blue bold above) reflects the corruption that I caused. If the page ID field of the header was corrupt, it wouldn't be possible to tell from these diagnostics which page is corrupt.

However, all is not lost. This bug means that under these circumstances the default online behavior of CHECKDB can't run. The workaround is to use the WITH TABLOCK option of CHECKDB, which does offline checking and doesn't need the snapshot - but the trade-off is that an exclusive database lock is required for a short time and then shared table locks for all tables in the database (this is why online is the default). Running this option on my corrupt database gives:

DBCC CHECKDB ('TestCheckdbBug') WITH TABLOCK, ALL_ERRORMSGS, NO_INFOMSGS;
GO

Msg 8928, Level 16, State 1, Line 1
Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data): Page (1:143) could not be processed. See other errors for details.
Msg 8939, Level 16, State 6, Line 1
Table error: Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data), page (1:143). Test ((m_type >= DATA_PAGE && m_type <= UNDOFILE_HEADER_PAGE) || (m_type == UNKNOWN_PAGE && level == BASIC_HEADER)) failed. Values are 255 and 255.
CHECKDB found 0 allocation errors and 2 consistency errors in table 'test' (object ID 2073058421).
CHECKDB found 0 allocation errors and 2 consistency errors in database 'TestCheckdbBug'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (TestCheckdbBug).

Which are the exact same results we had before I corrupted the LSN field (this is expected, as there is no check of a page's LSN field EXCEPT when running from a database snapshot). Now we can proceed to restore/repair as appropriate.

So - a scary little bug that has caused some people headaches, but I want to stress again - this can only happen if the database is ALREADY corrupt, and that it's quite rare. Hope this helps some of you picking this up from search engines in the future.

We're sitting here in St. Pete Beach in Florida visiting some of Kimberly's family and having some sun-kissed R&R before heading back up to Seattle on Wednesday, and I thought I'd get the next post in my sparse columns mini-series out. Before I start though, Kimberly just posted the resources for the Accidental DBA class we taught at SQL Connections last week and in Iceland at the end of March - see here.

In my first post on sparse columns (see here) I introduced the concepts and explained why sparse columns are a useful feature. In this post I'm going to use the example I gave - a document repository with 50 document types and 20 unique attributes per document-type. Yes, it's a contrived example, but scale it up be a factor of 100+ (think Sharepoint Server) and methods like normalization no longer apply.

I'm using a CTP-6 VPC on a Lenovo T60P laptop with 4GB, a dual-core 2.2GHz CPU, and the VPC is running off a 6200RPM drive. Your mileage may vary for run-times of the example scripts. The VPC is the one we gave out at SQL Connections and Iceland, and you can download the scripts and a VPC (maybe only CTP-5) from the Microsoft JumpStart site (see here for details).

The first test I'll do is just creating the schema necessary to store the 1000+ columns of information in the test scenario. I'll do one with sparse columns and one without:

-- Create two tables, one with 1000 columns and one with 1000 columns but 997 sparse.
CREATE TABLE TableWithoutSparseColumns (
   DocID INT IDENTITY, DocName VARCHAR (100), DocType INT,
   c0004 INT NULL, c0005 INT NULL, c0006 INT NULL, c0007 INT NULL, c0008 INT NULL, c0009 INT NULL,
   ...
   c0994 INT NULL, c0995 INT NULL, c0996 INT NULL, c0997 INT NULL, c0998 INT NULL, c0999 INT NULL,
   c1000 INT NULL);
GO

CREATE TABLE TableWithSparseColumns (
   DocID INT IDENTITY, DocName VARCHAR (100), DocType INT,
   c0004 INT SPARSE NULL, c0005 INT SPARSE NULL, c0006 INT SPARSE NULL, c0007 INT SPARSE NULL,
   ...
   c0996 INT SPARSE NULL, c0997 INT SPARSE NULL, c0998 INT SPARSE NULL, c0999 INT SPARSE NULL,
   c1000 INT SPARSE NULL);
GO

I won't list all the column names for the sake of brevity. Next I'll insert some values into each table (the same values in each table):

-- Insert a few rows in each
INSERT INTO TableWithSparseColumns (DocName, Doctype) VALUES ('aaaa', 1);
INSERT INTO TableWithSparseColumns (DocName, Doctype, c0945) VALUES ('bbbb', 2, 46);
INSERT INTO TableWithSparseColumns (DocName, Doctype, c0334) VALUES ('cccc', 3, 44);
INSERT INTO TableWithSparseColumns (DocName, Doctype, c0233, c0234) VALUES ('dddd', 4, 12, 34);
INSERT INTO TableWithSparseColumns (DocName, Doctype, c0233, c0234,c0235,c0236) VALUES ('eeee', 4, 12, 34, 46, 66);
GO

INSERT INTO TableWithoutSparseColumns (DocName, Doctype) VALUES ('aaaa', 1);
INSERT INTO TableWithoutSparseColumns (DocName, Doctype, c0945) VALUES ('bbbb', 2, 46);
INSERT INTO TableWithoutSparseColumns (DocName, Doctype, c0334) VALUES ('cccc', 3, 44);
INSERT INTO TableWithoutSparseColumns (DocName, Doctype, c0233, c0234) VALUES ('dddd', 4, 12, 34);
INSERT INTO TableWithoutSparseColumns (DocName, Doctype, c0233, c0234,c0235,c0236) VALUES ('eeee', 4, 12, 34, 46, 66);
GO

Now let's see how big each table is:

-- Now lets see how big the rows are
SELECT [avg_record_size_in_bytes], [page_count] FROM sys.dm_db_index_physical_stats (
   DB_ID ('SparseColumnsTest'), OBJECT_ID ('TableWithoutSparseColumns'), NULL, NULL, 'DETAILED');

SELECT [avg_record_size_in_bytes], [page_count] FROM sys.dm_db_index_physical_stats (
   
DB_ID ('SparseColumnsTest'), OBJECT_ID ('TableWithSparseColumns'), NULL, NULL, 'DETAILED');
GO

avg_record_size_in_bytes page_count
------------------------ --------------------
4135                     5

(1 row(s) affected)

avg_record_size_in_bytes page_count
------------------------ --------------------
40.6                     1

(1 row(s) affected)

Ok - so that's not a huge difference in page count (because we've only got 5 rows), but it's a *massive* difference in average record size. Scaled up to hundreds of thousands or millions of records, the space savings will be astronomical!

Now let's try selecting the data back using results-to-grid mode and a simple SELECT * statement. It takes 20 seconds to return - solely because the client still has to retrieve the metadata for 1000+ columns. Even though the columns are still defined as SPARSE, they show up in a SELECT * resultset, and that makes extracting out the non-NULL values pretty difficult...

Time for another new feature - column sets. There's a new column type available for use with sparse columns - an XML COLUMN_SET. This is a column that is only materialized when selected, and will return all the non-NULL sparse columns in a row as an XML BLOB. It will also change the behavior of a SELECT * operation - removing all the sparse columns from the resultset and replacing them with itself, representing all the non-NULL sparse columns. Redefining our TableWithSparseColumns to have an XML COLUMN_SET column called SparseColumns (using the syntax 'SparseColumns XML COLUMN_SET FOR ALL_SPARSE_COLUMNS'), and re-inserting the same values then gives the following results for a SELECT * operation:

DocID   DocName   DocType   SparseColumns
------- --------- --------- ---------------
1       aaaa      1         NULL
2       bbbb      2         <c0945>46</c0945>
3       cccc      3         <c0334>44</c0334>
4       dddd      4         <c0233>12</c0233><c0234>34</c0234>
5       eeee      4         <c0233>12</c0233><c0234>34</c0234><c0235>46</c0235><c0236>66</c0236>

Pretty cool - and it returns virtually instantaneously (obviously scaling up to hundreds of thousands or millions of rows would take longer due to the time necessary to read the pages into the buffer pool). One downside is that the XML blob only returns the column name and value - not the datatype - but if your application can cope with that then not having to wade through hundreds (or thousands by RTM) of NULL columns values is great.

Next time I'll discuss the internals of how sparse columns are stored.

Over the weekend there was a question on one of the internal aliases at MS: how can I tell what percentage of a database has changed since the last full backup, so I can choose between a differential or full backup?

No such code exists as far as I know - until now! I happened to read the thread while sitting in the airport in Washington D.C. on the way back from Iceland so I started playing around and this morning I completed the code.

The code below creates a function and a stored procedure. The basic idea behind the code is as follows:

For each online data file in the database
   For each GAM interval in the file
      Crack the DIFF map page using DBCC PAGE
      Interpret the DIFF bitmap to aggregate the changed extents
      Add the sum to the total changed extents for the database
   End
End
Report results

There's a function that I create in msdb call SQLskillsConvertToExtents that cracks some of the DBCC PAGE output, and the main procedure is called sp_SQLskillsDIFForFULL and it created as a system object in master. I tried making it a table-valued function but you can't do things like INSERT-EXEC in a function, and that's required for processing the DBCC PAGE output. So - create your own wrapper function or whatever to use it. The interface/output is:

EXEC sp_SQLskillsDIFForFULL 'msdb';
GO

Total Extents Changed Extents Percentage Changed
------------- --------------- ----------------------
102           56              54.9

I've tested it with databases with multiple files and up to around 700GB for a single file size. There's the potential for an issue with file sizes of 4TB and beyond (where PFS intervals and GAM intervals map to the same extent every 4TB or so, but I think it'll be ok and the position of the DIFF map in the extent won't change - if someone could test it with a 4+TB file I'd be grateful). It's been tested on SQL Server 2005 and 2008. It will not work on SQL Server 2000 - I'll do a 2000 version sometime soon.

Note that after doing a full backup you will never see Changed Extents equal to zero. It will always be 4 + (number of online data files - 1), and around 20 or so for msdb. This is because the extent containing the file header in each file is always marked as changed, as are three extents in the primary file containing the roots of some critical system tables.

Anyway - here it is. You can download it in a zip file from SQLskillsDIFForFULL.zip (2.65KB). Enjoy!

/*============================================================================
   
File: SQLskillsDIFForFULL.sql

   Summary: This script creates a system-wide SP SQLskillsDIFForFILL that
   
works out what percentage of a database has changed since the
   
previous full database backup.

   Date: April 2008

   SQL Server Versions:
         
10.0.1300.13 (SS2008 February CTP - CTP-6)
         
9.00.3054.00 (SS2005 SP2)
------------------------------------------------------------------------------
   
Copyright (C) 2008 Paul S. Randal, SQLskills.com
   
All rights reserved.

   For more scripts and sample code, check out 
      
http://www.sqlskills.com/

   You may alter this code for your own *non-commercial* purposes. You may
   
republish altered code as long as you give due credit.

   THIS CODE AND INFORMATION ARE PROVIDED "AS IS" WITHOUT WARRANTY OF 
   
ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED 
   
TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR FITNESS FOR A
   
PARTICULAR PURPOSE.

============================================================================*/

-- Create the function in MSDB
--
USE msdb;
GO

IF EXISTS (SELECT * FROM sys.objects WHERE NAME = 'SQLskillsConvertToExtents')
   
DROP FUNCTION SQLskillsConvertToExtents;
GO

-- This function cracks the output from a DBCC PAGE dump
-- of an allocation bitmap. It takes a string in the form
-- "(1:8) - (1:16)" or "(1:8) -" and returns the number
-- of extents represented by the string. Both the examples
-- above equal 1 extent.
--

CREATE FUNCTION SQLskillsConvertToExtents (
   
@extents VARCHAR (100))
RETURNS INTEGER
AS
BEGIN
   
DECLARE @extentTotal   INT;
   
DECLARE @colon         INT;
   
DECLARE @firstExtent   INT;
   
DECLARE @secondExtent  INT;

   SET @extentTotal = 0;
   
SET @colon = CHARINDEX (':', @extents);

   -- Check for the single extent case
   --
   IF (CHARINDEX (':', @extents, @colon + 1) = 0)
      
SET @extentTotal = 1;
   
ELSE
      
-- We're in the multi-extent case
      --
      BEGIN
      
SET @firstExtent = CONVERT (INT,
         
SUBSTRING (@extents, @colon + 1, CHARINDEX (')', @extents, @colon) - @colon - 1));
      
SET @colon = CHARINDEX (':', @extents, @colon + 1);
      
SET @secondExtent = CONVERT (INT,
         
SUBSTRING (@extents, @colon + 1, CHARINDEX (')', @extents, @colon) - @colon - 1));
      
SET @extentTotal = (@secondExtent - @firstExtent) / 8 + 1;
   
END

   RETURN @extentTotal;
END;
GO

USE master;
GO

IF OBJECT_ID ('sp_SQLskillsDIFForFULL') IS NOT NULL
   
DROP PROCEDURE sp_SQLskillsDIFForFULL;
GO

-- This SP cracks all differential bitmap pages for all online
-- data files in a database. It creates a sum of changed extents
-- and reports it as follows (example small msdb):
--
-- EXEC sp_SQLskillsDIFForFULL 'msdb';
-- GO
--
-- Total Extents Changed Extents Percentage Changed
-- ------------- --------------- ----------------------
-- 102           56              54.9
--
-- Note that after a full backup you will always see some extents
-- marked as changed. The number will be 4 + (number of data files - 1).
-- These extents contain the file headers of each file plus the
-- roots of some of the critical system tables in file 1.
-- The number for msdb may be round 20.
--
CREATE PROCEDURE sp_SQLskillsDIFForFULL (
   
@dbName VARCHAR (128))
AS
BEGIN
   
SET NOCOUNT ON;

   -- Create the temp table
   
--
   
IF EXISTS (SELECT * FROM msdb.sys.objects WHERE NAME = 'SQLskillsDBCCPage')
   
DROP TABLE msdb.dbo.SQLskillsDBCCPage;

   CREATE TABLE msdb.dbo.SQLskillsDBCCPage (
      
[ParentObject] VARCHAR (100),
      
[Object]       VARCHAR (100),
      
[Field]        VARCHAR (100),
      
[VALUE]        VARCHAR (100));

   DECLARE @fileID         INT;
   
DECLARE @fileSizePages  INT;
   
DECLARE @extentID       INT;
   
DECLARE @pageID         INT;
   
DECLARE @DIFFTotal      INT;
   
DECLARE @sizeTotal      INT;
   
DECLARE @total          INT;
   
DECLARE @dbccPageString VARCHAR (200);

   SELECT @DIFFTotal = 0;
   
SELECT @sizeTotal = 0;

   -- Setup a cursor for all online data files in the database
   
--
   
DECLARE files CURSOR FOR
      
SELECT [file_id], [size] FROM master.sys.master_files
      
WHERE [type_desc] = 'ROWS'
      
AND [state_desc] = 'ONLINE'
      
AND [database_id] = DB_ID (@dbName);

   OPEN files;

   FETCH NEXT FROM files INTO @fileID, @fileSizePages;

   WHILE @@FETCH_STATUS = 0
   
BEGIN
      
SELECT @extentID = 0;

      -- The size returned from master.sys.master_files is in
      
-- pages - we need to convert to extents
      
--
      
SELECT @sizeTotal = @sizeTotal + @fileSizePages / 8;

      WHILE (@extentID < @fileSizePages)
      
BEGIN
         
-- There may be an issue with the DIFF map page position
         
-- on the four extents where PFS pages and GAM pages live
         
-- (at page IDs 516855552, 1033711104, 1550566656, 2067422208)
         
-- but I think we'll be ok.
         
-- PFS pages are every 8088 pages (page 1, 8088, 16176, etc)
         
-- GAM extents are every 511232 pages
         
--
         
SELECT @pageID = @extentID + 6;

         -- Build the dynamic SQL
         
--
         
SELECT @dbccPageString = 'DBCC PAGE ('
            
+ @dbName + ', '
            
+ CAST (@fileID AS VARCHAR) + ', '
            
+ CAST (@pageID AS VARCHAR) + ', 3) WITH TABLERESULTS, NO_INFOMSGS';

         -- Empty out the temp table and insert into it again
         
--
         
DELETE FROM msdb.dbo.SQLskillsDBCCPage;
         
INSERT INTO msdb.dbo.SQLskillsDBCCPage EXEC (@dbccPageString);

         -- Aggregate all the changed extents using the function
         
--
         
SELECT @total = SUM ([msdb].[dbo].[SQLskillsConvertToExtents] ([Field]))
         
FROM msdb.dbo.SQLskillsDBCCPage
            
WHERE [VALUE] = '    CHANGED'
            
AND [ParentObject] LIKE 'DIFF_MAP%';

         SET @DIFFTotal = @DIFFTotal + @total;

         -- Move to the next GAM extent
         
SET @extentID = @extentID + 511232;
      
END

      FETCH NEXT FROM files INTO @fileID, @fileSizePages;
   
END;

   -- Clean up
   
--
   
DROP TABLE msdb.dbo.SQLskillsDBCCPage;
   
CLOSE files;
   
DEALLOCATE files;

   -- Output the results
   
--
   
SELECT
      
@sizeTotal AS [Total Extents],
      
@DIFFTotal AS [Changed Extents],
      
ROUND (
         
(CONVERT (FLOAT, @DIFFTotal) /
         
CONVERT (FLOAT, @sizeTotal)) * 100, 2) AS [Percentage Changed];
END;
GO

-- Mark the SP as a system object
--
EXEC sys.sp_MS_marksystemobject sp_SQLskillsDIFForFULL;
GO

-- Test to make sure everything was setup correctly
--
EXEC sp_SQLskillsDIFForFULL 'msdb';
GO

It's been a long few days of building slide decks and other content and I just had to stop for a bit and take care of my internals-hacking withdrawal symptoms!

While I was at Microsoft, I wrote some code in the Storage Engine to very easily return all the IAM chains/allocation units (see this post for more details of the internals of those), what type they are, and the relevant page IDs (first, root, first-IAM) so I could go spelunking with DBCC PAGE. Since I left six months ago, it's one of the things I've been missing using when poking around on customer sites, so this afternoon I sat down and wrote the equivalent in T-SQL, using the undocumented sys.system_internals_allocation_units DMV. The output is easy to match up to sys.partitions but the page IDs are formatted in byte-reversed hex so a little tweaking was needed to extract the fields and make them human readable - I've put them into the same format that all SQL Server error messages use when giving a page number.

So - I present to you sp_AllocationMetadata. I was having all kinds of trouble using it in other databases (trying to figure out a way to change database contexts in the SP) until I remembered that you can create an SP in master and mark it as a system object using the undocumented sys.sp_MS_marksystemobject SP. This makes any SP execute in the context of the database from where it is called - extremely useful when you're querying against a database's system catalog views.

[Edit: Kalen pointed out that the DMV *is* documented, but just not in the BOL index. Even better - that means the SP below isn't doing anything dodgy :-) - thanks Kalen!]

The SP can be called with an optional object name parameter, in which case it will only give you back the allocation metadata for that object. If you don't specify a parameter, it gives you back the allocation metadata for all objects in the database. Here's an example of the output:

USE AdventureWorks;
GO

EXEC sp_AllocationMetadata 'HumanResources.Employee';
GO

Object Name   Index ID   Alloc Unit ID       Alloc Unit Type   First Page   Root Page   First IAM Page
------------- ---------- ------------------- ----------------- ------------ ----------- ----------------
Employee      1          72057594050379776   IN_ROW_DATA       (1:588)      (1:594)     (1:593)
Employee      2          72057594055491584   IN_ROW_DATA       (1:2141)     (1:2144)    (1:2142)
Employee      3          72057594055557120   IN_ROW_DATA       (1:2146)     (1:2149)    (1:2147)
Employee      4          72057594055622656   IN_ROW_DATA       (1:2150)     (1:2150)    (1:2151)
Employee      5          72057594055688192   IN_ROW_DATA       (1:2153)     (1:2153)    (1:2154)

You'll notice there are only IN_ROW_DATA allocation units - that's because this table doesn't have any LOB data or an variable-length columns that have been pushed off-row (producing LOB_DATA and ROW_OVERFLOW_DATA allocation units, respectively). So - it only shows what actually exists (rather than creating NULL values, for instance).

Below is the script that creates the SP, and I've included it as an attachment too.

Ah - that feel's better :-) Happy spelunking!

USE master;
GO

IF OBJECT_ID ('sp_AllocationMetadata') IS NOT NULL
   DROP PROCEDURE sp_AllocationMetadata;
GO

CREATE PROCEDURE sp_AllocationMetadata
(
   
@object VARCHAR (128) = NULL
)
AS
SELECT
   
OBJECT_NAME (sp.object_id) AS [Object Name],
   
sp.index_id AS [Index ID],
   
sa.allocation_unit_id AS [Alloc Unit ID],
   
sa.type_desc AS [Alloc Unit Type],
   
'(' CONVERT (VARCHAR (6),
      
CONVERT (INT, SUBSTRING (sa.first_page, 6, 1) +
      
   SUBSTRING (sa.first_page, 5, 1))) +
   
':' + CONVERT (VARCHAR (20),
      
CONVERT (INT, SUBSTRING (sa.first_page, 4, 1) +
      
   SUBSTRING (sa.first_page, 3, 1) +
         
SUBSTRING (sa.first_page, 2, 1) +
         
SUBSTRING (sa.first_page, 1, 1))) +
   
')' AS [First Page],
   '(' + CONVERT (VARCHAR (6),
      
CONVERT (INT,
         
SUBSTRING (sa.root_page, 6, 1) +
         
SUBSTRING (sa.root_page, 5, 1))) +
   
':' + CONVERT (VARCHAR (20),
      
CONVERT (INT,
         
SUBSTRING (sa.root_page, 4, 1) +
         
SUBSTRING (sa.root_page, 3, 1) +
         
SUBSTRING (sa.root_page, 2, 1) +
         
SUBSTRING (sa.root_page, 1, 1))) +
   
')' AS [Root Page],
   
'(' + CONVERT (VARCHAR (6),
      
CONVERT (INT,
         
SUBSTRING (sa.first_iam_page, 6, 1) +
         
SUBSTRING (sa.first_iam_page, 5, 1))) +
   
':' + CONVERT (VARCHAR (20),
      
CONVERT (INT,
         
SUBSTRING (sa.first_iam_page, 4, 1) +
         
SUBSTRING (sa.first_iam_page, 3, 1) +
         
SUBSTRING (sa.first_iam_page, 2, 1) +
         
SUBSTRING (sa.first_iam_page, 1, 1))) +
   
')' AS [First IAM Page]
FROM
   
sys.system_internals_allocation_units AS sa,
   
sys.partitions AS sp
WHERE
   
sa.container_id = sp.partition_id
   AND sp.object_id =
      
(CASE WHEN (@object IS NULL)
         THEN sp.object_id
         
ELSE OBJECT_ID (@object)
      
END);
GO

EXEC sys.sp_MS_marksystemobject sp_AllocationMetadata;
GO

sp_AllocationMetadata.zip (.69 KB)

A couple of weeks ago I blogged about the three tracks of the SQL Server 2008 JumpStart course that SQLskills.com taught internally for Microsoft and some MVPs - see here for details. Well, the content is now available to download! Note that this was based on CTP-5 (November 2007 CTP) and there have been *lots* of behavioral changes since then (with more planned for CTP-6 Refresh and RTM), but if you want a high-level overview of a bunch of the features (albeit in bullet-point summaries on slides) then this is a good place to start.

For me, what's *REALLY* cool is that the site also has a downloadable VPC plus lab manuals for all of the AlwaysOn High-Availability hands-on labs that SQLskills.com wrote. We originally wrote these labs for SQL Server 2005 and I updated them all for CTP-5. The VPC has a long lab on each of the following:

  • Database Snapshots
  • Data Recovery and Preventative Techniques
  • Instant Initialization
  • Peer-to-Peer Replication (including the new Topology Wizard I blogged about here)
  • Table and Index Partitioning
  • Snapshot Isolation
  • Online Operations
  • Database Mirroring (including a demo I wrote of Automatic Page Repair, described here)
  • Service Oriented Database Architecture

There is some great depth in each of these - Kimberly blogged more info about the exercises in each lab here. There's also another VPC image with some higher-level labs on a variety of 2008 features and written by a number of different people- including some labs on Policy-Based Management and Performance Data Collection that Kimberly wrote.

So - where can you get these from? Go to http://sqlserver2008jumpstart.microsofttraining.com/ and hit the Download link on the right-hand side. Register and then you can get to the materials. The AlwaysOn VPC image is Collection 2 at the bottom of the page, and you'll see all the slide decks as you scroll down the page.

Enjoy!

This post is based on one from my old MSDN blog but the topic has come up a few times in recent days so I want to revamp it and re-post.

There are two things that confuse people about mirrored backups - can you mix-n-match backup devices from the mirrors, and what exactly do the various sizes mean?

1) Single-device backup, no mirror

The code below creates a single-device backup with no mirror, and then examines it.

BACKUP DATABASE AdventureWorks TO
DISK
= N'C:\SQLskills\mediaset1device1.bck'
WITH FORMAT, STATS;
GO

RESTORE HEADERONLY FROM DISK = N'C:\SQLskills\mediaset1device1.bck';
GO

The BackupSize in the HEADERONLY output is 168,899,072 bytes and the on-disk size of the file mediaset1device1.bck is 161MB.

2) Single-device backup, mirrored

The code below creates a single-device backup with a mirror, and then examines it.

BACKUP DATABASE AdventureWorks TO
DISK
= N'C:\SQLskills\mediaset1device1.bck'
MIRROR TO DISK = N'C:\SQLskills\mediaset2device1.bck'
WITH FORMAT, STATS;
GO

RESTORE HEADERONLY FROM DISK = N'C:\SQLskills\mediaset1device1.bck';
RESTORE HEADERONLY FROM DISK = N'C:\SQLskills\mediaset2device1.bck';
GO

The BackupSize in the HEADERONLY output of both files is 337,798,144 bytes. This is double the size of the backup in case #1 above - and it because there are now two copies of the backup. The on-disk size of both files is 161MB, which is what we'd expect as mediaset2device1.bck is a copy of mediaset1device1.bck.

3) Two-device backup, no mirror

The code below creates a two-device backup with no mirror, and then examines it.

BACKUP DATABASE AdventureWorks TO
DISK
= N'C:\SQLskills\mediaset1device1.bck',
DISK = N'C:\SQLskills\mediaset1device2.bck'
WITH FORMAT, STATS;
GO

RESTORE HEADERONLY FROM DISK = N'C:\SQLskills\mediaset1device1.bck';
GO

The BackupSize in the HEADERONLY output is 169,959,424 bytes. This is nearly exactly the same as for the single-device backup in case #1, but includes a bit more to account for the extra metadata in the second device. This time, the on-disk size of the file mediaset1device1.bck is 81MB. This is half of the on-disk size from the single-device case #1 as the backup is now split between the two files.

4) Two-device backup, mirrored

The code below creates a single-device backup with a mirror, and then examines it.

BACKUP DATABASE AdventureWorks TO
DISK
= N'C:\SQLskills\mediaset1device1.bck',
DISK = N'C:\SQLskills\mediaset1device2.bck'
MIRROR TO DISK = N'C:\SQLskills\mediaset2device1.bck',
DISK = N'C:\SQLskills\mediaset2device2.bck'
WITH FORMAT, STATS;
GO

RESTORE HEADERONLY FROM DISK = N'C:\SQLskills\mediaset1device1.bck';
RESTORE HEADERONLY FROM DISK = N'C:\SQLskills\mediaset2device1.bck';
GO

The BackupSize in the HEADERONLY output of both files is 339,918,848 bytes - again, double the size of the non-mirrored backup in case #3. The on-disk size of each file is 81MB, as each file is one half of a copy of the backup.

Restoring

Now let's try to mix devices from the two backup media sets and see if it's possible:

RESTORE DATABASE AdventureWorks
FROM DISK = N'C:\SQLskills\mediaset1device1.bck'
,
DISK = N
'C:\SQLskills\mediaset2device2.bck'
WITH REPLACE, STATS
;
GO

And it works fine - excellent! That's the whole point of having mirrored backups.

One other question is - can backup device types can differ between media sets in the same backup. The answer to this is no - as documented in Books Online. All the backup devices involved in a single backup, regardless of whether they're part of a mirror media set or not, must be of the same type and have similar characteristics.

Hope this is useful.

This is a post I've been meaning to do for a while - detailing some of the problems you can run into when using database snapshots.

Reverting to a snapshot

Many people use snapshots as a way of protecting against mistakes during a complicated set of changes to a database - you can just revert to the snapshot if something goes wrong. However, reverting to a snapshot has a little-known (but documented) problem. Let's see. I'm going to create a database plus a snapshot of it and then start taking backups.

CREATE DATABASE SnapshotTest;
GO
ALTER DATABASE SnapshotTest SET RECOVERY FULL;
GO
BACKUP DATABASE SnapshotTest TO DISK = 'C:\SQLskills\SnapshotTest.bak' WITH INIT;
GO

CREATE TABLE SnapshotTest.dbo.MyTable (c1 INT);
GO

BACKUP LOG SnapshotTest TO DISK = 'C:\SQLskills\SnapshotTest_log.bak' WITH INIT;
GO

-- Imagine a bunch of things happen here

-- Create the snapshot database, by first checking which files exist
SELECT * FROM SnapshotTest.sys.database_files;
GO

CREATE DATABASE ST_Snap ON
(NAME = SnapshotTest, FILENAME = 'C:\SQLskills\SnapshotTest_snap.snp')
AS SNAPSHOT OF SnapshotTest;
GO

Now I'm going to do some stuff that's a mistake:

DROP TABLE SnapshotTest.dbo.MyTable;
GO

Not a problem as I can revert to my database snapshot:

RESTORE DATABASE SnapshotTest FROM DATABASE_SNAPSHOT = 'ST_Snap';
GO

Oops! I forgot to take a log backup to capture everything that happened since the last log backup...

BACKUP LOG SnapshotTest TO DISK = 'C:\SQLskills\SnapshotTest_log.bak';
GO
Msg 4214, Level 16, State 1, Line 1
BACKUP LOG cannot be performed because there is no current database backup.
Msg 3013, Level 16, State 1, Line 1
BACKUP LOG is terminating abnormally.

Reverting to a snapshot rebuilds the transaction log and breaks the log backup chain. The only thing I can do is take a full or differential backup and then start taking log backups again. So - I lost the ability to do point-in-time restores in the period from the last log backup to the time when I reverted to the snapshot.

So how can I tell when the database was reverted? Books Online documents that the restorehistory table in msdb should have an entry with restore_type = 'R'. Let's try:

SELECT * FROM msdb.dbo.restorehistory WHERE [destination_database_name] = 'SnapshotTest';
GO

Nope - nothing. That functionality seems to be broken. The only way I could find to trace reverting was in the error log:

2008-01-30 11:09:21.73 spid53      Reverting database 'SnapshotTest' to the point in time of database snapshot 'ST_Snap' with split point LSN 26000000013800001 (0x0000001a:0000008a:0001). This is an informational message only. No user action is required.
2008-01-30 11:09:21.74 spid53      Starting up database 'SnapshotTest'.
2008-01-30 11:09:21.74 spid53      The database 'SnapshotTest' is marked RESTORING and is in a state that does not allow recovery to be run.
2008-01-30 11:09:21.85 spid53      Starting up database 'SnapshotTest'.
2008-01-30 11:09:21.87 spid53      Starting up database 'ST_Snap'.

Note that the snapshot database is still there and continues to work.

Running out of space in a snapshot

The next problem that can happen is if you create a database snapshot on a volume that doesn't have much disk space. In that case you can actually run out of space and the snapshot will go suspect and be unusable until it's dropped and recreated. Let's see what happens:

-- Create snapshot of large database on a volume with not much space
SELECT * FROM SalesDB.sys.database_files;
GO

CREATE DATABASE SalesDB_Snap ON
(NAME = SalesDBData, FILENAME = 'D:\sqlskills\SalesDB_snap.snp')
AS SNAPSHOT OF SalesDB;
GO

ALTER INDEX ALL ON SalesDB.dbo.Sales REBUILD;
GO
Msg 5128, Level 17, State 2, Line 1
Write to sparse file 'D:\sqlskills\SalesDB_snap.snp' failed due to lack of disk space.

The query that causes the sparse file to run out of space doesn't fail, but the user running the query will see this error. Note that it doesn't say the snapshot has gone suspect. If you try to use it then it will tell you, plus there info written to the error log:

2008-01-30 11:50:29.14 spid53      Error: 17053, Severity: 16, State: 1.
2008-01-30 11:50:29.14 spid53      D:\sqlskills\SalesDB_snap.snp: Operating system error 112(There is not enough space on the disk.) encountered.
2008-01-30 11:50:29.15 spid53      Error: 3420, Severity: 21, State: 1.
2008-01-30 11:50:29.15 spid53      Database snapshot 'SalesDB_Snap' has failed an IO operation and is marked suspect.  It must be dropped and recreated.

Running out of space while running DBCC CHECKDB

One thing that isn't common knowledge is that DBCC CHECKDB uses database snapshots as it's mechanism for running online in SQL Server 2005. You can't control where the snapshot is created (it's actually created in the same location as the files comprising the database being checked - see my previous post on the DBCC CHECKDB stages here for more info) but you can create your own snapshot and check that. Anyway - it's possible to run out of space in the snapshot while DBCC CHECKDB is running. I setup a situation with a large database on a drive with only 1MB of space left, started a large index rebuild in the database and then tried to run DBCC CHECKDB. Let's see what happened:

The index rebuild statement:

ALTER INDEX ALL ON SalesDB2.dbo.Sales REBUILD;
GO
Msg 5128, Level 17, State 2, Line 1
Write to sparse file 'd:\sqlskills\salesdbdata.mdf:MSSQL_DBCC20' failed due to lack of disk space.

So some poor unsuspecting user gets this error and has no idea why. For DBAs its a bit obscure too. The syntax is describing an alternate stream on the existing data file and the snapshot in question can't be accessed at all.

The DBCC CHECKDB statement:

DBCC CHECKDB (salesdb2) WITH NO_INFOMSGS;
GO
Msg 1823, Level 16, State 1, Line 1
A database snapshot cannot be created because it failed to start.
Msg 1823, Level 16, State 2, Line 1
A database snapshot cannot be created because it failed to start.
Msg 7928, Level 16, State 1, Line 1
The database snapshot for online checks could not be created. Either the reason is given in a previous error or one of the underlying volumes does not support sparse files or alternate streams. Attempting to get exclusive access to run checks offline.
Msg 5128, Level 17, State 2, Line 1
Write to sparse file 'd:\sqlskills\salesdbdata.mdf:MSSQL_DBCC20' failed due to lack of disk space.
Msg 3313, Level 21, State 2, Line 1
During redoing of a logged operation in database 'salesdb2', an error occurred at log record ID (1628:252:1). Typically, the specific failure is previously logged as an error in the Windows Event Log service. Restore the database from a full backup, or repair the database.
Msg 0, Level 20, State 0, Line 0
A severe error occurred on the current command.  The results, if any, should be discarded.

Wow - major amounts of errors - and it all boils down to running out of space.

Back in October 2007 I blogged about partition-level lock escalation in SQL Server 2008 (see here) and I promised to do a follow-up once CTP-5 came out with syntax etc. So here it is.

A brief recap - lock escalation in SQL Server 2005 and before only allowed table-level lock escalation. If you have a partitioned table with queries going against different partitions, then table-level escalation is a pain because the whole table is suddenly locked and concurrent queries against distinct partitions can't run. SQL Server 2008 gives the ability to escalate to a parttition lock, which won't affect the queries on the other partitions.

The lock escalation policy can only be set with ALTER TABLE after a table has been created, and the policy can only be set at the table level. The syntax is

ALTER TABLE TableName SET (LOCK_ESCALATION = TABLE | AUTO | DISABLE);

The options mean:

  • TABLE - escalation will always be to the table level. This is the default.
  • AUTO - escalation will be to the partition level if the table is partitioned; otherwise it will be to the table level
  • DISABLE - escalation will be disabled. This does not guarantee that it will NEVER occur - there are some cases where it is necessary (Books Online gives the example of scanning a heap in the SERIALIZABLE isolation level)

The only way I could find to check what the escalation policy for a table is set to is to use the sys.tables catalog view:

SELECT lock_escalation_desc FROM sys.mytables WHERE name = 'TableName';

Let's try it out. Here's a script that creates a database with an example table with 3 partitions. The partition ranges are negative infinity to 7999, 8000 to 15999, 16000 to positive infinity.

CREATE DATABASE LockEscalationTest;
GO

USE LockEscalationTest;
GO

-- Create three partitions: -7999, 8000-15999, 16000+
CREATE PARTITION FUNCTION MyPartitionFunction (INT) AS RANGE RIGHT FOR VALUES (8000, 16000);
GO

CREATE PARTITION SCHEME MyPartitionScheme AS PARTITION MyPartitionFunction
ALL TO ([PRIMARY]);
GO

-- Create a partitioned table
CREATE TABLE MyPartitionedTable (c1 INT);
GO

CREATE CLUSTERED INDEX MPT_Clust ON MyPartitionedTable (c1)
ON MyPartitionScheme (c1);
GO

-- Fill the table
SET NOCOUNT ON;
GO

DECLARE @a INT = 1;
WHILE (@a < 17000)
BEGIN
INSERT INTO MyPartitionedTable VALUES (@a);
SELECT @a = @a + 1;
END;
GO

Now I'm going to explicitly set the escalation to TABLE and start a transaction that should cause lock escalation.

ALTER TABLE MyPartitionedTable SET (LOCK_ESCALATION = TABLE);
GO

BEGIN TRAN
UPDATE
MyPartitionedTable SET c1 = c1 WHERE c1 < 7500;
GO

We should be able to see the locks being held:

SELECT [resource_type], [resource_associated_entity_id], [request_mode],
[request_type], [request_status] FROM sys.dm_tran_locks WHERE [resource_type] <> 'DATABASE';
GO

resource_type   resource_associated_entity_id request_mode   request_type   request_status
--------------- ----------------------------- -------------- -------------- ----------------
METADATA        0                             Sch-S          LOCK           GRANT
METADATA        0                             Sch-S          LOCK           GRANT
METADATA        0                             Sch-S          LOCK           GRANT
OBJECT          2105058535                    X              LOCK           GRANT

Just as we expected - an X table lock. Trying any query against the table fails now. Now I'll rollback that transaction, set the escalation to partition-level and try again.

ROLLBACK TRAN;
GO

ALTER TABLE MyPartitionedTable SET (LOCK_ESCALATION = AUTO);
GO

BEGIN TRAN
UPDATE
MyPartitionedTable SET c1 = c1 WHERE c1 < 7500;
GO

SELECT [partition_id], [object_id], [index_id], [partition_number]
FROM sys.partitions WHERE object_id = OBJECT_ID ('MyPartitionedTable');
GO

SELECT [resource_type], [resource_associated_entity_id], [request_mode],
[request_type], [request_status] FROM sys.dm_tran_locks WHERE [resource_type] <> 'DATABASE';
GO

partition_id         object_id   index_id    partition_number
-------------------- ----------- ----------- ----------------
72057594039042048    2105058535  1           1
72057594039107584    2105058535  1           2
72057594039173120    2105058535  1           3

resource_type   resource_associated_entity_id request_mode   request_type   request_status
--------------- ----------------------------- -------------- -------------- ----------------
HOBT            72057594039042048             X              LOCK           GRANT
METADATA        0                             Sch-S          LOCK           GRANT
METADATA        0                             Sch-S          LOCK           GRANT
METADATA        0                             Sch-S          LOCK           GRANT
OBJECT          2105058535                    IX             LOCK           GRANT

Excellent - the object lock is now IX rather than X, and the X lock is at the partition (HOBT) level for partition 1 (see the bold highlighting to match the partition ID with the lock resource). (For an explanation of HOBTs, see my post Inside The Storage Engine: IAM pages, IAM chains, and allocation units.) So now we should be able to do something with another partition - let's see if we can cause another partition level X lock in another connection:

USE LockEscalationTest;
GO

BEGIN TRAN
UPDATE
MyPartitionedTable set c1 = c1 WHERE c1 > 8100 AND c1 < 15900
;
GO

SELECT [partition_id], [object_id], [index_id], [partition_number]
FROM sys.partitions WHERE object_id = OBJECT_ID ('MyPartitionedTable'
);
GO

SELECT [resource_type], [resource_associated_entity_id], [request_mode],
[request_type], [request_status] FROM sys.dm_tran_locks WHERE [resource_type] <> 'DATABASE'
;
GO

partition_id         object_id   index_id    partition_number
-------------------- ----------- ----------- ----------------
72057594039042048    2105058535  1           1
72057594039107584    2105058535  1           2
72057594039173120    2105058535  1           3

resource_type   resource_associated_entity_id request_mode   request_type   request_status
--------------- ----------------------------- -------------- -------------- ----------------
HOBT            72057594039107584             X              LOCK           GRANT
HOBT            72057594039042048             X              LOCK           GRANT
METADATA        0                             Sch-S          LOCK           GRANT
METADATA        0                             Sch-S          LOCK           GRANT
METADATA        0                             Sch-S          LOCK           GRANT
OBJECT          2105058535                    IX             LOCK           GRANT
OBJECT          2105058535                    IX             LOCK           GRANT

Now we have two partition X locks, for partitions 1 and 2 (as expected - use the color coding above to match up the IDs), plus two table-level IX locks (one for each  connection, as expected). Very cool!

Now I'm going to force a deadlock - by having each connection try to read a row from the other locked partition:

Connection 1:

SELECT * FROM MyPartitionedTable WHERE c1 = 8500;
GO

Conneciton 2:

SELECT * FROM MyPartitionedTable WHERE c1 = 100;
GO

Connection 2 succeeds but on connection 1 we get (as expected):

(local)\SQLDEV01(SQLHAVPC\Administrator): Msg 1205, Level 13, State 18, Line 1
Transaction (Process ID 51) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.

This illustrates a potential problem with this new mechanism - applications that used to rely on the blocking nature of X table locks may now exhibit deadlocks if partition-level escalation is turned on in production without any testing. In fact, this mode was specifically chosen NOT to be the default setting for new tables because some trial workloads exhibited deadlocks during testing. Don't just turn it on in production without testing - as with any other option or feature.

I came across an interesting bug in Management Studio in the latest Katmai CTP today - when a connection disconnects from a database, SSMS doesn't release the shared database lock that it holds. This prevents any operations that need exclusive database access (like a RESTORE) and can be somewhat disconcerting if you don't realize what's going on. Try executing the following in SSMS in the November CTP:

USE master;
GO

CREATE DATABASE MySSMSTest;
GO

USE MySSMSTest;
GO

CREATE TABLE test (c1 INT);
GO

USE master;
GO

DROP DATABASE MySSMSTest;
GO

And you'll see:

Msg 3702, Level 16, State 4, Line 1
Cannot drop database "MySSMSTest" because it is currently in use.

It works perfectly in SQLCMD in the same CTP, and also in SQL Server 2005. You can work around this by doing something like:

ALTER DATABASE MySSMSTest SET RESTRICTED_USER WITH ROLLBACK IMMEDIATE;

GO

Not exactly ideal - but, hey, it's only a CTP. This has already been reported in Connect as issue 320135 - I added a workaround.

I'll post any further bugs that I find here as well as making sure they're on Connect.

Ok - so we did more partying than we thought so blog posts have been a little sparse this month, but here's one to end off the year.

There's a popular impression that turning on snapshot isolation and then rebuilding indexes will cause all rows in the table to get the extra 14-byte versioning tags. Fact or fiction? Let's find out.

First I'll create a test database, containing a small table with a clustered index and few rows:

CREATE DATABASE SItest;
GO

USE SItest;
GO

CREATE TABLE SmallTable (c1 INT, c2 INT);
CREATE CLUSTERED INDEX SmallTableCI ON SmallTable (c1);
GO

INSERT INTO SmallTable VALUES (1, 1);
INSERT INTO SmallTable VALUES (2,2);
GO

Next I'll turn on READ_COMMITTED_SNAPSHOT and rebuild the index to see if statement level versioning does the trick:

ALTER DATABASE SItest SET READ_COMMITTED_SNAPSHOT ON;
GO

ALTER INDEX SmallTableCI ON SmallTable REBUILD;
GO

Now let's look at the data page holding the two rows to see if there's any versioning info (the output is snipped short a little for brevity):

DBCC IND (SItest, SmallTable, 1);
GO

DBCC TRACEON (3604); -- remember this makes the output go to the console
DBCC PAGE (SItest, 1, 153, 3);
GO

PageFID PagePID     IAMFID IAMPID      ObjectID    IndexID
------- ----------- ------ ----------- ----------- -----------
1       154         NULL   NULL        2073058421  1
1       153         1      154         2073058421  1

(2 row(s) affected)

DBCC execution completed. If DBCC printed error messages, contact your system administrator.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.

PAGE: (1:153)

<SNIP SNIP SNIP>

Slot 0 Offset 0x60 Length 15

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP
Memory Dump @0x6209C060

00000000:   10000c00 01000000 01000000 0300f9 ...............
UNIQUIFIER = [NULL]

Slot 0 Column 1 Offset 0x4 Length 4

c1 = 1

Slot 0 Column 2 Offset 0x8 Length 4

c2 = 1

Slot 1 Offset 0x6f Length 15

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP
Memory Dump @0x6209C06F

00000000:   10000c00 02000000 02000000 0300f9 ...............
UNIQUIFIER = [NULL]

Slot 1 Column 1 Offset 0x4 Length 4

c1 = 2

Slot 1 Column 2 Offset 0x8 Length 4

c2 = 2

DBCC execution completed. If DBCC printed error messages, contact your system administrator.

Nope - both the rows look normal. Now for completeness let's try transaction level versioning and a rebuild:

ALTER DATABASE SItest SET ALLOW_SNAPSHOT_ISOLATION ON;
GO

ALTER INDEX SmallTableCI ON SmallTable REBUILD;
GO

DBCC IND (SItest, SmallTable, 1);
GO

DBCC PAGE (SItest, 1, 143, 3); -- page changed when we rebuilt the index
GO

PageFID PagePID     IAMFID IAMPID      ObjectID    IndexID
------- ----------- ------ ----------- ----------- -----------
1       152         NULL   NULL        2073058421  1
1       143         1      152         2073058421  1

(2 row(s) affected)

DBCC execution completed. If DBCC printed error messages, contact your system administrator.

PAGE: (1:143)

<SNIP SNIP SNIP>

Slot 0 Offset 0x60 Length 15

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP
Memory Dump @0x6209C060

00000000:   10000c00 01000000 01000000 0300f9 ...............
UNIQUIFIER = [NULL]

Slot 0 Column 1 Offset 0x4 Length 4

c1 = 1

Slot 0 Column 2 Offset 0x8 Length 4

c2 = 1

Slot 1 Offset 0x6f Length 15

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP
Memory Dump @0x6209C06F

00000000:   10000c00 02000000 02000000 0300f9 ...............
UNIQUIFIER = [NULL]

Slot 1 Column 1 Offset 0x4 Length 4

c1 = 2

Slot 1 Column 2 Offset 0x8 Length 4

c2 = 2

DBCC execution completed. If DBCC printed error messages, contact your system administrator.

Nope - still nothing. Now I'll force a versioning operation and we should see the tags. Any update to the table should cause versioned records to be created. In this case, I'll start an explicit transaction and do some updates and we should be able to see the original values using another query window. First the updates:

BEGIN TRAN;
GO

UPDATE SmallTable SET c1 = 4;
GO

And in another window:

SELECT * FROM smalltable;
GO

c1          c2
----------- -----------
1           1
2           2

(2 row(s) affected)

Cool - so the original values are still there. Let's see the versioning info on the data page (output snipped again):

DBCC PAGE (SItest, 1, 143, 3);
GO

PAGE: (1:143)

<SNIP SNIP SNIP>

Slot 0 Offset 0x7e Length 29

Record Type = GHOST_DATA_RECORD      Record Attributes =  NULL_BITMAP VERSIONING_INFO

Memory Dump @0x61CEC07E

00000000:   5c000c00 01000000 01000000 0300f9e0 \...............
00000010:   00000001 00000057 03000000 00       .......W.....

Version Infomation =
 Transaction Timestamp: 855
 Version Pointer: (file 1 page 224 slot 0)

UNIQUIFIER = [NULL]

Slot 0 Column 1 Offset 0x4 Length 4

c1 = 1

Slot 0 Column 2 Offset 0x8 Length 4

c2 = 1

Slot 1 Offset 0xb8 Length 29

Record Type = GHOST_DATA_RECORD      Record Attributes =  NULL_BITMAP VERSIONING_INFO

Memory Dump @0x61CEC0B8

00000000:   5c000c00 02000000 02000000 0300f9e0 \...............
00000010:   00000001 00010057 03000000 00       .......W.....

Version Infomation =
 Transaction Timestamp: 855
 Version Pointer: (file 1 page 224 slot 1)

UNIQUIFIER = [NULL]

Slot 1 Column 1 Offset 0x4 Length 4

c1 = 2

Slot 1 Column 2 Offset 0x8 Length 4

c2 = 2

Slot 2 Offset 0x9b Length 29

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP VERSIONING_INFO

Memory Dump @0x61CEC09B

00000000:   50000c00 04000000 01000000 0300f800 P...............
00000010:   00000000 00000057 03000000 00       .......W.....

Version Information =
 Transaction Timestamp: 855
 Version Pointer: Null

UNIQUIFIER = [NULL]

Slot 2 Column 1 Offset 0x4 Length 4

c1 = 4

Slot 2 Column 2 Offset 0x8 Length 4

c2 = 1

Slot 3 Offset 0xd5 Length 37

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS VERSIONING_INFO

Memory Dump @0x61CEC0D5

00000000:   70000c00 04000000 02000000 0300f801 p...............
00000010:   00170001 00000000 00000000 00000057 ...............W
00000020:   03000000 00                         .....

Version Information =
 Transaction Timestamp: 855
 Version Pointer: Null


Slot 3 Column 0 Offset 0x13 Length 4

UNIQUIFIER = 1

Slot 3 Column 1 Offset 0x4 Length 4

c1 = 4

Slot 3 Column 2 Offset 0x8 Length 4

c2 = 2

DBCC execution completed. If DBCC printed error messages, contact your system administrator.

I've marked the versioning parts in bold. Notice that the old records have been turned into ghost records too. The second record is now tagged as having variable-length columns too. This is because I updated both records to have the same clustering key value and so the second record now needs a uniquifier - which is stored as a variable-length column.

So, the original statement is a myth - the only time that rows get versioning info added to them is when it's needed to support a versioning operation.

This is a question I was sent a week or so ago - if a table is truncated inside a transaction, what protects the integrity of the table's pages in case the transaction rolls back? Let's find out.

First off I'll create a simple table to experiment with.

CREATE TABLE TruncateTest (c1 INT IDENTITY, c2 CHAR (8000) DEFAULT 'A');
GO

SET NOCOUNT ON;
GO

DECLARE @a INT;
SELECT @a = 1
;
WHILE (@a < 20
)
BEGIN

INSERT INTO TruncateTest DEFAULT VALUES;
SELECT @a = @a + 1;

END;
GO

We can see what pages and extents are allocated to the table using the undocumented DBCC IND command:

DBCC IND (test, TruncateTest, 0);
GO

PageFID PagePID
------- ---------
1       193
1       192
1       194
1       195
1       196
1       197
1       198
1       199
1       200
1       224
1       225
1       226
1       227
1       228
1       229
1       230
1       231
1       232
1       233
1       234

DBCC execution completed. If DBCC printed error messages, contact your system administrator.

I've curtailed the output to just the page IDs and we can see that there are 4 extents used by this table (starting on pages (1:192), (1:200), (1:224), and (1:232)). Now if we truncate the table in a transaction, what will DBCC IND show?

BEGIN TRAN;
GO

TRUNCATE TABLE TruncateTest;
GO

DBCC IND (test, TruncateTest, 0);
GO

DBCC execution completed. If DBCC printed error messages, contact your system administrator.

Looks like there are no pages allocated to the table. So where are they? Let's check what locks there are. Instead of using sp_lock, I'm going to use it's replacement DMV, sys.dm_tran_locks:

SELECT resource_type, resource_description, request_mode FROM sys.dm_tran_locks WHERE resource_type IN ('EXTENT', 'PAGE');
GO

resource_type   resource_description   request_mode
--------------- ---------------------- --------------
EXTENT          1:200                  X
PAGE            1:198                  X
PAGE            1:199                  X
PAGE            1:196                  X
PAGE            1:197                  X
PAGE            1:194                  X
PAGE            1:195                  X
PAGE            1:192                  X
PAGE            1:193                  X
EXTENT          1:192                  X
PAGE            1:200                  X
EXTENT          1:232                  X
EXTENT          1:224                  X

Ah - all the pages and extents are locked. The table doesn't show them as allocated any more but because they're exclusively locked, the allocation subsystem can't really deallocate them until the locks are dropped (when the transaction commits). That's the answer - they can't be reused until they're really deallocated. If a transaction rollback happens, the pages are just marked as allocated again.

I’m in the middle of a flight from Washington D.C. to Zurich on the way to Barcelona for TechEd IT Forum and I can’t sleep – Kimberly’s out like a light so what else is there to do except write another blog post? :-) OK - actually posting this from Barcelona on Tuesday before our first of 12 sessions here!

In the Database Maintenance workshop we did at SQL Connections last week I promised to blog a script I used to show how data file shrink operations cause massive fragmentation of indexes. The reason is that data file shrink starts at the end of the data file and moves a single page at a time to a free space below the shrink threshold. In the process of doing so, it perfectly reverses the physical order of the pages comprising the leaf level of an index – thus perfectly fragmenting it!!

Let’s try out my simple script that demonstrates this. First thing I’m going to do is create a new database and create a 10MB ‘filler’ table, which I’m going to then drop later to create a space that shrink can use.

USE MASTER;

GO

 

IF DATABASEPROPERTYEX ('shrinktest', 'Version') > 0

      DROP DATABASE shrinktest;

 

CREATE DATABASE shrinktest;

GO

USE shrinktest;

GO

 

SET NOCOUNT ON;

GO

 

-- Create and fill the filler table

CREATE TABLE filler (c1 INT IDENTITY, c2 VARCHAR(8000))

GO

DECLARE @a INT;

SELECT @a = 1;

WHILE (@a < 1280) -- insert 10MB

BEGIN

      INSERT INTO filler VALUES (REPLICATE ('a', 5000));

      SELECT @a = @a + 1;

END;

GO

Next I’ll create the ‘production’ table that we’d really like to keep in optimal shape for performance.

-- Create and fill the production table

CREATE TABLE production (c1 INT IDENTITY, c2 VARCHAR (8000));

CREATE CLUSTERED INDEX prod_cl ON production (c1);

GO

DECLARE @a INT;

SELECT @a = 1;

WHILE (@a < 1280) -- insert 10MB

BEGIN

      INSERT INTO production VALUES (REPLICATE ('a', 5000));

      SELECT @a = @a + 1;

END;

GO 

Now I’ll use the sys.dm_db_index_physical_stats DMV to check the fragmentation of the production table’s clustered index – it should be almost zero:

-- check the fragmentation of the production table

SELECT avg_fragmentation_in_percent, fragment_count FROM sys.dm_db_index_physical_stats (

      DB_ID ('shrinktest'), OBJECT_ID ('production'), 1, NULL, 'LIMITED');

GO

avg_fragmentation_in_percent fragment_count

---------------------------- --------------------

0.390930414386239            6

This is what I expected. Now I’m going to drop the filler table, run a shrink operation and then check the fragmentation again:

-- drop the filler table and shrink the database

DROP TABLE filler;

GO

 

-- shrink the database

DBCC SHRINKDATABASE (shrinktest);

GO

 

-- check the index fragmentation again

SELECT avg_fragmentation_in_percent, fragment_count FROM sys.dm_db_index_physical_stats (

      DB_ID ('shrinktest'), OBJECT_ID ('production'), 1, NULL, 'LIMITED');

GO

avg_fragmentation_in_percent fragment_count

---------------------------- --------------------

99.7654417513683             1277

Wow! The index went from almost 0% fragmented to almost 100% fragmented – the shrink operation totally reversed the physical ordering of the leaf level of the clustered index – nasty.

One of the common maintenance operations I see at customer sites is to run a database shrink at some interval, and I always advise against it – now you can see why. Running a regular shrink operation can cause horrible fragmentation problems. The worst problems I see are those customers with maintenance plans that rebuild all indexes and then run a shrink to remove the extra space necessary for the index rebuilds – completely undoing the effects of the index rebuild!

One other common thing I see is to have auto-shrink set on for one or databases. This is bad for several reasons:

  • Shrink causes index fragmentation, as I’ve just demonstrated above.
  • You can't control when it kicks in. Although it doesn't have any effect like long-term blocking, it does take up a lot of resources, both IO and CPU. It also moves a lot of data through the buffer pool and so can cause hot pages to be pushed out to disk, slowing things down further. If the server is already pushing the limits of the IO subsystem, running shrink may push it over, causing long disk queue lengths and possibly IO timeouts.
  • You're likely to get into a death-spiral of auto-grow then auto-shrink then auto-grow then auto-shrink... (in my experience, if someone is using auto-shrink, they're most likely using and relying on auto-grow too). An active database usually requires free space for normal operations - so if you take that free space away then the database just has to grow again. This is bad for several reasons:
    • Repeatedly shrinking and growing the data files will cause file-system level fragmentation, which can slow down performance
    • It wastes a huge amount of resources, basically running the shrink algorithm for no reason
    • Auto-grow itself can be bad, especially if you're using SQL Server 2000 (or don't have Instant File Initialization turned on - see this post from Kimberly's blog) where all allocations to the file being grown are blocked while the new portion of the file is being zero-initialized.

Bottom-line: auto-shrink should *NEVER* be turned on…

I've been involved in a few conversations today that have highlighted some big misconceptions about how backups work. I'd like to use this blog post to debunk them. I checked everything I say here with my friend Steve Schmidt, the developer on the Storage Engine team responsible for BACKUP/RESTORE for the last ten years.

Myth 1: A full database backup only contains the transaction log from the start of the backup to the end of the backup

When you restore a full database backup, you get a transactionally consistent database. Consider the case where there's an active transaction that doesn't commit until after the backup completes. If the backup only contained the log that occured while the database was being backed up, how would it roll back the active transaction. It *has* to include enough transaction log to roll back the active transaction. The start LSN of the log included in a database backup is the minimum of:

  • LSN of the last checkpoint
  • LSN of the start of the oldest active transaction
  • LSN of the last replicated transaction

Let me prove it to you. I'm going to create a database, put it into FULL recovery mode, start a transaction, checkpoint, and then take a backup. The checkpoint ensures the page I've altered is flushed to disk.

CREATE DATABASE stopattest;

GO

ALTER DATABASE stopattest SET RECOVERY FULL;

GO

BACKUP DATABASE stopattest TO DISK = 'c:\sqlskills\stopattest.bck' WITH INIT;

GO

USE stopattest;

GO

CREATE TABLE t1 (c1 INT);

GO

BEGIN TRAN;

INSERT INTO t1 VALUES (1);

GO

Now in another connection I'll take another full database backup.

BACKUP DATABASE stopattest TO DISK = 'c:\sqlskills\stopattest.bck' WITH INIT;

GO

The msdb.dbo.backupmedia table will tell us the relevant LSNs in the backup (I added some spaces to delineate the prts of the LSN for clarity):

SELECT last_lsn, checkpoint_lsn, database_backup_lsn FROM msdb.dbo.backupset

WHERE database_name = 'stopattest';

GO

last_lsn              checkpoint_lsn        database_backup_lsn
--------------------- --------------------- ---------------------
21 0000000190 00001   21 0000000174 00037    21 0000000058 00037

[Edit: After swapping some email with Kalen Delaney, I realized that when I originally put this together I had more log records in the post and when I removed them I messed up the description of the (21:174:37) LSN - its now corrected below)

So you can see the checkpoint that begins the backup was at (21:174:37). The LSN of the first log record that the backup contains is (21:58:37), which is before the start of the backup. And the backup contains all the log from then until (21:190:1). Now let's look at the actual transaction log to see what these LSNs correspond to.

SELECT [Current LSN], Operation, [Transaction Name] FROM fn_dblog (null,null);

GO

Here's some selected output:

Current LSN              Operation        Transaction Name
------------------------ ---------------- ------------------
00000015:0000003a:0025   LOP_BEGIN_CKPT   NULL
                (this is the calculated minimum LSN the backup must contain (21:58:37) - which is (15:3a:25) in hex)
.
.
00000015:00000061:0001   LOP_BEGIN_XACT   user_transaction
                (here's my transaction starting - before the backup started but within the LSN range contained in the backup)
.
.
00000015:000000ab:0004   LOP_BEGIN_XACT   Backup:InvalidateDiffMaps
                (this is the backup clearing the differential bitmaps)
.
.
00000015:000000ae:0025  LOP_BEGIN_CKPT    NULL
                (this is the checkpoint that BACKUP does - matching the checkpoint LSN above)
.
.

So - this clearly shows that the backup contains more than just the log from the time the backup was running.

Myth 2: It's possible to do a STOPAT with only a full database backup

This myth is that its possible to use STOPAT with a full database backup to stop during the time the backup was being taken. The argument FOR this myth is that the backup contains the log for all the changes that happened while the backup was being taken, so it must be possible to stop at any point in time. Technically, that's correct, but in practice it's wrong - you cannot stop at a point while the backup was running, using only the database backup.

This one's more complicated to explain. Doing a STOPAT operation means getting the database to a state where operations later than the time or LSN specified in the STOPAT clause haven't affected the database yet. A database backup reads pages that may or may not have been changed while the backup was running. If they are changed, it could be at any point while the backup is running.

Consider the case where page X is changed at LSN (10:45:12), *just* before the backup completes and is read by the backup at the time equivalent to LSN (10:45:13). The backup will contain the changed page image, plus the log record for the change. What if I want to stop at a point while the backup was running but *before* the change to page X, say at LSN (10:44:00). The backup only contains the image of page X at LSN (10:45:12) - how can it be put back to the image at the time we want to stop at? The argument is that we have the log record for the change - can't SQL Server just undo it?

No. It won't even see it. STOPAT works by recovering the database up to the point that the STOPAT specified. If we ask to stop at LSN (10:44:00), then the log will only be read and recovered up to that point. However, because the database backup didn't read page X until LSN (10:45:13), it only has the image of it from when it was altered at (10:45:12). This clearly won't give a database image as of (10:44:00).

The only way to stop at a particular time/LSN, is to have images of *all* database pages from before that time/LSN (i.e. from the *previous* database backup) and then restore all the transaction logs up to and including the time/LSN to stop at.

Hopefully that makes sense.

Two of the cool features in SQL Server 2005 are CROSS APPLY and DMVs (Dynamic Management Views). Now, far be it for me to get my hands dirty explaining developer stuff like CROSS APPLY :-) but I was having a discussion with Colin Leversuch-Roberts in the UK about the composability limitations of the sys.dm_db_index_physical_stats DMV. (Btw - you should check out Colin's blog post series on Analysing Indexes - lots of useful stuff).

So CROSS APPLY lets you do join-like functionality with table-valued functions that take parameters - which you can't do using JOIN. This works for most of the DMVs, but some of them are written to an older internal implementation that doesn't support CROSS APPLY, and sys.dm_db_index_physical_stats is one of them. If you try it you'll get an error like:

Msg 413, Level 16, State 1, Line 26

Correlated parameters or sub-queries are not supported by the inline function "master.sys.dm_db_index_physical_stats".

Fortunately there's a way around this - wrap the DMV in an artificial TVF of your own, and then CROSS APPLY to that. Here's an example:

 

CREATE FUNCTION my_index_physical_stats (

@database_id INT,

@object_id INT,

@index_id INT,

@partition_number INT,

@mode INT)

RETURNS @result TABLE (

database_id SMALLINT NULL, object_id INT NULL, index_id INT NULL, partition_number INT NULL,

index_type_desc NVARCHAR(60) NULL, alloc_unit_type_desc NVARCHAR (60) NULL, index_depth TINYINT NULL,

index_level TINYINT NULL, avg_fragmentation_in_percent FLOAT NULL, fragment_count BIGINT NULL,

avg_fragment_size_in_pages FLOAT NULL, page_count BIGINT NULL, avg_page_space_used_in_percent FLOAT NULL,

record_count BIGINT NULL, ghost_record_count BIGINT NULL, version_ghost_record_count BIGINT NULL,

min_record_size_in_bytes INT NULL, max_record_size_in_bytes INT NULL, avg_record_size_in_bytes FLOAT NULL,

forwarded_record_count BIGINT NULL)

BEGIN

INSERT INTO @result SELECT * FROM

sys.dm_db_index_physical_stats (@database_id, @object_id, @index_id, @partition_number, @mode)

RETURN

END;

GO

(I've heard from many of you that the Comments feature of my blog isn't working. I know - there's an issue with our blog engine that we're fixing. My apologies - I'll post a quick note when it's fixed.)

After posting last week about a BACKUP feature that I don't like (WITH NO_LOG - see here), I thought I'd do a quick post this week about a feature that was introduced in SS2005 for BACKUP that I DO like - the COPY_ONLY option to BACKUP DATABASE and BACKUP LOG.

Here's a situation I've seen several times that really screws people up. A savvy DBA of a busy web-fronted sales business has a rigorous backup schedule setup - daily full backups at midnight and differential backups every 4 hours. Everything's working perfectly. One day a disaster strikes at 7pm and the storage for the database is destroyed. The DBA starts restoring the backups using WITH NORECOVERY, gets to the noon differential backup and gets the following message:

RESTORE DATABASE production FROM DISK = 'c:\sqlskills\production-diff12pm.bck' WITH NORECOVERY;

GO

Msg 3136, Level 16, State 1, Line 1
This differential backup cannot be restored because the database has not been restored to the correct earlier state.
Msg 3013, Level 16, State 1, Line 1
RESTORE DATABASE is terminating abnormally.

Uh-oh. That's not good. That says that the 12pm differential backup does not have the same differential base as the 4am and 8am ones. How can that have happened? The DBA takes a look in the backup history tables in msdb (stripping out all the log backups):

SELECT name, backup_start_date, type, first_lsn, database_backup_lsn

FROM msdb.dbo.backupset WHERE database_name = 'production';

GO

name                           backup_start_date       type first_lsn            database_backup_lsn
------------------------------ ----------------------- ---- -------------------- --------------------
production Full 10/14/07       2007-10-14 00:00:00.000 D    88000000025300001    0
production Diff 4am 10/14/07   2007-10-14 04:00:00.000 I    118000000003000160   88000000025300001
production Diff 8am 10/14/07   2007-10-14 08:00:00.000 I    144000000070500160   88000000025300001
NULL                           2007-10-14 10:29:50.000 D    161000000056100147   88000000025300001
production Diff 12pm 10/14/07  2007-10-14 12:00:00.000 I    161000000062800034   161000000056100147
production Diff 4pm 10/14/07   2007-10-14 16:00:00.000 I    173000000054100144   161000000056100147

Aha! Look at the highlighted date in the output - someone took a full database backup of the database at 10.29am. The DBA checks and finds that one of the developers wanted a copy of the production database to play with so took a database backup. He restored the backup and then deleted both it and the database. Looking at the database_backup_lsn field, we can see that all the backups up till the accidental backup (look at the LSNs highlighted blue) have the differential base equal to the first_lsn of the full backup from midnight. The two backups after that have the differential base equal to the first_lsn of the accidental full backup (the LSNs highlighted red).

Oops! That means that the production database cannot be rolled forward any further than the last log backup before the accidental full backup was taken - losing more than 8 hours of data completely. Even though all the subsequent backups are intact, the initial full backup for them no longer exists so they're useless!

So how can a developer get a copy of the database without screwing up a recovery from a potential disaster? Using the new COPY_ONLY option. Taking a full backup with this option does not make the new backup a differential base - it does not clear any of the differential bitmaps and basically doesn't interfere with the regularly scheduled backups. Apart from that, it's a regular full backup of the database. One thing to bear in mind is that it's a one-off - you can't use one of these backups as a differential base, so you can't take COPY_ONLY differential backups. If you specify COPY_ONLY with DIFFERENTIAL, the COPY_ONLY option is ignored.

One other cool thing is that you can specify this option for a BACKUP LOG command too. This behaves the same way - it takes a log backup, but does not change the transaction log at all (i.e. it doesn't make any portion of the log inactive and permit log truncation). This is useful for doing online file restores without having the necessary backup of the tail of the log affect the log backup chain. More on that in a later post...

[Edit: In the initial version of this post, the DBA's backup strategy included log backups. I went through a couple of versions of this post before settling on full + diffs, but I forgot to remove the reference to log backups. In the first comment, Mark House correctly points out that an accidental full backup doesn't prevent a DBA with a complete log backup chain from recovering to any point in time. Apologies for the confusion!]

This has come up several times over the last few days, and is something that Kimberly and I discuss (and deplore) when we lecture - the use of BACKUP LOG WITH NO_LOG (or TRUNCATE_ONLY - they're synonymous) to allow log truncation.

How is it used?

The common use is when the transaction log grows to be inordinately large on a database that's in FULL (or BULK_LOGGED) recovery mode. This happens after a database backup has been taken, which switches the log into a mode where it won't truncate until its been backed up. In these circumstances, if you don't take a transaction log backup, the log will continue to grow until it runs out of disk space.

The alternative to taking a real log backup is to issue a BACKUP LOG dbname WITH NO_LOG command. Let's see this in action - note that I'm not advocating its use but I want to show you what it does. First off I'll create some transaction log after taking a full database backup:

USE nologtest;
GO

BACKUP DATABASE nologtest TO DISK = 'c:\sqlskills\nologtest.bck' WITH INIT;
GO

CREATE TABLE t1 (c1 INT, c2 VARCHAR (8000))
GO

SET NOCOUNT ON
GO

DECLARE @a INT;
SELECT @a = 1;
WHILE (@a < 10000)
BEGIN
   
INSERT INTO t1 VALUES (@a, REPLICATE ('a', 8000));
   
SELECT @a = @a + 1;
END;
GO

How large is the transaction log now?

SELECT name, size FROM sys.database_files;
GO

name            size
--------------- --------
nologtest       90264
nologtest_log   104128

A little bit larger than the data file (which is what I'd expect after the operation I just performed) and they're both around 100MB. I'll pretend that I haven't been paying attention to the size of the database and log and now I don't have any space to perform a backup. Can I just shrink the log?

DBCC SHRINKFILE (nologtest_log, 2);
GO

Cannot shrink log file 2 (nologtest_log) because all logical log files are in use.
DbId   FileId      CurrentSize MinimumSize UsedPages   EstimatedPages
------ ----------- ----------- ----------- ----------- --------------
11     2           102944      63          102944      56

No. What's stopping me (well DBCC SHRINKFILE just told me, but let's double-check)?

SELECT log_reuse_wait_desc FROM sys.databases WHERE NAME = 'nologtest';
GO

LOG_BACKUP

Ok - so I can't perform a backup so I'll use BACKUP LOG WITH NO_LOG.

BACKUP LOG nologtest WITH NO_LOG;
GO

SELECT log_reuse_wait_desc FROM sys.databases WHERE NAME = 'nologtest';
GO

NOTHING

Now it looks like I can do the shrink:

DBCC SHRINKFILE (nologtest_log, 2);
GO

DbId   FileId      CurrentSize MinimumSize UsedPages   EstimatedPages
------ ----------- ----------- ----------- ----------- --------------
11     2           256         63          256         56

Success!

Hold on, are you sure?

Why is it bad?

Do you realize what just happened? We discarded the contents of the transaction log since the last full/differential/log backup. That means that until the next full or differential backup completes, all work since the last backup completed will be lost if the database is destroyed. So if your business relies on not losing data/work in the event of a disaster - then what I just did is sacrilege!

The whole point of FULL (and BULK_LOGGED) recovery modes is to preserve the transaction log so that the database can be recovered right up to the point of a failure, or to do point-in-time restores. Using BACKUP LOG WITH NO_LOG negates all of that. (For completeness, note that you can't do a point-in-time restore to any time in a transaction log backup containing a bulk-logged transaction).

What are the alternatives?

If you don't want the FULL recovery mode behavior, then don't use FULL recovery mode - switch to SIMPLE mode. But be aware that you lose the ability to restore to any point in between your full or differential backups.

If you want the FULL recovery mode behavior, but don't want to run out of log space - then the answer is again simple (pun intended :-), take log backups! If your log is growing too quickly, take more frequent log backups! If you don't have enough disk space to store the log backups, talk to your management about the options - buying more disks and continuing to use the FULL recovery options, or being forced to move to SIMPLE recovery mode.

One thing to bear in mind - you actually have to monitor the size of your log to tell whether its growing. That's what gets people into trouble in the first place - a combination of:

  • Not knowing that the database is in FULL recovery mode
  • Not tracking the size of the log

The problem comes for those people who just use an app and have no idea that the developer put the database (maybe for a web application log) into FULL recovery mode, or even that they have SQL Server on their system. For these people there is no good answer unfortunately apart from educating developers.

How to prevent it being used

If you're a sysadmin who wants to stop your database owners and backup operators from using the NO_LOG option, there is way to do it. Trace flag 3231 in SS2000 and SS2005 will turn the NO_LOG and TRUNCATE_ONLY options into no-ops in FULL/BULK_LOGGED recovery mode, and will clear the log in SIMPLE recovery mode. Trace flag 3031 in SS2005 turns them in checkpoints in all recovery modes. These trace flags are undocumented but they're allowed to be publicized.

This way no-one can truncate the log unless you allow them to, and when they come to you to complain, you can educate them!

This is another question that came up on the Disaster Recovery forum on MSDN. Paraphrasing - ' I have a backup file containing full backups for 45 databases. How can I restore them all using a script?'

The answer is pretty straightforward. Let's create the situation described, using 3 databases for clarity rather than 45:

BACKUP DATABASE tinylogtest TO DISK = 'c:\sqlskills\mixedbackups.bck' WITH NAME = 'tinylogtest 10/09/07', INIT;

BACKUP DATABASE pagesplittest TO DISK = 'c:\sqlskills\mixedbackups.bck' WITH NAME = 'pagesplittest 10/09/07';

BACKUP DATABASE ghostrecordtest TO DISK = 'c:\sqlskills\mixedbackups.bck' WITH NAME = 'ghostrecordtest 10/09/07';

GO

You now can't do a regular restore from that backup file for any database except the first one in the file:

RESTORE DATABASE pagesplittest FROM DISK = 'c:\sqlskills\mixedbackups.bck' WITH REPLACE;

GO

Msg 1834, Level 16, State 1, Line 1

The file 'c:\tinylogtest\tinylogtest.mdf' cannot be overwritten. It is being used by database 'tinylogtest'.

Msg 3156, Level 16, State 4, Line 1

File 'fgt_mdf' cannot be restored to 'c:\tinylogtest\tinylogtest.mdf'. Use WITH MOVE to identify a valid location for the file.

Msg 1834, Level 16, State 1, Line 1

The file 'c:\tinylogtest\tinylogtest.ldf' cannot be overwritten. It is being used by database 'tinylogtest'.

Msg 3156, Level 16, State 4, Line 1

File 'fgt_log' cannot be restored to 'c:\tinylogtest\tinylogtest.ldf'. Use WITH MOVE to identify a valid location for the file.

Msg 3119, Level 16, State 1, Line 1

Problems were identified while planning for the RESTORE statement. Previous messages provide details.

Msg 3013, Level 16, State 1, Line 1

RESTORE DATABASE is terminating abnormally.

So how to restore the other databases apart from the first one? The answer is that you need to specify which of the backups within the file you're after. You can find the positions using the RESTORE HEADERONLY command:

RESTORE HEADERONLY FROM DISK = 'c:\sqlskills\mixedbackups.bck';

GO

This returns lots of information:

and then a whole bunch more columns and then ending with:

For our purposes, the interesting things to note are the Position, DatabaseName, and BackupTypeDescription. In our simple example we have 3 full backups. To restore the pagesplittest database as we tried to do above, we have to specifiy the position within the file of the backup we want to restore - just the database name isn't enough.

RESTORE DATABASE pagesplittest FROM DISK = 'c:\sqlskills\mixedbackups.bck' WITH REPLACE, FILE = 2;

GO

And it works. So the question asked how to do this, plus how to do it in a script. Below is a script I've adapted from the example I wrote for the Books Online for DBCC SHOWCONTIG back in 1999 when I rewrote DBCC SHOWCONTIG for SQL Server 2000.

Enjoy!

-- Create a temporary table to hold the output from RESTORE HEADERONLY

CREATE TABLE master.dbo.restoreheaderonly (

BackupName NVARCHAR (128), BackupDescription NVARCHAR (255), BackupType SMALLINT, ExpirationDate DATETIME,

Compressed TINYINT, Position SMALLINT, DeviceType TINYINT, UserName NVARCHAR (128), ServerName NVARCHAR (128),

DatabaseName NVARCHAR (128), DatabaseVersion INT, DatabaseCreationDate DATETIME, BackupSize NUMERIC (20, 0),

FirstLSN NUMERIC (25, 0), LastLSN NUMERIC (25,0), CheckpointLSN NUMERIC (25,0), DatabaseBackupLSN NUMERIC (25, 0),

BackupStartDate DATETIME, BackupFinishDate DATETIME, SortOrder SMALLINT, CodePage SMALLINT, UnicodeLocaleId INT,

UnicodeComparisonStyle INT, CompatibilityLevel TINYINT, SoftwareVendorId INT, SoftwareVersionMajor INT,

SoftwareVersionMinor INT, SoftwareVersionBuild INT, MachineName NVARCHAR (128), Flags INT, BindingID UNIQUEIDENTIFIER,

RecoveryForkID UNIQUEIDENTIFIER, Collation NVARCHAR (128), FamilyGUID UNIQUEIDENTIFIER, HasBulkLoggedData BIT,

IsSnapshot BIT, IsReadOnly BIT, IsSingleUser BIT, HasBackupChecksums BIT, IsDamaged BIT, BeginsLogChain BIT,

HasIncompleteMetaData BIT, IsForceOffline BIT, IsCopyOnly BIT, FirstRecoveryForkID UNIQUEIDENTIFIER,

ForkPointLSN NUMERIC (25, 0) NULL, RecoveryModel NVARCHAR (60), DifferentialBaseLSN NUMERIC (25, 0) NULL,

DifferentialBaseGUID UNIQUEIDENTIFIER, BackupTypeDescription NVARCHAR (60), BackupSetGUID UNIQUEIDENTIFIER NULL);

GO

 

-- Populate the table

INSERT INTO master.dbo.restoreheaderonly EXEC ('RESTORE HEADERONLY FROM DISK = ''C:\sqlskills\mixedbackups.bck''') ;

GO

 

DECLARE @Position SMALLINT;

DECLARE @DatabaseName NVARCHAR (128);

DECLARE @ExecString NVARCHAR (255);

 

-- Declare a cursor to iterate over the results

DECLARE databases CURSOR FOR

SELECT Position, DatabaseName FROM master.dbo.restoreheaderonly WHERE BackupTypeDescription = 'Database';

 

-- Open the cursor.

OPEN databases;

 

-- Loop through the databases.

FETCH NEXT FROM databases INTO @Position, @DatabaseName;

WHILE @@FETCH_STATUS = 0

BEGIN

SELECT @ExecString = 'RESTORE DATABASE ' + RTRIM (@DatabaseName) +

' FROM DISK = ''C:\sqlskills\mixedbackups.bck''' +

' WITH REPLACE, FILE = ' + RTRIM (CONVERT (VARCHAR (10), @Position));

SELECT 'Restoring database ' + RTRIM (@DatabaseName);

EXEC (@ExecString);

FETCH NEXT FROM databases INTO @Position, @DatabaseName;

END;

 

-- Close and deallocate the cursor.

CLOSE databases;

DEALLOCATE databases;

 

-- Delete the temporary table.

DROP TABLE master.dbo.restoreheaderonly;

GO

Over the years I was in the Storage Engine team I saw a lot of concern on the various forums about the ghost cleanup task. There have been a few bugs with it in previous versions  (see these KB articles - 932115 and 815594) and there's very little info available on it. For some reason I didn't get around to posting about it on my old blog but today I want to go into some depth on it.

So what is ghost cleanup? It's a background process that cleans up ghost records - usually referred to as the ghost cleanup task. What's a ghost record? As I described briefly in the Anatomy of a record post last week, a ghost record is one that's just been deleted in an index on a table (well, actually it gets more complicated if snapshot isolation of some form is enabled but for now, a record in an index is a good start). Such a delete operation never physically removes records from pages - it only marks them as having been deleted, or ghosted. This is a performance optimization that allows delete operations to complete more quickly. It also allows delete operations to rollback more quickly because all that needs to happen is to unmark the records as being deleted/ghosted, instead of having to reinsert the deleted records. The deleted record will be physically removed (well, its slot will be removed - the record data isn't actually overwritten) later by the background ghost cleanup task. The ghost cleanup task will leave a single record on the page to avoid having to deallocate empty data or index pages.

The ghost cleanup task can't physically delete the ghost records until after the delete transaction commits because the deleted records are locked and the locks aren't released until the transaction commits. As an aside, when ghost records exist on a page, even a NOLOCK or READ UNCOMMITTED scan won't return them because they are marked as ghost records.

When a record is deleted, apart from it being marked as a ghost record, the page that the record is on is also marked as having ghost records in one of the allocation maps - the PFS page (post coming soon!) - and in its page header. Marking a page as having ghost records in a PFS page also changes the database state to indicate that there are some ghost records to cleanup - somewhere. Nothing tells the ghost cleanup task to clean the specific page that the delete happened on - yet. That only happens when the next scan operation reads the page and notices that the page has ghost records.

The ghost cleanup task doesn't just start up when it's told to - it starts up in the background every 5 seconds and looks for ghost records to cleanup. Remember that it won't be told to go cleanup a specific page by a delete operation - it's a subsequent scan that does it, if a scan happens. When the ghost cleanup task starts up it checks to see if its been told to cleanup a page - if so it goes and does it. If not, it picks the next database that is marked as having some ghost records and looks through the PFS allocation map pages to see if there are any ghost records to cleanup. It will check through or cleanup a limited number of pages each time it wakes up - I remember the limit is 10 pages - to ensure it doesn't swamp the system. So - the ghost records will eventually be removed - either by the ghost cleanup task processing a database for ghost records or by it specifically being told to remove them from a page. If it processes a database and doesn't find any ghost records, it marks the database as not having any ghost records so it will be skipped next time.

How can you tell its running? On SQL Server 2005, you can use the following code to see the ghost cleanup task in sys.dm_exec_requests:

SELECT * INTO myexecrequests FROM sys.dm_exec_requests WHERE 1 = 0;

GO

SET NOCOUNT ON;

GO

DECLARE @a INT

SELECT @a = 0;

WHILE (@a < 1)

BEGIN

INSERT INTO myexecrequests SELECT * FROM sys.dm_exec_requests WHERE command LIKE '%ghost%'

SELECT @a = COUNT (*) FROM myexecrequests

END;

GO

SELECT * FROM myexecrequests;

GO

And on SQL Server 2000 you need to use sysprocesses (well, on SQL Server 2005 this works as well but its fake view derived from the DMVs):

SELECT * INTO mysysprocesses FROM master.dbo.sysprocesses WHERE 1 = 0;

GO

SET NOCOUNT ON;

GO

DECLARE @a INT

SELECT @a = 0;

WHILE (@a < 1)

BEGIN

INSERT INTO mysysprocesses SELECT * FROM master.dbo.sysprocesses WHERE cmd LIKE '%ghost%'

SELECT @a = COUNT (*) FROM mysysprocesses

END;

GO

SELECT * FROM mysysprocesses;

GO

The output from sys.dm_exec_requests is (with most unused and uninteresting columns stripped off):

session_id request_id  start_time              status       command
---------- ----------- ----------------------- ------------ ----------------
15         0           2007-10-05 16:34:49.653 background   GHOST CLEANUP

So how can you tell if a record is ghosted? Let's engineer some and look at it with DBCC PAGE - I've stripped out the uninteresting bits of the output and highlighted the interesting ghost parts:

CREATE TABLE t1 (c1 CHAR(10))

CREATE CLUSTERED INDEX t1c1 on t1 (c1)

GO

BEGIN TRAN

INSERT INTO t1 VALUES ('PAUL')

INSERT INTO t1 VALUES ('KIMBERLY')

DELETE FROM t1 WHERE c1='KIMBERLY';

GO

DBCC IND ('ghostrecordtest', 't1', 1);

GO

DBCC TRACEON (3604);

GO

DBCC PAGE ('ghostrecordtest', 1, 143, 3);

GO

<snip>

m_freeData = 130    m_reservedCnt = 0   m_lsn = (20:88:20)
m_xactReserved = 0  m_xdesId = (0:518)  m_ghostRecCnt = 1
m_tornBits = 0

<snip>

Slot 0 Offset 0x71 Length 17

Record Type = GHOST_DATA_RECORD      Record Attributes =  NULL_BITMAP
Memory Dump @0x6256C071

00000000:   1c000e00 4b494d42 45524c59 20200200 †....KIMBERLY  ..
00000010:   fc†††††††††††††††††††††††††††††††††††.
UNIQUIFIER = [NULL]

Slot 0 Column 1 Offset 0x4 Length 10

c1 = KIMBERLY

Slot 1 Offset 0x60 Length 17

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP
Memory Dump @0x6256C060

00000000:   10000e00 5041554c 20202020 20200200 †....PAUL      ..
00000010:   fc†††††††††††††††††††††††††††††††††††.
UNIQUIFIER = [NULL]

Slot 1 Column 1 Offset 0x4 Length 10

c1 = PAUL

Let's see what goes on the transaction log during this process (remember this is undocumented and unsupported - do it on a test database) - I've stripped off a bunch of the columns in the output:

DECLARE @a CHAR (20)

SELECT @a = [Transaction ID] FROM fn_dblog (null, null) WHERE [Transaction Name]='PaulsTran'

SELECT * FROM fn_dblog (null, null) WHERE [Transaction ID] = @a;

GO

Current LSN              Operation         Context             Transaction ID
------------------------ ----------------- ------------------- --------------
00000014:00000054:0011   LOP_BEGIN_XACT    LCX_NULL            0000:00000206
00000014:0000005a:0012   LOP_INSERT_ROWS   LCX_CLUSTERED       0000:00000206
00000014:0000005a:0013   LOP_INSERT_ROWS   LCX_CLUSTERED       0000:00000206
00000014:0000005a:0014   LOP_DELETE_ROWS   LCX_MARK_AS_GHOST   0000:00000206
00000014:0000005a:0016   LOP_DELETE_ROWS   LCX_MARK_AS_GHOST   0000:00000206

So there are the two inserts followed by the two deletes - with the rows being marked as ghost records. But where's the update to the PFS page? Well, changing the ghost bit in a PFS page is not done as part of a transaction. We'll need to look for it another way (apart from just dumping everything in the transaction log and searching manually):

SELECT Description, * FROM fn_dblog (null, null) WHERE Context like '%PFS%' AND AllocUnitName like '%t1%';

GO

Description               Current LSN              Operation        Context   Transaction ID
------------------------- ------------------------ ---------------- --------- ----------------
Allocated 0001:0000008f   00000014:00000054:0014   LOP_MODIFY_ROW   LCX_PFS   0000:00000208
                          00000014:0000005a:0015   LOP_SET_BITS     LCX_PFS   0000:00000000

The first one is just allocating a page but the second one is the one we're looking for - it's changed the bit for the page to say it has ghost records on. Let's commit the transaction and see what happens, filtering out all the previous transaction log:

SELECT MAX ([Current LSN]) FROM fn_dblog (null, null);

GO

-- 00000014:0000005e:0001

COMMIT TRAN

GO

SELECT [Page ID], * FROM fn_dblog (null, null) WHERE [Current LSN] > '00000014:0000005e:0001';

GO

Page ID         Current LSN              Operation          Context         Transaction ID
--------------- ------------------------ ------------------ --------------- --------------
NULL            00000014:0000005f:0001   LOP_COMMIT_XACT    LCX_NULL        0000:00000206
0001:0000008f   00000014:00000060:0001   LOP_EXPUNGE_ROWS   LCX_CLUSTERED   0000:00000000

We see that almost as soon as the transaction has commited, the ghost cleanup task goes in and process the page. Let's check a page dump to make sure the record is gone, and show that the contents of the record are still on the page (again, with non-relevant bits snipped out):

DBCC PAGE ('ghostrecordtest', 1, 143, 3);

GO

<snip>

m_freeData = 130         m_reservedCnt = 0        m_lsn = (20:94:1)
m_xactReserved = 0       m_xdesId = (0:518)       m_ghostRecCnt = 0
m_tornBits = 0

<snip>

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP
Memory Dump @0x6212C060

00000000:   10000e00 5041554c 20202020 20200200 †....PAUL      ..
00000010:   fc†††††††††††††††††††††††††††††††††††.
UNIQUIFIER = [NULL]

Slot 0 Column 1 Offset 0x4 Length 10

c1 = PAUL

DBCC PAGE ('ghostrecordtest', 1, 143, 2);

GO

<snip>

6212C040:   01000000 00000000 00000000 00000000 †................
6212C050:   00000000 00000000 00000000 00000000 †................
6212C060:   10000e00 5041554c 20202020 20200200 †....PAUL      ..
6212C070:   fc1c000e 004b494d 4245524c 59202002 †.....KIMBERLY  .
6212C080:   00fc0000 00000000 00000000 01000000 †................
6212C090:   00000000 13000000 01000000 00000000 †................

<snip>

So even though the record no longer exists, all that happened was that the slot was removed from the slot array at the end of the page - the record contents will remain on the page until the space is reused.

In the next post I'll go into details of the PFS and other allocation maps. Btw - please let me know if this stuff is interesting - I'd like to know where to spend blogging time. Thanks!

Here's a really interesting question that was in my search engine logs yesterday - if I have a transaction that runs and completes while a backup is running, will the complete transaction be in the backup? The answer is.... it depends!

In terms of what gets backed up, the way a full backup works is:

  1. Note the transaction log's LSN (Log Sequence Number)
  2. Read all allocated extents in the various data files
  3. Note the LSN again
  4. Read all the transaction log between the starting LSN and the ending LSN

Any transaction that commits before or on the LSN read in step 3 will be fully reflected when the database is restored. If not, the transaction will be undone. So you can't just go by the completion time of the backup and the completion time of the transaction. The transaction may well commit before the backup operation completes, but it may complete during step 4, and so it will get rolled back during a restore. In this case, it's necessary to take a log backup as well and restore that too to make the transaction be fully reflected after a restore.

As you know I always like to prove things  - so here's my proof of what I just said. I'm going to use the AdventureWorks database to do this. First thing is to set it to full recovery mode (and take the first full backup to start full recovery mode logging):

ALTER DATABASE AdventureWorks SET RECOVERY FULL;

BACKUP DATABASE AdventureWorks TO DISK='C:\SQLskills\AdventureWorks.bck' WITH INIT;

GO

Now I'm going to flush out the backup history tables in MSDB:

USE msdb;

GO

EXEC sp_delete_backuphistory '10/6/07';

GO

I've got a really contrived example that I'm going to use to show whether my transaction is wholely contained in the full backup. Using the HumanResources.Employee table, there's a column VacationHours which I'm going to set to 0 and then force all the changes pages to disk. This is my base state:

UPDATE AdventureWorks.HumanResources.Employee SET VacationHours = 0;

GO

CHECKPOINT;

GO

My contrived transaction is going to be in a tight loop updating all the rows in the table, which will generate lots of transaction log, and timed to complete just before the backup completes (i.e. in stage 4). In one connection I start the backup:

BACKUP DATABASE AdventureWorks TO DISK='C:\SQLskills\AdventureWorks.bck' WITH INIT;

SELECT GETDATE ();

GO

and in another I start my contrived transaction, after starting the backup:

BEGIN TRAN

DECLARE @a INT

DECLARE @b INT

SELECT @a = 1

WHILE (@a < 6)

BEGIN

SELECT @b = 1

WHILE (@b < 201)

BEGIN

UPDATE AdventureWorks.HumanResources.Employee SET VacationHours = @b

SELECT @b=@b+1

END

SELECT @a=@a+1

END

COMMIT TRAN;

SELECT GETDATE ();

GO

It's not pretty but it does the job. The backup finishes at 2007-10-05 17:42:38.983 and the transaction finishes at 2007-10-05 17:42:38.107 - before the backup finishes. Remember I set the VacationHours all to zero before running my transaction - let's check the transaction did actually change them:

SELECT MAX (VacationHours) FROM AdventureWorks.HumanResources.Employee;

GO

This returns 200 - which is what I'd expect. But is the transaction wholely contained in the backup? Let's look at the backup history to find out the last LSN that was captured in the full backup:

SELECT Backup_Start_Date, Backup_Finish_Date, First_LSN, Last_LSN

FROM msdb.dbo.backupset WHERE database_name = 'AdventureWorks';

GO

And the output we get is:

Backup_Start_Date       Backup_Finish_Date      First_LSN           Last_LSN
----------------------- ----------------------- ------------------- -------------------
2007-10-05 17:42:22.000 2007-10-05 17:42:38.000 86000000001600029   91000000625600001

The LSNs are in decimal, so we need to convert the three numbers to hex so we can compare them to what's in the log - giving: 5B:1870:1. This is the LSN of the last log record that was backed up in the full backup. Now let's take a look at the transaction log for AdventureWorks using the undocumented fn_dblog function. This is undocumented but very well known. It's a fully composable alternative to using the old DBCC LOG command.

USE AdventureWorks;

GO

SELECT [Current LSN], Operation, [Transaction ID], AllocUnitName FROM fn_dblog (NULL, NULL);

GO

The log record at that LSN is:

00000058:00001870:0001  LOP_MODIFY_ROW     0000:00001338  HumanResources.Employee.PK_Employee_EmployeeID

This is obviously in the middle of my contrived transaction - showing that it isn't all in the full backup. The end of the transaction isn't until way later in the log:

0000005e:00000628:01b1  LOP_MODIFY_ROW     0000:00001338  HumanResources.Employee.PK_Employee_EmployeeID
0000005e:00000628:01b2  LOP_COMMIT_XACT    0000:00001338  NULL

Before I do anything else, I want to take a log backup to preserve my transaction:

BACKUP LOG AdventureWorks TO DISK='C:\SQLskills\AdventureWorks_Log.bck' WITH INIT;

GO

Now I want to restore the full backup and really show that my transaction isn't all in there:

USE master;

GO

RESTORE DATABASE AdventureWorks FROM DISK='C:\SQLskills\AdventureWorks.bck' WITH REPLACE, RECOVERY;

GO

SELECT MAX (VacationHours) FROM AdventureWorks.HumanResources.Employee;

GO

This returns 0. Clearly my transaction isn't all in there, and the parts that are were rolled back during the restore. Now let's do the same thing but using NORECOVERY for the restore of the full backup and also applying the log backup I took:

RESTORE DATABASE AdventureWorks FROM DISK='C:\SQLskills\AdventureWorks.bck' WITH REPLACE, NORECOVERY;

GO

RESTORE LOG AdventureWorks FROM DISK='C:\SQLskills\AdventureWorks_Log.bck' WITH RECOVERY;

GO

SELECT MAX (VacationHours) FROM AdventureWorks.HumanResources.Employee;

GO

This time the SELECT returns 200. And now you can start playng around with fn_dblog if you didn't know about it before. I'll be posting more about Storage Engine internals that you can figure out from the transaction log in future.

(This is an updated repost from earlier this year on my old blog)

Whenever I’m discussing index maintenance, and specifically fragmentation, I always make a point of saying ‘Make sure the index is being used before doing anything about fragmentation’. If an index isn’t being used very much, but has very low page density (lots of free space in the index pages), then it will be occupying a lot more disk space than it could do and it may be worth compacting (with a rebuild or a defrag) to get that disk space back. However, usually there’s not much point spending resources to remove any kind of fragmentation when an index isn’t being used. This is especially true of those people who rebuild all indexes every night or every week.

 

You could even go so far as to say if a non-clustered index isn’t being used, why is it there at all? Extra non-clustered indexes drag down performance in a number of ways. Consider a non-clustered index called IX_MyNCIndex on the table MyTable:

  • Any time a record is inserted into MyTable, a matching record is inserted into IX_MyNCIndex. This is a bunch of extra IOs, extra log records, plus maybe even a page-split.
  • Any time a record is deleted from MyTable, the matching record in IX_MyNCIndex must be deleted. Extra IOs again and log records again.
  • Any time a record in MyTable is updated:
    • If MyTable has a clustered index, and the clustered index key value changes, then the matching record in IX_MyNCIndex must be updated. Extra IOs and log records again.
    • If any of the non-clustered index key values changes, or any of the INCLUDEd column values changes, then the matching record in IX_MyNCIndex must be updated. Extra IOs and log records again.
  • If a clustered index is created on MyTable, then IX_MyNCIndex has to be rebuilt to include the logical RIDs rather than the physical heap RIDs (see this post for an explanation). Lot of extra IOs and log records again.

That’s a significant amount of extra IOs and log records to maintain each extraneous non-clustered index.

 

So, how can you tell if an index is being used?

  • In SQL Server 2000 there is no way to do it
  • In SQL Server 2005 there are a few different ways in SQL Server 2005 – the one I want to discuss in this post is the sys.dm_db_index_usage_stats DMV.

This DMV exposes the information that is tracked about index usage (as the name suggests). It does not generate any information itself; it just returns info from a cache inside SQL Server. This cache is empty when the server instance starts, and is not persisted across instance restarts. All cache entries for indexes in a database are removed when that database is closed. So, the cache tracks usage information about which indexes have been used since the database they are part of was last opened (either manually or as part of instance start-up).

 

This continues to confuse people so I'll call it out: if the output from the DMV does not have an entry for the index you're interested in, it has not been used since the last database startup.

 

The cache tracks the following info for each index (for user queries and system queries):

  • The number of times it was used in a seek operation (either looking up a single row, or doing a range scan) along with the time of the last seek.
  • The number of times it was used in a scan operation (e.g. a select * operation) along with the time of the last scan
  • The number of times it was used in a lookup operation (this means a bookmark lookup – where a non-clustered index does not fully cover a query and additional columns must be retrieved from the base table row) along with the time of the last lookup.
  • The number of times it was used in an update operation (this counts inserts, updates, and deletes) along with the time of the last update.

Let’s have a look at its use.

SELECT * FROM sys.dm_db_index_usage_stats;

GO

The output is too wide for a single image so I've split it in two (I won't post any more output from the DMV - I'll just talk about it):

indexusage11.jpg

indexusage21.jpg

Unless you've just re-started your instance, you'll see a bunch of output from this, representing all index activity since the instance/databases started. If you're interested in whether an index is being used, you can filter the output. Let's focus in on a particular table  - AdventureWorks.Person.Address.

SELECT * FROM sys.dm_db_index_usage_stats

WHERE database_id = DB_ID('AdventureWorks')

AND object_id = OBJECT_ID('AdventureWorks.Person.Address');

GO

You'll probably see nothing in the output, unless you've been playing around with that table. Let's force the clustered index on that table to be used, and look at the DMV output again.

SELECT * FROM AdventureWorks.Person.Address;

GO

SELECT * FROM sys.dm_db_index_usage_stats

WHERE database_id = DB_ID('AdventureWorks')

AND object_id = OBJECT_ID('AdventureWorks.Person.Address');

GO

Now there's a single row, showing a scan on the clustered index. Let's do something else.

SELECT StateProvinceID FROM AdventureWorks.Person.Address

WHERE StateProvinceID > 4 AND StateProvinceId < 15;

GO

SELECT * FROM sys.dm_db_index_usage_stats

WHERE database_id = DB_ID('AdventureWorks')

AND object_id = OBJECT_ID('AdventureWorks.Person.Address');

GO

And there's another row, showing a seek in one of the table's non-clustered indexes.

So, its easy to look at the index usage for particular tables and indexes. But how can you monitor this over time? This is easy too - let's see how. First we need to create our own table to store snapshots of the DMV output.

IF OBJECTPROPERTY (object_id (N'master.dbo.MyIndexUsageStats'), 'IsUserTable') = 1 DROP TABLE dbo.MyIndexUsageStats;

GO

SELECT GETDATE () AS ExecutionTime, * INTO master.dbo.MyIndexUsageStats

FROM sys.dm_db_index_usage_stats WHERE database_id=0;

GO

Next we need to take a baseline snapshot of the DMV output.

INSERT master.dbo.MyIndexUsageStats

SELECT getdate (), * FROM sys.dm_db_index_usage_stats;

GO

And now simulate a few operations and take another snapshot of the DMV:

SELECT * FROM AdventureWorks.Person.Address;

GO

SELECT * FROM AdventureWorks.Person.Address;

GO

SELECT StateProvinceID FROM AdventureWorks.Person.Address

WHERE StateProvinceID > 4 AND StateProvinceId < 15;

GO

INSERT master.dbo.MyIndexUsageStats

SELECT getdate (), * FROM sys.dm_db_index_usage_stats;

GO

And look at the filtered contents of our snapshot table:

SELECT * FROM master.dbo.MyIndexUsageStats

WHERE database_id = DB_ID('AdventureWorks')

AND object_id = OBJECT_ID('AdventureWorks.Person.Address');

GO

You should see four rows - two from the baseline snapshot and two from the final snapshot. If you ran just the statements above, you'll see that the user_scans count for the clustered index has increased by two, and the user_seeks count for the non-clustered index has increased by one.

So this is a pretty simple example of how you can track index usage. By putting something like this into a regularly run script you can tell which indexes aren't being used and could be candidates for less-regular index maintenance or removal altogether.

Let me know how you get on.

I mentioned this in my Anatomy of a page post - its a common misconception that records in an index are ALWAYS stored in the same physical order as the logical order defined by the index key. Here's proof for you that this is incorrect (as well as introducing you to the other dump styles for DBCC PAGE).

I'm going to create a table with a clustered index on an integer column and keep the table to a single page for simplicity:

USE MASTER;

GO

IF DATABASEPROPERTY (N'rowordertest', 'Version') > 0 DROP DATABASE rowordertest;

GO

CREATE DATABASE rowordertest;

GO

USE rowordertest;

GO

CREATE TABLE t1 (c1 INT, c2 VARCHAR (10));

CREATE CLUSTERED INDEX t1c1 ON t1 (c1);

GO

Now I'm going to insert a few rows into the table, with c1 from 2 to 5 - conspicuously not inserting c1 = 1:

INSERT INTO t1 VALUES (2, REPLICATE ('b', 10));

INSERT INTO t1 VALUES (3, REPLICATE ('c', 10));

INSERT INTO t1 VALUES (4, REPLICATE ('d', 10));

INSERT INTO t1 VALUES (5, REPLICATE ('e', 10));

GO

Now, using DBCC IND  we see that the data page is (1:143) and dumping that with DBCC PAGE gives the following (skipping the header output):

DBCC IND ('rowordertest', 't1', 1);

GO

DBCC TRACEON (3604);

GO

DBCC PAGE ('rowordertest', 1, 143, 3);

GO

Slot 0 Offset 0x60 Length 27

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS

Memory Dump @0x5BA3C060

00000000:   30000800 02000000 0300f802 0011001b †0...............
00000010:   00626262 62626262 626262†††††††††††††.bbbbbbbbbb
UNIQUIFIER = [NULL]  

Slot 0 Column 1 Offset 0x4 Length 4

c1 = 2

Slot 0 Column 2 Offset 0x11 Length 10

c2 = bbbbbbbbbb

<snip slots 1 and 2>                   

Slot 3 Offset 0xb1 Length 27

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS

Memory Dump @0x5BA3C0B1

00000000:   30000800 05000000 0300f802 0011001b †0...............
00000010:   00656565 65656565 656565†††††††††††††.eeeeeeeeee
UNIQUIFIER = [NULL]

Slot 3 Column 1 Offset 0x4 Length 4

c1 = 5

Slot 3 Column 2 Offset 0x11 Length 10

c2 = eeeeeeeeee

DBCC PAGE with dump-style 3 always outputs the records on a page in their logical order (because that's how the slot array is ordered). Notice that the record with c1 = 2 is stored at offset 0x60 in the page and the last record on the page with c1 = 5 is stored at offset 0xb1. Now we'll insert a record with c1 =1. This will become the first logical record in the index, but will it cause the page to be shuffled so the records can all be stored in logical order?

INSERT INTO t1 VALUES (1, REPLICATE ('a', 10));

GO

DBCC PAGE ('rowordertest', 1, 143, 3);

GO

Slot 0 Offset 0xcc Length 27

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS

Memory Dump @0x61FCC0CC

00000000:   30000800 01000000 0300f802 0011001b †0...............
00000010:   00616161 61616161 616161†††††††††††††.aaaaaaaaaa
UNIQUIFIER = [NULL]

Slot 0 Column 1 Offset 0x4 Length 4

c1 = 1

Slot 0 Column 2 Offset 0x11 Length 10

c2 = aaaaaaaaaa

Slot 1 Offset 0x60 Length 27

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS

<snip>

The answer is no. Even though the record with c1 =1 is output by DBCC PAGE first, look at its offset within the page - 0xCC - clearly the last record on the page and stored in a different physical order than the logical order defined by the index key. Further proof can be obtained by looking at a raw hex dump of the page using dump style 2 with DBCC PAGE:

DBCC PAGE ('rowordertest', 1, 143, 2);

GO

<snip>

6204C000:   01010400 00400001 00000000 00000800 †.....@..........
6204C010:   00000000 00000500 44000000 0f1fe700 †........D.......
6204C020:   8f000000 01000000 13000000 60000000 †............`...
6204C030:   16000000 00000000 00000000 00000000 †................
6204C040:   01000000 00000000 00000000 00000000 †................
6204C050:   00000000 00000000 00000000 00000000 †................
6204C060:   30000800 02000000 0300f802 0011001b †0...............
6204C070:   00626262 62626262 62626230 00080003 †.bbbbbbbbbb0....
6204C080:   00000003 00f80200 11001b00 63636363 †............cccc
6204C090:   63636363 63633000 08000400 00000300 †cccccc0.........
6204C0A0:   f8020011 001b0064 64646464 64646464 †.......ddddddddd
6204C0B0:   64300008 00050000 000300f8 02001100 †d0..............
6204C0C0:   1b006565 65656565 65656565 30000800 †..eeeeeeeeee0...
6204C0D0:   01000000 0300f802 0011001b 00616161 †.............aaa
6204C0E0:   61616161 61616100 00000000 00000000 †aaaaaaa.........
6204C0F0:   00000000 00000000 00000000 00000000 †................

<snip>

You can clearly see that the last row I inserted, with c1 = 1 and the replicated 'a's is stored after the other records on the page, even though its key is logically before the others.

And just to nail the point home, doing a dump style 1 with DBCC PAGE will dump out the slot array for us:

DBCC PAGE ('rowordertest', 1, 143, 1);

GO

<snip>

OFFSET TABLE:

Row - Offset
4 (0x4) - 177 (0xb1)
3 (0x3) - 150 (0x96)
2 (0x2) - 123 (0x7b)
1 (0x1) - 96 (0x60)
0 (0x0) - 204 (0xcc)

<snip>

The slot array grows backwards, which is why its dumped in what looks like reverse logical order. You can see that slot 0, which represents the first logical record on the page, is stored at an offset greater than the others.

This is a follow-on article from two posts:

People get themselves into situations where they have no backups (or damaged backups) and the data or log files are damaged such that the only way to access the database is with EMERGENCY mode. In these situations, prior to SQL Server 2005, there was no documented or supported way to fix a database while in EMERGENCY mode - the only guidance could be found on the Internet or from calling Product Support and paying for help. The sequence of events was:

  1. Hack the system tables to get the database into 'emergency' mode.
  2. Use the undocumented and unsupported DBCC REBUILD_LOG command to build a new transaction log.
  3. Run DBCC CHECKDB with the REPAIR_ALLOW_DATA_LOSS option to fix up corruptions in the data files - both those that may have caused the issue, and those caused by rebuilding the transaction log (e.g. because an active transaction altering the database structure was lost).
  4. Figure out what data was lost or is transactionally inconsistent (e.g. because a transaction altering multiple tables was lost) as far as your business logic is concerned
  5. Take the database out of emergency mode
  6. And then all the other stuff like root-cause analysis and getting a better backup strategy

I decided to add a new feature to SQL Server 2005 called EMERGENCY mode repair that will do steps 2 and 3 as an atomic operation. The reasons for this were:

  • Much of the advice of how to do this on the Internet missed steps out (particularly missing step 3!)
  • The DBCC REBUILD_LOG command was unsupported and undocumented and we didn't like advising customers to use it
  • Adding a documented last-resort method of recovering from this situation would reduce calls to Product Support - saving time and money for customers and Microsoft.

So, when in EMERGENCY mode, you can use DBCC CHECKDB to bring the database back online again. The only repair option allowed in EMERGENCY mode is REPAIR_ALLOW_DATA_LOSS and it does a lot more than usual:

  • Forces recovery to run on the transaction log (if it exists). You can think of this as 'recovery with CONTINUE_AFTER_ERROR' - see this post for more details on the real CONTINUE_AFTER_ERROR option for BACKUP and RESTORE. The idea behind this is that the database is already inconsistent because either the transaction log is corrupt or something in the database is corrupt in such a way that recovery cannot complete. So, given that the database is inconsistent and we're about to rebuild the transaction log, it makes sense to salvage as much transactional information as possible from the log before we throw it away and build a new one.
  • Rebuild the transaction log - but only if the transaction log is corrupt.
  • Run DBCC CHECKDB with the REPAIR_ALLOW_DATA_LOSS option.
  • Set the database state to ONLINE.

It's a one-way operation and can't be rolled back. I always advise taking a copy of the database files before doing this in case something goes wrong or there are unrepairable errors. And if it does? Probably time to update your resume for not having a water-tight backup and disaster-recovery strategies in place. Saying that, I've never seen it fail. I can think of some pathalogical cases where it would fail though (involving the file system itself having problems) but that's really unlikely.

Let's walk-through an example of using it. I'm assuming there's a database called emergencydemo that's in the same state as at the end of the Search Engine Q&A #4 blog post - the database has no log file, is in EMERGENCY mode and the salaries table is corrupt.

First off I'll try bringing the database online, just to see what happens:

ALTER DATABASE emergencydemo SET ONLINE;
GO

File activation failure. The physical file name "C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\emergencydemo_log.LDF" may be incorrect.
The log cannot be rebuilt because the database was not cleanly shut down.
Msg 945, Level 14, State 2, Line 1
Database 'emergencydemo' cannot be opened due to inaccessible files or insufficient memory or disk space. See the SQL Server errorlog for details.
Failed to restart the current database. The current database is switched to master.
Msg 5069, Level 16, State 1, Line 1
ALTER DATABASE statement failed.

The first message makes sense - the database knows it needs to be recovered because it wasn't cleanly shut down, but the log file simply isn't there. The second message is from the new feature in 2005 that will automatically create a log file if one is missing on startup or attach - as long as the database was cleanly shut down. The 945 and 5069 errors are self-explanatory but notice that the database gets switched to master underneath us. This has bitten me several times in the past.

Well, I expected that not to work. Let's run emergency-mode repair:

DBCC CHECKDB (emergencydemo, REPAIR_ALLOW_DATA_LOSS) WITH NO_INFOMSGS;
GO

Msg 945, Level 14, State 2, Line 1
Database 'emergencydemo' cannot be opened due to inaccessible files or insufficient memory or disk space. See the SQL Server errorlog for details.

Hmm - looks like the failed ALTER DATABASE statement did change the state - but what to?

SELECT state_desc FROM sys.databases WHERE name='emergencydemo';
GO

state_desc
------------------------------------------------------------
RECOVERY_PENDING

That makes sense I guess. Ok - back to emergency mode and run repair:

ALTER DATABASE emergencydemo SET EMERGENCY;
GO
DBCC CHECKDB (emergencydemo, REPAIR_ALLOW_DATA_LOSS) WITH NO_INFOMSGS;
GO

Msg 7919, Level 16, State 3, Line 1
Repair statement not processed. Database needs to be in single user mode.

EMERGENCY mode is not SINGLE_USER mode - a database needs to be in SINGLE_USER mode for repair to run and EMERGENCY mode allows multiple connections from members of the sysadmin role. You can set SINGLE_USER mode as well as EMERGENCY mode - however, the sys.databases field state_desc will still just say EMERGENCY.

ALTER DATABASE emergencydemo SET SINGLE_USER;
GO
DBCC CHECKDB (emergencydemo, REPAIR_ALLOW_DATA_LOSS) WITH NO_INFOMSGS;
GO

File activation failure. The physical file name "C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\emergencydemo_log.LDF" may be incorrect.
The log cannot be rebuilt because the database was not cleanly shut down.
Warning: The log for database 'emergencydemo' has been rebuilt. Transactional consistency has been lost. The RESTORE chain was broken, and the server no longer has context on the previous log files, so you will need to know what they were. You should run DBCC CHECKDB to validate physical consistency. The database has been put in dbo-only mode. When you are ready to make the database available for use, you will need to reset database options and delete any extra log files.

This time it worked. First of all we get the same error as if we tried to bring the database online - that's from the code that's trying to run 'recovery with CONTINUE_AFTER_ERROR' on the transaction log. Next we get a nice long warning that the transaction log has been rebuilt and the consequences of doing that (basically that you need to start a new log backup chain by taking a full backup). If there had been any corruptions we'd have seen the usual output from DBCC CHECKDB about what errors it found and fixed. There's also a bunch of stuff in the error log

2007-10-02 17:21:20.95 spid51      Starting up database 'emergencydemo'.
2007-10-02 17:21:20.96 spid51      Error: 17207, Severity: 16, State: 1.
2007-10-02 17:21:20.96 spid51      FileMgr::StartLogFiles: Operating system error 2(The system cannot find the file specified.) occurred while creating or opening file 'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\emergencydemo_log.LDF'. Diagnose and correct the operating system error, and retry the operation.
2007-10-02 17:21:20.96 spid51      Starting up database 'emergencydemo'.
2007-10-02 17:21:21.10 spid51      Starting up database 'emergencydemo'.
2007-10-02 17:21:21.18 spid51      Warning: The log for database 'emergencydemo' has been rebuilt. Transactional consistency has been lost. The RESTORE chain was broken, and the server no longer has context on the previous log files, so you will need to know what they were. You should run DBCC CHECKDB to validate physical consistency. The database has been put in dbo-only mode. When you are ready to make the database available for use, you will need to reset database options and delete any extra log files.
2007-10-02 17:21:21.18 spid51      Warning: The log for database 'emergencydemo' has been rebuilt. Transactional consistency has been lost. The RESTORE chain was broken, and the server no longer has context on the previous log files, so you will need to know what they were. You should run DBCC CHECKDB to validate physical consistency. The database has been put in dbo-only mode. When you are ready to make the database available for use, you will need to reset database options and delete any extra log files.
2007-10-02 17:21:21.99 spid51      EMERGENCY MODE DBCC CHECKDB (emergencydemo, repair_allow_data_loss) WITH no_infomsgs executed by ROADRUNNERPR\paul found 0 errors and repaired 0 errors. Elapsed time: 0 hours 0 minutes 1 seconds.

Note that the usual error log entry from running DBCC CHECKDB is preceded by 'EMERGENCY MODE' this time.

Checking the database state:

SELECT state_desc FROM sys.databases WHERE name='emergencydemo';
GO

state_desc
------------------------------------------------------------
ONLINE

we find that it's been brought back online again because everything worked. It's still SINGLE_USER though so let's make it MULTI_USER and see what happened to our table:

ALTER DATABASE emergencydemo SET MULTI_USER;
GO
USE EMERGENCYDEMO;
GO
SELECT * FROM salaries;
GO

FirstName            LastName             Salary
-------------------- -------------------- -----------
John                 Williamson           10000
Stephen              Brown                0
Jack                 Bauer                10000

(3 row(s) affected)

And of course its still corrupt - because even though the transaction log was rebuilt and repaired, the original transaction that changed the salary to 0 never got a chance to rollback becuase I deleted the transaction log (in the previous post).

Now remember, you should only use this as a last resort, but if you do get yourself into trouble, you know there's a command that should be able to help you.

By far the most common search engine query leading to the blog is about fixing a suspect or unrecovered database. The very best way to do this is to use your backups, and to have a backup strategy that allows you to recover in the smallest amount of time and with no data loss. But what if you don't have a backup for some reason? Well, it depends what's damaged in the database and when the damage is noticed.

There are three states the database can be in when its damaged:

  1. ONLINE
    • If it's one of the data files that is damaged and the damage is hit during a query or some other operation then the database will remain online and accessible.
  2. RECOVERY PENDING
    • The database will be in this state if SQL Server knows that recovery needs to be run on the database but something is preventing recovery from starting. This is different from SUSPECT because there's nothing to say that recovery is going to fail - it just hasn't started yet.
    • An example of this is when the database wasn't cleanly shut down (i.e. there was at least one uncommitted transaction active at the time the database was shut down) and the log file has been deleted.
  3. SUSPECT
    • The database will be in this state if the transaction log is damaged and it prevents recovery or a transaction rollback from completing.
    • Some examples of this are:
      • When the database wasn't cleanly shut down and recovery tries to read a corrupt data page or comes across a corrupt log record.
      • A regular transaction rolls back and tries to read a corrupt data page or comes across a corrupt log record.

You can check the state of a database in the sys.databases catalog view:

SELECT state_desc FROM sys.databases WHERE name = 'master';

GO

or by using the DATABASEPROPERTYEX function:

SELECT DATABASEPROPERTYEX ('master', 'STATUS');

GO

Beware however, as DATABASEPROPERTYEX returns SUSPECT when the database is RECOVERY PENDING, as I'll show you below.

So the state the database is in determines what you can do if you don't have a backup. The easiest case is when it's still ONLINE. In this case you can probably run repair to remove the damage, most likely with some data loss (see my previous post on REPAIR_ALLOW_DATA_LOSS for more details), and then take steps to prevent the damage occuring again. If repair can't fix all the errors then your only option without a backup is to extract as much data as you can into a new database.

The other two database states are more difficult and are what's causing people to search for help. In this case the database isn't accessible at all, because recovery hasn't run or completed and so the database is in an inconsistent state. It could just be logically inconsistent (e.g. a transaction modifying data hasn't recovered) or worse it could structurally inconsistent (e.g. a system transaction modifying index linkages has't recovered). Either way, SQL Server wants to prevent you from getting into the database because it doesn't know what state the data and structures in the database are in. But if you don't have a backup, you need to get into the database, no matter what state things are in.

You can do this using EMERGENCY mode. In versions prior to SQL Server 2005, EMERGENCY mode wasn't documented and you had to hack the sysdatabases table to get a database into it (worse still, the exact hack to use was changed from version to version in a bid to obfuscate things). In 2005 though, EMERGENCY mode was documented and proper syntax added to support it. Members of the sysadmin role can put the database into EMERGENCY mode using:

ALTER DATABASE foo SET EMERGENCY;

GO

Once in EMERGENCY mode, the database is accessible only by members of the sysadmin role. The database is also read-only as nothing can be written to the transaction log.

Let's see an example of this (based on a demo from my Secrets of Fast Detection and Recovery from Database Corruptions presentation). I'm going to create a database and a sample table:

IF DATABASEPROPERTY (N'emergencydemo', 'Version') > 0 DROP DATABASE emergencydemo;

GO

CREATE DATABASE emergencydemo;

GO

USE emergencydemo;

GO

CREATE TABLE salaries (

FirstName CHAR (20),

LastName CHAR (20),

Salary INT);

GO

INSERT INTO salaries VALUES ('John', 'Williamson', 10000);

INSERT INTO salaries VALUES ('Stephen', 'Brown', 12000);

INSERT INTO salaries VALUES ('Jack', 'Bauer', 10000);

GO

I'm going to start an explicit user transaction and update a row in the table:

BEGIN TRANSACTION;

GO

UPDATE salaries SET Salary = 0 WHERE LastName='Brown';

GO

Now I'm going to force the data page holding the updated row to be written to disk:

CHECKPOINT;

GO

So we have an active, uncommitted transaction that's modified the table, and the table modification has been written to disk. If the power failed at this point, crash recovery would run and the transaction would be rolled back. I'm going to simulate this by shutting down SQL Server. In another connection:

SHUTDOWN WITH NOWAIT;

GO

Server shut down by NOWAIT request from login ROADRUNNERPR\paul.

SQL Server is terminating this process.

I'm also going to simulate damage to the transaction log:

C:\Documents and Settings\paul>del "C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\emergencydemo_log.LDF"

C:\Documents and Settings\paul>

Now when I start up SQL Server again, we see the following in the error log:

2007-10-02 11:39:47.14 spid18s     Starting up database 'emergencydemo'.
2007-10-02 11:39:47.46 spid18s     Error: 17207, Severity: 16, State: 1.
2007-10-02 11:39:47.46 spid18s     FileMgr::StartLogFiles: Operating system error 2(The system cannot find the file specified.) occurred while creating or opening file 'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\emergencydemo_log.LDF'. Diagnose and correct the operating system error, and retry the operation.
2007-10-02 11:39:47.60 spid18s     File activation failure. The physical file name "C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\emergencydemo_log.LDF" may be incorrect.
2007-10-02 11:39:47.60 spid18s     The log cannot be rebuilt because the database was not cleanly shut down.

The database wasn't cleanly shut down and the transaction log isn't available so recovery couldn't run. The final message is interesting - there's a feature in SQL Server 2005 that if you attach or startup a database without a transaction log file, and the database was cleanly shut down, SQL Server will create a new log file automatically. In our case that can't happen though.

What happens if I try to get into the database?

USE emergencydemo;

GO

Msg 945, Level 14, State 2, Line 1

Database 'emergencydemo' cannot be opened due to inaccessible files or insufficient memory or disk space. See the SQL Server errorlog for details.

So what state is the database in?

SELECT DATABASEPROPERTYEX ('emergencydemo', 'STATUS');

GO

returns SUSPECT. But checking the sys.databases table

SELECT state_desc FROM sys.databases WHERE name='emergencydemo';

GO

returns RECOVERY PENDING. This is what I'd expect, as recovery didn't get a chance to even start.

Now I'll set the database into EMERGENCY mode so I can get in and see what state things are in:

ALTER DATABASE emergencydemo SET EMERGENCY;

GO

In the errorlog you can tell when a database has been put into EMERGENCY mode:

2007-10-02 11:53:52.57 spid51      Setting database option EMERGENCY to ON for database emergencydemo.
2007-10-02 11:53:52.59 spid51      Starting up database 'emergencydemo'.
2007-10-02 11:53:52.62 spid51      The database 'emergencydemo' is marked EMERGENCY_MODE and is in a state that does not allow recovery to be run.

Let's try that again:

USE emergencydemo;

GO

This time it works. What's the state of the data?

SELECT * FROM salaries;

GO

FirstName            LastName             Salary
-------------------- -------------------- -----------
John                 Williamson           10000
Stephen              Brown                0
Jack                 Bauer                10000

It's inconsistent, as I'd expect.

That's the catch with EMERGENCY mode - you can get into the database but recovery hasn't run or completed so you don't know whether the database is logically or structurally consistent. However, at least you can get into the database to extract data out or repair any damage.

In the next post (later today) I'll show you how to repair any damage using the emergency-mode repair feature of DBCC CHECKDB.

Time for the first post in the Inside the Storage Engine series. I'm going to focus on SQL Server 2005 in this series and I'll point out major differences between 2005 and previous versions. Please drop me a line if there's something you'd like to see explained and demo'd.

Before jumping into how things work, I'd like to go over two commands I'll be using a lot - DBCC PAGE and DBCC IND. These are both undocumented and unsupported commands, but are very safe to use as they're used extensively inside and outside Microsoft when troubleshooting. Nevertheless, use at your own risk. They're quite well known in the SQL community and I and others have publicized them before (I even demo'd them last year at ITForum in Spain).

To illustrate their use, I'm going to use a simple script I wrote to prove that page splits never roll back. I was having a discussion with someone a while ago about this question and the answer is always no. A page split occurs when an insert or update has to happen at a certain point in an index page, and there's no room on the page to accomomodate the new or updated record. Page splits are done internally as separate 'system' transactions. Once a system transaction commits, it cannot be rolled back - even if the user transaction it was part of rolls back.

So, let's run through the script. First thing to do is create a database containing a table with an index (as page splits only happen in indexes).

USE MASTER;

GO

IF DATABASEPROPERTY (N'pagesplittest', 'Version') > 0 DROP DATABASE pagesplittest;

GO

CREATE DATABASE pagesplittest;

GO

USE pagesplittest;

GO

CREATE TABLE t1 (c1 INT, c2 VARCHAR (1000));

CREATE CLUSTERED INDEX t1c1 ON t1 (c1);

GO

Now I'm going to fill up a page in the index, but leave a gap in the c1 values so that I can force a page split by inserting the missing key value.

INSERT INTO t1 VALUES (1, REPLICATE ('a', 900));

INSERT INTO t1 VALUES (2, REPLICATE ('b', 900));

INSERT INTO t1 VALUES (3, REPLICATE ('c', 900));

INSERT INTO t1 VALUES (4, REPLICATE ('d', 900));

-- leave a gap at 5

INSERT INTO t1 VALUES (6, REPLICATE ('f', 900));

INSERT INTO t1 VALUES (7, REPLICATE ('g', 900));

INSERT INTO t1 VALUES (8, REPLICATE ('h', 900));

INSERT INTO t1 VALUES (9, REPLICATE ('i', 900));

GO

I can find out what the first index page is using the DBCC IND command:

DBCC IND ('pagesplittest', 't1', 1);

GO

This command list all the pages that are allocated to an index. Here's the output in this case:

SEQA3.jpg

The columns mean:

  • PageFID - the file ID of the page
  • PagePID - the page number in the file
  • IAMFID - the file ID of the IAM page that maps this page (this will be NULL for IAM pages themselves as they're not self-referential)
  • IAMPID - the page number in the file of the IAM page that maps this page
  • ObjectID - the ID of the object this page is part of
  • IndexID - the ID of the index this page is part of
  • PartitionNumber - the partition number (as defined by the partitioning scheme for the index) of the partition this page is part of
  • PartitionID - the internal ID of the partition this page is part of
  • PageType - the page type. Some common ones are:
    • 1 - data page
    • 2 - index page
    • 3 and 4 - text pages
    • 8 - GAM page
    • 9 - SGAM page
    • 10 - IAM page
    • 11 - PFS page
  • IndexLevel - what level the page is at in the index (if at all). Remember that index levels go from 0 at the leaf to N at the root page (except in clustered indexes in SQL Server 2000 and 7.0 - where there's a 0 at the leaf level (data pages) and a 0 at the next level up (first level of index pages))
  • NextPageFID and NextPagePID - the page ID of the next page in the doubly-linked list of pages at this level of the index
  • PrevPageFID and PrevPagePID - the page ID of the previous page in the doubly-linked list of pages at this level of the index

So you can see we've got a single page clustered index with an IAM page. Note that the page IDs returned may differ on your server. Let's look at the data page:

DBCC TRACEON (3604);

GO

DBCC PAGE (pagesplittest, 1, 143, 3);

GO

The traceflag is to make the output of DBCC PAGE go to the console, rather than to the error log. The syntax for DBCC PAGE is:

dbcc page ( {'dbname' | dbid}, filenum, pagenum [, printopt={0|1|2|3} ])

The filenum and pagenum parameters are taken from the page IDs that come from various system tables and appear in DBCC or other system error messages. A page ID of, say, (1:143) has filenum = 1 and pagenum = 143.

The printopt parameter has the following meanings:

  • 0 - print just the page header
  • 1 - page header plus per-row hex dumps and a dump of the page slot array (unless its a page that doesn't have one, like allocation bitmaps)
  • 2 - page header plus whole page hex dump
  • 3 - page header plus detailed per-row interpretation

The per-row interpretation works for all page types, including the allocation bitmaps. In our case, we asked for a detailed output. I'll explain the various parts of the output in a post about the anatomy of a page. Here's the output from DBCC PAGE, with a bunch of the repeated per-row info removed for brevity:

PAGE: (1:143)


BUFFER:


BUF @0x02C49720

bpage = 0x05400000                   bhash = 0x00000000                   bpageno = (1:143)
bdbid = 8                            breferences = 0                      bUse1 = 22163
bstat = 0xc0010b                     blog = 0x32159bb                     bnext = 0x00000000

PAGE HEADER:


Page @0x05400000

m_pageId = (1:143)                   m_headerVersion = 1                  m_type = 1
m_typeFlagBits = 0x4                 m_level = 0                          m_flagBits = 0x8000
m_objId (AllocUnitId.idObj) = 68     m_indexId (AllocUnitId.idInd) = 256 
Metadata: AllocUnitId = 72057594042384384                                
Metadata: PartitionId = 72057594038386688                                 Metadata: IndexId = 1
Metadata: ObjectId = 2073058421      m_prevPage = (0:0)                   m_nextPage = (0:0)
pminlen = 8                          m_slotCnt = 8                        m_freeCnt = 744
m_freeData = 7432                    m_reservedCnt = 0                    m_lsn = (18:113:2)
m_xactReserved = 0                   m_xdesId = (0:0)                     m_ghostRecCnt = 0
m_tornBits = 0                      

Allocation Status

GAM (1:2) = ALLOCATED                SGAM (1:3) = ALLOCATED              
PFS (1:1) = 0x60 MIXED_EXT ALLOCATED   0_PCT_FULL                         DIFF (1:6) = CHANGED
ML (1:7) = NOT MIN_LOGGED           

Slot 0 Offset 0x60 Length 917

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS

Memory Dump @0x5C24C060

00000000:   30000800 01000000 0300f802 00110095 †0...............        
00000010:   03616161 61616161 61616161 61616161 †.aaaaaaaaaaaaaaa        

<snip> I've removed this section to save space

00000380:   61616161 61616161 61616161 61616161 †aaaaaaaaaaaaaaaa        
00000390:   61616161 61††††††††††††††††††††††††††aaaaa                   
UNIQUIFIER = [NULL]                 

Slot 0 Column 1 Offset 0x4 Length 4

c1 = 1                              

Slot 0 Column 2 Offset 0x11 Length 900

c2 = aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
a                                   

Slot 1 Offset 0x3f5 Length 917

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS

Memory Dump @0x5C24C3F5

<snip> And again...

DBCC execution completed. If DBCC printed error messages, contact your system administrator.

As you can see from the output, each row is 917 bytes long and there's only 744 bytes free (look at the m_freecnt value in the PAGE HEADER section). This means that we can't insert another row on that page of the same length - there just isn't space - but that's what we're going to do! Remember that this page currently has nine rows on it. Let's force a page split:

BEGIN TRAN;

GO

INSERT INTO t1 VALUES (5, REPLICATE ('a', 900));

GO

Now we know there wasn't enough room so the page must have split. Let's check DBCC IND again to see if another page was allocated to the index - here's the output:

SEQA41.jpg

Two pages have been added - an index page and another data page. Before we added the extra row and caused the page split, the index only needed one page. Now that there are two data pages, there needs to be an index page to allow searches through the index b-tree. Let's take a look at the two data pages to see which rows are stored on which page. Doing DBCC PAGE on them shows that page (1:143) has 5 rows, with c1 values 1 through 5, and page (1:154) has the 4 rows with c1 values 6 through 9 (I'm not going to post all the DBCC PAGE output - that would make the post way too long and it gives you an incentive to try the commands out). This is what we'd expect, as the page split occurs at the insertion point, and the row being inserted is put onto the page that split.

Now let's rollback the user transaction and see what happens:

ROLLBACK TRAN;

GO

Running the DBCC PAGE commands again shows that the index structure remains the same as after the split. Page (1:154) has the 4 rows on it from the split and page (1:143) has the other rows but not the one we inserted in the explicit transaction.

So, proof that a page split is never rolled back. I'll be making much more use of these two DBCC commands in future posts and I'll do the page anatomy one later this week. Let me know if there's anything in particular you'd like to see described in this series.

This week I'm going to post a bunch of info on the basic structures used to store data and track allocations in SQL Server. A bunch of this was posted back when I started blogging at TechEd 2006 but I want to consolidate/clarify info and add more about using DBCC PAGE to examine the various structures.

So, what are records? At the simplest level, a record is the physical storage associated with a table or index row. Of course, it gets much more complicated than that...

Data records

  • Data records are stored on data pages.
  • Data records store rows from a heap or the leaf level of a clustered index.
  • A data record always stores all columns from a table row - either by-value or by-reference.
    • If any columns are for LOB data types (text, ntext, image, and the new LOB types in SQL Server 2005 - varchar(max), nvarchar(max), varbinary(max), XML), then there's a pointer stored in the data record which points to a text record on a different page (the root of a loose tree that stores the LOB value). Exceptions to this are when the schema has been set to store LOB columns 'in-row' when possible. This is when a LOB value is small enough to fit within the size limits of a data record. This is a performance benefit as selecting the LOB column does not require an extra IO to read the text record.
    • In SQL Server 2005, non-LOB variable length columns (e.g. varchar, sqlvariant) may also be stored 'off-row' as part of the row-overflow feature of having table rows longer than 8060 bytes. In this case the storage format is the same as for LOB values - a pointer in the data record pointing to a text record.
  • There is a difference to how the columns are laid out between heaps and clustered indexes - I'll cover that in a later post.

Forwarded/Forwarding records

  • These are technically data records and are only present in a heap.
  • A forwarded record is a data record in a heap that was updated and was too large to fit in-place on its original page and so has been moved to another page. It contains a back-pointer to the forwarding record.
  • A forwarding record is left in its place and points to the new location of the record. It's sometimes known as a forwarding-stub, as all it contains is the location of the real data record.
  • This is done to avoid having to update any non-clustered index records that point back directly to the original physical location of the record.
  • Although this optimizes non-clustered index maintenance during updates, it can cause additional IOs during SELECTs. This is because the non-clustered index record points to the old location of the index, so an extra IO might be needed to read the real location of the data row. This is fuel for the heap vs clustered index debate, in favor of clustered indexes.

Index records

  • Index records are stored on index pages.
  • There are two types of index records (which differ only in what columns they store):
    1. Those that store non-clustered index rows at the leaf level of a non-clustered index
    2. Those that comprise the b-tree that make up clustered and non-clustered indexes (i.e. in index pages above the leaf level of a clustered or non-clustered index)
  • I'll explain more about the differences between these in a later post as it can be quite complicated (especially the differences between SQL Server 2000 and 2005) and is worth doing in separate posts.
  • Index records typically do not contain all the column values in a table (although some do - called covering indexes).
  • In SQL Server 2005, non-clustered index records can include LOB values as included columns (with the storage details exactly the same as for data records) and also can have row-overflow data that is pushed off-row (again, in exactly the same way as for data records).

Text records

  • Text records are stored on text pages.
  • There are various types of text records that comprise the tree structure that stores LOB values, stored on two types of text page. I'll explain how they work and are linked together in a future post.
  • They are also used to store variable-length column values that have been pushed out of data or index records as part of the row-overflow capability.

Ghost records

  • These are records that have been logically deleted but not physically deleted from a page. The reasons for this are complicated, but basically having ghost records simplfies key-range locking and transaction rollback.
  • The record is marked with a bit that indicates it's a ghost record and cannot be physically deleted until the transaction that caused it to be ghosted commits. Once this is done, it is deleted by an asynchronous background proces (called the ghost-cleanup task) or it is converted back to a real record by an insert of a record with the exact same set of keys.

Other record types

  • There are also records that are used to store various allocation bitmaps, intermediate results of sort operations, and file and database metadata (e.g. in the per-file fileheader page and database boot page). Again, I'll go into these in later posts (there's a big queue of posts building up :-))

Record structure

All records have the same structure, regardless of their type and use, but the number and type of columns will be different. For instance, a data record from a table with a complex schema may have hundreds of columns of various types whereas an allocation bitmap record will have a single column, filling up the whole page.

The record structure is as follows:

  • record header
    • 4 bytes long
    • two bytes of record metadata (record type)
    • two bytes pointing forward in the record to the NULL bitmap
  • fixed length portion of the record, containing the columns storing data types that have fixed lengths (e.g. bigint, char(10), datetime)
  • NULL bitmap
    • two bytes for count of columns in the record
    • variable number of bytes to store one bit per column in the record, regardless of whether the column is nullable or not (this is different and simpler than SQL Server 2000 which had one bit per nullable column only)
    • this allows an optimization when reading columns that are NULL
  • variable-length column offset array
    • two bytes for the count of variable-length columns
    • two bytes per variable length column, giving the offset to the end of the column value
  • versioning tag
    • this is in SQL Server 2005 only and is a 14-byte structure that contains a timestamp plus a pointer into the version store in tempdb

NULL bitmap optimization

So why is the NULL bitmap an optimization?

Firstly, having a null bitmap removes the need for storing special 'NULL' values for fixed-length datatypes. Without the null bitmap, how can you tell whether a column is NULL? For fixed-length columns you'd need to define a special 'NULL' value, which limits the effective range of the datatype being stored. For varchar columns, the value could be a zero-length empty string, so just checking the length doesn't work - you'd need the special value again. For all other variable-length data types you can just check the length. So, we nede the NULL bitmap.

Secondly, it saves CPU cycles. If there was no NULL bitmap, then there are extra instructions executed for fixed-length and variable-length columns.

For fixed-length:

  1. read in the stored column value (possibly taking a cpu data cache miss)
  2. load the pre-defined NULL value for that datatype (possibly taking a cpu data cache miss, but only for the first read in the case of a multiple row select)
  3. do a comparison between the two values

For variable-length:

  1. calculate the offset of the variable length array
  2. read the number of variable length columns (possibly taking a cpu data cache miss)
  3. calculate the position in the variable length offset array to read
  4. read the column offset from it (possibly taking a cpu data cache miss)
  5. read the next one too (possibly taking another cpu data cache miss, if the offset in step 4 was on the boundary of a cache line size)
  6. compare them to see if they're the same

But with a NULL bitmap, all you have to do is:

  1. read the NULL bitmap offset (possibly taking a cpu data cache miss)
  2. calculate the additional offset of the NULL bit you want to read
  3. read it (possibly taking a cpu data cache miss)

So, its about even for a lookup of a single fixed-length column, but for variable-length columns, and for multiple row selects, there's a clear advantage to having the NULL bitmap.

Using DBCC IND and DBCC PAGE to examine a row in detail

Let's create an example table to look at:

USE MASTER;
GO
IF DATABASEPROPERTY (N'recordanatomy', 'Version') > 0 DROP DATABASE recordanatomy;
GO
CREATE DATABASE recordanatomy;
GO

USE recordanatomy;
GO
CREATE TABLE example (destination VARCHAR(100), activity VARCHAR(100), duration INT);
GO
INSERT INTO example VALUES ('Banff', 'sightseeing', 5);
INSERT INTO example VALUES ('Chicago', 'sailing', 4);
GO

And we can use DBCC IND again to find the page to look at:

DBCC IND ('recordanatomy', 'example', 1);
GO

The output tells us the data page is (1:143) so we can dump it with DBCC PAGE, using option 3 to get a fully interpreted dump of each record.

DBCC TRACEON (3604);
GO
DBCC PAGE ('recordanatomy', 1, 143, 3);
GO

Remember we need the trace-flag to make the DBCC PAGE output go to the console instead of the error log. The output will contain something like the following:

Slot 0 Offset 0x60 Length 33

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS

Memory Dump @0x5C76C060

00000000:   30000800 05000000 0300f802 00160021 †0..............!
00000010:   0042616e 66667369 67687473 6565696e †.Banffsightseein
00000020:   67†††††††††††††††††††††††††††††††††††g

Slot 0 Column 0 Offset 0x11 Length 5

destination = Banff

Slot 0 Column 1 Offset 0x16 Length 11

activity = sightseeing

Slot 0 Column 2 Offset 0x4 Length 4

duration = 5

Let's use the record structure I listed above to go through this record and see how things are stored.

  • Byte 0 is the TagA byte of the record metadata.

    • Its 0x30, which corresponds to 0x10 (bit 4) and 0x20 (bit 5). Bit 4 means the record has a NULL bitmap and bit 5 means the record has variable length columns. If 0x40 (bit 6) was also set, that would indicate that the record has a versioning tag. If 0x80 (bit 7) was also set, that would indicate that byte 1 has a value in it.
    • Bits 1-3 of byte 0 give the record type. The possible values are:
      • 0 = primary record. A data record in a heap that hasn't been forwarded or a data record at the leaf level of a clustered index.
      • 1 = forwarded record
      • 2 = forwarding record
      • 3 = index record
      • 4 = blob fragment
      • 5 = ghost index record
      • 6 = ghost data record
      • 7 = ghost version record. A special 15-byte record containing a single byte record header plus a 14-byte versioning tag that is used in some circumstances (like ghosting a versioned blob record)
    • In our example, none of these bits are set which means the record is a primary record. If the record was an index record, byte 0 would have the value 0x36. Remember that the record type starts on bit 1, not bit 0, and so the record type value from the enumeration above needs to be shifted left a bit (multiplied by two) to get its value in the byte.
  • Byte 1 is the TagB byte of the record metadata. It can either be 0x00 or 0x01. If it is 0x01, that means the record type is ghost forwarded record. In this case its 0x00, which is what we expect given the TagA byte value.
  • Bytes 2 and 3 are the offset of the NULL bitmap in the record. This is 0x0008 (DBCC PAGE presents multi-byte values in hex dumps as least-significant byte first). This means that there's a 4-byte fixed length portion of the record starting at byte 4. We expect this because we know the table schema.
  • Bytes 4 to 7 are the fixed length portion. Again, because we know the table schema, we know to interpret these bytes as a 4-byte integer. Without that knowledge, you'd have to guess. The value therefore is 0x00000005, which is what we'd expect to see as the value of the duration column.
  • Bytes 8 and 9 are the count of columns in the record. This is 0x0003 which is correct. Given that there are only 3 columns, the NULL bitmap of one bit per column will fit in a single byte.
  • Byte 10 is the NULL bitmap. The value is 0xF8. We need to convert it to binary to make sense of the value. 0xF8 = 11111000. This makes sense - bits 0-2 represent columns 1-3 and they're all 0, meaning the columns aren't NULL. Bits 3-7 represent non-existent columns and they're set to 1 for clarity.
  • Bytes 11 and 12 are the count of variable length columns in the record. That value is 0x0002, which we again know to be correct. This means there will be two two-byte entries in the variable length column offset array. These will be bytes 13-14 and 15-16, having values of 0x0016 and 0x0021 respectively. Remember that NULL bitmap entries point to the end of the column value - this is done so that we know how long each column is without having to store their length as well.
  • So, the final offset is bytes 15 and 16, which means the offset of the start of the first variable length column must be byte 17 (or 0x11 in hex), which agrees with the DBCC PAGE dump. The offset of the end of the first variable length column is 0x0016, so the first value is from byte 17 to byte 21 inclusive. This value is 0x42616E6666. We know from the table metadata that this is the first varchar column, destination. Converting to ASCII gives us the column value 'Banff'. Using similar logic, the second value is from byte 22 to byte 32 inclusive and has the value 'sightseeing'. Both of these match the data we're expecting.

And that's it.

Some of the features of SQL Server 2008 will introduce changes to the record structure - more on those when the features are available in CTPs.

This was a question from the MSDN Disaster Recovery forum I started while I was at Microsoft.

I have a 600 gig database that has a mirror.  I need to move the databases from local drives to a SAN.  Can anyone recommend a document that lists the steps to go through to move both the principle and mirror to the SAN with no down time? or minimal down time?

As far as I know, there isn't any such document so I had a crack at coming up with a list of operations. Here's what I had:

  1. Take a full backup of the principal on node A
  2. Restore it on the SAN on node B using WITH NORECOVERY, remembering to use WITH MOVE to place the files correctly, and with a different database name than the current mirror
  3. Take the required log backup on the principal and restore on the database copy on the SAN on node B
  4. Break the mirroring partnership
  5. Drop the current mirror database on node B
  6. Rename the database on the SAN on node B to be the mirror database -- THIS DOESN"T WORK!
  7. Setup the mirroring partnership to point to the newly restored database on the SAN on node B
  8. Start mirroring and the new mirror will catch-up
  9. Failover to the mirror on node B, which becomes the new principal
  10. Follow the same procedure to move the new mirror on node A onto its SAN
  11. Failback if you want to

And I promised to try it out to make sure I had it right so in this blog post I'm going to walk through the steps of doing this. It turns out that the steps above are slightly incorrect. Step 6 above doesn't work because the database is in recovery (so is inaccessible) and there's a short-cut when moving the database on the first node to avoid having to take and copy more backups. Let's see how it works and I'll post the corrected sequence at the end.

As I did in yesterday's mirroring post, I'm going to use the TicketSalesDB database from our Always-On DVDs. It's only a few hundred MB instead of 600GB but the principal is the same (no pun intended :-)). I've got mirroring running between two nodes, SQLDEV01 (the principal) and SQLDEV02 (the mirror), both of which are running 2005 SP2 and I've got a simulated workload inserting rows into the database. I don't actually have a SAN laying around so I'm cheating and I have directories called C:\SQLDEV01SAN and C:\SQLDEV02SAN instead. It's the location change that's the interesting part, not where the actual location is.

Step 1

On SQLDEV01, take a full backup and a log backup:

BACKUP DATABASE TicketSalesDB TO DISK = 'C:\SQLskills\TicketSalesDB.BAK' WITH INIT;

GO

BACKUP LOG TicketSalesDB TO DISK = 'C:\SQLskills\TicketSalesDB_Log.bak' WITH INIT;

GO

Step 2

On SQLDEV01, break the mirroring partnership:

ALTER DATABASE TicketSalesDB SET PARTNER OFF;

GO

And just check that it's gone:

SELECT mirroring_state_desc FROM sys.database_mirroring WHERE database_id = DB_ID ('TicketSalesDB');

GO

which returns:

NULL

Step 3

On SQLDEV02, drop the mirror database - this wouldn't work unless mirroring was no longer running:

DROP DATABASE TicketSalesDB;

GO

Step 4

Copy the backups to SQLDEV02 and restore them on the SAN and remembering to use WITH NORECOVERY:

RESTORE DATABASE TicketSalesDB FROM DISK='C:\SQLskills\TicketSalesDB.bak'

WITH

MOVE 'TicketSalesDBData' TO 'C:\SQLDEV02SAN\TicketSalesDBData.MDF',

MOVE 'TicketSalesFG2005Q1' TO 'C:\SQLDEV02SAN\TicketSalesFG2005Q1.NDF',

MOVE 'TicketSalesFG2005Q2' TO 'C:\SQLDEV02SAN\TicketSalesFG2005Q2.NDF',

MOVE 'TicketSalesFG2005Q3' TO 'C:\SQLDEV02SAN\TicketSalesFG2005Q3.NDF',

MOVE 'TicketSalesFG2005Q4' TO 'C:\SQLDEV02SAN\TicketSalesFG2005Q4.NDF',

MOVE 'TicketSalesDBLog' TO 'C:\SQLDEV02SAN\TicketSalesDBLog.LDF',

NORECOVERY;

GO

RESTORE LOG TicketSalesDB FROM DISK = 'C:\SQLskills\TicketSalesDB_Log.bak' WITH NORECOVERY;

GO

Step 5

On SQLDEV02, set the mirroring partner to be SQLDEV01:

ALTER DATABASE TicketSalesDB SET PARTNER = 'TCP://SQLDEV01:5091';

GO

Step 6

On SQLDEV01, start mirroring:

ALTER DATABASE TicketSalesDB SET PARTNER = 'TCP://SQLDEV02:5092';

GO

And check that it's running:

SELECT mirroring_state_desc FROM sys.database_mirroring WHERE database_id = DB_ID ('TicketSalesDB');

GO

This time it returns:

SYNCHRONIZED

Step 7

Now we need to failover so that we can move the database on SQLDEV01 onto its SAN. Before we do that, let's make sure that SQLDEV01 is the principal:

SELECT mirroring_role_desc FROM sys.database_mirroring WHERE database_id = DB_ID ('TicketSalesDB');

GO

which returns:

PRINCIPAL

Now force the failover:

ALTER DATABASE TicketSalesDB SET PARTNER FAILOVER;

GO

And query the DMV again to make sure. This time the mirroring_state_desc returned is:

MIRROR

Excellent!

Now, I did all of this while my workload was running and it automatically failed over to SQLDEV02, with the database now hosted on the SAN. To do the same move on SQLDEV01, we don't need to go through the backup and copy process again - we can just use the original backups we took in step 1.

Step 8

We need to break the mirroring partnership again, this time executing on SQLDEV02, the new principal:

ALTER DATABASE TicketSalesDB SET PARTNER OFF;

GO

On SQLDEV01, we can now drop the database and restore the original backups onto the SAN:

DROP DATABASE TicketSalesDB;

GO

RESTORE DATABASE TicketSalesDB FROM DISK='C:\SQLskills\TicketSalesDB.bak'

WITH

MOVE 'TicketSalesDBData' TO 'C:\SQLDEV01SAN\TicketSalesDBData.MDF',

MOVE 'TicketSalesFG2005Q1' TO 'C:\SQLDEV01SAN\TicketSalesFG2005Q1.NDF',

MOVE 'TicketSalesFG2005Q2' TO 'C:\SQLDEV01SAN\TicketSalesFG2005Q2.NDF',

MOVE 'TicketSalesFG2005Q3' TO 'C:\SQLDEV01SAN\TicketSalesFG2005Q3.NDF',

MOVE 'TicketSalesFG2005Q4' TO 'C:\SQLDEV01SAN\TicketSalesFG2005Q4.NDF',

MOVE 'TicketSalesDBLog' TO 'C:\SQLDEV01SAN\TicketSalesDBLog.LDF',

NORECOVERY;

GO

RESTORE LOG TicketSalesDB FROM DISK = 'C:\SQLskills\TicketSalesDB_Log.bak' WITH NORECOVERY;

GO

And setup mirroring again. On SQLDEV01:

ALTER DATABASE TicketSalesDB SET PARTNER = 'TCP://SQLDEV02:5092';

GO

And on SQLDEV02:

ALTER DATABASE TicketSalesDB SET PARTNER = 'TCP://SQLDEV01:5091';

GO

And we're running again.

Step 9

Now all we need to do is fail the workload back to SQLDEV01 by executing this on SQLDEV02:

ALTER DATABASE TicketSalesDB SET PARTNER FAILOVER;

GO

Summary

So - the corrected sequence for moving a database while mirroring is running is the following:

  1. Take a full backup of the principal database on node A, and the required log backup
  2. Break the mirroring partnership
  3. Drop the current mirror database on node B
  4. Copy the backups to node B and restore it on the SAN on node B using WITH NORECOVERY, remembering to use WITH MOVE to place the files correctly
  5. Setup the mirroring partnership to point to the newly restored database on the SAN on node B
  6. Start mirroring and the new mirror will catch-up
  7. Failover to the mirror on node B, which becomes the new principal
  8. Follow the same procedure to move the new mirror on node A onto its SAN, but using the original backups from step 1
  9. Failback

Hope this helps.

One of the great things about the blog engine we use is that it shows all the search engine queries that led to someone clicking through to the site. I've been looking through the logs to see what kind of problems people are having that end up here. In this occasional series, I'm going to pick a search engine query and blog about it - hopefully helping out people who have the problem in future.

First up is running out of transaction log space. This happens when the transaction log fills up to the point where it has to grow but either autogrow is not turned on or the volume on which the transaction log is placed has no more space for the file to grow. So what causes the transaction log to fill up in the first place? It could be a number of different things. The two most common ones I've seen are:

  • The database is in full recovery mode with normal processing, a full database backup has been taken but no log backups have been taken. This commonly happens when an application goes into production and someone decides to take a backup, without realizing the consequences. Log truncation is not automatic in such a case - a log backup is required to allow the log to truncate.
  • The database is in any recovery mode but there's a very long-running, uncommitted transaction that prevents log truncation. (Even in full recovery mode with regular log backups, all the log records for the long-running transaction are required in case it rolls-back - and SQL Server can't selectively truncate log records from the log for some transactions but not others.)

Let's have a look at both of these in SQL Server 2005 and example solutions.

The log backup case...

First thing to do is create a database with a very small transaction log that we can easily fill up.

USE master;
GO
IF DATABASEPROPERTY (N'tinylogtest', 'Version') > 0 DROP DATABASE tinylogtest;
GO

CREATE DATABASE tinylogtest ON (
   
NAME = fgt_mdf,
   
FILENAME = 'c:\tinylogtest\tinylogtest.mdf',
   
SIZE = 2MB)
LOG ON (
   
NAME = fgt_log,
   
FILENAME = 'c:\tinylogtest\tinylogtest.ldf',
   
SIZE = 512KB,
   
FILEGROWTH = 0);
GO

Notice that I've specifically turned auto-growth of the log file off by setting FILEGROWTH = 0. Now let's set the database into full recovery mode and take a full database backup.

ALTER DATABASE tinylogtest SET RECOVERY FULL;
GO
BACKUP DATABASE tinylogtest to DISK = 'C:\tinylogtest\dummybackup.bck';
GO

And then create a simple table and fill up the transaction log.

SET NOCOUNT ON;
GO
CREATE TABLE tinylogtest..testtable (c1 INT, c2 CHAR (3000));
GO
WHILE (1=1)
BEGIN
   
INSERT INTO tinylogtest..testtable VALUES (1, 'a')
END;

GO

And we get the following error:

Msg 9002, Level 17, State 2, Line 4
The transaction log for database 'tinylogtest' is full. To find out why space in the log cannot be reused, see the log_reuse_wait_desc column in sys.databases

This is much better than you get on SQL Server 2000:

Msg 9002, Level 17, State 2, Line 4
The log file for database 'tinylogtest' is full.

This has no helpful information at all. So, looking at the 2005 message, it's telling us to look in the sys.databases table for more info. Let's do that:

SELECT log_reuse_wait_desc FROM sys.databases WHERE [name] = 'tinylogtest';
GO

And the result is:

LOG_BACKUP

That's pretty clear - take a log backup! Here are the various values this can take, from the Books Online entry for sys.databases:

  • NOTHING
  • CHECKPOINT
  • LOG_BACKUP
  • ACTIVE_BACKUP_OR_RESTORE
  • ACTIVE_TRANSACTION
  • DATABASE_MIRRORING
  • REPLICATION
  • DATABASE_SNAPSHOT_CREATION
  • LOG_SCAN
  • OTHER_TRANSIENT

Their meanings are explained in the Books Online entry Factors That Can Delay Log Truncation. Now if we take a log backup, the log can be truncated.

BACKUP LOG tinylogtest to DISK = 'C:\tinylogtest\dummybackup.bck';
GO

And checking the state in sys.databases again returns:

NOTHING

Simple - but it's amazing how many times I see this problem on newsgroups and forums.

The long-running transaction case...

For this test, assume I've run the T-SQL above to create the database, put it into full recovery mode, take the first full backup, and then create the table. Now in a second connection, I'll create a long-running transaction:

BEGIN TRANSACTION PAULSTRAN;
GO

INSERT INTO tinylogtest..testtable VALUES (1, 'a');
GO

Note that the transaction hasn't been committed or rolled back. In the original connection, I'll execute the WHILE loop to fill up the log again. When we check sys.databases we get back:

LOG_BACKUP

So we take a log backup and check the state again. This time we get:

ACTIVE_TRANSACTION

We need to use the DBCC OPENTRAN command to find out what the long running transaction is:

DBCC OPENTRAN ('tinylogtest');
GO

And we get back:

Transaction information for database 'tinylogtest'.

Oldest active transaction:
    SPID (server process ID): 54
    UID (user ID) : -1
    Name            : PAULSTRAN
    LSN               : (18:98:1)
    Start time      : Sep 23 2007  5:49:53:077PM
    SID               : 0x010500000000000515000000ae4da6eadcd59cf661d6bb58ed030000
DBCC execution completed. If DBCC printed error messages, contact your system administrator.

From this info we can track down the transaction and commit it or roll it back. In our example, once we roll back the PAULSTRAN transaction, there's enough log space to insert one more record into the table, but then we run out of log space again. What's going on? Because the PAULSTRAN transaction we very close to the start of the transaction log, all of the log it was effectively holding hostage was backed up but couldn't be cleared. In this case we need to perform another log backup to allow the log to be cleared. This makes sense and is documented behavior.

Summary

Of course, there are other causes of the transaction log filling up and other things you can do apart from taking log backups or identifying long-running transactions - such as growing the log file or moving the log file to a volume with more free space. Here are some good resources you can use for further reading:

Hope this helps!

Quickie this morning - I was surfing some MVP blogs this morning and came across a series of posts by Allen White around using VB Script and SMO to automate regular maintenance jobs. His latest post contains code to regularly run consistency checks and more, and report any failures. Very nice - check it out!

This was originally posted as two posts on the SQL Server Storage Engine site. It was very popular so I've combined the two posts together and added a bunch more commentary - especially on page checksums and IO errors. You may also notice some color differences between the scripts from earlier in the year and today's post - more keywords are recognized in Management Studio in SP2 than before (but still not CHECKDB though...)

It's almost inevitable that at some point every DBA will face dealing with corruption - so it's very important that you know how the server will behave when corruption happens. You also need to make sure that whatever logic you've created to catch corruptions (either through error log parsing, alerts, or Agent jobs - topic for a future post) will actually work.

To do all this you need a corrupt database to play with. Earlier this year I created two corrupt databases - one for 2000 (attached in broken2000.zip) and one for 2005 (attached in broken2005.zip). The two attached files can be restored by unzipping them and then using the following syntax (substituting the correct backup name):

RESTORE DATABASE broken FROM DISK='c:\sqlskills\brokenXXX.bck'

WITH MOVE 'broken' TO 'c:\sqlskills\broken.mdf',

MOVE 'broken_log' TO 'c:\sqlskills\broken_log.ldf';

GO

The databases have the same schema - a table called 'brokentable' (c1 int, c2 varchar(7000)) with one row in it. The table has a single data-page which I've corrupted differently in 2000 and 2005:

  • 2000: The corrupt page has page ID (1:75) and the page header is corrupt so that selecting from the table will result in a 605 error which will kill the connection.
  • 2005: The corrupt page has page ID (1:143) and the page header is corrupt such that the page checksum is bad.

A cautionary note on page checksums - if you upgrade a database from 2000 to 2005 and turn on page checksums, nothing happens! It's not until a database page is read into the buffer pool, changed in some way and then written back out to disk that it will have a page checksum stamped on it. This means that once you turn them on, you need to trigger a page checksum being written to each page in some way - e.g. rebuilding all indexes or forcing an in-place update of all table rows. Neither of these is very palatable and there's no tool to force page checksums in SQL Server 2005. Unfortunately there are no plans to include such a tool in SQL Server 2008 either.

Below I've listed a few things you can try out to see what would happen on your database if a checksum failed. These are all using the 2005 corrupt database.

Query errors

Any query that touches that page is going to fail with an 824 error. The IO errors in 2005 are different from 2000 - they've been split into 3:

  • 823 - a hard IO error. This is where SQL Server has asked the OS to read the page but it just can't.
  • 824 - a soft IO error. This is where the OS could read the page but SQL Server decided that the page was corrupt - for example with a page checksum failure
  • 825 - a read-retry error. This is where either an 823 or 824 occured, SQL server retried the IO automatically and it succeeded. This error is written to the errorlog only - you need to be aware of these as they're a sign of your IO subsystem going awry. There's no way to turn off read-retry and force SQL Server to 'fail-fast' - whether this behavior is a good or bad thing can be argued both ways - personally I don't like it.

SELECT * FROM broken..brokentable;

GO

Msg 824, Level 24, State 2, Line 1

SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0x7232c940; actual: 0x720e4940). It occurred during a read of page (1:143) in database ID 8 at offset 0x0000000011e000 in file 'c:\sqlskills\broken.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

DBCC CHECKDB

CHECKDB throws us some nice errors. Note that it doesn't actually mention a page checksum failure. CHECKDB is the only thing in SQL Server that can 'eat' IO errors and convert them into non-fatal corruption errors. Note in the CHECKDB output below that the repair level needed to repair this error is 'repair_allow_data_loss' - this is because the repair for a page with any kind of IO error on it is to delete the page, fix-up all relevant linkages, and rebuild any referencing indexes.

DBCC CHECKDB ('broken') WITH NO_INFOMSGS, ALL_ERRORMSGS;

GO

Msg 8928, Level 16, State 1, Line 1

Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data): Page (1:143) could not be processed. See other errors for details.

Msg 8939, Level 16, State 98, Line 1

Table error: Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data), page (1:143). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 12716041 and -4.

CHECKDB found 0 allocation errors and 2 consistency errors in table 'brokentable' (object ID 2073058421).

CHECKDB found 0 allocation errors and 2 consistency errors in database 'broken'.

repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (broken).

Backup with CHECKSUM

If you have page checksums turned on, you should always use the WITH CHECKSUM option when taking backups. This will cause the page checksums to be checked as they're read into the backup. If a bad page checksum is found, the backup will stop and print a message identifying the bad page. Using the WITH CHECKSUM option on a backup will also generate a checksum over the entire backup stream and store it in the backup. This means we can detect a damaged backup by recalculating the checksum and comparing it against that stored in the backup - in much the same way that page checksums work.

BACKUP DATABASE broken TO DISK='c:\sqlskills\broken2.bck'

WITH CHECKSUM;

GO

Msg 3043, Level 16, State 1, Line 1

BACKUP 'broken' detected an error on page (1:143) in file 'c:\sqlskills\broken.mdf'.

Msg 3013, Level 16, State 1, Line 1

BACKUP DATABASE is terminating abnormally.

The backup has failed because of a bad page checksum. However, we can force it to backup. If this is the only copy of the database we have, and we're being forced to run repair to fix a corruption, for instance, then we want to make sure we have a backup to restore from in case something goes wrong with the repair. Even a backup that contains a corrupt database is better than no backup at all. In this case, we can use the CONTINUE_AFTER_ERROR option which will force the backup to continue when it finds a bad page.

BACKUP DATABASE broken TO DISK='c:\sqlskills\broken2.bck'

WITH CHECKSUM, CONTINUE_AFTER_ERROR;

GO

Processed 160 pages for database 'broken', file 'broken' on file 1.

Processed 1 pages for database 'broken', file 'broken_log' on file 1.

BACKUP WITH CONTINUE_AFTER_ERROR successfully generated a backup of the damaged database. Refer to the SQL Server error log for information about the errors that were encountered.

BACKUP DATABASE successfully processed 161 pages in 2.025 seconds (0.651 MB/sec).

RESTORE VERIFYONLY

You can check the high-level validity of any backup set using the RESTORE VERIFYONLY command.

RESTORE VERIFYONLY FROM DISK='c:\sqlskills\broken2005.bck';

GO

The backup set on file 1 is valid.

What about on the backup that we forced using CONTINUE_AFTER_ERROR?

RESTORE VERIFYONLY FROM DISK='c:\sqlskills\broken2.bck';

GO

The backup set was written with damaged data by a BACKUP WITH CONTINUE_AFTER_ERROR.

Isn't that cool? It tells us that the backup was already corrupt when it was written. Ok - let's ask it to specifically check the checksums in the backup. This will look through all the pages in the backup that have page checksums, check them, and recalculate the backup stream checksum.

RESTORE VERIFYONLY FROM DISK='c:\sqlskills\broken2.bck'

WITH CHECKSUM;

GO

The backup set was written with damaged data by a BACKUP WITH CONTINUE_AFTER_ERROR.

In our case, we get the same as above because the database was known to be corrupt when the backup was taken, so none of the checksum checking is done. What about if we try to check the checksums on the initial backup?

RESTORE VERIFYONLY FROM DISK='c:\sqlskills\broken2005.bck'

WITH CHECKSUM;

GO

Msg 3187, Level 16, State 1, Line 1

RESTORE WITH CHECKSUM cannot be specified because the backup set does not contain checksum information.

Msg 3013, Level 16, State 1, Line 1

VERIFY DATABASE is terminating abnormally.

We can't do that as the backup wasn't taken using the WITH CHECKSUM option in the first place, even though some of the database pages may have page checksums on them.

RESTORE

How about we try to overwrite the existing 'broken' database with the one from the second backup we took?

RESTORE DATABASE broken FROM DISK='c:\sqlskills\broken2.bck'

WITH REPLACE;

GO

Msg 3183, Level 16, State 1, Line 1

RESTORE detected an error on page (1:143) in database "broken" as read from the backup set.

Msg 3013, Level 16, State 1, Line 1

RESTORE DATABASE is terminating abnormally.

It won't let us because the backup contains corrupt data (and it knows that because we forced the backup to complete using the CONTINUE_AFTER_ERROR option). However, there may be cases where you've lost your database and all you have is a corrupt backup. In this case it may be better to restore what data you do have rather than lose everything. You can do it using the CONTINUE_AFTER_ERROR option on the RESTORE command this time:

RESTORE DATABASE broken FROM DISK='c:\sqlskills\broken2.bck'

WITH REPLACE, CONTINUE_AFTER_ERROR;

GO

Processed 160 pages for database 'broken', file 'broken' on file 1.

Processed 1 pages for database 'broken', file 'broken_log' on file 1.

The backup set was written with damaged data by a BACKUP WITH CONTINUE_AFTER_ERROR.

RESTORE WITH CONTINUE_AFTER_ERROR was successful but some damage was encountered. Inconsistencies in the database are possible.

RESTORE DATABASE successfully processed 161 pages in 0.392 seconds (3.364 MB/sec).

Isn't that cool? It works BUT it tells us that the backup set contained corrupt data and that the database was restored but could have corrupt data in.

Summary

Have a play about with these databases to familiarize yourself with the kind of responses you'll get from the various tools when a corruption exists, and how to work around it if need be.

Let me know if you want to see any particular kinds of corruptions explored, or want a database with something specific corrupted in.

broken2000.zip (41 KB)broken2005.zip (149.9 KB)

Theme design by Nukeation based on Jelle Druyts