The March edition of TechNet Magazine is available on the web now and has the latest installment of my regular SQL Q&A column.

This month's topics are:

  • Distributed transactions and database mirroring - why they don't work together
  • Background processes that can cause I/Os even with no connections to the server
  • How to restore backups from a file containing multiple appended backups
  • The perennial problem of production databases being too large to restore in development

Check it out at http://technet.microsoft.com/en-us/magazine/ff458345.aspx.

Christmas comes but once a year... really? Then mine just came early on this afternoon's UPS truck.

The very nice folks at Fusion-io just sent me two of their fully-loaded top-of-the-line ioDrive Duos with 640GB of solid-state flash memory in each. This is really extra-nice of them because on Dell's Small Business website they're currently retailing for $12800 *each*. Expensive? Yes. Worth it? That's what I'm hoping to prove.

There's nothing like expensive, pretty hardware to get me excited... here's what they look like:

Now, above I said 'expensive', and these are, but they pack some pretty amazing specs in terms of read/write bandwidth so you're getting a lot of bangs for you bucks. But where does it really make sense to drop the bucks for the biggest bangs? To answer that I'm planning to do a whole series of blog posts as part of my benchmarking efforts to investigate which operations can benefit the most from these drives. With 1.2TB of SSD storage I'll be able to plug these into one of my test systems here and run comparisons against 15k SCSI and 7.2k SATA drives.

Anyway, there's a lot of hype about the speed of SSDs, and also a lot of angst about SSDs not being Enterprise-ready. I don't agree with them not being Enterprise-ready - in fact, fellow-MVP Greg Linwood, who runs (among other things) our partner company SQLskills Australia, already has a bunch of customers with Fusion-io drives deployed in their enterprises successfully. As with any critical hardware infrastructure (especially cutting-edge stuff like this), the key to success is having everything setup correctly so I'll be blogging about all my experiences with them.

To summarize, I'm very excited! I've been wanting to get my hands on some serious SSD hardware for a couple of years now so I can do some *real* testing - it doesn't get better than this!

Shoot me an email or leave a comment if there's something you're interested in seeing tested.

PS Full disclosure: yes, of course Fusion-io sent me these because they're getting publicity from me blogging about them, but we don't have any editorial/veto agreement. I want to be able to recommend these to our enterprise clients and the only way to honestly do that is to play with them myself - so it's a win-win for both of us. And you guys get to test them vicariously through me, so it's a win-win for you too :-)

Stay tuned...

A while ago I blogged about disk partition alignment, and how the default alignment of 31.5Kb on Windows Server 2003 can lead to enormous I/O performance problems (see Are your disk partition offsets, RAID stripe sizes, and NTFS allocation units set correctly?). We've been on-site with clients this week and that topic came up again. I thought it would be useful to do a quick blog post showing how to use the diskpart and wmic tools. Google them for lots of info from the Microsoft site - but be careful not to play around with any of the destructive options on productions systems. The options I'm using below will not alter the disks in any way.

Note: This stuff applies to MBR disks, not GPT or dynamic disks. Although these require correct alignment too, I don't have any information on how to do it for those disks. The SQLCAT team will be publishing some guidelines but has not yet done so, AFAIK. Check out the SQLCAT team whitepaper Disk Partition Alignment Best Practices for SQL Server for full details on this topic.

Bring up a command prompt and type diskpart. You'll see something like:

C:\Users\Administrator>diskpart

Microsoft DiskPart version 6.0.6001
Copyright (C) 1999-2007 Microsoft Corporation.
On computer: MONKEY

DISKPART>

Next you need to list the logical disks that Windows knows about. Type list disk. You'll see something like:

DISKPART> list disk

  Disk ###  Status      Size     Free     Dyn  Gpt
  --------  ----------  -------  -------  ---  ---
  Disk 0    Online       136 GB  1434 MB
  Disk 1    Online      1116 GB      0 B
  Disk 2    Online      2036 GB      0 B

DISKPART> 

Disks 1 and 2 are two RAID arrays I'm using right now for the performance benchmarking series I'm doing. Notice that the numbers in the Free column aren't correct - not sure why not. 

To see the partitions on a disk, you need to set the diskpart focus to be that disk. Type select disk X, where X is the disk you want to focus on. You'll see something like:

DISKPART> select disk 1

Disk 1 is now the selected disk.

DISKPART>

And now you can list the partitions on the disk using list partition. You'll see something like:

DISKPART> list partition

  Partition ###  Type              Size     Offset
  -------------  ----------------  -------  -------
  Partition 1    Primary           1116 GB  1024 KB

DISKPART>

This is the output from one of my Windows Servr 2008 servers, where the default partition offset is 1MB - which doesn't lead to perf issues.

On another Windows XP system, I get the following output:

DISKPART> select disk 0

Disk 0 is now the selected disk.

DISKPART> list partition

  Partition ###  Type              Size     Offset
  -------------  ----------------  -------  -------
  Partition 1    Primary            119 GB    32 KB

DISKPART>

This disk isn't aligned correctly. If this was a RAID array, I'd pay a perf penalty every time a read or write straddled a RAID stripe offset. See the blog post link at the top of this post for more details.

Unfortunately, diskpart isn't always the best tool to use to get partition offsets, as it rounds up the values, and when there are multiple partitions, it can be hard to tell exactly what's what, especially whtih lots of disks where you need to select each one and then list the partitions.

In this case, use wmic to get the exact numbers. The command is as follows:

wmic partition get BlockSize, StartingOffset, Name, Index

For my server, I get the following output:

BlockSize  Index  Name                   StartingOffset
512        0      Disk #1, Partition #0  1048576
512        0      Disk #2, Partition #0  1048576
512        0      Disk #0, Partition #0  1505755136
 

For dynamic disks, use:

dmddiag.exe -v 

Now - go out to your servers and check the partition alignment - fixing this can give you up to 30-40% I/O performance boost!!

How do you fix it? Well, that's the downside - fixing it means reformatting the disk to have the correct partition offset or moving the data to a disk that already has the correct partition offset. Remember - although Windows Server 2008 creates disks with the correct offset, taking a disk that was created on Windows Server 2003 and attaching it to Windows Server 2008 will have no effect on the existing partition offset.

Categories:
IO Subsystems | Performance | Tools

(For the hardware setup I'm using, see this post.)

As part of my new benchmarking series I first wanted to play around with different configurations of data files and backup files for a 1-TB database to see what kind of performance gains I can get invoking the parallelism possible when backing up and restoring the database. To do that I need a way to quickly populate a 1-TB database so that I can mess around with different numbers of files and so on. It doesn't matter what the data in the database is, as backup doesn't care - as long as there's a terabyte of it. Why a terabyte? It's a nice round number, it's a common enough database size right now, and I have the storage to play around with it.

But then my plans got derailed. In figuring out how to most quickly populate a 1-TB database, I realized that in itself would be an interesting benchmark to explore, so that's what I'm doing first.

My aim is to give you improvements you can use in real life. If you think this is going to be boring, skip down to the end of the post where I show a detailed perfmon and explain what's going on in my overloaded IO subsystem, then come back up here :-)

The baseline for this benchmark is contrived - I'm going to populate a 1-TB clustered index (so I can do other experiments with the index) as quickly as I can. The interesting part is that I'm starting with a brain-dead database layout, configuration, and method of populating the table, and then I'm going to alter different things to see the effect on the system. The effects and gains will be the interesting part as it will expose parts of how SQL Server works which *WILL* be applicable to real-life situations and workloads - the whole point of me doing all of this is to show you improvements, why they work, and how they could be useful to you.

When doing any kind of performance testing it's *essential* to have a baseline with which to compare - otherwise how do you know what effect a variation is having? This post describes my baseline setup, measurements, and limitations I start to notice.

First let me describe the baseline setup:

  • Single data file and log file stored on the same volume, on an 8-drive RAID-10 array (each drive is a 300GB 15k SCIS drive), connected through iSCSI to the server
  • Data file is created to be 1-TB, with instant file initialization enabled
  • Log file is created to be 256MB
  • Database is using the SIMPLE recovery model

Yes, I'm deliberately putting the data and log on the same RAID array. I *want* to see some contention so I can prove to you how separation of data and log can reduce contention and improve performance sometimes.

Database and table creation script:

CREATE DATABASE BackupPerfTest ON
(NAME = 'BackupPerfTest_Data',
   FILENAME = 'K:\BackupPerfTest.mdf',
   SIZE = 1TB,
   FILEGROWTH = 1GB)
LOG ON
(NAME = 'BackupPerfTest_Log',
   FILENAME = 'K:\BackupPerfTest.ldf',
   SIZE = 256MB,
   FILEGROWTH = 50MB);
GO

ALTER DATABASE BackupPerfTest SET RECOVERY SIMPLE;
GO

USE BackupPerfTest;
GO

CREATE TABLE MyBigTable (c1 BIGINT IDENTITY, c2 CHAR (4100) DEFAULT 'a');
GO
CREATE CLUSTERED INDEX MyBigTable_cl ON MyBigTable (c1);
GO

I figured the fastest way to fill the database is to have a single table with one row per page, and that having SQL Server create the large CHAR column inside itself, rather than me doing a REPLICATE, would be quickest.

For the actual population of the table, I worked out that 134217728 table rows gives me a terabyte, with each row taking up a single 8KB page.

The baseline setup of the table population is:

  • Varying numbers of concurrent connections (16, 32, 64, 128, 256) to the server (8-way DELL PowerEdge 1950)
  • Each connection runs a simple script that inserts 134217728 / number-of-connections rows into the table, and then logs the elapsed time into a results table
  • Each insert is done as a single-statement implicit transaction (if I don't do an explicit BEGIN TRAN/COMMIT TRAN, SQL Server does it for me)
  • A monitor connection pings the results table every 5 minutes checking to see if number-of-connections results are there, and sending me email if so

Yes, I'm deliberately using this method to insert each row. Again, I want to be able to make improvements and see the effect of the changes.

Each connection will be running this script:

SET NOCOUNT ON;
GO

DECLARE @counter BIGINT;
DECLARE @start   DATETIME;
DECLARE @end     DATETIME;

SELECT @counter = 0;
SELECT @start = GETDATE ();

WHILE (@counter < $(rows))
BEGIN
   INSERT INTO MyBigTable DEFAULT VALUES;
   SELECT @counter = @counter + 1;
END;

SELECT @end = GETDATE ();

INSERT INTO msdb.dbo.Results VALUES (CONVERT (INTEGER, DATEDIFF (second, @start, @end)));
GO

This is run through SQLCMD, and the number of rows to insert is passed into the T-SQL script using:

sqlcmd -S(local)\SQLDev01 -dBackupPerfTest -i"C:\SQLskills\Populate1TBTest\PopulateTable.sql" -v rows=%1

%1 in the line above is passed from a master CMD that kicks off number-of-connections CMD windows, each of which just runs the SQLCMD line above.

So - a simple setup.

Here's a graph of the results:

 

The elapsed times for all connections to complete their work (as there could be up to an hour between the first and last to complete) were:

  • 16-way: 15 hours 25 minutes 5 seconds
  • 32-way: 13 hours 50 minutes 18 seconds (11% faster)
  • 64-way: 10 hours 12 minutes 48 seconds (27% faster)
  • 128-way: 8 hours 8 minutes 27 seconds (20% faster)
  • 256-way: 7 hours 24 minutes 21 seconds (9% faster)

More connections clearly leads to a faster runtime, but the improvements from doubling the number of threads clearly aren't directly proportional to the number of threads. The biggest improvement was from 32 to 64 threads, and then the percentage gain started to tail off. Let's look at the page allocations per second for each experiment too:

 

As I'd expect, the pace at which pages are being allocated in the database increases with more threads and the percentage improvements line pretty much matches that of the elapsed time graph above. There's a slight difference in the 128 and 256 gains here because the graph is show what the perf counter number was after the experiment reached a steady state. I noticed that the log grew enormously for the last few tests, which caused the steady-state number to not be reached for a while. I've already blogged about that phenomenon in Interesting case of watching log file growth during a perf test.

I also watched what was happening in perfmon to see if there were any obvious performance problems going on. Here's a perfmon graph for the 64-way test once it reached steady-state and the log wasn't growing:

 

Analysis: 

  • The black line is above 0 when a checkpoint is occuring.
  • The green line represents how much data is being written to the K: volume, where the data and log file are. You can see there's a constant amount of data being written all the time (transaction log records when the implicit transactions commit) with large spikes whenever a checkpoint occurs and causes the actual data pages to be written out.
  • The light blue line is the pages allocated per second. You can see that it takes a major nose dive whenever a checkpoint occurs. I'm speculating that this is because of disk contention preventing the transaction log being written to disk (thus slowing down the transaction throughput) while the checkpoint is occuring and writing out data pages
  • The dark blue line at the bottom is the average disk seconds per write. You can see that it hovers around 4-5 milliseconds and spikes to 16-17 when a checkpoint occurs.
  • The brown line in the middle is the average disk write queue length. It hovers around 18-19 and spikes to around 25 when a checkpoint occurs.

Observations:

  • This system is clearly bottlenecked in the I/O subsystem
  • There is a direct correlation between checkpoints occuring and: increased disk queue length, reduced transaction throughput

This is what I would expect to see on a system with lots of updates where the log and data are on the same volume. Remember that I've got them on a fast RAID-10 array. This debunks the theory I've often heard that contention doesn't matter on RAID arrays. Yes, it does. You can still max out the throughput capabalities of any IO subsystem - it just depends what you're trying to do with it. Imagine if I had nonclustered indexes on this table too - more logging and more pages being allocated - much worse performance...

So now I've got my baseline and there are some obvious things I can try to improve things:

  • Data and log file separation
  • Varying number of data files
  • Varying placement of data files (e.g. different RAID arrays)
  • Manual checkpoints
  • Pre-sizing the log
  • Using explicit transactions with varying batch insert sizes
  • Using -E to get larger allocations

I'm not saying that all of these things are going to help, but over the next few weeks I'm going to try them all and report on the results. You'll be able to clearly see the effect of changing these on my benchmark, running on production-quality hardware, rather than just taking people's words for it.

I hope you're going to find these benchmarks and experiments useful - I'll be learning (hopefully) as I go along too.

Let me know if there's anything else you'd like to see me try, and if you're following along (I don't want to spend all this time if no-one's reading the series!)

Thanks!

A few weeks ago I kicked off a survey on how you add geo-redundancy to a failover cluster (see here for the survey). The results as of 8/26/09 are as follows:

 

So why is this interesting? Well, many people will suggest failover clustering as the best way to provide high-availability for a database (or group of databases). And it is a great technology to protect against server failure, but there's only a single copy of the database, which is the Achilles' heel of failover clustering. If that copy of the database is damaged, the application is down unless there's another copy of the database available. This is where providing geo-redundancy comes in. With that in mind, I'm surprised at the percentage of respondents that don't provide any geo-redundancy at all.

There are a bunch of options for providing a redundant copy of a database that is hosted on a failover cluster, with pros and cons to each, and that's what I'm going to spend the rest of this post on.

SAN replication: This is where the SAN hardware itself mirrors all write I/Os to a remote SAN, thus maintaining a remote copy of the database. The hardware has to provide disk-block size and write-order preservation; otherwise the database on the remote SAN could become corrupt. Imagine if write-ordering was not preserved and some data pages write I/Os were completed on the remote SAN before log records write I/Os (thereby breaking the write-ahead logging protocol) - recovery wouldn't be able to work properly! This mechanism requires a remote SAN, a second failover cluster connected to the remote SAN, a network including both clusters, and a big, fat network pipe between the two SANs. The bigness and fatness of the pipe depends, of course, on how many write I/Os are performed on the local SAN, and whether the SAN replication is synchronous or asynchronous.

Synchronous replication requires that the I/O is completed on the remote SAN and acknowledged back to the local SAN before the local I/O can be acknowledged to the local server. If the network bandwidth and latency can't support the volume of write I/Os trying to be replicated to the remote SAN, the I/Os will start to queue up and delays will be incurred on the local server. This will lead to the workload slowing down as SQL Server has to wait longer and longer for I/Os to complete. Now, with synchronous replication you have the guarantee that the remote copy of the database is completely in-sync with the local copy, so if a failure occurs, no committed data will be lost. If the network can't keep up though, you may have to switch to asynchronous replication. This means the local I/Os don't have to wait for the remote I/Os to complete, and so no performance penalty is incurred. BUT as the replication is now asynchronous, committed data may be lost if the local copy of the SAN is damaged.

Apart from the potential for performance problems with SAN replication, it's also very expensive - as another SAN, another cluster, and some beefy network hardware/bandwidth is required. This isn't a technology I'd expect a small company to be using or considering. Finally, the portion of the remote SAN that's being replicated to cannot be accessed at all. On the MAJOR plus side, all databases on the SAN are replicated at once, without having to setup a technology to provide a redundant copy of each. For application ecosystems that include multiple databases, this is what I like to recommend.

Log shipping: This is the simplest way to maintain a redundant copy of the database - it's just backup log, copy, restore log; repeat. It works seamlessly with failover clustering and is really easy to setup and maintain. The only problem with this is that you open yourself up to data loss, as a log shipping secondary is usually not right up-to-date with the primary. You can use the secondary for reporting/querying by restoring the log backup WITH STANDBY (which requires a little more configuration, but not much), and you can protect against accidental data damage by having a secondary with a load-delay configured, so the database is, say, 8 hours behind the primary. In my experience, this is the most common technology that's used in conjunction with failover clustering as it's the cheapest and easiest. On the downside, it's a single database solution so its not suitable for complicated application ecosystems.

Transactional replication: This isn't very commonly used at all, although again, it works seamlessly with failover clustering after a failover. The reason this isn't used very often for geo-redundancy is that transactional replication doesn't provide database-level redundancy, only table-level. It's also much more complicated to setup and troubleshoot when things go wrong, plus there's varying latency between a transaction committing in the publication database and it being applied to the subscription database(s).

Database mirroring: Database mirroring is the only technology apart from SAN replication that can provide a zero data-loss solution when configured for synchronous operation. It works by shipping the log records from a database rather than the raw I/Os, so doesn't require anywhere near as much capital expenditure, but the network has to be able to cope with sending the log generated on the principal, otherwise performance on the principal can be affected. Mirroring is relatively easy to setup and maintain, and the mirror database can only be accessed, but only through a database snapshot. When combined with failover clustering, you need to be careful about setting the mirroring partner timeout, so that the local failover cluster gets a chance to fail over before mirroring does. Checkout my blog post on this: Search Engine Q&A #3: Database mirroring failover types and partner timeouts. You can configure database mirroring for synchronous or asychronous operation, with the same performance and data-loss exposure caveats as SAN replication. SQL Server 2008 provides log stream compression and automatic page repair, which make this more attractive (see SQL Server 2008: Performance boost for Database Mirroring and SQL Server 2008: Automatic Page Repair with Database Mirroring, respectively), but only supports a single database. I'm seeing this combination start to be used more, but again, it's a single database solution so isn't suitable for complicated application ecosystems.

Backups/homebrew: Good old backups can easily be used to provide a very low cost way of maintaining a redundant copy of a database, and if you think about it, this is really do-it-yourself log shipping. At the very least, databases should *always* be included in a backup strategy, no matter what other high-availability technology(s) you may have implemented.

3rd-party solution: There are a few non-Microsoft solutions for providing redundancy with failover clustering which don't involve traditional SAN replication. I'm not an expert in any of them, but I've heard of anecdotal issues with the two I mentioned in the survey and worked with customers who've had real issues with PolyServe (one of which I blogged about).

Summary 

When you're planning a high-availability strategy, you always need to consider the limitations of technologies while evaluating them. The big limitation of failover clustering is that there's no redundant copy of the database so you need to add another technology to provide that. I've just finished writing a 35-page whitepaper for Microsoft on the high-availability technologies in SQL Server 2008, as well as how to go about planning a strategy. It will be published before PASS in November, but in the meantime, this should have given you lots of food for thought.

Next post - the next survey!

There's another SQL Quiz (from Chris Shaw) doing the rounds where people blog the answer and then tag someone. This I got tagged by two people (Jason Massie and Gail Shaw) in the same day for the same quiz (albeit over a week ago). They either think I'm going to say something profound or funny, or maybe profoundly funny. Can you say something funnily profound? Ah, got it: profound or strange, or strangely profound, or profoundly strange. Whatever. On with quiz. I'll try not to disappoint.

Question 1: Do you feel that you have a reliable SAN Solution? If so what is the secret?

No. Well, that was easy, eh?

Ok, seriously - we don't have a SAN or any production databases per se, as we're a training and consulting company. We do have a lot of storage hardware (3 x DELL MD3000i's packed with 26TB (unformatted)), but it's not managed by a SAN. However, we do have a lot of clients that DO have SANs. So how do we know they have a reliable SAN solution? I guess there are a number of different factors off the top of my head, and I'm not a SAN expert:

  • Was it designed for the job it's doing? 
  • Are there redundant components to protect against hardware failure?
  • Was it configured by someone who knows what they're doing, with that brand of SAN?
  • Was it load tested to ensure it's can handle the job it was designed for? Was SQLIOSim run to simulate overloading the SAN to flush out any issues?
  • At the time it was configured, was the firmware all up-to-date, with no known bugs? I saw 'at the time it was configured' because you have to be careful about willy-nilly upgrades to firmware in the various components. Someone that doesn't know what they're doing can destabilize a SAN by upgrading a piece of firmware that subtly changes the behavior.
  • Are page checksums configured on the SQL databases to help detect I/O problems? Are regular consistency checks being run?

I would say that a 'no' answer to any of these is cause for concern.

Question 2: Describe database mirroring in laymen’s terms.

I'll try a few different answers. You be the judge.

1) I could make this very, very simple and just say "It's really technical and you don't want to know". That's not really in the spirit of things though.

2) Imagine 2 seven year-old girls, in separate rooms (like my youngest daughter and one of her friends). Maybe even separate countries. Girl #1 is painting a picture, using the standard easel setup. There's a webcam pointing at the painting that girl #1 is doing. Girl #2 has a monitor and can see what girl #1 is doing. They're also on the phone with each other. Whenever girl #1 paints a brush-stroke, she can't paint any more until girl #2 has made the exact same brush-stroke and said "Done it." That's the synchronous part of mirroring. Girl #1 can't get ahead of girl #2. Asynchronous mirroring is where girl #1 doesn't have to wait for girl #2 to keep up. With a witness, there's a third girl, with two webcams and another phone...

Hmm - ok this analogy isn't working. It seemed so promising! Let's try again...

3) This came to mind after taking a shower this morning in the hotel north of Houston. Database mirroring is like having redundant hot-water heaters. If the hot water fails from one heater, the heat-operated valve flips and the hot water is drawn from the other hot-water heater. You need two hot water heaters, and a fast-operating valve. If the first hot water heater is fixed/warmed up again, you can manually switch the water-flow valve back. If both hot-water heaters are unavailable, no hot water. See, failover clustering won't work, because then you've only got one hot water heater, with redundant pipes coming out of it. And replication won't work because there's a lot of latency between the water leaving the hot water heater and reaching the shower head. Ok - got a bit carried-away there.

Luckily we have a redundant hotel across the street, so if there's no hot water by the time we return from a day of bird-watching on the Gulf Coast, we can move to the redundant hotel. Although that's a lot more hassle, and a lot slower than if this hotel had a redundant hot water heater...

PS Some other folks (that I know of) have replied to the quiz - here are links to their answers:

And you can follow me on Twitter at @PaulRandal

PPS I'm not tagging anyone - been too long since the quiz started I think - the usual suspects have all been tagged already.

In this week's survey I'd like to know how often you run consistency checks on your *most critical* production database, regardless of *how* you run them (we did that survery already - see Importance of how you run consistency checks). I'll report on the results around July 4th.

I'd only like you to answer for your *most critical* production database, as the frequency will probably vary wildly by database, server, production vs. dev/QA and so on. If everyone answers for their most critical database then we won't get skewed results.

*Please* no comments on this post - wait for the survey results post to avoid skewing the answers. I'm very interested in your reasoning, but not until everyone else responds.

As always, a big Thanks! for contributing to the blog/community by responding. Please shoot me an email (Contact button, bottom left of the blog), or ping me on Twitter (@PaulRandal) if you have an idea for a good survey.

PS Thanks to Pat Wright for suggesting this week's topic on Twitter.

There are a couple of issues that I've heard of in the last few weeks (one while onsite at a customer) and I think they might bite some people so I'd like to share them with you.

DBCC CHECKDB in 2005 onwards uses a hidden database snapshot to create the transactionally-consistent point-in-time view of the database that it requires to run the consistency checks. The hidden database snapshot is created as a set of NTFS alternate streams on the existing database data files. The alternative to having DBCC CHECKDB do this automatically is to manually create your own database snapshot and run DBCC CHECKDB against that - it's the same thing really.

More info on DBCC CHECKDB's use of snapshots, and potential problems can be found at:

The two issues that I've heard of both are around an inability of DBCC CHECKDB to create the hidden snapshot. In that case it is forced to use locks to stabilize the database, which usually fails because the exclusive database lock required for running the allocation checks portion cannot be acquired.

The first issue is around the permissions of the SQL Server service account. To be able to create the NTFS alternate streams, the service account must have the privileges to create files in the DATA directory of the SQL Server instance. This is a really difficult problem to track down as the actual NTFS failure message is not surfaced by the snapshot creation code.

The second issue is around the use of HP PolyServe. Upgrading to Matrix Server 3.6.1 disables support for alternate streams in the filesystem, effectively breaking DBCC CHECKDB. Here's the paragraph from the 3.6.1 upgrade guide (available here):

In previous releases, MxDB for SQL Server provided ADS support internally for use with various SQL Server features such as the DBCC CHECKDB command. This internal support has been removed in HP PolyServe Software for Microsoft SQL Server. Instead, after all servers are upgraded to 3.6.1, you will need to enable ADS support on all filesystems previously used with MxDB for SQL Server. During the upgrade to 3.6.1, SQL Server operations requiring ADS will fail, as the new ADS support feature is not yet in place on the nodes running 3.6.1. For continuity of SQL Server operations, it is important to upgrade all nodes to 3.6.1 and upgrade filesystems for ADS as quickly as possible.

Enabling support after the upgrade means running the PolyServe psfscheck command (which I believe just runs the NTFS fsutil command under the covers), which unfortunately means taking the volume momentarily offline.

Hope this helps!

I've just been setting up some of our new hardware, and wanted to do some background reading to ensure I use the correct disk partition offset, RAID stripe size, and NTFS allocation unit size to enable the best possible performance for the volumes I'm creating.

You may not of heard about this (or your disk admins may not have heard about this) but on Windows Server 2003 and before, the default partition offset typically causes worse-than-optimal performance - and correcting it can get gains of maybe as high as 30% in terms of IO latency and duration. The SQLCAT team have just published a *fantastic* whitepaper (written by Jimmy May and Denny Lee) which explains the issue simply and clearly and shows you how to correct it. You should checkout the whitepaper at Disk Partition Alignment Best Practices for SQL Server.

The summary is that on Windows Server 2003 and before, the default partition offset is 31.5KB (63 x 512byte disk sectors), which does not align nicely with the common RAID stripe sizes of 64K or 128K, or the optimal NTFS allocation unit size of 64KB. This can lead to having to read/write multiple stripes every so often and a big perf drop. It can be fixed, as detailed in the whitepaper. For volumes *created* on Windows Server 2008, the problem does not exist as it creates a default partition offset of 1024KB.

In fact Jimmy just published a blog post to help you make the case to your disk admins/customers a few days ago: Disk Partition Alignment (Sector Alignment): Make the Case: Save Hundreds of Thousands of Dollars.

Luckily I'm using Windows Server 2008, which correctly sets the disk partition for the vast majority of cases.

Next thing I considered was RAID stripe size and NTFS allocation unit size (previously known as 'cluster size'). Kendal Van Dyke just published an *excellent* blog post series that provides a lot of empirical evidence as to what the best numbers are for the RAID level you're using. This saved me a lot of time. Check out his series at Disk Performance Hands On Series.

The Dell MD3000i units I'm using don't go any lower than 128KB for a RAID stripe size, so the default is fine. Unfortunately, I forgot to set the NTFS allocation unit size to 64KB when creating the partitions in Windows, so I need to recreate the partitions.

A massive thank-you to these guys for saving me a lot of time and hassle. You should go read this stuff too.

This has come up a few times now, most recently in an email question this morning - subsequent runs of DBCC CHECKDB show varying numbers of corruptions, and sometimes no corruptions - what's going on? Even more strange - a maintenance job runs a DBCC CHECKDB, which shows errors, but then in the morning - no consistency errors. What?

I answered this back in the April 2009 SQL Q&A column in TechNet Magazine, but I want to get it here on the blog too in a bit more detail. The answer has to do with the way the database is consistency checked, and how corruptions are detected.

In 2005 onwards, you're going to be using page checksums to help detect corruption. If you created the database on 2005 onwards, page checksums are enabled by default and every allocated page will have one. If you upgraded a database from 2000 or before, then you need to manually enable page checksums with ALTER DATABASE. The nothing happens. Until a page is read in, changed, and then written back out. So your upgraded database will have a mixture of nothing/page checksums, or torn-page detection/page checksums. Note: torn-page protected pages remain torn-page protected, even with page checksums enabled, until the next time they're altered. Then they get a page checksum. See Inside The Storage Engine: Does turning on page checksums discard any torn-page protection? for an explanation and examples.

Once you've got page checksums enabled, who can you tell if there are corruptions in the database? Well, there are a number of ways corruptions will show up:

  1. You run an operation that hits a page that has been corrupted, and the page checksum test fails
  2. You run a BACKUP ... WITH CHECKSUM and it finds a page with a bad checksum
  3. You run a DBCC CHECKDB and it finds a page with a bad checksum

That's all very well, but what if a page *doesn't* have a page checksum on it (because it hasn't been changed since page checksums were enabled)? None of #1 to #3 will fail because of a bad page checksum, as there isn't a page checksum to check. #1 might fail, depending on how corrupt the page is, and it will likely fail with an obscure message that doesn't immediately scream 'corruption'. #2 won't fail, as the only time BACKUP examines what it's backing up is when WITH CHECKSUM is enabled and a page has a page checksum on it. #3 might find the corruption, depending on how the page is corrupt. If the corruption is in the middle of a large varchar field, for instance, probably not. Your best bet is to have page checksums enabled and regularly run DBCC CHECKDB.

That's how corruptions are detected. So what about the disappearing corruptions? This gets into how consistency checks work. Consistency checks only run on the pages in the database that are allocated. If a page isn't allocated to anything, then the 8192 bytes of it are meaningless and can't be interpreted. Don't get confused between reserved and allocated - I explain that in the first misconceptions post here. As long as a page is allocated, it will be consistency checked by DBCC CHECKDB, including testing the page checksum, if it exists. A corruption can seem to 'disappear' if a corrupt page is allocated at the time a DBCC CHECKDB runs, but is then deallocated by the time the next DBCC CHECKDB runs. The first time it will be reported as corrupt, but the second time it's not allocated, so it isn't consistency checked and won't be reported as corrupt. The corruption looks like it's mysteriously vanished. But it hasn't - it's just that the corrupt page is no longer allocated. There's nothing stopping SQL Server deallocating a corrupt page - in fact, that's what many of the DBCC CHECKDB repairs do - deallocate what's broken, and fix up all the links.

The maintenance job phenomenon can occur because of the order of operations in the job. If the DBCC CHECKDB is first, and then there's an index rebuild, and the index rebuild happens to rebuild an index that DBCC CHECKDB had found a corruption in, then the *new* index will have a completely different set of database pages, and won't contain the corrupt page. Bingo - disappearing corruption. A subsequent DBCC CHECKDB might not find any corruption, because the previously corrupt pages are no longer allocated.

Bottom line - any time you get corruption error messages, 99.999% of the time it's your I/O subsystem that's got problems, even if the corruptions 'disappear'.

PS Don't forget to follow along on Twitter - http://twitter.com/PaulRandal

At the last few conferences I've presented at, there have been questions about using SSDs (Solid-State Drives) for enterprise storage and whether that will change some of the database maintenance practices. My answer to that is "I don't know" (ha - bet you thought I was going to say "It depends!") because adoption of SSDs is very low. I haven't been able to find much info about using them, but the Microsoft Research group in Cambridge just published a research paper Migrating Sever Storage to SSDs: Analysis of Tradeoffs, which does a nice job of walking through the issues involved and concludes that for the majority of workloads, it makes more economic sense to host them on HDDs. The exception is for top-end OLTP databases. I'll warn you that this isn't a whitepaper - it's a research paper, and gets a bit deep into algorithms and mathematical analyses, but if you're up to the challenge it's a great read.

You can download the paper from http://research.microsoft.com/en-us/um/people/antr/ms/ssd.pdf. Enjoy!

PS I found it on James Hamilton's blog.

There are two pretty well-known I/O errors - 823, and 824 - but there's also one called 825 which most DBAs do*not* know about, and definitely should.

From SQL Server 2005 onwards, if you ever see an 823 or 824, SQL Server has actually tried that I/O a total of 4 times before it finally declares a lost cause and surfaces the high-severity I/O error to the connection's console, killing the connection into the bargain. The idea behind this read-retry logic came from Exchange, where adding the logic reduced the amount of immediate downtime that customers experienced. While in concept this was something I agreed with at the time, I didn't agree with the way it was implemented.

If the I/O continues to fail, then the 823/824 is surfaced - that's fine. But what if the I/O succeeds on one of the retries? No high-severity error is raised, and the query completes, blissfully unaware that anything untoward happened. However, something *did* go badly wrong - the I/O subsystem failed to read 8KB of data correctly until the read was attempted again. Basically, the I/O subsystem had a problem, which luckily wasn't fatal *this time*. And that's what I don't like - the I/O subsystem went wrong but there are no flashing lights and alarm bells that fire for the DBA, as with an 823 or 824. If read-retry is required to get a read to complete, the only notification of this is a severity-10 informational message in the error log - error 825. It looks like this:

Msg 825, Level 10, State 2, Line 1.
A read of the file ‘D:\SQLskills\TestReadRetry.mdf’ at offset 0×0000017653C000 succeeded after failing 2 time(s) with error: incorrect checksum (expected: 0×4a224f20; actual: 0×2216ee12). Additional messages in the SQL Server error log and system event log may provide more detail. This error condition threatens database integrity and must be corrected. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

Now, what this message is really saying is that your I/O subsystem is going wrong and you must do something about it. And unless you're regularly scanning the error log looking for these, you'll be none-the-wiser.

So - my recommendation is that you add a specific Agent alert for error 825, along with your other alerts (see following blog post).

Every so often I'll see posts on the various data corruption forums discussing causes of corruption. In this post I want to debunk some of the myths around what can cause corruption. There are really two types of corruption to deal with, physical corruption and logical corruption.

Physical corruption

This is where something has altered the contents of a data or log file sector with no regard for what is being stored there. Possible causes of physical corruption are:

  • Problem with the I/O subsystem (99.8% of all cases I've ever seen - only 3 nines as I'd estimate I've seen around about a thousand corruption cases). Remember the I/O subsystem is everything underneath SQL Server in the I/O stack - including the OS, 3rd-party file system filter drivers, device drivers, RAID controllers, SAN controllers, network hardware, drives themselves, and so on. Millions of lines of code and lots of moving parts spinning very fast, very close to very fragile pieces of metal oxide (I once heard Jim Gray liken a disk drive head to a 747 jumbo jet flying at 500 mph at a height of 1/4 inch from the ground...)
  • Problem with the host machine hardware (0.1% of cases). Most of the time this is a memory error.
  • SQL Server bugs (0.1% of cases). Yes, there have been corruption bugs. Every piece of software has bugs. There are KB articles describing bugs.
  • Deliberate introduction of corruption using a hex editor or other means.

Physical corruption is what DBCC CHECKDB usually reports and the majority of cases are caused by a physical failures of some kind, with the minority caused by humans - software bugs.

Logical corruption

This is where something has altered some data so that a data relationship is broken. Possible causes of logical corruption are:

  • Humans

:-) Okay...

  • Application bug. The application deletes one part of an inherent data relationship but not the other. Or the application designer doesn't implement a constraint properly. Or the application designer doesn't cope with a transaction roll-back properly. You get the idea.
  • Accidental update/delete. Someone deletes or updates some data incorrectly.
  • SQL Server bug. See above.
  • DBCC CHECKDB when using the REPAIR_ALLOW_DATA_LOSS option. As is documented in Books Online, and I've blogged about and mentioned when lecturing, if you run repair, it doesn't take into account any inherent or explicit constraints on the data.

The point here is that a physical failure of a component does not cause logical corruption, it causes physical corruption. Conversely, application errors cause logical corruption, not physical corruption. DBCC CHECKDB errors are about physical corruption (okay, with the inclusion of DBCC CHECKCATALOG code in 2005, it will find cases where the DBA has manually altered the system tables, causing logical corruption) and applications cannot cause physical corruption as they can only manipulate data through SQL Server. If an application hits a SQL Server bug which causes physical corruption, that's still not the application causing physical corruption, it's SQL Server.

So - on to the myths.

  • Can an application cause physical corruption? No.
  • Can stopping a shrink operation cause corruption of any kind? No.
  • Can stopping an index rebuild cause corruption of any kind? No.
  • Can running DBCC CHECKDB without repair cause corruption of any kind? No.
  • Can creating a database snapshot cause corruption of any kind? No.

Hope this helps.

Many times I've been asked to do a blog post about creating Agent alerts, and given that today I demo'd it as part of our Accidental DBA workshop at Connections, it seemed a good time to do the blog post too!

I demo this in the context of alerting a DBA when an 823 or 824 IO error occurs. One of my early blog posts (see here) explains what these are, as well as providing a corrupt database that you can use to see these errors happening.

The idea is that I want to know as soon as an IO error occurs so I can start recovering and take preventative action to stop it happening again. I don't want to rely on users telling me when a query hits an IO error, and I don't want to have to scan the SQL error logs to find them. So I'm going to create an alert.

The first step is to fire up Management Studio and make sure SQL Server Agent is running. Next we need to make there's actually an Operator defined - so the new alert has someone to actually alert! - so we'll use the New Operator wizard (see below for how to get there).

In the New Operator Wizard that appears, I've created an operator named 'SysAdmin'. There are three Notification Options you can use - email, net send, and pager. I've setup SysAdmin to use net send to my local machine. You need to make sure the Messenger service is enabled otherwise net send will not work. Also, be aware the net sends will NOT work unless the machine has a network connection - even if the net send source and destination are the same machine! Given the various issues with net send, it's better to use email or pager alerts - but for the purposes of this blog post its the easiest option.

Now let's create the new alert - using the New Alert wizard (see the below for how to get there).

In the New Alert Wizard that appears, I've created an alert named 'IO Errors' for all severity 24 errors on all databases. Below is a portion of the General tab of the wizard showing these settings:

I also need to specify what happens. In the Response tab of the wizard I've set the SysAdmin operator to be notified using net send. Again, see below.

In the Options tab I've checked the box to include the error text in the net send.

Now let's test it. Using the database called 'broken' that I provide as an example (see here), I'll force an IO error to occur. In my query window I get:

SELECT * from broken..brokentable;
GO

Msg 824, Level 24, State 2, Line 1

SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0x7232c940; actual: 0x720e4940). It occurred during a read of page (1:143) in database ID 10 at offset 0x0000000011e000 in file 'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\broken.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

And a few seconds later I get the net send:

Pretty cool!

You can also use the WMI Provider to do this - see Creating a SQL Server Agent Alert by Using the WMI Provider for Server Events.

This is a really interesting question that came up in the Microsoft Certified Architect class I'm teaching at present - if a database has torn-page protection enabled, and page checksums are enabled, is all the existing torn-page detection lost?

This is an important question, because enabling page checksums doesn't suddenly make all allocated pages be protected by page checksums (it's not until a page is read into the buffer pool, modified, and then written back to disk, that it gets a page checksum). If all the existing torn-page protection is discarded when page checksums are enabled, then the pages would be unprotected until they got page checksums on. I couldn't remember the answer, so I experimented!

My idea was to create a database with torn-page protection, create a table with a simulated torn-page in it, then enable page checksums and see if the torn-page was still reported.

-- Create the test database
USE master;
GO
CREATE DATABASE ChecksumTest;
GO
USE ChecksumTest;
GO

-- Explicitly set the database to have torn-page detection
ALTER DATABASE ChecksumTest SET PAGE_VERIFY TORN_PAGE_DETECTION;
GO

-- Create a test table and insert a row.
CREATE TABLE BrokenTable (c1 INT, c2 CHAR (1000));
INSERT INTO BrokenTable VALUES (1, 'a');
GO

-- Ensure the page is written to disk and then tossed from the buffer pool
CHECKPOINT;
GO
DBCC DROPCLEANBUFFERS;
GO

Now I'm going to examine the page. There are two bits in the page header that specify whether the page is protected by torn-page detection or with a page checksum. Specifically, the m_flagBits field will have 0x100 set if the page is encoded for torn-page protection, and 0x200 set if the page has a page-checksum stored on it, and the page has not been modified (i.e. the checksum is stillvalid). You should not see the 0x100 bit set as torn-page encoding is removed when the page is read into the buffer pool - UNLESS the page IS actually torn, in which case the encoding is NOT removed.

sp_allocationmetadata 'BrokenTable';
GO
DBCC TRACEON (3604);
GO
DBCC PAGE ('ChecksumTest', 1, 143, 3);
GO

<snip>

m_pageId = (1:143)                   m_headerVersion = 1                  m_type = 1
m_typeFlagBits = 0x4                 m_level = 0                          m_flagBits = 0x8000
m_objId (AllocUnitId.idObj) = 67     m_indexId (AllocUnitId.idInd) = 256 
Metadata: AllocUnitId = 72057594042318848                                
Metadata: PartitionId = 72057594038321152                                 Metadata: IndexId = 0
Metadata: ObjectId = 2073058421      m_prevPage = (0:0)                   m_nextPage = (0:0)
pminlen = 1008                       m_slotCnt = 2                        m_freeCnt = 6070
m_freeData = 2118                    m_reservedCnt = 0                    m_lsn = (28:183:2)
m_xactReserved = 0                   m_xdesId = (0:0)                     m_ghostRecCnt = 0
m_tornBits = 770
      

<snip>     

In this case the torn-page encoding has been removed, and the page is fine. Once I've corrupted the page on disk, it's tricky to be able to see it with DBCC PAGE. I managed to catch it once and saw the following:

m_pageId = (1:143)                   m_headerVersion = 1                  m_type = 1
m_typeFlagBits = 0x4                 m_level = 0                          m_flagBits = 0x8100
m_objId (AllocUnitId.idObj) = 67     m_indexId (AllocUnitId.idInd) = 256 
Metadata: AllocUnitId = 72057594042318848                                
Metadata: PartitionId = 72057594038321152                                 Metadata: IndexId = 0
Metadata: ObjectId = 2073058421      m_prevPage = (0:0)                   m_nextPage = (0:0)
pminlen = 1008                       m_slotCnt = 1                        m_freeCnt = 7083
m_freeData = 1107                    m_reservedCnt = 0                    m_lsn = (28:81:20)
m_xactReserved = 0                   m_xdesId = (0:0)                     m_ghostRecCnt = 0
m_tornBits = 41949233

Now if I try to select from the table I get:         

SELECT * FROM BrokenTable;
GO

Msg 824, Level 24, State 2, Line 1

SQL Server detected a logical consistency-based I/O error: torn page (expected signature: 0xaaaaaaaa; actual signature: 0xaaaaa82a). It occurred during a read of page (1:143) in database ID 8 at offset 0x0000000011e000 in file 'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\ChecksumTest.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

The crux of the question is whether this will still be reported if the database switches to page checksums - let's try:

ALTER DATABASE checksumtest SET PAGE_VERIFY CHECKSUM;
GO

SELECT * FROM BrokenTable;
GO

Msg 824, Level 24, State 2, Line 1

SQL Server detected a logical consistency-based I/O error: torn page (expected signature: 0xaaaaaaaa; actual signature: 0xaaaaa82a). It occurred during a read of page (1:143) in database ID 8 at offset 0x0000000011e000 in file 'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\ChecksumTest.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

Cool! The answer is YES - the torn-page is still detected, because the bit in the page header specifies which page protection algorithm the page is using. In fact, it even works if you turn off page checksums and torn-page detection completely.

Here's an issue that I thought was a one-off but it just popped up again over the weekend so I want to publicize it.

DBA runs a DBCC CHECKDB and gets output like the following:

Msg 5172, Level 16, State 15, Line 1
The header for file 'E:\Data\namechanged.mdf:MSSQL_DBCC14' is not a valid database file header. The PageAudit property is incorrect.
Msg 5120, Level 16, State 9, Line 1
Unable to open the physical file "E:\Data\namechanged.mdf:MSSQL_DBCC14". Operating system error 0: "(null)".
Msg 1823, Level 16, State 2, Line 1
A database snapshot cannot be created because it failed to start.
Msg 7928, Level 16, State 1, Line 1
The database snapshot for online checks could not be created. Either the reason is given in a previous error or one of the underlying volumes does not support sparse files or alternate streams. Attempting to get exclusive access to run checks offline.

etc

Wow! Looks like something's seriously wrong with that database. Until you find that this happens on *all* the databases on the server. And on all the other servers in the data center too. What on earth's going on? I saw a similar issue once before, back in 2004 or 2005 while customers were testing pre-RTM IDW builds of SQL Server 2005. After running a test and doing some investigation for me, the person who just hit this issue over the weekend confirmed my hypothesis that he'd hit the problem I saw a few years ago.

The problem is a 3rd-party file encryption solution. It installs a file-system filter driver that filters all reads and writes to the database files so it can perform the encryption/decryption. Unfortunately it doesn't cope with NTFS alternate streams - which is how the internal database snapshots that DBCC CHECKDB uses are created - and so returns garbage for all reads from the database snapshot. The simple workaround is to create your own database snapshot (whose constituent files will exist in their own right, rather than being alternate streams of the source database's files) and then run DBCC CHECKDB against that.

Now - I don't want to cry 'wolf' here, but be careful of attributing massive corruption problems to the real IO subsystem if there's any kind of file-system filter driver installed.

There's been a very interesting discussion going on over at SQLServerCentral.com about whether to create multiple files for a user database because the server has multiple CPUs - see the thread here. I wrote a couple of long responses during the thread that I wanted to duplicate here as I think it's of broad interest.

My first response was:

Doesn't make sense to split any database into multiple files for per-cpu performance, except tempdb which can suffer from multiple-CPUs all trying to modify the same allocation bitmaps under high load with small temp tables being created/deleted (see http://www.microsoft.com/technet/prodtechnol/sql/2005/workingwithtempdb.mspx for details).

Now, saying that, there's an exception - and that's when a non-tempdb database has the same allocation bitmap contention issues, but that only happens on the very high-end with thousands of single-row inserts per second happening on each CPU. It's pretty rare. I've never seen it but Kimberly has.

What you should be considering is splitting a database into filegroups to be able to isolate critical tables within separate filegroups, or splitting a partitioned table across multiple filegroups. For SS2005, these aren't for performance but instead for manageability (e.g. being able to restore a single filegroup offline while the rest of the database remains online, or setting several partitions of a table to be read-only). For SS2008, there are improvements to query plan generation over multiple partitions that could allow for better parallelism, but its still not optimal compared to non-partitioned data.

So - complex topic and these are simple guidelines. Hope they help.

This was followed by a discussion pointing out that various hardware vendors imply the opposite, as do several MS sources. So here was my latest response:

Hi folks,

What's really not made clear in the MS literature is that the multiple-files in a database really only applies to tempdb. Even on tempdb, on 2005 you don't need one file per core - more like 1/4 -1/2 the number of files as there are cores.

The tempdb problem is this - common workloads create and drop many worktables per second. The allocations that happen for a table are initially single-pages (as opposed to whole extents). This means that a search of the SGAM page for the 4GB interval needs to be done to find a mixed-extent with a free page to allocate. Multiple CPUs all hammering this page cause contention on it and performance problems. Then, a page needs to be allocated for the first IAM page - same thing happens. Then these pages need to be marked allocated in a PFS page - same thing happens. And then these pages need to inserted into the sysindexes row for the tabel - more contention. On 2000 this was particularly bad - so T1118 plus multiple files was the solution, where SQL Server would round-robin the single page allocations in the files in tempdb, alleviating the contention somewhat.

In SQL Server 2005, we changed the temp table mechanism so that whenever a temp table is dropped, one data page, one IAM page, and the system table entries (no longer sysindexes, but instead is a 'hidden' table called sys.allocation_units) are cached. When a new temp table is allocated, if there's a cached 'template temp table' it is picked up and used without so much contention on the various allocation bitmaps. On a heavily loaded system there can still be contention and so you still need multiple files for an SMP box, but just not so many. And you don't need T1118 any more.

So - this is more prevalent on tempdb, but CAN happen on a user database under extreme load on monster hardware. Testing should show whether this is happening to you - if not, don't create multiple files for performance.

Now, in terms of what works for your particular vendors IO scalability - that's beyond me and you may need to think about that if its something they recommend. However - I'd still take it with a pinch of salt and do your own testing. See the whitepaper at http://www.microsoft.com/technet/prodtechnol/sql/bestpractice/pdpliobp.mspx for some testing/banchmarking info.

Hope this helps clarify a bit more - great discussion!

I'm interested to hear any other opinions on this - especially cases where you've found it necessary to create multiple files for performance.

Thanks!

There's a new whitepaper on TechNet that I've just come across (even though it was published 6 months ago!) called Predeployment I/O Best Practices. It's really good - discussing the following:

  • Guidelines for determining I/O capacity
  • Disk configuration best practices and common pitfalls
  • Using SQLIO to determine capacity and interpreting its results
  • Using System Monitor to monitor an IO subsystem

You can read it here and I've added a link to our whitepapers page.

While I'm on the subject of I/O, Bob Dorr (A Senior Escalation Engineer in PSS) published a blog post last year that debunks a couple of urban legends around SQL Server's IO, specifically:

  • The myth that SQL Server used one thread per data file
  • The myth that a disk queue length greater than 2 indicates an I/O bottleneck

Check out his blog post and the subsequent discussion here.

Categories:
IO Subsystems | Performance | Tools

One of the hottest features in SQL Server 2005 is database mirroring, and it's helped many companies implement successful and relatively inexpensive high-availability strategies. In SQL Server 2008, Database Mirroring has been enhanced in several ways - one of which is the ability to automatically repair corrupt pages!

This feature is based on the fact that the principal and mirror databases are exactly the same. So, if a page becomes corrupt on the principal, SQL Server should be able to read the page from the mirror and use it to fix the principal. Similarly, if a page becomes corrupt on the mirror, the page can be read from the principal to fix up the mirror. Pretty cool, eh?

Details

  • The feature works for pages that have 824 errors, 823 errors where the OS returns a CRC error while reading the page (to prevent resource issues triggering a page repair), and pages that have 829 errors (where the page is marked as restore pending). See my previous post here for more details on page errors.
  • Pages are fixed asynchronously.
    • If the page is corrupt in the principal, the query that hit the corrupt page will fail. Once discovered, a page is marked as being 829 until its fixed. This prevents an issue where a transient disk error could allow a subsequent update to change the page after it's been queued for being repaired, and then the page is overwritten with a copy from the mirror, losing the update. Nasty.
    • If the page is corrupt in the mirror (which is discovered when the page is read as part of the continual recovery of the log), the mirroring session is suspended. The mirror keeps track of all corrupt pages that need to be repaired with copies from the principal. Once all corrupt pages have been repaired, the mirroring session will be resumed automatically. This means that if a page is corrupt in both the mirror and the principal, manual intervention will be required to resolve the issue.
  • The feature is only available in Enterprise Edition. This means that if the principal is on Enterprise Edition and the mirror is on Standard Edition, then corrupt pages on the principal can repaired from the mirror but not the other way around.
  • There is a new DMV - sys.dm_db_mirroring_auto_page_repair - that allows you to track corrupt pages in mirrored databases 
    • It covers all mirrored databases on a server.
    • It provides info on the last 100 pages that were found in any mirrored database, as well as the status of the automatic page repair operation.
    • This DMV isn't yet included in the SQL Server 2008 Books Online available on TechNet but will have the following info:
      • Database ID the page is in
      • The Page ID, split into file and page-in-file
      • The error type - distinguishing between 823 errors, torn-page errors, page checksum failures, and all-other-824 errors
      • The status of the page repair operation
      • The time that the status was last updated
  • If a page repair fails for any reason (e.g. the mirroring partner couldn't supply the page) then the repair will be marked as failed. If the page is then hit again (by a query on the principal or a recovery operation on the mirror) then it will be re-queued for repair.
  • Not all pages can be repaired - the file header page, database boot page, and allocation bitmap pages (GAM, SGAM, PFS) cannot be repaired this way.

In Action

I've got a system with a few SQL Server 2008 instances running so I decided to give it a try. Here's what I did, using the TicketSalesDB from the Always-On DVDs and labs that SQLskills produces:

  • Changed the database to use page checksums
  • Rebuilt the clustered index of one of the tables (so the pages have page checksums on)
  • Took the initial backup of the database that's needed for mirroring (so that it has no corruptions in)
  • Used DBCC IND to find a page in the leaf level of the clustered index and corrupted the page
  • Setup mirroring (but using the backup I took before introducing the corruption - so the mirror database will have a clean copy of the page I corrupted)
  • Performed a query on the table with the corrupt page to force the page checksum failure and kick-off automatic page repair

And it worked! Going back into the instance with the principal database and querying the DMV gives:

C:\>sqlcmd /E /S.\KATMAI01
1> SELECT * FROM sys.dm_db_mirroring_auto_page_repair;
2> GO
database_id file_id     page_id              error_type page_status modification_time
----------- ----------- -------------------- ---------- ----------- -----------------------
          6           4                 4256         -1           5 2007-09-27 17:23:20.067

(1 rows affected)
1>

The page I corrupted was (4:4256) and page_status of 5 means the repair succeeded. Running the query again confirms that the corruption has been fixed. The page was also logged in the suspect_pages table in msdb:

1> SELECT * FROM msdb..suspect_pages;
2> GO
database_id file_id     page_id              event_type  error_count last_update_date
----------- ----------- -------------------- ----------- ----------- -----------------------
          6           4                 4256           5           1 2007-09-27 17:23:20.407

(1 rows affected)
1>

In Books Online (both 2005 and 2008), event_type of 5 means that the page was repaired.

I checked the SQL Server error log for the principal and this is what I found (the bolding is mine):

2007-09-27 17:17:10.41 spid25s     Database mirroring is active with database 'TicketSalesDB' as the principal copy. This is an informational message only. No user action is required.
2007-09-27 17:23:19.92 spid51      Error: 824, Severity: 24, State: 2.
2007-09-27 17:23:19.92 spid51      SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0x55684fbe; actual: 0x16e84fbe). It occurred during a read of page (4:4256) in database ID 6 at offset 0x00000002140000 in file 'C:\Program Files\Microsoft SQL Server\MSSQL.6\MSSQL\Data\TicketSalesFG2005Q1.NDF'.  Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
2007-09-27 17:23:19.96 spid26s     Database mirroring is attempting to repair physical page (4:4256) in database "TicketSalesDB" by requesting a copy from the partner.
2007-09-27 17:23:20.42 spid26s     Database mirroring successfully repaired physical page (4:4256) in database "TicketSalesDB" by obtaining a copy from the partner.

I also checked the error log for the mirror and there's nothing relevant in there. I guess the same info would be output to the mirror database's error log if the mirror became corrupted. It's reasonably simple to check this using similar steps as above, but corrupting the database before the initial backup is taken, fixing the corruption again after the backup and before mirroring starts (so the corruption is on the mirror but not the principal), and then triggering an update on the corrupt page. When the update is replayed on the mirror, the corruption will be hit and the page repaired in the same way. I'll check later and blog if there's any difference.

Summary

SQL Server 2008 introduces an enhancement to database mirroring that can bi-directionally pull pages between the principal and mirror databases to fix page corruptions. One word of caution I'd give is that this feature doesn't mean you can ignore these errors when they occur - you still need to do root-cause analysis on the corruption and take steps to prevent them happening again before a corruption occurs that automatic page repair cannot fix (as I mentioned above).

Nevertheless, this is a tremendously useful feature that's going to save a lot of downtime. Cool!

One of the comments I received recently is below:

Hi Paul,

If the corruption happens to be related to I/O Erros and there is nothing in the Event log or anywhere that points to I/O related issues, is there any Trace flag that we can enable when performing checkdb or checktable operations that can show us any information related to I/O Problems, Driver issues etc?. we have table corruptions happening on a regular basis but I need some kind of evidence to show to the SAN guys thats its a disk issue and not necessarily SQL Server. Any ideas or suggestions?

Thanks

Meher

This leads nicely into a blog post/repost about how to tell if your IO subsystem is causing corruptions. You've got recurring corruption - you blame the hardware and the hardware guys blame the software. There's no smoking gun and the hardware diagnostics come back clean. What can you do?

SQLIOSim

This is the hardware diagnostic you really want to run. It simulates a very heavy SQL Server workload and should do a far better job of discovering flaws in your hardware setup than individual hardware vendors' diagnostics will. I always recommend that you run it before installing a system, as well as using it to expose hardware as the problem in difficult-to-diagnose corruption problems.

You can find info on it at http://support.microsoft.com/default.aspx?scid=kb;en-us;231619 - this has been heavily updated since I originally publicized it last year. There are also some great resources on how to interpret the results - Kevin Kline wrote a blog post pulling them all together - check it out here.

One thing I like to say at conferences when I'm discussing SQLIOSim is that you're not just testing the hardware. You're also testing all the software in between the disk and SQL Server (the OS, 3rd party drivers, RAID controller firmware, disk drive firmware,...)

Page Checksums

This is a cool new feature of 2005. Once page checksums are enabled (at the database level), whenever a database page is written out of SQL Server's buffer pool, a checksum is calculated over the page's contents and stamped on the page. This is the very last operation performed on the page before it leaves SQL Server's control. When a page is read into SQL Server's buffer pool, if it has a page checksum on it then the checksum is recalculated and verified. If the re-calculated checksum doesn't match the one stamped on the page, something in the IO subsystem MUST have changed the page (i.e. in between SQL Server writing and subsequently reading the page, something underneath SQL Server in the IO stack corrupted the page).

Here are some points to note about page checksums (they debunk a bunch of common misconceptions):

  • Databases that are created on SQL Server 2005 automatically have page checksums turned on
  • Page checksums are a super-set of torn-page detection. Page checksums will also detect torn pages.
  • You cannot enable page checksums and torn-page detection at the same time.
  • Upgrading a database to SQL Server 2005 and turning on the page checksum option does not automatically protect all the pages, as a page has to be changed and written to disk after the database option is enabled to have a checksum written on it. Only when a page has been through this process is it protected. There is no tool or automatic way to force all pages to go through this process - as I mentioned in the last post.
  • In benchmarking tests with a TPCH workload during SQL Server 2005 development, we measured approx 2% performance degradation as a result of having checksums enabled.
  • The checksum cannot be used for error correction. Generating an error-correcting checksum would be a more complicated algorithm and so would be slower to compute.
  • The checksum is validated when a page is read for checking by any of the DBCC CHECK* commands (regardless of whether the PHYSICAL_ONLY option was used) so all existing page checksums can be checked by issuing a DBCC CHECKDB command.
  •  Any existing checksums are checked when pages are read as part of taking a backup. In addition, the restore logic will also verify the page checksums as the pages are restored from the backup media, so there is solid assurance that the data from the backup is consistent. 
  •  Page checksums do not prevent in-memory corruptions from memory scribblers (where the page is read in, corrupted in memory from some rogue process, and then written out with a new checksum). In Enterprise Edition, there is a ‘checksum sniffer’ that runs constantly as part of the lazywriter process, randomly picking unchanged pages in the buffer pool and validating their checksums to see if the page has been scribbled on. There are now documented cases of this process finding memory corruptions on SQL Server 2005 installations.

Bad page checksums will result in IO errors being reported (as I mentioned in the previous post):

  •  IO failures will trigger read-retry logic, which will re-read the page several times to see if the error clears itself (if it does, a message is written to the SQL error log)
  • IO failures that persist through read-retry are logged in the error log and Windows event log, so monitoring these will allow you to be alerted to hardware problems quickly. 
  •  If an IO error is reported, the page is not taken offline in any way. Anything that subsequently touches the page will get another IO error. The exception to this is if the IO error is encountered during transaction rollback. In this case, the database is taken offline and must be brought back online manually.

Trace flags

There are two trace flags you can use to add some extra auditing - these are documented in the SQL Server 2005 version of the SQL Server IO whitepaper.

Trace flag 806 will cause 'DBCC-style' page auditing to be performed whenever a database page is read into the buffer pool. This is useful to catch cases where pages are being corrupted in memory and then written out to disk with a new page checksum. When they're read back in the checksum will look correct, but the page is corrupt (because of the previous memory corruption). This page auditing goes someway to catching this - especially on non-Enterprise Edition systems that don't have the 'checksum sniffer' I describe above.

Trace flag 3422 will cause auditing of transaction log records as they're read (during transaction rollback or log recovery). This is useful because there is no equivalent to page checksums for transaction log records and so no way to detect whether log records are being corrupted.

Be careful with these trace flags - I don't recommend using them unless your experiencing corruptions that you can't diagnose. Turning them on will cause a big CPU hit because of the extra auditing that's happening.

Summary

So, to answer the question in the comment, there are a few things you should do on SQL Server 2005 to help detect IO subsystem problems. Page checksums in particular have helped to vastly reduce the number of undiagnosed corruption problems (saving time and hassle for customers and Product Support).

 

This was originally posted as two posts on the SQL Server Storage Engine site. It was very popular so I've combined the two posts together and added a bunch more commentary - especially on page checksums and IO errors. You may also notice some color differences between the scripts from earlier in the year and today's post - more keywords are recognized in Management Studio in SP2 than before (but still not CHECKDB though...)

It's almost inevitable that at some point every DBA will face dealing with corruption - so it's very important that you know how the server will behave when corruption happens. You also need to make sure that whatever logic you've created to catch corruptions (either through error log parsing, alerts, or Agent jobs - topic for a future post) will actually work.

To do all this you need a corrupt database to play with. Earlier this year I created two corrupt databases - one for 2000 (attached in broken2000.zip) and one for 2005 (attached in broken2005.zip). The two attached files can be restored by unzipping them and then using the following syntax (substituting the correct backup name):

RESTORE DATABASE broken FROM DISK='c:\sqlskills\brokenXXX.bck'

WITH MOVE 'broken' TO 'c:\sqlskills\broken.mdf',

MOVE 'broken_log' TO 'c:\sqlskills\broken_log.ldf';

GO

The databases have the same schema - a table called 'brokentable' (c1 int, c2 varchar(7000)) with one row in it. The table has a single data-page which I've corrupted differently in 2000 and 2005:

  • 2000: The corrupt page has page ID (1:75) and the page header is corrupt so that selecting from the table will result in a 605 error which will kill the connection.
  • 2005: The corrupt page has page ID (1:143) and the page header is corrupt such that the page checksum is bad.

A cautionary note on page checksums - if you upgrade a database from 2000 to 2005 and turn on page checksums, nothing happens! It's not until a database page is read into the buffer pool, changed in some way and then written back out to disk that it will have a page checksum stamped on it. This means that once you turn them on, you need to trigger a page checksum being written to each page in some way - e.g. rebuilding all indexes or forcing an in-place update of all table rows. Neither of these is very palatable and there's no tool to force page checksums in SQL Server 2005. Unfortunately there are no plans to include such a tool in SQL Server 2008 either.

Below I've listed a few things you can try out to see what would happen on your database if a checksum failed. These are all using the 2005 corrupt database.

Query errors

Any query that touches that page is going to fail with an 824 error. The IO errors in 2005 are different from 2000 - they've been split into 3:

  • 823 - a hard IO error. This is where SQL Server has asked the OS to read the page but it just can't.
  • 824 - a soft IO error. This is where the OS could read the page but SQL Server decided that the page was corrupt - for example with a page checksum failure
  • 825 - a read-retry error. This is where either an 823 or 824 occured, SQL server retried the IO automatically and it succeeded. This error is written to the errorlog only - you need to be aware of these as they're a sign of your IO subsystem going awry. There's no way to turn off read-retry and force SQL Server to 'fail-fast' - whether this behavior is a good or bad thing can be argued both ways - personally I don't like it.

SELECT * FROM broken..brokentable;

GO

Msg 824, Level 24, State 2, Line 1

SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0x7232c940; actual: 0x720e4940). It occurred during a read of page (1:143) in database ID 8 at offset 0x0000000011e000 in file 'c:\sqlskills\broken.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

DBCC CHECKDB

CHECKDB throws us some nice errors. Note that it doesn't actually mention a page checksum failure. CHECKDB is the only thing in SQL Server that can 'eat' IO errors and convert them into non-fatal corruption errors. Note in the CHECKDB output below that the repair level needed to repair this error is 'repair_allow_data_loss' - this is because the repair for a page with any kind of IO error on it is to delete the page, fix-up all relevant linkages, and rebuild any referencing indexes.

DBCC CHECKDB ('broken') WITH NO_INFOMSGS, ALL_ERRORMSGS;

GO

Msg 8928, Level 16, State 1, Line 1

Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data): Page (1:143) could not be processed. See other errors for details.

Msg 8939, Level 16, State 98, Line 1

Table error: Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data), page (1:143). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 12716041 and -4.

CHECKDB found 0 allocation errors and 2 consistency errors in table 'brokentable' (object ID 2073058421).

CHECKDB found 0 allocation errors and 2 consistency errors in database 'broken'.

repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (broken).

Backup with CHECKSUM

If you have page checksums turned on, you should always use the WITH CHECKSUM option when taking backups. This will cause the page checksums to be checked as they're read into the backup. If a bad page checksum is found, the backup will stop and print a message identifying the bad page. Using the WITH CHECKSUM option on a backup will also generate a checksum over the entire backup stream and store it in the backup. This means we can detect a damaged backup by recalculating the checksum and comparing it against that stored in the backup - in much the same way that page checksums work.

BACKUP DATABASE broken TO DISK='c:\sqlskills\broken2.bck'

WITH CHECKSUM;

GO

Msg 3043, Level 16, State 1, Line 1

BACKUP 'broken' detected an error on page (1:143) in file 'c:\sqlskills\broken.mdf'.

Msg 3013, Level 16, State 1, Line 1

BACKUP DATABASE is terminating abnormally.

The backup has failed because of a bad page checksum. However, we can force it to backup. If this is the only copy of the database we have, and we're being forced to run repair to fix a corruption, for instance, then we want to make sure we have a backup to restore from in case something goes wrong with the repair. Even a backup that contains a corrupt database is better than no backup at all. In this case, we can use the CONTINUE_AFTER_ERROR option which will force the backup to continue when it finds a bad page.

BACKUP DATABASE broken TO DISK='c:\sqlskills\broken2.bck'

WITH CHECKSUM, CONTINUE_AFTER_ERROR;

GO

Processed 160 pages for database 'broken', file 'broken' on file 1.

Processed 1 pages for database 'broken', file 'broken_log' on file 1.

BACKUP WITH CONTINUE_AFTER_ERROR successfully generated a backup of the damaged database. Refer to the SQL Server error log for information about the errors that were encountered.

BACKUP DATABASE successfully processed 161 pages in 2.025 seconds (0.651 MB/sec).

RESTORE VERIFYONLY

You can check the high-level validity of any backup set using the RESTORE VERIFYONLY command.

RESTORE VERIFYONLY FROM DISK='c:\sqlskills\broken2005.bck';

GO

The backup set on file 1 is valid.

What about on the backup that we forced using CONTINUE_AFTER_ERROR?

RESTORE VERIFYONLY FROM DISK='c:\sqlskills\broken2.bck';

GO

The backup set was written with damaged data by a BACKUP WITH CONTINUE_AFTER_ERROR.

Isn't that cool? It tells us that the backup was already corrupt when it was written. Ok - let's ask it to specifically check the checksums in the backup. This will look through all the pages in the backup that have page checksums, check them, and recalculate the backup stream checksum.

RESTORE VERIFYONLY FROM DISK='c:\sqlskills\broken2.bck'

WITH CHECKSUM;

GO

The backup set was written with damaged data by a BACKUP WITH CONTINUE_AFTER_ERROR.

In our case, we get the same as above because the database was known to be corrupt when the backup was taken, so none of the checksum checking is done. What about if we try to check the checksums on the initial backup?

RESTORE VERIFYONLY FROM DISK='c:\sqlskills\broken2005.bck'

WITH CHECKSUM;

GO

Msg 3187, Level 16, State 1, Line 1

RESTORE WITH CHECKSUM cannot be specified because the backup set does not contain checksum information.

Msg 3013, Level 16, State 1, Line 1

VERIFY DATABASE is terminating abnormally.

We can't do that as the backup wasn't taken using the WITH CHECKSUM option in the first place, even though some of the database pages may have page checksums on them.

RESTORE

How about we try to overwrite the existing 'broken' database with the one from the second backup we took?

RESTORE DATABASE broken FROM DISK='c:\sqlskills\broken2.bck'

WITH REPLACE;

GO

Msg 3183, Level 16, State 1, Line 1

RESTORE detected an error on page (1:143) in database "broken" as read from the backup set.

Msg 3013, Level 16, State 1, Line 1

RESTORE DATABASE is terminating abnormally.

It won't let us because the backup contains corrupt data (and it knows that because we forced the backup to complete using the CONTINUE_AFTER_ERROR option). However, there may be cases where you've lost your database and all you have is a corrupt backup. In this case it may be better to restore what data you do have rather than lose everything. You can do it using the CONTINUE_AFTER_ERROR option on the RESTORE command this time:

RESTORE DATABASE broken FROM DISK='c:\sqlskills\broken2.bck'

WITH REPLACE, CONTINUE_AFTER_ERROR;

GO

Processed 160 pages for database 'broken', file 'broken' on file 1.

Processed 1 pages for database 'broken', file 'broken_log' on file 1.

The backup set was written with damaged data by a BACKUP WITH CONTINUE_AFTER_ERROR.

RESTORE WITH CONTINUE_AFTER_ERROR was successful but some damage was encountered. Inconsistencies in the database are possible.

RESTORE DATABASE successfully processed 161 pages in 0.392 seconds (3.364 MB/sec).

Isn't that cool? It works BUT it tells us that the backup set contained corrupt data and that the database was restored but could have corrupt data in.

Summary

Have a play about with these databases to familiarize yourself with the kind of responses you'll get from the various tools when a corruption exists, and how to work around it if need be.

Let me know if you want to see any particular kinds of corruptions explored, or want a database with something specific corrupted in.

broken2000.zip (41 KB)broken2005.zip (149.9 KB)

Theme design by Nukeation based on Jelle Druyts