Back in January I offered a promotion as a way of introducing our maintenance/operations auditing services. Now I've completed a bunch of them, with some excellent results for customers.

One international customer, Plex Systems, was so pleased with the outcome of my audit of their manufacturing ERP software hosting databases that they issued a press release today to their industry partners and clients. Another customer, Hose and Fittings, Etc, was amazed at the details and justifications in the report I presented them - see their testimonial on the Past Customers page.

Take a look at the new auditing page that describes how the various audits work, and let me know if you want to discuss doing one - we can do it wherever you are in the world.

Thanks!

The November 2009 edition of TechNet Magazine is up on the web and contains my latest feature article, the second in a 3-part series on backups/restores/repairs.

In this article I explain all about using the RESTORE command. Not much point having backups if you don't know how to use them! Topics include:

  • The four phases of a restore operation - how they work and how you can speed them up
  • Figuring out what you need to restore when a disaster happens
  • Figuring out what you're actually able to restore
  • Using WITH RECOVERY / NORECOVERY / STANDBY
  • Point-in-time restore operations
  • Considerations when restoring to a different location

There seem to be a few formatting issues when the magazine's web folks put the article up - I've notified them and hopefully they'll fix them soon.

You can get to the article at: http://technet.microsoft.com/en-us/magazine/ee677581.aspx

Enjoy!

One of the perennial problems facing both experienced and involuntary DBAs is how to tell whether a database is really in the FULL recovery model or not? This is complicated by the fact that when you switch a database into the FULL recovery mode, it actually behaves as if it's in the SIMPLE recovery mode until the log backup chain is established (this is commonly called being in 'pseudo-SIMPLE').

It's a problem for several reasons:

1) if the database is really in the FULL recovery model then log backups must be taken so the log can clear/truncate properly and it doesn't grow out of control

2) if the database is in the FULL recovery model but the log backup chain has been broken (or not established at all since the database was created) then log backups are not possible (except for the yuckiness in SQL 2000 when log backups would succeed without complaint but be totally useless during disaster recovery)

I don't know of any script to easily determine whether a database is really in the FULL recovery mode, so I knocked one together - and I present it here for you to use.

The trick to the script is finding the last LSN that's been backed up for the database. if this is non-NULL, then a log backup chain exists and the database is really in the FULL recovery mode. This is stored in the dbi_dbbackupLSN field in the database boot page (see Search Engine Q&A #20: Boot pages, and boot page corruption) but also nicely available in the DMV sys.database_recovery_status.

I've tested this on 2005 SP3 and 2008 SP1.

Note that this doesn't work on SQL 2000 - I've poked around and can't find a way to get at the LSN without reading the boot page directly, which can't be done gracefully inside a function - I'll leave that as an exercise for you. You'd expect the IsTruncLog property returned by DATABASEPROPERTY to be correct when the database is in pseudo-SIMPLE, but it's not unfortunately.

Here are some test cases for the script:

CREATE DATABASE SimpleModeDB;
CREATE DATABASE BulkLoggedModeDB;
CREATE DATABASE FullModeDB;
GO

ALTER DATABASE SimpleModeDB SET RECOVERY SIMPLE;
ALTER DATABASE BulkLoggedModeDB SET RECOVERY BULK_LOGGED;
ALTER DATABASE FullModeDB SET RECOVERY FULL;
GO

SELECT [Name], msdb.dbo.SQLSkillsIsReallyInFullRecovery ([Name]) AS 'ReallyInFULL'
FROM sys.databases
WHERE [Name] LIKE '%ModeDB';
GO

Name              ReallyInFULL
----------------- -------------
SimpleModeDB      0
BulkLoggedModeDB  0
FullModeDB        0

This makes sense - the new FullModeDB database is still in pseudo-SIMPLE. Now what if we take a full database backup?

BACKUP DATABASE FullModeDB To DISK='C:\SQLskills\FullModeDB.bck' WITH INIT;
GO
SELECT msdb.dbo.SQLSkillsIsReallyInFullRecovery ('FullModeDB') AS 'ReallyInFULL';
GO

Processed 152 pages for database 'FullModeDB', file 'FullModeDB' on file 1.
Processed 1 pages for database 'FullModeDB', file 'FullModeDB_log' on file 1.
BACKUP DATABASE successfully processed 153 pages in 0.230 seconds (5.449 MB/sec).

ReallyInFULL
------------
1

Perfect. Now what about switching it back to SIMPLE and back to FULL again?

ALTER DATABASE FullModeDB SET RECOVERY SIMPLE;
ALTER DATABASE FullModeDB SET RECOVERY FULL;
GO
SELECT msdb.dbo.SQLSkillsIsReallyInFullRecovery ('FullModeDB') AS 'ReallyInFULL';
GO

ReallyInFULL
------------
0

Just as we expect - the log backup chain has been broken and the database is back to pseudo-SIMPLE again.

Now what if we restart the log backup chain using a full database backup?

BACKUP DATABASE FullModeDB To DISK='C:\SQLskills\FullModeDB.bck' WITH INIT;
GO
SELECT msdb.dbo.SQLSkillsIsReallyInFullRecovery ('FullModeDB') AS 'ReallyInFULL';
GO

Processed 152 pages for database 'FullModeDB', file 'FullModeDB' on file 1.
Processed 1 pages for database 'FullModeDB', file 'FullModeDB_log' on file 1.
BACKUP DATABASE successfully processed 153 pages in 0.095 seconds (13.193 MB/sec).

ReallyInFULL
------------
1

Perfect. Now what about if we break the chain and try to restart it using a differential database backup?

ALTER DATABASE FullModeDB SET RECOVERY SIMPLE;
ALTER DATABASE FullModeDB SET RECOVERY FULL;
GO
BACKUP DATABASE FullModeDB To DISK='C:\SQLskills\FullModeDB_diff.bck' WITH INIT, DIFFERENTIAL;
GO
SELECT msdb.dbo.SQLSkillsIsReallyInFullRecovery ('FullModeDB') AS 'ReallyInFULL';
GO

Processed 40 pages for database 'FullModeDB', file 'FullModeDB' on file 1.
Processed 1 pages for database 'FullModeDB', file 'FullModeDB_log' on file 1.
BACKUP DATABASE WITH DIFFERENTIAL successfully processed 41 pages in 0.080 seconds (4.192 MB/sec).

ReallyInFULL
------------
1

Perfect - that works too, as I'd expect. You may wonder why a differential backup works - either a full or differential backup will work as they bridge the LSN gap since the last full or differential backup before the log backup chain was broken - both of these backups include transaction log - see More on how much transaction log a full backup includes.

And here's the script itself - enjoy!

/*============================================================================
   File: SQLskillsIsReallyInFullRecovery.sql

   Summary: This script creates a function in msdb that returns a BIT value of
   1 if the supplied database is really in the FULL recovery mode and not still
   in pseudo-SIMPLE.

   Date: October 2009

   SQL Server Versions:
      10.0.2531.00 (SS2008 SP1)
      9.00.4035.00 (SS2005 SP3)
------------------------------------------------------------------------------
   Copyright (C) 2009 Paul S. Randal, SQLskills.com
   All rights reserved.

   For more scripts and sample code, check out
     
http://www.sqlskills.com/

   You may alter this code for your own *non-commercial* purposes. You may
   republish altered code as long as you give due credit.

   THIS CODE AND INFORMATION ARE PROVIDED "AS IS" WITHOUT WARRANTY OF
   ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED
   TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR FITNESS FOR A
   PARTICULAR PURPOSE.
============================================================================*/

USE [msdb];
GO

IF EXISTS (SELECT * FROM sys.objects WHERE NAME = 'SQLskillsIsReallyInFullRecovery')
   DROP FUNCTION SQLskillsIsReallyInFullRecovery;
GO
 
CREATE FUNCTION SQLskillsIsReallyInFullRecovery (
   @DBName sysname)
RETURNS BIT
AS
BEGIN
  
DECLARE @IsReallyFull  BIT;
   DECLARE @LastLogBackupLSN NUMERIC (25,0);
   DECLARE @RecoveryModel  TINYINT;

   SELECT @LastLogBackupLSN = [last_log_backup_lsn]
   FROM sys.database_recovery_status
   WHERE [database_id] = DB_ID (@DBName);

   SELECT @RecoveryModel = [recovery_model]
   FROM sys.databases
   WHERE [database_id] = DB_ID (@DBName);

   IF (@RecoveryModel = 1 AND @LastLogBackupLSN IS NOT NULL)
      SELECT @IsReallyFull = 1
   ELSE
      SELECT @IsReallyFull = 0;

   RETURN (@IsReallyFull);
END;
GO

The 35-page whitepaper on high availability I wrote for the SQL team over the summer has been published on MSDN. It's a 2-300 level whitepaper that describes the various high-availability technologies in SQL Server 2008 and how they can be used to mitigate disasters. It's chock-full of links to other whitepapers, technical articles and Books Online sections and also presents my methodology for planning a high-availability strategy.

You can get it at http://msdn.microsoft.com/en-us/library/ee523927.aspx.

Enjoy!

Here's the table of contents:

  • Introduction
  • Causes of Downtime and Data Loss
    • Planned Downtime
    • Unplanned Downtime and Data Loss
  • Planning a High-Availability Strategy
    • Requirements
    • Limitations
    • Technology Evaluation
  • SQL Server 2008 High-Availability Technologies
    • Logging and Recovery
    • Backup, Restore, and Related Technologies
      • Partial Database Availability and Online Piecemeal Restore
      • Instant File Initialization
      • Mirrored Backups
      • Backup Checksums
      • Backup Compression
    • Other Single-Instance Technologies
      • Online Operations
      • Database Snapshots
      • Hot-Add Memory and CPU
      • Resource Governor
    • Multi-Instance Technologies
      • Log Shipping
      • Transactional Replication
      • Database Mirroring
      • Failover Clustering
      • Combining Multi-Instance Technologies
      • Virtualization
  • Mitigating the Causes of Downtime and Data Loss
  • High-Availability Features Supported by SQL Server 2008 Editions
  • Conclusion

This survey is based on a suggestion from Jonathan Kehayias (blog/Twitter).

In this survey I'm asking two questions:

  1. When you're designing/implementing a database, do you manually create statistics or let SQL Server create them automatically?
  2. How do you maintain statistics? Manually or letting SQL Server do it for you? Or both? 

I'll probably collaborate with Kimberly on the editorial for this, as she's forgotten more about statistics than I'll ever know!

*Please* no comments on this post - wait for the survey results post to avoid skewing the answers (comments are unfortunately moderated anyway, now that my blog gets a lot of spam). I'm very interested in your reasoning, but not until everyone else responds.

   

As always, a big Thanks! for contributing to the blog/community by responding. Please shoot me an email (Contact button, bottom left of the blog), or ping me on Twitter (@PaulRandal) if you have an idea for a good survey.

A few weeks ago I kicked off a survey asking what kind of disaster recovery guide/run-book/plan (I'll just call it a plan from now on) you have (see here for the survey). Here are the results as of 9/15/09:

 

Out of all these answers, IMHO the last answer is the only acceptable one for a production DBA responsible for recovering critical databases, to a defined RTO and RPO (recovery-time objective and recovery-point objective, respectively).

The problem is though, getting a wonderful and comprehensive disaster recovery plan together is waaay easier said than done. For a start, disaster recovery isn't a sexy topic with management UNTIL a disaster actually happens and the RTO and RPO are completely blown - so it's hard to justify the time and effort needed to put together a good plan. It's especially hard to put a plan together if you're an involuntary DBA, with no idea about what disasters could occur and what you'd do to recover from them.

The reasons to have a plan worked out in advance are pretty much common sense: in a disaster situation, where time is often of the essence, adrenaline and stress levels run high and it can be hard to remain cool, calm, and collected. For an unprepared DBA, this can very easily lead to costly mistakes being made. No-one wants to be the one that overwrote the only existing copy of a database with a corrupt backup and caused the business to be offline for several days. Good-bye job.

A pre-defined disaster recovery plan allows the DBA (or responsible person) to follow a set of tested steps to resolve problems that can occur. A really comprehensive plan covers more than just how to restore the database, but instead will contain troubleshooting information to determine what needs to be done (rather than just immediately taking everything offline and running a restore), and then coping with twists and kinks that could crop up while performing the recovery operation. What makes a plan wonderful is that it gets tested regularly to make sure everything in it is a) still appropriate b) still works c) still works within the defined RTO and RPO. Kimberly and always like to say that a disaster recovery plan should be written by the most experienced DBA you have and tested by everyone else, down to the most junior DBA - everyone needs to be able to work with it.

Bottom line (not a very long editorial this time) - you can't expect to be able to recover within any defined limits if you're just going to wing-it when a disaster strikes. Doesn't matter how experienced you are, crazy stuff happens that can trip you up.

Next up - the next survey!

In this survey I'd like to find out what kind of disaster recovery guide/run-book/plan you have, if any. When a disaster occurs, do you just wing-it, or is there something written-down that was worked out in advance to help you along the way. I'll report on the results in a couple of weeks.

*Please* no comments on this post - wait for the survey results post to avoid skewing the answers (comments are unfortunately moderated anyway, now that my blog gets a lot of spam). I'm very interested in your reasoning, but not until everyone else responds.

As always, a big Thanks! for contributing to the blog/community by responding. Please shoot me an email (Contact button, bottom left of the blog), or ping me on Twitter (@PaulRandal) if you have an idea for a good survey.

A long time ago, in a blog post far, far away (well before I went offline in July) I kicked off a weekly survey about how often you run consistency checks (see here for the survey). Now I'm back online again, and so here are the results as of 8/3/09.

 

The results are actually surprising - I didn't expect so many people to be running consistency checks so frequently - 25% daily and another 37% weekly - very cool!

The number of people who don't know what consistency checks are may look a little depressing but I think it's probably a symptom of the number of people coming into the SQL world as involuntary DBAs. For those people, consistency checks are a way of proactively checking for database corruption, which is nearly always caused by the I/O subsystem. You can read a good introduction to consistency checking and other database maintenance topics in the article I wrote for TechNet Magazine in August 2008 - Top Tips for Effective Database Maintenance and bit more in the previous survey Importance of how you run consistency checks.

Basically you need to run regular consistency checks. There's a myth that you don't need to run consistency checks - this was perpetuated by various marketing folks when SQL 7.0 shipped, because SQL 6.5 used to cause allocation corruptions and the rewrite for 7.0 removed all the corruption problems. Now, of course there have been bugs in SQL Server that cause problems, but they account for a tiny fraction of the corruptions out there. Nearly all corruptions are caused by something going wrong in the I/O subsystem - and you can't predict when that will or won't happen. Jim Gray once likened the disk heads in a hard drive as akin to a 747 flying at 500mph 1/4 inch above the ground - scary stuff. You DO need to run consistency checks, because corruptions do happen. You can read more about the causes of corruptions in this blog post: Search Engine Q&A #26: Myths around causing corruption.

So - how often should you run consistency checks? Well, it depends. When I'm teaching I like to give two examples:

  • You have a dodgy I/O subsystem that is causing corruptions. You have no backups. You have a zero data-loss requirement. With no backups, your only way to get rid of corruptions is to run repair, which usually leads to data loss (see Misconceptions around database repair). In that case, you want to know about corruption as soon as possible to limit the amount of data loss from running repair. Contrived example for sure, but you'd be surprised what I've seen...
  • You have a rock-solid I/O subsystem, with all drivers and firmware up-to-date. You have a comprehensive backup strategy which you've extensively tested and you've confident you can recover from corruptions with zero-to-minimal downtime and data loss. In this case you may be comfortable running consistency checks once a week, say.

The overwhelming factor in how often to run consistency checks is you. How comfortable are you with the integrity of your I/O subsystem and your ability to recover from a corruption problem. If you have corruptions in your database today, you'll probably run DBCC CHECKDB on it every day for a month, right?!?

Part of any database maintenance of high-availability strategy is proactively making sure that corruption doesn't exist in the database - otherwise when you DO discover it, it may be more pervasive and it will take you longer to recover, and potentially with more downtime. Therefore the answer really is that you should run it as often as you can. Sometimes that can be difficult for very large databases (your definition is likely to be different than mine - I think 500GB and larger - depends on your hardware etc) as CHECKDB can take a long time to run (see CHECKDB From Every Angle: How long will CHECKDB take to run?), but there are ways you can effectively consistency check even a VLDB - see CHECKDB From Every Angle: Consistency Checking Options for a VLDB.

So - I'm very pleased to see so many people running regular consistency checks. However, maybe I'm just a pessimist but I did expect some of the less than optimal options to have higher numbers. Let's look at each in turn.

Never. The only time this is acceptable is if you don't know what consistency checks are and why you need them. Enough said. Even if you think you *can't* run consistency checks because it takes too many resources/time/etc, take a look at the blog post I mention above. There's always a way.

Only when corruption is detected some other way. By the time corruption has occurred and is detected through regular operations, it's likely to be more pervasive. Many databases do not see all their data read/updated as part of regular operations, which means that if a part of the database that's not used for a while gets corrupted, and you're not running consistency checks, you're not going to know. That means you're not going to know that your I/O subsystem is causing problems - and so more corruption will occur. You might think this isn't a big deal, but it can be depending on what part of the database (or maybe system databases) gets corrupted. Some things can be restored or repaired relatively simply - but what if, say, the boot page of master is corrupted? Your entire instance is down until you sort that out. It's always better to proactively discover corruption before it hits you in a way that really messes you up.

Only during an event like an upgrade or migration. In this category I see people running consistency checks AFTER doing an upgrade but not before. From release to release, DBCC CHECKDB has improved - especially from 2000 to 2005, where system catalog consistency checks were added. If you run consistency checks after an upgrade and find the database is corrupt, what do you do? Run repair? Go back to the earlier version and run repair? Restore your backups? The odds are that if you've got corruption, it's also in your backups too. Also, you may think that the upgrade itself caused the corruption - I've never seen this. It's always been that the corruption was there before and only got discovered after the upgrade. Always run consistency checks BEFORE doing an upgrade - to make sure you've got a chance of putting things right before (potentially irrevocably) moving to the newer version.

Only after performing a restore, or after a failover. This is, of course, a good practice, but shouldn't be the only time a consistency check is performed - for most of the same reasons as I've explained above. What do you do if the database is found to be corrupt after restoring backups? You're looking at data loss now.

To summarize: make sure you're running regular consistency checks - with the regularity in-line with your comfort zone and your maintenance/high-availability strategies.

Next up - the next survey!

In this week's survey I'd like to know how often you run consistency checks on your *most critical* production database, regardless of *how* you run them (we did that survery already - see Importance of how you run consistency checks). I'll report on the results around July 4th.

I'd only like you to answer for your *most critical* production database, as the frequency will probably vary wildly by database, server, production vs. dev/QA and so on. If everyone answers for their most critical database then we won't get skewed results.

*Please* no comments on this post - wait for the survey results post to avoid skewing the answers. I'm very interested in your reasoning, but not until everyone else responds.

As always, a big Thanks! for contributing to the blog/community by responding. Please shoot me an email (Contact button, bottom left of the blog), or ping me on Twitter (@PaulRandal) if you have an idea for a good survey.

PS Thanks to Pat Wright for suggesting this week's topic on Twitter.

In last week's survey I asked how you manage the size of your database *data* files - see here for the survey. Here are the results as of 6/24/09.

The 'other' values are as follows:

  1. 5 x 'manual file growths and a custom mom alter to when the datafiles are 98% full. autogrow set to fixed amount in case we miss the mom e-mail'
  2. 1 x 'auto-grow with a procedure to keep the log file size at 20% relative to the total data file size'
  3. 1 x 'Create with very large initial file size, set auto-growth to %'
  4. 1 x 'Lots of white space in files. Auto-grow to a fixed size (in case of emergency).'
  5. 1 x 'set initial size for 1 yr usage, monitor size, manual grow, autogrow percentage - send alert if it does grow'
  6. 1 x 'set initial size for 2 year growth, capture growth stats daily, monitor physical disk space daily'
  7. 1 x 'Set to autogrow by fixed size to cater for emergencies, otherwise maintain 80-90% free space by daily reporting then manual off-peak size increase if necessary.'

As I mentioned in the survey itself, this is just about database *data* files. I covered log file size management in a previous survey - see Importance of proper transaction log size management.

There are really four parts to data file size management, which I'll discuss below.

The first thing I'll say is that if you're able to, enable instant file initialization on 2005+ - as it will vastly reduce the time required to create data files, grow data files, and restore backups (if the restore has to create data files). We're talking minutes/hours down to single-digit seconds. See Misconceptions around instant file initialization. If you're not one of the miniscule fraction of a percent of customers who have volumes shared between secure file servers and SQL Server instances, turn this on ASAP. Most DBAs don't know about this feature, but everyone I teach it to or show it to are amazed and then go turn it on. No brainer.

Initial data file sizing. This can be tricky. Without getting into the how-to-calculate-the-database-size quagmire, I'll simply say that you should provision as much space as you can, based upon your sizing estimates. Don't just provision for the here-and-now - if you're able to, provision space for the next year's worth of anticipated growth - to prevent auto-growth having to kick-in at all. I rewrote all the Books Online pages for 2005 (links are to the 2008 BOL) on Estimating the Size of Heaps, Clustered Indexes, and Nonclustered Indexes and in the blog post Tool for estimating the size of a database I link to a tool someone put together that codified all my formulas. You can also get sizing tools from hardware vendors too.

Data file growth. If you're able to, auto-grow should ALWAYS be turned on, as an emergency measure in case your monitoring fails - you don't want the database to have to grow but it's unable to and then it stops and the application is offline. However, you shouldn't *rely* on auto-grow - it's just for emergencies. The auto-growth default for data files used to be 10% for 2000 and before, but then changed to 1MB from 2005 onwards (log file default auto-growth remained at 10%). Neither of these are good choices. A percentage-based auto-growth means that as your files get bigger, so does the auto-growth, and potentially the time it takes if you don't have instant file initialization enabled. A 1MB autogrowth is just nonsensical. Your auto-growth should be set to a fixed size - but what that should be is a great big 'it depends'. You need to decide whether you want the auto-growth to be a quick stop-gap, or whether the auto-growth will replace manual growth after monitoring. You also need to consider how long the auto-growth will take, especially without instant file initialization. I can't give any guidance here as to what a good number is, but I'd probably settle on something around 10% (fixed), with the percentage steadily falling as the database size grows. It's very important that you have alerts setup to you can tell when auto-growth does occur, so you can then take any necessary action to grow it even more or tweak your settings.

'Other' response #2 is interesting. There's been a 'best-practice' around for a while that the log file should be sized to be a percentage of the data file size. It's totally unfounded and in most cases bears no relation to reality. The vast majority of the time, the size of the log is *NOT* dependent on the data file sizes in any way. Imagine a 10TB database - would you provision a 2TB log? Of course not. Now, I can see special cases where the operations performed on the tables in the database might affect a fixed portion of the largest table in a single batch, and that could generate enough log (with reserved space too) to equal 20% of the data file size - but that's a pretty nonsensical special case, to be honest. You shouldn't use 'set the log as a percentage of the data file' as a guideline.

Data file size/usage monitoring. There's a growing movement towards monitoring the data file usage and manually growing the files as they approach 100% full - avoiding auto-growth altogether, but still having it enabled for emergencies. In my book, this is the best way to go as you have all the control over what happens and more importantly, when it happens - especially without instant file initialization. There are some quirks here though. SCOM, for instance, has logic that disables file size and usage monitoring if you enable auto-grow. It assumes that if you enable auto-grow then you're not interested in monitoring. I happened to have one of the SCOM devs in my last maintenance class I taught on the Redmond MS campus and he's going to try to get that logic fixed.

Data file shrinking. Just this morning I wrote a long blog post about this - see Why you should not shrink your data files. Running data file shrink causes index fragmentation, uses lots of resources, and the vast majority of the time when people use it, is unnecessary and wasteful. It should NEVER be part of a regular maintenance plan - as you get into the shrink-grow-shrink-grow cycle which becomes a zero-sum game with a ton of transaction log being generated. Think about this - if you run a shrink, which is fully logged, then all the log has to be backed up, log-shipped, database mirrored, scanned by the replication log reader agent, and so on. And then the database will probably auto-grow again through normal operation, or some part of the maintenance job that rebuilds indexes. And then you shrink it again. And the cycle continues...

Bottom line - make sure you size the data files with some free space, have auto-growth set appropriately, have instant file initialization enabled if you can, monitor file sizes and usage, alert on auto-grows, and don't use shrink. And don't use shrink. Really.

Next up - this week's survey!

(Been a while longer than usual since blog posts - I've been really busy flying around the country doing stuff onsite with clients. Normal service will be resumed at the start of July :-)

One of my biggest hot-buttons is around shrinking data files. Although I used to own the shrink code while I was at Microsoft, I never had a chance to rewrite it so that data file shrink is a more palatable operation. I really don't like shrink.

Now, don't confuse shrinking the transaction log with shrinking data files. Shrinking the log is necessary if your log has grown out of control, or as part of a process to remove excessive VLF fragmentation (see Kimberly's excellent posts on this here and here). However, shrinking the log should be a rare operation and should not be part of any regular maintenance you perform.

Shrinking of data files should be performed even more rarely, if at all. Here's why - data file shrink causes *massive* index fragmentation. Let me demonstrate with a simple script you can run. The script below will create a data file, create a 10MB 'filler' table at the start of the data file, create a 10MB 'production' clustered index, drop the 'filler' table and then run a shrink to reclaim the space.

USE MASTER;
GO

IF DATABASEPROPERTYEX ('DBMaint2008', 'Version') > 0
 DROP DATABASE DBMaint2008;

CREATE DATABASE DBMaint2008;
GO
USE DBMaint2008;
GO

SET NOCOUNT ON;
GO

-- Create the 10MB filler table at the 'front' of the data file
CREATE TABLE FillerTable (c1 INT IDENTITY,  c2 CHAR (8000) DEFAULT 'filler');
GO

-- Fill up the filler table
INSERT INTO FillerTable DEFAULT VALUES;
GO 1280

-- Create the production table, which will be 'after' the filler table in the data file
CREATE TABLE ProdTable (c1 INT IDENTITY,  c2 CHAR (8000) DEFAULT 'production');
CREATE CLUSTERED INDEX prod_cl ON ProdTable (c1);
GO

INSERT INTO ProdTable DEFAULT VALUES;
GO 1280

-- check the fragmentation of the production table
SELECT [avg_fragmentation_in_percent] FROM sys.dm_db_index_physical_stats (
    DB_ID ('DBMaint2008'), OBJECT_ID ('ProdTable'), 1, NULL, 'LIMITED');
GO

-- drop the filler table, creating 10MB of free space at the 'front' of the data file
DROP TABLE FillerTable;
GO

-- shrink the database
DBCC SHRINKDATABASE (DBMaint2008);
GO

-- check the index fragmentation again
SELECT [avg_fragmentation_in_percent] FROM sys.dm_db_index_physical_stats (
    DB_ID ('DBMaint2008'), OBJECT_ID ('ProdTable'), 1, NULL, 'LIMITED');
GO

avg_fragmentation_in_percent
----------------------------
0.390625

DbId   FileId      CurrentSize MinimumSize UsedPages   EstimatedPages
------ ----------- ----------- ----------- ----------- --------------
6      1           1456        152         1448        1440
6      2           63          63          56          56

DBCC execution completed. If DBCC printed error messages, contact your system administrator.

avg_fragmentation_in_percent
----------------------------
99.296875

Look at the output from the script! The logical fragmentation of the clustered index before the shrink is a near-perfect 0.4%. After the shrink, it's almost 100%. The shrink operation *completely* fragmented the index, removing any chance of efficient range scans on it by ensuring the all range-scan readahead I/Os will be single-page I/Os.

Why does this happen? A data file shrink operation works on a single file at a time, and uses the GAM bitmaps (see Inside The Storage Engine: GAM, SGAM, PFS and other allocation maps) to find the highest page allocated in the file. It then moves it as far towards the front of the file as it can, and so on, and so on. In the case above, it completely reversed the order of the clustered index, taking it from perfectly defragmented to perfectly fragmented.

The same code is used for DBCC SHRINKFILE, DBCC SHRINKDATABASE, and auto-shrink - they're equally as bad. As well as introducing index fragmentation, data file shrink also generates a lot of I/O, uses a lot of CPU, and generates *loads* of transaction log - as everything it does is fully logged.

Data file shrink should never be  part of regular maintenance, and you should NEVER, NEVER have auto-shrink enabled. I tried to have it removed from the product for SQL 2005 and SQL 2008 when I was in a position to do so - the only reason it's still there is for backwards compatibility. Don't fall into the trap of having a maintenance plan that rebuilds all indexes and then tries to reclaim the space required to rebuild the indexes by running a shrink - that's a zero-sum game where all you do is generate a log of transaction log for no actual gain in performance.

So what if you *do* need to run a shrink? For instance, if you've deleted a large proportion of a very large database and the database isn't likely to grow, or you need to empty a file before removing it?

The method I like to recommend is as follows:

  • Create a new filegroup
  • Move all affected tables and indexes into the new filegroup using the CREATE INDEX ... WITH (DROP_EXISTING) ON <filegroup> syntax, to move the tables and remove fragmentation from them at the same time
  • Drop the old filegroup that you were going to shrink anyway (or shrink it way down if its the primary filegroup)

Basically you need to provision some more space before you can shrink the old files, but it's a much cleaner mechanism.

If you absolutely have no choice and have to run a data file shrink operation, be aware that you're going to cause index fragmentation and you should take steps to remove it afterwards if it's going to cause performance problems. The only way to remove index fragmentation without causing data file growth again is to use DBCC INDEXDEFRAG or ALTER INDEX ... REORGANIZE. These commands only require a single 8KB page of extra space, instead of needing to build a whole new index in the case of an index rebuild operation.

Bottom line - try to avoid running data file shrink at all costs!

In this week's survey I'd like to know how you manage the sizes of your database *data* files (remember we've already done log file size management). I'll report on the results around 6/21/09.

I say this every week in the PS, but I'm moving it up here because I don't like having to delete comments and send email, *please* no comments on this post - wait for the survey results post to avoid skewing the answers. I'm very interested in your reasoning, but not until everyone else responds.

As always, a big Thanks! for contributing to the blog/community by responding. Please shoot me an email (Contact button, bottom left of the blog), or ping me on Twitter (@PaulRandal) if you have an idea for a good survey.

PS Thanks to Steve Jones (of SQL Server Central fame) for suggesting this week's topic on Twitter.

In last week's survey I asked you what you think is the hardest thing when becoming an involuntary DBA - see here for the survey. Here are the results as of 6/15/09.

The 'other' responses were:

  • 3 x 'Learning to tell good advice from bad advice'
  • 1 x 'Learning to know *that* you don't know'
  • 1 x 'Management having unrealistic expectation of my capabilities'

All great answers.

We all know the definition of 'involuntary DBA' - someone who is made to assume a DBA role or take on DBA responsibilities, usually against their will. I used to call them 'accidental DBAs' but too many people thought I meant DBAs who had accidents, so I changed my phrase. Let's go through some of the answers, ranked by number of responses.

Learning to know what you don't know. Actually the first step here is learning to know *that* you don't know - if you think you know everything (really in any field and with any expertise level) then you're most likely wrong, no matter who you are and what the field is. It can be a humbling experience making (and accepting) that realization and then quite daunting when you realize that there is maybe *so much* that you don't know and might have to learn. For an involuntary DBA this can be doubly daunting, because not only do you realize that there's a ton you don't know about being a DBA, but that means there's a ton you don't know about managing this (maybe mission-critical) SQL Server that you're suddenly responsible for and your job security is now suddenly dependent on doing the right thing.

Being the DBA as well as my actual job. Many times an involuntary DBA is actually a bolted-on side-role for someone with a real daytime job too - sometimes not even an especially technical daytime job. I'm teaching a class right now for a large customer of ours and about 1/3 of the 30+ people in the class are involuntary DBAs as well as operations engineers. Some other common examples - SharePoint admins who suddenly find themselves with an Enterprise-class SQL Server instance underpinning their SharePoint farm, developers using SQL Express as part of a client application who have to think about the maintenance implications of having SQL Server installed, the person who sits nearest the SQL Server instance when the real DBA leaves or is fired. Sometimes becoming an involuntary DBA can be very rewarding and take you down an unexpected but ultimately welcome career path, but many times its an unwelcome burden, and very occasionally it can be disastrous. A few weeks ago I heard on the forums from someone who had become an involuntary DBA after the actual DBA has been fired - and his job was on the line if he couldn't get the broken database fixed in 24 hours. Brutal. Being a DBA can be a full-time job in itself, and if you ask many production DBAs, they'll tell you that it can be way more than the standard 40-hour work week (seriously, does that even exist any more in our industry?) - so to try and do that *as well* as a regular 9-5 job might be nigh on impossible.

Finding information on what to do. This can be the real kicker - you've realized that there's lots you don't know - and you want to learn - but where to start? Who to ask? There's a great community of SQL Server folks that are very willing to help out - if you're reading this then you've already stumbled in to the community somehow. There are lots of blogs with information on - some of what I post is deeper information than you *really* need to know to get by, but the editorials in the Weekly Survey series are a good overview of some of the topics to be concerned about, and Kimberly also posts a ton of good info on her blog. In TechNet Magazine, I've been writing many articles aimed specifically at involuntary DBAs - here are some links:

Here's a quote from an interview I did with SQLRockstar recently (hope you don't mind Tom):

SR: What advice would you give to new SQL DBA’s just entering the field?

PR: There’s a huge amount you don’t know - that’s just a fact. SQL Server can appear very easy to setup and get running, but there are lots of ways to shoot yourself in the foot if you’re not careful. I’m not trying to be scary, just engender a healthy amount of trepidation. Get yourself a mentor if possible, get some training, practice, practice, practice, read a bunch of people’s blogs, follow folks on Twitter, and so on. You need to be an information sponge for a couple of years at least, and get as much experience as you can with the varied aspects of SQL Server - perf tuning, disaster recovery, security, design, etc. And the more you tinker, play, and learn - the more you’ll realize there’s lots you still don’t know. Accept that and be willing to take advice, make mistakes, and learn and you’ll go a long way. Don’t be intimidated to ask people questions, most of us are happy to help - we all started with zero SQL Server knowledge, no matter where we are now.

The last sentence is the most critical in that paragraph. No matter how knowledgeable anyone in the SQL community might seem, every single one of us started with *no* SQL Server knowledge at all. I had never even heard of SQL Server until I joined Microsoft in early 1999 - and there's a *huge* amount of SQL Server I know very little about - I can barely even spell BI for instance. But I know what I don't know and I know where to go to search for information, or who to ask.

And for all those of you in the SQL community, when you're answering what might, at first glance, look like a stupid question from someone with zero reputation or forum points, consider the fact that they may well be an involuntary DBA trying to find an answer because they know they need help and don't have anywhere better to go. Be kind. There's no such thing as a stupid question - the only stupid thing is *not* asking a question when you know you don't know the answer. A sure-fire way to put someone off asking any more questions is to slap them down when they ask their first.

Now, apart from asking the questions on forums etc, another problem that involuntary DBAs have is knowing what information/answers to trust. This one's very hard, as unless you get to know people in the community, you've got no clue who these people are and why you should believe them. The onus is also on us in the community to make damn sure that when we answer a question it's the *right* answer. This is especially true in the corruption forums where telling someone the wrong answer can deepen the corruption hole the poor DBA is already in.

Management that don't understand databases. And also don't understand that the new involuntary DBA isn't going to immediately and magically become 'super-DBA' overnight and fix all the problems with the SQL Server instance. It takes quite a while to learn the ins-and-outs of being a good DBA, and having the confidence to be able to argue with management about what is and isn't possible. There's no easy solution to this one I'm afraid - education is the only option.

Dealing with actual DBAs and developers. There's a tendency in human nature to be scornful and dismissive of people in the same field who are not as knowledgeable or talented - this can be especially true in the technical area we inhabit where sometimes the people we work with may be egotistical and possibly even lacking in social skills - the 'typical' geek developer. Apologies to all those out there who *are* nice and *do* have great social skills - but after 9 years as a developer and manager of various technical disciplines at Microsoft, I know of what I speak. As an involuntary DBA you may come up against some of this antipathy. My advice - you're just going to have to deal with it until you become more knowledgeable. It's unfortunate, but it's human nature. One way to gain respect and trust from these people is to learn some stuff and solve some problems - don't ever BS about something you don't know about - guaranteed way to quickly lose whatever respect you've built up. You might try complaining to your management, but if it's the same management that added being an involuntary DBA to your regular job, from what I've heard, they're unlikely to listen to you. I don't mean to sound depressing, I'm just being honest.

Performance tuning, disaster recovery, database maintenance, implementing an HA/DR strategy. It's hard to pick which of these is the most important for a new involuntary DBA to focus on - but I think I'd go for database maintenance. That addresses some perf issues (index fragmentation management, statistics maintenance) and some disaster recovery issues (taking backups, running consistency checks). Checkout the TechNet Magazine article above for (what I think is) a great primer, and for perf tuning, see the editorial from a couple of weeks ago that has lots of links: Important considerations when performance tuning.

So if you're an involuntary DBA, the bottom line is that it's going to be a struggle for you - but there are lots of resources out there and people willing to help out.

Don't be afraid to ask.

PS By all means add comments with more involuntary DBA resources and I'll collect them together.

Categories:
Involuntary DBA | Surveys

This week's survey is all about being a DBA or involuntary DBA - what do you think is the hardest thing when becoming an involuntary DBA? I'll report on the results sometime over the weekend of 6/13/09.

As always, a big Thanks! for contributing to the blog by responding. Please shoot me an email (Contact button, bottom left of the blog), or ping me on Twitter (@PaulRandal) if you have an idea for a good survey.

PS And again, as always, no comments on this post please (I'll delete them straight away) - please wait for the survey results post to avoid skewing the answers. I'm very interested in your reasoning, but not until everyone else responds.

Categories:
Involuntary DBA | Surveys

Chad Miller over on SQLServerCentral has put together a really great collection of scripts that allow you to easily monitor and report on what backups are happening on your system. These should be especially useful for involuntary DBAs, rather than having to dig into the backup history tables themselves. I'll be including a link to his post and downloadable scripts in all my DBA classes.

Check out Chad's article and scripts at Backup Monitoring and Reporting.

In last week's survey I asked you two things, as a precursor to a whitepaper I'm writing for Microsoft

The first question was what is your maximum allowable downtime SLA (either for 24x7 operation or not). See here for the survey. Here are the results as of 5/30/09.

 

The Other values were all about not having SLAs defined. And I think that's why this survey had a poor response rate - most of you out there don't have defined SLAs.

If you take a standard 365-day year, which means the year has 524160 minutes. Some example maximum allowable downtimes for 24x7 operations:

  • 99.999% (a.k.a 'five-nines') = slightly over 5 minutes downtime per year
  • 99.99% (a.k.a 'four-nines') = almost 52.5 minutes downtime per year
  • 99.9% (a.k.a 'three-nines') = almost 8.75 hours downtime per year
  • 99% (a.k.a 'two-nines') = just over 3.5 days of downtime per year
  • 98.5% = almost 5.5 days of downtime per year
  • 98% = just over a week of downtime per year
  • 95% = just over 2.5 weeks of downtime per year

You might be thinking - wow - whoever only has a target of 95% uptime must have a pretty crappy setup BUT it totally depends on the application and business requirements. 95% may be absolutely fine for some companies and utterly devastating to others (imagine the revenue loss if Amazon.com was down for more than 2 weeks...)

On the other end of the spectrum, you might be thinking - wow - 99.999% target is completely unattainable for the majority of businesses out there, BUT again it totally depends on the application and business requirements. This may be attainable for a simple application and database that doesn't generate much transaction log, or for a large, very busy OLTP database that generates lots of log but the company has lots of money to throw at the HA problem and can afford redundant clusters in separate data-centers with fat, dark-fiber links between them.

Defining SLAs, and not just for maximum allowable downtime, is incredibly important. When done properly, defining SLAs shows that the technical staff in a company and the business management in a company are in tune. It shows they have extensively analyzed the business requirements of an application and balanced them against the technical, space, power, HVAC, manpower, budgetary, and other limitations which may prevent a higher percentage being set as the target. It provides a meaningful input into the design and architecture of systems, and the choices of technologies (note the plural) required to achieve the target SLAs. It shows the company shareholders that the business managers understand the criticality of applications involved in running the business and that they are taking steps to safeguard the shareholders' equity interests.

Note I said 'when done properly'. Technical staff picking SLAs without business input, or vice-versa is a recipe for disaster. If the technical staff don't understand the business requirements, how can they pick appropriate SLAs? If the business managers don't understand the technical limitations, how can they pick achievable SLAs? Everyone has to be involved. And one of the most important things to consider is whether the SLA covers 24x7 operation, or just the time that the application has to be available, if not 24x7.

How to go about using the SLAs to pick appropriate technologies is one of the things I'll be going into in the whitepaper (due for publication around the end of September).

The second questions was what was your measured uptime (again, either for 24x7 operation or not) over the last year. See here for the survey. Here are the results as of 5/30/09.

 

The Other values for 24x7 were 2 x '96', and 1 x 'it varies across customers'. The Other value for non-24x7 was '92'.

Of course, this would be a little more meaningful if we could correlate target downtime with actual downtime achieved, but the free survey site doesn't reach that level of sophistication (and I don't think people have enough time to get into that much detail) while casually reading a blog post.

Nevertheless, the results are interesting, although not really statistically valid from this small a sample-size. The most interesting data point is that some respondents don't know what they achieved last year, or just didn't measure. I think this is the case for the majority of readers. Having well-defined and appropriate SLAs is the key to defining a workable strategy, and what's the point of defining the SLAs if you don't measure how well you did against them? If you don't meet the target you need to revisit the strategy, the SLAs, or maybe even both.

The fact that many people don't have SLAs speaks to the general poor state of high-availability and disaster-recovery planning in the industry, IMHO. And Kimberly just chimed in stating that it also shows how the PC-based server market doesn't focus anywhere near as much on regimented policies and procedures as the older, more traditional, mainframe market did/does.

There's a lot of work to be done. Many times this stuff just isn't high up in the priority list until a disaster actually happens. Invariably it then becomes very, very important. Get ahead of the curve and be proactive - one of the things we always tell our clients.

Next post - this week's survey!

This week's survey is inspired by this morning's Kimberly+Paul hot-tub conversation around data-dependent routing vs. network load balancing, which then turned to SLAs. Yes, we lead an exciting life Smile.

This survey is a *four*-parter. Part 1 is for each of your databases, which survey option is closest to your target maximum allowable downtime SLA (Service-Level Agreement, a.k.a. RTO - Recovery Time Objective)?

  • Survey 1 is if your SLA is based on 24x7 operation
  • Survey 2 is if your SLA is based on non-24x7 operation (e.g. allowing for a 6 hour daily maintenance window, the rest of the time we must be available 99.9%)

Part 2 is for each of your databases, what can you actually achieve?

  • Survey 3 is your achievement over the last year for 24x7 operation
  • Survey 4 is your achievement over the last year for non-24x7 operation

I'll report on the result sometime during the first week of June - as this is a big survey I'd like to get lots of results. Please repost/Tweet/whatever.

What do I mean by that downtime SLA? Well, however you have it defined. It could be that your SLA is an absolute value based on 24x7 operation, or it could be based on the 9-5 work-day. In all cases, what percentage of the defined time does the system have to be available? Note that I'm calling out a 5, 4, and 3 nines of 24x7 as separate answers as I'd like to see how often these occur and can be met.

As always, a big Thanks! for contributing to the blog by responding. Please shoot me an email (Contact button, bottom left of the blog), or ping me on Twitter (@PaulRandal) if you have an idea for a good survey.

PS No comments on this post please (I'll delete them straight away) - please wait for the survey results post to avoid skewing the answers.

In last week's survey I asked whether you're ever tested your disaster recovery plan, and if so, what happened? (See here for the survey). Here are the results as of 5/25/09:

The 'other' responses are:

  • 2 x "restored to test env regularly. don't know if sla would be met."
  • 2 x "test it regularly, most goes according to plan"
  • 2 x "Test it regularly, people screw up. Was a great win when I obtained the budget for this activity."
  • 2 x "test it regulary and learn new things every time but overall it works"
  • 2 x "we have annual drp test company-wide"
  • 1 x "We have lots of DR plans, some tested, some not."
  • 1 x "We test the dr failover with mirror.however chain replications (subscriber becomes pulisher)we can't"

A good mixture of results, but only around 25% of respondents test it regularly. Rather depressingly, 35% of respondents either don't have a DR plan or have one but have never tested it. Given the stories I see almost every day on the various forums, this doesn't surprise me - but it's still depressing nevertheless.

The term 'disaster recovery' means different things to different people. The word 'disaster' in my mind spans everything that could go wrong and affect whether your system and data is online, available, and performing to spec. The word 'recovery' means any process that allows you to bring your system and data back online, available, and performing to spec. Your disaster recovery plan could be as simple as restoring from the last full database backup, or as complicated as failing over all processing to a remote data center and engaging a 3rd-party company to distribute new DNS routing entries across the Internet. It, of course, depends (I have to work that into every editorial Smile). Lots of people interchange DR and high-availability (HA), but HA is really a set of technologies that you implement to help protect against disasters causing problems. For instance, you might implement database mirroring so that part of the DR plan for a database is to failover to the mirror, keeping the database highly-available, while DR happens on the old principal. Or you might implement auto-grow on the transaction log file so that if it runs out of space the database doesn't become unusable. Neither of these are doing DR, they're preventing a disaster from affecting availability. DR in the second case would be what you do to provision more space for the log so the database can come online again.

Now, this editorial's not going to be about putting together your disaster recovery plan - that's an entire book in itself, as there are many techniques depending on the disaster and the resources you have available to facilitate recovery. If you don't have a disaster recovery plan (which I'm going to start calling DR plan), then you should *know* that come disaster time, you're risking whoever's on duty floundering around making mistakes and potentially leading to more downtime and data-loss than if you had a plan to follow. People panic in times of high-stress and crisis, and without a set of steps to follow, bad things happen. Enough said. You know who you are - go get a DR plan before a disaster happens and you lose time, data, your job, or all of the above. I see them all happen regularly.

Once you have a DR plan in place, the ONLY way to know whether you're going to be able to recover from a variety of disasters is to simulate some, in production. Yes, this is far easier said than done - persuading your business owners to take planned downtime (and possibly lose a bit of revenue) can be a hard argument to make, but unless you do, you can't know that your DR plan will work. One argument I've found effective is wouldn't you rather have all the various admins and DBAs on-site and expecting the test and things to potentially go wrong, than wait until a real disaster occurs in the middle of the night and THEN find out that the DR plan doesn't work and everyone has to scramble when they're least expecting it? Of course, business owners often aren't interested in low-probability potential problems. DR and HA aren't sexy topics UNTIL the company experiences a disaster. Then it's likely to be the top thing on the CEO's mind and you have to have a DR plan in production by Tuesday.

Seriously though, if you're responsible for your system meeting certain SLAs (downtime and data loss - a.k.a. RTO and RPO) then your DR plan actually has to work, no matter how carefully you've designed it. This means you have to try restoring from your backups and seeing if you can do it within your downtime SLA. What about if you have to setup a new server first? What about if there's no power in your building? What if your off-site backups are 200 miles off-site and the network link is down? What if none of your backups work? What if your differential backups are bad, do you still have all the log backups? How does that affect your recovery time? And so on and so on. You might think I'm just making stuff up and these things don't happen, but they do, and everything I'm citing as an example has happened to a customer I've personally been involved with while at Microsoft or since then. And they keep happening over and over again to different people.

If I had to list the most common reasons I see why disaster recovery fails, they are:

  • There are no backups, meaning recovery = data loss
  • Backups don't work or all contain the corruption, meaning recovery = data loss
  • The data volume has increased since last DR test, meaning recovery time exceeds downtime SLA
  • The initial failover when the disaster happens doesn't work because the failover site only has part of the application ecosystem, meaning recovery involves getting the application working on the failover site AND then recovering from the disaster on the main site

Doing an initial test when the DR plan is first produced is great, because at least you know that it works, or there are some things you've missed (which is almost invariably the case). The DR (and HA plans) should be written by the most experienced DBAs, as they're the ones who've "seen it all" and have a good idea of what could go wrong at any point during the recovery. And the plans should be tested by the most junior DBAs, as you can bet that if a disaster occurs at 2am on Thanksgiving morning, it won't be the most senior DBA who's on duty.

Doing a regular test is critical because things change. Data volume increases. Databases get added into the mix. Personnel change. SLAs change. And after a change, if you don't test regularly, then you won't know if your DR plan still works until you have a real disaster. If you can push for a DR plan test and everything works, everyone has increased peace of mind. If you can push for a DR plan test and things go south, you'll be praised for having exposed the problems. But if you wait, and things go south, no-one likes being responsible for unnecessary downtime or data loss - and that doesn't look good on a resume.

Next post - this week's survey!

This week's survey is inspired from many stories I saw on the forums and Twitter this week - mostly bad, one good (someone I'm following is spending the weekend testing their disaster recovery plan - cool!). I'd like to know whether you're ever tested your disaster recovery plan, and if so, what happened? I'll report on the results sometime of the weekend of 5/24/09.

And what do I mean by disaster recovery? The definition varies from just restoring a backup to conducting a full failover of a datacenter - but I'd like you to decide for yourself. I'll editorialize more about this next week.

As always, a big Thanks! for contributing to the blog by responding. Please shoot me an email (Contact button, bottom left of the blog) if you have an idea for a good survey.

PS No comments on this post please (I'll delete them straight away) - please wait for the survey results post to avoid skewing the answers.

PPS Probably a lot less blogging/Twitter (@PaulRandal) this coming week - I'm teaching all 5 days.

Quickie this morning to start the day off. I saw a question on a forum: if I *have* to use a GUID and *must* have a primary key, should I make the primary key clustered or nonclustered?

Now, I'm not getting into the whole GUID vs. bigint identifier, or random GUID vs. GUID generated by NEWSEQUENTIALID(), so please don't comment on those issues, they're not relevant here. I just want to address the question - what kind of index should it be?

From a Storage Engine perspective, my answer is nonclustered. Here are three reasons why:

  • If the index is clustered, then the cluster key is immediately at least 16 bytes (the size of a GUID). This doesn't change the size of the clustered index records (as the GUID column has to be stored in the table anyway, and a clustered index IS the table), but it does change the size of the nonclustered indexes. All nonclustered indexes on the table must include the cluster keys, even of they are not explicitly part of the nonclustered index keys (I'll do a post on this later). This means the GUID is present in every nonclustered index record too. From this perspective, it would be better to use a smaller clustered index key and have the GUID primary ley be nonclustered so it's only present in that one nonclustered index.
  • Random GUIDs used as the high-order key cause index fragmentation. Their random nature means the insertion point into the index is also random. This causes page splits, which cause fragmentation and are *expensive*. (I touched on this a bit a few days ago in my post How expensive are page splits in terms of transaction log?.). With a random key value, it's hard to avoid page splits and fragmentation, although you can delay them somewhat using FILLFACTOR, but at the expense of using extra space. By making the GUID index nonclustered, you can delay page splits even further. The clustered index is the table, so the records are (usually always) larger than nonclustered index records. This means you can get fewer clustered index records on an 8KB page than nonclustered index records. With fewer records per page, you can do fewer random insertions on the page before a page split occurs. So using a nonclustered index for the GUID key means you can do fewer expensive page splits.
  • Given that whatever kind of index you create for the GUID key is going to experience index fragmentation, you're probably going to want to periodically remove the fragmentation as part of your database maintenance plan. It makes sense to try to limit the amount of resources used by the fragmentation removal operation (e.g. cpu, IO, disk space, transaction log space), and so the smaller the fragmented index, the better. A nonclustered index for the GUID key will be smaller than a clustered index, so if you choose a non-fragmentation-causing clustered index key, and confine the fragmentation to the nonclustered index, you can use fewer resources during database maintenance.

And there you have it. I'm sure some of you have seen pathological cases that disprove one of the above points, but my arguments are generalizations. Maybe this is a can of worms I've opened, in which case I look forward to the comments!

PS Brent did a great post about humor when blogging, the cartoon links he includes are great. Check it out here.

This is a true story, and unfolded over the last few days. It's deliberately written this way, I'm not trying to be patronizing - just illustrating the mistakes people can make if they don't know what not to do.

Once upon a time (well, a few days ago), there was a security person who had access to a 2000 SQL Server instance. He saw that the server had run out of disk space, but had no SQL server knowledge. He decided that the best thing to do was to delete the log file, instead of contact the real DBA to take corrective action. This person should not have had any access to SQL Server, or the ability to delete the log. Nevertheless, the log was deleted. And then the database became suspect. Eventually, with no other option, someone decided to create a new transaction log file using DBCC REBUILD_LOG. This was the right thing to do. The database wasn't being backed up so restore wasn't an option. Some of the details of who did what are unclear, as it so often the case. Needless to say, the real DBA was kept in the dark.

Unfortunately, whoever rebuilt the log didn't run DBCC CHECKDB afterwards to find out what corruption had been caused in the database by having the transaction log unceremoniously ripped out from under the database's feet. Rebuilding a transaction log is one of the worst things that can happen to a database, as all active transactions are thrown away, with no chance of rolling back. This leaves the database in a transactionally inconsistent, and potentially structurally inconsistent state (see Corruption: Last resorts that people try first... filed under my Bad Advice category, and Q4 from TechNet Magazine: February 2009 SQL Q&A column: Is it ever safe to rebuild a transaction log?).

Instead they carried on running. The real DBA thought everything was cool and even upgraded the instance from SQL 2000 to SQL 2005. To be honest, after seeing the 19000+ corruptions they had in the database, I'm surprised that the database upgraded successfully. But it did. Then he tried to rebuild some indexes, which failed with corruption errors. Which brought us to this morning, when DBCC CHECKDB was run. It reported a lot of corruption, but running from SSMS only showed the first 1000 errors (which was the inspiration for one of my previous posts today How to get all the corruption messages from CHECKDB. So they started to run REPAIR_ALLOW_DATA_LOSS, which will cause data loss - we chose the name of the repair option well. I got involved at this point, and saw from the list of errors that massive data loss would be the result.

Once they had re-run DBCC CHECKDB though the command line (luckily only 20 minutes) I could see all the errors. And I must say I was astounded. It was the *worst* case of multiply-allocated extents (see Inside the Storage Engine: Anatomy of an extent) that I've ever seen. Not only were the extents allocated by multiple IAM pages (see Inside the Storage Engine: IAM pages, IAM chains, and allocation units) - in other words, two objects thought they had the same pages in the database allocated, but the two objects had actually each proceeded to overwite the other's updates in the pages - getting the two clustered indexes hopelessly interlinked. Lots of errors (1000s) like:

Msg 8904, Level 16, State 1, Server NAMESCHANGEDTOPROTECTTHEINNOCENT, Line 1
Extent (1:9528) in database ID 5 is allocated by more than one
allocation object.
...
Msg 8978, Level 16, State 1, Server NAMESCHANGEDTOPROTECTTHEINNOCENT, Line 1
Table error: Object ID 1445580188, index ID 1, partition ID
376212519911424, alloc unit ID 94737543200768 (type In-row data). Page
(1:52696) is missing a reference from previous page (1:427112).
Possible chain linkage problem.
...
Msg 8977, Level 16, State 1, Server NAMESCHANGEDTOPROTECTTHEINNOCENT, Line 1
Table error: Object ID 1445580188, index ID 1, partition ID
376212519911424, alloc unit ID 94737543200768 (type In-row data).
Parent node for page (1:143210) was not encountered.
...
Msg 2533, Level 16, State 1, Server NAMESCHANGEDTOPROTECTTHEINNOCENT, Line 1
Table error: page (1:405139) allocated to object ID 1445580188, index
ID 1, partition ID 376212519911424, alloc unit ID 94737543200768 (type
In-row data) was not seen. The page may be invalid or may have an
incorrect alloc unit ID in its header.
...
CHECKDB found 1653 allocation errors and 17646 consistency errors in
database 'NAMESCHANGEDTOPROTECTTHEINNOCENT'.

The repair for all of this? Delete it all and fix up the links. MASSIVE data loss.

Lessons to learn from this:

  • Don't give people with no clue about SQL Server access to SQL Server. 
  • Don't delete a transaction log to reclaim space. Cardinal sin.
  • Don't rebuild a transaction log UNLESS you run a full DBCC CHECKDB afterwards and satify yourself that you know the extent of the damage.
  • Don't upgrade a database without running DBCC CHECKDB first. Best case - the upgrade fails. Worst case - it upgrades and then you might not be able to fix the corruptions.
  • Don't just run repair without doing a little analysis on the errors reported, especially with a *huge* number of errors reported.

Finally, I'm really grateful to the DBA in question for letting me help him out with this, and to post this blog post - we all learn from our own and others' mistakes.

PS And I got involved in this from Twitter - I just *love* it. After initially being skeptical of how much time I'd spend on it, I'm finding the benefits of connecting to the SQL community in 'real-time' vastly outweigh the time I'm putting into it. Follow me on http://twitter.com/paulrandal and you'll see a bunch of other SQL MVPs on there too.

Hopefully all of you reading my blog already know to use the WITH ALL_ERRORMSGS option whenever you run DBCC CHECKDB (which is now the default in 2005 SP3, haven't checked if it made it into 2008 SP1). If you're doing some quick analysis of the results to see, for example, whether it's just nonclustered indexes that are corrupt then you need to see all the errors - as it may look like all the errors are (relatively) benign but there's one error you didn't see that says a clustered index data page is corrupt.

That option is all well and good *except* when you run the DBCC command through SSMS. It will only show the first 1000 errors (long story, but let's just says it involves me being able to create corruption cases during 2005 development that would generate so many errors it would cause SSMS to crash) and there's no way to get around this. So, if you have more than 1000 errors, as in a case I'm helping with today where there are 19000+ errors, you can't use SSMS and expect to get them all back. Problem is, you don't know that you've got that many errors until you run it - so if you used SSMS and find there are more than 1000 errors, AND you want to do some analysis of them, you'll need to go and run it again using a different connection to the server.

So - something I advise is to run DBCC from the command line. Either osql or sqlcmd will do the trick. For instance:

osql -E -Q"DBCC CHECKDB (master) WITH ALL_ERRORMSGS, NO_INFOMSGS" -oC:\outputfile.txt

This will run the DBCC command and store the output in a file for later analysis. You can substitute sqlcmd for osql in the command above if you want.

Last week's survey was two-fold - what's the largest SQL Server database you manage, and how many SQL Server databases are you responsible for managing. Here are the results as of 5/10/2009.

 

As far as the database sizes are concerned, the distribution curve isn't surprising, but the average size of databases *is*. 70% of respondents have a database over 100GB, with half of those being 500GB or more. One lucky (or unlucky, depending on how you look at it!) person gets to play with a 20+TB database. There some truly huge SQL Server databases out there - for instance the 1.1 petabyte astronomical database described here, which adds 1.4TB every night. Back in December 2003, the largest SQL Server database was a 5TB Verizon database. Now multi-TB databases seem quite common when talking to large customers.

The distribution curve for the number of databases also seems unsurprising, except for the uptick at the 150+ databases mark. Fully 1/4 of respondents are responsible for more than 200 databases! Even if the majority of the databases are small, that's still a huge amount of management to consider and context to have to know about.

So what was the point of this survey? Well, I've heard from quite a few people this year that some of their DBA colleagues have been let go and they've had to pick up more responsibility, whether they want to or not. Just last week, Microsoft laid off 3000 more people worldwide, including some in their various IT departments. In the DBA class we were teaching last week, it was obvious that some of the attendees were affected and had a lot more on their plates.

With the average size of databases growing, and the number of databases any one person is responsible for also growing, it's imperative that the overall environment is as easily managed as possible. What does this mean? Here's a 2-minute list of stuff off the top of my head that I think helps make an environment more manageable:

  • Setting up each database so that the important parts can be recovered as quickly and easily as possible
  • Setting up each database so that there are no performance bottlenecks (these two were both kind of addressed by my previous survey Physical database layout vs. database size)
  • Making sure each database has the right backup strategy (from last week's survey Importance of having the right backups)
  • Making sure that each database has the correct monitoring set up. For instance:
    • SQL Agent alerts on the instance for high-severity errors (see Easy monitoring of high-severity errors: create Agent alerts)
    • Monitoring data and log file sizes to avoid auto-growths
    • Monitoring index fragmentation levels
    • Pro-active monitoring for things like high disk-queue lengths or runaway queries
    • Regular consistency checks to find corruption
  • Security is setup appropriately (see TechNet Magazine: feature article on Common SQL Server Security Issues and Solutions)
  • Maintenance jobs are automated as much possible (e.g. backups, consistency checks, index maintenance, and statistics maintenance). There are a bunch of scripts people have published to help with this: Tara Kizer has a lot on her blog, Ola Hallengren has a good one here, as do many others (no offense to anyone else with scripts, these are just the two that spring to mind - by all means reply with a comment to let me know of your popular script)
  • Database settings adhere to common best-practices (e.g. auto-shrink off, auto-grow on (and managed!), auto-create and auto-update statistics on, page checksums on)

And I'm barely touching the tip of the iceberg here. Most of these are just database maintenance best practices - with the idea that the healthier a database is, the less likely something will go wrong that demands DBA attention. And in an environment where potentially hundreds of databases are under the control of a single person, avoiding problems is of paramount importance.

Ok - cup of coffee later and I have more stuff to add to the list:

  • Consistency of settings and scripts across database and instances, so what you're familiar with in one database context translates to all others
  • Environment-wide run-book that covers all aspects of day-to-day operations, so all DBAs do the same thing and a DBA can help out or pick up databases easily
  • Environment-wide disaster recovery guide - written by the most senior DBA and tested by all DBAs down to the most junior
  • T-SQL source code control
  • Management of access to databases
    • Physical access to servers is limited
    • Network access to servers is limited
    • SA access is limited
    • Developers can't deploy new code without going through QA first
    • And so on
  • High-availability technologies in place to aid recoverability, with plenty of practice of failing over applications
  • Use of tools to increase productivity. Most major 3rd-party tool vendors have tools that can help in a myriad of DBA activities - I'm not going to name any in particular as I don't want to favor any over any of the others.

Talking about tools, SQL Server 2008 itself has some stuff built in that can increase productivity. Kimberly did a great post last November on Central Management Servers - see SQL Server 2008 Central Management Servers - have you seen these?. There's also Policy-Based Management, which is a good first step along the road to centralized policy setting, testing, and enforcement (although it has some nits that need to be fixed in v2), and Performance Data Collection which can be used to easily collect perf data (same kind of stuff that Activity Monitor has) for multiple instances and stored in a single location - Kimberly will be blogging more about these two features in the future.

Overall, having a haphazard way of setting up and maintaining databases and instances is going to make it really hard to scale up size and number of databases in an environment without investing heavily in more people. Taking a more disciplined, planned approach, learning from others - using others' scripts and practices, and developing streamlined processes will lead to increased productivity, decreased stress, and happier, stronger DBAs. No-one wants to be in constant fire-fighting mode.

This has been a bit of a ramble, but that's the point of these editorial-style posts. I'm sure you know a lot of this, and I'm sure I've missed some obvious points, so feel free to comment with more tips, tricks, and links!

Next up - this week's survey!

This has come up a few times now, most recently in an email question this morning - subsequent runs of DBCC CHECKDB show varying numbers of corruptions, and sometimes no corruptions - what's going on? Even more strange - a maintenance job runs a DBCC CHECKDB, which shows errors, but then in the morning - no consistency errors. What?

I answered this back in the April 2009 SQL Q&A column in TechNet Magazine, but I want to get it here on the blog too in a bit more detail. The answer has to do with the way the database is consistency checked, and how corruptions are detected.

In 2005 onwards, you're going to be using page checksums to help detect corruption. If you created the database on 2005 onwards, page checksums are enabled by default and every allocated page will have one. If you upgraded a database from 2000 or before, then you need to manually enable page checksums with ALTER DATABASE. The nothing happens. Until a page is read in, changed, and then written back out. So your upgraded database will have a mixture of nothing/page checksums, or torn-page detection/page checksums. Note: torn-page protected pages remain torn-page protected, even with page checksums enabled, until the next time they're altered. Then they get a page checksum. See Inside The Storage Engine: Does turning on page checksums discard any torn-page protection? for an explanation and examples.

Once you've got page checksums enabled, who can you tell if there are corruptions in the database? Well, there are a number of ways corruptions will show up:

  1. You run an operation that hits a page that has been corrupted, and the page checksum test fails
  2. You run a BACKUP ... WITH CHECKSUM and it finds a page with a bad checksum
  3. You run a DBCC CHECKDB and it finds a page with a bad checksum

That's all very well, but what if a page *doesn't* have a page checksum on it (because it hasn't been changed since page checksums were enabled)? None of #1 to #3 will fail because of a bad page checksum, as there isn't a page checksum to check. #1 might fail, depending on how corrupt the page is, and it will likely fail with an obscure message that doesn't immediately scream 'corruption'. #2 won't fail, as the only time BACKUP examines what it's backing up is when WITH CHECKSUM is enabled and a page has a page checksum on it. #3 might find the corruption, depending on how the page is corrupt. If the corruption is in the middle of a large varchar field, for instance, probably not. Your best bet is to have page checksums enabled and regularly run DBCC CHECKDB.

That's how corruptions are detected. So what about the disappearing corruptions? This gets into how consistency checks work. Consistency checks only run on the pages in the database that are allocated. If a page isn't allocated to anything, then the 8192 bytes of it are meaningless and can't be interpreted. Don't get confused between reserved and allocated - I explain that in the first misconceptions post here. As long as a page is allocated, it will be consistency checked by DBCC CHECKDB, including testing the page checksum, if it exists. A corruption can seem to 'disappear' if a corrupt page is allocated at the time a DBCC CHECKDB runs, but is then deallocated by the time the next DBCC CHECKDB runs. The first time it will be reported as corrupt, but the second time it's not allocated, so it isn't consistency checked and won't be reported as corrupt. The corruption looks like it's mysteriously vanished. But it hasn't - it's just that the corrupt page is no longer allocated. There's nothing stopping SQL Server deallocating a corrupt page - in fact, that's what many of the DBCC CHECKDB repairs do - deallocate what's broken, and fix up all the links.

The maintenance job phenomenon can occur because of the order of operations in the job. If the DBCC CHECKDB is first, and then there's an index rebuild, and the index rebuild happens to rebuild an index that DBCC CHECKDB had found a corruption in, then the *new* index will have a completely different set of database pages, and won't contain the corrupt page. Bingo - disappearing corruption. A subsequent DBCC CHECKDB might not find any corruption, because the previously corrupt pages are no longer allocated.

Bottom line - any time you get corruption error messages, 99.999% of the time it's your I/O subsystem that's got problems, even if the corruptions 'disappear'.

PS Don't forget to follow along on Twitter - http://twitter.com/PaulRandal

Recently there's been a spate of people noticing strange behavior from active transactions when DBCC CHECKDB (or any of the other DBCC consistency checking commands run).

For example, I've create a database call DbccTest with a single table. In one connection I do:

BEGIN TRAN
INSERT INTO t1 VALUES (1, 1);
GO

And in another connection I do:

DBCC CHECKDB (DbccTest) WITH ALL_ERRORMSGS, NO_INFOMSGS;
GO
 

Look what gets printed in the error log:

2009-05-04 16:03:21.55 spid54      1 transactions rolled back in database 'DbccTest' (14). This is an informational message only. No user action is required.
2009-05-04 16:03:21.91 spid54      DBCC CHECKDB (DbccTest) WITH all_errormsgs, no_infomsgs executed by ROADRUNNERPR\paul found 0 errors and repaired 0 errors. Elapsed time: 0 hours 0 minutes 0 seconds.

I've highlighted the weird part in bold - it looks like crash recovery ran on the DbccTest database. What's going on?

Well, crash recovery *DID* run on the DbccTest database - only it ran crash recovery into a hidden database snapshot. In the first part of the very long (well, 13 pages - not as long as the 70 page description in the 2008 internals book) post on how DBCC CHECKDB works (see CHECKDB From Every Angle: Complete description of all CHECKDB stages), I explain why DBCC CHECKDB needs a transactionally-consistent, point-in-time view of the database to run consistency checks on. In 2005 I changed the code to use a database snapshot - which by it's very nature provides a point-in-time, transactionally-consistent view of the database. When a database snapshot is created on a database, the transaction log of that database is examined, and crash-recovery is run on it, but *into the database snapshot* - the source database is totally unaffected. The message in the error log is from DBCC CHECKDB's hidden database snapshot starting up - but it's pretty misleading, I must admit.

One more twist to this behavior is that if there is a very long-running transaction in the database being consistency-checked, the database snapshot creation could take hours (because it has to wait for the long-running transaction to rollback, in the snapshot). If you think that the consistency check has hung, and try to kill it, you won't be able to and the SPID will appear as if it's in rollback. This is confusing because DBCC CHECKB doesn't do anything - so shouldn't have anything to rollback - but it's the database snapshot creation that's the problem. Once crash recovery has started on the database snapshot, it can't be interrupted - so you have to wait for it to finish (and the database snapshot to be created) before the DBCC command will actually stop, and remove the database snapshot. It's a weird side-effect, but a necessary one unfortunately. The recovery code could be changed to check for attention signals every so often I suppose, but I'm not holding-my-breath for that change.

Hope this helps.

Jason Massie posted an interesting statistic yesterday - Facebook has 1.5 petabytes of image storage, and it grows by 25TB daily - I wonder how they store and manage it?

In this week's survey, I'm interested in two things: what's the largest single SQL Server database in your company and how many SQL Server databases are you responsible for managing? I'll report on the results sometimes over the weekend of 5/9/09.

A note on the second survey: if you manage a team of DBAs, but don't manage databases directly yourself, answer for your individual DBAs, rather than the team as a whole.

Thanks!

PS No comments on this post please (I'll delete them straight away) - please wait for the survey results post to avoid skewing the answers.

The June edition of TechNet Magazine is available on the web now and has the latest installment of my regular SQL Q&A column.

This month's topics are:

  • Combining index and statistics maintenance
  • How some Enterprise-only features can prevent a database restoring on lower editions
  • Why database mirroring can switch between SYNCHRONIZED and SYNCHRONIZING
  • Use a database mirror for reporting

Check it out at http://technet.microsoft.com/en-us/magazine/dd797578.aspx.

Not much to do here in Tokyo while waiting for the Bangkok flight except read and blog, and I've already done enough reading for one day on the flight here - so that leaves blogging!

Here's an interesting case that I got involved in on SQLServerCentral (here's the original thread, linked here with permission of the original poster). The problem was that every night the maintenance job would run and end up producing a stack dump, from a 211 error (which always indicates metadata corruption causing the Engine to trip over). I guessed system table corruption but the consistency checking portion of the maintenance job was apparently running fine. So I asked for one of the stack dumps, the output from the maintenance plan, and the latest SQL error log to have a look at (I have a weakness for these things, as you well know).

After having a look, I saw some strange behavior in the error log - every night there 10 or so DBCC CHECKDBs of master within about 2 minutes - corresponding to the consistency checking part of the maintenance job that was clearing running DBCC CHECKDB against all the system and user databases. Then I remembered that vanilla 2005 SP2 has two nasty maintenance plan bugs - one of which caused the consistency checking part of the maintenance plan to malfunction when run, and run DBCC CHECKDB against master every time instead of the database specified. Fixing that would then lead to the being able to properly run consistency checks and find the corrupt database.

So - moral of the story is to make sure you're not running vanilla 2005 SP2 (or one of the builds close to it, 3150 to 3158) otherwise you could get into problems with maintenance plans - worse still, you may think you've been running consistency checks all this time on your user databases, but if no-one's checking the SQL error logs, you could just be repeatedly consistency checking master.

The KB that describes the bug is 934459 (available at http://support.microsoft.com/kb/934459), or you can install SP3 (available at http://support.microsoft.com/?kbid=955706).

Last week's survey was on how *you* manage the size of your transaction log (see here for the survey). Here are the results as of 4/10/2009 - the most popular survey yet:

In my opinion (guarantee that I'm going to say stuff that some of you won't agree with!), there are only two valid answers in this survey: #1 - run in the FULL recovery model and take regular log backups, and #4 - run in the SIMPLE recovery model all the time. The last answer is applicable if you run out of log space even though you're in either of these situations but isn't a general strategy like #1 or #4. IMHO, you should be in one of these two situations and in the rest of this editorial I'll explain why. I'm not going to touch on *why* your transaction log might start to fill up, instead here are some links:

Now for the survey options:

  1. Take regular log backups. I'm very pleased to see the vast majority of respondents doing this, as it's the only valid log size management strategy when in the FULL recovery model (same thing applies to BULK_LOGGED of course, but with a few twists). Once you take that first database backup in the FULL recovery model, SQL Server assumes you're going to manage the transaction log size from that point on by taking log backups. Unfortunately that isn't documented in big, red flashing letters in Books Online - so people can get themselves into troubel inadvertently. Also, the FULL recovery model is the default, and is required for database mirroring - which further adds to the potential for people (such as involuntary DBAs) to accidentally switch into I-will-take-log-backups mode and then not take them. If you don't want to take log backups for recovery purposes, or you don't want to use database mirroring, don't use the FULL recovery model - it's as SIMPLE as that (ha ha). You might argue and say that you're only using FULL because of database mirroring, and don't want to take log backups. I'd argue back and say that if you care enough to have hot standby of your database, you must also take backups - as you can't rely solely on a redundant copy of your database on a different server.
  2. Use BACKUP LOG WITH NO_LOG/TRUNCATE_ONLY regularly. These two commands do basically the same thing - allow the log to be cleared without taking a log backup. What's the point if you're not taking log backups? - just switch to SIMPLE and let the checkpoints clear the log. In fact, in 2008 these two commands have been removed. See my blog post BACKUP LOG WITH NO_LOG - use, abuse, and undocumented trace flags to stop it.
  3. Use BACKUP LOG WITH NO_LOG/TRUNCATE_ONLY when it fills up. Same as above. You might argue that you're only keeping the log around in case there's a disaster, so that you can take a log backup at that point and use it to recover up to the point of the disaster. I'd argue that's broken on two counts: 1) what if the log file is damaged and you can't back it up? 2) that's *all* the transaction log since the last full database backup you took (if you break the log backup chain and then take a full database backup, that backup becomes the base of subsequent log backups) so that may take a long time to restore and replay...
  4. Run in the SIMPLE recovery model all the time. If you don't need to use FULL, don't. Running in SIMPLE is perfectly acceptable, as long as you don't mind losing work in the event of a disaster.
  5. Switch to SIMPLE when it fills up, then back to FULL. This is like #s 2 and 3 - what's the point?
  6. Switch to SIMPLE when it fills up, shrink the log, then switch back to FULL. This is worse than 2, 3, or 5. If you shrink the log, then it's going to grow again - possibly causing VLF fragmentation (see Transaction Log VLFs - too many or too few?), and definitely causing your workload to pause while the log grows, as the log can't use instant initialization (see Search Engine Q&A #24: Why can't the transaction log use instant initialization?).
  7. Shutdown SQL Server and delete the transaction log file(s). Just don't get me started on this one - I'm glad no-one 'fessed up to doing it. There are many reasons why this is daft, including: 1) you have to shutdown to do it, so your workload is off-line 2) if the database wasn't cleanly shut down, it won't be able to be started again without using EMERGENCY mode repair, and your data will be transactionally inconsistent 3) as the log can't be instant initialized, the database won't come online until the log has been created and zero'd. Just don't do this. Ever.
  8. Add more transaction log files. As a general strategy for managing the size of the log, this is not good. Extra log files have zero effect on performance (the myth that SQL Server writes in parallel to the log files is just that - a myth) and make management more tricky. However, if your log fills up for some out-of-the-ordinary reason, then adding another log file may be the only way to allow the database to keep running. If you have to do this though, make sure that you remove the extra log file(s) when you can to keep log management uncomplicated.

To summarize, if you want to be able to take log backups to aid in point-in-time or up-to-the-second recovery, use the FULL recovery model. If not, use SIMPLE and you won't need to mess around with the log when it fills up because you're not taking log backups.

Next post - this week's survey! (And thanks to all those who are responding to them!)

PS For those of you who sent me details about your databases from the survey back at the start of March (see here) - I haven't forgotten. I was waiting to get a decent sample size and now I'm going to go through the data. If you want to send me any more data, you've got until Sunday.

The May 2009 TechNet Magazine is now available online, and it's the annual security issue. In there is an article I wrote highlighting 10 common security issues (and solutions) you should worry about if you're not a security-savvy DBA. It covers:

  • Physical security
  • Network security
  • Attach surface minimzation
  • Service accounts
  • Restricting use of administrator privileges
  • Authentication
  • Authorization
  • SQL injection
  • Disaster recovery
  • Auditing

There are also two screencasts of me demonstrating Transparent Data Encryption and SQL Server Audit, both in SQL Server 2008. 

To quote myself from the end of the article:

As far as takeaways from this article are concerned, I want you to realize that there are some steps you need to go through to ensure the data you are storing in SQL Server is as secure as you need it to be. This is especially important when you inherit a SQL Server instance that someone else has been managing. It's just like buying a house from someone—you need to ask if the alarm works, if the yard is fenced in, and who has copies of the keys. Running through the list I've given in this article is a good start, but make sure you dig deeper in areas that are relevant to you.

Check it out at http://technet.microsoft.com/en-us/magazine/2009.05.sql.aspx.

Enjoy!

Last week's survey was on what method you use to run consistency checks (see here for the survey). Here are the results as of 3/27/09 - again, very encouraging:

As you can see, 70% of respondents run DBCC CHECKDB on the production server, either with PHYSICAL_ONLY or without. For those running without it, be aware that using PHYSICAL_ONLY turns DBCC CHECKDB from a CPU-bound process into an I/O-bound process, and makes it run (potentially) magnitudes faster. It will still evaluate page checksums and torn-page protection, just skipping the higher-level logical checks.

I'm surprised to see so many people using a completely separate system to run consistency checks (restoring a backup and running a CHECKDB on the restored backup) - I've been a proponent of this method for a few years now, but I didn't think it had caught on so much (assuming a reasonably representative sample of readers responded to the survey). This method allows the entire consistency checking workload to be offloaded, and completely validates the backups used - but has the downside of requiring extra disk space on another server to restore the backup (I wish they'd build the system I got a patent for - to consistency check the database inside a backup without restoring it - see here).

Three of these choices I threw in to see if anyone was doing them so I could explain why they're not good methods to use.

  • "Don't run any consistency checks at all". I'm sure I don't need to labor the point with this one - you need to run consistency checks as well as having some kind of page protection turned on. If you don't proactively check for corruption, when it does occur it will likely be more widespread, and harder to recover within your data-loss and recovery time objectives than if you'd discovered it earlier. I've written lots about this in the Corruption and CHECKDB From Every Angle categories.
  • "Run DBCC CHECKDB on a database snapshot on a mirror database". I was asked about this several times while here at SQL Connections too. Database mirroring works by shipping transaction log records between the principal and mirror databases, NOT by shipping database pages - so if a page gets corrupted on the principal, the corruption will not be transferred to the mirror. This means that DBCC CHECKDB on a snapshot of the mirror does not reflect the consistency state of the principal database at all. Saying that, it is however, possible for corruption in the principal to affect the mirror. Imagine a column value is corrupted on disk by the principal's I/O subsystem. If that value is then read and used to calculate another value, which is then persisted - that calculated value is also 'corrupt' and will be reflected in the mirror database. Kind of an insidious, second-order corruption effect.
  • "Use BACKUP WITH CHECKSUM to validate page checksums, no DBCCs". Another option that sounds feasible, but in fact isn't for a couple of reasons. Firstly, BACKUP WITH CHECKSUM will stop when it finds a bad checksum, whereas DBCC CHECKDB will continue reading the rest of the database and tell you everything that's wrong with it. Secondly, BACKUP WITH CHECKSUM will only check those pages that HAVE page checkums, whereas DBCC CHECKDB will consistency check everything, regardless of whether a page has a checksum or not. For a database upgraded from an earlier version, where not every page will have a page checksum since you enabled them after upgrading, this is a critical point.

To summarize, this week's results were great, with almost 90% of respondents running some kind of consistency check. If you're not running any, for whatever reason, rethink your decision - there's always a way to run some kind of consistency checks and give yourself more peace of mind. See CHECKDB From Every Angle: Consistency Checking Options for a VLDB for more info.

Next post - this week's survey!

Jack Li, one of the Senior Escalation Engineers in Product Support, just posted details of an interesting case over on the CSS blog - his article is here. It talks about index builds and rebuilds, but the issue is the same for both, so I'll just talk about rebuilds.

The jist of the problem is that index rebuilds can parallelize, but sometimes they don't parallelize vey well. Each thread gets a certain range of the index to rebuild, using the existing index statistics to divide the ranges equally between the threads. If there's massive data skew, then one thread can end up doing the majority of the work, leading to a long run-time. The case in Jack's post involved a 250 million row index where 150 million rows had the same (NULL) key value. This range has to be processed by a single thread - a single value can't be divided between two+ threads.

Now, this is understandable behavior by the database engine, but it relies on the statistics being up-to-date. That's a bit of a catch-22 - rebuilding an index updates the statistics, but if the statistics aren't up-to-date then the index rebuild might parallelize badly! I guess the solution is that if you know that you have massive data skew in your large indexes, update statistics BEFORE doing an index rebuild. And given what I've been hearing this week at SQL Connections about how badly statistics keep biting people, I'm leaning towards a different recommendation for those people who have lots of perf trouble caused by statistics and the potential for skewed data - rebuild all your statistics regularly, and only rebuild/reorganize fragmented indexes. Statistics just cause so many problems it seems.

Thanks

PS Kimberly has a lot more info about statistics over on her blog - I'm just starting to venture into that mine-field

Just saw this on a forum - running REPAIR_ALLOW_DATA_LOSS on msdb to cope with corruption. Yes, this will work but it's certainly not safe and something I'd only recommend as a last resort.

Here's why: msdb stores your backup history and your Agent jobs. REPAIR_ALLOW_DATA_LOSS works by usually "delete what's broken and fix up all the links" (quoting myself). If you run repair on msdb, you need to pay close attention to what was deleted by repair, as it could lead to big problems later on. Some examples:

  1. What if repair deletes a record which just happened to store the Agent job that runs the transaction log backups for your main production database? Suddenly your log isn't being backed up and you don't know about it. The log starts to grow and eventually runs out of space. The database stops and your application is down until you figure out what's wrong.
  2. What if repair deletes a record which just happened to store the details of a log backup of the production database? Your have a disaster and run the script that looks through the backup history tables and auto-generates RESTORE statements to get you up and running again with up-to-the minute recovery. Because of the missing record, there's a missing RESTORE LOG statement in the middle of the restores of the log-backup-chain for the production database. So the restores fail, and you're down until you figure out what's wrong.

Bottom line, it's not safe to run repair on msdb - proceed with extreme caution. Not that repair won't work, or will cause corruption, but that it's far better to have backups of msdb that you can restore from.

(Yes, Kimberly's lecturing again...)

Here's something that I've seen crop up a lot recently on corruption forums:

Server: Msg 602, Level 21, State 50, Line 1
Could not find row in sysindexes for database ID 10, object ID 1, index ID 1. Run DBCC CHECKTABLE on sysindexes.

This isn't corruption - it comes from trying to attach a 2005 database to a 2000 server. If you try to restore a 2005 database on a 2000 server, you'll see:

Server: Msg 3169, Level 16, State 1, Line 1
The backed-up database has on-disk structure version 611. The server supports version 539 and cannot restore or upgrade this database.
Server: Msg 3013, Level 16, State 1, Line 1
RESTORE DATABASE is terminating abnormally.

This gives a more useful error. Basically, SQL Server is not up-level compatible in terms of the database physical structures. A SQL 2000 server cannot understand the new structures that are in a SQL 2005 database. There's a lot of confusion about this, and why setting database compatibility level is not the same as the physical version of the database. My blog post Search Engine Q&A #13: Difference between database version and database compatibility level has more details.

Trying to do the same thing with a 2008 database on a 2005 server is a bit better. For the attach we get:

Msg 1813, Level 16, State 2, Line 1
Could not open new database 'Dbmaint2008'. CREATE DATABASE is aborted.
Msg 948, Level 20, State 1, Line 1
The database 'Dbmaint2008' cannot be opened because it is version 655. This server supports version 612 and earlier. A downgrade path is not supported.

(Note that 611 is the physical version for all SQL Server 2005 builds, but if VARDECIMAL is enabled, the version gets bumped by one to 612 - long story...)

But the restore error is still a little cryptic:

Msg 3241, Level 16, State 7, Line 1
The media family on device 'c:\sqlskills\dbmaint2008.bck' is incorrectly formed. SQL Server cannot process this media family.
Msg 3013, Level 16, State 1, Line 1
RESTORE DATABASE is terminating abnormally.

Bottom line - you can't attach a higher-version database to a lower-version server.

Last week’s survey was on what kind of regular index maintenance you perform (see here for the survey) as a way of kicking off a new series I’m writing around index maintenance. Here are the results as of 3/21/09 – I find them very encouraging:

As you can see, about 2/5 of respondents are performing some form of analysis-based fragmentation removal (answers 5+6), which I consider the best way to perform index maintenance, if you’re willing to invest the time involved to set it up. It allows the least amount of work to be performed, for the most targeted performance gains – and so is especially appropriate for 24x7 systems where there’s a minimal or non-existent maintenance window.

The next best option is to do all rebuilds or all defrags based on a fragmentation threshold (answers 3+4), which about 1/5 of respondents do. This also allows work to be limited, but by choosing only a single method of removing fragmentation, there are pros and cons. Now, the survey was limited to a number of questions so I couldn’t explore what the threshold is that people are using (e.g. logical fragmentation, page density, extent fragmentation, or something else). Some measures are good to use and some not so good, and I’ll be exploring the various counters and ways of determining fragmentation as the series progresses.

Either of the options to operate on all indexes regardless of fragmentation (options 1+2) can lead to lots of wasted resources (disk space, transaction log space, I/Os, CPU) by operating on indexes that are not fragmented in the first place, or for which fragmentation removal has no benefit for workload performance. About 1/3 of respondents do this. This isn’t surprising to me as rebuild-all-the-indexes-every-night/week is a very common index maintenance plan for “involuntary DBAs” who know that index maintenance is important, but don’t have the knowledge or training to implement a more sophisticated maintenance plan. This growing size of this group of people is one of the main reasons I’m going to write this series.

Doing absolutely nothing for index maintenance, which about 1/10 do, is usually not a good idea, as indexes in a database that’s not read-only commonly become fragmented over time. However, these people may know they don’t suffer from fragmentation issues, or that removing fragmentation has no effect on workload performance. However, my suspicion (based on what I see in the field) is that some of these respondents don’t realize the benefits of performing index maintenance.

You may be surprised to hear that I don’t consider doing nothing to be the worst choice. That dubious honor goes to answer 7 – doing any kind of index maintenance followed by a database shrink operation – as 3 respondents do. A post-maintenance shrink operation may well undo some of the benefits of the maintenance by introducing massive amounts of index fragmentation – see my blog post Auto-shrink - turn it OFF! for details of how bad this can get.

I’m actually pretty encouraged by these results though. Compared to how things were around 1999 when I wrote DBCC INDEXDEFRAG and DBCC SHOWCONTIG for SQL Server 2000, these results show that knowledge in the field (or at least in the group that reads my blog and responded) has vastly improved. My aim for the forthcoming series about index fragmentation and maintenance is to increase knowledge a lot more broadly.

Next up - this week's survey. Thanks for reading!

I'm about to start a new series of post about index fragmentation and removing it. For this week's survey, I'd like to know what index maintenance you do to address fragmentation (in any of its forms) - I'll report on it in a week.

Thanks

Last week I kicked off the first weekly survey - on whether you validate your backups or not (see here for the survey). The results are very interesting (as of 3/13/09):

As you can see, almost 25% of respondents never validate their backups! And a further 25% only validate them occasionally, with 30% doing some kind of regular checks, and only a handful checking all the time.

While these results may seem shocking to you, based on what I've heard when teaching, they're pretty normal. There are lots of reasons why DBAs may choose not to validate backups as often as they should, including:

  • Not enough time to restore the backups to check them
  • Not enough disk space
  • Not part of the day-to-day operations guide
  • Don't see why it's important

Kimberly and I have a saying (well, to be fair, Kimberly coined it): you don't have a backup until you've restored it. You don't know whether the backup you just took was corrupt or not and will actually work in a disaster recovery situation.

Can you ever get a guarantee? No. Here's an analogy, taken from a very old post of mine. Consider Paul, who works for the Seattle Police Department in traffic control. Paul's in a control room somewhere in the city with a large bank of monitors connected to various traffic cameras. Paul's job is to cycle through the cameras every 1/2 hour, looking for traffic accidents. At the end of the 1/2 hour cycle, if Paul han;t seen any accidents then he knows that there are no accidents in the city.

Ah - but hold on. Does Paul really know that? No. All Paul knows is that at the point he looked at a particular camera, there was not an accident at that spot in the city. The very instant he switches to another camera feed, an accident could happen at a spot covered by the previous camera.

The same is true for validating backups. As soon as you've validated a backup, it could then be corrupted by the I/O subsystem, but at least you know that it was valid at some point. But what if that happens, I hear you ask? Well, then you need to have multiple copies of your backups, and you should not rely on backups as the only method of disaster recovery. A good high-availability solution includes as many technologies as you need to mitigate all risks - and backups are just one of those technologies. You're going to have to have some kind of redundant system too that you can fall back on (or mayb even immediately failover to, depending on your particular disaster recovery plan). But, saying that, you can't rely on the redundant server either - if it goes wrong, you'll need your backups.

So - whichever way you look at it, validating backups is a really good practice to get into so you don't get bitten when it comes to the crunch. When I teach, I've got many stories of customers losing data, business, time, and money (and DBAs losing their jobs) because the backups didn't work or were destroyed along with the data. Here's one for you (simplified, and no I won't divulge names etc). Major US investment firm decides to provision new hardware, so takes a backup of the database storing all the 401k accounts for all their customers (private and corporate), flattens the hardware, and goes to restore the backup. The backup is corrupt - on SQL 2000, where there's no RESTORE ... WITH CONTINUE_AFTER_ERROR. What happened? Well, the SQL team and Product Support had to get involved to help get the data back, but people in the firm lost their jobs and it cost a lot of time and money to recover the data. If only they'd had multiple copies of the backup, and tested their backup before removing the database (or better yet, restored the database on the new hardware before flattening the old hardware). They learned a costly lesson, but they did change their practices after that.

Unfortunately this is so often the way - people don't realize they need to validate backups or have an HA plan UNTIL they have a disaster. Then suddenly its the top priority at the company. Being proactive can save a lot of grief, and make you look good when disaster strikes.

Backups can be unusable for a number of reasons, including:

  • The full backup is corrupt, because something in the I/O subsystem corrupted it.
  • A backup in the log backup chain is corrupt, meaning restore cannot continue past that point in the chain.
  • All backups following a full backup are written to the same backup set, but the WITH INIT clause is used accidentally on all backups, meaning the only backup present in the backup set is the last one taken.
  • An out-of-band backup was taken without using the WITH COPY_ONLY clause and the log backup chain was broken (see BACKUP WITH COPY_ONLY - how to avoid breaking the backup chain).
  • The backups worked but the database contained corruption before it was backed up (kind of a separate issue).

The only ones that are out of your control are the first two, but they can be mitigated by having multiples copies of backups. All of these though, can be avoided at disaster recovery time by reguarly restoring your backups as a test of what you'd do if there was a real disaster. You might be surprised what you'd find out...

This is more of an editorial style post than a deep technical or example script post - I'm going to start doing more of these around the weekly surveys. Next post - this week's survey.

Thanks!

Ola Hallengren, who we meet every so often at SQL Connections, has a great script that helps automate consistency checking, backups, and index maintenance. He's constantly updating it and it's good quality code. Check out the latest version for  SQL Server 2005 and 2008:

Enjoy! 

This is an interesting case that cropped up yesterday - the transaction log is damaged so a log backup doesn't work (with the error below):

Backup detected log corruption in database FakeDBName. Context is FirstSector. LogFile: 2 'F:\SQLLOGS\XYZ\FakeDBName_Log.ldf' VLF SeqNo: x502e VLFBase: x2ce40000 LogBlockOffset: x2d0a9000 SectorStatus: 2 LogBlock.StartLsn.SeqNo: x4ee3 LogBlock.StartLsn.Blk: x1348 Size: x200 PrevSize: x400
2009-03-06 10:00:02.61 Backup      Error: 3041, Severity: 16, State: 1.
2009-03-06 10:00:02.61 Backup      BACKUP failed to complete the command BACKUP LOG FakeDBName Check the backup application log for detailed messages.
2009-03-06 10:00:03.61 Backup      Error: 3041, Severity: 16, State: 1.

However a full backup succeeds, as does a DBCC CHECKDB. What's going on?

The answer comes with understanding what portions of the transaction log are required for these operations. (For a good understanding of the transaction log itself, along with logging and recovery, see my article in the February TechNet Magazine.)

A transaction log backup, by its very nature, has to backup *all* transaction log generated since the last log backup - so it will try to backup the corrupt portion and fail.

A full database backup only has to backup enough transaction log to allow the database to be restored and recovered to a transactionally consistent point. In other words, it only requires the transaction log back to the beginning of the oldest active transaction at the point that the data-reading section of the full backup completes. This is a source of immense confusion - many people don't believe that a full (or differential) backup needs to also backup some transaction log. For a more in-depth study of this, see my previous blog posts Debunking a couple of myths around full database backups and More on how much transaction log a full backup includes.

A DBCC CHECKDB operation uses a database snapshot to get a transactionally consistent view of the database on which to run consistency checks. When the database snapshot is created, crash recovery is run on it to make it transactionally consistent. That requires the same amount of log as if a full backup was taken - back to the beginning of the oldest active transaction at the time the database snapshot is created. See CHECKDB From Every Angle: Complete description of all CHECKDB stages for more info.

So - it's entirely possible for the situation reported above to exist. The question then becomes, how to recover from it?

Assuming that the database files are intact, there is a simple solution. This solution will break the log backup chain, but given that the log is corrupt so a log backup cannot be taken, the log backup chain is *already* broken. Here's what to do:

  1. Stop all user activity in the database
  2. Switch to the SIMPLE recovery model (breaking the log backup chain and removing the requirement that the damaged portion of log must be backed up)
  3. Switch to the FULL recovery model
  4. Take a full database backup (thus starting a new log backup chain)
  5. Start taking log backups

You might want to manually shrink and grow the log file in between steps 2 and 3 too - in case the log file is on a damaged portion of disk - or maybe even shrink it right down and add another log file on an undamaged disk. You also will need to do some root-cause analysis to determine why the corruption occured in the first place, and to take preventative measures to stop it happening again.

Hope this helps

PS In my previous post, Testing a new survey method: backup validation, the answer with the largest number of responses so far is that people never verify their backups - very disturbing!

We're doing some research into database settings that people use and the best way to get a good variety of systems is to ask you to send us some data. We're interested in whether people are using the defaults or changing them in any way. If your're interested, run this query (in the context of any database, doesn't matter which) with text output:

SELECT d.*, mf.* FROM sys.databases d
CROSS APPLY sys.master_files mf
WHERE mf.database_id = d.database_id;

And send me the results in email by clicking here: mailto:paul@sqlskills.com?Subject=Feedback. Dumping into a .xls works and my email won't filter it out.

I'll aggregate the results and post some articles. As payment for your info, if I see anything in your settings that I recommend changing, I'll let you know.

Thanks in advance for any data you send!

[Edit: PS - of course everything will be totally anonymous and only aggregated data will be discussed on the blog - by all means remove database names if you want]

Back in 2005 Kimberly produced two very popular webcast series - an 11-part webcast series for TechNet called SQL Server 2005 for the IT Professional and a 10-part webcast series for MSDN called A Primer for Proper SQL Server Development. The webcast links and blog posts were broken for quite a while but now they're all fixed up and working again. I've created some web pages that link to all the webcasts and blog posts, along with abstracts. I've also included some more recent ones too and will be adding to the list over the next few weeks.

There's over 30 hours of good stuff to watch - check them out at http://www.sqlskills.com/webcasts.asp

In my previous post today I talked about error 825 and how you should have an Agent alert set up to catch it when it happens. There are two ways you can set up Agent alerts - using SSMS or using the stored-proc sp_add_alert. In a post from April 2008 I showed how to create an Operator and alerts using the SSMS wizards (see that post here). To use the SP to create the alert for message 825, the code would be as follows:

USE msdb;
GO

EXEC msdb.dbo.sp_add_alert @name = N'825 - Read-Retry Required',
    @message_id = 825,
    @severity = 0,
    @enabled = 1,
    @delay_between_responses = 0,
    @include_event_description_in = 1,
    @category_name = N'IO Subsystem Error';
GO

Note the @include_event_description parameter. This is a tinyint that says which of the notification mechanisms should include the description. The list of possibilities is:

  • 0 - None
  • 1 - Email
  • 2 - Pager
  • 4 - Net send

So my value of 1 only includes the description in email, even if I've set up the Operator to receive email and a net send. You can get more info about this SP from the following link: http://msdn.microsoft.com/en-us/library/ms189531.aspx.

As to which alerts you should have defined, my advice would be to have one for message 825 and a separate alert for each severity 19, 20, 21, 22, 23, 24, and 25. These are the high-severity errors and are also written to the error log. You can get a complete description of all error severities at http://msdn.microsoft.com/en-us/library/ms164086.aspx. Note that if you specify @severity, the @message_id must be 0 (and vice-versa).

Happy alerting!

PS And of course you'll need to use the SP sp_add_notification (see http://technet.microsoft.com/en-us/library/ms173843.aspx) to hook the alert to an Operator. Thanks Tibor (who just blogged about how to determine which errors to alert on) - got distracted by email while writing this and forgot to include it!

There are two pretty well-known I/O errors - 823, and 824 - but there's also one called 825 which most DBAs do*not* know about, and definitely should.

From SQL Server 2005 onwards, if you ever see an 823 or 824, SQL Server has actually tried that I/O a total of 4 times before it finally declares a lost cause and surfaces the high-severity I/O error to the connection's console, killing the connection into the bargain. The idea behind this read-retry logic came from Exchange, where adding the logic reduced the amount of immediate downtime that customers experienced. While in concept this was something I agreed with at the time, I didn't agree with the way it was implemented.

If the I/O continues to fail, then the 823/824 is surfaced - that's fine. But what if the I/O succeeds on one of the retries? No high-severity error is raised, and the query completes, blissfully unaware that anything untoward happened. However, something *did* go badly wrong - the I/O subsystem failed to read 8KB of data correctly until the read was attempted again. Basically, the I/O subsystem had a problem, which luckily wasn't fatal *this time*. And that's what I don't like - the I/O subsystem went wrong but there are no flashing lights and alarm bells that fire for the DBA, as with an 823 or 824. If read-retry is required to get a read to complete, the only notification of this is a severity-10 informational message in the error log - error 825. It looks like this:

Msg 825, Level 10, State 2, Line 1.
A read of the file ‘D:\SQLskills\TestReadRetry.mdf’ at offset 0×0000017653C000 succeeded after failing 2 time(s) with error: incorrect checksum (expected: 0×4a224f20; actual: 0×2216ee12). Additional messages in the SQL Server error log and system event log may provide more detail. This error condition threatens database integrity and must be corrected. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

Now, what this message is really saying is that your I/O subsystem is going wrong and you must do something about it. And unless you're regularly scanning the error log looking for these, you'll be none-the-wiser.

So - my recommendation is that you add a specific Agent alert for error 825, along with your other alerts (see following blog post).

One of the things I mentioned in my recent TechNet Magazine article on Understanding Logging and Recovery was the need to manage the transaction log so the number of VLFs (virtual log files) does not get too large (with too large being more than, say, 100 VLFs). Linchi Shea (a fellow MVP) has just posted some performance numbers comparing insert/update/delete performance between a database with 16 VLFs and one with 20000 VLFs, representing a poorly managed transaction log. You can see his blog post here.

For those wishing to take charge of unruly transaction logs, checkout Kimberly's blog posts - starting with this one. And if you want to quickly know how many VLFs your transaction log has, use the undocumented DBCC LOGINFO command - the number of lines of output is the number of VLFs you have.

Next posts coming up will be photos - we're in Bangkok right now and I've got 3 Where In The World Are Paul and Kimberly posts queued up - St. Lucia, Hyderabad, and Bangkok.

OK - last content post today. I forgot that the February TechNet Magazine also has the latest edition of my regular SQL Q&A column. This month's column covers:

  • Should backup compression be enabled at the instance level?
  • Client redirection during database mirroring failovers
  • Partition-level lock escalation in SQL Server 2008
  • Is it ever safe to rebuild a transaction log?

Check out the column at http://technet.microsoft.com/en-us/magazine/2009.02.sqlqa.aspx

Wow - today is all about new content. As if I haven't already blogged about enough stuff to keep you reading through next week, the February issue of TechNet Magazine is now available and contains a feature article I wrote about understanding how logging and recovery work inside SQL Server.

The article covers:

  • What is logging?
  • What is recovery?
  • The transaction log (include logical and physical architecture)
  • Recovery models and how they affect the behavior of the transaction log

There's also a ten-minute screencast video where I demonstrate a runaway transaction log.

Check it out at http://technet.microsoft.com/en-us/magazine/2009.02.logging.aspx.

Well, we're back from vacation finally (only for 10 days and then off to India and Thailand for 3 weeks... phew) and I've got a bunch of blogs posts to catch up on. First up - I did an interview with TechNet Radio in mid-December where I talked about database corruption and things to do to recover from it - similar to the conference sessions I've done but a little higher level.

The links for the interview are:

Enjoy!

Here's a question I got from someone who attended our database maintenance workshop at PASS last week (paraphrased):

I attended your pre-conference session on database maintenance and found it to be very informative.  From what you told use though, I think I need to change my nightly backup procedure.  I like to get my databases back to as small of a size as possible before backing them up, so I run the following commands to do this before taking the full database backup.  Could you help me with a better way of doing this? We're on SQL Server 2005.

BACKUP LOG <mydbname> WITH NO_LOG

DBCC SHRINKDATABASE (<mydbname>)

And here's the answer I sent back:

How large is the database? And how long must you keep the backups around? If the cumulative size of the backups takes up a large proportion of your available storage space (and we're talking more than just a single direct-attached 100+GB drive), then it may be worth compressing the backups - otherwise you're likely causing yourself more trouble than its worth.
 
By doing BACKUP LOG WITH NO_LOG you're effectively throwing away log records and removing the possibility of doing any kind of point-in-time, or up-to-the-second recovery (see BACKUP LOG WITH NO_LOG - use, abuse, and undocumented trace flags to stop it). If you're running in the FULL recovery model, and you don't care about either of these features, then you should switch to the SIMPLE recovery model. If you really want to be in FULL, don't ever use WITH NO_LOG.

The amount of transaction log that a full backup requires cannot be changed by you truncating the log. The full backup will backup any log it requires to enable the restored database to be a transctionally consistent copy of the database. See Debunking a couple of myths around full database backups and More on how much transaction log a full backup includes.

Doing a DBCC SHRINKDATABASE (the same exact operation as a database auto-shrink) will cause massive index fragmentation, and cause file-system fragmentation of the data files, as they will likely need to grow again after you've squeezed all the space out of them. See Auto-shrink - turn it OFF! for more details on the effects.

If you're really concerned about backup sizes and space is at a premium, I recommend using a 3rd-party backup compression tool such as LiteSpeed or HyperBac so you're not affecting the actual database. Remember also that SQL Server 2008 has native backup compression too - see my blog post here for more details.

Hope this helps

I got bored on the first leg of the journey from Seattle to London so thought I’d bang out a quick blog post.

After my previous post on checking whether a SQL instance is able to use instant initialization (see here), I had a discussion with Scott R., who regularly comments on blog articles. He proposed an alternative method of checking whether the SQL service account has the Perform volume maintenance tasks privilege (AKA SeManageVolumePrivilege) (or other useful privileges like Lock pages in memory – SeLockMemoryPrivilege) and outlined the pros and cons of the various methods. With his permission, I’ve turned the information he provided into this blog post.

To use whoami /priv to find the SQL service account privileges you need to enable xp_cmdshell to do it from within SQL, or be logged into the Windows box as the SQL service account. Scott suggested using an alternative tool called AccessChk, written by my friend Mark Russinovich (formerly of Sysinternals, now a Technical Fellow at Microsoft). Using it you can find the privileges assigned to other users, services, or processes. The example syntax to do this for a SQL instance would be:

AccessChk –p sqlservr.exe –f –q -v

This allows you to find the privileges of the SQL service account without having to enable xp_cmdshell or login as the service account itself.

Let’s compare the whoami + xp_cmdshell combination with the AccessChk + command window combination.

Whoami + xp_cmdshell

Advantages:

  • You don’t need access to a command window.
  • You don’t need an administrator-capable user account.
  • Whoami may already be installed on a given Windosw server.

Disadvantages:

  • You need to enable xp_cmdshell (for a brief period – and have the authority to do so) to run whoami, which is often a separate security issue in many IT organizations.
  • Whoami can’t be run from a separate command window, because it can’t get the privileges from any user account other than the one which is running the command.  Since the goal is to get the privileges of the SQL service account, and that service account is often locked down from unnecessary privileges (such as interactive logon, using command windows, etc.), whoami can’t easily be used to derive this information outside the context of running from SQL Server via xp_cmdshell.
  • Whoami may not already be installed on a given Windows server.  This could be a barrier to using it for this purpose in IT organizations with rigid program install guidelines.

AccessChk + command window

Advantages:

  • You don’t need to enable xp_cmdshell, bypassing this separate security issue.
  • You don’t need to use the SQL service account as AccessChk can report on privileges of other running processes (and their implied user account) or other explicit user accounts, without requiring the security context of the desired user account to run AccessChk.
  • This approach may be used to automate capture of such information for configuration management and reporting purposes.

Disadvantages:

  • You need access to a command window on the Windows server.
  • You may need an administrator-capable user account (which you may not have) – I am not sure of this requirement, but it may be the case.
  • AccessChk is most likely not already installed on a given Windows server.  This could be a barrier to using it for this purpose in IT organizations with rigid program install guidelines, just like for whoami.

Summary

So is there a recommendation here? No, just a quick examination of the trade-offs with the two methods. Once again, I recommend you grant this privilege to the SQL service account for the massive performance boost it gives with data file creation or growth operations.

This is one I've been trying to get to since I started blogging a couple of years ago: how to re-attach a detached suspect database. This is a pretty common scenario I see on the forums - a database goes suspect so the DBA tries to detach/attach, which fails. I wrote a demo for my corruption session at TechEd this year that shows how to create a suspect database with hex editor, then detaches it and shows how to re-attach and fix it. It's going to be a long blog post, but I've got everything wrapped up before we fly out to the UK tomorrow so I've got a bit of spare time.

Creating a Suspect Database

First off I'm going to create a simple database to use, called DemoSuspect with a table and some random data.

USE MASTER
GO

CREATE DATABASE DemoSuspect
GO

USE DemoSuspect;
GO

CREATE TABLE Employees (FirstName VARCHAR (20), LastName VARCHAR (20), YearlyBonus INT);
GO
INSERT INTO Employees VALUES ('Paul', 'Randal', 10000);
INSERT INTO Employees VALUES ('Kimberly', 'Tripp', 10000);
GO

Now I'll perform an update in an explicit transaction and force it to be written out to disk with a CHECKPOINT. I've accidentally deleted Kimberly's bonus!

-- Simulate an in-flight transaction
BEGIN TRAN;
UPDATE Employees SET YearlyBonus = 0 WHERE LastName = 'Tripp';
GO

CHECKPOINT;
GO

Then in another window, I'll simulate a crash using:

SHUTDOWN WITH NOWAIT;
GO

Now that SQL Server is shutdown, I'm going to simulate an I/O failure that corrupts the log file. I'm going to use a hex editor to do this - my editor of choice is the immensely popular and useful XVI32, written by Christian Maas. I opened the log file, filled the first section with zeroes, and then saved it again. See the screenshot below.

When I startup SQL Server again, it will try to run recovery on the DemoSuspect database and fail. This will put the database into the SUSPECT state.

So I restarted SQL Server, let's try getting in to the DemoSuspect database.

USE DemoSuspect;
GO

Msg 945, Level 14, State 2, Line 1
Database 'DemoSuspect' cannot be opened due to inaccessible files or insufficient memory or disk space. See the SQL Server errorlog for details.

Now let's check the database status:

SELECT DATABASEPROPERTYEX ('DemoSuspect', 'STATUS') AS 'Status';
GO

Status
--------

SUSPECT

Now at this point, the correct procedure is to restore from backups. If there are no backups available, then the next best thing is to get the database into EMERGENCY mode and extract as much data as possible, or run emergency-mode repair. However, I'm going to try the detach/attach route instead.

Detaching the Database

I'll try detaching the database using sp_detach_db:

EXEC sp_detach_db 'DemoSuspect';
GO

Msg 947, Level 16, State 1, Line 1
Error while closing database 'DemoSuspect'. Check for previous additional errors and retry the operation.

Hmm - did it work or didn't it?

SELECT * FROM sys.databases WHERE NAME = 'DemoSuspect';
GO

And there are no results, so the detach must have succeeded.

Re-attaching the Database

Let's try the obvious sp_attach_db:

EXEC sp_attach_db @dbname = N'DemoSuspect', 
   
@filename1 = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\DemoSuspect.mdf', 
   
@filename2 = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\DemoSuspect_LOG.ldf';
GO

Msg 5172, Level 16, State 15, Line 1
The header for file 'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\DemoSuspect_LOG.ldf' is not a valid database file header. The PageAudit property is incorrect.

Hmm. How about using the ATTACH_REBUILD_LOG option on CREATE DATABASE? That should create a new log file for me:

CREATE DATABASE DemoSuspect ON
   (NAME = DemoSuspect, FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\DemoSuspect.mdf')
FOR ATTACH_REBUILD_LOG;
GO

Msg 5172, Level 16, State 15, Line 1
The header for file 'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\DemoSuspect_LOG.ldf' is not a valid database file header. The PageAudit property is incorrect.
File activation failure. The physical file name "C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\DemoSuspect_LOG.ldf" may be incorrect.
The log cannot be rebuilt because the database was not cleanly shut down.
Msg 1813, Level 16, State 2, Line 1
Could not open new database 'DemoSuspect'. CREATE DATABASE is aborted.

Hmm. The database knows that there was an active transaction. Using the ATTACH_REBUILD_LOG command only works if the database was cleanly shut down and the log is missing. What about if I actually remove the log file and try that - maybe I could fool it? I took a copy of the data and log file, and then deleted the original log file (this is getting more and more dodgy...). Let's try again:

CREATE DATABASE DemoSuspect ON
   
(NAME = DemoSuspect, FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\DemoSuspect.mdf')
FOR ATTACH_REBUILD_LOG;
GO

File activation failure. The physical file name "C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\DemoSuspect_LOG.ldf" may be incorrect.
The log cannot be rebuilt because the database was not cleanly shut down.
Msg 1813, Level 16, State 2, Line 1
Could not open new database 'DemoSuspect'. CREATE DATABASE is aborted.

Well, that was a long-shot but SQL Server's smarter.

Basically the problem is that the database wasn't cleanly shutdown, which means that recovery HAS to run and complete before the database can be attached again. Given that our log file is corrupt, that's impossible. So, never detach a suspect database.

The only way to get the database back into SQL Server is to use a hack. I'm going to create a new dummy database with the exact same file layout and as close as possible to the file sizes of the detached database. Then I'm going to shutdown SQL Server, swap in the corrupt database files, and re-start SQL Server. If all goes well, the corrupt, suspect database will be attached again.

The one major downside of this is that if the SQL Server instance doesn't have instant initialization enabled (see How to tell if you have instant initialization enabled?), then creating the dummy database could take a long time if the data files are very big. This means that your application is offline while the files are created and zero'd out.

So, I've already got a copy of the corrupt database files, so now I need to delete the data file too. Before doing this you want to make absolutely sure you've got multiple copies of the corrupt database files... just in case. After deleting the data file, I can create my dummy database.

CREATE DATABASE DemoSuspect
GO

If you forget to delete the existing corrupt files first, you'll get the following error:

Msg 1802, Level 16, State 4, Line 1
CREATE DATABASE failed. Some file names listed could not be created. Check related errors.
Msg 5170, Level 16, State 1, Line 1
Cannot create file 'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\DemoSuspect.mdf' because it already exists. Change the file path or the file name, and retry the operation.

Ok - delete the data file and try again. Now I need to check the dummy database's files are there (at this point I'm totally paranoid), shutdown SQL Server, delete the dummy databases' files (this just seems wrong!), and swap in the corrupt files. I took another copy of the corrupt files before swapping them back in, just in case something goes wrong.

After restarting SQL Server, I can check the database state:

SELECT DATABASEPROPERTYEX ('DemoSuspect', 'STATUS') AS 'Status';
GO

Status
--------
SUSPECT

Woo-hoo - I'm back to having a suspect database attached again - after having to shutdown the server and mess about deleting and copying files around. Not good. Now I can actually fix it.

Fixing the Database

If you don't have any backups, then the only way to get into the database is to use EMERGENCY mode. This lets you into the database but you need to be aware that recovery has not completed so the contents of the database are transactionally (and possibly structurally) inconsistent. I'm going to choose to repair the database using emergency-mode repair. See CHECKDB From Every Angle: EMERGENCY mode repair - the very, very last resort for a detailed description of this tool.

ALTER DATABASE DemoSuspect SET EMERGENCY;
GO
ALTER DATABASE DemoSuspect SET SINGLE_USER;
GO
DBCC CHECKDB (DemoSuspect, REPAIR_ALLOW_DATA_LOSS) WITH NO_INFOMSGS, ALL_ERRORMSGS;
GO

Msg 5172, Level 16, State 15, Line 1
The header for file 'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\DemoSuspect_log.LDF' is not a valid database file header. The PageAudit property is incorrect.
File activation failure. The physical file name "C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\DemoSuspect_log.LDF" may be incorrect.
The log cannot be rebuilt because the database was not cleanly shut down.
The Service Broker in database "DemoSuspect" will be disabled because the Service Broker GUID in the database (9E879BFC-B742-4A69-AB14-4D6BD6F99E02) does not match the one in sys.databases (B4568D23-7018-40CF-B189-9C29DE697C09).
Warning: The log for database 'DemoSuspect' has been rebuilt. Transactional consistency has been lost. The RESTORE chain was broken, and the server no longer has context on the previous log files, so you will need to know what they were. You should run DBCC CHECKDB to validate physical consistency. The database has been put in dbo-only mode. When you are ready to make the database available for use, you will need to reset database options and delete any extra log files.

First off it tries to do the regular ATTACH_REBUILD_LOG. When that fails, DBCC CHECKDB takes over and forces the log to be rebuilt, after trying to force as much of the damaged log to be processed for recovery as it can. It then runs a full repair, in case there's anything corrupt in the database - in this case there isn't so there are no corruption messages in the output.

Notice the line about the Service Broker GUID being wrong. I had to use the hack method to get the database attached again, but when I created the dummy database, it created a Service Broker GUID for the DemoSuspect database in master.sys.databases. When I swapped in the corrupt database, it has a different GUID - so now I can't use Service Broker! And all because I detached the suspect database instead of repairing it.

So what's the state of the data?

USE DemoSuspect;
GO

SELECT * FROM Employees;
GO

FirstName   LastName   YearlyBonus
----------  ---------  ------------
Paul        Randal     10000
Kimberly    Tripp      0

Kimberly doesn't get a bonus this year - she won't be happy! This is contrived and flippant, of course, but it illustrates the point that after doing an emergency-mode repair, transactions that were active at the time the log was damaged will not get a chance to roll-back, most likely. In this case, I know what was going on when the crash occured, but what about on a busy OLTP system with hundreds or thousands of active transactions? What state will the data be in?

Summary

Yes, you can recover from a detached suspect database, but it's not pretty and you have to be very careful. The best course of action is always to have a comprehensive backup strategy that allows you to restore as quickly as possible. If you do have a suspect database and no backups, use EMERGENCY mode to access and/or repair the database. Hopefully this article will help people that find themselves in these situations.

On Tuesday I recorded a couple of sessions with Richard and Greg on RunAs Radio. It's been a while since either Kimberly or I have been on the show - last November at TechEd in Barcelona. In the first session we discussed what it is to be an "involuntary DBA" - someone who's thrown into the DBA role with no training. This goes along nicely with the TechNet Magazine article I wrote for the August 2008 issue on Effective Database Maintenance (for the involuntary DBA). Kimberly also pops in from time to time when she's not coughing, we make fun of her, and I describe one of my favorite Calvin and Hobbes cartoons around collecting spit in a jar - all the usual silliness.

The show is 35 minutes long and you can download it at http://www.runasradio.com/default.aspx?showNum=72.

Enjoy!

Well, we're just back from vacation (photo blog post to follow) and I've heard that the feature article on Effective Database Maintenance I wrote for the August issue of TechNet Magazine is live on the web. It also includes a 5 minute long screencast I recorded where I demo the effect of database shrink on index fragmentation.

You can get to the article at http://technet.microsoft.com/en-us/magazine/cc671165.aspx. The topics covered are:

  • Managing data and transaction log files
  • Eliminating index fragmentation
  • Ensuring accurate, up-to-date statistics
  • Detected corrupted database pages
  • Establishing an effective backup strategy

It's written around 2-300 level and presents a good overview (well, at least I think so :-)) of the concepts involved.

Also, the August SQL Q&A column is available at http://technet.microsoft.com/en-us/magazine/cc671180(TechNet.10).aspx. This month's topics on the web (more in the print magazine) are:

  • Database version changes with upgrades
  • Benefits of partitioning
  • Consistency checking options for VLDBs

Enjoy!

Theme design by Nukeation based on Jelle Druyts