Not much to do here in Tokyo while waiting for the Bangkok flight except read and blog, and I've already done enough reading for one day on the flight here - so that leaves blogging!

Here's an interesting case that I got involved in on SQLServerCentral (here's the original thread, linked here with permission of the original poster). The problem was that every night the maintenance job would run and end up producing a stack dump, from a 211 error (which always indicates metadata corruption causing the Engine to trip over). I guessed system table corruption but the consistency checking portion of the maintenance job was apparently running fine. So I asked for one of the stack dumps, the output from the maintenance plan, and the latest SQL error log to have a look at (I have a weakness for these things, as you well know).

After having a look, I saw some strange behavior in the error log - every night there 10 or so DBCC CHECKDBs of master within about 2 minutes - corresponding to the consistency checking part of the maintenance job that was clearing running DBCC CHECKDB against all the system and user databases. Then I remembered that vanilla 2005 SP2 has two nasty maintenance plan bugs - one of which caused the consistency checking part of the maintenance plan to malfunction when run, and run DBCC CHECKDB against master every time instead of the database specified. Fixing that would then lead to the being able to properly run consistency checks and find the corrupt database.

So - moral of the story is to make sure you're not running vanilla 2005 SP2 (or one of the builds close to it, 3150 to 3158) otherwise you could get into problems with maintenance plans - worse still, you may think you've been running consistency checks all this time on your user databases, but if no-one's checking the SQL error logs, you could just be repeatedly consistency checking master.

The KB that describes the bug is 934459 (available at http://support.microsoft.com/kb/934459), or you can install SP3 (available at http://support.microsoft.com/?kbid=955706).

Just saw this on a forum - running REPAIR_ALLOW_DATA_LOSS on msdb to cope with corruption. Yes, this will work but it's certainly not safe and something I'd only recommend as a last resort.

Here's why: msdb stores your backup history and your Agent jobs. REPAIR_ALLOW_DATA_LOSS works by usually "delete what's broken and fix up all the links" (quoting myself). If you run repair on msdb, you need to pay close attention to what was deleted by repair, as it could lead to big problems later on. Some examples:

  1. What if repair deletes a record which just happened to store the Agent job that runs the transaction log backups for your main production database? Suddenly your log isn't being backed up and you don't know about it. The log starts to grow and eventually runs out of space. The database stops and your application is down until you figure out what's wrong.
  2. What if repair deletes a record which just happened to store the details of a log backup of the production database? Your have a disaster and run the script that looks through the backup history tables and auto-generates RESTORE statements to get you up and running again with up-to-the minute recovery. Because of the missing record, there's a missing RESTORE LOG statement in the middle of the restores of the log-backup-chain for the production database. So the restores fail, and you're down until you figure out what's wrong.

Bottom line, it's not safe to run repair on msdb - proceed with extreme caution. Not that repair won't work, or will cause corruption, but that it's far better to have backups of msdb that you can restore from.

In my previous post today I talked about error 825 and how you should have an Agent alert set up to catch it when it happens. There are two ways you can set up Agent alerts - using SSMS or using the stored-proc sp_add_alert. In a post from April 2008 I showed how to create an Operator and alerts using the SSMS wizards (see that post here). To use the SP to create the alert for message 825, the code would be as follows:

USE msdb;
GO

EXEC msdb.dbo.sp_add_alert @name = N'825 - Read-Retry Required',
    @message_id = 825,
    @severity = 0,
    @enabled = 1,
    @delay_between_responses = 0,
    @include_event_description_in = 1,
    @category_name = N'IO Subsystem Error';
GO

Note the @include_event_description parameter. This is a tinyint that says which of the notification mechanisms should include the description. The list of possibilities is:

  • 0 - None
  • 1 - Email
  • 2 - Pager
  • 4 - Net send

So my value of 1 only includes the description in email, even if I've set up the Operator to receive email and a net send. You can get more info about this SP from the following link: http://msdn.microsoft.com/en-us/library/ms189531.aspx.

As to which alerts you should have defined, my advice would be to have one for message 825 and a separate alert for each severity 19, 20, 21, 22, 23, 24, and 25. These are the high-severity errors and are also written to the error log. You can get a complete description of all error severities at http://msdn.microsoft.com/en-us/library/ms164086.aspx. Note that if you specify @severity, the @message_id must be 0 (and vice-versa).

Happy alerting!

PS And of course you'll need to use the SP sp_add_notification (see http://technet.microsoft.com/en-us/library/ms173843.aspx) to hook the alert to an Operator. Thanks Tibor (who just blogged about how to determine which errors to alert on) - got distracted by email while writing this and forgot to include it!

There are two pretty well-known I/O errors - 823, and 824 - but there's also one called 825 which most DBAs do*not* know about, and definitely should.

From SQL Server 2005 onwards, if you ever see an 823 or 824, SQL Server has actually tried that I/O a total of 4 times before it finally declares a lost cause and surfaces the high-severity I/O error to the connection's console, killing the connection into the bargain. The idea behind this read-retry logic came from Exchange, where adding the logic reduced the amount of immediate downtime that customers experienced. While in concept this was something I agreed with at the time, I didn't agree with the way it was implemented.

If the I/O continues to fail, then the 823/824 is surfaced - that's fine. But what if the I/O succeeds on one of the retries? No high-severity error is raised, and the query completes, blissfully unaware that anything untoward happened. However, something *did* go badly wrong - the I/O subsystem failed to read 8KB of data correctly until the read was attempted again. Basically, the I/O subsystem had a problem, which luckily wasn't fatal *this time*. And that's what I don't like - the I/O subsystem went wrong but there are no flashing lights and alarm bells that fire for the DBA, as with an 823 or 824. If read-retry is required to get a read to complete, the only notification of this is a severity-10 informational message in the error log - error 825. It looks like this:

Msg 825, Level 10, State 2, Line 1.
A read of the file ‘D:\SQLskills\TestReadRetry.mdf’ at offset 0×0000017653C000 succeeded after failing 2 time(s) with error: incorrect checksum (expected: 0×4a224f20; actual: 0×2216ee12). Additional messages in the SQL Server error log and system event log may provide more detail. This error condition threatens database integrity and must be corrected. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

Now, what this message is really saying is that your I/O subsystem is going wrong and you must do something about it. And unless you're regularly scanning the error log looking for these, you'll be none-the-wiser.

So - my recommendation is that you add a specific Agent alert for error 825, along with your other alerts (see following blog post).

Many times I've been asked to do a blog post about creating Agent alerts, and given that today I demo'd it as part of our Accidental DBA workshop at Connections, it seemed a good time to do the blog post too!

I demo this in the context of alerting a DBA when an 823 or 824 IO error occurs. One of my early blog posts (see here) explains what these are, as well as providing a corrupt database that you can use to see these errors happening.

The idea is that I want to know as soon as an IO error occurs so I can start recovering and take preventative action to stop it happening again. I don't want to rely on users telling me when a query hits an IO error, and I don't want to have to scan the SQL error logs to find them. So I'm going to create an alert.

The first step is to fire up Management Studio and make sure SQL Server Agent is running. Next we need to make there's actually an Operator defined - so the new alert has someone to actually alert! - so we'll use the New Operator wizard (see below for how to get there).

In the New Operator Wizard that appears, I've created an operator named 'SysAdmin'. There are three Notification Options you can use - email, net send, and pager. I've setup SysAdmin to use net send to my local machine. You need to make sure the Messenger service is enabled otherwise net send will not work. Also, be aware the net sends will NOT work unless the machine has a network connection - even if the net send source and destination are the same machine! Given the various issues with net send, it's better to use email or pager alerts - but for the purposes of this blog post its the easiest option.

Now let's create the new alert - using the New Alert wizard (see the below for how to get there).

In the New Alert Wizard that appears, I've created an alert named 'IO Errors' for all severity 24 errors on all databases. Below is a portion of the General tab of the wizard showing these settings:

I also need to specify what happens. In the Response tab of the wizard I've set the SysAdmin operator to be notified using net send. Again, see below.

In the Options tab I've checked the box to include the error text in the net send.

Now let's test it. Using the database called 'broken' that I provide as an example (see here), I'll force an IO error to occur. In my query window I get:

SELECT * from broken..brokentable;
GO

Msg 824, Level 24, State 2, Line 1

SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0x7232c940; actual: 0x720e4940). It occurred during a read of page (1:143) in database ID 10 at offset 0x0000000011e000 in file 'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\broken.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

And a few seconds later I get the net send:

Pretty cool!

You can also use the WMI Provider to do this - see Creating a SQL Server Agent Alert by Using the WMI Provider for Server Events.

This came up several times during the week so I thought it was about time to blog about it. One of the new features we put into SQL Server 2005 was storing the last time that DBCC CHECKDB completed successfully (called the last-known good time). What does successfully mean? This is the catch - if DBCC CHECKDB runs to completion then it considers that a successful run - EVEN if it found some corruptions. However, the last-known good time is ONLY updated if there were NO corruptions found. Slightly confusing I know.

Cool - but how can you see it? Well, the only time it's ever reported is when the database starts up. Not too useful if the database has been running for months. So how to see it??? The trick is to use DBCC PAGE. The last-known good time is stored in the boot page of the database - page 9. The following code will dump page 9 for you:

-- you need this to get the DBCC PAGE output to your console
DBCC TRACEON (3604);
GO

-- page 9 is the boot page
DBCC PAGE (dbname, 1, 9, 3);
GO

You need to look for the dbi_dbccLastKnownGood field. That was the last time that DBCC CHECKDB ran without finding any corruptions.

Now - what about if you're trusting your SQL Agent jobs to run DBCC CHECKDB for you every so often (how often is a whole other discussion...) and relying on the Agent job failing if DBCC CHECKDB finds corruptions. Well, if all you do is run the DBCC command, you're never going to know about corruption unless DBCC CHECKDB itself fails for some reason - remember, it returns successfully even it it found corruptions.The key is to add another statement after you run DBCC that checks the value of @@ERROR (the last error code reported by SQL Server). Although DBCC CHECKDB doesn't stop with an error when it finds corruption, it will set @@ERROR if there are any error messages in its output.

Edit: The stuff I've struck-through above is not true. I thought I'd remembered it from testing previously and I've heard plenty of anecdotal evidence from customers too BUT an Agent job WILL fail if a DBCC CHECKDB within it fails. I went back and forth with Tara Kizer on SQLteam.com and we tested on 2005 and 2000 - and it worked. Now here is a real issue - the Job History that's captured will not contain all the output from DBCC CHECKDB - especially if you didn't use the WITH NO_INFOMSGS option. You should make sure you capture the output to a file to avoid having to go back and run DBCC CHECKDB all over again.

Summary - make sure you really know when DBCC CHECKDB runs successfully, without finding any corruptions!

It's common knowledge that SQL Server copes with daylight savings time (DST) correctly so why should you care?

Well, it's not so common knowledge that at the end of DST when the clocks go back an hour (always at 02:00 in the U.S.), SQL Agent essentially pauses for an hour (in at least SS2000 onwards). This means that if you have a job that's doing something every 15 minutes, there will be a gap of 75 minutes between the job execution at 01:45 and the job execution at 02:00. This happens because at 02:00, the time is set back to 01:00 but the next run time of all the jobs remains the same - so your job cannot execute until it's next scheduled time of 02:00. So, in the northern hemisphere every Fall, and in the southern hemisphere every Spring, you lose an hour's worth of SQL Agent jobs. Still, why should you care?

Well, it depends what the jobs are that get delayed by an hour. If you have a job that takes a log backup every 15 mins then on the day DST ends, there's actually a gap of 75 minutes between log backups. If you have a Service Level Agreement (SLA) that limits the maximum amount of lost work to 15 minutes in the event of a disaster, then for those 75 minutes you're exposed to potentially not being able to meet that SLA!

That could be a pretty big deal, especially if something goes wrong during that hour (no more or less likely than something going wrong at any other time, but still possible). In that case, you need to come up with an alternative solution. A couple of ways to get around the problem I can think of:

  • Have someone stay up late during that hour and take manual log backups.
  • Switch over to database mirroring, which continually streams the log to the redundant server and so isn't affected the DST issue.

Both of these are viable solutions but I think the best one is to create a SQL Agent job that runs at 01:59 and creates extra backup jobs to run at 01:00, 01:15, 01:30, and 01:45. I don't see why this shouldn't be possible. At 10:36 this morning I created a simple agent job to print the date to a file and set it to execute at 09:40 - in the past. I then set my system time back one hour and the job executed perfectly. The only downside of this solution is that you need to create and schedule the extra jobs using the T-SQL Agent SPs embedded in job steps for your 01:59 job - tedious but not hard. Maybe someone could send me a script and I'll blog it as a follow-on?

So with DST coming to an end on November 4th this is definitely something for you to be aware of even if you don't want to go to the trouble of coping with the extra hour's exposure. As an aside - the dates when DST starts and ends changed this year. KB article 931975 discusses which parts of SQL Server aren't aware of the changed dates and what you can do about it.

Theme design by Nukeation based on Jelle Druyts