Over the last week or so I've been helping the Product Support folks at Microsoft write a new KB article which discusses some of the things you should take into consideration when setting up database mirroring.

It discusses memory, CPU, I/O bandwidth, network bandwidth, and transaction volume. One of the most interesting things it discusses is exactly how many worker threads are required on the principal and mirror for each mirrored database.

Check it out at: http://support.microsoft.com/kb/2001270 (the formatting looks a little weird in the final online version, assuming they'll fix that up).

Enjoy!

The 35-page whitepaper on high availability I wrote for the SQL team over the summer has been published on MSDN. It's a 2-300 level whitepaper that describes the various high-availability technologies in SQL Server 2008 and how they can be used to mitigate disasters. It's chock-full of links to other whitepapers, technical articles and Books Online sections and also presents my methodology for planning a high-availability strategy.

You can get it at http://msdn.microsoft.com/en-us/library/ee523927.aspx.

Enjoy!

Here's the table of contents:

  • Introduction
  • Causes of Downtime and Data Loss
    • Planned Downtime
    • Unplanned Downtime and Data Loss
  • Planning a High-Availability Strategy
    • Requirements
    • Limitations
    • Technology Evaluation
  • SQL Server 2008 High-Availability Technologies
    • Logging and Recovery
    • Backup, Restore, and Related Technologies
      • Partial Database Availability and Online Piecemeal Restore
      • Instant File Initialization
      • Mirrored Backups
      • Backup Checksums
      • Backup Compression
    • Other Single-Instance Technologies
      • Online Operations
      • Database Snapshots
      • Hot-Add Memory and CPU
      • Resource Governor
    • Multi-Instance Technologies
      • Log Shipping
      • Transactional Replication
      • Database Mirroring
      • Failover Clustering
      • Combining Multi-Instance Technologies
      • Virtualization
  • Mitigating the Causes of Downtime and Data Loss
  • High-Availability Features Supported by SQL Server 2008 Editions
  • Conclusion

A few weeks ago I kicked off a survey on how you add geo-redundancy to a failover cluster (see here for the survey). The results as of 8/26/09 are as follows:

 

So why is this interesting? Well, many people will suggest failover clustering as the best way to provide high-availability for a database (or group of databases). And it is a great technology to protect against server failure, but there's only a single copy of the database, which is the Achilles' heel of failover clustering. If that copy of the database is damaged, the application is down unless there's another copy of the database available. This is where providing geo-redundancy comes in. With that in mind, I'm surprised at the percentage of respondents that don't provide any geo-redundancy at all.

There are a bunch of options for providing a redundant copy of a database that is hosted on a failover cluster, with pros and cons to each, and that's what I'm going to spend the rest of this post on.

SAN replication: This is where the SAN hardware itself mirrors all write I/Os to a remote SAN, thus maintaining a remote copy of the database. The hardware has to provide disk-block size and write-order preservation; otherwise the database on the remote SAN could become corrupt. Imagine if write-ordering was not preserved and some data pages write I/Os were completed on the remote SAN before log records write I/Os (thereby breaking the write-ahead logging protocol) - recovery wouldn't be able to work properly! This mechanism requires a remote SAN, a second failover cluster connected to the remote SAN, a network including both clusters, and a big, fat network pipe between the two SANs. The bigness and fatness of the pipe depends, of course, on how many write I/Os are performed on the local SAN, and whether the SAN replication is synchronous or asynchronous.

Synchronous replication requires that the I/O is completed on the remote SAN and acknowledged back to the local SAN before the local I/O can be acknowledged to the local server. If the network bandwidth and latency can't support the volume of write I/Os trying to be replicated to the remote SAN, the I/Os will start to queue up and delays will be incurred on the local server. This will lead to the workload slowing down as SQL Server has to wait longer and longer for I/Os to complete. Now, with synchronous replication you have the guarantee that the remote copy of the database is completely in-sync with the local copy, so if a failure occurs, no committed data will be lost. If the network can't keep up though, you may have to switch to asynchronous replication. This means the local I/Os don't have to wait for the remote I/Os to complete, and so no performance penalty is incurred. BUT as the replication is now asynchronous, committed data may be lost if the local copy of the SAN is damaged.

Apart from the potential for performance problems with SAN replication, it's also very expensive - as another SAN, another cluster, and some beefy network hardware/bandwidth is required. This isn't a technology I'd expect a small company to be using or considering. Finally, the portion of the remote SAN that's being replicated to cannot be accessed at all. On the MAJOR plus side, all databases on the SAN are replicated at once, without having to setup a technology to provide a redundant copy of each. For application ecosystems that include multiple databases, this is what I like to recommend.

Log shipping: This is the simplest way to maintain a redundant copy of the database - it's just backup log, copy, restore log; repeat. It works seamlessly with failover clustering and is really easy to setup and maintain. The only problem with this is that you open yourself up to data loss, as a log shipping secondary is usually not right up-to-date with the primary. You can use the secondary for reporting/querying by restoring the log backup WITH STANDBY (which requires a little more configuration, but not much), and you can protect against accidental data damage by having a secondary with a load-delay configured, so the database is, say, 8 hours behind the primary. In my experience, this is the most common technology that's used in conjunction with failover clustering as it's the cheapest and easiest. On the downside, it's a single database solution so its not suitable for complicated application ecosystems.

Transactional replication: This isn't very commonly used at all, although again, it works seamlessly with failover clustering after a failover. The reason this isn't used very often for geo-redundancy is that transactional replication doesn't provide database-level redundancy, only table-level. It's also much more complicated to setup and troubleshoot when things go wrong, plus there's varying latency between a transaction committing in the publication database and it being applied to the subscription database(s).

Database mirroring: Database mirroring is the only technology apart from SAN replication that can provide a zero data-loss solution when configured for synchronous operation. It works by shipping the log records from a database rather than the raw I/Os, so doesn't require anywhere near as much capital expenditure, but the network has to be able to cope with sending the log generated on the principal, otherwise performance on the principal can be affected. Mirroring is relatively easy to setup and maintain, and the mirror database can only be accessed, but only through a database snapshot. When combined with failover clustering, you need to be careful about setting the mirroring partner timeout, so that the local failover cluster gets a chance to fail over before mirroring does. Checkout my blog post on this: Search Engine Q&A #3: Database mirroring failover types and partner timeouts. You can configure database mirroring for synchronous or asychronous operation, with the same performance and data-loss exposure caveats as SAN replication. SQL Server 2008 provides log stream compression and automatic page repair, which make this more attractive (see SQL Server 2008: Performance boost for Database Mirroring and SQL Server 2008: Automatic Page Repair with Database Mirroring, respectively), but only supports a single database. I'm seeing this combination start to be used more, but again, it's a single database solution so isn't suitable for complicated application ecosystems.

Backups/homebrew: Good old backups can easily be used to provide a very low cost way of maintaining a redundant copy of a database, and if you think about it, this is really do-it-yourself log shipping. At the very least, databases should *always* be included in a backup strategy, no matter what other high-availability technology(s) you may have implemented.

3rd-party solution: There are a few non-Microsoft solutions for providing redundancy with failover clustering which don't involve traditional SAN replication. I'm not an expert in any of them, but I've heard of anecdotal issues with the two I mentioned in the survey and worked with customers who've had real issues with PolyServe (one of which I blogged about).

Summary 

When you're planning a high-availability strategy, you always need to consider the limitations of technologies while evaluating them. The big limitation of failover clustering is that there's no redundant copy of the database so you need to add another technology to provide that. I've just finished writing a 35-page whitepaper for Microsoft on the high-availability technologies in SQL Server 2008, as well as how to go about planning a strategy. It will be published before PASS in November, but in the meantime, this should have given you lots of food for thought.

Next post - the next survey!

I'm moving away from strictly weekly surveys as they're a little demanding, especially when we're on the road.

In this survey I'd like to know which technology you use to provide geo-redundancy to databases that are protected locally using a failover cluster. I'll report on the answers sometime around the middle of August.

*Please* no comments on this post - wait for the survey results post to avoid skewing the answers. I'm very interested in your reasoning, but not until everyone else responds.

As always, a big Thanks! for contributing to the blog/community by responding. Please shoot me an email (Contact button, bottom left of the blog), or ping me on Twitter (@PaulRandal) if you have an idea for a good survey.

In last week's survey I asked you two things, as a precursor to a whitepaper I'm writing for Microsoft

The first question was what is your maximum allowable downtime SLA (either for 24x7 operation or not). See here for the survey. Here are the results as of 5/30/09.

 

The Other values were all about not having SLAs defined. And I think that's why this survey had a poor response rate - most of you out there don't have defined SLAs.

If you take a standard 365-day year, which means the year has 524160 minutes. Some example maximum allowable downtimes for 24x7 operations:

  • 99.999% (a.k.a 'five-nines') = slightly over 5 minutes downtime per year
  • 99.99% (a.k.a 'four-nines') = almost 52.5 minutes downtime per year
  • 99.9% (a.k.a 'three-nines') = almost 8.75 hours downtime per year
  • 99% (a.k.a 'two-nines') = just over 3.5 days of downtime per year
  • 98.5% = almost 5.5 days of downtime per year
  • 98% = just over a week of downtime per year
  • 95% = just over 2.5 weeks of downtime per year

You might be thinking - wow - whoever only has a target of 95% uptime must have a pretty crappy setup BUT it totally depends on the application and business requirements. 95% may be absolutely fine for some companies and utterly devastating to others (imagine the revenue loss if Amazon.com was down for more than 2 weeks...)

On the other end of the spectrum, you might be thinking - wow - 99.999% target is completely unattainable for the majority of businesses out there, BUT again it totally depends on the application and business requirements. This may be attainable for a simple application and database that doesn't generate much transaction log, or for a large, very busy OLTP database that generates lots of log but the company has lots of money to throw at the HA problem and can afford redundant clusters in separate data-centers with fat, dark-fiber links between them.

Defining SLAs, and not just for maximum allowable downtime, is incredibly important. When done properly, defining SLAs shows that the technical staff in a company and the business management in a company are in tune. It shows they have extensively analyzed the business requirements of an application and balanced them against the technical, space, power, HVAC, manpower, budgetary, and other limitations which may prevent a higher percentage being set as the target. It provides a meaningful input into the design and architecture of systems, and the choices of technologies (note the plural) required to achieve the target SLAs. It shows the company shareholders that the business managers understand the criticality of applications involved in running the business and that they are taking steps to safeguard the shareholders' equity interests.

Note I said 'when done properly'. Technical staff picking SLAs without business input, or vice-versa is a recipe for disaster. If the technical staff don't understand the business requirements, how can they pick appropriate SLAs? If the business managers don't understand the technical limitations, how can they pick achievable SLAs? Everyone has to be involved. And one of the most important things to consider is whether the SLA covers 24x7 operation, or just the time that the application has to be available, if not 24x7.

How to go about using the SLAs to pick appropriate technologies is one of the things I'll be going into in the whitepaper (due for publication around the end of September).

The second questions was what was your measured uptime (again, either for 24x7 operation or not) over the last year. See here for the survey. Here are the results as of 5/30/09.

 

The Other values for 24x7 were 2 x '96', and 1 x 'it varies across customers'. The Other value for non-24x7 was '92'.

Of course, this would be a little more meaningful if we could correlate target downtime with actual downtime achieved, but the free survey site doesn't reach that level of sophistication (and I don't think people have enough time to get into that much detail) while casually reading a blog post.

Nevertheless, the results are interesting, although not really statistically valid from this small a sample-size. The most interesting data point is that some respondents don't know what they achieved last year, or just didn't measure. I think this is the case for the majority of readers. Having well-defined and appropriate SLAs is the key to defining a workable strategy, and what's the point of defining the SLAs if you don't measure how well you did against them? If you don't meet the target you need to revisit the strategy, the SLAs, or maybe even both.

The fact that many people don't have SLAs speaks to the general poor state of high-availability and disaster-recovery planning in the industry, IMHO. And Kimberly just chimed in stating that it also shows how the PC-based server market doesn't focus anywhere near as much on regimented policies and procedures as the older, more traditional, mainframe market did/does.

There's a lot of work to be done. Many times this stuff just isn't high up in the priority list until a disaster actually happens. Invariably it then becomes very, very important. Get ahead of the curve and be proactive - one of the things we always tell our clients.

Next post - this week's survey!

This week's survey is inspired by this morning's Kimberly+Paul hot-tub conversation around data-dependent routing vs. network load balancing, which then turned to SLAs. Yes, we lead an exciting life Smile.

This survey is a *four*-parter. Part 1 is for each of your databases, which survey option is closest to your target maximum allowable downtime SLA (Service-Level Agreement, a.k.a. RTO - Recovery Time Objective)?

  • Survey 1 is if your SLA is based on 24x7 operation
  • Survey 2 is if your SLA is based on non-24x7 operation (e.g. allowing for a 6 hour daily maintenance window, the rest of the time we must be available 99.9%)

Part 2 is for each of your databases, what can you actually achieve?

  • Survey 3 is your achievement over the last year for 24x7 operation
  • Survey 4 is your achievement over the last year for non-24x7 operation

I'll report on the result sometime during the first week of June - as this is a big survey I'd like to get lots of results. Please repost/Tweet/whatever.

What do I mean by that downtime SLA? Well, however you have it defined. It could be that your SLA is an absolute value based on 24x7 operation, or it could be based on the 9-5 work-day. In all cases, what percentage of the defined time does the system have to be available? Note that I'm calling out a 5, 4, and 3 nines of 24x7 as separate answers as I'd like to see how often these occur and can be met.

As always, a big Thanks! for contributing to the blog by responding. Please shoot me an email (Contact button, bottom left of the blog), or ping me on Twitter (@PaulRandal) if you have an idea for a good survey.

PS No comments on this post please (I'll delete them straight away) - please wait for the survey results post to avoid skewing the answers.

This week's survey is inspired from many stories I saw on the forums and Twitter this week - mostly bad, one good (someone I'm following is spending the weekend testing their disaster recovery plan - cool!). I'd like to know whether you're ever tested your disaster recovery plan, and if so, what happened? I'll report on the results sometime of the weekend of 5/24/09.

And what do I mean by disaster recovery? The definition varies from just restoring a backup to conducting a full failover of a datacenter - but I'd like you to decide for yourself. I'll editorialize more about this next week.

As always, a big Thanks! for contributing to the blog by responding. Please shoot me an email (Contact button, bottom left of the blog) if you have an idea for a good survey.

PS No comments on this post please (I'll delete them straight away) - please wait for the survey results post to avoid skewing the answers.

PPS Probably a lot less blogging/Twitter (@PaulRandal) this coming week - I'm teaching all 5 days.

Jason Massie posted an interesting statistic yesterday - Facebook has 1.5 petabytes of image storage, and it grows by 25TB daily - I wonder how they store and manage it?

In this week's survey, I'm interested in two things: what's the largest single SQL Server database in your company and how many SQL Server databases are you responsible for managing? I'll report on the results sometimes over the weekend of 5/9/09.

A note on the second survey: if you manage a team of DBAs, but don't manage databases directly yourself, answer for your individual DBAs, rather than the team as a whole.

Thanks!

PS No comments on this post please (I'll delete them straight away) - please wait for the survey results post to avoid skewing the answers.

A couple of weeks ago I kicked off the latest survey, on what the physical layout of your databases are and why you have them that way (see here for the survey). I let the survey run for a while to get a good sampling, and I wasn't disappointed, with over 1000 responses! Here are the results as of 4/27/2009.

Just like any other 'best practice' kind of topic, the question of how to design the physical layout of a database provokes a lot of (sometimes heated) discussion. There are lots of options and there are even more factors to consider - so the best answer is my perennial favorite "it depends"! In this post, I don't want to tell you how I think you should layout your database - instead I want to discuss some of the options and let you make up your own mind, with the added benefit of data on what your peers are doing with their databases. The main point of this survey was to see what people are doing, rather than as a driver for an editorial blog post.

As you can clearly see from the results above, and predictably, the distribution of layout types shifts as the database size increases - but I was very surprised by the number of single file databases over 10GB. Rather than go through each option in the survey, I'm going to talk a bit about some of the things to consider when planning a layout.

Underlying I/O subsystem

This could be the most important factor to consider. If you only have a single physical drive, for instance, there's arguably not much point creating multiple data files, as that will force the disk heads to bounce back and forth to the different file locations on the disk. On the other hand, if you have a SAN with several thousand drives grouped together into multiple LUNs, your possibilities are a lot wider (and maybe much harder to come up with the optimal layout). Several people asked if I'd go into depth around having multiple controllers, and different drive layouts in a SAN - and my answer is no. I'm not an expert at storage design, which, like indexing, is both an art and a science. There's a good whitepaper that discusses some of this: Physical Database Storage Design, which I helped review back in my MS days.

Performance, recoverability, manageability

Having multiple files with separate storage for each allows reads and writes to be parallelized for increased performance, lowering the amount of disk head contention. When a checkpoint occurs, and pages are written to disk, spreading the I/O load over multiples files can speed up the checkpoint and reduce the IOPS spikes that you may see. It can also lead to reduced contention for the various allocation bitmaps - in the same way as I've described for tempdb. In user databases with a very high rate of allocations, contention can arise on the GAM pages - but it's not common. Some people also advocate having separate filegroups for tables and indexes, and although this can sometimes be more trouble than it's worth, and often turns into a religious debate, I have heard of people getting a perf boost from this.

One of the most convincing reasons (I find) for having multiple filegroups is the ability to do much more targeted recovery. With a single file database, if it gets corrupted or lost, you have to restore the whole database, no matter how large the file is - and this can seriously affect your ability to recover within the RTO (Recovery Time Objective) agreement. By splitting the database into multiple filegroups, you can make use of partial database availability and online piecemeal restores (in Enterprise Edition) to allow the database to be online as soon as the primary filegroup is online, and then restore the remaining filegroups in priority order - bringing the application online as soon as the relevant filegroups are online. You can even use this layout to spread your backup workload - moving to filegroup-based backups instead of database backups, although this isn't very common.

As far as manageability is concerned, there are a few reasons to have multiple filegroups. Firstly, you can isolate a table that requires a lot of I/O (e.g. in terms of index maintenance) on separate storage from other tables, so that maintenance operations (and the I/O overhead of doing them) doesn't interfere with the I/O of the other tables. Also, you can provision different kinds of storage for different tables - in terms of disk speed and RAID level (redundancy), for instance. If you want to be able to move data around, you can do it much more easily if the database is split up, than if it's a single file.

Summary 

Ok - so I lied. I *am* going to offer advice - against one of the options: single filegroup, single file. For smaller databases, this is fine - but as the database size gets larger, say, over tens of GB, then having a single file can become a serious liability. With a single file database (or even a single filegroup database), you lose most of the benefits mentioned above.

Bottom line - as your databases get larger, you're going to need to think more carefully about their layout, otherwise you could run into big problems as your workload increases or when disaster strikes. As the survey results show, this is what your peers are doing.

Next post - this week's survey!

A short post to start the day (in India) and then at lunchtime I'll do how it works: FILESTREAM garbage collection.

This is a question that came up recently on an MVP discussion list: why does a 200MB log backup take 40 minutes to restore on my log-shipping secondary?

First answer I thought of was that the I/O subsystem on the log-shipping secondary was very slow, or broken in some way, so I/Os were taking ages to complete. Nope - I/O subsystem was performing well.

Next thing I suggested was the the log-shipping secondary was restoring the log backups using WITH STANDBY and there was something like an index rebuild that was being rolled back. As an aside, when you restore log backups on the log-shipping secondary, you have a choice how they recovery: either WITH NORECOVERY or WITH STANDBY. The first option doesn't allow any access to the database, as the database is still "in recovery". The second option runs the REDO part of recovery, then runs the UNDO part of recovery, but saves the details of what it had to do for UNDO into a file (who's name and location you specify). It then allows read-only access to the database, for queries/reporting/whatever. Obviously if there's a lot of transaction log that has to be undone (i.e. rolling back transactions that weren't committed at the time the log backup completed), then this could take some time. But 40 minutes? No. That wasn't it.

I was partly right on my second guess. The *previous* log backup that was restored WITH STANDBY contained a long-running index operation, and so the undo file that the restore created was *huge*. The next log backup that's restored after a RESTORE ... WITH STANDBY, must first undo everything in the undo file (i.e. put the database back to the exact state it was in as if the WITH STANDBY part of the restore never happened) before it can restore more transaction log. In this case, no-one had noticed that the previous log restore *also* took a lot of time and created the huge undo file. It just looked like the 200MB log backup was causing the problem.

My advice: in any kind of there's-suddenly-a-corruption-or-performance-problem situation, don't just look at what's happening now. Always look at what happened leading up to the problem, as the answer is often there.

This week's survey is a little more complicated. I'm interested in the physical layout of your databases. I've got four surveys, for a variety of database sizes. Please vote multiple times in each survey, as you see fit - and by all means forward this link to your friends/clients/etc or re-blog/tweet it. I'm going to report on this survey in two weeks, to give a bit more time for people to respond (and because we're travelling next week). I think we're going to see some interesting statistics come out of this - the more people that respond the better. I'll report on it 4/24/09.

One thing to note - this is just for user databases, not for tempdb. In the surveys, "multiple filegroups" implies multiple files too, and if you don't have them spread exactly one per drive/etc, just choose that option - I only have 10 options to choose from in the free surveys. 

(If you're in the over 1TB range and have multiple files/filesgroups spread over multiple drives/arrays/LUNs, vote using the last option on the >1TB survey and I'll lump them together.) 

Phew - thanks!

Last week I kicked off the first weekly survey - on whether you validate your backups or not (see here for the survey). The results are very interesting (as of 3/13/09):

As you can see, almost 25% of respondents never validate their backups! And a further 25% only validate them occasionally, with 30% doing some kind of regular checks, and only a handful checking all the time.

While these results may seem shocking to you, based on what I've heard when teaching, they're pretty normal. There are lots of reasons why DBAs may choose not to validate backups as often as they should, including:

  • Not enough time to restore the backups to check them
  • Not enough disk space
  • Not part of the day-to-day operations guide
  • Don't see why it's important

Kimberly and I have a saying (well, to be fair, Kimberly coined it): you don't have a backup until you've restored it. You don't know whether the backup you just took was corrupt or not and will actually work in a disaster recovery situation.

Can you ever get a guarantee? No. Here's an analogy, taken from a very old post of mine. Consider Paul, who works for the Seattle Police Department in traffic control. Paul's in a control room somewhere in the city with a large bank of monitors connected to various traffic cameras. Paul's job is to cycle through the cameras every 1/2 hour, looking for traffic accidents. At the end of the 1/2 hour cycle, if Paul han;t seen any accidents then he knows that there are no accidents in the city.

Ah - but hold on. Does Paul really know that? No. All Paul knows is that at the point he looked at a particular camera, there was not an accident at that spot in the city. The very instant he switches to another camera feed, an accident could happen at a spot covered by the previous camera.

The same is true for validating backups. As soon as you've validated a backup, it could then be corrupted by the I/O subsystem, but at least you know that it was valid at some point. But what if that happens, I hear you ask? Well, then you need to have multiple copies of your backups, and you should not rely on backups as the only method of disaster recovery. A good high-availability solution includes as many technologies as you need to mitigate all risks - and backups are just one of those technologies. You're going to have to have some kind of redundant system too that you can fall back on (or mayb even immediately failover to, depending on your particular disaster recovery plan). But, saying that, you can't rely on the redundant server either - if it goes wrong, you'll need your backups.

So - whichever way you look at it, validating backups is a really good practice to get into so you don't get bitten when it comes to the crunch. When I teach, I've got many stories of customers losing data, business, time, and money (and DBAs losing their jobs) because the backups didn't work or were destroyed along with the data. Here's one for you (simplified, and no I won't divulge names etc). Major US investment firm decides to provision new hardware, so takes a backup of the database storing all the 401k accounts for all their customers (private and corporate), flattens the hardware, and goes to restore the backup. The backup is corrupt - on SQL 2000, where there's no RESTORE ... WITH CONTINUE_AFTER_ERROR. What happened? Well, the SQL team and Product Support had to get involved to help get the data back, but people in the firm lost their jobs and it cost a lot of time and money to recover the data. If only they'd had multiple copies of the backup, and tested their backup before removing the database (or better yet, restored the database on the new hardware before flattening the old hardware). They learned a costly lesson, but they did change their practices after that.

Unfortunately this is so often the way - people don't realize they need to validate backups or have an HA plan UNTIL they have a disaster. Then suddenly its the top priority at the company. Being proactive can save a lot of grief, and make you look good when disaster strikes.

Backups can be unusable for a number of reasons, including:

  • The full backup is corrupt, because something in the I/O subsystem corrupted it.
  • A backup in the log backup chain is corrupt, meaning restore cannot continue past that point in the chain.
  • All backups following a full backup are written to the same backup set, but the WITH INIT clause is used accidentally on all backups, meaning the only backup present in the backup set is the last one taken.
  • An out-of-band backup was taken without using the WITH COPY_ONLY clause and the log backup chain was broken (see BACKUP WITH COPY_ONLY - how to avoid breaking the backup chain).
  • The backups worked but the database contained corruption before it was backed up (kind of a separate issue).

The only ones that are out of your control are the first two, but they can be mitigated by having multiples copies of backups. All of these though, can be avoided at disaster recovery time by reguarly restoring your backups as a test of what you'd do if there was a real disaster. You might be surprised what you'd find out...

This is more of an editorial style post than a deep technical or example script post - I'm going to start doing more of these around the weekly surveys. Next post - this week's survey.

Thanks!

Back in 2005 Kimberly produced two very popular webcast series - an 11-part webcast series for TechNet called SQL Server 2005 for the IT Professional and a 10-part webcast series for MSDN called A Primer for Proper SQL Server Development. The webcast links and blog posts were broken for quite a while but now they're all fixed up and working again. I've created some web pages that link to all the webcasts and blog posts, along with abstracts. I've also included some more recent ones too and will be adding to the list over the next few weeks.

There's over 30 hours of good stuff to watch - check them out at http://www.sqlskills.com/webcasts.asp

While trawling through the latest 2008 Books Online this morning to answer a question, I noticed a new section that I hadn't seen before, which explains in detail how to perform a rolling upgrade with database mirroring. The link to the MSDN page is http://msdn.microsoft.com/en-us/library/bb677181.aspx and below I've linked to the flowchart from that page.  

Books Online also has a lot of other "how-to" topics around database mirroring - here are some links:

Kevin Cox of the SQLCAT team also just blogged about a customer upgrade from 2005 to 2008 and some of the issues they faced. Btw - if you're not subscribed to their blog, you definitely should be - lots of cool stuff.

And now the bug. In 2008 RTM, if your database contains full-text then mirroring will not work when you perform a rolling upgrade. This is explained in KB 956017 (with a trace-flag workaround) and KB 957816, which points at 2008 RTM CU1 that has the fix in.

OK - last content post today. I forgot that the February TechNet Magazine also has the latest edition of my regular SQL Q&A column. This month's column covers:

  • Should backup compression be enabled at the instance level?
  • Client redirection during database mirroring failovers
  • Partition-level lock escalation in SQL Server 2008
  • Is it ever safe to rebuild a transaction log?

Check out the column at http://technet.microsoft.com/en-us/magazine/2009.02.sqlqa.aspx

It's really scary how quickly time flies - seems like it was just last week when I last blogged about TechNet Magazine (actually it was a month ago when I blogged about my 2008 Change Tracking article - see here). Anyway, a new issue of TechNet Magazine has just come out and this one has the latest installment of my bi-monthly SQL Q&A column.

This month's topics are:

  • How row-overflow columns can lead to poor range scan performance, even on completely defragmented indexes.
  • Combining database mirroring and failover clustering without undesired failover behavior.
  • Adding differential backups to a full+log backup strategy to lower recovery time.
  • Memory settings for multi-instance failover clustering

You can get to the column online at http://technet.microsoft.com/en-us/magazine/dd228989.aspx.

Enjoy!

PS If you have any ideas for what would make a good SQL Q&A topic, please drop me a line - paul@sqlskills.com

It's been almost exactly a week since the last post - an unusually long time for me. Kimberly and I were teaching the first week of the inaugural Microsoft Certifed Masters SQL course last week in Redmond (a little intense at 8 hours of *teaching* per 10-hour day - so no energy for blogging afterwards). Now we're in Vienna, where we're on-site with one of our favorite customers, then we head to Barcelona next week for TechEd EMEA, back to Vegas for SQL Connections the week after, and finally back to Seattle for PASS the following week. Phew! Look for a photo-stuffed Where In The World Are Paul and Kimberly post from Vienna next week.

I'll try to post some interesting stuff that comes up while we're at the conferences - always a good crop of questions.

In this quick post I want to touch on something that's becoming more important to explain as more and more people start managing systems who haven't got a lot of DBA experience (what I like to call involuntary DBAs). How do you plan a backup strategy? As you'll see if you look at my Backup/Restore category, I've got lots of info about specific types of backups but nothing on putting a plan together in the first place.

It's very simple to decide on using the SIMPLE recovery model and regular database backups - the backup schedule is easy. But, what happens when you come to recovering from a disaster? Which backups to you need to restore and how long does it take you? If you take weekly full backups, say, then you stand the chance of losing a lot of data if the disaster occurs just before your next full backup. So, if you switch to the FULL recovery model and add log backups, you can recover right up to the point of the crash.

But again, what backups do you have to restore and how long does it take you? Are you able to restore within the maximum allowable downtime for your business? If you have a 300-GB database, and the downtime allowance is 15 minutes (as with one DBA I've known), the answer is no. Do you need to move to a partitioned schema that makes use of multiple filegroups so that you can use partial database availability to bring your application online faster with only the critical filegroups?

The key point when planning a backup strategy is not to think about what backups you want to take - think about what restores you have to be able to perform, then work backwards from that.

Cheers

One of our customers (and friends!) sent us this last week - a stark reminder why hardware redundancy is a good idea!

 

The controller card in one of his servers literally blew up at 7.15am Tuesday morning. Although they didn't have redundant hardware, they were back online by 9am - pretty good going. Funnily enough they've been considering clustering  for a while now...

(Posted with permission.)

I just heard today that the first whitepaper I've written for Microsoft has been published!

The abstract is:

SQL Server Replication: Providing High-Availability using Database Mirroring

This white paper describes how to use database mirroring to increase the availability of the replication stream in a transactional environment. It covers setting up replication in a mirrored environment, the effect of mirroring partnership state changes, and the effect of mirroring failovers on replication. In addition, it describes how to use LSN-based initialization to recover from the failover of a mirrored subscriber database.

Although brief overviews are given of both replication and database mirroring, it is easier to understand this white paper if the reader has some experience with one or both of these technologies, and has at least a rudimentary knowledge of database concepts such as transactions.

You can download it from http://download.microsoft.com/download/d/9/4/d948f981-926e-40fa-a026-5bfcf076d9b9/ReplicationAndDBM.docx and I'll put a link in our whitepapers page.

Enjoy!

As you may already know, instant initialization is a way to prevent data (not log) file create and grow operations having to zero-initialize the new space before allowing it to be used. This can vastly speed up these operations as zero-initialization can take a *long* time for large files. This is especially useful in disaster recovery operations as the first phase of a restore is always to create the requisite files, if they don't already exist. Cutting minutes or even hours from this phase can significantly reduce downtime. It's available on XP SP2 and Windows Server 2003 and above. You can get more details from a blog post of Kimberly's from March 2007.

The way to enable it is to give the SQL Server service account the 'Perform volume maintenance tasks' privilege and then restart the service. There's no way to enable or disable it from within SQL Server, and until now, no way I've known of to tell whether it's enabled from within SQL Server. I was teaching a Microsoft-internal class on Database Maintenance last week and one of the students came up with a neat way to tell - using xp_cmdshell to execute the whoami /priv command, which lists all the privileges that SQL Server service account has.

The whoami command is available on Windows Server machines but for XP you need to download the support tools from Microsoft to get it to work, as I did on my laptop. You can get them here (5MB download) and you need to stop/start SQL Server on XP after installing them so it picks up the new tools path. Now, most people will be running with xp_cmdshell turned off, because of the security risks involved with enabling it, so here's a script that turns it on, checks the privileges, and turns it off again.

EXEC sp_configure 'xp_cmdshell', 1;
GO
RECONFIGURE WITH OVERRIDE;
GO

CREATE TABLE #xp_cmdshell_output (Output VARCHAR (8000));
GO

INSERT INTO #xp_cmdshell_output EXEC ('xp_cmdshell ''whoami /priv''');
GO

IF EXISTS (SELECT * FROM #xp_cmdshell_output WHERE Output LIKE '%SeManageVolumePrivilege%')
   
PRINT 'Instant Initialization enabled'
ELSE
   
PRINT 'Instant Initialization disabled';
GO

DROP TABLE #xp_cmdshell_output;
GO

EXEC sp_configure 'xp_cmdshell', 0;
GO
RECONFIGURE WITH OVERRIDE;
GO

Enjoy!

While we were at TechEd in June, Kimberly and I participated in an hour-long discussion panel (Leveraging SQL Server Technologies to Build a Solid High-Availability Strategy) that was video-taped by the TechEd Online folks. It's now been edited and is available for download/viewing. We cover everything from requirements analysis to technology details in SQL Server 2008. The other panel members were Satya Jayanty, Allan Hirt, Kevin Farlee, and Amit Bansal.

You can browse the various online videos at http://technet.microsoft.com/en-us/events/teched/cc561184.aspx or go straight to the panel discussion video here.

Enjoy!

It seems that today is going to be one of those days where I get lost in forums and blogging - I can live with that :-)

One of the questions that came up on a forum today was about choosing an HA solution - based solely on the hardware that was running the database! Given that single piece of info, it's impossible to come up with any kind of sensible answer. The other thing I see a lot is someone saying 'just use a cluster' - well, if you're trying to protect against damage to the data, just using a cluster won't do it because of the single-point-of-failure in a failover cluster - the shared disks.

So where do you start? The key to choosing an HA solution is to work out your requirements first and then choose a technology that allows you to meet as many of them as you can, within your available budget. Here are some of the questions I like to ask (not an exhaustive list):

  • What is the maximum application downtime SLA (service-level agreement)? In other words, if a disaster happens, how long can the application be off-line while failover occurs or the disaster is fixed?
  • What is the maximum acceptable data-loss SLA? If a disaster happens, how much can you afford to lose in terms of data or work? You might require up-to-the minute recovery for instance, or you might be able to cope with losing the last day's worth of transactions.
  • What are you trying to protect? Site, server, instance, database, filegroup, partition, table, group of tables?
  • What is the transaction log generation rate of your workload? If it's very high, that means you're going to have problems with backup up the log and with getting transaction log over to your redundant server/site.
  • What recovery model are you running your database(s) in? If you're in SIMPLE, then you can't get point-in-time recovery and so you're looking at losing all the work since your last full backup, and it also means you can't use any of the HA technologies which rely on the transaction log.
  • What’s your current backup strategy? If the answer is 'what backup strategy?' then you've got bigger problems than just getting an HA solution in place...
  • Are you trying to achieve site-level redundancy? If so, do you have a second site? Where is it? Does it have the same protection as the main site (in terms of security, HVAC, power, etc)
  • What’s the network bandwidth and latency to the second site? If your transction log generation rate is MBs/second, but your second site is 2000 miles away through a 720KB/second link, you're not going to be doing any kind of HA solution involving the second site that comes close to your downtime and data-loss requirements...
  • What’s the hardware at the second site?
  • Can you alter the application at all? If you can't alter the application then you may have a hard time getting it to gracefully failover to a redundant server. You also won't be able to use explicit redirection with database mirroring.
  • What's the application eco-system? In other words, what all has to failover so the application can run properly.

All of these figure into the choice of HA solution. Work these out, prioritize them, and then evaluate HA technologies (or combinations of technologies) to see which requirements you can meet. Don't just jump at failover clustering first!

Over the next few months I'll be posting more on designing for high-availability - let me know if there's anything in particular you want to see.

Theme design by Nukeation based on Jelle Druyts