This survey is based on a suggestion from Jonathan Kehayias (blog/Twitter).

In this survey I'm asking two questions:

  1. When you're designing/implementing a database, do you manually create statistics or let SQL Server create them automatically?
  2. How do you maintain statistics? Manually or letting SQL Server do it for you? Or both? 

I'll probably collaborate with Kimberly on the editorial for this, as she's forgotten more about statistics than I'll ever know!

*Please* no comments on this post - wait for the survey results post to avoid skewing the answers (comments are unfortunately moderated anyway, now that my blog gets a lot of spam). I'm very interested in your reasoning, but not until everyone else responds.

   

As always, a big Thanks! for contributing to the blog/community by responding. Please shoot me an email (Contact button, bottom left of the blog), or ping me on Twitter (@PaulRandal) if you have an idea for a good survey.

A few weeks ago I kicked off a survey asking what kind of disaster recovery guide/run-book/plan (I'll just call it a plan from now on) you have (see here for the survey). Here are the results as of 9/15/09:

 

Out of all these answers, IMHO the last answer is the only acceptable one for a production DBA responsible for recovering critical databases, to a defined RTO and RPO (recovery-time objective and recovery-point objective, respectively).

The problem is though, getting a wonderful and comprehensive disaster recovery plan together is waaay easier said than done. For a start, disaster recovery isn't a sexy topic with management UNTIL a disaster actually happens and the RTO and RPO are completely blown - so it's hard to justify the time and effort needed to put together a good plan. It's especially hard to put a plan together if you're an involuntary DBA, with no idea about what disasters could occur and what you'd do to recover from them.

The reasons to have a plan worked out in advance are pretty much common sense: in a disaster situation, where time is often of the essence, adrenaline and stress levels run high and it can be hard to remain cool, calm, and collected. For an unprepared DBA, this can very easily lead to costly mistakes being made. No-one wants to be the one that overwrote the only existing copy of a database with a corrupt backup and caused the business to be offline for several days. Good-bye job.

A pre-defined disaster recovery plan allows the DBA (or responsible person) to follow a set of tested steps to resolve problems that can occur. A really comprehensive plan covers more than just how to restore the database, but instead will contain troubleshooting information to determine what needs to be done (rather than just immediately taking everything offline and running a restore), and then coping with twists and kinks that could crop up while performing the recovery operation. What makes a plan wonderful is that it gets tested regularly to make sure everything in it is a) still appropriate b) still works c) still works within the defined RTO and RPO. Kimberly and always like to say that a disaster recovery plan should be written by the most experienced DBA you have and tested by everyone else, down to the most junior DBA - everyone needs to be able to work with it.

Bottom line (not a very long editorial this time) - you can't expect to be able to recover within any defined limits if you're just going to wing-it when a disaster strikes. Doesn't matter how experienced you are, crazy stuff happens that can trip you up.

Next up - the next survey!

In this survey I'd like to find out what kind of disaster recovery guide/run-book/plan you have, if any. When a disaster occurs, do you just wing-it, or is there something written-down that was worked out in advance to help you along the way. I'll report on the results in a couple of weeks.

*Please* no comments on this post - wait for the survey results post to avoid skewing the answers (comments are unfortunately moderated anyway, now that my blog gets a lot of spam). I'm very interested in your reasoning, but not until everyone else responds.

As always, a big Thanks! for contributing to the blog/community by responding. Please shoot me an email (Contact button, bottom left of the blog), or ping me on Twitter (@PaulRandal) if you have an idea for a good survey.

A few weeks ago I kicked off a survey on how you add geo-redundancy to a failover cluster (see here for the survey). The results as of 8/26/09 are as follows:

 

So why is this interesting? Well, many people will suggest failover clustering as the best way to provide high-availability for a database (or group of databases). And it is a great technology to protect against server failure, but there's only a single copy of the database, which is the Achilles' heel of failover clustering. If that copy of the database is damaged, the application is down unless there's another copy of the database available. This is where providing geo-redundancy comes in. With that in mind, I'm surprised at the percentage of respondents that don't provide any geo-redundancy at all.

There are a bunch of options for providing a redundant copy of a database that is hosted on a failover cluster, with pros and cons to each, and that's what I'm going to spend the rest of this post on.

SAN replication: This is where the SAN hardware itself mirrors all write I/Os to a remote SAN, thus maintaining a remote copy of the database. The hardware has to provide disk-block size and write-order preservation; otherwise the database on the remote SAN could become corrupt. Imagine if write-ordering was not preserved and some data pages write I/Os were completed on the remote SAN before log records write I/Os (thereby breaking the write-ahead logging protocol) - recovery wouldn't be able to work properly! This mechanism requires a remote SAN, a second failover cluster connected to the remote SAN, a network including both clusters, and a big, fat network pipe between the two SANs. The bigness and fatness of the pipe depends, of course, on how many write I/Os are performed on the local SAN, and whether the SAN replication is synchronous or asynchronous.

Synchronous replication requires that the I/O is completed on the remote SAN and acknowledged back to the local SAN before the local I/O can be acknowledged to the local server. If the network bandwidth and latency can't support the volume of write I/Os trying to be replicated to the remote SAN, the I/Os will start to queue up and delays will be incurred on the local server. This will lead to the workload slowing down as SQL Server has to wait longer and longer for I/Os to complete. Now, with synchronous replication you have the guarantee that the remote copy of the database is completely in-sync with the local copy, so if a failure occurs, no committed data will be lost. If the network can't keep up though, you may have to switch to asynchronous replication. This means the local I/Os don't have to wait for the remote I/Os to complete, and so no performance penalty is incurred. BUT as the replication is now asynchronous, committed data may be lost if the local copy of the SAN is damaged.

Apart from the potential for performance problems with SAN replication, it's also very expensive - as another SAN, another cluster, and some beefy network hardware/bandwidth is required. This isn't a technology I'd expect a small company to be using or considering. Finally, the portion of the remote SAN that's being replicated to cannot be accessed at all. On the MAJOR plus side, all databases on the SAN are replicated at once, without having to setup a technology to provide a redundant copy of each. For application ecosystems that include multiple databases, this is what I like to recommend.

Log shipping: This is the simplest way to maintain a redundant copy of the database - it's just backup log, copy, restore log; repeat. It works seamlessly with failover clustering and is really easy to setup and maintain. The only problem with this is that you open yourself up to data loss, as a log shipping secondary is usually not right up-to-date with the primary. You can use the secondary for reporting/querying by restoring the log backup WITH STANDBY (which requires a little more configuration, but not much), and you can protect against accidental data damage by having a secondary with a load-delay configured, so the database is, say, 8 hours behind the primary. In my experience, this is the most common technology that's used in conjunction with failover clustering as it's the cheapest and easiest. On the downside, it's a single database solution so its not suitable for complicated application ecosystems.

Transactional replication: This isn't very commonly used at all, although again, it works seamlessly with failover clustering after a failover. The reason this isn't used very often for geo-redundancy is that transactional replication doesn't provide database-level redundancy, only table-level. It's also much more complicated to setup and troubleshoot when things go wrong, plus there's varying latency between a transaction committing in the publication database and it being applied to the subscription database(s).

Database mirroring: Database mirroring is the only technology apart from SAN replication that can provide a zero data-loss solution when configured for synchronous operation. It works by shipping the log records from a database rather than the raw I/Os, so doesn't require anywhere near as much capital expenditure, but the network has to be able to cope with sending the log generated on the principal, otherwise performance on the principal can be affected. Mirroring is relatively easy to setup and maintain, and the mirror database can only be accessed, but only through a database snapshot. When combined with failover clustering, you need to be careful about setting the mirroring partner timeout, so that the local failover cluster gets a chance to fail over before mirroring does. Checkout my blog post on this: Search Engine Q&A #3: Database mirroring failover types and partner timeouts. You can configure database mirroring for synchronous or asychronous operation, with the same performance and data-loss exposure caveats as SAN replication. SQL Server 2008 provides log stream compression and automatic page repair, which make this more attractive (see SQL Server 2008: Performance boost for Database Mirroring and SQL Server 2008: Automatic Page Repair with Database Mirroring, respectively), but only supports a single database. I'm seeing this combination start to be used more, but again, it's a single database solution so isn't suitable for complicated application ecosystems.

Backups/homebrew: Good old backups can easily be used to provide a very low cost way of maintaining a redundant copy of a database, and if you think about it, this is really do-it-yourself log shipping. At the very least, databases should *always* be included in a backup strategy, no matter what other high-availability technology(s) you may have implemented.

3rd-party solution: There are a few non-Microsoft solutions for providing redundancy with failover clustering which don't involve traditional SAN replication. I'm not an expert in any of them, but I've heard of anecdotal issues with the two I mentioned in the survey and worked with customers who've had real issues with PolyServe (one of which I blogged about).

Summary 

When you're planning a high-availability strategy, you always need to consider the limitations of technologies while evaluating them. The big limitation of failover clustering is that there's no redundant copy of the database so you need to add another technology to provide that. I've just finished writing a 35-page whitepaper for Microsoft on the high-availability technologies in SQL Server 2008, as well as how to go about planning a strategy. It will be published before PASS in November, but in the meantime, this should have given you lots of food for thought.

Next post - the next survey!

I'm moving away from strictly weekly surveys as they're a little demanding, especially when we're on the road.

In this survey I'd like to know which technology you use to provide geo-redundancy to databases that are protected locally using a failover cluster. I'll report on the answers sometime around the middle of August.

*Please* no comments on this post - wait for the survey results post to avoid skewing the answers. I'm very interested in your reasoning, but not until everyone else responds.

As always, a big Thanks! for contributing to the blog/community by responding. Please shoot me an email (Contact button, bottom left of the blog), or ping me on Twitter (@PaulRandal) if you have an idea for a good survey.

A long time ago, in a blog post far, far away (well before I went offline in July) I kicked off a weekly survey about how often you run consistency checks (see here for the survey). Now I'm back online again, and so here are the results as of 8/3/09.

 

The results are actually surprising - I didn't expect so many people to be running consistency checks so frequently - 25% daily and another 37% weekly - very cool!

The number of people who don't know what consistency checks are may look a little depressing but I think it's probably a symptom of the number of people coming into the SQL world as involuntary DBAs. For those people, consistency checks are a way of proactively checking for database corruption, which is nearly always caused by the I/O subsystem. You can read a good introduction to consistency checking and other database maintenance topics in the article I wrote for TechNet Magazine in August 2008 - Top Tips for Effective Database Maintenance and bit more in the previous survey Importance of how you run consistency checks.

Basically you need to run regular consistency checks. There's a myth that you don't need to run consistency checks - this was perpetuated by various marketing folks when SQL 7.0 shipped, because SQL 6.5 used to cause allocation corruptions and the rewrite for 7.0 removed all the corruption problems. Now, of course there have been bugs in SQL Server that cause problems, but they account for a tiny fraction of the corruptions out there. Nearly all corruptions are caused by something going wrong in the I/O subsystem - and you can't predict when that will or won't happen. Jim Gray once likened the disk heads in a hard drive as akin to a 747 flying at 500mph 1/4 inch above the ground - scary stuff. You DO need to run consistency checks, because corruptions do happen. You can read more about the causes of corruptions in this blog post: Search Engine Q&A #26: Myths around causing corruption.

So - how often should you run consistency checks? Well, it depends. When I'm teaching I like to give two examples:

  • You have a dodgy I/O subsystem that is causing corruptions. You have no backups. You have a zero data-loss requirement. With no backups, your only way to get rid of corruptions is to run repair, which usually leads to data loss (see Misconceptions around database repair). In that case, you want to know about corruption as soon as possible to limit the amount of data loss from running repair. Contrived example for sure, but you'd be surprised what I've seen...
  • You have a rock-solid I/O subsystem, with all drivers and firmware up-to-date. You have a comprehensive backup strategy which you've extensively tested and you've confident you can recover from corruptions with zero-to-minimal downtime and data loss. In this case you may be comfortable running consistency checks once a week, say.

The overwhelming factor in how often to run consistency checks is you. How comfortable are you with the integrity of your I/O subsystem and your ability to recover from a corruption problem. If you have corruptions in your database today, you'll probably run DBCC CHECKDB on it every day for a month, right?!?

Part of any database maintenance of high-availability strategy is proactively making sure that corruption doesn't exist in the database - otherwise when you DO discover it, it may be more pervasive and it will take you longer to recover, and potentially with more downtime. Therefore the answer really is that you should run it as often as you can. Sometimes that can be difficult for very large databases (your definition is likely to be different than mine - I think 500GB and larger - depends on your hardware etc) as CHECKDB can take a long time to run (see CHECKDB From Every Angle: How long will CHECKDB take to run?), but there are ways you can effectively consistency check even a VLDB - see CHECKDB From Every Angle: Consistency Checking Options for a VLDB.

So - I'm very pleased to see so many people running regular consistency checks. However, maybe I'm just a pessimist but I did expect some of the less than optimal options to have higher numbers. Let's look at each in turn.

Never. The only time this is acceptable is if you don't know what consistency checks are and why you need them. Enough said. Even if you think you *can't* run consistency checks because it takes too many resources/time/etc, take a look at the blog post I mention above. There's always a way.

Only when corruption is detected some other way. By the time corruption has occurred and is detected through regular operations, it's likely to be more pervasive. Many databases do not see all their data read/updated as part of regular operations, which means that if a part of the database that's not used for a while gets corrupted, and you're not running consistency checks, you're not going to know. That means you're not going to know that your I/O subsystem is causing problems - and so more corruption will occur. You might think this isn't a big deal, but it can be depending on what part of the database (or maybe system databases) gets corrupted. Some things can be restored or repaired relatively simply - but what if, say, the boot page of master is corrupted? Your entire instance is down until you sort that out. It's always better to proactively discover corruption before it hits you in a way that really messes you up.

Only during an event like an upgrade or migration. In this category I see people running consistency checks AFTER doing an upgrade but not before. From release to release, DBCC CHECKDB has improved - especially from 2000 to 2005, where system catalog consistency checks were added. If you run consistency checks after an upgrade and find the database is corrupt, what do you do? Run repair? Go back to the earlier version and run repair? Restore your backups? The odds are that if you've got corruption, it's also in your backups too. Also, you may think that the upgrade itself caused the corruption - I've never seen this. It's always been that the corruption was there before and only got discovered after the upgrade. Always run consistency checks BEFORE doing an upgrade - to make sure you've got a chance of putting things right before (potentially irrevocably) moving to the newer version.

Only after performing a restore, or after a failover. This is, of course, a good practice, but shouldn't be the only time a consistency check is performed - for most of the same reasons as I've explained above. What do you do if the database is found to be corrupt after restoring backups? You're looking at data loss now.

To summarize: make sure you're running regular consistency checks - with the regularity in-line with your comfort zone and your maintenance/high-availability strategies.

Next up - the next survey!

In this week's survey I'd like to know how often you run consistency checks on your *most critical* production database, regardless of *how* you run them (we did that survery already - see Importance of how you run consistency checks). I'll report on the results around July 4th.

I'd only like you to answer for your *most critical* production database, as the frequency will probably vary wildly by database, server, production vs. dev/QA and so on. If everyone answers for their most critical database then we won't get skewed results.

*Please* no comments on this post - wait for the survey results post to avoid skewing the answers. I'm very interested in your reasoning, but not until everyone else responds.

As always, a big Thanks! for contributing to the blog/community by responding. Please shoot me an email (Contact button, bottom left of the blog), or ping me on Twitter (@PaulRandal) if you have an idea for a good survey.

PS Thanks to Pat Wright for suggesting this week's topic on Twitter.

In last week's survey I asked how you manage the size of your database *data* files - see here for the survey. Here are the results as of 6/24/09.

The 'other' values are as follows:

  1. 5 x 'manual file growths and a custom mom alter to when the datafiles are 98% full. autogrow set to fixed amount in case we miss the mom e-mail'
  2. 1 x 'auto-grow with a procedure to keep the log file size at 20% relative to the total data file size'
  3. 1 x 'Create with very large initial file size, set auto-growth to %'
  4. 1 x 'Lots of white space in files. Auto-grow to a fixed size (in case of emergency).'
  5. 1 x 'set initial size for 1 yr usage, monitor size, manual grow, autogrow percentage - send alert if it does grow'
  6. 1 x 'set initial size for 2 year growth, capture growth stats daily, monitor physical disk space daily'
  7. 1 x 'Set to autogrow by fixed size to cater for emergencies, otherwise maintain 80-90% free space by daily reporting then manual off-peak size increase if necessary.'

As I mentioned in the survey itself, this is just about database *data* files. I covered log file size management in a previous survey - see Importance of proper transaction log size management.

There are really four parts to data file size management, which I'll discuss below.

The first thing I'll say is that if you're able to, enable instant file initialization on 2005+ - as it will vastly reduce the time required to create data files, grow data files, and restore backups (if the restore has to create data files). We're talking minutes/hours down to single-digit seconds. See Misconceptions around instant file initialization. If you're not one of the miniscule fraction of a percent of customers who have volumes shared between secure file servers and SQL Server instances, turn this on ASAP. Most DBAs don't know about this feature, but everyone I teach it to or show it to are amazed and then go turn it on. No brainer.

Initial data file sizing. This can be tricky. Without getting into the how-to-calculate-the-database-size quagmire, I'll simply say that you should provision as much space as you can, based upon your sizing estimates. Don't just provision for the here-and-now - if you're able to, provision space for the next year's worth of anticipated growth - to prevent auto-growth having to kick-in at all. I rewrote all the Books Online pages for 2005 (links are to the 2008 BOL) on Estimating the Size of Heaps, Clustered Indexes, and Nonclustered Indexes and in the blog post Tool for estimating the size of a database I link to a tool someone put together that codified all my formulas. You can also get sizing tools from hardware vendors too.

Data file growth. If you're able to, auto-grow should ALWAYS be turned on, as an emergency measure in case your monitoring fails - you don't want the database to have to grow but it's unable to and then it stops and the application is offline. However, you shouldn't *rely* on auto-grow - it's just for emergencies. The auto-growth default for data files used to be 10% for 2000 and before, but then changed to 1MB from 2005 onwards (log file default auto-growth remained at 10%). Neither of these are good choices. A percentage-based auto-growth means that as your files get bigger, so does the auto-growth, and potentially the time it takes if you don't have instant file initialization enabled. A 1MB autogrowth is just nonsensical. Your auto-growth should be set to a fixed size - but what that should be is a great big 'it depends'. You need to decide whether you want the auto-growth to be a quick stop-gap, or whether the auto-growth will replace manual growth after monitoring. You also need to consider how long the auto-growth will take, especially without instant file initialization. I can't give any guidance here as to what a good number is, but I'd probably settle on something around 10% (fixed), with the percentage steadily falling as the database size grows. It's very important that you have alerts setup to you can tell when auto-growth does occur, so you can then take any necessary action to grow it even more or tweak your settings.

'Other' response #2 is interesting. There's been a 'best-practice' around for a while that the log file should be sized to be a percentage of the data file size. It's totally unfounded and in most cases bears no relation to reality. The vast majority of the time, the size of the log is *NOT* dependent on the data file sizes in any way. Imagine a 10TB database - would you provision a 2TB log? Of course not. Now, I can see special cases where the operations performed on the tables in the database might affect a fixed portion of the largest table in a single batch, and that could generate enough log (with reserved space too) to equal 20% of the data file size - but that's a pretty nonsensical special case, to be honest. You shouldn't use 'set the log as a percentage of the data file' as a guideline.

Data file size/usage monitoring. There's a growing movement towards monitoring the data file usage and manually growing the files as they approach 100% full - avoiding auto-growth altogether, but still having it enabled for emergencies. In my book, this is the best way to go as you have all the control over what happens and more importantly, when it happens - especially without instant file initialization. There are some quirks here though. SCOM, for instance, has logic that disables file size and usage monitoring if you enable auto-grow. It assumes that if you enable auto-grow then you're not interested in monitoring. I happened to have one of the SCOM devs in my last maintenance class I taught on the Redmond MS campus and he's going to try to get that logic fixed.

Data file shrinking. Just this morning I wrote a long blog post about this - see Why you should not shrink your data files. Running data file shrink causes index fragmentation, uses lots of resources, and the vast majority of the time when people use it, is unnecessary and wasteful. It should NEVER be part of a regular maintenance plan - as you get into the shrink-grow-shrink-grow cycle which becomes a zero-sum game with a ton of transaction log being generated. Think about this - if you run a shrink, which is fully logged, then all the log has to be backed up, log-shipped, database mirrored, scanned by the replication log reader agent, and so on. And then the database will probably auto-grow again through normal operation, or some part of the maintenance job that rebuilds indexes. And then you shrink it again. And the cycle continues...

Bottom line - make sure you size the data files with some free space, have auto-growth set appropriately, have instant file initialization enabled if you can, monitor file sizes and usage, alert on auto-grows, and don't use shrink. And don't use shrink. Really.

Next up - this week's survey!

In this week's survey I'd like to know how you manage the sizes of your database *data* files (remember we've already done log file size management). I'll report on the results around 6/21/09.

I say this every week in the PS, but I'm moving it up here because I don't like having to delete comments and send email, *please* no comments on this post - wait for the survey results post to avoid skewing the answers. I'm very interested in your reasoning, but not until everyone else responds.

As always, a big Thanks! for contributing to the blog/community by responding. Please shoot me an email (Contact button, bottom left of the blog), or ping me on Twitter (@PaulRandal) if you have an idea for a good survey.

PS Thanks to Steve Jones (of SQL Server Central fame) for suggesting this week's topic on Twitter.

In last week's survey I asked you what you think is the hardest thing when becoming an involuntary DBA - see here for the survey. Here are the results as of 6/15/09.

The 'other' responses were:

  • 3 x 'Learning to tell good advice from bad advice'
  • 1 x 'Learning to know *that* you don't know'
  • 1 x 'Management having unrealistic expectation of my capabilities'

All great answers.

We all know the definition of 'involuntary DBA' - someone who is made to assume a DBA role or take on DBA responsibilities, usually against their will. I used to call them 'accidental DBAs' but too many people thought I meant DBAs who had accidents, so I changed my phrase. Let's go through some of the answers, ranked by number of responses.

Learning to know what you don't know. Actually the first step here is learning to know *that* you don't know - if you think you know everything (really in any field and with any expertise level) then you're most likely wrong, no matter who you are and what the field is. It can be a humbling experience making (and accepting) that realization and then quite daunting when you realize that there is maybe *so much* that you don't know and might have to learn. For an involuntary DBA this can be doubly daunting, because not only do you realize that there's a ton you don't know about being a DBA, but that means there's a ton you don't know about managing this (maybe mission-critical) SQL Server that you're suddenly responsible for and your job security is now suddenly dependent on doing the right thing.

Being the DBA as well as my actual job. Many times an involuntary DBA is actually a bolted-on side-role for someone with a real daytime job too - sometimes not even an especially technical daytime job. I'm teaching a class right now for a large customer of ours and about 1/3 of the 30+ people in the class are involuntary DBAs as well as operations engineers. Some other common examples - SharePoint admins who suddenly find themselves with an Enterprise-class SQL Server instance underpinning their SharePoint farm, developers using SQL Express as part of a client application who have to think about the maintenance implications of having SQL Server installed, the person who sits nearest the SQL Server instance when the real DBA leaves or is fired. Sometimes becoming an involuntary DBA can be very rewarding and take you down an unexpected but ultimately welcome career path, but many times its an unwelcome burden, and very occasionally it can be disastrous. A few weeks ago I heard on the forums from someone who had become an involuntary DBA after the actual DBA has been fired - and his job was on the line if he couldn't get the broken database fixed in 24 hours. Brutal. Being a DBA can be a full-time job in itself, and if you ask many production DBAs, they'll tell you that it can be way more than the standard 40-hour work week (seriously, does that even exist any more in our industry?) - so to try and do that *as well* as a regular 9-5 job might be nigh on impossible.

Finding information on what to do. This can be the real kicker - you've realized that there's lots you don't know - and you want to learn - but where to start? Who to ask? There's a great community of SQL Server folks that are very willing to help out - if you're reading this then you've already stumbled in to the community somehow. There are lots of blogs with information on - some of what I post is deeper information than you *really* need to know to get by, but the editorials in the Weekly Survey series are a good overview of some of the topics to be concerned about, and Kimberly also posts a ton of good info on her blog. In TechNet Magazine, I've been writing many articles aimed specifically at involuntary DBAs - here are some links:

Here's a quote from an interview I did with SQLRockstar recently (hope you don't mind Tom):

SR: What advice would you give to new SQL DBA’s just entering the field?

PR: There’s a huge amount you don’t know - that’s just a fact. SQL Server can appear very easy to setup and get running, but there are lots of ways to shoot yourself in the foot if you’re not careful. I’m not trying to be scary, just engender a healthy amount of trepidation. Get yourself a mentor if possible, get some training, practice, practice, practice, read a bunch of people’s blogs, follow folks on Twitter, and so on. You need to be an information sponge for a couple of years at least, and get as much experience as you can with the varied aspects of SQL Server - perf tuning, disaster recovery, security, design, etc. And the more you tinker, play, and learn - the more you’ll realize there’s lots you still don’t know. Accept that and be willing to take advice, make mistakes, and learn and you’ll go a long way. Don’t be intimidated to ask people questions, most of us are happy to help - we all started with zero SQL Server knowledge, no matter where we are now.

The last sentence is the most critical in that paragraph. No matter how knowledgeable anyone in the SQL community might seem, every single one of us started with *no* SQL Server knowledge at all. I had never even heard of SQL Server until I joined Microsoft in early 1999 - and there's a *huge* amount of SQL Server I know very little about - I can barely even spell BI for instance. But I know what I don't know and I know where to go to search for information, or who to ask.

And for all those of you in the SQL community, when you're answering what might, at first glance, look like a stupid question from someone with zero reputation or forum points, consider the fact that they may well be an involuntary DBA trying to find an answer because they know they need help and don't have anywhere better to go. Be kind. There's no such thing as a stupid question - the only stupid thing is *not* asking a question when you know you don't know the answer. A sure-fire way to put someone off asking any more questions is to slap them down when they ask their first.

Now, apart from asking the questions on forums etc, another problem that involuntary DBAs have is knowing what information/answers to trust. This one's very hard, as unless you get to know people in the community, you've got no clue who these people are and why you should believe them. The onus is also on us in the community to make damn sure that when we answer a question it's the *right* answer. This is especially true in the corruption forums where telling someone the wrong answer can deepen the corruption hole the poor DBA is already in.

Management that don't understand databases. And also don't understand that the new involuntary DBA isn't going to immediately and magically become 'super-DBA' overnight and fix all the problems with the SQL Server instance. It takes quite a while to learn the ins-and-outs of being a good DBA, and having the confidence to be able to argue with management about what is and isn't possible. There's no easy solution to this one I'm afraid - education is the only option.

Dealing with actual DBAs and developers. There's a tendency in human nature to be scornful and dismissive of people in the same field who are not as knowledgeable or talented - this can be especially true in the technical area we inhabit where sometimes the people we work with may be egotistical and possibly even lacking in social skills - the 'typical' geek developer. Apologies to all those out there who *are* nice and *do* have great social skills - but after 9 years as a developer and manager of various technical disciplines at Microsoft, I know of what I speak. As an involuntary DBA you may come up against some of this antipathy. My advice - you're just going to have to deal with it until you become more knowledgeable. It's unfortunate, but it's human nature. One way to gain respect and trust from these people is to learn some stuff and solve some problems - don't ever BS about something you don't know about - guaranteed way to quickly lose whatever respect you've built up. You might try complaining to your management, but if it's the same management that added being an involuntary DBA to your regular job, from what I've heard, they're unlikely to listen to you. I don't mean to sound depressing, I'm just being honest.

Performance tuning, disaster recovery, database maintenance, implementing an HA/DR strategy. It's hard to pick which of these is the most important for a new involuntary DBA to focus on - but I think I'd go for database maintenance. That addresses some perf issues (index fragmentation management, statistics maintenance) and some disaster recovery issues (taking backups, running consistency checks). Checkout the TechNet Magazine article above for (what I think is) a great primer, and for perf tuning, see the editorial from a couple of weeks ago that has lots of links: Important considerations when performance tuning.

So if you're an involuntary DBA, the bottom line is that it's going to be a struggle for you - but there are lots of resources out there and people willing to help out.

Don't be afraid to ask.

PS By all means add comments with more involuntary DBA resources and I'll collect them together.

Categories:
Involuntary DBA | Surveys

This week's survey is all about being a DBA or involuntary DBA - what do you think is the hardest thing when becoming an involuntary DBA? I'll report on the results sometime over the weekend of 6/13/09.

As always, a big Thanks! for contributing to the blog by responding. Please shoot me an email (Contact button, bottom left of the blog), or ping me on Twitter (@PaulRandal) if you have an idea for a good survey.

PS And again, as always, no comments on this post please (I'll delete them straight away) - please wait for the survey results post to avoid skewing the answers. I'm very interested in your reasoning, but not until everyone else responds.

Categories:
Involuntary DBA | Surveys

In last week's survey I asked what you think is the most important thing when it comes to performance tuning, if you *had* to choose one - see here for the survey. Here are the results as of 6/7/09.

Now of course you're all calling 'foul' because I didn't put an 'it depends' option on there which you all would have chosen, but that would have been boring, and I wanted you to pick what you really think is the #1 thing in the majority of cases. Of course, in general the answer is always 'it depends', but some things turn out to be the #1 problem more often than others. For the record, mine and Kimberly's answers were overall indexing strategy, but I deliberately broke them out into clustered and nonclustered to see what people would pick. I'm not surprised that application issues and nonclustered indexing strategy came out top, but good to see that all answers were represented and there's a variety of opinion out there. Variety is the spice of life as they say, or is it garlic? Depends who's cooking I suppose - but I digress.

Let's walk through each option and see how badly they can affect performance if something's wrong. This is by no means an exhaustive list of what could cause problems, just me rambling on at the end of a Sunday, as usual. Blog posts are so much easier than articles because you can have run-on sentences, short little fragments and all sorts of other verboten grammatical beasts.

IO subsystem design/tuning, including RAID: Quite a few things here could be wrong. Is there a write-intensive workload (either data files or log files) sitting on a RAID-5 array (with RAID-5 there's a performance penalty on writes)? Is the disk partitioning alignment, RAID stripe size, or NTFS allocation unit size incorrectly set? (See Are your disk partition offsets, RAID stripe sizes, and NTFS allocation units set correctly? for details.) Is the I/O subsystem just not up to the load being placed on it - high disk queue lengths, page IO latch wait errors, long-running IO errors? Is tempdb the bottle-neck because there a multiple procs with multiple users all trying to create/drop lots of temp tables? (See Misconceptions around TF 1118 for details.) Are there multiple databases using the same I/O subsystem such that disk-head contention is causing performance issues? Is there a network bottleneck with iSCSI storage? Are there filter drivers getting in the way and slowing things down?

Server hardware (CPUs/memory/other): Ok - I'm going to sound like a stuck-record saying 'quite a few things here could be wrong'. Let's just take it as read that each category has lots that could go wrong. A heavy load with a single CPU isn't going to perform well. However, sometimes excessive parallelization can lead to the dreaded CX Packet waits - multiple threads in the query processor waiting for information from each other because the server is overloaded - or blocking issues as locks start getting in the way - but that's more an application problem - oops, wrong section. Having more CPUs is usually a good thing as parallel plans usually run faster than single-threaded plans. Having too little memory can severely limit the buffer pool size and force buffer pool thrashing (low page life expectancies), and large operations to spool data to tempdb, placing further load on resources. Not sure what I meant now by 'other' - maybe having a wireless keyboard could slow down commands getting to SQL Server? Ok - I remember - CPU architecture. 64-bit vs. 32-bit. With 64-bit there's no real limit on the amount of memory that can be addressed, and no mucking around with AWE. There are some tricky issues with NUMA to make sure you don't get cross-node memory accesses (just like CPU cache invalidation could hit you on SMP architectures) but that's getting a little advanced and geeky. Btw - watch out for this when you're using Extended Events in SQL 2008 on multi-proc boxes - there's a setting where you can tell it what kind of CPU architecture you have.

Virtualized vs. 'real' server: Now, I'm the first to admit that I'm not an expert in virtualization, but now I have 3 8-ways and lots of disks to play with at home, I want to be. That's why I picked up a Hyper-V book today after reading Tom's virtual bookshelf post here. What I do know is what I've heard from talking to people at conferences and during classes - the main thing here is I/O virtualization. If there's software virtualization of I/Os, then performance is going to suck under load and you better not do it in production. Microsoft's Hyper-V gets around this, as do things like VMware ESX, by allowing you to assign actual real physical resources to virtual machines. And that's as much as I know about it until the book arrives and I get back from Houston to read it. And no doubt I'll be posting more about it.

Database physical layout (files/filegroups): Tempdb is the obvious case, and I've covered that above. User databases are a bit trickier, and it really depends on the underlying I/O subsystem. A single monolithic data file on a single physical disk isn't going to perform well under heavy concurrent load. But it may do fine on a RAID-10 array sliced-and-diced in a clever way by an expensive SAN. In today's world, more and more databases are on SANs so user database performance is taken care of in that respect, and all you need to worry about is tempdb. I touched on this a little bit in the editorial of a previous survey Physical database layout vs. database size.

Table design: Both ends of the spectrum work well here - from the plastic Costco tables with the fold-down metal legs to antique French-oak farmhouse tables. Both will give many years of service with no degradation of performance. Just like regular tables, SQL Server tables come in all shapes and sizes and you can easily pick the wrong one for what you need, leading to poor performance. This is a massive can of worms - in fact Kimberly spends 2 brain-busting days on this in one class she teaches and she could easily spend a week of 10-hour days on it (oh yes, it's called the SQL Masters program :-) From picking the right column types (see Michelle's post at Performance Considerations of Data Types) to deciding how to store you LOB values (see my post at Importance of choosing the right LOB storage technique) to deciding on a good primary key. Just don't pick a random GUID, or if you have one, replace it. Clustered index keys should be unique, narrow, static and ever increasing. Table design encompasses so many things I can't do it justice in a late evening paragraph. But you get the idea.

Heaps vs. clustered indexes: Kimberly likes to wrap this up with nonclustered indexes too, but I wanted to break them out so I could reference a whitepaper that strongly suggests you should use clustered indexes: Comparing Tables Organized with Clustered Indexes versus Heaps. Read it yourself and draw your own conclusions. Basically, with the possibility of forwarding records in heaps, and the majority of a clustered index's upper levels being in memory, the extra random physical I/Os to access a record in a heap outweigh the aggregated cost of the in-memory binary-searches at each level of clustered index tree navigation. I'm sure some people will argue about this - bring it on. There are very special cases where heaps are better, but not in general.

Nonclustered index strategy: Biggie #1. Where to start here and what to say? For the majority of scenarios, if you don't have a good nonclustered indexing strategy you're not going to have good performance. There's a simple reason for this - without the right nonclustered indexes to allow the query optimizer to choose them, you're going to get table scans. Of your 124 billion row table. How fast is that going to be? Again, Kimberly has this class that she teaches on this where she goes into details on how to index for ORs, for ANDs, for JOINs, and all sorts of other kinky stuff. I need to sit and listen to her one day so I can appear knowledgeable about this. I just know that you need to have indexes with the right selectivity. Ok - I just called Kimberly downstairs in the office to make sure I'm right before I blog something and look like an idiot and she starts with 'erm, well it depends'. I give up. I do *corruption* and *HA* and *maintenance*, not performance tuning. It's bloody hard. Luckily I'm married to one of the best people on the planet for SQL index tuning - wow, what hope do you have? (Of index tuning, not of marrying Kimberly :-) Seriously, index tuning isn't that bad - I just had a moment of weakness. You need to make sure you have the right indexes and no completely unused indexes otherwise you're wasting resources maintaining them. You could do worse than listen to her on RunAs Radio Interview Posted - "Kim Tripp Indexes Everything".

Statistics: If the statistics are out-of-date, the optimizer will not be able to pick a good plan and your performance will suck. It's that simple. Turn on AUTO_UPDATE_STATISTICS and make statistics updating is part of your regular maintenance. Don't forget to update non-index statistics too.

Application (design and code): Biggie #2. Sometimes no amount of cleverness can wring good performance from the twisted logic of a deranged application programmer. An application that is written with absolutely no concern for how SQL Server works is likely to not perform well. An application that is written with too much knowledge of how SQL Server works is likely to fall foul of relying on 'accidental' behaviors, or behaviors that are limited to certain data volumes or workloads.  An application that is going to make heavy use of SQL server has to take into account how SQL Server is going to behave under a variety of workloads, on a variety of servers, and on a variety of I/O subsystems. The common application test framework? A ten-row customer table with a single connection. 'Excellent - my 16-table join with CLR-based aggregations runs in less than 2 seconds'. For 10 rows. Six months later: 'Now we've got 3 million customers, why does performance suck so bad?' Go figure.

Database maintenance: Ok - starting to get tired now and I still need to do this week's survey. Maybe I could just say 'do maintenance' and be done with it? No? Ok - how about go checkout the article I wrote last August for TechNet Magazine: Effective Database Maintenance article. If you don't maintain your indexes, they'll likely get fragmented and affect performance in one way or another and you'll need to play with FILLFACTOR (see Kimberly's Database Maintenance Best Practices Part II – Setting FILLFACTOR). If you don't take care of transaction log VLF fragmentation, it will affect log-based operations (Kimberly again at Transaction Log VLFs - too many or too few?). Seriously - she says I blog so much - how many links are in this post to her stuff? She blogs a lot too!. If you have corruption, it could manifest itself as long-running I/Os. If I don't put my food on to cook, then I'll be eating way too late again. That was a 'Paul maintenance' one that slipped in - and can have disastrous effects on blogging performance. Seriously, you can't just put a database into production and walk away. It will slowly degrade over time. Like red jello melting on a hot summer's day, but probably not as fast, unless you use random GUID primary keys, or it's really hot. What?!?

Performance tuning is an art and a science. But there's a huge amount of science behind it before you have to get into the art side of things. I've just scratched the surface here in a blog post that took me more than an hour to write, banging away non-stop as I do. And I don't do the art side of things. I leave that to Kimberly - she just got one of our client's batch jobs from a 72-hour run-time down to a 6.5 hour run-time. They were pretty happy. We can't all be the Goddess of Performance Tuning Smile

Next post - this week's survey!

Categories:
Performance | Surveys

This week's survey is going to provide some fun debating topics. I'd like to know what the number 1 thing is that YOU go after when performance tuning - what do you think is most important? (Assuming you're not just randomly walking up to a server to do some perf tuning - there's an actual problem).

[Edit: And, as I say below - please no comments about what you picked until the results post next week - it skews the results. I've already had to delete 6 comments and ask for them to be held until next week.]

There is deliberately no 'it depends' answer, no 'combination' answer, and no way to suggest something else. If what you think is most important isn't on the list, don't vote, but I'd like to hear your opinion after the editorial next week.

As always, a big Thanks! for contributing to the blog by responding. Please shoot me an email (Contact button, bottom left of the blog), or ping me on Twitter (@PaulRandal) if you have an idea for a good survey.

PS And again, as always, no comments on this post please (I'll delete them straight away) - please wait for the survey results post to avoid skewing the answers. I'm very interested in your reasoning, but not until everyone else responds.

Categories:
Performance | Surveys

In last week's survey I asked you two things, as a precursor to a whitepaper I'm writing for Microsoft

The first question was what is your maximum allowable downtime SLA (either for 24x7 operation or not). See here for the survey. Here are the results as of 5/30/09.

 

The Other values were all about not having SLAs defined. And I think that's why this survey had a poor response rate - most of you out there don't have defined SLAs.

If you take a standard 365-day year, which means the year has 524160 minutes. Some example maximum allowable downtimes for 24x7 operations:

  • 99.999% (a.k.a 'five-nines') = slightly over 5 minutes downtime per year
  • 99.99% (a.k.a 'four-nines') = almost 52.5 minutes downtime per year
  • 99.9% (a.k.a 'three-nines') = almost 8.75 hours downtime per year
  • 99% (a.k.a 'two-nines') = just over 3.5 days of downtime per year
  • 98.5% = almost 5.5 days of downtime per year
  • 98% = just over a week of downtime per year
  • 95% = just over 2.5 weeks of downtime per year

You might be thinking - wow - whoever only has a target of 95% uptime must have a pretty crappy setup BUT it totally depends on the application and business requirements. 95% may be absolutely fine for some companies and utterly devastating to others (imagine the revenue loss if Amazon.com was down for more than 2 weeks...)

On the other end of the spectrum, you might be thinking - wow - 99.999% target is completely unattainable for the majority of businesses out there, BUT again it totally depends on the application and business requirements. This may be attainable for a simple application and database that doesn't generate much transaction log, or for a large, very busy OLTP database that generates lots of log but the company has lots of money to throw at the HA problem and can afford redundant clusters in separate data-centers with fat, dark-fiber links between them.

Defining SLAs, and not just for maximum allowable downtime, is incredibly important. When done properly, defining SLAs shows that the technical staff in a company and the business management in a company are in tune. It shows they have extensively analyzed the business requirements of an application and balanced them against the technical, space, power, HVAC, manpower, budgetary, and other limitations which may prevent a higher percentage being set as the target. It provides a meaningful input into the design and architecture of systems, and the choices of technologies (note the plural) required to achieve the target SLAs. It shows the company shareholders that the business managers understand the criticality of applications involved in running the business and that they are taking steps to safeguard the shareholders' equity interests.

Note I said 'when done properly'. Technical staff picking SLAs without business input, or vice-versa is a recipe for disaster. If the technical staff don't understand the business requirements, how can they pick appropriate SLAs? If the business managers don't understand the technical limitations, how can they pick achievable SLAs? Everyone has to be involved. And one of the most important things to consider is whether the SLA covers 24x7 operation, or just the time that the application has to be available, if not 24x7.

How to go about using the SLAs to pick appropriate technologies is one of the things I'll be going into in the whitepaper (due for publication around the end of September).

The second questions was what was your measured uptime (again, either for 24x7 operation or not) over the last year. See here for the survey. Here are the results as of 5/30/09.

 

The Other values for 24x7 were 2 x '96', and 1 x 'it varies across customers'. The Other value for non-24x7 was '92'.

Of course, this would be a little more meaningful if we could correlate target downtime with actual downtime achieved, but the free survey site doesn't reach that level of sophistication (and I don't think people have enough time to get into that much detail) while casually reading a blog post.

Nevertheless, the results are interesting, although not really statistically valid from this small a sample-size. The most interesting data point is that some respondents don't know what they achieved last year, or just didn't measure. I think this is the case for the majority of readers. Having well-defined and appropriate SLAs is the key to defining a workable strategy, and what's the point of defining the SLAs if you don't measure how well you did against them? If you don't meet the target you need to revisit the strategy, the SLAs, or maybe even both.

The fact that many people don't have SLAs speaks to the general poor state of high-availability and disaster-recovery planning in the industry, IMHO. And Kimberly just chimed in stating that it also shows how the PC-based server market doesn't focus anywhere near as much on regimented policies and procedures as the older, more traditional, mainframe market did/does.

There's a lot of work to be done. Many times this stuff just isn't high up in the priority list until a disaster actually happens. Invariably it then becomes very, very important. Get ahead of the curve and be proactive - one of the things we always tell our clients.

Next post - this week's survey!

This week's survey is inspired by this morning's Kimberly+Paul hot-tub conversation around data-dependent routing vs. network load balancing, which then turned to SLAs. Yes, we lead an exciting life Smile.

This survey is a *four*-parter. Part 1 is for each of your databases, which survey option is closest to your target maximum allowable downtime SLA (Service-Level Agreement, a.k.a. RTO - Recovery Time Objective)?

  • Survey 1 is if your SLA is based on 24x7 operation
  • Survey 2 is if your SLA is based on non-24x7 operation (e.g. allowing for a 6 hour daily maintenance window, the rest of the time we must be available 99.9%)

Part 2 is for each of your databases, what can you actually achieve?

  • Survey 3 is your achievement over the last year for 24x7 operation
  • Survey 4 is your achievement over the last year for non-24x7 operation

I'll report on the result sometime during the first week of June - as this is a big survey I'd like to get lots of results. Please repost/Tweet/whatever.

What do I mean by that downtime SLA? Well, however you have it defined. It could be that your SLA is an absolute value based on 24x7 operation, or it could be based on the 9-5 work-day. In all cases, what percentage of the defined time does the system have to be available? Note that I'm calling out a 5, 4, and 3 nines of 24x7 as separate answers as I'd like to see how often these occur and can be met.

As always, a big Thanks! for contributing to the blog by responding. Please shoot me an email (Contact button, bottom left of the blog), or ping me on Twitter (@PaulRandal) if you have an idea for a good survey.

PS No comments on this post please (I'll delete them straight away) - please wait for the survey results post to avoid skewing the answers.

In last week's survey I asked whether you're ever tested your disaster recovery plan, and if so, what happened? (See here for the survey). Here are the results as of 5/25/09:

The 'other' responses are:

  • 2 x "restored to test env regularly. don't know if sla would be met."
  • 2 x "test it regularly, most goes according to plan"
  • 2 x "Test it regularly, people screw up. Was a great win when I obtained the budget for this activity."
  • 2 x "test it regulary and learn new things every time but overall it works"
  • 2 x "we have annual drp test company-wide"
  • 1 x "We have lots of DR plans, some tested, some not."
  • 1 x "We test the dr failover with mirror.however chain replications (subscriber becomes pulisher)we can't"

A good mixture of results, but only around 25% of respondents test it regularly. Rather depressingly, 35% of respondents either don't have a DR plan or have one but have never tested it. Given the stories I see almost every day on the various forums, this doesn't surprise me - but it's still depressing nevertheless.

The term 'disaster recovery' means different things to different people. The word 'disaster' in my mind spans everything that could go wrong and affect whether your system and data is online, available, and performing to spec. The word 'recovery' means any process that allows you to bring your system and data back online, available, and performing to spec. Your disaster recovery plan could be as simple as restoring from the last full database backup, or as complicated as failing over all processing to a remote data center and engaging a 3rd-party company to distribute new DNS routing entries across the Internet. It, of course, depends (I have to work that into every editorial Smile). Lots of people interchange DR and high-availability (HA), but HA is really a set of technologies that you implement to help protect against disasters causing problems. For instance, you might implement database mirroring so that part of the DR plan for a database is to failover to the mirror, keeping the database highly-available, while DR happens on the old principal. Or you might implement auto-grow on the transaction log file so that if it runs out of space the database doesn't become unusable. Neither of these are doing DR, they're preventing a disaster from affecting availability. DR in the second case would be what you do to provision more space for the log so the database can come online again.

Now, this editorial's not going to be about putting together your disaster recovery plan - that's an entire book in itself, as there are many techniques depending on the disaster and the resources you have available to facilitate recovery. If you don't have a disaster recovery plan (which I'm going to start calling DR plan), then you should *know* that come disaster time, you're risking whoever's on duty floundering around making mistakes and potentially leading to more downtime and data-loss than if you had a plan to follow. People panic in times of high-stress and crisis, and without a set of steps to follow, bad things happen. Enough said. You know who you are - go get a DR plan before a disaster happens and you lose time, data, your job, or all of the above. I see them all happen regularly.

Once you have a DR plan in place, the ONLY way to know whether you're going to be able to recover from a variety of disasters is to simulate some, in production. Yes, this is far easier said than done - persuading your business owners to take planned downtime (and possibly lose a bit of revenue) can be a hard argument to make, but unless you do, you can't know that your DR plan will work. One argument I've found effective is wouldn't you rather have all the various admins and DBAs on-site and expecting the test and things to potentially go wrong, than wait until a real disaster occurs in the middle of the night and THEN find out that the DR plan doesn't work and everyone has to scramble when they're least expecting it? Of course, business owners often aren't interested in low-probability potential problems. DR and HA aren't sexy topics UNTIL the company experiences a disaster. Then it's likely to be the top thing on the CEO's mind and you have to have a DR plan in production by Tuesday.

Seriously though, if you're responsible for your system meeting certain SLAs (downtime and data loss - a.k.a. RTO and RPO) then your DR plan actually has to work, no matter how carefully you've designed it. This means you have to try restoring from your backups and seeing if you can do it within your downtime SLA. What about if you have to setup a new server first? What about if there's no power in your building? What if your off-site backups are 200 miles off-site and the network link is down? What if none of your backups work? What if your differential backups are bad, do you still have all the log backups? How does that affect your recovery time? And so on and so on. You might think I'm just making stuff up and these things don't happen, but they do, and everything I'm citing as an example has happened to a customer I've personally been involved with while at Microsoft or since then. And they keep happening over and over again to different people.

If I had to list the most common reasons I see why disaster recovery fails, they are:

  • There are no backups, meaning recovery = data loss
  • Backups don't work or all contain the corruption, meaning recovery = data loss
  • The data volume has increased since last DR test, meaning recovery time exceeds downtime SLA
  • The initial failover when the disaster happens doesn't work because the failover site only has part of the application ecosystem, meaning recovery involves getting the application working on the failover site AND then recovering from the disaster on the main site

Doing an initial test when the DR plan is first produced is great, because at least you know that it works, or there are some things you've missed (which is almost invariably the case). The DR (and HA plans) should be written by the most experienced DBAs, as they're the ones who've "seen it all" and have a good idea of what could go wrong at any point during the recovery. And the plans should be tested by the most junior DBAs, as you can bet that if a disaster occurs at 2am on Thanksgiving morning, it won't be the most senior DBA who's on duty.

Doing a regular test is critical because things change. Data volume increases. Databases get added into the mix. Personnel change. SLAs change. And after a change, if you don't test regularly, then you won't know if your DR plan still works until you have a real disaster. If you can push for a DR plan test and everything works, everyone has increased peace of mind. If you can push for a DR plan test and things go south, you'll be praised for having exposed the problems. But if you wait, and things go south, no-one likes being responsible for unnecessary downtime or data loss - and that doesn't look good on a resume.

Next post - this week's survey!

This week's survey is inspired from many stories I saw on the forums and Twitter this week - mostly bad, one good (someone I'm following is spending the weekend testing their disaster recovery plan - cool!). I'd like to know whether you're ever tested your disaster recovery plan, and if so, what happened? I'll report on the results sometime of the weekend of 5/24/09.

And what do I mean by disaster recovery? The definition varies from just restoring a backup to conducting a full failover of a datacenter - but I'd like you to decide for yourself. I'll editorialize more about this next week.

As always, a big Thanks! for contributing to the blog by responding. Please shoot me an email (Contact button, bottom left of the blog) if you have an idea for a good survey.

PS No comments on this post please (I'll delete them straight away) - please wait for the survey results post to avoid skewing the answers.

PPS Probably a lot less blogging/Twitter (@PaulRandal) this coming week - I'm teaching all 5 days.

Last week's survey was another two-fold one - when you buy new servers, what architecture to you predominantly buy, and why?; when you buy new servers, which Edition of SQL Server do you predominantly buy, and why? Here are the results as of 5/17/2009.

 

For the first survey, the 'other' values were basically that 64-bit is purchased only if required. For the second survey, nine of the 'other' values were basically that Enterprise Edition is purchased only if required, one person uses Web Edition, and one person uses Developer Edition as they're just a developer.

I don't have a huge amount to say about these two surveys, I really just wanted to confirm my gut feel (and give you all a view of what's happening out there in the field).

For the architecture survey, I was entirely unsurprised to see that almost 80% of respondents are using 64-bit. It's widely known that 64-bit can give you better performance through the availability of more memory for SQL Server to use. Although you can use more than 4GB using AWE on 32-bit, that extra memory is only available for the buffer pool to use - not for general query processing - and AWE access does incur a little overhead. Much has been written about this and I'm not going to duplicate it here. A selection of articles can easily be found at http://www.google.com/search?hl=en&q=sql+server+64+bit+vs+32+bit&aq=1&oq=sql+server+64 Smile More interesting is the small number of respondents who are not able to use 64-bit, either because their corporate policy is 32-bit (maybe because of cost?), they have software that doesn't run on 64-bit systems, or specifically because 64-bit is too expensive.

A few people only use 64-bit if required, and (I imagine) prefer to save money if it not by sticking with 32-bit. I wonder how long it will be until 32-bit servers are not available at all? Certainly future versions of Microsoft server software are starting to become 64-bit only - for instance Exchange 2010 (see here) and SharePoint Server 2010 (see here), will the next version of SQL Server be 64-bit only?

For the Edition survey, again, unsurprising results. 55% use Enterprise Edition for one or a combination of the various "-abilities" (if you make performancability a new word), with almost 10% more having the option to use it if required, and 16% having to stay with Standard because of budgetary constraints. What you may be surprised to see (but I wasn't from my time working side-by-side with the SQL marketing team) is that 20% of respondents don't need Enterprise Edition. There were a lot of improvements to the database engine in SQL Server 2000 and SQL Server 2005 that meant that for intermediate workloads (hmm - just made that up, but you know what I'm trying to say), Enterprise Edition isn't needed. And with synchronous database mirroring available in Standard Edition, you can implement a great, low-cost high-availability plan without paying for Enterprise. As you can see, I don't get any kickbacks from the SQL team for selling Enterprise Edition Smile Saying that, however, there are a lot of very cool features in Enterprise Edition in the various versions - but if you don't *need* them, why pay for the higher Edition just for the sake of it?

One thing I will say to summarize, it's important to look at your requirements in terms of performance, availability, etc to make the right choice of server architecture and SQL Server Edition. If you make the wrong choice, do you think your company will pay to rectify your mistake before the next round of capital-expenditure? Probably not - but you'll definitely pay for the wrong choice with several years worth of hassle trying to make the system be more performant and easily recoverable than your choices allow.

As always, thanks for participating in the surveys - I've had a bunch of mail from people who like to see what other people are doing.

Next up - this week's survey!

Categories:
General | Hardware | Surveys

This week's survey is another two-parter, part 1 of which suggested by Mike Walsh on Twitter:

  1. When you buy new servers, what archictecture do you predominantly buy, and why?
  2. When you buy new servers, which Edition of SQL Server do you predominantly buy, and why?

I know the surveys below don't have a 'whatever suits the project' (thanks Dave Dustin), if that's the high-order bit for you, please use the Other option and I'll collate those results manually.

I'll report on the result some time over the weekend of 5/16/09.

As always, a big Thanks! for contributing to the blog by responding. Please shoot me an email (Contact button, bottom left of the blog) if you have an idea for a good survey.

PS No comments on this post please (I'll delete them straight away) - please wait for the survey results post to avoid skewing the answers.

Categories:
Surveys

Last week's survey was two-fold - what's the largest SQL Server database you manage, and how many SQL Server databases are you responsible for managing. Here are the results as of 5/10/2009.

 

As far as the database sizes are concerned, the distribution curve isn't surprising, but the average size of databases *is*. 70% of respondents have a database over 100GB, with half of those being 500GB or more. One lucky (or unlucky, depending on how you look at it!) person gets to play with a 20+TB database. There some truly huge SQL Server databases out there - for instance the 1.1 petabyte astronomical database described here, which adds 1.4TB every night. Back in December 2003, the largest SQL Server database was a 5TB Verizon database. Now multi-TB databases seem quite common when talking to large customers.

The distribution curve for the number of databases also seems unsurprising, except for the uptick at the 150+ databases mark. Fully 1/4 of respondents are responsible for more than 200 databases! Even if the majority of the databases are small, that's still a huge amount of management to consider and context to have to know about.

So what was the point of this survey? Well, I've heard from quite a few people this year that some of their DBA colleagues have been let go and they've had to pick up more responsibility, whether they want to or not. Just last week, Microsoft laid off 3000 more people worldwide, including some in their various IT departments. In the DBA class we were teaching last week, it was obvious that some of the attendees were affected and had a lot more on their plates.

With the average size of databases growing, and the number of databases any one person is responsible for also growing, it's imperative that the overall environment is as easily managed as possible. What does this mean? Here's a 2-minute list of stuff off the top of my head that I think helps make an environment more manageable:

  • Setting up each database so that the important parts can be recovered as quickly and easily as possible
  • Setting up each database so that there are no performance bottlenecks (these two were both kind of addressed by my previous survey Physical database layout vs. database size)
  • Making sure each database has the right backup strategy (from last week's survey Importance of having the right backups)
  • Making sure that each database has the correct monitoring set up. For instance:
    • SQL Agent alerts on the instance for high-severity errors (see Easy monitoring of high-severity errors: create Agent alerts)
    • Monitoring data and log file sizes to avoid auto-growths
    • Monitoring index fragmentation levels
    • Pro-active monitoring for things like high disk-queue lengths or runaway queries
    • Regular consistency checks to find corruption
  • Security is setup appropriately (see TechNet Magazine: feature article on Common SQL Server Security Issues and Solutions)
  • Maintenance jobs are automated as much possible (e.g. backups, consistency checks, index maintenance, and statistics maintenance). There are a bunch of scripts people have published to help with this: Tara Kizer has a lot on her blog, Ola Hallengren has a good one here, as do many others (no offense to anyone else with scripts, these are just the two that spring to mind - by all means reply with a comment to let me know of your popular script)
  • Database settings adhere to common best-practices (e.g. auto-shrink off, auto-grow on (and managed!), auto-create and auto-update statistics on, page checksums on)

And I'm barely touching the tip of the iceberg here. Most of these are just database maintenance best practices - with the idea that the healthier a database is, the less likely something will go wrong that demands DBA attention. And in an environment where potentially hundreds of databases are under the control of a single person, avoiding problems is of paramount importance.

Ok - cup of coffee later and I have more stuff to add to the list:

  • Consistency of settings and scripts across database and instances, so what you're familiar with in one database context translates to all others
  • Environment-wide run-book that covers all aspects of day-to-day operations, so all DBAs do the same thing and a DBA can help out or pick up databases easily
  • Environment-wide disaster recovery guide - written by the most senior DBA and tested by all DBAs down to the most junior
  • T-SQL source code control
  • Management of access to databases
    • Physical access to servers is limited
    • Network access to servers is limited
    • SA access is limited
    • Developers can't deploy new code without going through QA first
    • And so on
  • High-availability technologies in place to aid recoverability, with plenty of practice of failing over applications
  • Use of tools to increase productivity. Most major 3rd-party tool vendors have tools that can help in a myriad of DBA activities - I'm not going to name any in particular as I don't want to favor any over any of the others.

Talking about tools, SQL Server 2008 itself has some stuff built in that can increase productivity. Kimberly did a great post last November on Central Management Servers - see SQL Server 2008 Central Management Servers - have you seen these?. There's also Policy-Based Management, which is a good first step along the road to centralized policy setting, testing, and enforcement (although it has some nits that need to be fixed in v2), and Performance Data Collection which can be used to easily collect perf data (same kind of stuff that Activity Monitor has) for multiple instances and stored in a single location - Kimberly will be blogging more about these two features in the future.

Overall, having a haphazard way of setting up and maintaining databases and instances is going to make it really hard to scale up size and number of databases in an environment without investing heavily in more people. Taking a more disciplined, planned approach, learning from others - using others' scripts and practices, and developing streamlined processes will lead to increased productivity, decreased stress, and happier, stronger DBAs. No-one wants to be in constant fire-fighting mode.

This has been a bit of a ramble, but that's the point of these editorial-style posts. I'm sure you know a lot of this, and I'm sure I've missed some obvious points, so feel free to comment with more tips, tricks, and links!

Next up - this week's survey!

Jason Massie posted an interesting statistic yesterday - Facebook has 1.5 petabytes of image storage, and it grows by 25TB daily - I wonder how they store and manage it?

In this week's survey, I'm interested in two things: what's the largest single SQL Server database in your company and how many SQL Server databases are you responsible for managing? I'll report on the results sometimes over the weekend of 5/9/09.

A note on the second survey: if you manage a team of DBAs, but don't manage databases directly yourself, answer for your individual DBAs, rather than the team as a whole.

Thanks!

PS No comments on this post please (I'll delete them straight away) - please wait for the survey results post to avoid skewing the answers.

Last week's survey was on what kind of backups you take, along with the recovery model used (see here for the survey). Here are the results as of 5/2/2009.

 

The 'other' responses were combinations of the other answers.

This survey is a bit of a pre-cursor to my article on Understanding SQL Server Backups that will be in the July TechNet Magazine (available start of June). In that article I explain how the three major backup types work, and then how to combine them into an effective backup strategy, so I don't want to steal all my own thunder.  In the spirit of my weekly surveys, this is a 20-minute, brain-dump editorial, rather than a very carefully planned out article.

When I'm teaching classes on Database Maintenance or High-Availability, I always tell people not to plan a backup strategy; plan a restore strategy. You don't want to come up against a disaster recovery situation and find that the only backup you have of your multi TB 24x7 busy OLTP database is a full from several weeks ago. Kiss your job goodbye on that one. You have to make sure that you have backups that allow you to recover as quickly as possible and with the minimum of data loss.

So, you're probably taking backups so that you can restore when disaster strikes. Some of you may also be taking backups because:

  1. You periodically restore the database onto a reporting server, onto a data warehouse server, or onto test/dev servers
  2. You've got log shipping implemented (backup, copy, restore; backup, copy, restore;...)
  3. You've got database mirroring implemented (and so you must use the FULL recovery model, and thus must take log backups to manage the transaction log)
  4. You're in the FULL recovery model, even though you're not interested in up-to-the-minute or point-in-time recovery, and must take backups to manage the size of the transaction log

The survey included what recovery model you're running in too - basically SIMPLE or FULL. I don't know anyone that runs all the time in the BULK_LOGGED recovery model; and most 24x7 systems cannot switch into BULK_LOGGED because of the possibility of crashing and not being able to take a tail-of-the-log backup, if a minimally-logged operation has occurred since the last log backup.

In the SIMPLE recovery model, you cannot take log backups - so you're basically saying that you don't need to be able to do up-to-the-minute recovery, point-in-time recovery, or single-page restores. That's cool - it totally depends on your situation, and is you're running in SIMPLE it means you understand that there's no point running in FULL and having to take log backups if you're not interested in using them. If you're in the group covered by #4 above, switch to SIMPLE!

In the FULL recovery model, you have to take log backups - plain and simple - otherwise the log will grow out of control and eventually your database will grind to a halt when the log runs out of space. I've done a few blog posts about that so I won't labor the point (see Search Engine Q&A #1: Running out of transaction log space, Search Engine Q&A #23: My transaction log is full - now what?, and Importance of proper transaction log size management).

In either recovery model, there's the question of just how often should the various kinds of (I like to think of) 'mandatory' backups be taken, and whether to use the 'optional' differential backups.  Again, not stealing my own thunder from TechNet Magazine: a full backup is a copy of all the data in the database; a log backup is all the transaction log generated since the last log backup (or first full/differential following a break in the log backup chain); a differential backup is all data that has changed since the last *full* backup.

You need to take regular full backups - but just how regular depends. If the database is very large, and you need to keep your backups around for regulatory purposes, you might choose to take a full backup every couple of weeks or a month, with compression. Commonly I see people taking a full backup once a week.

Almost 20% of respondents are in SIMPLE and only take full backups - I wonder how many realize that a full backup only gives you a single point-in-time to which you can recover - and you lose all work since the last full backup. In the SIMPLE recovery model, you can't take log backups, so if you want to reduce the amount of data loss when a disaster occurs, you'll need to take differentials too, which only a handful of respondents do. Although this still means you'll have some data loss, it's vastly reduced. The amount of potential data loss will be the amount of work since your last (e.g. daily) differential backup, rather than since your last (e.g. weekly) full backup.

In the FULL recovery model, only a few % of respondents are NOT taking log backups - which means they shouldn't be in the FULL recovery model, or are being forced to (e.g. from using mirroring) and don't realize that having a redundant copy of the database isn't a good enough HA strategy - you have to have backups too in case of a secondary failure.

The vast majority of respondents use the FULL recovery model taking full and log backups (45% - which is what I expected) and about 25% are also taking differentials too. This is a more advanced strategy and can seriously REDUCE your downtime in the event of a disaster. A differential backup basically short-circuits the need to restore all the log backups that were taken in the time between the last full backup and that differential backup. You'll find out more on this in the article. For the ultimate in flexibility and fast recovery, this is the way to go - but at the cost of a little more complicated backup strategy, and extra storage space for the differential backups.

Now, what about the exotic answers at the end of the survey?

  • OS-level backups: this isn't a popular solution *at all* because of the complexity of getting the SQL data back out from the OS-level backup and recovered. I'm not surprised that no-one selected this.
  • SAN-level backups: I've seen a few customers do this, with mixed results. You must make sure that the SAN admin knows what they're doing - I heard of one customer who had data and log files on different LUNs. The SAN admin snapped the data LUN, and *then* the log LUN. Every so often, corruption would occur...
  • Shutdown SQL and copy the files: Just don't even think about this. Taking downtime to take a backup is daft, and what if the database is corrupt and won't re-attach?
  • Don't take backups: No need to discuss this one.

Ok - that's a quick blast that should give you some idea of why you'd want to make sure you've got the *right* backups, not just any old backup strategy.

Next post - this week's survey!

(PS I'm really enjoying being on Twitter - lots on interesting stuff. See me at http://twitter.com/PaulRandal)

In this week's survey I'm interested in what kind of backups you take, and also what recovery model you mostly use. If you have multiple database with different strategies, by all means respond multiple times. The more responses the better! I'll report on the survey results some time over the weekend of 5/2/2009.

A couple of notes on the survey:

  • If you're using BULK_LOGGED at all, just answer as FULL.
  • When I say full backup, I mean either full database, full filegroup, or full file backup. Same for differential.
  • I don't care whether you use a tool outside SQL Server to take your SQL backups (e.g. LiteSpeed) - it's the type of backup that's interesting.

Thanks!

PS No comments on this post please (I'll delete them straight away) - please wait for the survey results post to avoid skewing the answers.

A couple of weeks ago I kicked off the latest survey, on what the physical layout of your databases are and why you have them that way (see here for the survey). I let the survey run for a while to get a good sampling, and I wasn't disappointed, with over 1000 responses! Here are the results as of 4/27/2009.

Just like any other 'best practice' kind of topic, the question of how to design the physical layout of a database provokes a lot of (sometimes heated) discussion. There are lots of options and there are even more factors to consider - so the best answer is my perennial favorite "it depends"! In this post, I don't want to tell you how I think you should layout your database - instead I want to discuss some of the options and let you make up your own mind, with the added benefit of data on what your peers are doing with their databases. The main point of this survey was to see what people are doing, rather than as a driver for an editorial blog post.

As you can clearly see from the results above, and predictably, the distribution of layout types shifts as the database size increases - but I was very surprised by the number of single file databases over 10GB. Rather than go through each option in the survey, I'm going to talk a bit about some of the things to consider when planning a layout.

Underlying I/O subsystem

This could be the most important factor to consider. If you only have a single physical drive, for instance, there's arguably not much point creating multiple data files, as that will force the disk heads to bounce back and forth to the different file locations on the disk. On the other hand, if you have a SAN with several thousand drives grouped together into multiple LUNs, your possibilities are a lot wider (and maybe much harder to come up with the optimal layout). Several people asked if I'd go into depth around having multiple controllers, and different drive layouts in a SAN - and my answer is no. I'm not an expert at storage design, which, like indexing, is both an art and a science. There's a good whitepaper that discusses some of this: Physical Database Storage Design, which I helped review back in my MS days.

Performance, recoverability, manageability

Having multiple files with separate storage for each allows reads and writes to be parallelized for increased performance, lowering the amount of disk head contention. When a checkpoint occurs, and pages are written to disk, spreading the I/O load over multiples files can speed up the checkpoint and reduce the IOPS spikes that you may see. It can also lead to reduced contention for the various allocation bitmaps - in the same way as I've described for tempdb. In user databases with a very high rate of allocations, contention can arise on the GAM pages - but it's not common. Some people also advocate having separate filegroups for tables and indexes, and although this can sometimes be more trouble than it's worth, and often turns into a religious debate, I have heard of people getting a perf boost from this.

One of the most convincing reasons (I find) for having multiple filegroups is the ability to do much more targeted recovery. With a single file database, if it gets corrupted or lost, you have to restore the whole database, no matter how large the file is - and this can seriously affect your ability to recover within the RTO (Recovery Time Objective) agreement. By splitting the database into multiple filegroups, you can make use of partial database availability and online piecemeal restores (in Enterprise Edition) to allow the database to be online as soon as the primary filegroup is online, and then restore the remaining filegroups in priority order - bringing the application online as soon as the relevant filegroups are online. You can even use this layout to spread your backup workload - moving to filegroup-based backups instead of database backups, although this isn't very common.

As far as manageability is concerned, there are a few reasons to have multiple filegroups. Firstly, you can isolate a table that requires a lot of I/O (e.g. in terms of index maintenance) on separate storage from other tables, so that maintenance operations (and the I/O overhead of doing them) doesn't interfere with the I/O of the other tables. Also, you can provision different kinds of storage for different tables - in terms of disk speed and RAID level (redundancy), for instance. If you want to be able to move data around, you can do it much more easily if the database is split up, than if it's a single file.

Summary 

Ok - so I lied. I *am* going to offer advice - against one of the options: single filegroup, single file. For smaller databases, this is fine - but as the database size gets larger, say, over tens of GB, then having a single file can become a serious liability. With a single file database (or even a single filegroup database), you lose most of the benefits mentioned above.

Bottom line - as your databases get larger, you're going to need to think more carefully about their layout, otherwise you could run into big problems as your workload increases or when disaster strikes. As the survey results show, this is what your peers are doing.

Next post - this week's survey!

Last week I kicked off a survey about the physical layout of your databases, based on database size. I'm running it for two weeks and will report on the results next Friday.

One thing I missed was 'multiple filegroups over multiple drives/arrays/LUNs for any reason' in the > 1TB category - not enough options on the free surveys I use. If you fall into this category, use the final option on the >1TB survey and I'll lump them all together.

As of this morning there were 860 responses to the survey so far - very cool. Check it out at http://www.sqlskills.com/BLOGS/PAUL/post/Weekly-survey-does-size-really-matter-or-is-it-what-you-do-with-it.aspx.

Thanks!

Categories:
Surveys

This week's survey is a little more complicated. I'm interested in the physical layout of your databases. I've got four surveys, for a variety of database sizes. Please vote multiple times in each survey, as you see fit - and by all means forward this link to your friends/clients/etc or re-blog/tweet it. I'm going to report on this survey in two weeks, to give a bit more time for people to respond (and because we're travelling next week). I think we're going to see some interesting statistics come out of this - the more people that respond the better. I'll report on it 4/24/09.

One thing to note - this is just for user databases, not for tempdb. In the surveys, "multiple filegroups" implies multiple files too, and if you don't have them spread exactly one per drive/etc, just choose that option - I only have 10 options to choose from in the free surveys. 

(If you're in the over 1TB range and have multiple files/filesgroups spread over multiple drives/arrays/LUNs, vote using the last option on the >1TB survey and I'll lump them together.) 

Phew - thanks!

Last week's survey was on how *you* manage the size of your transaction log (see here for the survey). Here are the results as of 4/10/2009 - the most popular survey yet:

In my opinion (guarantee that I'm going to say stuff that some of you won't agree with!), there are only two valid answers in this survey: #1 - run in the FULL recovery model and take regular log backups, and #4 - run in the SIMPLE recovery model all the time. The last answer is applicable if you run out of log space even though you're in either of these situations but isn't a general strategy like #1 or #4. IMHO, you should be in one of these two situations and in the rest of this editorial I'll explain why. I'm not going to touch on *why* your transaction log might start to fill up, instead here are some links:

Now for the survey options:

  1. Take regular log backups. I'm very pleased to see the vast majority of respondents doing this, as it's the only valid log size management strategy when in the FULL recovery model (same thing applies to BULK_LOGGED of course, but with a few twists). Once you take that first database backup in the FULL recovery model, SQL Server assumes you're going to manage the transaction log size from that point on by taking log backups. Unfortunately that isn't documented in big, red flashing letters in Books Online - so people can get themselves into troubel inadvertently. Also, the FULL recovery model is the default, and is required for database mirroring - which further adds to the potential for people (such as involuntary DBAs) to accidentally switch into I-will-take-log-backups mode and then not take them. If you don't want to take log backups for recovery purposes, or you don't want to use database mirroring, don't use the FULL recovery model - it's as SIMPLE as that (ha ha). You might argue and say that you're only using FULL because of database mirroring, and don't want to take log backups. I'd argue back and say that if you care enough to have hot standby of your database, you must also take backups - as you can't rely solely on a redundant copy of your database on a different server.
  2. Use BACKUP LOG WITH NO_LOG/TRUNCATE_ONLY regularly. These two commands do basically the same thing - allow the log to be cleared without taking a log backup. What's the point if you're not taking log backups? - just switch to SIMPLE and let the checkpoints clear the log. In fact, in 2008 these two commands have been removed. See my blog post BACKUP LOG WITH NO_LOG - use, abuse, and undocumented trace flags to stop it.
  3. Use BACKUP LOG WITH NO_LOG/TRUNCATE_ONLY when it fills up. Same as above. You might argue that you're only keeping the log around in case there's a disaster, so that you can take a log backup at that point and use it to recover up to the point of the disaster. I'd argue that's broken on two counts: 1) what if the log file is damaged and you can't back it up? 2) that's *all* the transaction log since the last full database backup you took (if you break the log backup chain and then take a full database backup, that backup becomes the base of subsequent log backups) so that may take a long time to restore and replay...
  4. Run in the SIMPLE recovery model all the time. If you don't need to use FULL, don't. Running in SIMPLE is perfectly acceptable, as long as you don't mind losing work in the event of a disaster.
  5. Switch to SIMPLE when it fills up, then back to FULL. This is like #s 2 and 3 - what's the point?
  6. Switch to SIMPLE when it fills up, shrink the log, then switch back to FULL. This is worse than 2, 3, or 5. If you shrink the log, then it's going to grow again - possibly causing VLF fragmentation (see Transaction Log VLFs - too many or too few?), and definitely causing your workload to pause while the log grows, as the log can't use instant initialization (see Search Engine Q&A #24: Why can't the transaction log use instant initialization?).
  7. Shutdown SQL Server and delete the transaction log file(s). Just don't get me started on this one - I'm glad no-one 'fessed up to doing it. There are many reasons why this is daft, including: 1) you have to shutdown to do it, so your workload is off-line 2) if the database wasn't cleanly shut down, it won't be able to be started again without using EMERGENCY mode repair, and your data will be transactionally inconsistent 3) as the log can't be instant initialized, the database won't come online until the log has been created and zero'd. Just don't do this. Ever.
  8. Add more transaction log files. As a general strategy for managing the size of the log, this is not good. Extra log files have zero effect on performance (the myth that SQL Server writes in parallel to the log files is just that - a myth) and make management more tricky. However, if your log fills up for some out-of-the-ordinary reason, then adding another log file may be the only way to allow the database to keep running. If you have to do this though, make sure that you remove the extra log file(s) when you can to keep log management uncomplicated.

To summarize, if you want to be able to take log backups to aid in point-in-time or up-to-the-second recovery, use the FULL recovery model. If not, use SIMPLE and you won't need to mess around with the log when it fills up because you're not taking log backups.

Next post - this week's survey! (And thanks to all those who are responding to them!)

PS For those of you who sent me details about your databases from the survey back at the start of March (see here) - I haven't forgotten. I was waiting to get a decent sample size and now I'm going to go through the data. If you want to send me any more data, you've got until Sunday.

In this week's survey, I'm interested in how *you* manage the size of your transaction log. I'll report on the results around 4/10/2009.

Thanks!

(No comments please... wait till the results post next week) 

Last week's survey was on how you should store large-value character data in SQL 2005+ (see here for the survey). Here are the result as of 4/3/2009 - and I think my favorite answer is starting to catch-on:

My favorite answer is, of course, it depends! For all those who didn't answer 'it depends', your answer is valid, but only for particular circumstances, as each method has its pros and cons and won't be applicable in all cases. It's extremely important when designing a schema to consider how to store LOB data, as making the wrong choice can lead to nasty performance issues (where 'performance' is a catch-all to include things like slow queries, fragmentation, and wasted space). Now I'd like to run through each of the options and detail what I think of as the pros and cons. A couple of definitions first: 'in-row' means the column value is stored in the data or index record with the other columns; 'out-of-row' or 'off-row' means the column value is stored in a text page somewhere in the data file(s), with a physical pointer stored in the data/index record (taking either 16 or 24 bytes itself).

  • As a N/CHAR column. This is a great choice when the data that's stored in the column is a fixed size all the time, and always uses the full width of the column. Any time that the data may be smaller than the defined wdith of the column, space is being wasted in the row. Wasted space leads to fewer rows per page, more disk space being used to store the data, more I/Os to read the data, and more memory used in the buffer pool. However, if the character values are very volatile, and can change size, then having a fixed-width column can avoid the problem of a row having to expand and there not being enough space on the page to allow that - leading to a fragmentation-causing page split in an index (or forwarding record in a heap). There's a tipping point that can be hard to identify for your particular application...
  • As a N/VARCHAR (1-8000) column. For data values less than 8000 bytes, this is the common choice as it avoids wasted space. However, if the application can change the size of the data after the initial creation of the row, then there is the possibility of fragmentation occuring through page-splits. In SQL Server 2005+, a row can also be created that is more than 8060 bytes - one or more variable-length columns is pushed into off-row storage and replaced by a physical pointer. This means any access of the column has to do an extra I/O to reach the data - and this is commonly a physical I/O as the text page is not already in memory. This can lead to hard-to-diagnose performance issues if a query selects the column and some rows have the data in-row, and some out-of-row. Also, if the data values tend towards the larger end of the 1-8000 byte spectrum, individual rows can become vary large, leading to very few rows per page - and the problems described in the first option. If the data isn't used very much, then storing it in-row like this isn't very efficient.
  • As a N/VARCHAR (MAX) column in-row. This has the same pros and cons as the option above, with the added benefit that the value can grow larger than 8000 bytes. In that case it will be pushed off-row automatically, and start incurring the extra I/O for each access. These data types also work with the intrinsic functions in the same way as the character data types discussed above. I guess one drawback of this type compared to FILESTREAM is that it's limited to 2GB. Also, if there's a LOB data column in the table definition, the table's clustered index cannot have online operations peformed on it - even if all the LOB values are NULL or stored in-row!
  • As a N/VARCHAR (MAX) column out-of-row. The drawback of storing this data out-of-row is that accessing it requires an extra I/O to retrieve it, but if the data isn't used very much then this is an efficient way to go, but still uses space in-row to store the off-row pointer. An additional benefit of storing the data off-row is that it can be placed in a separate filegroup, possibly on less expensive storage (e.g. RAID 5) - but then there's the drawback that it can't be moved after being created except with an export/import operation. This option has the same online operations drawback as storing the data in-row.
  • As a N/TEXT column in-row. This has the same pros and cons as the N/VARCHAR (MAX) column in-row option, but these data types are deprecated and don't work with the majority of the intrinsic functions.
  • As a N/TEXT column out-of-row. Same as above.
  • In a seperate table and JOIN to it when required. This option is great when the data isn't used very much, as it doesn't require any storage at all in the main table (except for a value to use for the JOIN), but it does require some extra up-front design and slightly more complicated queries. There's another HUGE benefit to doing this - by moving the LOB data to another table, online operations become available on the main table's clustered index. (This concept is 'vertical partitioning' a huge topic in itself...)
  • As a FILESTREAM column. (Yes, I didn't have this in the survey, but it's a possibility). If your data values are going to be more than 1MB, then you may want to consider using the FILESTREAM data type in SQL 2008 to allow much faster access to the data than having to read it through the buffer pool before giving it to the client. There are lots of pros and cons to using FILESTREAM - see my whitepaper for more info here.

So, as you can see, the best answer for a general question like this is definitely It Depends!. Although I haven't covered every facet of each storage option, the aim of this post is to show that it is very important to consider the implications of the method you choose, as it could lead to performance problems down the line.

Next post - this week's survey!

(And it's official - this is blog post 39 this month, making this my most prolific month yet for blogging. An arguably dubious achievement...)

In this week's survey, I'm interested in your views on the best way to store large-value character data. I'll report on the results next week (around 4/3/09).

Thanks!

Categories:
LOB data | Surveys

Last week's survey was on what method you use to run consistency checks (see here for the survey). Here are the results as of 3/27/09 - again, very encouraging:

As you can see, 70% of respondents run DBCC CHECKDB on the production server, either with PHYSICAL_ONLY or without. For those running without it, be aware that using PHYSICAL_ONLY turns DBCC CHECKDB from a CPU-bound process into an I/O-bound process, and makes it run (potentially) magnitudes faster. It will still evaluate page checksums and torn-page protection, just skipping the higher-level logical checks.

I'm surprised to see so many people using a completely separate system to run consistency checks (restoring a backup and running a CHECKDB on the restored backup) - I've been a proponent of this method for a few years now, but I didn't think it had caught on so much (assuming a reasonably representative sample of readers responded to the survey). This method allows the entire consistency checking workload to be offloaded, and completely validates the backups used - but has the downside of requiring extra disk space on another server to restore the backup (I wish they'd build the system I got a patent for - to consistency check the database inside a backup without restoring it - see here).

Three of these choices I threw in to see if anyone was doing them so I could explain why they're not good methods to use.

  • "Don't run any consistency checks at all". I'm sure I don't need to labor the point with this one - you need to run consistency checks as well as having some kind of page protection turned on. If you don't proactively check for corruption, when it does occur it will likely be more widespread, and harder to recover within your data-loss and recovery time objectives than if you'd discovered it earlier. I've written lots about this in the Corruption and CHECKDB From Every Angle categories.
  • "Run DBCC CHECKDB on a database snapshot on a mirror database". I was asked about this several times while here at SQL Connections too. Database mirroring works by shipping transaction log records between the principal and mirror databases, NOT by shipping database pages - so if a page gets corrupted on the principal, the corruption will not be transferred to the mirror. This means that DBCC CHECKDB on a snapshot of the mirror does not reflect the consistency state of the principal database at all. Saying that, it is however, possible for corruption in the principal to affect the mirror. Imagine a column value is corrupted on disk by the principal's I/O subsystem. If that value is then read and used to calculate another value, which is then persisted - that calculated value is also 'corrupt' and will be reflected in the mirror database. Kind of an insidious, second-order corruption effect.
  • "Use BACKUP WITH CHECKSUM to validate page checksums, no DBCCs". Another option that sounds feasible, but in fact isn't for a couple of reasons. Firstly, BACKUP WITH CHECKSUM will stop when it finds a bad checksum, whereas DBCC CHECKDB will continue reading the rest of the database and tell you everything that's wrong with it. Secondly, BACKUP WITH CHECKSUM will only check those pages that HAVE page checkums, whereas DBCC CHECKDB will consistency check everything, regardless of whether a page has a checksum or not. For a database upgraded from an earlier version, where not every page will have a page checksum since you enabled them after upgrading, this is a critical point.

To summarize, this week's results were great, with almost 90% of respondents running some kind of consistency check. If you're not running any, for whatever reason, rethink your decision - there's always a way to run some kind of consistency checks and give yourself more peace of mind. See CHECKDB From Every Angle: Consistency Checking Options for a VLDB for more info.

Next post - this week's survey!

In this week's survey, I'd like to know *how* you run consistency checks, not how often. I'll report on the results next week (around 3/27/09)

Thanks!

Last week’s survey was on what kind of regular index maintenance you perform (see here for the survey) as a way of kicking off a new series I’m writing around index maintenance. Here are the results as of 3/21/09 – I find them very encouraging:

As you can see, about 2/5 of respondents are performing some form of analysis-based fragmentation removal (answers 5+6), which I consider the best way to perform index maintenance, if you’re willing to invest the time involved to set it up. It allows the least amount of work to be performed, for the most targeted performance gains – and so is especially appropriate for 24x7 systems where there’s a minimal or non-existent maintenance window.

The next best option is to do all rebuilds or all defrags based on a fragmentation threshold (answers 3+4), which about 1/5 of respondents do. This also allows work to be limited, but by choosing only a single method of removing fragmentation, there are pros and cons. Now, the survey was limited to a number of questions so I couldn’t explore what the threshold is that people are using (e.g. logical fragmentation, page density, extent fragmentation, or something else). Some measures are good to use and some not so good, and I’ll be exploring the various counters and ways of determining fragmentation as the series progresses.

Either of the options to operate on all indexes regardless of fragmentation (options 1+2) can lead to lots of wasted resources (disk space, transaction log space, I/Os, CPU) by operating on indexes that are not fragmented in the first place, or for which fragmentation removal has no benefit for workload performance. About 1/3 of respondents do this. This isn’t surprising to me as rebuild-all-the-indexes-every-night/week is a very common index maintenance plan for “involuntary DBAs” who know that index maintenance is important, but don’t have the knowledge or training to implement a more sophisticated maintenance plan. This growing size of this group of people is one of the main reasons I’m going to write this series.

Doing absolutely nothing for index maintenance, which about 1/10 do, is usually not a good idea, as indexes in a database that’s not read-only commonly become fragmented over time. However, these people may know they don’t suffer from fragmentation issues, or that removing fragmentation has no effect on workload performance. However, my suspicion (based on what I see in the field) is that some of these respondents don’t realize the benefits of performing index maintenance.

You may be surprised to hear that I don’t consider doing nothing to be the worst choice. That dubious honor goes to answer 7 – doing any kind of index maintenance followed by a database shrink operation – as 3 respondents do. A post-maintenance shrink operation may well undo some of the benefits of the maintenance by introducing massive amounts of index fragmentation – see my blog post Auto-shrink - turn it OFF! for details of how bad this can get.

I’m actually pretty encouraged by these results though. Compared to how things were around 1999 when I wrote DBCC INDEXDEFRAG and DBCC SHOWCONTIG for SQL Server 2000, these results show that knowledge in the field (or at least in the group that reads my blog and responded) has vastly improved. My aim for the forthcoming series about index fragmentation and maintenance is to increase knowledge a lot more broadly.

Next up - this week's survey. Thanks for reading!

I'm about to start a new series of post about index fragmentation and removing it. For this week's survey, I'd like to know what index maintenance you do to address fragmentation (in any of its forms) - I'll report on it in a week.

Thanks

Last week I kicked off the first weekly survey - on whether you validate your backups or not (see here for the survey). The results are very interesting (as of 3/13/09):

As you can see, almost 25% of respondents never validate their backups! And a further 25% only validate them occasionally, with 30% doing some kind of regular checks, and only a handful checking all the time.

While these results may seem shocking to you, based on what I've heard when teaching, they're pretty normal. There are lots of reasons why DBAs may choose not to validate backups as often as they should, including:

  • Not enough time to restore the backups to check them
  • Not enough disk space
  • Not part of the day-to-day operations guide
  • Don't see why it's important

Kimberly and I have a saying (well, to be fair, Kimberly coined it): you don't have a backup until you've restored it. You don't know whether the backup you just took was corrupt or not and will actually work in a disaster recovery situation.

Can you ever get a guarantee? No. Here's an analogy, taken from a very old post of mine. Consider Paul, who works for the Seattle Police Department in traffic control. Paul's in a control room somewhere in the city with a large bank of monitors connected to various traffic cameras. Paul's job is to cycle through the cameras every 1/2 hour, looking for traffic accidents. At the end of the 1/2 hour cycle, if Paul han;t seen any accidents then he knows that there are no accidents in the city.

Ah - but hold on. Does Paul really know that? No. All Paul knows is that at the point he looked at a particular camera, there was not an accident at that spot in the city. The very instant he switches to another camera feed, an accident could happen at a spot covered by the previous camera.

The same is true for validating backups. As soon as you've validated a backup, it could then be corrupted by the I/O subsystem, but at least you know that it was valid at some point. But what if that happens, I hear you ask? Well, then you need to have multiple copies of your backups, and you should not rely on backups as the only method of disaster recovery. A good high-availability solution includes as many technologies as you need to mitigate all risks - and backups are just one of those technologies. You're going to have to have some kind of redundant system too that you can fall back on (or mayb even immediately failover to, depending on your particular disaster recovery plan). But, saying that, you can't rely on the redundant server either - if it goes wrong, you'll need your backups.

So - whichever way you look at it, validating backups is a really good practice to get into so you don't get bitten when it comes to the crunch. When I teach, I've got many stories of customers losing data, business, time, and money (and DBAs losing their jobs) because the backups didn't work or were destroyed along with the data. Here's one for you (simplified, and no I won't divulge names etc). Major US investment firm decides to provision new hardware, so takes a backup of the database storing all the 401k accounts for all their customers (private and corporate), flattens the hardware, and goes to restore the backup. The backup is corrupt - on SQL 2000, where there's no RESTORE ... WITH CONTINUE_AFTER_ERROR. What happened? Well, the SQL team and Product Support had to get involved to help get the data back, but people in the firm lost their jobs and it cost a lot of time and money to recover the data. If only they'd had multiple copies of the backup, and tested their backup before removing the database (or better yet, restored the database on the new hardware before flattening the old hardware). They learned a costly lesson, but they did change their practices after that.

Unfortunately this is so often the way - people don't realize they need to validate backups or have an HA plan UNTIL they have a disaster. Then suddenly its the top priority at the company. Being proactive can save a lot of grief, and make you look good when disaster strikes.

Backups can be unusable for a number of reasons, including:

  • The full backup is corrupt, because something in the I/O subsystem corrupted it.
  • A backup in the log backup chain is corrupt, meaning restore cannot continue past that point in the chain.
  • All backups following a full backup are written to the same backup set, but the WITH INIT clause is used accidentally on all backups, meaning the only backup present in the backup set is the last one taken.
  • An out-of-band backup was taken without using the WITH COPY_ONLY clause and the log backup chain was broken (see BACKUP WITH COPY_ONLY - how to avoid breaking the backup chain).
  • The backups worked but the database contained corruption before it was backed up (kind of a separate issue).

The only ones that are out of your control are the first two, but they can be mitigated by having multiples copies of backups. All of these though, can be avoided at disaster recovery time by reguarly restoring your backups as a test of what you'd do if there was a real disaster. You might be surprised what you'd find out...

This is more of an editorial style post than a deep technical or example script post - I'm going to start doing more of these around the weekly surveys. Next post - this week's survey.

Thanks!

Whenever I teach a class I'm amazed at the number of DBAs who don't know about instant file initialization. Without wanting to redo blog posts that others have done, in a nutshell it allows a file to be create or grown 'instantly' by not having to zero the space in the file. This is especially useful in disaster recovery situations, where restoring a database may have to create the data files first - and with instant file initialization turned on it cuts out a bunch of downtime. See Kimberly's blog post Instant Initialization - What, Why and How? for more details.

Anyway, the reason I'm posting today is to clear up some misconceptions that I keep hearing.

Misconception: instant file initialization is on by default. No it isn't. There's a tiny security risk with having it enabled (for systems where the SQL data drives are shared by sensitive non-SQL files) so it's off by default and requires granting the SE_MANAGE_VOLUME_NAME permission (more commonly known as Perform Volume Maintenance Tasks). Kimberly's blog post explains how to enable it.

Misconception: instant file initialization is Enterprise-only. No it isn't.

Misconception: instant file initialization applies to log files too. No it doesn't. Log files cannot be instant initializated - my blog post Search Engine Q&A #24: Why can't the transaction log use instant initialization? explains why not.

Misconception: NTFS performs the zeroing when instant file initialization is not enabled. No it doesn't. SQL Server does it, because it can do it faster than NTFS. When instant file initialization is enabled, SQL skips zeroing the file and instead just calls the Windows API SetFileValidData. The special permission is required to be able to call that API. MSDN has more details at http://msdn.microsoft.com/en-us/library/aa365544(VS.85).aspx. Ok - that took a bit of remembering, and digging around on MSDN.

Misconception: when instant file initialization is enabled, pages are zero'd before they're written. No they're not. An entire 8k image is written to disk, which overwrites that uninitialized 8k portion of the file. Another flavor of this misconception is that if a page is allocated from an extent in an instant initialized file, the unallocated pages in the extent are zero'd out - that's not true either.

Hope this helps

PS And a perfect opportunity for another survey:

There's been an interesting discussion on SQLServerCentral about whether this question is valid for a DBA interview: what's the name of the executable that runs SQL Server?

My view is that it's a perfectly valid question, based on the cliched premise that the more you know, the further you go. I've conducted hundreds of interviews for positions at DEC and Microsoft and I've always been more impressed with people that knew more stuff (and why that stuff was useful to know - rather than just knowing weird facts by rote) than people who said they could look it up. To me, if someone knows things like the name of the SQL Server executable, it says to me that they've had more experience dealing with interesting situations (like having to start the server in single-user mode, or divide up resources on a multi-instance server using WSRM). And for that, they'd go higher up my 'hire' list than someone who would have to look it up. Of course, that's just one out of a large number of traits and characteristics that I'd be looking for.

So, my question to you is, do you think that's a good interview question (coupled with something to weed out those who don't know why it would be useful)? Vote in the survey below, and reply with any questions you think really are invalid for a DBA interview.

Thanks

PS And don't worry, it's not all going to be surveys from now on - I'm just having fun with a new toy. Some good internals posts coming up next week! 

Categories:
General | Surveys

I'm toying with the idea of having a weekly survey that'll highlight an interesting facet of database management. I've signed up with SurveyPopups.com, which is free and allows you to see the results as you vote on them.

If you think this is a cool idea, vote in the survey using the options below. If I get more than 100 people voting then I'll start doing one weekly or so and using the results to seed a blog post.

Thanks

Theme design by Nukeation based on Jelle Druyts