CHECKDB From Every Angle: Consistency Checking Options for a VLDB

This is a question that comes up a lot – how to run consistency checks on a VLDB?

We’re talking hundreds of GBs or 1 TB or more. These databases are now common on SQL Server – there are many customers with multi-TB databases. Any experienced DBA knows the value of running consistency checks, even when the system is behaving perfectly and the hardware is rock-solid. The two problems that people have with running a full DBCC CHECKDB on their VLDB are:

  • It takes a long time to run (based on many factors – see my previous post here for details).
  • It uses lots of resources – memory, CPU, IO bandwidth, tempdb space.

So it uses lots of resources for a long time. Even with a decent sized maintenance window, the DBCC CHECKDB may run over into normal operations. There’s also the case of a system that’s already pegged in more or more resource dimensions. Whatever the case, there are a number of options:

  • Don’t run consistency checks
  • Run DBCC CHECKDB using the WITH PHYSICAL_ONLY option
  • Use the partitioning feature and devise a consistency checking plan around that
  • Figure out your own scheme to divide up the consistency checking work over several days
  • Offload the consistency checks to a separate system

Let’s look at each in turn.

Don’t run consistency checks

A lot of people end up doing this as they can’t figure out how to run consistency checks on their DB. Don’t even think about using this option – there’s always a way to do it. If you absolutely cannot figure out a way to get consistency checks on your system, send me email and I’ll help you – I’ve helped many customers with all the schemes below, both in and out of Microsoft.

Use WITH PHYSICAL_ONLY

A full DBCC CHECKDB does a lot of stuff – see previous posts in this series for more details. You can vastly reduce the run-time and resource usage of DBCC CHECKDB by using the WITH PHYSICAL_ONLY option. With this option, DBCC CHECKDB will:

  • Run the equivalent of DBCC CHECKALLOC (i.e. check all the allocation structures)
  • Read and audit every allocated page in the database

So it skips all the logical checks, inter-page checks, and things like DBCC CHECKCATALOG. The fact that all allocated pages are read means that:

  • Any pages that cannot be read at all (i.e. 823 errors) will be discovered
  • If page checksums are enabled, any corruptions caused by the IO subsystem will be discovered as the page checksum will be checked as part of reading the page into the buffer pool

So there’s a trade-off of consistency checking depth against run-time and resource usage – but this option will pick up problems caused by the IO subsystem as long as page checksums are enabled and present.

Use the partitioning feature

One of the obvious ways to reduce the time/resources issue is to partition the load. If you’re using the partitioning feature then you’re already setup for this. Given that you’ve hopefully got your partitions stored on separate filegroups, you can use the DBCC CHECKFILEGROUP command.

Consider this example – you have the database partitioned by date such that the current month is on a read-write filegroup and the past 11 months are on 11 read-only filegroups (data from more than a year ago is on some offline storage medium). The prior months also have multiple backups on various media so are considered much ‘safer’ than the current month. It makes sense that you don’t need to check the read-only filegroups as often as the current month’s filegroup so an example consistency checking scheme could be:

  • Run a DBCC CHECKFILEGROUP on each read-only filegroup every week or two
  • Run a DBCC CHECKFILEGROUP on the read-write filegroup every day or two (depending on the stability of the hardware, how critical the datais, and the frequency and comprehensiveness of your backup strategy).

I know of several companies who’ve made the decision to move to partitioning in part because of this capability to easily divide up the consistency checking.

Figure out your own way to partition the checks

If you haven’t partitioned your tables, then there are ways you can split up the consistency checking workload so that it fits within a maintenance window. You basically need to simulate what DBCC CHECKDB does in a staggered manner. Here’s one scheme that I’ve recommended to several customers:

  • Figure out your largest tables (by number of pages) and split the total number into 7 buckets, such that there are a roughly equal number of database pages in each bucket.
  • Take all the remaining tables in the database and divide them equally between the 7 buckets (using number of pages again)
  • On Sunday:
    • Run a DBCC CHECKALLOC
    • Run a DBCC CHECKCATALOG
    • Run a DBCC CHECKTABLE on each table in the first bucket
  • On Monday, Tuesday, Wednesday:
    • Run a DBCC CHECKTABLE on each table in the 2nd, 3rd, 4th buckets, respectively
  • On Thursday:
    • Run a DBCC CHECKALLOC
    • Run a DBCC CHECKTABLE on each table in the 5th bucket
  • On Friday and Saturday:
    • Run a DBCC CHECKTABLE on each table in the 6th and 7th buckets, respectively

There’s one drawback to this method – a new internal database snapshot is created each time you start a new DBCC command, even for a DBCC CHECKTABLE. If the update workload on the database is significant, then there could be a lot of transaction log to recover each time the database snapshot is created – leading to a long total run-time. In this case, you may need to alter the number of buckets you use to make the total operation fit within your available window.

Use a separate system

This alternative is relatively simple – restore your backup (you are taking regular backups, right?) on another system and run a full DBCC CHECKDB on the restored database. This offloads the consistency checking burden from the production system and also allows you to check that your backups are valid. There are some drawbacks to this however:

  • You need to have sufficient disk space on the spare system to be able to restore the backup onto. If the production database is several TB, you need the same several TB on the spare box. This equates to a non-trivial amount of money – initial capital investment plus ongoing storage management costs. (Hopefully a future release will alleviate this – while at Microsoft I invented and patented a mechanism for consistency checking a database in a backup without restoring it.)
  • If the consistency checks find an error, you don’t know for sure that the database is corrupt on the production system. It could be a problem with the spare box that’s caused the corruption. The only way to know for sure is to run a consistency check on the production system. This is a small price to pay though, because most of the time the consistency checks on the spare system will be ok, so you know the production database was clean at the time the backup was taken.

Summary

You’ve got a bunch of choices to allow you to run consistency checks, so there’s really no excuse for not knowing (within a reasonable time-frame) that something’s gone wrong with your database.

40 thoughts on “CHECKDB From Every Angle: Consistency Checking Options for a VLDB

  1. Would another option be to take a SAN level snapshot of the database and then do a DBCC CheckDB on that copy of the database?

  2. I have a SQLServer 2012 server that is performing very strangely. I was running full CheckDB every day, for about a month or so, without any problems. Then, one day, it took about 12 hours to run it against a very small database. From this day on, I keep getting varying times to run it. Today, in one of the instances, it took my job 6 hours and a half to run CheckDB against the msdb database, which is 64Mb, with 40% unallocated space. What could there possibly be wrong?

  3. Paul, when you say – Figure out your largest tables (by number of pages) and split the total number into 7 buckets, such that there are a roughly equal number of database pages in each bucket.

    What do you mean by a ‘bucket’?

  4. Great article, Paul.

    In your “table buckets” scenario, does CHECKALLOC + CHECKCATALOG + CHECKTABLE cover everything that a CHECKDB does? The MSDB documentation says that CHECKDB also:

    Validates the contents of every indexed view in the database.

    Validates link-level consistency between table metadata and file system directories and files when storing varbinary(max) data in the file system using FILESTREAM.

    Validates the Service Broker data in the database.

    Are these important? Are there ways to do them separately? (I’m mostly worried about the indexed view validation, I don’t believe we are using FILESTREAM or Service Broker currently).

  5. Thanks for the great options. Is there recommended minimum frequency to running DBCC’s? Our policy has been to run “Physical only” nightly and full on weekends, but have timing issues on some servers. We anticipate partitioning the workload over a week’s time. Is that typical? Or could we extend over 2 weeks or a month? I know that’s a “it depends” question, but would like best practice recommendation.

  6. Hi Paul – PHYSICAL ONLY option does CHECKALLOC and skips CHECKCATALOG, when you go on say it still reads every page I assume that CHECKTABLE runs then too? thanks.

  7. I read through BOL and all your recommendations multiple times. I’ve been doing some testing also. I’m trying to determine if NOINDEX checking is a better approach then PHYSICAL_ONLY. They seem very similar and INDEXES are logical structures that are also skipped during PHYSICAL_ONLY from what I can tell.

    My questions are:

    1. Because PHYSICAL_ONLY skips checkcatalog, and checkcatalog seems important because the errors it detects require a restore, isn’t it less risk to run NOINDEX instead of PHYSICAL_ONLY as a best practice for production and VLDBs?

    2. Because indexes can always be rebuild isn’t running DBCC with no NOINDEX no real risk of data loss, because the worst case is I have to rebuild an index?

    3. If I was to separate and schedule the workload of DBCC, by large tables and different nights, it would be nice if DBCC CHECKDB had an omit tables option. It would allow me to do most of the critical databases on the maint schedule, and then run selected checktables (maybe with NOINDEX) as need on other nights.

    Please let me know what you think about using NOINDEX instead of PHYSICAL_ONLY. Thanks

    1. 1) No – you shouldn’t use NOINDEX as it skips indexes completely. PHYSICAL_ONLY still reads the index pages and checks their page checksums.
      2) Correct. But you still want to know if there’s corruption in the database as soon as possible. Just because it hit a nonclustered index (which will cause query failures), next time it might hit a clustered index or system table.
      3) Please submit a request on connect.microsoft.com

      Don’t use NOINDEX. I’ve never recommended it but never got around to pressing to have it removed back when I was on the SQL team.

  8. I have two servers (SQL 2008 SP4 Enterprise) with database mirroring configured across them. If I run CHECKDB WITH PHYSICAL_ONLY on principal database and normal DBCC CHECKDB () on mirrored database snapshot, will this option cover all consistency check that normal CHECKDB performs on principal database.

    Note: page_verify is configured as CHECKSUM for database on Principal database. From 2008 onward, we also have auto page repair option with database mirroring.

  9. Hi Paul

    I’m a huge fan of you and Kimberly’s work.

    Regarding this: “while at Microsoft I invented and patented a mechanism for consistency checking a database in a backup without restoring it.”

    This is very interesting. Is this functionality publicly available now or are there plans for it to be so?

    thanks
    Paul.

  10. Hey Paul

    SQL 2008 installation – CheckDB is running on a secondary system, so there is lots of everything *except* disk space – assuming checkdb is using what it needs and not more (DB data file is 293GB, tempdb grows to 34GB, largest table is 97GB) – I’m guessing all of the partitioning options don’t help me and the only way to make CheckDB use less tempdb (potentially) is PHYSICAL_ONLY – can’t seem to find any info on what impact (if any) that might have on tempdb space usage.

    Any other suggestions? I’ve asked management for more disk – they’re thinking about it (~2 years now…) – I could run checkdb in production – more space there and a reasonable maintenance window

    Thanks

    1. Correct – PHYSICAL_ONLY should use profoundly less space in tempdb. In your situation, I’d be tempted to run it in production every so often too so you get the full checks.

  11. Hello Paul, we have a Sql 2014 instance with latest patch level. These days we are facing logical inconsistency issues on tempdb when ever we ran some queries and these are intermittent. Actually it says incorrect page I’d detected on one of the tempdb file while trying to sort particular operation. What could be the reason for this? There are few procedures written with no lock and row lock hints and we have good amount of tempdb files based on the processors. There is no TDE enabled.

  12. Paul, thanks for the help already given and this wonderful article. I’m finally implementing a kind of “bucket” system- I have four hours a night that the system can take a performance hit so I’ve got a procedure that, each night, will checkalloc and checkcatalog (these take under a minute) and then proceed to checktable on every table, stopping after four hours. The next night, it runs checkalloc, checkcatalog, and continues on the next table in the list. If there aren’t any tables in the checktable queue, it fills up the queue again with 3900 tables or so (JD Edwards, ugh) and starts at the top.

    In an email you mentioned, “You’re probably missing system tables.” Can you elaborate? select * from databasename.sys.tables where type ‘U’ results in zero rows… should I be running checktables on some other tables that aren’t in sys.tables?

    Thank you

    Jeremy

  13. Hi Paul,

    Is it better option to devide the Full dbcc checkdb into three sepearte checks to reduce the overhead like

    Run a DBCC CHECKALLOC
    Run a DBCC CHECKCATALOG
    Run a DBCC CHECKTABLE

    1. There’s no right answer to this – it depends on what restrictions you have. And remember it’s not three parts if you do this, it’s two parts plus one part for each table in your database, as CHECKTABLE only checks one table.

  14. Hi Paul.

    Thank you for such a great posts about CHECKDB and for great feature at all.

    I want to ask you about using separate system option for CHECKDB. If I’ll restore FULL+DIFF+LOG backups and run CHECKDB,is it would be OK or I must restore FULL+DIFF without LOG?

      1. Thank you for quick reply.

        What if I want to run CHECKDB everyday after DIFF backup, would it be correct solution to run CHECKDB on restored FULL+DIFF? Or it wouldn’t give me any advantage?

  15. Hi Paul,

    We are facing multipe checkdb failures every weekend on our VLDB.By the time i came across this Physical_only feature.

    Can you please suggest if it is fine to replace checkdb with checkdb Physical_only and is it safe if my db or any page gets corrupted,does it will be able to get that corruption information as error message like as it does on normal Checkdb?

    Thanks,
    Jitanshu

    1. You need to fix the root cause of the corruption – likely hardware. No – you can’t *never* run a full DBCC CHECKDB as there are a lot of checks it does that a PHYSICAL_ONLY does not.

  16. Hi Paul,
    Thanks for your wonderful post.
    If we do Check table for every table in the database individually, Does it creates complete database snapshot for every DBCC CheckTable call?
    If it so, Is it preferrable to run for all the tables everytime which will create multiple DB snapshots?

  17. Hi Paul.
    Thank you for this article.

    I was hoping you could give an advise on this.

    I am running a checkdb(olla Hollagren) with noindex . It does not check the non clustered indexes only.

    I am worried we could miss a corruption of an index.

    The reason why I chose this solution is because of the time it takes to run

    What do you think if i decide to run a full check with index every 2 weeks and one without index every 2 weeks as well.

  18. Hi Paul, thanks so much for this article. Please may I ask a question about a situation I am facing. We run checkdb against our alwayson replica, it normally completes without error. However, on 2 occasions in the last 4 months, it has reported a corruption:

    The operating system returned error 1392(The file or directory is corrupted and unreadable.) to SQL Server during a write at offset 0x00000d0a690000 in file ‘E:\xxx\data\xxx.ndf_MSSQL_DBCC9’. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

    Subsequent executions of checkdb report no corruption. I have logged an incident with MS and they told me to run chkdsk, which came back with zero errors, then told me it was likely a transient write error and likely safe to ignore. I’m comfortable ignoring this, but I am coming close to being out of ideas on how to troubleshoot it. So my question, Is there any chance that checkdb could run out of space, and report a corruption rather than an out of space error? (we are getting close to disk space limits).

    1. No – it’s definitely an I/O subsystem error with the sparse file CHECKDB uses for its snapshot. Apart from chkdsk, about the only thing you can do is run whatever diagnostics are provided by your I/O subsystem vendor.

Leave a Reply

Your email address will not be published. Required fields are marked *

Other articles

Imagine feeling confident enough to handle whatever your database throws at you.

With training and consulting from SQLskills, you’ll be able to solve big problems, elevate your team’s capacity, and take control of your data career.