There are two pretty well-known I/O errors – 823, and 824 – but there’s also one called 825 which most DBAs do*not* know about, and definitely should.
From SQL Server 2005 onwards, if you ever see an 823 or 824, SQL Server has actually tried that I/O a total of 4 times before it finally declares a lost cause and surfaces the high-severity I/O error to the connection’s console, killing the connection into the bargain. The idea behind this read-retry logic came from Exchange, where adding the logic reduced the amount of immediate downtime that customers experienced. While in concept this was something I agreed with at the time, I didn’t agree with the way it was implemented.
If the I/O continues to fail, then the 823/824 is surfaced – that’s fine. But what if the I/O succeeds on one of the retries? No high-severity error is raised, and the query completes, blissfully unaware that anything untoward happened. However, something *did* go badly wrong – the I/O subsystem failed to read 8KB of data correctly until the read was attempted again. Basically, the I/O subsystem had a problem, which luckily wasn’t fatal *this time*. And that’s what I don’t like – the I/O subsystem went wrong but there are no flashing lights and alarm bells that fire for the DBA, as with an 823 or 824. If read-retry is required to get a read to complete, the only notification of this is a severity-10 informational message in the error log – error 825. It looks like this:
Msg 825, Level 10, State 2, Line 1. A read of the file ‘D:\SQLskills\TestReadRetry.mdf’ at offset 0×0000017653C000 succeeded after failing 2 time(s) with error: incorrect checksum (expected: 0×4a224f20; actual: 0×2216ee12). Additional messages in the SQL Server error log and system event log may provide more detail. This error condition threatens database integrity and must be corrected. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
What this message is really saying is that your I/O subsystem is going wrong and you must do something about it. And unless you’re regularly scanning the error log looking for these, you’ll be none-the-wiser.
So – my recommendation is that you add a specific Agent alert for error 825, along with your other alerts (see this blog post).
9 thoughts on “A little-known sign of impending doom: error 825”
Hi Paul,
Very interesting error. I wasn’t aware of it, but something that came in my mind, why this error was implemented in severity 10 and not like error 823 and 824 that was in severity 24?
I’ll start keep eyes on this one as well.
Thanks,
Marcos
What is scary with this one is that the drive subsystem will also have done many, many retries, etc. under the covers before SQL Server is even aware of the first retry. Keep in mind that by the time SQL Server even knows about the first retry, the drive subsystem had already given up trying.
Excellent article. I am dealing with some of these issues as well..
Can you help?
I’ve had five error 825 this week, and I have done a checkdb each time and it runs fine, no errors.
———————
SQL Server Alert System: ‘Corruption Alert 825 Error’ occurred
DESCRIPTION: A read of the file ‘E:\MSSQL\Data\PROD_1.ndf’ at offset 0x00002ec57e0000 succeeded after failing 1 time(s) with error: incorrect pageid (expected 3:24521712; actual 3:798104). Additional messages in the SQL Server error log and system event log may provide more detail. This error condition threatens database integrity and must be corrected. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books
———————
–To convert
SELECT 0x00002ec57e0000 / 8192
———————
–To get the object id
DBCC PAGE(‘Prod’, 1, 119824, 0) WITH TABLERESULTS
DBCC PAGE(‘Prod’, 1, 58040, 0) WITH TABLERESULTS
DBCC PAGE(‘Prod’, 1, 71088, 0) WITH TABLERESULTS
DBCC PAGE(‘Prod’, 1, 261352, 0) WITH TABLERESULTS
DBCC PAGE(‘Prod’, 1, 197304, 0) WITH TABLERESULTS
———————
— To get the object name
SELECT * FROM sys.indexes s WHERE s.object_id in (60, 2137058649,
2116918613,
1173579219) AND s.index_id = 1
———————
One of these indexes came up twice PK_NxFWKAppObjectDataHistory
clst
IDX__TimeIndex__27DDE006
PK_NxFWKAppObjectDataHistory
———————
–No rows returned
select * from dbo.suspect_pages
———————
Error 825: is a hardware issue, and it’s a read error. A index object error came up twice, has nothing to do with the issue, the index isn’t the cause, that’s right isn’t it?
Yup – it’s always an I/O subsystem error – and yours is starting to fail. CHECKDB may always come back clean, as subsequent reads of the same page may work, so CHECKDB sees a clean page.
Great article, I wonder if you can help shed some light on my issue.
Following a Storage issue, a collection of 16 entries have arrived in our msdb.dbo.suspect_pages Table, together with 824 errors in the Error Log.
Subsequent DBCC checks on the Databases affected do not return any issues. No subsequent 824 errors have occurred.
The Pages referred to in the errors do not appear to be in the regular User Tables and no other symptons have been noticed.
Is it likely that the errors have now gone away? Is there any way that this can be checked and how would these entries normally be cleaned from the suspect_pages Table?
Many thanks, Nigel.
Likely those pages were subsequently removed from the set of allocated pages in the database, by something like an index rebuild. You’ll need to remove the entries manually – the suspect_pages table is only ever inserted into and updated, never deleted from, by SQL Server automatically.
I am on SQL Server 2014 sp3.
Did this change? It seems to be indicating a DB corruption now, not disk IO time out. I see the same information on Paul S. Randals page and I have opened a question with him.
select * from sys.messages where language_id = 1033 and message_id = 823
The text reads:
The operating system returned error %ls to SQL Server during a %S_MSG at offset %#016I64x in file ‘%ls’. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
No – you’re looking at error 823, and the post is about error 825. 823/824 is when read-retry doesn’t work.