(Check out my Pluralsight online training course: SQL Server: Logging, Recovery, and the Transaction Log.)
This is a blog post I’ve been meaning to do for a while, and I’ve recently noticed some info on the web about checkpoints which is a little misleading, so I want to do a quick post to explain how checkpoints work as far as log records are concerned.
When a checkpoint operation occurs, no matter how it’s triggered (for instance through a manual CHECKPOINT, from a database or differential backup, or automatically) the same set of operations occurs:
-
All dirty data file pages for the database are written to disk (all pages that have changed in memory since they were read from disk or since the last checkpoint), regardless of the state of the transaction that made the change.
-
Before a page is written to disk, all log records up to and including the most recent log record describing a change to that page are written to disk (yes, log records can be cached in memory too). This guarantees recovery can work and is called write-ahead logging. Log records are written to the log sequentially, and log records from multiple transactions will be interspersed in the log. The log cannot be selectively written to disk, so writing a dirty page to disk that only has a single log record affecting it may mean writing many more previous log records to disk as well.
-
Log records describing the checkpoint are generated.
-
The LSN of the checkpoint is recorded in the database boot page in the dbi_checkptLSN field (see Search Engine Q&A #20: Boot pages, and boot page corruption).
-
If in the SIMPLE recovery model, the VLFs in the log are checked to see whether they can be marked inactive (called clearing or truncating the log – both of which are terrible misnomers, as nothing is either physically cleared or truncated).
I’m using some terms you may not have come across – for a background primer on logging, recovery, and transaction log architecture see my article in the February 2009 TechNet Magazine – Understanding Logging and Recovery in SQL Server (or come to my Wednesday November 4th PASS Spotlight session with the same title).
Checkpoints are not really tracked in the transaction log – it just serves as a useful repository for information about which transactions are active at the time of the checkpoint. The LSN of the last checkpoint is recorded in the database boot page. This is where recovery starts, and if this page is inaccessible, the database cannot be attached, opened, or processed in any way – partly because it’s the boot page that knows whether the database was cleanly shut down or not, and partly because it’s the only place that records the LSN of the last checkpoint record. You may say, well it’s recorded in the transaction log too, but what if the log is corrupt in some way?
One area of confusion I’ve seen is that the checkpoint log records are overwritten by subsequent checkpoints. Absolutely not – once written, a log record is NEVER updated or overwritten – it will only be overwritten when the log wraps and the VLFs are re-used (see Inside the Storage Engine: More on the circular nature of the log). This has led to further confusion about when checkpoint information is retrievable from the log, using commands such as fn_dblog.
For the rest of this post, I want to show you what’s going on in the transaction log when checkpoints occur under different circumstances.
[Edit: note that from 2012 onward, there’s an additional log record about updating the boot page.]
Consider the following example:
CREATE DATABASE CheckpointTest;
GO
USE CheckpointTest;
GO
CREATE TABLE t1 (c1 INT);
GO
INSERT INTO t1 VALUES (1);
GO
CHECKPOINT;
GO
SELECT [Current LSN], [Operation] FROM fn_dblog (NULL, NULL);
GOCurrent LSN Operation
———————– ——————————-
00000047:00000051:009b LOP_BEGIN_CKPT
00000047:00000091:0001 LOP_END_CKPT
We see the log records for the checkpoint. In this case the checkpoint is very simple and so there are only two records – beginning and ending the checkpoint.
If we run another checkpoint, what do we see?
CHECKPOINT;
GO
SELECT [Current LSN], [Operation] FROM fn_dblog (NULL, NULL);
GOCurrent LSN Operation
———————– ——————————-
00000047:00000092:0001 LOP_BEGIN_CKPT
00000047:00000093:0001 LOP_END_CKPT
Only information about one checkpoint, but with different LSNs for the log records. It’s not that the previous checkpoint was overwritten, it’s that we did a checkpoint – so the active portion of the log has moved forward and the log records for the previous checkpoint aren’t considered active any more as they’re not required (for instance, for database mirroring, an active transaction, a log backup, transactional replication). They’re still there in the log though, but just aren’t part of the required portion of the log and so aren’t dumped by fn_dblog.
Now, what if I create an active transaction? In another connection, I’ll do:
USE CheckpointTest;
GO
BEGIN TRAN;
GO
INSERT INTO t1 VALUES (1);
GO
And what if we do a checkpoint and look at the log now?
CHECKPOINT;
GO
SELECT [Current LSN], [Operation] FROM fn_dblog (NULL, NULL);
GOCurrent LSN Operation
———————– ——————————-
00000047:00000094:0001 LOP_BEGIN_XACT
00000047:00000094:0002 LOP_INSERT_ROWS
00000047:00000094:0003 LOP_COUNT_DELTA
00000047:00000094:0004 LOP_COUNT_DELTA
00000047:00000094:0005 LOP_COUNT_DELTA
00000047:00000094:0006 LOP_BEGIN_CKPT
00000047:00000096:0001 LOP_XACT_CKPT
00000047:00000096:0002 LOP_END_CKPT
We see the start of the open transaction, the insert of the record, the update of row counts in metadata, and the checkpoint.
You may notice that there’s another log record being generated for the checkpoint – LOP_XACT_CKPT. This is only generated when there are active (uncommitted) transactions and it lists information about all actvie transactions at the time the checkpoint begins. This is used during crash recovery to work out how far back in the transaction log to go to start REDO and UNDO operations (well, technically only UNDO will need to go this far back). Focusing in on this log record, we can see:
SELECT [Current LSN], [Operation], [Num Transactions], [Log Record]
FROM fn_dblog (NULL, NULL) WHERE [Operation] = ‘LOP_XACT_CKPT’;
GOCurrent LSN Operation Num Transactions
———————– ——————————- —————-
00000047:00000096:0001 LOP_XACT_CKPT 1Log Record
———————————————————————————————————–
0x000018 <snip> 780500000000000047000000940000000100040147000000940000000200000001 <snip> 6621000000000000
This log record contains information about each active (uncommitted) transaction at the time of the checkpoint. Without going into all the details about what’s in the payload for this log record, you can see two things in there:
-
The LSN of the LOP_BEGIN_XACT log record for the oldest active transaction (the first bold number above – match it against the LOP_BEGIN_XACT in the log dump a little higher up)
-
The LSN of the first log record making a database change for that transaction (the second bold number above – match it against the LOP_INSERT_ROWS in the log dump a little higher up)
You’ll notice that these LSNs are stored byte-reversed.
How about if we do another checkpoint?
CHECKPOINT;
GO
SELECT [Current LSN], [Operation] FROM fn_dblog (NULL, NULL);
GOCurrent LSN Operation
———————– ——————————-
00000047:00000094:0001 LOP_BEGIN_XACT
00000047:00000094:0002 LOP_INSERT_ROWS
00000047:00000094:0003 LOP_COUNT_DELTA
00000047:00000094:0004 LOP_COUNT_DELTA
00000047:00000094:0005 LOP_COUNT_DELTA
00000047:00000094:0006 LOP_BEGIN_CKPT
00000047:00000096:0001 LOP_XACT_CKPT
00000047:00000096:0002 LOP_END_CKPT
00000047:00000097:0001 LOP_BEGIN_CKPT
00000047:00000098:0001 LOP_XACT_CKPT
00000047:00000098:0002 LOP_END_CKPT
This time we see the log records for the current checkpoint and the previous one – as the active log stretches all the way back to the start of the oldest active transaction – no matter how many checkpoints we do while there’s an active transaction.
Now what if we start another active transaction in a third connection?
USE CheckpointTest;
GO
BEGIN TRAN;
GO
INSERT INTO t1 VALUES (2);
GO
And then go back to the original connection and do another checkpoint and dump the log again:
CHECKPOINT;
GO
SELECT [Current LSN], [Operation] FROM fn_dblog (NULL, NULL);
GOCurrent LSN Operation
———————– ——————————-
00000047:00000094:0001 LOP_BEGIN_XACT
00000047:00000094:0002 LOP_INSERT_ROWS
00000047:00000094:0003 LOP_COUNT_DELTA
00000047:00000094:0004 LOP_COUNT_DELTA
00000047:00000094:0005 LOP_COUNT_DELTA
00000047:00000094:0006 LOP_BEGIN_CKPT
00000047:00000096:0001 LOP_XACT_CKPT
00000047:00000096:0002 LOP_END_CKPT
00000047:00000097:0001 LOP_BEGIN_CKPT
00000047:00000098:0001 LOP_XACT_CKPT
00000047:00000098:0002 LOP_END_CKPT
00000047:00000099:0001 LOP_BEGIN_XACT
00000047:00000099:0002 LOP_INSERT_ROWS
00000047:00000099:0003 LOP_COUNT_DELTA
00000047:00000099:0004 LOP_COUNT_DELTA
00000047:00000099:0005 LOP_COUNT_DELTA
00000047:00000099:0006 LOP_BEGIN_CKPT
00000047:0000009b:0001 LOP_XACT_CKPT
00000047:0000009b:0002 LOP_END_CKPT
You can see that we now have three sets of checkpoint log records, and two active transactions. Only one of these sets of checkpoint log records is the pertinent one – the previous two have been superceded, but the log records haven’t been overwritten or removed, as you can clearly see.
Looking inside all the LOP_XACT_CKPT records, we can see (with a bit of creative formatting of the output):
SELECT [Current LSN], [Operation], [Num Transactions], [Log Record]
FROM fn_dblog (NULL, NULL) WHERE [Operation] = ‘LOP_XACT_CKPT’;
GOCurrent LSN Operation Num Transactions
———————– ——————————- —————-
00000047:00000096:0001 LOP_XACT_CKPT 1
00000047:00000098:0001 LOP_XACT_CKPT 1
00000047:0000009b:0001 LOP_XACT_CKPT 2Log Record
————————————————————————————————————-
0x000018 <snip> 780500000000000047000000940000000100040147000000940000000200000001 <snip> 21000000000000
0x000018 <snip> 780500000000000047000000940000000100040147000000940000000200000001 <snip> 21000000000000
0x000018 <snip> 780500000000000047000000940000000100040147000000940000000200000001 <snip> 21000000000000 …
… 79050000000000004700000099000000010004014700000099000000020000000100000002000000DC000000
The first two checkpoints only list one active transaction and the most recent one lists two, as we’d expect. The log record payload for the first two lists the same oldest active transaction (highlighted in bold). The payload for the last checkpoint lists the same oldset active transaction (as that hasn’t changed), but now lists an additional transaction (match the 470000009900000001000 with the LSN of the second LOP_BEGIN_XACT in the log dump above), and has a transaction count of two.
Just to finish things off, let’s look at the boot page of the database either using DBCC PAGE or DBCC DBINFO:
DBCC TRACEON (3604);
GO
DBCC DBINFO (‘CheckpointTest’);
GODBINFO STRUCTURE:
DBINFO @0x6711EF64dbi_dbid = 18 dbi_status = 65536 dbi_nextid = 2089058478
dbi_dbname = CheckpointTest dbi_maxDbTimestamp = 2000 dbi_version = 611
dbi_createVersion = 611 dbi_ESVersion = 0
dbi_nextseqnum = 1900-01-01 00:00:00.000 dbi_crdate = 2009-09-28 07:06:35.873
dbi_filegeneration = 0
dbi_checkptLSNm_fSeqNo = 71 m_blockOffset = 153 m_slotId = 6
dbi_RebuildLogs = 0 dbi_dbccFlags = 2
dbi_dbccLastKnownGood = 1900-01-01 00:00:00.000
<snip>
The dbi_checkptLSN is dumped out in decimal – converting to hex gives us (47:99:6), which exactly matches the LSN of the most recent LOP_BEGIN_CKPT record in the log dump above.
Hopefully this explains things clearly!
41 thoughts on “How do checkpoints work and what gets logged”
Paul,
Thanks for clarifying about this. My poor choice of words and not accounting for the active transactions led to this confusion. Thanks again.
Just to clarify, is a CHECKPOINT operation random or sequential in terms of I/O?
Random I/O to the data files, sequential I/O to the log.
Hi Paul,
I really enjoyed the article becuase of the art of writing and simplnees about the topics which is explained fantastically.
Paul,
Thanks for this pretty insightful post.
I’ve a couple of clarification to seek, though…
1. I believe this checkpoint behavior does not change from 2000 through 2008 R2. Is that fair to assume?
2. During a checkpoint, is all other DB activity quiesced? I mean are updates (after reading in new data pages to the buffer) allowed?
Thanks,
DHF.
1) Correct
2) No – other activity continues as normal
Hello!
Not surprising that I end up at Paul Randal’s blog whenever I’m investigating SQL Server txlog behavior. :-)
Regarding this portion of the post:
***
If in the SIMPLE recovery model, the VLFs in the log are checked to see whether they can be marked inactive (called clearing or truncating the log – both of which are terrible misnomers, as nothing is either physically cleared or truncated).
***
It seems that too many vlfs (eg 10000 at 10 mb in size) in simple recovery mode can lead to a protracted time of log_reuse_wait_desc of ‘checkpoint’, and thus potentially contribute to a txlog reaching max size and terminating/rolling back transactions.
If so, on a system with 10000 vlfs at 10 mb in size, it seems like shrinking the file to its first two vlfs and manually growing to expected size in 8000 mb chunks (for 500 mb vlfs as you and Kimberly have mentioned elsewhere) would yield expected benefits in transaction rollback, database recovery, and preventing txlog full conditions.
Seem correct?
Yes, correct.
Hi Paul,
Very basic question here please. Just working on trying to understand logging architecture (Thanks for your pluralsight course Logging and Recovery as well. It is invaluable.)
When you say “disk” here:
“All dirty data file pages for the database are written to disk (all pages that have changed in memory since they were read from disk or since the last checkpoint), regardless of the state of the transaction that made the change.”
is it .ldf or .mdf (.ndf)?”
Thanks a lot.
Also you say:
“Random I/O to the data files, sequential I/O to the log”
Is there multiple checkpoints?
Thank you.
No – that’s describing the I/O pattern to the various files writes occur to them.
Data files are mdf and ndf, log file is ldf.
Thanks Paul.
Hi Paul,
Your articles really helped me understand the logging mechanism. Presently I am working on a stored procedure to delete high volume of records from the database as part of the archival activity.
In order to avoid the filling the transaction log file, I am ensuring that the delete is done in small batches,transaction is commited, checkpoint issued,the recovery mode is SIMPLE.
But I have searched all the SP in the all the Databases in my application but could’nt find any of them issuing a checkpoint statement.I believe issuing checkpoint is mandatory for the VLF to be marked as truncated but if none of them is issuing the checkpoint statement then how the transaction log is managed. Please advice if it is compulsory to issue checkpoint to avoid the space full error of transaction log in simple recovery mode. We are using SQL server 2008.
If you’re using SIMPLE, a checkpoint occurs when the log is 70% full automatically. Are you sure you want to use SIMPLE? You can do the same thing using FULL and doing log backups after each committed transaction. Alternatively, you could use partitioning and simple drop an entire partition of records with hardly any logging.
Dear Paul,
The introduction says “All dirty data file pages for the database are written to disk, regardless of the state of the transaction that made the change”. Are these pages written into the database data file(s)? If so, I presume the page exists twice in the datafile : the page as it will be if the transaction is committed and the data as it will be if the transaction is rolled back. I wonder what will happen with those 2 versions? If the transaction has been completed (committed or rolled back) at the next checkpoint, I presume one of both pages will be removed and the other one is read into memory without being marked “dirty”???
Thanks
Peter.
Nope – it exists once, at the current state of the transaction. If the transaction commits, with no further changes to that page, nothing else needs to happen. If the transaction rolls back, the page is changed in memory again and will be flushed out to disk on the next checkpoint, overwriting the page on disk.
Paul
I very much appreciate this article.
You might be surprised but the line that meant
the most to me was
“If in the SIMPLE recovery model, the VLFs in the log are checked to see whether they can be marked inactive (called clearing or truncating the log – both of which are terrible misnomers, as nothing is either physically cleared or truncated).”
I have a SIMPLE recovery model database, and I have a huge update to do.
I wrote a loop to break the update into batches of 100,000 rows each.
But I wanted to make sure that I did not fill the TLog.
so I added a checkpoint to the loop and it keep the TLog from growing.
thanks
-gary h
Cool – glad to help!
Hi Paul, thank you for the post, it’s really helpful. I have a question about how checkpoints interact with user writes. If a checkpoint is in the middle of writing a data page to disk (IO request pending) and a user thread is trying to obtain an exclusive latch on that page in order to update it, what happens? (I suspect the thread will wait with a PAGEIOLATCH_EX, but not really sure). Thanks!
No – the thread will wait for a PAGELATCH_EX (PAGEIOLATCH_ is only when waiting for a page to be read from disk).
Great article!!! It helped me a lot to understand the working of Checkpoint. Thanks a lot Paul!!!
Great article.
Hi Paul, thank you for the post, as always great :)
I have one question about checkpoint and reusing of log space. I know that in simple recovery model checkpoint (direct or indirect) is the only way to reuse space but what is going on in full recovery model – in script below – why after checkpoint “Log Space Used” is going low?
–SQL 2016 DEV + SP1
USE [master]
GO
IF DATABASEPROPERTYEX (N’DUMMY’, N’Version’) > 0
BEGIN
ALTER DATABASE DUMMY SET SINGLE_USER
WITH ROLLBACK IMMEDIATE;
DROP DATABASE DUMMY;
END
GO
CREATE DATABASE [DUMMY]
ON PRIMARY
( NAME = N’DUMMY’
, FILENAME = N’C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\DATA\DUMMY.mdf’ ,
SIZE = 2024KB , MAXSIZE = UNLIMITED, FILEGROWTH = 65536KB )
LOG ON
( NAME = N’DUMMY_log’
, FILENAME = N’C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\DATA\DUMMY_log.ldf’
, SIZE = 2024KB , MAXSIZE = 2048GB , FILEGROWTH = 1024 KB)
GO
USE [master]
GO
ALTER DATABASE [DUMMY]
SET RECOVERY FULL
WITH NO_WAIT
GO
USE [DUMMY]
GO
DROP TABLE IF EXISTS dbo.TEST
SELECT TOP 50 cast(‘A’ AS CHAR(8000)) char_8000
INTO dbo.TEST
FROM master..spt_values
DBCC SQLPERF(logspace)
–Log Size (MB) Log Space Used (%)
–1,992188 38,48039
INSERT INTO dbo.TEST
SELECT TOP 50 cast(‘A’ AS CHAR(8000)) char_8000
FROM master..spt_values
DBCC SQLPERF(logspace)
–Log Size (MB) Log Space Used (%)
–1,992188 58,84804
CHECKPOINT
DBCC SQLPERF(logspace)
–Log Size (MB) Log Space Used (%)
–1,992188 19,28922
Because you haven’t performed a full backup yet, so the log behaves as if it’s in simple (as there’s no possibility of a log backup being performed).
Thank you very much Paul.
Best wishes :)
Hi Paul, your post helped answering my own question on whether or not setting a database to read_only issues an internal checkpoint and yeah it seems like it does!
CREATE DATABASE CheckpointTest;
GO
USE CheckpointTest;
GO
CREATE TABLE t1 (c1 INT);
GO
INSERT INTO t1 VALUES (1);
GO
Alter database CheckpointTest SET READ_ONLY WITH NO_WAIT
GO
SELECT [Current LSN], [Operation] FROM fn_dblog (NULL, NULL);
GO
Indeed – any state change like that would have to flush out any dirty pages before making the database read only.
Hi Paul,
how does checkpoint operation get a list of dirty pages? Does it query PFS page? we noticed PFS page latch contention on user database during checkpoint process, followed by stack dump.
Thanks.
You hit a bug – PFS pages aren’t used during checkpoints – it’s a coincidence. Old-style checkpoint runs through the BUF structures in the hash list for the database looking for ones marked dirty. In new-style checkpoints there’s a dirty page list.
Hi Paul,
Good Morning
Let me start my question with example
let’s say check point occurred at 10AM and 10.12AM my server got crashed { there is no checkpoint after 10AM}
In between 10 AM an 10.12 AM there was transaction occurred and it is completed
suddenly my server got crashed, in this case SQL Engine will do the redo or undo the completed transaction ?
Could you please explain about recovery
Thanks,
Mahendra
If the transaction committed before the server crashed, it will be redone. As long as the commit log record is on disk, it’s durable and will be redone.
Dear Paul,
Have query on inserting few rows in table , after insert statement it sayd rows affected but while selecting or counting rows the data is not being changed. I did not see anh trigger to stop being committed, implicit transaction is also off, no error in logs. Not sure why records are not showing in table.
Can’t happen if you’re doing begin tran; insert row; commit tran; select rows. Without looking at your code and other settings, impossible to say.
Hi Paul!
Does the automatic checkpoint occur for a Database?
Hi Paul,
Will checkpoint flushes Dirty pages from buffer cache?
Yes – that is the purpose of a checkpoint.
Hi Paul,
Does checkpoint run in bulk logged recovery model also? What does Eager Write process do then? I am confused.
Yes, in all recovery modes. In bulk logged or simple, eager writing ensures that all data file pages changed by a minimally-logged operation have been written to disk before the commit of that operation occurs.
Why cant Checkpoint flush dirty pages changed by minimally-logged operations to disk?
Of course it could, but for correctness of crash recovery, data pages changed by minimally-logged ops must be on disk before the op commits. But checkpoint won’t ever encounter such a page because of eager writing.