In yesterday’s blog post A Look at Backup Internals and How to Track Backup and Restore Throughput (Part 1), we looked at what happens when we Backup a database in SQL Server. Today, we are going to use the information we captured to perform some analysis of the Backup information in an attempt to find ways to decrease the time it takes to backup a database. When I began reviewing the data from the Backup in yesterdays post, I realized that I had made a mistake in the process and left Trace Flag 3213 off, which left some information that we’ll need out of the results, the Backup Buffer Configuration information. For this post I turned Trace Flag 3213 on, and reran the Backup from yesterday, so the results may differ slightly but the same Backup command was used for the tests.
The Backup Buffer Configuration information tells us how many Buffers were allocated, and what the Max Transfer Size being used is. In the screenshot above, this information is outlined in the red box. The default Buffer Count is determined by SQL Server when the BUFFERCOUNT option is not specified in the BACKUP DATABASE command. The calculation used is:
(NumberofBackupDevices * GetSuggestedIoDepth) + NumberofBackupDevices + (2*DatabaseDeviceCount)
This is covered in detail on Amit Banerjee’s blog post, Incorrect BufferCount data transfer option can lead to OOM condition. For the Backup in yesterday’s post, and the one that will be the baseline for today’s post, the BUFFERCOUNT option was not specified, and the Backup, as shown above, used 7 Buffers and the default MaxTransferSize of 1MB for the backup. If we look at the aggregated Event and wait_type information contained in the Event Session for the Backup we can begin to see what types of Backup bottlenecks we might have in the system.
SELECT
ISNULL(wait_type, event_name) AS Operation,
SUM(duration) AS duration,
SUM(signal_duration) AS signal_duration,
COUNT(*) AS occurences
FROM #TestResults
WHERE (duration IS NOT NULL OR signal_duration IS NOT NULL)
AND ISNULL(wait_type, event_name) IN
('BACKUPBUFFER', 'BACKUPIO')
GROUP BY ISNULL(wait_type, event_name)
Looking at this information, we have a large number of BACKUPBUFFER waits occurring during the backup of the database and this may be a potential tuning opportunity to improve the performance of our database Backups. To test this, we can change our Backup Script to include a 4GB MAXTRANSFERSIZE and a BUFFERCOUNT of 16. I also chose to change the filenames in the Extended Events Session to simplify working with the Target Data for each test individually. It is possible to specify the exact filename and metadatafile names when you read from the Target, but that requires more work than is needed in my opinion. For the sake of brevity I am not going to repeat all of the Extended Events information in this blog post, but instead show the outcome of running various configurations of BUFFERCOUNT against a test server.
Test Number | Backup File Count | Buffer Count | Max Transfer Size | Backup Time (s) | BACKUPBUFFER (wait ms) | BACKUPIO (wait ms) | BACKUPBUFFER (wait count) | BACKUPIO (wait count) |
1 | 1 | 7 | 1024 | 122.5 | 159471 | 62587 | 81616 | 22815 |
2 | 1 | 16 | 4096 | 105.2 | 90963 | 69091 | 14513 | 7982 |
3 | 1 | 32 | 4096 | 99.5 | 75236 | 88634 | 12298 | 8679 |
4 | 1 | 128 | 4096 | 95.9 | 70173 | 63435 | 8292 | 4679 |
5 | 1 | 256 | 4096 | 95.9 | 50988 | 48942 | 1538 | 1135 |
6 | 2 | 128 | 4096 | 96 | 152323 | 63800 | 12416 | 4925 |
7 | 2 | 256 | 4096 | 96.4 | 109565 | 46953 | 3067 | 1195 |
The same Event Session was used to gather the above metrics in seven repetitive tests. The test server is a Dell R710 dual quad core system with 24GB RAM and HT enabled. It has eight internal 15K RPM 146GB SAS drives that are configured into 2 RAID 1 drive pairs and a single 4 disk RAID 5 array. One of the RAID 1 drive pairs was dedicated to the OS, and SQL Binaries. The other was used for writing the backups, and the database data/log/tempdb files were placed on the RAID 5 array. This isn’t the ideal configuration for a setup, but its what I had available at the moment to work with, and is similar to some of the configurations I’ve seen in the real world as well. The Backups were segregated to a dedicated disk that was RAID 1 for this test to avoid RAID 5 write penalties, and to maximize the backup write throughput by isolating it from any other operations.
The above results can be interpreted a number of different ways. As the BUFFERCOUNT increases the backup time decreases, and so does the amount of time spent waiting on Backup Buffers. However, there is a tradeoff that is being made in memory consumption; specifically memory outside of the buffer pool from Virtual Address Space. On 32 bit servers this can lead to Out of Memory exceptions, the topic of Amit’s blog post referenced previously in this blog post. Test Number 5, with 256 buffers and a 4MB transfer size will use 1GB of memory as shown in the Backup Buffer Configuration Information.
On the test server used for this testing, the bottleneck is the disks that the backup file is being written to and further improvements in performance will require additional IO to accomplish. The test database has 30GB of data in it, and with backup compression, the backup size is 7.8GB in size on disk. For a full backup to take just over a minute and a half for this database is not that bad, but it is going to local disks and there is no safeguard for the data in the event of a catastrophic loss of the physical server entirely unless the data gets copied to another location in the network after a local backup occurs.
One thought on “An XEvent a Day (18 of 31) – A Look at Backup Internals and How to Track Backup and Restore Throughput (Part 2)”
A little late for this, but I believe you had a typo in the following sentence: “To test this, we can change our Backup Script to include a 4GB MAXTRANSFERSIZE and a BUFFERCOUNT of 16.” Did you mean 4MB MAXTRANSFERSIZE?
Great series by the way. I wish I would have started down the X events path sooner.