Blog posts in this series:

  • For the hardware setup I'm using, see this post.
  • For an explanation of log growth and its effect on perf, see this post.
  • For the baseline performance measurements for this benchmark, see this post.
  • For the increasing performance through log file IO optimization, see this post.
  • For the increasing performance through separation of data and log files, see this post.

In the previous post in the series, I examined the effects of separating the data and log files (one file each) to different RAID arrays. It was very obvious that separation gave a performance boost, and that having the portion of the database with the highest I/O write throughput requirements on the faster array (the 8-drive 15k SCSI RAID10) produced the biggest gain.

Now - a confession.  In the last post, when I posted it I found that moving the data an 8-drive 7.2k SATA RAID10 array was the best approach. *But* during the testing for this post, I found that one of my tests had screwed up and only half the client threads had run. You'll notice in that post I went back in and edited it to explain that and update the graph and results. I've now augmented my test harness with a way to check that all client threads are running - to make sure the tests are apples to apples, rather than apples to pomegranates :-)

So - the best I've been able to do so far with the tests is creating 1TB of data using 128 connections (each creating 1/128th TB using inserts with default values) with the single data file on an 8-drive 15k SCSI RAID array (pre-created to 1TB) and the log file on an 8-drive 7.2k SATA RAID10 array (pre-created to 256MB with 50MB autogrowth) in 20842 seconds.

Lots of people have been asking how my network is setup in these tests. Here's what I've been running with (all 1GB ethernet):

  • 1 NIC from a Broadcom BCM5708C NetXtreme II GigE card on the 10.x.x.x network
  • 1 NIC from a Broadcom BCM5708C NetXtreme II GigE card on the 192.168.x.x network
  • 2 x PowerConnect 5424 24-port iSCSI optimized switches , with no separation of traffic
  • The 10.x.x.x server NIC connected to all iSCSI arrays

Over the last couple of weeks I've been playing around with the network setup to make sure things are optimized, and this post will describe what I did and what effect it had. In all the tests below, I kept the dat aon the faster SCSI array and the log on the slower SATA array.

I'm very grateful to the help I received from Wes Brown (twitter|blog) and Denny Cherry (twitter|blog) to the technical questions and WTF?s I sent (and to anyone else on twitter I may have forgotten!).

1) Separation of network traffic

I decided to make one of the 5424 switches dedicated to iSCSI traffic on the 10.x.x.x network and the other for general network activity, including connecting to the management ports on the MD3000s. Turns out that I didn't really need to, as each 5424 can handle 48GB of throughput, way more than I'm generating. But hey ho, at least the wiring in the back of the 42U rack is a little tidier now :-)

Running the 128-way test with the new configuration gave a test time of 21252 seconds, slightly slower than the best time without separation! This was the first of the WTF?s. Until I realized that I hadn't actually removed any network bottleneck at all. I can't explain why things are slightly slower here, so I decided to take the switches out of the equation. My suspicion is that if I ran the test ten times, I'd get ten different results, but within a standard deviation of the median. So - no cause for concern. (In fact, I'm going to try this as part of the next set of tests.)

2) Direct connections to the iSCSI arrays

I configured another NIC (one from an Intel PRO/1000 PT Dual Port GigE card) and then had one NIC directly connected to one of the RAID controllers on the SCSI MD3000 (only one configured volume, so no concerns about having multiple volumes suddenly switching over to a non-optimal RAID controller) and the other NIC directly connected to the SATA MD3000.

Running the 128-way test with the new configuration gave a test time of 21859 seconds, slower than test #1. Very surprising - I expected to get some *gain* so I looked at the peak throughput of the arrays:

  • For test 1, peak SATA was 50500 bytes/sec and peak SCSI was 106012 bytes/sec.
  • For test 2, peak SATA was 46923 bytes/sec and peak SCSI was 107708 bytes/sec.

Things are slower with the network bottleneck removed.

3) Upgrading 5424 switch firmware and reconfiguring

Although the 5424 switches are supposed to come iSCSI optimized, I thought I'd flatten them and reconfigure the. I got the latest version of the switch firmware and re-flashed both switches. I think configured the 10.x.x.x one specifically for iSCSI using this excellent Dell whitepaper.

Running the 128-way test with the new configuration gave a test time of 20745 seconds. Finally an improvement, but nothing major, and still possibly just a statistical variation.

4) Upgrading the NIC drivers

Next I figured I'd bring the NICs up to the latest driver versions so upgraded all the NICs on all the servers.

Running the 128-way test with the new configuration gave a test time of 21743 seconds. Hmmm.

5) Homogenizing the network paths

At this point I started wondering if the Broadcom and Intel NICs had different characteristics so I decided to use the two Intel NICs for the iSCSI traffic. I also enabled jumbo frames. The Intel NICs have three setting for jumbo frames - off, 4088 bytes or 9014 bytes. The MD3000s can only go up to 9000 bytes, so I chose 4088 bytes and configured the MD3000 iSCSI ports to use the same.

Running the 128-way test with the new configuration gave a test time of 21526 seconds - nothing to write home about.

None of the network configuration changes I made had much effect on performance, apart from removing the network bottleneck, which made performance slightly worse overall. I checked other stuff like TCP offloading, but that wasn't enabled. My suspicion was that by removing the network bottleneck, I unmasked a SQL Server contention issue with my 128-connection test. I decided to try fewer client connections.

Here are the results:

There's clearly a SQL Server bottleneck that's being alleviated by reducing the number of connections and allowing the throughput to each array to increase slightly. With 8 connections, SQL Server isn't being driven hard enough and the elapsed time increases again, and this is reflected in the array throughput measurements too (a 10-15% drop compared to the 16-way test). One thing I forgot to do was examine the distribution of wait types while these tests were running, but my guess would be the bottleneck was in the transaction log manager.

Summary

By separating the network traffic and moving to two iSCSI NICs, I removed the network bottleneck I had (see the image at the bottom of the last post) and replaced it with a SQL Server bottleneck. Here's a snapshot of network utilization with the new setup:

 

In the next set of tests, I'm going to look at the effect of altering the transaction log auto-growth size, and pre-allocation size. In all the tests so far, the log has grown from the pre-allocated 256MB to somewhere between 6.5-8GB.

Should be interesting - stay tuned.

PS Yes, I'll be doing a bunch of stuff with the Fusion-io drives too - be patient! :-)

Christmas comes but once a year... really? Then mine just came early on this afternoon's UPS truck.

The very nice folks at Fusion-io just sent me two of their fully-loaded top-of-the-line ioDrive Duos with 640GB of solid-state flash memory in each. This is really extra-nice of them because on Dell's Small Business website they're currently retailing for $12800 *each*. Expensive? Yes. Worth it? That's what I'm hoping to prove.

There's nothing like expensive, pretty hardware to get me excited... here's what they look like:

Now, above I said 'expensive', and these are, but they pack some pretty amazing specs in terms of read/write bandwidth so you're getting a lot of bangs for you bucks. But where does it really make sense to drop the bucks for the biggest bangs? To answer that I'm planning to do a whole series of blog posts as part of my benchmarking efforts to investigate which operations can benefit the most from these drives. With 1.2TB of SSD storage I'll be able to plug these into one of my test systems here and run comparisons against 15k SCSI and 7.2k SATA drives.

Anyway, there's a lot of hype about the speed of SSDs, and also a lot of angst about SSDs not being Enterprise-ready. I don't agree with them not being Enterprise-ready - in fact, fellow-MVP Greg Linwood, who runs (among other things) our partner company SQLskills Australia, already has a bunch of customers with Fusion-io drives deployed in their enterprises successfully. As with any critical hardware infrastructure (especially cutting-edge stuff like this), the key to success is having everything setup correctly so I'll be blogging about all my experiences with them.

To summarize, I'm very excited! I've been wanting to get my hands on some serious SSD hardware for a couple of years now so I can do some *real* testing - it doesn't get better than this!

Shoot me an email or leave a comment if there's something you're interested in seeing tested.

PS Full disclosure: yes, of course Fusion-io sent me these because they're getting publicity from me blogging about them, but we don't have any editorial/veto agreement. I want to be able to recommend these to our enterprise clients and the only way to honestly do that is to play with them myself - so it's a win-win for both of us. And you guys get to test them vicariously through me, so it's a win-win for you too :-)

Stay tuned...

Blog posts in this series:

  • For the hardware setup I'm using, see this post.
  • For the baseline performance measurements for this benchmark, see this post.
  • For the increasing performance through log file IO optimization, see this post.

In the previous post in the series, I optimized the log block size to get better throughput on the transaction log, but it was very obvious that having the log file and the data file on the same RAID array is a bottleneck.

Over the last couple of weeks I've been running some tests with the log and data files on separate RAID arrays, and this post will explain what I've found.

The first thing I did was setup the systems for remote access so I can log in to them from anywhere in the world. This involved:

  • Changing the TCP/IP port that Remote Desktop listens to (KB 306759)
  • Allowing that port through Windows Firewall (having to perform this step wasn't obvious)
  • Turning on port forwarding on our Internet-facing router for that port to the right server

It's very cool being able to play with these servers from anywhere - and to show live servers-under-load during a class, as I could last week.

The previous two tests were performed using a single 2TB RAID-10 array comprised of 8 300GB 15k SCSI drives. The new array I added to the mix for this test is comprised of various numbers of 1TB 7.2k SATA drives. I tested:

  • 4 drives in a RAID-10 configuration, giving 2TB
  • 6 drives in a RAID-10 configuration, giving 2TB usable
  • 8 drives in a RAID-10 configuration, giving 2TB usable

The volume size limit that the Dell MD3000i arrays allow is 2TB, so the 2nd and 3rd array configurations described above made use of extra spindles but with a lot of wasted space. In this set of tests I limited myself to a single data file, but coming up I'll try multiple data files which will enable more efficient usage of the available raw disk capacity.

I used the same scripts as in the previous tests, with 128 concurrent connections each inserting 4.1KB rows in batches of 10 inserts per transaction into the same table, for a total of 1/128 TB in each connection. The data file is pre-sized to 1TB. The log file is pre-sized to 250MB, with a 50MB auto-growth increment. These are invariants from the previous test (and are also something obvious that can be changed for better performance, but that's not what this set of tests was about).

I tried the following configurations:

  • Log on SCSI RAID-10, data on 4-, 6-, 8-drive SATA RAID-10
  • Data on SCSI RAID-10, log on 4-, 6-, 8-drive SATA RAID-10

And very interesting results they are too! In the previous set of tests, the best performance I could get was 21167 seconds for test completion, with the data and log files sharing the SCSI RAID-10 array.

Moving the log to a different array

The left-hand graph above shows that moving the log file off managed to get the time down to 20842 seconds in the best case, but that's using 8 drives. The 6- and 8-drive times were basically the same - which shows that 4-drives didn't provide enough IO parallelism for the load SQL Server was pushing through the array, but moving to 6 drives provided basically enough so that 8 drives didn't lead to a big performance gain.

Overall, I'd say there was no real performance gain from moving the log file to a different array.

As far as log disk queue lengths go, when the log was on the 4-drive array, the average log write disk queue length was in the 20s. For the 6- and 8-drive cases, the queue length dropped to low single-digits. It didn't drop right down to around zero because the perfmon counter is measuring the queue length as far as Windows sees it - and the iSCSI traffic was being bottle-necked through a single NIC, as we'll see later.

For data disk queue lengths, when the log was on the 4-drive array, the average data write queue length was around 5, with spikes to 20+. For the 6- and 8-drive cases, the queue length increased to an average of 10-15, with spikes to 30+. Clearly the log drive bottle-neck in the 4-drive case was lowering the overall transaction throughput, which reduced the load on the data drives.

In terms of iSCSI performance, I had the MD3000is take a 2-hour snapshot of each array's performance with the varying size of the log array. The results are in the right-hand graph above. You can see that the throughput of the data array remains basically static, but the log array throughput increases with more drives thrown into the mix. Nothing stunning here.

In the 4-drive case, the transaction log grew to 5.5GB but in the 6- and 8-drive cases the log grew to over 8GB - which is what I'd expect given the higher transaction throughput.

Moving the data to a different array

Edit 02/28/10: When performing the tests for the next post in the series, I found the results from this test didn't make sense. I went back to re-run the single-NIC tests and found what I feared - the original tests didn't complete properly. The corrected results are below. 

The left-hand graph above shows that moving the data file off managed to get the time down to 25400 seconds in the best case, but that's using 8 drives. The data file's throughput requirements are clearly higher than the log file's (for this specific benchmark test, not as a general statement by *any means*). It's clear that having the data file on the slower SATA array wasn't going to lead to any better performance than having it on the SCSI array.

As far as log disk queue lengths go, they were between 1-3 in all cases, as the 8-drive SCSI array could clearly provide enough IO throughput to satisfy the log's need.

For data disk queue lengths, when the data was on the 4-drive array, the average data write queue length was around 30, with wild spikes. For the 6- and 8-drive cases, the queue length decreased to an average of 20, and then down to 10 with spikes to 30+. When there was no checkpoint or lazy writer activity, the write queue lengths dropped to zero, as I'd expect. 

In terms of iSCSI performance, I had the MD3000is take a 2-hour snapshot of each array's performance with the varying size of the data array. The results are in the right-hand graph above. You can see that the throughput of the log array remains pretty static, but the data array throughput increases *dramatically* with more drives thrown into the mix. Again, nothing stunning here.

In the 4-drive case, the transaction log grew to 19GB but in the 6- and 8-drive cases the log grew to around 8GB - again I'd expect this. It seems that for this workload, with these default checkpoint settings, and with 8GB of server memory, the log needs to be around 8GB when there's adequate IO throughput. This is something I'll be trying to address in future tests.

Perfmon captures

I took a few perfmon snapshots during the various tests to provide some details of the system performance. I'm not going to go into as much detail explaining the various counters and what they mean, I did that in the previous post, but I will point out interesting details. These aren't necessarily meant to be representative of what you'll see running similar tests - they're of interesting times during the traces that I thought would be fun to look at and understand.

1) Log file on the 6-drive SATA array

 

This graph is highlighting the Avg. Disk Write Queue Length for the K: data array (the black line). You can see that when the Checkpoint pages/sec (pink line) and Lazy writes/sec (light green line) both drop to zero, the write queue length drops to zero - as there's no other activity on that RAID array. When either or both of these start up again, the write queue length spikes wildly. Other things to note:

  • The Avg. Disk Write Queue Length for the I: log array (the green line at the bottom) is pretty static in the low single-digits.
  • The Log Bytes Flushed/sec (the red line) tracks the Disk Write Bytes/sec for the I: log array, except during the heavy checkpoint activity at the start of the trace - this is a period of log file auto-growth.
  • The Pages Allocated/sec (the top light blue line) is fairly static, but varies wildly during the heavy checkpoint activity. This is because the log auto-growth is effectively stalling transaction throughput.

2) Log file on the 8-drive SATA array

 

This is highlighting the Pages Allocated/sec (the black line), which remains basically static, and is a measure of the overall transaction throughput. At this point in the test, the log has auto-grown as far as it will and performance is stable. Other things to note:

  • The Log Bytes Flushed/sec also remains static, as there's a constant transaction workload with no interruptions for log growth, and directly correlates with the spikes and troughs in the Pages Allocated/sec counter.
  • The Avg. Disk Write Queue Length for the I: log array (the bottom green line) is static.

3) Data on the 4-drive SATA array

 

This represents the worst combination of array sizes and file placements, and is a pretty chaotic trace. It's highlighting the Avg. Disk Write Queue Length of the I: data array (the black line) and you can see that it varies wildy between 0 and 80(!!!), clearly showing that the data file performance is hampered by under-powered RAID configuration. All other perf counters around the data array vary wildly - clearly not a recipe for high performance.

Summary

There's clearly a performance gain to be had from separating the data and log portions of the database in this case. However, doing so has highlighted the fact that the simple networking configuration I have is now a bottleneck.

Here's a Task Manager trace showing the network utilization during one of the tests:

 

The troughs are the constant traffic going to the log array, and the sustained peaks are when there's checkpoint and/or lazy writer activity going to the data array as well.

In the next post, I'll tune the network configuration, and then I'll move on to trying multiple data files in multiple RAID arrays.

Hope you're still enjoying the series!

Categories:
Benchmarking | Performance

(For the hardware setup I'm using, see this post. For the baseline performance measurements for this benchmark, see this post.)

In my previous post in the series, I described the benchmark I'm optimizing - populating a 1-TB clustered index as fast as possible using default values. I proved to you that I had an IO bottleneck because the IOs to the data file (from checkpoints) and the transaction log file (from transactions committing) were causing contention.

Several people commented that I might have mis-configured the iSCSI IO subsystem - so first off I want to look at that. Fellow MVP Denny Cherry (twitter|blog), who knows a lot more than me about IO subsystems, volunteered to discuss my iSCSI setup with me to make sure I hadn't goofed anywhere (many thanks Denny!). It seems like I haven't. I'm using a single iSCSI array right now, with a single NIC on the server dedicated to the iSCSI traffic - once I move to multiple volumes, I'll add in more NICs.

Looking at Task Manager in the middle of a 6-hour test run to see the network utilization through the NIC shows that it's not saturated, as shown below.

 

I ran the DELL smcli utility for two hours during the most recent test to see what peak throughput I'm getting, using the following command:

smcli -n Middle_MD3000 -c "set session performanceMonitorInterval=5 performanceMonitorIterations=1440;save storageArray performanceStats file=\"c:\\MiddlePerfStats.csv\";"

I saw around 101MBytes/sec. and earlier when testing the smcli settings I'd seen 106MBytes/sec. I'm sure once I remove some of the contention that this will get better.

On to the test!

The first thing I wanted to try was optimizing my use of the transaction log - i.e. doing as few and as large IOs as possible to the log. My hypothesis is that by changing the batch size from one to, say, 10 or 100 SQL Server can make more efficient use of the log.

I changed my T-SQL script to take a batch size parameter and use explicit transactions inserting <batch-size> records. Everything else remained the same as the baseline. I picked 128 concurrent threads as my test point. In the baseline, the 128-thread insert test took 8 hours 8 minutes 27 seconds to complete (29307 seconds). My simple T-SQL code changed to (with $(rows) and $(batch) being SQLCMD parameters to the script):

DECLARE @counter BIGINT;
DECLARE @inner   SMALLINT;
DECLARE @start   DATETIME;
DECLARE @end     DATETIME;

SELECT @counter = 0;
SELECT @start = GETDATE ();

WHILE (@counter < $(rows))
BEGIN
   SELECT @inner = 0;
   BEGIN TRAN;
   WHILE (@inner < $(batch))
   BEGIN
      INSERT INTO MyBigTable DEFAULT VALUES;
      SELECT @inner = @inner + 1;
   END
   COMMIT TRAN;
   SELECT @counter = @counter + $(batch);
END;

SELECT @end = GETDATE ();

INSERT INTO msdb.dbo.Results VALUES (CONVERT (INTEGER, DATEDIFF (second, @start, @end)));
GO

Below are the results for 128 threads with batch sizes varying from 10 to 10000:

Increasing the batch size from 1 record to 10 records caused the elapsed time to drop from 29307 seconds to 21167 seconds - a 28% improvement! However, increasing the batch size further didn't improve things any more. Hmmm.

Next I tried different numbers of concurrent connections with a batch size of ten to see if the improvement was universal:

 

So moving from a single-record batch to a ten record batch has the same dramatic effect on performance for varying numbers of connections.

What's going on?

The answer involves some deep internals of the transaction log structure and IO behavior. The transaction log is split up internally into sections called virtual log files (VLFs for short). These VLFs are used to manage which parts of the log can be overwritten with new log records. If this is all unfamiliar, go read my TechNet Magazine article on Understanding Logging and Recovery in SQL Server which explains about the log in more detail and about transactions, otherwise what's coming next may not make much sense.

Whenever a transaction commits, all the transaction log records up to and including the LOP_COMMIT_XACT log record for that transaction (including log records from other transactions that are inter-mingled with those from the one that just committed) must be written out to disk, so the transaction is durable (called write-ahead logging). But it's not individual log records that are written to disk, the unit of IO for the transaction log is a log block. VLFs are split internally into dynamically sized log blocks, ranging from 512-bytes to 60KB in 512-byte increments, as needed. There are algorithms to figure out how big log blocks should be based on what log records are being flushed out.

With the single record batch, the amount of log generated in the transaction totals 4952 bytes. You can find this by doing something like:

CHECKPOINT;  -- to force the log in the SIMPLE recovery model to clear
GO

INSERT INTO MyBigTable DEFAULT VALUES;
GO

SELECT * FROM fn_dblog (NULL, NULL); -- show me all active log
GO

Then add up the values in the Log Record Length column for all records from the first LOP_BEGIN_XACT log record to the LOP_COMMIT_XACT with the matching Transaction ID.

Anyway, a single-record transaction generates 4952 bytes of transaction log, which will be flushed out in our case in a log block that's 5120 bytes (the nearest multiple of 512 above 4952), with a bunch of zero-padding at the end. For ten single-record transactions, that's 10 small IOs.

The ten-record transaction generates 48688 bytes of transaction log and will be flushed out in a log block that's 49152 bytes (the nearest multiple of 512 above 48688). This is clearly more efficient than 10 smaller IOs and is why changing to batches makes things go faster.

A 100-record transaction generates 489628 bytes of transaction log, which is clearly more than the 60KB log-block maximum, so it actually covers multiple log blocks (varying from 14-16 in my tests - I don't remember the exact algorithms). You can see the log block changing when the middle number in the Current LSN column of the fn_dblog output increases. It looks like it's jumping up, and it is - this is the offset of the log block within the current VLF divided by 512.

Because the 100-record transaction is split into multiple log blocks, there's no real IO gain during log flushes over the 10-record transaction - which is illustrated by the results I saw.

Now, even with this speed increase from the increased batch size, the average-disk-write-queue-length is still anywhere from 20-40 when running the 128-connection test, so there's still a significant bottleneck there. In fact, the transaction log grew significantly still during these tests (up to 23GB in one case) - for an explanation of this phenomenon, see Interesting case of watching log file growth during a perf test. I picked up some wait stats queries from fellow-MVP Glenn Berry (twitter|blog) to run while the tests were running. I'm using his as they're published and anyone can download and use them (see his blog post for the queries, and Bob Ward's CSS blog post that's starting to document the wait types).

His queries tell me that with the 10-record per batch and 128-connections:

  • Average write-stall to the data file is 37ms
  • Average write-stall to the log file is 18ms
  • Top wait types are PAGELATCH_EX (55% of all waits), PAGELATCH_SH (28% of all waits), and WRITELOG (14% of all waits)

The first two waits are the Storage Engine waiting for the buffer pool to kick out dirty pages so that buffers can be used to hold newly created pages, and the third is the log manager waiting for log block flushes to disk to complete.

Clearly IO IO IO.  Let's look at a a perfmon capture during the 10-records per batch test with 128 connections:

 

Let's go through each counter (top to bottom in the list in the image) and explain what's going on. I deliberately picked this time-slice, as it really simply shows the effect of contention:

  • Pages Allocated/sec: this is the light blue line and is the Access Methods part of the Storage Engine (the dev team I used to run) creating new data and index pages for the clustered index we're populating.
  • Checkpoint pages/sec: this is the pink line at the bottom left and bottom right of the capture. This is the buffer pool writing out dirty pages during a periodic checkpoint.
  • Avg. Disk sec/Write: this is the dark blue line that's tracking just above the thick black line. It's the amount of time in ms for a write to complete. You can see that it has a minimum around 12:51:00 and then varies wildly, hitting as high as 50+ms for a single write.
  • Avg. Disk Write Queue Length: this is the highlighted line in thick black. It also has a minimum around 12:51:00 and varies wildly the rest of the time.
  • Disk Write Bytes/sec: this is the dark green line at the top that shows the number of bytes being written to disk from all IO sources. Same story around 12:51:00.
  • Log Growths: A simple counter since the database was created/server started. It's over 100 and off the chart.
  • Log Bytes Flushed/sec: this is the red, fairly constant line around 1/3 the way up and is log blocks being flushed to disk because of transaction commits or checkpoints.
  • Lazy writes/sec: this is the light green line at the bottom and is the buffer pool having to force data pages to be written to disk (along with all transaction log flushed up to the point of the last log record to change the page being written) to make space for images of newly created pages.

This time-slice is really cool in that it shows what happens when contention goes away. Just before 12:51:00, a checkpoint ends and the lazywriter has nothing to do - so the only IOs hitting the disks are those coming from the transaction log flushing out log blocks as transactions commit. You can see the Avg. Disk Write Queue Length drop down to 2-3, the Avg. Disk sec/Write drop to about 5ms, and most beautiful of all (look, I'm a big geek ok? :-), the Disk Write Bytes/sec (the green line) drops down to be exactly equal to the Log Bytes Flushed/sec - proving that it's just log flushes hitting the disk. This is the no-contention case. It happens again for brief spell about 10 seconds later - the lazywriter most likely created a temporary surfeit of empty buffers. All the rest of the time, the lazywriter and checkpoints play havoc with the write throughput on the drives by causing contention.

It's clearly time to try some separation of files to relieve the contention - and that's what I'll cover in the next post in the series.

Hope you're enjoying the series - these take a long time to write up!

(For the hardware setup I'm using, see this post.)

As part of my new benchmarking series I first wanted to play around with different configurations of data files and backup files for a 1-TB database to see what kind of performance gains I can get invoking the parallelism possible when backing up and restoring the database. To do that I need a way to quickly populate a 1-TB database so that I can mess around with different numbers of files and so on. It doesn't matter what the data in the database is, as backup doesn't care - as long as there's a terabyte of it. Why a terabyte? It's a nice round number, it's a common enough database size right now, and I have the storage to play around with it.

But then my plans got derailed. In figuring out how to most quickly populate a 1-TB database, I realized that in itself would be an interesting benchmark to explore, so that's what I'm doing first.

My aim is to give you improvements you can use in real life. If you think this is going to be boring, skip down to the end of the post where I show a detailed perfmon and explain what's going on in my overloaded IO subsystem, then come back up here :-)

The baseline for this benchmark is contrived - I'm going to populate a 1-TB clustered index (so I can do other experiments with the index) as quickly as I can. The interesting part is that I'm starting with a brain-dead database layout, configuration, and method of populating the table, and then I'm going to alter different things to see the effect on the system. The effects and gains will be the interesting part as it will expose parts of how SQL Server works which *WILL* be applicable to real-life situations and workloads - the whole point of me doing all of this is to show you improvements, why they work, and how they could be useful to you.

When doing any kind of performance testing it's *essential* to have a baseline with which to compare - otherwise how do you know what effect a variation is having? This post describes my baseline setup, measurements, and limitations I start to notice.

First let me describe the baseline setup:

  • Single data file and log file stored on the same volume, on an 8-drive RAID-10 array (each drive is a 300GB 15k SCIS drive), connected through iSCSI to the server
  • Data file is created to be 1-TB, with instant file initialization enabled
  • Log file is created to be 256MB
  • Database is using the SIMPLE recovery model

Yes, I'm deliberately putting the data and log on the same RAID array. I *want* to see some contention so I can prove to you how separation of data and log can reduce contention and improve performance sometimes.

Database and table creation script:

CREATE DATABASE BackupPerfTest ON
(NAME = 'BackupPerfTest_Data',
   FILENAME = 'K:\BackupPerfTest.mdf',
   SIZE = 1TB,
   FILEGROWTH = 1GB)
LOG ON
(NAME = 'BackupPerfTest_Log',
   FILENAME = 'K:\BackupPerfTest.ldf',
   SIZE = 256MB,
   FILEGROWTH = 50MB);
GO

ALTER DATABASE BackupPerfTest SET RECOVERY SIMPLE;
GO

USE BackupPerfTest;
GO

CREATE TABLE MyBigTable (c1 BIGINT IDENTITY, c2 CHAR (4100) DEFAULT 'a');
GO
CREATE CLUSTERED INDEX MyBigTable_cl ON MyBigTable (c1);
GO

I figured the fastest way to fill the database is to have a single table with one row per page, and that having SQL Server create the large CHAR column inside itself, rather than me doing a REPLICATE, would be quickest.

For the actual population of the table, I worked out that 134217728 table rows gives me a terabyte, with each row taking up a single 8KB page.

The baseline setup of the table population is:

  • Varying numbers of concurrent connections (16, 32, 64, 128, 256) to the server (8-way DELL PowerEdge 1950)
  • Each connection runs a simple script that inserts 134217728 / number-of-connections rows into the table, and then logs the elapsed time into a results table
  • Each insert is done as a single-statement implicit transaction (if I don't do an explicit BEGIN TRAN/COMMIT TRAN, SQL Server does it for me)
  • A monitor connection pings the results table every 5 minutes checking to see if number-of-connections results are there, and sending me email if so

Yes, I'm deliberately using this method to insert each row. Again, I want to be able to make improvements and see the effect of the changes.

Each connection will be running this script:

SET NOCOUNT ON;
GO

DECLARE @counter BIGINT;
DECLARE @start   DATETIME;
DECLARE @end     DATETIME;

SELECT @counter = 0;
SELECT @start = GETDATE ();

WHILE (@counter < $(rows))
BEGIN
   INSERT INTO MyBigTable DEFAULT VALUES;
   SELECT @counter = @counter + 1;
END;

SELECT @end = GETDATE ();

INSERT INTO msdb.dbo.Results VALUES (CONVERT (INTEGER, DATEDIFF (second, @start, @end)));
GO

This is run through SQLCMD, and the number of rows to insert is passed into the T-SQL script using:

sqlcmd -S(local)\SQLDev01 -dBackupPerfTest -i"C:\SQLskills\Populate1TBTest\PopulateTable.sql" -v rows=%1

%1 in the line above is passed from a master CMD that kicks off number-of-connections CMD windows, each of which just runs the SQLCMD line above.

So - a simple setup.

Here's a graph of the results:

 

The elapsed times for all connections to complete their work (as there could be up to an hour between the first and last to complete) were:

  • 16-way: 15 hours 25 minutes 5 seconds
  • 32-way: 13 hours 50 minutes 18 seconds (11% faster)
  • 64-way: 10 hours 12 minutes 48 seconds (27% faster)
  • 128-way: 8 hours 8 minutes 27 seconds (20% faster)
  • 256-way: 7 hours 24 minutes 21 seconds (9% faster)

More connections clearly leads to a faster runtime, but the improvements from doubling the number of threads clearly aren't directly proportional to the number of threads. The biggest improvement was from 32 to 64 threads, and then the percentage gain started to tail off. Let's look at the page allocations per second for each experiment too:

 

As I'd expect, the pace at which pages are being allocated in the database increases with more threads and the percentage improvements line pretty much matches that of the elapsed time graph above. There's a slight difference in the 128 and 256 gains here because the graph is show what the perf counter number was after the experiment reached a steady state. I noticed that the log grew enormously for the last few tests, which caused the steady-state number to not be reached for a while. I've already blogged about that phenomenon in Interesting case of watching log file growth during a perf test.

I also watched what was happening in perfmon to see if there were any obvious performance problems going on. Here's a perfmon graph for the 64-way test once it reached steady-state and the log wasn't growing:

 

Analysis: 

  • The black line is above 0 when a checkpoint is occuring.
  • The green line represents how much data is being written to the K: volume, where the data and log file are. You can see there's a constant amount of data being written all the time (transaction log records when the implicit transactions commit) with large spikes whenever a checkpoint occurs and causes the actual data pages to be written out.
  • The light blue line is the pages allocated per second. You can see that it takes a major nose dive whenever a checkpoint occurs. I'm speculating that this is because of disk contention preventing the transaction log being written to disk (thus slowing down the transaction throughput) while the checkpoint is occuring and writing out data pages
  • The dark blue line at the bottom is the average disk seconds per write. You can see that it hovers around 4-5 milliseconds and spikes to 16-17 when a checkpoint occurs.
  • The brown line in the middle is the average disk write queue length. It hovers around 18-19 and spikes to around 25 when a checkpoint occurs.

Observations:

  • This system is clearly bottlenecked in the I/O subsystem
  • There is a direct correlation between checkpoints occuring and: increased disk queue length, reduced transaction throughput

This is what I would expect to see on a system with lots of updates where the log and data are on the same volume. Remember that I've got them on a fast RAID-10 array. This debunks the theory I've often heard that contention doesn't matter on RAID arrays. Yes, it does. You can still max out the throughput capabalities of any IO subsystem - it just depends what you're trying to do with it. Imagine if I had nonclustered indexes on this table too - more logging and more pages being allocated - much worse performance...

So now I've got my baseline and there are some obvious things I can try to improve things:

  • Data and log file separation
  • Varying number of data files
  • Varying placement of data files (e.g. different RAID arrays)
  • Manual checkpoints
  • Pre-sizing the log
  • Using explicit transactions with varying batch insert sizes
  • Using -E to get larger allocations

I'm not saying that all of these things are going to help, but over the next few weeks I'm going to try them all and report on the results. You'll be able to clearly see the effect of changing these on my benchmark, running on production-quality hardware, rather than just taking people's words for it.

I hope you're going to find these benchmarks and experiments useful - I'll be learning (hopefully) as I go along too.

Let me know if there's anything else you'd like to see me try, and if you're following along (I don't want to spend all this time if no-one's reading the series!)

Thanks!

I'm running some performance tests on the hardware we have (more details on the first of these tomorrow) and I was surprised to see some explosive transaction log growth while running in the SIMPLE recovery model with single row insert transactions!

Without spoiling tomorrow's thunder too much, I've got a setup with varying numbers of connections populating a 1TB table with default values, with the single data and log files on the same 8-drive RAID-10 array (again more details on why tomorrow). I was looking at perfmon while the test was running with 128 concurrent connections, and noticed some weird-looking behavior - basically what looked like a never-ending checkpoint occuring. So I dug in deeper and discovered the transaction log had grown over 7GB since the start of the test. Huh?

I restarted the test from scratch, with a 1TB data file and a 256MB log and watched what happened in perfmon; here's the screenshot of the first 90 seconds:

 

This is *so* interesting. The black line is the number of log growths, so you can see the log grows every time the line level goes up. The bright blue line is the number of pages being allocated per second to hold all the table rows my 128 connections are inserting. You can clearly see that every time there's a log growth, the allocations take a nose-dive - because no transactions can commit while the new portion of the log is being zeroed (remember that instant file initialization does not apply to the log - see this blog post). The green line (disk write bytes per second) spikes when the log grows because of the zeroes being written out to the log file by SQL Server (remember that SQL Server does the zeroing).

But why is the log growing in the first place? I'm in the SIMPLE recovery model and doing single-row implicit transactions (yes, I'm deliberately doing this) so there shouldn't be anything stopping the log from clearing during a checkpoint, right?

Wrong. The log starts out small (256MB) so one of the thresholds for triggering a checkpoint gets hit pretty fast (70% of the log file is used). So a checkpoint occurs (you can see the checkpoints occuring when the pink line at the bottom of the perfmon screen is above zero), and starts writing out to the data file, which is on the same disk as the log (see How do checkpoints work and what gets logged for an explanation of checkpoint mechanics), but it can't write fast enough (because of disk contention) to get to the log-clearing part before the log fills up completely and has to grow (because transactions are continuing at break-neck speed). So the log grows, and the insert transactions stop while the log is zeroed. And then the log starts to fill up again very quickly and another checkpoint is triggered, and so on and so on.

Eventually a steady state is reached where there's enough free log space during a checkpoint that no new log is required for the concurrent transactions to commit. You might ask why the checkpoint is so slow in the first place? Because I deliberately put the data file on the same RAID array as the log file, and both are being steadily hammered with writes - classic disk contention. Even though the RAID array is RAID-10 with 8x300GB 15k SCSI drives, the average disk queue write length is over 20 most of the time during the 128-way test because I'm simply trying to do too much.

The point of my blog post? Just because you don't have any of the classic causes of transaction log growth going on, doesn't mean you're going to be immune. In this case my (deliberate) poor physical layout of the database files and workload growth up to 128 concurrent connections caused the log to grow. What started out working when I was running 16 connections didn't work any more at 128 (actually I went back and re-ran some of the earlier tests and even with only 64 connections, the log grew to over 1GB before reaching steady-state).

Interesting eh?

Tomorrow I'll be continuing the perf/benchmarking series by creating my first benchmark and then tweaking the setup to see how I can improve performance (for instance with multiple data files, separation of log and data files - all the things I preach but have never *demonstrated*), but this behavior merited a post all on its own.

It's been a few weeks since my last posts but I've got a bunch in the pipeline coming up.

Firstly, I've got it together to start using the hardware we got a while back. I'm going to be doing some benchmarking, perf testing and playing with various HA technologies, and of course blogging a bunch about what I discover. The hardware was a lot of fun to put together!

This is kind of a reference post, detailing the hardware setup, so I can link to it in all the posts where I use our hardware.

Photos: front and back (no Facebook login required).

Here's what I'll be using, all mounted in a 42U rack:

  • 2 x Dell PowerEdge 1950 servers, each with
    • 2 x quad-core Xeon E5405 Processor2x6MB Cache, 2.0GHz, 1333MHz FSB
    • 8GB 667MHz RAM
    • 2 x 146GB 15K RPM SCSI drive (for OS etc)
    • 4 3GBps NICs
    • Windows Server 2008 Enterprise SP1 64-bit
    • Multiple SQL Server 2008 Enterprise instances
  • (Upper array) Dell PowerVault MD3000i iSCSI array with 2 dual-port controllers
    • 15 x 1TB 7.2K RPM SATA drives
  • (Middle array) Dell PowerVault MD3000i iSCSI array with 2 dual-port controllers
    • 8 x 300GB 15K RPM SCSI drives
  • (Bottom array) Dell PowerVault MD3000i iSCSI array with 2 dual-port controllers
    • 8 x 1TB 7.2K RPM SATA drives
  • Connected with 1GbE through two PowerConnect 5424 24-port iSCSI optimized switches 

The drives will be reconfigured and I'll specify what I'm using in each post.

The SQL Server instances are vanilla installs, with no special parameters, sp_configure options, or trace flags. If I use any I'll detail them in each post.

Time to get using it!

Categories:
Benchmarking | Performance

Theme design by Nukeation based on Jelle Druyts