Once in a blue moon I come across something cool and it happens to be Friday, and I blog about it. Last time was drinking Guinness in Dublin and the time before was an insane driving skills video (see here).

My (scary) friend and fellow SQL MVP Grant Fritchey (Twitter|blog) posted the first two lines of the Time Warp from the Rocky Horror Picture Show on Twitter a few minutes ago, to which I responded with the next two lines. They're burned into my brain and I blame my parents.

My Mother used to run a drama school in Scotland when I was a kid (Helensburgh Theatre Workshop) and in the bi-annual shows someone would always "accidentally" says "I'm lucky, you're lucky, we're all LUCKY". Then the clock chimes would start and my Mum and Dad (dressed as Magenta and Riff-Raff) would come on stage and everyone would do the whole Time Warp song. Fond memories, and yes, my family is very outgoing (and strange :-)

In homage to them, Richard O'Brien and Tim Curry, here's a YouTube link to the movie version of the song. Wonderful! (Safe for work btw).

http://www.youtube.com/watch?v=yarYjuN-m8I

It's just a jump to the left!

(Twitter @paulrandal)

Categories:
Personal | TGIF

Way back in the mists of time, at the end of the last century, I wrote DBCC SHOWCONTIG for SQL Server 2000, to complement my new invention DBCC INDEXDEFRAG.

I also used to wear shorts all the time, with luminous orange, yellow, or green socks.

Many things change - I now have (some) dress sense, for one. One other thing that changed was that DMVs came onto the scene with SQL Server 2005. DBCC SHOWCONTIG was replaced by sys.dm_db_index_physical_stats. Under the covers though, they both use the same code - and the I/O characteristics haven't changed.

This is a blog post I've been meaning to do for a while now, and I finally had the impetus to do it when I heard about today's T-SQL Tuesday on I/O in general being run by Mike Walsh (Twitter|blog). It's a neat idea so I decided to join in this time. In retrospect, reading this over before hitting 'publish', I got a bit carried away (spending two hours on this) - but it's one of my babies, so I'm entitled to! :-)

This isn't a post about how to use DMVs in general, how to use this DMV in particular, or anything about index fragmentation. This is a blog post about how the DMV works.

DMV is a catch-all phrase that most people (myself included) use to describe all the various utility views in SQL Server 2005 and 2008. DMV = Dynamic Management View. There's a catch with the catch-all though - some of the DMVs aren't views at all, they're functions. A pure DMV gets info from SQL Server's memory (or system tables) and displays it in some form. A DMF, on the other hand, has to go and so some work before it can give you some results. The sys.dm_db_index_physical_stats DMV (which I'm going to call 'the DMV' from now on) is by far the most expensive of these - but only in terms of I/O.

The idea of the DMV is to display physical attributes of indexes (and the special case of a heap) - to do this it has to scan the pages comprising the index, calculating statistics as it goes. Many DMVs support what's called predicate pushdown, which means if you specify a WHERE clause, the DMV takes that into account as it prepares the information. This DMV doesn't. If you ask it for only the indexes in the database that have logical fragmentation > 30%, it will scan all the indexes, and then just tell you about those meeting your criteria. It has to do this because it has no way of knowing which ones meet your criteria until it analyzes them - so can't support predicate pushdown.

This is where understanding what it's doing under the covers comes in - the meat of this post.

LIMITED

The default operating mode of the DMV is called LIMITED. Kimberly always makes fun of the equivalent option for DBCC SHOWCONTIG, which I named as a young and foolish developer - calling it WITH FAST. Hey - it's descriptive!

The LIMITED mode can only return the logical fragmentation of the leaf level plus the page count. It doesn't actually read the leaf level. It makes use of the fact that the next level up in the index contains a key-ordered list of page IDs of the pages at the leaf level - so it's trivial to examine the key-ordered list and see if the page IDs are also in allocation order or not, thus calculating logical fragmentation.

The idea behind this option is to allow you to find the fragmentation of an index by reading the minimum number of pages, i.e. in the smallest amount of time. This option can be magnitudes faster than using the DETAILED mode scan, and it depends on how big the index's fanout is. Without getting too much into the guts of indexes, the fanout is based on the index key size, and determines the number of child-page pointers an index page can hold (e.g. the number of leaf-level pages that a page in the next level up has information about).

Consider an index with a char(800) key. Each entry in a page in the level above the leaf has to include a key value (the lowest key that can possibly appear on the page being referred to), plus a page ID, plus record overhead, plus slot array entry - so 812 bytes. So a page can only hold 8096/812 = 9 such entries. The fanout is at most 9.

Consider an index with a bigint key. Each entry is 13 bytes, so a page can hold 8096/13 = 622 entries. The fanout is at most 622, but will likely be smaller, depending on operations on the index causing fragmentation at the non-leaf levels.

For a table with 1 million pages at the leaf level, the first index will have 1 million/9 = 111112 pages at least at the level above the leaf. The second index will have at least 1608 pages. The savings in I/O from using the LIMITED mode scan will clearly differ based on the fanout.

I've created a 100GB clustered index (on the same hardware as I'm using for the benchmarking series) with 13421760 leaf-level pages and a maximum fanout of 540. In reality, I populated the index using 16 concurrent threads, so there's some fragmentation. The level above the leaf has 63012 pages, an effective fanout of 213. Still, the LIMITED mode scan will read 213x less than a DETAILED scan, but will it be 213x faster?

Here's a perfmon capture of the LIMITED mode scan on my index:

 

There's nothing special going on under the covers in a LIMITED mode scan - the chain of pages at the level above the leaf is read in page-linkage order, with no readahead. The perfmon capture shows:

  • Avg. Disk Read Queue Length (light blue) is a steady 1.
  • Avg. disk sec/Read (pink) is a steady 4ms.
  • Disk Read Bytes/sec (green) is roughly 14.5million.
  • Page reads/sec (dark blue) is roughly 1800.

DETAILED 

The DETAILED mode does two things:

  • Calculate fragmentation by doing a LIMITED mode scan
  • Calculate all other statistics by reading all pages at every level of the index

And so it's obviously the slowest. It has to do the LIMITED mode scan first to be able to calculate the logical fragmentation, because it reads the leaf level pages in the fastest possible way - in allocation order. DBCC has a customized read-ahead mechanism for allocation order scans that it uses for this DMV and for DBCC CHECK* commands. It's *incredibly* aggressive and will hit the disks as hard as it possibly can, especially with DBCC CHECK* running in parallel.

Here's a perfmon capture of the DETAILED mode scan on my index:

 

Not quite as pretty as the LIMITED mode scan, but I like it :-) Here's what it's showing:

  • Avg. Disk Read Queue Length (black) is in the multiple hundreds. Clearly it's appetite for data is outstripping what my RAID array can do. It basically tries to saturate the I/O subsystem to get as much data as possible flowing into SQL Server.
  • Avg. disk sec/Read (pink line at the bottom) is actually measuring in whole seconds, rather than ms. Given the disk queue length, I'd expect that.
  • DBCC Logical Scan Bytes/sec (red) varies substantially as the readahead mechanism throttles up and down, but it's driving anywhere up to 80MB/sec. You can see around 9:49:20 AM when it drops to zero for a few seconds.
  • Readahead pages/sec (green) is tracking the DBCC scan. This is a buffer pool counter, the DBCC one is an Access Methods counter (the dev team I used to run during 2005 development). If I had Disk Read Bytes/sec and Pages reads/sec showing, they'd track the other two perfectly - I turned them off for clarity.

So the DETAILED mode not only reads more data, but it does it a heck of a lot more aggressively so has a much more detrimental effect on the overall I/O capabilities of the system while it's running.

SAMPLED

There is a third mode that was introduced just for the DMV. The idea is that if you have a very large table and you want an idea of some of the leaf level statistics, but you don't want to take the perf hit of running a DETAILED scan, you can use this mode. It does:

  • LIMITED mode scan
  • If the number of leaf level pages is < 10000, read all the pages, otherwise read every 100th pages (i.e. a 1% sample)

Summary

There's no progress reporting from the DMV (or DBCC SHOWCONTIG) but if you look at the reads column in sys.dm_exec_sessions you can see how far through the operation it is. This method works best for DETAILED scans, where can compare that number against the in_row_data_page_count for the index in sys.dm_db_partition_stats (yes, you'll need to mess around a bit if the index is actually partitioned).

In terms of timing, I ran all three scan modes to completion. The results:

  • LIMITED mode: 282 seconds
  • SAMPLED mode: 414 seconds
  • DETAILED mode: 3700 seconds

Although the LIMITED mode scan read roughly 200x less than the DETAILED scan, it was only 13 times faster, because the readahead mechanism for the DETAILED scan is way more efficient than the (necessary) follow-the-page-linkages scan of the LIMITED mode.

Just for kicks, I ran a SELECT COUNT(*) on the index to see how the regular Access Methods readahead mechanism would fare - it completed in 3870 seconds - 5% slower, and it had less processing to do than the DMV. Clearly DBCC rules! :-)

Although the DETAILED mode gives the most comprehensive output, it has to do the most work. For very large indexes, this could mean that your buffer pool effectively gets flushed out by the lazy writer making space available for the DMV to read and process the pages. One of the reasons I advise people to only run the DMV on indexes they know they're interested in - and better yet, run it on a restored backup of the database.

Hope this is helpful!

PS Oh, also beware of using the SSMS fragmentation wizard. It uses a SAMPLED mode scan, but I found it impossible to cancel!

Back in January I offered a promotion as a way of introducing our maintenance/operations auditing services. Now I've completed a bunch of them, with some excellent results for customers.

One international customer, Plex Systems, was so pleased with the outcome of my audit of their manufacturing ERP software hosting databases that they issued a press release today to their industry partners and clients. Another customer, Hose and Fittings, Etc, was amazed at the details and justifications in the report I presented them - see their testimonial on the Past Customers page.

Take a look at the new auditing page that describes how the various audits work, and let me know if you want to discuss doing one - we can do it wherever you are in the world.

Thanks!

Blog posts in this series:

  • For the hardware setup I'm using, see this post.
  • For an explanation of log growth and its effect on perf, see this post.
  • For the baseline performance measurements for this benchmark, see this post.
  • For the increasing performance through log file IO optimization, see this post.
  • For the increasing performance through separation of data and log files, see this post.

In the previous post in the series, I examined the effects of separating the data and log files (one file each) to different RAID arrays. It was very obvious that separation gave a performance boost, and that having the portion of the database with the highest I/O write throughput requirements on the faster array (the 8-drive 15k SCSI RAID10) produced the biggest gain.

Now - a confession.  In the last post, when I posted it I found that moving the data an 8-drive 7.2k SATA RAID10 array was the best approach. *But* during the testing for this post, I found that one of my tests had screwed up and only half the client threads had run. You'll notice in that post I went back in and edited it to explain that and update the graph and results. I've now augmented my test harness with a way to check that all client threads are running - to make sure the tests are apples to apples, rather than apples to pomegranates :-)

So - the best I've been able to do so far with the tests is creating 1TB of data using 128 connections (each creating 1/128th TB using inserts with default values) with the single data file on an 8-drive 15k SCSI RAID array (pre-created to 1TB) and the log file on an 8-drive 7.2k SATA RAID10 array (pre-created to 256MB with 50MB autogrowth) in 20842 seconds.

Lots of people have been asking how my network is setup in these tests. Here's what I've been running with (all 1GB ethernet):

  • 1 NIC from a Broadcom BCM5708C NetXtreme II GigE card on the 10.x.x.x network
  • 1 NIC from a Broadcom BCM5708C NetXtreme II GigE card on the 192.168.x.x network
  • 2 x PowerConnect 5424 24-port iSCSI optimized switches , with no separation of traffic
  • The 10.x.x.x server NIC connected to all iSCSI arrays

Over the last couple of weeks I've been playing around with the network setup to make sure things are optimized, and this post will describe what I did and what effect it had. In all the tests below, I kept the dat aon the faster SCSI array and the log on the slower SATA array.

I'm very grateful to the help I received from Wes Brown (twitter|blog) and Denny Cherry (twitter|blog) to the technical questions and WTF?s I sent (and to anyone else on twitter I may have forgotten!).

1) Separation of network traffic

I decided to make one of the 5424 switches dedicated to iSCSI traffic on the 10.x.x.x network and the other for general network activity, including connecting to the management ports on the MD3000s. Turns out that I didn't really need to, as each 5424 can handle 48GB of throughput, way more than I'm generating. But hey ho, at least the wiring in the back of the 42U rack is a little tidier now :-)

Running the 128-way test with the new configuration gave a test time of 21252 seconds, slightly slower than the best time without separation! This was the first of the WTF?s. Until I realized that I hadn't actually removed any network bottleneck at all. I can't explain why things are slightly slower here, so I decided to take the switches out of the equation. My suspicion is that if I ran the test ten times, I'd get ten different results, but within a standard deviation of the median. So - no cause for concern. (In fact, I'm going to try this as part of the next set of tests.)

2) Direct connections to the iSCSI arrays

I configured another NIC (one from an Intel PRO/1000 PT Dual Port GigE card) and then had one NIC directly connected to one of the RAID controllers on the SCSI MD3000 (only one configured volume, so no concerns about having multiple volumes suddenly switching over to a non-optimal RAID controller) and the other NIC directly connected to the SATA MD3000.

Running the 128-way test with the new configuration gave a test time of 21859 seconds, slower than test #1. Very surprising - I expected to get some *gain* so I looked at the peak throughput of the arrays:

  • For test 1, peak SATA was 50500 bytes/sec and peak SCSI was 106012 bytes/sec.
  • For test 2, peak SATA was 46923 bytes/sec and peak SCSI was 107708 bytes/sec.

Things are slower with the network bottleneck removed.

3) Upgrading 5424 switch firmware and reconfiguring

Although the 5424 switches are supposed to come iSCSI optimized, I thought I'd flatten them and reconfigure the. I got the latest version of the switch firmware and re-flashed both switches. I think configured the 10.x.x.x one specifically for iSCSI using this excellent Dell whitepaper.

Running the 128-way test with the new configuration gave a test time of 20745 seconds. Finally an improvement, but nothing major, and still possibly just a statistical variation.

4) Upgrading the NIC drivers

Next I figured I'd bring the NICs up to the latest driver versions so upgraded all the NICs on all the servers.

Running the 128-way test with the new configuration gave a test time of 21743 seconds. Hmmm.

5) Homogenizing the network paths

At this point I started wondering if the Broadcom and Intel NICs had different characteristics so I decided to use the two Intel NICs for the iSCSI traffic. I also enabled jumbo frames. The Intel NICs have three setting for jumbo frames - off, 4088 bytes or 9014 bytes. The MD3000s can only go up to 9000 bytes, so I chose 4088 bytes and configured the MD3000 iSCSI ports to use the same.

Running the 128-way test with the new configuration gave a test time of 21526 seconds - nothing to write home about.

None of the network configuration changes I made had much effect on performance, apart from removing the network bottleneck, which made performance slightly worse overall. I checked other stuff like TCP offloading, but that wasn't enabled. My suspicion was that by removing the network bottleneck, I unmasked a SQL Server contention issue with my 128-connection test. I decided to try fewer client connections.

Here are the results:

There's clearly a SQL Server bottleneck that's being alleviated by reducing the number of connections and allowing the throughput to each array to increase slightly. With 8 connections, SQL Server isn't being driven hard enough and the elapsed time increases again, and this is reflected in the array throughput measurements too (a 10-15% drop compared to the 16-way test). One thing I forgot to do was examine the distribution of wait types while these tests were running, but my guess would be the bottleneck was in the transaction log manager.

Summary

By separating the network traffic and moving to two iSCSI NICs, I removed the network bottleneck I had (see the image at the bottom of the last post) and replaced it with a SQL Server bottleneck. Here's a snapshot of network utilization with the new setup:

 

In the next set of tests, I'm going to look at the effect of altering the transaction log auto-growth size, and pre-allocation size. In all the tests so far, the log has grown from the pre-allocated 256MB to somewhere between 6.5-8GB.

Should be interesting - stay tuned.

PS Yes, I'll be doing a bunch of stuff with the Fusion-io drives too - be patient! :-)

The March edition of TechNet Magazine is available on the web now and has the latest installment of my regular SQL Q&A column.

This month's topics are:

  • Distributed transactions and database mirroring - why they don't work together
  • Background processes that can cause I/Os even with no connections to the server
  • How to restore backups from a file containing multiple appended backups
  • The perennial problem of production databases being too large to restore in development

Check it out at http://technet.microsoft.com/en-us/magazine/ff458345.aspx.

Christmas comes but once a year... really? Then mine just came early on this afternoon's UPS truck.

The very nice folks at Fusion-io just sent me two of their fully-loaded top-of-the-line ioDrive Duos with 640GB of solid-state flash memory in each. This is really extra-nice of them because on Dell's Small Business website they're currently retailing for $12800 *each*. Expensive? Yes. Worth it? That's what I'm hoping to prove.

There's nothing like expensive, pretty hardware to get me excited... here's what they look like:

Now, above I said 'expensive', and these are, but they pack some pretty amazing specs in terms of read/write bandwidth so you're getting a lot of bangs for you bucks. But where does it really make sense to drop the bucks for the biggest bangs? To answer that I'm planning to do a whole series of blog posts as part of my benchmarking efforts to investigate which operations can benefit the most from these drives. With 1.2TB of SSD storage I'll be able to plug these into one of my test systems here and run comparisons against 15k SCSI and 7.2k SATA drives.

Anyway, there's a lot of hype about the speed of SSDs, and also a lot of angst about SSDs not being Enterprise-ready. I don't agree with them not being Enterprise-ready - in fact, fellow-MVP Greg Linwood, who runs (among other things) our partner company SQLskills Australia, already has a bunch of customers with Fusion-io drives deployed in their enterprises successfully. As with any critical hardware infrastructure (especially cutting-edge stuff like this), the key to success is having everything setup correctly so I'll be blogging about all my experiences with them.

To summarize, I'm very excited! I've been wanting to get my hands on some serious SSD hardware for a couple of years now so I can do some *real* testing - it doesn't get better than this!

Shoot me an email or leave a comment if there's something you're interested in seeing tested.

PS Full disclosure: yes, of course Fusion-io sent me these because they're getting publicity from me blogging about them, but we don't have any editorial/veto agreement. I want to be able to recommend these to our enterprise clients and the only way to honestly do that is to play with them myself - so it's a win-win for both of us. And you guys get to test them vicariously through me, so it's a win-win for you too :-)

Stay tuned...

Our first public class of the year is just over 3 weeks away! This is the *only* public class we'll be teaching in the US in 2010. There are a few spots left so book now to avoid disappointment.

We've teamed up with our good friend Adam Machanic to bring a week-long custom class to the Boston area. In this class Kimberly and I take turns teaching modules and we're both on hand to answer questions, do research, and try things out (and banter too!). The tag-team approach works *really well* and make the class very enjoyable and relaxed for those attending (and for us!).

The class will be March 29th-April 2nd in the Le Meridien hotel in Cambridge, MA and will cover:

  • On-disk structures: how the data is stored
  • Index internals: how the data is organized
  • Logging and recovery: how the data is protected
  • Choosing the RIGHT Data Type
  • Table & Index Partitioning Strategies
  • Data Access
  • Indexing Strategies
  • Data and log file provisioning and management
  • Tempdb
  • Index and statistics maintenance
  • Using backup and restore (plus internals)
  • Consistency checking and disaster recovery

You can see a much more detailed course outline here and full details including how to register at Adam's Boston SQL Training website.

We hope to see you there!

Categories:
Classes

A while ago I blogged about disk partition alignment, and how the default alignment of 31.5Kb on Windows Server 2003 can lead to enormous I/O performance problems (see Are your disk partition offsets, RAID stripe sizes, and NTFS allocation units set correctly?). We've been on-site with clients this week and that topic came up again. I thought it would be useful to do a quick blog post showing how to use the diskpart and wmic tools. Google them for lots of info from the Microsoft site - but be careful not to play around with any of the destructive options on productions systems. The options I'm using below will not alter the disks in any way.

Note: This stuff applies to MBR disks, not GPT or dynamic disks. Although these require correct alignment too, I don't have any information on how to do it for those disks. The SQLCAT team will be publishing some guidelines but has not yet done so, AFAIK. Check out the SQLCAT team whitepaper Disk Partition Alignment Best Practices for SQL Server for full details on this topic.

Bring up a command prompt and type diskpart. You'll see something like:

C:\Users\Administrator>diskpart

Microsoft DiskPart version 6.0.6001
Copyright (C) 1999-2007 Microsoft Corporation.
On computer: MONKEY

DISKPART>

Next you need to list the logical disks that Windows knows about. Type list disk. You'll see something like:

DISKPART> list disk

  Disk ###  Status      Size     Free     Dyn  Gpt
  --------  ----------  -------  -------  ---  ---
  Disk 0    Online       136 GB  1434 MB
  Disk 1    Online      1116 GB      0 B
  Disk 2    Online      2036 GB      0 B

DISKPART> 

Disks 1 and 2 are two RAID arrays I'm using right now for the performance benchmarking series I'm doing. Notice that the numbers in the Free column aren't correct - not sure why not. 

To see the partitions on a disk, you need to set the diskpart focus to be that disk. Type select disk X, where X is the disk you want to focus on. You'll see something like:

DISKPART> select disk 1

Disk 1 is now the selected disk.

DISKPART>

And now you can list the partitions on the disk using list partition. You'll see something like:

DISKPART> list partition

  Partition ###  Type              Size     Offset
  -------------  ----------------  -------  -------
  Partition 1    Primary           1116 GB  1024 KB

DISKPART>

This is the output from one of my Windows Servr 2008 servers, where the default partition offset is 1MB - which doesn't lead to perf issues.

On another Windows XP system, I get the following output:

DISKPART> select disk 0

Disk 0 is now the selected disk.

DISKPART> list partition

  Partition ###  Type              Size     Offset
  -------------  ----------------  -------  -------
  Partition 1    Primary            119 GB    32 KB

DISKPART>

This disk isn't aligned correctly. If this was a RAID array, I'd pay a perf penalty every time a read or write straddled a RAID stripe offset. See the blog post link at the top of this post for more details.

Unfortunately, diskpart isn't always the best tool to use to get partition offsets, as it rounds up the values, and when there are multiple partitions, it can be hard to tell exactly what's what, especially whtih lots of disks where you need to select each one and then list the partitions.

In this case, use wmic to get the exact numbers. The command is as follows:

wmic partition get BlockSize, StartingOffset, Name, Index

For my server, I get the following output:

BlockSize  Index  Name                   StartingOffset
512        0      Disk #1, Partition #0  1048576
512        0      Disk #2, Partition #0  1048576
512        0      Disk #0, Partition #0  1505755136
 

For dynamic disks, use:

dmddiag.exe -v 

Now - go out to your servers and check the partition alignment - fixing this can give you up to 30-40% I/O performance boost!!

How do you fix it? Well, that's the downside - fixing it means reformatting the disk to have the correct partition offset or moving the data to a disk that already has the correct partition offset. Remember - although Windows Server 2008 creates disks with the correct offset, taking a disk that was created on Windows Server 2003 and attaching it to Windows Server 2008 will have no effect on the existing partition offset.

Categories:
IO Subsystems | Performance | Tools

The very worst piece of advice I ever saw on the Internet was in response to someone asking on a SQL newsgroup 'how can I create a corrupt database?'. The first response was:

When I want to corrupt a database to play with, I go into the data center, find a hard-drive and flick the power switch on-and-off quickly a few times.

This was closely followed by a bunch of replies (including mine) saying 'Noooooooooooo!!!!'

So, for a few years now I've provided a zip file with a bunch of pre-corrupted databases in - so you can test your consistency checking jobs, and experiment with corrupt databases without having to create them yourself by destroying hard-drives, or less daft means. You can find the main zip file at the top of our Past Events resources page, along with a link to a blog post which explains the various databases and demo scripts.

Several times I've been asked to provide SQL Server 2008 versions of the two databases which demo system table corruption. Well, that task has finally bubbled to my long-list of blog posts and website updates. There's now a second, smaller zip file which has 2008-only versions of the DemoFatalCorruption1 and DemoFatalCorruption2 databases, which showcase corruptions that prevent DBCC CHECKDB from running.

Let me know if you have any problems (playing with the backups Wink)

Enjoy!

Blog posts in this series:

  • For the hardware setup I'm using, see this post.
  • For the baseline performance measurements for this benchmark, see this post.
  • For the increasing performance through log file IO optimization, see this post.

In the previous post in the series, I optimized the log block size to get better throughput on the transaction log, but it was very obvious that having the log file and the data file on the same RAID array is a bottleneck.

Over the last couple of weeks I've been running some tests with the log and data files on separate RAID arrays, and this post will explain what I've found.

The first thing I did was setup the systems for remote access so I can log in to them from anywhere in the world. This involved:

  • Changing the TCP/IP port that Remote Desktop listens to (KB 306759)
  • Allowing that port through Windows Firewall (having to perform this step wasn't obvious)
  • Turning on port forwarding on our Internet-facing router for that port to the right server

It's very cool being able to play with these servers from anywhere - and to show live servers-under-load during a class, as I could last week.

The previous two tests were performed using a single 2TB RAID-10 array comprised of 8 300GB 15k SCSI drives. The new array I added to the mix for this test is comprised of various numbers of 1TB 7.2k SATA drives. I tested:

  • 4 drives in a RAID-10 configuration, giving 2TB
  • 6 drives in a RAID-10 configuration, giving 2TB usable
  • 8 drives in a RAID-10 configuration, giving 2TB usable

The volume size limit that the Dell MD3000i arrays allow is 2TB, so the 2nd and 3rd array configurations described above made use of extra spindles but with a lot of wasted space. In this set of tests I limited myself to a single data file, but coming up I'll try multiple data files which will enable more efficient usage of the available raw disk capacity.

I used the same scripts as in the previous tests, with 128 concurrent connections each inserting 4.1KB rows in batches of 10 inserts per transaction into the same table, for a total of 1/128 TB in each connection. The data file is pre-sized to 1TB. The log file is pre-sized to 250MB, with a 50MB auto-growth increment. These are invariants from the previous test (and are also something obvious that can be changed for better performance, but that's not what this set of tests was about).

I tried the following configurations:

  • Log on SCSI RAID-10, data on 4-, 6-, 8-drive SATA RAID-10
  • Data on SCSI RAID-10, log on 4-, 6-, 8-drive SATA RAID-10

And very interesting results they are too! In the previous set of tests, the best performance I could get was 21167 seconds for test completion, with the data and log files sharing the SCSI RAID-10 array.

Moving the log to a different array

The left-hand graph above shows that moving the log file off managed to get the time down to 20842 seconds in the best case, but that's using 8 drives. The 6- and 8-drive times were basically the same - which shows that 4-drives didn't provide enough IO parallelism for the load SQL Server was pushing through the array, but moving to 6 drives provided basically enough so that 8 drives didn't lead to a big performance gain.

Overall, I'd say there was no real performance gain from moving the log file to a different array.

As far as log disk queue lengths go, when the log was on the 4-drive array, the average log write disk queue length was in the 20s. For the 6- and 8-drive cases, the queue length dropped to low single-digits. It didn't drop right down to around zero because the perfmon counter is measuring the queue length as far as Windows sees it - and the iSCSI traffic was being bottle-necked through a single NIC, as we'll see later.

For data disk queue lengths, when the log was on the 4-drive array, the average data write queue length was around 5, with spikes to 20+. For the 6- and 8-drive cases, the queue length increased to an average of 10-15, with spikes to 30+. Clearly the log drive bottle-neck in the 4-drive case was lowering the overall transaction throughput, which reduced the load on the data drives.

In terms of iSCSI performance, I had the MD3000is take a 2-hour snapshot of each array's performance with the varying size of the log array. The results are in the right-hand graph above. You can see that the throughput of the data array remains basically static, but the log array throughput increases with more drives thrown into the mix. Nothing stunning here.

In the 4-drive case, the transaction log grew to 5.5GB but in the 6- and 8-drive cases the log grew to over 8GB - which is what I'd expect given the higher transaction throughput.

Moving the data to a different array

Edit 02/28/10: When performing the tests for the next post in the series, I found the results from this test didn't make sense. I went back to re-run the single-NIC tests and found what I feared - the original tests didn't complete properly. The corrected results are below. 

The left-hand graph above shows that moving the data file off managed to get the time down to 25400 seconds in the best case, but that's using 8 drives. The data file's throughput requirements are clearly higher than the log file's (for this specific benchmark test, not as a general statement by *any means*). It's clear that having the data file on the slower SATA array wasn't going to lead to any better performance than having it on the SCSI array.

As far as log disk queue lengths go, they were between 1-3 in all cases, as the 8-drive SCSI array could clearly provide enough IO throughput to satisfy the log's need.

For data disk queue lengths, when the data was on the 4-drive array, the average data write queue length was around 30, with wild spikes. For the 6- and 8-drive cases, the queue length decreased to an average of 20, and then down to 10 with spikes to 30+. When there was no checkpoint or lazy writer activity, the write queue lengths dropped to zero, as I'd expect. 

In terms of iSCSI performance, I had the MD3000is take a 2-hour snapshot of each array's performance with the varying size of the data array. The results are in the right-hand graph above. You can see that the throughput of the log array remains pretty static, but the data array throughput increases *dramatically* with more drives thrown into the mix. Again, nothing stunning here.

In the 4-drive case, the transaction log grew to 19GB but in the 6- and 8-drive cases the log grew to around 8GB - again I'd expect this. It seems that for this workload, with these default checkpoint settings, and with 8GB of server memory, the log needs to be around 8GB when there's adequate IO throughput. This is something I'll be trying to address in future tests.

Perfmon captures

I took a few perfmon snapshots during the various tests to provide some details of the system performance. I'm not going to go into as much detail explaining the various counters and what they mean, I did that in the previous post, but I will point out interesting details. These aren't necessarily meant to be representative of what you'll see running similar tests - they're of interesting times during the traces that I thought would be fun to look at and understand.

1) Log file on the 6-drive SATA array

 

This graph is highlighting the Avg. Disk Write Queue Length for the K: data array (the black line). You can see that when the Checkpoint pages/sec (pink line) and Lazy writes/sec (light green line) both drop to zero, the write queue length drops to zero - as there's no other activity on that RAID array. When either or both of these start up again, the write queue length spikes wildly. Other things to note:

  • The Avg. Disk Write Queue Length for the I: log array (the green line at the bottom) is pretty static in the low single-digits.
  • The Log Bytes Flushed/sec (the red line) tracks the Disk Write Bytes/sec for the I: log array, except during the heavy checkpoint activity at the start of the trace - this is a period of log file auto-growth.
  • The Pages Allocated/sec (the top light blue line) is fairly static, but varies wildly during the heavy checkpoint activity. This is because the log auto-growth is effectively stalling transaction throughput.

2) Log file on the 8-drive SATA array

 

This is highlighting the Pages Allocated/sec (the black line), which remains basically static, and is a measure of the overall transaction throughput. At this point in the test, the log has auto-grown as far as it will and performance is stable. Other things to note:

  • The Log Bytes Flushed/sec also remains static, as there's a constant transaction workload with no interruptions for log growth, and directly correlates with the spikes and troughs in the Pages Allocated/sec counter.
  • The Avg. Disk Write Queue Length for the I: log array (the bottom green line) is static.

3) Data on the 4-drive SATA array

 

This represents the worst combination of array sizes and file placements, and is a pretty chaotic trace. It's highlighting the Avg. Disk Write Queue Length of the I: data array (the black line) and you can see that it varies wildy between 0 and 80(!!!), clearly showing that the data file performance is hampered by under-powered RAID configuration. All other perf counters around the data array vary wildly - clearly not a recipe for high performance.

Summary

There's clearly a performance gain to be had from separating the data and log portions of the database in this case. However, doing so has highlighted the fact that the simple networking configuration I have is now a bottleneck.

Here's a Task Manager trace showing the network utilization during one of the tests:

 

The troughs are the constant traffic going to the log array, and the sustained peaks are when there's checkpoint and/or lazy writer activity going to the data array as well.

In the next post, I'll tune the network configuration, and then I'll move on to trying multiple data files in multiple RAID arrays.

Hope you're still enjoying the series!

Categories:
Benchmarking | Performance

The next set of our public classes (what we call Immersion Events) in 2010 are now officially open for registrations!

We've teamed up with SQLskills Australia (our partner company run by good friend and fellow-MVP Greg Linwood) to bring two 3-day advanced performance courses to Melbourne. We loved it so much when we were there in October 2009 that we couldn't wait to get back again!. In these two classes Kimberly and I take turns teaching modules and we're both on hand to answer questions, do research, and try things out (and banter too!). The tag-team approach works *really well* and make the class very enjoyable and relaxed for those attending (and for us!).

The classes will be April 21-23, and April 28-30 2010 in the SQLskills Australia training facility in central Melbourne. There's an early bird registration deal for both courses until late-March - check it out!

The classes will cover:

  • Class 1: Designing and Optimizing for High Performance
    • Internals
    • Data types
    • Horizontal partitioning
    • Index internals
    • Data access patterns
    • Statistics
    • Indexing strategies
    • Batches and plan cache
    • Optimizing procedural code
  • Class 2: Troubleshooting and Maintaining High Performance
    • Data file optimizations
    • Log file optimizations
    • Tempdb optimizations
    • Wait and queues
    • Locking and blocking
    • Resource governor
    • Extended events
    • Index cleanup
    • Index health
    • Missing indexes

You can see a *much* more detailed course outline here and full details including how to register on the SQLskills Australia website:

And check out the press release from the last public class we did in Ireland last year - some really satisfied attendees!

We hope to see you there!

Categories:
Classes

(Edited 2/27/10 to add database name to the output) 

Here's a little script I knocked up this afternoon to tell me who has open transactions on the server - not just the single oldest active transaction that DBCC OPENTRAN returns.

It gives back:

  • session ID
  • login name
  • database context
  • transaction begin time
  • how many log records have been generated by the transaction
  • how much log space has been taken up by those log records
  • how much log space has been reserved in case the transaction rolls back
  • the last T-SQL that was executed in the context of the transaction
  • the last query plan that was executed (only for currently executing plans)

It's ordered by the transaction begin time. I had some trouble using CROSS APPLY with the sys.dm_exec_query_plan DMV - if the plan isn't available, it blows out the entire result-set for that transaction. After messing around for ten minutes I discovered the joys of the OUTER APPLY operator - which is the same as CROSS APPLY but allows NULL values from the function being cross-applied.

Also thanks to fellow MVP Aaron Bertrand (twitter|blog) for pointing out a mistake in the way I was calling sys.dm_exec_query_plan.

Here's the script with some example output:

SELECT s_tst.[session_id],
   s_es.[login_name] AS [Login Name],
   DB_NAME (s_tdt.database_id) AS [Database],

   s_tdt.[database_transaction_begin_time] AS [Begin Time],
   s_tdt.[database_transaction_log_record_count] AS [Log Records],
   s_tdt.[database_transaction_log_bytes_used] AS [Log Bytes],
   s_tdt.[database_transaction_log_bytes_reserved] AS [Log Rsvd],
   s_est.[text] AS [Last T-SQL Text],
   s_eqp.[query_plan] AS [Last Plan]
FROM sys.dm_tran_database_transactions s_tdt
   JOIN sys.dm_tran_session_transactions s_tst
      ON s_tst.[transaction_id] = s_tdt.[transaction_id]
   JOIN sys.[dm_exec_sessions] s_es
      ON s_es.[session_id] = s_tst.[session_id]
   JOIN sys.dm_exec_connections s_ec
      ON s_ec.[session_id] = s_tst.[session_id]
   LEFT OUTER JOIN sys.dm_exec_requests s_er
      ON s_er.[session_id] = s_tst.[session_id]
   CROSS APPLY sys.dm_exec_sql_text (s_ec.[most_recent_sql_handle]) AS s_est
   OUTER APPLY sys.dm_exec_query_plan (s_er.[plan_handle]) AS s_eqp
ORDER BY [Begin Time] ASC;
GO

session_id Login Name        Database Begin Time              Log Records Log Bytes Log Rsvd Last T-SQL Text                      Last Plan
---------- ----------------- -------- ----------------------- ----------- --------- -------- ------------------------------------ ---------
54         ROADRUNNERPR\paul foo      2010-02-01 15:28:48.560 2           236       8550     begin tran insert into t1 values (1) NULL
55         ROADRUNNERPR\paul foo      2010-02-01 16:38:18.373 3           356       8852     insert into t1 values (3)            NULL

(For the hardware setup I'm using, see this post. For the baseline performance measurements for this benchmark, see this post.)

In my previous post in the series, I described the benchmark I'm optimizing - populating a 1-TB clustered index as fast as possible using default values. I proved to you that I had an IO bottleneck because the IOs to the data file (from checkpoints) and the transaction log file (from transactions committing) were causing contention.

Several people commented that I might have mis-configured the iSCSI IO subsystem - so first off I want to look at that. Fellow MVP Denny Cherry (twitter|blog), who knows a lot more than me about IO subsystems, volunteered to discuss my iSCSI setup with me to make sure I hadn't goofed anywhere (many thanks Denny!). It seems like I haven't. I'm using a single iSCSI array right now, with a single NIC on the server dedicated to the iSCSI traffic - once I move to multiple volumes, I'll add in more NICs.

Looking at Task Manager in the middle of a 6-hour test run to see the network utilization through the NIC shows that it's not saturated, as shown below.

 

I ran the DELL smcli utility for two hours during the most recent test to see what peak throughput I'm getting, using the following command:

smcli -n Middle_MD3000 -c "set session performanceMonitorInterval=5 performanceMonitorIterations=1440;save storageArray performanceStats file=\"c:\\MiddlePerfStats.csv\";"

I saw around 101MBytes/sec. and earlier when testing the smcli settings I'd seen 106MBytes/sec. I'm sure once I remove some of the contention that this will get better.

On to the test!

The first thing I wanted to try was optimizing my use of the transaction log - i.e. doing as few and as large IOs as possible to the log. My hypothesis is that by changing the batch size from one to, say, 10 or 100 SQL Server can make more efficient use of the log.

I changed my T-SQL script to take a batch size parameter and use explicit transactions inserting <batch-size> records. Everything else remained the same as the baseline. I picked 128 concurrent threads as my test point. In the baseline, the 128-thread insert test took 8 hours 8 minutes 27 seconds to complete (29307 seconds). My simple T-SQL code changed to (with $(rows) and $(batch) being SQLCMD parameters to the script):

DECLARE @counter BIGINT;
DECLARE @inner   SMALLINT;
DECLARE @start   DATETIME;
DECLARE @end     DATETIME;

SELECT @counter = 0;
SELECT @start = GETDATE ();

WHILE (@counter < $(rows))
BEGIN
   SELECT @inner = 0;
   BEGIN TRAN;
   WHILE (@inner < $(batch))
   BEGIN
      INSERT INTO MyBigTable DEFAULT VALUES;
      SELECT @inner = @inner + 1;
   END
   COMMIT TRAN;
   SELECT @counter = @counter + $(batch);
END;

SELECT @end = GETDATE ();

INSERT INTO msdb.dbo.Results VALUES (CONVERT (INTEGER, DATEDIFF (second, @start, @end)));
GO

Below are the results for 128 threads with batch sizes varying from 10 to 10000:

Increasing the batch size from 1 record to 10 records caused the elapsed time to drop from 29307 seconds to 21167 seconds - a 28% improvement! However, increasing the batch size further didn't improve things any more. Hmmm.

Next I tried different numbers of concurrent connections with a batch size of ten to see if the improvement was universal:

 

So moving from a single-record batch to a ten record batch has the same dramatic effect on performance for varying numbers of connections.

What's going on?

The answer involves some deep internals of the transaction log structure and IO behavior. The transaction log is split up internally into sections called virtual log files (VLFs for short). These VLFs are used to manage which parts of the log can be overwritten with new log records. If this is all unfamiliar, go read my TechNet Magazine article on Understanding Logging and Recovery in SQL Server which explains about the log in more detail and about transactions, otherwise what's coming next may not make much sense.

Whenever a transaction commits, all the transaction log records up to and including the LOP_COMMIT_XACT log record for that transaction (including log records from other transactions that are inter-mingled with those from the one that just committed) must be written out to disk, so the transaction is durable (called write-ahead logging). But it's not individual log records that are written to disk, the unit of IO for the transaction log is a log block. VLFs are split internally into dynamically sized log blocks, ranging from 512-bytes to 60KB in 512-byte increments, as needed. There are algorithms to figure out how big log blocks should be based on what log records are being flushed out.

With the single record batch, the amount of log generated in the transaction totals 4952 bytes. You can find this by doing something like:

CHECKPOINT;  -- to force the log in the SIMPLE recovery model to clear
GO

INSERT INTO MyBigTable DEFAULT VALUES;
GO

SELECT * FROM fn_dblog (NULL, NULL); -- show me all active log
GO

Then add up the values in the Log Record Length column for all records from the first LOP_BEGIN_XACT log record to the LOP_COMMIT_XACT with the matching Transaction ID.

Anyway, a single-record transaction generates 4952 bytes of transaction log, which will be flushed out in our case in a log block that's 5120 bytes (the nearest multiple of 512 above 4952), with a bunch of zero-padding at the end. For ten single-record transactions, that's 10 small IOs.

The ten-record transaction generates 48688 bytes of transaction log and will be flushed out in a log block that's 49152 bytes (the nearest multiple of 512 above 48688). This is clearly more efficient than 10 smaller IOs and is why changing to batches makes things go faster.

A 100-record transaction generates 489628 bytes of transaction log, which is clearly more than the 60KB log-block maximum, so it actually covers multiple log blocks (varying from 14-16 in my tests - I don't remember the exact algorithms). You can see the log block changing when the middle number in the Current LSN column of the fn_dblog output increases. It looks like it's jumping up, and it is - this is the offset of the log block within the current VLF divided by 512.

Because the 100-record transaction is split into multiple log blocks, there's no real IO gain during log flushes over the 10-record transaction - which is illustrated by the results I saw.

Now, even with this speed increase from the increased batch size, the average-disk-write-queue-length is still anywhere from 20-40 when running the 128-connection test, so there's still a significant bottleneck there. In fact, the transaction log grew significantly still during these tests (up to 23GB in one case) - for an explanation of this phenomenon, see Interesting case of watching log file growth during a perf test. I picked up some wait stats queries from fellow-MVP Glenn Berry (twitter|blog) to run while the tests were running. I'm using his as they're published and anyone can download and use them (see his blog post for the queries, and Bob Ward's CSS blog post that's starting to document the wait types).

His queries tell me that with the 10-record per batch and 128-connections:

  • Average write-stall to the data file is 37ms
  • Average write-stall to the log file is 18ms
  • Top wait types are PAGELATCH_EX (55% of all waits), PAGELATCH_SH (28% of all waits), and WRITELOG (14% of all waits)

The first two waits are the Storage Engine waiting for the buffer pool to kick out dirty pages so that buffers can be used to hold newly created pages, and the third is the log manager waiting for log block flushes to disk to complete.

Clearly IO IO IO.  Let's look at a a perfmon capture during the 10-records per batch test with 128 connections:

 

Let's go through each counter (top to bottom in the list in the image) and explain what's going on. I deliberately picked this time-slice, as it really simply shows the effect of contention:

  • Pages Allocated/sec: this is the light blue line and is the Access Methods part of the Storage Engine (the dev team I used to run) creating new data and index pages for the clustered index we're populating.
  • Checkpoint pages/sec: this is the pink line at the bottom left and bottom right of the capture. This is the buffer pool writing out dirty pages during a periodic checkpoint.
  • Avg. Disk sec/Write: this is the dark blue line that's tracking just above the thick black line. It's the amount of time in ms for a write to complete. You can see that it has a minimum around 12:51:00 and then varies wildly, hitting as high as 50+ms for a single write.
  • Avg. Disk Write Queue Length: this is the highlighted line in thick black. It also has a minimum around 12:51:00 and varies wildly the rest of the time.
  • Disk Write Bytes/sec: this is the dark green line at the top that shows the number of bytes being written to disk from all IO sources. Same story around 12:51:00.
  • Log Growths: A simple counter since the database was created/server started. It's over 100 and off the chart.
  • Log Bytes Flushed/sec: this is the red, fairly constant line around 1/3 the way up and is log blocks being flushed to disk because of transaction commits or checkpoints.
  • Lazy writes/sec: this is the light green line at the bottom and is the buffer pool having to force data pages to be written to disk (along with all transaction log flushed up to the point of the last log record to change the page being written) to make space for images of newly created pages.

This time-slice is really cool in that it shows what happens when contention goes away. Just before 12:51:00, a checkpoint ends and the lazywriter has nothing to do - so the only IOs hitting the disks are those coming from the transaction log flushing out log blocks as transactions commit. You can see the Avg. Disk Write Queue Length drop down to 2-3, the Avg. Disk sec/Write drop to about 5ms, and most beautiful of all (look, I'm a big geek ok? :-), the Disk Write Bytes/sec (the green line) drops down to be exactly equal to the Log Bytes Flushed/sec - proving that it's just log flushes hitting the disk. This is the no-contention case. It happens again for brief spell about 10 seconds later - the lazywriter most likely created a temporary surfeit of empty buffers. All the rest of the time, the lazywriter and checkpoints play havoc with the write throughput on the drives by causing contention.

It's clearly time to try some separation of files to relieve the contention - and that's what I'll cover in the next post in the series.

Hope you're enjoying the series - these take a long time to write up!

Back in September I blogged about an old 2005 bug that prevented DBCC CHECKFILEGROUP checking the partitions of an object on the specified filegroup unless *all* partitions of the object are on the specified filegroup (not a smart way to set things up!). The bug was fixed ages ago in 2005 but has only just been fixed in 2008. The post which explains the bug in more detail is DBCC CHECKFILEGROUP bug on SQL Server 2008. This is an important feature to be able to split the consistency checks of a partitioned VLDB over a series of days - see CHECKDB From Every Angle: Consistency Checking Options for a VLDB for more details.

SQL Server 2008 SP1 Cumulative Update 6 (which you can get here) has the bug fix for 2008 finally. The KB article which describes the bug is 975991.

You're all running regular consistency checks, right?

(For the hardware setup I'm using, see this post.)

As part of my new benchmarking series I first wanted to play around with different configurations of data files and backup files for a 1-TB database to see what kind of performance gains I can get invoking the parallelism possible when backing up and restoring the database. To do that I need a way to quickly populate a 1-TB database so that I can mess around with different numbers of files and so on. It doesn't matter what the data in the database is, as backup doesn't care - as long as there's a terabyte of it. Why a terabyte? It's a nice round number, it's a common enough database size right now, and I have the storage to play around with it.

But then my plans got derailed. In figuring out how to most quickly populate a 1-TB database, I realized that in itself would be an interesting benchmark to explore, so that's what I'm doing first.

My aim is to give you improvements you can use in real life. If you think this is going to be boring, skip down to the end of the post where I show a detailed perfmon and explain what's going on in my overloaded IO subsystem, then come back up here :-)

The baseline for this benchmark is contrived - I'm going to populate a 1-TB clustered index (so I can do other experiments with the index) as quickly as I can. The interesting part is that I'm starting with a brain-dead database layout, configuration, and method of populating the table, and then I'm going to alter different things to see the effect on the system. The effects and gains will be the interesting part as it will expose parts of how SQL Server works which *WILL* be applicable to real-life situations and workloads - the whole point of me doing all of this is to show you improvements, why they work, and how they could be useful to you.

When doing any kind of performance testing it's *essential* to have a baseline with which to compare - otherwise how do you know what effect a variation is having? This post describes my baseline setup, measurements, and limitations I start to notice.

First let me describe the baseline setup:

  • Single data file and log file stored on the same volume, on an 8-drive RAID-10 array (each drive is a 300GB 15k SCIS drive), connected through iSCSI to the server
  • Data file is created to be 1-TB, with instant file initialization enabled
  • Log file is created to be 256MB
  • Database is using the SIMPLE recovery model

Yes, I'm deliberately putting the data and log on the same RAID array. I *want* to see some contention so I can prove to you how separation of data and log can reduce contention and improve performance sometimes.

Database and table creation script:

CREATE DATABASE BackupPerfTest ON
(NAME = 'BackupPerfTest_Data',
   FILENAME = 'K:\BackupPerfTest.mdf',
   SIZE = 1TB,
   FILEGROWTH = 1GB)
LOG ON
(NAME = 'BackupPerfTest_Log',
   FILENAME = 'K:\BackupPerfTest.ldf',
   SIZE = 256MB,
   FILEGROWTH = 50MB);
GO

ALTER DATABASE BackupPerfTest SET RECOVERY SIMPLE;
GO

USE BackupPerfTest;
GO

CREATE TABLE MyBigTable (c1 BIGINT IDENTITY, c2 CHAR (4100) DEFAULT 'a');
GO
CREATE CLUSTERED INDEX MyBigTable_cl ON MyBigTable (c1);
GO

I figured the fastest way to fill the database is to have a single table with one row per page, and that having SQL Server create the large CHAR column inside itself, rather than me doing a REPLICATE, would be quickest.

For the actual population of the table, I worked out that 134217728 table rows gives me a terabyte, with each row taking up a single 8KB page.

The baseline setup of the table population is:

  • Varying numbers of concurrent connections (16, 32, 64, 128, 256) to the server (8-way DELL PowerEdge 1950)
  • Each connection runs a simple script that inserts 134217728 / number-of-connections rows into the table, and then logs the elapsed time into a results table
  • Each insert is done as a single-statement implicit transaction (if I don't do an explicit BEGIN TRAN/COMMIT TRAN, SQL Server does it for me)
  • A monitor connection pings the results table every 5 minutes checking to see if number-of-connections results are there, and sending me email if so

Yes, I'm deliberately using this method to insert each row. Again, I want to be able to make improvements and see the effect of the changes.

Each connection will be running this script:

SET NOCOUNT ON;
GO

DECLARE @counter BIGINT;
DECLARE @start   DATETIME;
DECLARE @end     DATETIME;

SELECT @counter = 0;
SELECT @start = GETDATE ();

WHILE (@counter < $(rows))
BEGIN
   INSERT INTO MyBigTable DEFAULT VALUES;
   SELECT @counter = @counter + 1;
END;

SELECT @end = GETDATE ();

INSERT INTO msdb.dbo.Results VALUES (CONVERT (INTEGER, DATEDIFF (second, @start, @end)));
GO

This is run through SQLCMD, and the number of rows to insert is passed into the T-SQL script using:

sqlcmd -S(local)\SQLDev01 -dBackupPerfTest -i"C:\SQLskills\Populate1TBTest\PopulateTable.sql" -v rows=%1

%1 in the line above is passed from a master CMD that kicks off number-of-connections CMD windows, each of which just runs the SQLCMD line above.

So - a simple setup.

Here's a graph of the results:

 

The elapsed times for all connections to complete their work (as there could be up to an hour between the first and last to complete) were:

  • 16-way: 15 hours 25 minutes 5 seconds
  • 32-way: 13 hours 50 minutes 18 seconds (11% faster)
  • 64-way: 10 hours 12 minutes 48 seconds (27% faster)
  • 128-way: 8 hours 8 minutes 27 seconds (20% faster)
  • 256-way: 7 hours 24 minutes 21 seconds (9% faster)

More connections clearly leads to a faster runtime, but the improvements from doubling the number of threads clearly aren't directly proportional to the number of threads. The biggest improvement was from 32 to 64 threads, and then the percentage gain started to tail off. Let's look at the page allocations per second for each experiment too:

 

As I'd expect, the pace at which pages are being allocated in the database increases with more threads and the percentage improvements line pretty much matches that of the elapsed time graph above. There's a slight difference in the 128 and 256 gains here because the graph is show what the perf counter number was after the experiment reached a steady state. I noticed that the log grew enormously for the last few tests, which caused the steady-state number to not be reached for a while. I've already blogged about that phenomenon in Interesting case of watching log file growth during a perf test.

I also watched what was happening in perfmon to see if there were any obvious performance problems going on. Here's a perfmon graph for the 64-way test once it reached steady-state and the log wasn't growing:

 

Analysis: 

  • The black line is above 0 when a checkpoint is occuring.
  • The green line represents how much data is being written to the K: volume, where the data and log file are. You can see there's a constant amount of data being written all the time (transaction log records when the implicit transactions commit) with large spikes whenever a checkpoint occurs and causes the actual data pages to be written out.
  • The light blue line is the pages allocated per second. You can see that it takes a major nose dive whenever a checkpoint occurs. I'm speculating that this is because of disk contention preventing the transaction log being written to disk (thus slowing down the transaction throughput) while the checkpoint is occuring and writing out data pages
  • The dark blue line at the bottom is the average disk seconds per write. You can see that it hovers around 4-5 milliseconds and spikes to 16-17 when a checkpoint occurs.
  • The brown line in the middle is the average disk write queue length. It hovers around 18-19 and spikes to around 25 when a checkpoint occurs.

Observations:

  • This system is clearly bottlenecked in the I/O subsystem
  • There is a direct correlation between checkpoints occuring and: increased disk queue length, reduced transaction throughput

This is what I would expect to see on a system with lots of updates where the log and data are on the same volume. Remember that I've got them on a fast RAID-10 array. This debunks the theory I've often heard that contention doesn't matter on RAID arrays. Yes, it does. You can still max out the throughput capabalities of any IO subsystem - it just depends what you're trying to do with it. Imagine if I had nonclustered indexes on this table too - more logging and more pages being allocated - much worse performance...

So now I've got my baseline and there are some obvious things I can try to improve things:

  • Data and log file separation
  • Varying number of data files
  • Varying placement of data files (e.g. different RAID arrays)
  • Manual checkpoints
  • Pre-sizing the log
  • Using explicit transactions with varying batch insert sizes
  • Using -E to get larger allocations

I'm not saying that all of these things are going to help, but over the next few weeks I'm going to try them all and report on the results. You'll be able to clearly see the effect of changing these on my benchmark, running on production-quality hardware, rather than just taking people's words for it.

I hope you're going to find these benchmarks and experiments useful - I'll be learning (hopefully) as I go along too.

Let me know if there's anything else you'd like to see me try, and if you're following along (I don't want to spend all this time if no-one's reading the series!)

Thanks!

Okay, this post is describing some new services we provide (we are a consulting firm after all :-), but there's a promotion at the bottom where you can save $$$ on one of the new services.

We're starting to offer a set of standalone auditing services that we can perform remotely, greatly reducing the cost of having someone come on-site to evaluate your SQL Server environment. The new standalone audit services we are providing are:

  • Database maintenance and operations audit of your SQL Server environment
  • Disaster recovery and high-availability audit of your SQL Server environment

The remote audits take the form of us giving you some non-intrusive scripts to run on each server, plus a detailed questionnaire to fill in. We analyze the results and send you a list of recommendations, with supporting explanations and links to deeper information.

As part of a regular remote consulting engagement, we'd also be engaging with the DBA team through web-meetings, phone calls, email conversations to undertake performance tuning, design work, and so on. However, many people just want to buy a small, limited block of time to run a quick audit and get the OK or a set of changes to implement to improve operations - this is where the standalone audits come in.

The standalone audit services we offer take a much shorter time (depending on environment size) than a regular consulting engagement, as they don't involve web-meetings and in-depth interaction with the DBA team, so they're a lot more cost-effective when on a tight budget. Of course, we also provide remote performance tuning, available in small blocks of time rather than an open-ended engagement, and the traditional 'figure out all sorts of problems' consulting engagements.

These audits are usually billed at our regular offsite consulting rate, but as an initial promotion I'll perform a database maintenance audit for $500 to the first 10 new customers who sign up before the end of January, limited to 4 hours of my time (a 50% saving over our regular rates).

Shoot me an email (through the contact button at the top of the page) if you're interested in any of these services and/or want to be one of the lucky few to take part in the promotion.

We look forward to working with you!

Categories:
Auditing | Consulting

I'm running some performance tests on the hardware we have (more details on the first of these tomorrow) and I was surprised to see some explosive transaction log growth while running in the SIMPLE recovery model with single row insert transactions!

Without spoiling tomorrow's thunder too much, I've got a setup with varying numbers of connections populating a 1TB table with default values, with the single data and log files on the same 8-drive RAID-10 array (again more details on why tomorrow). I was looking at perfmon while the test was running with 128 concurrent connections, and noticed some weird-looking behavior - basically what looked like a never-ending checkpoint occuring. So I dug in deeper and discovered the transaction log had grown over 7GB since the start of the test. Huh?

I restarted the test from scratch, with a 1TB data file and a 256MB log and watched what happened in perfmon; here's the screenshot of the first 90 seconds:

 

This is *so* interesting. The black line is the number of log growths, so you can see the log grows every time the line level goes up. The bright blue line is the number of pages being allocated per second to hold all the table rows my 128 connections are inserting. You can clearly see that every time there's a log growth, the allocations take a nose-dive - because no transactions can commit while the new portion of the log is being zeroed (remember that instant file initialization does not apply to the log - see this blog post). The green line (disk write bytes per second) spikes when the log grows because of the zeroes being written out to the log file by SQL Server (remember that SQL Server does the zeroing).

But why is the log growing in the first place? I'm in the SIMPLE recovery model and doing single-row implicit transactions (yes, I'm deliberately doing this) so there shouldn't be anything stopping the log from clearing during a checkpoint, right?

Wrong. The log starts out small (256MB) so one of the thresholds for triggering a checkpoint gets hit pretty fast (70% of the log file is used). So a checkpoint occurs (you can see the checkpoints occuring when the pink line at the bottom of the perfmon screen is above zero), and starts writing out to the data file, which is on the same disk as the log (see How do checkpoints work and what gets logged for an explanation of checkpoint mechanics), but it can't write fast enough (because of disk contention) to get to the log-clearing part before the log fills up completely and has to grow (because transactions are continuing at break-neck speed). So the log grows, and the insert transactions stop while the log is zeroed. And then the log starts to fill up again very quickly and another checkpoint is triggered, and so on and so on.

Eventually a steady state is reached where there's enough free log space during a checkpoint that no new log is required for the concurrent transactions to commit. You might ask why the checkpoint is so slow in the first place? Because I deliberately put the data file on the same RAID array as the log file, and both are being steadily hammered with writes - classic disk contention. Even though the RAID array is RAID-10 with 8x300GB 15k SCSI drives, the average disk queue write length is over 20 most of the time during the 128-way test because I'm simply trying to do too much.

The point of my blog post? Just because you don't have any of the classic causes of transaction log growth going on, doesn't mean you're going to be immune. In this case my (deliberate) poor physical layout of the database files and workload growth up to 128 concurrent connections caused the log to grow. What started out working when I was running 16 connections didn't work any more at 128 (actually I went back and re-ran some of the earlier tests and even with only 64 connections, the log grew to over 1GB before reaching steady-state).

Interesting eh?

Tomorrow I'll be continuing the perf/benchmarking series by creating my first benchmark and then tweaking the setup to see how I can improve performance (for instance with multiple data files, separation of log and data files - all the things I preach but have never *demonstrated*), but this behavior merited a post all on its own.

At the start of the year I decided to get seriously back into playing with electronics after a long haitus. I agonized about where to blog about all this stuff (as I like to blog a *lot*, as you all well know) and I initially thought it would be cool to have everything intermingled on this blog. Then I started to play and realized that I have a *ton* of stuff I want to blog about around electronics and microcontrollers - photos, videos, code - and that it would really interfere with the purpose of this blog - disseminating SQL Server info. Couple that with the fact that this blog is picked up by lots of SQL Server blog aggregators, and I started to worry that people would get a little miffed by *lots* of non-SQL stuff coming from this blog.

Soooo I decided to create a completely new blog on the site today, with a new 'theme'. I now know lots about CSS style sheets, ASP.NET, IIS, and BlogEngine.Net - probably way more than I really wanted to. But I have a nice shiny new blog dedicated to my electronics adventures. I've deleted the two posts since 1/1/10 about electronics and re-posted on the new blog (sorry to the 3 people who'd commented!). There's a link to the new blog at the top-left of this one - feel free to subscribe, drop in from time to time, or ignore it completely :-) 

You can get to the new blog at Paul's Electronics (in homage to the name of the electronics kit my Dad made me when I was 10).

Normal service will be resumed here very shortly...

Categories:
Personal

Yesterday I was discussing life-direction-changing-events with Kimberly and our great friend Libby and I thought it would be interesting to find out from people how their lives brought them to where they are today. I'm often tagged in these kinds of blog posts from other people, so today I'm going to start one of my own (I feel the power tingling through my veins!!! Er, ahem).

I'll go first. There are three very clear events/decisions in my life that brought me to where I am today.

Event #1 is the single most important one, as it shaped the way I developed my intellectual interests forever. On my tenth birthday in 1982, my wonderful Dad gave me an electronics kit and an electronics book. The book was Simple Electronics, a British Ladybird series book from 1979, aimed at helping kids build little electronics projects. Now that in itself wasn't wonderful, but Dad actually made me the kit from scratch over a few weeks in his shed (think 'shop' if you're in the US), including a cool box to keep the book, my wooden circuit boards (with screws to hold the components in place), and all the components, sorted in little bags and boxes with labels on. I think this is the best present I've ever been given - thanks Dad!

Pictures of the book and the box are below (with this blog post in the background - click for bigger images):

 

Sadly the book isn't the original one he gave me, but I bought one on Ebay a couple of years ago for posterity. The box, of course, is the original one that I still treasure. This kit changed my life completely - I became totally absorbed in electronics (see my recent blog posts here and here for the start of the revival), leading on to 8-bit computers, and my engineering degree in computer science and electronics from the University of Edinburgh. This birthday present put me on the course to being an engineer. But I almost didn't go to college. That happened because of event #2.

Event #2 was what stopped me joining the Royal Navy as a Weapons Electronics Officer. With my Dad having been in the Royal Navy as an engineer (he used to teach nuclear reactor theory and control electronics at the Navy's engineering school - H.M.S. Sultan - smart guy - the person I look up to most in the world), our family life was entwined with the Navy and it's way of life. I loved it and planned to follow my Dad's footsteps. I was in the Navy Cadets at school and did all kinds of summer courses with the Navy. Then when I was 17, I spent a week on H.M.S. Glasgow, a Type-42 Destroyer that saw serious action in the 1982 Falklands War, sailing from Edinburgh, around the top of Scotland through the Pentland Firth and The Minch, down to Glasgow on the opposite coast.

Here's a picture of H.M.S. Glasgow (click for a bigger image):

 

This *is* the original one, but sadly *isn't* mine. Lot's of scope for rampant megalomania with one of these to play with :-)

It was a fantastic experience but I discovered that I get badly seasick (and that lots of the equipment was pretty old and battle-hardened (i.e. 'simple')) so that put paid to my plans to join the Navy. Off to college...

Event #3 happened quite a bit later, and was my decision to accept a job offer from Microsoft, move to the US from Scotland and start being a SQL geek instead of a VMS file-system geek. The story behind that is explained in the first few Q&As in an email interview I did last year with Tom LaRock (twitter|blog). Without that decision, I'd never have been into SQL Server, started presenting, met Kimberly, etc etc and you wouldn't be reading this blog post.

So, those are my three life-changing events. What do you think are yours? It's not what got you to be a DBA, it what got you to where you are in your life.

In the great spirit of the SQL Community, I'm going to tag a few of my friends to have a go next who I think will provide interesting answers in an entertaining way (no pressure!) - if you blog in this series, please link back here so I can make sure I read them.

Enjoy!

Categories:
Personal

I finally had some time over the last two days to play with the Arduino board I picked up late last year. The Arduino is a pretty neat concept - wrapping a microcontroller up in a neat board that makes playing with sensors, displays, motors, etc and prototyping very simple. It's all open source and you can read more about it on their homepage (http://www.arduino.cc/) which also has a freeware IDE to use for programming. The board I have uses Amtel's ATmega328P processor, with 32K of flash memory and can do 20MIPS. They're very popular and opening up electronics and gadget hacking to non-techies.

Here's the 2009 rev of the Arduino Duemilanove board (image from their website, click for larger version):

 

They cost about $30 - I got mine as part of a kit from the Nuts'n'Volts magazine store but loads of online stores have them.  I just picked up some accessories yesterday from SparkFun who have the full range, including the Arduino Mega which has 54 IO pins - can't wait for that to arrive!

The possibilities for this are just endless. The IDE provides a full C++ environment with a bunch of helper classes already defined, which takes a lot out of the tedium of programming microcontrollers. If you're going to play with this, I recommend using some of the samples that come with the IDE and on their very extensive web site.

My current interest is with making things light up in clever ways so I thought I'd start off by writing a simple program to play with an LED array. The circuit's very simple: pins 2-11 from the Arduino connected through 220ohm resistors to the LED array, which is connected to ground on the other side. Future, more complex projects will include a circuit diagram (once I find a nice freeware program to do it).

Here's a photo of the board connected up and a close up of the very simple circuit (click for larger versions):

  

I put together two easy programs - one to move the lit LED from right-to-left and back again and one to move the lit LEDs from the middle out to the two sides and back in again. Kind of Knightrider-esque, but also the way the old SUN machines I used at university had their status lights on the back of the machine.

The code for the first one is:

/*
  10-bar LED array
 
  Connect the LED array to pins 11-2 through a 220R resistor, and to ground on the other side.
  01/14/2010
*/

void setup ()
{
  for (int loop = 2; loop < 12; loop++)
  {
    pinMode (loop, OUTPUT);    // Set the pin IO mode
    digitalWrite (loop, LOW);
  }
}

void loop ()
{
  // The loop has to start at 2 because 1 isn't an IO pin.
  // Go from right to left
  for (int loop = 2; loop < 12; loop++)
  {
    // Switch the LED on for 10 milliseconds
    digitalWrite (loop, HIGH);
    delay (10);
    digitalWrite (loop, LOW);
    delay (20);
  }
 
  // And from left to right again
  for (int loop = 11; loop > 1; loop--)
  {
    digitalWrite (loop, HIGH);
    delay (10);
    digitalWrite (loop, LOW);
    delay (20);
  }
}

And the second one changes the loop to be:

void loop ()
{
  // Start on LED 6 and go up to 10 (remember, shifted by 1)
  for (int loop = 7; loop < 12; loop++)
  {
    // Light the LED on the LHS of middle and the matching one on the RHS
    // This will light LEDs 6-5, 7-4, 8-3, 9-2, 10-1 (shifted by 1)

    digitalWrite (loop, HIGH);

    digitalWrite (13 - loop, HIGH);
    delay (10);
    digitalWrite (loop, LOW);
    digitalWrite (13 - loop, LOW);
    delay (20);
  }

  // And back down to the middle again
  for (int loop = 11; loop > 6; loop--)
  {
    digitalWrite (loop, HIGH);
    digitalWrite (13 - loop, HIGH);
    delay (10);
    digitalWrite (loop, LOW);
    digitalWrite (13 - loop, LOW);
    delay (20);
  }
}

Note there's no main() function - it's all taken care of. The wrapper calls your setup() function and calls the loop() function in an infinite loop. You only have to provide these functions and you can use all the C++ programming constructs (if you want to) or keep it pretty simple.

I took some short (5 second) movies of each program running - you can get them from SideToSide movie and MiddleToSides movie (1.9MB each).

This was just my introduction to the environment, the real fun will start with some of the projects I have lined up:

  • 3-D LED cubes (3x3x3, 4x4x4, 8x8x8 and maybe higher - I have 1000 3mm red LEDs winging their way towards me - $29.95 from Rackmount-Devices.com)
  • panel of 8x8 LED arrays controlled through MAX7219 chips
  • object recognition with a simple CMOS camera
It's going to be a fun year! This is going to satisfy not only my desire to get back into electronics and start tinkering, but also to write some nifty low-level code.

As always, let me know if this is interesting, you're doing something similar, want something explained, or you have an idea for a cool project.

Enjoy!

Categories:

Theme design by Nukeation based on Jelle Druyts