** Well, this started out as an easy “row size” write-up but I ended up adding a bunch of tangents in it and well, there’s apparently more coming because as I got started I realized there’s way more to this topic than I even barely touch on here… fyi :) **
Maybe not something that will make performance scream and/or improve performance by orders of magnitude but just choosing the right data type for the job goes a long way to creating a more compact and optimal row, keeping table schema stable (well thought out design shouldn’t be as volatile – hopefully) and this can result in better cache utilization because the rows are more compact and more rows fit on a page. The overall table size – in terms of pages – is then smaller.
But this might not make as much sense to you yet… It’s partially based on internals and the way in which SQL Server stores the data row. First, rows have a maximum length of 8060 bytes; this is based on the fact that rows are stored on pages and pages are 8K in size in SQL Server. Now, rows could possibly span pages (with a change in internals and design) but I would probably still argue against it. I can make numerous arguments that even if this were possible it’s not necessarily desirable. In fact, I’m often advocate forms of “vertical” partitioning where columns are split into multiple tables in order to allow row sizes to be narrower and therefore fit more rows on a page… This leads to a narrower table where scans require fewer pages in cache. OK, vertical partitioning is another topic in and of itself. I’ll touch on this toward the end. Back to row size – this 8060 byte limitation applies to ALL data types except LOB types (text, ntext, and image) as well as overhead. I’m not really interested in all of the details in the overhead as there are numerous excellent resources that describe each byte (and how they work) in great detail (for example, Inside SQL Server by Kalen Delaney) but I’m interested in the overall choices you make. Since rows cannot span pages then some consideration has to be made during table design because that implies that free space might remain if rows do not optimally fit on a page. Since many data rows have variable width data then the number of rows that fits on a page varies but the best example to understand the impact of improperly chosen types is seen if I start with fixed width rows. And – before I go too far, this is not THE silver bullet to make performance scream but a best practice that’s easy to follow and HELPS lead to a well balanced system.
Take a row size of 400 bytes per row. If a page is 8k = 8192 bytes and SQL Server uses 96 bytes of overhead then the remainder of the page available for data is 8096. The SQL Server team consciously chose to restrict the total size even further (to 8060) to make room for potential overhead needed in future changes (and they are using at least 14 bytes per row for an upcoming feature in Yukon – which I won’t mention now but an upcoming whitepaper I’ve written will go into quite a few details on this feature and the overhead needed, etc.). So – as with any good design, forethought went into overhead and internals while they weren’t exactly sure how they might use those bytes in the future they kept them available. This often allows upgrades to future versions to be made simply and less painfully (and this is in fact the case on the upgrade to Yukon). It’s not always perfect and there are other things that might require work to upgrade (efficiently – again, thinking ahead :) but it’s always better to think about current, futures and how long you expect the design to work (and then double it :). So – suffice it to say – you get 8060 for a single inserted row but the entire 8096 including overhead if there are multiple rows on a page. Taking a simple table design where the rows are 400 bytes per row you can see that the math works out well 8096/400 = 20 rows per page with 96 bytes remaining (and my 400 bytes includes overhead – there’s a header of 4 bytes, a null block, a variable block, and then offsets for each variable width column). Regardless, we have 20 rows per page and a very small amount of space remaining. If the row size were 4500 bytes then we’d have a very different story – only 1 row would fit per page and we’d end up wasting disk space. Again, with variable character data this probably wouldn’t be as extreme… And, many of you would probably answer “disk space is cheap” which I agree with BUT… since the smallest unit of I/O is the page level then your free space wasted on disk translates to wasted space in cache – which is not cheap and is definitely NOT infinite. So – narrower rows where the right data type is used is a good start. I’ve seen designs where every time someone needs a date they choose datetime (which is 8 bytes and stores time – precise to the time tick (which is 3.33 ms)) and each time they needed a number they chose int – even if they only have 15 status codes currently and they only plan to add 20-30 more. The highest status code they’re looking at is 35-40… Or let’s even say 60-75… These numbers can be stored in a tinyint – which is only 1 byte. No real reason to waste 3 bytes for every row. And – I know you’re thinking who cares about 3 bytes but imagine 3 bytes for each row, for 6 different columns over 87 million rows…. (It’s something like 1.5GB) which is a heck of a lot of memory and a heck of a lot of disk space…
And – just to make another point. I don’t really care (from a disk space perspective) about EVERY table? But – these are good practices everywhere. They can become very important for LARGE tables though; review the “monsters” that will give you the most grief over time as they grow and change. Almost every OLTP database has lots of tables (lots of lookup tables, etc.) but often only a handful of really large tables.
If you’re interested in adding up all the bytes yourself – for a new table you can certainly do this… There are a few things to keep in mind in terms of overhead and since you’ll probably have variable width data then you probably don’t have to worry about being exact here. Just try and get the average size of actual data and add a bit of padding. If you have existing data you can get the average BYTE size of the column using:
SELECT avg(datalength(column)) FROM tablename
Make sure to use DATALENGTH – which is byte length – and not LEN – which is string length. Remember, UNICODE columns are two bytes per character.
If you have a table already – with sample data and you want to see the ROW length averages, min and max – then consider using DBCC SHOWCONTIG to see these columns: MinimumRecordSize, MaximumRecordSize, and AverageRecordSize. The only way to see this information is to use DBCC SHOWCONTIG WITH TABLERESULTS. Check out the BOL for more information as this command DOES use locking to access the data so it can cause shared lock/blocking.
OK, so I really need to wrap this up. What are some of my final thoughts in terms of this [probably one of many] mail on table design:
First, take some time to learn all of the different data types and then make sure to find the right data type for the job! Use the most optimal yet flexible data type possible. If you have 4 status codes but think you might have 100 – then go tinyint, no need for int.
Consider data access/pattern usage as well. If you are storing dates – without time then you have a few options, you could go smalldatetime (which is 4 bytes) if your dates are all greater than Jan 1, 1900 or if you don’t want the hassle of dealing with time (when you only want a date) you could actually consider a char(8) for date. OK, this seems to waste space but can often make other operations easier with which to deal. A formatted string of yyyymmdd is a valid input to all of the datetime functions and can be a useful way of storing date without time. And – I have lots of opinions on datetime data… In fact, I’ve written a series of articles for SQL Server Magazine on exactly this topic so if you’re interested check out the links here.
And what about Nullability and/or GUIDs? Ugh – I thought I’d get to these but I just can’t even go here now… This is enough for today. I’ll get motivated again soon; in fact, on my 8th flight in three weeks tomorrow – whatever shall I do on that plane?! ;)
4 thoughts on “A Simple Start – Table Creation Best Practices”
I’m redesigning a db that will eventually have ~350 million rows. Instead of using datetime for the "date" field, I now plan on using smalldatetime as there is no time value. After looking at some of your other articles, I’m wondering about using an INT — but this field is one of the most frequently searched — what are the performance/indexing implications of using INT instead of smalldatetime for my date column? In terms of uniqueness, each day will hold about 500K rows. Thanks for any input.
I’d say that this depends more on usage than anything else. Can you send me a private mail with more details and we can figure this out… I’ll blog about the outcome :). Thanks!
Hi !
Thanks for your sharing your considerations on your blog.
I am designing a database that will hold a couple of fields with durations (usually in the minutes to hours range) that need accuracy to the second. I have been considering datetime, smalldatetime and int.
I won’t have to do any searches at all, but some calculations (Sum) wil be necessary to display some totals in reports.
Smalldatetime is not very useful due to its precision with time.
The middle layer has a good cache mechanism, so if I stored, say, duration in seconds as an int, all calculations to display it in a hh:mm:s format would have to be done in the presentation layer for every display of the data.
So, it seems DateTime would be the best bet, with a view using CONVERT to format the duration right away. Problems arise if I need to save durations over 24 hours, and it also feels bad to store a time value and waste the date part. Also, convert function seems to be a bit slow ? (http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsqlpro04/html/sp04e9.asp).
I would appreciate it very much if you could share a little bit of thought on my dillema.
Thank you,
Mario