In one of the sessions Kimberly and I taught this week at SQL Connections, we discussed how to choose efficient data-types – I’d like to share the discussion here with an example. The crux of the matter is in the amount of space required to store the data-types in the schema, and ensuring that the data-type choice matches the data being stored without incurring any unnecessary overhead. Here are a couple of examples: You can see how choosing data-types without considering whether it’s the best type can lead to wasted space. Let’s look at an example schema to support a population census of the USA. CREATE
TABLE CensusSchema1 (
SSN CHAR (256),
StateName CHAR (256),
Sex INTEGER,
Employed INTEGER,
SelfEmployed INTEGER,
EthnicOrigin INTEGER,
MaritalStatus INTEGER,
NumberOfDependents INTEGER,
Age INTEGER,
CountryOfBirth INTEGER,
HouseholdIncomeGroup INTEGER,
ZipCode5 INTEGER);
GO
At first glance this may look reasonable, but digging in you’ll see that many of the columns are over-sized. Here’s a cleaned-up schema to compare against, with notes on the per-column savings:
TABLE CensusSchema2 (
CREATE
SSN CHAR (9), — saving 244 bytes
StateName VARCHAR (256), — saving at least 240 bytes (longest state name is 14 + 2-byte varchar overhead)
Sex BIT,
Employed BIT,
SelfEmployed BIT, — saving 11 bytes altogether over these three fields
EthnicOrigin TINYINT, — saving 3 bytes
MaritalStatus TINYINT, — saving 3 bytes
NumberOfDependents TINYINT, — saving 3 bytes
Age TINYINT, — saving 3 bytes
CountryOfBirth TINYINT, — saving 3 bytes
HouseholdIncomeGroup TINYINT, — saving 3 bytes
ZipCode5 INTEGER); — no saving
GO
The bad schema gives a per-row size of 574 bytes and the cleaned-up schema is no more than 48 bytes per-row. I designed these two schemas to support a US census. The population of the USA is approx. 300 million. This means the bad schema would take around 190GB to store all the info, and the cleaned-up schema only takes around 15GB – more than 12 times more efficient! Now we’re starting to see how poor data-type choice can lead to poor performance.
Wider rows means:
Fewer rows can fit on an 8k page.
More CPU is necessary to crack open a record due to the number of CPU data cache line invalidations necessary to read the record into the CPU’s various caches (every time a cache line is invalidated it takes a bunch of CPU cycles – see here for an explanation of CPUs and cache lines).
Less rows per page means:
More pages are needed to store the data
Indexes could have a smaller fan-out (if the index keys are wider than then need to be), leading to more levels in the index and less efficient index navigation from the index root page to the index leaf-level.
More pages means:
More IOs are necessary to read the same amount of actual data
More buffer pool memory is necessary to hold the same amount of actual data
More disk space is necessary to hold the same amount of actual data
Clearly there’s a link between various aspects of workload performance and badly chosen data-types.
3 thoughts on “How can data-type choice affect performance?”
I appreciated this article. I knew about how to correctly size fields in the various ways you mentioned. And I knew that the consequence of a badly designed database would be that it took up more space. But I mostly thought "so what"? What if you have lots of space? Does it really matter beyond that? Now I see how it can affect performance. That’s a big reason to pay close attention to designing data types just right. Now I feel the effort I have made on worrying about data types has been more than intellectual exercises. It has practical benefits. Thanks. – JJ
Just curious why you would StateName VARCHAR(256) instead of StateCode CHAR(2)? The two-letter state abbreviations are fairly standard, and since we are talking about US Census data, then there is no need to worry about provinces or territories or any peculiarities of other countries…
You mentioned CHAR(2) vs. VARCHAR(2) in the text of the article, but the "cleaned up" version in T-SQL still uses StateName and merely swaps out VARCHAR(256) instead of CHAR(256).
Yup – you could (and probably would) use a 2 byte StateCode instead – I just chose to keep a state name in my example.