We’re sitting here in St. Pete Beach in Florida visiting some of Kimberly’s family and having some sun-kissed R&R before heading back up to Seattle on Wednesday, and I thought I’d get the next post in my sparse columns mini-series out. Before I start though, Kimberly just posted the resources for the Accidental DBA class we taught at SQL Connections last week and in Iceland at the end of March – see here. In my first post on sparse columns (see here) I introduced the concepts and explained why sparse columns are a useful feature. In this post I’m going to use the example I gave – a document repository with 50 document types and 20 unique attributes per document-type. Yes, it’s a contrived example, but scale it up be a factor of 100+ (think Sharepoint Server) and methods like normalization no longer apply. I’m using a CTP-6 VPC on a Lenovo T60P laptop with 4GB, a dual-core 2.2GHz CPU, and the VPC is running off a 6200RPM drive. Your mileage may vary for run-times of the example scripts. The VPC is the one we gave out at SQL Connections and Iceland, and you can download the scripts and a VPC (maybe only CTP-5) from the Microsoft JumpStart site (see here for details). The first test I’ll do is just creating the schema necessary to store the 1000+ columns of information in the test scenario. I’ll do one with sparse columns and one without: — Create two tables, one with 1000 columns and one with 1000 columns but 997 sparse.
DocID INT IDENTITY, DocName VARCHAR (100), DocType INT,
c0004 INT NULL, c0005 INT NULL, c0006 INT NULL, c0007 INT NULL, c0008 INT NULL, c0009 INT NULL,
…
c0994 INT NULL, c0995 INT NULL, c0996 INT NULL, c0997 INT NULL, c0998 INT NULL, c0999 INT NULL,
c1000 INT NULL);
GO
CREATE
TABLE TableWithSparseColumns (DocID INT IDENTITY, DocName VARCHAR (100), DocType INT,
c0004 INT SPARSE NULL, c0005 INT SPARSE NULL, c0006 INT SPARSE NULL, c0007 INT SPARSE NULL,
…
c0996 INT SPARSE NULL, c0997 INT SPARSE NULL, c0998 INT SPARSE NULL, c0999 INT SPARSE NULL,
c1000 INT SPARSE NULL);
GO
I won’t list all the column names for the sake of brevity. Next I’ll insert some values into each table (the same values in each table):
INSERT INTO TableWithSparseColumns (DocName, Doctype) VALUES (‘aaaa’, 1);
— Insert a few rows in each
INSERT INTO TableWithSparseColumns (DocName, Doctype, c0945) VALUES (‘bbbb’, 2, 46);
INSERT INTO TableWithSparseColumns (DocName, Doctype, c0334) VALUES (‘cccc’, 3, 44);
INSERT INTO TableWithSparseColumns (DocName, Doctype, c0233, c0234) VALUES (‘dddd’, 4, 12, 34);
INSERT INTO TableWithSparseColumns (DocName, Doctype, c0233, c0234,c0235,c0236) VALUES (‘eeee’, 4, 12, 34, 46, 66);
GO
INSERT
INTO TableWithoutSparseColumns (DocName, Doctype) VALUES (‘aaaa’, 1);INSERT INTO TableWithoutSparseColumns (DocName, Doctype, c0945) VALUES (‘bbbb’, 2, 46);
INSERT INTO TableWithoutSparseColumns (DocName, Doctype, c0334) VALUES (‘cccc’, 3, 44);
INSERT INTO TableWithoutSparseColumns (DocName, Doctype, c0233, c0234) VALUES (‘dddd’, 4, 12, 34);
INSERT INTO TableWithoutSparseColumns (DocName, Doctype, c0233, c0234,c0235,c0236) VALUES (‘eeee’, 4, 12, 34, 46, 66);
GO
Now let’s see how big each table is:
SELECT [avg_record_size_in_bytes], [page_count] FROM sys.dm_db_index_physical_stats (
— Now lets see how big the rows are
DB_ID (‘SparseColumnsTest’), OBJECT_ID (‘TableWithoutSparseColumns’), NULL, NULL, ‘DETAILED’);
SELECT
[avg_record_size_in_bytes], [page_count] FROM sys.dm_db_index_physical_stats (DB_ID (‘SparseColumnsTest’), OBJECT_ID (‘TableWithSparseColumns’), NULL, NULL, ‘DETAILED’);
GO
avg_record_size_in_bytes page_count
———————— ——————–
4135 5
(1 row(s) affected)
avg_record_size_in_bytes page_count
———————— ——————–
40.6 1
(1 row(s) affected)
Ok – so that’s not a huge difference in page count (because we’ve only got 5 rows), but it’s a *massive* difference in average record size. Scaled up to hundreds of thousands or millions of records, the space savings will be astronomical!
Now let’s try selecting the data back using results-to-grid mode and a simple SELECT * statement. It takes 20 seconds to return – solely because the client still has to retrieve the metadata for 1000+ columns. Even though the columns are still defined as SPARSE, they show up in a SELECT * resultset, and that makes extracting out the non-NULL values pretty difficult…
Time for another new feature – column sets. There’s a new column type available for use with sparse columns – an XML COLUMN_SET. This is a column that is only materialized when selected, and will return all the non-NULL sparse columns in a row as an XML BLOB. It will also change the behavior of a SELECT * operation – removing all the sparse columns from the resultset and replacing them with itself, representing all the non-NULL sparse columns. Redefining our TableWithSparseColumns to have an XML COLUMN_SET column called SparseColumns (using the syntax ‘SparseColumns XML COLUMN_SET FOR ALL_SPARSE_COLUMNS‘), and re-inserting the same values then gives the following results for a SELECT * operation:
DocID DocName DocType SparseColumns
——- ——— ——— —————
1 aaaa 1 NULL
2 bbbb 2 <c0945>46</c0945>
3 cccc 3 <c0334>44</c0334>
4 dddd 4 <c0233>12</c0233><c0234>34</c0234>
5 eeee 4 <c0233>12</c0233><c0234>34</c0234><c0235>46</c0235><c0236>66</c0236>
Pretty cool – and it returns virtually instantaneously (obviously scaling up to hundreds of thousands or millions of rows would take longer due to the time necessary to read the pages into the buffer pool). One downside is that the XML blob only returns the column name and value – not the datatype – but if your application can cope with that then not having to wade through hundreds (or thousands by RTM) of NULL columns values is great.
Next time I’ll discuss the internals of how sparse columns are stored.
2 thoughts on “SQL Server 2008: Sparse columns and XML COLUMN_SET”
Thanks Paul. I’m curious what is gained by returning a column that contains XML as in your example… Wouldn’t you still have to parse that out (aside from situations where SOAP is being used, etc.)? Couldn’t that additional parsing exceed the time it would have taken to retrieve the columns the old way?
Nope – generating all the NULLs and the column metadata for the thousands of null columns is orders of magnitudes slower than cracking a simple XML schema like the columnset returns.
Thanks