How much can sparse columns help to save disk space?

sparseSQL Server is implemented in a way that null values are stored and indexed similarly to other values. It is an opposite to what Oracle does. In the relational theory, null means unknown or not provided so why do null values consume disk space if they do not carry (useful) information? It is not big pain in many cases, but if a table has many null values compared to the real ones, space overhead may be significant.

The SPARSE keyword is kind of a solution to this. It indicates that a column does not have much real data in it so SQL Server should optimize the storage usage. This time, I am going to check how much space can be saved.

Sparse columns were introduced in SQL Server 2008 but this article is based on the latest release - SQL Server 2014. As indicated at the beginning, the SPARSE keyword reduced disk space for storing the null value but unfortunately it increases space for not null values.

 

Sparse columns and INT data type

For demonstration purposes, I have prepared two tables: one with a regular INT column and one with a sparse INT column.

CREATE TABLE cd.regular (
       col_int INT
);
CREATE TABLE cd.sparse (
       col_int INT sparse
);

Both tables are filled with 100 000 rows of a number 54.

DECLARE @rowNum INT = 0;
WHILE @rowNum < 100000 BEGIN
       INSERT INTO cd.regular
       VALUES (54);

       INSERT INTO cd.sparse
       VALUES (54);

       SET @rowNum = @rowNum + 1;
END;
GO

Amount of storage space used by the tables is provided by sp_spaceused procedure.

EXEC sp_spaceused@objname='cd.regular', @updateusage='true'
EXEC sp_spaceused@objname='cd.sparse', @updateusage='true'

int 54

 

Sparse column overhead is clearly visible. According to the documentation, SQL Server requires 8 bytes for not null values in an INT column when SPARSE setting is used while 4 bytes without SPARSE. If there are no null values in the column, there is no storage benefit from the sparse functionality but the amount of space is not even the same as of the regular column. It is much higher.

What if the columns have null values only?

For this case, I use the same tables structure as above - two tables with one INT column. Of course, at the beginning they are empty, then I fill them using the following script.

DECLARE @rowNum INT = 0;
WHILE @rowNum < 100000 BEGIN
       INSERT INTO cd.regular
       VALUES (null);

       INSERT INTO cd.sparse
       VALUES (null);

       SET @rowNum = @rowNum + 1;
END;
GO

Now, each table has 100 000 rows with null values only. How much storage space consumed do you expect? The regular table should take the same amount because INT uses 4 bytes no matter null is inserted or a number. The sparse column is not cost free even if all values are null so expecting 0 kB consumed would have no reason. Additionally, each rows uses space just because they are rows - for row headers. Lets see how it looks.

EXEC sp_spaceused@objname='cd.regular', @updateusage='true'
EXEC sp_spaceused@objname='cd.sparse', @updateusage='true'

int null

The saving is not spectacular but it is visible. Row headers overhead is significant enough when compared to 4 bytes needed for the INT columns so the benefit is not so appealing. But if there were 10 INT columns with more null values than numbers, it could be worth implementing.

 

Sparse columns and VARCHAR data type

A data type is an important factor when deciding on using sparse functionality. Lets see how it works with VARCHAR.

Similarly to the previous scenario, this time there are two tables with one column each, but VARCHAR is used instead of INT.

CREATE TABLE cd.regular (
       col_varchar100 VARCHAR(100)
);
CREATE TABLE cd.sparse (
       col_varchar100 VARCHAR(100) sparse
);
GO

The tables are filled with a sample text.

DECLARE @rowNum INT = 0;
WHILE @rowNum < 100000 BEGIN
       INSERT INTO cd.regular
       VALUES ('This is a sample text - not so short, not so long.');

       INSERT INTO cd.sparse
       VALUES ('This is a sample text - not so short, not so long.');

       SET @rowNum = @rowNum + 1;
END;
GO

 

After the experience with INT columns, higher storage usage in case of the sparse column is a natural prediction.

EXEC sp_spaceused@objname='cd.regular', @updateusage='true'
EXEC sp_spaceused@objname='cd.sparse', @updateusage='true'

varchar sample

The test confirmed it.

Will null values change the result?

DECLARE @rowNum INT = 0;
WHILE @rowNum < 100000 BEGIN
       INSERT INTO cd.regular
       VALUES (null);

       INSERT INTO cd.sparse
       VALUES (null);

       SET @rowNum = @rowNum + 1;
END;
GO
EXEC sp_spaceused@objname='cd.regular', @updateusage='true'
EXEC sp_spaceused@objname='cd.sparse', @updateusage='true'

varchar null

This time they did not. Why? Storage usually consumed by VARCHAR columns depends on a length of data. There is not much waste so in this case, benefit is not even visible.

 

Summary

Sparse columns may make sense only when null values are a significant part of all values in a particular column. Microsoft documentation states that a column should have at least 20-40% of null to even start considering sparse columns. A natural application areas for them are data warehouses where null values are very rife in table columns.
Another important conclusion is that sparse columns overhead is big so it definitely does not make sense to use this feature widely without knowing nature of data and making a conscious decision.

If you are looking for further storage usage savings, consider filtered indexes.