Stop being so precise! and more about using Load(end)dates (Datavault Series)

Introduction

My web site was offline a few days ago. Sorry for that.
I got this email from my webhosting provider: “A new message or response with subject: User *** has used up 153% of their bandwidth ..”.
Shortly after they suspended my account. It was suspended for about 8 hours. I have upgraded the network traffic limit.
All right, when all bandwidth was consumed .. then somebody must be reading my blog. Thank you for that.

This weeks post is about the LoadDate and LoadEndDate.
Actually there are two things to be nerdy on:

  1. End dating without gaps
  2. Getting the best value for bytes on DATETIME2 precision

By the way, these topics apply to SQL Server, the examples are made using SQL Server 2014.

1. End dating without gaps

How end dating works

In a Data Vault Satellite, different subsequent versions of a row in the source system are distinguished through a LoadDate (which usually contains a date/time value). So the HashKey of the corresponding Hub Row plus the LoadDate are the primary key for the satellite.
However to get the right version out when querying for a specific date, this requires a relatively difficult query, and this can also have a negative impact on performance.
This is why the LoadEndDate is an optional column of a Satellite, to make querying (read: getting data out) easier and better performing.

Important to remember that the LoadEndDate is not the date/time the load(process) ended, but the date/time the row was replaced by a newer row for the same business entity in the hub. What’s in a name, if I had to choose I would just call it EndDate, but LoadEndDate is the standard, and once you know it, it is not a problem.

There are two ways to determine the value for this LoadEndDate:

  1. Exclusive: the LoadEndDate is the LoadDate of the new row that replaces this one, minus a small time fraction. Using the exclusive methods enables the use of the BETWEEN keyword in Transact-SQL to get the right row out, example:

    WHERE @SnapshotDate BETWEEN [LoadDate] AND [LoadEndDate]

    This is the method that is used in most Data Vault examples.

  2. Inclusive: the LoadEndDate is exactly equal to the LoadDate of the new row that replaces this one. This requires no computation when updating the LoadEndDate, but disqualifies the BETWEEN keyword, for getting data out you need to do something like:

    WHERE @SnapshotDate >= [LoadDate] AND @SnapshotDate < [LoadEndDate]

Problem

There is a (theoretical) problem when using the exclusive method.
If you substract to much, there will be a tiny time gap between the LoadEndDate and the LoadDate of the subsequent row.
I fully admit this is a theoretical problem, because the chances that you require the row valid exactly on this moment are astronomically small.
(There is this not so hypothetical situation that your load starts at midnight, the LoadDate is set before the first second of the new day has passed, and you substract a second.
Then you do not find a record when using the exact date (without time fraction) to get the row out. But you need a lot of bad luck for this to happen).

Solution

Still if you are a purist you want to do it right, certainly because the solution is so simple.
If you make the the “grain” of substraction from the LoadDate equal to the precision of the LoadDate, there is no gap.
E.g.
– Substract 1 second from DATETIME2(0)
– Substract 1 centisecond from DATETIME2(2) ***TIP***
– Substract 1 millisecond from DATETIME2(3)
– Substract 1 microsecond from DATETIME2(6)
– Substract 100 nanoseconds from DATETIME2(7)

Examples in T-SQL:

Exclusive enddate without gaps.sql

DECLARE @dt20 DATETIME2(0) = '2016-04-13 20:52:17'
DECLARE @dt22 DATETIME2(2) = '2016-04-13 20:52:17.00'
DECLARE @dt23 DATETIME2(3) = '2016-04-13 20:52:17.000'
DECLARE @dt26 DATETIME2(6) = '2016-04-13 20:52:17.000000'
DECLARE @dt27 DATETIME2(7) = '2016-04-13 20:52:17.0000000'

SELECT @dt20, DATEADD(SECOND, -1, @dt20)
SELECT @dt22, DATEADD(MILLISECOND, -10, @dt22)
SELECT @dt23, DATEADD(MILLISECOND, -1, @dt23)
SELECT @dt26, DATEADD(MICROSECOND, -1, @dt26)
SELECT @dt27, DATEADD(NANOSECOND, -100, @dt27)

2. Getting the best value for bytes on DATETIME2 precision

This is about a “smart” precision to choose for your LoadDate and LoadEndDate columns. Unlike the older DATETIME datatype, DATETIME2 uses less or more bytes for storage depending on the precision you specify.
The Storage size is: 6 bytes for precisions less than 3; 7 bytes for precisions 3 and 4. All other precisions require 8 bytes. (I shamelessly pasted this from MSDN).
In the book “Building a scalable data warehouse with Data Vault 2.0” a lot of examples use DATETIME2(7). You can ask yourself why. Why do you need to be precise to 100 nanoseconds? I dare to say that in most cases (except when loading realtime or near-realtime streaming data into your Data Vault), seconds would be precise enough.
But looking back a few lines, to the storage sizes, DATETIME2(0) uses 6 bytes, but DATETIME2(2) ALSO uses 6 bytes. So with the latter you get a higher precision for the same storage size. And for daily loads, the centiseconds precision that DATETIME2(2) is providing, is really precise enough, believe me.
So DATETIME2(2) gives us the best value for bytes!


DV-S01E04-precise
Why would your LoadDates and LoadEndDates be more precise than needed ?..


In this way you can save two bytes, compared with a DATETIME2(7), which uses 8 bytes and is used in the book. Because the LoadDate is in the primary key of satellite tables, also the primary key index will be smaller.
In PIT Tables, which usually have multiple LoadDates, the storage gain is even more.
And what counts for money, “who does not respect a small gain, isn’t worth a big one”, does also count for small performance gains you can get by doing this kind of tweaks. Never forget that! Multiple smaller tweaks might add up to a noticable performance improvement!

Conclusion / Wrap up

In this blog post you have read about the two methods for enddating: Inclusive and Exclusive.
Also you could read how you can use the Exclusive method for end dating in Data Vault satellites, without having gaps in the timeline.
Finally I discussed which precision will in most cases be good enough for LoadDate and LoadEndDate columns.

(c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

Hans Michiels

Hans Michiels

Hans is an Independent Business Intelligence and Datawarehouse Consultant & Microsoft SQL Server Consultant, working in the Netherlands. He has been working in the software industry since 1996, with SQL Server since the year 2001, and since 2008 he has a primary focus on datawarehouse- and business intelligence projects using Microsoft technology, using a Datavault and Kimball architecture. He has a special interest in Datawarehouse Automation and Metadata driven solutions. * Certified in Data Vault Modeling: Certified Data Vault 2.0 Practitioner, CDVDM (aka Data Vault 1.0) * Certified in MS SQL Server: * MCSA (Microsoft Certified Solutions Associate) SQL Server 2012 - MCITP Business Intelligence Developer 2005/2008 - MCITP Database Developer 2005/2008 - MCITP Database Administrator 2005/2008

More Posts

4 thoughts on “Stop being so precise! and more about using Load(end)dates (Datavault Series)”

  1. Hi Hans,
    Thank you for this look at Load End Dates. I should let you know that Load End Dates are dead in Big Data Solutions. Why? Because they force a physical hard update in the database. Not all platforms can support this, and those that do (Teradata, Oracle, Sybase, SQLServer, MySQL, etc…) cannot do this at scale (billions of records in a sub-second time frame).

    I posted a blog entry called Load End Dates are Dead, I recommend you might take a look at it, as should your readers.

    Thank you kindly,
    Dan Linstedt – Founder and Inventor of Data Vault

Leave a Reply

Your e-mail address will not be published. Required fields are marked *