Tag Archives: DataVault

Implementing a hash key collision strategy

Introduction

Dan Linstedt, the inventor of Data Vault, has written a lot about it: hashkeys.
For instance, one of his latest blog posts:
#datavault 2.0, Hashes, one more time.

I will not list all other sources, as you can use Google yourself.
A few comments on hash keys:

  1. You need them for scalability. Using sequence numbers is taking the risk that your data warehouse does not scale well later when the amount of data grows.
  2. They can collide: two different business keys can produce the same hash key. However the chance that this happens is very small. For instance when using SHA-1 (which produces a hash value of 160 bits) you will have a 1 in 1018 chance on a hash collision when having 1.71 * 1015 hash values (read: hub rows) according to this blog post.
  3. If collisions are unacceptable you need a hash key collision strategy.

The full article is posted on DWA.Guide, so you read further there ..


Picture credits: © Can Stock Photo / alexskp

Temporal Table applications in SQL Data Warehouse environments (SQL Server Series)

Today the subject of investigation is the Temporal Table, which is a new feature in SQL Server 2016. My focus will slightly be on how to use it in Data Warehouse environments, but there is some general information passing by as I write.
I want to cover next topics:

  1. What is a temporal table (in short)?
  2. Can I use a temporal table for a table in the PSA (Persistent Staging Area)?
  3. Can I use a temporal table for a Data Vault Satellite?
  4. Is using temporal tables for full auditing in an OLTP system a good idea?

The full article is posted on SQLShack.com, so read further there ..

Free download: Blog Yearbook 2016

I have made a compilation of all my blog posts of the past year, and it was more work than you would think.
Especially because it also contains a non-published new article.

After filling in the form you can download the e-book for free.

    I promise you the following:

    • I will not give or sell your email address to any third party.
    • I will not spam you.
    • If I start with a newsletter, you can easily unsubscribe or change the frequency.

    (c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

    Zeros, bloody zeros! (Data Vault Series)

    Introduction

    I must admit I have a weakness for British humour.
    When I had to cope with leading zeros in business keys some time ago, I spontaneously came up with the title of this post, not knowing that it would serve as such.
    For those who do not know, “Meetings, bloody meetings” is a British comedy training film in which John Cleese plays a main role. It was made in 1976, and a remake was made in 2012.
    It tells in a funny way what can go wrong at meetings and how you can do better, check it out if you can.

    DV-S01E05-meetingsMr John Cleese

    But, obviously, this post is not about meetings but about zeros.

    Problem

    I can be short about that: leading zeros in business key values.
    For instance a customer number is delivered to the data warehouse as 0001806 (instead of 1806).
    This would not be a problem it is would always be delivered exactly like that. But to be honest, you can and will not know that upfront. Even this might be the case now, it might not be in the future.
    When other tools are used, leading zeros could suddenly disappear (for instance when a csv file is modified using Excel), or (more rarely) the number of leading zeros could change (01806, 00001806). When this happens you have a problem, because for the data warehouse 01806, 0001806, 00001806 and 1806 are all different business keys! Even if you have only two variants, it is already a problem.
    Because every business key gets a different row in the hub, and this customer now exists multiple times!

    DV-S01E05-zeros(No acting here, this is how I look sometimes)

    Solution

    If you are familiar with Data Vault, you might already think of same-as-links to solve this.
    But I think the solution should be implemented earlier, to avoid having multiple hub rows.
    Simply always remove leading zeros when the sourcecolumn is (part of) a business key (either primary or foreign key) and seems a number or ID but is delivered as a string/varchar. In this way 1806 will always be 1806! And I think it is pretty impossible that 001806 and 1806 would refer to two different customers.
    Unless, of course, they would come from different source systems. But in that situation, depending on leading zeros would be a bad thing to do, because when then leading zeros dropped off, satellite rows of different customers (in different source systems) could end up as connected to the same hub row! In this situation, in a non-integrated Raw Vault, it would be better to prefix the business key with the source system code and remove the leading zeros, for instance, CRM.1806 and ERP.1806.
    In all cases, you can still store the original value (with leading zeros) as an ordinary attribute in a satellite for auditing reasons.

    How to implement the solution

    There are many ways to remove leading zeros. When I was searching for this I had two requirements:

    • No casting from and to an integer may take place, otherwise all business keys need to be numeric, so this would make the solution less reliable.
    • No function, routine or assembly may be called, this could negatively impact performance. I was looking for an “inline” conversion.

    After some research I found an expression that was the same for SQL and SSIS and quite okay (T-SQL version by Robin Hames, my credits for his work), but appeared to change a string with only one or more zeros to an empty string. And because a 0 can have a meaning – and is certainly different from an empty string – this is undesired behavior, IMHO.
    So I had to add some logic to it: a SELECT CASE in T-SQL and an inline condition (format {condition} ? {true part} : {false part} ) to the SSIS expression.
    Furthermore I came on a different method for T-SQL as well, using the PATINDEX function, which is more compact than the other solution.
    For SSIS I still use the ‘Robin Hames’ method, because the PATINDEX function is not available in SSIS Expressions.
    So .. this is what it has become:

    T-SQL

    Remove_leading_zeros.sql

    SELECT
        example.[id_with_leading_zeros],
       CASE
          WHEN LTRIM(example.[id_with_leading_zeros]) = '' THEN ''
          WHEN PATINDEX( '%[^0 ]%', example.[id_with_leading_zeros]) = 0 THEN '0'
          ELSE SUBSTRING(example.[id_with_leading_zeros], PATINDEX('%[^0 ]%', example.[id_with_leading_zeros]), LEN(example.[id_with_leading_zeros]))
       END AS [id_without_zeros_method1],

       CASE
          WHEN LTRIM(example.[id_with_leading_zeros]) = '' THEN ''
          WHEN PATINDEX( '%[^0 ]%', example.[id_with_leading_zeros]) = 0 THEN '0'
          ELSE REPLACE(REPLACE(LTRIM(REPLACE(-- Robin Hames' method
                REPLACE(LTRIM(example.[id_with_leading_zeros]), ' ', '!#!') -- replace existing spaces with a string that does not occur in the column value, I have chosen '!#!'
                , '0', ' ') -- replace '0' with ' '
                ) -- end of LTRIM to remove leading '0's that have been changed to ' 's
                , ' ', '0') -- change ' ' back to '0'
                , '!#!', ' ') -- change '!#!' back to ' '
       END AS [id_without_zeros_method2]
    FROM
        (
        SELECT
            TOP 1000 RIGHT('00000000000' + CONVERT(NVARCHAR(12), object_id), 14) AS [id_with_leading_zeros]
        FROM
            master.sys.objects
        UNION
        SELECT N' 00000 '
        UNION
        SELECT N'00'
        UNION
        SELECT N' '
        UNION
        SELECT ' 0099990 A '
        UNION
        SELECT '-5550'
        ) example

    SSIS Expression (can be used in Derived Column)

    (LTRIM(REPLACE(id_with_leading_zeros,"0", "")) == "" && LTRIM(id_with_leading_zeros) != "") ? "0" : REPLACE(REPLACE(LTRIM(REPLACE(REPLACE(LTRIM(id_with_leading_zeros)," ","!#!"),"0"," "))," ","0"),"!#!"," ")

    DV-S01E05-151In a Derived Column Transformation this looks for instance like this

    Conclusion / Wrap up

    In this post I have motivated why I think you should remove leading zeros from business keys when data is loaded from source systems to a data warehouse.
    This post also contains different ways to remove leading zeros, two for T-SQL and one for a SSIS expression.

    (c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

    Stop being so precise! and more about using Load(end)dates (Datavault Series)

    Introduction

    My web site was offline a few days ago. Sorry for that.
    I got this email from my webhosting provider: “A new message or response with subject: User *** has used up 153% of their bandwidth ..”.
    Shortly after they suspended my account. It was suspended for about 8 hours. I have upgraded the network traffic limit.
    All right, when all bandwidth was consumed .. then somebody must be reading my blog. Thank you for that.

    This weeks post is about the LoadDate and LoadEndDate.
    Actually there are two things to be nerdy on:

    1. End dating without gaps
    2. Getting the best value for bytes on DATETIME2 precision

    By the way, these topics apply to SQL Server, the examples are made using SQL Server 2014.

    1. End dating without gaps

    How end dating works

    In a Data Vault Satellite, different subsequent versions of a row in the source system are distinguished through a LoadDate (which usually contains a date/time value). So the HashKey of the corresponding Hub Row plus the LoadDate are the primary key for the satellite.
    However to get the right version out when querying for a specific date, this requires a relatively difficult query, and this can also have a negative impact on performance.
    This is why the LoadEndDate is an optional column of a Satellite, to make querying (read: getting data out) easier and better performing.

    Important to remember that the LoadEndDate is not the date/time the load(process) ended, but the date/time the row was replaced by a newer row for the same business entity in the hub. What’s in a name, if I had to choose I would just call it EndDate, but LoadEndDate is the standard, and once you know it, it is not a problem.

    There are two ways to determine the value for this LoadEndDate:

    1. Exclusive: the LoadEndDate is the LoadDate of the new row that replaces this one, minus a small time fraction. Using the exclusive methods enables the use of the BETWEEN keyword in Transact-SQL to get the right row out, example:

      WHERE @SnapshotDate BETWEEN [LoadDate] AND [LoadEndDate]

      This is the method that is used in most Data Vault examples.

    2. Inclusive: the LoadEndDate is exactly equal to the LoadDate of the new row that replaces this one. This requires no computation when updating the LoadEndDate, but disqualifies the BETWEEN keyword, for getting data out you need to do something like:

      WHERE @SnapshotDate >= [LoadDate] AND @SnapshotDate < [LoadEndDate]

    Problem

    There is a (theoretical) problem when using the exclusive method.
    If you substract to much, there will be a tiny time gap between the LoadEndDate and the LoadDate of the subsequent row.
    I fully admit this is a theoretical problem, because the chances that you require the row valid exactly on this moment are astronomically small.
    (There is this not so hypothetical situation that your load starts at midnight, the LoadDate is set before the first second of the new day has passed, and you substract a second.
    Then you do not find a record when using the exact date (without time fraction) to get the row out. But you need a lot of bad luck for this to happen).

    Solution

    Still if you are a purist you want to do it right, certainly because the solution is so simple.
    If you make the the “grain” of substraction from the LoadDate equal to the precision of the LoadDate, there is no gap.
    E.g.
    – Substract 1 second from DATETIME2(0)
    – Substract 1 centisecond from DATETIME2(2) ***TIP***
    – Substract 1 millisecond from DATETIME2(3)
    – Substract 1 microsecond from DATETIME2(6)
    – Substract 100 nanoseconds from DATETIME2(7)

    Examples in T-SQL:

    Exclusive enddate without gaps.sql

    DECLARE @dt20 DATETIME2(0) = '2016-04-13 20:52:17'
    DECLARE @dt22 DATETIME2(2) = '2016-04-13 20:52:17.00'
    DECLARE @dt23 DATETIME2(3) = '2016-04-13 20:52:17.000'
    DECLARE @dt26 DATETIME2(6) = '2016-04-13 20:52:17.000000'
    DECLARE @dt27 DATETIME2(7) = '2016-04-13 20:52:17.0000000'

    SELECT @dt20, DATEADD(SECOND, -1, @dt20)
    SELECT @dt22, DATEADD(MILLISECOND, -10, @dt22)
    SELECT @dt23, DATEADD(MILLISECOND, -1, @dt23)
    SELECT @dt26, DATEADD(MICROSECOND, -1, @dt26)
    SELECT @dt27, DATEADD(NANOSECOND, -100, @dt27)

    2. Getting the best value for bytes on DATETIME2 precision

    This is about a “smart” precision to choose for your LoadDate and LoadEndDate columns. Unlike the older DATETIME datatype, DATETIME2 uses less or more bytes for storage depending on the precision you specify.
    The Storage size is: 6 bytes for precisions less than 3; 7 bytes for precisions 3 and 4. All other precisions require 8 bytes. (I shamelessly pasted this from MSDN).
    In the book “Building a scalable data warehouse with Data Vault 2.0” a lot of examples use DATETIME2(7). You can ask yourself why. Why do you need to be precise to 100 nanoseconds? I dare to say that in most cases (except when loading realtime or near-realtime streaming data into your Data Vault), seconds would be precise enough.
    But looking back a few lines, to the storage sizes, DATETIME2(0) uses 6 bytes, but DATETIME2(2) ALSO uses 6 bytes. So with the latter you get a higher precision for the same storage size. And for daily loads, the centiseconds precision that DATETIME2(2) is providing, is really precise enough, believe me.
    So DATETIME2(2) gives us the best value for bytes!


    DV-S01E04-precise
    Why would your LoadDates and LoadEndDates be more precise than needed ?..


    In this way you can save two bytes, compared with a DATETIME2(7), which uses 8 bytes and is used in the book. Because the LoadDate is in the primary key of satellite tables, also the primary key index will be smaller.
    In PIT Tables, which usually have multiple LoadDates, the storage gain is even more.
    And what counts for money, “who does not respect a small gain, isn’t worth a big one”, does also count for small performance gains you can get by doing this kind of tweaks. Never forget that! Multiple smaller tweaks might add up to a noticable performance improvement!

    Conclusion / Wrap up

    In this blog post you have read about the two methods for enddating: Inclusive and Exclusive.
    Also you could read how you can use the Exclusive method for end dating in Data Vault satellites, without having gaps in the timeline.
    Finally I discussed which precision will in most cases be good enough for LoadDate and LoadEndDate columns.

    (c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.