Tag Archives: DataVault

Implementing a hash key collision strategy

Introduction

Dan Linstedt, the inventor of Data Vault, has written a lot about it: hashkeys. For instance, one of his latest blog posts: #datavault 2.0, Hashes, one more time. I will not list all other sources, as you can use Google yourself. A few comments on hash keys:

  1. You need them for scalability. Using sequence numbers is taking the risk that your data warehouse does not scale well later when the amount of data grows.
  2. They can collide: two different business keys can produce the same hash key. However the chance that this happens is very small. For instance when using SHA-1 (which produces a hash value of 160 bits) you will have a 1 in 1018 chance on a hash collision when having 1.71 * 1015 hash values (read: hub rows) according to this blog post.
  3. If collisions are unacceptable you need a hash key collision strategy.

Hash collision detection

If you take the possibility of a hash collision seriously, you’ll need a detection strategy. The easiest way to implement this is to make sure that a primary key violation will occur when a hash key collision occurs. When your ETL processes are set up following best practices your processes are restartable after fixing the error, or can be rolled back, so you can continue the load or reload later. A hash key collision will be detected when you will check in the hub if a business key (single-column or composite) already exists, and a primary key constraint is enabled on the hash key. For instance the contents of a hub table (simplified example):

Hashkey Business key
H1 A
H2 B

Let’s assume that you have a business key C that will also produce the hashkey H1. The load process checks if C already exist, doesn’t find it, and thus wants to add a row to the hub. But because hashkey H1 already exists, a primary key violation will occur. This is good! Nothing is worse than an undetected hash key collision. If the load process would check for the hashkey H1, business key C would never be added to the hub and the describing context (satellite rows) and links of business key A would all be messed up with rows of business key C, but you would not know! So detection is essential. If for some reason your hubs do not have physical primary keys, you need to check afterwards for duplicate hashkeys. But this is the second-best approach, because then data corruption has already taken place. It can be a mess, but using a loading datetime and/or the source data (which is still in staging) you might be able to fix this.

Types of hash key collision strategies

So, Houston, we have a problem and Seattle, we need a solution. How to deal with hash key collisions? You can deal with them at different levels. Each level has a different impact on load processes and refactoring impact (the effort needed to change the load process and rehash existing business keys). When chosing a different hash collision strategy, load impact and refactoring impact are more or less inversely proportional, so when the impact on loading process increases the refactoring impact will usually decrease. Because of the low chances that a hash collision will ever happen, it often makes sense to choose a strategy that has a low impact on loading (and hence, a medium or high impact on refactoring, which might never be needed anyway). In an overview:

Level Load impact Refactoring impact
System Level None High
Table Level Low Low to High
Row Level High None

System Level

Implementing a hash key collision on System Level means that you change the hash key computation of the entire system. The advantage is that you need no logic in the load processes, they just use the (changed) hash key computation always. The disadvantages can be:

  • a performance impact: changed way of hashing might be slower on all tables.
  • a risk of a high refactoring effort: the need to change a lot of ETL code if no data warehouse automation tool is used, and rehashing all business keys of the entire data warehouse.

In my opinion this feels like using a large ax to cut your lettuce.

Table Level

Implementing a hash key collision on Table Level means that you change the hash key computation only for the business key of the hub that causes a hash key collision. The advantages of this approach are:

  • this exception can be hardcoded in manually made ETL code or meta data driven in a data warehouse automation tool. It can even be decided at runtime using the hubname as a parameter for the hash key calculation (although I would not recommend the latter because this could have a serious performance impact).
  • a performance impact (if any) has only effect on one table.
  • the refactoring impact can be Low to Medium: changing ETL code only affects the hash key of one hub and so does the rehashing of business keys.

The disadvantages can be:

  • If it is a hub that is used in many links, the refactoring impact might still be Medium to High.

Row Level

And finally hash key collision mitigation at the row level. This can be done by adding extra tables to the data vault that store business keys that would produce a duplicate hash key; the table row contains an alternate hash key for the business key to be used. The advantage of this approach is:

  • There is no refactoring impact. When a hash key collision occurs, just add the new business key that causes the collision to the exception table with an alternate hash key, and restart the load process.

The disadvantage however will be:

  • Makes loading much more complicated: for every hash key computation a lookup in the exception table (which in most cases will be empty) needs to be done. This also can have a performance impact. I must admit I did not try, but would not dare to take the risk that this would not scale.

So my first thought was: no-go, no-go! In my opinion the influence on daily loads is just too big for a very small chance of a hash collision.

Rehashing methods

Picture credits: © Can Stock Photo / alexskp

Not discussed yet, how would you rehash? In the Data Vault 2.0 training, upgrading the hashing algoritm when possible (e.g. MD5 to SHA-1) was mentioned. This is a valid choice, but has an impact on the length of your hash key (in the example from 16 to 20 bytes when stored as binary), and this introduces an exception in your table design patterns. Does not have my preference. Euh .. is there something else I can do? Yes! Dan Linstedt suggests to reverse the business key value before hashing, and this can work. The length of the hash key will stay the same, all that needs to be changed in the ETL code is the hash key computation for this hub (also when used in links!). But the possibilities are almost endless! I will use the business key 1234AB as an example: Besides reversing, you could also:

  • duplicate the business key before hashing (1234AB1234AB)
  • append the first character (1234AB1)
  • prepend or append a fixed string (1234AB#, or #1234AB)
  • etcetera etcetera. Just use your imagination and test what works best in terms of performance, code simplicy and uniqueness of the hash value.

Conclusion / Wrap up

In this post I discussed how you can prepare for the unlikely event of a Hash collission in your Data Vault 2.0 data warehouse.

Some things to keep in mind:

  1. Make sure you will detect a hash key collision when it happens.
  2. Make a decision on which level you want to deal with a hash key collision when it happens. Your options are: System, Table and Row level. I have a preference for Table level, but System level is also a valid choice. I think Row level is not a wise choice, for reasons described above.
  3. Know how to rehash the business keys for a hub that had a hash key collision. “do something” (as described above) with the business key before hashing until the hash values for the hub are unique and the performance is acceptable.

Interested in Data Vault? Make sure you check out the website of Data Vault Alliance !

© 2017-2023 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

Temporal Table applications in SQL Data Warehouse environments (SQL Server Series)

Today the subject of investigation is the Temporal Table, which is a new feature in SQL Server 2016. My focus will slightly be on how to use it in Data Warehouse environments, but there is some general information passing by as I write.
I want to cover next topics:

  1. What is a temporal table (in short)?
  2. Can I use a temporal table for a table in the PSA (Persistent Staging Area)?
  3. Can I use a temporal table for a Data Vault Satellite?
  4. Is using temporal tables for full auditing in an OLTP system a good idea?

The full article is posted on SQLShack.com, so read further there ..

Free download: Blog Yearbook 2016

I have made a compilation of all my blog posts of the past year, and it was more work than you would think.
Especially because it also contains a non-published new article.

After filling in the form you can download the e-book for free.

    I promise you the following:

    • I will not give or sell your email address to any third party.
    • I will not spam you.
    • If I start with a newsletter, you can easily unsubscribe or change the frequency.

    (c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

    Zeros, bloody zeros! (Data Vault Series)

    Introduction

    I must admit I have a weakness for British humour.
    When I had to cope with leading zeros in business keys some time ago, I spontaneously came up with the title of this post, not knowing that it would serve as such.
    For those who do not know, “Meetings, bloody meetings” is a British comedy training film in which John Cleese plays a main role. It was made in 1976, and a remake was made in 2012.
    It tells in a funny way what can go wrong at meetings and how you can do better, check it out if you can.

    DV-S01E05-meetingsMr John Cleese

    But, obviously, this post is not about meetings but about zeros.

    Problem

    I can be short about that: leading zeros in business key values.
    For instance a customer number is delivered to the data warehouse as 0001806 (instead of 1806).
    This would not be a problem it is would always be delivered exactly like that. But to be honest, you can and will not know that upfront. Even this might be the case now, it might not be in the future.
    When other tools are used, leading zeros could suddenly disappear (for instance when a csv file is modified using Excel), or (more rarely) the number of leading zeros could change (01806, 00001806). When this happens you have a problem, because for the data warehouse 01806, 0001806, 00001806 and 1806 are all different business keys! Even if you have only two variants, it is already a problem.
    Because every business key gets a different row in the hub, and this customer now exists multiple times!

    DV-S01E05-zeros(No acting here, this is how I look sometimes)

    Solution

    If you are familiar with Data Vault, you might already think of same-as-links to solve this.
    But I think the solution should be implemented earlier, to avoid having multiple hub rows.
    Simply always remove leading zeros when the sourcecolumn is (part of) a business key (either primary or foreign key) and seems a number or ID but is delivered as a string/varchar. In this way 1806 will always be 1806! And I think it is pretty impossible that 001806 and 1806 would refer to two different customers.
    Unless, of course, they would come from different source systems. But in that situation, depending on leading zeros would be a bad thing to do, because when then leading zeros dropped off, satellite rows of different customers (in different source systems) could end up as connected to the same hub row! In this situation, in a non-integrated Raw Vault, it would be better to prefix the business key with the source system code and remove the leading zeros, for instance, CRM.1806 and ERP.1806.
    In all cases, you can still store the original value (with leading zeros) as an ordinary attribute in a satellite for auditing reasons.

    How to implement the solution

    There are many ways to remove leading zeros. When I was searching for this I had two requirements:

    • No casting from and to an integer may take place, otherwise all business keys need to be numeric, so this would make the solution less reliable.
    • No function, routine or assembly may be called, this could negatively impact performance. I was looking for an “inline” conversion.

    After some research I found an expression that was the same for SQL and SSIS and quite okay (T-SQL version by Robin Hames, my credits for his work), but appeared to change a string with only one or more zeros to an empty string. And because a 0 can have a meaning – and is certainly different from an empty string – this is undesired behavior, IMHO.
    So I had to add some logic to it: a SELECT CASE in T-SQL and an inline condition (format {condition} ? {true part} : {false part} ) to the SSIS expression.
    Furthermore I came on a different method for T-SQL as well, using the PATINDEX function, which is more compact than the other solution.
    For SSIS I still use the ‘Robin Hames’ method, because the PATINDEX function is not available in SSIS Expressions.
    So .. this is what it has become:

    T-SQL

    Remove_leading_zeros.sql

    SELECT
        example.[id_with_leading_zeros],
       CASE
          WHEN LTRIM(example.[id_with_leading_zeros]) = '' THEN ''
          WHEN PATINDEX( '%[^0 ]%', example.[id_with_leading_zeros]) = 0 THEN '0'
          ELSE SUBSTRING(example.[id_with_leading_zeros], PATINDEX('%[^0 ]%', example.[id_with_leading_zeros]), LEN(example.[id_with_leading_zeros]))
       END AS [id_without_zeros_method1],

       CASE
          WHEN LTRIM(example.[id_with_leading_zeros]) = '' THEN ''
          WHEN PATINDEX( '%[^0 ]%', example.[id_with_leading_zeros]) = 0 THEN '0'
          ELSE REPLACE(REPLACE(LTRIM(REPLACE(-- Robin Hames' method
                REPLACE(LTRIM(example.[id_with_leading_zeros]), ' ', '!#!') -- replace existing spaces with a string that does not occur in the column value, I have chosen '!#!'
                , '0', ' ') -- replace '0' with ' '
                ) -- end of LTRIM to remove leading '0's that have been changed to ' 's
                , ' ', '0') -- change ' ' back to '0'
                , '!#!', ' ') -- change '!#!' back to ' '
       END AS [id_without_zeros_method2]
    FROM
        (
        SELECT
            TOP 1000 RIGHT('00000000000' + CONVERT(NVARCHAR(12), object_id), 14) AS [id_with_leading_zeros]
        FROM
            master.sys.objects
        UNION
        SELECT N' 00000 '
        UNION
        SELECT N'00'
        UNION
        SELECT N' '
        UNION
        SELECT ' 0099990 A '
        UNION
        SELECT '-5550'
        ) example

    SSIS Expression (can be used in Derived Column)

    (LTRIM(REPLACE(id_with_leading_zeros,"0", "")) == "" && LTRIM(id_with_leading_zeros) != "") ? "0" : REPLACE(REPLACE(LTRIM(REPLACE(REPLACE(LTRIM(id_with_leading_zeros)," ","!#!"),"0"," "))," ","0"),"!#!"," ")

    DV-S01E05-151In a Derived Column Transformation this looks for instance like this

    Conclusion / Wrap up

    In this post I have motivated why I think you should remove leading zeros from business keys when data is loaded from source systems to a data warehouse.
    This post also contains different ways to remove leading zeros, two for T-SQL and one for a SSIS expression.

    (c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

    Stop being so precise! and more about using Load(end)dates (Datavault Series)

    Introduction

    My web site was offline a few days ago. Sorry for that.
    I got this email from my webhosting provider: “A new message or response with subject: User *** has used up 153% of their bandwidth ..”.
    Shortly after they suspended my account. It was suspended for about 8 hours. I have upgraded the network traffic limit.
    All right, when all bandwidth was consumed .. then somebody must be reading my blog. Thank you for that.

    This weeks post is about the LoadDate and LoadEndDate.
    Actually there are two things to be nerdy on:

    1. End dating without gaps
    2. Getting the best value for bytes on DATETIME2 precision

    By the way, these topics apply to SQL Server, the examples are made using SQL Server 2014.

    1. End dating without gaps

    How end dating works

    In a Data Vault Satellite, different subsequent versions of a row in the source system are distinguished through a LoadDate (which usually contains a date/time value). So the HashKey of the corresponding Hub Row plus the LoadDate are the primary key for the satellite.
    However to get the right version out when querying for a specific date, this requires a relatively difficult query, and this can also have a negative impact on performance.
    This is why the LoadEndDate is an optional column of a Satellite, to make querying (read: getting data out) easier and better performing.

    Important to remember that the LoadEndDate is not the date/time the load(process) ended, but the date/time the row was replaced by a newer row for the same business entity in the hub. What’s in a name, if I had to choose I would just call it EndDate, but LoadEndDate is the standard, and once you know it, it is not a problem.

    There are two ways to determine the value for this LoadEndDate:

    1. Exclusive: the LoadEndDate is the LoadDate of the new row that replaces this one, minus a small time fraction. Using the exclusive methods enables the use of the BETWEEN keyword in Transact-SQL to get the right row out, example:

      WHERE @SnapshotDate BETWEEN [LoadDate] AND [LoadEndDate]

      This is the method that is used in most Data Vault examples.

    2. Inclusive: the LoadEndDate is exactly equal to the LoadDate of the new row that replaces this one. This requires no computation when updating the LoadEndDate, but disqualifies the BETWEEN keyword, for getting data out you need to do something like:

      WHERE @SnapshotDate >= [LoadDate] AND @SnapshotDate < [LoadEndDate]

    Problem

    There is a (theoretical) problem when using the exclusive method.
    If you substract to much, there will be a tiny time gap between the LoadEndDate and the LoadDate of the subsequent row.
    I fully admit this is a theoretical problem, because the chances that you require the row valid exactly on this moment are astronomically small.
    (There is this not so hypothetical situation that your load starts at midnight, the LoadDate is set before the first second of the new day has passed, and you substract a second.
    Then you do not find a record when using the exact date (without time fraction) to get the row out. But you need a lot of bad luck for this to happen).

    Solution

    Still if you are a purist you want to do it right, certainly because the solution is so simple.
    If you make the the “grain” of substraction from the LoadDate equal to the precision of the LoadDate, there is no gap.
    E.g.
    – Substract 1 second from DATETIME2(0)
    – Substract 1 centisecond from DATETIME2(2) ***TIP***
    – Substract 1 millisecond from DATETIME2(3)
    – Substract 1 microsecond from DATETIME2(6)
    – Substract 100 nanoseconds from DATETIME2(7)

    Examples in T-SQL:

    Exclusive enddate without gaps.sql

    DECLARE @dt20 DATETIME2(0) = '2016-04-13 20:52:17'
    DECLARE @dt22 DATETIME2(2) = '2016-04-13 20:52:17.00'
    DECLARE @dt23 DATETIME2(3) = '2016-04-13 20:52:17.000'
    DECLARE @dt26 DATETIME2(6) = '2016-04-13 20:52:17.000000'
    DECLARE @dt27 DATETIME2(7) = '2016-04-13 20:52:17.0000000'

    SELECT @dt20, DATEADD(SECOND, -1, @dt20)
    SELECT @dt22, DATEADD(MILLISECOND, -10, @dt22)
    SELECT @dt23, DATEADD(MILLISECOND, -1, @dt23)
    SELECT @dt26, DATEADD(MICROSECOND, -1, @dt26)
    SELECT @dt27, DATEADD(NANOSECOND, -100, @dt27)

    2. Getting the best value for bytes on DATETIME2 precision

    This is about a “smart” precision to choose for your LoadDate and LoadEndDate columns. Unlike the older DATETIME datatype, DATETIME2 uses less or more bytes for storage depending on the precision you specify.
    The Storage size is: 6 bytes for precisions less than 3; 7 bytes for precisions 3 and 4. All other precisions require 8 bytes. (I shamelessly pasted this from MSDN).
    In the book “Building a scalable data warehouse with Data Vault 2.0” a lot of examples use DATETIME2(7). You can ask yourself why. Why do you need to be precise to 100 nanoseconds? I dare to say that in most cases (except when loading realtime or near-realtime streaming data into your Data Vault), seconds would be precise enough.
    But looking back a few lines, to the storage sizes, DATETIME2(0) uses 6 bytes, but DATETIME2(2) ALSO uses 6 bytes. So with the latter you get a higher precision for the same storage size. And for daily loads, the centiseconds precision that DATETIME2(2) is providing, is really precise enough, believe me.
    So DATETIME2(2) gives us the best value for bytes!


    DV-S01E04-precise
    Why would your LoadDates and LoadEndDates be more precise than needed ?..


    In this way you can save two bytes, compared with a DATETIME2(7), which uses 8 bytes and is used in the book. Because the LoadDate is in the primary key of satellite tables, also the primary key index will be smaller.
    In PIT Tables, which usually have multiple LoadDates, the storage gain is even more.
    And what counts for money, “who does not respect a small gain, isn’t worth a big one”, does also count for small performance gains you can get by doing this kind of tweaks. Never forget that! Multiple smaller tweaks might add up to a noticable performance improvement!

    Conclusion / Wrap up

    In this blog post you have read about the two methods for enddating: Inclusive and Exclusive.
    Also you could read how you can use the Exclusive method for end dating in Data Vault satellites, without having gaps in the timeline.
    Finally I discussed which precision will in most cases be good enough for LoadDate and LoadEndDate columns.

    (c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.