Category Archives: Data Vault Series

Contains blog posts related to Data Vault 1.0 and 2.0

Implementing a hash key collision strategy

Introduction

Dan Linstedt, the inventor of Data Vault, has written a lot about it: hashkeys. For instance, one of his latest blog posts: #datavault 2.0, Hashes, one more time. I will not list all other sources, as you can use Google yourself. A few comments on hash keys:

  1. You need them for scalability. Using sequence numbers is taking the risk that your data warehouse does not scale well later when the amount of data grows.
  2. They can collide: two different business keys can produce the same hash key. However the chance that this happens is very small. For instance when using SHA-1 (which produces a hash value of 160 bits) you will have a 1 in 1018 chance on a hash collision when having 1.71 * 1015 hash values (read: hub rows) according to this blog post.
  3. If collisions are unacceptable you need a hash key collision strategy.

Hash collision detection

If you take the possibility of a hash collision seriously, you’ll need a detection strategy. The easiest way to implement this is to make sure that a primary key violation will occur when a hash key collision occurs. When your ETL processes are set up following best practices your processes are restartable after fixing the error, or can be rolled back, so you can continue the load or reload later. A hash key collision will be detected when you will check in the hub if a business key (single-column or composite) already exists, and a primary key constraint is enabled on the hash key. For instance the contents of a hub table (simplified example):

Hashkey Business key
H1 A
H2 B

Let’s assume that you have a business key C that will also produce the hashkey H1. The load process checks if C already exist, doesn’t find it, and thus wants to add a row to the hub. But because hashkey H1 already exists, a primary key violation will occur. This is good! Nothing is worse than an undetected hash key collision. If the load process would check for the hashkey H1, business key C would never be added to the hub and the describing context (satellite rows) and links of business key A would all be messed up with rows of business key C, but you would not know! So detection is essential. If for some reason your hubs do not have physical primary keys, you need to check afterwards for duplicate hashkeys. But this is the second-best approach, because then data corruption has already taken place. It can be a mess, but using a loading datetime and/or the source data (which is still in staging) you might be able to fix this.

Types of hash key collision strategies

So, Houston, we have a problem and Seattle, we need a solution. How to deal with hash key collisions? You can deal with them at different levels. Each level has a different impact on load processes and refactoring impact (the effort needed to change the load process and rehash existing business keys). When chosing a different hash collision strategy, load impact and refactoring impact are more or less inversely proportional, so when the impact on loading process increases the refactoring impact will usually decrease. Because of the low chances that a hash collision will ever happen, it often makes sense to choose a strategy that has a low impact on loading (and hence, a medium or high impact on refactoring, which might never be needed anyway). In an overview:

Level Load impact Refactoring impact
System Level None High
Table Level Low Low to High
Row Level High None

System Level

Implementing a hash key collision on System Level means that you change the hash key computation of the entire system. The advantage is that you need no logic in the load processes, they just use the (changed) hash key computation always. The disadvantages can be:

  • a performance impact: changed way of hashing might be slower on all tables.
  • a risk of a high refactoring effort: the need to change a lot of ETL code if no data warehouse automation tool is used, and rehashing all business keys of the entire data warehouse.

In my opinion this feels like using a large ax to cut your lettuce.

Table Level

Implementing a hash key collision on Table Level means that you change the hash key computation only for the business key of the hub that causes a hash key collision. The advantages of this approach are:

  • this exception can be hardcoded in manually made ETL code or meta data driven in a data warehouse automation tool. It can even be decided at runtime using the hubname as a parameter for the hash key calculation (although I would not recommend the latter because this could have a serious performance impact).
  • a performance impact (if any) has only effect on one table.
  • the refactoring impact can be Low to Medium: changing ETL code only affects the hash key of one hub and so does the rehashing of business keys.

The disadvantages can be:

  • If it is a hub that is used in many links, the refactoring impact might still be Medium to High.

Row Level

And finally hash key collision mitigation at the row level. This can be done by adding extra tables to the data vault that store business keys that would produce a duplicate hash key; the table row contains an alternate hash key for the business key to be used. The advantage of this approach is:

  • There is no refactoring impact. When a hash key collision occurs, just add the new business key that causes the collision to the exception table with an alternate hash key, and restart the load process.

The disadvantage however will be:

  • Makes loading much more complicated: for every hash key computation a lookup in the exception table (which in most cases will be empty) needs to be done. This also can have a performance impact. I must admit I did not try, but would not dare to take the risk that this would not scale.

So my first thought was: no-go, no-go! In my opinion the influence on daily loads is just too big for a very small chance of a hash collision.

Rehashing methods

Picture credits: © Can Stock Photo / alexskp

Not discussed yet, how would you rehash? In the Data Vault 2.0 training, upgrading the hashing algoritm when possible (e.g. MD5 to SHA-1) was mentioned. This is a valid choice, but has an impact on the length of your hash key (in the example from 16 to 20 bytes when stored as binary), and this introduces an exception in your table design patterns. Does not have my preference. Euh .. is there something else I can do? Yes! Dan Linstedt suggests to reverse the business key value before hashing, and this can work. The length of the hash key will stay the same, all that needs to be changed in the ETL code is the hash key computation for this hub (also when used in links!). But the possibilities are almost endless! I will use the business key 1234AB as an example: Besides reversing, you could also:

  • duplicate the business key before hashing (1234AB1234AB)
  • append the first character (1234AB1)
  • prepend or append a fixed string (1234AB#, or #1234AB)
  • etcetera etcetera. Just use your imagination and test what works best in terms of performance, code simplicy and uniqueness of the hash value.

Conclusion / Wrap up

In this post I discussed how you can prepare for the unlikely event of a Hash collission in your Data Vault 2.0 data warehouse.

Some things to keep in mind:

  1. Make sure you will detect a hash key collision when it happens.
  2. Make a decision on which level you want to deal with a hash key collision when it happens. Your options are: System, Table and Row level. I have a preference for Table level, but System level is also a valid choice. I think Row level is not a wise choice, for reasons described above.
  3. Know how to rehash the business keys for a hub that had a hash key collision. “do something” (as described above) with the business key before hashing until the hash values for the hub are unique and the performance is acceptable.

Interested in Data Vault? Make sure you check out the website of Data Vault Alliance !

© 2017-2023 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

Using a Persistent Staging Area: What, Why, and How

Change log

Date Changes
21 February, 2017 Due to a very useful discussion on LinkedIn with Dan Linstedt, I have made some changes in this article to emphasize that a PSA does not have a permanent place in any EDW
(with exception, as Dan suggests, being the Hadoop NoSQL platform used for slower storage mechanisms).

Today I want to discuss the Persistent Staging Area in an Enterprise Data Warehouse, or PSA for short. First I want to give a credit to Roelant Vos who has blogged about this subject before. He triggered me to go further with this concept to see “what’s in it for me” and to do a proof of concept using SQL Server 2016.

What is a Persistent Staging Area or PSA?

As an introduction, I want to tell what a (non-persistent) Staging Area is. Some of you will already know, so be it. A Staging Area is a “landing zone” for data flowing into a data warehouse environment. The data in a Staging Area is only kept there until it is successfully loaded into the data warehouse. Between two loads, all staging tables are made empty again (or dropped and recreated before the next load).
A Persistent Staging Area on the other hand, is a Staging Area that is not wiped out between loads: it contains full history from the source(s) that deliver data to it.


Picture credits: © creativecommonsstockphotos | Dreamstime Stock Photos
The “Staging Area” for passengers on an airport ..

Why would you want to make use of a PSA, and what are its pros and cons?

A PSA makes it possible to reload the next layer in the Data Warehouse (for example a Data Vault) when requirements change during development of the Data Vault / EDW without going back to the original source(s). The original source(s) might not contain all history anymore (e.g. because it is an OLTP system or a web service returning only current data) or it is cumbersome to do such a reload because the source data is delivered in large numbers of flat files.

As you might know for data warehouses with (semi)structured data I always preach for a multi-tier architecture including a Data Vault for storing full history of unmodified source data and one or more dimensional models (virtual if performance allows, otherwise physical) as data feeds for cubes, reports, dashboards and BI tools.

So, I hear you think, in a Data Vault you have already full history, why would you need a PSA then? The answer is simple: because building a Data Vault takes time. You do not load all data on day 1, maybe there are uncertainties about the model and your resources are restricted, most likely. So you build the Data Vault in increments. But you can start capturing as much source data as you can! When Data Warehouse Automation tools are used, this is fairly easy to build. So even though you do not take the data further into your data warehouse than the (Persistent) Staging Area, you already start collecting your history of data.

Note: a different reason to use a PSA could be that your source systems contain high data volumes and are geographically spread: you could then create a PSA on premise near the source system, and export only delta data from it – each row would get an I, U or D indication for Insert, Update and Delete – to massively reduce data traffic between source and EDW. I will not discuss this use case scenario today, because then pros and cons would be completely different, and time travelling and full reload would not be applicable.

Using a PSA has it’s pros and cons. To mention a few:

Pros:

  • You collect data, even for those parts of the data warehouse you have not properly designed and developed yet.
  • You can clear the Data Vault, or even change the model completely (this can happen, for instance when business concepts initially were not well understood), create new ETL Code to load data to the modified Data Vault and do a replay of loading full history of the data captured in the PSA.
  • Important is that the PSA is implemented with changed rows only. If done so, it does not have excessive storage needs. So like a Data Vault Satellite, a new row for a table in the PSA is only added, when something has changed for the primary (surrogate or natural) key. A start- and end date/time can be used to implement this.
  • A (relational) PSA should have an optional, temporary character, and is not part of the core EDW tiers. A (relational) PSA should have an optional, temporary character, and is not part of the core EDW tiers! This was written twice on purpose! Do not forget! It is useful during building the EDW, but must be disarmed when the EDW has gone into production and has reached a maintenance phase. A good way to make sure this happens is to plan a decommission date and stick to the plan (can also be a plan like “after sources X, Y and Z are connected and are data feeds in production for 3 months”). Do not keep the PSA around for no good reason.
    In this way, most of the cons below are very likely to be not applicable!

Cons:

  • You can argue that you are in fact building a second data warehouse.
  • When sources change over time, your PSA might not reflect the current structure of the source system(s) anymore (e.g. a column is removed in the source system but you leave it in the PSA so that you can add new data with the column empty but keep the data in this column for older rows). This might be confusing.
  • When source systems are not used anymore (for instance when accounting software is replaced by a similar product from a different vendor) you need to add tables of the new system to the PSA, while you also keep the tables from the old system, which will not receive new data anymore. Now you need to keep the ETL code to load from the “old systems” PSA tables, in case a full reload is needed, and you also need to develop code to get data from the “new systems” PSA, and might need logic for which point in time which tables should be loaded. As you can understand, now the maintenance of your PSA can start to become a heavy burden. So going back to the pros, make sure your PSA is optional and can be disarmed when the maintenance efforts do no longer justify the benefits.

How could you implement a PSA?

To give an idea how this fits in my favourite architecture, I’ve added the PSA to it.
For a normal load the PSA is just loaded from the Staging Area (preferably in parallel with loading the Enterprise Data Warehouse from the Staging Area).
A Virtual Staging Access Layer (for example implemented with views) ensures that ETL code for loading the Enterprise Data Warehouse can switch easily between Staging Area and PSA for loading data. An example how to implement this will follow below.


How the PSA fits in the EDW architecture

For the next schema I owe you an explanation, because I know it’s a bit weird to mix architecture with processing in one schema, but in this case I thought it made sense and could clarify how to reload the EDW from the PSA: no sources and Staging Area are involved, but instead a time travelling mechanism queries the data from the PSA through the Virtual Staging Access Layer, to replay all loads that have been done so far, but with modified ETL Code to load data into a modified or extended Data Vault.


Reloading the DWH from the Persistent Staging Area

To set up time travelling, you could create a list of load sessions and their start date/times, then substract a small time fraction and use that value as Point-In-Time date time value to retrieve all rows from the previous load.
Note: For the DATETIME2 columns, I use a precision of centiseconds, so DATETIME2(2). For justification see one of my older blog posts: Stop being so precise! and more about using Load(end)dates (Datavault Series). If needed you can use a higher precision for your data warehouse.

In a table:

Load no LoadStartDatetime Point-In-Time value
1 2017-02-01 02:00:01.74 2017-02-02 02:00:00.88
2 2017-02-02 02:00:00.89 2017-02-02 02:00:02.11
3 (last load so far) 2017-02-03 02:00:02.12 9999-12-31 23:59:59.98 (! not 99 !)

The Time Travel icon with the arrows in the picture above refers to the process that loops through all Start date/time values chronologically and triggers execution of the load process from PSA to EDW for every Point-In-Time value.

SQL Server 2016 implementation: how could you implement loading the PSA from the Staging Area?

This logic is very similar to loading a Data Vault Satellite. For each staging table you should create a “History” version in the PSA database, with two extra columns: EffectiveStartDts and EffectiveEndDts and a primary key defined as the primary key of the staging table plus the EffectiveStartDts. Most concepts of loading a satellite could be applied: you could add rows that are new, insert new versions of rows that have changed, and mark absent rows (missing in the source) as such. For the PSA I would suggest just to update the end date time so that this row will simply not be in the selection when a later point-in-time is used. Also there is no need to create a separate “hub” table for each staging table.

SQL Server 2016 implementation: how could you implement the historic reload from the PSA?

In a previous blog post Isolate cross database “databasename” dependencies using synonyms and schemas I already advocated the concept of putting each tier of the EDW in its own database schema, and use synonyms to achieve interconnectivity between objects in different databases. In the rest of this article I will assume that you understand this concept (if not, you could read this blog post first).

With some imagination, we can use a modified version of this concept by creating the Virtual Staging Access Layer into the PSA database. This virtual layer is needed for loading data from the PSA instead of the Staging Area.
Suppose the Staging Database has a database schema stg and that schema contains all tables that can be loaded into the EDW / Raw Data Vault.
What we can do is do a similar thing in the PSA database, so create a stg schema, but instead of tables this schema contains views with the same name and same columns as the tables in the Staging Database! These views select the rows valid at a certain point in time.

The challenge to retrieve only the valid rows at a point in time is an easy one: To the PSA database we add a table [stg].[PointInTime] with only one row and a [CurrentPointInTime] column of datatype DATETIME2. Now we can build a SELECT statement on the PSA table (same as the staging table, but with start- and end date/time stamps to hold historic changed and deleted rows) with a CROSS JOIN on the PointInTime table and a WHERE clause to select only the rows valid at the given point in time.

An example how such a view could look like:

CREATE_VIEW_stg_Customer.sql

CREATE VIEW [stg].[Customer]
AS
/*
==========================================================================================
Author: Hans Michiels
Create date: 15-FEB-2017
Description: View that has exactly the same structure as a table with the same name
             in the Staging Database.
             This view can be used for full reloads of all data in the PSA Database.
==========================================================================================
*/

SELECT
    [CustomerID],
    [FirstName],
    [Initials],
    [MiddleName],
    [SurName],
    [DateOfBirth],
    [Gender],
    [SocialSecurityNumber],
    [Address],
    [PostalCode],
    [Residence],
    [StateOrProvince],
    [Country],
    [RowHash]
FROM
    [psa].[CustomerHistory] hist
CROSS JOIN
    [psa].[PointInTime] pit
WHERE
    hist.EffectiveStartDts <= pit.CurrentPointInTime
    AND hist.EffectiveEndDts >= pit.CurrentPointInTime -- When INCLUSIVE enddating is used.
    -- AND hist.EffectiveEndDts > pit.CurrentPointInTime -- When EXCLUSIVE enddating is used.
GO

The next steps are:

  1. Make sure your ETL Code to load the data has a mechanism to skip loading data from the sources to the Staging Area and from the Staging Area to the PSA when the PSA is used instead of the normal Staging area. For instance you could add a view [stg].[IsPSA] to both the Staging database and the PSA database, returning a single row with a single value “False” or “True”, respectively. In your package you could retrieve this value using the Staging Connection Manager, and change the flow of your package depending on the fact if the Staging database is connected or the PSA database.
  2. Build a time travelling mechanism (could be a FullReloadMaster.dtsx package) that updates the stg.PointInTime table for every point-in-time date/time and after every update starts the ETL Process (e.g. a Master.dtsx SSIS package) with a changed connection string, connected to the PSA instead of “normal” Staging database!

Here is a possible implementation of the [stg].[IsPSA] view, based on the fact that the PSA database will have _PSA_ in the database name and the staging database will not. This allows to use the same version of the view for the Staging and PSA database.

CREATE_VIEW_stg_IsPSA.sql

CREATE VIEW [stg].[IsPSA]
AS
/*
==========================================================================================
Author: Hans Michiels
Create date: 10-FEB-2017
Description: View that returns one rows with a bitcolumn [IsPSA] that indicates
             if the current database is the PSA or the STG.
==========================================================================================
*/

SELECT
  CONVERT(BIT,
    CASE
    WHEN CHARINDEX('_PSA_', DB_NAME()) > 0 THEN 1
    ELSE 0
    END) AS [IsPSA]
GO


Example how a FullReloadMaster.dtsx package could look like. Important is that the Staging database connection string is a parameter of Master.dtsx, and is passed by FullReloadMaster.dtsx, but it now is a connection to the PSA database.

Now the next challenge is how to get this working. If you use a SSIS dataflow with a OLEDB Source (and a destination), it is not so difficult: you can just change the connection string of the OLEDB Source. I have done a proof of concept and it worked! You do not have to redeploy SSIS Packages. If views in the PSA database are used with exactly the same structure as the tables in the Staging database, the SSIS Package agrees with the “metadata” and “VS_NEEDSNEWMETADATA” or “VS_ISBROKEN” errors do not occur. Of course you could also make views for reading the Staging tables in the Staging database. This would create an additional layer of abstraction, because then in both the Staging Area and PSA you read the data from the views, and you could, when needed, implement views of which the structure differs from the underlying tables.

If you use SQL code (e.g. stored procedures) this will be a little more work, but still doable. You can follow one of the following strategies for storing the SQL Code:

  • a push strategy: SQL code is in the PSA database, and should then also be in the Staging database. From there, you push the data to the raw Data Vault.
  • a pull strategy: SQL code is in the raw Data Vault. From there you pull the data from the Staging Area or PSA. You will need to switch via synonyms: drop stg synonyms and recreate them but them pointing to the PSA database before a historic reload is done. Afterwards, the reverse action needs to take place to let them point to the Staging database again.
  • an outsider strategy: SQL code is in a separate database, and has synonyms for both staging and raw Data Vault tables and views. Now the same switch via synonyms method can be used.

A few considerations:

  • a push strategy forces you to install your SQL code twice: in the Staging and the PSA database. Also a Staging database might not feel like the “right place” for stored procedures.
  • a pull strategy and an outsider strategy force you to change synonyms, something that normally only has to be done during a deployment, but now before a full reload (and afterwards the reverse action).

I have a slight preference for switching via synonyms, assumed that a full reload will not happen often.
Still using the push strategy could be a valid choice too.

Conclusion / Wrap up

In this post you could read more about the concept of a Persistent Staging Area. Although I have not used it in production environments yet, I think in some projects it can be very useful.
Important to keep in mind:

  • Make sure that your data warehouse can survive if the PSA is switched off. So do not expose it to end users or make it a mandatory part of your ETL Code.
  • Plan a decommission moment when the PSA is created and stick to the plan. Do not keep the PSA around for no good reason.
  • Use code generation tools to update the PSA from the staging area to limit the development effort to set it up.
  • Virtualize the read access to the staging tables, for instance by using views.

If you stick to this “rules” I think a PSA can prove its value for your Enterprise Data Warehouse!

(c) 2017 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

Free download: Blog Yearbook 2016

I have made a compilation of all my blog posts of the past year, and it was more work than you would think.
Especially because it also contains a non-published new article.

After filling in the form you can download the e-book for free.

    I promise you the following:

    • I will not give or sell your email address to any third party.
    • I will not spam you.
    • If I start with a newsletter, you can easily unsubscribe or change the frequency.

    (c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

    Zeros, bloody zeros! (Data Vault Series)

    Introduction

    I must admit I have a weakness for British humour.
    When I had to cope with leading zeros in business keys some time ago, I spontaneously came up with the title of this post, not knowing that it would serve as such.
    For those who do not know, “Meetings, bloody meetings” is a British comedy training film in which John Cleese plays a main role. It was made in 1976, and a remake was made in 2012.
    It tells in a funny way what can go wrong at meetings and how you can do better, check it out if you can.

    DV-S01E05-meetingsMr John Cleese

    But, obviously, this post is not about meetings but about zeros.

    Problem

    I can be short about that: leading zeros in business key values.
    For instance a customer number is delivered to the data warehouse as 0001806 (instead of 1806).
    This would not be a problem it is would always be delivered exactly like that. But to be honest, you can and will not know that upfront. Even this might be the case now, it might not be in the future.
    When other tools are used, leading zeros could suddenly disappear (for instance when a csv file is modified using Excel), or (more rarely) the number of leading zeros could change (01806, 00001806). When this happens you have a problem, because for the data warehouse 01806, 0001806, 00001806 and 1806 are all different business keys! Even if you have only two variants, it is already a problem.
    Because every business key gets a different row in the hub, and this customer now exists multiple times!

    DV-S01E05-zeros(No acting here, this is how I look sometimes)

    Solution

    If you are familiar with Data Vault, you might already think of same-as-links to solve this.
    But I think the solution should be implemented earlier, to avoid having multiple hub rows.
    Simply always remove leading zeros when the sourcecolumn is (part of) a business key (either primary or foreign key) and seems a number or ID but is delivered as a string/varchar. In this way 1806 will always be 1806! And I think it is pretty impossible that 001806 and 1806 would refer to two different customers.
    Unless, of course, they would come from different source systems. But in that situation, depending on leading zeros would be a bad thing to do, because when then leading zeros dropped off, satellite rows of different customers (in different source systems) could end up as connected to the same hub row! In this situation, in a non-integrated Raw Vault, it would be better to prefix the business key with the source system code and remove the leading zeros, for instance, CRM.1806 and ERP.1806.
    In all cases, you can still store the original value (with leading zeros) as an ordinary attribute in a satellite for auditing reasons.

    How to implement the solution

    There are many ways to remove leading zeros. When I was searching for this I had two requirements:

    • No casting from and to an integer may take place, otherwise all business keys need to be numeric, so this would make the solution less reliable.
    • No function, routine or assembly may be called, this could negatively impact performance. I was looking for an “inline” conversion.

    After some research I found an expression that was the same for SQL and SSIS and quite okay (T-SQL version by Robin Hames, my credits for his work), but appeared to change a string with only one or more zeros to an empty string. And because a 0 can have a meaning – and is certainly different from an empty string – this is undesired behavior, IMHO.
    So I had to add some logic to it: a SELECT CASE in T-SQL and an inline condition (format {condition} ? {true part} : {false part} ) to the SSIS expression.
    Furthermore I came on a different method for T-SQL as well, using the PATINDEX function, which is more compact than the other solution.
    For SSIS I still use the ‘Robin Hames’ method, because the PATINDEX function is not available in SSIS Expressions.
    So .. this is what it has become:

    T-SQL

    Remove_leading_zeros.sql

    SELECT
        example.[id_with_leading_zeros],
       CASE
          WHEN LTRIM(example.[id_with_leading_zeros]) = '' THEN ''
          WHEN PATINDEX( '%[^0 ]%', example.[id_with_leading_zeros]) = 0 THEN '0'
          ELSE SUBSTRING(example.[id_with_leading_zeros], PATINDEX('%[^0 ]%', example.[id_with_leading_zeros]), LEN(example.[id_with_leading_zeros]))
       END AS [id_without_zeros_method1],

       CASE
          WHEN LTRIM(example.[id_with_leading_zeros]) = '' THEN ''
          WHEN PATINDEX( '%[^0 ]%', example.[id_with_leading_zeros]) = 0 THEN '0'
          ELSE REPLACE(REPLACE(LTRIM(REPLACE(-- Robin Hames' method
                REPLACE(LTRIM(example.[id_with_leading_zeros]), ' ', '!#!') -- replace existing spaces with a string that does not occur in the column value, I have chosen '!#!'
                , '0', ' ') -- replace '0' with ' '
                ) -- end of LTRIM to remove leading '0's that have been changed to ' 's
                , ' ', '0') -- change ' ' back to '0'
                , '!#!', ' ') -- change '!#!' back to ' '
       END AS [id_without_zeros_method2]
    FROM
        (
        SELECT
            TOP 1000 RIGHT('00000000000' + CONVERT(NVARCHAR(12), object_id), 14) AS [id_with_leading_zeros]
        FROM
            master.sys.objects
        UNION
        SELECT N' 00000 '
        UNION
        SELECT N'00'
        UNION
        SELECT N' '
        UNION
        SELECT ' 0099990 A '
        UNION
        SELECT '-5550'
        ) example

    SSIS Expression (can be used in Derived Column)

    (LTRIM(REPLACE(id_with_leading_zeros,"0", "")) == "" && LTRIM(id_with_leading_zeros) != "") ? "0" : REPLACE(REPLACE(LTRIM(REPLACE(REPLACE(LTRIM(id_with_leading_zeros)," ","!#!"),"0"," "))," ","0"),"!#!"," ")

    DV-S01E05-151In a Derived Column Transformation this looks for instance like this

    Conclusion / Wrap up

    In this post I have motivated why I think you should remove leading zeros from business keys when data is loaded from source systems to a data warehouse.
    This post also contains different ways to remove leading zeros, two for T-SQL and one for a SSIS expression.

    (c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

    Stop being so precise! and more about using Load(end)dates (Datavault Series)

    Introduction

    My web site was offline a few days ago. Sorry for that.
    I got this email from my webhosting provider: “A new message or response with subject: User *** has used up 153% of their bandwidth ..”.
    Shortly after they suspended my account. It was suspended for about 8 hours. I have upgraded the network traffic limit.
    All right, when all bandwidth was consumed .. then somebody must be reading my blog. Thank you for that.

    This weeks post is about the LoadDate and LoadEndDate.
    Actually there are two things to be nerdy on:

    1. End dating without gaps
    2. Getting the best value for bytes on DATETIME2 precision

    By the way, these topics apply to SQL Server, the examples are made using SQL Server 2014.

    1. End dating without gaps

    How end dating works

    In a Data Vault Satellite, different subsequent versions of a row in the source system are distinguished through a LoadDate (which usually contains a date/time value). So the HashKey of the corresponding Hub Row plus the LoadDate are the primary key for the satellite.
    However to get the right version out when querying for a specific date, this requires a relatively difficult query, and this can also have a negative impact on performance.
    This is why the LoadEndDate is an optional column of a Satellite, to make querying (read: getting data out) easier and better performing.

    Important to remember that the LoadEndDate is not the date/time the load(process) ended, but the date/time the row was replaced by a newer row for the same business entity in the hub. What’s in a name, if I had to choose I would just call it EndDate, but LoadEndDate is the standard, and once you know it, it is not a problem.

    There are two ways to determine the value for this LoadEndDate:

    1. Exclusive: the LoadEndDate is the LoadDate of the new row that replaces this one, minus a small time fraction. Using the exclusive methods enables the use of the BETWEEN keyword in Transact-SQL to get the right row out, example:

      WHERE @SnapshotDate BETWEEN [LoadDate] AND [LoadEndDate]

      This is the method that is used in most Data Vault examples.

    2. Inclusive: the LoadEndDate is exactly equal to the LoadDate of the new row that replaces this one. This requires no computation when updating the LoadEndDate, but disqualifies the BETWEEN keyword, for getting data out you need to do something like:

      WHERE @SnapshotDate >= [LoadDate] AND @SnapshotDate < [LoadEndDate]

    Problem

    There is a (theoretical) problem when using the exclusive method.
    If you substract to much, there will be a tiny time gap between the LoadEndDate and the LoadDate of the subsequent row.
    I fully admit this is a theoretical problem, because the chances that you require the row valid exactly on this moment are astronomically small.
    (There is this not so hypothetical situation that your load starts at midnight, the LoadDate is set before the first second of the new day has passed, and you substract a second.
    Then you do not find a record when using the exact date (without time fraction) to get the row out. But you need a lot of bad luck for this to happen).

    Solution

    Still if you are a purist you want to do it right, certainly because the solution is so simple.
    If you make the the “grain” of substraction from the LoadDate equal to the precision of the LoadDate, there is no gap.
    E.g.
    – Substract 1 second from DATETIME2(0)
    – Substract 1 centisecond from DATETIME2(2) ***TIP***
    – Substract 1 millisecond from DATETIME2(3)
    – Substract 1 microsecond from DATETIME2(6)
    – Substract 100 nanoseconds from DATETIME2(7)

    Examples in T-SQL:

    Exclusive enddate without gaps.sql

    DECLARE @dt20 DATETIME2(0) = '2016-04-13 20:52:17'
    DECLARE @dt22 DATETIME2(2) = '2016-04-13 20:52:17.00'
    DECLARE @dt23 DATETIME2(3) = '2016-04-13 20:52:17.000'
    DECLARE @dt26 DATETIME2(6) = '2016-04-13 20:52:17.000000'
    DECLARE @dt27 DATETIME2(7) = '2016-04-13 20:52:17.0000000'

    SELECT @dt20, DATEADD(SECOND, -1, @dt20)
    SELECT @dt22, DATEADD(MILLISECOND, -10, @dt22)
    SELECT @dt23, DATEADD(MILLISECOND, -1, @dt23)
    SELECT @dt26, DATEADD(MICROSECOND, -1, @dt26)
    SELECT @dt27, DATEADD(NANOSECOND, -100, @dt27)

    2. Getting the best value for bytes on DATETIME2 precision

    This is about a “smart” precision to choose for your LoadDate and LoadEndDate columns. Unlike the older DATETIME datatype, DATETIME2 uses less or more bytes for storage depending on the precision you specify.
    The Storage size is: 6 bytes for precisions less than 3; 7 bytes for precisions 3 and 4. All other precisions require 8 bytes. (I shamelessly pasted this from MSDN).
    In the book “Building a scalable data warehouse with Data Vault 2.0” a lot of examples use DATETIME2(7). You can ask yourself why. Why do you need to be precise to 100 nanoseconds? I dare to say that in most cases (except when loading realtime or near-realtime streaming data into your Data Vault), seconds would be precise enough.
    But looking back a few lines, to the storage sizes, DATETIME2(0) uses 6 bytes, but DATETIME2(2) ALSO uses 6 bytes. So with the latter you get a higher precision for the same storage size. And for daily loads, the centiseconds precision that DATETIME2(2) is providing, is really precise enough, believe me.
    So DATETIME2(2) gives us the best value for bytes!


    DV-S01E04-precise
    Why would your LoadDates and LoadEndDates be more precise than needed ?..


    In this way you can save two bytes, compared with a DATETIME2(7), which uses 8 bytes and is used in the book. Because the LoadDate is in the primary key of satellite tables, also the primary key index will be smaller.
    In PIT Tables, which usually have multiple LoadDates, the storage gain is even more.
    And what counts for money, “who does not respect a small gain, isn’t worth a big one”, does also count for small performance gains you can get by doing this kind of tweaks. Never forget that! Multiple smaller tweaks might add up to a noticable performance improvement!

    Conclusion / Wrap up

    In this blog post you have read about the two methods for enddating: Inclusive and Exclusive.
    Also you could read how you can use the Exclusive method for end dating in Data Vault satellites, without having gaps in the timeline.
    Finally I discussed which precision will in most cases be good enough for LoadDate and LoadEndDate columns.

    (c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

    Hash Diff calculation with SQL Server (Datavault Series)

    Updates

    Change log

    Date Changes
    31 January, 2017 Really good news! The 8k limitation on HASHBYTES has been removed in SQL Server 2016!

    Today I would like to discuss Hash Diff calculation using SQL Server 2014, from my own experience.

    My purpose is NOT to be complete and/or to replicate book “Building a scalable data warehouse with Data Vault 2.0”. For full details please read the book.

    However I will give a short intro for those who hardly know what I am talking about.

    Introduction

    Hash differences or Hash diffs

    A hash difference is a hash value of all the descriptive data of a row in a satellite, plus the business key column(s).

    A hash difference column is used to detect changes in a new version of a satellite’s row for a particular business key, and can be used in a comparison instead of doing a comparison on every individual descriptive attribute.
    The main reason to use hash diff columns is to improve performance when loading new satellite rows (of which one or more attributes have changed).
    The more columns the satellite has, the higher the performance gain will be.
    To support data warehouse automation patterns, hash diffs can also be used for satellites with only a few or even one descriptive attribute.

    I think a different name, like RowHash, RecordHash, DataHash or AttributeValuesHash, would better describe the contents of the column, because the value itself is not a difference, but that’s a different (relatively unimportant) discussion.

    Hash keys

    This article is not about Hash keys, however to be complete on hashing as far as Data Vault 2.0 is concerned, I will give a short introduction on that as well.

    Hash keys replace sequence numbers (generated by the database engine) of the Data Vault 1.0 standard. They support geographically distributed data warehouses, as well as integration with big data environments like Hadoop.

    A hash key is a hash value of the business key column(s) used in a Hub or Link. The advantage is that it is predictable, which enables parallel loading and a multiplatform / distributed datawarehouse architecture. For instance a customer with code NL123 will have the same hash key on all distributed parts of the data warehouse. Because the hash key always has the same length and is stored as a (fixed length) CHAR column, performance is usually better than when the business key column(s) are directly used as primary key.

    How to calculate a hash difference in T-SQL

    For full details about Endianness, differences of hash functions between different platforms, how to cope with new columns and other technical details I refer to the book mentioned above. If you want to work with Data Vault 2.0 you will need the book anyway.

    As said, a hash difference is a hash value of all the descriptive data of a row in a satellite, plus the business key column(s). To do the hashing, we have to put all the column values together in one nvarchar value, and then apply a hashing function on it. This is the short version and the base to understand it.

    DV-S01E02-hashdiff

    Between the different column values you have to use a delimiter, preferably one that does not occur in the column values. Otherwise you have to “escape” the delimiter in column values.

    If you want to do a case insensitive compare, you should convert all values to either upper- or lowercase. But this means that if in the source system a customer name was “hans michiels” and is changed to “Hans Michiels”, you will not detect the change and therefore do not store it in the Data Vault. So it depends on the contents of the satellite and your business view on it if this is desired behaviour or not.

    The hash diff calculation results in a binary(16) value, when using the hashing algorithm MD5 (which is recommended). To enable better cross-platform- and tools support, this is then converted to a hexadecimal string of 32 characters, stored as a CHAR(32) column.

    In all cases you must also convert the hexadecimal hash value to all UPPER- or all lowercase, because not all convert functions give the same output, some lower- and other uppercase. To compare values that were made by different hash functions (read on different systems) making it all uppercase (or lowercase) makes sure the values do not differ in casing only (which could lead to a false negative answer on the “are the rows the same” question).

    I understand this is all dry and boring nerd stuff if you do not see an example, so .. a simple example:

    Calculating_a_hash_diff.sql

    SELECT
          [bla]
        , UPPER(
            CONVERT(CHAR(32),
              HASHBYTES('MD5',
                UPPER(
                  CONCAT( ISNULL(CONVERT(NVARCHAR, [BusinessKeyOfHub]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), [Column01]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), [Column02]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), [Column03]), N'')
                  ) -- END CONCAT
                ) -- END UPPER
              ) -- END HASHBYTES
            , 2) -- END CONVERT
          ) -- END UPPER
    FROM [stg].[bla]

    So far nothing shocking, but there are a few pitfalls. If you keep them in mind, you will be an excellent hasher soon. 😉

    CONVERT(NVARCHAR without specifying the size

    Do this only for values of 30 or less characters, otherwise the value will be cut off which could lead to wrong hash diff values, not changing when the attribute value changes. It is for instance okay for integers, decimals (unless the precision is more than 28, mind the decimal separator and possible negative sign), datetime stamps, and bits. When converting (n)varchar safest thing to do is follow the original size, for unique identifiers use 38 or 40.

    Beware for CONCAT

    Beware for concatenation of (N)VARCHAR(MAX), (N)TEXT and XML columns using the CONCAT function. It appears that only the first 8000 bytes of (N)VARCHAR(MAX) columns are in the concatenated result.

    And even worse, HASHBYTES ..

    HASHBYTES does only hash a NVARCHAR value with a length of 4000 or less. Above that the error “String or binary data would be truncated” occurs. This is really important to keep in mind.
    The 8000 characters (or 4000 unicode characters) limitation on any string hashed with the HASHBYTES function has been removed in SQL Server 2016! Now you can hash larger (N)VARCHAR values!

    Cope with the HASHBYTES and CONCAT limitations

    4000 characters is quite something, right? Only with very wide satellites (a lot of columns) and/or usage of (N)VARCHAR(MAX), (N)TEXT or XML columns (hereafter called ‘unlimited length columns’) you can get into trouble.

    So, what is wise to do? Some measures you can take are:

    Make your satellites not too wide

    This is a best practice, not only from hashing function perspective. Split the satellites by source system, then by rate of change. Keep in mind that the business key column(s) and all satellite column values are converted to NVARCHAR and then (including all delimiters) may not exceed the 4000 characters if you want to use the HASHBYTES function.

    Size unlimited length columns according to the actual maximum length

    What I mean is, if for instance a Country name is stored in an unlimited length column in the source system, you know that a size of 100 will be long enough. When you doubt just do some data profiling on the source to see what is the maximum length. Use that size plus a safety margin for the satellite column size. Problem gone.

    Handle true unlimited length columns

    Put true unlimited length columns (for instance free entry ‘memo’ fields or xml columns in source systems in their own satellite! Otherwise if any other (small) attribute changes, the 4k memo is copied, even when it was not changed. This absorbs storage. But that alone is not enough, as said, with HASHBYTES, above 8000 characters (4000 when unicode) you are still f*c**d. Luckily there is a system function in the master database, which uses the MD5 algorithm and takes a binary value as input. So to compute a MD5 hash value for a text column with unlimited length, you could do something like:

    Hash_for_memo.sql

    SELECT master.sys.fn_repl_hash_binary(CONVERT(VARBINARY(MAX),
    [MyHugeMemoField)) AS [HashDiff]

    But don’t make the mistake to use the function [sys].[fn_repl_hash_binary] for all hashing, because the performance of HASHBYTES is much better!
    On my machine HASHBYTES is three times faster than [sys].[fn_repl_hash_binary] on the same dataset.
    You can check for yourself using this script:

    Performance_comparison.sql

    USE [msdb]
    GO

    DECLARE @start DATETIME2(7)
    DECLARE @stop DATETIME2(7)
    DECLARE @elapsed1 BIGINT
    DECLARE @elapsed2 BIGINT
    --\-------------------------------------------------------------------------
    ---) TEST PERFORMANCE of HASHBYTES function.
    --/-------------------------------------------------------------------------
    SELECT @start = GETDATE();
    SELECT TOP1000000
          UPPER(
            CONVERT(CHAR(32),
              HASHBYTES('MD5',
                --UPPER(
                  CONCAT( ISNULL(CONVERT(NVARCHAR(128), o1.[name]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[object_id]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[principal_id]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[schema_id]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[parent_object_id]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[type]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[type_desc]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[create_date], 126), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[modify_date], 126), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[is_ms_shipped]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[is_published]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[is_schema_published]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(128), o2.[name]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[object_id]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[principal_id]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[schema_id]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[parent_object_id]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[type]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[type_desc]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[create_date], 126), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[modify_date], 126), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[is_ms_shipped]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[is_published]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[is_schema_published]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(128), o3.[name]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[object_id]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[principal_id]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[schema_id]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[parent_object_id]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[type]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[type_desc]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[create_date], 126), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[modify_date], 126), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[is_ms_shipped]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[is_published]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[is_schema_published]), N'')
                  ) -- END CONCAT
                --) -- END UPPER
              ) -- END HASHBYTES
            , 2) -- END CONVERT
          ) -- END UPPER
      FROM [sys].[objects] o1
      CROSS JOIN [sys].[objects] o2
      CROSS JOIN [sys].[objects] o3;

    SELECT @stop = GETDATE();
    SELECT @elapsed1 = DATEDIFF(MICROSECOND,@start, @stop);

    --\-------------------------------------------------------------------------
    ---) TEST PERFORMANCE of [master].[sys].[fn_repl_hash_binary] function.
    --/-------------------------------------------------------------------------
    SELECT @start = GETDATE();
    SELECT TOP 1000000
          UPPER(
            CONVERT(CHAR(32),
              master.sys.fn_repl_hash_binary(CONVERT(VARBINARY(MAX),
                --UPPER(
                  CONCAT( ISNULL(CONVERT(NVARCHAR(128), o1.[name]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[object_id]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[principal_id]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[schema_id]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[parent_object_id]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[type]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[type_desc]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[create_date], 126), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[modify_date], 126), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[is_ms_shipped]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[is_published]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[is_schema_published]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(128), o2.[name]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[object_id]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[principal_id]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[schema_id]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[parent_object_id]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[type]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[type_desc]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[create_date], 126), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[modify_date], 126), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[is_ms_shipped]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[is_published]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[is_schema_published]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(128), o3.[name]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[object_id]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[principal_id]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[schema_id]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[parent_object_id]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[type]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[type_desc]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[create_date], 126), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[modify_date], 126), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[is_ms_shipped]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[is_published]), N'')
                        , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[is_schema_published]), N'')
                     ) -- END CONCAT
                --) -- END UPPER
              ) -- END HASHBYTES
              ) -- END CONVERT
            , 2) -- END CONVERT
          ) -- END UPPER
      FROM [sys].[objects] o1
      CROSS JOIN [sys].[objects] o2
      CROSS JOIN [sys].[objects] o3;

    SELECT @stop = GETDATE();
    SELECT @elapsed2 = DATEDIFF(MICROSECOND,@start, @stop);

    SELECT @elapsed1 AS [Elapsed_HashBytes]
    , @elapsed2 AS [Elapsed_fn_repl_hash_binary]
    , 'HASHBYTES is ' + CONVERT(VARCHAR, CONVERT(decimal(19, 2), @elapsed2 / @elapsed1))
    + ' times faster than [fn_repl_hash_binary]' AS [Result]

    That’s about it.
    Just a few more things:

    • If you use a different hashing algorithm (e.g. SHA1) you cannot use the sys.fn_repl_hash_binary function, in that case you will have to seek your solution in a C# script task or CLR stored procedure, doing the hashing in .NET code.
    • Don’t use CONCAT if you have more than one column in the satellite containing [MyHugeMemoField], or you still can run into problems.

    Conclusion / Wrap up

    I have explained in short why Hash diff columns are used in a Data Vault 2.0 satellite, and how these values can be calculated.

    After that, you have read about some limitations of the CONCAT and HASHBYTES function when used for calculating Hash diff values, and how to cope with these limitations.

    Credits
    My thanks go to Vincent Kuiper for reviewing the initial draft of this article.

    (c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

    Data Vault and Dimensional Modelling, a happy marriage! (Data Vault Series)

    We need ambassadors!

    I was at this one day course recently and spoke to a professional working in the business intelligence domain.
    He said that he had just followed a two week course for datawarehouse architects.
    I asked him, what did they tell about Data Vault?
    His answer was shocking: “Not so much ..”

    If recently trained datawarehouse architects do not get trained to use Data Vault in a datawarehouse architecture, there definitely is a lot of missionary work to do.
    Because the Data Vault can add so much value to your datawarehouse solution. 
    And more important, it is not a choice, a Kimball dimensional model OR a Linstedt Data Vault. Use both and
    combine each strengths!

    Intended audience

    This is a strong call for everybody who reads this to share this post with at least one person you know working in the business intelligence domain using star schemas (a.k.a. dimensional modelling) directly fed from source systems.

    What are the strengths of a Dimensional Model (a.k.a. Star Schema)?

    • Because the dimensional model is already around for decades, a lot of front end tools have an excellent support for it.
    • A dimensional model is designed for reporting, slicing and dicing, for getting data out. That is what it is good at.

      This becomes clear by the following examples:

      • Query performance is usually better than of other models, like Data Vault or Relational Datawarehouse, especially for aggregations.
      • A star model is an excellent source for Cube databases, like SSAS, especially when using a multidimensional model.
      • Date selections and comparison of periods (e.g. Q1 this year versus Q1 Last Year) is easier than in Data Vault or Relational models.
    • It enables self service BI. Give business (power) users (direct) access to a dimensional model, and there is a good chance that they can get  data out in the format they want.

    What are the strengths of a Data Vault?

    • A Data Vault is meant to capture all the data all the time. Therefore it stores all data from the source systems, captures the changes in it, and keeps a full history of all changes (with a sidemark, that multiple changes between two load cycles, will be seen as one).
    • A Data Vault captures all data exactly as it is stored in the source systems. “The good, the bad and the ugly”with no business rules applied to it (except, in some cases hard business rules like data types). This makes all data auditable.
    • Data Vault 2.0 supports cross platform integration with other systems (e.g. Hadoop) and decentralized datawarehouses, for instance in different global regions.
    • With Data Vault 2.0 working agile is easy, in sprints, building the system in increments.
    • Data Vault has a very good support for Data Integration. Satellites, hanging off the Hub, containing data from different source systems.

    So? What are the strengths of a datawarehouse architecture using both?

    • You can change your mind! If your business rules change (and it happens!), just reload your dimensional model from the Data Vault with the new rules applied.
    • You have one version of the facts . If you have multiple dimensional models, you have not.
    • Data history is not like a destroyed monument. With a Data Vault, you capture all data from the source systems, and all changes applied to it. Source systems usually reflect only the current state of the data, so without a change capturing central storage like Data Vault, your history is gone forever!
    • Your dimensional model does not have to serve two purposes: data storage and presentation of information. These two tasks can be in conflict, for instance there is currently no report need for all data that you could grap from source systems, so you leave it out of your star schema. But if the question comes tomorrow, you have no data, no history!
    • You have full auditibility. Because the (raw) Data Vault stores the data from source systems unmodified, every measure, every dimensional attribute, can be traced back to the source it came from and when. This makes it able to refute claims that the data is not correct. Maybe now it will become clear that the old datawarehouse- or reporting system has lied for years!

    How can you use both?

    The answer is simple: by using a multi-tier architecture.
    Data from source systems is first extracted to a Staging area, before it is moved into the Enterprise Data Warehouse using a Data Vault Model (with or without some optional components).
    From there it will be distributed to dimensional models (star schemas) and cubes, and whilst the data is on its way, business rules can be applied.
    DV-S01E02-multi-tier-architecture
    Multi-tier architecture using both Data Vault and Dimensional Modelling techniques.

    Conclusion / Wrap up

    I have written this post to create more awareness about using both Data Vault and Dimensional Modelling or Star Schemas in a data warehouse architecture.
    I have listed strengths and benefits of Data Vault, Dimensional Modelling and of an architecture using both.

    (c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

    Isolate cross database “databasename” dependencies using synonyms and schemas (Data Vault Series)

    Intended audience

    This post could be interesting for professionals working on datawarehouse projects using the Microsoft SQL Server database engine. This post assumes that you have at least a basic understanding of Data Vault principles.

    Introduction

    Those who have read the book “Building a Scalable Data Warehouse with Data Vault 2.0” by Dan Linstedt and Michael Olschimke will know that database schema’s are used in the examples in the book for the “raw” and “biz” (business) layer in the Data Vault.

    If you do not know this book, buy it, if you are the intended audience you should read it ( no I have no commercial interest in saying this), so check it out on Amazon.

    What is done in the book is putting tables or views in the database schema’s “raw” and “biz”, respectively, making it transparent if a satellite is either virtual (and at least partially computed) or persisted in a physical table.

    Putting these data layers in different schema’s is a great starting point for a my interdatabase connectivity concept.

    What data layers can we have

    • Staging area
    • Raw Data Vault
    • Business (Data) Vault
    • Metrics Vault
    • Error Vault
    • Information Marts

    A rule of thumb is that each layer resides in it’s own database, but also multiple layers can be in one database. What should be avoided is spreading one layer over multiple databases. This does not count for different Information Marts though, which should be put in their own database.

    Whether to put layers together can depend amongst others on backup requirements (the biz layer could for instance be restored from a database backup OR by doing some reprocessing from the Raw Data Vault, which could lead to the decision to backup the database less frequently).

    A typical set up could be:

    • Staging Area – separate database, “stg” database schema.
    • Raw Data Vault – together with Business Vault, “raw” database schema.
    • Business Vault – together with Raw Vault, “biz” database schema.

    To keep the example simple I will leave the other data layers out for now.

    What is the problem?

    To be short: SQL statements copying data to a different database.

    When using data flows in SSIS, this problem does not occur.

    Why is it bad?

    Database names are prefixed in FROM clauses, causing a dependency on the database name.

    So what, I still don’t see the problem

    Scenario’s where it can become a hassle to manage this include:

    • Running different environments on the same database instance (e.g. development and test), differing the environments by database names. When deploying code or copying databases, all interdatabase dependencies must be changed.
    • Running different enviroments for multiple clients in a BIAAS (Business Intelligence As A Service) solution on the same database instance.

    So, lets divide datalayers using database schema’s ..

    This is an example of the initial set up.

    Dv01E01_dbschema1
    Please note that you should keep all objects (tables, views, stored procedures, udf’s) from one data layer together in one database.

    Now, to pull data from the “stg” to the “raw” layer, SQL statements could be like (simplified example):

    Insert.sql
    INSERT INTO [raw].[TableName]
    SELECT ..
    FROM [EDW_STG_TEST].[stg].[TableName]

    But, as you can see the database name is prefixed before the Staging table. And this is a problem (at least it makes managing the solution more complex).

    How can we solve this?

    Managing interdatabase dependencies using synonyms

    First, a little explanation from MSDN:

    A synonym is a database object that serves the following purposes:

    • Provides an alternative name for another database object, referred to as the base object, that can exist on a local or remote server.
    • Provides a layer of abstraction that protects a client application from changes made to the name or location of the base object.

    Read more on MSDN.

    What if we would have made a synonym for the staging table in the “EDW_DV_TEST” database, also in a “stg” schema in that database?

    Then we could omit the database name from the SQL statement:

    Insert.sql
    INSERT INTO [raw].[TableName]
    SELECT ..
    FROM [stg].[TableName]
    That is what we want!

    In a picture:

    Dv01E01_dbschema2

    The good news is, creating the synonyms can be automated with a script, so a change of the Staging database name only requires running a SQL script in the EDW_DV_TEST to drop and recreate the synonyms for the staging database objects!

    One other advantage is, you could merge different layers into one database or spit them into different databases. For example all code will keep on working if you move all Staging tables to the EDW_DV_TEST database, therewith replacing the synonyms already there in the stg schema by actual tables. All you would have to do is change the connection string for the processes that fill the staging tables.

    In a similar way, you could decide to separate the raw and biz layer into different databases.

    Script to create or change the synonyms

    DV-S01E01-Create_synonyms_for_objects_in_a_different_database.sql

    --\
    ---) Author: Hans Michiels
    ---) Creates synonyms for tables, views, user defined functions
    ---) and stored procedures in a different database.
    ---) All synonyms in the specified schema are dropped
    ---) prior to (re)creation.
    --/
    /*
    (c) Copyright 2016 - hansmichiels.com
     
    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.
     
    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
    GNU General Public License for more details.
     
    You should have received a copy of the GNU General Public License
    along with this program. If not, see http://www.gnu.org/licenses/.
    */

    USE [AdventureWorks2014] -- Replace with database you want to create the synonyms in.
    GO

    --\--------------------------------------------------------------------------
    ---) DECLARATIONS
    --/--------------------------------------------------------------------------

    DECLARE @SourceDatabase SYSNAME
    DECLARE @SourceSchema SYSNAME
    DECLARE @SynonymsSchema SYSNAME

    DECLARE @SynonymsPrefix NVARCHAR(128)
    DECLARE @SynonymName SYSNAME

    DECLARE @CurSchemaName SYSNAME
    DECLARE @CurObjectName SYSNAME
    DECLARE @CurObjectType CHAR(2)

    DECLARE @nsql NVARCHAR(4000)
    DECLARE @DebugPrint NVARCHAR(500)
    DECLARE @SimulationMode CHAR(1)

    --\--------------------------------------------------------------------------
    ---) CONFIGURATION: adjust the variable values to meet your needs.
    --/--------------------------------------------------------------------------

    SELECT @SourceDatabase = 'AW2014_Staging' -- Database in which objects exist
                                              -- for which synonyms must be created.
         , @SourceSchema = 'stg' -- Schema name of source objects.
         , @SynonymsSchema = 'stg' -- Schema name to put synonyms in.
         , @SynonymsPrefix = '' -- Optional Synonyms name prefix.
         , @SimulationMode = 'Y' -- Use Y if you only want the SQL statement in the
                                   -- output window without it being executed.

    --\--------------------------------------------------------------------------
    ---) INITIALIZATION
    --/--------------------------------------------------------------------------

    IF NOT EXISTS(SELECT 1 FROM sys.schemas WHERE name = @SynonymsSchema)
    BEGIN
        SELECT @nsql ='CREATE SCHEMA ' + QUOTENAME(@SynonymsSchema) + ' AUTHORIZATION [dbo];';
        IF @SimulationMode = 'Y' PRINT '----- SIMULATION MODE: SCRIPT IS NOT EXECUTED ------ ';
        PRINT @nsql;
        IF @SimulationMode != 'Y' EXECUTE sp_executesql @nsql;
    END

    --\--------------------------------------------------------------------------
    ---) MAIN PART I: Drop all existing synonyms in the specified schema.
    --/--------------------------------------------------------------------------

    DECLARE ExistingObjCursor CURSOR LOCAL STATIC FOR
        SELECT syn.name AS [SynonymName]
        FROM sys.synonyms syn
        JOIN sys.schemas sch
          ON syn.schema_id = sch.schema_id
        WHERE sch.name = @SynonymsSchema
     
    IF @SimulationMode = 'Y' PRINT '----- SIMULATION MODE: SCRIPT IS NOT EXECUTED ------ ';

    SET @DebugPrint = '-- Dropping all existing synonyms in schema ' + QUOTENAME(@SynonymsSchema)
    PRINT REPLICATE('-', LEN(@DebugPrint))
    PRINT @DebugPrint
    PRINT REPLICATE('-', LEN(@DebugPrint))

    OPEN ExistingObjCursor
    FETCH NEXT FROM ExistingObjCursor INTO @SynonymName

    WHILE @@FETCH_STATUS = 0
    BEGIN

        SET @nsql = 'IF OBJECT_ID(''' + QUOTENAME(@SynonymsSchema) + '.' + QUOTENAME(@SynonymName) + ''', ''SN'') IS NOT NULL'
              + ' DROP SYNONYM ' + QUOTENAME(@SynonymsSchema) + '.' + QUOTENAME(@SynonymName) + ';'

        PRINT @nsql;
        IF @SimulationMode != 'Y' EXECUTE sp_executesql @nsql;

        FETCH NEXT FROM ExistingObjCursor INTO @SynonymName
    END

    CLOSE ExistingObjCursor
    DEALLOCATE ExistingObjCursor

    --\--------------------------------------------------------------------------
    ---) MAIN PART II: Recreate all synonyms in the specified schema.
    --/--------------------------------------------------------------------------

    IF OBJECT_ID('tempdb..#SourceObjects') IS NOT NULL BEGIN
        DROP TABLE #SourceObjects
    END

    CREATE TABLE #SourceObjects
        ( [SchemaName] SYSNAME NOT NULL
        , [ObjectName] SYSNAME NOT NULL
        , [ObjectType] CHAR(2) NOT NULL
        )

    -- Insert USER objects
    SELECT @nsql =
    'INSERT INTO #SourceObjects([SchemaName], [ObjectName], [ObjectType])'
    + ' SELECT sch.name AS [SchemaName]'
    + ' , obj.name AS [ObjectName]'
    + ' , obj.type AS [ObjectType]'
    + ' FROM [' + @SourceDatabase + '].sys.objects obj'
    + ' JOIN [' + @SourceDatabase + '].sys.schemas sch'
    + ' ON sch.schema_id = obj.schema_id '
    + ' WHERE sch.name = ''' + @SourceSchema + ''''
    + ' AND obj.type IN ( '
    + ' ''FN'' /* SQL_SCALAR_FUNCTION */ '
    + ' , ''P '' /* SQL_STORED_PROCEDURE */ '
    + ' , ''IF'' /* SQL_INLINE_TABLE_VALUED_FUNCTION */ '
    + ' , ''TF'' /* SQL_TABLE_VALUED_FUNCTION */ '
    + ' , ''U '' /* USER_TABLE */ '
    + ' , ''V '' /* VIEW */ '
    + ' )'
    + ' ORDER BY obj.type, sch.name, obj.name'

    PRINT '/*'
    EXECUTE sp_executesql @nsql;
    PRINT '*/'
    DECLARE ObjectCursor CURSOR LOCAL STATIC FOR
        SELECT [SchemaName], [ObjectName], [ObjectType]
        FROM #SourceObjects
     
    IF @SimulationMode = 'Y' PRINT '----- SIMULATION MODE: SCRIPT IS NOT EXECUTED ------ ';

    SET @DebugPrint = '-- Recreate synonyms in schema ' + QUOTENAME(@SynonymsSchema)
    PRINT REPLICATE('-', LEN(@DebugPrint))
    PRINT @DebugPrint
    PRINT REPLICATE('-', LEN(@DebugPrint))

    OPEN ObjectCursor
    FETCH NEXT FROM ObjectCursor INTO @CurSchemaName, @CurObjectName, @CurObjectType

    WHILE @@FETCH_STATUS = 0
    BEGIN

        SELECT @SynonymName = @SynonymsPrefix + @CurObjectName
        
        SET @nsql = 'CREATE SYNONYM ' + QUOTENAME(@SynonymsSchema) + '.' + QUOTENAME(@SynonymName)
                  + ' FOR ' + QUOTENAME(@SourceDatabase) + '.' + QUOTENAME(@CurSchemaName) + '.' + QUOTENAME(@CurObjectName) + ';'

        PRINT @nsql;
        IF @SimulationMode != 'Y' EXECUTE sp_executesql @nsql;

        FETCH NEXT FROM ObjectCursor INTO @CurSchemaName, @CurObjectName, @CurObjectType
    END

    CLOSE ObjectCursor
    DEALLOCATE ObjectCursor
    GO

    --\--------------------------------------------------------------------------
    ---) CLEAN UP
    --/--------------------------------------------------------------------------
    IF OBJECT_ID('tempdb..#SourceObjects') IS NOT NULL BEGIN
        DROP TABLE #SourceObjects
    END

    Download the script here.

    Conclusion / wrap up

    Database synonyms can be used to isolate interdatabase dependencies into objects that can be easily updated with a changed database name.

    This makes managing the Data Vault solution easier.

    This concept can also be of use in non-Data Vault database environments.

    (c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.