Introduction
I must admit I have a weakness for British humour.
When I had to cope with leading zeros in business keys some time ago, I spontaneously came up with the title of this post, not knowing that it would serve as such.
For those who do not know, “Meetings, bloody meetings” is a British comedy training film in which John Cleese plays a main role. It was made in 1976, and a remake was made in 2012.
It tells in a funny way what can go wrong at meetings and how you can do better, check it out if you can.
Mr John Cleese
But, obviously, this post is not about meetings but about zeros.
Problem
I can be short about that: leading zeros in business key values.
For instance a customer number is delivered to the data warehouse as 0001806 (instead of 1806).
This would not be a problem it is would always be delivered exactly like that. But to be honest, you can and will not know that upfront. Even this might be the case now, it might not be in the future.
When other tools are used, leading zeros could suddenly disappear (for instance when a csv file is modified using Excel), or (more rarely) the number of leading zeros could change (01806, 00001806). When this happens you have a problem, because for the data warehouse 01806, 0001806, 00001806 and 1806 are all different business keys! Even if you have only two variants, it is already a problem.
Because every business key gets a different row in the hub, and this customer now exists multiple times!
(No acting here, this is how I look sometimes)
Solution
If you are familiar with Data Vault, you might already think of same-as-links to solve this.
But I think the solution should be implemented earlier, to avoid having multiple hub rows.
Simply always remove leading zeros when the sourcecolumn is (part of) a business key (either primary or foreign key) and seems a number or ID but is delivered as a string/varchar. In this way 1806 will always be 1806! And I think it is pretty impossible that 001806 and 1806 would refer to two different customers.
Unless, of course, they would come from different source systems. But in that situation, depending on leading zeros would be a bad thing to do, because when then leading zeros dropped off, satellite rows of different customers (in different source systems) could end up as connected to the same hub row! In this situation, in a non-integrated Raw Vault, it would be better to prefix the business key with the source system code and remove the leading zeros, for instance, CRM.1806 and ERP.1806.
In all cases, you can still store the original value (with leading zeros) as an ordinary attribute in a satellite for auditing reasons.
How to implement the solution
There are many ways to remove leading zeros. When I was searching for this I had two requirements:
- No casting from and to an integer may take place, otherwise all business keys need to be numeric, so this would make the solution less reliable.
- No function, routine or assembly may be called, this could negatively impact performance. I was looking for an “inline” conversion.
After some research I found an expression that was the same for SQL and SSIS and quite okay (T-SQL version by Robin Hames, my credits for his work), but appeared to change a string with only one or more zeros to an empty string. And because a 0 can have a meaning – and is certainly different from an empty string – this is undesired behavior, IMHO.
So I had to add some logic to it: a SELECT CASE in T-SQL and an inline condition (format {condition} ? {true part} : {false part} ) to the SSIS expression.
Furthermore I came on a different method for T-SQL as well, using the PATINDEX function, which is more compact than the other solution.
For SSIS I still use the ‘Robin Hames’ method, because the PATINDEX function is not available in SSIS Expressions.
So .. this is what it has become:
T-SQL
SELECT
example.[id_with_leading_zeros],
CASE
WHEN LTRIM(example.[id_with_leading_zeros]) = '' THEN ''
WHEN PATINDEX( '%[^0 ]%', example.[id_with_leading_zeros]) = 0 THEN '0'
ELSE SUBSTRING(example.[id_with_leading_zeros], PATINDEX('%[^0 ]%', example.[id_with_leading_zeros]), LEN(example.[id_with_leading_zeros]))
END AS [id_without_zeros_method1],
CASE
WHEN LTRIM(example.[id_with_leading_zeros]) = '' THEN ''
WHEN PATINDEX( '%[^0 ]%', example.[id_with_leading_zeros]) = 0 THEN '0'
ELSE REPLACE(REPLACE(LTRIM(REPLACE(-- Robin Hames' method
REPLACE(LTRIM(example.[id_with_leading_zeros]), ' ', '!#!') -- replace existing spaces with a string that does not occur in the column value, I have chosen '!#!'
, '0', ' ') -- replace '0' with ' '
) -- end of LTRIM to remove leading '0's that have been changed to ' 's
, ' ', '0') -- change ' ' back to '0'
, '!#!', ' ') -- change '!#!' back to ' '
END AS [id_without_zeros_method2]
FROM
(
SELECT
TOP 1000 RIGHT('00000000000' + CONVERT(NVARCHAR(12), object_id), 14) AS [id_with_leading_zeros]
FROM
master.sys.objects
UNION
SELECT N' 00000 '
UNION
SELECT N'00'
UNION
SELECT N' '
UNION
SELECT ' 0099990 A '
UNION
SELECT '-5550'
) example
SSIS Expression (can be used in Derived Column)
(LTRIM(REPLACE(id_with_leading_zeros,"0", "")) == "" && LTRIM(id_with_leading_zeros) != "") ? "0" : REPLACE(REPLACE(LTRIM(REPLACE(REPLACE(LTRIM(id_with_leading_zeros)," ","!#!"),"0"," "))," ","0"),"!#!"," ")
In a Derived Column Transformation this looks for instance like this
Conclusion / Wrap up
In this post I have motivated why I think you should remove leading zeros from business keys when data is loaded from source systems to a data warehouse.
This post also contains different ways to remove leading zeros, two for T-SQL and one for a SSIS expression.
(c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.
Hans,
Is the purpose of having a business vault not to solve this. Changing data bits when entering the raw vault will mess up accountability & traceability. Not a good thing…
Hi Eric, thanks for your feedback.
I agree that solving this in the Business Vault still is a defendable approach.
However this topic is closely related to calculating the hashkey from the business key, and in that sense leading zeros can in my opinion be placed in the same category as differences in case (uppercase/lowercase) for an alphanumeric business key, and leading and trailing spaces. (read more in “Building a Scalable Data Warehouse with Data Vault 2.0” paragraph 11.2.2)
Usually case sensitivity and leading and trailing spaces are irrelevant for a business key. It seldom occurs that a business key AAA would be something different than Aaa. I think the same counts for leading zeros. When checked with the business, it is simpler to remove the leading zeros than solve it in the Business Vault. And I like the KISS (Keep It Simple Stupid) approach whenever possible.
As for accountability and traceability, as mentioned in the article you still can store the business key as normal attribute with leading zeros in a satellite to keep full auditability.
But it is not black/white and should be decided per business key, in that sense my article has not enough nuance, so your addition was certainly welcome.
Hans