Implementing a hash key collision strategy

Introduction

Dan Linstedt, the inventor of Data Vault, has written a lot about it: hashkeys.
For instance, one of his latest blog posts:
#datavault 2.0, Hashes, one more time.

I will not list all other sources, as you can use Google yourself.
A few comments on hash keys:

  1. You need them for scalability. Using sequence numbers is taking the risk that your data warehouse does not scale well later when the amount of data grows.
  2. They can collide: two different business keys can produce the same hash key. However the chance that this happens is very small. For instance when using SHA-1 (which produces a hash value of 160 bits) you will have a 1 in 1018 chance on a hash collision when having 1.71 * 1015 hash values (read: hub rows) according to this blog post.
  3. If collisions are unacceptable you need a hash key collision strategy.

The full article is posted on DWA.Guide, so you read further there ..


Picture credits: © Can Stock Photo / alexskp

Hans Michiels

Hans is an Independent Business Intelligence and Datawarehouse Consultant & Microsoft SQL Server Consultant, working in the Netherlands. He has been working in the software industry since 1996, with SQL Server since the year 2001, and since 2008 he has a primary focus on datawarehouse- and business intelligence projects using Microsoft technology, using a Datavault and Kimball architecture. He has a special interest in Datawarehouse Automation and Metadata driven solutions. * Certified in Data Vault Modeling: Certified Data Vault 2.0 Practitioner, CDVDM (aka Data Vault 1.0) * Certified in MS SQL Server: * MCSA (Microsoft Certified Solutions Associate) SQL Server 2012 - MCITP Business Intelligence Developer 2005/2008 - MCITP Database Developer 2005/2008 - MCITP Database Administrator 2005/2008

More Posts

Leave a Reply

Your email address will not be published. Required fields are marked *