Detect duplicates in a Data Flow Task (SSIS Series)

Intended audience

This article assumes that you have at least a basic understanding of SSIS, and the Data Flow Task in particular. C# knowledge is an advantage but not required.

Introduction

If you ever used the T-SQL windowing function ROW_NUMBER() OVER (PARTITION BY ..) to find duplicates in a query result, this post might be of interest for you.

Because not all sources having potential duplicate entries (e.g. business keys) come from a SQL Server database. If it is a csv- or Excel file ROW_NUMBER will not automaticly be your friend (it might be when you use OPENROWSET, but I think that is a rather complicated solution).

C# script components are extremely flexible yet fairly easy to understand, develop and maintain.
This is why this is my preferred solution to tackle the problem above.
Let’s begin!

How to detect duplicates in a Data Flow Task using a C# Script component

First an overview of the end result (the Data Flow of the Data Flow Task). I will tell you step by step how to achieve this using SSIS 2014. I did not make a print screen of the Control Flow because it only contains one Data Flow Task.
Please note that I enabled two Data Viewers on both outputs of the Conditional Split component, so you can see the data that flows through (screenshot later).
SSIS-S01E02-image01
For the demo I will use the following components in a dataflow. To solve your problem just pick out the things you need.

  • An OLE DB Source
  • A script component (the most interesting part!)
  • An (optional) conditional split
  • Two OLE DB Destinations: one for the first occurrences of each business key, and one for the duplicates.

An OLE DB Source

My example is based on Advertureworks. I select customers from the [AdventureWorksDW] database, and make sure there are duplicates by using UNION ALL in the SQL statement.
In the real world it would be more likely that your source would not be a SQL Server query, but for instance a csv- or Excel file.

OLEDB Source.sql

SELECT
  [CustomerKey], [CustomerAlternateKey] AS [CustomerBusinessKey],
  [Title], [FirstName], [MiddleName], [LastName],
  [NameStyle], [BirthDate], [MaritalStatus], [Suffix],
  [Gender], [EmailAddress], [YearlyIncome]
FROM
  [dbo].[DimCustomer]
UNION ALL
SELECT
  [CustomerKey], [CustomerAlternateKey] AS [CustomerBusinessKey],
  [Title], [FirstName], [MiddleName], [LastName],
  [NameStyle], [BirthDate], [MaritalStatus], [Suffix],
  [Gender], [EmailAddress], [YearlyIncome]
FROM
  [dbo].[DimCustomer]
UNION ALL
SELECT
  [CustomerKey], [CustomerAlternateKey] AS [CustomerBusinessKey],
  [Title], [FirstName], [MiddleName], [LastName],
  [NameStyle], [BirthDate], [MaritalStatus], [Suffix],
  [Gender], [EmailAddress], [YearlyIncome]
FROM
  [dbo].[DimCustomer]
  WHERE CustomerKey % 2 = 0

SQL Statement for the OLE DB Source

A script component (the most interesting part!)

From the SSIS Toolbox drag a Script Component to the Data flow surface. In the “Select Script Component Type” choose “Transformation”.

SSIS-S01E02-image02

To use a column value in the script, you have to define it as an input column. Select the column you want to check for duplicate values with Usage Type ReadOnly. In the demo this is the CustomerBusinessKey column.
SSIS-S01E02-image03

We also need a new column to store the occurrence number for the CustomerBusinessKey. To do this, in the Script Transformation Editor, Select Inputs and Outputs on the left side, then unfold Output 0 and Output Columns, respectively.
Then click on the Add column button to add a column with DataType four-byte signed integer [DT_I4]. Give the new column a clear name. In the demo it is called CustomerBusinessKeyOccurrence.
SSIS-S01E02-image03b

After these preparations we can start writing the C# script. Press the Edit script … button to do so.
SSIS-S01E02-image04

Adjust the script as follows (the changes are also marked inside red blocks in the picture below):

  • In the Namespaces region, add:
    using System.Collections.Generic;
  • Above PreExecute(), add:
    // The column to count duplicates for will be stored as Key (with string datatype) of the dictionary.
    // The occurrence counter will be stored as Value of the dictionary (with int datatype).
    private Dictionary<string, int> m_dictOccurrencesCount;
  • Inside PreExecute(), add:
    // Initialize the dictionary, otherwise it will be null when accessed.
    m_dictOccurrencesCount = new Dictionary<string, int>();

SSIS-S01E02-image05
Script adjustments – part 1

Then adjust the script as follows:

  • Inside Input0_ProcessInputRow(Input0Buffer Row), add:
    // Count occurrences of CustomerBusinessKeys.
    if (m_dictOccurrencesCount.ContainsKey(Row.CustomerBusinessKey))
    {
      // If the CustomerBusinessKey of the current row is already in the dictionary,
      // increase the counter with 1:
      m_dictOccurrencesCount[Row.CustomerBusinessKey]++;
      // Also assign the new value to the CustomerBusinessKeyOccurrence column,
      // so it can be used in the Data flow later.
      Row.CustomerBusinessKeyOccurrence = m_dictOccurrencesCount[Row.CustomerBusinessKey];
    }
    else
    {
      // If the CustomerBusinessKey of the current row is not yet in the dictionary,
      // add it, and set the counter to 1
      m_dictOccurrencesCount.Add(Row.CustomerBusinessKey, 1);
      // Also assign 1 to the CustomerBusinessKeyOccurrence column.
      Row.CustomerBusinessKeyOccurrence = 1;
    }

SSIS-S01E02-image06
Script adjustments – part 2

When the script is ready, Choose Rebuild from the BUILD menu in the VstaProjects – Microsoft Visual Studio Window. Then close that Window. You return to the Script Transformation Editor, close it by pressing OK.

An (optional) conditional split

Optionally you can add a Conditional Split.
I use the conditional split to store the first occurrences for every CustomerBusinessKey in one table and all duplicates in another table.
Both tables are only to show you that the script task and conditional split do their jobs properly. In real projects, you might develop this differently, for example by storing only the first occurence.

SSIS-S01E02-image07
Conditional Split Configuration

Two OLE DB Destinations: one for the first occurrences of each business key, and one for the duplicates

I basicly created a two temporary tables by composing a SELECT TOP 0 .. INTO statement from the OLE DB Source SQL Statement, because the result of the demo needs to flow to somewhere.

SSIS-S01E02-OLEDB-Source.sql
IF OBJECT_ID('[dbo].[DimCustomerFirstOne]', 'U') IS NOT NULL
  DROP TABLE [dbo].[DimCustomerFirstOne]
GO
SELECT TOP 0
  [CustomerKey], [CustomerAlternateKey] + 'B' AS [CustomerBusinessKey],
  [Title], [FirstName], [MiddleName], [LastName],
  [NameStyle], [BirthDate], [MaritalStatus], [Suffix],
  [Gender], [EmailAddress], [YearlyIncome], CONVERT(int, 0) AS [OccurrenceNo]
INTO [dbo].[DimCustomerFirstOne]
FROM
  [dbo].[DimCustomer]
GO

IF OBJECT_ID('[dbo].[DimCustomerDuplicates]', 'U') IS NOT NULL
  DROP TABLE [dbo].[DimCustomerDuplicates]
GO
SELECT TOP 0
  [CustomerKey], [CustomerAlternateKey] + 'B' AS [CustomerBusinessKey],
  [Title], [FirstName], [MiddleName], [LastName],
  [NameStyle], [BirthDate], [MaritalStatus], [Suffix],
  [Gender], [EmailAddress], [YearlyIncome], CONVERT(int, 0) AS [OccurrenceNo]
INTO [dbo].[DimCustomerDuplicates]
FROM
  [dbo].[DimCustomer]
GO

OLE DB Destination OLE_DST FirstOne uses the [dbo].[DimCustomerFirstOne] table.
OLE DB Destination OLE_DST Duplicate uses the [dbo].[DimCustomerDuplicates] table.

Let it flow, let if flow, let it flow

Okay, time to test.
When data flows, the “First occurence” and “Duplicates” Data Viewers pop up. Then I can check that all the rows are divided well over both tables based on the value of column CustomerBusinessKeyOccurrence: exactly 1 or higher than 1.
When I run it with the Data Viewers disabled, the package execution finished literally in a split second. So the performance is excellent.
Test succeeded!
SSIS-S01E02-image08
The Data flow in Debug Mode

SSIS-S01E02-image09b
Data Viewer for First occurrence flow: CustomerBusinessKeyOccurrence is 1 for all rows.

SSIS-S01E02-image10b
Data Viewer for Duplicates flow: CustomerBusinessKeyOccurrence is 2 or higher for all rows.

Other considerations

If multiple columns make a row unique concatenate all column values with a delimiter, and then add the concatenated result to the dictionary m_dictOccurrencesCount.
For instance you could check duplicates using the customer nameparts (firstname, middlename and lastname), the birthdate and the gender. This is a softer check than on a real business key, but sometimes you do not have that.
Of course, for this to work, you must add those columns as ReadOnly input columns to the script component.
You can create a concatenated value as follows:
Inside Input0_ProcessInputRow(Input0Buffer Row), start with:

// Concatenate FirstName, MiddleName, LastName, BirthDate and Gender, using a | as delimiter.
string businessKey = string.Format(“{0}|{1}|{2}|{3:yyyyMMdd}|{4}”,
  Row.FirstName, Row.MiddleName, Row.LastName, Row.BirthDate, Row.Gender).ToUpper();

The concatenated value is converted to uppercase, so that differences in lower- or uppercase lead to the same value.
Then use the businessKey variable (instead of Row.CustomerBusinessKey) in the if statement as shown above.

If the concatenated value becomes very long, you could calculate a hash value and add that to the dictionary instead (thanks for the suggestion, Joost). This would use less memory and might be faster, but the only way to find out if this is the case is testing .. testing ..
( I did not test this to make the demo not too complicated and because performance did not seem to be an issue ).

Thinking a bit further, you could use this concept also to calculate running totals.
Just configure the dictionary to hold a decimal instead of an int and you can calculate running totals for numeric values (e.g. sales figures) easily.

Conclusion / Wrap up

In SSIS you can use a Script Component within a Data Flow Task to detect duplicates in the source data.
This can be an alternative to the T-SQL ROW_NUMBER() function, when the source is not a SQL Server Database.

Credits
My thanks go to Joost van Rossum for reviewing the initial draft of this article. His input was very useful.

(c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

When reverse engineering is a necessary evil .. (SQL Server Series)

.. or: How to find data (text, guids, numbers) in your entire SQL Server database ..

Introduction

A number of years ago I had to make queries on an OLTP database used by third parties’ software. As a person preferring data warehouses for too many reasons to mention here, I am not proud of it, but sometimes these requests just happen. If you have the same now and then, you are not the only one 😉 .

Problem

A little simplified, this is what happened:
I started asking around, is there documentation? “No”, there wasn’t.
Is there a data dictionary? “No”, there wasn’t, either.
Can I have access to the database? “Yes, here’s the connection string.”
So I made a connection, but .. what do all these tables mean? How are they related? How are the foreign keys? The model didn’t tell me, with no physical constraints and unclear table- and column names.
Do you have anything else for me to see how this works? “Well, we have a PROD and a TEST environment, you can have read only access to PROD and read/write to the TEST.”
Okay, that is something ..

MSSQL-S01E02-testconnection
Picture credits: © David Coleman | Dreamstime Stock Photos
When there is only a (test)connection..

Towards a solution

Could you show me how you use the system? Luckily the employee of my client was very helpful and showed me how he used the application.
I tried to get a grip on the SQL Statements caused by his actions, but due to a lack of permissions on the database, this didn’t work.
This could be arranged, but could take a while before I actually would have temporary “elevated” permissions.
As it was just some small piece of “inbetween” work, that seemed an inefficient route to follow.
Okay, I have seen what you did, can I do that on the TEST as well? “Yeah, go ahead.”
So I started to make an order, using silly names and descriptions, and choosing weird seldomly used products. So, what’s next?
Ok, if I can find where my description “First order of Hans” is stored, I can start to map the frontend to the backend, or make clear for myself where the entered information is stored in the database.

Solution

This is where this script became useful. It finds all occurrences of one or more text values in a SQL Server database.

WARNING: Do not use in production environments, as this script can potentially run pretty long (10 to 20 minutes when I used it) and can have a negative impact on database performance in general (so other users might regret that you are running the script).
So use it in a test enviroment when you can.

SearchTextInDatabase.sql

--\
---) SearchTextInDatabase script.
---) Author: Hans Michiels
---) Searches for one or more texts in all (n)(var)char and (n)text
---) columns of a SQL Server database.
---) WARNING: Can potentially run very long, my advise is not
---) to run this script in production environments!
--/
/*
(c) Copyright 2016 - hansmichiels.com
 
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
 
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
 
You should have received a copy of the GNU General Public License
along with this program. If not, see http://www.gnu.org/licenses/.
*/

USE [AdventureWorks2014]
GO

--\-----------------------------------------------------------------------------------
---) DECLARATIONS
--/-----------------------------------------------------------------------------------
DECLARE @schema_name SYSNAME
DECLARE @table_name SYSNAME
DECLARE @column_name SYSNAME
DECLARE @nsql NVARCHAR(4000)
DECLARE @param_def NVARCHAR(500)
DECLARE @search_text NVARCHAR(1000)
DECLARE @search_texts TABLE ([search_text] NVARCHAR(1000))
DECLARE @debug_print NVARCHAR(500)
DECLARE @results_count INT = 0

IF OBJECT_ID('tempdb..#search_results') IS NOT NULL BEGIN
    DROP TABLE #search_results
END
CREATE TABLE #search_results
    ( [search_text] NVARCHAR(1000)
    , [schema_name] NVARCHAR(128)
    , [table_name] NVARCHAR(128)
    , [column_name] NVARCHAR(128)
    , [the_value] NVARCHAR(1000)
    )

DECLARE columns_cursor CURSOR LOCAL STATIC FOR
    SELECT sch.name AS [schema_name]
    , tbl.name AS [table_name]
    , col.name AS [column_name]
    FROM sys.tables tbl
     JOIN sys.schemas sch
        ON sch.schema_id = tbl.schema_id
      JOIN sys.columns col
        ON col.object_id = tbl.object_id
    WHERE col.system_type_id IN (175, 167, 231, 239, 35, 99) -- (N)(VAR)CHAR, (N)TEXT

--\-----------------------------------------------------------------------------------
---) INITIALIZATION
---) Specify the texts to search for below.
---) Use the wildcard % if you want to do a 'like' search.
--/-----------------------------------------------------------------------------------
INSERT INTO @search_texts

  SELECT '%Assembly%'
  UNION SELECT 'First order of Hans'

DECLARE search_cursor CURSOR LOCAL STATIC FOR
    SELECT search_text
    FROM @search_texts

--\-----------------------------------------------------------------------------------
---) MAIN
--/-----------------------------------------------------------------------------------
OPEN columns_cursor
FETCH NEXT FROM columns_cursor INTO @schema_name, @table_name, @column_name

WHILE @@FETCH_STATUS = 0
BEGIN

    SET @debug_print = 'Processing ' + @table_name + '.' + @column_name;
    PRINT REPLICATE('-', LEN(@debug_print))
    PRINT @debug_print
    PRINT REPLICATE('-', LEN(@debug_print))
    
    SET @nsql = 'INSERT INTO #search_results SELECT DISTINCT @inner_search_text, '
        + '''' + @schema_name + ''' AS [schema_name], '
        + '''' + @table_name + ''' AS [table_name], '
        + '''' + @column_name + ''' AS [column_name], CONVERT(NVARCHAR(1000), ' + QUOTENAME(@column_name) + ') AS [the_value] '
        + ' FROM ' + QUOTENAME(@schema_name) + '.' + QUOTENAME(@table_name) + ' WITH (NOLOCK) '
        + ' WHERE ' + QUOTENAME(@column_name) + ' LIKE @inner_search_text;';

    OPEN search_cursor
    FETCH NEXT FROM search_cursor INTO @search_text

    WHILE @@FETCH_STATUS = 0
    BEGIN
        PRINT ''
        PRINT '--> Processing search text ''' + @search_text + '''';
        SET @param_def = N'@inner_search_text NVARCHAR(1000)';
        EXECUTE sp_executesql @nsql, @param_def,
                              @inner_search_text = @search_text;

        FETCH NEXT FROM search_cursor INTO @search_text
    END
    
    CLOSE search_cursor

    FETCH NEXT FROM columns_cursor INTO @schema_name, @table_name, @column_name
    
END

CLOSE columns_cursor
DEALLOCATE columns_cursor

SELECT
      t.schema_name
    , t.table_name
    , t.column_name
    , t.the_value
    , 'SELECT * FROM ' + QUOTENAME(t.schema_name) + '.' + QUOTENAME(t.table_name)
      + ' WHERE ' + QUOTENAME(t.column_name) + ' = ''' + REPLACE(t.the_value, '''', '''''') + '''' AS [select_statement]
 FROM #search_results t

GO

Some time later, I also made two similar versions for searching unique identifiers and numeric values.
I know the scripts might not look so ‘pretty’, but they do the job when needed.

SearchUniqueIdentifierInDatabase.sql

--\
---) SearchUniqueIdentifierInDatabase script.
---) Author: Hans Michiels
---) Searches for one or more unique identifiers in all UNIQUEIDENTIFIER columns
---) of a SQL Server database.
---) WARNING: Can potentially run very long, my advise is not
---) to run this script in production environments!
--/
/*
(c) Copyright 2016 - hansmichiels.com
 
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
 
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
 
You should have received a copy of the GNU General Public License
along with this program. If not, see http://www.gnu.org/licenses/.
*/

USE [AdventureWorks2014]
GO

--\-----------------------------------------------------------------------------------
---) DECLARATIONS
--/-----------------------------------------------------------------------------------
DECLARE @schema_name SYSNAME
DECLARE @table_name SYSNAME
DECLARE @column_name SYSNAME
DECLARE @nsql NVARCHAR(4000)
DECLARE @param_def NVARCHAR(500)
DECLARE @search_value UNIQUEIDENTIFIER
DECLARE @search_values TABLE ([search_value] UNIQUEIDENTIFIER)
DECLARE @debug_print NVARCHAR(500)
DECLARE @results_count INT = 0

IF OBJECT_ID('tempdb..#search_results') IS NOT NULL BEGIN
    DROP TABLE #search_results
END
CREATE TABLE #search_results
    ( [search_value] UNIQUEIDENTIFIER
    , [schema_name] NVARCHAR(128)
    , [table_name] NVARCHAR(128)
    , [column_name] NVARCHAR(128)
    , [the_value] UNIQUEIDENTIFIER
    )

DECLARE columns_cursor CURSOR LOCAL STATIC FOR
    SELECT sch.name AS [schema_name]
    , tbl.name AS [table_name]
    , col.name AS [column_name]
    FROM sys.tables tbl
     JOIN sys.schemas sch
        ON sch.schema_id = tbl.schema_id
      JOIN sys.columns col
        ON col.object_id = tbl.object_id
                              -- UNIQUEIDENTIFIER
    WHERE col.system_type_id IN (36)

--\-----------------------------------------------------------------------------------
---) INITIALIZATION
---) Specify the unique identifiers to search for below.
--/-----------------------------------------------------------------------------------
INSERT INTO @search_values
SELECT 'D081136E-38D6-4D42-8FFD-19A6A8FA61E6'
UNION SELECT '9AADCB0D-36CF-483F-84D8-585C2D4EC6E9'
UNION SELECT '32A54B9E-E034-4BFB-B573-A71CDE60D8C0'
UNION SELECT '4C506923-6D1B-452C-A07C-BAA6F5B142A4'

DECLARE search_cursor CURSOR LOCAL STATIC FOR
    SELECT search_value
    FROM @search_values

--\-----------------------------------------------------------------------------------
---) MAIN
--/-----------------------------------------------------------------------------------
OPEN columns_cursor
FETCH NEXT FROM columns_cursor INTO @schema_name, @table_name, @column_name

WHILE @@FETCH_STATUS = 0
BEGIN

    SET @debug_print = 'Processing ' + @table_name + '.' + @column_name;
    PRINT REPLICATE('-', LEN(@debug_print))
    PRINT @debug_print
    PRINT REPLICATE('-', LEN(@debug_print))
    
    SET @nsql = 'INSERT INTO #search_results SELECT DISTINCT @inner_search_value, '
        + '''' + @schema_name + ''' AS [schema_name], '
        + '''' + @table_name + ''' AS [table_name], '
        + '''' + @column_name + ''' AS [column_name], ' + QUOTENAME(@column_name) + ' AS [the_value] '
        + ' FROM ' + QUOTENAME(@schema_name) + '.' + QUOTENAME(@table_name) + ' WITH (NOLOCK) '
        + ' WHERE ' + QUOTENAME(@column_name) + ' = @inner_search_value ;';

    OPEN search_cursor
    FETCH NEXT FROM search_cursor INTO @search_value

    WHILE @@FETCH_STATUS = 0
    BEGIN
        PRINT ''
        PRINT '--> Processing search value ''' + CONVERT(NVARCHAR(40), @search_value) + '''';
        SET @param_def = N'@inner_search_value UNIQUEIDENTIFIER';
        EXECUTE sp_executesql @nsql, @param_def,
                              @inner_search_value = @search_value;

        FETCH NEXT FROM search_cursor INTO @search_value
    END
    
    CLOSE search_cursor

    FETCH NEXT FROM columns_cursor INTO @schema_name, @table_name, @column_name
    
    SELECT @results_count = (SELECT COUNT(*) FROM #search_results);
    
END

CLOSE columns_cursor
DEALLOCATE columns_cursor

SELECT
      t.schema_name
    , t.table_name
    , t.column_name
    , t.the_value
    , 'SELECT * FROM ' + QUOTENAME(t.schema_name) + '.' + QUOTENAME(t.table_name)
      + ' WHERE ' + QUOTENAME(t.column_name) + ' = ''' + REPLACE(t.the_value, '''', '''''') + '''' AS [select_statement]
 FROM #search_results t

GO


SearchNumberInDatabase.sql

--\
---) SearchNumberInDatabase script.
---) Author: Hans Michiels
---) Searches for one or more numeric values in all TINYINT, SMALLINT,
---) INT, FLOAT, DECIMAL and NUMERIC columns of a SQL Server database.
---) WARNING: Can potentially run very long, my advise is not
---) to run this script in production environments!
--/
/*
(c) Copyright 2016 - hansmichiels.com
 
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
 
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
 
You should have received a copy of the GNU General Public License
along with this program. If not, see http://www.gnu.org/licenses/.
*/

USE [AdventureWorks2014]
GO

--\-----------------------------------------------------------------------------------
---) DECLARATIONS
--/-----------------------------------------------------------------------------------
DECLARE @schema_name SYSNAME
DECLARE @table_name SYSNAME
DECLARE @column_name SYSNAME
DECLARE @nsql NVARCHAR(4000)
DECLARE @param_def NVARCHAR(500)
DECLARE @search_value DECIMAL(38, 2)
DECLARE @search_values TABLE ([search_value] DECIMAL(38, 2))
DECLARE @debug_print NVARCHAR(500)
DECLARE @results_count INT = 0
DECLARE @max_difference DECIMAL(38, 2) -- Use to find values 'near by'

IF OBJECT_ID('tempdb..#search_results') IS NOT NULL BEGIN
    DROP TABLE #search_results
END
CREATE TABLE #search_results
    ( [search_value] DECIMAL(38, 2)
    , [schema_name] NVARCHAR(128)
    , [table_name] NVARCHAR(128)
    , [column_name] NVARCHAR(128)
    , [the_value] DECIMAL(38, 2)
    )

DECLARE columns_cursor CURSOR LOCAL STATIC FOR
    SELECT sch.name AS [schema_name]
    , tbl.name AS [table_name]
    , col.name AS [column_name]
    FROM sys.tables tbl
     JOIN sys.schemas sch
        ON sch.schema_id = tbl.schema_id
      JOIN sys.columns col
        ON col.object_id = tbl.object_id
                              -- TINYINT, SMALLINT, INT, INT
    WHERE col.system_type_id IN (48, 52, 56, 127
                              -- FLOAT, DECIMAL, FLOAT, DECIMAL, DECIMAL, DECIMAL
                              --, 59, 60, 62, 106, 108, 122, 127
                                )

--\-----------------------------------------------------------------------------------
---) INITIALIZATION
---) Specify the texts to search for below “INSERT INTO @search_values”
---) Also set a value for @max_difference
---) @max_difference makes it possible to find nearby numbers.
--/-----------------------------------------------------------------------------------
SELECT @max_difference = 1.0
INSERT INTO @search_values
  SELECT 755
  --UNION SELECT 97867563

DECLARE search_cursor CURSOR LOCAL STATIC FOR
    SELECT search_value
    FROM @search_values

-------------------------------------------------------------------------------------
-- MAIN
-------------------------------------------------------------------------------------
OPEN columns_cursor
FETCH NEXT FROM columns_cursor INTO @schema_name, @table_name, @column_name

WHILE @@FETCH_STATUS = 0
BEGIN

    SET @debug_print = 'Processing ' + @table_name + '.' + @column_name;
    PRINT REPLICATE('-', LEN(@debug_print))
    PRINT @debug_print
    PRINT REPLICATE('-', LEN(@debug_print))
    
    SET @nsql = 'INSERT INTO #search_results SELECT DISTINCT @inner_search_value, '
        + '''' + @schema_name + ''' AS [schema_name], '
        + '''' + @table_name + ''' AS [table_name], '
        + '''' + @column_name + ''' AS [column_name], CONVERT(DECIMAL(38, 2), ' + QUOTENAME(@column_name) + ') AS [the_value] '
        + ' FROM ' + QUOTENAME(@schema_name) + '.' + QUOTENAME(@table_name) + ' WITH (NOLOCK) WHERE ABS(' + QUOTENAME(@column_name) + ' - @inner_search_value) <= ' + CAST(@max_difference AS VARCHAR(50)) + ';';

    PRINT @nsql
    
    OPEN search_cursor
    FETCH NEXT FROM search_cursor INTO @search_value

    WHILE @@FETCH_STATUS = 0
    BEGIN
        PRINT ''
        PRINT '--> Processing search value ''' + CONVERT(NVARCHAR(40), @search_value) + '''';
        SET @param_def = N'@inner_search_value DECIMAL(38, 2)';
        EXECUTE sp_executesql @nsql, @param_def,
                              @inner_search_value = @search_value;

        FETCH NEXT FROM search_cursor INTO @search_value
    END
    
    CLOSE search_cursor

    FETCH NEXT FROM columns_cursor INTO @schema_name, @table_name, @column_name
    
    SELECT @results_count = (SELECT COUNT(*) FROM #search_results);
    
END

CLOSE columns_cursor
DEALLOCATE columns_cursor

SELECT
      t.schema_name
    , t.table_name
    , t.column_name
    , t.the_value
    , 'SELECT * FROM ' + QUOTENAME(t.schema_name) + '.' + QUOTENAME(t.table_name) + ' WHERE ' + QUOTENAME(t.column_name) + ' = ''' + REPLACE(t.the_value, '''', '''''') + ''''
 FROM #search_results t

GO

Download all scripts here.

Conclusion / Wrap up

When you have to understand a SQL Server datamodel, for instance to get data out, and there is no documentation but only a test environment and database connection available, these scripts can help you to find where data entered in the frontend software is stored in the database.
A different use case is when the database is using unique identifiers for primary and foreign keys, and the foreign key relationships are not physically enforced and/or flexible. In that case you can search for a specific unique identifier and see where it is used in the database.

(c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

Stop being so precise! and more about using Load(end)dates (Datavault Series)

Introduction

My web site was offline a few days ago. Sorry for that.
I got this email from my webhosting provider: “A new message or response with subject: User *** has used up 153% of their bandwidth ..”.
Shortly after they suspended my account. It was suspended for about 8 hours. I have upgraded the network traffic limit.
All right, when all bandwidth was consumed .. then somebody must be reading my blog. Thank you for that.

This weeks post is about the LoadDate and LoadEndDate.
Actually there are two things to be nerdy on:

  1. End dating without gaps
  2. Getting the best value for bytes on DATETIME2 precision

By the way, these topics apply to SQL Server, the examples are made using SQL Server 2014.

1. End dating without gaps

How end dating works

In a Data Vault Satellite, different subsequent versions of a row in the source system are distinguished through a LoadDate (which usually contains a date/time value). So the HashKey of the corresponding Hub Row plus the LoadDate are the primary key for the satellite.
However to get the right version out when querying for a specific date, this requires a relatively difficult query, and this can also have a negative impact on performance.
This is why the LoadEndDate is an optional column of a Satellite, to make querying (read: getting data out) easier and better performing.

Important to remember that the LoadEndDate is not the date/time the load(process) ended, but the date/time the row was replaced by a newer row for the same business entity in the hub. What’s in a name, if I had to choose I would just call it EndDate, but LoadEndDate is the standard, and once you know it, it is not a problem.

There are two ways to determine the value for this LoadEndDate:

  1. Exclusive: the LoadEndDate is the LoadDate of the new row that replaces this one, minus a small time fraction. Using the exclusive methods enables the use of the BETWEEN keyword in Transact-SQL to get the right row out, example:

    WHERE @SnapshotDate BETWEEN [LoadDate] AND [LoadEndDate]

    This is the method that is used in most Data Vault examples.

  2. Inclusive: the LoadEndDate is exactly equal to the LoadDate of the new row that replaces this one. This requires no computation when updating the LoadEndDate, but disqualifies the BETWEEN keyword, for getting data out you need to do something like:

    WHERE @SnapshotDate >= [LoadDate] AND @SnapshotDate < [LoadEndDate]

Problem

There is a (theoretical) problem when using the exclusive method.
If you substract to much, there will be a tiny time gap between the LoadEndDate and the LoadDate of the subsequent row.
I fully admit this is a theoretical problem, because the chances that you require the row valid exactly on this moment are astronomically small.
(There is this not so hypothetical situation that your load starts at midnight, the LoadDate is set before the first second of the new day has passed, and you substract a second.
Then you do not find a record when using the exact date (without time fraction) to get the row out. But you need a lot of bad luck for this to happen).

Solution

Still if you are a purist you want to do it right, certainly because the solution is so simple.
If you make the the “grain” of substraction from the LoadDate equal to the precision of the LoadDate, there is no gap.
E.g.
– Substract 1 second from DATETIME2(0)
– Substract 1 centisecond from DATETIME2(2) ***TIP***
– Substract 1 millisecond from DATETIME2(3)
– Substract 1 microsecond from DATETIME2(6)
– Substract 100 nanoseconds from DATETIME2(7)

Examples in T-SQL:

Exclusive enddate without gaps.sql

DECLARE @dt20 DATETIME2(0) = '2016-04-13 20:52:17'
DECLARE @dt22 DATETIME2(2) = '2016-04-13 20:52:17.00'
DECLARE @dt23 DATETIME2(3) = '2016-04-13 20:52:17.000'
DECLARE @dt26 DATETIME2(6) = '2016-04-13 20:52:17.000000'
DECLARE @dt27 DATETIME2(7) = '2016-04-13 20:52:17.0000000'

SELECT @dt20, DATEADD(SECOND, -1, @dt20)
SELECT @dt22, DATEADD(MILLISECOND, -10, @dt22)
SELECT @dt23, DATEADD(MILLISECOND, -1, @dt23)
SELECT @dt26, DATEADD(MICROSECOND, -1, @dt26)
SELECT @dt27, DATEADD(NANOSECOND, -100, @dt27)

2. Getting the best value for bytes on DATETIME2 precision

This is about a “smart” precision to choose for your LoadDate and LoadEndDate columns. Unlike the older DATETIME datatype, DATETIME2 uses less or more bytes for storage depending on the precision you specify.
The Storage size is: 6 bytes for precisions less than 3; 7 bytes for precisions 3 and 4. All other precisions require 8 bytes. (I shamelessly pasted this from MSDN).
In the book “Building a scalable data warehouse with Data Vault 2.0” a lot of examples use DATETIME2(7). You can ask yourself why. Why do you need to be precise to 100 nanoseconds? I dare to say that in most cases (except when loading realtime or near-realtime streaming data into your Data Vault), seconds would be precise enough.
But looking back a few lines, to the storage sizes, DATETIME2(0) uses 6 bytes, but DATETIME2(2) ALSO uses 6 bytes. So with the latter you get a higher precision for the same storage size. And for daily loads, the centiseconds precision that DATETIME2(2) is providing, is really precise enough, believe me.
So DATETIME2(2) gives us the best value for bytes!


DV-S01E04-precise
Why would your LoadDates and LoadEndDates be more precise than needed ?..


In this way you can save two bytes, compared with a DATETIME2(7), which uses 8 bytes and is used in the book. Because the LoadDate is in the primary key of satellite tables, also the primary key index will be smaller.
In PIT Tables, which usually have multiple LoadDates, the storage gain is even more.
And what counts for money, “who does not respect a small gain, isn’t worth a big one”, does also count for small performance gains you can get by doing this kind of tweaks. Never forget that! Multiple smaller tweaks might add up to a noticable performance improvement!

Conclusion / Wrap up

In this blog post you have read about the two methods for enddating: Inclusive and Exclusive.
Also you could read how you can use the Exclusive method for end dating in Data Vault satellites, without having gaps in the timeline.
Finally I discussed which precision will in most cases be good enough for LoadDate and LoadEndDate columns.

(c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

Hash Diff calculation with SQL Server (Datavault Series)

Updates

Change log

Date Changes
31 January, 2017 Really good news! The 8k limitation on HASHBYTES has been removed in SQL Server 2016!

Today I would like to discuss Hash Diff calculation using SQL Server 2014, from my own experience.

My purpose is NOT to be complete and/or to replicate book “Building a scalable data warehouse with Data Vault 2.0”. For full details please read the book.

However I will give a short intro for those who hardly know what I am talking about.

Introduction

Hash differences or Hash diffs

A hash difference is a hash value of all the descriptive data of a row in a satellite, plus the business key column(s).

A hash difference column is used to detect changes in a new version of a satellite’s row for a particular business key, and can be used in a comparison instead of doing a comparison on every individual descriptive attribute.
The main reason to use hash diff columns is to improve performance when loading new satellite rows (of which one or more attributes have changed).
The more columns the satellite has, the higher the performance gain will be.
To support data warehouse automation patterns, hash diffs can also be used for satellites with only a few or even one descriptive attribute.

I think a different name, like RowHash, RecordHash, DataHash or AttributeValuesHash, would better describe the contents of the column, because the value itself is not a difference, but that’s a different (relatively unimportant) discussion.

Hash keys

This article is not about Hash keys, however to be complete on hashing as far as Data Vault 2.0 is concerned, I will give a short introduction on that as well.

Hash keys replace sequence numbers (generated by the database engine) of the Data Vault 1.0 standard. They support geographically distributed data warehouses, as well as integration with big data environments like Hadoop.

A hash key is a hash value of the business key column(s) used in a Hub or Link. The advantage is that it is predictable, which enables parallel loading and a multiplatform / distributed datawarehouse architecture. For instance a customer with code NL123 will have the same hash key on all distributed parts of the data warehouse. Because the hash key always has the same length and is stored as a (fixed length) CHAR column, performance is usually better than when the business key column(s) are directly used as primary key.

How to calculate a hash difference in T-SQL

For full details about Endianness, differences of hash functions between different platforms, how to cope with new columns and other technical details I refer to the book mentioned above. If you want to work with Data Vault 2.0 you will need the book anyway.

As said, a hash difference is a hash value of all the descriptive data of a row in a satellite, plus the business key column(s). To do the hashing, we have to put all the column values together in one nvarchar value, and then apply a hashing function on it. This is the short version and the base to understand it.

DV-S01E02-hashdiff

Between the different column values you have to use a delimiter, preferably one that does not occur in the column values. Otherwise you have to “escape” the delimiter in column values.

If you want to do a case insensitive compare, you should convert all values to either upper- or lowercase. But this means that if in the source system a customer name was “hans michiels” and is changed to “Hans Michiels”, you will not detect the change and therefore do not store it in the Data Vault. So it depends on the contents of the satellite and your business view on it if this is desired behaviour or not.

The hash diff calculation results in a binary(16) value, when using the hashing algorithm MD5 (which is recommended). To enable better cross-platform- and tools support, this is then converted to a hexadecimal string of 32 characters, stored as a CHAR(32) column.

In all cases you must also convert the hexadecimal hash value to all UPPER- or all lowercase, because not all convert functions give the same output, some lower- and other uppercase. To compare values that were made by different hash functions (read on different systems) making it all uppercase (or lowercase) makes sure the values do not differ in casing only (which could lead to a false negative answer on the “are the rows the same” question).

I understand this is all dry and boring nerd stuff if you do not see an example, so .. a simple example:

Calculating_a_hash_diff.sql

SELECT
      [bla]
    , UPPER(
        CONVERT(CHAR(32),
          HASHBYTES('MD5',
            UPPER(
              CONCAT( ISNULL(CONVERT(NVARCHAR, [BusinessKeyOfHub]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), [Column01]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), [Column02]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), [Column03]), N'')
              ) -- END CONCAT
            ) -- END UPPER
          ) -- END HASHBYTES
        , 2) -- END CONVERT
      ) -- END UPPER
FROM [stg].[bla]

So far nothing shocking, but there are a few pitfalls. If you keep them in mind, you will be an excellent hasher soon. 😉

CONVERT(NVARCHAR without specifying the size

Do this only for values of 30 or less characters, otherwise the value will be cut off which could lead to wrong hash diff values, not changing when the attribute value changes. It is for instance okay for integers, decimals (unless the precision is more than 28, mind the decimal separator and possible negative sign), datetime stamps, and bits. When converting (n)varchar safest thing to do is follow the original size, for unique identifiers use 38 or 40.

Beware for CONCAT

Beware for concatenation of (N)VARCHAR(MAX), (N)TEXT and XML columns using the CONCAT function. It appears that only the first 8000 bytes of (N)VARCHAR(MAX) columns are in the concatenated result.

And even worse, HASHBYTES ..

HASHBYTES does only hash a NVARCHAR value with a length of 4000 or less. Above that the error “String or binary data would be truncated” occurs. This is really important to keep in mind.
The 8000 characters (or 4000 unicode characters) limitation on any string hashed with the HASHBYTES function has been removed in SQL Server 2016! Now you can hash larger (N)VARCHAR values!

Cope with the HASHBYTES and CONCAT limitations

4000 characters is quite something, right? Only with very wide satellites (a lot of columns) and/or usage of (N)VARCHAR(MAX), (N)TEXT or XML columns (hereafter called ‘unlimited length columns’) you can get into trouble.

So, what is wise to do? Some measures you can take are:

Make your satellites not too wide

This is a best practice, not only from hashing function perspective. Split the satellites by source system, then by rate of change. Keep in mind that the business key column(s) and all satellite column values are converted to NVARCHAR and then (including all delimiters) may not exceed the 4000 characters if you want to use the HASHBYTES function.

Size unlimited length columns according to the actual maximum length

What I mean is, if for instance a Country name is stored in an unlimited length column in the source system, you know that a size of 100 will be long enough. When you doubt just do some data profiling on the source to see what is the maximum length. Use that size plus a safety margin for the satellite column size. Problem gone.

Handle true unlimited length columns

Put true unlimited length columns (for instance free entry ‘memo’ fields or xml columns in source systems in their own satellite! Otherwise if any other (small) attribute changes, the 4k memo is copied, even when it was not changed. This absorbs storage. But that alone is not enough, as said, with HASHBYTES, above 8000 characters (4000 when unicode) you are still f*c**d. Luckily there is a system function in the master database, which uses the MD5 algorithm and takes a binary value as input. So to compute a MD5 hash value for a text column with unlimited length, you could do something like:

Hash_for_memo.sql

SELECT master.sys.fn_repl_hash_binary(CONVERT(VARBINARY(MAX),
[MyHugeMemoField)) AS [HashDiff]

But don’t make the mistake to use the function [sys].[fn_repl_hash_binary] for all hashing, because the performance of HASHBYTES is much better!
On my machine HASHBYTES is three times faster than [sys].[fn_repl_hash_binary] on the same dataset.
You can check for yourself using this script:

Performance_comparison.sql

USE [msdb]
GO

DECLARE @start DATETIME2(7)
DECLARE @stop DATETIME2(7)
DECLARE @elapsed1 BIGINT
DECLARE @elapsed2 BIGINT
--\-------------------------------------------------------------------------
---) TEST PERFORMANCE of HASHBYTES function.
--/-------------------------------------------------------------------------
SELECT @start = GETDATE();
SELECT TOP1000000
      UPPER(
        CONVERT(CHAR(32),
          HASHBYTES('MD5',
            --UPPER(
              CONCAT( ISNULL(CONVERT(NVARCHAR(128), o1.[name]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[object_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[principal_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[schema_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[parent_object_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[type]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[type_desc]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[create_date], 126), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[modify_date], 126), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[is_ms_shipped]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[is_published]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[is_schema_published]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(128), o2.[name]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[object_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[principal_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[schema_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[parent_object_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[type]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[type_desc]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[create_date], 126), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[modify_date], 126), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[is_ms_shipped]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[is_published]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[is_schema_published]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(128), o3.[name]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[object_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[principal_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[schema_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[parent_object_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[type]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[type_desc]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[create_date], 126), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[modify_date], 126), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[is_ms_shipped]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[is_published]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[is_schema_published]), N'')
              ) -- END CONCAT
            --) -- END UPPER
          ) -- END HASHBYTES
        , 2) -- END CONVERT
      ) -- END UPPER
  FROM [sys].[objects] o1
  CROSS JOIN [sys].[objects] o2
  CROSS JOIN [sys].[objects] o3;

SELECT @stop = GETDATE();
SELECT @elapsed1 = DATEDIFF(MICROSECOND,@start, @stop);

--\-------------------------------------------------------------------------
---) TEST PERFORMANCE of [master].[sys].[fn_repl_hash_binary] function.
--/-------------------------------------------------------------------------
SELECT @start = GETDATE();
SELECT TOP 1000000
      UPPER(
        CONVERT(CHAR(32),
          master.sys.fn_repl_hash_binary(CONVERT(VARBINARY(MAX),
            --UPPER(
              CONCAT( ISNULL(CONVERT(NVARCHAR(128), o1.[name]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[object_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[principal_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[schema_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[parent_object_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[type]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[type_desc]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[create_date], 126), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[modify_date], 126), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[is_ms_shipped]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[is_published]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[is_schema_published]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(128), o2.[name]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[object_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[principal_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[schema_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[parent_object_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[type]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[type_desc]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[create_date], 126), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[modify_date], 126), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[is_ms_shipped]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[is_published]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[is_schema_published]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(128), o3.[name]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[object_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[principal_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[schema_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[parent_object_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[type]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[type_desc]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[create_date], 126), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[modify_date], 126), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[is_ms_shipped]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[is_published]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[is_schema_published]), N'')
                 ) -- END CONCAT
            --) -- END UPPER
          ) -- END HASHBYTES
          ) -- END CONVERT
        , 2) -- END CONVERT
      ) -- END UPPER
  FROM [sys].[objects] o1
  CROSS JOIN [sys].[objects] o2
  CROSS JOIN [sys].[objects] o3;

SELECT @stop = GETDATE();
SELECT @elapsed2 = DATEDIFF(MICROSECOND,@start, @stop);

SELECT @elapsed1 AS [Elapsed_HashBytes]
, @elapsed2 AS [Elapsed_fn_repl_hash_binary]
, 'HASHBYTES is ' + CONVERT(VARCHAR, CONVERT(decimal(19, 2), @elapsed2 / @elapsed1))
+ ' times faster than [fn_repl_hash_binary]' AS [Result]

That’s about it.
Just a few more things:

  • If you use a different hashing algorithm (e.g. SHA1) you cannot use the sys.fn_repl_hash_binary function, in that case you will have to seek your solution in a C# script task or CLR stored procedure, doing the hashing in .NET code.
  • Don’t use CONCAT if you have more than one column in the satellite containing [MyHugeMemoField], or you still can run into problems.

Conclusion / Wrap up

I have explained in short why Hash diff columns are used in a Data Vault 2.0 satellite, and how these values can be calculated.

After that, you have read about some limitations of the CONCAT and HASHBYTES function when used for calculating Hash diff values, and how to cope with these limitations.

Credits
My thanks go to Vincent Kuiper for reviewing the initial draft of this article.

(c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

Data Vault and Dimensional Modelling, a happy marriage! (Data Vault Series)

We need ambassadors!

I was at this one day course recently and spoke to a professional working in the business intelligence domain.
He said that he had just followed a two week course for datawarehouse architects.
I asked him, what did they tell about Data Vault?
His answer was shocking: “Not so much ..”

If recently trained datawarehouse architects do not get trained to use Data Vault in a datawarehouse architecture, there definitely is a lot of missionary work to do.
Because the Data Vault can add so much value to your datawarehouse solution. 
And more important, it is not a choice, a Kimball dimensional model OR a Linstedt Data Vault. Use both and
combine each strengths!

Intended audience

This is a strong call for everybody who reads this to share this post with at least one person you know working in the business intelligence domain using star schemas (a.k.a. dimensional modelling) directly fed from source systems.

What are the strengths of a Dimensional Model (a.k.a. Star Schema)?

  • Because the dimensional model is already around for decades, a lot of front end tools have an excellent support for it.
  • A dimensional model is designed for reporting, slicing and dicing, for getting data out. That is what it is good at.

    This becomes clear by the following examples:

    • Query performance is usually better than of other models, like Data Vault or Relational Datawarehouse, especially for aggregations.
    • A star model is an excellent source for Cube databases, like SSAS, especially when using a multidimensional model.
    • Date selections and comparison of periods (e.g. Q1 this year versus Q1 Last Year) is easier than in Data Vault or Relational models.
  • It enables self service BI. Give business (power) users (direct) access to a dimensional model, and there is a good chance that they can get  data out in the format they want.

What are the strengths of a Data Vault?

  • A Data Vault is meant to capture all the data all the time. Therefore it stores all data from the source systems, captures the changes in it, and keeps a full history of all changes (with a sidemark, that multiple changes between two load cycles, will be seen as one).
  • A Data Vault captures all data exactly as it is stored in the source systems. “The good, the bad and the ugly”with no business rules applied to it (except, in some cases hard business rules like data types). This makes all data auditable.
  • Data Vault 2.0 supports cross platform integration with other systems (e.g. Hadoop) and decentralized datawarehouses, for instance in different global regions.
  • With Data Vault 2.0 working agile is easy, in sprints, building the system in increments.
  • Data Vault has a very good support for Data Integration. Satellites, hanging off the Hub, containing data from different source systems.

So? What are the strengths of a datawarehouse architecture using both?

  • You can change your mind! If your business rules change (and it happens!), just reload your dimensional model from the Data Vault with the new rules applied.
  • You have one version of the facts . If you have multiple dimensional models, you have not.
  • Data history is not like a destroyed monument. With a Data Vault, you capture all data from the source systems, and all changes applied to it. Source systems usually reflect only the current state of the data, so without a change capturing central storage like Data Vault, your history is gone forever!
  • Your dimensional model does not have to serve two purposes: data storage and presentation of information. These two tasks can be in conflict, for instance there is currently no report need for all data that you could grap from source systems, so you leave it out of your star schema. But if the question comes tomorrow, you have no data, no history!
  • You have full auditibility. Because the (raw) Data Vault stores the data from source systems unmodified, every measure, every dimensional attribute, can be traced back to the source it came from and when. This makes it able to refute claims that the data is not correct. Maybe now it will become clear that the old datawarehouse- or reporting system has lied for years!

How can you use both?

The answer is simple: by using a multi-tier architecture.
Data from source systems is first extracted to a Staging area, before it is moved into the Enterprise Data Warehouse using a Data Vault Model (with or without some optional components).
From there it will be distributed to dimensional models (star schemas) and cubes, and whilst the data is on its way, business rules can be applied.
DV-S01E02-multi-tier-architecture
Multi-tier architecture using both Data Vault and Dimensional Modelling techniques.

Conclusion / Wrap up

I have written this post to create more awareness about using both Data Vault and Dimensional Modelling or Star Schemas in a data warehouse architecture.
I have listed strengths and benefits of Data Vault, Dimensional Modelling and of an architecture using both.

(c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.