Detect duplicates in a Data Flow Task (SSIS Series)

Intended audience

This article assumes that you have at least a basic understanding of SSIS, and the Data Flow Task in particular. C# knowledge is an advantage but not required.

Introduction

If you ever used the T-SQL windowing function ROW_NUMBER() OVER (PARTITION BY ..) to find duplicates in a query result, this post might be of interest for you.

Because not all sources having potential duplicate entries (e.g. business keys) come from a SQL Server database. If it is a csv- or Excel file ROW_NUMBER will not automaticly be your friend (it might be when you use OPENROWSET, but I think that is a rather complicated solution).

C# script components are extremely flexible yet fairly easy to understand, develop and maintain.
This is why this is my preferred solution to tackle the problem above.
Let’s begin!

How to detect duplicates in a Data Flow Task using a C# Script component

First an overview of the end result (the Data Flow of the Data Flow Task). I will tell you step by step how to achieve this using SSIS 2014. I did not make a print screen of the Control Flow because it only contains one Data Flow Task.
Please note that I enabled two Data Viewers on both outputs of the Conditional Split component, so you can see the data that flows through (screenshot later).
SSIS-S01E02-image01
For the demo I will use the following components in a dataflow. To solve your problem just pick out the things you need.

  • An OLE DB Source
  • A script component (the most interesting part!)
  • An (optional) conditional split
  • Two OLE DB Destinations: one for the first occurrences of each business key, and one for the duplicates.

An OLE DB Source

My example is based on Advertureworks. I select customers from the [AdventureWorksDW] database, and make sure there are duplicates by using UNION ALL in the SQL statement.
In the real world it would be more likely that your source would not be a SQL Server query, but for instance a csv- or Excel file.

OLEDB Source.sql

SELECT
  [CustomerKey], [CustomerAlternateKey] AS [CustomerBusinessKey],
  [Title], [FirstName], [MiddleName], [LastName],
  [NameStyle], [BirthDate], [MaritalStatus], [Suffix],
  [Gender], [EmailAddress], [YearlyIncome]
FROM
  [dbo].[DimCustomer]
UNION ALL
SELECT
  [CustomerKey], [CustomerAlternateKey] AS [CustomerBusinessKey],
  [Title], [FirstName], [MiddleName], [LastName],
  [NameStyle], [BirthDate], [MaritalStatus], [Suffix],
  [Gender], [EmailAddress], [YearlyIncome]
FROM
  [dbo].[DimCustomer]
UNION ALL
SELECT
  [CustomerKey], [CustomerAlternateKey] AS [CustomerBusinessKey],
  [Title], [FirstName], [MiddleName], [LastName],
  [NameStyle], [BirthDate], [MaritalStatus], [Suffix],
  [Gender], [EmailAddress], [YearlyIncome]
FROM
  [dbo].[DimCustomer]
  WHERE CustomerKey % 2 = 0

SQL Statement for the OLE DB Source

A script component (the most interesting part!)

From the SSIS Toolbox drag a Script Component to the Data flow surface. In the “Select Script Component Type” choose “Transformation”.

SSIS-S01E02-image02

To use a column value in the script, you have to define it as an input column. Select the column you want to check for duplicate values with Usage Type ReadOnly. In the demo this is the CustomerBusinessKey column.
SSIS-S01E02-image03

We also need a new column to store the occurrence number for the CustomerBusinessKey. To do this, in the Script Transformation Editor, Select Inputs and Outputs on the left side, then unfold Output 0 and Output Columns, respectively.
Then click on the Add column button to add a column with DataType four-byte signed integer [DT_I4]. Give the new column a clear name. In the demo it is called CustomerBusinessKeyOccurrence.
SSIS-S01E02-image03b

After these preparations we can start writing the C# script. Press the Edit script … button to do so.
SSIS-S01E02-image04

Adjust the script as follows (the changes are also marked inside red blocks in the picture below):

  • In the Namespaces region, add:
    using System.Collections.Generic;
  • Above PreExecute(), add:
    // The column to count duplicates for will be stored as Key (with string datatype) of the dictionary.
    // The occurrence counter will be stored as Value of the dictionary (with int datatype).
    private Dictionary<string, int> m_dictOccurrencesCount;
  • Inside PreExecute(), add:
    // Initialize the dictionary, otherwise it will be null when accessed.
    m_dictOccurrencesCount = new Dictionary<string, int>();

SSIS-S01E02-image05
Script adjustments – part 1

Then adjust the script as follows:

  • Inside Input0_ProcessInputRow(Input0Buffer Row), add:
    // Count occurrences of CustomerBusinessKeys.
    if (m_dictOccurrencesCount.ContainsKey(Row.CustomerBusinessKey))
    {
      // If the CustomerBusinessKey of the current row is already in the dictionary,
      // increase the counter with 1:
      m_dictOccurrencesCount[Row.CustomerBusinessKey]++;
      // Also assign the new value to the CustomerBusinessKeyOccurrence column,
      // so it can be used in the Data flow later.
      Row.CustomerBusinessKeyOccurrence = m_dictOccurrencesCount[Row.CustomerBusinessKey];
    }
    else
    {
      // If the CustomerBusinessKey of the current row is not yet in the dictionary,
      // add it, and set the counter to 1
      m_dictOccurrencesCount.Add(Row.CustomerBusinessKey, 1);
      // Also assign 1 to the CustomerBusinessKeyOccurrence column.
      Row.CustomerBusinessKeyOccurrence = 1;
    }

SSIS-S01E02-image06
Script adjustments – part 2

When the script is ready, Choose Rebuild from the BUILD menu in the VstaProjects – Microsoft Visual Studio Window. Then close that Window. You return to the Script Transformation Editor, close it by pressing OK.

An (optional) conditional split

Optionally you can add a Conditional Split.
I use the conditional split to store the first occurrences for every CustomerBusinessKey in one table and all duplicates in another table.
Both tables are only to show you that the script task and conditional split do their jobs properly. In real projects, you might develop this differently, for example by storing only the first occurence.

SSIS-S01E02-image07
Conditional Split Configuration

Two OLE DB Destinations: one for the first occurrences of each business key, and one for the duplicates

I basicly created a two temporary tables by composing a SELECT TOP 0 .. INTO statement from the OLE DB Source SQL Statement, because the result of the demo needs to flow to somewhere.

SSIS-S01E02-OLEDB-Source.sql
IF OBJECT_ID('[dbo].[DimCustomerFirstOne]', 'U') IS NOT NULL
  DROP TABLE [dbo].[DimCustomerFirstOne]
GO
SELECT TOP 0
  [CustomerKey], [CustomerAlternateKey] + 'B' AS [CustomerBusinessKey],
  [Title], [FirstName], [MiddleName], [LastName],
  [NameStyle], [BirthDate], [MaritalStatus], [Suffix],
  [Gender], [EmailAddress], [YearlyIncome], CONVERT(int, 0) AS [OccurrenceNo]
INTO [dbo].[DimCustomerFirstOne]
FROM
  [dbo].[DimCustomer]
GO

IF OBJECT_ID('[dbo].[DimCustomerDuplicates]', 'U') IS NOT NULL
  DROP TABLE [dbo].[DimCustomerDuplicates]
GO
SELECT TOP 0
  [CustomerKey], [CustomerAlternateKey] + 'B' AS [CustomerBusinessKey],
  [Title], [FirstName], [MiddleName], [LastName],
  [NameStyle], [BirthDate], [MaritalStatus], [Suffix],
  [Gender], [EmailAddress], [YearlyIncome], CONVERT(int, 0) AS [OccurrenceNo]
INTO [dbo].[DimCustomerDuplicates]
FROM
  [dbo].[DimCustomer]
GO

OLE DB Destination OLE_DST FirstOne uses the [dbo].[DimCustomerFirstOne] table.
OLE DB Destination OLE_DST Duplicate uses the [dbo].[DimCustomerDuplicates] table.

Let it flow, let if flow, let it flow

Okay, time to test.
When data flows, the “First occurence” and “Duplicates” Data Viewers pop up. Then I can check that all the rows are divided well over both tables based on the value of column CustomerBusinessKeyOccurrence: exactly 1 or higher than 1.
When I run it with the Data Viewers disabled, the package execution finished literally in a split second. So the performance is excellent.
Test succeeded!
SSIS-S01E02-image08
The Data flow in Debug Mode

SSIS-S01E02-image09b
Data Viewer for First occurrence flow: CustomerBusinessKeyOccurrence is 1 for all rows.

SSIS-S01E02-image10b
Data Viewer for Duplicates flow: CustomerBusinessKeyOccurrence is 2 or higher for all rows.

Other considerations

If multiple columns make a row unique concatenate all column values with a delimiter, and then add the concatenated result to the dictionary m_dictOccurrencesCount.
For instance you could check duplicates using the customer nameparts (firstname, middlename and lastname), the birthdate and the gender. This is a softer check than on a real business key, but sometimes you do not have that.
Of course, for this to work, you must add those columns as ReadOnly input columns to the script component.
You can create a concatenated value as follows:
Inside Input0_ProcessInputRow(Input0Buffer Row), start with:

// Concatenate FirstName, MiddleName, LastName, BirthDate and Gender, using a | as delimiter.
string businessKey = string.Format(“{0}|{1}|{2}|{3:yyyyMMdd}|{4}”,
  Row.FirstName, Row.MiddleName, Row.LastName, Row.BirthDate, Row.Gender).ToUpper();

The concatenated value is converted to uppercase, so that differences in lower- or uppercase lead to the same value.
Then use the businessKey variable (instead of Row.CustomerBusinessKey) in the if statement as shown above.

If the concatenated value becomes very long, you could calculate a hash value and add that to the dictionary instead (thanks for the suggestion, Joost). This would use less memory and might be faster, but the only way to find out if this is the case is testing .. testing ..
( I did not test this to make the demo not too complicated and because performance did not seem to be an issue ).

Thinking a bit further, you could use this concept also to calculate running totals.
Just configure the dictionary to hold a decimal instead of an int and you can calculate running totals for numeric values (e.g. sales figures) easily.

Conclusion / Wrap up

In SSIS you can use a Script Component within a Data Flow Task to detect duplicates in the source data.
This can be an alternative to the T-SQL ROW_NUMBER() function, when the source is not a SQL Server Database.

Credits
My thanks go to Joost van Rossum for reviewing the initial draft of this article. His input was very useful.

(c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

Hans Michiels

Hans is an Independent Business Intelligence and Datawarehouse Consultant & Microsoft SQL Server Consultant, working in the Netherlands. He has been working in the software industry since 1996, with SQL Server since the year 2001, and since 2008 he has a primary focus on datawarehouse- and business intelligence projects using Microsoft technology, using a Datavault and Kimball architecture. He has a special interest in Datawarehouse Automation and Metadata driven solutions. * Certified in Data Vault Modeling: Certified Data Vault 2.0 Practitioner, CDVDM (aka Data Vault 1.0) * Certified in MS SQL Server: * MCSA (Microsoft Certified Solutions Associate) SQL Server 2012 - MCITP Business Intelligence Developer 2005/2008 - MCITP Database Developer 2005/2008 - MCITP Database Administrator 2005/2008

More Posts

20 thoughts on “Detect duplicates in a Data Flow Task (SSIS Series)”

  1. Grate Post…
    Do we need to have the [CustomerBusinessKey] field sorted? for the script to work.

    I mainly use HashBytes (MD5) for Duplicates and SCD

    1. Hello Nik, sorting is not necessary. The dictionary in the C# script takes care of the counting, no matter in what order the rows flow through.

  2. Great,
    why the row. is not recognised? I tried to use it my the rows of my excel file but does not work
    any ideas please
    string businessKey = string.Format(“{0}|{1}|{2}|{3:yyyyMMdd}|{4}”,
    Row.FirstName, Row.MiddleName, Row.LastName

    1. Hello ilhamK, from the information you give it is difficult for me to troubleshoot.
      At least Row.FirstName, Row.MiddleName and Row.LastName are examples from the post, you should replace them with your own column names.
      Also, for every number in curly brackets (here {0}{1}{2}{3}{4}) you need to add a Row.SomeColumnName to the string.Format part.
      If you cannot get it working, I can have a look, if you send me your SSIS Package and other files or scripts needed to debug it.
      Please give me a week to look at it, as my time is limited.

  3. Hi Hans,

    This solution looks absolutely ideal, I have one problem: the field I’m testing for duplicates on is an integer, not a string, so the code segment fails “cannot convert from ‘int’ to ‘string'” you couldn’t help / direct me to the adjustment I need to make to your code to reference it correctly could you? I’ve no C# experience unfortunately!

    Huge thanks

    1. Hi Alex,
      Could you try to replace all occurrences of “Dictionary<string, int>” in the script task by “Dictionary<int, int>”?
      I am pretty sure that will solve your problem.
      If not, please let me know and I’ll help you further.
      Best regards,
      Hans

      1. It worked perfectly, thank you kindly for your time, hugely appreciated.

        Kind regards,
        Alex

  4. Hi, thanks you so much for this tutorial.

    I would like to adapt this so that the behavior is the following :

    For each row containing a value with a duplicate display the total number of duplicates in CustomerBusinessKeyOccurrence.

    For the unique CustomerBusinessKey display 1.

    Is it at all possible ?

    BR,
    Gui

    1. Hello Gui,
      Thanks for your question. If I understand it correctly you want to mark the business keys that are unique. The problem with this is that you can’t do it in the data flow, because only after all rows have gone through the data flow task, you could derive the unique business keys.
      If you follow the setup from my blog post and assumed that the first occurences go into a table named [dbo].[FirstOccurence] and the duplicates go into a table named [dbo].[Duplicates], with this query you could find the unique businesskeys:
      SELECT
      … columns ..
      FROM
      [dbo].[FirstOccurence]
      WHERE
      [BusinessKey] NOT IN
      (
      SELECT [BusinessKey]
      FROM [dbo].[Duplicates]
      );

      Hope this helps.
      Best regards,
      Hans

  5. Hi,
    thanks for this but i want to detect duplicate when for exemple the same email is accorded to diffrents persons.

    1. Hello Fares,
      This is too little information for me to say something useful about the thing you want to achieve.
      Is there an entire email message in a column in the data flow?
      Best regards,
      Hans

  6. Hello Hans,
    We need to find out the duplicates check on 3 column and if the 4 column is into consideration we need to check if 4 column
    (we get 2 rows as duplicates we need to mark first one as 1 and 2nd as Zero), else the 4 th row will be considered as 1 .
    Could you please help we are using impala so we need to implement this on script task

    1. Hi Vikramjit, this is a very specific use case and I do not fully understand your requirement. Also I have never worked with impala, so I am afraid I cannot help you.
      Maybe just one general remark, as you want to check on multiple columns you could use the concatenation or hash tip, as described in the blog post.

Comments are closed.