Tag Archives: BigData

Data Warehouse Automation, the way to go

Introduction

Due to other activities I was not in the opportunity to blog a lot in the past months. One of those activities was the creation of the web site

DataWarehouseAutomation.Guide
.
I think it was worth the pause. And also in the future it will have influence on my time available for blogging. But I love to dive into a more detailed technical post when I have some time.

I just wanted to let you know of a few blog posts on that new web site

www.dwa.guide
(the short URL will redirect to the long one):

Contact me if you want to discuss your data warehouse automation needs or concerns in a 1-on-1 conversation!

(c) 2017 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

How to automate your SSIS package deployment and configuration (SSIS Series)




Last update: 20 May, 2017: made newer scripts available for download after giving a presentation for PASS Slowakia.

Enhancements in the newer scripts (may 2017)

  • can handle multiple folders/environments on the same SQL Server Instance.
  • can install ispac files that are located in subfolders.
  • can install individual SSIS packages (dtsx files) from Powershell (SQL Server 2016 only feature)

Introduction

Hi there, today I want to share my scripts on package deployment. As manual deployment can be cumbersome and prone to errors, I have automated this boring repetitive task.
This article is based on SQL Server 2016, but I have included scripts to be used for SQL Server 2012 and 2014.

Problem

How to automate SSIS Package deployment.

Solution

  1. Setting up folder and environment
  2. Deploy your SSIS Packages
  3. Setup package configuration (or: connect package- and project parameters to environment variables)
  4. Glueing everything together using Powershell

Setting up folder and environment

As you could read in one of my previous posts Using the project deployment model with environments I usually create a one-on-one relation between folders and environments.
So a folder = an environment, and within the folder several SSIS projects can exist.
What I have seen others do is creating a “Generic”environment. That is also a possibility, this is a matter of personal choice. I prefer having the environment having an explicit name identical to the foldername (e.g. DEV), just to decrease the risk that any changes are made in the wrong Generic environment variable values.

Our main goal for this step is to have a way to automate setting up a folder and environment, including all (scripted) environment variables. This script should be rerunable to be able to add or update environment variables when needed.
One thing you should know is that the database SSISDB has a number of stored procecures in the catalog schema, that I use to achieve this.
Well, here’s the script. Don’t bother copy and paste while you can download all scripts here.

pre_deployment.sql

--\ -------------------------------------------------------------------
---) Script to create folder, environment and settings.
--/ -------------------------------------------------------------------
/*
(c) Copyright 2016 - hansmichiels.com
 
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
 
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
 
You should have received a copy of the GNU General Public License
along with this program. If not, see http://www.gnu.org/licenses/.
*/

USE [SSISDB]
GO

DECLARE @environment_variables_table TABLE(
  [id] int identity(1, 1),
  --)> Just an autonumber to be able to go through the table without cursor.
  [variable_name] nvarchar(128),
  --) Name of the variable
  [data_type] nvarchar(128),
  --) Variable datatype e.g. [ Boolean | Byte | DateTime | Decimal | Double | Int16 | Int32 | Int64 | SByte | Single | String | UInt32 | UInt64 ]
  --) Check all possible values using this query: SELECT DISTINCT [ssis_data_type] FROM [SSISDB].[internal].[data_type_mapping]
  [sensitive] bit,
  --) Indication of the environment variable is sensitive (e.g. a password).
  [value] nvarchar(4000),
  --) The variable value.
  [description] nvarchar(1024),
  --) Extra description for the variable.
  [create_only] bit
  --) Indication that a variable should only be created when it does not exist yet, but should not be replaced when it already exists.
  --) (so the value in the script does not overwrite the existing value).
  --) In this way you can prevent later configuration changes in an environment being undone by the values in the script.
  )

DECLARE @id int
DECLARE @max_id int
DECLARE @folder_id int
DECLARE @folder_name sysname = N'unknown'
DECLARE @environment_name nvarchar(128) = N'unknown'
DECLARE @environment_description nvarchar(128) = N'unknown'
DECLARE @variable_name nvarchar(128)
DECLARE @data_type nvarchar(128)
DECLARE @sensitive bit
DECLARE @value sql_variant
DECLARE @description nvarchar(1024)
DECLARE @create_only bit --
DECLARE @exists bit
DECLARE @nsql nvarchar(max)
DECLARE @message nvarchar(255)

DECLARE @false BIT = 0, @true BIT = 1

--\
---) Environment settings.
---) This is a generic script for all environments, using the @@SERVERNAME (SQL Server Instance name)
---) to set values right for each environment.
--/

IF @@SERVERNAME LIKE 'VWS63-SQL1%1'
BEGIN
    SELECT
      @folder_name = N'DEV',
      @environment_name = N'DEV',
      @environment_description = N'My development environment'
END
IF @@SERVERNAME = '<SQL INSTANCE NAME>'
BEGIN
    SELECT
      @folder_name = N'TST',
      @environment_name = N'TST',
      @environment_description = N'My test environment'
END
IF @@SERVERNAME = '<SQL INSTANCE NAME>'
BEGIN
    SELECT
      @folder_name = N'PRD',
      @environment_name = N'PRD',
      @environment_description = N'My production environment'
END

IF @environment_name = N'unknown'
BEGIN
    RAISERROR('Unknown environment', 16, 1)
    -- Make sure that the script does not continue when the environment settings could not be set due to an unknown @@SERVERNAME.
    GOTO Finally
END

----\
-----) Create folder, when needed.
----/
IF NOT EXISTS (
    SELECT 1
    FROM [SSISDB].[internal].[folders] f
    WHERE f.name = @folder_name
    )
BEGIN
    SET @message = 'Folder “' + @folder_name + '” is being created.'
    RAISERROR(@message , 0, 1) WITH NOWAIT;
    EXEC [SSISDB].[catalog].[create_folder] @folder_name, @folder_id OUTPUT;
END

--\
---) Create environment, when needed.
--/
IF NOT EXISTS (
    SELECT 1
    FROM [SSISDB].[internal].[environments] e
    JOIN [SSISDB].[internal].[folders] f
        ON e.folder_id = f.folder_id
    WHERE e.[environment_name] = @environment_name
        AND f.name = @folder_name
    )
BEGIN
    SET @message = 'Environment “' + @environment_name + '” is being created.'
    RAISERROR(@message , 0, 1) WITH NOWAIT;
    EXEC [SSISDB].[catalog].[create_environment] @folder_name, @environment_name, @environment_description;
END

IF @environment_name = N'DEV'
BEGIN
  INSERT INTO @environment_variables_table
      ( [create_only], [variable_name], [data_type], [sensitive], [value], [description] )
  VALUES
  -- Just a few examples of environment variables, add your own here or generate this part from existing environments or deployed packages, see scripts below.
      ( @true, N'Parameter1', N'Int32', @false, N'1', N'Example parameter 1 (DEV).' )
    , ( @true, N'Parameter2', N'String', @false, N'Hello world from script (DEV).', N'Example parameter 2 (DEV).' )
    , ( @true, N'Parameter3', N'Decimal', @false, N'9.95', N'Example parameter 3 (DEV).' )
    , ( @true, N'ProjectParameter1', N'String', @false, N'Hello world from script (DEV).', N'Example projectparameter 1 (DEV).' )
    , ( @true, N'ProjectParameter2', N'Int32', @false, N'4', N'Example projectparameter 2 (DEV).' )
    
END

IF @environment_name = N'TST'
BEGIN
  INSERT INTO @environment_variables_table
      ( [create_only], [variable_name], [data_type], [sensitive], [value], [description] )
  VALUES
  -- Just a few examples of environment variables, add your own here or generate this part from existing environments or deployed packages, see scripts below.
      ( @true, N'Parameter1', N'Int32', @false, N'1', N'Example parameter 1 (TST).' )
    , ( @true, N'Parameter2', N'String', @false, N'Hello world from script (TST).', N'Example parameter 2 (TST).' )
    , ( @true, N'Parameter3', N'Decimal', @false, N'9.95', N'Example parameter 3 (TST).' )
    , ( @true, N'ProjectParameter1', N'String', @false, N'Hello world from script (TST).', N'Example projectparameter 1 (TST).' )
    , ( @true, N'ProjectParameter2', N'Int32', @false, N'4', N'Example projectparameter 2 (TST).' )
    
END

IF @environment_name = N'PRD'
BEGIN
  INSERT INTO @environment_variables_table
      ( [create_only], [variable_name], [data_type], [sensitive], [value], [description] )
  VALUES
  -- Just a few examples of environment variables, add your own here or generate this part from existing environments or deployed packages, see scripts below.
      ( @true, N'Parameter1', N'Int32', @false, N'1', N'Example parameter 1 (PRD).' )
    , ( @true, N'Parameter2', N'String', @false, N'Hello world from script (PRD).', N'Example parameter 2 (PRD).' )
    , ( @true, N'Parameter3', N'Decimal', @false, N'19.95', N'Example parameter 3 (PRD).' )
    , ( @true, N'ProjectParameter1', N'String', @false, N'Hello world from script (PRD).', N'Example projectparameter 1 (PRD).' )
    , ( @true, N'ProjectParameter2', N'Int32', @false, N'4', N'Example projectparameter 2 (PRD).' )
    
END

SELECT @id = 1, @max_id = MAX([id]) FROM @environment_variables_table
WHILE @id <= @max_id
BEGIN
    SELECT
      @variable_name = v.variable_name,
      @data_type = v.data_type,
      @sensitive = v.sensitive,
      @value = v.value,
      @description = v.[description],
      @create_only = v.[create_only],
      @exists = 0
    FROM @environment_variables_table v
    WHERE [id] = @id;

    IF EXISTS (
        SELECT 1
        FROM
            [SSISDB].[internal].[environment_variables] v
        JOIN
            [SSISDB].[internal].[environments] e ON e.environment_id = v.environment_id
        JOIN
            [SSISDB].[internal].[folders] f
            ON e.folder_id = f.folder_id
        WHERE
            v.[name] = @variable_name
            AND e.environment_name = @environment_name
            AND f.name = @folder_name
        )
    BEGIN
        IF @create_only = 1
        BEGIN
            SET @message = @variable_name + ' already exists and is not replaced.'
            RAISERROR(@message , 0, 1) WITH NOWAIT;
            SET @exists = 1;
        END ELSE BEGIN
            SET @nsql = N'EXECUTE [catalog].[delete_environment_variable] '
              + N'@folder_name = N'''+ @folder_name + ''', @environment_name = N''' + @environment_name + ''', '
              + N'@variable_name = N''' + @variable_name + ''''
            PRINT @nsql;
            EXEC sp_executesql @nsql;
        END
    END

    IF @exists = 0
    BEGIN
        SET @message = 'Creating variable “' + @variable_name + '”.';
        RAISERROR(@message , 0, 1) WITH NOWAIT;

        SET @nsql = N'EXECUTE [catalog].[create_environment_variable] '
          + N'@folder_name = N'''+ @folder_name + ''', @environment_name = N''' + @environment_name + ''', '
          + N'@variable_name = N'''+ @variable_name + ''', @data_type = N''' + @data_type + ''', '
          + N'@sensitive = ' + CONVERT(NVARCHAR, @sensitive) + ', @description = N''' + @description + ''', '
          + CHAR(13) + CHAR(10) + N'@value = ' +
          CASE UPPER(@data_type)
          WHEN 'String' THEN 'N''' + CONVERT(NVARCHAR(1000), @value) + ''' '
          ELSE CONVERT(NVARCHAR(1000), @value)
          END + '; '
        -- PRINT @nsql;
        EXEC sp_executesql @nsql;
    END
    SET @id = @id + 1
END

RAISERROR('Creating folder, environment and settings has completed ..', 0, 1) WITH NOWAIT;

Finally:

GO
--\
---) ON a server where the VARIABLES already exist, you can use this query to generate the 'VALUES' part of the INSERT command used above.
--/
/*
DECLARE @environment_name nvarchar(128) = N'DEV'

SELECT
    m.*,
    v.name,
    N', ( N''' + v.name + ''', N''' + m.[ssis_data_type] + ''', '
      + CONVERT(NCHAR(1), v.sensitive) + ', N'''
      + CASE v.sensitive
        WHEN 0 THEN CONVERT(NVARCHAR(4000), v.value)
        ELSE '< type sensitive value here TODO >' END
      + ''', N''' + v.description + ''' )' AS [sql_values_part]
FROM
    [SSISDB].[internal].[environment_variables] v
JOIN
    [SSISDB].[internal].[environments] e
   ON e.environment_id = v.environment_id
JOIN
    [SSISDB].[internal].[folders] f
    ON e.folder_id = f.folder_id
JOIN
    [SSISDB].[internal].[data_type_mapping] m
    ON m.sql_data_type = v.base_data_type
    AND ( m.sql_data_type != 'tinyint' OR m.ssis_data_type = 'Byte')
WHERE
    e.environment_name = @environment_name
*/

--\
---) ON a server where the PACKAGES are ALREADY DEPLOYED, you can use this query to generate the 'VALUES' part of the INSERT command used above.
---) Manually check the output before running the sql script!
--/
/*
SELECT
  N', ( N''' + q.[variable_name] + ''', N''' + q.[data_type] + ''', '
    + CONVERT(NCHAR(1), q.[sensitive]) + ', N'''
    + CASE q.sensitive
        WHEN 0 THEN CONVERT(NVARCHAR(4000), q.value)
        ELSE '< type sensitive value here TODO >' END
    + ''', N''' + q.[description] + ''' )' AS [sql_values_part]
FROM
  (
  SELECT TOP 2147483647
      prm.[parameter_name] AS [variable_name],
      prm.[parameter_data_type] AS [data_type],r
      prm.[sensitive],
      COALESCE(prm.[default_value], prm.[design_default_value]) AS [value],
      prm.[description],
      ROW_NUMBER() OVER (PARTITION BY prm.[parameter_name] ORDER BY prm.[parameter_name]) AS [Occurrence]
  FROM
      [SSISDB].[internal].[object_parameters] prm
  -- [variable_name], [data_type], [sensitive], [value], [description]
  WHERE
      prm.[value_type] = 'V'
      AND prm.[parameter_name] NOT LIKE 'CM.%'
      AND LEFT(prm.[parameter_name], 1) != '_'
      AND NOT ( prm.[object_type] = 30 AND LEFT(prm.[object_name], 1) = '_') -- Naming convention for internally used SSIS Packages: start with _
  ORDER BY
      prm.[parameter_name]
  ) q
WHERE
  q.Occurrence = 1
*/

GO

ssis-e10-428
Example output of this script, when executed using Management Studio.

ssis-e10-429
And the second time, when a parameter is added and “create_only” is true for the existing parameters.

Deploy your SSIS Packages

This is about how to deploy an .ispac file in an automated way.
An .ispac file is a zipfile that contains all packages in a SSIS project that can be deployed all at once. You create it when you chose Build or Rebuild in Visual Studio when you have the SSIS project open.
By the way, if you rename it and give it a .zip extension, you can simply look what is inside.

Amongst the files you can download here there are demo SSIS projects that I used for this on SQL Server 2012, 2014 and 2016.

ssis-e10-423
The demo SSIS project ..
ssis-e10-424
Menu-option to rebuild the SSIS solution ..
ssis-e10-425
ssis-e10-426
After rebuilding the solution from the menu you will find the ispac files in the bin\<configuration name> folder of each project. In the screenprints above you can see how the bin subfolder is related to the chosen configuration in Visual Studio.

To install an ispac file, I use the command line options of the Integration Services Deployment Wizard.

For a default installation you can find the Integration Services deployment wizard (ISDeploymentWizard.exe) here:

Version Path
SQL Server 2012 C:\Program Files\SQL Server\110\DTS\Binn
SQL Server 2014 C:\Program Files\SQL Server\120\DTS\Binn
SQL Server 2016 C:\Program Files\SQL Server\130\DTS\Binn

You will find the exact command line in one of the following paragraphs, where all steps are glued together.
On 64-bit machines there is also a 32-bit version of this tool, but to be honest I do not see a reason why you would want to use the 32-bit version instead of the 64-bit, therefore I did not list the paths here.

ssis-e10-427
After installation you can find the folder, project and environment here ..

Setup package configuration

After the packages/SSIS projects are deployed, the next thing to do is adding an environment reference to the project(s) and connecting environment variables to package parameters and project parameters, based on the name.
I use this script for that:

post_deployment.sql

--\ -------------------------------------------------------------------
---) Script to add environment reference to project(s) and to connect
---) environment variables to package parameters, based on the name
--/ -------------------------------------------------------------------
/*
(c) Copyright 2016 - hansmichiels.com
 
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
 
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
 
You should have received a copy of the GNU General Public License
along with this program. If not, see http://www.gnu.org/licenses/.
*/

USE [SSISDB]
GO

DECLARE @object_parameter_value_table TABLE(
  [id] int identity(1, 1),
  [object_type] smallint,
  [object_name] nvarchar(260),
  [parameter_name] nvarchar(128),
  [project_name] NVARCHAR(128)
  )

DECLARE @project_names_table TABLE(
  [id] int identity(1, 1),
  [project_name] NVARCHAR(128)
  )

DECLARE @message nvarchar(255)

DECLARE @id int
DECLARE @max_id int
DECLARE @folder_name sysname = N'unknown'
DECLARE @environment_name nvarchar(128) = N'unknown'
DECLARE @environment_description nvarchar(128) = N'unknown'
DECLARE @project_name NVARCHAR(128)

DECLARE @object_type SMALLINT
DECLARE @object_name NVARCHAR(260)
DECLARE @value_type CHAR(1)
DECLARE @parameter_name NVARCHAR(128)
DECLARE @parameter_value SQL_VARIANT

DECLARE @reference_id BIGINT

--\
---) Environment settings
--/

IF @@SERVERNAME LIKE 'VWS63-SQL1%1'
BEGIN
    SELECT
      @folder_name = N'DEV',
      @environment_name = N'DEV',
      @environment_description = N'My development environment'
END
IF @@SERVERNAME = '<SQL INSTANCE NAME>'
BEGIN
    SELECT
      @folder_name = N'TST',
      @environment_name = N'TST',
      @environment_description = N'My test environment'
END
IF @@SERVERNAME = '<SQL INSTANCE NAME>'
BEGIN
    SELECT
      @folder_name = N'PRD',
      @environment_name = N'PRD',
      @environment_description = N'My production environment'
END

IF @folder_name = N'unknown'
BEGIN
    RAISERROR('Unknown environment', 16, 1)
    -- Make sure that the script does not continue when the environment settings could not be set due to an unknown @@SERVERNAME.
    GOTO Finally
END

INSERT INTO @project_names_table([project_name])
  SELECT
      p.name
  FROM
      [SSISDB].[internal].[folders] f
  JOIN
      [SSISDB].[internal].[environments] e
      ON e.folder_id = f.folder_id
  JOIN
      [SSISDB].[internal].[projects] p
      ON p.folder_id = f.folder_id
  WHERE
      f.name = @folder_name
      AND e.environment_name = @environment_name
      -- Optional project_name filter here:
      -- AND p.name IN ('Project_1', 'Project_2')

--\
---) Add environment reference to project(s).
--/
SELECT @id = 1, @max_id = MAX([id]) FROM @project_names_table
WHILE @id <= @max_id
BEGIN
    SELECT
      @project_name = v.[project_name]
    FROM @project_names_table v
    WHERE [id] = @id;

    IF NOT EXISTS( SELECT 1
        FROM
            [SSISDB].[internal].[folders] f
        JOIN
            [SSISDB].[internal].[environments] e
            ON e.folder_id = f.folder_id
        JOIN
            [SSISDB].[internal].[projects] p
            ON p.folder_id = f.folder_id
        JOIN
            [SSISDB].[internal].[environment_references] r
            ON r.environment_name = e.environment_name
            AND p.project_id = r.project_id
        WHERE
            f.name = @folder_name
            AND e.environment_name = @environment_name
            AND p.name = @project_name
        )
    BEGIN
        SET @message = 'An environment reference for project “' + @project_name + '” is being created.'
        RAISERROR(@message , 0, 1) WITH NOWAIT;

        EXEC [SSISDB].[catalog].[create_environment_reference]
            @environment_name=@environment_name, @reference_id=@reference_id OUTPUT,
            @project_name=@project_name, @folder_name=@folder_name, @reference_type='R';
        --Select @reference_id
    END

    SET @id = @id + 1
END

--\
---) Connect environment variables to package parameters, based on the name
--/
INSERT INTO @object_parameter_value_table (
    [object_type],
    [object_name],
    [parameter_name],
    [project_name]
    )
SELECT
    prm.[object_type],
    prm.[object_name],
    prm.[parameter_name],
    prj.name
FROM
    [SSISDB].[internal].[folders] f
JOIN
    [SSISDB].[internal].[environments] e
    ON e.folder_id = f.folder_id
JOIN
    [SSISDB].[internal].[environment_variables] ev
    ON e.environment_id = ev.environment_id
JOIN
    [SSISDB].[internal].[projects] prj
    ON prj.folder_id = f.folder_id
JOIN
    @project_names_table prjsel
    ON prjsel.project_name = prj.name
JOIN
    [SSISDB].[internal].[object_parameters] prm
    ON prj.project_id = prm.project_id
    AND prm.parameter_name = ev.name
WHERE
    prm.[value_type] != 'R'
    AND prm.value_set = 0
    AND prm.[parameter_name] NOT LIKE 'CM.%'
    AND LEFT(prm.[parameter_name], 1) != '_' -- Naming convention for internally used parameters: start with _
    AND NOT ( prm.[object_type] = 30 AND LEFT(prm.[object_name], 1) = '_') -- Naming convention for internally used SSIS Packages: start with _
    AND f.name = @folder_name
    AND e.environment_name = @environment_name
ORDER BY
    prm.object_name, prm.parameter_name

SELECT @id = 1, @max_id = MAX([id]) FROM @object_parameter_value_table
WHILE @id <= @max_id
BEGIN
    SELECT
        @object_type = v.[object_type],
        @object_name = v.[object_name],
        @parameter_name = v.[parameter_name],
        @project_name = v.[project_name]
    FROM
        @object_parameter_value_table v
    WHERE
        [id] = @id;

    SELECT @value_type = 'R', @parameter_value = @parameter_name;
    
    SET @message = 'Parameter “' + @parameter_name + '” (of object “' + @object_name + '”) is mapped to environment variable.'
    RAISERROR(@message , 0, 1) WITH NOWAIT;

    EXEC [SSISDB].[catalog].[set_object_parameter_value]
      @object_type, @folder_name, @project_name, @parameter_name,
      @parameter_value, @object_name, @value_type

    SET @id = @id + 1
END

RAISERROR('Adding environment reference to project(s) and mapping environment variables to package parameters has completed ..', 0, 1) WITH NOWAIT;

Finally:

GO

Also important is that I use a naming convention to decide whether or not to create a link between a package parameter and an environment variable.
For this I borrowed a naming convention from c# coding, where a variable starting with an underscore is a private variable within a class.
So what I have done is the following:

  • To exclude one package parameter from being mapped to an environment variable, start the name with _ (underscore). I know you could just use parameter names that do not exist as environment variables, but this naming convention makes your design more clear. The naming convention also makes it possible to generate the script to create environment variables after you have initially deployed your packages, you can then just skip the package parameter names that start with _.
  • To exclude all package parameters for a SSIS package, start the package name with _ (e.g. when this package is always executed from other packages and does not need environment mappings).

For the demo I use a solution with two projects, one of them with three packages, and the second one with just one package. This is done to demonstrate how connecting package- and project parameters to environment variables (step 3) works.

Project SSIS-E10-P1

  • ExampleMaster.dtsx: contains Parameter1 and Parameter2. Both parameters should be mapped to environment variables with the same name.
  • Export_Persons.dtsx: contains Parameter1, Parameter2 and _Parameter3. Only the parameters not starting with the underscore should be mapped to the environment variables with the same name.
  • _LoadDimPerson.dtsx: contains Parameter1, Parameter2 and Parameter4. Not a single parameter should be mapped to an environment variable, because the package name starts with _
  • The project contains ProjectParameter1 which should be mapped to an environment variable with the same name.

Project SSIS-E10-P2

  • AnyPackage.dtsx: contains Parameter1 and Parameter5. Only Parameter1 will be mapped, because no environment variable with name Parameter5 exists.
  • The project contains ProjectParameter1 and ProjectParameter2. Both parameters should be mapped to environment variables with the same name.
  • ssis-e10-431
    Example output windows when this script is run after having deployed the ispac files.

    ssis-e10-430
    And the result: the environment is “selectable” and the parameters show that they are mapped to an environment variable by having an underlined environment variable name in the “Value” column.

    Glueing everything together using Powershell

    Alright, what do we have so far?

    1. A sql script for setting up the folder and environment with variables.
    2. One or more .ispac files that can silently be deployed using ISDeploymentWizard.exe and its command line options.
    3. A sql script for connecting package- and project parameters to environment variables.

    So our PowerShell script simply needs to execute a sql script, then deploy the packages, and then execute another sql script. That shouldn’t be too difficult.
    This is the script that does exact those steps. If I have some time left I want to split the configuration and the actual script over different files/scripts.

    deployment_x64.ps1
    ######################################################################
    # Script:
    # ssis_deployment_script.ps1
    # Author:
    # Hans Michiels
    # Description:
    # This scripts consists of three parts:
    # 1. Executes pre_deployment.sql, which sets up an environment
    # with its variables.
    # 2. Installs one or more SSIS Deployment packages (ispac files)
    # that are in the same folder as this script.
    # 3. Executes post_deployment.sql, which configures the installed
    # packages to use the environment variables.
    #
    # (c) Copyright 2016 - hansmichiels.com
    #
    # This program is free software: you can redistribute it and/or modify
    # it under the terms of the GNU General Public License as published by
    # the Free Software Foundation, either version 3 of the License, or
    # (at your option) any later version.
    #
    # This program is distributed in the hope that it will be useful,
    # but WITHOUT ANY WARRANTY; without even the implied warranty of
    # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
    # GNU General Public License for more details.
    #
    # You should have received a copy of the GNU General Public License
    # along with this program. If not, see http://www.gnu.org/licenses/.
    #
    ######################################################################

    cls
    # Get the folder where this script is located
    $ScriptPath = split-path -parent $MyInvocation.MyCommand.Definition
    # Write-Output $ScriptPath;

    ######################################################################
    # CONFIGURATION
    ######################################################################
    # Configure SQL instance name and (SSIS) foldername here.
    $SQLInstance = “VWS63-SQL161”
    $FolderName = “DEV”
    # SQL Server 2016 default location:
    $ISDeploymentWizard = “C:\Program Files\Microsoft SQL Server\130\DTS\Binn\ISDeploymentWizard.exe”
    # My location:
    # $ISDeploymentWizard = “C:\MSSQL\Shared features\130\DTS\Binn\ISDeploymentWizard.exe”

    ######################################################################
    # MAIN SCRIPT
    ######################################################################
    # \
    # 1. Calls pre_deployment sql script
    # /
    $File = $ScriptPath + “\pre_deployment.sql”
    sqlcmd -S $SQLInstance -i $File

    # \
    # 2. Installs each ispac file found in the folder where the powershell script is located.
    # /
    Get-ChildItem $ScriptPath -Filter *.ispac |
    Foreach-Object {
        $ProjectName = $_.BaseName
        Write-Host “Deploying project $ProjectName ..”;
        & $ISDeploymentWizard /Silent /ModelType:Project /SourcePath:”$ScriptPath\$ProjectName.ispac” /DestinationServer:”$SQLInstance” /DestinationPath:”/SSISDB/$FolderName/$ProjectName” | Out-Null
        Write-Host “Deployment of project $ProjectName completed ..”;
    }
    # \
    # 3. Calls post_deployment sql script
    # /
    $File = $ScriptPath + “\post_deployment.sql”
    sqlcmd -S $SQLInstance -i $File

    And this is the result of the execution of the Powershell script. From now on no more manual deployments!
    ssis-e10-435Powershell output window ..

    Download all scripts and SSIS Solutions for SQL Server 2012, 2014 and 2016 here.

    Conclusion / Wrap up

    The goal of this blog post was to explain to you how to automate the deployment of SSIS packages including the configuration and environment setup.
    I have explained how:

    • to create a folder, an environment and environment variables in SSISDB using a sql script;
    • to deploy your SSIS Packages /ispac files using Powershell and the command line options of Integration Services Deployment Wizard;
    • to setup your package configuration after deployment (or: connect package- and project parameters to environment variables);
    • to execute all of the above using Powershell.

    You could also read that I use a naming convention to be able to exclude individual package parameters or all parameters of a package from being connected to environment variables.

    (c) 2016/2017 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.


    Using the project deployment model with environments (SSIS Series)

    Introduction

    Hi, as announced in my previous post of July, 14, I took a summer break. I was really useful and I enjoyed it.
    From now on I will try to write some interesting posts weekly again.
    Today I want to explain the project deployment model of SSIS, introduced already in SSIS 2012.
    Also you are gonna find a (hopefully useful) script below that creates folders, environments and environment variables, and creates connections from package parameters to environment variables.
    But that’s for later, first a little introduction.

    Folders and environments

    What I found very confusing when I started using the project deployment model, already some years ago, is that an environment in this context is not what I expected it to be.
    From a software development perspective an environment is an installed and working version of your software, optionally using one or more databases and configuration setttings exclusively for that environment. For instance a Development or a Test environment.
    But the project deployment model of SSIS does not agree with that definition.
    For the project deployment model an environment is only a set of values for all environment variables, and nothing more.
    Can you think of what that means? Suppose you have a server, both used for Development and Test, and you have databases and cubes with a “_DEV” or a “_TST” suffix to their names.
    But as far as SSIS is concerned, you can only install one version of an SSIS package! Yeah, you can execute it with a different set of parameter values stored in SSIS environment variables, but that’s it.
    You cannot have two different versions of the same package installed on the same instance of SQL Server – at least not using only SSIS Environments.
    I found it not so intuitive that you need to use folders inside the Integration Services Catalog to achieve this.
    So far the theory, let us try this stuff out together.

    Let’s try it

    I’ll just click an example together, you do not need to follow me, my goal is that you understand the concepts so that you can apply it to your own projects.
    By the way, I am using SQL Server 2016, but it should also work in SQL Server 2012 and 2014.

    My test setup is very basic:

    • I have made two databases, [SSIS_S01E08_DEV] AND [SSIS_S01E08_TST], as development- and testenvironment, respectively.
    • Both databases have a table [tbl].[Fruit], but the table in the DEV database has one extra column, [FruitDescription], to simulate a change that is under development but not deployed to test yet.
    • I have made one SSIS Package AddFruit.dtsx, that inserts a row into the table [tbl].[Fruit] passing the value for [FruitName] only. During the demo I will change the package so that it also provides the [FruitDescription] column value.

    Playing around with the example above, I can demonstrate the use of folders and environments when the package is deployed using the project deployment model.

    The initial SSIS Package

    The initial SSIS Package contains just one Execute SQL Task, one parameter named FruitName and one OLEDB Connection manager named MyDatabase.connectionmanager.
    The connection manager is configured by an expression so that the property ConnectionString gets the value of a project parameter also named ConnectionString as demonstrated in the images below.
    SSIS-S01E08-280The Execute SQL Task

    SSIS-S01E08-281The package parameter

    SSIS-S01E08-282The project parameter

    SSIS-S01E08-283The OLEDB connection manager configured with an expression
    SSIS-S01E08-284Then rebuild the Solution

    Deploying the SSIS Package

    Once the solution has been rebuilt, you can deploy the package with the Integration Services Deployment Wizard. You can also call this utility from the command line, but explaining that is out of scope for this blog post.
    To execute this wizard, you have to locate the .ispac file, it is in a subfolder of your projects bin folder.
    If you have created additional solution configurations in your Visual Studio solution, this subfolder might be different from the image below (where the subfolder is Development).
    SSIS-S01E08-285Locating the .ispac file, that was created when the solution was rebuilt.

    SSIS-S01E08-286First step of the wizard

    SSIS-S01E08-287The default deployment model is Project Deployment, the one you need. The other option is Package Deployment.

    SSIS-S01E08-288Select a server name and a path [1]

    SSIS-S01E08-289Select a server name and a path [2]: create a new folder

    SSIS-S01E08-290Select a server name and a path [3]: name the folder after your environment

    SSIS-S01E08-292Select a server name and a path [4]: then click Next

    SSIS-S01E08-293Review and deploy.

    SSIS-S01E08-295Deployment succeeded.

    SSIS-S01E08-296In the Integration Services Catalog you can find your project and package.

    SSIS-S01E08-297You could configure it now, but there are no SSIS environments yet (note: these are a set of environment variable values).

    SSIS-S01E08-298Don’t configure just yet. We will do this later.

    SSIS-S01E08-299Here you could create an environment .. don’t do it now !

    SSIS-S01E08-TST-environment_setup.sql

    --\
    ---) A script to create a folder and environment in an SSIS Catalog.
    ---) Author: Hans Michiels
    --/
    /*
    (c) Copyright 2016 - hansmichiels.com
     
    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.
     
    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
    GNU General Public License for more details.
     
    You should have received a copy of the GNU General Public License
    along with this program. If not, see http://www.gnu.org/licenses/.
    */

    USE [SSISDB]
    GO

    DECLARE @environment_variables_table TABLE(
      [id] int identity(1, 1),
      [variable_name] nvarchar(128),
      [data_type] nvarchar(128),
      [sensitive] bit,
      [value] nvarchar(4000),
      [description] nvarchar(1024)
      )

    DECLARE @object_parameter_value_table TABLE(
      [id] int identity(1, 1),
      [object_type] smallint,
      [object_name] nvarchar(260),
      [parameter_name] nvarchar(128)
      )

    DECLARE @id INT
    DECLARE @max_id INT
    DECLARE @folder_name sysname
    DECLARE @environment_name nvarchar(128)
    DECLARE @environment_description nvarchar(128)
    DECLARE @project_name nvarchar(128)
    DECLARE @variable_name nvarchar(128)
    DECLARE @data_type nvarchar(128)
    DECLARE @sensitive bit
    DECLARE @value sql_variant
    DECLARE @value_tinyint tinyint
    DECLARE @description nvarchar(1024)
    DECLARE @nsql nvarchar(max)

    DECLARE @object_type smallint
    DECLARE @object_name nvarchar(260)
    DECLARE @value_type CHAR(1)
    DECLARE @parameter_name nvarchar(128)
    DECLARE @parameter_value sql_variant

    --\
    ---) Environment settings
    --/
    SELECT
      @folder_name = N'TST',
      @environment_name = N'TST',
      @environment_description = N'My demo Test environment',
      @project_name=N'SSIS_S01E08'

    --\
    ---) Script to create environment and settings.
    --/
    IF NOT EXISTS (
        SELECT 1
        FROM [SSISDB].[internal].[environments]
        WHERE [environment_name] = @environment_name
        )
    BEGIN
        EXEC [SSISDB].[catalog].[create_environment] @folder_name, @environment_name, @environment_description;
    END

    INSERT INTO @environment_variables_table
      ( [variable_name], [data_type], [sensitive], [value], [description] )
      VALUES
          ( N'ConnectionString', N'String', 0, N'Data Source=.;Initial Catalog=SSIS_S01E08_TST;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;', N'Connection string for the database.' )
        , ( N'FruitName', N'String', 0, N'Tamarillo', N'' )
        --, ( N'Int32Example', N'Int32', 0, N'1', N'' )
        --, ( N'BooleanExample', N'Boolean', 0, N'1', N'' )
        --, ( N'ByteExample', N'Byte', 0, N'5', N'' )

    SELECT @id = 1, @max_id = MAX([id]) FROM @environment_variables_table
    WHILE @id <= @max_id
    BEGIN
        SELECT
          @variable_name = v.variable_name,
          @data_type = v.data_type,
          @sensitive = v.sensitive,
          @value = v.value,
          @description = v.[description]
        FROM @environment_variables_table v
        WHERE [id] = @id;

        IF EXISTS (
            SELECT 1
            FROM [SSISDB].[internal].[environment_variables] v
            JOIN [SSISDB].[internal].[environments] e ON e.environment_id = v.environment_id
            WHERE v.[name] = @variable_name
            AND e.environment_name = @environment_name
            )
        BEGIN
            SET @nsql = N'EXECUTE [catalog].[delete_environment_variable] '
              + N'@folder_name = N'''+ @folder_name + ''', @environment_name = N''' + @environment_name + ''', '
              + N'@variable_name = N''' + @variable_name + ''''
            PRINT @nsql;
            EXEC sp_executesql @nsql;
        END

        PRINT '/*'
        PRINT @variable_name
        PRINT CONVERT(nvarchar(128), SQL_VARIANT_PROPERTY(@value, 'BaseType'));
        PRINT '*/'

        SET @nsql = N'EXECUTE [catalog].[create_environment_variable] '
          + N'@folder_name = N'''+ @folder_name + ''', @environment_name = N''' + @environment_name + ''', '
          + N'@variable_name = N'''+ @variable_name + ''', @data_type = N''' + @data_type + ''', '
          + N'@sensitive = ' + CONVERT(NVARCHAR, @sensitive) + ', @description = N''' + @description + ''', '
          + CHAR(13) + CHAR(10) + N'@value = ' +
          CASE UPPER(@data_type)
          WHEN 'String' THEN 'N''' + CONVERT(NVARCHAR(1000), @value) + ''' '
          ELSE CONVERT(NVARCHAR(1000), @value)
          END + '; '
        PRINT @nsql;
        EXEC sp_executesql @nsql;

        SET @id = @id + 1
    END

    --\
    ---) Add environment reference to project.
    --/
    Declare @reference_id bigint
    IF NOT EXISTS(SELECT 1
      FROM [SSISDB].[internal].[environment_references] r
      JOIN [SSISDB].[internal].[projects] p
        ON p.project_id = r.project_id
      WHERE p.name = @project_name
        AND r.environment_name = @environment_name
      )
    BEGIN
        EXEC [SSISDB].[catalog].[create_environment_reference]
            @environment_name=@environment_name, @reference_id=@reference_id OUTPUT,
            @project_name=@project_name, @folder_name=@folder_name, @reference_type='R';
        --Select @reference_id
    END

    --\
    ---) Connect environment variables to package parameters, based on the name
    --/
    INSERT INTO @object_parameter_value_table (
        [object_type],
        [object_name],
        [parameter_name]
        )
    SELECT
        prm.[object_type],
        prm.[object_name],
        prm.[parameter_name]
    FROM
        [SSISDB].[internal].[object_parameters] prm
    JOIN
        [SSISDB].[internal].[environment_variables] ev
        ON ev.name = prm.parameter_name
    JOIN
        [SSISDB].[internal].[projects] prj
        ON prj.project_id = prm.project_id
    WHERE
        prm.[value_type] != 'R'
    AND
        prm.value_set = 0
    AND
        prj.name = @project_name
    ORDER BY
        prm.object_name, prm.parameter_name

    SELECT @id = 1, @max_id = MAX([id]) FROM @object_parameter_value_table
    WHILE @id <= @max_id
    BEGIN
        SELECT
          @object_type = v.[object_type],
          @object_name = v.[object_name],
          @parameter_name = v.[parameter_name]
        FROM @object_parameter_value_table v
        WHERE [id] = @id;

        SELECT @value_type = 'R', @parameter_value = @parameter_name;
        
        EXEC [SSISDB].[catalog].[set_object_parameter_value]
          @object_type, @folder_name, @project_name, @parameter_name,
          @parameter_value, @object_name, @value_type

        SET @id = @id + 1
    END

    GO

    My script that facilitates to set up multiple environments quickly (download all the used scripts here).

    In the demo it would not be a problem to do all the configuration by hand, but if you have a lot of SSIS Packages with a lot of parameters, you really do not want to do this, especially not if you have different environments, like Development, Test and Production.
    This is why I made a script that automates this work for me:

    • It creates an environment, if it doesn’t exist yet.
    • It (re)creates environment variables, defined in the script.
    • It adds an environment reference to the project, if it doesn’t exist yet.
    • It connects environment variables to package parameters, based on the name.
      For this to work, it is important that package parameters with the same name used in different packages mean the same thing!
      For instance if you have a parameter SourceFolder used in different packages, it should be the same folder! If not, rename the parameters so that they have a unique name and – when mapped to environment variables – can both be assigned a different value, e.g. CrmSourceFolder and ErpSourceFolder.

    We run the script for the TST environment.

    SSIS-S01E08-300The script was executed successfully.

    SSIS-S01E08-301After the script has run, a TST folder with an TST environment inside exists. You can view or edit its properties.

    SSIS-S01E08-302When the Variables pane is selected on the left, the environment variables, created by the script, are shown.

    SSIS-S01E08-303View package configuration.

    SSIS-S01E08-304The package parameter FruitName is now connected to environment variable FruitName.

    SSIS-S01E08-305When I choose Execute ..

    SSIS-S01E08-306… all I have to do is check Environment, it will automaticly show .\TST

    SSIS-S01E08-307After execution of the package, a row exists in the table (and one from a previous test)

    SSIS-S01E08-308We can also execute the package with a SQL Agent Job ..

    SSIS-S01E08-310Execute the package with a SQL Agent Job: all I have to do is check Environment, it will automaticly show .\TST ..

    SSIS-S01E08-311.. and all values are connected properly ..

    SSIS-S01E08-312And yes, it works!

    Extending the package with a FruitDescription

    I must admit, I cheated a little. I skipped the part of deploying the first version of the SSIS Package to the DEV environment.
    So now I am going to deploy to a new DEV Folder and Environment for the first time, while it is will be the second version of the package. This is not a recommended practice in real projects.
    Anyway, the package gets an extra parameter FruitDescription.
    SSIS-S01E08-313Added FruitDescription

    SSIS-S01E08-314Modify the Execute SQL Task to insert also the value for @FruitDescription [1]

    SSIS-S01E08-315Modify the Execute SQL Task to insert also the value for @FruitDescription [2]

    SSIS-S01E08-316Then Rebuild

    SSIS-S01E08-317Then run deployment wizard again, but now create a DEV folder

    SSIS-S01E08-318Script modifications [1]: use a DEV folder and a DEV environment name.

    SSIS-S01E08-319Add FruitDescription to the list of environment variables and change the values: connection string must use the DEV database and fruitname and -description something different than in the TST environment.

    Then run the script.
    SSIS-S01E08-320The package was deployed using the deployment wizard.

    SSIS-S01E08-321SSIS Environment was created using the script

    SSIS-S01E08-322Environment variables where added using the script

    SSIS-S01E08-323Environment variables were connected to package parameters using the script

    SSIS-S01E08-324Package execution test: note the new parameter FruitDescription

    SSIS-S01E08-325Test result

    SSIS-S01E08-326Final result: different versions of the same SSIS Package installed side by side on the same machine. But in addition to SSIS Environments with variable values, folders inside the Integration Services Catalog are used for that!

    Download the scripts here.

    Conclusion / Wrap up

    With the project deployment model of SSIS, introduced in SQL Server 2012, deployment and configuration of SSIS Packages has become easier.
    However it is important to understand how folders and environments in the Integration Services Catalog work. I have tried to explain this, and provided a script to set up multiple environments quickly.

    Why did you use those weird fruit examples?

    D=Development
    T=Test
    😉
    Read more here:
    Durian
    Tamarillo

    (c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

    Implement logging in a Script Task (SSIS Series)

    Introduction

    This post is a follow-up on two older posts and will be the last one before my summer break.

    Today I want to enhance the SSIS package made in post Fixing corrupt csv files in a SSIS Package (SSIS Series) by using the Plug and Play Logging Solution.

    Problem

    As long as a script runs well, you might not need logging inside. BUT .. when something GOES wrong, having logging in the script can reduce the time needed to troubleshoot the issue.
    In the example for fixing csv files, anything can be wrong with a file, causing an error in the script. In that case it is for instance useful to know which file was being processed when the error occurred.

    Prerequisite

    For this to work, you have to install my Plug and Play Logging Solution, which can be downloaded from this post. It is just one SQL Script you have to run, that creates a [logdb] database with objects inside. You can also add those objects to a different database that already exists.
    It is however not mandatory to rebuild the Fixing corrupt csv files in a SSIS Package example, as you can also add logging to your own Script Task.

    Solution

    This post is a enhancement on the Fixing corrupt csv files in a SSIS Package post, so you might need to read that post first, if you did not do that already.

    Therefore I will dive directly into the changes that are needed to add logging in the Script Task.
    This will be done with lots of screenprints with some comment in between.

    SSIS-S01E07-169Adding a package parameter “LoggingConnectionString”
    First you need to add a package parameter LoggingConnectionString that can be used in an expression of the OLE DB Connection and as input variable for the Script Task.

    SSIS-S01E07-170OLE DB Connection configuration ..
    Then add an OLE DB Connection for the logging database.

    SSIS-S01E07-171OLE DB Connection properties ..
    Use an expression to let the ConnectionString of the OLE DB Connection be equal to the value of the package parameter LoggingConnectionString.

    SSIS-S01E07-173OLE DB Connection – Expression configuration [1]

    SSIS-S01E07-172OLE DB Connection – Expression configuration [2]

    SSIS-S01E07-174Final result
    By the fx icon you can see that the connection manager uses an expression.

    SSIS-S01E07-176Enable logging
    Now enable logging for the package.

    SSIS-S01E07-177Enable logging [2]
    Add a log provider for SQL Server and let it use the logdb.oledbConnection by selecting it under the Configuration column header.

    SSIS-S01E07-178Enable logging [3]
    Then select all events. Filtering on what is actually logged is done by the logging solution (by the value of @MaxMessageClass, see this blog post for more details).

    SSIS-S01E07-179Select the Script Task
    Select the Script Task and add the following Variables to ReadOnlyVariables:

    • System::ExecutionInstanceGUID
    • System::PackageID
    • $Package::LoggingConnectionString


    • SSIS-S01E07-180The added ReadOnlyVariables in the red rectangles

      Below you will find a number of screenprints of the script task to talk you through the changes.
      You can download the C# script here.

      SSIS-S01E07-181Overview
      First make sure the Namespaces region is as shown.
      Then fold the namespace with the guid in the name, and paste the entire namespace HansMichielsCom.PlugAndPlaySSISLoggingSolution underneath it.
      This code is put in a separate namespace, so that it could also be placed in a .NET assembly that is added to the GAC (Global Assembly Cache). When you would do this, you do not have to add the code to every Script Task.
      For the example of today, we just put this namespace inside the Script Task to make things not too complicated for now.

      SSIS-S01E07-182Using the HansMichielsCom.PlugAndPlaySSISLoggingSolution namespace
      As a result, you have to tell the first guid-like namespace, that you want to call code inside the second namespace. Therefore add the using statement as shown above.

      SSIS-S01E07-183Constant used for logging
      Below you will see some printscreens with changed parts in the script.

      SSIS-S01E07-184Method GetSsisLogWriter to instantiate a SsisLogWriter object

      SSIS-S01E07-187Method Main is extended with logging.

      SSIS-S01E07-188Pass the logWriter as parameter to other methods ..

      SSIS-S01E07-189IMPORTANT: Bugfix in CheckAndReturnHeader!

      SSIS-S01E07-190IMPORTANT: Bugfix in CheckAndReturnHeader!
      (header == null) is added to cope with empty files.

      Testing, testing, one, two ..



      SSIS-S01E07-191Test preparations [1]

      SSIS-S01E07-193Test preparations [2]

      SSIS-S01E07-194Test execution

      SSIS-S01E07-195Test result: logging rows done inside the script are in the log table.

      Conclusion / Wrap up

      In this post I have demonstrated how to implement logging in SSIS Script Tasks using my Plug and Play Logging Solution.
      This type of logging gives more control on what to log and how to log it than when you implement logging using SSIS events.
      The examples given are very basic. You can use your imagination to implement logging of errors using a try .. catch block, or use all available parameters of logWriter.AddLogEntry to change the Retention Class, Message Class, and so on.

      In the summer I will take some time for study, reflection, holiday, and still .. work.
      My next post will be early September at the latest, maybe earlier.

      Download the C# script here.

      (c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

    Zeros, bloody zeros! (Data Vault Series)

    Introduction

    I must admit I have a weakness for British humour.
    When I had to cope with leading zeros in business keys some time ago, I spontaneously came up with the title of this post, not knowing that it would serve as such.
    For those who do not know, “Meetings, bloody meetings” is a British comedy training film in which John Cleese plays a main role. It was made in 1976, and a remake was made in 2012.
    It tells in a funny way what can go wrong at meetings and how you can do better, check it out if you can.

    DV-S01E05-meetingsMr John Cleese

    But, obviously, this post is not about meetings but about zeros.

    Problem

    I can be short about that: leading zeros in business key values.
    For instance a customer number is delivered to the data warehouse as 0001806 (instead of 1806).
    This would not be a problem it is would always be delivered exactly like that. But to be honest, you can and will not know that upfront. Even this might be the case now, it might not be in the future.
    When other tools are used, leading zeros could suddenly disappear (for instance when a csv file is modified using Excel), or (more rarely) the number of leading zeros could change (01806, 00001806). When this happens you have a problem, because for the data warehouse 01806, 0001806, 00001806 and 1806 are all different business keys! Even if you have only two variants, it is already a problem.
    Because every business key gets a different row in the hub, and this customer now exists multiple times!

    DV-S01E05-zeros(No acting here, this is how I look sometimes)

    Solution

    If you are familiar with Data Vault, you might already think of same-as-links to solve this.
    But I think the solution should be implemented earlier, to avoid having multiple hub rows.
    Simply always remove leading zeros when the sourcecolumn is (part of) a business key (either primary or foreign key) and seems a number or ID but is delivered as a string/varchar. In this way 1806 will always be 1806! And I think it is pretty impossible that 001806 and 1806 would refer to two different customers.
    Unless, of course, they would come from different source systems. But in that situation, depending on leading zeros would be a bad thing to do, because when then leading zeros dropped off, satellite rows of different customers (in different source systems) could end up as connected to the same hub row! In this situation, in a non-integrated Raw Vault, it would be better to prefix the business key with the source system code and remove the leading zeros, for instance, CRM.1806 and ERP.1806.
    In all cases, you can still store the original value (with leading zeros) as an ordinary attribute in a satellite for auditing reasons.

    How to implement the solution

    There are many ways to remove leading zeros. When I was searching for this I had two requirements:

    • No casting from and to an integer may take place, otherwise all business keys need to be numeric, so this would make the solution less reliable.
    • No function, routine or assembly may be called, this could negatively impact performance. I was looking for an “inline” conversion.

    After some research I found an expression that was the same for SQL and SSIS and quite okay (T-SQL version by Robin Hames, my credits for his work), but appeared to change a string with only one or more zeros to an empty string. And because a 0 can have a meaning – and is certainly different from an empty string – this is undesired behavior, IMHO.
    So I had to add some logic to it: a SELECT CASE in T-SQL and an inline condition (format {condition} ? {true part} : {false part} ) to the SSIS expression.
    Furthermore I came on a different method for T-SQL as well, using the PATINDEX function, which is more compact than the other solution.
    For SSIS I still use the ‘Robin Hames’ method, because the PATINDEX function is not available in SSIS Expressions.
    So .. this is what it has become:

    T-SQL

    Remove_leading_zeros.sql

    SELECT
        example.[id_with_leading_zeros],
       CASE
          WHEN LTRIM(example.[id_with_leading_zeros]) = '' THEN ''
          WHEN PATINDEX( '%[^0 ]%', example.[id_with_leading_zeros]) = 0 THEN '0'
          ELSE SUBSTRING(example.[id_with_leading_zeros], PATINDEX('%[^0 ]%', example.[id_with_leading_zeros]), LEN(example.[id_with_leading_zeros]))
       END AS [id_without_zeros_method1],

       CASE
          WHEN LTRIM(example.[id_with_leading_zeros]) = '' THEN ''
          WHEN PATINDEX( '%[^0 ]%', example.[id_with_leading_zeros]) = 0 THEN '0'
          ELSE REPLACE(REPLACE(LTRIM(REPLACE(-- Robin Hames' method
                REPLACE(LTRIM(example.[id_with_leading_zeros]), ' ', '!#!') -- replace existing spaces with a string that does not occur in the column value, I have chosen '!#!'
                , '0', ' ') -- replace '0' with ' '
                ) -- end of LTRIM to remove leading '0's that have been changed to ' 's
                , ' ', '0') -- change ' ' back to '0'
                , '!#!', ' ') -- change '!#!' back to ' '
       END AS [id_without_zeros_method2]
    FROM
        (
        SELECT
            TOP 1000 RIGHT('00000000000' + CONVERT(NVARCHAR(12), object_id), 14) AS [id_with_leading_zeros]
        FROM
            master.sys.objects
        UNION
        SELECT N' 00000 '
        UNION
        SELECT N'00'
        UNION
        SELECT N' '
        UNION
        SELECT ' 0099990 A '
        UNION
        SELECT '-5550'
        ) example

    SSIS Expression (can be used in Derived Column)

    (LTRIM(REPLACE(id_with_leading_zeros,"0", "")) == "" && LTRIM(id_with_leading_zeros) != "") ? "0" : REPLACE(REPLACE(LTRIM(REPLACE(REPLACE(LTRIM(id_with_leading_zeros)," ","!#!"),"0"," "))," ","0"),"!#!"," ")

    DV-S01E05-151In a Derived Column Transformation this looks for instance like this

    Conclusion / Wrap up

    In this post I have motivated why I think you should remove leading zeros from business keys when data is loaded from source systems to a data warehouse.
    This post also contains different ways to remove leading zeros, two for T-SQL and one for a SSIS expression.

    (c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

    Zip sourcefiles in a SSIS Package (SSIS Series)

    Introduction

    As I am in a (un)zipping mood, this week the opposite problem of last week.
    The example is not so ready-to-use as in my previous post, as it depends on choices made in your development process of other SSIS packages.
    Still I hope you can adjust this example to match your needs.

    Problem

    If you have imported and archived (csv- or xml) files, these files can take up a lot of disk space. The archive folder can also become overwhelmingly full with ten thousands of files, which can make it slowly accessible and difficult to manage.

    Solution

    To tackle this problem I have made a fully configurable SSIS package, that can be executed from a different package with an Execute Package Task.
    The package is made using SSIS 2014.

    The zipping itself is done with 7za.exe, for more information on how to get and install this free program see my previous blog post.

    Scenario

    When I used this concept, a different SSIS package moved files to a subfolder of an archive folder after they were processed, but their name was prefixed with the LoadCycleID with leading zeros, for instance 001345_customers.xml.

    The archive folder and subfolder were separated, because the archive folder was an environment variable and the same for all packages, while the subfolder was different for different SSIS packages, and contained the name of the source system of the files.
    If you only have one archive folder, just change the example to use that instead of the concatenation of archive folder and subfolder.

    Overview

    Here is a picture of the end result:

    SSIS-S01E06-124
    The Control Flow

    Detailed steps

    Parameters and variables

    First you need to add a few parameters and variables to the package.

    Parameters:

    Name Data type Required Description
    _7zaPathAndFilename String False The full path and filename of 7za.exe.
    ArchiveFolder String True The root folder for archiving processed files.
    ArchiveSubFolder String True The subfolder of the ArchiveFolder in which the files to be zipped exist, the subfolder can for instance be named after a source system like SAP or CRM.
    LoadCycleID Integer True The ID of the LoadCycle that is processing these files. In your situation it could also be called ExecutionID, SessionID and so on, the name is not really important.

    SSIS-S01E06-125
    Package parameters

    Variables:

    However not all variables might be necessary, I like the concept of being able to configure all values in one place.
    The data type of all variables is String, except LoadCycleID which is an Integer.
    Every variable uses an expression to set its value.
    Here is a detailed list of all variables and the expression used for each one.

    Name Expression
    ArchiveFilesToDeleteInFolder @[$Package::ArchiveFolder] + "\\" + @[$Package::ArchiveSubFolder]
    ArchiveFilesToDeletePattern @[User::LoadCycleIDString] + "*"
    ArchiveFilesToZipPattern @[$Package::ArchiveFolder] + "\\" + @[$Package::ArchiveSubFolder] + "\\" + @[User::LoadCycleIDString] + "*.*"
    ArchiveFilesWorkingDirectory @[$Package::ArchiveFolder]
    ArchiveZipFile @[User::LoadCycleIDString] + "_" + @[$Package::ArchiveSubFolder] + ".zip"
    ArchiveZipPathFile_Final @[$Package::ArchiveFolder] + "\\" +@[$Package::ArchiveSubFolder] + "\\" + @[User::LoadCycleIDString] + "_" + @[$Package::ArchiveSubFolder] + ".zip"
    ArchiveZipPathFile_Temp @[$Package::ArchiveFolder] + "\\" + @[User::LoadCycleIDString] + "_" + @[$Package::ArchiveSubFolder] + ".zip"
    LoadCycleID @[$Package::LoadCycleID]
    LoadCycleIDString RIGHT("00000" + (DT_STR, 10, 1252) @[$Package::LoadCycleID], 6)

    Important note: if you expect that the value for LoadCycleID will grow beyond 999999 over time, you should add more prefixed zero’s to LoadCycleIDString and increase the number 6, the second parameter of the RIGHT function.

    SSIS-S01E06-126
    SSIS variables of the package

    The Control Flow

    The control flow is quite straight forward, only three tasks and we’re done.

    EPR Zip files in archive subfolder

    This task creates one zipfile containing multiple other files in the ArchiveSubFolder.
    From the SSIS Toolbox drag an Execute Process Task to the Control Flow panel and open the Execute Process Task Editor.
    On the General and Process panes you do not have to change anything.

    SSIS-S01E06-128
    SSIS-S01E06-129
    No changes needed here

    On the Expressions pane, press the ellipsis next to Expressions. In the Property Expressions Editor add the following:

    Property Expression
    Arguments "a "+ "\"" + @[User::ArchiveZipFile] + "\" " + "\"" + @[User::ArchiveFilesToZipPattern] + "\" -mx=9"
    Executable @[$Package::_7zaPathAndFilename]
    WorkingDirectory @[User::ArchiveFilesWorkingDirectory]

    Please note that in the Arguments expression -mx=9 is used. This is the compression level, level 9 means “ultra” compression. This level gives the smallest zip files but is also the slowest compression level. You can for instance also use 5 for “normal” or 3 for “fast” compression, if this is what you want.

    SSIS-S01E06-127
    Expressions for the Execute Process Task

    SCR Delete uncompressed files

    This task deletes the files that were added to the zip file in the previous task.

    In the Script Task Editor, add both variables User::ArchiveFilesToDeleteInFolder and User::ArchiveFilesToDeletePattern to ReadOnlyVariables, then press Edit Script ….

    Then in the namespaces region add one namespace, as indicated below.

    Namespaces region
    #region Namespaces
    using System;
    using System.Data;
    using Microsoft.SqlServer.Dts.Runtime;
    using System.Windows.Forms;
    // Added namespace:
    using System.IO;
    #endregion

    Adjust public void main() as follows:

    public void Main

    public void Main()
    {
      // Copy SSIS variable values to C# string variables.
      string archiveFilesToDeleteInFolder = Dts.Variables["User::ArchiveFilesToDeleteInFolder"].Value.ToString();
      string archiveFilesToDeletePattern = Dts.Variables["User::ArchiveFilesToDeletePattern"].Value.ToString();

      // Delete each file that just has been added to the zip archive.
      foreach (string f in Directory.EnumerateFiles(archiveFilesToDeleteInFolder, archiveFilesToDeletePattern))
      {
        File.Delete(f);
      }

     Dts.TaskResult = (int)ScriptResults.Success;
    }


    WARNING: Make sure that no files that meet the file name pattern of files to add to the zip file are added to the ArchiveSubFolder while or after the zipfile is (being) created. Otherwise files that are NOT zipped might get deleted!

    Download the script here.

    FST Move zip file

    The File System Task is to move the zip file to the ArchiveSubFolder, because the zipfile is initially created in a temporary folder and not in the ArchiveSubFolder.
    This is to prevent that it would be deleted by coincidence, if you use a value for ArchiveFilesToDeletePattern that ends with .*
    Configure the File System Task as shown in the picture below:

    SSIS-S01E06-130
    File System Task Configuration

    Download the script here.

    Let the car drive

    Time to test the package and tell the result in pictures (if you don’t mind).

    SSIS-S01E06-131
    Before: files of LoadCycleID 001346 need to be zipped ..

    SSIS-S01E06-132
    Execute the package ..

    SSIS-S01E06-133
    All csv files replaced with one zip file ..

    SSIS-S01E06-134
    .. that contains the csv files ..

    Conclusion / Wrap up

    In this blog post you could read the steps needed to build a SSIS package that creates a zip file containing a number of files in an archive(sub)folder.
    This makes the archive smaller and easier to manage.
    However beware that you only delete files that were added to the zip file. For more details see the warning above.

    (c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

    Unzip specific files from lots of zipped archives (SSIS Series)

    Introduction

    This week I want to discuss the trouble you may have when searching hundreds of zip files for specific files inside them.
    Recognizable?
    At least I faced this problem some time ago, and solved it using SSIS and the commandline version of 7-zip, named 7za.exe.
    The choice for SSIS was purely based on convenience, as you could tackle this problem also using Powershell, C# or almost any other scripting language.
    But I thought SSIS is just a neat and quick way to do this.

    Problem

    Finding specific files in huge amounts of zip files.

    Solution

    To tackle this problem I have made a fully configurable SSIS package, that can be executed with different parameters.
    The main task of the package is to extract specific files (based on a file name pattern) from zipfiles in a folder and – optionally – subfolders.
    The unzipping itself is done with 7za.exe, which is the commandline version of 7z, the great open source zip program, that exists already for several years and is used by millions of users.

    By the way, the package is made using SSIS 2014.

    Preparations: download and install 7za

    It is important to download this package from the official 7z web site, just to avoid that you would get a copy from some deceitful web site infected with malware.
    To go to the official 7-zip download page click on this link:
    7-zip download page
    Most important is the 7-Zip Extra: standalone console version.
    For the example I have used version 16.02 (2016-05-21) for Windows, but it should also work with version 9.20 (2010-11-18), then called 7-Zip Command Line Version.

    If you download the newest version, you also need software to unzip the 7z160-extra.7z file, for instance the regular 32-bit or 64-bit version of 7-zip, which can also be downloaded on the same page.

    SSIS-S01E05-100
    Download page of 7-zip

    I assume you will use the newest version 7z160-extra.7z
    Once unzipped, just take the time to read License.txt, readme.txt and history.txt.
    Then copy the three files 7zxa.dll, 7za.dll and 7za.exe to a folder on the Windows computer where SSIS runs, usually a Windows Server.
    If you run a x64 version of Windows, you can take the x64 version of the 7za files, which are in a subfolder “x64” of the unzipped files.

    SSIS-S01E05-101
    Unzipped 7za files ..

    I have created a subfolder “7za” in the “C:\Program files” folder of my virtual Windows Server, and copied the x64-version of the three mentioned files to it, but you can put them in any folder you like.
    When you have done this, the real fun can start ..

    Overview

    Here is a picture of the end result:

    SSIS-S01E05-102
    The Control Flow

    Detailed steps

    Parameters and variables

    First you need to add a few parameters and variables to the package.

    Parameters:

    Name Data type Required Description
    _7zaPathAndFilename String False The full path and filename of 7za.exe.
    FilenamePattern String True The filename pattern of files you want to unzip.
    SourceRootFolder String True The main folder where you want to search for zip files.
    TraverseSubfolders Boolean False Indication if you want to look only in SourceRootFolder, or also in its subfolders.
    UnzipRootFolder String True The folder to which files are unzipped. Important: this may NOT be a subfolder of SourceRootFolder.
    ZipFilenamePattern String True The filename pattern of the zipfiles from which you want to unzip files.

    SSIS-S01E05-103
    Package parameters

    Variables:

    Name Data type Description
    FLC_UnzipFolder String The subfolder where the files will be unzipped. This is a subfolder of the UnzipRootFolder, with the same relative subpath as the zipfile has in SourceRootFolder. Additionally, the zipfilename has been replaced with a subfolder that is the same as the zipfilename without the file extension.
    FLC_ZipPathAndFileName String The zipfile full path, that is target of investigation if any files with FilenamePattern are zipped inside.

    SSIS-S01E05-104
    SSIS variables needed to do the job

    Developing the package

    FELC Unzip files

    This foreach loop container finds all zip files.
    In the Foreach Loop Editor, select Collection in the left pane.
    As we want to make the solution generic by using all those parameters mentioned above, we have to add a few expressions.

    SSIS-S01E05-105
    Foreach Loop container Collection configuration

    Add the following three expressions (in my opinion the editor is not very userfriendly, the window is too small, but it can be done):

    Property Expression
    Directory @[$Package::SourceRootFolder]
    FileSpec @[$Package::ZipFilenamePattern]
    Recurse @[$Package::TraverseSubfolders]

    Please note that these expressions will set the values for Folder, Files and the Traverse subfolders checkbox in the Enumerator configuration part of the screen (inside the purple square in the picture above).
    If you close and reopen the Foreach Loop Editor, you will notice that the configuration fields now use the values from the corresponding parameters, except the value Fully qualified for Retrieve file name, which is hardcoded.

    SSIS-S01E05-106
    Example of how to set an expression..

    Then, still in the Foreach Loop Editor, select Variable Mappings in the left pane and map variable User::FLC_ZipPathAndFileName to Index 0, as shown in the picture below.

    SSIS-S01E05-107
    Variable Mappings

    EXPR Set Unzip folder

    The next nifty thing that needs to be done, is creating a folder to store the unzipped files. As we do not know if zipped files with the same name exist in multiple zip files, the safest thing to do is to create a directory structure identical to the directory structure of the zip files.
    Additionally, the basename of the zip file (so without file extension) is used as an extra subfolder. In this way we are sure that unzipped files with the same name can be stored without overwriting other ones.

    To do this, from the SSIS Toolbox drag an Expression Task inside the Foreach Loop container.
    Open the Expression Task and paste this expression into the Expression textbox.

    @[User::FLC_UnzipFolder] =
    REPLACE(LEFT(@[User::FLC_ZipPathAndFileName],
    FINDSTRING(@[User::FLC_ZipPathAndFileName], "."
    + TOKEN(@[User::FLC_ZipPathAndFileName], ".",
    TOKENCOUNT(@[User::FLC_ZipPathAndFileName], ".")),1) -1),
    @[$Package::SourceRootFolder], @[$Package::UnzipRootFolder])

    Then press the Evaluate Expression button to check that the expression is valid.

    SSIS-S01E05-108
    Configuring the Expression Task

    FSYS Create Unzip folder

    In the previous step we have composed the full path to unzip files into, but this folder structure might not exist yet.
    Therefore we use A File System Task named FSYS Create Unzip folder to create the folder structure.
    The good news is that the File System Task can create multiple (sub)folders at once.
    Configure it as shown in the picture.
    SSIS-S01E05-109
    Configuration of FSYS Create Unzip folder

    EPR Unzip Selected Files

    Below you see a few screenshots for the configuration of the Execute Process Task named EPR Unzip Selected Files.
    The funny thing is, that in the Process pane, you do not have to change anything. Executable and Arguments are set using expressions (coming next) and all the other properties can keep their default values.

    SSIS-S01E05-110
    Process configuration.

    SSIS-S01E05-111
    Setting Executable and Arguments using expressions
    Arguments (copy-friendly):

    "-y x \"" + @[User::FLC_ZipPathAndFileName] + "\" -o" + @[User::FLC_UnzipFolder] + " " + @[$Package::FilenamePattern] + " -r"

    SCR Remove Empty folders

    There is one problem with the steps done so far. If a zipfile does not contain a file that matches the FilenamePattern, then the folderstructure is made for nothing.
    However you could just leave these empty folders, I think the solution is a bit better if these empty folders are deleted.
    To do this, from the SSIS Toolbox drag an Script Task inside the Foreach Loop container.

    SSIS-S01E05-112
    In the Script Task Editor, add both variables User::FLC_UnzipFolder and $Package::UnzipRootFolder to ReadOnlyVariables

    In the Script Task Editor window, press Edit Script ….

    Please note that you can download the C# script here..

    Then in the namespaces region add two namespaces, as indicated below.

    Namespaces region
    #region Namespaces
    using System;
    using System.Data;
    using Microsoft.SqlServer.Dts.Runtime;
    using System.Windows.Forms;
    // Added namespaces:
    using System.IO;
    using System.Linq;
    #endregion

    Adjust public void main() as follows:

    public void Main

    public void Main()
    {
      // Copy SSIS variable value of FLC_UnzipFolder to C# string variable.
      string unzipFolder = Dts.Variables["User::FLC_UnzipFolder"].Value.ToString();

      // Source: http://stackoverflow.com/questions/22520761/delete-all-empty-directories-in-specified-path
      // My credits go to author "Jon" from Berlin.
      // Because not all zipfiles might contain files with the FilenamePattern we are looking for, an empty path might have been created.
      // This script deletes these empty folders, so that in the end only folders will exist with unzipped files or subfolders in it.
      var di = new DirectoryInfo(unzipFolder);
      var root = Dts.Variables["$Package::UnzipRootFolder"].Value.ToString(); // no trailing slash!
      while (di.FullName != root
              && !di.EnumerateFiles().Any()
              && !di.EnumerateDirectories().Any())
      {
        di.Delete();
        di = di.Parent;
      }

     Dts.TaskResult = (int)ScriptResults.Success;
    }

    Download the script here.

    The proof of the pudding is in the eating

    Time to do some testing.
    I have created a few folders with subfolders and subsubfolders. You can see the folder structure in the last picture below.
    As you might have noticed the default values for my parameters already prepared for the testcase: I want to unzip only customer files from zip files with 201605 in the name.

    SSIS-S01E05-113
    Executing the package in Debug mode..

    Below you can see the result. Only customer files are unzipped. The subfolders under the Archive folder that contain no customer zip files (but only currency files), do not exist under the Unzipped folder.

    SSIS-S01E05-114
    Folder structure of Archive and Unzipped rootfolders..

    SSIS-S01E05-115
    Unzipped “Customer” files ..

    Important notes for using with huge amounts of zip files

    From my experience I must also warn you. If you run this package on a folder containing gigabytes of zipfiles, there are two things to be aware of:

    1. The execution of the package can potentially take a long time, several hours is not unusual.
    2. Make sure you have enough disk space. Suppose you have the following hypothetical scenario: zip files: 100GB, average zip compression rate: 20%, percentage of files to be unzipped: 30%. In this case you are going to unzip files that have a zipped size of 30% of 100GB, is 30GB. Once unzipped, their size will be 30GB * (1/0,20) = 150GB. You need to have that disk space available on the disk where the Unzipped folder resides.

    Conclusion / Wrap up

    With this SSIS package you can selectively unzip files from archived zip files.
    When using it, make sure you have enough disk space, and be aware that the package can run for several hours, depending on the number and (average) size of the zip files to process.

    (c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

    11 Things you should know about Biml – an overview (Datawarehouse Series)

    Intended audience and purpose of this post

    This post gives an overview of a number products and features related to Biml on a high level.
    Therewith it is a starting point for further learning and investigating.

    Introduction


    Biml means Business Intelligence Markup Language.
    What else do you need to know about it, if you are a professional worker in the business intelligence and data warehouse domain?

    SSIS-S01E05-biml-logo

    11 Things you should know about Biml – an overview

    1. Who invented Biml?
    2. What is Biml?
    3. What are BimlExpress and BIDS Helper?
    4. What is BimlScript?
    5. What is BimlFlex?
    6. What are Mist and BimlStudio?
    7. What are BimlBundles?
    8. What are BimlTransformers?
    9. Is the free part of Biml a way to get me addicted to it?
    10. How do I know that Biml is a proven technology that is going to stay?
    11. Where can I find more information on the internet?

    1. Who invented Biml?

    Biml and all tools around it is created by Varigence, a company with CEO Scott Currie, a former Microsoft Employee.

    Back to top

    2. What is Biml?

    Biml – the markup language – is an XML dialect to describe Microsoft SSIS Packages and SSAS Cubes.
    Everything you can do with SSIS, you can do with Biml (quote of Scott Currie, done during the Biml usergroup NL kickoff on May 7, 2016).
    The goal of Biml is to improve productivity of Data warehouse development.
    But the biml ‘ecosystem’ consists of more than only the Markup Language. Other components are:

    • Biml Script
    • BimlExpress and BIDS Helper
    • BimlFlex
    • BimlStudio (also/formerly known as Mist)
    • BimlBundles
    • BimlTransformers

    Back to top

    3. What are BimlExpress and BIDS Helper?

    BimlExpress and BIDS Helper both are Add-ins for Business Intelligence Development Studio (actually the Visual Studio version with the project types needed for Business Intelligence).
    BimlExpress is the successor of BIDS Helper and one of it biggest advantages is syntax highlighting and intellisense.
    It is really worth checking them out, because they can improve your productivity, especially when you use metadata and design patterns for SSIS packages.
    You can download and install BimlExpress. You have to register but can use the product at no cost.

    Back to top

    4. What is BimlScript?

    BimlScript is what makes Biml (the markup language) even more interesting. With BimlScript you can write .NET Code (C#) and do all kind of interesting stuff, like accessing your metadata, using for each loops, and generate SSIS packages using a design pattern.
    I have generated packages to load dimensions of a star schema, and I intend to blog about it later this year in more detail.

    Back to top

    5. What is BimlFlex?

    BimlFlex is a datawarehouse automation framework of Varigence. It can greatly improve productivity when building data warehouses.
    It uses Predefined patterns for data warehouse automation and Extension Points that you can use to change the standard automation patterns on certain ‘change points’, or can build exceptions for the pattern of this one very special table.
    BimlFlex is part of BimlStudio and therefore a commercial (paid) product.

    Back to top

    6. What are Mist and BimlStudio?

    Mist and BimlStudio are the same tool. BimlStudio is just the new name, due to strange associations with the “Mist” name in different languages, for example in German (What does the German word “Mist” mean?). This is why they changed the name.
    Also the names for all products and techniques associated with Biml now start with Biml, which makes them more recognizable.
    BimlStudio has a number of extra features when compared with BimlExpress.
    BimlStudio is a paid product, you can buy a perpetual license or a monthly subscription.

    SSIS-S01E05-bimlstudio-logo

    Back to top

    7. What are BimlBundles?

    BimlBundles are part of BimlStudio and protect your source code while your customer still can make changes at framework extension points.
    Koos van Strien did a very good job at explaining BimlBundles in more detail here.

    Back to top

    8. What are BimlTransformers?

    With BimlTransformers you can define and store changes on Biml code, instead of changing the Biml code itself. If you have worked with Visual Studio/.NET before, you can compare it with changes you make for your config file for different environments.
    Koos van Strien did a very good job at explaining BimlTransformers in more detail here.

    Back to top

    9. Is the free part of Biml a way to get me addicted to it?

    Nope. Although Varigence works sales model is a Fremium model (a free product that can be extended/enhanced with paid products), the promise of Scott Currie, done during the Biml usergroup NL kickoff on May 7, 2016, is that all features that are released for free will stay for free. You can count on that.

    Back to top

    10. How do I know that Biml is a proven technology that is going to stay?

    Biml already exists for about 8 years and has proven itself already. Varigence is a global player and steady company that does not suddenly disappear. I am not concerned if Biml will exist 10 years from now, and you should not have to be either. Even in the utterworst case, you would still have the SSIS packages that Biml has generated for you, and you can edit/maintain them without Biml.

    Back to top

    11. Where can I find more information on the internet?

    To read more about Biml, here are a few interesting web sites, blog posts and Linked In Usergroups:

    Back to top

    Conclusion / Wrap up

    Biml is here to stay and adds a lot of value, even if you use only the free parts of it. Varigence can only be praised for that.
    In this blog post I shortly described a number of products and techniques that are part of the Biml ‘ecosystem’.
    I do not have a commercial interest in promoting Biml, I am just a Business Intelligence Consultant who likes Biml a lot.

    (c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

    Fixing corrupt csv files in a SSIS Package (SSIS Series)

    (Promotional)

    Big Data & Analytics insight 2016: save the date for an insightful conference
    (as a speaker I was asked to do a bit of promotion ..)

    Introduction

    A few weeks ago my blog post was about using a C# Script Component to detect duplicates in a Data Flow Task.

    This week I want to discuss it’s brother, the C# Script Task, which can be used in the Control Flow.

    Again, extremely flexible, basicly you embed a small C# program inside your package, and you can use the entire .NET framework for all kind of things. I used it to fix malformed csv source files, before processing them in a Data Flow Task.

    Pros and Problems

    Csv (comma seperated files) have their pros and problems, as compared with other formats, like XML.

    A few pros are that they are usually smaller in size than XML files, and easier and faster to load. These are two reasons why csv files still are used a lot these days, despite of the fact that XML files are in some cases more reliable, and more flexible as far as the file contents is concerned.

    Unfortunately csv files are not so good in holding memo/comment fields that might contain carriage return/line feeds in it. Memo fields with those characters do not stay on one line in the file, therefore making the csv file more difficult to process (at least the standard flat file source can not handle it in all cases).

    Also the structure is very strict. If a column is added to a csv source file, your Data Flow Task will break.

    Solution

    In this post I will discuss a possible solution for both problems:

    1. carriage return/line feeds in contents of csv files
    2. changing file formats over time

    This post assumes that you have worked with SSIS before, know what For Each Loop Containers and Data Flow Tasks do, and recognize the csv problems above. It is a big advantage if you can read and understand C# code.

    Okay, let’s start!

    Scenario

    My demo is based on the following hypothetical scenario:

    1. You need to process multiple CustomerDetails.csv files.
    2. In this file a pipe (|) is used as separator between the columns.
    3. The file contains a header line: SocialSecurityNumber|FirstName|LastName|Gender|Residence|Comments
    4. The Comments column is added later, your package also needs to be able to load historic files without the Comments column.
    5. CustomerDetails.csv files come from hundreds of locations and systems (hereafter called: sources) world wide. Fixing the csv files in the export processes at all these locations is impossible. Even if some sources deliver correct csv files, your SSIS package cannot rely on the fact that all sources will do so.

    Overview

    Here are a few pictures of the end result:
    SSIS-S01E03-109
    The connection managers that are needed

    SSIS-S01E03-110
    The Control Flow
    SSIS-S01E03-111
    The Data Flow

    Detailed steps

    Parameters and variables

    First you need to add a few parameters and variables to the package.

    Parameters:

    • SourceFolder: The folder that that For Each Loop Container reads for csv files.
    • ArchiveFolder: The folder to which files are moved after being processed, if no error occurred.
    • ErrorFolder: The folder to which files are moved that caused an error during processing.

    All parameters are Required and their Data type is String.

    SSIS-S01E03-113
    The parameters of the package

    Variables:

    • CRLF: This variable will be filled with a Carriage Return/Line Feed in the Script Task.
    • CRLF_Replacer: This is the character string that is used to replace CarriageReturn/LineFeeds in the Comment column of the file. It must be a character string that does not occur in the Comment field. Therefore ~CRLF~ is used. If you want a different character string, you can change it to something else.
    • SourcePathAndFilename: The full path and filename of the original filename to process. It is filled by the For Each Loop Container. The value set at design time is unimportant.
    • PathAndFilenameToProcess: The full path and filename of the file to process. This can be either the original file, or the modified temporary file. The Flat File Connection Manager uses this path and filename. The value set at design time is unimportant.

    The Data type of all variables is String.

    SSIS-S01E03-114
    SSIS variables needed to do the job

    The Control Flow – For Each Loop Container

    Add a For Each Loop Container FELC – Load sourcefiles and configure it as a Foreach File Enumerator. Retrieve file name Fully qualified.

    SSIS-S01E03-115

    On the Collection pane, just below the Enumerator type, press the ellipsis next to Expressions. In the Property Expressions Editor configure the Property Directory to be set by
    Expression @[$Package::SourceFolder].
    SSIS-S01E03-116

    On the Variable Mappings pane, map variable User::SourcePathAndFilename to Index 0
    SSIS-S01E03-102
    For Each Loop Container configuration

    The Control Flow – Script Task

    Next, add a Script Task SCR – Fix csv file if needed.
    In the Script Task Editor, configure the following:

    • ReadOnlyVariables: User::CRLF_Replacer,User::SourcePathAndFilename
    • ReadWriteVariables: User::CRLF,User::PathAndFilenameToProcess

    SSIS-S01E03-117
    Variable configuration in the Script Task Editor

    Then press the Edit Script… button.
    Don’t get scared by the amount of C# code, if you keep your head cool, it will be fixed in 10 minutes!

    Adjust the script as follows:

    • In the Namespaces region, if not already there, add:
      using System.IO;
      using System.Text;


    • SSIS-S01E03-108
      Replace the highlighted part above with the code below

      Now remove public void Main() completely and replace it with the code underneath (to avoid malformatting due to html rendering, please download the original csharp script here).
      Then check the code for !!! ATTENTION !!! comment. These are points in the script that you might need to change.

      ScriptTask.cs

        /*
        (c) Copyright 2016 - hansmichiels.com
       
        This program is free software: you can redistribute it and/or modify
        it under the terms of the GNU General Public License as published by
        the Free Software Foundation, either version 3 of the License, or
        (at your option) any later version.
       
        This program is distributed in the hope that it will be useful,
        but WITHOUT ANY WARRANTY; without even the implied warranty of
        MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
        GNU General Public License for more details.
       
        You should have received a copy of the GNU General Public License
        along with this program. If not, see http://www.gnu.org/licenses/.
        */

       
        /// This method is called when this script task executes in the control flow.
        /// Before returning from this method, set the value of Dts.TaskResult to indicate success or failure.
        /// To open Help, press F1.
        /// </summary>
       public void Main()
       {
        // Copy a few SSIS variables to C# string variables.
         string sourcePathAndFilename = Dts.Variables[“User::SourcePathAndFilename”].Value.ToString();
         string crlfReplacer = Dts.Variables[“User::CRLF_Replacer”].Value.ToString();
       
         // Set the variable that needs to hold a CarriageReturn/LineFeed.
         Dts.Variables[“User::CRLF”].Value = string.Format(“\r\n”);
       
         // Call the method that checks if the sourcefile is corrupt and if so, fixes it.
         FixFileIfCorrupt(sourcePathAndFilename, crlfReplacer);
       
         // Set returnvalue for script task.
        Dts.TaskResult = (int)ScriptResults.Success;
       }
       
        /// Csv-file can have lines with carriage return-linefeeds, causing errors during the load.
        /// By preinvestigating the file we can correct the file and prevent these errors from ocurring.
        /// Source: http://www.hansmichiels.com/2016/05/22/fixing-corrupt-csv-files-in-a-ssis-package-ssis-series/
        /// Author: Hans Michiels
        /// License: GNU General Public License, see http://www.gnu.org/licenses/
        /// </summary>
        private void FixFileIfCorrupt(string sourcePathAndFilename, string crlfReplacer)
        {
         string header = string.Empty;
         string headerAppend = string.Empty;
         string detailAppend = string.Empty;
       
         // Initialize pathAndFilenameToProcess with sourcePathAndFilename. Overwrite later if needed.
         string pathAndFilenameToProcess = sourcePathAndFilename;
       
         // Check if the header of the file is as expected, and return the header row through a ref parameter.
         // If the file header is not as expected, do not touch the file and exit this method.
         if (!CheckAndReturnHeader(sourcePathAndFilename, ref header, ref headerAppend, ref detailAppend)) return;
         
         // Get the encoding of the current file. We need this if we have to create a corrected file.
         var encoding = GetEncoding(sourcePathAndFilename);
       
         // Read all lines of the file into a string array.
         // !!! ATTENTION !!! WARNING: only do this when the files to process are relatively small.
         // !!! ATTENTION !!! When in doubt, test it and monitor memory usage.
         // (I suppose files up to 50~100 MB might not be a big problem but larger might, but testing is the way to find out)
         string[] lines = File.ReadAllLines(sourcePathAndFilename, encoding);
       
         // Use a variable to set when the file has changed due to corrections this method makes.
         bool fileHasChanged = false;
         bool lineIsOkay = true;
       
         // appendbuffer is used to keep parts of 'broken' lines.
         string appendbuffer = string.Empty;
       
         // Get lower- and upperbound of the string array into variables.
         int lower = lines.GetLowerBound(0);
         int upper = lines.GetUpperBound(0);
       
         // Loop through all lines in the file from back to front!
         for (int i = upper; i >= lower; i--)
         {
          // If there is still some text in the buffer, append it to the line and make the buffer empty again.
          if (appendbuffer != string.Empty)
          {
            lines[i] = lines[i] + appendbuffer;
            appendbuffer = string.Empty;
            fileHasChanged = true;
          }
       
          // reset lineIsOkay variable, set to true underneath when the line passes the 'line ok' checks.
          lineIsOkay = false;
       
          // !!! ATTENTION !!!
          // Here you need to add some custom code.
          // Describe one or more properties of a correct line, e.g.
          // - first 8 characters are numeric;
          // - The length is at least 12;
          // - It contains at least x delimiters;
          // etc.
          // Then convert this check to C# code. Use your imagination if needed!
       
          // I check for:
          // - the length to be 12 or more characters.
          // - the first 8 characters (the social security number) to be numeric.
          // - at least 4 pipes.
          if (lines[i].Length > 12)
          {
            string checkString = lines[i].Substring(0, 8);
            int intOutput;
       
            lineIsOkay = (Int32.TryParse(checkString, out intOutput));
       
            if (lineIsOkay)
            {
             // Check that the line contains at least 4 delimiters.
             lineIsOkay = (lines[i].Split('|').Length - 1 >= 4);
            }
          }
       
          if ((!lineIsOkay) && (i > lower))
          {
            // Line is NOT OK.
            // I have to append this line to the previous one, except when it is the header.
            // To do this put this line into appendbuffer variable.
            // I use a replacement string for the removed carriage return/line feed.
            // This will be converted back to a carriage return/line feed in the Data Flow.
            appendbuffer = string.Format(“{0}{1}”, crlfReplacer, lines[i]);
            lines[i] = string.Empty; // Package can handle empty lines.
            fileHasChanged = true;
          }
       
         }
       
         // !!! ATTENTION !!! If you do not have old fileformats you can remove this block.
         // Code to handle the old file format, without a Comments column.
         // The code below reformats the file so that it become always in new fileformat, with an empty comment.
         if (headerAppend != string.Empty)
         {
          // This is an old format without the Comments column. Append Comments column header to header and a delimiter (pipe) to detail rows.
          fileHasChanged = true;
       
          for (int i = upper; i >= lower; i--)
          {
           if (lines[i].Length > 0)
           {
            if (i > lower)
            {
              lines[i] += detailAppend;
            }
            else
            {
              lines[i] += headerAppend;
            }
           }
          }
         }
       
         if (fileHasChanged)
         {
          // Here the modified/corrected file is stored to a temporary file, so a “tmp” extension is added to the modified filename.
          // For auditing purposes I keep the original file and not the corrected one,
          // because the latter can always be recreated when needed.
          // Also, when a bug would exist in the creation of the corrected file
          // (initially I had a problem with codepages, which should now be solved),
          // the original file might get lost forever.
          pathAndFilenameToProcess = String.Format(“{0}.tmp”, sourcePathAndFilename);
       
          // Now create a new corrected csv file.
          File.WriteAllLines(pathAndFilenameToProcess, lines, encoding);
         }
       
         // Also change the SSIS Package variable, because the Flat File Source uses it.
         Dts.Variables[“User::PathAndFilenameToProcess”].Value = pathAndFilenameToProcess;
       
        }
       
        /// <summary>
        /// This method checks if the header is as expected. If not, return false.
        /// Also the value of by ref parameters 'header', 'headerAppend' and 'detailAppend' is set.
        /// Source: http://www.hansmichiels.com/2016/05/22/fixing-corrupt-csv-files-in-a-ssis-package-ssis-series/
        /// Author: Hans Michiels
        /// License: GNU General Public License, see http://www.gnu.org/licenses/
        /// <param name=”sourcePathAndFilename”></param>
        /// <param name=”header”></param>
        ///
        private bool CheckAndReturnHeader(string sourcePathAndFilename,
          ref string header, ref string headerAppend, ref string detailAppend)
        {
         // Initialize return values.
         header = string.Empty;
         headerAppend = string.Empty;
         detailAppend = string.Empty;
       
         // Open the file to read the header line, then close the file again.
         using (StreamReader reader = new StreamReader(sourcePathAndFilename))
         {
          header = reader.ReadLine();
          reader.Close();
         }
         // !!! ATTENTION !!! Here you could do a check on (parts) of the header,
         // so that you do not change files with an unexpected format.
         if (!header.ToLower().StartsWith(“socialsecuritynumber|firstname|lastname|gender|residence”))
         {
          return false;
         }
         // !!! ATTENTION !!! Here you could check for old file formats.
         // Old formats can only be handled when one or more new columns were added at the end of the csv lines.
         // In the example, there is a file format without the Comments column still being delivered.
         // By appending the Comments column with an empty value for those files, they can be processed
         // by the same Data Flow Task using only one Flat File Source Connection.
         if (!header.ToLower().Contains(“|comments”))
         {
          // This is an old format without the Comments column.
          // Append Comments column header to header and a delimiter (pipe) to detail rows.
          headerAppend = “|Comments”;
          detailAppend = “|”;
         }
         return true;
        }
       
        /// <summary>
        /// Determines a text file's encoding by analyzing its byte order mark (BOM).
        /// Defaults to ASCII when detection of the text file's endianness fails.
        /// Source: http://stackoverflow.com/questions/3825390/effective-way-to-find-any-files-encoding
        /// With thanks to the author ‘2Toad’.
        /// </summary>
        /// <param name=”filename”>The text file to analyze.</param>
        /// <returns>The detected encoding.</returns>
        public static Encoding GetEncoding(string filename)
        {
         // Read the BOM
         var bom = new byte[4];
         using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
         {
          file.Read(bom, 0, 4);
         }
       
         // Analyze the BOM
         if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) return Encoding.UTF7;
         if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) return Encoding.UTF8;
         if (bom[0] == 0xff && bom[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
         if (bom[0] == 0xfe && bom[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
         if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) return Encoding.UTF32;
         // return Encoding.GetEncoding(1252); // Without BOM, default to Windows Codepage 1252
         return Encoding.Default; // Without BOM, the encoding for the OS's current ANSI code page is returned by default.
        }

      When you have adjusted the script for your file, choose Rebuild from the BUILD menu in the VstaProjects – Microsoft Visual Studio Window. Then close that Window. You return to the Script Task Editor, close it by pressing OK.

      Developing the Data Flow Task

      Add a Data Flow Task to the Control Flow and name it DFT – Import Customer Details.
      Then add the following components to it:

      Flat File Source FF_SRC – CustomerDetails

      From the SSIS Toolbox, drag a Flat File Source to the Data Flow area and create a new Flat File connection manager.
      SSIS-S01E03-121

      If you want to rebuild the demo, download the csv files (and C# script) here.
      Configure the Flat File Connection as shown in the picture below:
      SSIS-S01E03-118

      Choose Advanced in the pane on the left and configure the properties for each column as follows:

      • SocialSecurityNumber: DataType string [DT_STR], OutputColumnWidth 9
      • FirstName: DataType Unicode string [DT_WSTR], OutputColumnWidth 50
      • LastName: DataType Unicode string [DT_WSTR], OutputColumnWidth 50
      • Gender: DataType string [DT_STR], OutputColumnWidth 1
      • Residence: DataType Unicode string [DT_WSTR], OutputColumnWidth 60
      • Comments: DataType Unicode string [DT_WSTR], OutputColumnWidth 1000

      SSIS-S01E03-120
      Example of column properties

      Conditional Split CSPL – Filter out empty lines

      We use a Conditional Split to filter out empty lines in the csv file. Because the SocialSecurityNumber is a required column, we assume that the line is empty (at least it is useless) if SocialSecurityNumber is empty.
      So add an Output named Empty lines with Condition SocialSecurityNumber == “”. If you like, you can check on multiple columns to be empty.
      Rename the default Output to Non-empty lines.

      SSIS-S01E03-122
      Conditional Split Configuration

      Derived Column DC – Re-insert CRLF

      Because the (corrected) file might contain replacements for carriage return/line feeds, we are going to replace them with the original carriage return/line feeds, before the value is inserted into the database table.
      In the Derived Column Transformation Editor, add a derived column, configured as follows:

      • Derived Column Name: Comments
      • Derived Column: Replace ‘Comments’
      • Expression: REPLACE(Comments,@[User::CRLF_Replacer],@[User::CRLF])

      SSIS-S01E03-123
      The Derived Column Transformation Editor

      OLEDB Destination OLE_DST stg DimCustomer

      To rebuild the demo, create a table to store the rows from the data flow task,
      add a OLEDB Destination that uses this table.

      destination_table.sql

      SET ANSI_NULLS ON
      GO
      SET QUOTED_IDENTIFIER ON
      GO
      SET ANSI_PADDING ON
      GO
      CREATE TABLE [stg].[DimCustomer](
       [SocialSecurityNumber] [char](9) NOT NULL,
       [FirstName] [nvarchar](50) NULL,
       [LastName] [nvarchar](50) NULL,
       [Gender] [char](1) NULL,
       [Comments] [nvarchar](1000) NULL,
       [Residence] [nvarchar](60) NULL,
       CONSTRAINT [PK_DimCustomer_CustomerKey] PRIMARY KEY CLUSTERED
      (
       [SocialSecurityNumber] ASC
      )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
      ) ON [PRIMARY]
      GO
      SET ANSI_PADDING OFF
      GO

      The Control Flow – File System Tasks

      Finally the File System Tasks archive the original file and delete the corrected file (if it exists).
      For auditing purposes I keep the original file and not the corrected one, because the latter can always be recreated when needed.
      Also, when a bug would exist in the creation of the corrected file (initially I had a problem with codepages, which should now be solved), the original file might get lost forever.

      Please go back to the picture of the The Control Flow at the beginning of this post, to see how the File System Tasks are connected.
      Next are the printscreens that should help you to configure the File System Tasks.

      SSIS-S01E03-124
      Configuration of “FST – Move original file to archive folder”

      SSIS-S01E03-125
      Configuration of “FST – Move original file to error folder”

      SSIS-S01E03-126
      Precedence constraint of “FST – Delete temporary file” (you need to set it twice).

      SSIS-S01E03-128
      Configuration of “FST – Delete temporary file”

      The proof

      To show you that it works, I run the package in debug mode.
      To be able to have a look at the temporary/corrected files, afterwards, I have disabled the File System Task that deletes this file.
      SSIS-S01E03-130

      When I open the “CustomerDetails.csv.tmp” file after the package has run, I can see that line 15 was corrected.
      From the second screenshot you can see that the CRLF_Replacer value was inserted twice.
      SSIS-S01E03-131
      SSIS-S01E03-132

      And when I run a query on the destination table, I can see that the Comments column contains the original CarriageReturn/LineFeeds!
      SSIS-S01E03-134

      By the way, all the data in the csv files is generated from random firstnames, lastnames and residences and fake social security numbers. Any match with existing persons would be based solely on coincidence.

      Conclusion / Wrap up

      Csv files still are useful today, because they have some advantages when compared with XML files.
      However they also have their limitations, which can be frustrating when loading those files with a Data Flow Task.
      In this blog post I have demonstrated how to overcome a few of those limitations, namely:

      1. carriage return/line feeds in contents of csv files
      2. changing file formats over time

      The solution is based on using a C# Script Task.

      Please be aware that I do not pretend this is the best solution.
      If you are very familiar with C#, you could all of the processsing in the Script Task, using SqlBulkCopy, and you would not need a temporary file.
      However the Data Flow Task offers some advantages, like lookups you can use, without the need to do this in a script.
      If you create good Interface Specification Documents on forehand (for instance containing how to handle carriage return/line feeds in column values) you can force that all sources deliver correct csv-files, and then they do not have to be corrected.

      Download the C# script and csv files here.

      (c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

    Detect duplicates in a Data Flow Task (SSIS Series)

    Intended audience

    This article assumes that you have at least a basic understanding of SSIS, and the Data Flow Task in particular. C# knowledge is an advantage but not required.

    Introduction

    If you ever used the T-SQL windowing function ROW_NUMBER() OVER (PARTITION BY ..) to find duplicates in a query result, this post might be of interest for you.

    Because not all sources having potential duplicate entries (e.g. business keys) come from a SQL Server database. If it is a csv- or Excel file ROW_NUMBER will not automaticly be your friend (it might be when you use OPENROWSET, but I think that is a rather complicated solution).

    C# script components are extremely flexible yet fairly easy to understand, develop and maintain.
    This is why this is my preferred solution to tackle the problem above.
    Let’s begin!

    How to detect duplicates in a Data Flow Task using a C# Script component

    First an overview of the end result (the Data Flow of the Data Flow Task). I will tell you step by step how to achieve this using SSIS 2014. I did not make a print screen of the Control Flow because it only contains one Data Flow Task.
    Please note that I enabled two Data Viewers on both outputs of the Conditional Split component, so you can see the data that flows through (screenshot later).
    SSIS-S01E02-image01
    For the demo I will use the following components in a dataflow. To solve your problem just pick out the things you need.

    • An OLE DB Source
    • A script component (the most interesting part!)
    • An (optional) conditional split
    • Two OLE DB Destinations: one for the first occurrences of each business key, and one for the duplicates.

    An OLE DB Source

    My example is based on Advertureworks. I select customers from the [AdventureWorksDW] database, and make sure there are duplicates by using UNION ALL in the SQL statement.
    In the real world it would be more likely that your source would not be a SQL Server query, but for instance a csv- or Excel file.

    OLEDB Source.sql

    SELECT
      [CustomerKey], [CustomerAlternateKey] AS [CustomerBusinessKey],
      [Title], [FirstName], [MiddleName], [LastName],
      [NameStyle], [BirthDate], [MaritalStatus], [Suffix],
      [Gender], [EmailAddress], [YearlyIncome]
    FROM
      [dbo].[DimCustomer]
    UNION ALL
    SELECT
      [CustomerKey], [CustomerAlternateKey] AS [CustomerBusinessKey],
      [Title], [FirstName], [MiddleName], [LastName],
      [NameStyle], [BirthDate], [MaritalStatus], [Suffix],
      [Gender], [EmailAddress], [YearlyIncome]
    FROM
      [dbo].[DimCustomer]
    UNION ALL
    SELECT
      [CustomerKey], [CustomerAlternateKey] AS [CustomerBusinessKey],
      [Title], [FirstName], [MiddleName], [LastName],
      [NameStyle], [BirthDate], [MaritalStatus], [Suffix],
      [Gender], [EmailAddress], [YearlyIncome]
    FROM
      [dbo].[DimCustomer]
      WHERE CustomerKey % 2 = 0

    SQL Statement for the OLE DB Source

    A script component (the most interesting part!)

    From the SSIS Toolbox drag a Script Component to the Data flow surface. In the “Select Script Component Type” choose “Transformation”.

    SSIS-S01E02-image02

    To use a column value in the script, you have to define it as an input column. Select the column you want to check for duplicate values with Usage Type ReadOnly. In the demo this is the CustomerBusinessKey column.
    SSIS-S01E02-image03

    We also need a new column to store the occurrence number for the CustomerBusinessKey. To do this, in the Script Transformation Editor, Select Inputs and Outputs on the left side, then unfold Output 0 and Output Columns, respectively.
    Then click on the Add column button to add a column with DataType four-byte signed integer [DT_I4]. Give the new column a clear name. In the demo it is called CustomerBusinessKeyOccurrence.
    SSIS-S01E02-image03b

    After these preparations we can start writing the C# script. Press the Edit script … button to do so.
    SSIS-S01E02-image04

    Adjust the script as follows (the changes are also marked inside red blocks in the picture below):

    • In the Namespaces region, add:
      using System.Collections.Generic;
    • Above PreExecute(), add:
      // The column to count duplicates for will be stored as Key (with string datatype) of the dictionary.
      // The occurrence counter will be stored as Value of the dictionary (with int datatype).
      private Dictionary<string, int> m_dictOccurrencesCount;
    • Inside PreExecute(), add:
      // Initialize the dictionary, otherwise it will be null when accessed.
      m_dictOccurrencesCount = new Dictionary<string, int>();

    SSIS-S01E02-image05
    Script adjustments – part 1

    Then adjust the script as follows:

    • Inside Input0_ProcessInputRow(Input0Buffer Row), add:
      // Count occurrences of CustomerBusinessKeys.
      if (m_dictOccurrencesCount.ContainsKey(Row.CustomerBusinessKey))
      {
        // If the CustomerBusinessKey of the current row is already in the dictionary,
        // increase the counter with 1:
        m_dictOccurrencesCount[Row.CustomerBusinessKey]++;
        // Also assign the new value to the CustomerBusinessKeyOccurrence column,
        // so it can be used in the Data flow later.
        Row.CustomerBusinessKeyOccurrence = m_dictOccurrencesCount[Row.CustomerBusinessKey];
      }
      else
      {
        // If the CustomerBusinessKey of the current row is not yet in the dictionary,
        // add it, and set the counter to 1
        m_dictOccurrencesCount.Add(Row.CustomerBusinessKey, 1);
        // Also assign 1 to the CustomerBusinessKeyOccurrence column.
        Row.CustomerBusinessKeyOccurrence = 1;
      }

    SSIS-S01E02-image06
    Script adjustments – part 2

    When the script is ready, Choose Rebuild from the BUILD menu in the VstaProjects – Microsoft Visual Studio Window. Then close that Window. You return to the Script Transformation Editor, close it by pressing OK.

    An (optional) conditional split

    Optionally you can add a Conditional Split.
    I use the conditional split to store the first occurrences for every CustomerBusinessKey in one table and all duplicates in another table.
    Both tables are only to show you that the script task and conditional split do their jobs properly. In real projects, you might develop this differently, for example by storing only the first occurence.

    SSIS-S01E02-image07
    Conditional Split Configuration

    Two OLE DB Destinations: one for the first occurrences of each business key, and one for the duplicates

    I basicly created a two temporary tables by composing a SELECT TOP 0 .. INTO statement from the OLE DB Source SQL Statement, because the result of the demo needs to flow to somewhere.

    SSIS-S01E02-OLEDB-Source.sql
    IF OBJECT_ID('[dbo].[DimCustomerFirstOne]', 'U') IS NOT NULL
      DROP TABLE [dbo].[DimCustomerFirstOne]
    GO
    SELECT TOP 0
      [CustomerKey], [CustomerAlternateKey] + 'B' AS [CustomerBusinessKey],
      [Title], [FirstName], [MiddleName], [LastName],
      [NameStyle], [BirthDate], [MaritalStatus], [Suffix],
      [Gender], [EmailAddress], [YearlyIncome], CONVERT(int, 0) AS [OccurrenceNo]
    INTO [dbo].[DimCustomerFirstOne]
    FROM
      [dbo].[DimCustomer]
    GO

    IF OBJECT_ID('[dbo].[DimCustomerDuplicates]', 'U') IS NOT NULL
      DROP TABLE [dbo].[DimCustomerDuplicates]
    GO
    SELECT TOP 0
      [CustomerKey], [CustomerAlternateKey] + 'B' AS [CustomerBusinessKey],
      [Title], [FirstName], [MiddleName], [LastName],
      [NameStyle], [BirthDate], [MaritalStatus], [Suffix],
      [Gender], [EmailAddress], [YearlyIncome], CONVERT(int, 0) AS [OccurrenceNo]
    INTO [dbo].[DimCustomerDuplicates]
    FROM
      [dbo].[DimCustomer]
    GO

    OLE DB Destination OLE_DST FirstOne uses the [dbo].[DimCustomerFirstOne] table.
    OLE DB Destination OLE_DST Duplicate uses the [dbo].[DimCustomerDuplicates] table.

    Let it flow, let if flow, let it flow

    Okay, time to test.
    When data flows, the “First occurence” and “Duplicates” Data Viewers pop up. Then I can check that all the rows are divided well over both tables based on the value of column CustomerBusinessKeyOccurrence: exactly 1 or higher than 1.
    When I run it with the Data Viewers disabled, the package execution finished literally in a split second. So the performance is excellent.
    Test succeeded!
    SSIS-S01E02-image08
    The Data flow in Debug Mode

    SSIS-S01E02-image09b
    Data Viewer for First occurrence flow: CustomerBusinessKeyOccurrence is 1 for all rows.

    SSIS-S01E02-image10b
    Data Viewer for Duplicates flow: CustomerBusinessKeyOccurrence is 2 or higher for all rows.

    Other considerations

    If multiple columns make a row unique concatenate all column values with a delimiter, and then add the concatenated result to the dictionary m_dictOccurrencesCount.
    For instance you could check duplicates using the customer nameparts (firstname, middlename and lastname), the birthdate and the gender. This is a softer check than on a real business key, but sometimes you do not have that.
    Of course, for this to work, you must add those columns as ReadOnly input columns to the script component.
    You can create a concatenated value as follows:
    Inside Input0_ProcessInputRow(Input0Buffer Row), start with:

    // Concatenate FirstName, MiddleName, LastName, BirthDate and Gender, using a | as delimiter.
    string businessKey = string.Format(“{0}|{1}|{2}|{3:yyyyMMdd}|{4}”,
      Row.FirstName, Row.MiddleName, Row.LastName, Row.BirthDate, Row.Gender).ToUpper();

    The concatenated value is converted to uppercase, so that differences in lower- or uppercase lead to the same value.
    Then use the businessKey variable (instead of Row.CustomerBusinessKey) in the if statement as shown above.

    If the concatenated value becomes very long, you could calculate a hash value and add that to the dictionary instead (thanks for the suggestion, Joost). This would use less memory and might be faster, but the only way to find out if this is the case is testing .. testing ..
    ( I did not test this to make the demo not too complicated and because performance did not seem to be an issue ).

    Thinking a bit further, you could use this concept also to calculate running totals.
    Just configure the dictionary to hold a decimal instead of an int and you can calculate running totals for numeric values (e.g. sales figures) easily.

    Conclusion / Wrap up

    In SSIS you can use a Script Component within a Data Flow Task to detect duplicates in the source data.
    This can be an alternative to the T-SQL ROW_NUMBER() function, when the source is not a SQL Server Database.

    Credits
    My thanks go to Joost van Rossum for reviewing the initial draft of this article. His input was very useful.

    (c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.