Tag Archives: microsoft

Fixing corrupt csv files in a SSIS Package (SSIS Series)

(Promotional)

Big Data & Analytics insight 2016: save the date for an insightful conference
(as a speaker I was asked to do a bit of promotion ..)

Introduction

A few weeks ago my blog post was about using a C# Script Component to detect duplicates in a Data Flow Task.

This week I want to discuss it’s brother, the C# Script Task, which can be used in the Control Flow.

Again, extremely flexible, basicly you embed a small C# program inside your package, and you can use the entire .NET framework for all kind of things. I used it to fix malformed csv source files, before processing them in a Data Flow Task.

Pros and Problems

Csv (comma seperated files) have their pros and problems, as compared with other formats, like XML.

A few pros are that they are usually smaller in size than XML files, and easier and faster to load. These are two reasons why csv files still are used a lot these days, despite of the fact that XML files are in some cases more reliable, and more flexible as far as the file contents is concerned.

Unfortunately csv files are not so good in holding memo/comment fields that might contain carriage return/line feeds in it. Memo fields with those characters do not stay on one line in the file, therefore making the csv file more difficult to process (at least the standard flat file source can not handle it in all cases).

Also the structure is very strict. If a column is added to a csv source file, your Data Flow Task will break.

Solution

In this post I will discuss a possible solution for both problems:

  1. carriage return/line feeds in contents of csv files
  2. changing file formats over time

This post assumes that you have worked with SSIS before, know what For Each Loop Containers and Data Flow Tasks do, and recognize the csv problems above. It is a big advantage if you can read and understand C# code.

Okay, let’s start!

Scenario

My demo is based on the following hypothetical scenario:

  1. You need to process multiple CustomerDetails.csv files.
  2. In this file a pipe (|) is used as separator between the columns.
  3. The file contains a header line: SocialSecurityNumber|FirstName|LastName|Gender|Residence|Comments
  4. The Comments column is added later, your package also needs to be able to load historic files without the Comments column.
  5. CustomerDetails.csv files come from hundreds of locations and systems (hereafter called: sources) world wide. Fixing the csv files in the export processes at all these locations is impossible. Even if some sources deliver correct csv files, your SSIS package cannot rely on the fact that all sources will do so.

Overview

Here are a few pictures of the end result:
SSIS-S01E03-109
The connection managers that are needed

SSIS-S01E03-110
The Control Flow
SSIS-S01E03-111
The Data Flow

Detailed steps

Parameters and variables

First you need to add a few parameters and variables to the package.

Parameters:

  • SourceFolder: The folder that that For Each Loop Container reads for csv files.
  • ArchiveFolder: The folder to which files are moved after being processed, if no error occurred.
  • ErrorFolder: The folder to which files are moved that caused an error during processing.

All parameters are Required and their Data type is String.

SSIS-S01E03-113
The parameters of the package

Variables:

  • CRLF: This variable will be filled with a Carriage Return/Line Feed in the Script Task.
  • CRLF_Replacer: This is the character string that is used to replace CarriageReturn/LineFeeds in the Comment column of the file. It must be a character string that does not occur in the Comment field. Therefore ~CRLF~ is used. If you want a different character string, you can change it to something else.
  • SourcePathAndFilename: The full path and filename of the original filename to process. It is filled by the For Each Loop Container. The value set at design time is unimportant.
  • PathAndFilenameToProcess: The full path and filename of the file to process. This can be either the original file, or the modified temporary file. The Flat File Connection Manager uses this path and filename. The value set at design time is unimportant.

The Data type of all variables is String.

SSIS-S01E03-114
SSIS variables needed to do the job

The Control Flow – For Each Loop Container

Add a For Each Loop Container FELC – Load sourcefiles and configure it as a Foreach File Enumerator. Retrieve file name Fully qualified.

SSIS-S01E03-115

On the Collection pane, just below the Enumerator type, press the ellipsis next to Expressions. In the Property Expressions Editor configure the Property Directory to be set by
Expression @[$Package::SourceFolder].
SSIS-S01E03-116

On the Variable Mappings pane, map variable User::SourcePathAndFilename to Index 0
SSIS-S01E03-102
For Each Loop Container configuration

The Control Flow – Script Task

Next, add a Script Task SCR – Fix csv file if needed.
In the Script Task Editor, configure the following:

  • ReadOnlyVariables: User::CRLF_Replacer,User::SourcePathAndFilename
  • ReadWriteVariables: User::CRLF,User::PathAndFilenameToProcess

SSIS-S01E03-117
Variable configuration in the Script Task Editor

Then press the Edit Script… button.
Don’t get scared by the amount of C# code, if you keep your head cool, it will be fixed in 10 minutes!

Adjust the script as follows:

  • In the Namespaces region, if not already there, add:
    using System.IO;
    using System.Text;


  • SSIS-S01E03-108
    Replace the highlighted part above with the code below

    Now remove public void Main() completely and replace it with the code underneath (to avoid malformatting due to html rendering, please download the original csharp script here).
    Then check the code for !!! ATTENTION !!! comment. These are points in the script that you might need to change.

    ScriptTask.cs

      /*
      (c) Copyright 2016 - hansmichiels.com
     
      This program is free software: you can redistribute it and/or modify
      it under the terms of the GNU General Public License as published by
      the Free Software Foundation, either version 3 of the License, or
      (at your option) any later version.
     
      This program is distributed in the hope that it will be useful,
      but WITHOUT ANY WARRANTY; without even the implied warranty of
      MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
      GNU General Public License for more details.
     
      You should have received a copy of the GNU General Public License
      along with this program. If not, see http://www.gnu.org/licenses/.
      */

     
      /// This method is called when this script task executes in the control flow.
      /// Before returning from this method, set the value of Dts.TaskResult to indicate success or failure.
      /// To open Help, press F1.
      /// </summary>
     public void Main()
     {
      // Copy a few SSIS variables to C# string variables.
       string sourcePathAndFilename = Dts.Variables[“User::SourcePathAndFilename”].Value.ToString();
       string crlfReplacer = Dts.Variables[“User::CRLF_Replacer”].Value.ToString();
     
       // Set the variable that needs to hold a CarriageReturn/LineFeed.
       Dts.Variables[“User::CRLF”].Value = string.Format(“\r\n”);
     
       // Call the method that checks if the sourcefile is corrupt and if so, fixes it.
       FixFileIfCorrupt(sourcePathAndFilename, crlfReplacer);
     
       // Set returnvalue for script task.
      Dts.TaskResult = (int)ScriptResults.Success;
     }
     
      /// Csv-file can have lines with carriage return-linefeeds, causing errors during the load.
      /// By preinvestigating the file we can correct the file and prevent these errors from ocurring.
      /// Source: http://www.hansmichiels.com/2016/05/22/fixing-corrupt-csv-files-in-a-ssis-package-ssis-series/
      /// Author: Hans Michiels
      /// License: GNU General Public License, see http://www.gnu.org/licenses/
      /// </summary>
      private void FixFileIfCorrupt(string sourcePathAndFilename, string crlfReplacer)
      {
       string header = string.Empty;
       string headerAppend = string.Empty;
       string detailAppend = string.Empty;
     
       // Initialize pathAndFilenameToProcess with sourcePathAndFilename. Overwrite later if needed.
       string pathAndFilenameToProcess = sourcePathAndFilename;
     
       // Check if the header of the file is as expected, and return the header row through a ref parameter.
       // If the file header is not as expected, do not touch the file and exit this method.
       if (!CheckAndReturnHeader(sourcePathAndFilename, ref header, ref headerAppend, ref detailAppend)) return;
       
       // Get the encoding of the current file. We need this if we have to create a corrected file.
       var encoding = GetEncoding(sourcePathAndFilename);
     
       // Read all lines of the file into a string array.
       // !!! ATTENTION !!! WARNING: only do this when the files to process are relatively small.
       // !!! ATTENTION !!! When in doubt, test it and monitor memory usage.
       // (I suppose files up to 50~100 MB might not be a big problem but larger might, but testing is the way to find out)
       string[] lines = File.ReadAllLines(sourcePathAndFilename, encoding);
     
       // Use a variable to set when the file has changed due to corrections this method makes.
       bool fileHasChanged = false;
       bool lineIsOkay = true;
     
       // appendbuffer is used to keep parts of 'broken' lines.
       string appendbuffer = string.Empty;
     
       // Get lower- and upperbound of the string array into variables.
       int lower = lines.GetLowerBound(0);
       int upper = lines.GetUpperBound(0);
     
       // Loop through all lines in the file from back to front!
       for (int i = upper; i >= lower; i--)
       {
        // If there is still some text in the buffer, append it to the line and make the buffer empty again.
        if (appendbuffer != string.Empty)
        {
          lines[i] = lines[i] + appendbuffer;
          appendbuffer = string.Empty;
          fileHasChanged = true;
        }
     
        // reset lineIsOkay variable, set to true underneath when the line passes the 'line ok' checks.
        lineIsOkay = false;
     
        // !!! ATTENTION !!!
        // Here you need to add some custom code.
        // Describe one or more properties of a correct line, e.g.
        // - first 8 characters are numeric;
        // - The length is at least 12;
        // - It contains at least x delimiters;
        // etc.
        // Then convert this check to C# code. Use your imagination if needed!
     
        // I check for:
        // - the length to be 12 or more characters.
        // - the first 8 characters (the social security number) to be numeric.
        // - at least 4 pipes.
        if (lines[i].Length > 12)
        {
          string checkString = lines[i].Substring(0, 8);
          int intOutput;
     
          lineIsOkay = (Int32.TryParse(checkString, out intOutput));
     
          if (lineIsOkay)
          {
           // Check that the line contains at least 4 delimiters.
           lineIsOkay = (lines[i].Split('|').Length - 1 >= 4);
          }
        }
     
        if ((!lineIsOkay) && (i > lower))
        {
          // Line is NOT OK.
          // I have to append this line to the previous one, except when it is the header.
          // To do this put this line into appendbuffer variable.
          // I use a replacement string for the removed carriage return/line feed.
          // This will be converted back to a carriage return/line feed in the Data Flow.
          appendbuffer = string.Format(“{0}{1}”, crlfReplacer, lines[i]);
          lines[i] = string.Empty; // Package can handle empty lines.
          fileHasChanged = true;
        }
     
       }
     
       // !!! ATTENTION !!! If you do not have old fileformats you can remove this block.
       // Code to handle the old file format, without a Comments column.
       // The code below reformats the file so that it become always in new fileformat, with an empty comment.
       if (headerAppend != string.Empty)
       {
        // This is an old format without the Comments column. Append Comments column header to header and a delimiter (pipe) to detail rows.
        fileHasChanged = true;
     
        for (int i = upper; i >= lower; i--)
        {
         if (lines[i].Length > 0)
         {
          if (i > lower)
          {
            lines[i] += detailAppend;
          }
          else
          {
            lines[i] += headerAppend;
          }
         }
        }
       }
     
       if (fileHasChanged)
       {
        // Here the modified/corrected file is stored to a temporary file, so a “tmp” extension is added to the modified filename.
        // For auditing purposes I keep the original file and not the corrected one,
        // because the latter can always be recreated when needed.
        // Also, when a bug would exist in the creation of the corrected file
        // (initially I had a problem with codepages, which should now be solved),
        // the original file might get lost forever.
        pathAndFilenameToProcess = String.Format(“{0}.tmp”, sourcePathAndFilename);
     
        // Now create a new corrected csv file.
        File.WriteAllLines(pathAndFilenameToProcess, lines, encoding);
       }
     
       // Also change the SSIS Package variable, because the Flat File Source uses it.
       Dts.Variables[“User::PathAndFilenameToProcess”].Value = pathAndFilenameToProcess;
     
      }
     
      /// <summary>
      /// This method checks if the header is as expected. If not, return false.
      /// Also the value of by ref parameters 'header', 'headerAppend' and 'detailAppend' is set.
      /// Source: http://www.hansmichiels.com/2016/05/22/fixing-corrupt-csv-files-in-a-ssis-package-ssis-series/
      /// Author: Hans Michiels
      /// License: GNU General Public License, see http://www.gnu.org/licenses/
      /// <param name=”sourcePathAndFilename”></param>
      /// <param name=”header”></param>
      ///
      private bool CheckAndReturnHeader(string sourcePathAndFilename,
        ref string header, ref string headerAppend, ref string detailAppend)
      {
       // Initialize return values.
       header = string.Empty;
       headerAppend = string.Empty;
       detailAppend = string.Empty;
     
       // Open the file to read the header line, then close the file again.
       using (StreamReader reader = new StreamReader(sourcePathAndFilename))
       {
        header = reader.ReadLine();
        reader.Close();
       }
       // !!! ATTENTION !!! Here you could do a check on (parts) of the header,
       // so that you do not change files with an unexpected format.
       if (!header.ToLower().StartsWith(“socialsecuritynumber|firstname|lastname|gender|residence”))
       {
        return false;
       }
       // !!! ATTENTION !!! Here you could check for old file formats.
       // Old formats can only be handled when one or more new columns were added at the end of the csv lines.
       // In the example, there is a file format without the Comments column still being delivered.
       // By appending the Comments column with an empty value for those files, they can be processed
       // by the same Data Flow Task using only one Flat File Source Connection.
       if (!header.ToLower().Contains(“|comments”))
       {
        // This is an old format without the Comments column.
        // Append Comments column header to header and a delimiter (pipe) to detail rows.
        headerAppend = “|Comments”;
        detailAppend = “|”;
       }
       return true;
      }
     
      /// <summary>
      /// Determines a text file's encoding by analyzing its byte order mark (BOM).
      /// Defaults to ASCII when detection of the text file's endianness fails.
      /// Source: http://stackoverflow.com/questions/3825390/effective-way-to-find-any-files-encoding
      /// With thanks to the author ‘2Toad’.
      /// </summary>
      /// <param name=”filename”>The text file to analyze.</param>
      /// <returns>The detected encoding.</returns>
      public static Encoding GetEncoding(string filename)
      {
       // Read the BOM
       var bom = new byte[4];
       using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
       {
        file.Read(bom, 0, 4);
       }
     
       // Analyze the BOM
       if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) return Encoding.UTF7;
       if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) return Encoding.UTF8;
       if (bom[0] == 0xff && bom[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
       if (bom[0] == 0xfe && bom[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
       if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) return Encoding.UTF32;
       // return Encoding.GetEncoding(1252); // Without BOM, default to Windows Codepage 1252
       return Encoding.Default; // Without BOM, the encoding for the OS's current ANSI code page is returned by default.
      }

    When you have adjusted the script for your file, choose Rebuild from the BUILD menu in the VstaProjects – Microsoft Visual Studio Window. Then close that Window. You return to the Script Task Editor, close it by pressing OK.

    Developing the Data Flow Task

    Add a Data Flow Task to the Control Flow and name it DFT – Import Customer Details.
    Then add the following components to it:

    Flat File Source FF_SRC – CustomerDetails

    From the SSIS Toolbox, drag a Flat File Source to the Data Flow area and create a new Flat File connection manager.
    SSIS-S01E03-121

    If you want to rebuild the demo, download the csv files (and C# script) here.
    Configure the Flat File Connection as shown in the picture below:
    SSIS-S01E03-118

    Choose Advanced in the pane on the left and configure the properties for each column as follows:

    • SocialSecurityNumber: DataType string [DT_STR], OutputColumnWidth 9
    • FirstName: DataType Unicode string [DT_WSTR], OutputColumnWidth 50
    • LastName: DataType Unicode string [DT_WSTR], OutputColumnWidth 50
    • Gender: DataType string [DT_STR], OutputColumnWidth 1
    • Residence: DataType Unicode string [DT_WSTR], OutputColumnWidth 60
    • Comments: DataType Unicode string [DT_WSTR], OutputColumnWidth 1000

    SSIS-S01E03-120
    Example of column properties

    Conditional Split CSPL – Filter out empty lines

    We use a Conditional Split to filter out empty lines in the csv file. Because the SocialSecurityNumber is a required column, we assume that the line is empty (at least it is useless) if SocialSecurityNumber is empty.
    So add an Output named Empty lines with Condition SocialSecurityNumber == “”. If you like, you can check on multiple columns to be empty.
    Rename the default Output to Non-empty lines.

    SSIS-S01E03-122
    Conditional Split Configuration

    Derived Column DC – Re-insert CRLF

    Because the (corrected) file might contain replacements for carriage return/line feeds, we are going to replace them with the original carriage return/line feeds, before the value is inserted into the database table.
    In the Derived Column Transformation Editor, add a derived column, configured as follows:

    • Derived Column Name: Comments
    • Derived Column: Replace ‘Comments’
    • Expression: REPLACE(Comments,@[User::CRLF_Replacer],@[User::CRLF])

    SSIS-S01E03-123
    The Derived Column Transformation Editor

    OLEDB Destination OLE_DST stg DimCustomer

    To rebuild the demo, create a table to store the rows from the data flow task,
    add a OLEDB Destination that uses this table.

    destination_table.sql

    SET ANSI_NULLS ON
    GO
    SET QUOTED_IDENTIFIER ON
    GO
    SET ANSI_PADDING ON
    GO
    CREATE TABLE [stg].[DimCustomer](
     [SocialSecurityNumber] [char](9) NOT NULL,
     [FirstName] [nvarchar](50) NULL,
     [LastName] [nvarchar](50) NULL,
     [Gender] [char](1) NULL,
     [Comments] [nvarchar](1000) NULL,
     [Residence] [nvarchar](60) NULL,
     CONSTRAINT [PK_DimCustomer_CustomerKey] PRIMARY KEY CLUSTERED
    (
     [SocialSecurityNumber] ASC
    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
    ) ON [PRIMARY]
    GO
    SET ANSI_PADDING OFF
    GO

    The Control Flow – File System Tasks

    Finally the File System Tasks archive the original file and delete the corrected file (if it exists).
    For auditing purposes I keep the original file and not the corrected one, because the latter can always be recreated when needed.
    Also, when a bug would exist in the creation of the corrected file (initially I had a problem with codepages, which should now be solved), the original file might get lost forever.

    Please go back to the picture of the The Control Flow at the beginning of this post, to see how the File System Tasks are connected.
    Next are the printscreens that should help you to configure the File System Tasks.

    SSIS-S01E03-124
    Configuration of “FST – Move original file to archive folder”

    SSIS-S01E03-125
    Configuration of “FST – Move original file to error folder”

    SSIS-S01E03-126
    Precedence constraint of “FST – Delete temporary file” (you need to set it twice).

    SSIS-S01E03-128
    Configuration of “FST – Delete temporary file”

    The proof

    To show you that it works, I run the package in debug mode.
    To be able to have a look at the temporary/corrected files, afterwards, I have disabled the File System Task that deletes this file.
    SSIS-S01E03-130

    When I open the “CustomerDetails.csv.tmp” file after the package has run, I can see that line 15 was corrected.
    From the second screenshot you can see that the CRLF_Replacer value was inserted twice.
    SSIS-S01E03-131
    SSIS-S01E03-132

    And when I run a query on the destination table, I can see that the Comments column contains the original CarriageReturn/LineFeeds!
    SSIS-S01E03-134

    By the way, all the data in the csv files is generated from random firstnames, lastnames and residences and fake social security numbers. Any match with existing persons would be based solely on coincidence.

    Conclusion / Wrap up

    Csv files still are useful today, because they have some advantages when compared with XML files.
    However they also have their limitations, which can be frustrating when loading those files with a Data Flow Task.
    In this blog post I have demonstrated how to overcome a few of those limitations, namely:

    1. carriage return/line feeds in contents of csv files
    2. changing file formats over time

    The solution is based on using a C# Script Task.

    Please be aware that I do not pretend this is the best solution.
    If you are very familiar with C#, you could all of the processsing in the Script Task, using SqlBulkCopy, and you would not need a temporary file.
    However the Data Flow Task offers some advantages, like lookups you can use, without the need to do this in a script.
    If you create good Interface Specification Documents on forehand (for instance containing how to handle carriage return/line feeds in column values) you can force that all sources deliver correct csv-files, and then they do not have to be corrected.

    Download the C# script and csv files here.

    (c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

Detect duplicates in a Data Flow Task (SSIS Series)

Intended audience

This article assumes that you have at least a basic understanding of SSIS, and the Data Flow Task in particular. C# knowledge is an advantage but not required.

Introduction

If you ever used the T-SQL windowing function ROW_NUMBER() OVER (PARTITION BY ..) to find duplicates in a query result, this post might be of interest for you.

Because not all sources having potential duplicate entries (e.g. business keys) come from a SQL Server database. If it is a csv- or Excel file ROW_NUMBER will not automaticly be your friend (it might be when you use OPENROWSET, but I think that is a rather complicated solution).

C# script components are extremely flexible yet fairly easy to understand, develop and maintain.
This is why this is my preferred solution to tackle the problem above.
Let’s begin!

How to detect duplicates in a Data Flow Task using a C# Script component

First an overview of the end result (the Data Flow of the Data Flow Task). I will tell you step by step how to achieve this using SSIS 2014. I did not make a print screen of the Control Flow because it only contains one Data Flow Task.
Please note that I enabled two Data Viewers on both outputs of the Conditional Split component, so you can see the data that flows through (screenshot later).
SSIS-S01E02-image01
For the demo I will use the following components in a dataflow. To solve your problem just pick out the things you need.

  • An OLE DB Source
  • A script component (the most interesting part!)
  • An (optional) conditional split
  • Two OLE DB Destinations: one for the first occurrences of each business key, and one for the duplicates.

An OLE DB Source

My example is based on Advertureworks. I select customers from the [AdventureWorksDW] database, and make sure there are duplicates by using UNION ALL in the SQL statement.
In the real world it would be more likely that your source would not be a SQL Server query, but for instance a csv- or Excel file.

OLEDB Source.sql

SELECT
  [CustomerKey], [CustomerAlternateKey] AS [CustomerBusinessKey],
  [Title], [FirstName], [MiddleName], [LastName],
  [NameStyle], [BirthDate], [MaritalStatus], [Suffix],
  [Gender], [EmailAddress], [YearlyIncome]
FROM
  [dbo].[DimCustomer]
UNION ALL
SELECT
  [CustomerKey], [CustomerAlternateKey] AS [CustomerBusinessKey],
  [Title], [FirstName], [MiddleName], [LastName],
  [NameStyle], [BirthDate], [MaritalStatus], [Suffix],
  [Gender], [EmailAddress], [YearlyIncome]
FROM
  [dbo].[DimCustomer]
UNION ALL
SELECT
  [CustomerKey], [CustomerAlternateKey] AS [CustomerBusinessKey],
  [Title], [FirstName], [MiddleName], [LastName],
  [NameStyle], [BirthDate], [MaritalStatus], [Suffix],
  [Gender], [EmailAddress], [YearlyIncome]
FROM
  [dbo].[DimCustomer]
  WHERE CustomerKey % 2 = 0

SQL Statement for the OLE DB Source

A script component (the most interesting part!)

From the SSIS Toolbox drag a Script Component to the Data flow surface. In the “Select Script Component Type” choose “Transformation”.

SSIS-S01E02-image02

To use a column value in the script, you have to define it as an input column. Select the column you want to check for duplicate values with Usage Type ReadOnly. In the demo this is the CustomerBusinessKey column.
SSIS-S01E02-image03

We also need a new column to store the occurrence number for the CustomerBusinessKey. To do this, in the Script Transformation Editor, Select Inputs and Outputs on the left side, then unfold Output 0 and Output Columns, respectively.
Then click on the Add column button to add a column with DataType four-byte signed integer [DT_I4]. Give the new column a clear name. In the demo it is called CustomerBusinessKeyOccurrence.
SSIS-S01E02-image03b

After these preparations we can start writing the C# script. Press the Edit script … button to do so.
SSIS-S01E02-image04

Adjust the script as follows (the changes are also marked inside red blocks in the picture below):

  • In the Namespaces region, add:
    using System.Collections.Generic;
  • Above PreExecute(), add:
    // The column to count duplicates for will be stored as Key (with string datatype) of the dictionary.
    // The occurrence counter will be stored as Value of the dictionary (with int datatype).
    private Dictionary<string, int> m_dictOccurrencesCount;
  • Inside PreExecute(), add:
    // Initialize the dictionary, otherwise it will be null when accessed.
    m_dictOccurrencesCount = new Dictionary<string, int>();

SSIS-S01E02-image05
Script adjustments – part 1

Then adjust the script as follows:

  • Inside Input0_ProcessInputRow(Input0Buffer Row), add:
    // Count occurrences of CustomerBusinessKeys.
    if (m_dictOccurrencesCount.ContainsKey(Row.CustomerBusinessKey))
    {
      // If the CustomerBusinessKey of the current row is already in the dictionary,
      // increase the counter with 1:
      m_dictOccurrencesCount[Row.CustomerBusinessKey]++;
      // Also assign the new value to the CustomerBusinessKeyOccurrence column,
      // so it can be used in the Data flow later.
      Row.CustomerBusinessKeyOccurrence = m_dictOccurrencesCount[Row.CustomerBusinessKey];
    }
    else
    {
      // If the CustomerBusinessKey of the current row is not yet in the dictionary,
      // add it, and set the counter to 1
      m_dictOccurrencesCount.Add(Row.CustomerBusinessKey, 1);
      // Also assign 1 to the CustomerBusinessKeyOccurrence column.
      Row.CustomerBusinessKeyOccurrence = 1;
    }

SSIS-S01E02-image06
Script adjustments – part 2

When the script is ready, Choose Rebuild from the BUILD menu in the VstaProjects – Microsoft Visual Studio Window. Then close that Window. You return to the Script Transformation Editor, close it by pressing OK.

An (optional) conditional split

Optionally you can add a Conditional Split.
I use the conditional split to store the first occurrences for every CustomerBusinessKey in one table and all duplicates in another table.
Both tables are only to show you that the script task and conditional split do their jobs properly. In real projects, you might develop this differently, for example by storing only the first occurence.

SSIS-S01E02-image07
Conditional Split Configuration

Two OLE DB Destinations: one for the first occurrences of each business key, and one for the duplicates

I basicly created a two temporary tables by composing a SELECT TOP 0 .. INTO statement from the OLE DB Source SQL Statement, because the result of the demo needs to flow to somewhere.

SSIS-S01E02-OLEDB-Source.sql
IF OBJECT_ID('[dbo].[DimCustomerFirstOne]', 'U') IS NOT NULL
  DROP TABLE [dbo].[DimCustomerFirstOne]
GO
SELECT TOP 0
  [CustomerKey], [CustomerAlternateKey] + 'B' AS [CustomerBusinessKey],
  [Title], [FirstName], [MiddleName], [LastName],
  [NameStyle], [BirthDate], [MaritalStatus], [Suffix],
  [Gender], [EmailAddress], [YearlyIncome], CONVERT(int, 0) AS [OccurrenceNo]
INTO [dbo].[DimCustomerFirstOne]
FROM
  [dbo].[DimCustomer]
GO

IF OBJECT_ID('[dbo].[DimCustomerDuplicates]', 'U') IS NOT NULL
  DROP TABLE [dbo].[DimCustomerDuplicates]
GO
SELECT TOP 0
  [CustomerKey], [CustomerAlternateKey] + 'B' AS [CustomerBusinessKey],
  [Title], [FirstName], [MiddleName], [LastName],
  [NameStyle], [BirthDate], [MaritalStatus], [Suffix],
  [Gender], [EmailAddress], [YearlyIncome], CONVERT(int, 0) AS [OccurrenceNo]
INTO [dbo].[DimCustomerDuplicates]
FROM
  [dbo].[DimCustomer]
GO

OLE DB Destination OLE_DST FirstOne uses the [dbo].[DimCustomerFirstOne] table.
OLE DB Destination OLE_DST Duplicate uses the [dbo].[DimCustomerDuplicates] table.

Let it flow, let if flow, let it flow

Okay, time to test.
When data flows, the “First occurence” and “Duplicates” Data Viewers pop up. Then I can check that all the rows are divided well over both tables based on the value of column CustomerBusinessKeyOccurrence: exactly 1 or higher than 1.
When I run it with the Data Viewers disabled, the package execution finished literally in a split second. So the performance is excellent.
Test succeeded!
SSIS-S01E02-image08
The Data flow in Debug Mode

SSIS-S01E02-image09b
Data Viewer for First occurrence flow: CustomerBusinessKeyOccurrence is 1 for all rows.

SSIS-S01E02-image10b
Data Viewer for Duplicates flow: CustomerBusinessKeyOccurrence is 2 or higher for all rows.

Other considerations

If multiple columns make a row unique concatenate all column values with a delimiter, and then add the concatenated result to the dictionary m_dictOccurrencesCount.
For instance you could check duplicates using the customer nameparts (firstname, middlename and lastname), the birthdate and the gender. This is a softer check than on a real business key, but sometimes you do not have that.
Of course, for this to work, you must add those columns as ReadOnly input columns to the script component.
You can create a concatenated value as follows:
Inside Input0_ProcessInputRow(Input0Buffer Row), start with:

// Concatenate FirstName, MiddleName, LastName, BirthDate and Gender, using a | as delimiter.
string businessKey = string.Format(“{0}|{1}|{2}|{3:yyyyMMdd}|{4}”,
  Row.FirstName, Row.MiddleName, Row.LastName, Row.BirthDate, Row.Gender).ToUpper();

The concatenated value is converted to uppercase, so that differences in lower- or uppercase lead to the same value.
Then use the businessKey variable (instead of Row.CustomerBusinessKey) in the if statement as shown above.

If the concatenated value becomes very long, you could calculate a hash value and add that to the dictionary instead (thanks for the suggestion, Joost). This would use less memory and might be faster, but the only way to find out if this is the case is testing .. testing ..
( I did not test this to make the demo not too complicated and because performance did not seem to be an issue ).

Thinking a bit further, you could use this concept also to calculate running totals.
Just configure the dictionary to hold a decimal instead of an int and you can calculate running totals for numeric values (e.g. sales figures) easily.

Conclusion / Wrap up

In SSIS you can use a Script Component within a Data Flow Task to detect duplicates in the source data.
This can be an alternative to the T-SQL ROW_NUMBER() function, when the source is not a SQL Server Database.

Credits
My thanks go to Joost van Rossum for reviewing the initial draft of this article. His input was very useful.

(c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

Stop being so precise! and more about using Load(end)dates (Datavault Series)

Introduction

My web site was offline a few days ago. Sorry for that.
I got this email from my webhosting provider: “A new message or response with subject: User *** has used up 153% of their bandwidth ..”.
Shortly after they suspended my account. It was suspended for about 8 hours. I have upgraded the network traffic limit.
All right, when all bandwidth was consumed .. then somebody must be reading my blog. Thank you for that.

This weeks post is about the LoadDate and LoadEndDate.
Actually there are two things to be nerdy on:

  1. End dating without gaps
  2. Getting the best value for bytes on DATETIME2 precision

By the way, these topics apply to SQL Server, the examples are made using SQL Server 2014.

1. End dating without gaps

How end dating works

In a Data Vault Satellite, different subsequent versions of a row in the source system are distinguished through a LoadDate (which usually contains a date/time value). So the HashKey of the corresponding Hub Row plus the LoadDate are the primary key for the satellite.
However to get the right version out when querying for a specific date, this requires a relatively difficult query, and this can also have a negative impact on performance.
This is why the LoadEndDate is an optional column of a Satellite, to make querying (read: getting data out) easier and better performing.

Important to remember that the LoadEndDate is not the date/time the load(process) ended, but the date/time the row was replaced by a newer row for the same business entity in the hub. What’s in a name, if I had to choose I would just call it EndDate, but LoadEndDate is the standard, and once you know it, it is not a problem.

There are two ways to determine the value for this LoadEndDate:

  1. Exclusive: the LoadEndDate is the LoadDate of the new row that replaces this one, minus a small time fraction. Using the exclusive methods enables the use of the BETWEEN keyword in Transact-SQL to get the right row out, example:

    WHERE @SnapshotDate BETWEEN [LoadDate] AND [LoadEndDate]

    This is the method that is used in most Data Vault examples.

  2. Inclusive: the LoadEndDate is exactly equal to the LoadDate of the new row that replaces this one. This requires no computation when updating the LoadEndDate, but disqualifies the BETWEEN keyword, for getting data out you need to do something like:

    WHERE @SnapshotDate >= [LoadDate] AND @SnapshotDate < [LoadEndDate]

Problem

There is a (theoretical) problem when using the exclusive method.
If you substract to much, there will be a tiny time gap between the LoadEndDate and the LoadDate of the subsequent row.
I fully admit this is a theoretical problem, because the chances that you require the row valid exactly on this moment are astronomically small.
(There is this not so hypothetical situation that your load starts at midnight, the LoadDate is set before the first second of the new day has passed, and you substract a second.
Then you do not find a record when using the exact date (without time fraction) to get the row out. But you need a lot of bad luck for this to happen).

Solution

Still if you are a purist you want to do it right, certainly because the solution is so simple.
If you make the the “grain” of substraction from the LoadDate equal to the precision of the LoadDate, there is no gap.
E.g.
– Substract 1 second from DATETIME2(0)
– Substract 1 centisecond from DATETIME2(2) ***TIP***
– Substract 1 millisecond from DATETIME2(3)
– Substract 1 microsecond from DATETIME2(6)
– Substract 100 nanoseconds from DATETIME2(7)

Examples in T-SQL:

Exclusive enddate without gaps.sql

DECLARE @dt20 DATETIME2(0) = '2016-04-13 20:52:17'
DECLARE @dt22 DATETIME2(2) = '2016-04-13 20:52:17.00'
DECLARE @dt23 DATETIME2(3) = '2016-04-13 20:52:17.000'
DECLARE @dt26 DATETIME2(6) = '2016-04-13 20:52:17.000000'
DECLARE @dt27 DATETIME2(7) = '2016-04-13 20:52:17.0000000'

SELECT @dt20, DATEADD(SECOND, -1, @dt20)
SELECT @dt22, DATEADD(MILLISECOND, -10, @dt22)
SELECT @dt23, DATEADD(MILLISECOND, -1, @dt23)
SELECT @dt26, DATEADD(MICROSECOND, -1, @dt26)
SELECT @dt27, DATEADD(NANOSECOND, -100, @dt27)

2. Getting the best value for bytes on DATETIME2 precision

This is about a “smart” precision to choose for your LoadDate and LoadEndDate columns. Unlike the older DATETIME datatype, DATETIME2 uses less or more bytes for storage depending on the precision you specify.
The Storage size is: 6 bytes for precisions less than 3; 7 bytes for precisions 3 and 4. All other precisions require 8 bytes. (I shamelessly pasted this from MSDN).
In the book “Building a scalable data warehouse with Data Vault 2.0” a lot of examples use DATETIME2(7). You can ask yourself why. Why do you need to be precise to 100 nanoseconds? I dare to say that in most cases (except when loading realtime or near-realtime streaming data into your Data Vault), seconds would be precise enough.
But looking back a few lines, to the storage sizes, DATETIME2(0) uses 6 bytes, but DATETIME2(2) ALSO uses 6 bytes. So with the latter you get a higher precision for the same storage size. And for daily loads, the centiseconds precision that DATETIME2(2) is providing, is really precise enough, believe me.
So DATETIME2(2) gives us the best value for bytes!


DV-S01E04-precise
Why would your LoadDates and LoadEndDates be more precise than needed ?..


In this way you can save two bytes, compared with a DATETIME2(7), which uses 8 bytes and is used in the book. Because the LoadDate is in the primary key of satellite tables, also the primary key index will be smaller.
In PIT Tables, which usually have multiple LoadDates, the storage gain is even more.
And what counts for money, “who does not respect a small gain, isn’t worth a big one”, does also count for small performance gains you can get by doing this kind of tweaks. Never forget that! Multiple smaller tweaks might add up to a noticable performance improvement!

Conclusion / Wrap up

In this blog post you have read about the two methods for enddating: Inclusive and Exclusive.
Also you could read how you can use the Exclusive method for end dating in Data Vault satellites, without having gaps in the timeline.
Finally I discussed which precision will in most cases be good enough for LoadDate and LoadEndDate columns.

(c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

Hash Diff calculation with SQL Server (Datavault Series)

Updates

Change log

Date Changes
31 January, 2017 Really good news! The 8k limitation on HASHBYTES has been removed in SQL Server 2016!

Today I would like to discuss Hash Diff calculation using SQL Server 2014, from my own experience.

My purpose is NOT to be complete and/or to replicate book “Building a scalable data warehouse with Data Vault 2.0”. For full details please read the book.

However I will give a short intro for those who hardly know what I am talking about.

Introduction

Hash differences or Hash diffs

A hash difference is a hash value of all the descriptive data of a row in a satellite, plus the business key column(s).

A hash difference column is used to detect changes in a new version of a satellite’s row for a particular business key, and can be used in a comparison instead of doing a comparison on every individual descriptive attribute.
The main reason to use hash diff columns is to improve performance when loading new satellite rows (of which one or more attributes have changed).
The more columns the satellite has, the higher the performance gain will be.
To support data warehouse automation patterns, hash diffs can also be used for satellites with only a few or even one descriptive attribute.

I think a different name, like RowHash, RecordHash, DataHash or AttributeValuesHash, would better describe the contents of the column, because the value itself is not a difference, but that’s a different (relatively unimportant) discussion.

Hash keys

This article is not about Hash keys, however to be complete on hashing as far as Data Vault 2.0 is concerned, I will give a short introduction on that as well.

Hash keys replace sequence numbers (generated by the database engine) of the Data Vault 1.0 standard. They support geographically distributed data warehouses, as well as integration with big data environments like Hadoop.

A hash key is a hash value of the business key column(s) used in a Hub or Link. The advantage is that it is predictable, which enables parallel loading and a multiplatform / distributed datawarehouse architecture. For instance a customer with code NL123 will have the same hash key on all distributed parts of the data warehouse. Because the hash key always has the same length and is stored as a (fixed length) CHAR column, performance is usually better than when the business key column(s) are directly used as primary key.

How to calculate a hash difference in T-SQL

For full details about Endianness, differences of hash functions between different platforms, how to cope with new columns and other technical details I refer to the book mentioned above. If you want to work with Data Vault 2.0 you will need the book anyway.

As said, a hash difference is a hash value of all the descriptive data of a row in a satellite, plus the business key column(s). To do the hashing, we have to put all the column values together in one nvarchar value, and then apply a hashing function on it. This is the short version and the base to understand it.

DV-S01E02-hashdiff

Between the different column values you have to use a delimiter, preferably one that does not occur in the column values. Otherwise you have to “escape” the delimiter in column values.

If you want to do a case insensitive compare, you should convert all values to either upper- or lowercase. But this means that if in the source system a customer name was “hans michiels” and is changed to “Hans Michiels”, you will not detect the change and therefore do not store it in the Data Vault. So it depends on the contents of the satellite and your business view on it if this is desired behaviour or not.

The hash diff calculation results in a binary(16) value, when using the hashing algorithm MD5 (which is recommended). To enable better cross-platform- and tools support, this is then converted to a hexadecimal string of 32 characters, stored as a CHAR(32) column.

In all cases you must also convert the hexadecimal hash value to all UPPER- or all lowercase, because not all convert functions give the same output, some lower- and other uppercase. To compare values that were made by different hash functions (read on different systems) making it all uppercase (or lowercase) makes sure the values do not differ in casing only (which could lead to a false negative answer on the “are the rows the same” question).

I understand this is all dry and boring nerd stuff if you do not see an example, so .. a simple example:

Calculating_a_hash_diff.sql

SELECT
      [bla]
    , UPPER(
        CONVERT(CHAR(32),
          HASHBYTES('MD5',
            UPPER(
              CONCAT( ISNULL(CONVERT(NVARCHAR, [BusinessKeyOfHub]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), [Column01]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), [Column02]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), [Column03]), N'')
              ) -- END CONCAT
            ) -- END UPPER
          ) -- END HASHBYTES
        , 2) -- END CONVERT
      ) -- END UPPER
FROM [stg].[bla]

So far nothing shocking, but there are a few pitfalls. If you keep them in mind, you will be an excellent hasher soon. 😉

CONVERT(NVARCHAR without specifying the size

Do this only for values of 30 or less characters, otherwise the value will be cut off which could lead to wrong hash diff values, not changing when the attribute value changes. It is for instance okay for integers, decimals (unless the precision is more than 28, mind the decimal separator and possible negative sign), datetime stamps, and bits. When converting (n)varchar safest thing to do is follow the original size, for unique identifiers use 38 or 40.

Beware for CONCAT

Beware for concatenation of (N)VARCHAR(MAX), (N)TEXT and XML columns using the CONCAT function. It appears that only the first 8000 bytes of (N)VARCHAR(MAX) columns are in the concatenated result.

And even worse, HASHBYTES ..

HASHBYTES does only hash a NVARCHAR value with a length of 4000 or less. Above that the error “String or binary data would be truncated” occurs. This is really important to keep in mind.
The 8000 characters (or 4000 unicode characters) limitation on any string hashed with the HASHBYTES function has been removed in SQL Server 2016! Now you can hash larger (N)VARCHAR values!

Cope with the HASHBYTES and CONCAT limitations

4000 characters is quite something, right? Only with very wide satellites (a lot of columns) and/or usage of (N)VARCHAR(MAX), (N)TEXT or XML columns (hereafter called ‘unlimited length columns’) you can get into trouble.

So, what is wise to do? Some measures you can take are:

Make your satellites not too wide

This is a best practice, not only from hashing function perspective. Split the satellites by source system, then by rate of change. Keep in mind that the business key column(s) and all satellite column values are converted to NVARCHAR and then (including all delimiters) may not exceed the 4000 characters if you want to use the HASHBYTES function.

Size unlimited length columns according to the actual maximum length

What I mean is, if for instance a Country name is stored in an unlimited length column in the source system, you know that a size of 100 will be long enough. When you doubt just do some data profiling on the source to see what is the maximum length. Use that size plus a safety margin for the satellite column size. Problem gone.

Handle true unlimited length columns

Put true unlimited length columns (for instance free entry ‘memo’ fields or xml columns in source systems in their own satellite! Otherwise if any other (small) attribute changes, the 4k memo is copied, even when it was not changed. This absorbs storage. But that alone is not enough, as said, with HASHBYTES, above 8000 characters (4000 when unicode) you are still f*c**d. Luckily there is a system function in the master database, which uses the MD5 algorithm and takes a binary value as input. So to compute a MD5 hash value for a text column with unlimited length, you could do something like:

Hash_for_memo.sql

SELECT master.sys.fn_repl_hash_binary(CONVERT(VARBINARY(MAX),
[MyHugeMemoField)) AS [HashDiff]

But don’t make the mistake to use the function [sys].[fn_repl_hash_binary] for all hashing, because the performance of HASHBYTES is much better!
On my machine HASHBYTES is three times faster than [sys].[fn_repl_hash_binary] on the same dataset.
You can check for yourself using this script:

Performance_comparison.sql

USE [msdb]
GO

DECLARE @start DATETIME2(7)
DECLARE @stop DATETIME2(7)
DECLARE @elapsed1 BIGINT
DECLARE @elapsed2 BIGINT
--\-------------------------------------------------------------------------
---) TEST PERFORMANCE of HASHBYTES function.
--/-------------------------------------------------------------------------
SELECT @start = GETDATE();
SELECT TOP1000000
      UPPER(
        CONVERT(CHAR(32),
          HASHBYTES('MD5',
            --UPPER(
              CONCAT( ISNULL(CONVERT(NVARCHAR(128), o1.[name]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[object_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[principal_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[schema_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[parent_object_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[type]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[type_desc]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[create_date], 126), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[modify_date], 126), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[is_ms_shipped]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[is_published]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[is_schema_published]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(128), o2.[name]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[object_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[principal_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[schema_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[parent_object_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[type]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[type_desc]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[create_date], 126), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[modify_date], 126), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[is_ms_shipped]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[is_published]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[is_schema_published]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(128), o3.[name]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[object_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[principal_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[schema_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[parent_object_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[type]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[type_desc]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[create_date], 126), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[modify_date], 126), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[is_ms_shipped]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[is_published]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[is_schema_published]), N'')
              ) -- END CONCAT
            --) -- END UPPER
          ) -- END HASHBYTES
        , 2) -- END CONVERT
      ) -- END UPPER
  FROM [sys].[objects] o1
  CROSS JOIN [sys].[objects] o2
  CROSS JOIN [sys].[objects] o3;

SELECT @stop = GETDATE();
SELECT @elapsed1 = DATEDIFF(MICROSECOND,@start, @stop);

--\-------------------------------------------------------------------------
---) TEST PERFORMANCE of [master].[sys].[fn_repl_hash_binary] function.
--/-------------------------------------------------------------------------
SELECT @start = GETDATE();
SELECT TOP 1000000
      UPPER(
        CONVERT(CHAR(32),
          master.sys.fn_repl_hash_binary(CONVERT(VARBINARY(MAX),
            --UPPER(
              CONCAT( ISNULL(CONVERT(NVARCHAR(128), o1.[name]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[object_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[principal_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[schema_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[parent_object_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[type]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[type_desc]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[create_date], 126), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[modify_date], 126), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[is_ms_shipped]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[is_published]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[is_schema_published]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(128), o2.[name]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[object_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[principal_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[schema_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[parent_object_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[type]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[type_desc]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[create_date], 126), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[modify_date], 126), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[is_ms_shipped]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[is_published]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[is_schema_published]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(128), o3.[name]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[object_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[principal_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[schema_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[parent_object_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[type]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[type_desc]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[create_date], 126), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[modify_date], 126), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[is_ms_shipped]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[is_published]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[is_schema_published]), N'')
                 ) -- END CONCAT
            --) -- END UPPER
          ) -- END HASHBYTES
          ) -- END CONVERT
        , 2) -- END CONVERT
      ) -- END UPPER
  FROM [sys].[objects] o1
  CROSS JOIN [sys].[objects] o2
  CROSS JOIN [sys].[objects] o3;

SELECT @stop = GETDATE();
SELECT @elapsed2 = DATEDIFF(MICROSECOND,@start, @stop);

SELECT @elapsed1 AS [Elapsed_HashBytes]
, @elapsed2 AS [Elapsed_fn_repl_hash_binary]
, 'HASHBYTES is ' + CONVERT(VARCHAR, CONVERT(decimal(19, 2), @elapsed2 / @elapsed1))
+ ' times faster than [fn_repl_hash_binary]' AS [Result]

That’s about it.
Just a few more things:

  • If you use a different hashing algorithm (e.g. SHA1) you cannot use the sys.fn_repl_hash_binary function, in that case you will have to seek your solution in a C# script task or CLR stored procedure, doing the hashing in .NET code.
  • Don’t use CONCAT if you have more than one column in the satellite containing [MyHugeMemoField], or you still can run into problems.

Conclusion / Wrap up

I have explained in short why Hash diff columns are used in a Data Vault 2.0 satellite, and how these values can be calculated.

After that, you have read about some limitations of the CONCAT and HASHBYTES function when used for calculating Hash diff values, and how to cope with these limitations.

Credits
My thanks go to Vincent Kuiper for reviewing the initial draft of this article.

(c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

A Plug and Play Logging Solution (SSIS Series)

Introduction

Ever had to investigate an error that occurred during execution of a SSIS package? Then you will know that logging is the key to know what happened.

And not only logging in SSIS itself, but also in stored procedures, c# script tasks, c# console applications you might have written and are executed from within the package.
SSIS-S01E01-log-story
So the log tells you a story (I once heard “so it reads like a thrilling boys’ book”, I liked that expression, GJ).

Also you do not want to reinvent the wheel for every SSIS project, but want to use a standard solution. Plug and Play instead of spending sprints on “utitilies”.

Feature list

The features in short (as compared with what SSIS offers out of the box):

  1. It’s a Plug And Play SSIS Logging Solution in most cases: no additional programming in SSIS Packages is required*), so can be easily used for all existing packages! Only when you have multiple master packages with the same name, you will have to do a little of set up work.
  2. Log messages are labeled with a message class (a sort of severity of the message) for easy filtering.
  3. Changing the maximum message class to log is possible without redeploying any SSIS Packages (for instance enable VERBOSE logging for trouble shooting, then set it back to DEBUG or INFO).
  4. Log messages are labeled with a retention class for easy purging. There is also an “audit” retention class for messages that may never be purged/deleted.
  5. Purging can be automated, so no scheduling is needed.
  6. Logging is Integrated/centralized for SSIS Packages, stored procedures and other home made software.
  7. Deduplicated logging can be enabled, if this is done, identical SSIS logging messages (e.g. first from the OLE DB Destination, then from the Data Flow Task, and then from the SSIS Package) are logged only once as much as possible.
  8. Master / child relations for SSIS packages can be configured, so that log entries from master- and childpackages can be retrieved easily together in a single timeline. Out of the box this works using a naming convention (Master package must have “master” in the name) but can be fine tuned for every package that runs.
  9. Easy configuration by executing a stored procedure, to set retention periods, maximum log class, automatic purging on/off, deduplicated logging on/off, and naming convention for the master package.
  10. Storage of both start- and endtimes of the current timezone as well as UTC start- and endtimes.

*) As far as the logging that SSIS does is concerned. If you want logging in your own stored procedures and executables you need to implement it first.

Quick Start Guide

  1. Download the installation solution script here.
  2. Execute the entire script to install on new database [logdb]. To install on existing database, skip the part of the script where the database is created and execute the rest against an existing database. All objects the solution creates are created in database schema [log] (except for stored procedure [dbo].[sp_ssis_addlogentry], this name is needed by SSIS).
  3. Enable logging in your SSIS Packages if you have not already done that. Use a logging provider for SQL Server and select all events for all components of the package (more details below).
  4. Configure the connection managers of your SSIS Packages to use the database in which you installed the solution. Alternatively you can execute the stored procedure [log].[spCreateCrossRefs] to redirect logging for existing packages to the logging solution without any changes in your SSIS packages. See the User Manual for more details.

Directly to User Manual

Steps to follow when you want to enable logging in your SSIS Package:

SSIS-S01E01-enable_logging

User Manual

Table Of Contents

Introduction

The “hansmichiels.com Plug And Play Logging Solution” provides Plug and Play logging for SSIS Packages, stored procedures and other home made software, like command line applications or web applications.
After installing the solution, as explained in the Quick Start Guide, you can manage the solution using a few stored procedures, or accept the default settings and run it out of the box.

Understanding MessageClass and RetentionClass

MessageClass can have one of these 6 values:

  • 1 = Fatal Error (SSIS OnError event)
  • 2 = Non-fatal error (reserved for user defined logging for non-fatal errors, not used by SSIS packages)
  • 3 = Warning (SSIS OnWarning event)
  • 4 = Info message (SSIS: PackageStart, PackageEnd, OnInformation events)
  • 5 = Debug message (SSIS: OnPreValidate, OnPostValidate, OnPreExecute, OnPostExecute events)
  • 6 = Verbose messages (SSIS: Pipeline, Diagnostic and OnProgress events, typically more than 90% of all SSIS log messages)

RetentionClass aligns with MessageClass because it also has values 1 to 6 and by default the RetentionClass will be equal to the MessageClass. However, when you do custom logging in your own stored procedures or executable programs, you can provide a different value for it, for instance MessageClass 4 (INFO) but RetentionClass 1 if you want the message to be kept longer than the value for RetentionClass 4.

In addition there is a RetentionClass 0 for messages that may never be purged/deleted.

There are two ways to prevent that log messages are purged/deleted:
  1. By setting the configuration setting @RetentionPeriodClassN (where N is from 1 to 6) to 0. Then ALL messages for that RetentionClass will never be purged. However when the configuration setting is changed to a different value, the next time spPurgeLog is executed, the messages can still be deleted.
  2. By explicitly adding log entries with RetentionClass 0. Those entries do not depend on configuration settings, and are never deleted, unless you delete them directly from the [log].[Entry] table.



Back to Table of Contents

Understanding the stored procedures and user defined function of the solution

Stored procedure [log].[spConfigure]

With this stored procedure you can change a number of configuration settings which are parameters of the stored procedure.

SSIS-S01E01-configure
Configuration: a child can do the laundry

You only have to provide values for the configuration settings (parameters) that you want to change. Other settings will remain unchanged.

You can set the following parameters:

  • @ExecutionGuid: Guid used for logging the configure operation. Can be NULL.
  • @MaxMessageClass: The maximum MessageClass to log. For instance 5 means that log entries with MessageClass=6 (VERBOSE) are not logged. The defaultvalue is 5 (DEBUG).
  • @RetentionPeriodClass1: Hours to keep a log entry if the [RetentionClass] column has value 1. Set the value to 0 if the log message may never be deleted. The defaultvalue is 0.
  • @RetentionPeriodClass2: The same for [RetentionClass] = 2. The defaultvalue is 0.
  • @RetentionPeriodClass3: The same for [RetentionClass] = 3. The defaultvalue is 4320 hours or 180 days.
  • @RetentionPeriodClass4: The same for [RetentionClass] = 4. The defaultvalue is 2160 hours or 90 days.
  • @RetentionPeriodClass5: The same for [RetentionClass] = 5. The defaultvalue is 336 hours or 14 days.
  • @RetentionPeriodClass6: The same for [RetentionClass] = 6. The defaultvalue is 168 hours or 7 days. Verbose messages are typically more than 90% of all messages, so a short retention period is recommended, if logged at all.
  • @DefaultMasterSourcePattern: default naming convention for master packages. Use wildcard % for nameparts. The defaultvalue is ‘%Master%’.
  • @EnableAutoPurgeOnPackageEnd: If enabled, spPurgeLog will be executed every time a SSIS Package ends. This keeps the log clean without an additional effort for scheduling, but the downside is that spPurgeLog might be executed more often than when it is scheduled daily. This is not harmful, but could have a negative impact on package performance. The defaultvalue is 0 (false).
  • @EnableDeduplicatedLogging: If enabled a staging table [log].[EntryStaging] will be created and duplicate SSIS log messages (with different sources) will be deduplicated as much as possible. Still in some situations duplicates may occur but a lot less than when this setting is not enabled. The defaultvalue is 0 (false).
  • @ResetAllToDefaultValues: Use 1 to reset all values to the values of the DEFAULT constraints on the [log].[Config] table. In that case all other parameter values are ignored.

Examples of use:

EXECUTE [log].[spConfigure] @ResetAllToDefaultValues = 1;

EXECUTE [log].[spConfigure]
@EnableDeduplicatedLogging = 1,
@EnableAutoPurgeOnPackageEnd = 1,
@MaxMessageClass = 6,
@RetentionPeriodClass2 = 8640,
@RetentionPeriodClass6 = 48;



Back to Table of Contents

Stored procedure [log].[spAddLogEntry]

You can use this stored procedure to implement logging in your own stored procedures or executable programs.

It is advised that your own stored procedures which are called from within SSIS Packages, have a parameter @ExecutionGuid (with datatype UNIQUEIDENTIFIER). Provide the value from the package variable System::ExecutionInstanceGUID.

You can set the following parameters:

  • @Event: The log event, can be NULL. If you leave it empty, the event name will be based on the value for @MessageClass.
  • @Source: The log source.
  • @SourceGuid:The log source guid, can be NULL.
  • @ExecutionGuid: A global unique identifier to be able to distinct log messages of different jobs running in parallel.
  • @StartTime: Can be NULL, if NULL the current date/time is used.
  • @EndTime: Can be NULL, if NULL the value of @StartTime is used (which can be set to current date/time).
  • @MessageClass: The MessageClass of the message: 1=Fatal Error, 2=Error, 3=Warning, 4=Info, 5=Debug, 6=Verbose
  • @RetentionClass: A numeric indicator for the time after which the log messages can be deleted or purged. Values align with MessageClass, and in addition there is a RetentionClass with value 0 for log messages that may never be deleted, e.g. for auditing purposes. Can be NULL, if NULL the value provided for MessageClass will be used.
  • @Message: The log message.

Examples of use:

— At the beginning of a stored procedure:
DECLARE @Source NVARCHAR(1024) = ‘log.spCreateCrossRefs’;
DECLARE @Message NVARCHAR(2048) = ‘Starting: ‘ + @Source;
EXECUTE [log].[spAddLogEntry] @Message = @Message, @ExecutionGuid = @ExecutionGuid,
@Source = @Source, @MessageClass = 4;

— at the end of a stored procedure:
SELECT @Message = ‘Finished: ‘ + @Source;
EXECUTE [log].[spAddLogEntry] @Message = @Message, @ExecutionGuid = @ExecutionGuid,
@Source = @Source, @MessageClass = 4;

— To log something specific:
SELECT @Message = ‘Purged ‘ + CAST(@RowCount AS NVARCHAR) + ‘ log entries with RetentionClass ‘ + CAST(@RetentionClass AS NVARCHAR) +‘.’;
EXECUTE [log].[spAddLogEntry] @Message = @Message, @ExecutionGuid = @ExecutionGuid,
@Source = @Source, @MessageClass = 4, @RetentionClass = 0;



Back to Table of Contents

Stored procedure [log].[spPurgeLog]

Schedule the execution of this stored procedure daily when you do not want to set EnableAutoPurgeOnPackageEnd to 1 (true).

This stored procedure has a parameter @ExecutionGuid used for logging the purge operation, but you can omit the parameter, then a new unique identifier value will be used.


Example of use:



EXECUTE [log].[spPurgeLog]

Back to Table of Contents

Stored procedure [log].[spCreateCrossRefs]

Stored procedure to create synonyms in a database and create or replace the stored procedure [dbo].[sp_ssis_addlogentry] (hereafter collectively called: the cross references)

Use this stored procedure for two purposes:

  1. If you want to do logging in your own stored procedures in the database specified with parameter @DatabaseName
  2. If you want to redirect logging for SSIS Packages that use the database specified with parameter @DatabaseName for logging. In this way those packages will use the logging solution without the need to change the connection string.

You can set the following parameters:

  • @ExecutionGuid: Guid used for logging the configure operation. Can be NULL.
  • @DatabaseName: the name of the database in which the cross references must be created.
  • @DebugMode: if 1, more details will be shown in the output window
  • @SimulationMode: if 1, only the SQL Statements are shown in the output window, but they are not executed.
  • @DropOnly: if 1, the cross references are only dropped, but not (re)created.



Example of use:

USE [logdb]
GO
EXECUTE [log].[spCreateCrossRefs]
    @DatabaseName = 'MyOtherDb'
  , @DebugMode = 1
  , @SimulationMode = 1
  , @DropOnly = 0
GO

Back to Table of Contents

Stored procedure [log].[spSetMasterSourcePattern]

While packages run, the table [log].[Source] is filled with Source (= SSIS Package) and MasterSource information, based on the configuration setting @DefaultMasterSourcePattern.

If the value for @DefaultMasterSourcePattern is correct for all packages that run, you do not have to change anything.

However when the Master/child Source relations need to be changed, you can use stored procedure [log].[spSetMasterSourcePattern] to configure master- and child package relationships.

The definition of a master package is a package that executes another SSIS Package, the child package. When going more than one level deep, only the ultimate parent package (the one that starts everything else) has to be defined, the “package tree” is not relevant. For instance Package A executes Package B, and Package B executes Package C. In this situation A should be defined as master package for C.

You can set the following parameters:

  • @ExecutionGuid: Guid used for logging the configure operation. Can be NULL.
  • @Source: The log source or log source pattern. For a pattern use a % wildcard, e.g. “Load%” for all sources of which the name starts with “Load”.
  • @SourceGuid: Source Guid, if NULL Source will be retrieved with the value of @Source only.
  • @MasterSourcePattern: The master source to use for this @Source/@SourceGuid
  • @MasterSourceGuid: Master Source Guid, if NULL the connection to the Master Source will be made using only the @MasterSourcePattern. You only need to configure @MasterSourceGuid, if you have multiple master packages in different projects with the same name (e.g. all called “Master.dtsx”) that can run at the same time. In that situation you have to link child- to masterpackages using both @MasterSourcePattern and @MasterSourceGuid.

The easiest way to configure your master- and childpackages is to use spConfigure to set @DefaultMasterSourcePattern so that it matches your masterpackage name.
Then run your masterpackage once. In this way the table [log].[Sources] is already filled with all executed (master- and child)packages, so you do not have to do that yourself.
Then use the examples below to finetune, if necessary.
Examples of use:

--\
---) Set MasterSourcePattern for all sources (use only after SSIS Packages have run for the first time).
--/
EXECUTE [log].[spSetMasterSourcePattern]
  @Source = '%', @MasterSourcePattern = 'Master%'
GO

-- Then undo for the master sources themselves
EXECUTE [log].[spSetMasterSourcePattern]
  @Source = 'Master%', @MasterSourcePattern = NULL
GO

--\
---) Set MasterSourcePattern for one source.
--/
-- Without SourceGuid (use only after SSIS Packages have run for the first time).
EXECUTE [log].[spSetMasterSourcePattern]
  @Source = 'LoadDimCustomer', @MasterSourcePattern = 'DWHMaster'
GO

-- With SourceGuid and MasterSourceGuid (can be used prior to running any packages).
EXECUTE [log].[spSetMasterSourcePattern]
  @Source = 'LoadDimCustomer',
  @SourceGuid = '7480D07D-5099-4B76-9404-17C2AFD0E603', -- PackageID property of SSIS Package.
  @MasterSourcePattern = 'Master',
  @MasterSourceGuid = '0F429F36-784D-4F05-8D5A-9374609A32B9' -- PackageID property of your master SSIS Package.
GO



Back to Table of Contents

Stored procedures [log].[spOnExecutionStart] and [log].[spOnExecutionEnd]

Used internally by the logging solution.


Back to Table of Contents

User defined function [log].[fnLogEntry]

User defined function to select log entries.

This function has the following parameters:

  • @Source: Packagename to select log entries for (including log entries of subpackages). Can be NULL. If NULL then the last execution of any Package will be selected.
  • @LastXExecutions: A number to indicate for how many executions you want to see the log. Defaultvalue is 1.
  • @MaxMessageClass: Filter on MessageClass, for instance use 2 to show only log messages with a MessageClass less than or equal to 2. Defaultvalue is 6.
  • @IncludeRetention0: If 1, also messages with RetentionClass 0 (never delete) will be shown, regardless of their MessageClass.

Example of use:

SELECT * FROM [log].[fnLogEntry] (‘Master’, 1, 4, 1 )
GO



Back to Table of Contents

How to implement logging in own executables

Logging in own executables (for instance C# Console Applications) is fairly easy to set up.
Keep the following in mind:

  • Generate a new guid every time the executable runs, and use it as value for @ExecutionGuid.
  • Assign the executable a persistent SourceGuid, and use it as value for @SourceGuid.
  • Use the executable name as value for @Source. While the executable runs, you can use lower level names (e.g. class and/or method names) for @Source, but still use the executable @SourceGuid, and not a different one.
  • When the executable starts, use the stored procedure spAddLogEntry to add an entry with @event=”ExecutionStart”. This triggers built in facilities of the logging solution.
  • Just before the executable stops, use the stored procedure spAddLogEntry to add an entry with @event=”ExecutionEnd”.
  • When the application is called from within a SSIS Package, use spSetMasterSourcePattern to create a master/child relationship.
  • If you are interested in a C# code example just let me know, and I will dedicate a future blog post to it.

 

Download the logging solution script here.

How about updates?

I intend to update the solution when needed.
If you find bugs or have interesting additions, I would be thankful if you can share those with me so I can improve the logging solution.

(c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

The CONTROLLER – TASK design pattern for stored procedures

(This blog post was first published on 25-nov-2013 on my former blog)

The challenges

Working with SQL Server and stored procedures for several years, I have faced several challenges like:

  1. reuse of stored procedures: make sure stored procedures are small components that you can use for different purposes.
  2. maintainability of stored procedures: make sure stored procedures do not become large ‘procedural written’, otherwise they can become hard to maintain due to this.
  3. generating code versus manual maintenance: how to merge generated and manual SQL code easily.
  4. where to implement error handling and the automic unit using a transaction: how to ensure these without harming the needs above.

And I have found a way to deal with this: The CONTROLLER - TASK design pattern for stored procedures. This pattern solves my challenges into a working solution!

1) Reuse of stored procedures

Sometimes you want to reuse a particular INSERT, UPDATE or DELETE statement. When it is in a stored procedure you can simply call this stored procedure! But what if this stored procedure also contains other code, uses its own transaction or implements its own error handling? Then this might be not that simple!

The CONTROLLER-TASK design pattern for stored procedures makes reuse of SQL code very easy.

2) Maintainability of stored procedures

A best practice in C# programming (or any type of programming really) that has become even more popular over the last years, is to create short methods containing only a small amount of code, and having a “main” method to call all the short methods.

Advantages: the main method is very readable (when method names are self-explaining), and code stays really maintainable. Also, short methods can be unittested easier.

Why not do the same in SQL code then?

Why not have an inventory of short stored procedures, either generated or manually made, that do only one task, and shop around to glue together what you need in a “main” stored procedure?

When working like this in a team:

- tasks can be divided easier when stored procedures are short

- stored procedures are easier to understand for new developers when they are short

The CONTROLLER-TASK design pattern for stored procedures makes your SQL code better maintainable.

3) Generating code versus manual maintenance

How convenient would it be to generate generic stored procedures, for instance UPDATE stored procedures for each table, a LOAD stored procedure for each hub in a datavault datawarehouse, and so on? Generating is not extremely more difficult than creating normal stored procedures. Just use tables like sys.objecs, sys.columns, and so on, and a lot is possible.

Imagine these stored procedures populate your inventory of “tasks”, ready to use and glue together in  a “main” stored procedure?

The CONTROLLER-TASK design pattern makes combining both manually made and generated SQL code to a working solution extremely easy.

4) Where to implement the error handling and automic unit using a transaction?

Is a stored procedure still reusable, if it is running in its own transaction? Maybe, if you use nested transactions. But there are drawnbacks and risks. Code can become messy when not implemented in the same way in different stored procedures.

Is a stored procedure still reusable, if it implements its own error handling? Maybe, but you have to be careful. If it ‘swallows’ the error, the stored procedure that calls it will not notice it and will continue after the error. This is almost never what you want. Besides that, implementing error handling in every stored procedure, makes your code base consist of many more lines.

So .. why not divide the “core” SQL code to run, and the “housekeeping” (automic unit, error handling)? No more confusion, it is clear where things are done, and code is easier reusable and maintainable.

The CONTROLLER-TASK design pattern divides core functionality and housekeeping code, therewith keeping your code cleaner and again .. better maintainable!

My solution, easy to implement

My answer to the challenges above is the CONTROLLER-TASK design pattern for stored procedures. As a picture says more than a thousand words, take a close look at the one below.

controller_task_design_pattern

Getting started

Allright, if you see the value of it, how can you implement the CONTROLLER-TASK design pattern?

In its basic form, you’ll need only a few things: a naming convention and two stored procedure templates.

I will tell you how I did this, so you can use or modify it in the way you prefer.

Choosing a naming convention

As explained, you have three types of stored procedures:

  • Task stored procedures (manually made)
  • Task stored procedures (generated)
  • Controller stored procedures

I just prefix my stored procedures with the typename as follows:

  • TASK_<BaseStoredProcName> for stored procedures that are manually made
  • GTSK_<BaseStoredProcName> for stored procedures that are generated
  • CTLR_<BaseStoredProcName> for controller stored procedures

Of course you can invent your own different naming convention, as long as you use it consequently and document it for new developers working on your project.

Using two templates

Here are my templates, you can modify for your own needs

Prerequisites

prerequisites.sql
CREATE SCHEMA [sp] AUTHORIZATION [dbo];

TASK stored procedures

TASK stored procedures.sql
PRINT ‘ScriptFile P10.sp.TASK_ReplaceWithStoredProcBaseName.sql’
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
IF EXISTS (SELECT 1 FROM sys.objects WHERE object_id = OBJECT_ID(N’sp.TASK_ReplaceWithStoredProcBaseName’) AND type in (N’P’, N’PC’))
    DROP PROCEDURE sp.TASK_ReplaceWithStoredProcBaseName
GO
/* PRESS CTRL-SHIFT-M in SSMS to fill parameters
==========================================================================================
Author :
Create date :
Parameters:
- Input : none
- Output : none
Returnvalue : 0 if executed successfully.
Description : >
==========================================================================================
*/

CREATE PROCEDURE sp. TASK_ReplaceWithStoredProcBaseName
AS
BEGIN
  ---------------------------------------------------------------------------------------------------
  -- DECLARATIONS
  ---------------------------------------------------------------------------------------------------
  --N/A
  ---------------------------------------------------------------------------------------------------
  -- INITIALIZATION
  ---------------------------------------------------------------------------------------------------
  --N/A
  ---------------------------------------------------------------------------------------------------
  -- MAIN
  ---------------------------------------------------------------------------------------------------
  SELECT 1 AS [Replace with your own code]

END
GO

 

Controller stored procedures

Controller stored procedures.sql
PRINT ‘ScriptFile P20.sp.CTLR_ReplaceWithStoredProcBaseName.sql’
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
IF EXISTS (SELECT 1 FROM sys.objects WHERE object_id = OBJECT_ID(N’sp.CTLR_ReplaceWithStoredProcBaseName’) AND type in (N’P’, N’PC’))
    DROP PROCEDURE sp.CTLR_ReplaceWithStoredProcBaseName
GO
/* PRESS CTRL-SHIFT-M in SSMS to fill parameters
==========================================================================================
Author :
Create date :
Parameters:
- Input : none
- Output : none
Returnvalue : 0 if executed successfully.
Description : >
==========================================================================================
*/

CREATE PROCEDURE sp.CTLR_ReplaceWithStoredProcBaseName
AS
BEGIN
  ---------------------------------------------------------------------------------------------------
  -- TEMPLATE CODE: CONFIGURATION OF CONTROLLER, BEGIN TRANSACTION, INFO LOGGING
  ---------------------------------------------------------------------------------------------------
  BEGIN TRY

    SET XACT_ABORT ON
    BEGIN TRANSACTION

    ---------------------------------------------------------------------------------------------------
    -- (NON-TEMPLATE) DECLARATIONS
    ---------------------------------------------------------------------------------------------------
    -- N/A
    ---------------------------------------------------------------------------------------------------
    -- EXECUTE TASK STORED PROCEDURES
    ---------------------------------------------------------------------------------------------------
    -- Example call:
    EXECUTE sp.TASK_ReplaceWithStoredProcBaseName
    --

    ---------------------------------------------------------------------------------------------------
    -- TEMPLATE CODE: COMMIT TRANSACTION
    ---------------------------------------------------------------------------------------------------
    COMMIT TRANSACTION

  END TRY
  BEGIN CATCH
    ---------------------------------------------------------------------------------------------------
    -- TEMPLATE CODE: CATCH BLOCK WITH ROLLBACK AND THROW
    ---------------------------------------------------------------------------------------------------
    IF @@TRANCOUNT > 0
    BEGIN
        ROLLBACK TRANSACTION;
    END
    -- Options here:
    --\
    ---> 1) Log the error (optional)
    --/
    -- Not part of the BASIC template

    --\
    ---> 2) Rethrow the error (optional)
    --/
    -- Syntax for SQL Server 2012:
    -- ; THROW

    -- Syntax for SQL Server 2008 and earlier:
    DECLARE @ORIGINAL_ERROR NVARCHAR(MAX)
          , @ERROR_SEVERITY INT = ERROR_SEVERITY()
          , @ERROR_STATE INT = ERROR_STATE()

    SELECT @ORIGINAL_ERROR = + ERROR_MESSAGE()
                        + ‘ (ERROR_NUMBER=’ + CAST(ERROR_NUMBER() AS NVARCHAR)
                        + ‘, ERROR_LINE=’ + CAST(ERROR_LINE() AS NVARCHAR)
                        + ‘, ERROR_PROCEDURE=’ + ERROR_PROCEDURE() + ‘)’
    RAISERROR(@ORIGINAL_ERROR, @ERROR_SEVERITY, @ERROR_STATE)

  END CATCH

  ---------------------------------------------------------------------------------------------------
  -- RETURN THE RESULT WHEN EXECUTED SUCCESFULLY
  ---------------------------------------------------------------------------------------------------
  RETURN 0;
    
END
GO

 

Summary

In this blogpost I have explained to you the advantages of the CONTROLLER - TASK design pattern for stored procedures, which are:

  • makes reuse of SQL code very easy.
  • makes your SQL code better maintainable.
  • makes combining both manually made and generated SQL code to a working solution extremely easy.
  • divides core functionality and housekeeping code, therewith keeping your code cleaner and again .. better maintainable!

After that you were shown how to get started with a basic implementation of this design pattern.

Thank you for reading my blogpost and I hope you enjoyed it and found it useful.

Query hints WITH (NOLOCK) and WITH (READPAST) in action

(This blog post was first published on 18-jan-2013 on my former blog)

Download all used scripts here.

Query hints WITH (NOLOCK) and WITH (READPAST) are a great way to increase performance on (readonly) queries that run on one or more tables in which data is added , edited or deleted at the same time.

However those query hints have also some drawbacks, to be aware of, when using them.

In this article I will demonstrate how these query hints work, and when to use them.

Let’s start with some theory: what do these query hints do?

It makes no sense to copy the MSDN description here. If you need more detailed information you can read it yourself on msdn

I will just summarize what these query hints can do for me.

WITH (READPAST) – shows only committed data. When a record gets updated, it will not be selected at all, not the previous version of the record, nor the updated but uncommitted version.

WITH (NOLOCK) – shows both committed and uncommitted (dirty) data. When a record gets updated, the new uncommitted values of the record will be selected.

When to use

In general, do not use these queryhints, when you retrieve data to edit, e.g. in disconnected scenario’s like editing in a webapplication. Working with uncommitted data will in this case seldom be what you want.

When it comes to read only access, it depends. You have to decide what is more important: showing always the right (committed) data, or showing data fast. If the risk that uncommitted data is shown is acceptable, WITH (NOLOCK) is the query hint for you, if the risk that data being changed is not shown at all, WITH (READPAST) is the query hint for you.

Also for semi-static data (dimension- or system tables), that is seldom changed, the risk of the drawbacks of both queryhints, is usually acceptable.

Executing the demo

Step 1

First of all we have to do some setup work. I have made a script for that. You can run it in SQL Server Management Studio (see link to download all scripts at the top of this article).

Note that for simplicity of the demo I have created a CHAR column to be primary key. I know it is not a best practice, but for the demo it is sufficient. Normally you would always strive for numeric primary keys.

step1_setup.sql

— Step 1: Create a QueryHintsDemo database with one table, DemoData
USE [master]
GO
IF NOT EXISTS (SELECT name FROM sys.databases WHERE name = N’QueryHintsDemo’)
  CREATE DATABASE [QueryHintsDemo]
GO
USE [QueryHintsDemo]
GO
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[DemoData]’) AND type in (N’U’))
DROP TABLE [dbo].[DemoData]
GO

SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO

CREATE TABLE [dbo].[DemoData](
 [DemoDataCode] CHAR(6) NOT NULL,
 [DemoDataDescription] VARCHAR(100) NOT NULL,
 CONSTRAINT [PK_dbo_DemoData] PRIMARY KEY CLUSTERED
(
 [DemoDataCode] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

GO

SET ANSI_PADDING OFF
GO

Download all used scripts here.

Step 2

What is next?

Now you have to make sure you have two query windows open in Management Studio, so you can execute one script, and before that is finished, execute another script as well.

The first script will insert some initial data in the DemoData table, and then update it in a transaction. To be able to do this demonstration properly, the WAITFOR DELAY feature is used to wait 20 seconds before the transaction is rolled back.

In the first query window, use this script:

step2a_keep_transaction_open.sql

TRUNCATE TABLE [QueryHintsDemo].[dbo].[DemoData]
GO

INSERT INTO [QueryHintsDemo].[dbo].[DemoData]
           ([DemoDataCode]
           ,[DemoDataDescription])
     VALUES
           (‘DEMO01’
           ,‘This is the original text.’)
GO

BEGIN TRANSACTION

  UPDATE [QueryHintsDemo].[dbo].[DemoData]
     SET [DemoDataDescription] = ‘This is the updated text that gets rolled back later.’
   WHERE [DemoDataCode] = ‘DEMO01’

  WAITFOR DELAY ’00:00:20′

ROLLBACK TRANSACTION
GO

BEGIN TRANSACTION

  UPDATE [QueryHintsDemo].[dbo].[DemoData]
     SET [DemoDataDescription] = ‘This is the updated text that gets committed.’
   WHERE [DemoDataCode] = ‘DEMO01’
  
COMMIT TRANSACTION
GO

 

This is the script for the second querywindow:

step2b_select.sql

— WITH READPAST, will not show the updated, non committed record
— but also not the ‘old’ version of the record, before it got updated.
— So you potentially miss out some data when using this query hint.
SELECT ‘WITH (READPAST)’ AS [READPAST], *
FROM [QueryHintsDemo].[dbo].[DemoData] WITH (READPAST)
GO

— WITH NOLOCK, will show the updated, but uncommitted record. This is also called a “dirty read”.
— The risk here is that you use/display data, that in the end will
— not be committed and therefor never became ‘official’.
SELECT ‘WITH (NOLOCK)’ AS [NOLOCK], *
FROM [QueryHintsDemo].[dbo].[DemoData] WITH (NOLOCK)
GO

— Without using any queryhint, the select statement will wait until the
— first transaction is rolled back, and then show the original value.
— This is the most reliable, but also the slowest.
— You will notice that the first two SELECT statements will return
— a result immediately, but that the query below will only show result after almost
— 20 seconds, caused by the 20 seconds delay in the previous script.
SELECT ‘Without queryhint (1)’ AS [No hint 1], *
FROM [QueryHintsDemo].[dbo].[DemoData]

— Wait a few seconds, so that the update to be committed in the other script, is completed.
WAITFOR DELAY ’00:00:02′
— Without using any queryhint, the select statement will wait until the
— update is finished, and then show the final value.
SELECT ‘Without queryhint (2)’ AS [No hint 2], *
FROM [QueryHintsDemo].[dbo].[DemoData]

 

As you might already have understood, you should execute both scripts in SQL Server Management Studio: execute the first script, and while it is still running, start the second script in a seperate query window as soon as possible (but at least within 20 seconds after starting the first script).

Analyzing the query result

20130118_analyze_result2

 
Clean up
Use this script to clean up.

cleanup.sql

USE [master]
GO
EXEC msdb.dbo.sp_delete_database_backuphistory @database_name = N’QueryHintsDemo’
GO
ALTER DATABASE [QueryHintsDemo] SET SINGLE_USER WITH ROLLBACK IMMEDIATE
GO
DROP DATABASE [QueryHintsDemo]
GO

 

Conclusion

The queryhints WITH (NOLOCK) and WITH (READPAST) are a great way to improve performance.

However they both have their own drawnbacks, to be aware of, when using them:

  • When using WITH (READPAST), you potentially miss out some data that is being updated, when your select query runs.
  • When using WITH (NOLOCK), the risk here is that you use/display data, that in the end will not be committed and therefor never became ‘official’.

It depends on the situation, when it is acceptable to use these queryhints.