Tag Archives: SQLServer

Unzip specific files from lots of zipped archives (SSIS Series)

Introduction

This week I want to discuss the trouble you may have when searching hundreds of zip files for specific files inside them.
Recognizable?
At least I faced this problem some time ago, and solved it using SSIS and the commandline version of 7-zip, named 7za.exe.
The choice for SSIS was purely based on convenience, as you could tackle this problem also using Powershell, C# or almost any other scripting language.
But I thought SSIS is just a neat and quick way to do this.

Problem

Finding specific files in huge amounts of zip files.

Solution

To tackle this problem I have made a fully configurable SSIS package, that can be executed with different parameters.
The main task of the package is to extract specific files (based on a file name pattern) from zipfiles in a folder and – optionally – subfolders.
The unzipping itself is done with 7za.exe, which is the commandline version of 7z, the great open source zip program, that exists already for several years and is used by millions of users.

By the way, the package is made using SSIS 2014.

Preparations: download and install 7za

It is important to download this package from the official 7z web site, just to avoid that you would get a copy from some deceitful web site infected with malware.
To go to the official 7-zip download page click on this link:
7-zip download page
Most important is the 7-Zip Extra: standalone console version.
For the example I have used version 16.02 (2016-05-21) for Windows, but it should also work with version 9.20 (2010-11-18), then called 7-Zip Command Line Version.

If you download the newest version, you also need software to unzip the 7z160-extra.7z file, for instance the regular 32-bit or 64-bit version of 7-zip, which can also be downloaded on the same page.

SSIS-S01E05-100
Download page of 7-zip

I assume you will use the newest version 7z160-extra.7z
Once unzipped, just take the time to read License.txt, readme.txt and history.txt.
Then copy the three files 7zxa.dll, 7za.dll and 7za.exe to a folder on the Windows computer where SSIS runs, usually a Windows Server.
If you run a x64 version of Windows, you can take the x64 version of the 7za files, which are in a subfolder “x64” of the unzipped files.

SSIS-S01E05-101
Unzipped 7za files ..

I have created a subfolder “7za” in the “C:\Program files” folder of my virtual Windows Server, and copied the x64-version of the three mentioned files to it, but you can put them in any folder you like.
When you have done this, the real fun can start ..

Overview

Here is a picture of the end result:

SSIS-S01E05-102
The Control Flow

Detailed steps

Parameters and variables

First you need to add a few parameters and variables to the package.

Parameters:

Name Data type Required Description
_7zaPathAndFilename String False The full path and filename of 7za.exe.
FilenamePattern String True The filename pattern of files you want to unzip.
SourceRootFolder String True The main folder where you want to search for zip files.
TraverseSubfolders Boolean False Indication if you want to look only in SourceRootFolder, or also in its subfolders.
UnzipRootFolder String True The folder to which files are unzipped. Important: this may NOT be a subfolder of SourceRootFolder.
ZipFilenamePattern String True The filename pattern of the zipfiles from which you want to unzip files.

SSIS-S01E05-103
Package parameters

Variables:

Name Data type Description
FLC_UnzipFolder String The subfolder where the files will be unzipped. This is a subfolder of the UnzipRootFolder, with the same relative subpath as the zipfile has in SourceRootFolder. Additionally, the zipfilename has been replaced with a subfolder that is the same as the zipfilename without the file extension.
FLC_ZipPathAndFileName String The zipfile full path, that is target of investigation if any files with FilenamePattern are zipped inside.

SSIS-S01E05-104
SSIS variables needed to do the job

Developing the package

FELC Unzip files

This foreach loop container finds all zip files.
In the Foreach Loop Editor, select Collection in the left pane.
As we want to make the solution generic by using all those parameters mentioned above, we have to add a few expressions.

SSIS-S01E05-105
Foreach Loop container Collection configuration

Add the following three expressions (in my opinion the editor is not very userfriendly, the window is too small, but it can be done):

Property Expression
Directory @[$Package::SourceRootFolder]
FileSpec @[$Package::ZipFilenamePattern]
Recurse @[$Package::TraverseSubfolders]

Please note that these expressions will set the values for Folder, Files and the Traverse subfolders checkbox in the Enumerator configuration part of the screen (inside the purple square in the picture above).
If you close and reopen the Foreach Loop Editor, you will notice that the configuration fields now use the values from the corresponding parameters, except the value Fully qualified for Retrieve file name, which is hardcoded.

SSIS-S01E05-106
Example of how to set an expression..

Then, still in the Foreach Loop Editor, select Variable Mappings in the left pane and map variable User::FLC_ZipPathAndFileName to Index 0, as shown in the picture below.

SSIS-S01E05-107
Variable Mappings

EXPR Set Unzip folder

The next nifty thing that needs to be done, is creating a folder to store the unzipped files. As we do not know if zipped files with the same name exist in multiple zip files, the safest thing to do is to create a directory structure identical to the directory structure of the zip files.
Additionally, the basename of the zip file (so without file extension) is used as an extra subfolder. In this way we are sure that unzipped files with the same name can be stored without overwriting other ones.

To do this, from the SSIS Toolbox drag an Expression Task inside the Foreach Loop container.
Open the Expression Task and paste this expression into the Expression textbox.

@[User::FLC_UnzipFolder] =
REPLACE(LEFT(@[User::FLC_ZipPathAndFileName],
FINDSTRING(@[User::FLC_ZipPathAndFileName], "."
+ TOKEN(@[User::FLC_ZipPathAndFileName], ".",
TOKENCOUNT(@[User::FLC_ZipPathAndFileName], ".")),1) -1),
@[$Package::SourceRootFolder], @[$Package::UnzipRootFolder])

Then press the Evaluate Expression button to check that the expression is valid.

SSIS-S01E05-108
Configuring the Expression Task

FSYS Create Unzip folder

In the previous step we have composed the full path to unzip files into, but this folder structure might not exist yet.
Therefore we use A File System Task named FSYS Create Unzip folder to create the folder structure.
The good news is that the File System Task can create multiple (sub)folders at once.
Configure it as shown in the picture.
SSIS-S01E05-109
Configuration of FSYS Create Unzip folder

EPR Unzip Selected Files

Below you see a few screenshots for the configuration of the Execute Process Task named EPR Unzip Selected Files.
The funny thing is, that in the Process pane, you do not have to change anything. Executable and Arguments are set using expressions (coming next) and all the other properties can keep their default values.

SSIS-S01E05-110
Process configuration.

SSIS-S01E05-111
Setting Executable and Arguments using expressions
Arguments (copy-friendly):

"-y x \"" + @[User::FLC_ZipPathAndFileName] + "\" -o" + @[User::FLC_UnzipFolder] + " " + @[$Package::FilenamePattern] + " -r"

SCR Remove Empty folders

There is one problem with the steps done so far. If a zipfile does not contain a file that matches the FilenamePattern, then the folderstructure is made for nothing.
However you could just leave these empty folders, I think the solution is a bit better if these empty folders are deleted.
To do this, from the SSIS Toolbox drag an Script Task inside the Foreach Loop container.

SSIS-S01E05-112
In the Script Task Editor, add both variables User::FLC_UnzipFolder and $Package::UnzipRootFolder to ReadOnlyVariables

In the Script Task Editor window, press Edit Script ….

Please note that you can download the C# script here..

Then in the namespaces region add two namespaces, as indicated below.

Namespaces region
#region Namespaces
using System;
using System.Data;
using Microsoft.SqlServer.Dts.Runtime;
using System.Windows.Forms;
// Added namespaces:
using System.IO;
using System.Linq;
#endregion

Adjust public void main() as follows:

public void Main

public void Main()
{
  // Copy SSIS variable value of FLC_UnzipFolder to C# string variable.
  string unzipFolder = Dts.Variables["User::FLC_UnzipFolder"].Value.ToString();

  // Source: http://stackoverflow.com/questions/22520761/delete-all-empty-directories-in-specified-path
  // My credits go to author "Jon" from Berlin.
  // Because not all zipfiles might contain files with the FilenamePattern we are looking for, an empty path might have been created.
  // This script deletes these empty folders, so that in the end only folders will exist with unzipped files or subfolders in it.
  var di = new DirectoryInfo(unzipFolder);
  var root = Dts.Variables["$Package::UnzipRootFolder"].Value.ToString(); // no trailing slash!
  while (di.FullName != root
          && !di.EnumerateFiles().Any()
          && !di.EnumerateDirectories().Any())
  {
    di.Delete();
    di = di.Parent;
  }

 Dts.TaskResult = (int)ScriptResults.Success;
}

Download the script here.

The proof of the pudding is in the eating

Time to do some testing.
I have created a few folders with subfolders and subsubfolders. You can see the folder structure in the last picture below.
As you might have noticed the default values for my parameters already prepared for the testcase: I want to unzip only customer files from zip files with 201605 in the name.

SSIS-S01E05-113
Executing the package in Debug mode..

Below you can see the result. Only customer files are unzipped. The subfolders under the Archive folder that contain no customer zip files (but only currency files), do not exist under the Unzipped folder.

SSIS-S01E05-114
Folder structure of Archive and Unzipped rootfolders..

SSIS-S01E05-115
Unzipped “Customer” files ..

Important notes for using with huge amounts of zip files

From my experience I must also warn you. If you run this package on a folder containing gigabytes of zipfiles, there are two things to be aware of:

  1. The execution of the package can potentially take a long time, several hours is not unusual.
  2. Make sure you have enough disk space. Suppose you have the following hypothetical scenario: zip files: 100GB, average zip compression rate: 20%, percentage of files to be unzipped: 30%. In this case you are going to unzip files that have a zipped size of 30% of 100GB, is 30GB. Once unzipped, their size will be 30GB * (1/0,20) = 150GB. You need to have that disk space available on the disk where the Unzipped folder resides.

Conclusion / Wrap up

With this SSIS package you can selectively unzip files from archived zip files.
When using it, make sure you have enough disk space, and be aware that the package can run for several hours, depending on the number and (average) size of the zip files to process.

(c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

11 Things you should know about Biml – an overview (Datawarehouse Series)

Intended audience and purpose of this post

This post gives an overview of a number products and features related to Biml on a high level.
Therewith it is a starting point for further learning and investigating.

Introduction


Biml means Business Intelligence Markup Language.
What else do you need to know about it, if you are a professional worker in the business intelligence and data warehouse domain?

SSIS-S01E05-biml-logo

11 Things you should know about Biml – an overview

  1. Who invented Biml?
  2. What is Biml?
  3. What are BimlExpress and BIDS Helper?
  4. What is BimlScript?
  5. What is BimlFlex?
  6. What are Mist and BimlStudio?
  7. What are BimlBundles?
  8. What are BimlTransformers?
  9. Is the free part of Biml a way to get me addicted to it?
  10. How do I know that Biml is a proven technology that is going to stay?
  11. Where can I find more information on the internet?

1. Who invented Biml?

Biml and all tools around it is created by Varigence, a company with CEO Scott Currie, a former Microsoft Employee.

Back to top

2. What is Biml?

Biml – the markup language – is an XML dialect to describe Microsoft SSIS Packages and SSAS Cubes.
Everything you can do with SSIS, you can do with Biml (quote of Scott Currie, done during the Biml usergroup NL kickoff on May 7, 2016).
The goal of Biml is to improve productivity of Data warehouse development.
But the biml ‘ecosystem’ consists of more than only the Markup Language. Other components are:

  • Biml Script
  • BimlExpress and BIDS Helper
  • BimlFlex
  • BimlStudio (also/formerly known as Mist)
  • BimlBundles
  • BimlTransformers

Back to top

3. What are BimlExpress and BIDS Helper?

BimlExpress and BIDS Helper both are Add-ins for Business Intelligence Development Studio (actually the Visual Studio version with the project types needed for Business Intelligence).
BimlExpress is the successor of BIDS Helper and one of it biggest advantages is syntax highlighting and intellisense.
It is really worth checking them out, because they can improve your productivity, especially when you use metadata and design patterns for SSIS packages.
You can download and install BimlExpress. You have to register but can use the product at no cost.

Back to top

4. What is BimlScript?

BimlScript is what makes Biml (the markup language) even more interesting. With BimlScript you can write .NET Code (C#) and do all kind of interesting stuff, like accessing your metadata, using for each loops, and generate SSIS packages using a design pattern.
I have generated packages to load dimensions of a star schema, and I intend to blog about it later this year in more detail.

Back to top

5. What is BimlFlex?

BimlFlex is a datawarehouse automation framework of Varigence. It can greatly improve productivity when building data warehouses.
It uses Predefined patterns for data warehouse automation and Extension Points that you can use to change the standard automation patterns on certain ‘change points’, or can build exceptions for the pattern of this one very special table.
BimlFlex is part of BimlStudio and therefore a commercial (paid) product.

Back to top

6. What are Mist and BimlStudio?

Mist and BimlStudio are the same tool. BimlStudio is just the new name, due to strange associations with the “Mist” name in different languages, for example in German (What does the German word “Mist” mean?). This is why they changed the name.
Also the names for all products and techniques associated with Biml now start with Biml, which makes them more recognizable.
BimlStudio has a number of extra features when compared with BimlExpress.
BimlStudio is a paid product, you can buy a perpetual license or a monthly subscription.

SSIS-S01E05-bimlstudio-logo

Back to top

7. What are BimlBundles?

BimlBundles are part of BimlStudio and protect your source code while your customer still can make changes at framework extension points.
Koos van Strien did a very good job at explaining BimlBundles in more detail here.

Back to top

8. What are BimlTransformers?

With BimlTransformers you can define and store changes on Biml code, instead of changing the Biml code itself. If you have worked with Visual Studio/.NET before, you can compare it with changes you make for your config file for different environments.
Koos van Strien did a very good job at explaining BimlTransformers in more detail here.

Back to top

9. Is the free part of Biml a way to get me addicted to it?

Nope. Although Varigence works sales model is a Fremium model (a free product that can be extended/enhanced with paid products), the promise of Scott Currie, done during the Biml usergroup NL kickoff on May 7, 2016, is that all features that are released for free will stay for free. You can count on that.

Back to top

10. How do I know that Biml is a proven technology that is going to stay?

Biml already exists for about 8 years and has proven itself already. Varigence is a global player and steady company that does not suddenly disappear. I am not concerned if Biml will exist 10 years from now, and you should not have to be either. Even in the utterworst case, you would still have the SSIS packages that Biml has generated for you, and you can edit/maintain them without Biml.

Back to top

11. Where can I find more information on the internet?

To read more about Biml, here are a few interesting web sites, blog posts and Linked In Usergroups:

Back to top

Conclusion / Wrap up

Biml is here to stay and adds a lot of value, even if you use only the free parts of it. Varigence can only be praised for that.
In this blog post I shortly described a number of products and techniques that are part of the Biml ‘ecosystem’.
I do not have a commercial interest in promoting Biml, I am just a Business Intelligence Consultant who likes Biml a lot.

(c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

Adding custom header and footer to csv files (SSIS Series)

Introduction

This article wasn’t planned. SSIS again? No I wanted to write a post about something else.
Nevertheless I got this question on one of my previous posts and thought, I might as well create a blog post about it.

Problem

“Hi Hans,
I have created my ssis package for flat file destination. My problem is that I want to add header and footer in my flat file destination to send it to FTP location format will be

0|MMDDYYYY|HHMMSS
1|Row1
1|Row2
1|Row3
1|ROW4
9|6

The last line contains no of row count including header and trailer
Please let me know how to do that.. I am completely stuck with that and no good luck.. “

So to summarize, both the header and the footer line of the csv file need to contain dynamic content.
Also each row starts with a ‘row type’: 0 for the header, 1 for every detail row, and 9 for the footer.

Solution

As an example I just use a simple Data Flow Task with an OLEDB Source connected to table DimCurrency from the AdventureWorksDW database.
The data then flows to a Script Component used as a Destination.
Here are a few printscreens to demonstrate what I mean. The Script Component will create the file.
The package is made using SSIS 2014.

Developing the package

SSIS-S01E04-099
The demo package just contains one Data Flow Task

SSIS-S01E04-205
Design of the Data Flow Task

SSIS-S01E04-101
OLEDB Source: DimCurrency from the AdventureWorksDW database

SSIS-S01E04-102
OLEDB Source: select all three columns

SSIS-S01E04-208
When you drag the Script Component to the Data Flow, you will get this dialog. Use the Script Component as a Destination

SSIS-S01E04-207
Variables that the package contains.

SSIS-S01E04-209
In the Script Transformation Editor, add both variables User::HeaderTimestamp and User::TargetCsvPathAndFileName to ReadOnlyVariables

Now comes the most interesting part! We use the endless possibilities of C# to fix our problem.
In the Script Transformation Editor window, press Edit Script ….

Please note that you can download the entire script here, so you do not have to copy the (sometimes malformatted) code from this webpage..

Then in the namespaces region add two namespaces, as indicated below.

Namespaces region
#region Namespaces
using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
// Added namespaces:
using System.IO;
using System.Text;
#endregion

Above public override void PreExecute() add the copyright notice and the following two regions: Configuration and Private variables.
Please note that these regions do not exist yet, so you have to copy the code including the #region and #endregion lines.

New regions
  /*
  (c) Copyright 2016 - hansmichiels.com
 
  This program is free software: you can redistribute it and/or modify
  it under the terms of the GNU General Public License as published by
  the Free Software Foundation, either version 3 of the License, or
  (at your option) any later version.
 
  This program is distributed in the hope that it will be useful,
  but WITHOUT ANY WARRANTY; without even the implied warranty of
  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
  GNU General Public License for more details.
 
  You should have received a copy of the GNU General Public License
  along with this program. If not, see http://www.gnu.org/licenses/.
  */

 
  #region ***** Configuration *****

  // Here you can change the column delimiter and the text qualifier easily.
  private const string _delimiter = “|”;
  private const string _textQualifier = “”;
  // To set the text qualifier to ” use:
  // private const string _textQualifier = “\””; // I recommend to use a text qualifier.
  
  // Every x rows, write the buffer to the file, and clear the buffer.
  // This is faster than writing every individual line to the file.
  // The higher the value of _bufferRows is, the more memory will be used and the less write actions are done to the csv file.
  // A higher value is a good idea if you have enough memory in your server.
  private const int _bufferRows = 100000;

  #endregion ***** Configuration *****

  #region Private variables

  // Variable to hold the path and filename
  private string _fileName;
  // Variable to count the rows
  private int _rowCount;
  // StringBuilder to hold a buffer of text that still needs to be written to the csv file.
  private StringBuilder _fileContentsBuffer;

  #endregion Private variables

PreExecute() is called before the data starts to flow through. Here we do some preparation and compose the headerline of the file.

PreExecute()

 &nbs p;/// <summary>
  /// This method is called once, before rows begin to be processed in the data flow.
  /// </summary>
  public override void PreExecute()
  {
    base.PreExecute();

    // Initialize variables
    _rowCount = 0;
    _fileContentsBuffer = new StringBuilder();
    _fileName = Variables.TargetCsvPathAndFileName;
    DateTime headerTimestamp = Variables.HeaderTimestamp;

    // Append the new line to the buffer.
    // This will be written to file with the first detail rows.
    // If there are no detail rows, no file will be created.
    _fileContentsBuffer.AppendFormat(“{1}0{1}{0}{1}{2:ddMMyyyy}{1}{0}{1}{2:HHmmss}{1}\r\n”, _delimiter, _textQualifier, headerTimestamp);
    }

PostExecute() is called after the data has flowed through. Here we append the last lines in the _fileContentsBuffer variable and the footerline to the file.

PostExecute()
  /// <summary>
  /// This method is called after all the rows have passed through this component.
  ///
  /// You can delete this method if you don't need to do anything here.
  /// </summary>
  public override void PostExecute()
  {
    base.PostExecute();
    // If no detail rows, bail out. No file will be created when there are no detail rows.
    if (_rowCount == 0) return;

    using (StreamWriter sw = File.AppendText(_fileName))
    {
      // Append a footer line with the rowcount, example value: 9|105
      // The rowcount is a count of the detail rows, but the footer count
      // needs to include header and footer, so we add 2 to the value of _rowCount.
      _fileContentsBuffer.AppendFormat(“{1}9{1}{0}{1}{2}{1}”, _delimiter, _textQualifier, _rowCount + 2);
      sw.Write(_fileContentsBuffer.ToString());
      _fileContentsBuffer.Clear();
    }
  }

Input0_ProcessInputRow(Input0Buffer Row) is the place where all the data processing takes place. As this script is the Destination for the data, we need to do something with those data. A Flat File Destination is not there and will not do this for us.
So here we append every detail row to the _fileContentsBuffer variable. Once in a while, we write the buffer to the file, and clear the buffer again.

Input0_ProcessInputRow

  /// <summary>
  /// This method is called once for every row that passes through the component from Input0.
  /// </summary>
  /// <param name=”Row”>The row that is currently passing through the component</param>
  public override void Input0_ProcessInputRow(Input0Buffer Row)
  {
    // Increase the rowcounter
    _rowCount++;

    // Append the new line to the buffer.
    // The format needs to be like “{1}[middle part]{1}\r\n”, so that the row starts and ends with the text qualifier.
    // Replace [middle part] with {1}{0}{1}{x} for every column that you want to append to the file, where x is 2 for column 1, 3 for column 2 and so on.
    _fileContentsBuffer.AppendFormat(“{1}1{1}{0}{1}{2}{1}{0}{1}{3}{1}{0}{1}{4}{1}\r\n”, _delimiter, _textQualifier,
      SafeValue(Row.CurrencyKey), // First column, {2} in the format
      SafeValue(Row.CurrencyAlternateKey), // Second column, {3} in the format
      SafeValue(Row.CurrencyName) // Third column, {3} in the format
             // etc.
      );
    
    // Every x rows, write the buffer to the file, and clear the buffer.
    if (_rowCount % _bufferRows == 0)
    {
      using (StreamWriter sw = File.AppendText(_fileName))
      {
      sw.Write(_fileContentsBuffer.ToString());
      _fileContentsBuffer.Clear();
      }
    }
  }

Multiple overloaded versions of the SafeValue method (for different datatypes like string, int, DateTime, you can add more when needed) add escape characters to the column values when needed.

SafeValue method and overloads
  /// <summary>
  /// Escapes a double quote delimiter in a column value by doubling it.
  /// </summary>
  /// <param name=”value”></param>
  /// <returns></returns>
  private string SafeValue(string value)
  {
    // If _textQualifier is null or empty, just return the value.
    if (string.IsNullOrEmpty(_textQualifier)) return value;

    if (value.Contains(_textQualifier))
    {
    // If the value contains one or more text qualifiers,
    // escape them by replacing them with two text qualifiers.
    return value.Replace(_textQualifier, _textQualifier + _textQualifier);
    }
    return value;
  }

  /// <summary>
  /// Overloaded method for DateTime value (meant as an example)
  /// For other data types, create new overloaded methods,
  /// and in that overloaded method convert the value to string,
  /// before calling the 'SafeValue' method with a string parameter.
  /// </summary>
  /// <param name=”value”></param>
  /// <returns></returns>
  private string SafeValue(DateTime value)
  {
    return SafeValue(value.ToString(“yyyy-MM-dd HH:mm:ss.fff”));
  }

  /// <summary>
  /// Overloaded method for int value
  /// </summary>
  /// <param name=”value”></param>
  /// <returns></returns>
  private string SafeValue(int value)
  {
    return SafeValue(value.ToString());
  }
}

When the script is ready, Choose Rebuild from the BUILD menu in the VstaProjects – Microsoft Visual Studio Window. Then close that Window. You return to the Script Transformation Editor, close it by pressing OK.

Testing the package

I simply run the package in debug mode. 105 currency rows are added to the csv file.
Including the header and footer, the csv file contains 107 rows, and this is the number that is written to the footer line.

SSIS-S01E04-110
Running the package in debug mode

SSIS-S01E04-111
First and last part of the created csv file.

Please note that to change the delimiter and text qualifier you only have to make two changes in the ***** Configuration ***** region.

Configuration
  #region ***** Configuration *****

  // Here you can change the column delimiter and the text qualifier easily.
  private const string _delimiter = “;”;
  private const string _textQualifier = “\””;
. . .

  #endregion ***** Configuration *****

By this change the file will look as follows:
SSIS-S01E04-112
First and last part of the created csv file with a different delimiter and text qualifier.

I also did a test with a source query that returned over 1.1 million rows, and the csv file of 25 MB was created in just 3 seconds (on a core i7 laptop with SSD).
So my conclusion is that the performance of this solution is very satisfying.

Download the C# script for the script component here.

Conclusion / Wrap up

In this post I have demonstrated how to use SSIS 2014 to create a csv file with a dynamic header (with a timestamp) and footer line (including a rowcount) using a Script Component as a Destination in a Data Flow Task.

(Promotional)

Big Data & Analytics insight 2016: save the date for an insightful conference
(as a speaker I was asked to do a bit of promotion ..)

(c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

Fixing corrupt csv files in a SSIS Package (SSIS Series)

(Promotional)

Big Data & Analytics insight 2016: save the date for an insightful conference
(as a speaker I was asked to do a bit of promotion ..)

Introduction

A few weeks ago my blog post was about using a C# Script Component to detect duplicates in a Data Flow Task.

This week I want to discuss it’s brother, the C# Script Task, which can be used in the Control Flow.

Again, extremely flexible, basicly you embed a small C# program inside your package, and you can use the entire .NET framework for all kind of things. I used it to fix malformed csv source files, before processing them in a Data Flow Task.

Pros and Problems

Csv (comma seperated files) have their pros and problems, as compared with other formats, like XML.

A few pros are that they are usually smaller in size than XML files, and easier and faster to load. These are two reasons why csv files still are used a lot these days, despite of the fact that XML files are in some cases more reliable, and more flexible as far as the file contents is concerned.

Unfortunately csv files are not so good in holding memo/comment fields that might contain carriage return/line feeds in it. Memo fields with those characters do not stay on one line in the file, therefore making the csv file more difficult to process (at least the standard flat file source can not handle it in all cases).

Also the structure is very strict. If a column is added to a csv source file, your Data Flow Task will break.

Solution

In this post I will discuss a possible solution for both problems:

  1. carriage return/line feeds in contents of csv files
  2. changing file formats over time

This post assumes that you have worked with SSIS before, know what For Each Loop Containers and Data Flow Tasks do, and recognize the csv problems above. It is a big advantage if you can read and understand C# code.

Okay, let’s start!

Scenario

My demo is based on the following hypothetical scenario:

  1. You need to process multiple CustomerDetails.csv files.
  2. In this file a pipe (|) is used as separator between the columns.
  3. The file contains a header line: SocialSecurityNumber|FirstName|LastName|Gender|Residence|Comments
  4. The Comments column is added later, your package also needs to be able to load historic files without the Comments column.
  5. CustomerDetails.csv files come from hundreds of locations and systems (hereafter called: sources) world wide. Fixing the csv files in the export processes at all these locations is impossible. Even if some sources deliver correct csv files, your SSIS package cannot rely on the fact that all sources will do so.

Overview

Here are a few pictures of the end result:
SSIS-S01E03-109
The connection managers that are needed

SSIS-S01E03-110
The Control Flow
SSIS-S01E03-111
The Data Flow

Detailed steps

Parameters and variables

First you need to add a few parameters and variables to the package.

Parameters:

  • SourceFolder: The folder that that For Each Loop Container reads for csv files.
  • ArchiveFolder: The folder to which files are moved after being processed, if no error occurred.
  • ErrorFolder: The folder to which files are moved that caused an error during processing.

All parameters are Required and their Data type is String.

SSIS-S01E03-113
The parameters of the package

Variables:

  • CRLF: This variable will be filled with a Carriage Return/Line Feed in the Script Task.
  • CRLF_Replacer: This is the character string that is used to replace CarriageReturn/LineFeeds in the Comment column of the file. It must be a character string that does not occur in the Comment field. Therefore ~CRLF~ is used. If you want a different character string, you can change it to something else.
  • SourcePathAndFilename: The full path and filename of the original filename to process. It is filled by the For Each Loop Container. The value set at design time is unimportant.
  • PathAndFilenameToProcess: The full path and filename of the file to process. This can be either the original file, or the modified temporary file. The Flat File Connection Manager uses this path and filename. The value set at design time is unimportant.

The Data type of all variables is String.

SSIS-S01E03-114
SSIS variables needed to do the job

The Control Flow – For Each Loop Container

Add a For Each Loop Container FELC – Load sourcefiles and configure it as a Foreach File Enumerator. Retrieve file name Fully qualified.

SSIS-S01E03-115

On the Collection pane, just below the Enumerator type, press the ellipsis next to Expressions. In the Property Expressions Editor configure the Property Directory to be set by
Expression @[$Package::SourceFolder].
SSIS-S01E03-116

On the Variable Mappings pane, map variable User::SourcePathAndFilename to Index 0
SSIS-S01E03-102
For Each Loop Container configuration

The Control Flow – Script Task

Next, add a Script Task SCR – Fix csv file if needed.
In the Script Task Editor, configure the following:

  • ReadOnlyVariables: User::CRLF_Replacer,User::SourcePathAndFilename
  • ReadWriteVariables: User::CRLF,User::PathAndFilenameToProcess

SSIS-S01E03-117
Variable configuration in the Script Task Editor

Then press the Edit Script… button.
Don’t get scared by the amount of C# code, if you keep your head cool, it will be fixed in 10 minutes!

Adjust the script as follows:

  • In the Namespaces region, if not already there, add:
    using System.IO;
    using System.Text;


  • SSIS-S01E03-108
    Replace the highlighted part above with the code below

    Now remove public void Main() completely and replace it with the code underneath (to avoid malformatting due to html rendering, please download the original csharp script here).
    Then check the code for !!! ATTENTION !!! comment. These are points in the script that you might need to change.

    ScriptTask.cs

      /*
      (c) Copyright 2016 - hansmichiels.com
     
      This program is free software: you can redistribute it and/or modify
      it under the terms of the GNU General Public License as published by
      the Free Software Foundation, either version 3 of the License, or
      (at your option) any later version.
     
      This program is distributed in the hope that it will be useful,
      but WITHOUT ANY WARRANTY; without even the implied warranty of
      MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
      GNU General Public License for more details.
     
      You should have received a copy of the GNU General Public License
      along with this program. If not, see http://www.gnu.org/licenses/.
      */

     
      /// This method is called when this script task executes in the control flow.
      /// Before returning from this method, set the value of Dts.TaskResult to indicate success or failure.
      /// To open Help, press F1.
      /// </summary>
     public void Main()
     {
      // Copy a few SSIS variables to C# string variables.
       string sourcePathAndFilename = Dts.Variables[“User::SourcePathAndFilename”].Value.ToString();
       string crlfReplacer = Dts.Variables[“User::CRLF_Replacer”].Value.ToString();
     
       // Set the variable that needs to hold a CarriageReturn/LineFeed.
       Dts.Variables[“User::CRLF”].Value = string.Format(“\r\n”);
     
       // Call the method that checks if the sourcefile is corrupt and if so, fixes it.
       FixFileIfCorrupt(sourcePathAndFilename, crlfReplacer);
     
       // Set returnvalue for script task.
      Dts.TaskResult = (int)ScriptResults.Success;
     }
     
      /// Csv-file can have lines with carriage return-linefeeds, causing errors during the load.
      /// By preinvestigating the file we can correct the file and prevent these errors from ocurring.
      /// Source: http://www.hansmichiels.com/2016/05/22/fixing-corrupt-csv-files-in-a-ssis-package-ssis-series/
      /// Author: Hans Michiels
      /// License: GNU General Public License, see http://www.gnu.org/licenses/
      /// </summary>
      private void FixFileIfCorrupt(string sourcePathAndFilename, string crlfReplacer)
      {
       string header = string.Empty;
       string headerAppend = string.Empty;
       string detailAppend = string.Empty;
     
       // Initialize pathAndFilenameToProcess with sourcePathAndFilename. Overwrite later if needed.
       string pathAndFilenameToProcess = sourcePathAndFilename;
     
       // Check if the header of the file is as expected, and return the header row through a ref parameter.
       // If the file header is not as expected, do not touch the file and exit this method.
       if (!CheckAndReturnHeader(sourcePathAndFilename, ref header, ref headerAppend, ref detailAppend)) return;
       
       // Get the encoding of the current file. We need this if we have to create a corrected file.
       var encoding = GetEncoding(sourcePathAndFilename);
     
       // Read all lines of the file into a string array.
       // !!! ATTENTION !!! WARNING: only do this when the files to process are relatively small.
       // !!! ATTENTION !!! When in doubt, test it and monitor memory usage.
       // (I suppose files up to 50~100 MB might not be a big problem but larger might, but testing is the way to find out)
       string[] lines = File.ReadAllLines(sourcePathAndFilename, encoding);
     
       // Use a variable to set when the file has changed due to corrections this method makes.
       bool fileHasChanged = false;
       bool lineIsOkay = true;
     
       // appendbuffer is used to keep parts of 'broken' lines.
       string appendbuffer = string.Empty;
     
       // Get lower- and upperbound of the string array into variables.
       int lower = lines.GetLowerBound(0);
       int upper = lines.GetUpperBound(0);
     
       // Loop through all lines in the file from back to front!
       for (int i = upper; i >= lower; i--)
       {
        // If there is still some text in the buffer, append it to the line and make the buffer empty again.
        if (appendbuffer != string.Empty)
        {
          lines[i] = lines[i] + appendbuffer;
          appendbuffer = string.Empty;
          fileHasChanged = true;
        }
     
        // reset lineIsOkay variable, set to true underneath when the line passes the 'line ok' checks.
        lineIsOkay = false;
     
        // !!! ATTENTION !!!
        // Here you need to add some custom code.
        // Describe one or more properties of a correct line, e.g.
        // - first 8 characters are numeric;
        // - The length is at least 12;
        // - It contains at least x delimiters;
        // etc.
        // Then convert this check to C# code. Use your imagination if needed!
     
        // I check for:
        // - the length to be 12 or more characters.
        // - the first 8 characters (the social security number) to be numeric.
        // - at least 4 pipes.
        if (lines[i].Length > 12)
        {
          string checkString = lines[i].Substring(0, 8);
          int intOutput;
     
          lineIsOkay = (Int32.TryParse(checkString, out intOutput));
     
          if (lineIsOkay)
          {
           // Check that the line contains at least 4 delimiters.
           lineIsOkay = (lines[i].Split('|').Length - 1 >= 4);
          }
        }
     
        if ((!lineIsOkay) && (i > lower))
        {
          // Line is NOT OK.
          // I have to append this line to the previous one, except when it is the header.
          // To do this put this line into appendbuffer variable.
          // I use a replacement string for the removed carriage return/line feed.
          // This will be converted back to a carriage return/line feed in the Data Flow.
          appendbuffer = string.Format(“{0}{1}”, crlfReplacer, lines[i]);
          lines[i] = string.Empty; // Package can handle empty lines.
          fileHasChanged = true;
        }
     
       }
     
       // !!! ATTENTION !!! If you do not have old fileformats you can remove this block.
       // Code to handle the old file format, without a Comments column.
       // The code below reformats the file so that it become always in new fileformat, with an empty comment.
       if (headerAppend != string.Empty)
       {
        // This is an old format without the Comments column. Append Comments column header to header and a delimiter (pipe) to detail rows.
        fileHasChanged = true;
     
        for (int i = upper; i >= lower; i--)
        {
         if (lines[i].Length > 0)
         {
          if (i > lower)
          {
            lines[i] += detailAppend;
          }
          else
          {
            lines[i] += headerAppend;
          }
         }
        }
       }
     
       if (fileHasChanged)
       {
        // Here the modified/corrected file is stored to a temporary file, so a “tmp” extension is added to the modified filename.
        // For auditing purposes I keep the original file and not the corrected one,
        // because the latter can always be recreated when needed.
        // Also, when a bug would exist in the creation of the corrected file
        // (initially I had a problem with codepages, which should now be solved),
        // the original file might get lost forever.
        pathAndFilenameToProcess = String.Format(“{0}.tmp”, sourcePathAndFilename);
     
        // Now create a new corrected csv file.
        File.WriteAllLines(pathAndFilenameToProcess, lines, encoding);
       }
     
       // Also change the SSIS Package variable, because the Flat File Source uses it.
       Dts.Variables[“User::PathAndFilenameToProcess”].Value = pathAndFilenameToProcess;
     
      }
     
      /// <summary>
      /// This method checks if the header is as expected. If not, return false.
      /// Also the value of by ref parameters 'header', 'headerAppend' and 'detailAppend' is set.
      /// Source: http://www.hansmichiels.com/2016/05/22/fixing-corrupt-csv-files-in-a-ssis-package-ssis-series/
      /// Author: Hans Michiels
      /// License: GNU General Public License, see http://www.gnu.org/licenses/
      /// <param name=”sourcePathAndFilename”></param>
      /// <param name=”header”></param>
      ///
      private bool CheckAndReturnHeader(string sourcePathAndFilename,
        ref string header, ref string headerAppend, ref string detailAppend)
      {
       // Initialize return values.
       header = string.Empty;
       headerAppend = string.Empty;
       detailAppend = string.Empty;
     
       // Open the file to read the header line, then close the file again.
       using (StreamReader reader = new StreamReader(sourcePathAndFilename))
       {
        header = reader.ReadLine();
        reader.Close();
       }
       // !!! ATTENTION !!! Here you could do a check on (parts) of the header,
       // so that you do not change files with an unexpected format.
       if (!header.ToLower().StartsWith(“socialsecuritynumber|firstname|lastname|gender|residence”))
       {
        return false;
       }
       // !!! ATTENTION !!! Here you could check for old file formats.
       // Old formats can only be handled when one or more new columns were added at the end of the csv lines.
       // In the example, there is a file format without the Comments column still being delivered.
       // By appending the Comments column with an empty value for those files, they can be processed
       // by the same Data Flow Task using only one Flat File Source Connection.
       if (!header.ToLower().Contains(“|comments”))
       {
        // This is an old format without the Comments column.
        // Append Comments column header to header and a delimiter (pipe) to detail rows.
        headerAppend = “|Comments”;
        detailAppend = “|”;
       }
       return true;
      }
     
      /// <summary>
      /// Determines a text file's encoding by analyzing its byte order mark (BOM).
      /// Defaults to ASCII when detection of the text file's endianness fails.
      /// Source: http://stackoverflow.com/questions/3825390/effective-way-to-find-any-files-encoding
      /// With thanks to the author ‘2Toad’.
      /// </summary>
      /// <param name=”filename”>The text file to analyze.</param>
      /// <returns>The detected encoding.</returns>
      public static Encoding GetEncoding(string filename)
      {
       // Read the BOM
       var bom = new byte[4];
       using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
       {
        file.Read(bom, 0, 4);
       }
     
       // Analyze the BOM
       if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) return Encoding.UTF7;
       if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) return Encoding.UTF8;
       if (bom[0] == 0xff && bom[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
       if (bom[0] == 0xfe && bom[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
       if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) return Encoding.UTF32;
       // return Encoding.GetEncoding(1252); // Without BOM, default to Windows Codepage 1252
       return Encoding.Default; // Without BOM, the encoding for the OS's current ANSI code page is returned by default.
      }

    When you have adjusted the script for your file, choose Rebuild from the BUILD menu in the VstaProjects – Microsoft Visual Studio Window. Then close that Window. You return to the Script Task Editor, close it by pressing OK.

    Developing the Data Flow Task

    Add a Data Flow Task to the Control Flow and name it DFT – Import Customer Details.
    Then add the following components to it:

    Flat File Source FF_SRC – CustomerDetails

    From the SSIS Toolbox, drag a Flat File Source to the Data Flow area and create a new Flat File connection manager.
    SSIS-S01E03-121

    If you want to rebuild the demo, download the csv files (and C# script) here.
    Configure the Flat File Connection as shown in the picture below:
    SSIS-S01E03-118

    Choose Advanced in the pane on the left and configure the properties for each column as follows:

    • SocialSecurityNumber: DataType string [DT_STR], OutputColumnWidth 9
    • FirstName: DataType Unicode string [DT_WSTR], OutputColumnWidth 50
    • LastName: DataType Unicode string [DT_WSTR], OutputColumnWidth 50
    • Gender: DataType string [DT_STR], OutputColumnWidth 1
    • Residence: DataType Unicode string [DT_WSTR], OutputColumnWidth 60
    • Comments: DataType Unicode string [DT_WSTR], OutputColumnWidth 1000

    SSIS-S01E03-120
    Example of column properties

    Conditional Split CSPL – Filter out empty lines

    We use a Conditional Split to filter out empty lines in the csv file. Because the SocialSecurityNumber is a required column, we assume that the line is empty (at least it is useless) if SocialSecurityNumber is empty.
    So add an Output named Empty lines with Condition SocialSecurityNumber == “”. If you like, you can check on multiple columns to be empty.
    Rename the default Output to Non-empty lines.

    SSIS-S01E03-122
    Conditional Split Configuration

    Derived Column DC – Re-insert CRLF

    Because the (corrected) file might contain replacements for carriage return/line feeds, we are going to replace them with the original carriage return/line feeds, before the value is inserted into the database table.
    In the Derived Column Transformation Editor, add a derived column, configured as follows:

    • Derived Column Name: Comments
    • Derived Column: Replace ‘Comments’
    • Expression: REPLACE(Comments,@[User::CRLF_Replacer],@[User::CRLF])

    SSIS-S01E03-123
    The Derived Column Transformation Editor

    OLEDB Destination OLE_DST stg DimCustomer

    To rebuild the demo, create a table to store the rows from the data flow task,
    add a OLEDB Destination that uses this table.

    destination_table.sql

    SET ANSI_NULLS ON
    GO
    SET QUOTED_IDENTIFIER ON
    GO
    SET ANSI_PADDING ON
    GO
    CREATE TABLE [stg].[DimCustomer](
     [SocialSecurityNumber] [char](9) NOT NULL,
     [FirstName] [nvarchar](50) NULL,
     [LastName] [nvarchar](50) NULL,
     [Gender] [char](1) NULL,
     [Comments] [nvarchar](1000) NULL,
     [Residence] [nvarchar](60) NULL,
     CONSTRAINT [PK_DimCustomer_CustomerKey] PRIMARY KEY CLUSTERED
    (
     [SocialSecurityNumber] ASC
    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
    ) ON [PRIMARY]
    GO
    SET ANSI_PADDING OFF
    GO

    The Control Flow – File System Tasks

    Finally the File System Tasks archive the original file and delete the corrected file (if it exists).
    For auditing purposes I keep the original file and not the corrected one, because the latter can always be recreated when needed.
    Also, when a bug would exist in the creation of the corrected file (initially I had a problem with codepages, which should now be solved), the original file might get lost forever.

    Please go back to the picture of the The Control Flow at the beginning of this post, to see how the File System Tasks are connected.
    Next are the printscreens that should help you to configure the File System Tasks.

    SSIS-S01E03-124
    Configuration of “FST – Move original file to archive folder”

    SSIS-S01E03-125
    Configuration of “FST – Move original file to error folder”

    SSIS-S01E03-126
    Precedence constraint of “FST – Delete temporary file” (you need to set it twice).

    SSIS-S01E03-128
    Configuration of “FST – Delete temporary file”

    The proof

    To show you that it works, I run the package in debug mode.
    To be able to have a look at the temporary/corrected files, afterwards, I have disabled the File System Task that deletes this file.
    SSIS-S01E03-130

    When I open the “CustomerDetails.csv.tmp” file after the package has run, I can see that line 15 was corrected.
    From the second screenshot you can see that the CRLF_Replacer value was inserted twice.
    SSIS-S01E03-131
    SSIS-S01E03-132

    And when I run a query on the destination table, I can see that the Comments column contains the original CarriageReturn/LineFeeds!
    SSIS-S01E03-134

    By the way, all the data in the csv files is generated from random firstnames, lastnames and residences and fake social security numbers. Any match with existing persons would be based solely on coincidence.

    Conclusion / Wrap up

    Csv files still are useful today, because they have some advantages when compared with XML files.
    However they also have their limitations, which can be frustrating when loading those files with a Data Flow Task.
    In this blog post I have demonstrated how to overcome a few of those limitations, namely:

    1. carriage return/line feeds in contents of csv files
    2. changing file formats over time

    The solution is based on using a C# Script Task.

    Please be aware that I do not pretend this is the best solution.
    If you are very familiar with C#, you could all of the processsing in the Script Task, using SqlBulkCopy, and you would not need a temporary file.
    However the Data Flow Task offers some advantages, like lookups you can use, without the need to do this in a script.
    If you create good Interface Specification Documents on forehand (for instance containing how to handle carriage return/line feeds in column values) you can force that all sources deliver correct csv-files, and then they do not have to be corrected.

    Download the C# script and csv files here.

    (c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

Detect duplicates in a Data Flow Task (SSIS Series)

Intended audience

This article assumes that you have at least a basic understanding of SSIS, and the Data Flow Task in particular. C# knowledge is an advantage but not required.

Introduction

If you ever used the T-SQL windowing function ROW_NUMBER() OVER (PARTITION BY ..) to find duplicates in a query result, this post might be of interest for you.

Because not all sources having potential duplicate entries (e.g. business keys) come from a SQL Server database. If it is a csv- or Excel file ROW_NUMBER will not automaticly be your friend (it might be when you use OPENROWSET, but I think that is a rather complicated solution).

C# script components are extremely flexible yet fairly easy to understand, develop and maintain.
This is why this is my preferred solution to tackle the problem above.
Let’s begin!

How to detect duplicates in a Data Flow Task using a C# Script component

First an overview of the end result (the Data Flow of the Data Flow Task). I will tell you step by step how to achieve this using SSIS 2014. I did not make a print screen of the Control Flow because it only contains one Data Flow Task.
Please note that I enabled two Data Viewers on both outputs of the Conditional Split component, so you can see the data that flows through (screenshot later).
SSIS-S01E02-image01
For the demo I will use the following components in a dataflow. To solve your problem just pick out the things you need.

  • An OLE DB Source
  • A script component (the most interesting part!)
  • An (optional) conditional split
  • Two OLE DB Destinations: one for the first occurrences of each business key, and one for the duplicates.

An OLE DB Source

My example is based on Advertureworks. I select customers from the [AdventureWorksDW] database, and make sure there are duplicates by using UNION ALL in the SQL statement.
In the real world it would be more likely that your source would not be a SQL Server query, but for instance a csv- or Excel file.

OLEDB Source.sql

SELECT
  [CustomerKey], [CustomerAlternateKey] AS [CustomerBusinessKey],
  [Title], [FirstName], [MiddleName], [LastName],
  [NameStyle], [BirthDate], [MaritalStatus], [Suffix],
  [Gender], [EmailAddress], [YearlyIncome]
FROM
  [dbo].[DimCustomer]
UNION ALL
SELECT
  [CustomerKey], [CustomerAlternateKey] AS [CustomerBusinessKey],
  [Title], [FirstName], [MiddleName], [LastName],
  [NameStyle], [BirthDate], [MaritalStatus], [Suffix],
  [Gender], [EmailAddress], [YearlyIncome]
FROM
  [dbo].[DimCustomer]
UNION ALL
SELECT
  [CustomerKey], [CustomerAlternateKey] AS [CustomerBusinessKey],
  [Title], [FirstName], [MiddleName], [LastName],
  [NameStyle], [BirthDate], [MaritalStatus], [Suffix],
  [Gender], [EmailAddress], [YearlyIncome]
FROM
  [dbo].[DimCustomer]
  WHERE CustomerKey % 2 = 0

SQL Statement for the OLE DB Source

A script component (the most interesting part!)

From the SSIS Toolbox drag a Script Component to the Data flow surface. In the “Select Script Component Type” choose “Transformation”.

SSIS-S01E02-image02

To use a column value in the script, you have to define it as an input column. Select the column you want to check for duplicate values with Usage Type ReadOnly. In the demo this is the CustomerBusinessKey column.
SSIS-S01E02-image03

We also need a new column to store the occurrence number for the CustomerBusinessKey. To do this, in the Script Transformation Editor, Select Inputs and Outputs on the left side, then unfold Output 0 and Output Columns, respectively.
Then click on the Add column button to add a column with DataType four-byte signed integer [DT_I4]. Give the new column a clear name. In the demo it is called CustomerBusinessKeyOccurrence.
SSIS-S01E02-image03b

After these preparations we can start writing the C# script. Press the Edit script … button to do so.
SSIS-S01E02-image04

Adjust the script as follows (the changes are also marked inside red blocks in the picture below):

  • In the Namespaces region, add:
    using System.Collections.Generic;
  • Above PreExecute(), add:
    // The column to count duplicates for will be stored as Key (with string datatype) of the dictionary.
    // The occurrence counter will be stored as Value of the dictionary (with int datatype).
    private Dictionary<string, int> m_dictOccurrencesCount;
  • Inside PreExecute(), add:
    // Initialize the dictionary, otherwise it will be null when accessed.
    m_dictOccurrencesCount = new Dictionary<string, int>();

SSIS-S01E02-image05
Script adjustments – part 1

Then adjust the script as follows:

  • Inside Input0_ProcessInputRow(Input0Buffer Row), add:
    // Count occurrences of CustomerBusinessKeys.
    if (m_dictOccurrencesCount.ContainsKey(Row.CustomerBusinessKey))
    {
      // If the CustomerBusinessKey of the current row is already in the dictionary,
      // increase the counter with 1:
      m_dictOccurrencesCount[Row.CustomerBusinessKey]++;
      // Also assign the new value to the CustomerBusinessKeyOccurrence column,
      // so it can be used in the Data flow later.
      Row.CustomerBusinessKeyOccurrence = m_dictOccurrencesCount[Row.CustomerBusinessKey];
    }
    else
    {
      // If the CustomerBusinessKey of the current row is not yet in the dictionary,
      // add it, and set the counter to 1
      m_dictOccurrencesCount.Add(Row.CustomerBusinessKey, 1);
      // Also assign 1 to the CustomerBusinessKeyOccurrence column.
      Row.CustomerBusinessKeyOccurrence = 1;
    }

SSIS-S01E02-image06
Script adjustments – part 2

When the script is ready, Choose Rebuild from the BUILD menu in the VstaProjects – Microsoft Visual Studio Window. Then close that Window. You return to the Script Transformation Editor, close it by pressing OK.

An (optional) conditional split

Optionally you can add a Conditional Split.
I use the conditional split to store the first occurrences for every CustomerBusinessKey in one table and all duplicates in another table.
Both tables are only to show you that the script task and conditional split do their jobs properly. In real projects, you might develop this differently, for example by storing only the first occurence.

SSIS-S01E02-image07
Conditional Split Configuration

Two OLE DB Destinations: one for the first occurrences of each business key, and one for the duplicates

I basicly created a two temporary tables by composing a SELECT TOP 0 .. INTO statement from the OLE DB Source SQL Statement, because the result of the demo needs to flow to somewhere.

SSIS-S01E02-OLEDB-Source.sql
IF OBJECT_ID('[dbo].[DimCustomerFirstOne]', 'U') IS NOT NULL
  DROP TABLE [dbo].[DimCustomerFirstOne]
GO
SELECT TOP 0
  [CustomerKey], [CustomerAlternateKey] + 'B' AS [CustomerBusinessKey],
  [Title], [FirstName], [MiddleName], [LastName],
  [NameStyle], [BirthDate], [MaritalStatus], [Suffix],
  [Gender], [EmailAddress], [YearlyIncome], CONVERT(int, 0) AS [OccurrenceNo]
INTO [dbo].[DimCustomerFirstOne]
FROM
  [dbo].[DimCustomer]
GO

IF OBJECT_ID('[dbo].[DimCustomerDuplicates]', 'U') IS NOT NULL
  DROP TABLE [dbo].[DimCustomerDuplicates]
GO
SELECT TOP 0
  [CustomerKey], [CustomerAlternateKey] + 'B' AS [CustomerBusinessKey],
  [Title], [FirstName], [MiddleName], [LastName],
  [NameStyle], [BirthDate], [MaritalStatus], [Suffix],
  [Gender], [EmailAddress], [YearlyIncome], CONVERT(int, 0) AS [OccurrenceNo]
INTO [dbo].[DimCustomerDuplicates]
FROM
  [dbo].[DimCustomer]
GO

OLE DB Destination OLE_DST FirstOne uses the [dbo].[DimCustomerFirstOne] table.
OLE DB Destination OLE_DST Duplicate uses the [dbo].[DimCustomerDuplicates] table.

Let it flow, let if flow, let it flow

Okay, time to test.
When data flows, the “First occurence” and “Duplicates” Data Viewers pop up. Then I can check that all the rows are divided well over both tables based on the value of column CustomerBusinessKeyOccurrence: exactly 1 or higher than 1.
When I run it with the Data Viewers disabled, the package execution finished literally in a split second. So the performance is excellent.
Test succeeded!
SSIS-S01E02-image08
The Data flow in Debug Mode

SSIS-S01E02-image09b
Data Viewer for First occurrence flow: CustomerBusinessKeyOccurrence is 1 for all rows.

SSIS-S01E02-image10b
Data Viewer for Duplicates flow: CustomerBusinessKeyOccurrence is 2 or higher for all rows.

Other considerations

If multiple columns make a row unique concatenate all column values with a delimiter, and then add the concatenated result to the dictionary m_dictOccurrencesCount.
For instance you could check duplicates using the customer nameparts (firstname, middlename and lastname), the birthdate and the gender. This is a softer check than on a real business key, but sometimes you do not have that.
Of course, for this to work, you must add those columns as ReadOnly input columns to the script component.
You can create a concatenated value as follows:
Inside Input0_ProcessInputRow(Input0Buffer Row), start with:

// Concatenate FirstName, MiddleName, LastName, BirthDate and Gender, using a | as delimiter.
string businessKey = string.Format(“{0}|{1}|{2}|{3:yyyyMMdd}|{4}”,
  Row.FirstName, Row.MiddleName, Row.LastName, Row.BirthDate, Row.Gender).ToUpper();

The concatenated value is converted to uppercase, so that differences in lower- or uppercase lead to the same value.
Then use the businessKey variable (instead of Row.CustomerBusinessKey) in the if statement as shown above.

If the concatenated value becomes very long, you could calculate a hash value and add that to the dictionary instead (thanks for the suggestion, Joost). This would use less memory and might be faster, but the only way to find out if this is the case is testing .. testing ..
( I did not test this to make the demo not too complicated and because performance did not seem to be an issue ).

Thinking a bit further, you could use this concept also to calculate running totals.
Just configure the dictionary to hold a decimal instead of an int and you can calculate running totals for numeric values (e.g. sales figures) easily.

Conclusion / Wrap up

In SSIS you can use a Script Component within a Data Flow Task to detect duplicates in the source data.
This can be an alternative to the T-SQL ROW_NUMBER() function, when the source is not a SQL Server Database.

Credits
My thanks go to Joost van Rossum for reviewing the initial draft of this article. His input was very useful.

(c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

When reverse engineering is a necessary evil .. (SQL Server Series)

.. or: How to find data (text, guids, numbers) in your entire SQL Server database ..

Introduction

A number of years ago I had to make queries on an OLTP database used by third parties’ software. As a person preferring data warehouses for too many reasons to mention here, I am not proud of it, but sometimes these requests just happen. If you have the same now and then, you are not the only one 😉 .

Problem

A little simplified, this is what happened:
I started asking around, is there documentation? “No”, there wasn’t.
Is there a data dictionary? “No”, there wasn’t, either.
Can I have access to the database? “Yes, here’s the connection string.”
So I made a connection, but .. what do all these tables mean? How are they related? How are the foreign keys? The model didn’t tell me, with no physical constraints and unclear table- and column names.
Do you have anything else for me to see how this works? “Well, we have a PROD and a TEST environment, you can have read only access to PROD and read/write to the TEST.”
Okay, that is something ..

MSSQL-S01E02-testconnection
Picture credits: © David Coleman | Dreamstime Stock Photos
When there is only a (test)connection..

Towards a solution

Could you show me how you use the system? Luckily the employee of my client was very helpful and showed me how he used the application.
I tried to get a grip on the SQL Statements caused by his actions, but due to a lack of permissions on the database, this didn’t work.
This could be arranged, but could take a while before I actually would have temporary “elevated” permissions.
As it was just some small piece of “inbetween” work, that seemed an inefficient route to follow.
Okay, I have seen what you did, can I do that on the TEST as well? “Yeah, go ahead.”
So I started to make an order, using silly names and descriptions, and choosing weird seldomly used products. So, what’s next?
Ok, if I can find where my description “First order of Hans” is stored, I can start to map the frontend to the backend, or make clear for myself where the entered information is stored in the database.

Solution

This is where this script became useful. It finds all occurrences of one or more text values in a SQL Server database.

WARNING: Do not use in production environments, as this script can potentially run pretty long (10 to 20 minutes when I used it) and can have a negative impact on database performance in general (so other users might regret that you are running the script).
So use it in a test enviroment when you can.

SearchTextInDatabase.sql

--\
---) SearchTextInDatabase script.
---) Author: Hans Michiels
---) Searches for one or more texts in all (n)(var)char and (n)text
---) columns of a SQL Server database.
---) WARNING: Can potentially run very long, my advise is not
---) to run this script in production environments!
--/
/*
(c) Copyright 2016 - hansmichiels.com
 
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
 
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
 
You should have received a copy of the GNU General Public License
along with this program. If not, see http://www.gnu.org/licenses/.
*/

USE [AdventureWorks2014]
GO

--\-----------------------------------------------------------------------------------
---) DECLARATIONS
--/-----------------------------------------------------------------------------------
DECLARE @schema_name SYSNAME
DECLARE @table_name SYSNAME
DECLARE @column_name SYSNAME
DECLARE @nsql NVARCHAR(4000)
DECLARE @param_def NVARCHAR(500)
DECLARE @search_text NVARCHAR(1000)
DECLARE @search_texts TABLE ([search_text] NVARCHAR(1000))
DECLARE @debug_print NVARCHAR(500)
DECLARE @results_count INT = 0

IF OBJECT_ID('tempdb..#search_results') IS NOT NULL BEGIN
    DROP TABLE #search_results
END
CREATE TABLE #search_results
    ( [search_text] NVARCHAR(1000)
    , [schema_name] NVARCHAR(128)
    , [table_name] NVARCHAR(128)
    , [column_name] NVARCHAR(128)
    , [the_value] NVARCHAR(1000)
    )

DECLARE columns_cursor CURSOR LOCAL STATIC FOR
    SELECT sch.name AS [schema_name]
    , tbl.name AS [table_name]
    , col.name AS [column_name]
    FROM sys.tables tbl
     JOIN sys.schemas sch
        ON sch.schema_id = tbl.schema_id
      JOIN sys.columns col
        ON col.object_id = tbl.object_id
    WHERE col.system_type_id IN (175, 167, 231, 239, 35, 99) -- (N)(VAR)CHAR, (N)TEXT

--\-----------------------------------------------------------------------------------
---) INITIALIZATION
---) Specify the texts to search for below.
---) Use the wildcard % if you want to do a 'like' search.
--/-----------------------------------------------------------------------------------
INSERT INTO @search_texts

  SELECT '%Assembly%'
  UNION SELECT 'First order of Hans'

DECLARE search_cursor CURSOR LOCAL STATIC FOR
    SELECT search_text
    FROM @search_texts

--\-----------------------------------------------------------------------------------
---) MAIN
--/-----------------------------------------------------------------------------------
OPEN columns_cursor
FETCH NEXT FROM columns_cursor INTO @schema_name, @table_name, @column_name

WHILE @@FETCH_STATUS = 0
BEGIN

    SET @debug_print = 'Processing ' + @table_name + '.' + @column_name;
    PRINT REPLICATE('-', LEN(@debug_print))
    PRINT @debug_print
    PRINT REPLICATE('-', LEN(@debug_print))
    
    SET @nsql = 'INSERT INTO #search_results SELECT DISTINCT @inner_search_text, '
        + '''' + @schema_name + ''' AS [schema_name], '
        + '''' + @table_name + ''' AS [table_name], '
        + '''' + @column_name + ''' AS [column_name], CONVERT(NVARCHAR(1000), ' + QUOTENAME(@column_name) + ') AS [the_value] '
        + ' FROM ' + QUOTENAME(@schema_name) + '.' + QUOTENAME(@table_name) + ' WITH (NOLOCK) '
        + ' WHERE ' + QUOTENAME(@column_name) + ' LIKE @inner_search_text;';

    OPEN search_cursor
    FETCH NEXT FROM search_cursor INTO @search_text

    WHILE @@FETCH_STATUS = 0
    BEGIN
        PRINT ''
        PRINT '--> Processing search text ''' + @search_text + '''';
        SET @param_def = N'@inner_search_text NVARCHAR(1000)';
        EXECUTE sp_executesql @nsql, @param_def,
                              @inner_search_text = @search_text;

        FETCH NEXT FROM search_cursor INTO @search_text
    END
    
    CLOSE search_cursor

    FETCH NEXT FROM columns_cursor INTO @schema_name, @table_name, @column_name
    
END

CLOSE columns_cursor
DEALLOCATE columns_cursor

SELECT
      t.schema_name
    , t.table_name
    , t.column_name
    , t.the_value
    , 'SELECT * FROM ' + QUOTENAME(t.schema_name) + '.' + QUOTENAME(t.table_name)
      + ' WHERE ' + QUOTENAME(t.column_name) + ' = ''' + REPLACE(t.the_value, '''', '''''') + '''' AS [select_statement]
 FROM #search_results t

GO

Some time later, I also made two similar versions for searching unique identifiers and numeric values.
I know the scripts might not look so ‘pretty’, but they do the job when needed.

SearchUniqueIdentifierInDatabase.sql

--\
---) SearchUniqueIdentifierInDatabase script.
---) Author: Hans Michiels
---) Searches for one or more unique identifiers in all UNIQUEIDENTIFIER columns
---) of a SQL Server database.
---) WARNING: Can potentially run very long, my advise is not
---) to run this script in production environments!
--/
/*
(c) Copyright 2016 - hansmichiels.com
 
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
 
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
 
You should have received a copy of the GNU General Public License
along with this program. If not, see http://www.gnu.org/licenses/.
*/

USE [AdventureWorks2014]
GO

--\-----------------------------------------------------------------------------------
---) DECLARATIONS
--/-----------------------------------------------------------------------------------
DECLARE @schema_name SYSNAME
DECLARE @table_name SYSNAME
DECLARE @column_name SYSNAME
DECLARE @nsql NVARCHAR(4000)
DECLARE @param_def NVARCHAR(500)
DECLARE @search_value UNIQUEIDENTIFIER
DECLARE @search_values TABLE ([search_value] UNIQUEIDENTIFIER)
DECLARE @debug_print NVARCHAR(500)
DECLARE @results_count INT = 0

IF OBJECT_ID('tempdb..#search_results') IS NOT NULL BEGIN
    DROP TABLE #search_results
END
CREATE TABLE #search_results
    ( [search_value] UNIQUEIDENTIFIER
    , [schema_name] NVARCHAR(128)
    , [table_name] NVARCHAR(128)
    , [column_name] NVARCHAR(128)
    , [the_value] UNIQUEIDENTIFIER
    )

DECLARE columns_cursor CURSOR LOCAL STATIC FOR
    SELECT sch.name AS [schema_name]
    , tbl.name AS [table_name]
    , col.name AS [column_name]
    FROM sys.tables tbl
     JOIN sys.schemas sch
        ON sch.schema_id = tbl.schema_id
      JOIN sys.columns col
        ON col.object_id = tbl.object_id
                              -- UNIQUEIDENTIFIER
    WHERE col.system_type_id IN (36)

--\-----------------------------------------------------------------------------------
---) INITIALIZATION
---) Specify the unique identifiers to search for below.
--/-----------------------------------------------------------------------------------
INSERT INTO @search_values
SELECT 'D081136E-38D6-4D42-8FFD-19A6A8FA61E6'
UNION SELECT '9AADCB0D-36CF-483F-84D8-585C2D4EC6E9'
UNION SELECT '32A54B9E-E034-4BFB-B573-A71CDE60D8C0'
UNION SELECT '4C506923-6D1B-452C-A07C-BAA6F5B142A4'

DECLARE search_cursor CURSOR LOCAL STATIC FOR
    SELECT search_value
    FROM @search_values

--\-----------------------------------------------------------------------------------
---) MAIN
--/-----------------------------------------------------------------------------------
OPEN columns_cursor
FETCH NEXT FROM columns_cursor INTO @schema_name, @table_name, @column_name

WHILE @@FETCH_STATUS = 0
BEGIN

    SET @debug_print = 'Processing ' + @table_name + '.' + @column_name;
    PRINT REPLICATE('-', LEN(@debug_print))
    PRINT @debug_print
    PRINT REPLICATE('-', LEN(@debug_print))
    
    SET @nsql = 'INSERT INTO #search_results SELECT DISTINCT @inner_search_value, '
        + '''' + @schema_name + ''' AS [schema_name], '
        + '''' + @table_name + ''' AS [table_name], '
        + '''' + @column_name + ''' AS [column_name], ' + QUOTENAME(@column_name) + ' AS [the_value] '
        + ' FROM ' + QUOTENAME(@schema_name) + '.' + QUOTENAME(@table_name) + ' WITH (NOLOCK) '
        + ' WHERE ' + QUOTENAME(@column_name) + ' = @inner_search_value ;';

    OPEN search_cursor
    FETCH NEXT FROM search_cursor INTO @search_value

    WHILE @@FETCH_STATUS = 0
    BEGIN
        PRINT ''
        PRINT '--> Processing search value ''' + CONVERT(NVARCHAR(40), @search_value) + '''';
        SET @param_def = N'@inner_search_value UNIQUEIDENTIFIER';
        EXECUTE sp_executesql @nsql, @param_def,
                              @inner_search_value = @search_value;

        FETCH NEXT FROM search_cursor INTO @search_value
    END
    
    CLOSE search_cursor

    FETCH NEXT FROM columns_cursor INTO @schema_name, @table_name, @column_name
    
    SELECT @results_count = (SELECT COUNT(*) FROM #search_results);
    
END

CLOSE columns_cursor
DEALLOCATE columns_cursor

SELECT
      t.schema_name
    , t.table_name
    , t.column_name
    , t.the_value
    , 'SELECT * FROM ' + QUOTENAME(t.schema_name) + '.' + QUOTENAME(t.table_name)
      + ' WHERE ' + QUOTENAME(t.column_name) + ' = ''' + REPLACE(t.the_value, '''', '''''') + '''' AS [select_statement]
 FROM #search_results t

GO


SearchNumberInDatabase.sql

--\
---) SearchNumberInDatabase script.
---) Author: Hans Michiels
---) Searches for one or more numeric values in all TINYINT, SMALLINT,
---) INT, FLOAT, DECIMAL and NUMERIC columns of a SQL Server database.
---) WARNING: Can potentially run very long, my advise is not
---) to run this script in production environments!
--/
/*
(c) Copyright 2016 - hansmichiels.com
 
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
 
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
 
You should have received a copy of the GNU General Public License
along with this program. If not, see http://www.gnu.org/licenses/.
*/

USE [AdventureWorks2014]
GO

--\-----------------------------------------------------------------------------------
---) DECLARATIONS
--/-----------------------------------------------------------------------------------
DECLARE @schema_name SYSNAME
DECLARE @table_name SYSNAME
DECLARE @column_name SYSNAME
DECLARE @nsql NVARCHAR(4000)
DECLARE @param_def NVARCHAR(500)
DECLARE @search_value DECIMAL(38, 2)
DECLARE @search_values TABLE ([search_value] DECIMAL(38, 2))
DECLARE @debug_print NVARCHAR(500)
DECLARE @results_count INT = 0
DECLARE @max_difference DECIMAL(38, 2) -- Use to find values 'near by'

IF OBJECT_ID('tempdb..#search_results') IS NOT NULL BEGIN
    DROP TABLE #search_results
END
CREATE TABLE #search_results
    ( [search_value] DECIMAL(38, 2)
    , [schema_name] NVARCHAR(128)
    , [table_name] NVARCHAR(128)
    , [column_name] NVARCHAR(128)
    , [the_value] DECIMAL(38, 2)
    )

DECLARE columns_cursor CURSOR LOCAL STATIC FOR
    SELECT sch.name AS [schema_name]
    , tbl.name AS [table_name]
    , col.name AS [column_name]
    FROM sys.tables tbl
     JOIN sys.schemas sch
        ON sch.schema_id = tbl.schema_id
      JOIN sys.columns col
        ON col.object_id = tbl.object_id
                              -- TINYINT, SMALLINT, INT, INT
    WHERE col.system_type_id IN (48, 52, 56, 127
                              -- FLOAT, DECIMAL, FLOAT, DECIMAL, DECIMAL, DECIMAL
                              --, 59, 60, 62, 106, 108, 122, 127
                                )

--\-----------------------------------------------------------------------------------
---) INITIALIZATION
---) Specify the texts to search for below “INSERT INTO @search_values”
---) Also set a value for @max_difference
---) @max_difference makes it possible to find nearby numbers.
--/-----------------------------------------------------------------------------------
SELECT @max_difference = 1.0
INSERT INTO @search_values
  SELECT 755
  --UNION SELECT 97867563

DECLARE search_cursor CURSOR LOCAL STATIC FOR
    SELECT search_value
    FROM @search_values

-------------------------------------------------------------------------------------
-- MAIN
-------------------------------------------------------------------------------------
OPEN columns_cursor
FETCH NEXT FROM columns_cursor INTO @schema_name, @table_name, @column_name

WHILE @@FETCH_STATUS = 0
BEGIN

    SET @debug_print = 'Processing ' + @table_name + '.' + @column_name;
    PRINT REPLICATE('-', LEN(@debug_print))
    PRINT @debug_print
    PRINT REPLICATE('-', LEN(@debug_print))
    
    SET @nsql = 'INSERT INTO #search_results SELECT DISTINCT @inner_search_value, '
        + '''' + @schema_name + ''' AS [schema_name], '
        + '''' + @table_name + ''' AS [table_name], '
        + '''' + @column_name + ''' AS [column_name], CONVERT(DECIMAL(38, 2), ' + QUOTENAME(@column_name) + ') AS [the_value] '
        + ' FROM ' + QUOTENAME(@schema_name) + '.' + QUOTENAME(@table_name) + ' WITH (NOLOCK) WHERE ABS(' + QUOTENAME(@column_name) + ' - @inner_search_value) <= ' + CAST(@max_difference AS VARCHAR(50)) + ';';

    PRINT @nsql
    
    OPEN search_cursor
    FETCH NEXT FROM search_cursor INTO @search_value

    WHILE @@FETCH_STATUS = 0
    BEGIN
        PRINT ''
        PRINT '--> Processing search value ''' + CONVERT(NVARCHAR(40), @search_value) + '''';
        SET @param_def = N'@inner_search_value DECIMAL(38, 2)';
        EXECUTE sp_executesql @nsql, @param_def,
                              @inner_search_value = @search_value;

        FETCH NEXT FROM search_cursor INTO @search_value
    END
    
    CLOSE search_cursor

    FETCH NEXT FROM columns_cursor INTO @schema_name, @table_name, @column_name
    
    SELECT @results_count = (SELECT COUNT(*) FROM #search_results);
    
END

CLOSE columns_cursor
DEALLOCATE columns_cursor

SELECT
      t.schema_name
    , t.table_name
    , t.column_name
    , t.the_value
    , 'SELECT * FROM ' + QUOTENAME(t.schema_name) + '.' + QUOTENAME(t.table_name) + ' WHERE ' + QUOTENAME(t.column_name) + ' = ''' + REPLACE(t.the_value, '''', '''''') + ''''
 FROM #search_results t

GO

Download all scripts here.

Conclusion / Wrap up

When you have to understand a SQL Server datamodel, for instance to get data out, and there is no documentation but only a test environment and database connection available, these scripts can help you to find where data entered in the frontend software is stored in the database.
A different use case is when the database is using unique identifiers for primary and foreign keys, and the foreign key relationships are not physically enforced and/or flexible. In that case you can search for a specific unique identifier and see where it is used in the database.

(c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

Stop being so precise! and more about using Load(end)dates (Datavault Series)

Introduction

My web site was offline a few days ago. Sorry for that.
I got this email from my webhosting provider: “A new message or response with subject: User *** has used up 153% of their bandwidth ..”.
Shortly after they suspended my account. It was suspended for about 8 hours. I have upgraded the network traffic limit.
All right, when all bandwidth was consumed .. then somebody must be reading my blog. Thank you for that.

This weeks post is about the LoadDate and LoadEndDate.
Actually there are two things to be nerdy on:

  1. End dating without gaps
  2. Getting the best value for bytes on DATETIME2 precision

By the way, these topics apply to SQL Server, the examples are made using SQL Server 2014.

1. End dating without gaps

How end dating works

In a Data Vault Satellite, different subsequent versions of a row in the source system are distinguished through a LoadDate (which usually contains a date/time value). So the HashKey of the corresponding Hub Row plus the LoadDate are the primary key for the satellite.
However to get the right version out when querying for a specific date, this requires a relatively difficult query, and this can also have a negative impact on performance.
This is why the LoadEndDate is an optional column of a Satellite, to make querying (read: getting data out) easier and better performing.

Important to remember that the LoadEndDate is not the date/time the load(process) ended, but the date/time the row was replaced by a newer row for the same business entity in the hub. What’s in a name, if I had to choose I would just call it EndDate, but LoadEndDate is the standard, and once you know it, it is not a problem.

There are two ways to determine the value for this LoadEndDate:

  1. Exclusive: the LoadEndDate is the LoadDate of the new row that replaces this one, minus a small time fraction. Using the exclusive methods enables the use of the BETWEEN keyword in Transact-SQL to get the right row out, example:

    WHERE @SnapshotDate BETWEEN [LoadDate] AND [LoadEndDate]

    This is the method that is used in most Data Vault examples.

  2. Inclusive: the LoadEndDate is exactly equal to the LoadDate of the new row that replaces this one. This requires no computation when updating the LoadEndDate, but disqualifies the BETWEEN keyword, for getting data out you need to do something like:

    WHERE @SnapshotDate >= [LoadDate] AND @SnapshotDate < [LoadEndDate]

Problem

There is a (theoretical) problem when using the exclusive method.
If you substract to much, there will be a tiny time gap between the LoadEndDate and the LoadDate of the subsequent row.
I fully admit this is a theoretical problem, because the chances that you require the row valid exactly on this moment are astronomically small.
(There is this not so hypothetical situation that your load starts at midnight, the LoadDate is set before the first second of the new day has passed, and you substract a second.
Then you do not find a record when using the exact date (without time fraction) to get the row out. But you need a lot of bad luck for this to happen).

Solution

Still if you are a purist you want to do it right, certainly because the solution is so simple.
If you make the the “grain” of substraction from the LoadDate equal to the precision of the LoadDate, there is no gap.
E.g.
– Substract 1 second from DATETIME2(0)
– Substract 1 centisecond from DATETIME2(2) ***TIP***
– Substract 1 millisecond from DATETIME2(3)
– Substract 1 microsecond from DATETIME2(6)
– Substract 100 nanoseconds from DATETIME2(7)

Examples in T-SQL:

Exclusive enddate without gaps.sql

DECLARE @dt20 DATETIME2(0) = '2016-04-13 20:52:17'
DECLARE @dt22 DATETIME2(2) = '2016-04-13 20:52:17.00'
DECLARE @dt23 DATETIME2(3) = '2016-04-13 20:52:17.000'
DECLARE @dt26 DATETIME2(6) = '2016-04-13 20:52:17.000000'
DECLARE @dt27 DATETIME2(7) = '2016-04-13 20:52:17.0000000'

SELECT @dt20, DATEADD(SECOND, -1, @dt20)
SELECT @dt22, DATEADD(MILLISECOND, -10, @dt22)
SELECT @dt23, DATEADD(MILLISECOND, -1, @dt23)
SELECT @dt26, DATEADD(MICROSECOND, -1, @dt26)
SELECT @dt27, DATEADD(NANOSECOND, -100, @dt27)

2. Getting the best value for bytes on DATETIME2 precision

This is about a “smart” precision to choose for your LoadDate and LoadEndDate columns. Unlike the older DATETIME datatype, DATETIME2 uses less or more bytes for storage depending on the precision you specify.
The Storage size is: 6 bytes for precisions less than 3; 7 bytes for precisions 3 and 4. All other precisions require 8 bytes. (I shamelessly pasted this from MSDN).
In the book “Building a scalable data warehouse with Data Vault 2.0” a lot of examples use DATETIME2(7). You can ask yourself why. Why do you need to be precise to 100 nanoseconds? I dare to say that in most cases (except when loading realtime or near-realtime streaming data into your Data Vault), seconds would be precise enough.
But looking back a few lines, to the storage sizes, DATETIME2(0) uses 6 bytes, but DATETIME2(2) ALSO uses 6 bytes. So with the latter you get a higher precision for the same storage size. And for daily loads, the centiseconds precision that DATETIME2(2) is providing, is really precise enough, believe me.
So DATETIME2(2) gives us the best value for bytes!


DV-S01E04-precise
Why would your LoadDates and LoadEndDates be more precise than needed ?..


In this way you can save two bytes, compared with a DATETIME2(7), which uses 8 bytes and is used in the book. Because the LoadDate is in the primary key of satellite tables, also the primary key index will be smaller.
In PIT Tables, which usually have multiple LoadDates, the storage gain is even more.
And what counts for money, “who does not respect a small gain, isn’t worth a big one”, does also count for small performance gains you can get by doing this kind of tweaks. Never forget that! Multiple smaller tweaks might add up to a noticable performance improvement!

Conclusion / Wrap up

In this blog post you have read about the two methods for enddating: Inclusive and Exclusive.
Also you could read how you can use the Exclusive method for end dating in Data Vault satellites, without having gaps in the timeline.
Finally I discussed which precision will in most cases be good enough for LoadDate and LoadEndDate columns.

(c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.