Adding custom header and footer to csv files (SSIS Series)

Introduction

This article wasn’t planned. SSIS again? No I wanted to write a post about something else.
Nevertheless I got this question on one of my previous posts and thought, I might as well create a blog post about it.

Problem

“Hi Hans,
I have created my ssis package for flat file destination. My problem is that I want to add header and footer in my flat file destination to send it to FTP location format will be

0|MMDDYYYY|HHMMSS
1|Row1
1|Row2
1|Row3
1|ROW4
9|6

The last line contains no of row count including header and trailer
Please let me know how to do that.. I am completely stuck with that and no good luck.. “

So to summarize, both the header and the footer line of the csv file need to contain dynamic content.
Also each row starts with a ‘row type’: 0 for the header, 1 for every detail row, and 9 for the footer.

Solution

As an example I just use a simple Data Flow Task with an OLEDB Source connected to table DimCurrency from the AdventureWorksDW database.
The data then flows to a Script Component used as a Destination.
Here are a few printscreens to demonstrate what I mean. The Script Component will create the file.
The package is made using SSIS 2014.

Developing the package

SSIS-S01E04-099
The demo package just contains one Data Flow Task

SSIS-S01E04-205
Design of the Data Flow Task

SSIS-S01E04-101
OLEDB Source: DimCurrency from the AdventureWorksDW database

SSIS-S01E04-102
OLEDB Source: select all three columns

SSIS-S01E04-208
When you drag the Script Component to the Data Flow, you will get this dialog. Use the Script Component as a Destination

SSIS-S01E04-207
Variables that the package contains.

SSIS-S01E04-209
In the Script Transformation Editor, add both variables User::HeaderTimestamp and User::TargetCsvPathAndFileName to ReadOnlyVariables

Now comes the most interesting part! We use the endless possibilities of C# to fix our problem.
In the Script Transformation Editor window, press Edit Script ….

Please note that you can download the entire script here, so you do not have to copy the (sometimes malformatted) code from this webpage..

Then in the namespaces region add two namespaces, as indicated below.

Namespaces region
#region Namespaces
using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
// Added namespaces:
using System.IO;
using System.Text;
#endregion

Above public override void PreExecute() add the copyright notice and the following two regions: Configuration and Private variables.
Please note that these regions do not exist yet, so you have to copy the code including the #region and #endregion lines.

New regions
  /*
  (c) Copyright 2016 - hansmichiels.com
 
  This program is free software: you can redistribute it and/or modify
  it under the terms of the GNU General Public License as published by
  the Free Software Foundation, either version 3 of the License, or
  (at your option) any later version.
 
  This program is distributed in the hope that it will be useful,
  but WITHOUT ANY WARRANTY; without even the implied warranty of
  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
  GNU General Public License for more details.
 
  You should have received a copy of the GNU General Public License
  along with this program. If not, see http://www.gnu.org/licenses/.
  */

 
  #region ***** Configuration *****

  // Here you can change the column delimiter and the text qualifier easily.
  private const string _delimiter = “|”;
  private const string _textQualifier = “”;
  // To set the text qualifier to ” use:
  // private const string _textQualifier = “\””; // I recommend to use a text qualifier.
  
  // Every x rows, write the buffer to the file, and clear the buffer.
  // This is faster than writing every individual line to the file.
  // The higher the value of _bufferRows is, the more memory will be used and the less write actions are done to the csv file.
  // A higher value is a good idea if you have enough memory in your server.
  private const int _bufferRows = 100000;

  #endregion ***** Configuration *****

  #region Private variables

  // Variable to hold the path and filename
  private string _fileName;
  // Variable to count the rows
  private int _rowCount;
  // StringBuilder to hold a buffer of text that still needs to be written to the csv file.
  private StringBuilder _fileContentsBuffer;

  #endregion Private variables

PreExecute() is called before the data starts to flow through. Here we do some preparation and compose the headerline of the file.

PreExecute()

 &nbs p;/// <summary>
  /// This method is called once, before rows begin to be processed in the data flow.
  /// </summary>
  public override void PreExecute()
  {
    base.PreExecute();

    // Initialize variables
    _rowCount = 0;
    _fileContentsBuffer = new StringBuilder();
    _fileName = Variables.TargetCsvPathAndFileName;
    DateTime headerTimestamp = Variables.HeaderTimestamp;

    // Append the new line to the buffer.
    // This will be written to file with the first detail rows.
    // If there are no detail rows, no file will be created.
    _fileContentsBuffer.AppendFormat(“{1}0{1}{0}{1}{2:ddMMyyyy}{1}{0}{1}{2:HHmmss}{1}\r\n”, _delimiter, _textQualifier, headerTimestamp);
    }

PostExecute() is called after the data has flowed through. Here we append the last lines in the _fileContentsBuffer variable and the footerline to the file.

PostExecute()
  /// <summary>
  /// This method is called after all the rows have passed through this component.
  ///
  /// You can delete this method if you don't need to do anything here.
  /// </summary>
  public override void PostExecute()
  {
    base.PostExecute();
    // If no detail rows, bail out. No file will be created when there are no detail rows.
    if (_rowCount == 0) return;

    using (StreamWriter sw = File.AppendText(_fileName))
    {
      // Append a footer line with the rowcount, example value: 9|105
      // The rowcount is a count of the detail rows, but the footer count
      // needs to include header and footer, so we add 2 to the value of _rowCount.
      _fileContentsBuffer.AppendFormat(“{1}9{1}{0}{1}{2}{1}”, _delimiter, _textQualifier, _rowCount + 2);
      sw.Write(_fileContentsBuffer.ToString());
      _fileContentsBuffer.Clear();
    }
  }

Input0_ProcessInputRow(Input0Buffer Row) is the place where all the data processing takes place. As this script is the Destination for the data, we need to do something with those data. A Flat File Destination is not there and will not do this for us.
So here we append every detail row to the _fileContentsBuffer variable. Once in a while, we write the buffer to the file, and clear the buffer again.

Input0_ProcessInputRow

  /// <summary>
  /// This method is called once for every row that passes through the component from Input0.
  /// </summary>
  /// <param name=”Row”>The row that is currently passing through the component</param>
  public override void Input0_ProcessInputRow(Input0Buffer Row)
  {
    // Increase the rowcounter
    _rowCount++;

    // Append the new line to the buffer.
    // The format needs to be like “{1}[middle part]{1}\r\n”, so that the row starts and ends with the text qualifier.
    // Replace [middle part] with {1}{0}{1}{x} for every column that you want to append to the file, where x is 2 for column 1, 3 for column 2 and so on.
    _fileContentsBuffer.AppendFormat(“{1}1{1}{0}{1}{2}{1}{0}{1}{3}{1}{0}{1}{4}{1}\r\n”, _delimiter, _textQualifier,
      SafeValue(Row.CurrencyKey), // First column, {2} in the format
      SafeValue(Row.CurrencyAlternateKey), // Second column, {3} in the format
      SafeValue(Row.CurrencyName) // Third column, {3} in the format
             // etc.
      );
    
    // Every x rows, write the buffer to the file, and clear the buffer.
    if (_rowCount % _bufferRows == 0)
    {
      using (StreamWriter sw = File.AppendText(_fileName))
      {
      sw.Write(_fileContentsBuffer.ToString());
      _fileContentsBuffer.Clear();
      }
    }
  }

Multiple overloaded versions of the SafeValue method (for different datatypes like string, int, DateTime, you can add more when needed) add escape characters to the column values when needed.

SafeValue method and overloads
  /// <summary>
  /// Escapes a double quote delimiter in a column value by doubling it.
  /// </summary>
  /// <param name=”value”></param>
  /// <returns></returns>
  private string SafeValue(string value)
  {
    // If _textQualifier is null or empty, just return the value.
    if (string.IsNullOrEmpty(_textQualifier)) return value;

    if (value.Contains(_textQualifier))
    {
    // If the value contains one or more text qualifiers,
    // escape them by replacing them with two text qualifiers.
    return value.Replace(_textQualifier, _textQualifier + _textQualifier);
    }
    return value;
  }

  /// <summary>
  /// Overloaded method for DateTime value (meant as an example)
  /// For other data types, create new overloaded methods,
  /// and in that overloaded method convert the value to string,
  /// before calling the 'SafeValue' method with a string parameter.
  /// </summary>
  /// <param name=”value”></param>
  /// <returns></returns>
  private string SafeValue(DateTime value)
  {
    return SafeValue(value.ToString(“yyyy-MM-dd HH:mm:ss.fff”));
  }

  /// <summary>
  /// Overloaded method for int value
  /// </summary>
  /// <param name=”value”></param>
  /// <returns></returns>
  private string SafeValue(int value)
  {
    return SafeValue(value.ToString());
  }
}

When the script is ready, Choose Rebuild from the BUILD menu in the VstaProjects – Microsoft Visual Studio Window. Then close that Window. You return to the Script Transformation Editor, close it by pressing OK.

Testing the package

I simply run the package in debug mode. 105 currency rows are added to the csv file.
Including the header and footer, the csv file contains 107 rows, and this is the number that is written to the footer line.

SSIS-S01E04-110
Running the package in debug mode

SSIS-S01E04-111
First and last part of the created csv file.

Please note that to change the delimiter and text qualifier you only have to make two changes in the ***** Configuration ***** region.

Configuration
  #region ***** Configuration *****

  // Here you can change the column delimiter and the text qualifier easily.
  private const string _delimiter = “;”;
  private const string _textQualifier = “\””;
. . .

  #endregion ***** Configuration *****

By this change the file will look as follows:
SSIS-S01E04-112
First and last part of the created csv file with a different delimiter and text qualifier.

I also did a test with a source query that returned over 1.1 million rows, and the csv file of 25 MB was created in just 3 seconds (on a core i7 laptop with SSD).
So my conclusion is that the performance of this solution is very satisfying.

Download the C# script for the script component here.

Conclusion / Wrap up

In this post I have demonstrated how to use SSIS 2014 to create a csv file with a dynamic header (with a timestamp) and footer line (including a rowcount) using a Script Component as a Destination in a Data Flow Task.

(Promotional)

Big Data & Analytics insight 2016: save the date for an insightful conference
(as a speaker I was asked to do a bit of promotion ..)

(c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

Big Data & Analytics insight 2016: save the date for an insightful conference

Dear reader,

Is the data warehouse at the end of its lifecycle or does the existing infrastructure gain new value from integrating with new Big Data technologies? If so, how do you go about it? Do you need inspiration for your analytics strategy? Then there’s plenty to go by at Big Data & Analytics Insight 2016, on June 7th in Utrecht, The Netherlands.

There will be input from Big Data domain experts and highly interesting presentations like the ones on Data Vault 2.0 & Big Data and Hands-on Predictive Text Analytics that will inspire you. Whether you are deep into the technical or the business aspects of Big Data & Analytics, this conference is a must attend if you want to stay ahead of future evolutions instead of enduring them.

Check the agenda and book your seat, or better, become a ba4all member and get access to one more conference and a hands-on session on Customer Analytics.

I look forward to seeing you at Big Data & Analytics Insight 2016!

Kind regards,

Hans Michiels
(as a speaker I was asked to do a bit of promotion ..)

Fixing corrupt csv files in a SSIS Package (SSIS Series)

(Promotional)

Big Data & Analytics insight 2016: save the date for an insightful conference
(as a speaker I was asked to do a bit of promotion ..)

Introduction

A few weeks ago my blog post was about using a C# Script Component to detect duplicates in a Data Flow Task.

This week I want to discuss it’s brother, the C# Script Task, which can be used in the Control Flow.

Again, extremely flexible, basicly you embed a small C# program inside your package, and you can use the entire .NET framework for all kind of things. I used it to fix malformed csv source files, before processing them in a Data Flow Task.

Pros and Problems

Csv (comma seperated files) have their pros and problems, as compared with other formats, like XML.

A few pros are that they are usually smaller in size than XML files, and easier and faster to load. These are two reasons why csv files still are used a lot these days, despite of the fact that XML files are in some cases more reliable, and more flexible as far as the file contents is concerned.

Unfortunately csv files are not so good in holding memo/comment fields that might contain carriage return/line feeds in it. Memo fields with those characters do not stay on one line in the file, therefore making the csv file more difficult to process (at least the standard flat file source can not handle it in all cases).

Also the structure is very strict. If a column is added to a csv source file, your Data Flow Task will break.

Solution

In this post I will discuss a possible solution for both problems:

  1. carriage return/line feeds in contents of csv files
  2. changing file formats over time

This post assumes that you have worked with SSIS before, know what For Each Loop Containers and Data Flow Tasks do, and recognize the csv problems above. It is a big advantage if you can read and understand C# code.

Okay, let’s start!

Scenario

My demo is based on the following hypothetical scenario:

  1. You need to process multiple CustomerDetails.csv files.
  2. In this file a pipe (|) is used as separator between the columns.
  3. The file contains a header line: SocialSecurityNumber|FirstName|LastName|Gender|Residence|Comments
  4. The Comments column is added later, your package also needs to be able to load historic files without the Comments column.
  5. CustomerDetails.csv files come from hundreds of locations and systems (hereafter called: sources) world wide. Fixing the csv files in the export processes at all these locations is impossible. Even if some sources deliver correct csv files, your SSIS package cannot rely on the fact that all sources will do so.

Overview

Here are a few pictures of the end result:
SSIS-S01E03-109
The connection managers that are needed

SSIS-S01E03-110
The Control Flow
SSIS-S01E03-111
The Data Flow

Detailed steps

Parameters and variables

First you need to add a few parameters and variables to the package.

Parameters:

  • SourceFolder: The folder that that For Each Loop Container reads for csv files.
  • ArchiveFolder: The folder to which files are moved after being processed, if no error occurred.
  • ErrorFolder: The folder to which files are moved that caused an error during processing.

All parameters are Required and their Data type is String.

SSIS-S01E03-113
The parameters of the package

Variables:

  • CRLF: This variable will be filled with a Carriage Return/Line Feed in the Script Task.
  • CRLF_Replacer: This is the character string that is used to replace CarriageReturn/LineFeeds in the Comment column of the file. It must be a character string that does not occur in the Comment field. Therefore ~CRLF~ is used. If you want a different character string, you can change it to something else.
  • SourcePathAndFilename: The full path and filename of the original filename to process. It is filled by the For Each Loop Container. The value set at design time is unimportant.
  • PathAndFilenameToProcess: The full path and filename of the file to process. This can be either the original file, or the modified temporary file. The Flat File Connection Manager uses this path and filename. The value set at design time is unimportant.

The Data type of all variables is String.

SSIS-S01E03-114
SSIS variables needed to do the job

The Control Flow – For Each Loop Container

Add a For Each Loop Container FELC – Load sourcefiles and configure it as a Foreach File Enumerator. Retrieve file name Fully qualified.

SSIS-S01E03-115

On the Collection pane, just below the Enumerator type, press the ellipsis next to Expressions. In the Property Expressions Editor configure the Property Directory to be set by
Expression @[$Package::SourceFolder].
SSIS-S01E03-116

On the Variable Mappings pane, map variable User::SourcePathAndFilename to Index 0
SSIS-S01E03-102
For Each Loop Container configuration

The Control Flow – Script Task

Next, add a Script Task SCR – Fix csv file if needed.
In the Script Task Editor, configure the following:

  • ReadOnlyVariables: User::CRLF_Replacer,User::SourcePathAndFilename
  • ReadWriteVariables: User::CRLF,User::PathAndFilenameToProcess

SSIS-S01E03-117
Variable configuration in the Script Task Editor

Then press the Edit Script… button.
Don’t get scared by the amount of C# code, if you keep your head cool, it will be fixed in 10 minutes!

Adjust the script as follows:

  • In the Namespaces region, if not already there, add:
    using System.IO;
    using System.Text;


  • SSIS-S01E03-108
    Replace the highlighted part above with the code below

    Now remove public void Main() completely and replace it with the code underneath (to avoid malformatting due to html rendering, please download the original csharp script here).
    Then check the code for !!! ATTENTION !!! comment. These are points in the script that you might need to change.

    ScriptTask.cs

      /*
      (c) Copyright 2016 - hansmichiels.com
     
      This program is free software: you can redistribute it and/or modify
      it under the terms of the GNU General Public License as published by
      the Free Software Foundation, either version 3 of the License, or
      (at your option) any later version.
     
      This program is distributed in the hope that it will be useful,
      but WITHOUT ANY WARRANTY; without even the implied warranty of
      MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
      GNU General Public License for more details.
     
      You should have received a copy of the GNU General Public License
      along with this program. If not, see http://www.gnu.org/licenses/.
      */

     
      /// This method is called when this script task executes in the control flow.
      /// Before returning from this method, set the value of Dts.TaskResult to indicate success or failure.
      /// To open Help, press F1.
      /// </summary>
     public void Main()
     {
      // Copy a few SSIS variables to C# string variables.
       string sourcePathAndFilename = Dts.Variables[“User::SourcePathAndFilename”].Value.ToString();
       string crlfReplacer = Dts.Variables[“User::CRLF_Replacer”].Value.ToString();
     
       // Set the variable that needs to hold a CarriageReturn/LineFeed.
       Dts.Variables[“User::CRLF”].Value = string.Format(“\r\n”);
     
       // Call the method that checks if the sourcefile is corrupt and if so, fixes it.
       FixFileIfCorrupt(sourcePathAndFilename, crlfReplacer);
     
       // Set returnvalue for script task.
      Dts.TaskResult = (int)ScriptResults.Success;
     }
     
      /// Csv-file can have lines with carriage return-linefeeds, causing errors during the load.
      /// By preinvestigating the file we can correct the file and prevent these errors from ocurring.
      /// Source: http://www.hansmichiels.com/2016/05/22/fixing-corrupt-csv-files-in-a-ssis-package-ssis-series/
      /// Author: Hans Michiels
      /// License: GNU General Public License, see http://www.gnu.org/licenses/
      /// </summary>
      private void FixFileIfCorrupt(string sourcePathAndFilename, string crlfReplacer)
      {
       string header = string.Empty;
       string headerAppend = string.Empty;
       string detailAppend = string.Empty;
     
       // Initialize pathAndFilenameToProcess with sourcePathAndFilename. Overwrite later if needed.
       string pathAndFilenameToProcess = sourcePathAndFilename;
     
       // Check if the header of the file is as expected, and return the header row through a ref parameter.
       // If the file header is not as expected, do not touch the file and exit this method.
       if (!CheckAndReturnHeader(sourcePathAndFilename, ref header, ref headerAppend, ref detailAppend)) return;
       
       // Get the encoding of the current file. We need this if we have to create a corrected file.
       var encoding = GetEncoding(sourcePathAndFilename);
     
       // Read all lines of the file into a string array.
       // !!! ATTENTION !!! WARNING: only do this when the files to process are relatively small.
       // !!! ATTENTION !!! When in doubt, test it and monitor memory usage.
       // (I suppose files up to 50~100 MB might not be a big problem but larger might, but testing is the way to find out)
       string[] lines = File.ReadAllLines(sourcePathAndFilename, encoding);
     
       // Use a variable to set when the file has changed due to corrections this method makes.
       bool fileHasChanged = false;
       bool lineIsOkay = true;
     
       // appendbuffer is used to keep parts of 'broken' lines.
       string appendbuffer = string.Empty;
     
       // Get lower- and upperbound of the string array into variables.
       int lower = lines.GetLowerBound(0);
       int upper = lines.GetUpperBound(0);
     
       // Loop through all lines in the file from back to front!
       for (int i = upper; i >= lower; i--)
       {
        // If there is still some text in the buffer, append it to the line and make the buffer empty again.
        if (appendbuffer != string.Empty)
        {
          lines[i] = lines[i] + appendbuffer;
          appendbuffer = string.Empty;
          fileHasChanged = true;
        }
     
        // reset lineIsOkay variable, set to true underneath when the line passes the 'line ok' checks.
        lineIsOkay = false;
     
        // !!! ATTENTION !!!
        // Here you need to add some custom code.
        // Describe one or more properties of a correct line, e.g.
        // - first 8 characters are numeric;
        // - The length is at least 12;
        // - It contains at least x delimiters;
        // etc.
        // Then convert this check to C# code. Use your imagination if needed!
     
        // I check for:
        // - the length to be 12 or more characters.
        // - the first 8 characters (the social security number) to be numeric.
        // - at least 4 pipes.
        if (lines[i].Length > 12)
        {
          string checkString = lines[i].Substring(0, 8);
          int intOutput;
     
          lineIsOkay = (Int32.TryParse(checkString, out intOutput));
     
          if (lineIsOkay)
          {
           // Check that the line contains at least 4 delimiters.
           lineIsOkay = (lines[i].Split('|').Length - 1 >= 4);
          }
        }
     
        if ((!lineIsOkay) && (i > lower))
        {
          // Line is NOT OK.
          // I have to append this line to the previous one, except when it is the header.
          // To do this put this line into appendbuffer variable.
          // I use a replacement string for the removed carriage return/line feed.
          // This will be converted back to a carriage return/line feed in the Data Flow.
          appendbuffer = string.Format(“{0}{1}”, crlfReplacer, lines[i]);
          lines[i] = string.Empty; // Package can handle empty lines.
          fileHasChanged = true;
        }
     
       }
     
       // !!! ATTENTION !!! If you do not have old fileformats you can remove this block.
       // Code to handle the old file format, without a Comments column.
       // The code below reformats the file so that it become always in new fileformat, with an empty comment.
       if (headerAppend != string.Empty)
       {
        // This is an old format without the Comments column. Append Comments column header to header and a delimiter (pipe) to detail rows.
        fileHasChanged = true;
     
        for (int i = upper; i >= lower; i--)
        {
         if (lines[i].Length > 0)
         {
          if (i > lower)
          {
            lines[i] += detailAppend;
          }
          else
          {
            lines[i] += headerAppend;
          }
         }
        }
       }
     
       if (fileHasChanged)
       {
        // Here the modified/corrected file is stored to a temporary file, so a “tmp” extension is added to the modified filename.
        // For auditing purposes I keep the original file and not the corrected one,
        // because the latter can always be recreated when needed.
        // Also, when a bug would exist in the creation of the corrected file
        // (initially I had a problem with codepages, which should now be solved),
        // the original file might get lost forever.
        pathAndFilenameToProcess = String.Format(“{0}.tmp”, sourcePathAndFilename);
     
        // Now create a new corrected csv file.
        File.WriteAllLines(pathAndFilenameToProcess, lines, encoding);
       }
     
       // Also change the SSIS Package variable, because the Flat File Source uses it.
       Dts.Variables[“User::PathAndFilenameToProcess”].Value = pathAndFilenameToProcess;
     
      }
     
      /// <summary>
      /// This method checks if the header is as expected. If not, return false.
      /// Also the value of by ref parameters 'header', 'headerAppend' and 'detailAppend' is set.
      /// Source: http://www.hansmichiels.com/2016/05/22/fixing-corrupt-csv-files-in-a-ssis-package-ssis-series/
      /// Author: Hans Michiels
      /// License: GNU General Public License, see http://www.gnu.org/licenses/
      /// <param name=”sourcePathAndFilename”></param>
      /// <param name=”header”></param>
      ///
      private bool CheckAndReturnHeader(string sourcePathAndFilename,
        ref string header, ref string headerAppend, ref string detailAppend)
      {
       // Initialize return values.
       header = string.Empty;
       headerAppend = string.Empty;
       detailAppend = string.Empty;
     
       // Open the file to read the header line, then close the file again.
       using (StreamReader reader = new StreamReader(sourcePathAndFilename))
       {
        header = reader.ReadLine();
        reader.Close();
       }
       // !!! ATTENTION !!! Here you could do a check on (parts) of the header,
       // so that you do not change files with an unexpected format.
       if (!header.ToLower().StartsWith(“socialsecuritynumber|firstname|lastname|gender|residence”))
       {
        return false;
       }
       // !!! ATTENTION !!! Here you could check for old file formats.
       // Old formats can only be handled when one or more new columns were added at the end of the csv lines.
       // In the example, there is a file format without the Comments column still being delivered.
       // By appending the Comments column with an empty value for those files, they can be processed
       // by the same Data Flow Task using only one Flat File Source Connection.
       if (!header.ToLower().Contains(“|comments”))
       {
        // This is an old format without the Comments column.
        // Append Comments column header to header and a delimiter (pipe) to detail rows.
        headerAppend = “|Comments”;
        detailAppend = “|”;
       }
       return true;
      }
     
      /// <summary>
      /// Determines a text file's encoding by analyzing its byte order mark (BOM).
      /// Defaults to ASCII when detection of the text file's endianness fails.
      /// Source: http://stackoverflow.com/questions/3825390/effective-way-to-find-any-files-encoding
      /// With thanks to the author ‘2Toad’.
      /// </summary>
      /// <param name=”filename”>The text file to analyze.</param>
      /// <returns>The detected encoding.</returns>
      public static Encoding GetEncoding(string filename)
      {
       // Read the BOM
       var bom = new byte[4];
       using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
       {
        file.Read(bom, 0, 4);
       }
     
       // Analyze the BOM
       if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) return Encoding.UTF7;
       if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) return Encoding.UTF8;
       if (bom[0] == 0xff && bom[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
       if (bom[0] == 0xfe && bom[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
       if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) return Encoding.UTF32;
       // return Encoding.GetEncoding(1252); // Without BOM, default to Windows Codepage 1252
       return Encoding.Default; // Without BOM, the encoding for the OS's current ANSI code page is returned by default.
      }

    When you have adjusted the script for your file, choose Rebuild from the BUILD menu in the VstaProjects – Microsoft Visual Studio Window. Then close that Window. You return to the Script Task Editor, close it by pressing OK.

    Developing the Data Flow Task

    Add a Data Flow Task to the Control Flow and name it DFT – Import Customer Details.
    Then add the following components to it:

    Flat File Source FF_SRC – CustomerDetails

    From the SSIS Toolbox, drag a Flat File Source to the Data Flow area and create a new Flat File connection manager.
    SSIS-S01E03-121

    If you want to rebuild the demo, download the csv files (and C# script) here.
    Configure the Flat File Connection as shown in the picture below:
    SSIS-S01E03-118

    Choose Advanced in the pane on the left and configure the properties for each column as follows:

    • SocialSecurityNumber: DataType string [DT_STR], OutputColumnWidth 9
    • FirstName: DataType Unicode string [DT_WSTR], OutputColumnWidth 50
    • LastName: DataType Unicode string [DT_WSTR], OutputColumnWidth 50
    • Gender: DataType string [DT_STR], OutputColumnWidth 1
    • Residence: DataType Unicode string [DT_WSTR], OutputColumnWidth 60
    • Comments: DataType Unicode string [DT_WSTR], OutputColumnWidth 1000

    SSIS-S01E03-120
    Example of column properties

    Conditional Split CSPL – Filter out empty lines

    We use a Conditional Split to filter out empty lines in the csv file. Because the SocialSecurityNumber is a required column, we assume that the line is empty (at least it is useless) if SocialSecurityNumber is empty.
    So add an Output named Empty lines with Condition SocialSecurityNumber == “”. If you like, you can check on multiple columns to be empty.
    Rename the default Output to Non-empty lines.

    SSIS-S01E03-122
    Conditional Split Configuration

    Derived Column DC – Re-insert CRLF

    Because the (corrected) file might contain replacements for carriage return/line feeds, we are going to replace them with the original carriage return/line feeds, before the value is inserted into the database table.
    In the Derived Column Transformation Editor, add a derived column, configured as follows:

    • Derived Column Name: Comments
    • Derived Column: Replace ‘Comments’
    • Expression: REPLACE(Comments,@[User::CRLF_Replacer],@[User::CRLF])

    SSIS-S01E03-123
    The Derived Column Transformation Editor

    OLEDB Destination OLE_DST stg DimCustomer

    To rebuild the demo, create a table to store the rows from the data flow task,
    add a OLEDB Destination that uses this table.

    destination_table.sql

    SET ANSI_NULLS ON
    GO
    SET QUOTED_IDENTIFIER ON
    GO
    SET ANSI_PADDING ON
    GO
    CREATE TABLE [stg].[DimCustomer](
     [SocialSecurityNumber] [char](9) NOT NULL,
     [FirstName] [nvarchar](50) NULL,
     [LastName] [nvarchar](50) NULL,
     [Gender] [char](1) NULL,
     [Comments] [nvarchar](1000) NULL,
     [Residence] [nvarchar](60) NULL,
     CONSTRAINT [PK_DimCustomer_CustomerKey] PRIMARY KEY CLUSTERED
    (
     [SocialSecurityNumber] ASC
    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
    ) ON [PRIMARY]
    GO
    SET ANSI_PADDING OFF
    GO

    The Control Flow – File System Tasks

    Finally the File System Tasks archive the original file and delete the corrected file (if it exists).
    For auditing purposes I keep the original file and not the corrected one, because the latter can always be recreated when needed.
    Also, when a bug would exist in the creation of the corrected file (initially I had a problem with codepages, which should now be solved), the original file might get lost forever.

    Please go back to the picture of the The Control Flow at the beginning of this post, to see how the File System Tasks are connected.
    Next are the printscreens that should help you to configure the File System Tasks.

    SSIS-S01E03-124
    Configuration of “FST – Move original file to archive folder”

    SSIS-S01E03-125
    Configuration of “FST – Move original file to error folder”

    SSIS-S01E03-126
    Precedence constraint of “FST – Delete temporary file” (you need to set it twice).

    SSIS-S01E03-128
    Configuration of “FST – Delete temporary file”

    The proof

    To show you that it works, I run the package in debug mode.
    To be able to have a look at the temporary/corrected files, afterwards, I have disabled the File System Task that deletes this file.
    SSIS-S01E03-130

    When I open the “CustomerDetails.csv.tmp” file after the package has run, I can see that line 15 was corrected.
    From the second screenshot you can see that the CRLF_Replacer value was inserted twice.
    SSIS-S01E03-131
    SSIS-S01E03-132

    And when I run a query on the destination table, I can see that the Comments column contains the original CarriageReturn/LineFeeds!
    SSIS-S01E03-134

    By the way, all the data in the csv files is generated from random firstnames, lastnames and residences and fake social security numbers. Any match with existing persons would be based solely on coincidence.

    Conclusion / Wrap up

    Csv files still are useful today, because they have some advantages when compared with XML files.
    However they also have their limitations, which can be frustrating when loading those files with a Data Flow Task.
    In this blog post I have demonstrated how to overcome a few of those limitations, namely:

    1. carriage return/line feeds in contents of csv files
    2. changing file formats over time

    The solution is based on using a C# Script Task.

    Please be aware that I do not pretend this is the best solution.
    If you are very familiar with C#, you could all of the processsing in the Script Task, using SqlBulkCopy, and you would not need a temporary file.
    However the Data Flow Task offers some advantages, like lookups you can use, without the need to do this in a script.
    If you create good Interface Specification Documents on forehand (for instance containing how to handle carriage return/line feeds in column values) you can force that all sources deliver correct csv-files, and then they do not have to be corrected.

    Download the C# script and csv files here.

    (c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

A naming convention for virtual machines and home networks (Windows User Series)

Introduction

If you are a professional working in the Information Technology Domain you might have set up virtual machines to test different software versions, do Proofs of Concept or just use them for the ease of backup and restore.
But what name do you give those virtual machines?
As long as it are only a few, it doesn’t really matter.

Challenge

It becomes a bit of a challenge when you are have different virtual machines all with a specific setup, e.g. different versions of the same software.
Recognizable?

Solution

I want to share my naming convention that I use for virtual machines and my home network.
It is pulled together from different sources in the internet, as well as my own insight, so don’t be surprised if you see parts of other standards.

My device names consist of the following parts:
1. Device type
2. Operating System Line
3. Operating System Version
4. Separator
5. Purpose of the device
6. Sequence number

A naming convention for virtual machines and home networks

1. Device type

Minimum and maximum length

Device type is a single character.

Domain values

Namepart Description
V Virtual Machine
S Physical Server
L Physical Laptop
D Physical Desktop
C Cloud Computer
P Printer
T Tablet Computer
M Mobile Phone

2. Operating System Line

Minimum and maximum length

Operating System Line consists of exactly two characters.

Domain values

Namepart Description
WS Windows Server
WC Windows Desktop/Client
LS Linux Server
LC Linux Client
OX OS X

3. Operating System Version

Minimum and maximum length

Operating System Line consists of 1 to 4 characters.

Domain values

For Windows devices use the internal version number:

Namepart Description
51 Windows Server 2003
52 Windows Server 2003 R2
60 Windows Server 2008
61 Windows Server 2008 R2
62 Windows Server 2012
63 Windows Server 2012 R2
100 Windows Server 2016
A “C” can be added to server versions, if the “Core” version is installed. E.g. 63C
51 Windows XP
52 Windows XP Professional x64
60 Windows Vista
61 Windows 7
62 Windows 8
63 Windows 8.1
100 Windows 10

Get other/older versions from wikipedia.

4. Separator

Minimum and maximum length

The separator is a single character: a hyphen (-).

Domain values

Must be a hyphen (-).

5. Purpose of the device

Minimum and maximum length

Purpose of the device consists of 1 to 4 characters.

Domain values

For Servers:

Namepart Description
DC Domain Controller
FS File Server
PS Print Server
WEB Web Server
ORA Oracle database
SQL SQL Server database
DB other database(s)
EXH Microsoft Exchange
SHP Microsoft Sharepoint
CTX Citrix Server
ESX VMware ESX Server
HV Windows Hyper-V Host

This part is optional for desktop devices.
If implemented, you could use:

Namepart Description
STD or S Standard
DEV or D Used for software development
GRD or G Used for graphical design

6. Sequence number

Minimum and maximum length

The sequence number consists of 1 to 6 characters.

WIN-S01E03-image3
Picture credits: © Yan Zommer | Dreamstime Stock Photos
Numbering your Windows instances.

Domain values

A sequence number making the name unique.
You can decide to make it unique based on purpose only, so you could have:
DWC61-STD001 and LWC61-STD002 (so you do not use 001 for the latter).
If you have a limited number of devices, you could put some intelligence in the number so you know which version of the ‘purpose delivering’ software is running.

Example names

Example name Explanation
DWS63C-HV01 Physical desktop running Windows Server 2012R2 Core, being a Hyper V Host with sequence number 01.
VWS63-SQL141 Hyper V Virtual machine running Windows Server 2012R2 with SQL Server, sequence number 141 (intelligent number, indicates that it is my first virtual machine with SQL Server 2014 installed, I would use 08x for SQL 2008, 09x for SQL 2008R2, 12x for SQL 2012, you get the idea). You could use this if the number of virtual machines per SQL version is less than 10.
VWC100-DEV151 Hyper V Virtual machine running Windows 10, used for development, with sequence number 151 (intelligent number, 15 indicates it’s my Visual Studio 2015 virtual machine, as I have only one of those, 151 is an excellent number).
LWC61-STD0554 Physical Laptop with Windows 7, with software for a ‘standard workspace’, sequence number 0554.

WIN-S01E03-image1
A real life example.

Warning/disclaimer

Do NOT use this naming convention for enterprise networks! Problems that could arise if you use it include:

  • A physical location is not part of this naming convention. In enterprise networks servers often have a physical location reference in their name.
  • The operating system is in the name. This if fine for your own lab virtual machines and home computers, but can be very inconvenient in company networks where the device name is in an inventory system. Enterprises typically do not want devices to be renamed when the Operating System is upgraded.

So.. use this naming convention only for the intended purpose, and take advantage of it.

Conclusion / Wrap up

I posted my own naming convention for virtual and physical computer devices I use “AS IS”.
I do not say it is perfect or useful in all cases.
Use the parts you think are convenient and adjust further to meet your requirements.
I would be thankful if you would like to share your adjustments with me, so I can update this post.
I discourage to use this standard in enterprise networks. For more details see the warning/disclaimer above.

(c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.