(Promotional)
Big Data & Analytics insight 2016: save the date for an insightful conference
(as a speaker I was asked to do a bit of promotion ..)
Introduction
A few weeks ago my blog post was about using a C# Script Component to detect duplicates in a Data Flow Task.
This week I want to discuss it’s brother, the C# Script Task, which can be used in the Control Flow.
Again, extremely flexible, basicly you embed a small C# program inside your package, and you can use the entire .NET framework for all kind of things. I used it to fix malformed csv source files, before processing them in a Data Flow Task.
Pros and Problems
Csv (comma seperated files) have their pros and problems, as compared with other formats, like XML.
A few pros are that they are usually smaller in size than XML files, and easier and faster to load. These are two reasons why csv files still are used a lot these days, despite of the fact that XML files are in some cases more reliable, and more flexible as far as the file contents is concerned.
Unfortunately csv files are not so good in holding memo/comment fields that might contain carriage return/line feeds in it. Memo fields with those characters do not stay on one line in the file, therefore making the csv file more difficult to process (at least the standard flat file source can not handle it in all cases).
Also the structure is very strict. If a column is added to a csv source file, your Data Flow Task will break.
Solution
In this post I will discuss a possible solution for both problems:
- carriage return/line feeds in contents of csv files
- changing file formats over time
This post assumes that you have worked with SSIS before, know what For Each Loop Containers and Data Flow Tasks do, and recognize the csv problems above. It is a big advantage if you can read and understand C# code.
Okay, let’s start!
Scenario
My demo is based on the following hypothetical scenario:
- You need to process multiple CustomerDetails.csv files.
- In this file a pipe (|) is used as separator between the columns.
- The file contains a header line: SocialSecurityNumber|FirstName|LastName|Gender|Residence|Comments
- The Comments column is added later, your package also needs to be able to load historic files without the Comments column.
- CustomerDetails.csv files come from hundreds of locations and systems (hereafter called: sources) world wide. Fixing the csv files in the export processes at all these locations is impossible. Even if some sources deliver correct csv files, your SSIS package cannot rely on the fact that all sources will do so.
Overview
Here are a few pictures of the end result:

The connection managers that are needed

The Control Flow

The Data Flow
Detailed steps
Parameters and variables
First you need to add a few parameters and variables to the package.
Parameters:
- SourceFolder: The folder that that For Each Loop Container reads for csv files.
- ArchiveFolder: The folder to which files are moved after being processed, if no error occurred.
- ErrorFolder: The folder to which files are moved that caused an error during processing.
All parameters are Required and their Data type is String.

The parameters of the package
Variables:
- CRLF: This variable will be filled with a Carriage Return/Line Feed in the Script Task.
- CRLF_Replacer: This is the character string that is used to replace CarriageReturn/LineFeeds in the Comment column of the file. It must be a character string that does not occur in the Comment field. Therefore ~CRLF~ is used. If you want a different character string, you can change it to something else.
- SourcePathAndFilename: The full path and filename of the original filename to process. It is filled by the For Each Loop Container. The value set at design time is unimportant.
- PathAndFilenameToProcess: The full path and filename of the file to process. This can be either the original file, or the modified temporary file. The Flat File Connection Manager uses this path and filename. The value set at design time is unimportant.
The Data type of all variables is String.

SSIS variables needed to do the job
The Control Flow – For Each Loop Container
Add a For Each Loop Container FELC – Load sourcefiles and configure it as a Foreach File Enumerator. Retrieve file name Fully qualified.

On the Collection pane, just below the Enumerator type, press the ellipsis next to Expressions. In the Property Expressions Editor configure the Property Directory to be set by
Expression @[$Package::SourceFolder].

On the Variable Mappings pane, map variable User::SourcePathAndFilename to Index 0

For Each Loop Container configuration
The Control Flow – Script Task
Next, add a Script Task SCR – Fix csv file if needed.
In the Script Task Editor, configure the following:
- ReadOnlyVariables: User::CRLF_Replacer,User::SourcePathAndFilename
- ReadWriteVariables: User::CRLF,User::PathAndFilenameToProcess

Variable configuration in the Script Task Editor
Then press the Edit Script… button.
Don’t get scared by the amount of C# code, if you keep your head cool, it will be fixed in 10 minutes!
Adjust the script as follows:
- In the Namespaces region, if not already there, add:
using System.IO;
using System.Text;

Replace the highlighted part above with the code below
Now remove public void Main() completely and replace it with the code underneath (to avoid malformatting due to html rendering, please download the original csharp script here).
Then check the code for !!! ATTENTION !!! comment. These are points in the script that you might need to change.
/*
(c) Copyright 2016 - hansmichiels.com
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see http://www.gnu.org/licenses/.
*/
/// This method is called when this script task executes in the control flow.
/// Before returning from this method, set the value of Dts.TaskResult to indicate success or failure.
/// To open Help, press F1.
/// </summary>
public void Main()
{
// Copy a few SSIS variables to C# string variables.
string sourcePathAndFilename = Dts.Variables[“User::SourcePathAndFilename”].Value.ToString();
string crlfReplacer = Dts.Variables[“User::CRLF_Replacer”].Value.ToString();
// Set the variable that needs to hold a CarriageReturn/LineFeed.
Dts.Variables[“User::CRLF”].Value = string.Format(“\r\n”);
// Call the method that checks if the sourcefile is corrupt and if so, fixes it.
FixFileIfCorrupt(sourcePathAndFilename, crlfReplacer);
// Set returnvalue for script task.
Dts.TaskResult = (int)ScriptResults.Success;
}
/// Csv-file can have lines with carriage return-linefeeds, causing errors during the load.
/// By preinvestigating the file we can correct the file and prevent these errors from ocurring.
/// Source: http://www.hansmichiels.com/2016/05/22/fixing-corrupt-csv-files-in-a-ssis-package-ssis-series/
/// Author: Hans Michiels
/// License: GNU General Public License, see http://www.gnu.org/licenses/
/// </summary>
private void FixFileIfCorrupt(string sourcePathAndFilename, string crlfReplacer)
{
string header = string.Empty;
string headerAppend = string.Empty;
string detailAppend = string.Empty;
// Initialize pathAndFilenameToProcess with sourcePathAndFilename. Overwrite later if needed.
string pathAndFilenameToProcess = sourcePathAndFilename;
// Check if the header of the file is as expected, and return the header row through a ref parameter.
// If the file header is not as expected, do not touch the file and exit this method.
if (!CheckAndReturnHeader(sourcePathAndFilename, ref header, ref headerAppend, ref detailAppend)) return;
// Get the encoding of the current file. We need this if we have to create a corrected file.
var encoding = GetEncoding(sourcePathAndFilename);
// Read all lines of the file into a string array.
// !!! ATTENTION !!! WARNING: only do this when the files to process are relatively small.
// !!! ATTENTION !!! When in doubt, test it and monitor memory usage.
// (I suppose files up to 50~100 MB might not be a big problem but larger might, but testing is the way to find out)
string[] lines = File.ReadAllLines(sourcePathAndFilename, encoding);
// Use a variable to set when the file has changed due to corrections this method makes.
bool fileHasChanged = false;
bool lineIsOkay = true;
// appendbuffer is used to keep parts of 'broken' lines.
string appendbuffer = string.Empty;
// Get lower- and upperbound of the string array into variables.
int lower = lines.GetLowerBound(0);
int upper = lines.GetUpperBound(0);
// Loop through all lines in the file from back to front!
for (int i = upper; i >= lower; i--)
{
// If there is still some text in the buffer, append it to the line and make the buffer empty again.
if (appendbuffer != string.Empty)
{
lines[i] = lines[i] + appendbuffer;
appendbuffer = string.Empty;
fileHasChanged = true;
}
// reset lineIsOkay variable, set to true underneath when the line passes the 'line ok' checks.
lineIsOkay = false;
// !!! ATTENTION !!!
// Here you need to add some custom code.
// Describe one or more properties of a correct line, e.g.
// - first 8 characters are numeric;
// - The length is at least 12;
// - It contains at least x delimiters;
// etc.
// Then convert this check to C# code. Use your imagination if needed!
// I check for:
// - the length to be 12 or more characters.
// - the first 8 characters (the social security number) to be numeric.
// - at least 4 pipes.
if (lines[i].Length > 12)
{
string checkString = lines[i].Substring(0, 8);
int intOutput;
lineIsOkay = (Int32.TryParse(checkString, out intOutput));
if (lineIsOkay)
{
// Check that the line contains at least 4 delimiters.
lineIsOkay = (lines[i].Split('|').Length - 1 >= 4);
}
}
if ((!lineIsOkay) && (i > lower))
{
// Line is NOT OK.
// I have to append this line to the previous one, except when it is the header.
// To do this put this line into appendbuffer variable.
// I use a replacement string for the removed carriage return/line feed.
// This will be converted back to a carriage return/line feed in the Data Flow.
appendbuffer = string.Format(“{0}{1}”, crlfReplacer, lines[i]);
lines[i] = string.Empty; // Package can handle empty lines.
fileHasChanged = true;
}
}
// !!! ATTENTION !!! If you do not have old fileformats you can remove this block.
// Code to handle the old file format, without a Comments column.
// The code below reformats the file so that it become always in new fileformat, with an empty comment.
if (headerAppend != string.Empty)
{
// This is an old format without the Comments column. Append Comments column header to header and a delimiter (pipe) to detail rows.
fileHasChanged = true;
for (int i = upper; i >= lower; i--)
{
if (lines[i].Length > 0)
{
if (i > lower)
{
lines[i] += detailAppend;
}
else
{
lines[i] += headerAppend;
}
}
}
}
if (fileHasChanged)
{
// Here the modified/corrected file is stored to a temporary file, so a “tmp” extension is added to the modified filename.
// For auditing purposes I keep the original file and not the corrected one,
// because the latter can always be recreated when needed.
// Also, when a bug would exist in the creation of the corrected file
// (initially I had a problem with codepages, which should now be solved),
// the original file might get lost forever.
pathAndFilenameToProcess = String.Format(“{0}.tmp”, sourcePathAndFilename);
// Now create a new corrected csv file.
File.WriteAllLines(pathAndFilenameToProcess, lines, encoding);
}
// Also change the SSIS Package variable, because the Flat File Source uses it.
Dts.Variables[“User::PathAndFilenameToProcess”].Value = pathAndFilenameToProcess;
}
/// <summary>
/// This method checks if the header is as expected. If not, return false.
/// Also the value of by ref parameters 'header', 'headerAppend' and 'detailAppend' is set.
/// Source: http://www.hansmichiels.com/2016/05/22/fixing-corrupt-csv-files-in-a-ssis-package-ssis-series/
/// Author: Hans Michiels
/// License: GNU General Public License, see http://www.gnu.org/licenses/
/// <param name=”sourcePathAndFilename”></param>
/// <param name=”header”></param>
///
private bool CheckAndReturnHeader(string sourcePathAndFilename,
ref string header, ref string headerAppend, ref string detailAppend)
{
// Initialize return values.
header = string.Empty;
headerAppend = string.Empty;
detailAppend = string.Empty;
// Open the file to read the header line, then close the file again.
using (StreamReader reader = new StreamReader(sourcePathAndFilename))
{
header = reader.ReadLine();
reader.Close();
}
// !!! ATTENTION !!! Here you could do a check on (parts) of the header,
// so that you do not change files with an unexpected format.
if (!header.ToLower().StartsWith(“socialsecuritynumber|firstname|lastname|gender|residence”))
{
return false;
}
// !!! ATTENTION !!! Here you could check for old file formats.
// Old formats can only be handled when one or more new columns were added at the end of the csv lines.
// In the example, there is a file format without the Comments column still being delivered.
// By appending the Comments column with an empty value for those files, they can be processed
// by the same Data Flow Task using only one Flat File Source Connection.
if (!header.ToLower().Contains(“|comments”))
{
// This is an old format without the Comments column.
// Append Comments column header to header and a delimiter (pipe) to detail rows.
headerAppend = “|Comments”;
detailAppend = “|”;
}
return true;
}
/// <summary>
/// Determines a text file's encoding by analyzing its byte order mark (BOM).
/// Defaults to ASCII when detection of the text file's endianness fails.
/// Source: http://stackoverflow.com/questions/3825390/effective-way-to-find-any-files-encoding
/// With thanks to the author ‘2Toad’.
/// </summary>
/// <param name=”filename”>The text file to analyze.</param>
/// <returns>The detected encoding.</returns>
public static Encoding GetEncoding(string filename)
{
// Read the BOM
var bom = new byte[4];
using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
{
file.Read(bom, 0, 4);
}
// Analyze the BOM
if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) return Encoding.UTF7;
if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) return Encoding.UTF8;
if (bom[0] == 0xff && bom[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
if (bom[0] == 0xfe && bom[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) return Encoding.UTF32;
// return Encoding.GetEncoding(1252); // Without BOM, default to Windows Codepage 1252
return Encoding.Default; // Without BOM, the encoding for the OS's current ANSI code page is returned by default.
}
When you have adjusted the script for your file, choose Rebuild from the BUILD menu in the VstaProjects – Microsoft Visual Studio Window. Then close that Window. You return to the Script Task Editor, close it by pressing OK.
Developing the Data Flow Task
Add a Data Flow Task to the Control Flow and name it DFT – Import Customer Details.
Then add the following components to it:
Flat File Source FF_SRC – CustomerDetails
From the SSIS Toolbox, drag a Flat File Source to the Data Flow area and create a new Flat File connection manager.

If you want to rebuild the demo, download the csv files (and C# script) here.
Configure the Flat File Connection as shown in the picture below:

Choose Advanced in the pane on the left and configure the properties for each column as follows:
- SocialSecurityNumber: DataType string [DT_STR], OutputColumnWidth 9
- FirstName: DataType Unicode string [DT_WSTR], OutputColumnWidth 50
- LastName: DataType Unicode string [DT_WSTR], OutputColumnWidth 50
- Gender: DataType string [DT_STR], OutputColumnWidth 1
- Residence: DataType Unicode string [DT_WSTR], OutputColumnWidth 60
- Comments: DataType Unicode string [DT_WSTR], OutputColumnWidth 1000

Example of column properties
Conditional Split CSPL – Filter out empty lines
We use a Conditional Split to filter out empty lines in the csv file. Because the SocialSecurityNumber is a required column, we assume that the line is empty (at least it is useless) if SocialSecurityNumber is empty.
So add an Output named Empty lines with Condition SocialSecurityNumber == “”. If you like, you can check on multiple columns to be empty.
Rename the default Output to Non-empty lines.

Conditional Split Configuration
Derived Column DC – Re-insert CRLF
Because the (corrected) file might contain replacements for carriage return/line feeds, we are going to replace them with the original carriage return/line feeds, before the value is inserted into the database table.
In the Derived Column Transformation Editor, add a derived column, configured as follows:
- Derived Column Name: Comments
- Derived Column: Replace ‘Comments’
- Expression: REPLACE(Comments,@[User::CRLF_Replacer],@[User::CRLF])

The Derived Column Transformation Editor
OLEDB Destination OLE_DST stg DimCustomer
To rebuild the demo, create a table to store the rows from the data flow task,
add a OLEDB Destination that uses this table.
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [stg].[DimCustomer](
[SocialSecurityNumber] [char](9) NOT NULL,
[FirstName] [nvarchar](50) NULL,
[LastName] [nvarchar](50) NULL,
[Gender] [char](1) NULL,
[Comments] [nvarchar](1000) NULL,
[Residence] [nvarchar](60) NULL,
CONSTRAINT [PK_DimCustomer_CustomerKey] PRIMARY KEY CLUSTERED
(
[SocialSecurityNumber] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
The Control Flow – File System Tasks
Finally the File System Tasks archive the original file and delete the corrected file (if it exists).
For auditing purposes I keep the original file and not the corrected one, because the latter can always be recreated when needed.
Also, when a bug would exist in the creation of the corrected file (initially I had a problem with codepages, which should now be solved), the original file might get lost forever.
Please go back to the picture of the The Control Flow at the beginning of this post, to see how the File System Tasks are connected.
Next are the printscreens that should help you to configure the File System Tasks.

Configuration of “FST – Move original file to archive folder”

Configuration of “FST – Move original file to error folder”

Precedence constraint of “FST – Delete temporary file” (you need to set it twice).

Configuration of “FST – Delete temporary file”
The proof
To show you that it works, I run the package in debug mode.
To be able to have a look at the temporary/corrected files, afterwards, I have disabled the File System Task that deletes this file.

When I open the “CustomerDetails.csv.tmp” file after the package has run, I can see that line 15 was corrected.
From the second screenshot you can see that the CRLF_Replacer value was inserted twice.


And when I run a query on the destination table, I can see that the Comments column contains the original CarriageReturn/LineFeeds!

By the way, all the data in the csv files is generated from random firstnames, lastnames and residences and fake social security numbers. Any match with existing persons would be based solely on coincidence.
Conclusion / Wrap up
Csv files still are useful today, because they have some advantages when compared with XML files.
However they also have their limitations, which can be frustrating when loading those files with a Data Flow Task.
In this blog post I have demonstrated how to overcome a few of those limitations, namely:
- carriage return/line feeds in contents of csv files
- changing file formats over time
The solution is based on using a C# Script Task.
Please be aware that I do not pretend this is the best solution.
If you are very familiar with C#, you could all of the processsing in the Script Task, using SqlBulkCopy, and you would not need a temporary file.
However the Data Flow Task offers some advantages, like lookups you can use, without the need to do this in a script.
If you create good Interface Specification Documents on forehand (for instance containing how to handle carriage return/line feeds in column values) you can force that all sources deliver correct csv-files, and then they do not have to be corrected.
(c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.