All posts by Hans Michiels

Hans is an Independent Business Intelligence and Datawarehouse Consultant & Microsoft SQL Server Consultant, working in the Netherlands. He has been working in the software industry since 1996, with SQL Server since the year 2001, and since 2008 he has a primary focus on datawarehouse- and business intelligence projects using Microsoft technology, using a Datavault and Kimball architecture. He has a special interest in Datawarehouse Automation and Metadata driven solutions. * Certified in Data Vault Modeling: Certified Data Vault 2.0 Practitioner, CDVDM (aka Data Vault 1.0) * Certified in MS SQL Server: * MCSA (Microsoft Certified Solutions Associate) SQL Server 2012 - MCITP Business Intelligence Developer 2005/2008 - MCITP Database Developer 2005/2008 - MCITP Database Administrator 2005/2008

Implement logging in a Script Task (SSIS Series)


This post is a follow-up on two older posts and will be the last one before my summer break.

Today I want to enhance the SSIS package made in post Fixing corrupt csv files in a SSIS Package (SSIS Series) by using the Plug and Play Logging Solution.


As long as a script runs well, you might not need logging inside. BUT .. when something GOES wrong, having logging in the script can reduce the time needed to troubleshoot the issue.
In the example for fixing csv files, anything can be wrong with a file, causing an error in the script. In that case it is for instance useful to know which file was being processed when the error occurred.


For this to work, you have to install my Plug and Play Logging Solution, which can be downloaded from this post. It is just one SQL Script you have to run, that creates a [logdb] database with objects inside. You can also add those objects to a different database that already exists.
It is however not mandatory to rebuild the Fixing corrupt csv files in a SSIS Package example, as you can also add logging to your own Script Task.


This post is a enhancement on the Fixing corrupt csv files in a SSIS Package post, so you might need to read that post first, if you did not do that already.

Therefore I will dive directly into the changes that are needed to add logging in the Script Task.
This will be done with lots of screenprints with some comment in between.

SSIS-S01E07-169Adding a package parameter “LoggingConnectionString”
First you need to add a package parameter LoggingConnectionString that can be used in an expression of the OLE DB Connection and as input variable for the Script Task.

SSIS-S01E07-170OLE DB Connection configuration ..
Then add an OLE DB Connection for the logging database.

SSIS-S01E07-171OLE DB Connection properties ..
Use an expression to let the ConnectionString of the OLE DB Connection be equal to the value of the package parameter LoggingConnectionString.

SSIS-S01E07-173OLE DB Connection – Expression configuration [1]

SSIS-S01E07-172OLE DB Connection – Expression configuration [2]

SSIS-S01E07-174Final result
By the fx icon you can see that the connection manager uses an expression.

SSIS-S01E07-176Enable logging
Now enable logging for the package.

SSIS-S01E07-177Enable logging [2]
Add a log provider for SQL Server and let it use the logdb.oledbConnection by selecting it under the Configuration column header.

SSIS-S01E07-178Enable logging [3]
Then select all events. Filtering on what is actually logged is done by the logging solution (by the value of @MaxMessageClass, see this blog post for more details).

SSIS-S01E07-179Select the Script Task
Select the Script Task and add the following Variables to ReadOnlyVariables:

  • System::ExecutionInstanceGUID
  • System::PackageID
  • $Package::LoggingConnectionString

  • SSIS-S01E07-180The added ReadOnlyVariables in the red rectangles

    Below you will find a number of screenprints of the script task to talk you through the changes.
    You can download the C# script here.

    First make sure the Namespaces region is as shown.
    Then fold the namespace with the guid in the name, and paste the entire namespace HansMichielsCom.PlugAndPlaySSISLoggingSolution underneath it.
    This code is put in a separate namespace, so that it could also be placed in a .NET assembly that is added to the GAC (Global Assembly Cache). When you would do this, you do not have to add the code to every Script Task.
    For the example of today, we just put this namespace inside the Script Task to make things not too complicated for now.

    SSIS-S01E07-182Using the HansMichielsCom.PlugAndPlaySSISLoggingSolution namespace
    As a result, you have to tell the first guid-like namespace, that you want to call code inside the second namespace. Therefore add the using statement as shown above.

    SSIS-S01E07-183Constant used for logging
    Below you will see some printscreens with changed parts in the script.

    SSIS-S01E07-184Method GetSsisLogWriter to instantiate a SsisLogWriter object

    SSIS-S01E07-187Method Main is extended with logging.

    SSIS-S01E07-188Pass the logWriter as parameter to other methods ..

    SSIS-S01E07-189IMPORTANT: Bugfix in CheckAndReturnHeader!

    SSIS-S01E07-190IMPORTANT: Bugfix in CheckAndReturnHeader!
    (header == null) is added to cope with empty files.

    Testing, testing, one, two ..

    SSIS-S01E07-191Test preparations [1]

    SSIS-S01E07-193Test preparations [2]

    SSIS-S01E07-194Test execution

    SSIS-S01E07-195Test result: logging rows done inside the script are in the log table.

    Conclusion / Wrap up

    In this post I have demonstrated how to implement logging in SSIS Script Tasks using my Plug and Play Logging Solution.
    This type of logging gives more control on what to log and how to log it than when you implement logging using SSIS events.
    The examples given are very basic. You can use your imagination to implement logging of errors using a try .. catch block, or use all available parameters of logWriter.AddLogEntry to change the Retention Class, Message Class, and so on.

    In the summer I will take some time for study, reflection, holiday, and still .. work.
    My next post will be early September at the latest, maybe earlier.

    Download the C# script here.

    (c) 2016 – Do not steal the contents – spread the link instead – thank you.

Zeros, bloody zeros! (Data Vault Series)


I must admit I have a weakness for British humour.
When I had to cope with leading zeros in business keys some time ago, I spontaneously came up with the title of this post, not knowing that it would serve as such.
For those who do not know, “Meetings, bloody meetings” is a British comedy training film in which John Cleese plays a main role. It was made in 1976, and a remake was made in 2012.
It tells in a funny way what can go wrong at meetings and how you can do better, check it out if you can.

DV-S01E05-meetingsMr John Cleese

But, obviously, this post is not about meetings but about zeros.


I can be short about that: leading zeros in business key values.
For instance a customer number is delivered to the data warehouse as 0001806 (instead of 1806).
This would not be a problem it is would always be delivered exactly like that. But to be honest, you can and will not know that upfront. Even this might be the case now, it might not be in the future.
When other tools are used, leading zeros could suddenly disappear (for instance when a csv file is modified using Excel), or (more rarely) the number of leading zeros could change (01806, 00001806). When this happens you have a problem, because for the data warehouse 01806, 0001806, 00001806 and 1806 are all different business keys! Even if you have only two variants, it is already a problem.
Because every business key gets a different row in the hub, and this customer now exists multiple times!

DV-S01E05-zeros(No acting here, this is how I look sometimes)


If you are familiar with Data Vault, you might already think of same-as-links to solve this.
But I think the solution should be implemented earlier, to avoid having multiple hub rows.
Simply always remove leading zeros when the sourcecolumn is (part of) a business key (either primary or foreign key) and seems a number or ID but is delivered as a string/varchar. In this way 1806 will always be 1806! And I think it is pretty impossible that 001806 and 1806 would refer to two different customers.
Unless, of course, they would come from different source systems. But in that situation, depending on leading zeros would be a bad thing to do, because when then leading zeros dropped off, satellite rows of different customers (in different source systems) could end up as connected to the same hub row! In this situation, in a non-integrated Raw Vault, it would be better to prefix the business key with the source system code and remove the leading zeros, for instance, CRM.1806 and ERP.1806.
In all cases, you can still store the original value (with leading zeros) as an ordinary attribute in a satellite for auditing reasons.

How to implement the solution

There are many ways to remove leading zeros. When I was searching for this I had two requirements:

  • No casting from and to an integer may take place, otherwise all business keys need to be numeric, so this would make the solution less reliable.
  • No function, routine or assembly may be called, this could negatively impact performance. I was looking for an “inline” conversion.

After some research I found an expression that was the same for SQL and SSIS and quite okay (T-SQL version by Robin Hames, my credits for his work), but appeared to change a string with only one or more zeros to an empty string. And because a 0 can have a meaning – and is certainly different from an empty string – this is undesired behavior, IMHO.
So I had to add some logic to it: a SELECT CASE in T-SQL and an inline condition (format {condition} ? {true part} : {false part} ) to the SSIS expression.
Furthermore I came on a different method for T-SQL as well, using the PATINDEX function, which is more compact than the other solution.
For SSIS I still use the ‘Robin Hames’ method, because the PATINDEX function is not available in SSIS Expressions.
So .. this is what it has become:



      WHEN LTRIM(example.[id_with_leading_zeros]) = '' THEN ''
      WHEN PATINDEX( '%[^0 ]%', example.[id_with_leading_zeros]) = 0 THEN '0'
      ELSE SUBSTRING(example.[id_with_leading_zeros], PATINDEX('%[^0 ]%', example.[id_with_leading_zeros]), LEN(example.[id_with_leading_zeros]))
   END AS [id_without_zeros_method1],

      WHEN LTRIM(example.[id_with_leading_zeros]) = '' THEN ''
      WHEN PATINDEX( '%[^0 ]%', example.[id_with_leading_zeros]) = 0 THEN '0'
            REPLACE(LTRIM(example.[id_with_leading_zeros]), ' ', '!#!') -- replace existing spaces with a string that does not occur in the column value, I have chosen '!#!'
            , '0', ' ') -- replace '0' with ' '
            ) -- end of LTRIM to remove leading '0's that have been changed to ' 's
            , ' ', '0') -- change ' ' back to '0'
            , '!#!', ' ') -- change '!#!' back to ' '
   END AS [id_without_zeros_method2]
        TOP 1000 RIGHT('00000000000' + CONVERT(NVARCHAR(12), object_id), 14) AS [id_with_leading_zeros]
    SELECT N' 00000 '
    SELECT N'00'
    SELECT N' '
    SELECT ' 0099990 A '
    SELECT '-5550'
    ) example

SSIS Expression (can be used in Derived Column)

(LTRIM(REPLACE(id_with_leading_zeros,"0", "")) == "" && LTRIM(id_with_leading_zeros) != "") ? "0" : REPLACE(REPLACE(LTRIM(REPLACE(REPLACE(LTRIM(id_with_leading_zeros)," ","!#!"),"0"," "))," ","0"),"!#!"," ")

DV-S01E05-151In a Derived Column Transformation this looks for instance like this

Conclusion / Wrap up

In this post I have motivated why I think you should remove leading zeros from business keys when data is loaded from source systems to a data warehouse.
This post also contains different ways to remove leading zeros, two for T-SQL and one for a SSIS expression.

(c) 2016 – Do not steal the contents – spread the link instead – thank you.

SQL Server 2016 tempdb configurations benchmark (SQL Server Series)


When you have installed SQL Server 2016 you will have noticed that the configuration for tempdb has changed.
Microsoft made a proper tempdb configuration easy to do via the GUI during installation.

Today I want to go through the settings offered and will measure performance of different setups.

What are the options?

During installation you can set the following values:
For TempDB datafiles:

Setting Default value
Number of files 4
Initial size (MB) 8
Autogrowth (MB) 64
Data directories Default data directory, in my case C:\MSSQL\Instance root\MSSQL13.MSSQLSERVER\MSSQL\Data

For TempDB logfile:

Setting Default value
Initial size (MB) 8
Autogrowth (MB) 64
Log directory Default data directory, in my case C:\MSSQL\Instance root\MSSQL13.MSSQLSERVER\MSSQL\Data

Configuration screen during SQL Server 2016 installation (I changed the directories already).

So what is the best setup?

This subject is not new for SQL Server 2016, but the installation procedure now just gives a better opportunity to set it up right the first time.
There are roughly five differentiators in the setup:

  • Initial size (MB)
  • Autogrowth (MB)
  • Number of data files
  • Location of data files
  • Read/write performance of the different storage media

What are the considerations when setting these values?

Initial size (MB)

A larger initial size will accomplish two things:

  1. It will take longer before the file will have the need to grow.
  2. The file can be stored less fragmented on disk.

If you have enough disk space, the larger the better, there is no real downside on making the initial size quite large. More about this in my conclusion.

Autogrowth (MB)

With autogrowth the choice is a bit more delicated, as a small or large value will both have pros and cons, as pointed out in the schedule below.

Autogrowth value Pros Cons
Small More frequent smaller growth actions will cause less noticable hick ups TempDB datafiles will become more fragmented.
Large Tempdb datafiles will become less fragmented. Less frequent larger growth actions might cause noticable hick ups

Number of data files

In general, the recommendation is one data file per (logical) processor. You can configure more files than the number of (logical) processors, but do this only if you notice that this gives a better performance.

Location of data files

Locating the data files on different drives can have two effects:

  1. Queue length per disk drive in Windows (all scenarios, also with virtual disks residing on the same physical drive).
  2. Physical writes to the diskdrive can go in parallel (only when different physical drives).

The tests

The testcases

Creating combinations of all differentiators would just create *a lot* of testscenarios.
So I will be a bit pragmatic.
The testcases are chosen, so that they can be compared with at least one other testcase, of which one differentiator is different.

Test # # Data files Init size (MB) Auto growth (MB) # Fol- ders Scenario
0 4 8 (x4) 64 1 Default installation (data files on OS disk)
1 4 8 (x4) 64 1 Data files on dedicated drive
2 4 8 (x4) 64 4 Data files on more drives
3 4 2048 (x4) 64 4 Data files on more drives + Larger initial size
4 4 8 (x4) 256 4 Data files on more drives + Larger autogrowth
5 4 2048 (x4) 64 1 Data files on dedicated drive + Larger initial size
6 4 2048 (x4) 64 1 Data files on dedicated drive + Larger initial size + TempDB on different SSD
7 8 2048 (x8) 64 1 Data files on dedicated drive + Larger initial size + More data files
8 4 8 (x4) 64 1 Default installation (data files on OS disk) + Larger initial size

Test results

I will present you both two graphs as well as the raw data. I think especially the first graph gives some useful insights, as it compares all testcases with testcase 0 (default installation), by using an index 100 for testcase 0. If a different testcase performs better, the index will be less than 100. Also, in this way, the average result of both testsets (A and B) can be shown.
I left testcase 6 out of the graph, later I will tell why I did this.
MSSQL-S01E03-testresult-indexIndex-based comparison of testresults

MSSQL-S01E03-testresult-secondsExecution times of testcases for testset A and B

Test results figures

Test # Scenario Testresult testset A (ms) Testresult testset B (ms)
0 Default installation 30359 232795
1 Data files on dedicated drive 28344 261467
2 Data files on more drives 22514 248484
3 Data files on more drives + Larger initial size 28040 210223
4 Data files on more drives + Larger autogrowth 24173 200632
5 Data files on dedicated drive + Larger initial size 19608 192587
6 Data files on dedicated drive + Larger initial size + TempDB on different SSD 123289 387646
7 Data files on dedicated drive + Larger initial size + More data files 19692 211501
8 Default installation + Larger initial size 20149 361844


Data files on dedicated drive + Larger initial size (test 5) gives the best performance for both testsets. Putting tempdb files on multiple disks can not win from this, however the differences are relatively small.

Surprisingly performance dramaticly decreases when the virtual disk on which tempdb is stored is moved to a different (faster) SSD (this is test 6)! I have no explanation for this. Therefore I will leave test 6 out of the analysis, by doing this physical storage is the same for all testcases and changes in performance are only caused by the other differentiators.

My most important conclusion is that the default values offered during installation are quite okay, except for two things:

The Initial size (MB)

You should really set the initial size to a big number (gigabytes if possible). Note that if you leave the number of files 4 (which I recommend), the space used is four times the size that you use for Initial size.
If you have a dedicated disk for tempdb (which is recommended), you might want to set the initial size of the datafile(s) to fill up the disk. You should then disable Autogrowth.
Also on a production server, check the size of tempdb files after a week of uptime (also the SQL instance may not have been restarted). Say the tempdb size of all files together (can also be one file) is 22GB. You could then set your initial size to 28GB in total, so 7GB per file if you configured 4 files. This would mean that during normal operation, the tempdb never has to grow.

Dedicated disk

It is a common best practive to put tempdb on a seperate disk so that the Windows Disk Queue does not have to compete with other read/write actions to that disk.

Other settings are somewhat less important but … think for yourself (my disclaimer)

In my opinion both putting tempdb files on multiple different disks and setting Autogrowth to a different value do not offer much performance benefit.
The Autogrowth value is even irrelevant if the initial size is so large that temdb never has to grow during normal operation.
But every server is different! Test and think for yourself.
My testresults are presented “AS IS” in the hope they will be useful as a starting point for further own analysis. I am not responsible for any disadvantage you might have by using my test results for your tempdb setup.

What if SQL Server is already installed and/or it is an older version?

No worries, you still can change the tempdb configuration.
You can move and add files using SQL scripts.
Initial size and Autogrowth you can also set with Management Studio.
MSSQL-S01E03-166Go to properties of tempdb

MSSQL-S01E03-167Here you can change Initial size and Autogrowth

By the way, I noticed that SQL Server does not delete moved tempdb files. So if you move tempdb files, you should delete the files on the old location, so that they do not use disk space and do not cause confusion for administrators.

Test justification

  • All tests are done on a Hyper V Virtual Machine running Windows Server 2012 R2 and SQL Server 2016 Developer Edition.
  • Change of physical storage type is done by moving vhdx files to different physical disks, and then change the file location in the settings of the Hyper V Virtual Machine (while it is not running, of course).
  • Each test is done on an initial copy of the Virtual Machine, therewith the startsituation is equal for every test (e.g. no tempdb growth by previous tests).
  • Tests where done with two sets of scripts (set A and B) that inserted, updated and deleted rows in temporary tables in tempdb. Set A consists of three sql scripts, and set B of five.
  • All scripts were developed, so that all tempdb inserts started at the same time (maybe milliseconds difference in starttime only). This to ensure that every testrun was executed in the same way.
  • All scripts were executed in SQL Server Management Studio.
  • The first file of tempdb is stored on its own (virtual) disk except for testcase 0, where tempdb files are on the C-drive where Windows is installed.
  • When spreading files over multiple virtual disks, the disks created for database files, log files, and backup files are used. This was done for pragmatic reasons during the test. This was defendable, because no other activity on databases and log files was going on during the test. It is however not recommended to spread tempdb like that in production environment, unless every disk is dedicated to be used for tempdb only.
  • The host machine had the same applications open during every test, being a Windows Explorer and Microsoft Excel. No other applications that used CPU and could influence performance of the virtual machines were running.
  • The tempdb log file was placed on a seperate disk (the same one) for each test. Therewith it was no subject of performance investigation. This would have at least doubled the number of testcases.

Wrap up

Microsoft did a good job in improving the setup options for tempdb in the installation procedure.
It is a bit of a pity that they did not go the last mile, giving the tempdb data files a more realistic Initial size.
Luckily this is easy to change during installation, and it is a quick win for anyone installing SQL Server 2016.

Download all the used scripts here.

(c) 2016 – Do not steal the contents – spread the link instead – thank you.

Zip sourcefiles in a SSIS Package (SSIS Series)


As I am in a (un)zipping mood, this week the opposite problem of last week.
The example is not so ready-to-use as in my previous post, as it depends on choices made in your development process of other SSIS packages.
Still I hope you can adjust this example to match your needs.


If you have imported and archived (csv- or xml) files, these files can take up a lot of disk space. The archive folder can also become overwhelmingly full with ten thousands of files, which can make it slowly accessible and difficult to manage.


To tackle this problem I have made a fully configurable SSIS package, that can be executed from a different package with an Execute Package Task.
The package is made using SSIS 2014.

The zipping itself is done with 7za.exe, for more information on how to get and install this free program see my previous blog post.


When I used this concept, a different SSIS package moved files to a subfolder of an archive folder after they were processed, but their name was prefixed with the LoadCycleID with leading zeros, for instance 001345_customers.xml.

The archive folder and subfolder were separated, because the archive folder was an environment variable and the same for all packages, while the subfolder was different for different SSIS packages, and contained the name of the source system of the files.
If you only have one archive folder, just change the example to use that instead of the concatenation of archive folder and subfolder.


Here is a picture of the end result:

The Control Flow

Detailed steps

Parameters and variables

First you need to add a few parameters and variables to the package.


Name Data type Required Description
_7zaPathAndFilename String False The full path and filename of 7za.exe.
ArchiveFolder String True The root folder for archiving processed files.
ArchiveSubFolder String True The subfolder of the ArchiveFolder in which the files to be zipped exist, the subfolder can for instance be named after a source system like SAP or CRM.
LoadCycleID Integer True The ID of the LoadCycle that is processing these files. In your situation it could also be called ExecutionID, SessionID and so on, the name is not really important.

Package parameters


However not all variables might be necessary, I like the concept of being able to configure all values in one place.
The data type of all variables is String, except LoadCycleID which is an Integer.
Every variable uses an expression to set its value.
Here is a detailed list of all variables and the expression used for each one.

Name Expression
ArchiveFilesToDeleteInFolder @[$Package::ArchiveFolder] + "\\" + @[$Package::ArchiveSubFolder]
ArchiveFilesToDeletePattern @[User::LoadCycleIDString] + "*"
ArchiveFilesToZipPattern @[$Package::ArchiveFolder] + "\\" + @[$Package::ArchiveSubFolder] + "\\" + @[User::LoadCycleIDString] + "*.*"
ArchiveFilesWorkingDirectory @[$Package::ArchiveFolder]
ArchiveZipFile @[User::LoadCycleIDString] + "_" + @[$Package::ArchiveSubFolder] + ".zip"
ArchiveZipPathFile_Final @[$Package::ArchiveFolder] + "\\" +@[$Package::ArchiveSubFolder] + "\\" + @[User::LoadCycleIDString] + "_" + @[$Package::ArchiveSubFolder] + ".zip"
ArchiveZipPathFile_Temp @[$Package::ArchiveFolder] + "\\" + @[User::LoadCycleIDString] + "_" + @[$Package::ArchiveSubFolder] + ".zip"
LoadCycleID @[$Package::LoadCycleID]
LoadCycleIDString RIGHT("00000" + (DT_STR, 10, 1252) @[$Package::LoadCycleID], 6)

Important note: if you expect that the value for LoadCycleID will grow beyond 999999 over time, you should add more prefixed zero’s to LoadCycleIDString and increase the number 6, the second parameter of the RIGHT function.

SSIS variables of the package

The Control Flow

The control flow is quite straight forward, only three tasks and we’re done.

EPR Zip files in archive subfolder

This task creates one zipfile containing multiple other files in the ArchiveSubFolder.
From the SSIS Toolbox drag an Execute Process Task to the Control Flow panel and open the Execute Process Task Editor.
On the General and Process panes you do not have to change anything.

No changes needed here

On the Expressions pane, press the ellipsis next to Expressions. In the Property Expressions Editor add the following:

Property Expression
Arguments "a "+ "\"" + @[User::ArchiveZipFile] + "\" " + "\"" + @[User::ArchiveFilesToZipPattern] + "\" -mx=9"
Executable @[$Package::_7zaPathAndFilename]
WorkingDirectory @[User::ArchiveFilesWorkingDirectory]

Please note that in the Arguments expression -mx=9 is used. This is the compression level, level 9 means “ultra” compression. This level gives the smallest zip files but is also the slowest compression level. You can for instance also use 5 for “normal” or 3 for “fast” compression, if this is what you want.

Expressions for the Execute Process Task

SCR Delete uncompressed files

This task deletes the files that were added to the zip file in the previous task.

In the Script Task Editor, add both variables User::ArchiveFilesToDeleteInFolder and User::ArchiveFilesToDeletePattern to ReadOnlyVariables, then press Edit Script ….

Then in the namespaces region add one namespace, as indicated below.

Namespaces region
#region Namespaces
using System;
using System.Data;
using Microsoft.SqlServer.Dts.Runtime;
using System.Windows.Forms;
// Added namespace:
using System.IO;

Adjust public void main() as follows:

public void Main

public void Main()
  // Copy SSIS variable values to C# string variables.
  string archiveFilesToDeleteInFolder = Dts.Variables["User::ArchiveFilesToDeleteInFolder"].Value.ToString();
  string archiveFilesToDeletePattern = Dts.Variables["User::ArchiveFilesToDeletePattern"].Value.ToString();

  // Delete each file that just has been added to the zip archive.
  foreach (string f in Directory.EnumerateFiles(archiveFilesToDeleteInFolder, archiveFilesToDeletePattern))

 Dts.TaskResult = (int)ScriptResults.Success;

WARNING: Make sure that no files that meet the file name pattern of files to add to the zip file are added to the ArchiveSubFolder while or after the zipfile is (being) created. Otherwise files that are NOT zipped might get deleted!

Download the script here.

FST Move zip file

The File System Task is to move the zip file to the ArchiveSubFolder, because the zipfile is initially created in a temporary folder and not in the ArchiveSubFolder.
This is to prevent that it would be deleted by coincidence, if you use a value for ArchiveFilesToDeletePattern that ends with .*
Configure the File System Task as shown in the picture below:

File System Task Configuration

Download the script here.

Let the car drive

Time to test the package and tell the result in pictures (if you don’t mind).

Before: files of LoadCycleID 001346 need to be zipped ..

Execute the package ..

All csv files replaced with one zip file ..

.. that contains the csv files ..

Conclusion / Wrap up

In this blog post you could read the steps needed to build a SSIS package that creates a zip file containing a number of files in an archive(sub)folder.
This makes the archive smaller and easier to manage.
However beware that you only delete files that were added to the zip file. For more details see the warning above.

(c) 2016 – Do not steal the contents – spread the link instead – thank you.

Unzip specific files from lots of zipped archives (SSIS Series)


This week I want to discuss the trouble you may have when searching hundreds of zip files for specific files inside them.
At least I faced this problem some time ago, and solved it using SSIS and the commandline version of 7-zip, named 7za.exe.
The choice for SSIS was purely based on convenience, as you could tackle this problem also using Powershell, C# or almost any other scripting language.
But I thought SSIS is just a neat and quick way to do this.


Finding specific files in huge amounts of zip files.


To tackle this problem I have made a fully configurable SSIS package, that can be executed with different parameters.
The main task of the package is to extract specific files (based on a file name pattern) from zipfiles in a folder and – optionally – subfolders.
The unzipping itself is done with 7za.exe, which is the commandline version of 7z, the great open source zip program, that exists already for several years and is used by millions of users.

By the way, the package is made using SSIS 2014.

Preparations: download and install 7za

It is important to download this package from the official 7z web site, just to avoid that you would get a copy from some deceitful web site infected with malware.
To go to the official 7-zip download page click on this link:
7-zip download page
Most important is the 7-Zip Extra: standalone console version.
For the example I have used version 16.02 (2016-05-21) for Windows, but it should also work with version 9.20 (2010-11-18), then called 7-Zip Command Line Version.

If you download the newest version, you also need software to unzip the 7z160-extra.7z file, for instance the regular 32-bit or 64-bit version of 7-zip, which can also be downloaded on the same page.

Download page of 7-zip

I assume you will use the newest version 7z160-extra.7z
Once unzipped, just take the time to read License.txt, readme.txt and history.txt.
Then copy the three files 7zxa.dll, 7za.dll and 7za.exe to a folder on the Windows computer where SSIS runs, usually a Windows Server.
If you run a x64 version of Windows, you can take the x64 version of the 7za files, which are in a subfolder “x64” of the unzipped files.

Unzipped 7za files ..

I have created a subfolder “7za” in the “C:\Program files” folder of my virtual Windows Server, and copied the x64-version of the three mentioned files to it, but you can put them in any folder you like.
When you have done this, the real fun can start ..


Here is a picture of the end result:

The Control Flow

Detailed steps

Parameters and variables

First you need to add a few parameters and variables to the package.


Name Data type Required Description
_7zaPathAndFilename String False The full path and filename of 7za.exe.
FilenamePattern String True The filename pattern of files you want to unzip.
SourceRootFolder String True The main folder where you want to search for zip files.
TraverseSubfolders Boolean False Indication if you want to look only in SourceRootFolder, or also in its subfolders.
UnzipRootFolder String True The folder to which files are unzipped. Important: this may NOT be a subfolder of SourceRootFolder.
ZipFilenamePattern String True The filename pattern of the zipfiles from which you want to unzip files.

Package parameters


Name Data type Description
FLC_UnzipFolder String The subfolder where the files will be unzipped. This is a subfolder of the UnzipRootFolder, with the same relative subpath as the zipfile has in SourceRootFolder. Additionally, the zipfilename has been replaced with a subfolder that is the same as the zipfilename without the file extension.
FLC_ZipPathAndFileName String The zipfile full path, that is target of investigation if any files with FilenamePattern are zipped inside.

SSIS variables needed to do the job

Developing the package

FELC Unzip files

This foreach loop container finds all zip files.
In the Foreach Loop Editor, select Collection in the left pane.
As we want to make the solution generic by using all those parameters mentioned above, we have to add a few expressions.

Foreach Loop container Collection configuration

Add the following three expressions (in my opinion the editor is not very userfriendly, the window is too small, but it can be done):

Property Expression
Directory @[$Package::SourceRootFolder]
FileSpec @[$Package::ZipFilenamePattern]
Recurse @[$Package::TraverseSubfolders]

Please note that these expressions will set the values for Folder, Files and the Traverse subfolders checkbox in the Enumerator configuration part of the screen (inside the purple square in the picture above).
If you close and reopen the Foreach Loop Editor, you will notice that the configuration fields now use the values from the corresponding parameters, except the value Fully qualified for Retrieve file name, which is hardcoded.

Example of how to set an expression..

Then, still in the Foreach Loop Editor, select Variable Mappings in the left pane and map variable User::FLC_ZipPathAndFileName to Index 0, as shown in the picture below.

Variable Mappings

EXPR Set Unzip folder

The next nifty thing that needs to be done, is creating a folder to store the unzipped files. As we do not know if zipped files with the same name exist in multiple zip files, the safest thing to do is to create a directory structure identical to the directory structure of the zip files.
Additionally, the basename of the zip file (so without file extension) is used as an extra subfolder. In this way we are sure that unzipped files with the same name can be stored without overwriting other ones.

To do this, from the SSIS Toolbox drag an Expression Task inside the Foreach Loop container.
Open the Expression Task and paste this expression into the Expression textbox.

@[User::FLC_UnzipFolder] =
FINDSTRING(@[User::FLC_ZipPathAndFileName], "."
+ TOKEN(@[User::FLC_ZipPathAndFileName], ".",
TOKENCOUNT(@[User::FLC_ZipPathAndFileName], ".")),1) -1),
@[$Package::SourceRootFolder], @[$Package::UnzipRootFolder])

Then press the Evaluate Expression button to check that the expression is valid.

Configuring the Expression Task

FSYS Create Unzip folder

In the previous step we have composed the full path to unzip files into, but this folder structure might not exist yet.
Therefore we use A File System Task named FSYS Create Unzip folder to create the folder structure.
The good news is that the File System Task can create multiple (sub)folders at once.
Configure it as shown in the picture.
Configuration of FSYS Create Unzip folder

EPR Unzip Selected Files

Below you see a few screenshots for the configuration of the Execute Process Task named EPR Unzip Selected Files.
The funny thing is, that in the Process pane, you do not have to change anything. Executable and Arguments are set using expressions (coming next) and all the other properties can keep their default values.

Process configuration.

Setting Executable and Arguments using expressions
Arguments (copy-friendly):

"-y x \"" + @[User::FLC_ZipPathAndFileName] + "\" -o" + @[User::FLC_UnzipFolder] + " " + @[$Package::FilenamePattern] + " -r"

SCR Remove Empty folders

There is one problem with the steps done so far. If a zipfile does not contain a file that matches the FilenamePattern, then the folderstructure is made for nothing.
However you could just leave these empty folders, I think the solution is a bit better if these empty folders are deleted.
To do this, from the SSIS Toolbox drag an Script Task inside the Foreach Loop container.

In the Script Task Editor, add both variables User::FLC_UnzipFolder and $Package::UnzipRootFolder to ReadOnlyVariables

In the Script Task Editor window, press Edit Script ….

Please note that you can download the C# script here..

Then in the namespaces region add two namespaces, as indicated below.

Namespaces region
#region Namespaces
using System;
using System.Data;
using Microsoft.SqlServer.Dts.Runtime;
using System.Windows.Forms;
// Added namespaces:
using System.IO;
using System.Linq;

Adjust public void main() as follows:

public void Main

public void Main()
  // Copy SSIS variable value of FLC_UnzipFolder to C# string variable.
  string unzipFolder = Dts.Variables["User::FLC_UnzipFolder"].Value.ToString();

  // Source:
  // My credits go to author "Jon" from Berlin.
  // Because not all zipfiles might contain files with the FilenamePattern we are looking for, an empty path might have been created.
  // This script deletes these empty folders, so that in the end only folders will exist with unzipped files or subfolders in it.
  var di = new DirectoryInfo(unzipFolder);
  var root = Dts.Variables["$Package::UnzipRootFolder"].Value.ToString(); // no trailing slash!
  while (di.FullName != root
          && !di.EnumerateFiles().Any()
          && !di.EnumerateDirectories().Any())
    di = di.Parent;

 Dts.TaskResult = (int)ScriptResults.Success;

Download the script here.

The proof of the pudding is in the eating

Time to do some testing.
I have created a few folders with subfolders and subsubfolders. You can see the folder structure in the last picture below.
As you might have noticed the default values for my parameters already prepared for the testcase: I want to unzip only customer files from zip files with 201605 in the name.

Executing the package in Debug mode..

Below you can see the result. Only customer files are unzipped. The subfolders under the Archive folder that contain no customer zip files (but only currency files), do not exist under the Unzipped folder.

Folder structure of Archive and Unzipped rootfolders..

Unzipped “Customer” files ..

Important notes for using with huge amounts of zip files

From my experience I must also warn you. If you run this package on a folder containing gigabytes of zipfiles, there are two things to be aware of:

  1. The execution of the package can potentially take a long time, several hours is not unusual.
  2. Make sure you have enough disk space. Suppose you have the following hypothetical scenario: zip files: 100GB, average zip compression rate: 20%, percentage of files to be unzipped: 30%. In this case you are going to unzip files that have a zipped size of 30% of 100GB, is 30GB. Once unzipped, their size will be 30GB * (1/0,20) = 150GB. You need to have that disk space available on the disk where the Unzipped folder resides.

Conclusion / Wrap up

With this SSIS package you can selectively unzip files from archived zip files.
When using it, make sure you have enough disk space, and be aware that the package can run for several hours, depending on the number and (average) size of the zip files to process.

(c) 2016 – Do not steal the contents – spread the link instead – thank you.

11 Things you should know about Biml – an overview (Datawarehouse Series)

Intended audience and purpose of this post

This post gives an overview of a number products and features related to Biml on a high level.
Therewith it is a starting point for further learning and investigating.


Biml means Business Intelligence Markup Language.
What else do you need to know about it, if you are a professional worker in the business intelligence and data warehouse domain?


11 Things you should know about Biml – an overview

  1. Who invented Biml?
  2. What is Biml?
  3. What are BimlExpress and BIDS Helper?
  4. What is BimlScript?
  5. What is BimlFlex?
  6. What are Mist and BimlStudio?
  7. What are BimlBundles?
  8. What are BimlTransformers?
  9. Is the free part of Biml a way to get me addicted to it?
  10. How do I know that Biml is a proven technology that is going to stay?
  11. Where can I find more information on the internet?

1. Who invented Biml?

Biml and all tools around it is created by Varigence, a company with CEO Scott Currie, a former Microsoft Employee.

Back to top

2. What is Biml?

Biml – the markup language – is an XML dialect to describe Microsoft SSIS Packages and SSAS Cubes.
Everything you can do with SSIS, you can do with Biml (quote of Scott Currie, done during the Biml usergroup NL kickoff on May 7, 2016).
The goal of Biml is to improve productivity of Data warehouse development.
But the biml ‘ecosystem’ consists of more than only the Markup Language. Other components are:

  • Biml Script
  • BimlExpress and BIDS Helper
  • BimlFlex
  • BimlStudio (also/formerly known as Mist)
  • BimlBundles
  • BimlTransformers

Back to top

3. What are BimlExpress and BIDS Helper?

BimlExpress and BIDS Helper both are Add-ins for Business Intelligence Development Studio (actually the Visual Studio version with the project types needed for Business Intelligence).
BimlExpress is the successor of BIDS Helper and one of it biggest advantages is syntax highlighting and intellisense.
It is really worth checking them out, because they can improve your productivity, especially when you use metadata and design patterns for SSIS packages.
You can download and install BimlExpress. You have to register but can use the product at no cost.

Back to top

4. What is BimlScript?

BimlScript is what makes Biml (the markup language) even more interesting. With BimlScript you can write .NET Code (C#) and do all kind of interesting stuff, like accessing your metadata, using for each loops, and generate SSIS packages using a design pattern.
I have generated packages to load dimensions of a star schema, and I intend to blog about it later this year in more detail.

Back to top

5. What is BimlFlex?

BimlFlex is a datawarehouse automation framework of Varigence. It can greatly improve productivity when building data warehouses.
It uses Predefined patterns for data warehouse automation and Extension Points that you can use to change the standard automation patterns on certain ‘change points’, or can build exceptions for the pattern of this one very special table.
BimlFlex is part of BimlStudio and therefore a commercial (paid) product.

Back to top

6. What are Mist and BimlStudio?

Mist and BimlStudio are the same tool. BimlStudio is just the new name, due to strange associations with the “Mist” name in different languages, for example in German (What does the German word “Mist” mean?). This is why they changed the name.
Also the names for all products and techniques associated with Biml now start with Biml, which makes them more recognizable.
BimlStudio has a number of extra features when compared with BimlExpress.
BimlStudio is a paid product, you can buy a perpetual license or a monthly subscription.


Back to top

7. What are BimlBundles?

BimlBundles are part of BimlStudio and protect your source code while your customer still can make changes at framework extension points.
Koos van Strien did a very good job at explaining BimlBundles in more detail here.

Back to top

8. What are BimlTransformers?

With BimlTransformers you can define and store changes on Biml code, instead of changing the Biml code itself. If you have worked with Visual Studio/.NET before, you can compare it with changes you make for your config file for different environments.
Koos van Strien did a very good job at explaining BimlTransformers in more detail here.

Back to top

9. Is the free part of Biml a way to get me addicted to it?

Nope. Although Varigence works sales model is a Fremium model (a free product that can be extended/enhanced with paid products), the promise of Scott Currie, done during the Biml usergroup NL kickoff on May 7, 2016, is that all features that are released for free will stay for free. You can count on that.

Back to top

10. How do I know that Biml is a proven technology that is going to stay?

Biml already exists for about 8 years and has proven itself already. Varigence is a global player and steady company that does not suddenly disappear. I am not concerned if Biml will exist 10 years from now, and you should not have to be either. Even in the utterworst case, you would still have the SSIS packages that Biml has generated for you, and you can edit/maintain them without Biml.

Back to top

11. Where can I find more information on the internet?

To read more about Biml, here are a few interesting web sites, blog posts and Linked In Usergroups:

Back to top

Conclusion / Wrap up

Biml is here to stay and adds a lot of value, even if you use only the free parts of it. Varigence can only be praised for that.
In this blog post I shortly described a number of products and techniques that are part of the Biml ‘ecosystem’.
I do not have a commercial interest in promoting Biml, I am just a Business Intelligence Consultant who likes Biml a lot.

(c) 2016 – Do not steal the contents – spread the link instead – thank you.

Adding custom header and footer to csv files (SSIS Series)


This article wasn’t planned. SSIS again? No I wanted to write a post about something else.
Nevertheless I got this question on one of my previous posts and thought, I might as well create a blog post about it.


“Hi Hans,
I have created my ssis package for flat file destination. My problem is that I want to add header and footer in my flat file destination to send it to FTP location format will be


The last line contains no of row count including header and trailer
Please let me know how to do that.. I am completely stuck with that and no good luck.. “

So to summarize, both the header and the footer line of the csv file need to contain dynamic content.
Also each row starts with a ‘row type’: 0 for the header, 1 for every detail row, and 9 for the footer.


As an example I just use a simple Data Flow Task with an OLEDB Source connected to table DimCurrency from the AdventureWorksDW database.
The data then flows to a Script Component used as a Destination.
Here are a few printscreens to demonstrate what I mean. The Script Component will create the file.
The package is made using SSIS 2014.

Developing the package

The demo package just contains one Data Flow Task

Design of the Data Flow Task

OLEDB Source: DimCurrency from the AdventureWorksDW database

OLEDB Source: select all three columns

When you drag the Script Component to the Data Flow, you will get this dialog. Use the Script Component as a Destination

Variables that the package contains.

In the Script Transformation Editor, add both variables User::HeaderTimestamp and User::TargetCsvPathAndFileName to ReadOnlyVariables

Now comes the most interesting part! We use the endless possibilities of C# to fix our problem.
In the Script Transformation Editor window, press Edit Script ….

Please note that you can download the entire script here, so you do not have to copy the (sometimes malformatted) code from this webpage..

Then in the namespaces region add two namespaces, as indicated below.

Namespaces region
#region Namespaces
using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
// Added namespaces:
using System.IO;
using System.Text;

Above public override void PreExecute() add the copyright notice and the following two regions: Configuration and Private variables.
Please note that these regions do not exist yet, so you have to copy the code including the #region and #endregion lines.

New regions
  (c) Copyright 2016 -
  This program is free software: you can redistribute it and/or modify
  it under the terms of the GNU General Public License as published by
  the Free Software Foundation, either version 3 of the License, or
  (at your option) any later version.
  This program is distributed in the hope that it will be useful,
  but WITHOUT ANY WARRANTY; without even the implied warranty of
  GNU General Public License for more details.
  You should have received a copy of the GNU General Public License
  along with this program. If not, see

  #region ***** Configuration *****

  // Here you can change the column delimiter and the text qualifier easily.
  private const string _delimiter = “|”;
  private const string _textQualifier = “”;
  // To set the text qualifier to ” use:
  // private const string _textQualifier = “\””; // I recommend to use a text qualifier.
  // Every x rows, write the buffer to the file, and clear the buffer.
  // This is faster than writing every individual line to the file.
  // The higher the value of _bufferRows is, the more memory will be used and the less write actions are done to the csv file.
  // A higher value is a good idea if you have enough memory in your server.
  private const int _bufferRows = 100000;

  #endregion ***** Configuration *****

  #region Private variables

  // Variable to hold the path and filename
  private string _fileName;
  // Variable to count the rows
  private int _rowCount;
  // StringBuilder to hold a buffer of text that still needs to be written to the csv file.
  private StringBuilder _fileContentsBuffer;

  #endregion Private variables

PreExecute() is called before the data starts to flow through. Here we do some preparation and compose the headerline of the file.


 &nbs p;/// <summary>
  /// This method is called once, before rows begin to be processed in the data flow.
  /// </summary>
  public override void PreExecute()

    // Initialize variables
    _rowCount = 0;
    _fileContentsBuffer = new StringBuilder();
    _fileName = Variables.TargetCsvPathAndFileName;
    DateTime headerTimestamp = Variables.HeaderTimestamp;

    // Append the new line to the buffer.
    // This will be written to file with the first detail rows.
    // If there are no detail rows, no file will be created.
    _fileContentsBuffer.AppendFormat(“{1}0{1}{0}{1}{2:ddMMyyyy}{1}{0}{1}{2:HHmmss}{1}\r\n”, _delimiter, _textQualifier, headerTimestamp);

PostExecute() is called after the data has flowed through. Here we append the last lines in the _fileContentsBuffer variable and the footerline to the file.

  /// <summary>
  /// This method is called after all the rows have passed through this component.
  /// You can delete this method if you don't need to do anything here.
  /// </summary>
  public override void PostExecute()
    // If no detail rows, bail out. No file will be created when there are no detail rows.
    if (_rowCount == 0) return;

    using (StreamWriter sw = File.AppendText(_fileName))
      // Append a footer line with the rowcount, example value: 9|105
      // The rowcount is a count of the detail rows, but the footer count
      // needs to include header and footer, so we add 2 to the value of _rowCount.
      _fileContentsBuffer.AppendFormat(“{1}9{1}{0}{1}{2}{1}”, _delimiter, _textQualifier, _rowCount + 2);

Input0_ProcessInputRow(Input0Buffer Row) is the place where all the data processing takes place. As this script is the Destination for the data, we need to do something with those data. A Flat File Destination is not there and will not do this for us.
So here we append every detail row to the _fileContentsBuffer variable. Once in a while, we write the buffer to the file, and clear the buffer again.


  /// <summary>
  /// This method is called once for every row that passes through the component from Input0.
  /// </summary>
  /// <param name=”Row”>The row that is currently passing through the component</param>
  public override void Input0_ProcessInputRow(Input0Buffer Row)
    // Increase the rowcounter

    // Append the new line to the buffer.
    // The format needs to be like “{1}[middle part]{1}\r\n”, so that the row starts and ends with the text qualifier.
    // Replace [middle part] with {1}{0}{1}{x} for every column that you want to append to the file, where x is 2 for column 1, 3 for column 2 and so on.
    _fileContentsBuffer.AppendFormat(“{1}1{1}{0}{1}{2}{1}{0}{1}{3}{1}{0}{1}{4}{1}\r\n”, _delimiter, _textQualifier,
      SafeValue(Row.CurrencyKey), // First column, {2} in the format
      SafeValue(Row.CurrencyAlternateKey), // Second column, {3} in the format
      SafeValue(Row.CurrencyName) // Third column, {3} in the format
             // etc.
    // Every x rows, write the buffer to the file, and clear the buffer.
    if (_rowCount % _bufferRows == 0)
      using (StreamWriter sw = File.AppendText(_fileName))

Multiple overloaded versions of the SafeValue method (for different datatypes like string, int, DateTime, you can add more when needed) add escape characters to the column values when needed.

SafeValue method and overloads
  /// <summary>
  /// Escapes a double quote delimiter in a column value by doubling it.
  /// </summary>
  /// <param name=”value”></param>
  /// <returns></returns>
  private string SafeValue(string value)
    // If _textQualifier is null or empty, just return the value.
    if (string.IsNullOrEmpty(_textQualifier)) return value;

    if (value.Contains(_textQualifier))
    // If the value contains one or more text qualifiers,
    // escape them by replacing them with two text qualifiers.
    return value.Replace(_textQualifier, _textQualifier + _textQualifier);
    return value;

  /// <summary>
  /// Overloaded method for DateTime value (meant as an example)
  /// For other data types, create new overloaded methods,
  /// and in that overloaded method convert the value to string,
  /// before calling the 'SafeValue' method with a string parameter.
  /// </summary>
  /// <param name=”value”></param>
  /// <returns></returns>
  private string SafeValue(DateTime value)
    return SafeValue(value.ToString(“yyyy-MM-dd HH:mm:ss.fff”));

  /// <summary>
  /// Overloaded method for int value
  /// </summary>
  /// <param name=”value”></param>
  /// <returns></returns>
  private string SafeValue(int value)
    return SafeValue(value.ToString());

When the script is ready, Choose Rebuild from the BUILD menu in the VstaProjects – Microsoft Visual Studio Window. Then close that Window. You return to the Script Transformation Editor, close it by pressing OK.

Testing the package

I simply run the package in debug mode. 105 currency rows are added to the csv file.
Including the header and footer, the csv file contains 107 rows, and this is the number that is written to the footer line.

Running the package in debug mode

First and last part of the created csv file.

Please note that to change the delimiter and text qualifier you only have to make two changes in the ***** Configuration ***** region.

  #region ***** Configuration *****

  // Here you can change the column delimiter and the text qualifier easily.
  private const string _delimiter = “;”;
  private const string _textQualifier = “\””;
. . .

  #endregion ***** Configuration *****

By this change the file will look as follows:
First and last part of the created csv file with a different delimiter and text qualifier.

I also did a test with a source query that returned over 1.1 million rows, and the csv file of 25 MB was created in just 3 seconds (on a core i7 laptop with SSD).
So my conclusion is that the performance of this solution is very satisfying.

Download the C# script for the script component here.

Conclusion / Wrap up

In this post I have demonstrated how to use SSIS 2014 to create a csv file with a dynamic header (with a timestamp) and footer line (including a rowcount) using a Script Component as a Destination in a Data Flow Task.


Big Data & Analytics insight 2016: save the date for an insightful conference
(as a speaker I was asked to do a bit of promotion ..)

(c) 2016 – Do not steal the contents – spread the link instead – thank you.

Big Data & Analytics insight 2016: save the date for an insightful conference

Dear reader,

Is the data warehouse at the end of its lifecycle or does the existing infrastructure gain new value from integrating with new Big Data technologies? If so, how do you go about it? Do you need inspiration for your analytics strategy? Then there’s plenty to go by at Big Data & Analytics Insight 2016, on June 7th in Utrecht, The Netherlands.

There will be input from Big Data domain experts and highly interesting presentations like the ones on Data Vault 2.0 & Big Data and Hands-on Predictive Text Analytics that will inspire you. Whether you are deep into the technical or the business aspects of Big Data & Analytics, this conference is a must attend if you want to stay ahead of future evolutions instead of enduring them.

Check the agenda and book your seat, or better, become a ba4all member and get access to one more conference and a hands-on session on Customer Analytics.

I look forward to seeing you at Big Data & Analytics Insight 2016!

Kind regards,

Hans Michiels
(as a speaker I was asked to do a bit of promotion ..)

Fixing corrupt csv files in a SSIS Package (SSIS Series)


Big Data & Analytics insight 2016: save the date for an insightful conference
(as a speaker I was asked to do a bit of promotion ..)


A few weeks ago my blog post was about using a C# Script Component to detect duplicates in a Data Flow Task.

This week I want to discuss it’s brother, the C# Script Task, which can be used in the Control Flow.

Again, extremely flexible, basicly you embed a small C# program inside your package, and you can use the entire .NET framework for all kind of things. I used it to fix malformed csv source files, before processing them in a Data Flow Task.

Pros and Problems

Csv (comma seperated files) have their pros and problems, as compared with other formats, like XML.

A few pros are that they are usually smaller in size than XML files, and easier and faster to load. These are two reasons why csv files still are used a lot these days, despite of the fact that XML files are in some cases more reliable, and more flexible as far as the file contents is concerned.

Unfortunately csv files are not so good in holding memo/comment fields that might contain carriage return/line feeds in it. Memo fields with those characters do not stay on one line in the file, therefore making the csv file more difficult to process (at least the standard flat file source can not handle it in all cases).

Also the structure is very strict. If a column is added to a csv source file, your Data Flow Task will break.


In this post I will discuss a possible solution for both problems:

  1. carriage return/line feeds in contents of csv files
  2. changing file formats over time

This post assumes that you have worked with SSIS before, know what For Each Loop Containers and Data Flow Tasks do, and recognize the csv problems above. It is a big advantage if you can read and understand C# code.

Okay, let’s start!


My demo is based on the following hypothetical scenario:

  1. You need to process multiple CustomerDetails.csv files.
  2. In this file a pipe (|) is used as separator between the columns.
  3. The file contains a header line: SocialSecurityNumber|FirstName|LastName|Gender|Residence|Comments
  4. The Comments column is added later, your package also needs to be able to load historic files without the Comments column.
  5. CustomerDetails.csv files come from hundreds of locations and systems (hereafter called: sources) world wide. Fixing the csv files in the export processes at all these locations is impossible. Even if some sources deliver correct csv files, your SSIS package cannot rely on the fact that all sources will do so.


Here are a few pictures of the end result:
The connection managers that are needed

The Control Flow
The Data Flow

Detailed steps

Parameters and variables

First you need to add a few parameters and variables to the package.


  • SourceFolder: The folder that that For Each Loop Container reads for csv files.
  • ArchiveFolder: The folder to which files are moved after being processed, if no error occurred.
  • ErrorFolder: The folder to which files are moved that caused an error during processing.

All parameters are Required and their Data type is String.

The parameters of the package


  • CRLF: This variable will be filled with a Carriage Return/Line Feed in the Script Task.
  • CRLF_Replacer: This is the character string that is used to replace CarriageReturn/LineFeeds in the Comment column of the file. It must be a character string that does not occur in the Comment field. Therefore ~CRLF~ is used. If you want a different character string, you can change it to something else.
  • SourcePathAndFilename: The full path and filename of the original filename to process. It is filled by the For Each Loop Container. The value set at design time is unimportant.
  • PathAndFilenameToProcess: The full path and filename of the file to process. This can be either the original file, or the modified temporary file. The Flat File Connection Manager uses this path and filename. The value set at design time is unimportant.

The Data type of all variables is String.

SSIS variables needed to do the job

The Control Flow – For Each Loop Container

Add a For Each Loop Container FELC – Load sourcefiles and configure it as a Foreach File Enumerator. Retrieve file name Fully qualified.


On the Collection pane, just below the Enumerator type, press the ellipsis next to Expressions. In the Property Expressions Editor configure the Property Directory to be set by
Expression @[$Package::SourceFolder].

On the Variable Mappings pane, map variable User::SourcePathAndFilename to Index 0
For Each Loop Container configuration

The Control Flow – Script Task

Next, add a Script Task SCR – Fix csv file if needed.
In the Script Task Editor, configure the following:

  • ReadOnlyVariables: User::CRLF_Replacer,User::SourcePathAndFilename
  • ReadWriteVariables: User::CRLF,User::PathAndFilenameToProcess

Variable configuration in the Script Task Editor

Then press the Edit Script… button.
Don’t get scared by the amount of C# code, if you keep your head cool, it will be fixed in 10 minutes!

Adjust the script as follows:

  • In the Namespaces region, if not already there, add:
    using System.IO;
    using System.Text;

  • SSIS-S01E03-108
    Replace the highlighted part above with the code below

    Now remove public void Main() completely and replace it with the code underneath (to avoid malformatting due to html rendering, please download the original csharp script here).
    Then check the code for !!! ATTENTION !!! comment. These are points in the script that you might need to change.


      (c) Copyright 2016 -
      This program is free software: you can redistribute it and/or modify
      it under the terms of the GNU General Public License as published by
      the Free Software Foundation, either version 3 of the License, or
      (at your option) any later version.
      This program is distributed in the hope that it will be useful,
      but WITHOUT ANY WARRANTY; without even the implied warranty of
      GNU General Public License for more details.
      You should have received a copy of the GNU General Public License
      along with this program. If not, see

      /// This method is called when this script task executes in the control flow.
      /// Before returning from this method, set the value of Dts.TaskResult to indicate success or failure.
      /// To open Help, press F1.
      /// </summary>
     public void Main()
      // Copy a few SSIS variables to C# string variables.
       string sourcePathAndFilename = Dts.Variables[“User::SourcePathAndFilename”].Value.ToString();
       string crlfReplacer = Dts.Variables[“User::CRLF_Replacer”].Value.ToString();
       // Set the variable that needs to hold a CarriageReturn/LineFeed.
       Dts.Variables[“User::CRLF”].Value = string.Format(“\r\n”);
       // Call the method that checks if the sourcefile is corrupt and if so, fixes it.
       FixFileIfCorrupt(sourcePathAndFilename, crlfReplacer);
       // Set returnvalue for script task.
      Dts.TaskResult = (int)ScriptResults.Success;
      /// Csv-file can have lines with carriage return-linefeeds, causing errors during the load.
      /// By preinvestigating the file we can correct the file and prevent these errors from ocurring.
      /// Source:
      /// Author: Hans Michiels
      /// License: GNU General Public License, see
      /// </summary>
      private void FixFileIfCorrupt(string sourcePathAndFilename, string crlfReplacer)
       string header = string.Empty;
       string headerAppend = string.Empty;
       string detailAppend = string.Empty;
       // Initialize pathAndFilenameToProcess with sourcePathAndFilename. Overwrite later if needed.
       string pathAndFilenameToProcess = sourcePathAndFilename;
       // Check if the header of the file is as expected, and return the header row through a ref parameter.
       // If the file header is not as expected, do not touch the file and exit this method.
       if (!CheckAndReturnHeader(sourcePathAndFilename, ref header, ref headerAppend, ref detailAppend)) return;
       // Get the encoding of the current file. We need this if we have to create a corrected file.
       var encoding = GetEncoding(sourcePathAndFilename);
       // Read all lines of the file into a string array.
       // !!! ATTENTION !!! WARNING: only do this when the files to process are relatively small.
       // !!! ATTENTION !!! When in doubt, test it and monitor memory usage.
       // (I suppose files up to 50~100 MB might not be a big problem but larger might, but testing is the way to find out)
       string[] lines = File.ReadAllLines(sourcePathAndFilename, encoding);
       // Use a variable to set when the file has changed due to corrections this method makes.
       bool fileHasChanged = false;
       bool lineIsOkay = true;
       // appendbuffer is used to keep parts of 'broken' lines.
       string appendbuffer = string.Empty;
       // Get lower- and upperbound of the string array into variables.
       int lower = lines.GetLowerBound(0);
       int upper = lines.GetUpperBound(0);
       // Loop through all lines in the file from back to front!
       for (int i = upper; i >= lower; i--)
        // If there is still some text in the buffer, append it to the line and make the buffer empty again.
        if (appendbuffer != string.Empty)
          lines[i] = lines[i] + appendbuffer;
          appendbuffer = string.Empty;
          fileHasChanged = true;
        // reset lineIsOkay variable, set to true underneath when the line passes the 'line ok' checks.
        lineIsOkay = false;
        // !!! ATTENTION !!!
        // Here you need to add some custom code.
        // Describe one or more properties of a correct line, e.g.
        // - first 8 characters are numeric;
        // - The length is at least 12;
        // - It contains at least x delimiters;
        // etc.
        // Then convert this check to C# code. Use your imagination if needed!
        // I check for:
        // - the length to be 12 or more characters.
        // - the first 8 characters (the social security number) to be numeric.
        // - at least 4 pipes.
        if (lines[i].Length > 12)
          string checkString = lines[i].Substring(0, 8);
          int intOutput;
          lineIsOkay = (Int32.TryParse(checkString, out intOutput));
          if (lineIsOkay)
           // Check that the line contains at least 4 delimiters.
           lineIsOkay = (lines[i].Split('|').Length - 1 >= 4);
        if ((!lineIsOkay) && (i > lower))
          // Line is NOT OK.
          // I have to append this line to the previous one, except when it is the header.
          // To do this put this line into appendbuffer variable.
          // I use a replacement string for the removed carriage return/line feed.
          // This will be converted back to a carriage return/line feed in the Data Flow.
          appendbuffer = string.Format(“{0}{1}”, crlfReplacer, lines[i]);
          lines[i] = string.Empty; // Package can handle empty lines.
          fileHasChanged = true;
       // !!! ATTENTION !!! If you do not have old fileformats you can remove this block.
       // Code to handle the old file format, without a Comments column.
       // The code below reformats the file so that it become always in new fileformat, with an empty comment.
       if (headerAppend != string.Empty)
        // This is an old format without the Comments column. Append Comments column header to header and a delimiter (pipe) to detail rows.
        fileHasChanged = true;
        for (int i = upper; i >= lower; i--)
         if (lines[i].Length > 0)
          if (i > lower)
            lines[i] += detailAppend;
            lines[i] += headerAppend;
       if (fileHasChanged)
        // Here the modified/corrected file is stored to a temporary file, so a “tmp” extension is added to the modified filename.
        // For auditing purposes I keep the original file and not the corrected one,
        // because the latter can always be recreated when needed.
        // Also, when a bug would exist in the creation of the corrected file
        // (initially I had a problem with codepages, which should now be solved),
        // the original file might get lost forever.
        pathAndFilenameToProcess = String.Format(“{0}.tmp”, sourcePathAndFilename);
        // Now create a new corrected csv file.
        File.WriteAllLines(pathAndFilenameToProcess, lines, encoding);
       // Also change the SSIS Package variable, because the Flat File Source uses it.
       Dts.Variables[“User::PathAndFilenameToProcess”].Value = pathAndFilenameToProcess;
      /// <summary>
      /// This method checks if the header is as expected. If not, return false.
      /// Also the value of by ref parameters 'header', 'headerAppend' and 'detailAppend' is set.
      /// Source:
      /// Author: Hans Michiels
      /// License: GNU General Public License, see
      /// <param name=”sourcePathAndFilename”></param>
      /// <param name=”header”></param>
      private bool CheckAndReturnHeader(string sourcePathAndFilename,
        ref string header, ref string headerAppend, ref string detailAppend)
       // Initialize return values.
       header = string.Empty;
       headerAppend = string.Empty;
       detailAppend = string.Empty;
       // Open the file to read the header line, then close the file again.
       using (StreamReader reader = new StreamReader(sourcePathAndFilename))
        header = reader.ReadLine();
       // !!! ATTENTION !!! Here you could do a check on (parts) of the header,
       // so that you do not change files with an unexpected format.
       if (!header.ToLower().StartsWith(“socialsecuritynumber|firstname|lastname|gender|residence”))
        return false;
       // !!! ATTENTION !!! Here you could check for old file formats.
       // Old formats can only be handled when one or more new columns were added at the end of the csv lines.
       // In the example, there is a file format without the Comments column still being delivered.
       // By appending the Comments column with an empty value for those files, they can be processed
       // by the same Data Flow Task using only one Flat File Source Connection.
       if (!header.ToLower().Contains(“|comments”))
        // This is an old format without the Comments column.
        // Append Comments column header to header and a delimiter (pipe) to detail rows.
        headerAppend = “|Comments”;
        detailAppend = “|”;
       return true;
      /// <summary>
      /// Determines a text file's encoding by analyzing its byte order mark (BOM).
      /// Defaults to ASCII when detection of the text file's endianness fails.
      /// Source:
      /// With thanks to the author ‘2Toad’.
      /// </summary>
      /// <param name=”filename”>The text file to analyze.</param>
      /// <returns>The detected encoding.</returns>
      public static Encoding GetEncoding(string filename)
       // Read the BOM
       var bom = new byte[4];
       using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
        file.Read(bom, 0, 4);
       // Analyze the BOM
       if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) return Encoding.UTF7;
       if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) return Encoding.UTF8;
       if (bom[0] == 0xff && bom[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
       if (bom[0] == 0xfe && bom[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
       if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) return Encoding.UTF32;
       // return Encoding.GetEncoding(1252); // Without BOM, default to Windows Codepage 1252
       return Encoding.Default; // Without BOM, the encoding for the OS's current ANSI code page is returned by default.

    When you have adjusted the script for your file, choose Rebuild from the BUILD menu in the VstaProjects – Microsoft Visual Studio Window. Then close that Window. You return to the Script Task Editor, close it by pressing OK.

    Developing the Data Flow Task

    Add a Data Flow Task to the Control Flow and name it DFT – Import Customer Details.
    Then add the following components to it:

    Flat File Source FF_SRC – CustomerDetails

    From the SSIS Toolbox, drag a Flat File Source to the Data Flow area and create a new Flat File connection manager.

    If you want to rebuild the demo, download the csv files (and C# script) here.
    Configure the Flat File Connection as shown in the picture below:

    Choose Advanced in the pane on the left and configure the properties for each column as follows:

    • SocialSecurityNumber: DataType string [DT_STR], OutputColumnWidth 9
    • FirstName: DataType Unicode string [DT_WSTR], OutputColumnWidth 50
    • LastName: DataType Unicode string [DT_WSTR], OutputColumnWidth 50
    • Gender: DataType string [DT_STR], OutputColumnWidth 1
    • Residence: DataType Unicode string [DT_WSTR], OutputColumnWidth 60
    • Comments: DataType Unicode string [DT_WSTR], OutputColumnWidth 1000

    Example of column properties

    Conditional Split CSPL – Filter out empty lines

    We use a Conditional Split to filter out empty lines in the csv file. Because the SocialSecurityNumber is a required column, we assume that the line is empty (at least it is useless) if SocialSecurityNumber is empty.
    So add an Output named Empty lines with Condition SocialSecurityNumber == “”. If you like, you can check on multiple columns to be empty.
    Rename the default Output to Non-empty lines.

    Conditional Split Configuration

    Derived Column DC – Re-insert CRLF

    Because the (corrected) file might contain replacements for carriage return/line feeds, we are going to replace them with the original carriage return/line feeds, before the value is inserted into the database table.
    In the Derived Column Transformation Editor, add a derived column, configured as follows:

    • Derived Column Name: Comments
    • Derived Column: Replace ‘Comments’
    • Expression: REPLACE(Comments,@[User::CRLF_Replacer],@[User::CRLF])

    The Derived Column Transformation Editor

    OLEDB Destination OLE_DST stg DimCustomer

    To rebuild the demo, create a table to store the rows from the data flow task,
    add a OLEDB Destination that uses this table.


    CREATE TABLE [stg].[DimCustomer](
     [SocialSecurityNumber] [char](9) NOT NULL,
     [FirstName] [nvarchar](50) NULL,
     [LastName] [nvarchar](50) NULL,
     [Gender] [char](1) NULL,
     [Comments] [nvarchar](1000) NULL,
     [Residence] [nvarchar](60) NULL,
     [SocialSecurityNumber] ASC
    ) ON [PRIMARY]

    The Control Flow – File System Tasks

    Finally the File System Tasks archive the original file and delete the corrected file (if it exists).
    For auditing purposes I keep the original file and not the corrected one, because the latter can always be recreated when needed.
    Also, when a bug would exist in the creation of the corrected file (initially I had a problem with codepages, which should now be solved), the original file might get lost forever.

    Please go back to the picture of the The Control Flow at the beginning of this post, to see how the File System Tasks are connected.
    Next are the printscreens that should help you to configure the File System Tasks.

    Configuration of “FST – Move original file to archive folder”

    Configuration of “FST – Move original file to error folder”

    Precedence constraint of “FST – Delete temporary file” (you need to set it twice).

    Configuration of “FST – Delete temporary file”

    The proof

    To show you that it works, I run the package in debug mode.
    To be able to have a look at the temporary/corrected files, afterwards, I have disabled the File System Task that deletes this file.

    When I open the “CustomerDetails.csv.tmp” file after the package has run, I can see that line 15 was corrected.
    From the second screenshot you can see that the CRLF_Replacer value was inserted twice.

    And when I run a query on the destination table, I can see that the Comments column contains the original CarriageReturn/LineFeeds!

    By the way, all the data in the csv files is generated from random firstnames, lastnames and residences and fake social security numbers. Any match with existing persons would be based solely on coincidence.

    Conclusion / Wrap up

    Csv files still are useful today, because they have some advantages when compared with XML files.
    However they also have their limitations, which can be frustrating when loading those files with a Data Flow Task.
    In this blog post I have demonstrated how to overcome a few of those limitations, namely:

    1. carriage return/line feeds in contents of csv files
    2. changing file formats over time

    The solution is based on using a C# Script Task.

    Please be aware that I do not pretend this is the best solution.
    If you are very familiar with C#, you could all of the processsing in the Script Task, using SqlBulkCopy, and you would not need a temporary file.
    However the Data Flow Task offers some advantages, like lookups you can use, without the need to do this in a script.
    If you create good Interface Specification Documents on forehand (for instance containing how to handle carriage return/line feeds in column values) you can force that all sources deliver correct csv-files, and then they do not have to be corrected.

    Download the C# script and csv files here.

    (c) 2016 – Do not steal the contents – spread the link instead – thank you.

A naming convention for virtual machines and home networks (Windows User Series)


If you are a professional working in the Information Technology Domain you might have set up virtual machines to test different software versions, do Proofs of Concept or just use them for the ease of backup and restore.
But what name do you give those virtual machines?
As long as it are only a few, it doesn’t really matter.


It becomes a bit of a challenge when you are have different virtual machines all with a specific setup, e.g. different versions of the same software.


I want to share my naming convention that I use for virtual machines and my home network.
It is pulled together from different sources in the internet, as well as my own insight, so don’t be surprised if you see parts of other standards.

My device names consist of the following parts:
1. Device type
2. Operating System Line
3. Operating System Version
4. Separator
5. Purpose of the device
6. Sequence number

A naming convention for virtual machines and home networks

1. Device type

Minimum and maximum length

Device type is a single character.

Domain values

Namepart Description
V Virtual Machine
S Physical Server
L Physical Laptop
D Physical Desktop
C Cloud Computer
P Printer
T Tablet Computer
M Mobile Phone

2. Operating System Line

Minimum and maximum length

Operating System Line consists of exactly two characters.

Domain values

Namepart Description
WS Windows Server
WC Windows Desktop/Client
LS Linux Server
LC Linux Client

3. Operating System Version

Minimum and maximum length

Operating System Line consists of 1 to 4 characters.

Domain values

For Windows devices use the internal version number:

Namepart Description
51 Windows Server 2003
52 Windows Server 2003 R2
60 Windows Server 2008
61 Windows Server 2008 R2
62 Windows Server 2012
63 Windows Server 2012 R2
100 Windows Server 2016
A “C” can be added to server versions, if the “Core” version is installed. E.g. 63C
51 Windows XP
52 Windows XP Professional x64
60 Windows Vista
61 Windows 7
62 Windows 8
63 Windows 8.1
100 Windows 10

Get other/older versions from wikipedia.

4. Separator

Minimum and maximum length

The separator is a single character: a hyphen (-).

Domain values

Must be a hyphen (-).

5. Purpose of the device

Minimum and maximum length

Purpose of the device consists of 1 to 4 characters.

Domain values

For Servers:

Namepart Description
DC Domain Controller
FS File Server
PS Print Server
WEB Web Server
ORA Oracle database
SQL SQL Server database
DB other database(s)
EXH Microsoft Exchange
SHP Microsoft Sharepoint
CTX Citrix Server
ESX VMware ESX Server
HV Windows Hyper-V Host

This part is optional for desktop devices.
If implemented, you could use:

Namepart Description
STD or S Standard
DEV or D Used for software development
GRD or G Used for graphical design

6. Sequence number

Minimum and maximum length

The sequence number consists of 1 to 6 characters.

Picture credits: © Yan Zommer | Dreamstime Stock Photos
Numbering your Windows instances.

Domain values

A sequence number making the name unique.
You can decide to make it unique based on purpose only, so you could have:
DWC61-STD001 and LWC61-STD002 (so you do not use 001 for the latter).
If you have a limited number of devices, you could put some intelligence in the number so you know which version of the ‘purpose delivering’ software is running.

Example names

Example name Explanation
DWS63C-HV01 Physical desktop running Windows Server 2012R2 Core, being a Hyper V Host with sequence number 01.
VWS63-SQL141 Hyper V Virtual machine running Windows Server 2012R2 with SQL Server, sequence number 141 (intelligent number, indicates that it is my first virtual machine with SQL Server 2014 installed, I would use 08x for SQL 2008, 09x for SQL 2008R2, 12x for SQL 2012, you get the idea). You could use this if the number of virtual machines per SQL version is less than 10.
VWC100-DEV151 Hyper V Virtual machine running Windows 10, used for development, with sequence number 151 (intelligent number, 15 indicates it’s my Visual Studio 2015 virtual machine, as I have only one of those, 151 is an excellent number).
LWC61-STD0554 Physical Laptop with Windows 7, with software for a ‘standard workspace’, sequence number 0554.

A real life example.


Do NOT use this naming convention for enterprise networks! Problems that could arise if you use it include:

  • A physical location is not part of this naming convention. In enterprise networks servers often have a physical location reference in their name.
  • The operating system is in the name. This if fine for your own lab virtual machines and home computers, but can be very inconvenient in company networks where the device name is in an inventory system. Enterprises typically do not want devices to be renamed when the Operating System is upgraded.

So.. use this naming convention only for the intended purpose, and take advantage of it.

Conclusion / Wrap up

I posted my own naming convention for virtual and physical computer devices I use “AS IS”.
I do not say it is perfect or useful in all cases.
Use the parts you think are convenient and adjust further to meet your requirements.
I would be thankful if you would like to share your adjustments with me, so I can update this post.
I discourage to use this standard in enterprise networks. For more details see the warning/disclaimer above.

(c) 2016 – Do not steal the contents – spread the link instead – thank you.