Tag Archives: logfiles

Free download: Blog Yearbook 2016

I have made a compilation of all my blog posts of the past year, and it was more work than you would think.
Especially because it also contains a non-published new article.

After filling in the form you can download the e-book for free.

I promise you the following:

  • I will not give or sell your email address to any third party.
  • I will not spam you.
  • If I start with a newsletter, you can easily unsubscribe or change the frequency.

(c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

Zip sourcefiles in a SSIS Package (SSIS Series)

Introduction

As I am in a (un)zipping mood, this week the opposite problem of last week.
The example is not so ready-to-use as in my previous post, as it depends on choices made in your development process of other SSIS packages.
Still I hope you can adjust this example to match your needs.

Problem

If you have imported and archived (csv- or xml) files, these files can take up a lot of disk space. The archive folder can also become overwhelmingly full with ten thousands of files, which can make it slowly accessible and difficult to manage.

Solution

To tackle this problem I have made a fully configurable SSIS package, that can be executed from a different package with an Execute Package Task.
The package is made using SSIS 2014.

The zipping itself is done with 7za.exe, for more information on how to get and install this free program see my previous blog post.

Scenario

When I used this concept, a different SSIS package moved files to a subfolder of an archive folder after they were processed, but their name was prefixed with the LoadCycleID with leading zeros, for instance 001345_customers.xml.

The archive folder and subfolder were separated, because the archive folder was an environment variable and the same for all packages, while the subfolder was different for different SSIS packages, and contained the name of the source system of the files.
If you only have one archive folder, just change the example to use that instead of the concatenation of archive folder and subfolder.

Overview

Here is a picture of the end result:

SSIS-S01E06-124
The Control Flow

Detailed steps

Parameters and variables

First you need to add a few parameters and variables to the package.

Parameters:

Name Data type Required Description
_7zaPathAndFilename String False The full path and filename of 7za.exe.
ArchiveFolder String True The root folder for archiving processed files.
ArchiveSubFolder String True The subfolder of the ArchiveFolder in which the files to be zipped exist, the subfolder can for instance be named after a source system like SAP or CRM.
LoadCycleID Integer True The ID of the LoadCycle that is processing these files. In your situation it could also be called ExecutionID, SessionID and so on, the name is not really important.

SSIS-S01E06-125
Package parameters

Variables:

However not all variables might be necessary, I like the concept of being able to configure all values in one place.
The data type of all variables is String, except LoadCycleID which is an Integer.
Every variable uses an expression to set its value.
Here is a detailed list of all variables and the expression used for each one.

Name Expression
ArchiveFilesToDeleteInFolder @[$Package::ArchiveFolder] + "\\" + @[$Package::ArchiveSubFolder]
ArchiveFilesToDeletePattern @[User::LoadCycleIDString] + "*"
ArchiveFilesToZipPattern @[$Package::ArchiveFolder] + "\\" + @[$Package::ArchiveSubFolder] + "\\" + @[User::LoadCycleIDString] + "*.*"
ArchiveFilesWorkingDirectory @[$Package::ArchiveFolder]
ArchiveZipFile @[User::LoadCycleIDString] + "_" + @[$Package::ArchiveSubFolder] + ".zip"
ArchiveZipPathFile_Final @[$Package::ArchiveFolder] + "\\" +@[$Package::ArchiveSubFolder] + "\\" + @[User::LoadCycleIDString] + "_" + @[$Package::ArchiveSubFolder] + ".zip"
ArchiveZipPathFile_Temp @[$Package::ArchiveFolder] + "\\" + @[User::LoadCycleIDString] + "_" + @[$Package::ArchiveSubFolder] + ".zip"
LoadCycleID @[$Package::LoadCycleID]
LoadCycleIDString RIGHT("00000" + (DT_STR, 10, 1252) @[$Package::LoadCycleID], 6)

Important note: if you expect that the value for LoadCycleID will grow beyond 999999 over time, you should add more prefixed zero’s to LoadCycleIDString and increase the number 6, the second parameter of the RIGHT function.

SSIS-S01E06-126
SSIS variables of the package

The Control Flow

The control flow is quite straight forward, only three tasks and we’re done.

EPR Zip files in archive subfolder

This task creates one zipfile containing multiple other files in the ArchiveSubFolder.
From the SSIS Toolbox drag an Execute Process Task to the Control Flow panel and open the Execute Process Task Editor.
On the General and Process panes you do not have to change anything.

SSIS-S01E06-128
SSIS-S01E06-129
No changes needed here

On the Expressions pane, press the ellipsis next to Expressions. In the Property Expressions Editor add the following:

Property Expression
Arguments "a "+ "\"" + @[User::ArchiveZipFile] + "\" " + "\"" + @[User::ArchiveFilesToZipPattern] + "\" -mx=9"
Executable @[$Package::_7zaPathAndFilename]
WorkingDirectory @[User::ArchiveFilesWorkingDirectory]

Please note that in the Arguments expression -mx=9 is used. This is the compression level, level 9 means “ultra” compression. This level gives the smallest zip files but is also the slowest compression level. You can for instance also use 5 for “normal” or 3 for “fast” compression, if this is what you want.

SSIS-S01E06-127
Expressions for the Execute Process Task

SCR Delete uncompressed files

This task deletes the files that were added to the zip file in the previous task.

In the Script Task Editor, add both variables User::ArchiveFilesToDeleteInFolder and User::ArchiveFilesToDeletePattern to ReadOnlyVariables, then press Edit Script ….

Then in the namespaces region add one namespace, as indicated below.

Namespaces region
#region Namespaces
using System;
using System.Data;
using Microsoft.SqlServer.Dts.Runtime;
using System.Windows.Forms;
// Added namespace:
using System.IO;
#endregion

Adjust public void main() as follows:

public void Main

public void Main()
{
  // Copy SSIS variable values to C# string variables.
  string archiveFilesToDeleteInFolder = Dts.Variables["User::ArchiveFilesToDeleteInFolder"].Value.ToString();
  string archiveFilesToDeletePattern = Dts.Variables["User::ArchiveFilesToDeletePattern"].Value.ToString();

  // Delete each file that just has been added to the zip archive.
  foreach (string f in Directory.EnumerateFiles(archiveFilesToDeleteInFolder, archiveFilesToDeletePattern))
  {
    File.Delete(f);
  }

 Dts.TaskResult = (int)ScriptResults.Success;
}


WARNING: Make sure that no files that meet the file name pattern of files to add to the zip file are added to the ArchiveSubFolder while or after the zipfile is (being) created. Otherwise files that are NOT zipped might get deleted!

Download the script here.

FST Move zip file

The File System Task is to move the zip file to the ArchiveSubFolder, because the zipfile is initially created in a temporary folder and not in the ArchiveSubFolder.
This is to prevent that it would be deleted by coincidence, if you use a value for ArchiveFilesToDeletePattern that ends with .*
Configure the File System Task as shown in the picture below:

SSIS-S01E06-130
File System Task Configuration

Download the script here.

Let the car drive

Time to test the package and tell the result in pictures (if you don’t mind).

SSIS-S01E06-131
Before: files of LoadCycleID 001346 need to be zipped ..

SSIS-S01E06-132
Execute the package ..

SSIS-S01E06-133
All csv files replaced with one zip file ..

SSIS-S01E06-134
.. that contains the csv files ..

Conclusion / Wrap up

In this blog post you could read the steps needed to build a SSIS package that creates a zip file containing a number of files in an archive(sub)folder.
This makes the archive smaller and easier to manage.
However beware that you only delete files that were added to the zip file. For more details see the warning above.

(c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

Unzip specific files from lots of zipped archives (SSIS Series)

Introduction

This week I want to discuss the trouble you may have when searching hundreds of zip files for specific files inside them.
Recognizable?
At least I faced this problem some time ago, and solved it using SSIS and the commandline version of 7-zip, named 7za.exe.
The choice for SSIS was purely based on convenience, as you could tackle this problem also using Powershell, C# or almost any other scripting language.
But I thought SSIS is just a neat and quick way to do this.

Problem

Finding specific files in huge amounts of zip files.

Solution

To tackle this problem I have made a fully configurable SSIS package, that can be executed with different parameters.
The main task of the package is to extract specific files (based on a file name pattern) from zipfiles in a folder and – optionally – subfolders.
The unzipping itself is done with 7za.exe, which is the commandline version of 7z, the great open source zip program, that exists already for several years and is used by millions of users.

By the way, the package is made using SSIS 2014.

Preparations: download and install 7za

It is important to download this package from the official 7z web site, just to avoid that you would get a copy from some deceitful web site infected with malware.
To go to the official 7-zip download page click on this link:
7-zip download page
Most important is the 7-Zip Extra: standalone console version.
For the example I have used version 16.02 (2016-05-21) for Windows, but it should also work with version 9.20 (2010-11-18), then called 7-Zip Command Line Version.

If you download the newest version, you also need software to unzip the 7z160-extra.7z file, for instance the regular 32-bit or 64-bit version of 7-zip, which can also be downloaded on the same page.

SSIS-S01E05-100
Download page of 7-zip

I assume you will use the newest version 7z160-extra.7z
Once unzipped, just take the time to read License.txt, readme.txt and history.txt.
Then copy the three files 7zxa.dll, 7za.dll and 7za.exe to a folder on the Windows computer where SSIS runs, usually a Windows Server.
If you run a x64 version of Windows, you can take the x64 version of the 7za files, which are in a subfolder “x64” of the unzipped files.

SSIS-S01E05-101
Unzipped 7za files ..

I have created a subfolder “7za” in the “C:\Program files” folder of my virtual Windows Server, and copied the x64-version of the three mentioned files to it, but you can put them in any folder you like.
When you have done this, the real fun can start ..

Overview

Here is a picture of the end result:

SSIS-S01E05-102
The Control Flow

Detailed steps

Parameters and variables

First you need to add a few parameters and variables to the package.

Parameters:

Name Data type Required Description
_7zaPathAndFilename String False The full path and filename of 7za.exe.
FilenamePattern String True The filename pattern of files you want to unzip.
SourceRootFolder String True The main folder where you want to search for zip files.
TraverseSubfolders Boolean False Indication if you want to look only in SourceRootFolder, or also in its subfolders.
UnzipRootFolder String True The folder to which files are unzipped. Important: this may NOT be a subfolder of SourceRootFolder.
ZipFilenamePattern String True The filename pattern of the zipfiles from which you want to unzip files.

SSIS-S01E05-103
Package parameters

Variables:

Name Data type Description
FLC_UnzipFolder String The subfolder where the files will be unzipped. This is a subfolder of the UnzipRootFolder, with the same relative subpath as the zipfile has in SourceRootFolder. Additionally, the zipfilename has been replaced with a subfolder that is the same as the zipfilename without the file extension.
FLC_ZipPathAndFileName String The zipfile full path, that is target of investigation if any files with FilenamePattern are zipped inside.

SSIS-S01E05-104
SSIS variables needed to do the job

Developing the package

FELC Unzip files

This foreach loop container finds all zip files.
In the Foreach Loop Editor, select Collection in the left pane.
As we want to make the solution generic by using all those parameters mentioned above, we have to add a few expressions.

SSIS-S01E05-105
Foreach Loop container Collection configuration

Add the following three expressions (in my opinion the editor is not very userfriendly, the window is too small, but it can be done):

Property Expression
Directory @[$Package::SourceRootFolder]
FileSpec @[$Package::ZipFilenamePattern]
Recurse @[$Package::TraverseSubfolders]

Please note that these expressions will set the values for Folder, Files and the Traverse subfolders checkbox in the Enumerator configuration part of the screen (inside the purple square in the picture above).
If you close and reopen the Foreach Loop Editor, you will notice that the configuration fields now use the values from the corresponding parameters, except the value Fully qualified for Retrieve file name, which is hardcoded.

SSIS-S01E05-106
Example of how to set an expression..

Then, still in the Foreach Loop Editor, select Variable Mappings in the left pane and map variable User::FLC_ZipPathAndFileName to Index 0, as shown in the picture below.

SSIS-S01E05-107
Variable Mappings

EXPR Set Unzip folder

The next nifty thing that needs to be done, is creating a folder to store the unzipped files. As we do not know if zipped files with the same name exist in multiple zip files, the safest thing to do is to create a directory structure identical to the directory structure of the zip files.
Additionally, the basename of the zip file (so without file extension) is used as an extra subfolder. In this way we are sure that unzipped files with the same name can be stored without overwriting other ones.

To do this, from the SSIS Toolbox drag an Expression Task inside the Foreach Loop container.
Open the Expression Task and paste this expression into the Expression textbox.

@[User::FLC_UnzipFolder] =
REPLACE(LEFT(@[User::FLC_ZipPathAndFileName],
FINDSTRING(@[User::FLC_ZipPathAndFileName], "."
+ TOKEN(@[User::FLC_ZipPathAndFileName], ".",
TOKENCOUNT(@[User::FLC_ZipPathAndFileName], ".")),1) -1),
@[$Package::SourceRootFolder], @[$Package::UnzipRootFolder])

Then press the Evaluate Expression button to check that the expression is valid.

SSIS-S01E05-108
Configuring the Expression Task

FSYS Create Unzip folder

In the previous step we have composed the full path to unzip files into, but this folder structure might not exist yet.
Therefore we use A File System Task named FSYS Create Unzip folder to create the folder structure.
The good news is that the File System Task can create multiple (sub)folders at once.
Configure it as shown in the picture.
SSIS-S01E05-109
Configuration of FSYS Create Unzip folder

EPR Unzip Selected Files

Below you see a few screenshots for the configuration of the Execute Process Task named EPR Unzip Selected Files.
The funny thing is, that in the Process pane, you do not have to change anything. Executable and Arguments are set using expressions (coming next) and all the other properties can keep their default values.

SSIS-S01E05-110
Process configuration.

SSIS-S01E05-111
Setting Executable and Arguments using expressions
Arguments (copy-friendly):

"-y x \"" + @[User::FLC_ZipPathAndFileName] + "\" -o" + @[User::FLC_UnzipFolder] + " " + @[$Package::FilenamePattern] + " -r"

SCR Remove Empty folders

There is one problem with the steps done so far. If a zipfile does not contain a file that matches the FilenamePattern, then the folderstructure is made for nothing.
However you could just leave these empty folders, I think the solution is a bit better if these empty folders are deleted.
To do this, from the SSIS Toolbox drag an Script Task inside the Foreach Loop container.

SSIS-S01E05-112
In the Script Task Editor, add both variables User::FLC_UnzipFolder and $Package::UnzipRootFolder to ReadOnlyVariables

In the Script Task Editor window, press Edit Script ….

Please note that you can download the C# script here..

Then in the namespaces region add two namespaces, as indicated below.

Namespaces region
#region Namespaces
using System;
using System.Data;
using Microsoft.SqlServer.Dts.Runtime;
using System.Windows.Forms;
// Added namespaces:
using System.IO;
using System.Linq;
#endregion

Adjust public void main() as follows:

public void Main

public void Main()
{
  // Copy SSIS variable value of FLC_UnzipFolder to C# string variable.
  string unzipFolder = Dts.Variables["User::FLC_UnzipFolder"].Value.ToString();

  // Source: http://stackoverflow.com/questions/22520761/delete-all-empty-directories-in-specified-path
  // My credits go to author "Jon" from Berlin.
  // Because not all zipfiles might contain files with the FilenamePattern we are looking for, an empty path might have been created.
  // This script deletes these empty folders, so that in the end only folders will exist with unzipped files or subfolders in it.
  var di = new DirectoryInfo(unzipFolder);
  var root = Dts.Variables["$Package::UnzipRootFolder"].Value.ToString(); // no trailing slash!
  while (di.FullName != root
          && !di.EnumerateFiles().Any()
          && !di.EnumerateDirectories().Any())
  {
    di.Delete();
    di = di.Parent;
  }

 Dts.TaskResult = (int)ScriptResults.Success;
}

Download the script here.

The proof of the pudding is in the eating

Time to do some testing.
I have created a few folders with subfolders and subsubfolders. You can see the folder structure in the last picture below.
As you might have noticed the default values for my parameters already prepared for the testcase: I want to unzip only customer files from zip files with 201605 in the name.

SSIS-S01E05-113
Executing the package in Debug mode..

Below you can see the result. Only customer files are unzipped. The subfolders under the Archive folder that contain no customer zip files (but only currency files), do not exist under the Unzipped folder.

SSIS-S01E05-114
Folder structure of Archive and Unzipped rootfolders..

SSIS-S01E05-115
Unzipped “Customer” files ..

Important notes for using with huge amounts of zip files

From my experience I must also warn you. If you run this package on a folder containing gigabytes of zipfiles, there are two things to be aware of:

  1. The execution of the package can potentially take a long time, several hours is not unusual.
  2. Make sure you have enough disk space. Suppose you have the following hypothetical scenario: zip files: 100GB, average zip compression rate: 20%, percentage of files to be unzipped: 30%. In this case you are going to unzip files that have a zipped size of 30% of 100GB, is 30GB. Once unzipped, their size will be 30GB * (1/0,20) = 150GB. You need to have that disk space available on the disk where the Unzipped folder resides.

Conclusion / Wrap up

With this SSIS package you can selectively unzip files from archived zip files.
When using it, make sure you have enough disk space, and be aware that the package can run for several hours, depending on the number and (average) size of the zip files to process.

(c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.