Tag Archives: Datawarehouse

Data Warehouse Automation, the way to go

Introduction

Due to other activities I was not in the opportunity to blog a lot in the past months. One of those activities was the creation of the web site

DataWarehouseAutomation.Guide
.
I think it was worth the pause. And also in the future it will have influence on my time available for blogging. But I love to dive into a more detailed technical post when I have some time.

I just wanted to let you know of a few blog posts on that new web site

www.dwa.guide
(the short URL will redirect to the long one):

Contact me if you want to discuss your data warehouse automation needs or concerns in a 1-on-1 conversation!

(c) 2017 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

The Data Warehouse Automation Tools Survey Results (Data Warehouse Series)

Introduction

As you might know, I started a survey on the usage and opinion about Data Warehouse Automation (DWA) Tools.
Today I am presenting the results.
I sincerely want to thank all the participants for sharing your thoughts.

The participants

First some insight into the persons that filled in the survey. You were not with so many (62 completed surveys).

In statistics there is a formula to compute the minimum sample (number of participants) given a certain margin of error, reliability level and total population. If I fill in 10% for error margin, 90% for reliability level and 20000 for total population (this number does not have much impact on the end result) the minimum sample (number of participants) should be 68.
Please note that with 10% error margin and 90% reliability level that there are considerable chances that the answers are not representative for the entire population of data warehouse professionals.

Participants by country

Participants are spread over 23 countries on several continents, with a concentration in the Netherlands and the United States.

Detailed list:

Country No of participants
Argentina 1
Australia 1
Austria 1
Belgium 3
Brazil 1
Bulgaria 1
Denmark 1
Germany 3
Ghana 1
Hungary 1
Iceland 1
India 4
Lithuania 1
Netherlands 20
New Zealand 1
Norway 1
Romania 1
Slovakia 1
Spain 1
Switzerland 3
Turkey 1
United Kingdom 2
United States 10

Jobroles of participants

Almost half of the participants selected Data Warehouse Architect as best describing their current job role.
Furthermore there were 12 ETL developers and a number of other job roles, as shown in the graph.

Exact counts:

Jobrole No of participants
BI & Analytics/Data Science manager 2
BI Consultant 1
Business analyst 1
Data warehouse architect 30
Director 2
ETL Developer 12
Front end BI Developer 1
Manager 1
Multi-disciplined BI Professional 4
Project manager 6
SQL Developer 1

Familiarity with the investigated tools

First question to answer is: how familiar are the respondents with those tools?
I used 5 levels of familiarity with a short name and explanation. The option values have the following meaning:

  1. Don’t know: I do not know this product.
  2. Heard of: I have heard of this product but I have not worked in a project where it was tried or used.
  3. Tried: I have tried this product but not used it in production environments.
  4. Used: I have worked in a project where this product was used.
  5. Using: I am currently working in a project where this product is being used.

The results in a graph:

And the exact figures, first absolute, then percentages:

Issuer – Productname Don’t know Heard of Tried Used Using
AnalytiX DS – CatfX / Data Vault Code Gen Bundle 44 14 3 0 0
Attunity – Attunity Compose (formerly BiReady) 36 21 4 0 1
Gamma Systems – Data Warehouse Studio 55 5 1 0 0
Javlin – CloverETL 51 7 2 1 0
Magnitude – Magnitude Kalido 39 20 1 1 0
Qosqo – Quipu 43 12 3 2 1
TimeXtender – TimeXtender 42 14 3 0 2
Trivadis – biGenius 52 5 1 1 2
Varigence – BimlExpress / Bidshelper 43 3 4 3 8
Varigence – Mist / BimlStudio 46 4 9 1 1
WhereScape – RED 35 10 6 5 5
Average 44,2 10,5 3,4 1,3 1,8
Issuer – Productname Don’t know Heard of Tried Used Using
AnalytiX DS – CatfX / Data Vault Code Gen Bundle 72,1% 23% 4,9% 0% 0%
Attunity – Attunity Compose (formerly BiReady) 58,1% 33,9% 6,5% 0% 1,6%
Gamma Systems – Data Warehouse Studio 90,2% 8,2% 1,6% 0% 0%
Javlin – CloverETL 83,6% 11,5% 3,3% 1,6% 0%
Magnitude – Magnitude Kalido 63,9% 32,8% 1,6% 1,6% 0%
Qosqo – Quipu 70,5% 19,7% 4,9% 3,3% 1,6%
TimeXtender – TimeXtender 68,9% 23% 4,9% 0% 3,3%
Trivadis – biGenius 85,2% 8,2% 1,6% 1,6% 3,3%
Varigence – BimlExpress / Bidshelper 70,5% 4,9% 6,6% 4,9% 13,1%
Varigence – Mist / BimlStudio 75,4% 6,6% 14,8% 1,6% 1,6%
WhereScape – RED 57,4% 16,4% 9,8% 8,2% 8,2%
Average 72,3% 17,2% 5,6% 2,1% 2,9%

I think most remarkable is that most of the tools are not very well known: the answer “Don’t know” is on the average given by 72,3% of the participants.

Product satisfaction

The product satisfaction question gave the insight that there are no ‘bad’ tools. All tools fulfill a need, and the digits go from 3 to 5 (on a scale of 1 to 5). So the products with the lowest mark for satisfaction, still score a ‘C’ (or as we say in the Netherlands, a ‘vijfje’).

Value for money

Value for money is obviously a very subjective matter. What one person thinks is expensive, the other might think of as cheap. Most vendors are not transparent about their prices.
Marks vary from 1,80 to 3,46. What we can conclude from this is that some tools have the perception of a relatively low value for money, while others provide a better value for money, according to the participants of the survey.
The products that have an average score of 3 or higher (again on a scale of 1 to 5) are both Varigence products (BimlExpress/Bidshelper and Mist/BimlStudio), AnalytiX DS (CatfX / Data Vault Code Gen Bundle) and Trivadis (biGenius).

By the way, the question in the survey had the following scale: 1=cheap and 5=expensive. I translated this to a value for money by reversing the answers (so 1 became 5 and vice versa and 2 became 4 and vice versa). In this way, a higher score means a ‘cheaper’ pricing perception, thus a better value for money.

Other tools that were mentioned

A number of other tools were also mentioned, many of them only once or twice.
Remarkable is that also tools that do not meet the “DWA Tool” definition were mentioned, for instance Microsoft BI Products.
What I learned from this is that I should have given a definition of what I meant with a “Data Warehouse Automation Tool”.
From Wikipedia:
“Data warehouse automation or DWA refers to the process of accelerating and automating the data warehouse development cycles, while assuring quality and consistency. DWA is believed to provide automation of the entire lifecycle of a data warehouse, from source system analysis to testing to documentation. It helps improve productivity, reduce cost, and improve overall quality.”

Worth mentioning is the open source tool ODE. I didn’t know it but tried it recently. It’s a sql based, metadata driven tool without graphical user interface, and I could not get an “AdventureWorks” example working due to errors that occurred during working through the example. So I think this tool needs to become more stable or at least should have a better manual on how to use it. But it’s Open Source and that puts a smile on my face.
From a company/human resources perspective, working with such a tool can however lead to a hidden cost because using it can be more laborious than working with a commercial, ‘more stable’ product. And once you start changing the source code (for yourself or your company, not as member of the Open Source Development team) you can’t upgrade to newer versions anymore and are stuck with the maintenance of your copy of the tool. And if you join the Open Source Team you are now doing two projects instead of one!
In the Netherlands we have an old-fashioned saying “Bezint eer gij begint” which means “think well before you start”, and I think this is applicable here as well. A license for a commercial tool might be worth the money.

What also appears to be the case, that a number of those tools are not for sale, but are tools that consultancy firms use for doing projects.

List of all the other tools that were mentioned:

Issuer – Productname Mentions
Adante – Datawarehouse on Demand 1
Automic – One Automation 2
Birst – Birst 1
Centennium – TBD engine (template based development) 1
Cubido – Cubikit 1
DIKW – DHW Accelerator 1
Grip Op Data B.V. – GDIP 1
Infent – Infenture Insight 1
Informatica – Powercenter 2
Informatica – B2b data exchange 1
Informatica – Informatica 1
Informatica – MFT 1
I-refact – I-refactory 1
Kadenza – Smart Datawarehouse generator 1
Microsoft – MSBI(SSIS,SSAS,SSRS) 9
Microsoft – Power BI 1
Optimalbi Limited – ODE (Open source) 1
Oracle – ODI 1
Oracle – OWB 1
Own build – Excelbased tool 1
Pentaho – Kettle 2
Pentaho – Big Data 1
Rabobank – Foundation Toolkit 1
SAP – Data Services 2
SAP – Information Steward 1
Tableau – Server 1
Talend – Data Fabric 1
Talend – Open Studio 1
TopBi – Data Vault Generator 1
TopBI – HSG 1
Varigence – BimlFlex 1
Varigence – BimlOnline 1
Viz Analytics – Data Warehouse Studio 1

Comments given about all the tools

Participants could give their opinion about the tools in free text format – and they did!
Here are just the bare comments.

Attunity – Attunity Compose (formerly BiReady)
  • “Like features, makes work flow easily “
Qosqo – Quipu
  • “This one could be split up between the free and paid variant for this research.”
  • “Most advanced options on data vault modeling and generation in one tool. Does not support data marts, requires Groovy knowledge to change templates.”
TimeXtender – TimeXtender
  • “vendor lock-in, no best practices applied when generating SSAS cubes”
  • “lock in, very good features, easy to use (even for business users)”
Trivadis – biGenius
  • “if they extend SSIS generation no lock in, very flexibel, fast, cool”
  • “No Vendor-Lock (generated Code can be scheduled and executed native), Microsoft and Oracle-Support, Data Vault Support”
  • “Integrated product, short learning curve, easy to adapt”
Varigence – BimlExpress / Bidshelper
  • “Great to building SSIS packages.”
  • “Free, but you need to write the generating code first. Ideal for consultants though.”
  • “Nice combination, but needs to mature.”
  • “no lock in, very cool that you generate SSIS/SSAS, free, a lot of work coding, not for everyone”
  • “Very flexible, but long learning curve”
  • “Second best tool to generate SSIS packages. Requires a lot of exceptions if you want to alter parts of code for specific tables. Debugging is hell.”
Varigence – Mist / BimlStudio
  • “Has potential but is progressing slowly. Pricing is fair.”
  • “No vendor lock-in, as the result of BIML are just SSIS packages”
  • “More mature version.”
  • “easier, expensive, nice you can easily re-engineer from existing DWH”
  • “There is no reverse engineering module.We can not convert old package to use BimlStudio”
  • “Very flexible”
  • “Best tool to generate SSIS packages. Still requires a lot of manual labor for low level changes.”
WhereScape – RED
  • “We were able to increase development speed with a factor 3. “
  • “Vendor-Lock (needs App-Server for Scheduling, Processing), Microsoft not fully supported (only Beta), Data Vault Support”
  • “Creates stored procedures which clutters db’s, powerful tool, but has some annoyances”
  • “Several products which are not completely integrated, some features not “ready”, several limitations (especially when working with datasets, expensive for what you get, only based overseas (except smalll base in UK)”
  • “You have to click and type almost everything, then it “generates” a stored procedure. Also stores its repository in the same database and schema as your data warehouse.”
  • “Changed projects into initiatives enabling business analytics. “
  • “Like lineage, like support for multiple generation types (such as data vault), dislike performance, dislike excessive storage, dislike cost of consulting, dislike unreliability, dislike UI, dislike lack of adapters, dislike reliance on database procedures, dislike scheduler, dislike target specific repository”

Comments for other tools (not all available for sale)

Adante – Datawarehouse on Demand
  • “Very easy to use automation tool for Dimensional modeling”
Centennium – TBD engine (template based development)
  • “SQL server based metadata based staging, vault en datamart generator. Developed by Centennium, for the larger part made adult by me.”
I-refact – I-refactory
  • “A very solid, well thought solution for the data management. It’s not just for dwh, but for the whole data platform of your company.”
Kadenza – Smart Datawarehouse generator
  • “Model driven generator c-sharp”
Talend – Data Fabric
  • “Talend is a very friendly for completing complex operations like Data Integration, Data Quality and Big Data (towards Hadoop) with minimum programming skills”
Viz Analytics – Data Warehouse Studio
  • “Custom built from the ground up, with own ETL engine, very quick to build and implement DWs.”

Used technologies

One of the questions was about the technology that is being used (from Gartner Magic Quadrant for BI and Analytics). As more than one answer was allowed, the total number of responses is more than the number of participants, because some participants use two or more technologies.
In the Top 5 we see: Microsoft, Oracle, Tableau, SAP and Qlik. To be honest, nothing that surprises me.

Technology Mentions
Alteryx 7
Birst 3
Board International 3
Datameer 3
IBM 11
Information Builders 3
Microsoft 47
MicroStrategy 7
Oracle 25
Pentaho 6
Qlik 13
SAP 16
SAS 5
Sisense 1
Tableau 20
TIBCO Software 2
Yellowfin 1
Zoomdata 1

Conclusion / Wrap up

With this survey I tried to give the business intelligence community more insight into the available Data Warehouse Automation Tools.
Due to the low number of participants, conclusions come with large uncertainty margins. I think this is a pity.
Nevertheless this investigation could be good starting point for further analysis on DWA Tools or a DWA Tool selection.

(c) 2017 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

Using a Persistent Staging Area: What, Why, and How

Change log

Date Changes
21 February, 2017 Due to a very useful discussion on LinkedIn with Dan Linstedt, I have made some changes in this article to emphasize that a PSA does not have a permanent place in any EDW
(with exception, as Dan suggests, being the Hadoop NoSQL platform used for slower storage mechanisms).

Today I want to discuss the Persistent Staging Area in an Enterprise Data Warehouse, or PSA for short. First I want to give a credit to Roelant Vos who has blogged about this subject before. He triggered me to go further with this concept to see “what’s in it for me” and to do a proof of concept using SQL Server 2016.

What is a Persistent Staging Area or PSA?

As an introduction, I want to tell what a (non-persistent) Staging Area is. Some of you will already know, so be it. A Staging Area is a “landing zone” for data flowing into a data warehouse environment. The data in a Staging Area is only kept there until it is successfully loaded into the data warehouse. Between two loads, all staging tables are made empty again (or dropped and recreated before the next load).
A Persistent Staging Area on the other hand, is a Staging Area that is not wiped out between loads: it contains full history from the source(s) that deliver data to it.


Picture credits: © creativecommonsstockphotos | Dreamstime Stock Photos
The “Staging Area” for passengers on an airport ..

Why would you want to make use of a PSA, and what are its pros and cons?

A PSA makes it possible to reload the next layer in the Data Warehouse (for example a Data Vault) when requirements change during development of the Data Vault / EDW without going back to the original source(s). The original source(s) might not contain all history anymore (e.g. because it is an OLTP system or a web service returning only current data) or it is cumbersome to do such a reload because the source data is delivered in large numbers of flat files.

As you might know for data warehouses with (semi)structured data I always preach for a multi-tier architecture including a Data Vault for storing full history of unmodified source data and one or more dimensional models (virtual if performance allows, otherwise physical) as data feeds for cubes, reports, dashboards and BI tools.

So, I hear you think, in a Data Vault you have already full history, why would you need a PSA then? The answer is simple: because building a Data Vault takes time. You do not load all data on day 1, maybe there are uncertainties about the model and your resources are restricted, most likely. So you build the Data Vault in increments. But you can start capturing as much source data as you can! When Data Warehouse Automation tools are used, this is fairly easy to build. So even though you do not take the data further into your data warehouse than the (Persistent) Staging Area, you already start collecting your history of data.

Note: a different reason to use a PSA could be that your source systems contain high data volumes and are geographically spread: you could then create a PSA on premise near the source system, and export only delta data from it – each row would get an I, U or D indication for Insert, Update and Delete – to massively reduce data traffic between source and EDW. I will not discuss this use case scenario today, because then pros and cons would be completely different, and time travelling and full reload would not be applicable.

Using a PSA has it’s pros and cons. To mention a few:

Pros:

  • You collect data, even for those parts of the data warehouse you have not properly designed and developed yet.
  • You can clear the Data Vault, or even change the model completely (this can happen, for instance when business concepts initially were not well understood), create new ETL Code to load data to the modified Data Vault and do a replay of loading full history of the data captured in the PSA.
  • Important is that the PSA is implemented with changed rows only. If done so, it does not have excessive storage needs. So like a Data Vault Satellite, a new row for a table in the PSA is only added, when something has changed for the primary (surrogate or natural) key. A start- and end date/time can be used to implement this.
  • A (relational) PSA should have an optional, temporary character, and is not part of the core EDW tiers. A (relational) PSA should have an optional, temporary character, and is not part of the core EDW tiers! This was written twice on purpose! Do not forget! It is useful during building the EDW, but must be disarmed when the EDW has gone into production and has reached a maintenance phase. A good way to make sure this happens is to plan a decommission date and stick to the plan (can also be a plan like “after sources X, Y and Z are connected and are data feeds in production for 3 months”). Do not keep the PSA around for no good reason.
    In this way, most of the cons below are very likely to be not applicable!

Cons:

  • You can argue that you are in fact building a second data warehouse.
  • When sources change over time, your PSA might not reflect the current structure of the source system(s) anymore (e.g. a column is removed in the source system but you leave it in the PSA so that you can add new data with the column empty but keep the data in this column for older rows). This might be confusing.
  • When source systems are not used anymore (for instance when accounting software is replaced by a similar product from a different vendor) you need to add tables of the new system to the PSA, while you also keep the tables from the old system, which will not receive new data anymore. Now you need to keep the ETL code to load from the “old systems” PSA tables, in case a full reload is needed, and you also need to develop code to get data from the “new systems” PSA, and might need logic for which point in time which tables should be loaded. As you can understand, now the maintenance of your PSA can start to become a heavy burden. So going back to the pros, make sure your PSA is optional and can be disarmed when the maintenance efforts do no longer justify the benefits.

How could you implement a PSA?

To give an idea how this fits in my favourite architecture, I’ve added the PSA to it.
For a normal load the PSA is just loaded from the Staging Area (preferably in parallel with loading the Enterprise Data Warehouse from the Staging Area).
A Virtual Staging Access Layer (for example implemented with views) ensures that ETL code for loading the Enterprise Data Warehouse can switch easily between Staging Area and PSA for loading data. An example how to implement this will follow below.


How the PSA fits in the EDW architecture

For the next schema I owe you an explanation, because I know it’s a bit weird to mix architecture with processing in one schema, but in this case I thought it made sense and could clarify how to reload the EDW from the PSA: no sources and Staging Area are involved, but instead a time travelling mechanism queries the data from the PSA through the Virtual Staging Access Layer, to replay all loads that have been done so far, but with modified ETL Code to load data into a modified or extended Data Vault.


Reloading the DWH from the Persistent Staging Area

To set up time travelling, you could create a list of load sessions and their start date/times, then substract a small time fraction and use that value as Point-In-Time date time value to retrieve all rows from the previous load.
Note: For the DATETIME2 columns, I use a precision of centiseconds, so DATETIME2(2). For justification see one of my older blog posts: Stop being so precise! and more about using Load(end)dates (Datavault Series). If needed you can use a higher precision for your data warehouse.

In a table:

Load no LoadStartDatetime Point-In-Time value
1 2017-02-01 02:00:01.74 2017-02-02 02:00:00.88
2 2017-02-02 02:00:00.89 2017-02-02 02:00:02.11
3 (last load so far) 2017-02-03 02:00:02.12 9999-12-31 23:59:59.98 (! not 99 !)

The Time Travel icon with the arrows in the picture above refers to the process that loops through all Start date/time values chronologically and triggers execution of the load process from PSA to EDW for every Point-In-Time value.

SQL Server 2016 implementation: how could you implement loading the PSA from the Staging Area?

This logic is very similar to loading a Data Vault Satellite. For each staging table you should create a “History” version in the PSA database, with two extra columns: EffectiveStartDts and EffectiveEndDts and a primary key defined as the primary key of the staging table plus the EffectiveStartDts. Most concepts of loading a satellite could be applied: you could add rows that are new, insert new versions of rows that have changed, and mark absent rows (missing in the source) as such. For the PSA I would suggest just to update the end date time so that this row will simply not be in the selection when a later point-in-time is used. Also there is no need to create a separate “hub” table for each staging table.

SQL Server 2016 implementation: how could you implement the historic reload from the PSA?

In a previous blog post Isolate cross database “databasename” dependencies using synonyms and schemas I already advocated the concept of putting each tier of the EDW in its own database schema, and use synonyms to achieve interconnectivity between objects in different databases. In the rest of this article I will assume that you understand this concept (if not, you could read this blog post first).

With some imagination, we can use a modified version of this concept by creating the Virtual Staging Access Layer into the PSA database. This virtual layer is needed for loading data from the PSA instead of the Staging Area.
Suppose the Staging Database has a database schema stg and that schema contains all tables that can be loaded into the EDW / Raw Data Vault.
What we can do is do a similar thing in the PSA database, so create a stg schema, but instead of tables this schema contains views with the same name and same columns as the tables in the Staging Database! These views select the rows valid at a certain point in time.

The challenge to retrieve only the valid rows at a point in time is an easy one: To the PSA database we add a table [stg].[PointInTime] with only one row and a [CurrentPointInTime] column of datatype DATETIME2. Now we can build a SELECT statement on the PSA table (same as the staging table, but with start- and end date/time stamps to hold historic changed and deleted rows) with a CROSS JOIN on the PointInTime table and a WHERE clause to select only the rows valid at the given point in time.

An example how such a view could look like:

CREATE_VIEW_stg_Customer.sql

CREATE VIEW [stg].[Customer]
AS
/*
==========================================================================================
Author: Hans Michiels
Create date: 15-FEB-2017
Description: View that has exactly the same structure as a table with the same name
             in the Staging Database.
             This view can be used for full reloads of all data in the PSA Database.
==========================================================================================
*/

SELECT
    [CustomerID],
    [FirstName],
    [Initials],
    [MiddleName],
    [SurName],
    [DateOfBirth],
    [Gender],
    [SocialSecurityNumber],
    [Address],
    [PostalCode],
    [Residence],
    [StateOrProvince],
    [Country],
    [RowHash]
FROM
    [psa].[CustomerHistory] hist
CROSS JOIN
    [psa].[PointInTime] pit
WHERE
    hist.EffectiveStartDts <= pit.CurrentPointInTime
    AND hist.EffectiveEndDts >= pit.CurrentPointInTime -- When INCLUSIVE enddating is used.
    -- AND hist.EffectiveEndDts > pit.CurrentPointInTime -- When EXCLUSIVE enddating is used.
GO

The next steps are:

  1. Make sure your ETL Code to load the data has a mechanism to skip loading data from the sources to the Staging Area and from the Staging Area to the PSA when the PSA is used instead of the normal Staging area. For instance you could add a view [stg].[IsPSA] to both the Staging database and the PSA database, returning a single row with a single value “False” or “True”, respectively. In your package you could retrieve this value using the Staging Connection Manager, and change the flow of your package depending on the fact if the Staging database is connected or the PSA database.
  2. Build a time travelling mechanism (could be a FullReloadMaster.dtsx package) that updates the stg.PointInTime table for every point-in-time date/time and after every update starts the ETL Process (e.g. a Master.dtsx SSIS package) with a changed connection string, connected to the PSA instead of “normal” Staging database!

Here is a possible implementation of the [stg].[IsPSA] view, based on the fact that the PSA database will have _PSA_ in the database name and the staging database will not. This allows to use the same version of the view for the Staging and PSA database.

CREATE_VIEW_stg_IsPSA.sql

CREATE VIEW [stg].[IsPSA]
AS
/*
==========================================================================================
Author: Hans Michiels
Create date: 10-FEB-2017
Description: View that returns one rows with a bitcolumn [IsPSA] that indicates
             if the current database is the PSA or the STG.
==========================================================================================
*/

SELECT
  CONVERT(BIT,
    CASE
    WHEN CHARINDEX('_PSA_', DB_NAME()) > 0 THEN 1
    ELSE 0
    END) AS [IsPSA]
GO


Example how a FullReloadMaster.dtsx package could look like. Important is that the Staging database connection string is a parameter of Master.dtsx, and is passed by FullReloadMaster.dtsx, but it now is a connection to the PSA database.

Now the next challenge is how to get this working. If you use a SSIS dataflow with a OLEDB Source (and a destination), it is not so difficult: you can just change the connection string of the OLEDB Source. I have done a proof of concept and it worked! You do not have to redeploy SSIS Packages. If views in the PSA database are used with exactly the same structure as the tables in the Staging database, the SSIS Package agrees with the “metadata” and “VS_NEEDSNEWMETADATA” or “VS_ISBROKEN” errors do not occur. Of course you could also make views for reading the Staging tables in the Staging database. This would create an additional layer of abstraction, because then in both the Staging Area and PSA you read the data from the views, and you could, when needed, implement views of which the structure differs from the underlying tables.

If you use SQL code (e.g. stored procedures) this will be a little more work, but still doable. You can follow one of the following strategies for storing the SQL Code:

  • a push strategy: SQL code is in the PSA database, and should then also be in the Staging database. From there, you push the data to the raw Data Vault.
  • a pull strategy: SQL code is in the raw Data Vault. From there you pull the data from the Staging Area or PSA. You will need to switch via synonyms: drop stg synonyms and recreate them but them pointing to the PSA database before a historic reload is done. Afterwards, the reverse action needs to take place to let them point to the Staging database again.
  • an outsider strategy: SQL code is in a separate database, and has synonyms for both staging and raw Data Vault tables and views. Now the same switch via synonyms method can be used.

A few considerations:

  • a push strategy forces you to install your SQL code twice: in the Staging and the PSA database. Also a Staging database might not feel like the “right place” for stored procedures.
  • a pull strategy and an outsider strategy force you to change synonyms, something that normally only has to be done during a deployment, but now before a full reload (and afterwards the reverse action).

I have a slight preference for switching via synonyms, assumed that a full reload will not happen often.
Still using the push strategy could be a valid choice too.

Conclusion / Wrap up

In this post you could read more about the concept of a Persistent Staging Area. Although I have not used it in production environments yet, I think in some projects it can be very useful.
Important to keep in mind:

  • Make sure that your data warehouse can survive if the PSA is switched off. So do not expose it to end users or make it a mandatory part of your ETL Code.
  • Plan a decommission moment when the PSA is created and stick to the plan. Do not keep the PSA around for no good reason.
  • Use code generation tools to update the PSA from the staging area to limit the development effort to set it up.
  • Virtualize the read access to the staging tables, for instance by using views.

If you stick to this “rules” I think a PSA can prove its value for your Enterprise Data Warehouse!

(c) 2017 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

Free download: Blog Yearbook 2016

I have made a compilation of all my blog posts of the past year, and it was more work than you would think.
Especially because it also contains a non-published new article.

After filling in the form you can download the e-book for free.

    I promise you the following:

    • I will not give or sell your email address to any third party.
    • I will not spam you.
    • If I start with a newsletter, you can easily unsubscribe or change the frequency.

    (c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

    Using the project deployment model with environments (SSIS Series)

    Introduction

    Hi, as announced in my previous post of July, 14, I took a summer break. I was really useful and I enjoyed it.
    From now on I will try to write some interesting posts weekly again.
    Today I want to explain the project deployment model of SSIS, introduced already in SSIS 2012.
    Also you are gonna find a (hopefully useful) script below that creates folders, environments and environment variables, and creates connections from package parameters to environment variables.
    But that’s for later, first a little introduction.

    Folders and environments

    What I found very confusing when I started using the project deployment model, already some years ago, is that an environment in this context is not what I expected it to be.
    From a software development perspective an environment is an installed and working version of your software, optionally using one or more databases and configuration setttings exclusively for that environment. For instance a Development or a Test environment.
    But the project deployment model of SSIS does not agree with that definition.
    For the project deployment model an environment is only a set of values for all environment variables, and nothing more.
    Can you think of what that means? Suppose you have a server, both used for Development and Test, and you have databases and cubes with a “_DEV” or a “_TST” suffix to their names.
    But as far as SSIS is concerned, you can only install one version of an SSIS package! Yeah, you can execute it with a different set of parameter values stored in SSIS environment variables, but that’s it.
    You cannot have two different versions of the same package installed on the same instance of SQL Server – at least not using only SSIS Environments.
    I found it not so intuitive that you need to use folders inside the Integration Services Catalog to achieve this.
    So far the theory, let us try this stuff out together.

    Let’s try it

    I’ll just click an example together, you do not need to follow me, my goal is that you understand the concepts so that you can apply it to your own projects.
    By the way, I am using SQL Server 2016, but it should also work in SQL Server 2012 and 2014.

    My test setup is very basic:

    • I have made two databases, [SSIS_S01E08_DEV] AND [SSIS_S01E08_TST], as development- and testenvironment, respectively.
    • Both databases have a table [tbl].[Fruit], but the table in the DEV database has one extra column, [FruitDescription], to simulate a change that is under development but not deployed to test yet.
    • I have made one SSIS Package AddFruit.dtsx, that inserts a row into the table [tbl].[Fruit] passing the value for [FruitName] only. During the demo I will change the package so that it also provides the [FruitDescription] column value.

    Playing around with the example above, I can demonstrate the use of folders and environments when the package is deployed using the project deployment model.

    The initial SSIS Package

    The initial SSIS Package contains just one Execute SQL Task, one parameter named FruitName and one OLEDB Connection manager named MyDatabase.connectionmanager.
    The connection manager is configured by an expression so that the property ConnectionString gets the value of a project parameter also named ConnectionString as demonstrated in the images below.
    SSIS-S01E08-280The Execute SQL Task

    SSIS-S01E08-281The package parameter

    SSIS-S01E08-282The project parameter

    SSIS-S01E08-283The OLEDB connection manager configured with an expression
    SSIS-S01E08-284Then rebuild the Solution

    Deploying the SSIS Package

    Once the solution has been rebuilt, you can deploy the package with the Integration Services Deployment Wizard. You can also call this utility from the command line, but explaining that is out of scope for this blog post.
    To execute this wizard, you have to locate the .ispac file, it is in a subfolder of your projects bin folder.
    If you have created additional solution configurations in your Visual Studio solution, this subfolder might be different from the image below (where the subfolder is Development).
    SSIS-S01E08-285Locating the .ispac file, that was created when the solution was rebuilt.

    SSIS-S01E08-286First step of the wizard

    SSIS-S01E08-287The default deployment model is Project Deployment, the one you need. The other option is Package Deployment.

    SSIS-S01E08-288Select a server name and a path [1]

    SSIS-S01E08-289Select a server name and a path [2]: create a new folder

    SSIS-S01E08-290Select a server name and a path [3]: name the folder after your environment

    SSIS-S01E08-292Select a server name and a path [4]: then click Next

    SSIS-S01E08-293Review and deploy.

    SSIS-S01E08-295Deployment succeeded.

    SSIS-S01E08-296In the Integration Services Catalog you can find your project and package.

    SSIS-S01E08-297You could configure it now, but there are no SSIS environments yet (note: these are a set of environment variable values).

    SSIS-S01E08-298Don’t configure just yet. We will do this later.

    SSIS-S01E08-299Here you could create an environment .. don’t do it now !

    SSIS-S01E08-TST-environment_setup.sql

    --\
    ---) A script to create a folder and environment in an SSIS Catalog.
    ---) Author: Hans Michiels
    --/
    /*
    (c) Copyright 2016 - hansmichiels.com
     
    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.
     
    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
    GNU General Public License for more details.
     
    You should have received a copy of the GNU General Public License
    along with this program. If not, see http://www.gnu.org/licenses/.
    */

    USE [SSISDB]
    GO

    DECLARE @environment_variables_table TABLE(
      [id] int identity(1, 1),
      [variable_name] nvarchar(128),
      [data_type] nvarchar(128),
      [sensitive] bit,
      [value] nvarchar(4000),
      [description] nvarchar(1024)
      )

    DECLARE @object_parameter_value_table TABLE(
      [id] int identity(1, 1),
      [object_type] smallint,
      [object_name] nvarchar(260),
      [parameter_name] nvarchar(128)
      )

    DECLARE @id INT
    DECLARE @max_id INT
    DECLARE @folder_name sysname
    DECLARE @environment_name nvarchar(128)
    DECLARE @environment_description nvarchar(128)
    DECLARE @project_name nvarchar(128)
    DECLARE @variable_name nvarchar(128)
    DECLARE @data_type nvarchar(128)
    DECLARE @sensitive bit
    DECLARE @value sql_variant
    DECLARE @value_tinyint tinyint
    DECLARE @description nvarchar(1024)
    DECLARE @nsql nvarchar(max)

    DECLARE @object_type smallint
    DECLARE @object_name nvarchar(260)
    DECLARE @value_type CHAR(1)
    DECLARE @parameter_name nvarchar(128)
    DECLARE @parameter_value sql_variant

    --\
    ---) Environment settings
    --/
    SELECT
      @folder_name = N'TST',
      @environment_name = N'TST',
      @environment_description = N'My demo Test environment',
      @project_name=N'SSIS_S01E08'

    --\
    ---) Script to create environment and settings.
    --/
    IF NOT EXISTS (
        SELECT 1
        FROM [SSISDB].[internal].[environments]
        WHERE [environment_name] = @environment_name
        )
    BEGIN
        EXEC [SSISDB].[catalog].[create_environment] @folder_name, @environment_name, @environment_description;
    END

    INSERT INTO @environment_variables_table
      ( [variable_name], [data_type], [sensitive], [value], [description] )
      VALUES
          ( N'ConnectionString', N'String', 0, N'Data Source=.;Initial Catalog=SSIS_S01E08_TST;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;', N'Connection string for the database.' )
        , ( N'FruitName', N'String', 0, N'Tamarillo', N'' )
        --, ( N'Int32Example', N'Int32', 0, N'1', N'' )
        --, ( N'BooleanExample', N'Boolean', 0, N'1', N'' )
        --, ( N'ByteExample', N'Byte', 0, N'5', N'' )

    SELECT @id = 1, @max_id = MAX([id]) FROM @environment_variables_table
    WHILE @id <= @max_id
    BEGIN
        SELECT
          @variable_name = v.variable_name,
          @data_type = v.data_type,
          @sensitive = v.sensitive,
          @value = v.value,
          @description = v.[description]
        FROM @environment_variables_table v
        WHERE [id] = @id;

        IF EXISTS (
            SELECT 1
            FROM [SSISDB].[internal].[environment_variables] v
            JOIN [SSISDB].[internal].[environments] e ON e.environment_id = v.environment_id
            WHERE v.[name] = @variable_name
            AND e.environment_name = @environment_name
            )
        BEGIN
            SET @nsql = N'EXECUTE [catalog].[delete_environment_variable] '
              + N'@folder_name = N'''+ @folder_name + ''', @environment_name = N''' + @environment_name + ''', '
              + N'@variable_name = N''' + @variable_name + ''''
            PRINT @nsql;
            EXEC sp_executesql @nsql;
        END

        PRINT '/*'
        PRINT @variable_name
        PRINT CONVERT(nvarchar(128), SQL_VARIANT_PROPERTY(@value, 'BaseType'));
        PRINT '*/'

        SET @nsql = N'EXECUTE [catalog].[create_environment_variable] '
          + N'@folder_name = N'''+ @folder_name + ''', @environment_name = N''' + @environment_name + ''', '
          + N'@variable_name = N'''+ @variable_name + ''', @data_type = N''' + @data_type + ''', '
          + N'@sensitive = ' + CONVERT(NVARCHAR, @sensitive) + ', @description = N''' + @description + ''', '
          + CHAR(13) + CHAR(10) + N'@value = ' +
          CASE UPPER(@data_type)
          WHEN 'String' THEN 'N''' + CONVERT(NVARCHAR(1000), @value) + ''' '
          ELSE CONVERT(NVARCHAR(1000), @value)
          END + '; '
        PRINT @nsql;
        EXEC sp_executesql @nsql;

        SET @id = @id + 1
    END

    --\
    ---) Add environment reference to project.
    --/
    Declare @reference_id bigint
    IF NOT EXISTS(SELECT 1
      FROM [SSISDB].[internal].[environment_references] r
      JOIN [SSISDB].[internal].[projects] p
        ON p.project_id = r.project_id
      WHERE p.name = @project_name
        AND r.environment_name = @environment_name
      )
    BEGIN
        EXEC [SSISDB].[catalog].[create_environment_reference]
            @environment_name=@environment_name, @reference_id=@reference_id OUTPUT,
            @project_name=@project_name, @folder_name=@folder_name, @reference_type='R';
        --Select @reference_id
    END

    --\
    ---) Connect environment variables to package parameters, based on the name
    --/
    INSERT INTO @object_parameter_value_table (
        [object_type],
        [object_name],
        [parameter_name]
        )
    SELECT
        prm.[object_type],
        prm.[object_name],
        prm.[parameter_name]
    FROM
        [SSISDB].[internal].[object_parameters] prm
    JOIN
        [SSISDB].[internal].[environment_variables] ev
        ON ev.name = prm.parameter_name
    JOIN
        [SSISDB].[internal].[projects] prj
        ON prj.project_id = prm.project_id
    WHERE
        prm.[value_type] != 'R'
    AND
        prm.value_set = 0
    AND
        prj.name = @project_name
    ORDER BY
        prm.object_name, prm.parameter_name

    SELECT @id = 1, @max_id = MAX([id]) FROM @object_parameter_value_table
    WHILE @id <= @max_id
    BEGIN
        SELECT
          @object_type = v.[object_type],
          @object_name = v.[object_name],
          @parameter_name = v.[parameter_name]
        FROM @object_parameter_value_table v
        WHERE [id] = @id;

        SELECT @value_type = 'R', @parameter_value = @parameter_name;
        
        EXEC [SSISDB].[catalog].[set_object_parameter_value]
          @object_type, @folder_name, @project_name, @parameter_name,
          @parameter_value, @object_name, @value_type

        SET @id = @id + 1
    END

    GO

    My script that facilitates to set up multiple environments quickly (download all the used scripts here).

    In the demo it would not be a problem to do all the configuration by hand, but if you have a lot of SSIS Packages with a lot of parameters, you really do not want to do this, especially not if you have different environments, like Development, Test and Production.
    This is why I made a script that automates this work for me:

    • It creates an environment, if it doesn’t exist yet.
    • It (re)creates environment variables, defined in the script.
    • It adds an environment reference to the project, if it doesn’t exist yet.
    • It connects environment variables to package parameters, based on the name.
      For this to work, it is important that package parameters with the same name used in different packages mean the same thing!
      For instance if you have a parameter SourceFolder used in different packages, it should be the same folder! If not, rename the parameters so that they have a unique name and – when mapped to environment variables – can both be assigned a different value, e.g. CrmSourceFolder and ErpSourceFolder.

    We run the script for the TST environment.

    SSIS-S01E08-300The script was executed successfully.

    SSIS-S01E08-301After the script has run, a TST folder with an TST environment inside exists. You can view or edit its properties.

    SSIS-S01E08-302When the Variables pane is selected on the left, the environment variables, created by the script, are shown.

    SSIS-S01E08-303View package configuration.

    SSIS-S01E08-304The package parameter FruitName is now connected to environment variable FruitName.

    SSIS-S01E08-305When I choose Execute ..

    SSIS-S01E08-306… all I have to do is check Environment, it will automaticly show .\TST

    SSIS-S01E08-307After execution of the package, a row exists in the table (and one from a previous test)

    SSIS-S01E08-308We can also execute the package with a SQL Agent Job ..

    SSIS-S01E08-310Execute the package with a SQL Agent Job: all I have to do is check Environment, it will automaticly show .\TST ..

    SSIS-S01E08-311.. and all values are connected properly ..

    SSIS-S01E08-312And yes, it works!

    Extending the package with a FruitDescription

    I must admit, I cheated a little. I skipped the part of deploying the first version of the SSIS Package to the DEV environment.
    So now I am going to deploy to a new DEV Folder and Environment for the first time, while it is will be the second version of the package. This is not a recommended practice in real projects.
    Anyway, the package gets an extra parameter FruitDescription.
    SSIS-S01E08-313Added FruitDescription

    SSIS-S01E08-314Modify the Execute SQL Task to insert also the value for @FruitDescription [1]

    SSIS-S01E08-315Modify the Execute SQL Task to insert also the value for @FruitDescription [2]

    SSIS-S01E08-316Then Rebuild

    SSIS-S01E08-317Then run deployment wizard again, but now create a DEV folder

    SSIS-S01E08-318Script modifications [1]: use a DEV folder and a DEV environment name.

    SSIS-S01E08-319Add FruitDescription to the list of environment variables and change the values: connection string must use the DEV database and fruitname and -description something different than in the TST environment.

    Then run the script.
    SSIS-S01E08-320The package was deployed using the deployment wizard.

    SSIS-S01E08-321SSIS Environment was created using the script

    SSIS-S01E08-322Environment variables where added using the script

    SSIS-S01E08-323Environment variables were connected to package parameters using the script

    SSIS-S01E08-324Package execution test: note the new parameter FruitDescription

    SSIS-S01E08-325Test result

    SSIS-S01E08-326Final result: different versions of the same SSIS Package installed side by side on the same machine. But in addition to SSIS Environments with variable values, folders inside the Integration Services Catalog are used for that!

    Download the scripts here.

    Conclusion / Wrap up

    With the project deployment model of SSIS, introduced in SQL Server 2012, deployment and configuration of SSIS Packages has become easier.
    However it is important to understand how folders and environments in the Integration Services Catalog work. I have tried to explain this, and provided a script to set up multiple environments quickly.

    Why did you use those weird fruit examples?

    D=Development
    T=Test
    😉
    Read more here:
    Durian
    Tamarillo

    (c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

    Implement logging in a Script Task (SSIS Series)

    Introduction

    This post is a follow-up on two older posts and will be the last one before my summer break.

    Today I want to enhance the SSIS package made in post Fixing corrupt csv files in a SSIS Package (SSIS Series) by using the Plug and Play Logging Solution.

    Problem

    As long as a script runs well, you might not need logging inside. BUT .. when something GOES wrong, having logging in the script can reduce the time needed to troubleshoot the issue.
    In the example for fixing csv files, anything can be wrong with a file, causing an error in the script. In that case it is for instance useful to know which file was being processed when the error occurred.

    Prerequisite

    For this to work, you have to install my Plug and Play Logging Solution, which can be downloaded from this post. It is just one SQL Script you have to run, that creates a [logdb] database with objects inside. You can also add those objects to a different database that already exists.
    It is however not mandatory to rebuild the Fixing corrupt csv files in a SSIS Package example, as you can also add logging to your own Script Task.

    Solution

    This post is a enhancement on the Fixing corrupt csv files in a SSIS Package post, so you might need to read that post first, if you did not do that already.

    Therefore I will dive directly into the changes that are needed to add logging in the Script Task.
    This will be done with lots of screenprints with some comment in between.

    SSIS-S01E07-169Adding a package parameter “LoggingConnectionString”
    First you need to add a package parameter LoggingConnectionString that can be used in an expression of the OLE DB Connection and as input variable for the Script Task.

    SSIS-S01E07-170OLE DB Connection configuration ..
    Then add an OLE DB Connection for the logging database.

    SSIS-S01E07-171OLE DB Connection properties ..
    Use an expression to let the ConnectionString of the OLE DB Connection be equal to the value of the package parameter LoggingConnectionString.

    SSIS-S01E07-173OLE DB Connection – Expression configuration [1]

    SSIS-S01E07-172OLE DB Connection – Expression configuration [2]

    SSIS-S01E07-174Final result
    By the fx icon you can see that the connection manager uses an expression.

    SSIS-S01E07-176Enable logging
    Now enable logging for the package.

    SSIS-S01E07-177Enable logging [2]
    Add a log provider for SQL Server and let it use the logdb.oledbConnection by selecting it under the Configuration column header.

    SSIS-S01E07-178Enable logging [3]
    Then select all events. Filtering on what is actually logged is done by the logging solution (by the value of @MaxMessageClass, see this blog post for more details).

    SSIS-S01E07-179Select the Script Task
    Select the Script Task and add the following Variables to ReadOnlyVariables:

    • System::ExecutionInstanceGUID
    • System::PackageID
    • $Package::LoggingConnectionString


    • SSIS-S01E07-180The added ReadOnlyVariables in the red rectangles

      Below you will find a number of screenprints of the script task to talk you through the changes.
      You can download the C# script here.

      SSIS-S01E07-181Overview
      First make sure the Namespaces region is as shown.
      Then fold the namespace with the guid in the name, and paste the entire namespace HansMichielsCom.PlugAndPlaySSISLoggingSolution underneath it.
      This code is put in a separate namespace, so that it could also be placed in a .NET assembly that is added to the GAC (Global Assembly Cache). When you would do this, you do not have to add the code to every Script Task.
      For the example of today, we just put this namespace inside the Script Task to make things not too complicated for now.

      SSIS-S01E07-182Using the HansMichielsCom.PlugAndPlaySSISLoggingSolution namespace
      As a result, you have to tell the first guid-like namespace, that you want to call code inside the second namespace. Therefore add the using statement as shown above.

      SSIS-S01E07-183Constant used for logging
      Below you will see some printscreens with changed parts in the script.

      SSIS-S01E07-184Method GetSsisLogWriter to instantiate a SsisLogWriter object

      SSIS-S01E07-187Method Main is extended with logging.

      SSIS-S01E07-188Pass the logWriter as parameter to other methods ..

      SSIS-S01E07-189IMPORTANT: Bugfix in CheckAndReturnHeader!

      SSIS-S01E07-190IMPORTANT: Bugfix in CheckAndReturnHeader!
      (header == null) is added to cope with empty files.

      Testing, testing, one, two ..



      SSIS-S01E07-191Test preparations [1]

      SSIS-S01E07-193Test preparations [2]

      SSIS-S01E07-194Test execution

      SSIS-S01E07-195Test result: logging rows done inside the script are in the log table.

      Conclusion / Wrap up

      In this post I have demonstrated how to implement logging in SSIS Script Tasks using my Plug and Play Logging Solution.
      This type of logging gives more control on what to log and how to log it than when you implement logging using SSIS events.
      The examples given are very basic. You can use your imagination to implement logging of errors using a try .. catch block, or use all available parameters of logWriter.AddLogEntry to change the Retention Class, Message Class, and so on.

      In the summer I will take some time for study, reflection, holiday, and still .. work.
      My next post will be early September at the latest, maybe earlier.

      Download the C# script here.

      (c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

    Zeros, bloody zeros! (Data Vault Series)

    Introduction

    I must admit I have a weakness for British humour.
    When I had to cope with leading zeros in business keys some time ago, I spontaneously came up with the title of this post, not knowing that it would serve as such.
    For those who do not know, “Meetings, bloody meetings” is a British comedy training film in which John Cleese plays a main role. It was made in 1976, and a remake was made in 2012.
    It tells in a funny way what can go wrong at meetings and how you can do better, check it out if you can.

    DV-S01E05-meetingsMr John Cleese

    But, obviously, this post is not about meetings but about zeros.

    Problem

    I can be short about that: leading zeros in business key values.
    For instance a customer number is delivered to the data warehouse as 0001806 (instead of 1806).
    This would not be a problem it is would always be delivered exactly like that. But to be honest, you can and will not know that upfront. Even this might be the case now, it might not be in the future.
    When other tools are used, leading zeros could suddenly disappear (for instance when a csv file is modified using Excel), or (more rarely) the number of leading zeros could change (01806, 00001806). When this happens you have a problem, because for the data warehouse 01806, 0001806, 00001806 and 1806 are all different business keys! Even if you have only two variants, it is already a problem.
    Because every business key gets a different row in the hub, and this customer now exists multiple times!

    DV-S01E05-zeros(No acting here, this is how I look sometimes)

    Solution

    If you are familiar with Data Vault, you might already think of same-as-links to solve this.
    But I think the solution should be implemented earlier, to avoid having multiple hub rows.
    Simply always remove leading zeros when the sourcecolumn is (part of) a business key (either primary or foreign key) and seems a number or ID but is delivered as a string/varchar. In this way 1806 will always be 1806! And I think it is pretty impossible that 001806 and 1806 would refer to two different customers.
    Unless, of course, they would come from different source systems. But in that situation, depending on leading zeros would be a bad thing to do, because when then leading zeros dropped off, satellite rows of different customers (in different source systems) could end up as connected to the same hub row! In this situation, in a non-integrated Raw Vault, it would be better to prefix the business key with the source system code and remove the leading zeros, for instance, CRM.1806 and ERP.1806.
    In all cases, you can still store the original value (with leading zeros) as an ordinary attribute in a satellite for auditing reasons.

    How to implement the solution

    There are many ways to remove leading zeros. When I was searching for this I had two requirements:

    • No casting from and to an integer may take place, otherwise all business keys need to be numeric, so this would make the solution less reliable.
    • No function, routine or assembly may be called, this could negatively impact performance. I was looking for an “inline” conversion.

    After some research I found an expression that was the same for SQL and SSIS and quite okay (T-SQL version by Robin Hames, my credits for his work), but appeared to change a string with only one or more zeros to an empty string. And because a 0 can have a meaning – and is certainly different from an empty string – this is undesired behavior, IMHO.
    So I had to add some logic to it: a SELECT CASE in T-SQL and an inline condition (format {condition} ? {true part} : {false part} ) to the SSIS expression.
    Furthermore I came on a different method for T-SQL as well, using the PATINDEX function, which is more compact than the other solution.
    For SSIS I still use the ‘Robin Hames’ method, because the PATINDEX function is not available in SSIS Expressions.
    So .. this is what it has become:

    T-SQL

    Remove_leading_zeros.sql

    SELECT
        example.[id_with_leading_zeros],
       CASE
          WHEN LTRIM(example.[id_with_leading_zeros]) = '' THEN ''
          WHEN PATINDEX( '%[^0 ]%', example.[id_with_leading_zeros]) = 0 THEN '0'
          ELSE SUBSTRING(example.[id_with_leading_zeros], PATINDEX('%[^0 ]%', example.[id_with_leading_zeros]), LEN(example.[id_with_leading_zeros]))
       END AS [id_without_zeros_method1],

       CASE
          WHEN LTRIM(example.[id_with_leading_zeros]) = '' THEN ''
          WHEN PATINDEX( '%[^0 ]%', example.[id_with_leading_zeros]) = 0 THEN '0'
          ELSE REPLACE(REPLACE(LTRIM(REPLACE(-- Robin Hames' method
                REPLACE(LTRIM(example.[id_with_leading_zeros]), ' ', '!#!') -- replace existing spaces with a string that does not occur in the column value, I have chosen '!#!'
                , '0', ' ') -- replace '0' with ' '
                ) -- end of LTRIM to remove leading '0's that have been changed to ' 's
                , ' ', '0') -- change ' ' back to '0'
                , '!#!', ' ') -- change '!#!' back to ' '
       END AS [id_without_zeros_method2]
    FROM
        (
        SELECT
            TOP 1000 RIGHT('00000000000' + CONVERT(NVARCHAR(12), object_id), 14) AS [id_with_leading_zeros]
        FROM
            master.sys.objects
        UNION
        SELECT N' 00000 '
        UNION
        SELECT N'00'
        UNION
        SELECT N' '
        UNION
        SELECT ' 0099990 A '
        UNION
        SELECT '-5550'
        ) example

    SSIS Expression (can be used in Derived Column)

    (LTRIM(REPLACE(id_with_leading_zeros,"0", "")) == "" && LTRIM(id_with_leading_zeros) != "") ? "0" : REPLACE(REPLACE(LTRIM(REPLACE(REPLACE(LTRIM(id_with_leading_zeros)," ","!#!"),"0"," "))," ","0"),"!#!"," ")

    DV-S01E05-151In a Derived Column Transformation this looks for instance like this

    Conclusion / Wrap up

    In this post I have motivated why I think you should remove leading zeros from business keys when data is loaded from source systems to a data warehouse.
    This post also contains different ways to remove leading zeros, two for T-SQL and one for a SSIS expression.

    (c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

    Adding custom header and footer to csv files (SSIS Series)

    Introduction

    This article wasn’t planned. SSIS again? No I wanted to write a post about something else.
    Nevertheless I got this question on one of my previous posts and thought, I might as well create a blog post about it.

    Problem

    “Hi Hans,
    I have created my ssis package for flat file destination. My problem is that I want to add header and footer in my flat file destination to send it to FTP location format will be

    0|MMDDYYYY|HHMMSS
    1|Row1
    1|Row2
    1|Row3
    1|ROW4
    9|6

    The last line contains no of row count including header and trailer
    Please let me know how to do that.. I am completely stuck with that and no good luck.. “

    So to summarize, both the header and the footer line of the csv file need to contain dynamic content.
    Also each row starts with a ‘row type’: 0 for the header, 1 for every detail row, and 9 for the footer.

    Solution

    As an example I just use a simple Data Flow Task with an OLEDB Source connected to table DimCurrency from the AdventureWorksDW database.
    The data then flows to a Script Component used as a Destination.
    Here are a few printscreens to demonstrate what I mean. The Script Component will create the file.
    The package is made using SSIS 2014.

    Developing the package

    SSIS-S01E04-099
    The demo package just contains one Data Flow Task

    SSIS-S01E04-205
    Design of the Data Flow Task

    SSIS-S01E04-101
    OLEDB Source: DimCurrency from the AdventureWorksDW database

    SSIS-S01E04-102
    OLEDB Source: select all three columns

    SSIS-S01E04-208
    When you drag the Script Component to the Data Flow, you will get this dialog. Use the Script Component as a Destination

    SSIS-S01E04-207
    Variables that the package contains.

    SSIS-S01E04-209
    In the Script Transformation Editor, add both variables User::HeaderTimestamp and User::TargetCsvPathAndFileName to ReadOnlyVariables

    Now comes the most interesting part! We use the endless possibilities of C# to fix our problem.
    In the Script Transformation Editor window, press Edit Script ….

    Please note that you can download the entire script here, so you do not have to copy the (sometimes malformatted) code from this webpage..

    Then in the namespaces region add two namespaces, as indicated below.

    Namespaces region
    #region Namespaces
    using System;
    using System.Data;
    using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
    using Microsoft.SqlServer.Dts.Runtime.Wrapper;
    // Added namespaces:
    using System.IO;
    using System.Text;
    #endregion

    Above public override void PreExecute() add the copyright notice and the following two regions: Configuration and Private variables.
    Please note that these regions do not exist yet, so you have to copy the code including the #region and #endregion lines.

    New regions
      /*
      (c) Copyright 2016 - hansmichiels.com
     
      This program is free software: you can redistribute it and/or modify
      it under the terms of the GNU General Public License as published by
      the Free Software Foundation, either version 3 of the License, or
      (at your option) any later version.
     
      This program is distributed in the hope that it will be useful,
      but WITHOUT ANY WARRANTY; without even the implied warranty of
      MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
      GNU General Public License for more details.
     
      You should have received a copy of the GNU General Public License
      along with this program. If not, see http://www.gnu.org/licenses/.
      */

     
      #region ***** Configuration *****

      // Here you can change the column delimiter and the text qualifier easily.
      private const string _delimiter = “|”;
      private const string _textQualifier = “”;
      // To set the text qualifier to ” use:
      // private const string _textQualifier = “\””; // I recommend to use a text qualifier.
      
      // Every x rows, write the buffer to the file, and clear the buffer.
      // This is faster than writing every individual line to the file.
      // The higher the value of _bufferRows is, the more memory will be used and the less write actions are done to the csv file.
      // A higher value is a good idea if you have enough memory in your server.
      private const int _bufferRows = 100000;

      #endregion ***** Configuration *****

      #region Private variables

      // Variable to hold the path and filename
      private string _fileName;
      // Variable to count the rows
      private int _rowCount;
      // StringBuilder to hold a buffer of text that still needs to be written to the csv file.
      private StringBuilder _fileContentsBuffer;

      #endregion Private variables

    PreExecute() is called before the data starts to flow through. Here we do some preparation and compose the headerline of the file.

    PreExecute()

     &nbs p;/// <summary>
      /// This method is called once, before rows begin to be processed in the data flow.
      /// </summary>
      public override void PreExecute()
      {
        base.PreExecute();

        // Initialize variables
        _rowCount = 0;
        _fileContentsBuffer = new StringBuilder();
        _fileName = Variables.TargetCsvPathAndFileName;
        DateTime headerTimestamp = Variables.HeaderTimestamp;

        // Append the new line to the buffer.
        // This will be written to file with the first detail rows.
        // If there are no detail rows, no file will be created.
        _fileContentsBuffer.AppendFormat(“{1}0{1}{0}{1}{2:ddMMyyyy}{1}{0}{1}{2:HHmmss}{1}\r\n”, _delimiter, _textQualifier, headerTimestamp);
        }

    PostExecute() is called after the data has flowed through. Here we append the last lines in the _fileContentsBuffer variable and the footerline to the file.

    PostExecute()
      /// <summary>
      /// This method is called after all the rows have passed through this component.
      ///
      /// You can delete this method if you don't need to do anything here.
      /// </summary>
      public override void PostExecute()
      {
        base.PostExecute();
        // If no detail rows, bail out. No file will be created when there are no detail rows.
        if (_rowCount == 0) return;

        using (StreamWriter sw = File.AppendText(_fileName))
        {
          // Append a footer line with the rowcount, example value: 9|105
          // The rowcount is a count of the detail rows, but the footer count
          // needs to include header and footer, so we add 2 to the value of _rowCount.
          _fileContentsBuffer.AppendFormat(“{1}9{1}{0}{1}{2}{1}”, _delimiter, _textQualifier, _rowCount + 2);
          sw.Write(_fileContentsBuffer.ToString());
          _fileContentsBuffer.Clear();
        }
      }

    Input0_ProcessInputRow(Input0Buffer Row) is the place where all the data processing takes place. As this script is the Destination for the data, we need to do something with those data. A Flat File Destination is not there and will not do this for us.
    So here we append every detail row to the _fileContentsBuffer variable. Once in a while, we write the buffer to the file, and clear the buffer again.

    Input0_ProcessInputRow

      /// <summary>
      /// This method is called once for every row that passes through the component from Input0.
      /// </summary>
      /// <param name=”Row”>The row that is currently passing through the component</param>
      public override void Input0_ProcessInputRow(Input0Buffer Row)
      {
        // Increase the rowcounter
        _rowCount++;

        // Append the new line to the buffer.
        // The format needs to be like “{1}[middle part]{1}\r\n”, so that the row starts and ends with the text qualifier.
        // Replace [middle part] with {1}{0}{1}{x} for every column that you want to append to the file, where x is 2 for column 1, 3 for column 2 and so on.
        _fileContentsBuffer.AppendFormat(“{1}1{1}{0}{1}{2}{1}{0}{1}{3}{1}{0}{1}{4}{1}\r\n”, _delimiter, _textQualifier,
          SafeValue(Row.CurrencyKey), // First column, {2} in the format
          SafeValue(Row.CurrencyAlternateKey), // Second column, {3} in the format
          SafeValue(Row.CurrencyName) // Third column, {3} in the format
                 // etc.
          );
        
        // Every x rows, write the buffer to the file, and clear the buffer.
        if (_rowCount % _bufferRows == 0)
        {
          using (StreamWriter sw = File.AppendText(_fileName))
          {
          sw.Write(_fileContentsBuffer.ToString());
          _fileContentsBuffer.Clear();
          }
        }
      }

    Multiple overloaded versions of the SafeValue method (for different datatypes like string, int, DateTime, you can add more when needed) add escape characters to the column values when needed.

    SafeValue method and overloads
      /// <summary>
      /// Escapes a double quote delimiter in a column value by doubling it.
      /// </summary>
      /// <param name=”value”></param>
      /// <returns></returns>
      private string SafeValue(string value)
      {
        // If _textQualifier is null or empty, just return the value.
        if (string.IsNullOrEmpty(_textQualifier)) return value;

        if (value.Contains(_textQualifier))
        {
        // If the value contains one or more text qualifiers,
        // escape them by replacing them with two text qualifiers.
        return value.Replace(_textQualifier, _textQualifier + _textQualifier);
        }
        return value;
      }

      /// <summary>
      /// Overloaded method for DateTime value (meant as an example)
      /// For other data types, create new overloaded methods,
      /// and in that overloaded method convert the value to string,
      /// before calling the 'SafeValue' method with a string parameter.
      /// </summary>
      /// <param name=”value”></param>
      /// <returns></returns>
      private string SafeValue(DateTime value)
      {
        return SafeValue(value.ToString(“yyyy-MM-dd HH:mm:ss.fff”));
      }

      /// <summary>
      /// Overloaded method for int value
      /// </summary>
      /// <param name=”value”></param>
      /// <returns></returns>
      private string SafeValue(int value)
      {
        return SafeValue(value.ToString());
      }
    }

    When the script is ready, Choose Rebuild from the BUILD menu in the VstaProjects – Microsoft Visual Studio Window. Then close that Window. You return to the Script Transformation Editor, close it by pressing OK.

    Testing the package

    I simply run the package in debug mode. 105 currency rows are added to the csv file.
    Including the header and footer, the csv file contains 107 rows, and this is the number that is written to the footer line.

    SSIS-S01E04-110
    Running the package in debug mode

    SSIS-S01E04-111
    First and last part of the created csv file.

    Please note that to change the delimiter and text qualifier you only have to make two changes in the ***** Configuration ***** region.

    Configuration
      #region ***** Configuration *****

      // Here you can change the column delimiter and the text qualifier easily.
      private const string _delimiter = “;”;
      private const string _textQualifier = “\””;
    . . .

      #endregion ***** Configuration *****

    By this change the file will look as follows:
    SSIS-S01E04-112
    First and last part of the created csv file with a different delimiter and text qualifier.

    I also did a test with a source query that returned over 1.1 million rows, and the csv file of 25 MB was created in just 3 seconds (on a core i7 laptop with SSD).
    So my conclusion is that the performance of this solution is very satisfying.

    Download the C# script for the script component here.

    Conclusion / Wrap up

    In this post I have demonstrated how to use SSIS 2014 to create a csv file with a dynamic header (with a timestamp) and footer line (including a rowcount) using a Script Component as a Destination in a Data Flow Task.

    (Promotional)

    Big Data & Analytics insight 2016: save the date for an insightful conference
    (as a speaker I was asked to do a bit of promotion ..)

    (c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

    Fixing corrupt csv files in a SSIS Package (SSIS Series)

    (Promotional)

    Big Data & Analytics insight 2016: save the date for an insightful conference
    (as a speaker I was asked to do a bit of promotion ..)

    Introduction

    A few weeks ago my blog post was about using a C# Script Component to detect duplicates in a Data Flow Task.

    This week I want to discuss it’s brother, the C# Script Task, which can be used in the Control Flow.

    Again, extremely flexible, basicly you embed a small C# program inside your package, and you can use the entire .NET framework for all kind of things. I used it to fix malformed csv source files, before processing them in a Data Flow Task.

    Pros and Problems

    Csv (comma seperated files) have their pros and problems, as compared with other formats, like XML.

    A few pros are that they are usually smaller in size than XML files, and easier and faster to load. These are two reasons why csv files still are used a lot these days, despite of the fact that XML files are in some cases more reliable, and more flexible as far as the file contents is concerned.

    Unfortunately csv files are not so good in holding memo/comment fields that might contain carriage return/line feeds in it. Memo fields with those characters do not stay on one line in the file, therefore making the csv file more difficult to process (at least the standard flat file source can not handle it in all cases).

    Also the structure is very strict. If a column is added to a csv source file, your Data Flow Task will break.

    Solution

    In this post I will discuss a possible solution for both problems:

    1. carriage return/line feeds in contents of csv files
    2. changing file formats over time

    This post assumes that you have worked with SSIS before, know what For Each Loop Containers and Data Flow Tasks do, and recognize the csv problems above. It is a big advantage if you can read and understand C# code.

    Okay, let’s start!

    Scenario

    My demo is based on the following hypothetical scenario:

    1. You need to process multiple CustomerDetails.csv files.
    2. In this file a pipe (|) is used as separator between the columns.
    3. The file contains a header line: SocialSecurityNumber|FirstName|LastName|Gender|Residence|Comments
    4. The Comments column is added later, your package also needs to be able to load historic files without the Comments column.
    5. CustomerDetails.csv files come from hundreds of locations and systems (hereafter called: sources) world wide. Fixing the csv files in the export processes at all these locations is impossible. Even if some sources deliver correct csv files, your SSIS package cannot rely on the fact that all sources will do so.

    Overview

    Here are a few pictures of the end result:
    SSIS-S01E03-109
    The connection managers that are needed

    SSIS-S01E03-110
    The Control Flow
    SSIS-S01E03-111
    The Data Flow

    Detailed steps

    Parameters and variables

    First you need to add a few parameters and variables to the package.

    Parameters:

    • SourceFolder: The folder that that For Each Loop Container reads for csv files.
    • ArchiveFolder: The folder to which files are moved after being processed, if no error occurred.
    • ErrorFolder: The folder to which files are moved that caused an error during processing.

    All parameters are Required and their Data type is String.

    SSIS-S01E03-113
    The parameters of the package

    Variables:

    • CRLF: This variable will be filled with a Carriage Return/Line Feed in the Script Task.
    • CRLF_Replacer: This is the character string that is used to replace CarriageReturn/LineFeeds in the Comment column of the file. It must be a character string that does not occur in the Comment field. Therefore ~CRLF~ is used. If you want a different character string, you can change it to something else.
    • SourcePathAndFilename: The full path and filename of the original filename to process. It is filled by the For Each Loop Container. The value set at design time is unimportant.
    • PathAndFilenameToProcess: The full path and filename of the file to process. This can be either the original file, or the modified temporary file. The Flat File Connection Manager uses this path and filename. The value set at design time is unimportant.

    The Data type of all variables is String.

    SSIS-S01E03-114
    SSIS variables needed to do the job

    The Control Flow – For Each Loop Container

    Add a For Each Loop Container FELC – Load sourcefiles and configure it as a Foreach File Enumerator. Retrieve file name Fully qualified.

    SSIS-S01E03-115

    On the Collection pane, just below the Enumerator type, press the ellipsis next to Expressions. In the Property Expressions Editor configure the Property Directory to be set by
    Expression @[$Package::SourceFolder].
    SSIS-S01E03-116

    On the Variable Mappings pane, map variable User::SourcePathAndFilename to Index 0
    SSIS-S01E03-102
    For Each Loop Container configuration

    The Control Flow – Script Task

    Next, add a Script Task SCR – Fix csv file if needed.
    In the Script Task Editor, configure the following:

    • ReadOnlyVariables: User::CRLF_Replacer,User::SourcePathAndFilename
    • ReadWriteVariables: User::CRLF,User::PathAndFilenameToProcess

    SSIS-S01E03-117
    Variable configuration in the Script Task Editor

    Then press the Edit Script… button.
    Don’t get scared by the amount of C# code, if you keep your head cool, it will be fixed in 10 minutes!

    Adjust the script as follows:

    • In the Namespaces region, if not already there, add:
      using System.IO;
      using System.Text;


    • SSIS-S01E03-108
      Replace the highlighted part above with the code below

      Now remove public void Main() completely and replace it with the code underneath (to avoid malformatting due to html rendering, please download the original csharp script here).
      Then check the code for !!! ATTENTION !!! comment. These are points in the script that you might need to change.

      ScriptTask.cs

        /*
        (c) Copyright 2016 - hansmichiels.com
       
        This program is free software: you can redistribute it and/or modify
        it under the terms of the GNU General Public License as published by
        the Free Software Foundation, either version 3 of the License, or
        (at your option) any later version.
       
        This program is distributed in the hope that it will be useful,
        but WITHOUT ANY WARRANTY; without even the implied warranty of
        MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
        GNU General Public License for more details.
       
        You should have received a copy of the GNU General Public License
        along with this program. If not, see http://www.gnu.org/licenses/.
        */

       
        /// This method is called when this script task executes in the control flow.
        /// Before returning from this method, set the value of Dts.TaskResult to indicate success or failure.
        /// To open Help, press F1.
        /// </summary>
       public void Main()
       {
        // Copy a few SSIS variables to C# string variables.
         string sourcePathAndFilename = Dts.Variables[“User::SourcePathAndFilename”].Value.ToString();
         string crlfReplacer = Dts.Variables[“User::CRLF_Replacer”].Value.ToString();
       
         // Set the variable that needs to hold a CarriageReturn/LineFeed.
         Dts.Variables[“User::CRLF”].Value = string.Format(“\r\n”);
       
         // Call the method that checks if the sourcefile is corrupt and if so, fixes it.
         FixFileIfCorrupt(sourcePathAndFilename, crlfReplacer);
       
         // Set returnvalue for script task.
        Dts.TaskResult = (int)ScriptResults.Success;
       }
       
        /// Csv-file can have lines with carriage return-linefeeds, causing errors during the load.
        /// By preinvestigating the file we can correct the file and prevent these errors from ocurring.
        /// Source: http://www.hansmichiels.com/2016/05/22/fixing-corrupt-csv-files-in-a-ssis-package-ssis-series/
        /// Author: Hans Michiels
        /// License: GNU General Public License, see http://www.gnu.org/licenses/
        /// </summary>
        private void FixFileIfCorrupt(string sourcePathAndFilename, string crlfReplacer)
        {
         string header = string.Empty;
         string headerAppend = string.Empty;
         string detailAppend = string.Empty;
       
         // Initialize pathAndFilenameToProcess with sourcePathAndFilename. Overwrite later if needed.
         string pathAndFilenameToProcess = sourcePathAndFilename;
       
         // Check if the header of the file is as expected, and return the header row through a ref parameter.
         // If the file header is not as expected, do not touch the file and exit this method.
         if (!CheckAndReturnHeader(sourcePathAndFilename, ref header, ref headerAppend, ref detailAppend)) return;
         
         // Get the encoding of the current file. We need this if we have to create a corrected file.
         var encoding = GetEncoding(sourcePathAndFilename);
       
         // Read all lines of the file into a string array.
         // !!! ATTENTION !!! WARNING: only do this when the files to process are relatively small.
         // !!! ATTENTION !!! When in doubt, test it and monitor memory usage.
         // (I suppose files up to 50~100 MB might not be a big problem but larger might, but testing is the way to find out)
         string[] lines = File.ReadAllLines(sourcePathAndFilename, encoding);
       
         // Use a variable to set when the file has changed due to corrections this method makes.
         bool fileHasChanged = false;
         bool lineIsOkay = true;
       
         // appendbuffer is used to keep parts of 'broken' lines.
         string appendbuffer = string.Empty;
       
         // Get lower- and upperbound of the string array into variables.
         int lower = lines.GetLowerBound(0);
         int upper = lines.GetUpperBound(0);
       
         // Loop through all lines in the file from back to front!
         for (int i = upper; i >= lower; i--)
         {
          // If there is still some text in the buffer, append it to the line and make the buffer empty again.
          if (appendbuffer != string.Empty)
          {
            lines[i] = lines[i] + appendbuffer;
            appendbuffer = string.Empty;
            fileHasChanged = true;
          }
       
          // reset lineIsOkay variable, set to true underneath when the line passes the 'line ok' checks.
          lineIsOkay = false;
       
          // !!! ATTENTION !!!
          // Here you need to add some custom code.
          // Describe one or more properties of a correct line, e.g.
          // - first 8 characters are numeric;
          // - The length is at least 12;
          // - It contains at least x delimiters;
          // etc.
          // Then convert this check to C# code. Use your imagination if needed!
       
          // I check for:
          // - the length to be 12 or more characters.
          // - the first 8 characters (the social security number) to be numeric.
          // - at least 4 pipes.
          if (lines[i].Length > 12)
          {
            string checkString = lines[i].Substring(0, 8);
            int intOutput;
       
            lineIsOkay = (Int32.TryParse(checkString, out intOutput));
       
            if (lineIsOkay)
            {
             // Check that the line contains at least 4 delimiters.
             lineIsOkay = (lines[i].Split('|').Length - 1 >= 4);
            }
          }
       
          if ((!lineIsOkay) && (i > lower))
          {
            // Line is NOT OK.
            // I have to append this line to the previous one, except when it is the header.
            // To do this put this line into appendbuffer variable.
            // I use a replacement string for the removed carriage return/line feed.
            // This will be converted back to a carriage return/line feed in the Data Flow.
            appendbuffer = string.Format(“{0}{1}”, crlfReplacer, lines[i]);
            lines[i] = string.Empty; // Package can handle empty lines.
            fileHasChanged = true;
          }
       
         }
       
         // !!! ATTENTION !!! If you do not have old fileformats you can remove this block.
         // Code to handle the old file format, without a Comments column.
         // The code below reformats the file so that it become always in new fileformat, with an empty comment.
         if (headerAppend != string.Empty)
         {
          // This is an old format without the Comments column. Append Comments column header to header and a delimiter (pipe) to detail rows.
          fileHasChanged = true;
       
          for (int i = upper; i >= lower; i--)
          {
           if (lines[i].Length > 0)
           {
            if (i > lower)
            {
              lines[i] += detailAppend;
            }
            else
            {
              lines[i] += headerAppend;
            }
           }
          }
         }
       
         if (fileHasChanged)
         {
          // Here the modified/corrected file is stored to a temporary file, so a “tmp” extension is added to the modified filename.
          // For auditing purposes I keep the original file and not the corrected one,
          // because the latter can always be recreated when needed.
          // Also, when a bug would exist in the creation of the corrected file
          // (initially I had a problem with codepages, which should now be solved),
          // the original file might get lost forever.
          pathAndFilenameToProcess = String.Format(“{0}.tmp”, sourcePathAndFilename);
       
          // Now create a new corrected csv file.
          File.WriteAllLines(pathAndFilenameToProcess, lines, encoding);
         }
       
         // Also change the SSIS Package variable, because the Flat File Source uses it.
         Dts.Variables[“User::PathAndFilenameToProcess”].Value = pathAndFilenameToProcess;
       
        }
       
        /// <summary>
        /// This method checks if the header is as expected. If not, return false.
        /// Also the value of by ref parameters 'header', 'headerAppend' and 'detailAppend' is set.
        /// Source: http://www.hansmichiels.com/2016/05/22/fixing-corrupt-csv-files-in-a-ssis-package-ssis-series/
        /// Author: Hans Michiels
        /// License: GNU General Public License, see http://www.gnu.org/licenses/
        /// <param name=”sourcePathAndFilename”></param>
        /// <param name=”header”></param>
        ///
        private bool CheckAndReturnHeader(string sourcePathAndFilename,
          ref string header, ref string headerAppend, ref string detailAppend)
        {
         // Initialize return values.
         header = string.Empty;
         headerAppend = string.Empty;
         detailAppend = string.Empty;
       
         // Open the file to read the header line, then close the file again.
         using (StreamReader reader = new StreamReader(sourcePathAndFilename))
         {
          header = reader.ReadLine();
          reader.Close();
         }
         // !!! ATTENTION !!! Here you could do a check on (parts) of the header,
         // so that you do not change files with an unexpected format.
         if (!header.ToLower().StartsWith(“socialsecuritynumber|firstname|lastname|gender|residence”))
         {
          return false;
         }
         // !!! ATTENTION !!! Here you could check for old file formats.
         // Old formats can only be handled when one or more new columns were added at the end of the csv lines.
         // In the example, there is a file format without the Comments column still being delivered.
         // By appending the Comments column with an empty value for those files, they can be processed
         // by the same Data Flow Task using only one Flat File Source Connection.
         if (!header.ToLower().Contains(“|comments”))
         {
          // This is an old format without the Comments column.
          // Append Comments column header to header and a delimiter (pipe) to detail rows.
          headerAppend = “|Comments”;
          detailAppend = “|”;
         }
         return true;
        }
       
        /// <summary>
        /// Determines a text file's encoding by analyzing its byte order mark (BOM).
        /// Defaults to ASCII when detection of the text file's endianness fails.
        /// Source: http://stackoverflow.com/questions/3825390/effective-way-to-find-any-files-encoding
        /// With thanks to the author ‘2Toad’.
        /// </summary>
        /// <param name=”filename”>The text file to analyze.</param>
        /// <returns>The detected encoding.</returns>
        public static Encoding GetEncoding(string filename)
        {
         // Read the BOM
         var bom = new byte[4];
         using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
         {
          file.Read(bom, 0, 4);
         }
       
         // Analyze the BOM
         if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) return Encoding.UTF7;
         if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) return Encoding.UTF8;
         if (bom[0] == 0xff && bom[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
         if (bom[0] == 0xfe && bom[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
         if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) return Encoding.UTF32;
         // return Encoding.GetEncoding(1252); // Without BOM, default to Windows Codepage 1252
         return Encoding.Default; // Without BOM, the encoding for the OS's current ANSI code page is returned by default.
        }

      When you have adjusted the script for your file, choose Rebuild from the BUILD menu in the VstaProjects – Microsoft Visual Studio Window. Then close that Window. You return to the Script Task Editor, close it by pressing OK.

      Developing the Data Flow Task

      Add a Data Flow Task to the Control Flow and name it DFT – Import Customer Details.
      Then add the following components to it:

      Flat File Source FF_SRC – CustomerDetails

      From the SSIS Toolbox, drag a Flat File Source to the Data Flow area and create a new Flat File connection manager.
      SSIS-S01E03-121

      If you want to rebuild the demo, download the csv files (and C# script) here.
      Configure the Flat File Connection as shown in the picture below:
      SSIS-S01E03-118

      Choose Advanced in the pane on the left and configure the properties for each column as follows:

      • SocialSecurityNumber: DataType string [DT_STR], OutputColumnWidth 9
      • FirstName: DataType Unicode string [DT_WSTR], OutputColumnWidth 50
      • LastName: DataType Unicode string [DT_WSTR], OutputColumnWidth 50
      • Gender: DataType string [DT_STR], OutputColumnWidth 1
      • Residence: DataType Unicode string [DT_WSTR], OutputColumnWidth 60
      • Comments: DataType Unicode string [DT_WSTR], OutputColumnWidth 1000

      SSIS-S01E03-120
      Example of column properties

      Conditional Split CSPL – Filter out empty lines

      We use a Conditional Split to filter out empty lines in the csv file. Because the SocialSecurityNumber is a required column, we assume that the line is empty (at least it is useless) if SocialSecurityNumber is empty.
      So add an Output named Empty lines with Condition SocialSecurityNumber == “”. If you like, you can check on multiple columns to be empty.
      Rename the default Output to Non-empty lines.

      SSIS-S01E03-122
      Conditional Split Configuration

      Derived Column DC – Re-insert CRLF

      Because the (corrected) file might contain replacements for carriage return/line feeds, we are going to replace them with the original carriage return/line feeds, before the value is inserted into the database table.
      In the Derived Column Transformation Editor, add a derived column, configured as follows:

      • Derived Column Name: Comments
      • Derived Column: Replace ‘Comments’
      • Expression: REPLACE(Comments,@[User::CRLF_Replacer],@[User::CRLF])

      SSIS-S01E03-123
      The Derived Column Transformation Editor

      OLEDB Destination OLE_DST stg DimCustomer

      To rebuild the demo, create a table to store the rows from the data flow task,
      add a OLEDB Destination that uses this table.

      destination_table.sql

      SET ANSI_NULLS ON
      GO
      SET QUOTED_IDENTIFIER ON
      GO
      SET ANSI_PADDING ON
      GO
      CREATE TABLE [stg].[DimCustomer](
       [SocialSecurityNumber] [char](9) NOT NULL,
       [FirstName] [nvarchar](50) NULL,
       [LastName] [nvarchar](50) NULL,
       [Gender] [char](1) NULL,
       [Comments] [nvarchar](1000) NULL,
       [Residence] [nvarchar](60) NULL,
       CONSTRAINT [PK_DimCustomer_CustomerKey] PRIMARY KEY CLUSTERED
      (
       [SocialSecurityNumber] ASC
      )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
      ) ON [PRIMARY]
      GO
      SET ANSI_PADDING OFF
      GO

      The Control Flow – File System Tasks

      Finally the File System Tasks archive the original file and delete the corrected file (if it exists).
      For auditing purposes I keep the original file and not the corrected one, because the latter can always be recreated when needed.
      Also, when a bug would exist in the creation of the corrected file (initially I had a problem with codepages, which should now be solved), the original file might get lost forever.

      Please go back to the picture of the The Control Flow at the beginning of this post, to see how the File System Tasks are connected.
      Next are the printscreens that should help you to configure the File System Tasks.

      SSIS-S01E03-124
      Configuration of “FST – Move original file to archive folder”

      SSIS-S01E03-125
      Configuration of “FST – Move original file to error folder”

      SSIS-S01E03-126
      Precedence constraint of “FST – Delete temporary file” (you need to set it twice).

      SSIS-S01E03-128
      Configuration of “FST – Delete temporary file”

      The proof

      To show you that it works, I run the package in debug mode.
      To be able to have a look at the temporary/corrected files, afterwards, I have disabled the File System Task that deletes this file.
      SSIS-S01E03-130

      When I open the “CustomerDetails.csv.tmp” file after the package has run, I can see that line 15 was corrected.
      From the second screenshot you can see that the CRLF_Replacer value was inserted twice.
      SSIS-S01E03-131
      SSIS-S01E03-132

      And when I run a query on the destination table, I can see that the Comments column contains the original CarriageReturn/LineFeeds!
      SSIS-S01E03-134

      By the way, all the data in the csv files is generated from random firstnames, lastnames and residences and fake social security numbers. Any match with existing persons would be based solely on coincidence.

      Conclusion / Wrap up

      Csv files still are useful today, because they have some advantages when compared with XML files.
      However they also have their limitations, which can be frustrating when loading those files with a Data Flow Task.
      In this blog post I have demonstrated how to overcome a few of those limitations, namely:

      1. carriage return/line feeds in contents of csv files
      2. changing file formats over time

      The solution is based on using a C# Script Task.

      Please be aware that I do not pretend this is the best solution.
      If you are very familiar with C#, you could all of the processsing in the Script Task, using SqlBulkCopy, and you would not need a temporary file.
      However the Data Flow Task offers some advantages, like lookups you can use, without the need to do this in a script.
      If you create good Interface Specification Documents on forehand (for instance containing how to handle carriage return/line feeds in column values) you can force that all sources deliver correct csv-files, and then they do not have to be corrected.

      Download the C# script and csv files here.

      (c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

    Detect duplicates in a Data Flow Task (SSIS Series)

    Intended audience

    This article assumes that you have at least a basic understanding of SSIS, and the Data Flow Task in particular. C# knowledge is an advantage but not required.

    Introduction

    If you ever used the T-SQL windowing function ROW_NUMBER() OVER (PARTITION BY ..) to find duplicates in a query result, this post might be of interest for you.

    Because not all sources having potential duplicate entries (e.g. business keys) come from a SQL Server database. If it is a csv- or Excel file ROW_NUMBER will not automaticly be your friend (it might be when you use OPENROWSET, but I think that is a rather complicated solution).

    C# script components are extremely flexible yet fairly easy to understand, develop and maintain.
    This is why this is my preferred solution to tackle the problem above.
    Let’s begin!

    How to detect duplicates in a Data Flow Task using a C# Script component

    First an overview of the end result (the Data Flow of the Data Flow Task). I will tell you step by step how to achieve this using SSIS 2014. I did not make a print screen of the Control Flow because it only contains one Data Flow Task.
    Please note that I enabled two Data Viewers on both outputs of the Conditional Split component, so you can see the data that flows through (screenshot later).
    SSIS-S01E02-image01
    For the demo I will use the following components in a dataflow. To solve your problem just pick out the things you need.

    • An OLE DB Source
    • A script component (the most interesting part!)
    • An (optional) conditional split
    • Two OLE DB Destinations: one for the first occurrences of each business key, and one for the duplicates.

    An OLE DB Source

    My example is based on Advertureworks. I select customers from the [AdventureWorksDW] database, and make sure there are duplicates by using UNION ALL in the SQL statement.
    In the real world it would be more likely that your source would not be a SQL Server query, but for instance a csv- or Excel file.

    OLEDB Source.sql

    SELECT
      [CustomerKey], [CustomerAlternateKey] AS [CustomerBusinessKey],
      [Title], [FirstName], [MiddleName], [LastName],
      [NameStyle], [BirthDate], [MaritalStatus], [Suffix],
      [Gender], [EmailAddress], [YearlyIncome]
    FROM
      [dbo].[DimCustomer]
    UNION ALL
    SELECT
      [CustomerKey], [CustomerAlternateKey] AS [CustomerBusinessKey],
      [Title], [FirstName], [MiddleName], [LastName],
      [NameStyle], [BirthDate], [MaritalStatus], [Suffix],
      [Gender], [EmailAddress], [YearlyIncome]
    FROM
      [dbo].[DimCustomer]
    UNION ALL
    SELECT
      [CustomerKey], [CustomerAlternateKey] AS [CustomerBusinessKey],
      [Title], [FirstName], [MiddleName], [LastName],
      [NameStyle], [BirthDate], [MaritalStatus], [Suffix],
      [Gender], [EmailAddress], [YearlyIncome]
    FROM
      [dbo].[DimCustomer]
      WHERE CustomerKey % 2 = 0

    SQL Statement for the OLE DB Source

    A script component (the most interesting part!)

    From the SSIS Toolbox drag a Script Component to the Data flow surface. In the “Select Script Component Type” choose “Transformation”.

    SSIS-S01E02-image02

    To use a column value in the script, you have to define it as an input column. Select the column you want to check for duplicate values with Usage Type ReadOnly. In the demo this is the CustomerBusinessKey column.
    SSIS-S01E02-image03

    We also need a new column to store the occurrence number for the CustomerBusinessKey. To do this, in the Script Transformation Editor, Select Inputs and Outputs on the left side, then unfold Output 0 and Output Columns, respectively.
    Then click on the Add column button to add a column with DataType four-byte signed integer [DT_I4]. Give the new column a clear name. In the demo it is called CustomerBusinessKeyOccurrence.
    SSIS-S01E02-image03b

    After these preparations we can start writing the C# script. Press the Edit script … button to do so.
    SSIS-S01E02-image04

    Adjust the script as follows (the changes are also marked inside red blocks in the picture below):

    • In the Namespaces region, add:
      using System.Collections.Generic;
    • Above PreExecute(), add:
      // The column to count duplicates for will be stored as Key (with string datatype) of the dictionary.
      // The occurrence counter will be stored as Value of the dictionary (with int datatype).
      private Dictionary<string, int> m_dictOccurrencesCount;
    • Inside PreExecute(), add:
      // Initialize the dictionary, otherwise it will be null when accessed.
      m_dictOccurrencesCount = new Dictionary<string, int>();

    SSIS-S01E02-image05
    Script adjustments – part 1

    Then adjust the script as follows:

    • Inside Input0_ProcessInputRow(Input0Buffer Row), add:
      // Count occurrences of CustomerBusinessKeys.
      if (m_dictOccurrencesCount.ContainsKey(Row.CustomerBusinessKey))
      {
        // If the CustomerBusinessKey of the current row is already in the dictionary,
        // increase the counter with 1:
        m_dictOccurrencesCount[Row.CustomerBusinessKey]++;
        // Also assign the new value to the CustomerBusinessKeyOccurrence column,
        // so it can be used in the Data flow later.
        Row.CustomerBusinessKeyOccurrence = m_dictOccurrencesCount[Row.CustomerBusinessKey];
      }
      else
      {
        // If the CustomerBusinessKey of the current row is not yet in the dictionary,
        // add it, and set the counter to 1
        m_dictOccurrencesCount.Add(Row.CustomerBusinessKey, 1);
        // Also assign 1 to the CustomerBusinessKeyOccurrence column.
        Row.CustomerBusinessKeyOccurrence = 1;
      }

    SSIS-S01E02-image06
    Script adjustments – part 2

    When the script is ready, Choose Rebuild from the BUILD menu in the VstaProjects – Microsoft Visual Studio Window. Then close that Window. You return to the Script Transformation Editor, close it by pressing OK.

    An (optional) conditional split

    Optionally you can add a Conditional Split.
    I use the conditional split to store the first occurrences for every CustomerBusinessKey in one table and all duplicates in another table.
    Both tables are only to show you that the script task and conditional split do their jobs properly. In real projects, you might develop this differently, for example by storing only the first occurence.

    SSIS-S01E02-image07
    Conditional Split Configuration

    Two OLE DB Destinations: one for the first occurrences of each business key, and one for the duplicates

    I basicly created a two temporary tables by composing a SELECT TOP 0 .. INTO statement from the OLE DB Source SQL Statement, because the result of the demo needs to flow to somewhere.

    SSIS-S01E02-OLEDB-Source.sql
    IF OBJECT_ID('[dbo].[DimCustomerFirstOne]', 'U') IS NOT NULL
      DROP TABLE [dbo].[DimCustomerFirstOne]
    GO
    SELECT TOP 0
      [CustomerKey], [CustomerAlternateKey] + 'B' AS [CustomerBusinessKey],
      [Title], [FirstName], [MiddleName], [LastName],
      [NameStyle], [BirthDate], [MaritalStatus], [Suffix],
      [Gender], [EmailAddress], [YearlyIncome], CONVERT(int, 0) AS [OccurrenceNo]
    INTO [dbo].[DimCustomerFirstOne]
    FROM
      [dbo].[DimCustomer]
    GO

    IF OBJECT_ID('[dbo].[DimCustomerDuplicates]', 'U') IS NOT NULL
      DROP TABLE [dbo].[DimCustomerDuplicates]
    GO
    SELECT TOP 0
      [CustomerKey], [CustomerAlternateKey] + 'B' AS [CustomerBusinessKey],
      [Title], [FirstName], [MiddleName], [LastName],
      [NameStyle], [BirthDate], [MaritalStatus], [Suffix],
      [Gender], [EmailAddress], [YearlyIncome], CONVERT(int, 0) AS [OccurrenceNo]
    INTO [dbo].[DimCustomerDuplicates]
    FROM
      [dbo].[DimCustomer]
    GO

    OLE DB Destination OLE_DST FirstOne uses the [dbo].[DimCustomerFirstOne] table.
    OLE DB Destination OLE_DST Duplicate uses the [dbo].[DimCustomerDuplicates] table.

    Let it flow, let if flow, let it flow

    Okay, time to test.
    When data flows, the “First occurence” and “Duplicates” Data Viewers pop up. Then I can check that all the rows are divided well over both tables based on the value of column CustomerBusinessKeyOccurrence: exactly 1 or higher than 1.
    When I run it with the Data Viewers disabled, the package execution finished literally in a split second. So the performance is excellent.
    Test succeeded!
    SSIS-S01E02-image08
    The Data flow in Debug Mode

    SSIS-S01E02-image09b
    Data Viewer for First occurrence flow: CustomerBusinessKeyOccurrence is 1 for all rows.

    SSIS-S01E02-image10b
    Data Viewer for Duplicates flow: CustomerBusinessKeyOccurrence is 2 or higher for all rows.

    Other considerations

    If multiple columns make a row unique concatenate all column values with a delimiter, and then add the concatenated result to the dictionary m_dictOccurrencesCount.
    For instance you could check duplicates using the customer nameparts (firstname, middlename and lastname), the birthdate and the gender. This is a softer check than on a real business key, but sometimes you do not have that.
    Of course, for this to work, you must add those columns as ReadOnly input columns to the script component.
    You can create a concatenated value as follows:
    Inside Input0_ProcessInputRow(Input0Buffer Row), start with:

    // Concatenate FirstName, MiddleName, LastName, BirthDate and Gender, using a | as delimiter.
    string businessKey = string.Format(“{0}|{1}|{2}|{3:yyyyMMdd}|{4}”,
      Row.FirstName, Row.MiddleName, Row.LastName, Row.BirthDate, Row.Gender).ToUpper();

    The concatenated value is converted to uppercase, so that differences in lower- or uppercase lead to the same value.
    Then use the businessKey variable (instead of Row.CustomerBusinessKey) in the if statement as shown above.

    If the concatenated value becomes very long, you could calculate a hash value and add that to the dictionary instead (thanks for the suggestion, Joost). This would use less memory and might be faster, but the only way to find out if this is the case is testing .. testing ..
    ( I did not test this to make the demo not too complicated and because performance did not seem to be an issue ).

    Thinking a bit further, you could use this concept also to calculate running totals.
    Just configure the dictionary to hold a decimal instead of an int and you can calculate running totals for numeric values (e.g. sales figures) easily.

    Conclusion / Wrap up

    In SSIS you can use a Script Component within a Data Flow Task to detect duplicates in the source data.
    This can be an alternative to the T-SQL ROW_NUMBER() function, when the source is not a SQL Server Database.

    Credits
    My thanks go to Joost van Rossum for reviewing the initial draft of this article. His input was very useful.

    (c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.