Category Archives: Datawarehouse series

Contains blog posts related to datawarehousing that can not be placed in other more specific categories like “Data Vault Series” or “SSIS Series”.

Implementing a hash key collision strategy

Introduction

Dan Linstedt, the inventor of Data Vault, has written a lot about it: hashkeys.
For instance, one of his latest blog posts:
#datavault 2.0, Hashes, one more time.

I will not list all other sources, as you can use Google yourself.
A few comments on hash keys:

  1. You need them for scalability. Using sequence numbers is taking the risk that your data warehouse does not scale well later when the amount of data grows.
  2. They can collide: two different business keys can produce the same hash key. However the chance that this happens is very small. For instance when using SHA-1 (which produces a hash value of 160 bits) you will have a 1 in 1018 chance on a hash collision when having 1.71 * 1015 hash values (read: hub rows) according to this blog post.
  3. If collisions are unacceptable you need a hash key collision strategy.

The full article is posted on DWA.Guide, so you read further there ..


Picture credits: © Can Stock Photo / alexskp

The Data Warehouse Automation Tools Survey Results (Data Warehouse Series)

Introduction

As you might know, I started a survey on the usage and opinion about Data Warehouse Automation (DWA) Tools.
Today I am presenting the results.
I sincerely want to thank all the participants for sharing your thoughts.

The participants

First some insight into the persons that filled in the survey. You were not with so many (62 completed surveys).

In statistics there is a formula to compute the minimum sample (number of participants) given a certain margin of error, reliability level and total population. If I fill in 10% for error margin, 90% for reliability level and 20000 for total population (this number does not have much impact on the end result) the minimum sample (number of participants) should be 68.
Please note that with 10% error margin and 90% reliability level that there are considerable chances that the answers are not representative for the entire population of data warehouse professionals.

Participants by country

Participants are spread over 23 countries on several continents, with a concentration in the Netherlands and the United States.

Detailed list:

Country No of participants
Argentina 1
Australia 1
Austria 1
Belgium 3
Brazil 1
Bulgaria 1
Denmark 1
Germany 3
Ghana 1
Hungary 1
Iceland 1
India 4
Lithuania 1
Netherlands 20
New Zealand 1
Norway 1
Romania 1
Slovakia 1
Spain 1
Switzerland 3
Turkey 1
United Kingdom 2
United States 10

Jobroles of participants

Almost half of the participants selected Data Warehouse Architect as best describing their current job role.
Furthermore there were 12 ETL developers and a number of other job roles, as shown in the graph.

Exact counts:

Jobrole No of participants
BI & Analytics/Data Science manager 2
BI Consultant 1
Business analyst 1
Data warehouse architect 30
Director 2
ETL Developer 12
Front end BI Developer 1
Manager 1
Multi-disciplined BI Professional 4
Project manager 6
SQL Developer 1

Familiarity with the investigated tools

First question to answer is: how familiar are the respondents with those tools?
I used 5 levels of familiarity with a short name and explanation. The option values have the following meaning:

  1. Don’t know: I do not know this product.
  2. Heard of: I have heard of this product but I have not worked in a project where it was tried or used.
  3. Tried: I have tried this product but not used it in production environments.
  4. Used: I have worked in a project where this product was used.
  5. Using: I am currently working in a project where this product is being used.

The results in a graph:

And the exact figures, first absolute, then percentages:

Issuer – Productname Don’t know Heard of Tried Used Using
AnalytiX DS – CatfX / Data Vault Code Gen Bundle 44 14 3 0 0
Attunity – Attunity Compose (formerly BiReady) 36 21 4 0 1
Gamma Systems – Data Warehouse Studio 55 5 1 0 0
Javlin – CloverETL 51 7 2 1 0
Magnitude – Magnitude Kalido 39 20 1 1 0
Qosqo – Quipu 43 12 3 2 1
TimeXtender – TimeXtender 42 14 3 0 2
Trivadis – biGenius 52 5 1 1 2
Varigence – BimlExpress / Bidshelper 43 3 4 3 8
Varigence – Mist / BimlStudio 46 4 9 1 1
WhereScape – RED 35 10 6 5 5
Average 44,2 10,5 3,4 1,3 1,8
Issuer – Productname Don’t know Heard of Tried Used Using
AnalytiX DS – CatfX / Data Vault Code Gen Bundle 72,1% 23% 4,9% 0% 0%
Attunity – Attunity Compose (formerly BiReady) 58,1% 33,9% 6,5% 0% 1,6%
Gamma Systems – Data Warehouse Studio 90,2% 8,2% 1,6% 0% 0%
Javlin – CloverETL 83,6% 11,5% 3,3% 1,6% 0%
Magnitude – Magnitude Kalido 63,9% 32,8% 1,6% 1,6% 0%
Qosqo – Quipu 70,5% 19,7% 4,9% 3,3% 1,6%
TimeXtender – TimeXtender 68,9% 23% 4,9% 0% 3,3%
Trivadis – biGenius 85,2% 8,2% 1,6% 1,6% 3,3%
Varigence – BimlExpress / Bidshelper 70,5% 4,9% 6,6% 4,9% 13,1%
Varigence – Mist / BimlStudio 75,4% 6,6% 14,8% 1,6% 1,6%
WhereScape – RED 57,4% 16,4% 9,8% 8,2% 8,2%
Average 72,3% 17,2% 5,6% 2,1% 2,9%

I think most remarkable is that most of the tools are not very well known: the answer “Don’t know” is on the average given by 72,3% of the participants.

Product satisfaction

The product satisfaction question gave the insight that there are no ‘bad’ tools. All tools fulfill a need, and the digits go from 3 to 5 (on a scale of 1 to 5). So the products with the lowest mark for satisfaction, still score a ‘C’ (or as we say in the Netherlands, a ‘vijfje’).

Value for money

Value for money is obviously a very subjective matter. What one person thinks is expensive, the other might think of as cheap. Most vendors are not transparent about their prices.
Marks vary from 1,80 to 3,46. What we can conclude from this is that some tools have the perception of a relatively low value for money, while others provide a better value for money, according to the participants of the survey.
The products that have an average score of 3 or higher (again on a scale of 1 to 5) are both Varigence products (BimlExpress/Bidshelper and Mist/BimlStudio), AnalytiX DS (CatfX / Data Vault Code Gen Bundle) and Trivadis (biGenius).

By the way, the question in the survey had the following scale: 1=cheap and 5=expensive. I translated this to a value for money by reversing the answers (so 1 became 5 and vice versa and 2 became 4 and vice versa). In this way, a higher score means a ‘cheaper’ pricing perception, thus a better value for money.

Other tools that were mentioned

A number of other tools were also mentioned, many of them only once or twice.
Remarkable is that also tools that do not meet the “DWA Tool” definition were mentioned, for instance Microsoft BI Products.
What I learned from this is that I should have given a definition of what I meant with a “Data Warehouse Automation Tool”.
From Wikipedia:
“Data warehouse automation or DWA refers to the process of accelerating and automating the data warehouse development cycles, while assuring quality and consistency. DWA is believed to provide automation of the entire lifecycle of a data warehouse, from source system analysis to testing to documentation. It helps improve productivity, reduce cost, and improve overall quality.”

Worth mentioning is the open source tool ODE. I didn’t know it but tried it recently. It’s a sql based, metadata driven tool without graphical user interface, and I could not get an “AdventureWorks” example working due to errors that occurred during working through the example. So I think this tool needs to become more stable or at least should have a better manual on how to use it. But it’s Open Source and that puts a smile on my face.
From a company/human resources perspective, working with such a tool can however lead to a hidden cost because using it can be more laborious than working with a commercial, ‘more stable’ product. And once you start changing the source code (for yourself or your company, not as member of the Open Source Development team) you can’t upgrade to newer versions anymore and are stuck with the maintenance of your copy of the tool. And if you join the Open Source Team you are now doing two projects instead of one!
In the Netherlands we have an old-fashioned saying “Bezint eer gij begint” which means “think well before you start”, and I think this is applicable here as well. A license for a commercial tool might be worth the money.

What also appears to be the case, that a number of those tools are not for sale, but are tools that consultancy firms use for doing projects.

List of all the other tools that were mentioned:

Issuer – Productname Mentions
Adante – Datawarehouse on Demand 1
Automic – One Automation 2
Birst – Birst 1
Centennium – TBD engine (template based development) 1
Cubido – Cubikit 1
DIKW – DHW Accelerator 1
Grip Op Data B.V. – GDIP 1
Infent – Infenture Insight 1
Informatica – Powercenter 2
Informatica – B2b data exchange 1
Informatica – Informatica 1
Informatica – MFT 1
I-refact – I-refactory 1
Kadenza – Smart Datawarehouse generator 1
Microsoft – MSBI(SSIS,SSAS,SSRS) 9
Microsoft – Power BI 1
Optimalbi Limited – ODE (Open source) 1
Oracle – ODI 1
Oracle – OWB 1
Own build – Excelbased tool 1
Pentaho – Kettle 2
Pentaho – Big Data 1
Rabobank – Foundation Toolkit 1
SAP – Data Services 2
SAP – Information Steward 1
Tableau – Server 1
Talend – Data Fabric 1
Talend – Open Studio 1
TopBi – Data Vault Generator 1
TopBI – HSG 1
Varigence – BimlFlex 1
Varigence – BimlOnline 1
Viz Analytics – Data Warehouse Studio 1

Comments given about all the tools

Participants could give their opinion about the tools in free text format – and they did!
Here are just the bare comments.

Attunity – Attunity Compose (formerly BiReady)
  • “Like features, makes work flow easily “
Qosqo – Quipu
  • “This one could be split up between the free and paid variant for this research.”
  • “Most advanced options on data vault modeling and generation in one tool. Does not support data marts, requires Groovy knowledge to change templates.”
TimeXtender – TimeXtender
  • “vendor lock-in, no best practices applied when generating SSAS cubes”
  • “lock in, very good features, easy to use (even for business users)”
Trivadis – biGenius
  • “if they extend SSIS generation no lock in, very flexibel, fast, cool”
  • “No Vendor-Lock (generated Code can be scheduled and executed native), Microsoft and Oracle-Support, Data Vault Support”
  • “Integrated product, short learning curve, easy to adapt”
Varigence – BimlExpress / Bidshelper
  • “Great to building SSIS packages.”
  • “Free, but you need to write the generating code first. Ideal for consultants though.”
  • “Nice combination, but needs to mature.”
  • “no lock in, very cool that you generate SSIS/SSAS, free, a lot of work coding, not for everyone”
  • “Very flexible, but long learning curve”
  • “Second best tool to generate SSIS packages. Requires a lot of exceptions if you want to alter parts of code for specific tables. Debugging is hell.”
Varigence – Mist / BimlStudio
  • “Has potential but is progressing slowly. Pricing is fair.”
  • “No vendor lock-in, as the result of BIML are just SSIS packages”
  • “More mature version.”
  • “easier, expensive, nice you can easily re-engineer from existing DWH”
  • “There is no reverse engineering module.We can not convert old package to use BimlStudio”
  • “Very flexible”
  • “Best tool to generate SSIS packages. Still requires a lot of manual labor for low level changes.”
WhereScape – RED
  • “We were able to increase development speed with a factor 3. “
  • “Vendor-Lock (needs App-Server for Scheduling, Processing), Microsoft not fully supported (only Beta), Data Vault Support”
  • “Creates stored procedures which clutters db’s, powerful tool, but has some annoyances”
  • “Several products which are not completely integrated, some features not “ready”, several limitations (especially when working with datasets, expensive for what you get, only based overseas (except smalll base in UK)”
  • “You have to click and type almost everything, then it “generates” a stored procedure. Also stores its repository in the same database and schema as your data warehouse.”
  • “Changed projects into initiatives enabling business analytics. “
  • “Like lineage, like support for multiple generation types (such as data vault), dislike performance, dislike excessive storage, dislike cost of consulting, dislike unreliability, dislike UI, dislike lack of adapters, dislike reliance on database procedures, dislike scheduler, dislike target specific repository”

Comments for other tools (not all available for sale)

Adante – Datawarehouse on Demand
  • “Very easy to use automation tool for Dimensional modeling”
Centennium – TBD engine (template based development)
  • “SQL server based metadata based staging, vault en datamart generator. Developed by Centennium, for the larger part made adult by me.”
I-refact – I-refactory
  • “A very solid, well thought solution for the data management. It’s not just for dwh, but for the whole data platform of your company.”
Kadenza – Smart Datawarehouse generator
  • “Model driven generator c-sharp”
Talend – Data Fabric
  • “Talend is a very friendly for completing complex operations like Data Integration, Data Quality and Big Data (towards Hadoop) with minimum programming skills”
Viz Analytics – Data Warehouse Studio
  • “Custom built from the ground up, with own ETL engine, very quick to build and implement DWs.”

Used technologies

One of the questions was about the technology that is being used (from Gartner Magic Quadrant for BI and Analytics). As more than one answer was allowed, the total number of responses is more than the number of participants, because some participants use two or more technologies.
In the Top 5 we see: Microsoft, Oracle, Tableau, SAP and Qlik. To be honest, nothing that surprises me.

Technology Mentions
Alteryx 7
Birst 3
Board International 3
Datameer 3
IBM 11
Information Builders 3
Microsoft 47
MicroStrategy 7
Oracle 25
Pentaho 6
Qlik 13
SAP 16
SAS 5
Sisense 1
Tableau 20
TIBCO Software 2
Yellowfin 1
Zoomdata 1

Conclusion / Wrap up

With this survey I tried to give the business intelligence community more insight into the available Data Warehouse Automation Tools.
Due to the low number of participants, conclusions come with large uncertainty margins. I think this is a pity.
Nevertheless this investigation could be good starting point for further analysis on DWA Tools or a DWA Tool selection.

(c) 2017 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

Using a Persistent Staging Area: What, Why, and How

Change log

Date Changes
21 February, 2017 Due to a very useful discussion on LinkedIn with Dan Linstedt, I have made some changes in this article to emphasize that a PSA does not have a permanent place in any EDW
(with exception, as Dan suggests, being the Hadoop NoSQL platform used for slower storage mechanisms).

Today I want to discuss the Persistent Staging Area in an Enterprise Data Warehouse, or PSA for short. First I want to give a credit to Roelant Vos who has blogged about this subject before. He triggered me to go further with this concept to see “what’s in it for me” and to do a proof of concept using SQL Server 2016.

What is a Persistent Staging Area or PSA?

As an introduction, I want to tell what a (non-persistent) Staging Area is. Some of you will already know, so be it. A Staging Area is a “landing zone” for data flowing into a data warehouse environment. The data in a Staging Area is only kept there until it is successfully loaded into the data warehouse. Between two loads, all staging tables are made empty again (or dropped and recreated before the next load).
A Persistent Staging Area on the other hand, is a Staging Area that is not wiped out between loads: it contains full history from the source(s) that deliver data to it.


Picture credits: © creativecommonsstockphotos | Dreamstime Stock Photos
The “Staging Area” for passengers on an airport ..

Why would you want to make use of a PSA, and what are its pros and cons?

A PSA makes it possible to reload the next layer in the Data Warehouse (for example a Data Vault) when requirements change during development of the Data Vault / EDW without going back to the original source(s). The original source(s) might not contain all history anymore (e.g. because it is an OLTP system or a web service returning only current data) or it is cumbersome to do such a reload because the source data is delivered in large numbers of flat files.

As you might know for data warehouses with (semi)structured data I always preach for a multi-tier architecture including a Data Vault for storing full history of unmodified source data and one or more dimensional models (virtual if performance allows, otherwise physical) as data feeds for cubes, reports, dashboards and BI tools.

So, I hear you think, in a Data Vault you have already full history, why would you need a PSA then? The answer is simple: because building a Data Vault takes time. You do not load all data on day 1, maybe there are uncertainties about the model and your resources are restricted, most likely. So you build the Data Vault in increments. But you can start capturing as much source data as you can! When Data Warehouse Automation tools are used, this is fairly easy to build. So even though you do not take the data further into your data warehouse than the (Persistent) Staging Area, you already start collecting your history of data.

Note: a different reason to use a PSA could be that your source systems contain high data volumes and are geographically spread: you could then create a PSA on premise near the source system, and export only delta data from it – each row would get an I, U or D indication for Insert, Update and Delete – to massively reduce data traffic between source and EDW. I will not discuss this use case scenario today, because then pros and cons would be completely different, and time travelling and full reload would not be applicable.

Using a PSA has it’s pros and cons. To mention a few:

Pros:

  • You collect data, even for those parts of the data warehouse you have not properly designed and developed yet.
  • You can clear the Data Vault, or even change the model completely (this can happen, for instance when business concepts initially were not well understood), create new ETL Code to load data to the modified Data Vault and do a replay of loading full history of the data captured in the PSA.
  • Important is that the PSA is implemented with changed rows only. If done so, it does not have excessive storage needs. So like a Data Vault Satellite, a new row for a table in the PSA is only added, when something has changed for the primary (surrogate or natural) key. A start- and end date/time can be used to implement this.
  • A (relational) PSA should have an optional, temporary character, and is not part of the core EDW tiers. A (relational) PSA should have an optional, temporary character, and is not part of the core EDW tiers! This was written twice on purpose! Do not forget! It is useful during building the EDW, but must be disarmed when the EDW has gone into production and has reached a maintenance phase. A good way to make sure this happens is to plan a decommission date and stick to the plan (can also be a plan like “after sources X, Y and Z are connected and are data feeds in production for 3 months”). Do not keep the PSA around for no good reason.
    In this way, most of the cons below are very likely to be not applicable!

Cons:

  • You can argue that you are in fact building a second data warehouse.
  • When sources change over time, your PSA might not reflect the current structure of the source system(s) anymore (e.g. a column is removed in the source system but you leave it in the PSA so that you can add new data with the column empty but keep the data in this column for older rows). This might be confusing.
  • When source systems are not used anymore (for instance when accounting software is replaced by a similar product from a different vendor) you need to add tables of the new system to the PSA, while you also keep the tables from the old system, which will not receive new data anymore. Now you need to keep the ETL code to load from the “old systems” PSA tables, in case a full reload is needed, and you also need to develop code to get data from the “new systems” PSA, and might need logic for which point in time which tables should be loaded. As you can understand, now the maintenance of your PSA can start to become a heavy burden. So going back to the pros, make sure your PSA is optional and can be disarmed when the maintenance efforts do no longer justify the benefits.

How could you implement a PSA?

To give an idea how this fits in my favourite architecture, I’ve added the PSA to it.
For a normal load the PSA is just loaded from the Staging Area (preferably in parallel with loading the Enterprise Data Warehouse from the Staging Area).
A Virtual Staging Access Layer (for example implemented with views) ensures that ETL code for loading the Enterprise Data Warehouse can switch easily between Staging Area and PSA for loading data. An example how to implement this will follow below.


How the PSA fits in the EDW architecture

For the next schema I owe you an explanation, because I know it’s a bit weird to mix architecture with processing in one schema, but in this case I thought it made sense and could clarify how to reload the EDW from the PSA: no sources and Staging Area are involved, but instead a time travelling mechanism queries the data from the PSA through the Virtual Staging Access Layer, to replay all loads that have been done so far, but with modified ETL Code to load data into a modified or extended Data Vault.


Reloading the DWH from the Persistent Staging Area

To set up time travelling, you could create a list of load sessions and their start date/times, then substract a small time fraction and use that value as Point-In-Time date time value to retrieve all rows from the previous load.
Note: For the DATETIME2 columns, I use a precision of centiseconds, so DATETIME2(2). For justification see one of my older blog posts: Stop being so precise! and more about using Load(end)dates (Datavault Series). If needed you can use a higher precision for your data warehouse.

In a table:

Load no LoadStartDatetime Point-In-Time value
1 2017-02-01 02:00:01.74 2017-02-02 02:00:00.88
2 2017-02-02 02:00:00.89 2017-02-02 02:00:02.11
3 (last load so far) 2017-02-03 02:00:02.12 9999-12-31 23:59:59.98 (! not 99 !)

The Time Travel icon with the arrows in the picture above refers to the process that loops through all Start date/time values chronologically and triggers execution of the load process from PSA to EDW for every Point-In-Time value.

SQL Server 2016 implementation: how could you implement loading the PSA from the Staging Area?

This logic is very similar to loading a Data Vault Satellite. For each staging table you should create a “History” version in the PSA database, with two extra columns: EffectiveStartDts and EffectiveEndDts and a primary key defined as the primary key of the staging table plus the EffectiveStartDts. Most concepts of loading a satellite could be applied: you could add rows that are new, insert new versions of rows that have changed, and mark absent rows (missing in the source) as such. For the PSA I would suggest just to update the end date time so that this row will simply not be in the selection when a later point-in-time is used. Also there is no need to create a separate “hub” table for each staging table.

SQL Server 2016 implementation: how could you implement the historic reload from the PSA?

In a previous blog post Isolate cross database “databasename” dependencies using synonyms and schemas I already advocated the concept of putting each tier of the EDW in its own database schema, and use synonyms to achieve interconnectivity between objects in different databases. In the rest of this article I will assume that you understand this concept (if not, you could read this blog post first).

With some imagination, we can use a modified version of this concept by creating the Virtual Staging Access Layer into the PSA database. This virtual layer is needed for loading data from the PSA instead of the Staging Area.
Suppose the Staging Database has a database schema stg and that schema contains all tables that can be loaded into the EDW / Raw Data Vault.
What we can do is do a similar thing in the PSA database, so create a stg schema, but instead of tables this schema contains views with the same name and same columns as the tables in the Staging Database! These views select the rows valid at a certain point in time.

The challenge to retrieve only the valid rows at a point in time is an easy one: To the PSA database we add a table [stg].[PointInTime] with only one row and a [CurrentPointInTime] column of datatype DATETIME2. Now we can build a SELECT statement on the PSA table (same as the staging table, but with start- and end date/time stamps to hold historic changed and deleted rows) with a CROSS JOIN on the PointInTime table and a WHERE clause to select only the rows valid at the given point in time.

An example how such a view could look like:

CREATE_VIEW_stg_Customer.sql

CREATE VIEW [stg].[Customer]
AS
/*
==========================================================================================
Author: Hans Michiels
Create date: 15-FEB-2017
Description: View that has exactly the same structure as a table with the same name
             in the Staging Database.
             This view can be used for full reloads of all data in the PSA Database.
==========================================================================================
*/

SELECT
    [CustomerID],
    [FirstName],
    [Initials],
    [MiddleName],
    [SurName],
    [DateOfBirth],
    [Gender],
    [SocialSecurityNumber],
    [Address],
    [PostalCode],
    [Residence],
    [StateOrProvince],
    [Country],
    [RowHash]
FROM
    [psa].[CustomerHistory] hist
CROSS JOIN
    [psa].[PointInTime] pit
WHERE
    hist.EffectiveStartDts <= pit.CurrentPointInTime
    AND hist.EffectiveEndDts >= pit.CurrentPointInTime -- When INCLUSIVE enddating is used.
    -- AND hist.EffectiveEndDts > pit.CurrentPointInTime -- When EXCLUSIVE enddating is used.
GO

The next steps are:

  1. Make sure your ETL Code to load the data has a mechanism to skip loading data from the sources to the Staging Area and from the Staging Area to the PSA when the PSA is used instead of the normal Staging area. For instance you could add a view [stg].[IsPSA] to both the Staging database and the PSA database, returning a single row with a single value “False” or “True”, respectively. In your package you could retrieve this value using the Staging Connection Manager, and change the flow of your package depending on the fact if the Staging database is connected or the PSA database.
  2. Build a time travelling mechanism (could be a FullReloadMaster.dtsx package) that updates the stg.PointInTime table for every point-in-time date/time and after every update starts the ETL Process (e.g. a Master.dtsx SSIS package) with a changed connection string, connected to the PSA instead of “normal” Staging database!

Here is a possible implementation of the [stg].[IsPSA] view, based on the fact that the PSA database will have _PSA_ in the database name and the staging database will not. This allows to use the same version of the view for the Staging and PSA database.

CREATE_VIEW_stg_IsPSA.sql

CREATE VIEW [stg].[IsPSA]
AS
/*
==========================================================================================
Author: Hans Michiels
Create date: 10-FEB-2017
Description: View that returns one rows with a bitcolumn [IsPSA] that indicates
             if the current database is the PSA or the STG.
==========================================================================================
*/

SELECT
  CONVERT(BIT,
    CASE
    WHEN CHARINDEX('_PSA_', DB_NAME()) > 0 THEN 1
    ELSE 0
    END) AS [IsPSA]
GO


Example how a FullReloadMaster.dtsx package could look like. Important is that the Staging database connection string is a parameter of Master.dtsx, and is passed by FullReloadMaster.dtsx, but it now is a connection to the PSA database.

Now the next challenge is how to get this working. If you use a SSIS dataflow with a OLEDB Source (and a destination), it is not so difficult: you can just change the connection string of the OLEDB Source. I have done a proof of concept and it worked! You do not have to redeploy SSIS Packages. If views in the PSA database are used with exactly the same structure as the tables in the Staging database, the SSIS Package agrees with the “metadata” and “VS_NEEDSNEWMETADATA” or “VS_ISBROKEN” errors do not occur. Of course you could also make views for reading the Staging tables in the Staging database. This would create an additional layer of abstraction, because then in both the Staging Area and PSA you read the data from the views, and you could, when needed, implement views of which the structure differs from the underlying tables.

If you use SQL code (e.g. stored procedures) this will be a little more work, but still doable. You can follow one of the following strategies for storing the SQL Code:

  • a push strategy: SQL code is in the PSA database, and should then also be in the Staging database. From there, you push the data to the raw Data Vault.
  • a pull strategy: SQL code is in the raw Data Vault. From there you pull the data from the Staging Area or PSA. You will need to switch via synonyms: drop stg synonyms and recreate them but them pointing to the PSA database before a historic reload is done. Afterwards, the reverse action needs to take place to let them point to the Staging database again.
  • an outsider strategy: SQL code is in a separate database, and has synonyms for both staging and raw Data Vault tables and views. Now the same switch via synonyms method can be used.

A few considerations:

  • a push strategy forces you to install your SQL code twice: in the Staging and the PSA database. Also a Staging database might not feel like the “right place” for stored procedures.
  • a pull strategy and an outsider strategy force you to change synonyms, something that normally only has to be done during a deployment, but now before a full reload (and afterwards the reverse action).

I have a slight preference for switching via synonyms, assumed that a full reload will not happen often.
Still using the push strategy could be a valid choice too.

Conclusion / Wrap up

In this post you could read more about the concept of a Persistent Staging Area. Although I have not used it in production environments yet, I think in some projects it can be very useful.
Important to keep in mind:

  • Make sure that your data warehouse can survive if the PSA is switched off. So do not expose it to end users or make it a mandatory part of your ETL Code.
  • Plan a decommission moment when the PSA is created and stick to the plan. Do not keep the PSA around for no good reason.
  • Use code generation tools to update the PSA from the staging area to limit the development effort to set it up.
  • Virtualize the read access to the staging tables, for instance by using views.

If you stick to this “rules” I think a PSA can prove its value for your Enterprise Data Warehouse!

(c) 2017 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

Free download: Blog Yearbook 2016

I have made a compilation of all my blog posts of the past year, and it was more work than you would think.
Especially because it also contains a non-published new article.

After filling in the form you can download the e-book for free.

I promise you the following:

  • I will not give or sell your email address to any third party.
  • I will not spam you.
  • If I start with a newsletter, you can easily unsubscribe or change the frequency.

(c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

Zeros, bloody zeros! (Data Vault Series)

Introduction

I must admit I have a weakness for British humour.
When I had to cope with leading zeros in business keys some time ago, I spontaneously came up with the title of this post, not knowing that it would serve as such.
For those who do not know, “Meetings, bloody meetings” is a British comedy training film in which John Cleese plays a main role. It was made in 1976, and a remake was made in 2012.
It tells in a funny way what can go wrong at meetings and how you can do better, check it out if you can.

DV-S01E05-meetingsMr John Cleese

But, obviously, this post is not about meetings but about zeros.

Problem

I can be short about that: leading zeros in business key values.
For instance a customer number is delivered to the data warehouse as 0001806 (instead of 1806).
This would not be a problem it is would always be delivered exactly like that. But to be honest, you can and will not know that upfront. Even this might be the case now, it might not be in the future.
When other tools are used, leading zeros could suddenly disappear (for instance when a csv file is modified using Excel), or (more rarely) the number of leading zeros could change (01806, 00001806). When this happens you have a problem, because for the data warehouse 01806, 0001806, 00001806 and 1806 are all different business keys! Even if you have only two variants, it is already a problem.
Because every business key gets a different row in the hub, and this customer now exists multiple times!

DV-S01E05-zeros(No acting here, this is how I look sometimes)

Solution

If you are familiar with Data Vault, you might already think of same-as-links to solve this.
But I think the solution should be implemented earlier, to avoid having multiple hub rows.
Simply always remove leading zeros when the sourcecolumn is (part of) a business key (either primary or foreign key) and seems a number or ID but is delivered as a string/varchar. In this way 1806 will always be 1806! And I think it is pretty impossible that 001806 and 1806 would refer to two different customers.
Unless, of course, they would come from different source systems. But in that situation, depending on leading zeros would be a bad thing to do, because when then leading zeros dropped off, satellite rows of different customers (in different source systems) could end up as connected to the same hub row! In this situation, in a non-integrated Raw Vault, it would be better to prefix the business key with the source system code and remove the leading zeros, for instance, CRM.1806 and ERP.1806.
In all cases, you can still store the original value (with leading zeros) as an ordinary attribute in a satellite for auditing reasons.

How to implement the solution

There are many ways to remove leading zeros. When I was searching for this I had two requirements:

  • No casting from and to an integer may take place, otherwise all business keys need to be numeric, so this would make the solution less reliable.
  • No function, routine or assembly may be called, this could negatively impact performance. I was looking for an “inline” conversion.

After some research I found an expression that was the same for SQL and SSIS and quite okay (T-SQL version by Robin Hames, my credits for his work), but appeared to change a string with only one or more zeros to an empty string. And because a 0 can have a meaning – and is certainly different from an empty string – this is undesired behavior, IMHO.
So I had to add some logic to it: a SELECT CASE in T-SQL and an inline condition (format {condition} ? {true part} : {false part} ) to the SSIS expression.
Furthermore I came on a different method for T-SQL as well, using the PATINDEX function, which is more compact than the other solution.
For SSIS I still use the ‘Robin Hames’ method, because the PATINDEX function is not available in SSIS Expressions.
So .. this is what it has become:

T-SQL

Remove_leading_zeros.sql

SELECT
    example.[id_with_leading_zeros],
   CASE
      WHEN LTRIM(example.[id_with_leading_zeros]) = '' THEN ''
      WHEN PATINDEX( '%[^0 ]%', example.[id_with_leading_zeros]) = 0 THEN '0'
      ELSE SUBSTRING(example.[id_with_leading_zeros], PATINDEX('%[^0 ]%', example.[id_with_leading_zeros]), LEN(example.[id_with_leading_zeros]))
   END AS [id_without_zeros_method1],

   CASE
      WHEN LTRIM(example.[id_with_leading_zeros]) = '' THEN ''
      WHEN PATINDEX( '%[^0 ]%', example.[id_with_leading_zeros]) = 0 THEN '0'
      ELSE REPLACE(REPLACE(LTRIM(REPLACE(-- Robin Hames' method
            REPLACE(LTRIM(example.[id_with_leading_zeros]), ' ', '!#!') -- replace existing spaces with a string that does not occur in the column value, I have chosen '!#!'
            , '0', ' ') -- replace '0' with ' '
            ) -- end of LTRIM to remove leading '0's that have been changed to ' 's
            , ' ', '0') -- change ' ' back to '0'
            , '!#!', ' ') -- change '!#!' back to ' '
   END AS [id_without_zeros_method2]
FROM
    (
    SELECT
        TOP 1000 RIGHT('00000000000' + CONVERT(NVARCHAR(12), object_id), 14) AS [id_with_leading_zeros]
    FROM
        master.sys.objects
    UNION
    SELECT N' 00000 '
    UNION
    SELECT N'00'
    UNION
    SELECT N' '
    UNION
    SELECT ' 0099990 A '
    UNION
    SELECT '-5550'
    ) example

SSIS Expression (can be used in Derived Column)

(LTRIM(REPLACE(id_with_leading_zeros,"0", "")) == "" && LTRIM(id_with_leading_zeros) != "") ? "0" : REPLACE(REPLACE(LTRIM(REPLACE(REPLACE(LTRIM(id_with_leading_zeros)," ","!#!"),"0"," "))," ","0"),"!#!"," ")

DV-S01E05-151In a Derived Column Transformation this looks for instance like this

Conclusion / Wrap up

In this post I have motivated why I think you should remove leading zeros from business keys when data is loaded from source systems to a data warehouse.
This post also contains different ways to remove leading zeros, two for T-SQL and one for a SSIS expression.

(c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

11 Things you should know about Biml – an overview (Datawarehouse Series)

Intended audience and purpose of this post

This post gives an overview of a number products and features related to Biml on a high level.
Therewith it is a starting point for further learning and investigating.

Introduction


Biml means Business Intelligence Markup Language.
What else do you need to know about it, if you are a professional worker in the business intelligence and data warehouse domain?

SSIS-S01E05-biml-logo

11 Things you should know about Biml – an overview

  1. Who invented Biml?
  2. What is Biml?
  3. What are BimlExpress and BIDS Helper?
  4. What is BimlScript?
  5. What is BimlFlex?
  6. What are Mist and BimlStudio?
  7. What are BimlBundles?
  8. What are BimlTransformers?
  9. Is the free part of Biml a way to get me addicted to it?
  10. How do I know that Biml is a proven technology that is going to stay?
  11. Where can I find more information on the internet?

1. Who invented Biml?

Biml and all tools around it is created by Varigence, a company with CEO Scott Currie, a former Microsoft Employee.

Back to top

2. What is Biml?

Biml – the markup language – is an XML dialect to describe Microsoft SSIS Packages and SSAS Cubes.
Everything you can do with SSIS, you can do with Biml (quote of Scott Currie, done during the Biml usergroup NL kickoff on May 7, 2016).
The goal of Biml is to improve productivity of Data warehouse development.
But the biml ‘ecosystem’ consists of more than only the Markup Language. Other components are:

  • Biml Script
  • BimlExpress and BIDS Helper
  • BimlFlex
  • BimlStudio (also/formerly known as Mist)
  • BimlBundles
  • BimlTransformers

Back to top

3. What are BimlExpress and BIDS Helper?

BimlExpress and BIDS Helper both are Add-ins for Business Intelligence Development Studio (actually the Visual Studio version with the project types needed for Business Intelligence).
BimlExpress is the successor of BIDS Helper and one of it biggest advantages is syntax highlighting and intellisense.
It is really worth checking them out, because they can improve your productivity, especially when you use metadata and design patterns for SSIS packages.
You can download and install BimlExpress. You have to register but can use the product at no cost.

Back to top

4. What is BimlScript?

BimlScript is what makes Biml (the markup language) even more interesting. With BimlScript you can write .NET Code (C#) and do all kind of interesting stuff, like accessing your metadata, using for each loops, and generate SSIS packages using a design pattern.
I have generated packages to load dimensions of a star schema, and I intend to blog about it later this year in more detail.

Back to top

5. What is BimlFlex?

BimlFlex is a datawarehouse automation framework of Varigence. It can greatly improve productivity when building data warehouses.
It uses Predefined patterns for data warehouse automation and Extension Points that you can use to change the standard automation patterns on certain ‘change points’, or can build exceptions for the pattern of this one very special table.
BimlFlex is part of BimlStudio and therefore a commercial (paid) product.

Back to top

6. What are Mist and BimlStudio?

Mist and BimlStudio are the same tool. BimlStudio is just the new name, due to strange associations with the “Mist” name in different languages, for example in German (What does the German word “Mist” mean?). This is why they changed the name.
Also the names for all products and techniques associated with Biml now start with Biml, which makes them more recognizable.
BimlStudio has a number of extra features when compared with BimlExpress.
BimlStudio is a paid product, you can buy a perpetual license or a monthly subscription.

SSIS-S01E05-bimlstudio-logo

Back to top

7. What are BimlBundles?

BimlBundles are part of BimlStudio and protect your source code while your customer still can make changes at framework extension points.
Koos van Strien did a very good job at explaining BimlBundles in more detail here.

Back to top

8. What are BimlTransformers?

With BimlTransformers you can define and store changes on Biml code, instead of changing the Biml code itself. If you have worked with Visual Studio/.NET before, you can compare it with changes you make for your config file for different environments.
Koos van Strien did a very good job at explaining BimlTransformers in more detail here.

Back to top

9. Is the free part of Biml a way to get me addicted to it?

Nope. Although Varigence works sales model is a Fremium model (a free product that can be extended/enhanced with paid products), the promise of Scott Currie, done during the Biml usergroup NL kickoff on May 7, 2016, is that all features that are released for free will stay for free. You can count on that.

Back to top

10. How do I know that Biml is a proven technology that is going to stay?

Biml already exists for about 8 years and has proven itself already. Varigence is a global player and steady company that does not suddenly disappear. I am not concerned if Biml will exist 10 years from now, and you should not have to be either. Even in the utterworst case, you would still have the SSIS packages that Biml has generated for you, and you can edit/maintain them without Biml.

Back to top

11. Where can I find more information on the internet?

To read more about Biml, here are a few interesting web sites, blog posts and Linked In Usergroups:

Back to top

Conclusion / Wrap up

Biml is here to stay and adds a lot of value, even if you use only the free parts of it. Varigence can only be praised for that.
In this blog post I shortly described a number of products and techniques that are part of the Biml ‘ecosystem’.
I do not have a commercial interest in promoting Biml, I am just a Business Intelligence Consultant who likes Biml a lot.

(c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.