|21 February, 2017||Due to a very useful discussion on LinkedIn with Dan Linstedt, I have made some changes in this article to emphasize that a PSA does not have a permanent place in any EDW
(with exception, as Dan suggests, being the Hadoop NoSQL platform used for slower storage mechanisms).
Today I want to discuss the Persistent Staging Area in an Enterprise Data Warehouse, or PSA for short. First I want to give a credit to Roelant Vos who has blogged about this subject before. He triggered me to go further with this concept to see “what’s in it for me” and to do a proof of concept using SQL Server 2016.
What is a Persistent Staging Area or PSA?
As an introduction, I want to tell what a (non-persistent) Staging Area is. Some of you will already know, so be it. A Staging Area is a “landing zone” for data flowing into a data warehouse environment. The data in a Staging Area is only kept there until it is successfully loaded into the data warehouse. Between two loads, all staging tables are made empty again (or dropped and recreated before the next load).
A Persistent Staging Area on the other hand, is a Staging Area that is not wiped out between loads: it contains full history from the source(s) that deliver data to it.
Picture credits: © creativecommonsstockphotos | Dreamstime Stock Photos
The “Staging Area” for passengers on an airport ..
Why would you want to make use of a PSA, and what are its pros and cons?
A PSA makes it possible to reload the next layer in the Data Warehouse (for example a Data Vault) when requirements change during development of the Data Vault / EDW without going back to the original source(s). The original source(s) might not contain all history anymore (e.g. because it is an OLTP system or a web service returning only current data) or it is cumbersome to do such a reload because the source data is delivered in large numbers of flat files.
As you might know for data warehouses with (semi)structured data I always preach for a multi-tier architecture including a Data Vault for storing full history of unmodified source data and one or more dimensional models (virtual if performance allows, otherwise physical) as data feeds for cubes, reports, dashboards and BI tools.
So, I hear you think, in a Data Vault you have already full history, why would you need a PSA then? The answer is simple: because building a Data Vault takes time. You do not load all data on day 1, maybe there are uncertainties about the model and your resources are restricted, most likely. So you build the Data Vault in increments. But you can start capturing as much source data as you can! When Data Warehouse Automation tools are used, this is fairly easy to build. So even though you do not take the data further into your data warehouse than the (Persistent) Staging Area, you already start collecting your history of data.
Note: a different reason to use a PSA could be that your source systems contain high data volumes and are geographically spread: you could then create a PSA on premise near the source system, and export only delta data from it – each row would get an I, U or D indication for Insert, Update and Delete – to massively reduce data traffic between source and EDW. I will not discuss this use case scenario today, because then pros and cons would be completely different, and time travelling and full reload would not be applicable.
Using a PSA has it’s pros and cons. To mention a few:
- You collect data, even for those parts of the data warehouse you have not properly designed and developed yet.
- You can clear the Data Vault, or even change the model completely (this can happen, for instance when business concepts initially were not well understood), create new ETL Code to load data to the modified Data Vault and do a replay of loading full history of the data captured in the PSA.
- Important is that the PSA is implemented with changed rows only. If done so, it does not have excessive storage needs. So like a Data Vault Satellite, a new row for a table in the PSA is only added, when something has changed for the primary (surrogate or natural) key. A start- and end date/time can be used to implement this.
- A (relational) PSA should have an optional, temporary character, and is not part of the core EDW tiers. A (relational) PSA should have an optional, temporary character, and is not part of the core EDW tiers! This was written twice on purpose! Do not forget! It is useful during building the EDW, but must be disarmed when the EDW has gone into production and has reached a maintenance phase. A good way to make sure this happens is to plan a decommission date and stick to the plan (can also be a plan like “after sources X, Y and Z are connected and are data feeds in production for 3 months”). Do not keep the PSA around for no good reason.
In this way, most of the cons below are very likely to be not applicable!
- You can argue that you are in fact building a second data warehouse.
- When sources change over time, your PSA might not reflect the current structure of the source system(s) anymore (e.g. a column is removed in the source system but you leave it in the PSA so that you can add new data with the column empty but keep the data in this column for older rows). This might be confusing.
- When source systems are not used anymore (for instance when accounting software is replaced by a similar product from a different vendor) you need to add tables of the new system to the PSA, while you also keep the tables from the old system, which will not receive new data anymore. Now you need to keep the ETL code to load from the “old systems” PSA tables, in case a full reload is needed, and you also need to develop code to get data from the “new systems” PSA, and might need logic for which point in time which tables should be loaded. As you can understand, now the maintenance of your PSA can start to become a heavy burden. So going back to the pros, make sure your PSA is optional and can be disarmed when the maintenance efforts do no longer justify the benefits.
How could you implement a PSA?
To give an idea how this fits in my favourite architecture, I’ve added the PSA to it.
For a normal load the PSA is just loaded from the Staging Area (preferably in parallel with loading the Enterprise Data Warehouse from the Staging Area).
A Virtual Staging Access Layer (for example implemented with views) ensures that ETL code for loading the Enterprise Data Warehouse can switch easily between Staging Area and PSA for loading data. An example how to implement this will follow below.
How the PSA fits in the EDW architecture
For the next schema I owe you an explanation, because I know it’s a bit weird to mix architecture with processing in one schema, but in this case I thought it made sense and could clarify how to reload the EDW from the PSA: no sources and Staging Area are involved, but instead a time travelling mechanism queries the data from the PSA through the Virtual Staging Access Layer, to replay all loads that have been done so far, but with modified ETL Code to load data into a modified or extended Data Vault.
Reloading the DWH from the Persistent Staging Area
To set up time travelling, you could create a list of load sessions and their start date/times, then substract a small time fraction and use that value as Point-In-Time date time value to retrieve all rows from the previous load.
Note: For the DATETIME2 columns, I use a precision of centiseconds, so DATETIME2(2). For justification see one of my older blog posts: Stop being so precise! and more about using Load(end)dates (Datavault Series). If needed you can use a higher precision for your data warehouse.
In a table:
|Load no||LoadStartDatetime||Point-In-Time value|
|1||2017-02-01 02:00:01.74||2017-02-02 02:00:00.88|
|2||2017-02-02 02:00:00.89||2017-02-02 02:00:02.11|
|3 (last load so far)||2017-02-03 02:00:02.12||9999-12-31 23:59:59.98 (! not 99 !)|
The Time Travel icon with the arrows in the picture above refers to the process that loops through all Start date/time values chronologically and triggers execution of the load process from PSA to EDW for every Point-In-Time value.
SQL Server 2016 implementation: how could you implement loading the PSA from the Staging Area?
This logic is very similar to loading a Data Vault Satellite. For each staging table you should create a “History” version in the PSA database, with two extra columns: EffectiveStartDts and EffectiveEndDts and a primary key defined as the primary key of the staging table plus the EffectiveStartDts. Most concepts of loading a satellite could be applied: you could add rows that are new, insert new versions of rows that have changed, and mark absent rows (missing in the source) as such. For the PSA I would suggest just to update the end date time so that this row will simply not be in the selection when a later point-in-time is used. Also there is no need to create a separate “hub” table for each staging table.
SQL Server 2016 implementation: how could you implement the historic reload from the PSA?
In a previous blog post Isolate cross database “databasename” dependencies using synonyms and schemas I already advocated the concept of putting each tier of the EDW in its own database schema, and use synonyms to achieve interconnectivity between objects in different databases. In the rest of this article I will assume that you understand this concept (if not, you could read this blog post first).
With some imagination, we can use a modified version of this concept by creating the Virtual Staging Access Layer into the PSA database. This virtual layer is needed for loading data from the PSA instead of the Staging Area.
Suppose the Staging Database has a database schema stg and that schema contains all tables that can be loaded into the EDW / Raw Data Vault.
What we can do is do a similar thing in the PSA database, so create a stg schema, but instead of tables this schema contains views with the same name and same columns as the tables in the Staging Database! These views select the rows valid at a certain point in time.
The challenge to retrieve only the valid rows at a point in time is an easy one: To the PSA database we add a table [stg].[PointInTime] with only one row and a [CurrentPointInTime] column of datatype DATETIME2. Now we can build a SELECT statement on the PSA table (same as the staging table, but with start- and end date/time stamps to hold historic changed and deleted rows) with a CROSS JOIN on the PointInTime table and a WHERE clause to select only the rows valid at the given point in time.
An example how such a view could look like:
CREATE VIEW [stg].[Customer]
Author: Hans Michiels
Create date: 15-FEB-2017
Description: View that has exactly the same structure as a table with the same name
in the Staging Database.
This view can be used for full reloads of all data in the PSA Database.
hist.EffectiveStartDts <= pit.CurrentPointInTime
AND hist.EffectiveEndDts >= pit.CurrentPointInTime -- When INCLUSIVE enddating is used.
-- AND hist.EffectiveEndDts > pit.CurrentPointInTime -- When EXCLUSIVE enddating is used.
The next steps are:
- Make sure your ETL Code to load the data has a mechanism to skip loading data from the sources to the Staging Area and from the Staging Area to the PSA when the PSA is used instead of the normal Staging area. For instance you could add a view [stg].[IsPSA] to both the Staging database and the PSA database, returning a single row with a single value “False” or “True”, respectively. In your package you could retrieve this value using the Staging Connection Manager, and change the flow of your package depending on the fact if the Staging database is connected or the PSA database.
- Build a time travelling mechanism (could be a FullReloadMaster.dtsx package) that updates the stg.PointInTime table for every point-in-time date/time and after every update starts the ETL Process (e.g. a Master.dtsx SSIS package) with a changed connection string, connected to the PSA instead of “normal” Staging database!
Here is a possible implementation of the [stg].[IsPSA] view, based on the fact that the PSA database will have _PSA_ in the database name and the staging database will not. This allows to use the same version of the view for the Staging and PSA database.
CREATE VIEW [stg].[IsPSA]
Author: Hans Michiels
Create date: 10-FEB-2017
Description: View that returns one rows with a bitcolumn [IsPSA] that indicates
if the current database is the PSA or the STG.
WHEN CHARINDEX('_PSA_', DB_NAME()) > 0 THEN 1
END) AS [IsPSA]
Example how a FullReloadMaster.dtsx package could look like. Important is that the Staging database connection string is a parameter of Master.dtsx, and is passed by FullReloadMaster.dtsx, but it now is a connection to the PSA database.
Now the next challenge is how to get this working. If you use a SSIS dataflow with a OLEDB Source (and a destination), it is not so difficult: you can just change the connection string of the OLEDB Source. I have done a proof of concept and it worked! You do not have to redeploy SSIS Packages. If views in the PSA database are used with exactly the same structure as the tables in the Staging database, the SSIS Package agrees with the “metadata” and “VS_NEEDSNEWMETADATA” or “VS_ISBROKEN” errors do not occur. Of course you could also make views for reading the Staging tables in the Staging database. This would create an additional layer of abstraction, because then in both the Staging Area and PSA you read the data from the views, and you could, when needed, implement views of which the structure differs from the underlying tables.
If you use SQL code (e.g. stored procedures) this will be a little more work, but still doable. You can follow one of the following strategies for storing the SQL Code:
- a push strategy: SQL code is in the PSA database, and should then also be in the Staging database. From there, you push the data to the raw Data Vault.
- a pull strategy: SQL code is in the raw Data Vault. From there you pull the data from the Staging Area or PSA. You will need to switch via synonyms: drop stg synonyms and recreate them but them pointing to the PSA database before a historic reload is done. Afterwards, the reverse action needs to take place to let them point to the Staging database again.
- an outsider strategy: SQL code is in a separate database, and has synonyms for both staging and raw Data Vault tables and views. Now the same switch via synonyms method can be used.
A few considerations:
- a push strategy forces you to install your SQL code twice: in the Staging and the PSA database. Also a Staging database might not feel like the “right place” for stored procedures.
- a pull strategy and an outsider strategy force you to change synonyms, something that normally only has to be done during a deployment, but now before a full reload (and afterwards the reverse action).
I have a slight preference for switching via synonyms, assumed that a full reload will not happen often.
Still using the push strategy could be a valid choice too.
Conclusion / Wrap up
In this post you could read more about the concept of a Persistent Staging Area. Although I have not used it in production environments yet, I think in some projects it can be very useful.
Important to keep in mind:
- Make sure that your data warehouse can survive if the PSA is switched off. So do not expose it to end users or make it a mandatory part of your ETL Code.
- Plan a decommission moment when the PSA is created and stick to the plan. Do not keep the PSA around for no good reason.
- Use code generation tools to update the PSA from the staging area to limit the development effort to set it up.
- Virtualize the read access to the staging tables, for instance by using views.
If you stick to this “rules” I think a PSA can prove its value for your Enterprise Data Warehouse!
(c) 2017 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.
21 thoughts on “Using a Persistent Staging Area: What, Why, and How”
Thank you Hands
Very interesting articule about Bi technic.
Do you need ODS when you have PSA?
I my opinion the PSA is additional and optional, so you still need an ODS or Data Vault as core part of your EDW architecture.
Very good and detailed article.
Have you ever considered using a Hadoop system as a PSA?
I am working out a concept to minimise resources, reduce loading interfaces and virtualize DW layers as much as possible.
Hi Paul, thanks for your input and sorry for my late reply, it seems to have slipped through.
On LinkedIn I’ve had a discussion with Dan Linstedt on this. Hadoop is to be said a good alternative for a PSA as it is slower, “cheaper” (although have not seen a TCO calculation yet) storage. So yes, I suppose you can use Hadoop as a PSA. I want to make the distinction that the “relational” PSA I suggest always is optional/temporary (to support the DWH development process with full reloads) and a PSA on Hadoop might be permanent. I am curious to know how your concept works out, enough challenges I suppose! Good luck, and I am interested to stay in the loop of updates on that. Best regards, Hans
Good article, and may yet become the best approach in the current rapidly changing demands when it comes to processing data.
You talk about creating ‘History’ versions of all your staging tables. May I ask why you did not use the ‘Temporal tables’ option of SQL Server 2016?
To be honest I was doing a POC with Temporal tables, but then my blog article ended up going about two subjects, so I decided to leave it out. Temporal tables have some other challenges, like the start- and enddatetimes are controlled by SQL Server. This can be problematic for loading history from an older system (where the “LoadDts” is not the systemdate). Workaround is to add an additional LoadDts yourself). Performance of Temporal tables is good, though, better than the “History” table during my tests.
Also a Temporal table can not be truncated, and change detection is quite stupid. Any insert, update or delete is a change.
For instance when you update a row and the column values after the update stay the same, there will be two identical rows in your Temporal table.
The same for a delete and insert, even if they are the same, there will be two rows in your Temporal table.
Long story short, you will always need a plain staging table and a History or Temporal table, so you can do change detection yourself and in parallel with loading the rest of your EDW.
Interesting article. I think there are some situations in which this can solve some problems, in specific when you want to reload your Datavault from scratch without losing history. Frank Habers wrote an article about HDA (Historical Data Area) many years ago. Here you can find it: http://www.xr-magazine.nl/artikelen/1125/business-intelligence/een-eenvoudig-alternatief-voor-de-data-vault
There are some similarities but the his architecture is sligthly different.
Hopefully this can help you.
Thank you for your comment and alerting me to the article of Frank Habers. The article and comments show that there are people with strong opinions working in the field of BI/DWH. But discussion is good,as long as it is respectful. The HSA has in a technical sense a similarity with the PSA, but I want to emphazise that I see the PSA just as an temporary extra layer, for your convenience, that can be turned off when it is not useful anymore.
What is your opinion on this?
A PSA or HDA can both solve issues which occure when you want to reload your Datavault or DWH. The only problem which must solved on a certain moment is the huge amount of data. When one of its components is obsolete, this through it away. This will reduce the costs of maintanence and the cost of storage.
Thanks for your addition, Gert.
A good article on this subject covering some interesting concepts, it’s always good to learn some new tricks for an old dog. I would however like to see this pattern extended in two ways, 1). Not to have to end date each PSA, and 2) The same looping logic is also mentioned in Dan’s book DV2, such that in a non PSA load, we should still do a load for each load date found in a staging table,
Hi Andrew, thanks for your addition.
Regarding your comments:
1) A PSA table does not have to be enddated when temporal tables are used, because then SQL Server takes care of the enddating (see my current blog post on that: http://www.hansmichiels.com/2017/03/07/temporal-table-applications-in-sql-data-warehouse-environments-sql-server-series ).
2) Thanks for bringing the looping in the staging area to my attention. I agree that a looping mechanism can be used when your staging area might contain data from several loads simultaneously. This can be the case when multiple loads are done during the day (for instance each 15 minutes) and you cannot 100% guarantee that the previous load was already completely processed when the next flows in). This is however a different kind of looping: in the staging you filter on LoadDatetimeStamp = (value of your choice). In the PSA you filter on CurrentPointInTime BETWEEN
LoadDateTimeStamp AND LoadEndDateTimeStamp. So in the PSA you can select rows of older LoadDateTimeStamps, that have not been changed afterwards.
While attending DMzone last week, I had the pleasure to listen to a talk by Roelant Vos, who talks about the ‘virtual datawarehouse’, meaning it can be rebuilt on the fly. One of the nice things he does is that no physical end-dating is done in the PSA, at the moment data is loaded into different layers, the enddate is calculated on the fly using the SQL Server ROW_NUMBER() ranking function. I find that also an interesting approach to the PSA, which can make it even easier to build.
Hi Johan, thanks for your useful addition, calculating enddates on the fly could be a way to make a PSA easier to load/build.
I have a question on how to load a multiple versions from PSA. For example we have a PSA that is loaded everyday from the source systems. But the load to data warehouse is done for example once a week. It is a requirement to load the DW with full history. I am having a problem writing a join between two PSA table (Order and OrderLine) and to be able to connect OrderLine to the correct version of Order that was valid when OrderLine was valid. Do you have any examples on how to connect the correct versions to each other?
66017921|256708631|2017-10-13 14:32:03.180|2018-05-18 13:18:29.9337143|2018-05-22 05:45:48.3815396
66017921|256708631|2018-05-21 22:16:44.267|2018-05-22 05:45:48.3815396|9999-12-31 23:59:59.9999999
66017921|256708632|2017-10-13 14:32:03.180|2018-05-18 13:18:29.9337143|2018-05-22 05:45:48.3815396
66017921|256708632|2018-05-21 22:16:44.267|2018-05-22 05:45:48.3815396|9999-12-31 23:59:59.9999999
66017921|2017-10-16 10:45:41.800|2018-05-18 11:38:36.3231813|2018-05-22 05:47:00.8021379
66017921|2018-05-21 22:16:44.267|2018-05-22 05:47:00.8021379|9999-12-31 23:59:59.9999999
I think you have to pick a point in time, and then retrieve the Order rows and OrderLine rows that were valid at that specific point in time.
This point in time can be “now” or any other chosen date/time.
Then you get all valid rows from both tables using a WHERE clause like
WHERE SysStartTime <= @PointInTime AND SysEndTime > @PointInTime.
In your case you can not use the BETWEEN keyword because it would include the SysEndTime, and in your example this is the same as the SysStartTime of the next valid row, so it needs to be excluded (by the way, this would only be an issue if the point in time you have chosen is exactly the same as the SysEndTime of any row).
So an example :
DECLARE @PointInTime DATETIME2(7) = CURRENT_TIMESTAMP;
SELECT o.*, ol.*
SELECT * FROM Order WHERE SysStartTime <= @PointInTime AND SysEndTime > @PointInTime
SELECT * FROM OrderLine WHERE SysStartTime <= @PointInTime AND SysEndTime > @PointInTime
ON o.o_id = ol.o_id
If you get performance issues, you could borrow the Data Vault concept of a Point-In-Time table. Explaining the concept of a PIT table is a bit too much for here but if you need help on it let me know.
Hope this helps.
great article! I’m also a great fan of a PSA and would hardly recommend to use one within a modern DWH architecture.
There is one additional pro I’d like to add: You can load every interface without the risk you misinterpreted somethin businesswise – the only information you need, is of technical nature and hence loading processes for the PSA can be generated fully automated from technical metadata.
Thanks for your valuable addition.
Very good article about the PSA area for Data warehouse concepts.
Just one question about the effective end dts, for the “old” records, will you also have to update the value of effective end dts to the previous records? Because there is updated data from the delivery, and the previous records are not valid then.
If you do so, would it be more complicated for the whole process and also for the time consuming?
I think maybe that’s part of the reason, why we should only use effective start dts with Window Function.
Thanks, best wishes come from Beijing!
Thanks for your valuable feedback. You are mentioning a valid point, also in the last years also in DataVault Enddatetimestamps are not considered the best practice anymore, and – you can guess already – TSQL Windowing function has taken over the honour to determine the enddate/time, mainly to avoid updates. So good point.