Generate new predictable unique identifiers from existing ones

Introduction

Today I want to share a script with you that I used to copy some database rows that use unique identifiers (or guids).
I had related rows that I needed to copy (compare with OrderHeader / OrderDetail where OrderDetail contains the OrderHeaderGuid)

Problem

When you want to copy related records that are identified with unique identifiers, you have a dependency on the unique identifier values that are used in other tables as foreign keys.
Especially when you want to copy multiple rows (having multiple different unique identifiers) creating new predictable guids from the existing ones can be useful.
In the example below I explain how this can be done using a MD5 hash of the source guid.
The example uses the AdventureWorks database.

Solution

TSQL-E04-script.sql

--\
---) Script: Generate new predictable GUIDs from existing ones - demo
---) Author: Hans Michiels
---) Website: www.hansmichiels.com
--/

--\
---) 1. Prepare for the demo
--/
USE [tempdb]
GO

IF OBJECT_ID('[dbo].[SalesOrderHeader]', 'U') IS NOT NULL
   DROP TABLE [dbo].[SalesOrderHeader];
IF OBJECT_ID('[dbo].[SalesOrderDetail]', 'U') IS NOT NULL
   DROP TABLE [dbo].[SalesOrderDetail];
GO

SELECT NEWID() AS [SalesOrderGUID], h.*
INTO [dbo].[SalesOrderHeader]
FROM [AdventureWorks].[Sales].[SalesOrderHeader] h

SELECT h.SalesOrderGUID, NEWID() AS [SalesOrderDetailGUID], d.*
INTO [dbo].[SalesOrderDetail]
FROM [AdventureWorks].[Sales].[SalesOrderDetail] d
JOIN [dbo].[SalesOrderHeader] h ON h.SalesOrderID = d.SalesOrderID

ALTER TABLE [tempdb].[dbo].[SalesOrderHeader] DROP COLUMN SalesOrderID;
ALTER TABLE [tempdb].[dbo].[SalesOrderDetail] DROP COLUMN SalesOrderID;
ALTER TABLE [tempdb].[dbo].[SalesOrderDetail] DROP COLUMN SalesOrderDetailID;
GO

--\
---) 2. Demonstrate how it works given the example that you would
---) want to copy an existing SalesOrder and its detail rows to
---) a new one, without having the dependency to insert the
---) header first, then retrieve the value for [SalesOrderGUID]
---) for the row, and then use it to insert the detail rows.
---) Using a new predictable unique identifier this lookup is
---) not needed.
--/

-- First we declare a value to use for 'Salt'. This makes sure that we could
-- copy the same source rows multiple times to new rows, because the generated
-- guids will be different every time the script runs.
-- We keep it simple and just use a timestamp as a salt value.
DECLARE @NewGuidSalt VARCHAR(30) = CONVERT(VARCHAR, GETDATE(), 120)
PRINT @NewGuidSalt

-- Again I keep the example very simple, hypothetical almost because the action
-- below is not likely for a real life example: we select 3 salesorders from a specific CustomerID and copy them to a different CustomerID.
-- However it explains the concept.
DECLARE @SourceCustomerID INT
DECLARE @TargetCustomerID INT

SELECT
  @SourceCustomerID = 11000
, @TargetCustomerID = 99999

-- Obviously it is important to run the entire operation
-- in a transaction and implement TRY CATCH to rollback
-- the transaction when an error occurs. For instance
-- a hash collision could occur (but the chance that it
-- happens is astronomically small).
BEGIN TRY

    BEGIN TRAN

    --\
    ---) 2a. Copy [SalesOrderHeader]
    --/
    INSERT INTO [dbo].[SalesOrderHeader]
      ( [SalesOrderGUID]
      , [RevisionNumber]
      , [OrderDate]
      , [DueDate]
      , [ShipDate]
      , [Status]
      , [OnlineOrderFlag]
      , [SalesOrderNumber]
      , [PurchaseOrderNumber]
      , [AccountNumber]
      , [CustomerID]
      , [SalesPersonID]
      , [TerritoryID]
      , [BillToAddressID]
      , [ShipToAddressID]
      , [ShipMethodID]
      , [CreditCardID]
      , [CreditCardApprovalCode]
      , [CurrencyRateID]
      , [SubTotal]
      , [TaxAmt]
      , [Freight]
      , [TotalDue]
      , [Comment]
      , [rowguid]
      , [ModifiedDate]
      )
    SELECT
        CONVERT(UNIQUEIDENTIFIER, HASHBYTES('MD5', CONVERT(CHAR(36), src.[SalesOrderGUID]) + @NewGuidSalt)) AS [SalesOrderGUID]
      , [RevisionNumber]
      , [OrderDate]
      , [DueDate]
      , [ShipDate]
      , [Status]
      , [OnlineOrderFlag]
      , [SalesOrderNumber]
      , [PurchaseOrderNumber]
      , [AccountNumber]
      , @TargetCustomerID AS [CustomerID]
      , [SalesPersonID]
      , [TerritoryID]
      , [BillToAddressID]
      , [ShipToAddressID]
      , [ShipMethodID]
      , [CreditCardID]
      , [CreditCardApprovalCode]
      , [CurrencyRateID]
      , [SubTotal]
      , [TaxAmt]
      , [Freight]
      , [TotalDue]
      , [Comment]
      , NEWID() AS [rowguid]
      , GETDATE() AS [ModifiedDate]
    FROM
        [dbo].[SalesOrderHeader] src
    WHERE
        src.CustomerID = @SourceCustomerID

    --\
    ---) 2b. Copy [SalesOrderDetail] rows
    --/
    INSERT INTO [dbo].[SalesOrderDetail]
      ( [SalesOrderGUID]
      , [SalesOrderDetailGUID]
      , [CarrierTrackingNumber]
      , [OrderQty]
      , [ProductID]
      , [SpecialOfferID]
      , [UnitPrice]
      , [UnitPriceDiscount]
      , [LineTotal]
      , [rowguid]
      , [ModifiedDate]
      )
    SELECT
        CONVERT(UNIQUEIDENTIFIER, HASHBYTES('MD5', CONVERT(CHAR(36), src.[SalesOrderGUID]) + @NewGuidSalt)) AS [SalesOrderGUID]
      , NEWID() AS [SalesOrderDetailGUID]
      -- For [SalesOrderDetailGUID] we simply use NEWID(), because it is not used somewhere else.
      , src.[CarrierTrackingNumber]
      , src.[OrderQty]
      , src.[ProductID]
      , src.[SpecialOfferID]
      , src.[UnitPrice]
      , src.[UnitPriceDiscount]
      , src.[LineTotal]
      , NEWID() AS [rowguid]
      , GETDATE() AS [ModifiedDate]
    FROM
        [dbo].[SalesOrderDetail] src
    JOIN
        [dbo].[SalesOrderHeader] h
        ON h.SalesOrderGUID = src.SalesOrderGUID
    WHERE
        h.CustomerID = @SourceCustomerID

    COMMIT TRAN;

END TRY
BEGIN CATCH
    IF @@TRANCOUNT > 0 ROLLBACK;
    DECLARE @ErrorMessage NVARCHAR(MAX) = ERROR_MESSAGE();
    RAISERROR(@ErrorMessage, 16, 1);
END CATCH

--\
---) 3. Now view the results
---) Note that the order detail rows are copied using the
---) same generated guid for [SalesOrderGUID]. Note that
---) if you would rerun the script starting with
---) “2. Demonstrate ..” you would get a second copy of the
---) same orders with orderdetails rows and with a different
---) generated guid for [SalesOrderGUID], due to the Salt
---) that was used.
--/
SELECT *
FROM
    [dbo].[SalesOrderHeader] h
WHERE
    h.CustomerID IN (@SourceCustomerID, @TargetCustomerID)

SELECT d.*
FROM
    [dbo].[SalesOrderDetail] d
JOIN
    [dbo].[SalesOrderHeader] h
    ON h.SalesOrderGUID = d.SalesOrderGUID
WHERE
    h.CustomerID IN (@SourceCustomerID, @TargetCustomerID)

GO

--\
---) 3. CLEANUP
--/
USE [tempdb]
GO

IF OBJECT_ID('[dbo].[SalesOrderHeader]', 'U') IS NOT NULL
   DROP TABLE [dbo].[SalesOrderHeader];
IF OBJECT_ID('[dbo].[SalesOrderDetail]', 'U') IS NOT NULL
   DROP TABLE [dbo].[SalesOrderDetail];
GO

Download the script here.

TSQL-E04-script
© 2018 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

Implementing a hash key collision strategy

Introduction

Dan Linstedt, the inventor of Data Vault, has written a lot about it: hashkeys.
For instance, one of his latest blog posts:
#datavault 2.0, Hashes, one more time.

I will not list all other sources, as you can use Google yourself.
A few comments on hash keys:

  1. You need them for scalability. Using sequence numbers is taking the risk that your data warehouse does not scale well later when the amount of data grows.
  2. They can collide: two different business keys can produce the same hash key. However the chance that this happens is very small. For instance when using SHA-1 (which produces a hash value of 160 bits) you will have a 1 in 1018 chance on a hash collision when having 1.71 * 1015 hash values (read: hub rows) according to this blog post.
  3. If collisions are unacceptable you need a hash key collision strategy.

The full article is posted on DWA.Guide, so you read further there ..


Picture credits: © Can Stock Photo / alexskp

The SEQUENCE, an updatable alternative for an IDENTITY column

Introduction

Recently I used SEQUENCES instead of IDENTITY columns. I was aware of their existence but there was never a need to use them .. until now.

Problem

Identity values of another system needed to be preserved, I know normally it is trivial if you should want this, but without going into detail, I can tell you this time it made sense.
But the problem with IDENTITY columns is that you can only assign a specific value during INSERT using the IDENTITY_INSERT option.
In my case this was not a simple solution. I would need a complicated script to create a new table, use IDENTITY_INSERT to copy over the rows from the current table, then recreate all foreign keys that refer to the old table so that they refer to the new table, then drop foreign keys referring to the old table, then drop the old table, then rename the new table to the old name. Sounds like a complicated plan, right? This is what I thought as well ..


Picture credits: © Can Stock Photo / Andreus

Solution

Using SEQUENCES made it possible to UPDATE the values. But I must admit, if you already have a table with an IDENTITY column, with rows, with foreign keys, it is already too late to switch to using a SEQUENCE, or at least you need the complicated plan, described above.
But in my case I generated a data warehouse house with biGENiUS, a data warehouse automation tool.
Then I did an initial load, and then I wanted to set some ID’s right. I added ID’s to Data Vault 2.0 hubs to use for SCD Type 1 Dimensions.
The flexibility of biGENiUS made it possible for me to change the template for the generated hub, and use a column that gets a value from a SEQUENCE instead of an IDENTITY column. So 15 minutes later I had what I wanted, an updatable sequence number in the hub that had a hashkey as primary key.
Without going into more details about why I choose this approach and whether it is good or not, I want to share the scripts with you.

First the script for the table with a sequence number/autonumbered column, in my case a hub table in a data vault.

create_hub.sql

USE [MSSQL_E13]
GO
CREATE SCHEMA [raw]
GO

IF OBJECT_ID (N'[raw].[Example_Hub]', N'U') IS NOT NULL
    DROP TABLE [raw].[Example_Hub];
GO

CREATE TABLE [raw].[Example_Hub] (
     [Example_HK] CHAR(32) NOT NULL -- the hashkey
    ,[Example_BK] NVARCHAR(10) NOT NULL -- the business key
    ,[SA_SourceSystem_ID] SMALLINT NOT NULL
    ,[DW_Load_ID] BIGINT NOT NULL
    ,[DW_Sequence_ID] INT NOT NULL -- don't add IDENTITY(1,1) here,
                                   -- the alternative is below.
    ,CONSTRAINT [PK_Example_Hub] PRIMARY KEY NONCLUSTERED (
          [Example_HK] ASC -- due to the distributed character of
          -- the hashkey, it is created as NONCLUSTERED.
    )
    ,CONSTRAINT [IX_UN_Example_Hub] UNIQUE (
          [Example_BK]
    )
    ,CONSTRAINT [IX_UC_Example_Hub] UNIQUE CLUSTERED (
          [DW_Sequence_ID] -- the additional sequence number is
          -- created with a UNIQUE CLUSTERED index.
    )
);
GO
-- Create a sequence to use instead of IDENTITY.
-- Note that the sequence is not bound to the table.
-- It is in our case only related to the table using a naming
-- convention {TABLENAME}_Sequence, but you could use a sequence
-- to be used by multiple tables, although I do not know why you
-- would want that (it could become a bottleneck for insertions).
IF OBJECT_ID('[raw].[Example_Hub_Sequence]', 'SO') IS NULL
    CREATE SEQUENCE [raw].[Example_Hub_Sequence] AS INT START WITH 1;

-- Add default constraint for column [DW_Sequence_ID] so that
-- it uses the next value of the sequence.
ALTER TABLE [raw].[Example_Hub]
    ADD CONSTRAINT [DF_raw_Example_Hub__DW_Sequence_ID]
    DEFAULT NEXT VALUE FOR [raw].[Example_Hub_Sequence]
    FOR [DW_Sequence_ID];
GO

Then the script to update the sequence number from an existing table, based/joined on the business key.

update_sequence.sql

--\
---) Keep ExampleId's
---) Obviously it only makes sense to run this script when
---) (generated) values are loaded to the table and the [DW_Sequence_ID]
---) column, that you now want to change.
--/
BEGIN TRY
    BEGIN TRAN;

    --\
    ---) Swap the sign of all ID's to a negative value.
    ---) This works only if all values are positive.
    --/
    UPDATE hub
      SET [DW_Sequence_ID] = -1 * [DW_Sequence_ID]
      FROM [raw].[Example_Hub] hub
      WHERE [DW_Sequence_ID] > 0;

    --\
    ---) Now update the values.
    --/
    UPDATE hub
      SET [DW_Sequence_ID] = curdb.ExampleId
      FROM [raw].[Example_Hub] hub
      JOIN [ExistingDB].[dbo].[dim_Example] curdb
        ON curdb.ExampleCode = hub.Example_Code_BK

    --\
    ---) Make sure that the sequence next value will be higher than the
    ---) already used ID's in the existing table.
    --/
    DECLARE @MaxID_New INT = ISNULL((SELECT MAX([DW_Sequence_ID]) FROM [raw].[Example_Hub]), 0) + 1;
    DECLARE @sql VARCHAR(1000)

    SET @sql = 'ALTER SEQUENCE [raw].[Example_Hub_Sequence] RESTART WITH ' + CONVERT(VARCHAR, @MaxID_New) + ';'
    PRINT @sql
    EXEC(@sql);

    COMMIT TRAN;
END TRY
BEGIN CATCH
    IF @@TRANCOUNT > 0 ROLLBACK TRAN;
    THROW;
END CATCH

--\
---) If not all values are updated, you have to check manually.
---) Maybe you can just swap the sign again to make it a positive value, if that
---) does not conflict with existing values.
--/
IF EXISTS(SELECT 1 FROM [raw].[Example_Hub] hub WHERE [DW_Sequence_ID] < 0)
BEGIN
    PRINT 'Not all rows of [Example_Hub] are updated, please check.'
    SELECT * FROM [raw].[Example_Hub] hub WHERE [DW_Sequence_ID] < 0;
END

GO

Download the scripts here.

Conclusion / Wrap up

A sequence is an object type in SQL Server of which the usage is not very widely spread, as far as I know. But it can be useful!
It can be used as an alternative to a column with an IDENTITY specification, if you want to update the value later on.
Please make sure if you update the values, that after this update the SEQUENCE object will only return values that are higher than the updated values. It is your responsibility to make this happen. An example of how to to this is shown in the second script.

(c) 2017 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

Data Warehouse Automation, the way to go

Introduction

Due to other activities I was not in the opportunity to blog a lot in the past months. One of those activities was the creation of the web site

DataWarehouseAutomation.Guide
.
I think it was worth the pause. And also in the future it will have influence on my time available for blogging. But I love to dive into a more detailed technical post when I have some time.

I just wanted to let you know of a few blog posts on that new web site

www.dwa.guide
(the short URL will redirect to the long one):

Contact me if you want to discuss your data warehouse automation needs or concerns in a 1-on-1 conversation!

(c) 2017 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

Executing your own .NET console application from SSIS (SSIS Series)

Introduction

The SSIS Script Task is a very powerful component to use in a SSIS package, and most of the time you can achieve with it what you want.
Still I have faced a few situations where a Script Task was not the best solution. In those cases I made a .NET Console application and executed it from within a SSIS package.

The full article is posted on SQLShack.com, so you can read further there ..

The Data Warehouse Automation Tools Survey Results (Data Warehouse Series)

Introduction

As you might know, I started a survey on the usage and opinion about Data Warehouse Automation (DWA) Tools.
Today I am presenting the results.
I sincerely want to thank all the participants for sharing your thoughts.

The participants

First some insight into the persons that filled in the survey. You were not with so many (62 completed surveys).

In statistics there is a formula to compute the minimum sample (number of participants) given a certain margin of error, reliability level and total population. If I fill in 10% for error margin, 90% for reliability level and 20000 for total population (this number does not have much impact on the end result) the minimum sample (number of participants) should be 68.
Please note that with 10% error margin and 90% reliability level that there are considerable chances that the answers are not representative for the entire population of data warehouse professionals.

Participants by country

Participants are spread over 23 countries on several continents, with a concentration in the Netherlands and the United States.

Detailed list:

Country No of participants
Argentina 1
Australia 1
Austria 1
Belgium 3
Brazil 1
Bulgaria 1
Denmark 1
Germany 3
Ghana 1
Hungary 1
Iceland 1
India 4
Lithuania 1
Netherlands 20
New Zealand 1
Norway 1
Romania 1
Slovakia 1
Spain 1
Switzerland 3
Turkey 1
United Kingdom 2
United States 10

Jobroles of participants

Almost half of the participants selected Data Warehouse Architect as best describing their current job role.
Furthermore there were 12 ETL developers and a number of other job roles, as shown in the graph.

Exact counts:

Jobrole No of participants
BI & Analytics/Data Science manager 2
BI Consultant 1
Business analyst 1
Data warehouse architect 30
Director 2
ETL Developer 12
Front end BI Developer 1
Manager 1
Multi-disciplined BI Professional 4
Project manager 6
SQL Developer 1

Familiarity with the investigated tools

First question to answer is: how familiar are the respondents with those tools?
I used 5 levels of familiarity with a short name and explanation. The option values have the following meaning:

  1. Don’t know: I do not know this product.
  2. Heard of: I have heard of this product but I have not worked in a project where it was tried or used.
  3. Tried: I have tried this product but not used it in production environments.
  4. Used: I have worked in a project where this product was used.
  5. Using: I am currently working in a project where this product is being used.

The results in a graph:

And the exact figures, first absolute, then percentages:

Issuer – Productname Don’t know Heard of Tried Used Using
AnalytiX DS – CatfX / Data Vault Code Gen Bundle 44 14 3 0 0
Attunity – Attunity Compose (formerly BiReady) 36 21 4 0 1
Gamma Systems – Data Warehouse Studio 55 5 1 0 0
Javlin – CloverETL 51 7 2 1 0
Magnitude – Magnitude Kalido 39 20 1 1 0
Qosqo – Quipu 43 12 3 2 1
TimeXtender – TimeXtender 42 14 3 0 2
Trivadis – biGenius 52 5 1 1 2
Varigence – BimlExpress / Bidshelper 43 3 4 3 8
Varigence – Mist / BimlStudio 46 4 9 1 1
WhereScape – RED 35 10 6 5 5
Average 44,2 10,5 3,4 1,3 1,8
Issuer – Productname Don’t know Heard of Tried Used Using
AnalytiX DS – CatfX / Data Vault Code Gen Bundle 72,1% 23% 4,9% 0% 0%
Attunity – Attunity Compose (formerly BiReady) 58,1% 33,9% 6,5% 0% 1,6%
Gamma Systems – Data Warehouse Studio 90,2% 8,2% 1,6% 0% 0%
Javlin – CloverETL 83,6% 11,5% 3,3% 1,6% 0%
Magnitude – Magnitude Kalido 63,9% 32,8% 1,6% 1,6% 0%
Qosqo – Quipu 70,5% 19,7% 4,9% 3,3% 1,6%
TimeXtender – TimeXtender 68,9% 23% 4,9% 0% 3,3%
Trivadis – biGenius 85,2% 8,2% 1,6% 1,6% 3,3%
Varigence – BimlExpress / Bidshelper 70,5% 4,9% 6,6% 4,9% 13,1%
Varigence – Mist / BimlStudio 75,4% 6,6% 14,8% 1,6% 1,6%
WhereScape – RED 57,4% 16,4% 9,8% 8,2% 8,2%
Average 72,3% 17,2% 5,6% 2,1% 2,9%

I think most remarkable is that most of the tools are not very well known: the answer “Don’t know” is on the average given by 72,3% of the participants.

Product satisfaction

The product satisfaction question gave the insight that there are no ‘bad’ tools. All tools fulfill a need, and the digits go from 3 to 5 (on a scale of 1 to 5). So the products with the lowest mark for satisfaction, still score a ‘C’ (or as we say in the Netherlands, a ‘vijfje’).

Value for money

Value for money is obviously a very subjective matter. What one person thinks is expensive, the other might think of as cheap. Most vendors are not transparent about their prices.
Marks vary from 1,80 to 3,46. What we can conclude from this is that some tools have the perception of a relatively low value for money, while others provide a better value for money, according to the participants of the survey.
The products that have an average score of 3 or higher (again on a scale of 1 to 5) are both Varigence products (BimlExpress/Bidshelper and Mist/BimlStudio), AnalytiX DS (CatfX / Data Vault Code Gen Bundle) and Trivadis (biGenius).

By the way, the question in the survey had the following scale: 1=cheap and 5=expensive. I translated this to a value for money by reversing the answers (so 1 became 5 and vice versa and 2 became 4 and vice versa). In this way, a higher score means a ‘cheaper’ pricing perception, thus a better value for money.

Other tools that were mentioned

A number of other tools were also mentioned, many of them only once or twice.
Remarkable is that also tools that do not meet the “DWA Tool” definition were mentioned, for instance Microsoft BI Products.
What I learned from this is that I should have given a definition of what I meant with a “Data Warehouse Automation Tool”.
From Wikipedia:
“Data warehouse automation or DWA refers to the process of accelerating and automating the data warehouse development cycles, while assuring quality and consistency. DWA is believed to provide automation of the entire lifecycle of a data warehouse, from source system analysis to testing to documentation. It helps improve productivity, reduce cost, and improve overall quality.”

Worth mentioning is the open source tool ODE. I didn’t know it but tried it recently. It’s a sql based, metadata driven tool without graphical user interface, and I could not get an “AdventureWorks” example working due to errors that occurred during working through the example. So I think this tool needs to become more stable or at least should have a better manual on how to use it. But it’s Open Source and that puts a smile on my face.
From a company/human resources perspective, working with such a tool can however lead to a hidden cost because using it can be more laborious than working with a commercial, ‘more stable’ product. And once you start changing the source code (for yourself or your company, not as member of the Open Source Development team) you can’t upgrade to newer versions anymore and are stuck with the maintenance of your copy of the tool. And if you join the Open Source Team you are now doing two projects instead of one!
In the Netherlands we have an old-fashioned saying “Bezint eer gij begint” which means “think well before you start”, and I think this is applicable here as well. A license for a commercial tool might be worth the money.

What also appears to be the case, that a number of those tools are not for sale, but are tools that consultancy firms use for doing projects.

List of all the other tools that were mentioned:

Issuer – Productname Mentions
Adante – Datawarehouse on Demand 1
Automic – One Automation 2
Birst – Birst 1
Centennium – TBD engine (template based development) 1
Cubido – Cubikit 1
DIKW – DHW Accelerator 1
Grip Op Data B.V. – GDIP 1
Infent – Infenture Insight 1
Informatica – Powercenter 2
Informatica – B2b data exchange 1
Informatica – Informatica 1
Informatica – MFT 1
I-refact – I-refactory 1
Kadenza – Smart Datawarehouse generator 1
Microsoft – MSBI(SSIS,SSAS,SSRS) 9
Microsoft – Power BI 1
Optimalbi Limited – ODE (Open source) 1
Oracle – ODI 1
Oracle – OWB 1
Own build – Excelbased tool 1
Pentaho – Kettle 2
Pentaho – Big Data 1
Rabobank – Foundation Toolkit 1
SAP – Data Services 2
SAP – Information Steward 1
Tableau – Server 1
Talend – Data Fabric 1
Talend – Open Studio 1
TopBi – Data Vault Generator 1
TopBI – HSG 1
Varigence – BimlFlex 1
Varigence – BimlOnline 1
Viz Analytics – Data Warehouse Studio 1

Comments given about all the tools

Participants could give their opinion about the tools in free text format – and they did!
Here are just the bare comments.

Attunity – Attunity Compose (formerly BiReady)
  • “Like features, makes work flow easily “
Qosqo – Quipu
  • “This one could be split up between the free and paid variant for this research.”
  • “Most advanced options on data vault modeling and generation in one tool. Does not support data marts, requires Groovy knowledge to change templates.”
TimeXtender – TimeXtender
  • “vendor lock-in, no best practices applied when generating SSAS cubes”
  • “lock in, very good features, easy to use (even for business users)”
Trivadis – biGenius
  • “if they extend SSIS generation no lock in, very flexibel, fast, cool”
  • “No Vendor-Lock (generated Code can be scheduled and executed native), Microsoft and Oracle-Support, Data Vault Support”
  • “Integrated product, short learning curve, easy to adapt”
Varigence – BimlExpress / Bidshelper
  • “Great to building SSIS packages.”
  • “Free, but you need to write the generating code first. Ideal for consultants though.”
  • “Nice combination, but needs to mature.”
  • “no lock in, very cool that you generate SSIS/SSAS, free, a lot of work coding, not for everyone”
  • “Very flexible, but long learning curve”
  • “Second best tool to generate SSIS packages. Requires a lot of exceptions if you want to alter parts of code for specific tables. Debugging is hell.”
Varigence – Mist / BimlStudio
  • “Has potential but is progressing slowly. Pricing is fair.”
  • “No vendor lock-in, as the result of BIML are just SSIS packages”
  • “More mature version.”
  • “easier, expensive, nice you can easily re-engineer from existing DWH”
  • “There is no reverse engineering module.We can not convert old package to use BimlStudio”
  • “Very flexible”
  • “Best tool to generate SSIS packages. Still requires a lot of manual labor for low level changes.”
WhereScape – RED
  • “We were able to increase development speed with a factor 3. “
  • “Vendor-Lock (needs App-Server for Scheduling, Processing), Microsoft not fully supported (only Beta), Data Vault Support”
  • “Creates stored procedures which clutters db’s, powerful tool, but has some annoyances”
  • “Several products which are not completely integrated, some features not “ready”, several limitations (especially when working with datasets, expensive for what you get, only based overseas (except smalll base in UK)”
  • “You have to click and type almost everything, then it “generates” a stored procedure. Also stores its repository in the same database and schema as your data warehouse.”
  • “Changed projects into initiatives enabling business analytics. “
  • “Like lineage, like support for multiple generation types (such as data vault), dislike performance, dislike excessive storage, dislike cost of consulting, dislike unreliability, dislike UI, dislike lack of adapters, dislike reliance on database procedures, dislike scheduler, dislike target specific repository”

Comments for other tools (not all available for sale)

Adante – Datawarehouse on Demand
  • “Very easy to use automation tool for Dimensional modeling”
Centennium – TBD engine (template based development)
  • “SQL server based metadata based staging, vault en datamart generator. Developed by Centennium, for the larger part made adult by me.”
I-refact – I-refactory
  • “A very solid, well thought solution for the data management. It’s not just for dwh, but for the whole data platform of your company.”
Kadenza – Smart Datawarehouse generator
  • “Model driven generator c-sharp”
Talend – Data Fabric
  • “Talend is a very friendly for completing complex operations like Data Integration, Data Quality and Big Data (towards Hadoop) with minimum programming skills”
Viz Analytics – Data Warehouse Studio
  • “Custom built from the ground up, with own ETL engine, very quick to build and implement DWs.”

Used technologies

One of the questions was about the technology that is being used (from Gartner Magic Quadrant for BI and Analytics). As more than one answer was allowed, the total number of responses is more than the number of participants, because some participants use two or more technologies.
In the Top 5 we see: Microsoft, Oracle, Tableau, SAP and Qlik. To be honest, nothing that surprises me.

Technology Mentions
Alteryx 7
Birst 3
Board International 3
Datameer 3
IBM 11
Information Builders 3
Microsoft 47
MicroStrategy 7
Oracle 25
Pentaho 6
Qlik 13
SAP 16
SAS 5
Sisense 1
Tableau 20
TIBCO Software 2
Yellowfin 1
Zoomdata 1

Conclusion / Wrap up

With this survey I tried to give the business intelligence community more insight into the available Data Warehouse Automation Tools.
Due to the low number of participants, conclusions come with large uncertainty margins. I think this is a pity.
Nevertheless this investigation could be good starting point for further analysis on DWA Tools or a DWA Tool selection.

(c) 2017 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

How to mimic a wildcard search on Always Encrypted columns with Entity Framework

Introduction

The title of this post should have been “How to implement wildcard search functionality with Always Encrypted, make deterministic encryption safer, and load initial data using SqlBulkInsert”, but as you understand, that’s just too long for a title.

A number of years ago I built a web application with “Always Encrypted” as VARBINARY columns, before Microsoft offered this feature out of the SQL Server Box. So in case the database server would be compromised by hackers, no client details could be revealed.

Just like Microsoft my Data Access Layer, built in .NET did the encryption and decryption, transparent for other code that accessed it. And I had the same challenge how to search on encrypted columns.

So with (maybe more than average) interest I was looking forward to know how the Always Encrypted feature of SQL Server 2016 works.

The full article is posted on SQLShack.com, so you can read further there ..

Temporal Table applications in SQL Data Warehouse environments (SQL Server Series)

Today the subject of investigation is the Temporal Table, which is a new feature in SQL Server 2016. My focus will slightly be on how to use it in Data Warehouse environments, but there is some general information passing by as I write.
I want to cover next topics:

  1. What is a temporal table (in short)?
  2. Can I use a temporal table for a table in the PSA (Persistent Staging Area)?
  3. Can I use a temporal table for a Data Vault Satellite?
  4. Is using temporal tables for full auditing in an OLTP system a good idea?

The full article is posted on SQLShack.com, so read further there ..

Using a Persistent Staging Area: What, Why, and How

Change log

Date Changes
21 February, 2017 Due to a very useful discussion on LinkedIn with Dan Linstedt, I have made some changes in this article to emphasize that a PSA does not have a permanent place in any EDW
(with exception, as Dan suggests, being the Hadoop NoSQL platform used for slower storage mechanisms).

Today I want to discuss the Persistent Staging Area in an Enterprise Data Warehouse, or PSA for short. First I want to give a credit to Roelant Vos who has blogged about this subject before. He triggered me to go further with this concept to see “what’s in it for me” and to do a proof of concept using SQL Server 2016.

What is a Persistent Staging Area or PSA?

As an introduction, I want to tell what a (non-persistent) Staging Area is. Some of you will already know, so be it. A Staging Area is a “landing zone” for data flowing into a data warehouse environment. The data in a Staging Area is only kept there until it is successfully loaded into the data warehouse. Between two loads, all staging tables are made empty again (or dropped and recreated before the next load).
A Persistent Staging Area on the other hand, is a Staging Area that is not wiped out between loads: it contains full history from the source(s) that deliver data to it.


Picture credits: © creativecommonsstockphotos | Dreamstime Stock Photos
The “Staging Area” for passengers on an airport ..

Why would you want to make use of a PSA, and what are its pros and cons?

A PSA makes it possible to reload the next layer in the Data Warehouse (for example a Data Vault) when requirements change during development of the Data Vault / EDW without going back to the original source(s). The original source(s) might not contain all history anymore (e.g. because it is an OLTP system or a web service returning only current data) or it is cumbersome to do such a reload because the source data is delivered in large numbers of flat files.

As you might know for data warehouses with (semi)structured data I always preach for a multi-tier architecture including a Data Vault for storing full history of unmodified source data and one or more dimensional models (virtual if performance allows, otherwise physical) as data feeds for cubes, reports, dashboards and BI tools.

So, I hear you think, in a Data Vault you have already full history, why would you need a PSA then? The answer is simple: because building a Data Vault takes time. You do not load all data on day 1, maybe there are uncertainties about the model and your resources are restricted, most likely. So you build the Data Vault in increments. But you can start capturing as much source data as you can! When Data Warehouse Automation tools are used, this is fairly easy to build. So even though you do not take the data further into your data warehouse than the (Persistent) Staging Area, you already start collecting your history of data.

Note: a different reason to use a PSA could be that your source systems contain high data volumes and are geographically spread: you could then create a PSA on premise near the source system, and export only delta data from it – each row would get an I, U or D indication for Insert, Update and Delete – to massively reduce data traffic between source and EDW. I will not discuss this use case scenario today, because then pros and cons would be completely different, and time travelling and full reload would not be applicable.

Using a PSA has it’s pros and cons. To mention a few:

Pros:

  • You collect data, even for those parts of the data warehouse you have not properly designed and developed yet.
  • You can clear the Data Vault, or even change the model completely (this can happen, for instance when business concepts initially were not well understood), create new ETL Code to load data to the modified Data Vault and do a replay of loading full history of the data captured in the PSA.
  • Important is that the PSA is implemented with changed rows only. If done so, it does not have excessive storage needs. So like a Data Vault Satellite, a new row for a table in the PSA is only added, when something has changed for the primary (surrogate or natural) key. A start- and end date/time can be used to implement this.
  • A (relational) PSA should have an optional, temporary character, and is not part of the core EDW tiers. A (relational) PSA should have an optional, temporary character, and is not part of the core EDW tiers! This was written twice on purpose! Do not forget! It is useful during building the EDW, but must be disarmed when the EDW has gone into production and has reached a maintenance phase. A good way to make sure this happens is to plan a decommission date and stick to the plan (can also be a plan like “after sources X, Y and Z are connected and are data feeds in production for 3 months”). Do not keep the PSA around for no good reason.
    In this way, most of the cons below are very likely to be not applicable!

Cons:

  • You can argue that you are in fact building a second data warehouse.
  • When sources change over time, your PSA might not reflect the current structure of the source system(s) anymore (e.g. a column is removed in the source system but you leave it in the PSA so that you can add new data with the column empty but keep the data in this column for older rows). This might be confusing.
  • When source systems are not used anymore (for instance when accounting software is replaced by a similar product from a different vendor) you need to add tables of the new system to the PSA, while you also keep the tables from the old system, which will not receive new data anymore. Now you need to keep the ETL code to load from the “old systems” PSA tables, in case a full reload is needed, and you also need to develop code to get data from the “new systems” PSA, and might need logic for which point in time which tables should be loaded. As you can understand, now the maintenance of your PSA can start to become a heavy burden. So going back to the pros, make sure your PSA is optional and can be disarmed when the maintenance efforts do no longer justify the benefits.

How could you implement a PSA?

To give an idea how this fits in my favourite architecture, I’ve added the PSA to it.
For a normal load the PSA is just loaded from the Staging Area (preferably in parallel with loading the Enterprise Data Warehouse from the Staging Area).
A Virtual Staging Access Layer (for example implemented with views) ensures that ETL code for loading the Enterprise Data Warehouse can switch easily between Staging Area and PSA for loading data. An example how to implement this will follow below.


How the PSA fits in the EDW architecture

For the next schema I owe you an explanation, because I know it’s a bit weird to mix architecture with processing in one schema, but in this case I thought it made sense and could clarify how to reload the EDW from the PSA: no sources and Staging Area are involved, but instead a time travelling mechanism queries the data from the PSA through the Virtual Staging Access Layer, to replay all loads that have been done so far, but with modified ETL Code to load data into a modified or extended Data Vault.


Reloading the DWH from the Persistent Staging Area

To set up time travelling, you could create a list of load sessions and their start date/times, then substract a small time fraction and use that value as Point-In-Time date time value to retrieve all rows from the previous load.
Note: For the DATETIME2 columns, I use a precision of centiseconds, so DATETIME2(2). For justification see one of my older blog posts: Stop being so precise! and more about using Load(end)dates (Datavault Series). If needed you can use a higher precision for your data warehouse.

In a table:

Load no LoadStartDatetime Point-In-Time value
1 2017-02-01 02:00:01.74 2017-02-02 02:00:00.88
2 2017-02-02 02:00:00.89 2017-02-02 02:00:02.11
3 (last load so far) 2017-02-03 02:00:02.12 9999-12-31 23:59:59.98 (! not 99 !)

The Time Travel icon with the arrows in the picture above refers to the process that loops through all Start date/time values chronologically and triggers execution of the load process from PSA to EDW for every Point-In-Time value.

SQL Server 2016 implementation: how could you implement loading the PSA from the Staging Area?

This logic is very similar to loading a Data Vault Satellite. For each staging table you should create a “History” version in the PSA database, with two extra columns: EffectiveStartDts and EffectiveEndDts and a primary key defined as the primary key of the staging table plus the EffectiveStartDts. Most concepts of loading a satellite could be applied: you could add rows that are new, insert new versions of rows that have changed, and mark absent rows (missing in the source) as such. For the PSA I would suggest just to update the end date time so that this row will simply not be in the selection when a later point-in-time is used. Also there is no need to create a separate “hub” table for each staging table.

SQL Server 2016 implementation: how could you implement the historic reload from the PSA?

In a previous blog post Isolate cross database “databasename” dependencies using synonyms and schemas I already advocated the concept of putting each tier of the EDW in its own database schema, and use synonyms to achieve interconnectivity between objects in different databases. In the rest of this article I will assume that you understand this concept (if not, you could read this blog post first).

With some imagination, we can use a modified version of this concept by creating the Virtual Staging Access Layer into the PSA database. This virtual layer is needed for loading data from the PSA instead of the Staging Area.
Suppose the Staging Database has a database schema stg and that schema contains all tables that can be loaded into the EDW / Raw Data Vault.
What we can do is do a similar thing in the PSA database, so create a stg schema, but instead of tables this schema contains views with the same name and same columns as the tables in the Staging Database! These views select the rows valid at a certain point in time.

The challenge to retrieve only the valid rows at a point in time is an easy one: To the PSA database we add a table [stg].[PointInTime] with only one row and a [CurrentPointInTime] column of datatype DATETIME2. Now we can build a SELECT statement on the PSA table (same as the staging table, but with start- and end date/time stamps to hold historic changed and deleted rows) with a CROSS JOIN on the PointInTime table and a WHERE clause to select only the rows valid at the given point in time.

An example how such a view could look like:

CREATE_VIEW_stg_Customer.sql

CREATE VIEW [stg].[Customer]
AS
/*
==========================================================================================
Author: Hans Michiels
Create date: 15-FEB-2017
Description: View that has exactly the same structure as a table with the same name
             in the Staging Database.
             This view can be used for full reloads of all data in the PSA Database.
==========================================================================================
*/

SELECT
    [CustomerID],
    [FirstName],
    [Initials],
    [MiddleName],
    [SurName],
    [DateOfBirth],
    [Gender],
    [SocialSecurityNumber],
    [Address],
    [PostalCode],
    [Residence],
    [StateOrProvince],
    [Country],
    [RowHash]
FROM
    [psa].[CustomerHistory] hist
CROSS JOIN
    [psa].[PointInTime] pit
WHERE
    hist.EffectiveStartDts <= pit.CurrentPointInTime
    AND hist.EffectiveEndDts >= pit.CurrentPointInTime -- When INCLUSIVE enddating is used.
    -- AND hist.EffectiveEndDts > pit.CurrentPointInTime -- When EXCLUSIVE enddating is used.
GO

The next steps are:

  1. Make sure your ETL Code to load the data has a mechanism to skip loading data from the sources to the Staging Area and from the Staging Area to the PSA when the PSA is used instead of the normal Staging area. For instance you could add a view [stg].[IsPSA] to both the Staging database and the PSA database, returning a single row with a single value “False” or “True”, respectively. In your package you could retrieve this value using the Staging Connection Manager, and change the flow of your package depending on the fact if the Staging database is connected or the PSA database.
  2. Build a time travelling mechanism (could be a FullReloadMaster.dtsx package) that updates the stg.PointInTime table for every point-in-time date/time and after every update starts the ETL Process (e.g. a Master.dtsx SSIS package) with a changed connection string, connected to the PSA instead of “normal” Staging database!

Here is a possible implementation of the [stg].[IsPSA] view, based on the fact that the PSA database will have _PSA_ in the database name and the staging database will not. This allows to use the same version of the view for the Staging and PSA database.

CREATE_VIEW_stg_IsPSA.sql

CREATE VIEW [stg].[IsPSA]
AS
/*
==========================================================================================
Author: Hans Michiels
Create date: 10-FEB-2017
Description: View that returns one rows with a bitcolumn [IsPSA] that indicates
             if the current database is the PSA or the STG.
==========================================================================================
*/

SELECT
  CONVERT(BIT,
    CASE
    WHEN CHARINDEX('_PSA_', DB_NAME()) > 0 THEN 1
    ELSE 0
    END) AS [IsPSA]
GO


Example how a FullReloadMaster.dtsx package could look like. Important is that the Staging database connection string is a parameter of Master.dtsx, and is passed by FullReloadMaster.dtsx, but it now is a connection to the PSA database.

Now the next challenge is how to get this working. If you use a SSIS dataflow with a OLEDB Source (and a destination), it is not so difficult: you can just change the connection string of the OLEDB Source. I have done a proof of concept and it worked! You do not have to redeploy SSIS Packages. If views in the PSA database are used with exactly the same structure as the tables in the Staging database, the SSIS Package agrees with the “metadata” and “VS_NEEDSNEWMETADATA” or “VS_ISBROKEN” errors do not occur. Of course you could also make views for reading the Staging tables in the Staging database. This would create an additional layer of abstraction, because then in both the Staging Area and PSA you read the data from the views, and you could, when needed, implement views of which the structure differs from the underlying tables.

If you use SQL code (e.g. stored procedures) this will be a little more work, but still doable. You can follow one of the following strategies for storing the SQL Code:

  • a push strategy: SQL code is in the PSA database, and should then also be in the Staging database. From there, you push the data to the raw Data Vault.
  • a pull strategy: SQL code is in the raw Data Vault. From there you pull the data from the Staging Area or PSA. You will need to switch via synonyms: drop stg synonyms and recreate them but them pointing to the PSA database before a historic reload is done. Afterwards, the reverse action needs to take place to let them point to the Staging database again.
  • an outsider strategy: SQL code is in a separate database, and has synonyms for both staging and raw Data Vault tables and views. Now the same switch via synonyms method can be used.

A few considerations:

  • a push strategy forces you to install your SQL code twice: in the Staging and the PSA database. Also a Staging database might not feel like the “right place” for stored procedures.
  • a pull strategy and an outsider strategy force you to change synonyms, something that normally only has to be done during a deployment, but now before a full reload (and afterwards the reverse action).

I have a slight preference for switching via synonyms, assumed that a full reload will not happen often.
Still using the push strategy could be a valid choice too.

Conclusion / Wrap up

In this post you could read more about the concept of a Persistent Staging Area. Although I have not used it in production environments yet, I think in some projects it can be very useful.
Important to keep in mind:

  • Make sure that your data warehouse can survive if the PSA is switched off. So do not expose it to end users or make it a mandatory part of your ETL Code.
  • Plan a decommission moment when the PSA is created and stick to the plan. Do not keep the PSA around for no good reason.
  • Use code generation tools to update the PSA from the staging area to limit the development effort to set it up.
  • Virtualize the read access to the staging tables, for instance by using views.

If you stick to this “rules” I think a PSA can prove its value for your Enterprise Data Warehouse!

(c) 2017 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

Is SSDT ready for Visual Studio 2017 RC already? (Tooling Series)

Introduction

At the moment of writing, Microsoft released Visual Studio 2017 RC (Release Candidate) a week ago. It has been the fourth release of a Visual Studio 2017 “preview” since November 2016.
The final release is planned for the first half of 2017 (source: Redmond Channel Partner website)
Time to see what’s in the box!
And I am especially interested in how will it cooperate with SQL Server Data Tools (SSDT).

Download and installation

Visual Studio 2017 RC1 can be downloaded here. It’s a webinstaller, so during installation it will download the components you have selected.
Installation is quite straightforward.
A quiet install (with my Powershell script) does not work yet, I did not put effort in it to find out why.
I just started the executable for a normal install.
The number of Components you can install is impressive.


Installable parts of Visual Studio 2017 ..

I have selected Data storage and processing and .NET desktop development (for my open source console applications).
After selection they appear in the summary on the right (see picture below).


Installation summary ..

The installation doesn’t take too long; I did not use a stopwatch, but think it took about 10 to 15 minutes.

What draws the attention?

I have choosen the blue color theme, and Visual Studio 2017 looks very similar to Visual Studio 2015.


Choosing the color theme ..

What is new is a Visual Studio Installer shortcut in the “Apps” of Windows.


Visual Studio Installer shortcut ..
You can use it to modify your installation easily.


Visual Studio Installer screen ..

However slightly out of scope, I couldn’t help that I wanted to open one of my console apps, SQLAgentJobStarter, in this version to check if any upgrade or modification would be necessary. And the answer is: no. It just opens.
What draws my attention are features in the editor that in the past could only be achieved with third party tools. For example a code change is suggested, and the preview of the change is shown before you apply it. Useful!

Visual Studio suggests code changes and shows a preview of the change ..

Is SSDT ready for Visual Studio 2017?

No. Sorry.
I downloaded Download SQL Server Data Tools (17.0 RC1), the description says “Includes support for SQL Server vNext CTP1, but not recommended for production use.”.
Apparently it is more related to “SQL Server vNext” (SQL Server 2018?) than to “Visual Studio vNext”.
And it’s just SSDT for Visual Studio 2015.
It’s doesn’t even install, probably because VS 2015 is not installed.

SSDT (17.0 RC1) does not install ..

But is this bad? Again, a no. Just have some patience.
SSDT has a different release cycle than Visual Studio. On Microsoft’s SSDT blog there is no news on any SSDT activity for VS 2017.
I have read on some forum (sorry could not find the link anymore) that it usually takes a few months after a new release of Visual Studio before an update for SSDT will become available.

Conclusion / Wrap up

So, you could already install VS 2017, but for BI projects (SSIS, SSRS, SSAS) you will still have to use SSDT for VS 2015.
Visual Studio 2017 is expected to be released in the first half of 2017, you have to add a few months for SSDT.

(c) 2017 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.