Tag Archives: big data

Big Data & Analytics insight 2016: save the date for an insightful conference

Dear reader,

Is the data warehouse at the end of its lifecycle or does the existing infrastructure gain new value from integrating with new Big Data technologies? If so, how do you go about it? Do you need inspiration for your analytics strategy? Then there’s plenty to go by at Big Data & Analytics Insight 2016, on June 7th in Utrecht, The Netherlands.

There will be input from Big Data domain experts and highly interesting presentations like the ones on Data Vault 2.0 & Big Data and Hands-on Predictive Text Analytics that will inspire you. Whether you are deep into the technical or the business aspects of Big Data & Analytics, this conference is a must attend if you want to stay ahead of future evolutions instead of enduring them.

Check the agenda and book your seat, or better, become a ba4all member and get access to one more conference and a hands-on session on Customer Analytics.

I look forward to seeing you at Big Data & Analytics Insight 2016!

Kind regards,

Hans Michiels
(as a speaker I was asked to do a bit of promotion ..)

Data Vault and Dimensional Modelling, a happy marriage! (Data Vault Series)

We need ambassadors!

I was at this one day course recently and spoke to a professional working in the business intelligence domain.
He said that he had just followed a two week course for datawarehouse architects.
I asked him, what did they tell about Data Vault?
His answer was shocking: “Not so much ..”

If recently trained datawarehouse architects do not get trained to use Data Vault in a datawarehouse architecture, there definitely is a lot of missionary work to do.
Because the Data Vault can add so much value to your datawarehouse solution. 
And more important, it is not a choice, a Kimball dimensional model OR a Linstedt Data Vault. Use both and
combine each strengths!

Intended audience

This is a strong call for everybody who reads this to share this post with at least one person you know working in the business intelligence domain using star schemas (a.k.a. dimensional modelling) directly fed from source systems.

What are the strengths of a Dimensional Model (a.k.a. Star Schema)?

  • Because the dimensional model is already around for decades, a lot of front end tools have an excellent support for it.
  • A dimensional model is designed for reporting, slicing and dicing, for getting data out. That is what it is good at.

    This becomes clear by the following examples:

    • Query performance is usually better than of other models, like Data Vault or Relational Datawarehouse, especially for aggregations.
    • A star model is an excellent source for Cube databases, like SSAS, especially when using a multidimensional model.
    • Date selections and comparison of periods (e.g. Q1 this year versus Q1 Last Year) is easier than in Data Vault or Relational models.
  • It enables self service BI. Give business (power) users (direct) access to a dimensional model, and there is a good chance that they can get  data out in the format they want.

What are the strengths of a Data Vault?

  • A Data Vault is meant to capture all the data all the time. Therefore it stores all data from the source systems, captures the changes in it, and keeps a full history of all changes (with a sidemark, that multiple changes between two load cycles, will be seen as one).
  • A Data Vault captures all data exactly as it is stored in the source systems. “The good, the bad and the ugly”with no business rules applied to it (except, in some cases hard business rules like data types). This makes all data auditable.
  • Data Vault 2.0 supports cross platform integration with other systems (e.g. Hadoop) and decentralized datawarehouses, for instance in different global regions.
  • With Data Vault 2.0 working agile is easy, in sprints, building the system in increments.
  • Data Vault has a very good support for Data Integration. Satellites, hanging off the Hub, containing data from different source systems.

So? What are the strengths of a datawarehouse architecture using both?

  • You can change your mind! If your business rules change (and it happens!), just reload your dimensional model from the Data Vault with the new rules applied.
  • You have one version of the facts . If you have multiple dimensional models, you have not.
  • Data history is not like a destroyed monument. With a Data Vault, you capture all data from the source systems, and all changes applied to it. Source systems usually reflect only the current state of the data, so without a change capturing central storage like Data Vault, your history is gone forever!
  • Your dimensional model does not have to serve two purposes: data storage and presentation of information. These two tasks can be in conflict, for instance there is currently no report need for all data that you could grap from source systems, so you leave it out of your star schema. But if the question comes tomorrow, you have no data, no history!
  • You have full auditibility. Because the (raw) Data Vault stores the data from source systems unmodified, every measure, every dimensional attribute, can be traced back to the source it came from and when. This makes it able to refute claims that the data is not correct. Maybe now it will become clear that the old datawarehouse- or reporting system has lied for years!

How can you use both?

The answer is simple: by using a multi-tier architecture.
Data from source systems is first extracted to a Staging area, before it is moved into the Enterprise Data Warehouse using a Data Vault Model (with or without some optional components).
From there it will be distributed to dimensional models (star schemas) and cubes, and whilst the data is on its way, business rules can be applied.
DV-S01E02-multi-tier-architecture
Multi-tier architecture using both Data Vault and Dimensional Modelling techniques.

Conclusion / Wrap up

I have written this post to create more awareness about using both Data Vault and Dimensional Modelling or Star Schemas in a data warehouse architecture.
I have listed strengths and benefits of Data Vault, Dimensional Modelling and of an architecture using both.

(c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

Isolate cross database “databasename” dependencies using synonyms and schemas (Data Vault Series)

Intended audience

This post could be interesting for professionals working on datawarehouse projects using the Microsoft SQL Server database engine. This post assumes that you have at least a basic understanding of Data Vault principles.

Introduction

Those who have read the book “Building a Scalable Data Warehouse with Data Vault 2.0” by Dan Linstedt and Michael Olschimke will know that database schema’s are used in the examples in the book for the “raw” and “biz” (business) layer in the Data Vault.

If you do not know this book, buy it, if you are the intended audience you should read it ( no I have no commercial interest in saying this), so check it out on Amazon.

What is done in the book is putting tables or views in the database schema’s “raw” and “biz”, respectively, making it transparent if a satellite is either virtual (and at least partially computed) or persisted in a physical table.

Putting these data layers in different schema’s is a great starting point for a my interdatabase connectivity concept.

What data layers can we have

  • Staging area
  • Raw Data Vault
  • Business (Data) Vault
  • Metrics Vault
  • Error Vault
  • Information Marts

A rule of thumb is that each layer resides in it’s own database, but also multiple layers can be in one database. What should be avoided is spreading one layer over multiple databases. This does not count for different Information Marts though, which should be put in their own database.

Whether to put layers together can depend amongst others on backup requirements (the biz layer could for instance be restored from a database backup OR by doing some reprocessing from the Raw Data Vault, which could lead to the decision to backup the database less frequently).

A typical set up could be:

  • Staging Area – separate database, “stg” database schema.
  • Raw Data Vault – together with Business Vault, “raw” database schema.
  • Business Vault – together with Raw Vault, “biz” database schema.

To keep the example simple I will leave the other data layers out for now.

What is the problem?

To be short: SQL statements copying data to a different database.

When using data flows in SSIS, this problem does not occur.

Why is it bad?

Database names are prefixed in FROM clauses, causing a dependency on the database name.

So what, I still don’t see the problem

Scenario’s where it can become a hassle to manage this include:

  • Running different environments on the same database instance (e.g. development and test), differing the environments by database names. When deploying code or copying databases, all interdatabase dependencies must be changed.
  • Running different enviroments for multiple clients in a BIAAS (Business Intelligence As A Service) solution on the same database instance.

So, lets divide datalayers using database schema’s ..

This is an example of the initial set up.

Dv01E01_dbschema1
Please note that you should keep all objects (tables, views, stored procedures, udf’s) from one data layer together in one database.

Now, to pull data from the “stg” to the “raw” layer, SQL statements could be like (simplified example):

Insert.sql
INSERT INTO [raw].[TableName]
SELECT ..
FROM [EDW_STG_TEST].[stg].[TableName]

But, as you can see the database name is prefixed before the Staging table. And this is a problem (at least it makes managing the solution more complex).

How can we solve this?

Managing interdatabase dependencies using synonyms

First, a little explanation from MSDN:

A synonym is a database object that serves the following purposes:

  • Provides an alternative name for another database object, referred to as the base object, that can exist on a local or remote server.
  • Provides a layer of abstraction that protects a client application from changes made to the name or location of the base object.

Read more on MSDN.

What if we would have made a synonym for the staging table in the “EDW_DV_TEST” database, also in a “stg” schema in that database?

Then we could omit the database name from the SQL statement:

Insert.sql
INSERT INTO [raw].[TableName]
SELECT ..
FROM [stg].[TableName]
That is what we want!

In a picture:

Dv01E01_dbschema2

The good news is, creating the synonyms can be automated with a script, so a change of the Staging database name only requires running a SQL script in the EDW_DV_TEST to drop and recreate the synonyms for the staging database objects!

One other advantage is, you could merge different layers into one database or spit them into different databases. For example all code will keep on working if you move all Staging tables to the EDW_DV_TEST database, therewith replacing the synonyms already there in the stg schema by actual tables. All you would have to do is change the connection string for the processes that fill the staging tables.

In a similar way, you could decide to separate the raw and biz layer into different databases.

Script to create or change the synonyms

DV-S01E01-Create_synonyms_for_objects_in_a_different_database.sql

--\
---) Author: Hans Michiels
---) Creates synonyms for tables, views, user defined functions
---) and stored procedures in a different database.
---) All synonyms in the specified schema are dropped
---) prior to (re)creation.
--/
/*
(c) Copyright 2016 - hansmichiels.com
 
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
 
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
 
You should have received a copy of the GNU General Public License
along with this program. If not, see http://www.gnu.org/licenses/.
*/

USE [AdventureWorks2014] -- Replace with database you want to create the synonyms in.
GO

--\--------------------------------------------------------------------------
---) DECLARATIONS
--/--------------------------------------------------------------------------

DECLARE @SourceDatabase SYSNAME
DECLARE @SourceSchema SYSNAME
DECLARE @SynonymsSchema SYSNAME

DECLARE @SynonymsPrefix NVARCHAR(128)
DECLARE @SynonymName SYSNAME

DECLARE @CurSchemaName SYSNAME
DECLARE @CurObjectName SYSNAME
DECLARE @CurObjectType CHAR(2)

DECLARE @nsql NVARCHAR(4000)
DECLARE @DebugPrint NVARCHAR(500)
DECLARE @SimulationMode CHAR(1)

--\--------------------------------------------------------------------------
---) CONFIGURATION: adjust the variable values to meet your needs.
--/--------------------------------------------------------------------------

SELECT @SourceDatabase = 'AW2014_Staging' -- Database in which objects exist
                                          -- for which synonyms must be created.
     , @SourceSchema = 'stg' -- Schema name of source objects.
     , @SynonymsSchema = 'stg' -- Schema name to put synonyms in.
     , @SynonymsPrefix = '' -- Optional Synonyms name prefix.
     , @SimulationMode = 'Y' -- Use Y if you only want the SQL statement in the
                               -- output window without it being executed.

--\--------------------------------------------------------------------------
---) INITIALIZATION
--/--------------------------------------------------------------------------

IF NOT EXISTS(SELECT 1 FROM sys.schemas WHERE name = @SynonymsSchema)
BEGIN
    SELECT @nsql ='CREATE SCHEMA ' + QUOTENAME(@SynonymsSchema) + ' AUTHORIZATION [dbo];';
    IF @SimulationMode = 'Y' PRINT '----- SIMULATION MODE: SCRIPT IS NOT EXECUTED ------ ';
    PRINT @nsql;
    IF @SimulationMode != 'Y' EXECUTE sp_executesql @nsql;
END

--\--------------------------------------------------------------------------
---) MAIN PART I: Drop all existing synonyms in the specified schema.
--/--------------------------------------------------------------------------

DECLARE ExistingObjCursor CURSOR LOCAL STATIC FOR
    SELECT syn.name AS [SynonymName]
    FROM sys.synonyms syn
    JOIN sys.schemas sch
      ON syn.schema_id = sch.schema_id
    WHERE sch.name = @SynonymsSchema
 
IF @SimulationMode = 'Y' PRINT '----- SIMULATION MODE: SCRIPT IS NOT EXECUTED ------ ';

SET @DebugPrint = '-- Dropping all existing synonyms in schema ' + QUOTENAME(@SynonymsSchema)
PRINT REPLICATE('-', LEN(@DebugPrint))
PRINT @DebugPrint
PRINT REPLICATE('-', LEN(@DebugPrint))

OPEN ExistingObjCursor
FETCH NEXT FROM ExistingObjCursor INTO @SynonymName

WHILE @@FETCH_STATUS = 0
BEGIN

    SET @nsql = 'IF OBJECT_ID(''' + QUOTENAME(@SynonymsSchema) + '.' + QUOTENAME(@SynonymName) + ''', ''SN'') IS NOT NULL'
          + ' DROP SYNONYM ' + QUOTENAME(@SynonymsSchema) + '.' + QUOTENAME(@SynonymName) + ';'

    PRINT @nsql;
    IF @SimulationMode != 'Y' EXECUTE sp_executesql @nsql;

    FETCH NEXT FROM ExistingObjCursor INTO @SynonymName
END

CLOSE ExistingObjCursor
DEALLOCATE ExistingObjCursor

--\--------------------------------------------------------------------------
---) MAIN PART II: Recreate all synonyms in the specified schema.
--/--------------------------------------------------------------------------

IF OBJECT_ID('tempdb..#SourceObjects') IS NOT NULL BEGIN
    DROP TABLE #SourceObjects
END

CREATE TABLE #SourceObjects
    ( [SchemaName] SYSNAME NOT NULL
    , [ObjectName] SYSNAME NOT NULL
    , [ObjectType] CHAR(2) NOT NULL
    )

-- Insert USER objects
SELECT @nsql =
'INSERT INTO #SourceObjects([SchemaName], [ObjectName], [ObjectType])'
+ ' SELECT sch.name AS [SchemaName]'
+ ' , obj.name AS [ObjectName]'
+ ' , obj.type AS [ObjectType]'
+ ' FROM [' + @SourceDatabase + '].sys.objects obj'
+ ' JOIN [' + @SourceDatabase + '].sys.schemas sch'
+ ' ON sch.schema_id = obj.schema_id '
+ ' WHERE sch.name = ''' + @SourceSchema + ''''
+ ' AND obj.type IN ( '
+ ' ''FN'' /* SQL_SCALAR_FUNCTION */ '
+ ' , ''P '' /* SQL_STORED_PROCEDURE */ '
+ ' , ''IF'' /* SQL_INLINE_TABLE_VALUED_FUNCTION */ '
+ ' , ''TF'' /* SQL_TABLE_VALUED_FUNCTION */ '
+ ' , ''U '' /* USER_TABLE */ '
+ ' , ''V '' /* VIEW */ '
+ ' )'
+ ' ORDER BY obj.type, sch.name, obj.name'

PRINT '/*'
EXECUTE sp_executesql @nsql;
PRINT '*/'
DECLARE ObjectCursor CURSOR LOCAL STATIC FOR
    SELECT [SchemaName], [ObjectName], [ObjectType]
    FROM #SourceObjects
 
IF @SimulationMode = 'Y' PRINT '----- SIMULATION MODE: SCRIPT IS NOT EXECUTED ------ ';

SET @DebugPrint = '-- Recreate synonyms in schema ' + QUOTENAME(@SynonymsSchema)
PRINT REPLICATE('-', LEN(@DebugPrint))
PRINT @DebugPrint
PRINT REPLICATE('-', LEN(@DebugPrint))

OPEN ObjectCursor
FETCH NEXT FROM ObjectCursor INTO @CurSchemaName, @CurObjectName, @CurObjectType

WHILE @@FETCH_STATUS = 0
BEGIN

    SELECT @SynonymName = @SynonymsPrefix + @CurObjectName
    
    SET @nsql = 'CREATE SYNONYM ' + QUOTENAME(@SynonymsSchema) + '.' + QUOTENAME(@SynonymName)
              + ' FOR ' + QUOTENAME(@SourceDatabase) + '.' + QUOTENAME(@CurSchemaName) + '.' + QUOTENAME(@CurObjectName) + ';'

    PRINT @nsql;
    IF @SimulationMode != 'Y' EXECUTE sp_executesql @nsql;

    FETCH NEXT FROM ObjectCursor INTO @CurSchemaName, @CurObjectName, @CurObjectType
END

CLOSE ObjectCursor
DEALLOCATE ObjectCursor
GO

--\--------------------------------------------------------------------------
---) CLEAN UP
--/--------------------------------------------------------------------------
IF OBJECT_ID('tempdb..#SourceObjects') IS NOT NULL BEGIN
    DROP TABLE #SourceObjects
END

Download the script here.

Conclusion / wrap up

Database synonyms can be used to isolate interdatabase dependencies into objects that can be easily updated with a changed database name.

This makes managing the Data Vault solution easier.

This concept can also be of use in non-Data Vault database environments.

(c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.