Tag Archives: sql server

Hash Diff calculation with SQL Server (Datavault Series)

Updates

Change log

Date Changes
31 January, 2017 Really good news! The 8k limitation on HASHBYTES has been removed in SQL Server 2016!

Today I would like to discuss Hash Diff calculation using SQL Server 2014, from my own experience.

My purpose is NOT to be complete and/or to replicate book “Building a scalable data warehouse with Data Vault 2.0”. For full details please read the book.

However I will give a short intro for those who hardly know what I am talking about.

Introduction

Hash differences or Hash diffs

A hash difference is a hash value of all the descriptive data of a row in a satellite, plus the business key column(s).

A hash difference column is used to detect changes in a new version of a satellite’s row for a particular business key, and can be used in a comparison instead of doing a comparison on every individual descriptive attribute.
The main reason to use hash diff columns is to improve performance when loading new satellite rows (of which one or more attributes have changed).
The more columns the satellite has, the higher the performance gain will be.
To support data warehouse automation patterns, hash diffs can also be used for satellites with only a few or even one descriptive attribute.

I think a different name, like RowHash, RecordHash, DataHash or AttributeValuesHash, would better describe the contents of the column, because the value itself is not a difference, but that’s a different (relatively unimportant) discussion.

Hash keys

This article is not about Hash keys, however to be complete on hashing as far as Data Vault 2.0 is concerned, I will give a short introduction on that as well.

Hash keys replace sequence numbers (generated by the database engine) of the Data Vault 1.0 standard. They support geographically distributed data warehouses, as well as integration with big data environments like Hadoop.

A hash key is a hash value of the business key column(s) used in a Hub or Link. The advantage is that it is predictable, which enables parallel loading and a multiplatform / distributed datawarehouse architecture. For instance a customer with code NL123 will have the same hash key on all distributed parts of the data warehouse. Because the hash key always has the same length and is stored as a (fixed length) CHAR column, performance is usually better than when the business key column(s) are directly used as primary key.

How to calculate a hash difference in T-SQL

For full details about Endianness, differences of hash functions between different platforms, how to cope with new columns and other technical details I refer to the book mentioned above. If you want to work with Data Vault 2.0 you will need the book anyway.

As said, a hash difference is a hash value of all the descriptive data of a row in a satellite, plus the business key column(s). To do the hashing, we have to put all the column values together in one nvarchar value, and then apply a hashing function on it. This is the short version and the base to understand it.

DV-S01E02-hashdiff

Between the different column values you have to use a delimiter, preferably one that does not occur in the column values. Otherwise you have to “escape” the delimiter in column values.

If you want to do a case insensitive compare, you should convert all values to either upper- or lowercase. But this means that if in the source system a customer name was “hans michiels” and is changed to “Hans Michiels”, you will not detect the change and therefore do not store it in the Data Vault. So it depends on the contents of the satellite and your business view on it if this is desired behaviour or not.

The hash diff calculation results in a binary(16) value, when using the hashing algorithm MD5 (which is recommended). To enable better cross-platform- and tools support, this is then converted to a hexadecimal string of 32 characters, stored as a CHAR(32) column.

In all cases you must also convert the hexadecimal hash value to all UPPER- or all lowercase, because not all convert functions give the same output, some lower- and other uppercase. To compare values that were made by different hash functions (read on different systems) making it all uppercase (or lowercase) makes sure the values do not differ in casing only (which could lead to a false negative answer on the “are the rows the same” question).

I understand this is all dry and boring nerd stuff if you do not see an example, so .. a simple example:

Calculating_a_hash_diff.sql

SELECT
      [bla]
    , UPPER(
        CONVERT(CHAR(32),
          HASHBYTES('MD5',
            UPPER(
              CONCAT( ISNULL(CONVERT(NVARCHAR, [BusinessKeyOfHub]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), [Column01]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), [Column02]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), [Column03]), N'')
              ) -- END CONCAT
            ) -- END UPPER
          ) -- END HASHBYTES
        , 2) -- END CONVERT
      ) -- END UPPER
FROM [stg].[bla]

So far nothing shocking, but there are a few pitfalls. If you keep them in mind, you will be an excellent hasher soon. 😉

CONVERT(NVARCHAR without specifying the size

Do this only for values of 30 or less characters, otherwise the value will be cut off which could lead to wrong hash diff values, not changing when the attribute value changes. It is for instance okay for integers, decimals (unless the precision is more than 28, mind the decimal separator and possible negative sign), datetime stamps, and bits. When converting (n)varchar safest thing to do is follow the original size, for unique identifiers use 38 or 40.

Beware for CONCAT

Beware for concatenation of (N)VARCHAR(MAX), (N)TEXT and XML columns using the CONCAT function. It appears that only the first 8000 bytes of (N)VARCHAR(MAX) columns are in the concatenated result.

And even worse, HASHBYTES ..

HASHBYTES does only hash a NVARCHAR value with a length of 4000 or less. Above that the error “String or binary data would be truncated” occurs. This is really important to keep in mind.
The 8000 characters (or 4000 unicode characters) limitation on any string hashed with the HASHBYTES function has been removed in SQL Server 2016! Now you can hash larger (N)VARCHAR values!

Cope with the HASHBYTES and CONCAT limitations

4000 characters is quite something, right? Only with very wide satellites (a lot of columns) and/or usage of (N)VARCHAR(MAX), (N)TEXT or XML columns (hereafter called ‘unlimited length columns’) you can get into trouble.

So, what is wise to do? Some measures you can take are:

Make your satellites not too wide

This is a best practice, not only from hashing function perspective. Split the satellites by source system, then by rate of change. Keep in mind that the business key column(s) and all satellite column values are converted to NVARCHAR and then (including all delimiters) may not exceed the 4000 characters if you want to use the HASHBYTES function.

Size unlimited length columns according to the actual maximum length

What I mean is, if for instance a Country name is stored in an unlimited length column in the source system, you know that a size of 100 will be long enough. When you doubt just do some data profiling on the source to see what is the maximum length. Use that size plus a safety margin for the satellite column size. Problem gone.

Handle true unlimited length columns

Put true unlimited length columns (for instance free entry ‘memo’ fields or xml columns in source systems in their own satellite! Otherwise if any other (small) attribute changes, the 4k memo is copied, even when it was not changed. This absorbs storage. But that alone is not enough, as said, with HASHBYTES, above 8000 characters (4000 when unicode) you are still f*c**d. Luckily there is a system function in the master database, which uses the MD5 algorithm and takes a binary value as input. So to compute a MD5 hash value for a text column with unlimited length, you could do something like:

Hash_for_memo.sql

SELECT master.sys.fn_repl_hash_binary(CONVERT(VARBINARY(MAX),
[MyHugeMemoField)) AS [HashDiff]

But don’t make the mistake to use the function [sys].[fn_repl_hash_binary] for all hashing, because the performance of HASHBYTES is much better!
On my machine HASHBYTES is three times faster than [sys].[fn_repl_hash_binary] on the same dataset.
You can check for yourself using this script:

Performance_comparison.sql

USE [msdb]
GO

DECLARE @start DATETIME2(7)
DECLARE @stop DATETIME2(7)
DECLARE @elapsed1 BIGINT
DECLARE @elapsed2 BIGINT
--\-------------------------------------------------------------------------
---) TEST PERFORMANCE of HASHBYTES function.
--/-------------------------------------------------------------------------
SELECT @start = GETDATE();
SELECT TOP1000000
      UPPER(
        CONVERT(CHAR(32),
          HASHBYTES('MD5',
            --UPPER(
              CONCAT( ISNULL(CONVERT(NVARCHAR(128), o1.[name]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[object_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[principal_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[schema_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[parent_object_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[type]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[type_desc]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[create_date], 126), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[modify_date], 126), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[is_ms_shipped]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[is_published]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[is_schema_published]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(128), o2.[name]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[object_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[principal_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[schema_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[parent_object_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[type]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[type_desc]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[create_date], 126), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[modify_date], 126), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[is_ms_shipped]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[is_published]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[is_schema_published]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(128), o3.[name]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[object_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[principal_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[schema_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[parent_object_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[type]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[type_desc]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[create_date], 126), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[modify_date], 126), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[is_ms_shipped]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[is_published]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[is_schema_published]), N'')
              ) -- END CONCAT
            --) -- END UPPER
          ) -- END HASHBYTES
        , 2) -- END CONVERT
      ) -- END UPPER
  FROM [sys].[objects] o1
  CROSS JOIN [sys].[objects] o2
  CROSS JOIN [sys].[objects] o3;

SELECT @stop = GETDATE();
SELECT @elapsed1 = DATEDIFF(MICROSECOND,@start, @stop);

--\-------------------------------------------------------------------------
---) TEST PERFORMANCE of [master].[sys].[fn_repl_hash_binary] function.
--/-------------------------------------------------------------------------
SELECT @start = GETDATE();
SELECT TOP 1000000
      UPPER(
        CONVERT(CHAR(32),
          master.sys.fn_repl_hash_binary(CONVERT(VARBINARY(MAX),
            --UPPER(
              CONCAT( ISNULL(CONVERT(NVARCHAR(128), o1.[name]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[object_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[principal_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[schema_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[parent_object_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[type]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[type_desc]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[create_date], 126), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[modify_date], 126), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[is_ms_shipped]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[is_published]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o1.[is_schema_published]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(128), o2.[name]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[object_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[principal_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[schema_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[parent_object_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[type]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[type_desc]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[create_date], 126), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[modify_date], 126), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[is_ms_shipped]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[is_published]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o2.[is_schema_published]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(128), o3.[name]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[object_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[principal_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[schema_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[parent_object_id]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[type]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[type_desc]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[create_date], 126), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[modify_date], 126), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[is_ms_shipped]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[is_published]), N'')
                    , ';', ISNULL(CONVERT(NVARCHAR(50), o3.[is_schema_published]), N'')
                 ) -- END CONCAT
            --) -- END UPPER
          ) -- END HASHBYTES
          ) -- END CONVERT
        , 2) -- END CONVERT
      ) -- END UPPER
  FROM [sys].[objects] o1
  CROSS JOIN [sys].[objects] o2
  CROSS JOIN [sys].[objects] o3;

SELECT @stop = GETDATE();
SELECT @elapsed2 = DATEDIFF(MICROSECOND,@start, @stop);

SELECT @elapsed1 AS [Elapsed_HashBytes]
, @elapsed2 AS [Elapsed_fn_repl_hash_binary]
, 'HASHBYTES is ' + CONVERT(VARCHAR, CONVERT(decimal(19, 2), @elapsed2 / @elapsed1))
+ ' times faster than [fn_repl_hash_binary]' AS [Result]

That’s about it.
Just a few more things:

  • If you use a different hashing algorithm (e.g. SHA1) you cannot use the sys.fn_repl_hash_binary function, in that case you will have to seek your solution in a C# script task or CLR stored procedure, doing the hashing in .NET code.
  • Don’t use CONCAT if you have more than one column in the satellite containing [MyHugeMemoField], or you still can run into problems.

Conclusion / Wrap up

I have explained in short why Hash diff columns are used in a Data Vault 2.0 satellite, and how these values can be calculated.

After that, you have read about some limitations of the CONCAT and HASHBYTES function when used for calculating Hash diff values, and how to cope with these limitations.

Credits
My thanks go to Vincent Kuiper for reviewing the initial draft of this article.

(c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

Data Vault and Dimensional Modelling, a happy marriage! (Data Vault Series)

We need ambassadors!

I was at this one day course recently and spoke to a professional working in the business intelligence domain.
He said that he had just followed a two week course for datawarehouse architects.
I asked him, what did they tell about Data Vault?
His answer was shocking: “Not so much ..”

If recently trained datawarehouse architects do not get trained to use Data Vault in a datawarehouse architecture, there definitely is a lot of missionary work to do.
Because the Data Vault can add so much value to your datawarehouse solution. 
And more important, it is not a choice, a Kimball dimensional model OR a Linstedt Data Vault. Use both and
combine each strengths!

Intended audience

This is a strong call for everybody who reads this to share this post with at least one person you know working in the business intelligence domain using star schemas (a.k.a. dimensional modelling) directly fed from source systems.

What are the strengths of a Dimensional Model (a.k.a. Star Schema)?

  • Because the dimensional model is already around for decades, a lot of front end tools have an excellent support for it.
  • A dimensional model is designed for reporting, slicing and dicing, for getting data out. That is what it is good at.

    This becomes clear by the following examples:

    • Query performance is usually better than of other models, like Data Vault or Relational Datawarehouse, especially for aggregations.
    • A star model is an excellent source for Cube databases, like SSAS, especially when using a multidimensional model.
    • Date selections and comparison of periods (e.g. Q1 this year versus Q1 Last Year) is easier than in Data Vault or Relational models.
  • It enables self service BI. Give business (power) users (direct) access to a dimensional model, and there is a good chance that they can get  data out in the format they want.

What are the strengths of a Data Vault?

  • A Data Vault is meant to capture all the data all the time. Therefore it stores all data from the source systems, captures the changes in it, and keeps a full history of all changes (with a sidemark, that multiple changes between two load cycles, will be seen as one).
  • A Data Vault captures all data exactly as it is stored in the source systems. “The good, the bad and the ugly”with no business rules applied to it (except, in some cases hard business rules like data types). This makes all data auditable.
  • Data Vault 2.0 supports cross platform integration with other systems (e.g. Hadoop) and decentralized datawarehouses, for instance in different global regions.
  • With Data Vault 2.0 working agile is easy, in sprints, building the system in increments.
  • Data Vault has a very good support for Data Integration. Satellites, hanging off the Hub, containing data from different source systems.

So? What are the strengths of a datawarehouse architecture using both?

  • You can change your mind! If your business rules change (and it happens!), just reload your dimensional model from the Data Vault with the new rules applied.
  • You have one version of the facts . If you have multiple dimensional models, you have not.
  • Data history is not like a destroyed monument. With a Data Vault, you capture all data from the source systems, and all changes applied to it. Source systems usually reflect only the current state of the data, so without a change capturing central storage like Data Vault, your history is gone forever!
  • Your dimensional model does not have to serve two purposes: data storage and presentation of information. These two tasks can be in conflict, for instance there is currently no report need for all data that you could grap from source systems, so you leave it out of your star schema. But if the question comes tomorrow, you have no data, no history!
  • You have full auditibility. Because the (raw) Data Vault stores the data from source systems unmodified, every measure, every dimensional attribute, can be traced back to the source it came from and when. This makes it able to refute claims that the data is not correct. Maybe now it will become clear that the old datawarehouse- or reporting system has lied for years!

How can you use both?

The answer is simple: by using a multi-tier architecture.
Data from source systems is first extracted to a Staging area, before it is moved into the Enterprise Data Warehouse using a Data Vault Model (with or without some optional components).
From there it will be distributed to dimensional models (star schemas) and cubes, and whilst the data is on its way, business rules can be applied.
DV-S01E02-multi-tier-architecture
Multi-tier architecture using both Data Vault and Dimensional Modelling techniques.

Conclusion / Wrap up

I have written this post to create more awareness about using both Data Vault and Dimensional Modelling or Star Schemas in a data warehouse architecture.
I have listed strengths and benefits of Data Vault, Dimensional Modelling and of an architecture using both.

(c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

A Plug and Play Logging Solution (SSIS Series)

Introduction

Ever had to investigate an error that occurred during execution of a SSIS package? Then you will know that logging is the key to know what happened.

And not only logging in SSIS itself, but also in stored procedures, c# script tasks, c# console applications you might have written and are executed from within the package.
SSIS-S01E01-log-story
So the log tells you a story (I once heard “so it reads like a thrilling boys’ book”, I liked that expression, GJ).

Also you do not want to reinvent the wheel for every SSIS project, but want to use a standard solution. Plug and Play instead of spending sprints on “utitilies”.

Feature list

The features in short (as compared with what SSIS offers out of the box):

  1. It’s a Plug And Play SSIS Logging Solution in most cases: no additional programming in SSIS Packages is required*), so can be easily used for all existing packages! Only when you have multiple master packages with the same name, you will have to do a little of set up work.
  2. Log messages are labeled with a message class (a sort of severity of the message) for easy filtering.
  3. Changing the maximum message class to log is possible without redeploying any SSIS Packages (for instance enable VERBOSE logging for trouble shooting, then set it back to DEBUG or INFO).
  4. Log messages are labeled with a retention class for easy purging. There is also an “audit” retention class for messages that may never be purged/deleted.
  5. Purging can be automated, so no scheduling is needed.
  6. Logging is Integrated/centralized for SSIS Packages, stored procedures and other home made software.
  7. Deduplicated logging can be enabled, if this is done, identical SSIS logging messages (e.g. first from the OLE DB Destination, then from the Data Flow Task, and then from the SSIS Package) are logged only once as much as possible.
  8. Master / child relations for SSIS packages can be configured, so that log entries from master- and childpackages can be retrieved easily together in a single timeline. Out of the box this works using a naming convention (Master package must have “master” in the name) but can be fine tuned for every package that runs.
  9. Easy configuration by executing a stored procedure, to set retention periods, maximum log class, automatic purging on/off, deduplicated logging on/off, and naming convention for the master package.
  10. Storage of both start- and endtimes of the current timezone as well as UTC start- and endtimes.

*) As far as the logging that SSIS does is concerned. If you want logging in your own stored procedures and executables you need to implement it first.

Quick Start Guide

  1. Download the installation solution script here.
  2. Execute the entire script to install on new database [logdb]. To install on existing database, skip the part of the script where the database is created and execute the rest against an existing database. All objects the solution creates are created in database schema [log] (except for stored procedure [dbo].[sp_ssis_addlogentry], this name is needed by SSIS).
  3. Enable logging in your SSIS Packages if you have not already done that. Use a logging provider for SQL Server and select all events for all components of the package (more details below).
  4. Configure the connection managers of your SSIS Packages to use the database in which you installed the solution. Alternatively you can execute the stored procedure [log].[spCreateCrossRefs] to redirect logging for existing packages to the logging solution without any changes in your SSIS packages. See the User Manual for more details.

Directly to User Manual

Steps to follow when you want to enable logging in your SSIS Package:

SSIS-S01E01-enable_logging

User Manual

Table Of Contents

Introduction

The “hansmichiels.com Plug And Play Logging Solution” provides Plug and Play logging for SSIS Packages, stored procedures and other home made software, like command line applications or web applications.
After installing the solution, as explained in the Quick Start Guide, you can manage the solution using a few stored procedures, or accept the default settings and run it out of the box.

Understanding MessageClass and RetentionClass

MessageClass can have one of these 6 values:

  • 1 = Fatal Error (SSIS OnError event)
  • 2 = Non-fatal error (reserved for user defined logging for non-fatal errors, not used by SSIS packages)
  • 3 = Warning (SSIS OnWarning event)
  • 4 = Info message (SSIS: PackageStart, PackageEnd, OnInformation events)
  • 5 = Debug message (SSIS: OnPreValidate, OnPostValidate, OnPreExecute, OnPostExecute events)
  • 6 = Verbose messages (SSIS: Pipeline, Diagnostic and OnProgress events, typically more than 90% of all SSIS log messages)

RetentionClass aligns with MessageClass because it also has values 1 to 6 and by default the RetentionClass will be equal to the MessageClass. However, when you do custom logging in your own stored procedures or executable programs, you can provide a different value for it, for instance MessageClass 4 (INFO) but RetentionClass 1 if you want the message to be kept longer than the value for RetentionClass 4.

In addition there is a RetentionClass 0 for messages that may never be purged/deleted.

There are two ways to prevent that log messages are purged/deleted:
  1. By setting the configuration setting @RetentionPeriodClassN (where N is from 1 to 6) to 0. Then ALL messages for that RetentionClass will never be purged. However when the configuration setting is changed to a different value, the next time spPurgeLog is executed, the messages can still be deleted.
  2. By explicitly adding log entries with RetentionClass 0. Those entries do not depend on configuration settings, and are never deleted, unless you delete them directly from the [log].[Entry] table.



Back to Table of Contents

Understanding the stored procedures and user defined function of the solution

Stored procedure [log].[spConfigure]

With this stored procedure you can change a number of configuration settings which are parameters of the stored procedure.

SSIS-S01E01-configure
Configuration: a child can do the laundry

You only have to provide values for the configuration settings (parameters) that you want to change. Other settings will remain unchanged.

You can set the following parameters:

  • @ExecutionGuid: Guid used for logging the configure operation. Can be NULL.
  • @MaxMessageClass: The maximum MessageClass to log. For instance 5 means that log entries with MessageClass=6 (VERBOSE) are not logged. The defaultvalue is 5 (DEBUG).
  • @RetentionPeriodClass1: Hours to keep a log entry if the [RetentionClass] column has value 1. Set the value to 0 if the log message may never be deleted. The defaultvalue is 0.
  • @RetentionPeriodClass2: The same for [RetentionClass] = 2. The defaultvalue is 0.
  • @RetentionPeriodClass3: The same for [RetentionClass] = 3. The defaultvalue is 4320 hours or 180 days.
  • @RetentionPeriodClass4: The same for [RetentionClass] = 4. The defaultvalue is 2160 hours or 90 days.
  • @RetentionPeriodClass5: The same for [RetentionClass] = 5. The defaultvalue is 336 hours or 14 days.
  • @RetentionPeriodClass6: The same for [RetentionClass] = 6. The defaultvalue is 168 hours or 7 days. Verbose messages are typically more than 90% of all messages, so a short retention period is recommended, if logged at all.
  • @DefaultMasterSourcePattern: default naming convention for master packages. Use wildcard % for nameparts. The defaultvalue is ‘%Master%’.
  • @EnableAutoPurgeOnPackageEnd: If enabled, spPurgeLog will be executed every time a SSIS Package ends. This keeps the log clean without an additional effort for scheduling, but the downside is that spPurgeLog might be executed more often than when it is scheduled daily. This is not harmful, but could have a negative impact on package performance. The defaultvalue is 0 (false).
  • @EnableDeduplicatedLogging: If enabled a staging table [log].[EntryStaging] will be created and duplicate SSIS log messages (with different sources) will be deduplicated as much as possible. Still in some situations duplicates may occur but a lot less than when this setting is not enabled. The defaultvalue is 0 (false).
  • @ResetAllToDefaultValues: Use 1 to reset all values to the values of the DEFAULT constraints on the [log].[Config] table. In that case all other parameter values are ignored.

Examples of use:

EXECUTE [log].[spConfigure] @ResetAllToDefaultValues = 1;

EXECUTE [log].[spConfigure]
@EnableDeduplicatedLogging = 1,
@EnableAutoPurgeOnPackageEnd = 1,
@MaxMessageClass = 6,
@RetentionPeriodClass2 = 8640,
@RetentionPeriodClass6 = 48;



Back to Table of Contents

Stored procedure [log].[spAddLogEntry]

You can use this stored procedure to implement logging in your own stored procedures or executable programs.

It is advised that your own stored procedures which are called from within SSIS Packages, have a parameter @ExecutionGuid (with datatype UNIQUEIDENTIFIER). Provide the value from the package variable System::ExecutionInstanceGUID.

You can set the following parameters:

  • @Event: The log event, can be NULL. If you leave it empty, the event name will be based on the value for @MessageClass.
  • @Source: The log source.
  • @SourceGuid:The log source guid, can be NULL.
  • @ExecutionGuid: A global unique identifier to be able to distinct log messages of different jobs running in parallel.
  • @StartTime: Can be NULL, if NULL the current date/time is used.
  • @EndTime: Can be NULL, if NULL the value of @StartTime is used (which can be set to current date/time).
  • @MessageClass: The MessageClass of the message: 1=Fatal Error, 2=Error, 3=Warning, 4=Info, 5=Debug, 6=Verbose
  • @RetentionClass: A numeric indicator for the time after which the log messages can be deleted or purged. Values align with MessageClass, and in addition there is a RetentionClass with value 0 for log messages that may never be deleted, e.g. for auditing purposes. Can be NULL, if NULL the value provided for MessageClass will be used.
  • @Message: The log message.

Examples of use:

— At the beginning of a stored procedure:
DECLARE @Source NVARCHAR(1024) = ‘log.spCreateCrossRefs’;
DECLARE @Message NVARCHAR(2048) = ‘Starting: ‘ + @Source;
EXECUTE [log].[spAddLogEntry] @Message = @Message, @ExecutionGuid = @ExecutionGuid,
@Source = @Source, @MessageClass = 4;

— at the end of a stored procedure:
SELECT @Message = ‘Finished: ‘ + @Source;
EXECUTE [log].[spAddLogEntry] @Message = @Message, @ExecutionGuid = @ExecutionGuid,
@Source = @Source, @MessageClass = 4;

— To log something specific:
SELECT @Message = ‘Purged ‘ + CAST(@RowCount AS NVARCHAR) + ‘ log entries with RetentionClass ‘ + CAST(@RetentionClass AS NVARCHAR) +‘.’;
EXECUTE [log].[spAddLogEntry] @Message = @Message, @ExecutionGuid = @ExecutionGuid,
@Source = @Source, @MessageClass = 4, @RetentionClass = 0;



Back to Table of Contents

Stored procedure [log].[spPurgeLog]

Schedule the execution of this stored procedure daily when you do not want to set EnableAutoPurgeOnPackageEnd to 1 (true).

This stored procedure has a parameter @ExecutionGuid used for logging the purge operation, but you can omit the parameter, then a new unique identifier value will be used.


Example of use:



EXECUTE [log].[spPurgeLog]

Back to Table of Contents

Stored procedure [log].[spCreateCrossRefs]

Stored procedure to create synonyms in a database and create or replace the stored procedure [dbo].[sp_ssis_addlogentry] (hereafter collectively called: the cross references)

Use this stored procedure for two purposes:

  1. If you want to do logging in your own stored procedures in the database specified with parameter @DatabaseName
  2. If you want to redirect logging for SSIS Packages that use the database specified with parameter @DatabaseName for logging. In this way those packages will use the logging solution without the need to change the connection string.

You can set the following parameters:

  • @ExecutionGuid: Guid used for logging the configure operation. Can be NULL.
  • @DatabaseName: the name of the database in which the cross references must be created.
  • @DebugMode: if 1, more details will be shown in the output window
  • @SimulationMode: if 1, only the SQL Statements are shown in the output window, but they are not executed.
  • @DropOnly: if 1, the cross references are only dropped, but not (re)created.



Example of use:

USE [logdb]
GO
EXECUTE [log].[spCreateCrossRefs]
    @DatabaseName = 'MyOtherDb'
  , @DebugMode = 1
  , @SimulationMode = 1
  , @DropOnly = 0
GO

Back to Table of Contents

Stored procedure [log].[spSetMasterSourcePattern]

While packages run, the table [log].[Source] is filled with Source (= SSIS Package) and MasterSource information, based on the configuration setting @DefaultMasterSourcePattern.

If the value for @DefaultMasterSourcePattern is correct for all packages that run, you do not have to change anything.

However when the Master/child Source relations need to be changed, you can use stored procedure [log].[spSetMasterSourcePattern] to configure master- and child package relationships.

The definition of a master package is a package that executes another SSIS Package, the child package. When going more than one level deep, only the ultimate parent package (the one that starts everything else) has to be defined, the “package tree” is not relevant. For instance Package A executes Package B, and Package B executes Package C. In this situation A should be defined as master package for C.

You can set the following parameters:

  • @ExecutionGuid: Guid used for logging the configure operation. Can be NULL.
  • @Source: The log source or log source pattern. For a pattern use a % wildcard, e.g. “Load%” for all sources of which the name starts with “Load”.
  • @SourceGuid: Source Guid, if NULL Source will be retrieved with the value of @Source only.
  • @MasterSourcePattern: The master source to use for this @Source/@SourceGuid
  • @MasterSourceGuid: Master Source Guid, if NULL the connection to the Master Source will be made using only the @MasterSourcePattern. You only need to configure @MasterSourceGuid, if you have multiple master packages in different projects with the same name (e.g. all called “Master.dtsx”) that can run at the same time. In that situation you have to link child- to masterpackages using both @MasterSourcePattern and @MasterSourceGuid.

The easiest way to configure your master- and childpackages is to use spConfigure to set @DefaultMasterSourcePattern so that it matches your masterpackage name.
Then run your masterpackage once. In this way the table [log].[Sources] is already filled with all executed (master- and child)packages, so you do not have to do that yourself.
Then use the examples below to finetune, if necessary.
Examples of use:

--\
---) Set MasterSourcePattern for all sources (use only after SSIS Packages have run for the first time).
--/
EXECUTE [log].[spSetMasterSourcePattern]
  @Source = '%', @MasterSourcePattern = 'Master%'
GO

-- Then undo for the master sources themselves
EXECUTE [log].[spSetMasterSourcePattern]
  @Source = 'Master%', @MasterSourcePattern = NULL
GO

--\
---) Set MasterSourcePattern for one source.
--/
-- Without SourceGuid (use only after SSIS Packages have run for the first time).
EXECUTE [log].[spSetMasterSourcePattern]
  @Source = 'LoadDimCustomer', @MasterSourcePattern = 'DWHMaster'
GO

-- With SourceGuid and MasterSourceGuid (can be used prior to running any packages).
EXECUTE [log].[spSetMasterSourcePattern]
  @Source = 'LoadDimCustomer',
  @SourceGuid = '7480D07D-5099-4B76-9404-17C2AFD0E603', -- PackageID property of SSIS Package.
  @MasterSourcePattern = 'Master',
  @MasterSourceGuid = '0F429F36-784D-4F05-8D5A-9374609A32B9' -- PackageID property of your master SSIS Package.
GO



Back to Table of Contents

Stored procedures [log].[spOnExecutionStart] and [log].[spOnExecutionEnd]

Used internally by the logging solution.


Back to Table of Contents

User defined function [log].[fnLogEntry]

User defined function to select log entries.

This function has the following parameters:

  • @Source: Packagename to select log entries for (including log entries of subpackages). Can be NULL. If NULL then the last execution of any Package will be selected.
  • @LastXExecutions: A number to indicate for how many executions you want to see the log. Defaultvalue is 1.
  • @MaxMessageClass: Filter on MessageClass, for instance use 2 to show only log messages with a MessageClass less than or equal to 2. Defaultvalue is 6.
  • @IncludeRetention0: If 1, also messages with RetentionClass 0 (never delete) will be shown, regardless of their MessageClass.

Example of use:

SELECT * FROM [log].[fnLogEntry] (‘Master’, 1, 4, 1 )
GO



Back to Table of Contents

How to implement logging in own executables

Logging in own executables (for instance C# Console Applications) is fairly easy to set up.
Keep the following in mind:

  • Generate a new guid every time the executable runs, and use it as value for @ExecutionGuid.
  • Assign the executable a persistent SourceGuid, and use it as value for @SourceGuid.
  • Use the executable name as value for @Source. While the executable runs, you can use lower level names (e.g. class and/or method names) for @Source, but still use the executable @SourceGuid, and not a different one.
  • When the executable starts, use the stored procedure spAddLogEntry to add an entry with @event=”ExecutionStart”. This triggers built in facilities of the logging solution.
  • Just before the executable stops, use the stored procedure spAddLogEntry to add an entry with @event=”ExecutionEnd”.
  • When the application is called from within a SSIS Package, use spSetMasterSourcePattern to create a master/child relationship.
  • If you are interested in a C# code example just let me know, and I will dedicate a future blog post to it.

 

Download the logging solution script here.

How about updates?

I intend to update the solution when needed.
If you find bugs or have interesting additions, I would be thankful if you can share those with me so I can improve the logging solution.

(c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

Isolate cross database “databasename” dependencies using synonyms and schemas (Data Vault Series)

Intended audience

This post could be interesting for professionals working on datawarehouse projects using the Microsoft SQL Server database engine. This post assumes that you have at least a basic understanding of Data Vault principles.

Introduction

Those who have read the book “Building a Scalable Data Warehouse with Data Vault 2.0” by Dan Linstedt and Michael Olschimke will know that database schema’s are used in the examples in the book for the “raw” and “biz” (business) layer in the Data Vault.

If you do not know this book, buy it, if you are the intended audience you should read it ( no I have no commercial interest in saying this), so check it out on Amazon.

What is done in the book is putting tables or views in the database schema’s “raw” and “biz”, respectively, making it transparent if a satellite is either virtual (and at least partially computed) or persisted in a physical table.

Putting these data layers in different schema’s is a great starting point for a my interdatabase connectivity concept.

What data layers can we have

  • Staging area
  • Raw Data Vault
  • Business (Data) Vault
  • Metrics Vault
  • Error Vault
  • Information Marts

A rule of thumb is that each layer resides in it’s own database, but also multiple layers can be in one database. What should be avoided is spreading one layer over multiple databases. This does not count for different Information Marts though, which should be put in their own database.

Whether to put layers together can depend amongst others on backup requirements (the biz layer could for instance be restored from a database backup OR by doing some reprocessing from the Raw Data Vault, which could lead to the decision to backup the database less frequently).

A typical set up could be:

  • Staging Area – separate database, “stg” database schema.
  • Raw Data Vault – together with Business Vault, “raw” database schema.
  • Business Vault – together with Raw Vault, “biz” database schema.

To keep the example simple I will leave the other data layers out for now.

What is the problem?

To be short: SQL statements copying data to a different database.

When using data flows in SSIS, this problem does not occur.

Why is it bad?

Database names are prefixed in FROM clauses, causing a dependency on the database name.

So what, I still don’t see the problem

Scenario’s where it can become a hassle to manage this include:

  • Running different environments on the same database instance (e.g. development and test), differing the environments by database names. When deploying code or copying databases, all interdatabase dependencies must be changed.
  • Running different enviroments for multiple clients in a BIAAS (Business Intelligence As A Service) solution on the same database instance.

So, lets divide datalayers using database schema’s ..

This is an example of the initial set up.

Dv01E01_dbschema1
Please note that you should keep all objects (tables, views, stored procedures, udf’s) from one data layer together in one database.

Now, to pull data from the “stg” to the “raw” layer, SQL statements could be like (simplified example):

Insert.sql
INSERT INTO [raw].[TableName]
SELECT ..
FROM [EDW_STG_TEST].[stg].[TableName]

But, as you can see the database name is prefixed before the Staging table. And this is a problem (at least it makes managing the solution more complex).

How can we solve this?

Managing interdatabase dependencies using synonyms

First, a little explanation from MSDN:

A synonym is a database object that serves the following purposes:

  • Provides an alternative name for another database object, referred to as the base object, that can exist on a local or remote server.
  • Provides a layer of abstraction that protects a client application from changes made to the name or location of the base object.

Read more on MSDN.

What if we would have made a synonym for the staging table in the “EDW_DV_TEST” database, also in a “stg” schema in that database?

Then we could omit the database name from the SQL statement:

Insert.sql
INSERT INTO [raw].[TableName]
SELECT ..
FROM [stg].[TableName]
That is what we want!

In a picture:

Dv01E01_dbschema2

The good news is, creating the synonyms can be automated with a script, so a change of the Staging database name only requires running a SQL script in the EDW_DV_TEST to drop and recreate the synonyms for the staging database objects!

One other advantage is, you could merge different layers into one database or spit them into different databases. For example all code will keep on working if you move all Staging tables to the EDW_DV_TEST database, therewith replacing the synonyms already there in the stg schema by actual tables. All you would have to do is change the connection string for the processes that fill the staging tables.

In a similar way, you could decide to separate the raw and biz layer into different databases.

Script to create or change the synonyms

DV-S01E01-Create_synonyms_for_objects_in_a_different_database.sql

--\
---) Author: Hans Michiels
---) Creates synonyms for tables, views, user defined functions
---) and stored procedures in a different database.
---) All synonyms in the specified schema are dropped
---) prior to (re)creation.
--/
/*
(c) Copyright 2016 - hansmichiels.com
 
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
 
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
 
You should have received a copy of the GNU General Public License
along with this program. If not, see http://www.gnu.org/licenses/.
*/

USE [AdventureWorks2014] -- Replace with database you want to create the synonyms in.
GO

--\--------------------------------------------------------------------------
---) DECLARATIONS
--/--------------------------------------------------------------------------

DECLARE @SourceDatabase SYSNAME
DECLARE @SourceSchema SYSNAME
DECLARE @SynonymsSchema SYSNAME

DECLARE @SynonymsPrefix NVARCHAR(128)
DECLARE @SynonymName SYSNAME

DECLARE @CurSchemaName SYSNAME
DECLARE @CurObjectName SYSNAME
DECLARE @CurObjectType CHAR(2)

DECLARE @nsql NVARCHAR(4000)
DECLARE @DebugPrint NVARCHAR(500)
DECLARE @SimulationMode CHAR(1)

--\--------------------------------------------------------------------------
---) CONFIGURATION: adjust the variable values to meet your needs.
--/--------------------------------------------------------------------------

SELECT @SourceDatabase = 'AW2014_Staging' -- Database in which objects exist
                                          -- for which synonyms must be created.
     , @SourceSchema = 'stg' -- Schema name of source objects.
     , @SynonymsSchema = 'stg' -- Schema name to put synonyms in.
     , @SynonymsPrefix = '' -- Optional Synonyms name prefix.
     , @SimulationMode = 'Y' -- Use Y if you only want the SQL statement in the
                               -- output window without it being executed.

--\--------------------------------------------------------------------------
---) INITIALIZATION
--/--------------------------------------------------------------------------

IF NOT EXISTS(SELECT 1 FROM sys.schemas WHERE name = @SynonymsSchema)
BEGIN
    SELECT @nsql ='CREATE SCHEMA ' + QUOTENAME(@SynonymsSchema) + ' AUTHORIZATION [dbo];';
    IF @SimulationMode = 'Y' PRINT '----- SIMULATION MODE: SCRIPT IS NOT EXECUTED ------ ';
    PRINT @nsql;
    IF @SimulationMode != 'Y' EXECUTE sp_executesql @nsql;
END

--\--------------------------------------------------------------------------
---) MAIN PART I: Drop all existing synonyms in the specified schema.
--/--------------------------------------------------------------------------

DECLARE ExistingObjCursor CURSOR LOCAL STATIC FOR
    SELECT syn.name AS [SynonymName]
    FROM sys.synonyms syn
    JOIN sys.schemas sch
      ON syn.schema_id = sch.schema_id
    WHERE sch.name = @SynonymsSchema
 
IF @SimulationMode = 'Y' PRINT '----- SIMULATION MODE: SCRIPT IS NOT EXECUTED ------ ';

SET @DebugPrint = '-- Dropping all existing synonyms in schema ' + QUOTENAME(@SynonymsSchema)
PRINT REPLICATE('-', LEN(@DebugPrint))
PRINT @DebugPrint
PRINT REPLICATE('-', LEN(@DebugPrint))

OPEN ExistingObjCursor
FETCH NEXT FROM ExistingObjCursor INTO @SynonymName

WHILE @@FETCH_STATUS = 0
BEGIN

    SET @nsql = 'IF OBJECT_ID(''' + QUOTENAME(@SynonymsSchema) + '.' + QUOTENAME(@SynonymName) + ''', ''SN'') IS NOT NULL'
          + ' DROP SYNONYM ' + QUOTENAME(@SynonymsSchema) + '.' + QUOTENAME(@SynonymName) + ';'

    PRINT @nsql;
    IF @SimulationMode != 'Y' EXECUTE sp_executesql @nsql;

    FETCH NEXT FROM ExistingObjCursor INTO @SynonymName
END

CLOSE ExistingObjCursor
DEALLOCATE ExistingObjCursor

--\--------------------------------------------------------------------------
---) MAIN PART II: Recreate all synonyms in the specified schema.
--/--------------------------------------------------------------------------

IF OBJECT_ID('tempdb..#SourceObjects') IS NOT NULL BEGIN
    DROP TABLE #SourceObjects
END

CREATE TABLE #SourceObjects
    ( [SchemaName] SYSNAME NOT NULL
    , [ObjectName] SYSNAME NOT NULL
    , [ObjectType] CHAR(2) NOT NULL
    )

-- Insert USER objects
SELECT @nsql =
'INSERT INTO #SourceObjects([SchemaName], [ObjectName], [ObjectType])'
+ ' SELECT sch.name AS [SchemaName]'
+ ' , obj.name AS [ObjectName]'
+ ' , obj.type AS [ObjectType]'
+ ' FROM [' + @SourceDatabase + '].sys.objects obj'
+ ' JOIN [' + @SourceDatabase + '].sys.schemas sch'
+ ' ON sch.schema_id = obj.schema_id '
+ ' WHERE sch.name = ''' + @SourceSchema + ''''
+ ' AND obj.type IN ( '
+ ' ''FN'' /* SQL_SCALAR_FUNCTION */ '
+ ' , ''P '' /* SQL_STORED_PROCEDURE */ '
+ ' , ''IF'' /* SQL_INLINE_TABLE_VALUED_FUNCTION */ '
+ ' , ''TF'' /* SQL_TABLE_VALUED_FUNCTION */ '
+ ' , ''U '' /* USER_TABLE */ '
+ ' , ''V '' /* VIEW */ '
+ ' )'
+ ' ORDER BY obj.type, sch.name, obj.name'

PRINT '/*'
EXECUTE sp_executesql @nsql;
PRINT '*/'
DECLARE ObjectCursor CURSOR LOCAL STATIC FOR
    SELECT [SchemaName], [ObjectName], [ObjectType]
    FROM #SourceObjects
 
IF @SimulationMode = 'Y' PRINT '----- SIMULATION MODE: SCRIPT IS NOT EXECUTED ------ ';

SET @DebugPrint = '-- Recreate synonyms in schema ' + QUOTENAME(@SynonymsSchema)
PRINT REPLICATE('-', LEN(@DebugPrint))
PRINT @DebugPrint
PRINT REPLICATE('-', LEN(@DebugPrint))

OPEN ObjectCursor
FETCH NEXT FROM ObjectCursor INTO @CurSchemaName, @CurObjectName, @CurObjectType

WHILE @@FETCH_STATUS = 0
BEGIN

    SELECT @SynonymName = @SynonymsPrefix + @CurObjectName
    
    SET @nsql = 'CREATE SYNONYM ' + QUOTENAME(@SynonymsSchema) + '.' + QUOTENAME(@SynonymName)
              + ' FOR ' + QUOTENAME(@SourceDatabase) + '.' + QUOTENAME(@CurSchemaName) + '.' + QUOTENAME(@CurObjectName) + ';'

    PRINT @nsql;
    IF @SimulationMode != 'Y' EXECUTE sp_executesql @nsql;

    FETCH NEXT FROM ObjectCursor INTO @CurSchemaName, @CurObjectName, @CurObjectType
END

CLOSE ObjectCursor
DEALLOCATE ObjectCursor
GO

--\--------------------------------------------------------------------------
---) CLEAN UP
--/--------------------------------------------------------------------------
IF OBJECT_ID('tempdb..#SourceObjects') IS NOT NULL BEGIN
    DROP TABLE #SourceObjects
END

Download the script here.

Conclusion / wrap up

Database synonyms can be used to isolate interdatabase dependencies into objects that can be easily updated with a changed database name.

This makes managing the Data Vault solution easier.

This concept can also be of use in non-Data Vault database environments.

(c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

Top 10 Design Mistakes seen in OLTP production systems (SQL Server Series)

Introduction


If you ever had to source your datawarehouse from OnLine Transaction Processing databases, I suppose this top 10 will be a “joy” of recognition for you.

Top 10 Design Mistakes seen in OLTP production systems

1. Don’t have referential integrity enforced

Without actual foreign key constraints in the database, you will get orphaned records sooner or later (e.g. orderline rows for already deleted orders). This is bad for dataquality and can hide bugs in the front end software maintaining the data.

Back to top

2. Foreign key columns having a totally different name than the primary key they refer to

Especially when point 1 is the case, it is difficult to tell that a foreign key column “account_id” refers to the “customer_id” column in the customer table. When reverse engineering a OLTP design this does not really help to understand it.

Back to top

3. Foreign key columns missing “Id”, “Key” or “Guid” in the name

I personally dislike a column name “customer” when it is an integer or global unique identifier datatype. However from the datatype you can draw the conclusion that it cannot be the name, there are many situations where you only see the column name, and then it can be confusing.

Back to top

4. Foreign key columns meaning more than one thing

More than once, in standard software which is highly adjustable or configurable, the datamodel is “flexible”, e.g. link tables where object_id can be a customer, an employee, or yet something else. Apart from the problem this gives in enforcing referential integrity without a physical constraint, it makes it more difficult to understand the model and/or get data out.

Back to top

5. Accessing the datamodel through low-performance all-in-one views, protecting base tables from direct access

They exist! Views with 700+ columns, probably joining 20+ tables, to give an all-in-one view for an account. And still, then you need an extra column from an already joined table, that is not in the list of 700+ columns. You can guess what this leads to, suboptimal queries with poor performance. While, if you were allowed to write a view directly on the tables, you could optimize it for your needs.

Back to top

6. Oversized column lengths

Ever seen a varchar(max) column for a country code with a maximum length of 3? I Have. While this is an extreme example, oversized columns lead to several problems:

  • SQL Server handles the column differently. Without going into too much detail, a char(3) column for the country code would be stored much more efficiently than a varchar(max) column.
  • You do not always know how much space must be reserved for the column on reports and so on.
  • If you use the value in concatenated columns (e.g. business keys), you need to size it after the physical maximum size, even when the actual data will not fill it up.


Oversizing can happen:

  • when front end C# developers use “code first”, and the column length is not defined in the code. Therefore I recommend, when “code first” is used, a database developer or DBA has a critical look at the generated physical datamodel before going live.
  • when the column size is not specified in design documents.
  • when people just don’t care to size the column properly. Oversizing seems like a safe thing to do, if the column is big enough, no one will complain!

Back to top

7. Oversized INT Datatypes

Or: To Smallint or not to Smallint. When reference tables only contain a few rows, a smallint or even tinyint datatype for the primary key will do! If the primary key is used in foreign key constraints of large tables (with many rows) less storage is needed, indexes will be smaller, all leading to better performance.

Unfortunately, SQL Server does not support unsigned smallint, int and bigint datatypes, but nobody stops you from start counting at the minimum negative value, so -32768 instead of 1 for a smallint! Just change the seed part of the IDENTITY specification. If you do this, you can store up to 65536 rows in a table with a smallint primary key.

MSSQL-S01E01-Top 10 Design Mista…

–\
—) Author: Hans Michiels
—) Script to demonstrate how to start an
—) identity column with a negative number
–/

IF OBJECT_ID(‘[dbo].[A65kRowsTable]’, ‘U’) IS NOT NULL
DROP TABLE [dbo].[A65kRowsTable]
GO
CREATE TABLE [dbo].[A65kRowsTable](
[ID] SMALLINT IDENTITY(32768, 1) NOT NULL,
[SomeColumn] VARCHAR(50) NOT NULL
CONSTRAINT [PK_dbo_A65kRowsTable_ID] PRIMARY KEY CLUSTERED
(
[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
INSERT INTO [dbo].[A65kRowsTable] ( [SomeColumn] )
VALUES (‘-32768 is the first ID.’)
GO
SELECT [ID], [SomeColumn]
FROM [dbo].[A65kRowsTable]
GO


If you do this trick with an INT, you can store 4.2 billion rows and you might in some cases be able to prevent that you need a bigint.

Back to top

8. Lack of unique indexes for business or natural keys

Again a dataquality issue. If the primary key is surrogate (an integer type which is autonumbered), a table still can have one or more different columns that make a row unique. For example, a social security number in a citizens table. Not enforcing this constraint via an unique index, can lead to two citizens with the same social security number! This can hide bugs in the front end software, fraude, or both. Anyway, it is not wise to take the risk that this happens.

Back to top

 9. Guids stored as (n)varchar

A guid, short for Global Unique IDentifier, is a generated value of 16 bytes, guaranteed to be globally unique. It is sometimes used for primary key columns, instead of integer datatypes.
I will keep it in the middle if this is wise to do, there might be some use cases when this is a valid approach, but I would only do this if using integers is not an option.

SQL Server has the UNIQUEIDENTIFIER datatype to store guids. Internally, this datatype uses 16 bytes, but the value is represented as a hexadecimal string of 36 characters (32 for the 16 bytes, with 4 hyphens in between, sometimes surrounded by curly brackets, e.g. {935D9FA8-2C77-4C34-894C-8FCDA8E47F19}.
But if the guid is stored in a varchar(36) column, it will use 36 bytes for the value plus 2 bytes overhead, so 38 bytes in total. Compared with the 16 bytes for the UNIQUEIDENTIFIER, you can imagine that storage space and index size will increase and performance will decrease.
Not to mention Nvarchar(36), the Unicode variant of varchar, which will use 2 bytes for every character plus 2 bytes overhead, so 74 bytes in total. This is massive! I have seen nvarchar(36) in a production system to store a guid and would classify it as a “My first database” mistake. Only very unexperienced developers, hardly knowing anything about datatypes would do this. Don’t be one of them!

MSSQL-S01E01-mistakes-homer-doh

Back to top

10. Where’s the DBA?

This point is more about daily operations than about the design, but I still want to mention it. If you are using SQL Server databases you need to maintain them. A simple maintenance plan is easy to set up, and will do index and statistics maintenance for you, make backups and could prevent that the transaction log file grows excessively. If you don’t have an internal DBA, hire one to set it up and monitor your SQL Server instances.

(c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

Schedule Ola Hallengren’s Maintenance Solution (T-SQL Scripting Series)

Introduction

If you have you ever struggled with setting up a maintenance plan for SQL Server, you’ll be glad that Ola Hallengren came around. This very clever SQL Server DBA thought, well the SQL Server Maintenance Plans aren’t that great. It does the job, but out of the box some jobs are scheduled at exactly the same time, amongst other inefficiencies. We all can be thankful that he wrote his own MaintenanceSolution.sql, which has won several awards and can be downloaded for free here.

F5 and then?

If you have downloaded Ola Hallengren’s MaintenanceSolution.sql, you still have to schedule the SQL Agent Jobs. But it is logical that this is not included, as every company, every server could have different requirements for backup, integrity checks, index optimization and statistics maintenance.

.. Configure!

In the script below you will find a CONFIGURATION section where you can configure how the SQL Agent Jobs created by Ola Hallengren should be scheduled. Just for your convenience, hope you like it.

Ola_Scheduling.sql

--\
---) Author: Hans Michiels
---) Creates jobschedules in SQL Server Agent jobs that
---) were created by Ola Hallengren's maintenance solution.
--/
/*
(c) Copyright 2016 - hansmichiels.com
 
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
 
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
 
You should have received a copy of the GNU General Public License
along with this program. If not, see http://www.gnu.org/licenses/.
*/

USE [msdb]
GO

--\-------------------------------------------------------------------------------------
---) DECLARATIONS
--/-------------------------------------------------------------------------------------
DECLARE @cur_days VARCHAR(50)
     -- @cur_days contains a space separated list of weekdays, e.g. 'mon tue wed thu fri sat sun',
     -- or use 'all' for all days.
     -- Please note that using 'mon tue wed thu fri sat sun' will create a weekly job with
     -- all days checked while using 'all' will create a daily job.

DECLARE @cur_start_time INT
     -- @cur_start_time: the starttime of the job in integer format (hhmmss)

DECLARE @cur_subday_interval INT
     -- @cur_subday_interval: N in 'every N seconds/minutes/hours (for seconds max N = 100)
     -- Use 0 when N/A

DECLARE @cur_intraday_unit VARCHAR(7)
     -- @cur_intraday_unit: seconds (s), minutes (m) or hours (h).
     -- For configuration you can use both the full name or the first letter.

DECLARE @cur_delete_any_schedule CHAR(1)
     -- @cur_delete_any_schedule: indicates if any schedule for the given jobname should be deleted.
     -- Use NULL if you want to use the default value (recommended).
     -- It is recommended only to use 'N' if you use multiple schedules for the same job.

DECLARE @default_delete_any_schedule CHAR(1)
     -- @default_delete_any_schedule: valued used if in the configuration below
     -- [delete_any_schedule] is NULL for a specific job.

DECLARE @delete_schedules_only CHAR(1)
     -- @delete_schedules_only: if set to 'Y', only the existing schedules are deleted.
     -- No new schedules are made.
     -- If you use this setting, make sure to set @default_delete_any_schedule also to 'Y'
     -- without overruling it to 'N' for specific jobs.

DECLARE @cur_schedule_id int
DECLARE @cur_job_name SYSNAME
DECLARE @cur_schedule_name SYSNAME
DECLARE @cur_freq_type INT
DECLARE @cur_freq_interval INT
DECLARE @cur_freq_subday_type INT

DECLARE @active_start_date INT
DECLARE @message VARCHAR(500)

SET @active_start_date = CONVERT(INT, CONVERT(CHAR(8), CURRENT_TIMESTAMP, 112));

--\-------------------------------------------------------------------------------------
---) CONFIGURATION
--/-------------------------------------------------------------------------------------
SET @default_delete_any_schedule = 'Y';
-- If you DO NOT want to delete all existing schedules for all jobs in your configuration,
-- change @default_delete_any_schedule to 'N'.
-- Keeping this value on 'Y' is usually safe to do and this prevents that old schedules
-- might in some cases not be deleted which can give trouble due to multiple (conflicting)
-- schedules.
-- !! ONLY IF you have more than one schedule for a job, you should set [delete_any_schedule]
-- in the configuration above to 'N' for that specific job.

SET @delete_schedules_only = 'N';

DECLARE daily_weekly_cursor CURSOR LOCAL STATIC FOR
SELECT CONVERT(SYSNAME, NULL) AS [job_name], CONVERT(SYSNAME, NULL) AS [schedule_name],
        CONVERT(INT, NULL) AS [start_time], CONVERT(VARCHAR(50), NULL) AS [days],
        CONVERT(INT, NULL) AS [subday_interval],
        CONVERT(VARCHAR(7), NULL) AS [intraday_unit],
        CONVERT(CHAR(1), NULL) AS [delete_any_schedule]

-------------[job_name]------------------------------------[schedule]--[start]-[days]-----------------------[subday ]-[intraday]-[delete_any]
-----------------------------------------------------------[name ]--[time ]------------------------------[interval]-[unit ]-[schedule ]
UNION SELECT 'CommandLog Cleanup' , 'Schedule', 180000, 'wed sun' , 0 , NULL , NULL
UNION SELECT 'Output File Cleanup' , 'Schedule', 180500, 'wed sun' , 0 , NULL , NULL
UNION SELECT 'sp_delete_backuphistory' , 'Schedule', 181000, 'wed sun' , 0 , NULL , NULL
UNION SELECT 'sp_purge_jobhistory' , 'Schedule', 181500, 'wed sun' , 0 , NULL , NULL

UNION SELECT 'DatabaseIntegrityCheck - SYSTEM_DATABASES' , 'Schedule', 182000, 'wed sun' , 0 , NULL , NULL
UNION SELECT 'DatabaseIntegrityCheck - USER_DATABASES' , 'Schedule', 191500, 'wed sun' , 0 , NULL , NULL
UNION SELECT 'DatabaseBackup - SYSTEM_DATABASES - FULL' , 'Schedule', 001500, 'all' , 0 , NULL , NULL
UNION SELECT 'DatabaseBackup - USER_DATABASES - FULL' , 'Schedule', 003000, 'all' , 0 , NULL , NULL

UNION SELECT 'IndexOptimize - USER_DATABASES' , 'Schedule', 081000, 'mon tue wed thu fri sat sun', 0 , NULL , NULL

/* -- Weekly full, other days diff example:
UNION SELECT 'DatabaseBackup - USER_DATABASES - FULL' , 'Schedule', 201500, 'fri' , 0 , NULL , NULL
UNION SELECT 'DatabaseBackup - USER_DATABASES - DIFF' , 'Schedule', 201500, 'mon tue wed thu sat sun' , 0 , NULL , NULL
*/

/* -- Transaction log example / Example of use of [delete_any_schedule] with more than one schedule
UNION SELECT 'DatabaseBackup - USER_DATABASES - LOG' , 'Weekdays', 000001, 'mon tue wed thu fri' , 10 , 'minutes', 'Y'
UNION SELECT 'DatabaseBackup - USER_DATABASES - LOG' , 'Weekend' , 000001, 'sat sun' , 2 , 'hours', 'N'
*/

  

--\-------------------------------------------------------------------------------------
---) CREATION OF THE SCHEDULES
--/-------------------------------------------------------------------------------------
OPEN daily_weekly_cursor
FETCH NEXT FROM daily_weekly_cursor
INTO @cur_job_name, @cur_schedule_name, @cur_start_time, @cur_days,
     @cur_subday_interval, @cur_intraday_unit, @cur_delete_any_schedule

WHILE @@FETCH_STATUS = 0
BEGIN

    --\
    ---) Check if it is the dummy row to specify datatypes:
    --/
    IF @cur_job_name IS NULL
    BEGIN
      -- RAISERROR('NULL', 0, 1) WITH NOWAIT;
      FETCH NEXT FROM daily_weekly_cursor
      INTO @cur_job_name, @cur_schedule_name, @cur_start_time, @cur_days,
           @cur_subday_interval, @cur_intraday_unit, @cur_delete_any_schedule
    END

    SET @message = 'Job: ' + @cur_job_name
    RAISERROR(@message , 0, 1) WITH NOWAIT;

    --\
    ---) Get the current schedule_id, if any.
    --/
    SELECT @cur_schedule_id = -1;
    WHILE @cur_schedule_id IS NOT NULL
    BEGIN
        SELECT @cur_schedule_id = (
            SELECT TOP 1 sch.schedule_id
            FROM msdb.dbo.sysschedules sch
              JOIN msdb.dbo.sysjobschedules js
                ON js.schedule_id = sch.schedule_id
              JOIN msdb.dbo.sysjobs jobs
                ON jobs.job_id = js.job_id
             WHERE jobs.name = @cur_job_name
               AND ( ISNULL(@cur_delete_any_schedule, @default_delete_any_schedule) = 'Y' OR sch.name = @cur_schedule_name )
            );

        --\
        ---) If schedule exists, delete it first.
        --/
        IF @cur_schedule_id IS NOT NULL
        BEGIN
            EXEC msdb.dbo.sp_delete_schedule @schedule_id = @cur_schedule_id, @force_delete = 1;
            SET @message = ' ' + 'Schedule deleted with @schedule_id = ' + CONVERT(VARCHAR, @cur_schedule_id)
            RAISERROR(@message , 0, 1) WITH NOWAIT;
        END
    END

    --\
    ---) Set @cur_freq_type depending on days specification.
    --/
    SELECT @cur_freq_type = CASE @cur_days WHEN 'all' THEN 4 ELSE 8 END;

    --\
    ---) Set @cur_freq_interval depending on days specification.
    --/
    SELECT @cur_freq_interval = 0, @cur_days = ' ' + @cur_days + ' ';
    IF CHARINDEX(' all ', @cur_days) > 0 SET @cur_freq_interval = @cur_freq_interval + 1;
    IF CHARINDEX(' sun ', @cur_days) > 0 SET @cur_freq_interval = @cur_freq_interval + 1;
    IF CHARINDEX(' mon ', @cur_days) > 0 SET @cur_freq_interval = @cur_freq_interval + 2;
    IF CHARINDEX(' tue ', @cur_days) > 0 SET @cur_freq_interval = @cur_freq_interval + 4;
    IF CHARINDEX(' wed ', @cur_days) > 0 SET @cur_freq_interval = @cur_freq_interval + 8;
    IF CHARINDEX(' thu ', @cur_days) > 0 SET @cur_freq_interval = @cur_freq_interval + 16;
    IF CHARINDEX(' fri ', @cur_days) > 0 SET @cur_freq_interval = @cur_freq_interval + 32;
    IF CHARINDEX(' sat ', @cur_days) > 0 SET @cur_freq_interval = @cur_freq_interval + 64;
        
    IF @cur_freq_interval = 0
    BEGIN
      RAISERROR('Days string does not contain any valid day abbreviations', 16, 1);
    END

    IF @delete_schedules_only != 'Y'
    BEGIN
        IF @cur_subday_interval = 0
        BEGIN
            EXEC msdb.dbo.sp_add_jobschedule @job_name=@cur_job_name, @name=@cur_schedule_name,
          @enabled=1,
          @freq_type=@cur_freq_type,
          @freq_interval=@cur_freq_interval,
          @freq_subday_type=1,
          @freq_subday_interval=0,
          @freq_relative_interval=0,
          @freq_recurrence_factor=1,
          @active_start_date=@active_start_date,
          @active_end_date=99991231,
          @active_start_time=@cur_start_time,
          @active_end_time=235959, @schedule_id = @cur_schedule_id OUTPUT;

            SET @message = ' ' + 'Schedule created for days' + @cur_days + 'at ' + CONVERT(VARCHAR, @cur_start_time)
            RAISERROR(@message , 0, 1) WITH NOWAIT;

        END ELSE BEGIN

            SELECT @cur_freq_subday_type =
            CASE LOWER(LEFT(@cur_intraday_unit, 1))
            WHEN 'h' THEN 8
            WHEN 'm' THEN 4
            WHEN 's' THEN 2
            ELSE 0
            END

            IF @cur_freq_subday_type = 0
            BEGIN
              RAISERROR('''intraday_unit'' must contain a valid value when ''subday_interval'' contains a number greater than 0. Please check your configuration.', 16, 1);
            END

            EXEC msdb.dbo.sp_add_jobschedule @job_name=@cur_job_name, @name=@cur_schedule_name,
          @enabled=1,
          @freq_type=@cur_freq_type,
          @freq_interval=@cur_freq_interval,
          @freq_subday_type=@cur_freq_subday_type,
          @freq_subday_interval=@cur_subday_interval,
          @freq_relative_interval=0,
          @freq_recurrence_factor=1,
          @active_start_date=@active_start_date,
          @active_end_date=99991231,
          @active_start_time=@cur_start_time,
          @active_end_time=235959, @schedule_id = @cur_schedule_id OUTPUT;

            SET @message = ' ' + 'Schedule created for days' + @cur_days + 'every ' + CONVERT(VARCHAR, @cur_subday_interval) + ' ' + @cur_intraday_unit
            RAISERROR(@message , 0, 1) WITH NOWAIT;

        END
    END
    RAISERROR('-----' , 0, 1) WITH NOWAIT;
    
    SET @cur_schedule_id = NULL;
    FETCH NEXT FROM daily_weekly_cursor
    INTO @cur_job_name, @cur_schedule_name, @cur_start_time, @cur_days,
         @cur_subday_interval, @cur_intraday_unit, @cur_delete_any_schedule
END

CLOSE daily_weekly_cursor
DEALLOCATE daily_weekly_cursor

Download the script here.

(Note: the downloadable script has slightly better indenting than the html version above).

TSQL-S01E02-Schedule

Conclusion / Wrap up

With this script you can easily schedule the SQL Agent Jobs that Ola Hallengren’s MaintenanceSolution-script creates.

(c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

How to copy a database table structure (T-SQL Scripting Series)

Problem

Do you recognize the need to copy a table structure in MS SQL Server, for temporary or permanent usage?

Most of you will know the famous

Insert Into Example.sql
SELECT TOP 0 *
INTO [dbo].[some unexisting table]
FROM [dbo].[some existing table]

But this approach has some limitations: Computed columns change into regular columns, and primary and foreign keys are not copied. Sometimes you just need more than this.

Solution

So I have made a script to copy a SQL server table structure with foreign keys but without indexes.
This script handles user defined types and computed columns gracefully.

My credits go to Aaron Bertand (see his post on mssqltips). I used his script for copying the foreign keys and added my own parts to it (download it here).

TSQL-S01E01-copy_table

TSQL-S01E01-CopyTableStructure.sql

--\
---) Author: Hans Michiels
---) Script to copy a sql server table structure with foreign keys but without indexes.
---) This script handles user defined types and computed columns gracefully.
--/
/*
(c) Copyright 2016 - hansmichiels.com
 
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
 
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
 
You should have received a copy of the GNU General Public License
along with this program. If not, see .
*/

--\
---) VARIABLES DECLARATIONS
--/
DECLARE @CrLf NVARCHAR(2)
DECLARE @Indent NVARCHAR(2)
DECLARE @nsql NVARCHAR(MAX)
DECLARE @SimulationMode CHAR(1)
DECLARE @SourceSchemaAndTable NVARCHAR(260)
DECLARE @TargetSchemaAndTable NVARCHAR(260)
DECLARE @FkNameSuffix NVARCHAR(128)
DECLARE @TableOptions NVARCHAR(500)

--\
---) CONFIGURATION: set the source and target schema/tablename here, and some other settings.
--/
SELECT
    @SimulationMode = 'Y' -- Use Y if you only want the SQL statement in the output window without it being executed.
  , @SourceSchemaAndTable = '[dbo].[Order]'
  , @TargetSchemaAndTable = '[dbo].[OrderCopy]'
  , @TableOptions = ' WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY]'
  , @FkNameSuffix = '_' + REPLACE(CAST(NEWID() AS VARCHAR(40)), '-', '') -- A Guid is added to the foreign key name to make it unique.
  , @CrLf = CHAR(13) + CHAR(10)
  , @Indent = SPACE(2)
  -- For 'min' script use this (in case sql is near 4000 characters):
  -- , @CrLf = ' '
  -- , @Indent = ''

--\
---) BUILD SQL FOR CLONING TABLE
--/
SELECT @nsql
    = ISNULL(@nsql, '')
    + CASE col_sequence WHEN 1 THEN
      @CrLf + 'IF OBJECT_ID(N''' + @TargetSchemaAndTable + ''', ''U'') IS NOT NULL DROP TABLE ' + @TargetSchemaAndTable + ';'
    + @CrLf + 'CREATE TABLE ' + @TargetSchemaAndTable + @CrLf + @Indent + '( ' ELSE @CrLf + @Indent + ', ' END
    + [definition]
FROM (
      SELECT ROW_NUMBER() OVER (PARTITION BY tb.object_id ORDER BY tb.object_id, col.column_id) AS col_sequence
        , QUOTENAME(col.name) + ' '
        + COALESCE(
            'AS ' + cmp.definition + CASE ISNULL(cmp.is_persisted, 0) WHEN 1 THEN ' PERSISTED ' ELSE '' END,
            CASE
              WHEN col.system_type_id != col.user_type_id THEN QUOTENAME(usr_tp.schema_name) + '.' + QUOTENAME(usr_tp.name)
              ELSE
                QUOTENAME(sys_tp.name) +
                CASE
                  WHEN sys_tp.name IN ('char', 'varchar', 'binary', 'varbinary') THEN '(' + CONVERT(VARCHAR, CASE col.max_length WHEN -1 THEN 'max' ELSE CAST(col.max_length AS varchar(10)) END) + ')'
                  WHEN sys_tp.name IN ('nchar', 'nvarchar') THEN '(' + CONVERT(VARCHAR, CASE col.max_length WHEN -1 THEN 'max' ELSE CAST(col.max_length/2 AS varchar(10)) END) + ')'
                  WHEN sys_tp.name IN ('decimal', 'numeric') THEN '(' + CAST(col.precision AS VARCHAR) + ',' + CAST(col.scale AS VARCHAR) + ')'
                  WHEN sys_tp.name IN ('datetime2') THEN '(' + CAST(col.scale AS VARCHAR) + ')'
                  ELSE ''
                END
            END
            )
        + CASE col.is_nullable
            WHEN 0 THEN ' NOT NULL'
            ELSE CASE WHEN cmp.definition IS NULL THEN ' NULL' ELSE ' ' END
          END AS [definition]
       FROM sys.tables tb
       JOIN sys.schemas sch
         ON sch.schema_id = tb.schema_id
       JOIN sys.columns col
         ON col.object_id = tb.object_id
       JOIN sys.types sys_tp
         ON col.system_type_id = sys_tp.system_type_id
        AND col.system_type_id = sys_tp.user_type_id
       LEFT JOIN
            (
            SELECT tp.*, sch.name AS [schema_name]
            FROM sys.types tp
            JOIN sys.schemas sch
            ON tp.schema_id = sch.schema_id
            ) usr_tp
         ON col.system_type_id = usr_tp.system_type_id
        AND col.user_type_id = usr_tp.user_type_id
       LEFT JOIN sys.computed_columns cmp
         ON cmp.object_id = tb.object_id
        AND cmp.column_id = col.column_id
      WHERE tb.object_id = OBJECT_ID(@SourceSchemaAndTable, 'U')
      ) subqry
;
SELECT @nsql
    = ISNULL(@nsql, '')
    + CASE col_sequence
        WHEN 1 THEN @CrLf + ', PRIMARY KEY ' + CASE is_clustered_index WHEN 1 THEN 'CLUSTERED' ELSE 'NONCLUSTERED' END
           + @CrLf + @Indent + '( '
        ELSE @CrLf + @Indent + ', '
      END
    + QUOTENAME(pk_cols.column_name)
    + CASE is_descending_key
        WHEN 1 THEN ' DESC'
        ELSE ' ASC'
      END
FROM (
      SELECT TOP 2147483647 sch.name as schema_name, tb.name as table_name, col.name as column_name
               , ROW_NUMBER() OVER (PARTITION BY tb.object_id ORDER BY tb.object_id, ic.key_ordinal) AS col_sequence
               , ic.is_descending_key
               , CASE WHEN idx.index_id = 1 THEN 1 ELSE 0 END AS [is_clustered_index]
       FROM sys.tables tb
       JOIN sys.schemas sch
         ON sch.schema_id = tb.schema_id
       JOIN sys.indexes idx
         ON idx.is_primary_key = 1
        AND idx.object_id = tb.object_id
       JOIN sys.index_columns ic
         ON is_included_column = 0
        AND ic.object_id = tb.object_id
        AND ic.index_id = idx.index_id
       JOIN sys.columns col
         ON col.column_id = ic.column_id
        AND col.object_id = tb.object_id
      WHERE tb.object_id = OBJECT_ID(@SourceSchemaAndTable, 'U')
      ORDER BY tb.object_id, ic.key_ordinal
      ) pk_cols
 
SELECT @nsql = @nsql + @CrLf + @indent + ') ' + @TableOptions

IF @SimulationMode = 'Y'
BEGIN
  PRINT '-- Simulation mode: script is not executed.'
END
PRINT @nsql;
IF @SimulationMode != 'Y'
BEGIN
  EXEC(@nsql);
END
 
--\
---) Copy foreign key constraints
---) A guid is added to the foreign key name to make it unique within the database.
--/
SET @nsql = N'';
SELECT @nsql += N'
ALTER TABLE '

      + @TargetSchemaAndTable
      + ' ADD CONSTRAINT [' + LEFT(fk.name + @FkNameSuffix, 128) + '] '
      + ' FOREIGN KEY (' + STUFF((SELECT ',' + QUOTENAME(col.name)
      -- get all the columns in the constraint table
      FROM sys.columns AS col
      JOIN sys.foreign_key_columns AS fkc
        ON fkc.parent_column_id = col.column_id
        AND fkc.parent_object_id = col.[object_id]
      WHERE fkc.constraint_object_id = fk.[object_id]
      ORDER BY fkc.constraint_column_id
      FOR XML PATH(N''), TYPE).value(N'.[1]', N'nvarchar(max)'), 1, 1, N'')
      + ') REFERENCES ' + QUOTENAME(rs.name) + '.' + QUOTENAME(rtb.name)
      + '('
      + STUFF((SELECT ',' + QUOTENAME(col.name)
      -- get all the referenced columns
      FROM sys.columns AS col
      JOIN sys.foreign_key_columns AS fkc
        ON fkc.referenced_column_id = col.column_id
       AND fkc.referenced_object_id = col.[object_id]
      WHERE fkc.constraint_object_id = fk.object_id
      ORDER BY fkc.constraint_column_id
      FOR XML PATH(N''), TYPE).value(N'.[1]', N'nvarchar(max)'), 1, 1, N'') + ');'
FROM sys.foreign_keys AS fk
JOIN sys.tables AS rtb -- referenced table
   ON fk.referenced_object_id = rtb.[object_id]
JOIN sys.schemas AS rs
   ON rtb.[schema_id] = rs.[schema_id]
JOIN sys.tables AS ctb -- constraint table
   ON fk.parent_object_id = ctb.[object_id]
WHERE rtb.is_ms_shipped = 0 AND ctb.is_ms_shipped = 0
  AND ctb.object_id = OBJECT_ID(@SourceSchemaAndTable, 'U');

IF @SimulationMode = 'Y'
BEGIN
  PRINT '-- Simulation mode: script is not executed.'
END
PRINT @nsql;
IF @SimulationMode != 'Y'
BEGIN
  EXEC(@nsql);
END

Download the script here.

Conclusion / Wrap up

With this script you can copy the structure of a table including primary and foreign keys, computed columns and columns using a user defined datatype.
Last updated: 14 July, 2016.

(c) 2016 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

The CONTROLLER – TASK design pattern for stored procedures

(This blog post was first published on 25-nov-2013 on my former blog)

The challenges

Working with SQL Server and stored procedures for several years, I have faced several challenges like:

  1. reuse of stored procedures: make sure stored procedures are small components that you can use for different purposes.
  2. maintainability of stored procedures: make sure stored procedures do not become large ‘procedural written’, otherwise they can become hard to maintain due to this.
  3. generating code versus manual maintenance: how to merge generated and manual SQL code easily.
  4. where to implement error handling and the automic unit using a transaction: how to ensure these without harming the needs above.

And I have found a way to deal with this: The CONTROLLER - TASK design pattern for stored procedures. This pattern solves my challenges into a working solution!

1) Reuse of stored procedures

Sometimes you want to reuse a particular INSERT, UPDATE or DELETE statement. When it is in a stored procedure you can simply call this stored procedure! But what if this stored procedure also contains other code, uses its own transaction or implements its own error handling? Then this might be not that simple!

The CONTROLLER-TASK design pattern for stored procedures makes reuse of SQL code very easy.

2) Maintainability of stored procedures

A best practice in C# programming (or any type of programming really) that has become even more popular over the last years, is to create short methods containing only a small amount of code, and having a “main” method to call all the short methods.

Advantages: the main method is very readable (when method names are self-explaining), and code stays really maintainable. Also, short methods can be unittested easier.

Why not do the same in SQL code then?

Why not have an inventory of short stored procedures, either generated or manually made, that do only one task, and shop around to glue together what you need in a “main” stored procedure?

When working like this in a team:

- tasks can be divided easier when stored procedures are short

- stored procedures are easier to understand for new developers when they are short

The CONTROLLER-TASK design pattern for stored procedures makes your SQL code better maintainable.

3) Generating code versus manual maintenance

How convenient would it be to generate generic stored procedures, for instance UPDATE stored procedures for each table, a LOAD stored procedure for each hub in a datavault datawarehouse, and so on? Generating is not extremely more difficult than creating normal stored procedures. Just use tables like sys.objecs, sys.columns, and so on, and a lot is possible.

Imagine these stored procedures populate your inventory of “tasks”, ready to use and glue together in  a “main” stored procedure?

The CONTROLLER-TASK design pattern makes combining both manually made and generated SQL code to a working solution extremely easy.

4) Where to implement the error handling and automic unit using a transaction?

Is a stored procedure still reusable, if it is running in its own transaction? Maybe, if you use nested transactions. But there are drawnbacks and risks. Code can become messy when not implemented in the same way in different stored procedures.

Is a stored procedure still reusable, if it implements its own error handling? Maybe, but you have to be careful. If it ‘swallows’ the error, the stored procedure that calls it will not notice it and will continue after the error. This is almost never what you want. Besides that, implementing error handling in every stored procedure, makes your code base consist of many more lines.

So .. why not divide the “core” SQL code to run, and the “housekeeping” (automic unit, error handling)? No more confusion, it is clear where things are done, and code is easier reusable and maintainable.

The CONTROLLER-TASK design pattern divides core functionality and housekeeping code, therewith keeping your code cleaner and again .. better maintainable!

My solution, easy to implement

My answer to the challenges above is the CONTROLLER-TASK design pattern for stored procedures. As a picture says more than a thousand words, take a close look at the one below.

controller_task_design_pattern

Getting started

Allright, if you see the value of it, how can you implement the CONTROLLER-TASK design pattern?

In its basic form, you’ll need only a few things: a naming convention and two stored procedure templates.

I will tell you how I did this, so you can use or modify it in the way you prefer.

Choosing a naming convention

As explained, you have three types of stored procedures:

  • Task stored procedures (manually made)
  • Task stored procedures (generated)
  • Controller stored procedures

I just prefix my stored procedures with the typename as follows:

  • TASK_<BaseStoredProcName> for stored procedures that are manually made
  • GTSK_<BaseStoredProcName> for stored procedures that are generated
  • CTLR_<BaseStoredProcName> for controller stored procedures

Of course you can invent your own different naming convention, as long as you use it consequently and document it for new developers working on your project.

Using two templates

Here are my templates, you can modify for your own needs

Prerequisites

prerequisites.sql
CREATE SCHEMA [sp] AUTHORIZATION [dbo];

TASK stored procedures

TASK stored procedures.sql
PRINT ‘ScriptFile P10.sp.TASK_ReplaceWithStoredProcBaseName.sql’
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
IF EXISTS (SELECT 1 FROM sys.objects WHERE object_id = OBJECT_ID(N’sp.TASK_ReplaceWithStoredProcBaseName’) AND type in (N’P’, N’PC’))
    DROP PROCEDURE sp.TASK_ReplaceWithStoredProcBaseName
GO
/* PRESS CTRL-SHIFT-M in SSMS to fill parameters
==========================================================================================
Author :
Create date :
Parameters:
- Input : none
- Output : none
Returnvalue : 0 if executed successfully.
Description : >
==========================================================================================
*/

CREATE PROCEDURE sp. TASK_ReplaceWithStoredProcBaseName
AS
BEGIN
  ---------------------------------------------------------------------------------------------------
  -- DECLARATIONS
  ---------------------------------------------------------------------------------------------------
  --N/A
  ---------------------------------------------------------------------------------------------------
  -- INITIALIZATION
  ---------------------------------------------------------------------------------------------------
  --N/A
  ---------------------------------------------------------------------------------------------------
  -- MAIN
  ---------------------------------------------------------------------------------------------------
  SELECT 1 AS [Replace with your own code]

END
GO

 

Controller stored procedures

Controller stored procedures.sql
PRINT ‘ScriptFile P20.sp.CTLR_ReplaceWithStoredProcBaseName.sql’
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
IF EXISTS (SELECT 1 FROM sys.objects WHERE object_id = OBJECT_ID(N’sp.CTLR_ReplaceWithStoredProcBaseName’) AND type in (N’P’, N’PC’))
    DROP PROCEDURE sp.CTLR_ReplaceWithStoredProcBaseName
GO
/* PRESS CTRL-SHIFT-M in SSMS to fill parameters
==========================================================================================
Author :
Create date :
Parameters:
- Input : none
- Output : none
Returnvalue : 0 if executed successfully.
Description : >
==========================================================================================
*/

CREATE PROCEDURE sp.CTLR_ReplaceWithStoredProcBaseName
AS
BEGIN
  ---------------------------------------------------------------------------------------------------
  -- TEMPLATE CODE: CONFIGURATION OF CONTROLLER, BEGIN TRANSACTION, INFO LOGGING
  ---------------------------------------------------------------------------------------------------
  BEGIN TRY

    SET XACT_ABORT ON
    BEGIN TRANSACTION

    ---------------------------------------------------------------------------------------------------
    -- (NON-TEMPLATE) DECLARATIONS
    ---------------------------------------------------------------------------------------------------
    -- N/A
    ---------------------------------------------------------------------------------------------------
    -- EXECUTE TASK STORED PROCEDURES
    ---------------------------------------------------------------------------------------------------
    -- Example call:
    EXECUTE sp.TASK_ReplaceWithStoredProcBaseName
    --

    ---------------------------------------------------------------------------------------------------
    -- TEMPLATE CODE: COMMIT TRANSACTION
    ---------------------------------------------------------------------------------------------------
    COMMIT TRANSACTION

  END TRY
  BEGIN CATCH
    ---------------------------------------------------------------------------------------------------
    -- TEMPLATE CODE: CATCH BLOCK WITH ROLLBACK AND THROW
    ---------------------------------------------------------------------------------------------------
    IF @@TRANCOUNT > 0
    BEGIN
        ROLLBACK TRANSACTION;
    END
    -- Options here:
    --\
    ---> 1) Log the error (optional)
    --/
    -- Not part of the BASIC template

    --\
    ---> 2) Rethrow the error (optional)
    --/
    -- Syntax for SQL Server 2012:
    -- ; THROW

    -- Syntax for SQL Server 2008 and earlier:
    DECLARE @ORIGINAL_ERROR NVARCHAR(MAX)
          , @ERROR_SEVERITY INT = ERROR_SEVERITY()
          , @ERROR_STATE INT = ERROR_STATE()

    SELECT @ORIGINAL_ERROR = + ERROR_MESSAGE()
                        + ‘ (ERROR_NUMBER=’ + CAST(ERROR_NUMBER() AS NVARCHAR)
                        + ‘, ERROR_LINE=’ + CAST(ERROR_LINE() AS NVARCHAR)
                        + ‘, ERROR_PROCEDURE=’ + ERROR_PROCEDURE() + ‘)’
    RAISERROR(@ORIGINAL_ERROR, @ERROR_SEVERITY, @ERROR_STATE)

  END CATCH

  ---------------------------------------------------------------------------------------------------
  -- RETURN THE RESULT WHEN EXECUTED SUCCESFULLY
  ---------------------------------------------------------------------------------------------------
  RETURN 0;
    
END
GO

 

Summary

In this blogpost I have explained to you the advantages of the CONTROLLER - TASK design pattern for stored procedures, which are:

  • makes reuse of SQL code very easy.
  • makes your SQL code better maintainable.
  • makes combining both manually made and generated SQL code to a working solution extremely easy.
  • divides core functionality and housekeeping code, therewith keeping your code cleaner and again .. better maintainable!

After that you were shown how to get started with a basic implementation of this design pattern.

Thank you for reading my blogpost and I hope you enjoyed it and found it useful.

Query hints WITH (NOLOCK) and WITH (READPAST) in action

(This blog post was first published on 18-jan-2013 on my former blog)

Download all used scripts here.

Query hints WITH (NOLOCK) and WITH (READPAST) are a great way to increase performance on (readonly) queries that run on one or more tables in which data is added , edited or deleted at the same time.

However those query hints have also some drawbacks, to be aware of, when using them.

In this article I will demonstrate how these query hints work, and when to use them.

Let’s start with some theory: what do these query hints do?

It makes no sense to copy the MSDN description here. If you need more detailed information you can read it yourself on msdn

I will just summarize what these query hints can do for me.

WITH (READPAST) – shows only committed data. When a record gets updated, it will not be selected at all, not the previous version of the record, nor the updated but uncommitted version.

WITH (NOLOCK) – shows both committed and uncommitted (dirty) data. When a record gets updated, the new uncommitted values of the record will be selected.

When to use

In general, do not use these queryhints, when you retrieve data to edit, e.g. in disconnected scenario’s like editing in a webapplication. Working with uncommitted data will in this case seldom be what you want.

When it comes to read only access, it depends. You have to decide what is more important: showing always the right (committed) data, or showing data fast. If the risk that uncommitted data is shown is acceptable, WITH (NOLOCK) is the query hint for you, if the risk that data being changed is not shown at all, WITH (READPAST) is the query hint for you.

Also for semi-static data (dimension- or system tables), that is seldom changed, the risk of the drawbacks of both queryhints, is usually acceptable.

Executing the demo

Step 1

First of all we have to do some setup work. I have made a script for that. You can run it in SQL Server Management Studio (see link to download all scripts at the top of this article).

Note that for simplicity of the demo I have created a CHAR column to be primary key. I know it is not a best practice, but for the demo it is sufficient. Normally you would always strive for numeric primary keys.

step1_setup.sql

— Step 1: Create a QueryHintsDemo database with one table, DemoData
USE [master]
GO
IF NOT EXISTS (SELECT name FROM sys.databases WHERE name = N’QueryHintsDemo’)
  CREATE DATABASE [QueryHintsDemo]
GO
USE [QueryHintsDemo]
GO
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[DemoData]’) AND type in (N’U’))
DROP TABLE [dbo].[DemoData]
GO

SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO

CREATE TABLE [dbo].[DemoData](
 [DemoDataCode] CHAR(6) NOT NULL,
 [DemoDataDescription] VARCHAR(100) NOT NULL,
 CONSTRAINT [PK_dbo_DemoData] PRIMARY KEY CLUSTERED
(
 [DemoDataCode] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

GO

SET ANSI_PADDING OFF
GO

Download all used scripts here.

Step 2

What is next?

Now you have to make sure you have two query windows open in Management Studio, so you can execute one script, and before that is finished, execute another script as well.

The first script will insert some initial data in the DemoData table, and then update it in a transaction. To be able to do this demonstration properly, the WAITFOR DELAY feature is used to wait 20 seconds before the transaction is rolled back.

In the first query window, use this script:

step2a_keep_transaction_open.sql

TRUNCATE TABLE [QueryHintsDemo].[dbo].[DemoData]
GO

INSERT INTO [QueryHintsDemo].[dbo].[DemoData]
           ([DemoDataCode]
           ,[DemoDataDescription])
     VALUES
           (‘DEMO01’
           ,‘This is the original text.’)
GO

BEGIN TRANSACTION

  UPDATE [QueryHintsDemo].[dbo].[DemoData]
     SET [DemoDataDescription] = ‘This is the updated text that gets rolled back later.’
   WHERE [DemoDataCode] = ‘DEMO01’

  WAITFOR DELAY ’00:00:20′

ROLLBACK TRANSACTION
GO

BEGIN TRANSACTION

  UPDATE [QueryHintsDemo].[dbo].[DemoData]
     SET [DemoDataDescription] = ‘This is the updated text that gets committed.’
   WHERE [DemoDataCode] = ‘DEMO01’
  
COMMIT TRANSACTION
GO

 

This is the script for the second querywindow:

step2b_select.sql

— WITH READPAST, will not show the updated, non committed record
— but also not the ‘old’ version of the record, before it got updated.
— So you potentially miss out some data when using this query hint.
SELECT ‘WITH (READPAST)’ AS [READPAST], *
FROM [QueryHintsDemo].[dbo].[DemoData] WITH (READPAST)
GO

— WITH NOLOCK, will show the updated, but uncommitted record. This is also called a “dirty read”.
— The risk here is that you use/display data, that in the end will
— not be committed and therefor never became ‘official’.
SELECT ‘WITH (NOLOCK)’ AS [NOLOCK], *
FROM [QueryHintsDemo].[dbo].[DemoData] WITH (NOLOCK)
GO

— Without using any queryhint, the select statement will wait until the
— first transaction is rolled back, and then show the original value.
— This is the most reliable, but also the slowest.
— You will notice that the first two SELECT statements will return
— a result immediately, but that the query below will only show result after almost
— 20 seconds, caused by the 20 seconds delay in the previous script.
SELECT ‘Without queryhint (1)’ AS [No hint 1], *
FROM [QueryHintsDemo].[dbo].[DemoData]

— Wait a few seconds, so that the update to be committed in the other script, is completed.
WAITFOR DELAY ’00:00:02′
— Without using any queryhint, the select statement will wait until the
— update is finished, and then show the final value.
SELECT ‘Without queryhint (2)’ AS [No hint 2], *
FROM [QueryHintsDemo].[dbo].[DemoData]

 

As you might already have understood, you should execute both scripts in SQL Server Management Studio: execute the first script, and while it is still running, start the second script in a seperate query window as soon as possible (but at least within 20 seconds after starting the first script).

Analyzing the query result

20130118_analyze_result2

 
Clean up
Use this script to clean up.

cleanup.sql

USE [master]
GO
EXEC msdb.dbo.sp_delete_database_backuphistory @database_name = N’QueryHintsDemo’
GO
ALTER DATABASE [QueryHintsDemo] SET SINGLE_USER WITH ROLLBACK IMMEDIATE
GO
DROP DATABASE [QueryHintsDemo]
GO

 

Conclusion

The queryhints WITH (NOLOCK) and WITH (READPAST) are a great way to improve performance.

However they both have their own drawnbacks, to be aware of, when using them:

  • When using WITH (READPAST), you potentially miss out some data that is being updated, when your select query runs.
  • When using WITH (NOLOCK), the risk here is that you use/display data, that in the end will not be committed and therefor never became ‘official’.

It depends on the situation, when it is acceptable to use these queryhints.