How a SSIS Lookup works different from a T-SQL JOIN (SSIS Series)

Introduction

With a Lookup in a dataflow in a SSIS Package you can achieve a similar result as with a T-SQL Join clause in a SELECT statement: you can check if a row exists in a different dataset, based on one or more (business / natural / surrogate) key columns. However there are a number of differences. One of these differences took me several hours of investigating some time ago, when I had this lookup in SSIS that did not find any row. I did not understand why, because I had a similar T-SQL statement with a JOIN that found all rows. What could be different (it appeared to be a CHAR column connected to a VARCHAR)?
So today I am going to compare the two.

Subjects of comparison:

  1. Perform a lookup when the datatypes are different
  2. Perform a lookup on a different SQL Server Instance
  3. Perform a lookup when the join is not a equi-join
  4. When the source is not SQL Server
  5. When the joined columns differ in Case (Uppercase / Lowercase / Mixed case)

Simple setup for two lookup transformations using the Adventureworks database

Perform a lookup when the datatypes are different

A Transact-SQL query with a join is more forgiving than a SSIS Lookup when different datatypes are joined.

Connect a nvarchar to a varchar column

In a SSIS Lookup this does not work because an error is shown during design:

In T-SQL, this join can be made and will work:

different_datatypes.sql
SELECT
    *
FROM
    [Production].[STG_Product] stg
JOIN
    [Production].[ProductCategory] pc
    ON pc.name = stg.VARCHAR_ProductCategoryName

Connect a fixed length CHAR column to a variable length VARCHAR column (or NCHAR to NVARCHAR)

In SSIS the lookup can be developed, but will not work in practice, when the CHAR column value is not the maximum length (and thus has trailing spaces).
A “join” of a nvarchar to a nchar column ..
.. will not work !

In T-SQL, this join can be made and will work:

fixed_length_column.sql
SELECT
    *
FROM
    [Production].[STG_Product] stg
JOIN
    [Production].[ProductCategory] pc
    ON pc.name = stg.NCHAR_ProductCategoryName

Perform a lookup on a different SQL Server Instance

In a SSIS Lookup you can use different connection managers for source and Lookup transformation, so this is easy to do.

In T-SQL this would be more complex to implement, for instance you could use linked servers if your security policy allows this.

Perform a lookup when the join is not a equi-join

An equi-join is a join where the column values that are joined simply must be equal to each other.

By default a SSIS Lookup will do an equi-join. In the visual editor you connect one or more columns from the Available Input Columns to the Available Lookup Columns. Designed in this way it will be an equi-join.
There are two ways I know of to implement a different kind of join:

  1. in the Advanced Tab of the Lookup Transformation Editor you can modify the SQL statement.
  2. use a OLE DB Command in the Data Flow, but this can have a negative impact on performance.

In T-SQL you can do more complex joins, for instance with a BETWEEN clause for an EffectiveStartDate and EffectiveEndDate.

When the source is not SQL Server

In a SSIS Lookup you can use different connection managers for source and Lookup transformation, so this is easy to do.

In T-SQL this would be practicly impossible, or at least much more complex to accomplish (e.g. with OPENROWSET).

When the joined columns differ in Case

A SSIS Lookup is case-sensitive when Full cache is used, so differences in casing cause that lookup values are not found.
However when you switch to No Cache or Partial Cache , the lookup can become case-insensitive! Keith Mescha brought this to my attention (see below). Thank you, Keith.
So the safest thing to do is convert the column values for columns to connect from Available Input Columns to Available Lookup Columns to either upper- or lowercase on both sides. In T-SQL you can use the LOWER() and UPPER() functions for this, in SSIS you can add a Derived column to your dataflow, where you also can use the LOWER() and UPPER() functions. Depending on your needs, you can replace the existing column or add the upper or lower value as new column.

Example of a derived column that converts the value of a column of the Available Input Columns to uppercase.

In T-SQL the result of this join will depend on collation settings of the SQL Server Instance or column(s) affected in the join.
By default, SQL Server is installed using a case-insensitive collation (e.g. Latin1_General_CI_AS, where CI stands for Case Insensitive), and when the collation is not overruled on column-level, the join will work case-insensitively.

Performance considerations

From my own experience I know SSIS Lookups can be faster when developed correctly. This means:

  • Use full cache whenever the dataset allows this (so memory usage is acceptable given the amount of RAM on the server)
  • In the query of the lookup transformation, only include the columns that are needed in the Lookup.

Conclusion / Wrap up

In this blog post I have compared the SSIS Lookup transformation with the T-SQL Join.
In general T-SQL is more flexible and more forgiving, but when you need to go outside the SQL Server Instance for source or lookup it is easier to use a SSIS Lookup, but beware of the peculiarities of the SSIS Lookup, as described in this article.
Performance-wise, a SSIS Lookup can perform better than a T-SQL join, but this of course depends on a lot of design factors in both your Lookup as well as SQL database.

(c) 2017 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.

Hans Michiels

Hans is an Independent Business Intelligence and Datawarehouse Consultant & Microsoft SQL Server Consultant, working in the Netherlands. He has been working in the software industry since 1996, with SQL Server since the year 2001, and since 2008 he has a primary focus on datawarehouse- and business intelligence projects using Microsoft technology, using a Datavault and Kimball architecture. He has a special interest in Datawarehouse Automation and Metadata driven solutions. * Certified in Data Vault Modeling: Certified Data Vault 2.0 Practitioner, CDVDM (aka Data Vault 1.0) * Certified in MS SQL Server: * MCSA (Microsoft Certified Solutions Associate) SQL Server 2012 - MCITP Business Intelligence Developer 2005/2008 - MCITP Database Developer 2005/2008 - MCITP Database Administrator 2005/2008

More Posts

9 thoughts on “How a SSIS Lookup works different from a T-SQL JOIN (SSIS Series)”

  1. I have seen where the cache settings of the lookup could change the case sensitivity. This is based on all rows being pulled into memory vs a query to sql per row. Have you tested that as well?

    1. Hello Keith, this is really something. I did not try this before, but I just did, and you are right: when using No Cache or Partial Cache the lookup suddenly isn’t case-sensitive anymore. I did not know that. Thanks for being to attentive.

  2. Hi Hans, I’m from Argentina . I read your article and I have to mention that you are wrong about the equi-join and the between clause. You can work it out , using the advanced tab on the SSIS Lookup transformation task.

    1. Hi Agustin, this is a good addition, you are right, by customizing the SQL statement you can do this.
      Thanks for your feedback, it is always good to learn from each other.

  3. SSIS Lookups are for people unable to write complex SQL queries.
    People unable to look at an Execution Plan and unable to (re)write the query so that the Plan is optimal.
    Personal experience.

    1. Hi Teodoro, I do not agree with that. From my experiences SSIS Lookups are useful when source and destination are not within the same SQL Server Instance or when the source (or destination) isn’t SQL Server at all.
      And even within the same SQL Server Instance you have a choice, and I think SSIS Lookups are not only for people with less technical skills. It just depends on the situation and your needs. I use whatever is faster/easier to develop with the same result (also taking into account that a Lookup might be easier to maintain by a successor developer than a very complex SQL Statement), even though I can read Execution Plans and act accordingly. I do not want to make it more complex than needed to show that I am “smart” or to exercise my brain. I know developers that do not care if the solution is maintainable by other developers.

    2. No, I disagree. what you’re saying would only apply in the event that your lookup source is local to your data. Sometimes this is not the case and lookup will outperform a linked server query any day of the week!

Comments are closed.