Introduction
With a Lookup in a dataflow in a SSIS Package you can achieve a similar result as with a T-SQL Join clause in a SELECT statement: you can check if a row exists in a different dataset, based on one or more (business / natural / surrogate) key columns. However there are a number of differences. One of these differences took me several hours of investigating some time ago, when I had this lookup in SSIS that did not find any row. I did not understand why, because I had a similar T-SQL statement with a JOIN that found all rows. What could be different (it appeared to be a CHAR column connected to a VARCHAR)?
So today I am going to compare the two.
Subjects of comparison:
- Perform a lookup when the datatypes are different
- Perform a lookup on a different SQL Server Instance
- Perform a lookup when the join is not a equi-join
- When the source is not SQL Server
- When the joined columns differ in Case (Uppercase / Lowercase / Mixed case)
Simple setup for two lookup transformations using the Adventureworks database
Perform a lookup when the datatypes are different
A Transact-SQL query with a join is more forgiving than a SSIS Lookup when different datatypes are joined.
Connect a nvarchar to a varchar column
In a SSIS Lookup this does not work because an error is shown during design:
In T-SQL, this join can be made and will work:
*
FROM
[Production].[STG_Product] stg
JOIN
[Production].[ProductCategory] pc
ON pc.name = stg.VARCHAR_ProductCategoryName
Connect a fixed length CHAR column to a variable length VARCHAR column (or NCHAR to NVARCHAR)
In SSIS the lookup can be developed, but will not work in practice, when the CHAR column value is not the maximum length (and thus has trailing spaces).
A “join” of a nvarchar to a nchar column ..
.. will not work !
In T-SQL, this join can be made and will work:
*
FROM
[Production].[STG_Product] stg
JOIN
[Production].[ProductCategory] pc
ON pc.name = stg.NCHAR_ProductCategoryName
Perform a lookup on a different SQL Server Instance
In a SSIS Lookup you can use different connection managers for source and Lookup transformation, so this is easy to do.
In T-SQL this would be more complex to implement, for instance you could use linked servers if your security policy allows this.
Perform a lookup when the join is not a equi-join
An equi-join is a join where the column values that are joined simply must be equal to each other.
By default a SSIS Lookup will do an equi-join. In the visual editor you connect one or more columns from the Available Input Columns to the Available Lookup Columns. Designed in this way it will be an equi-join.
There are two ways I know of to implement a different kind of join:
- in the Advanced Tab of the Lookup Transformation Editor you can modify the SQL statement.
- use a OLE DB Command in the Data Flow, but this can have a negative impact on performance.
In T-SQL you can do more complex joins, for instance with a BETWEEN clause for an EffectiveStartDate and EffectiveEndDate.
When the source is not SQL Server
In a SSIS Lookup you can use different connection managers for source and Lookup transformation, so this is easy to do.
In T-SQL this would be practicly impossible, or at least much more complex to accomplish (e.g. with OPENROWSET).
When the joined columns differ in Case
A SSIS Lookup is case-sensitive when Full cache is used, so differences in casing cause that lookup values are not found.
However when you switch to No Cache or Partial Cache , the lookup can become case-insensitive! Keith Mescha brought this to my attention (see below). Thank you, Keith.
So the safest thing to do is convert the column values for columns to connect from Available Input Columns to Available Lookup Columns to either upper- or lowercase on both sides. In T-SQL you can use the LOWER() and UPPER() functions for this, in SSIS you can add a Derived column to your dataflow, where you also can use the LOWER() and UPPER() functions. Depending on your needs, you can replace the existing column or add the upper or lower value as new column.
Example of a derived column that converts the value of a column of the Available Input Columns to uppercase.
In T-SQL the result of this join will depend on collation settings of the SQL Server Instance or column(s) affected in the join.
By default, SQL Server is installed using a case-insensitive collation (e.g. Latin1_General_CI_AS, where CI stands for Case Insensitive), and when the collation is not overruled on column-level, the join will work case-insensitively.
Performance considerations
From my own experience I know SSIS Lookups can be faster when developed correctly. This means:
- Use full cache whenever the dataset allows this (so memory usage is acceptable given the amount of RAM on the server)
- In the query of the lookup transformation, only include the columns that are needed in the Lookup.
Conclusion / Wrap up
In this blog post I have compared the SSIS Lookup transformation with the T-SQL Join.
In general T-SQL is more flexible and more forgiving, but when you need to go outside the SQL Server Instance for source or lookup it is easier to use a SSIS Lookup, but beware of the peculiarities of the SSIS Lookup, as described in this article.
Performance-wise, a SSIS Lookup can perform better than a T-SQL join, but this of course depends on a lot of design factors in both your Lookup as well as SQL database.
(c) 2017 hansmichiels.com – Do not steal the contents – spread the link instead – thank you.