By Richard Hiers,2017
In the following post, I’ll be explaining how to use ETL and SSIS to generate surrogate keys for dimension tables in a data warehouse. Although most dimensional data possesses a natural key or a business key in the source tables, Kimball and everyone else I’ve read strongly recommend using a surrogate key for dimensional tables. Our prospects, for example, all have a unique ProspectID, but it is recommended that the ETL system generate a unique identifying key be for each record. This surrogate key then becomes the foreign key in the Fact tables.
You can read more about the advantages of using a surrogate key in Kimball’s The Data Warehouse Toolkit, 3rd Ed. starting on page 98, Corr’s Agile Data Warehouse Design starting on page 141, and Adamson’s Star Schema on page 30 and following. But in brief, some of the key benefits of using a surrogate key are:
- Isolates the data warehouse from any instability in the source OLTP system keys where business keys may possibly be deleted, recycled, and modified in ways that could be detrimental to the functioning of the data warehouse.
- Facilitates pulling dimensional attributes from disparate source systems in which each has its own independent business keys.
- Supports slowly changing dimensions (SCD) where there can be multiple versions of any given entity in the dimension.
So the question becomes, how do you create and maintain these surrogate keys?
The Need: An SSIS Surrogate Key Generator
It would be possible to design the dimension table with an auto-incrementing identity key, but at least for performance reasons, that doesn’t seem to be the preferred mechanism. From Kimball starting on page 469: “For improved efficiency, consider having the ETL tool generate and maintain the surrogate keys. Avoid the temptation of concatenating the operational key of the source system and a date/time stamp.” OK, so how do I accomplish that?
The Challenge: An SSIS Surrogate Key Generator
Given this and similar recommendations found on the web, and given the maturity of Microsoft’s SSIS ETL tools, I would expect this to be pretty straight forward. I would expect that I could just drag the Surrogate Key Creator transformation into my Data Flow and go on my merry way. But its not that straight forward. The tool doesn’t exist and even in Knight’s Professional Microsoft SQL Server 2014 Integration Services, the topic is not addressed that I can find, though I haven’t yet read it from stem to stern.
Given the complexity of some of the solutions I found on the web, it is really surprising to me that Microsoft hasn’t made this simpler to accomplish. But that seems to be where we are. With the help of some instructions on the web, I was eventually able to craft an ETL surrogate key generator independent of my data warehouse RDBMS. In the examples below I am creating a generator for my Expected Entrance Term dimension table.
Using a Script Component
I was able to create a working surrogate key (SK) generator first referring to the concepts and script found in a 2011 post from Joost van Rossum’s SSIS blog. I’ll briefly outline the solution since I found his instructions to be incomplete (He seems to have assumed a higher level of familiarity with the various components and thus omitted some details). After that, I’ll outline a simpler method which does not require the use of a script.
- Start by querying the destination dimension table for the last SK used. On the Control Flow tab in Microsoft Visual Studio, drag an Execute SQL Task to the canvas. Double click to edit.
- ResultSet set to “Single row”
- Define your OLE DB Connection
- Insert your SQL Statement (modify as needed):
SELECT MAX([TERM_KEY]) AS MaxSK
- On the Result Set tab, Click the Add button to add a Variable. Leave the Variable Name as “User::Counter” and rename the Result Name to whatever you prefer.
- Click OK to save your changes
- Add a Data Flow Task, connect the Execute SQL Task to it, and double-click to edit
- In the Data Flow, add an OLE DB Source to the canvas and configure it to select the records you need to populate your dimension (from the Term table in my case)
- Connect that to a Script Component and click to edit it (choose “Transformation” for the type when you drop it on the canvas)
- Go to the Inputs and Outputs tab and click Add Column and give it a Name and DataType (I used TERMKEY and four-byte signed integer [DT_I4]). Note, name it something other than “RowId”, you will need to modify the code below)
- Make sure the ScriptLanguage property is set to C# and click the Edit Script… button to open the script editor
- Replace the place-holder code with the C# code from Rossum’s blog post (step 7) referenced above. (make sure you select the C# code and not the VB.NET code). NOTE: If you did not name your new Column above “RowId” you will need to change the modify row 59 in the sample code to reflect your Column name.
- Save the script and close that window and then click OK to save and close the Script Component editor
- Add an OLE DB Destination and configure it to map the source columns plus the new key column to the appropriate destination columns. The data flow will look like this:
- Once you have saved this project, click the Start button to take it for a spin. On the first run you should see all of your dimension records added to the data warehouse with a sequential keys starting at 1.
- Start the project a second time, and you will see you will now have twice the records, but the surrogate keys for the second load will start where the previous load ended. In my case the first load ended with and SK of 280 and the second load started with 281.
Nice! Of course in a production ETL system, the second load would be only include any new rows (and assigned the next higher keys) or changes which would be handled differently based on the type of attribute that changed (are we tracking historical values, or just the current value?).
A Simpler Surrogate Key Generator
While attempting to fill in the gaps of the instructions I found for the script method above, I came across a simpler method that doesn’t require a script on Sam Vanga’s blog (scroll down to his “With ROW_NUMBER()” method). It utilizes the same Execute SQL Task, but makes utilizes the results in the initial OLE DB Source transformation tool rather than in a Script Component. Here are the steps I took:
First, perform steps 1-7 from the previous method.
- In the Data Flow, add an OLE DB Source to the canvas and configure it to select the records you need to populate your dimension (from the Term table in my case). However, in addition to the query you used in the previous method, you will add another column to the SELECT to make use of the variable returned from the Execute SQL Task.
SELECT ROW_NUMBER() OVER(ORDER BY TermCalendarID) + ? AS TERM_KEY
,TextTerm AS TERM_DESCRIPTION
,Term AS TERM_CODE
,TermCalendarID AS TERM_ID
,TermStartDate AS TERM_START_DATE
,TermEndDate AS TERM_END_DATE
- As in the query above, add the statement “ROW_NUMBER() OVER(ORDER BY <column_name>) + ? AS TERM_KEY”. Without the “+ ?” , this query orders the rows by the column indicated and then assigns sequential numbers, starting with 1, to each row. In my case, I had 280 rows in the result, numbered sequentially from 1-280.
- The “?” (question mark) is the parameter placeholder in the OLE DB Source for SSIS. After adding the question mark, click on the Parameters… button to the right of the query window. NOTE, if you don’t have a “?” in your query, you will get the message: “The SQL statement does not contain any parameters.”
- Select the variable you defined in step 5 (User::Counter in my case) and click OK to save.
What this does then (“ROW_NUMBER() OVER(ORDER BY <column_name>) + ? AS TERM_KEY”), is give each row a sequential row number, and then add to each the variable number which is the MAX surrogate key value of the destination dimension table. On initial load, the TERM dimension table is empty, so “0” is added to each row number. The destination ends up with 280 rows, with TERM-KEYs numbered sequentially 1-280. On the next subsequent load, each new row number will have 280 added to it. So if on the next load 5 new terms had been added to the source table, the first would be ROW_NUMBER 1, plus 280 which would result in a new surrogate key of 281, and so on.
- Add an OLE DB Destination and configure it to map the source columns plus the new key column to the appropriate destination columns. OK to save.
This simpler surrogate key generator populates the destination dimension just as the previous one did. I haven’t compared the performance of these two methods, but Sam Vanga claims that the ROW_NUMBER() method is twice as fast as the script option (he is using a different script than I though).
Kimball specified that the surrogate key generator should “independently generate surrogate keys for every dimension; it should be independent of database instance and able to serve distributed clients… consider having the ETL tool generate and maintain the surrogate keys” (pp. 469-470). Do these two methods meet that specification? They certainly do not rely on database triggers or on an identity column. I’m not sure what he is thinking of with “distributed clients” (Can anyone provide an insight? Please leave a comment below.) Are these methods examples of the ETL tool “maintaining” the keys? Kimball seems to be saying the generator is a thing. When I started looking around for an SSIS key generator, I was surprised to find only methods. I’ll continue to look for other options, but for now, this seems to be the best option. Let me know your thoughts and experience.
read more: https://diybidwguy.com/2017/10/20/using-ssis-generate-surrogate-keys/