PowerBI 1: How to Get Address by Passing Existing Coordinates to Google Maps Reverse Geocoding API

In this article I show you how to get all of these information out of Google Maps by passing existing coordinates to Google Maps geocoding API.

Requirements

To be able to follow this post steps you need to:

  • Have a Google account
  • Activate the Google Maps Geocoding API
  • Find a desired API in Google Developers Guide
  • Use the API in Power BI
  • A GPS device data exported to CSV file

Activate the Google Maps Geocoding API

I assume you already have an existing Google account or you know how to create a new one, so I jump straight into the next step which is activating geocoding API. To be able to start using the Google Maps Geocoding API you need to get an API key first.

Google Maps Geocoding API

  • Scroll down and click “Get A Key”

Get Google Maps API Key

  • This will navigate you to Google Developers Console
  • Click “Yes” then click “Agree and continue”

Get Google Maps API Key 2

  • Type in a name then click “Create”
  • Get Google Maps API Key 3

You successfully created an API key. We will use this key in the next sections of this article.

Using Google Maps Geocoding APIs In Power BI

Now that we have an API key we can use lots of Google Geocoding APIs available here. As you can see there are lots of useful APIs you can use to leverage your Power BI model. In my sample I need to retrieve Country, Post Code and Street Address out of existing coordinates generated by my bike GPS computer. So I use Reverse Geocoding (Address Lookup). Here is a sample:

https://maps.googleapis.com/maps/api/geocode/json?latlng=-36.8484213,174.7457047&key=YOUR_API_KEY

If you copy and paste the above API in your web browser you can see its output which is a JSON providing detailed information about the location.

Note: Do not forget to copy and paste your API key at the end of the API.

The idea is to define a function in Power BI then pass latitude and longitude to the function to get Country, Postal Code and Street Address.

How it works

  • Open Power BI Desktop
  • Get Data from Web
  • Paste the API in the URL section then click OK

Power BI Desktop Get Data From Web

  • Stick with the default and click “Connect”

Power BI Desktop Access Web Content

  • This opens “Query Editor”
  • Click “List”

Power BI Desktop Query Editor Expand List

  • Convert the list to table by clicking “To Table” button from “Transform” tab from the ribbon

Power BI Desktop Convert List to Table

  • Click on expand complex column icon (image) to expand the table then tick “formatted_address” and “types” columns then click OK

Power BI Desktop Expand Complex Column

Power BI Desktop Expand Complex Column 2

  • Expand the “types” column as well

Power BI Desktop Query Editor

  • Filter “types” column to show “Country”, “Postal Code” and “Steer Address”

Power BI Desktop Query Editor Filter Column

Power BI Desktop Query Editor 2

  • We need to the columns as rows so we need to transpose the table. So go to “transform” tab and click “Transpose”

Power BI Desktop Transpose Table

  • As you see the second row contains the column names. Click “Reverse Rows” to show the second row first

Power BI Desktop Reverse Rows

  • Click “Use First Row As Headers”

Power BI Desktop Use First Row As Header

  • Rename the query then save the model

Power BI Desktop Rename Query

PowerBI: How to Get Address by Passing Existing Coordinates to Google Maps Reverse Geocoding API

Create a Function from an Existing Query (Address Lookup)

From the previous tutorial, we loaded the geo-location information for just one coordinate. Now we need to turn the query to a function so that we can invoke it for all coordinates we have in our dataset.

A function construction is as below:

  • Click “Advanced Editor”

Power BI Desktop Advanced Editor

  • Change the code as per the screenshot below ===> “Sample Data single load”
let
  get_geo = ( lat as text, lng as text) =>
let
 Source = Json.Document(Web.Contents
("https://maps.googleapis.com/maps/api/geocode/json?latlng="&lat&","&lng&"
&ampAIzaSyBwKQ7VSYAw4zE0zTtMCQ-k839qoeNdeso")),
 results = Source[results],
 #"Converted to Table" = Table.FromList(results, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
 #"Expanded Column1" = Table.ExpandRecordColumn(#"Converted to Table", "Column1", {"formatted_address", "types"}, {"Column1.formatted_address", "Column1.types"}),
 #"Expanded Column1.types" = Table.ExpandListColumn(#"Expanded Column1", "Column1.types"),
 #"Filtered Rows" = Table.SelectRows(#"Expanded Column1.types", each ([Column1.types] = "postal_code" or [Column1.types] = "street_address")),
 #"Transposed Table" = Table.Transpose(#"Filtered Rows"),
 #"Reversed Rows" = Table.ReverseRows(#"Transposed Table"),
 #"Promoted Headers" = Table.PromoteHeaders(#"Reversed Rows", [PromoteAllScalars=true]),
 #"Changed Type" = Table.TransformColumnTypes(#"Promoted Headers",{{"street_address", type text}, {"postal_code", type text}})
in
 #"Changed Type"
in
  get_geo

Power BI Desktop Create Function

1– Function internal name

2 & 4 – Input parameters

3 & 5 – Parameters’ data types

6 – Function body

7– Output

Note: The name that will be used to revoke the function is the query name which is “fn_GeoLocation” in our sample. —> “Sample Data single load”

  • Continue modifying the codes as below then click “Done”

Power BI Desktop Function Parameters

  • What we have done above is that we replaced the static latitude and longitude with parameters. We used “&” to concatenate both sides of the source.
  • As soon as you click “Done” Power BI detects the query as a function

Power BI Desktop Invoke Function

  • To test the function just click “Invoke” and enter latitude and longitude then click OK

Power BI Desktop Invoke Function 2

Power BI Desktop Invoke Function 3

  • Delete the invoked function step

We successfully defined a query as a function so that we can invoke it anytime.

The next step is to load GPS data from CSV. I’m using my bike computer data that I exported to CSV format earlier.

Download sample CSV file here.

  • In Query Editor, click “New Source” from “Home” tab then click “CSV”

Power BI CSV

  • Browse and open the CSV file
  • Click OK

Power BI Query Editor CSV

  • Click “Add Custom Column” from “Add Column” tab and name it “Geolocation”
  • Enter the following formula to invoke the fn_GeoLocation function and click OK
  • “Sample Data single load”

Power BI Query Editor Add New Column

Note: I used “Number.ToText” function to convert latitude and longitude data type from number to text as we defined the function parameters as text.

  • You’ll get a “Information is required about data privacy” warning message. Click “Continue”

Power BI Query Editor Add New Column 2

  • Select a desired privacy level from the list then click “Save”

Power BI Privacy Levels

Note: Learn more about “Privacy levels” here.

  • Scroll right to find the new added column
  • Expand “Geolocation” column

Power BI Expand Complex Column 2

Voila! We have got all information needed

  • Click “Close & Apply” from “Home” tab from the ribbon and wait until the data loads into the model.

Power BI Query Editor Close & Apply

Note: It might take sometime to load data into the model based on your dataset size. This is because of sending coordinates to Google Maps row by row then loading the results into the model, so the process could be slow if you have a large dataset. There are also some limitations applied to the API usage.

http://biinsight.com/power-bi-and-google-maps-api-address-lookup/

Why Data Modelling is Important in BI

As per my experience in working on reporting systems based on semantic models, aka “self-service BI”, but the principles are the same we had in DSS (decision support system), OLAP (on-line analytical processing), and many other names that we’ve seen over the years.

The basic idea is always the same: enable the user to navigate data without rewriting a new query every time. In order to do that, a description of the data model with metadata that enrich the simple collection of table and relationships allows a generic client tool to offer a customized experience, as it was designed for a specific business need.
Creating a good data model specific for the business analysis is important because the goal is different compared to a database that runs a transactional application. If you want to record sales, you need a database that is optimized to collect data in a safe and accurate way. If you want to analyze sales, you need a data model that is optimized to query and aggregate data in volume. These two requirements are different in a way that is much larger than a simple performance issue.

A normalized model for a transactional application might show challenges for a query. How to interpret a missing data or a null value? A data model optimized for the analysis is simpler in its nature, because the data have been transformed so that they are unambiguous.

Power BI shares the same data modeling experience of Power Pivot and Analysis Services Tabular. The easiest way to create a data model is to get the data “as is” from a data source, define relationships, and then start designing reports. However, this approach raises several issues when you have too many tables, or different granularities in tables representing the same entity from different data sources. Power Query and the M language are here to solve these problems. If you are so lucky to read data from a well-designed data mart that follows the star schema principles, you don’t need any transformation. But in all the other cases, you probably should massage your tables before creating your reports.

The question is: what is the right data model for the business analysis?
The first simple answer is: the model should be designed to answer business questions, and not to simply represent the data as they come from the data source.
Yes, this answer is not a real answer, because it does not provide any practical guidance.
A better hint is: one entity, one table. This is a “user-friendly” way to describe a star schema, where each set of attributes describing a business entity is collected in a table called “dimension”, and each set of numeric metrics describing events and sharing the same granularity is saved in a table called “fact table”. But “one entity, one table” is simpler to understand.

A product is a table. Category, color, price, manufacturer: these are attributes. There is no added value in creating many relationships to just describe attributes of the same entity. Just collapse all these attributes in the same Product table.
A customer is a table. Country, city, address, number of employees are just attributes of the customer. It is useless to create a table with the countries.
Unless the country is a business entity in your analysis. If you are doing demographical researches, chances are that the country is a business entity, and not just an attribute of a customer.
An invoice is a table. Date of the invoice, shipping cost, products sold, quantities… all these attributes should be in a table with a conformed granularity. And when this is not possible, but only then, you start considering several tables at different granularities.

Design principles for data modelling

There are no right or wrong data model; just good or bad. Good data modelling is difficult; and the follow design principles could be useful:

–Be faithful to the specification of the requirement

–Use common sense and make assumption only if the specification fails to explain

–Avoid duplication and other redundant information

–The KISS principle

Customer: A customer may not yet have booked for a safari, or may have booked for and participated in several safaris. A company or an individual person may be a customer – but only individual persons are booked into scheduled safari trip occurrences. The information to be stored about a customer is:

  • For a company customer: name, address (first line address, second line address, city, postcode, country) and contact name, email address and phone number. A contact may change over time
  • For a person customer: first name, last name, address (first line address, second line address, city, postcode, country), email address, phone number and date of birth
  • A customer’s preferences (whether company or person) for time slots during which they wish to travel must be retained in the database. A customer may be able to travel in several time slots during the year. Tane wishes to retain this information so that he can target only those able to participate with publicity for scheduled trips

SISS to SQL Server Data Type Translations

It can be extremely confusing when you first encounter SSIS data types.  At first glance, they seem to be nothing like SQL Server data types you love and know.  That’s why I’ve provided below a conversion chart of SSIS data types to SQL Server data types.  This information is readily available on MSDN but it always seems difficult to find.  Hope this helps!

SISS to SQL Server Data Type Translations

http://www.sqlservercentral.com/blogs/dknight/2010/12/22/ssis-to-sql-server-data-type-translations/

SQL Server Data Types you Must Know

Why data types are important

  1. The data is stored in the database in a consistent and known format.
  2. Knowing the data type allows you to know which calculations and formulations you can use on the column.
  3. Data types affect storage. Some values take up more space when stored in one data type versus another.  Take our age tables above for example.
  4. Data types affect performance. The less time the database has to infer values or convert them the better.  “Is December, 32, 2015 a date?”

Commonly used SQL Server Data Types

There are over thirty different data types you can choose from when defining columns.  Many of these are set up for very specific jobs such as storing images, and others more suitable to general use.

Here is the data types you’ll most frequently encounter in your everyday use of SQL.  These are:

  • INT
  • VARCHAR, NVARCHAR
  • DATETIME
  • DECIMAL, FLOAT
  • BIT

INT – Integer Data Type

The integer data type is used to store whole numbers.  Examples include -23, 0, 5, and 10045.  Whole numbers don’t include decimal places.  Since SQL server uses a number of computer words to represent an integer there are maximum and minimum values which it can represent.  An INT datatype can store a value from -2,147,483,648 to 2,147,483,647.

Practical uses of the INT data type include using it to count values, store a person’s age, or use as an ID key to a table.

But INT wouldn’t be so good to keep track of a terabyte hard drive address space, as the INT data type only goes to 2 billion and we would need to track into the trillions.  For this you could use BIGINT.

The INT data type can be used in calculations.  Since DaysToManufacture is defined as INT we can easily calculate hours by multiplying it by 24:

SELECT Name,
       DaysToManufacture,
       DaysToManufacture * 24 as HoursToManufacture
FROM   Production.Product

Here you can see the results

Use of INT to perform calculations.

There are many operations and functions you can use with integers which we’ll cover once we dig into functions.

VARCHAR and NVARCHAR – Text Values

Both VARCHAR and NVARCHAR are used to store variable length text values.  “VARCHAR” stands for variable length character.

The number of characters to store in a VARCHAR or NVARCHAR are defined within the column.   For instance as you can see in the following column definition from the object explorer, the product name is defined to hold fifty characters.

VARCHAR definition shown in SQL Server Management Studio

What makes VARCHAR popular is that values less than fifty characters take less space.  Only enough space to hold the value is allocated.  This differs from the CHAR data type which always allocates the specified length, regardless of the length of the actual data stored.

The VARCHAR datatype can typically store a maximum of 8,000 characters.  The NVARCHAR datatype is used to store Unicode text.  Since UNICODE characters occupy twice the space, NVARCHAR columns can store a maximum of 4,000 characters.

The advantage NVARCHAR has over VARCHAR is it can store Unicode characters.  This makes it handy to store extended character sets like those used for languages such as Kanji.

If your database was designed prior to SQL 2008 you’ll most likely encounter VARCHAR; however, more modern databases or those global in nature tend to use NVARCHAR.

DATETIME – Date and Time

The DATETIME data type is used to store the date and time.  An example of a DATATIME value is

1968-10-23 1:45:37.123

This is the value for October 23rd, 1968 at 1:45 AM.  Actually the time is more precise than that.  The time is really 45 minutes, 37.123 seconds.

In many cases you just need to store the date.  In these cases, the time component is zeroed out.  Thus, November 5th, 1972 is

1972-11-05 00:00:00.000

A DATETIME can store dates from January 1, 1753, through December 31, 9999.  This makes the DATETIME good for recording dates in today’s world, but not so much in William Shakespeare’s.

As you get more familiar with the various SQL built-in functions you’ll be able to manipulate the data.  To give you a glimpse, we’ll use the YEAR function to count employees hired each year.  When given a DATETIME value, the YEAR function return the year.

The query we’ll use is

SELECT   YEAR(HireDate),
         Count(*)
FROM     HumanResources.Employee
GROUP BY YEAR(HireDate)
ORDER BY YEAR(HireDate)

And here are the results

Use YEAR on DATETIME data type

The benefit is the DATETIME type ensures the values are valid dates.  Once this is assured, we’re able to use a slew of functions to calculate the number of days between dates, the month of a date and so on.

We’ll explore these various functions in detail in another blog article.

DECIMAL and FLOAT – Decimal Points

As you may have guessed DECIMAL and FLOAT datatypes are used to work with decimal values such as 10.3.

I lumped DECIMAL and FLOAT into the same category, since they both can handle values with decimal points; however, they both do so differently:

If you need precise values, such as when working with financial or accounting data, then use DECIMAL.  The reason is the DECIMAL datatype allows you to define the number of decimal points to maintain.

DECIMAL

DECIMAL data types are defined by precision and scale.  The precision determine the number of total digits to store; whereas, scale determine the number of digits to the right of the decimal point.

A DECIMAL datatype is specified as DECIMAL(precision,scale).

A DECIMAL datatype can be no more than 38 digits.  The precision and scale must adhere to the following relation

0 <= scale <= precision <= 38 digits

In the Production.Product table, the weight column’s datatype is defined as DECIMAL(8,2).  The first digit is the precision, the second the scale.

Weight is defined to have eight total digits, two of them to the right of the decimal place.  We’ll the following sample query to illustrate how this data type.

SELECT   DISTINCT Weight
FROM     Production.Product
WHERE    Weight BETWEEN 29.00 and 189.00
ORDER BY Weight DESC

The results follow:

Using DECIMAL data type to display results

FLOAT

Where DECIMAL datatypes are great for exact numbers, FLOATs are really good for long numeric values.  Though a DECIMAL value can have 38 digits total, in many engineering and scientific application this is inadequate.  For scientific applications where extreme numeric values are encountered, FLOAT rises to the top!

FLOATS have a range from – 1.79E+308 to 1.79E+308.  That means the largest value can be 179 followed by 306 zeros (large indeed!).

Because of the way float data is stored in the computer (see IEEE 754 floating point specification) the number stored is an extremely close approximation.  For many application this is good enough.

Because of the approximate behavior, avoid using <> and = operators in the WHERE clause.  Many a DBA has been burned by the statement.

WHERE mass = 2.5

Their expectation are dashed when mass is supposed to equal 2.5, but really, in the computer it is stored as 2.499999999999999; therefore, not equal to 2.500000000000000!

That is the nature of floating points and computers.  You and I see 2.499999999999999 and think for practical purposes it is 2.5, but to the computer, were off just a bit.  J

BIT – Boolean or Yes/No values

There’s times when you just need to store whether something “is” or “is not.”  For instance, whether an employee is active.  It is in these cases that the BIT datatype comes to its own.  This data type be one of three states: 1, 0, or NULL.

The value of 1 signifies TRUE and 0 FALSE.

In this query we’re listing all salaried position job titles

SELECT DISTINCT JobTitle
FROM   HumanResources.Employee
WHERE  SalariedFlag = 1

Here are the results

Using the BIT data type in Searches

We could have also use ‘True’ instead of 1.  Here is the same example using ‘True’

SELECT DISTINCT JobTitle
FROM   HumanResources.Employee
WHERE  SalariedFlag = 'True'

And the opposite using ‘False’

SELECT DISTINCT JobTitle
FROM   HumanResources.Employee
WHERE  SalariedFlag = 'False'

I tend to stick with 1 and 0, since it is easier to type, but if you’re going for readability, then ‘True’ and ‘False’ are good options.

Read more:
https://www.essentialsql.com/commonly-used-sql-server-data-types/

Using SSIS to Generate Surrogate Keys

By Richard Hiers,2017

In the following post, I’ll be explaining how to use ETL and SSIS to generate surrogate keys for dimension tables in a data warehouse. Although most dimensional data possesses a natural key or a business key in the source tables, Kimball and everyone else I’ve read strongly recommend using a surrogate key for dimensional tables. Our prospects, for example, all have a unique ProspectID, but it is recommended that the ETL system generate a unique identifying key be for each record. This surrogate key then becomes the foreign key in the Fact tables.

You can read more about the advantages of using a surrogate key in Kimball’s The Data Warehouse Toolkit, 3rd Ed. starting on page 98, Corr’s Agile Data Warehouse Design starting on page 141, and Adamson’s Star Schema on page 30 and following. But in brief, some of the key benefits of using a surrogate key are:

  1. Isolates the data warehouse from any instability in the source OLTP system keys where business keys may possibly be deleted, recycled, and modified in ways that could be detrimental to the functioning of the data warehouse.
  2. Facilitates pulling dimensional attributes from disparate source systems in which each has its own independent business keys.
  3. Supports slowly changing dimensions (SCD) where there can be multiple versions of any given entity in the dimension.

So the question becomes, how do you create and maintain these surrogate keys?

The Need: An SSIS Surrogate Key Generator

It would be possible to design the dimension table with an auto-incrementing identity key, but at least for performance reasons, that doesn’t seem to be the preferred mechanism. From Kimball starting on page 469: “For improved efficiency, consider having the ETL tool generate and maintain the surrogate keys. Avoid the temptation of concatenating the operational key of the source system and a date/time stamp.”  OK, so how do I accomplish that?

The Challenge: An SSIS Surrogate Key Generator

Given this and similar recommendations found on the web, and given the maturity of Microsoft’s SSIS ETL tools, I would expect this to be pretty straight forward. I would expect that I could just drag the Surrogate Key Creator transformation into my Data Flow and go on my merry way. But its not that straight forward. The tool doesn’t exist and even in Knight’s Professional Microsoft SQL Server 2014 Integration Services, the topic is not addressed that I can find, though I haven’t yet read it from stem to stern.

Given the complexity of some of the solutions I found on the web, it is really surprising to me that Microsoft hasn’t made this simpler to accomplish. But that seems to be where we are. With the help of some instructions on the web, I was eventually able to craft an ETL surrogate key generator independent of my data warehouse RDBMS. In the examples below I am creating a generator for my Expected Entrance Term dimension table.

Using a Script Component

I was able to create a working surrogate key (SK) generator first referring to the concepts and script found in a 2011 post from Joost van Rossum’s SSIS blog. I’ll briefly outline the solution since I found his instructions to be incomplete (He seems to have assumed a higher level of familiarity with the various components and thus omitted some details). After that, I’ll outline a simpler method which does not require the use of a script.

  1. Start by querying the destination dimension table for the last SK used. On the Control Flow tab in Microsoft Visual Studio, drag an Execute SQL Task to the canvas. Double click to edit.
  2. ResultSet set to “Single row”
  3. Define your OLE DB Connection
  4. Insert your SQL Statement (modify as needed):
1
2
SELECT MAX([TERM_KEY]) AS MaxSK
FROM SemesterTerms

SSIS sql task control flow

  1. On the Result Set tab, Click the Add button to add a Variable. Leave the Variable Name as “User::Counter” and rename the Result Name to whatever you prefer.

result set variable

  1. Click OK to save your changes
  2. Add a Data Flow Task, connect the Execute SQL Task to it, and double-click to edit
  3. In the Data Flow, add an OLE DB Source to the canvas and configure it to select the records you need to populate your dimension (from the Term table in my case)
  4. Connect that to a Script Component and click to edit it (choose “Transformation” for the type when you drop it on the canvas)
  5. Go to the Inputs and Outputs tab and click Add Column and give it a Name and DataType (I used TERMKEY and four-byte signed integer [DT_I4]). Note, name it something other than “RowId”, you will need to modify the code below)
  6. Make sure the ScriptLanguage property is set to C# and click the Edit Script… button to open the script editor
  7. Replace the place-holder code with the C# code from Rossum’s blog post (step 7) referenced above. (make sure you select the C# code and not the VB.NET code). NOTE: If you did not name your new Column above “RowId” you will need to change the modify row 59 in the sample code to reflect your Column name.
  8. Save the script and close that window and then click OK to save and close the Script Component editor
  9. Add an OLE DB Destination and configure it to map the source columns plus the new key column to the appropriate destination columns. The data flow will look like this:

surrogate key generator with c# script

  1. Once you have saved this project, click the Start button to take it for a spin. On the first run you should see all of your dimension records added to the data warehouse with a sequential keys starting at 1.
  2. Start the project a second time, and you will see you will now have twice the records, but the surrogate keys for the second load will start where the previous load ended. In my case the first load ended with and SK of 280 and the second load started with 281.

Nice! Of course in a production ETL system, the second load would be only include any new rows (and assigned the next higher keys) or changes which would be handled differently based on the type of attribute that changed (are we tracking historical values, or just the current value?).

A Simpler Surrogate Key Generator

While attempting to fill in the gaps of the instructions I found for the script method above, I came across a simpler method that doesn’t require a script on Sam Vanga’s blog (scroll down to his “With ROW_NUMBER()” method). It utilizes the same Execute SQL Task, but makes utilizes the results in the initial OLE DB Source transformation tool rather than in a Script Component. Here are the steps I took:

First, perform steps 1-7 from the previous method.

  1. In the Data Flow, add an OLE DB Source to the canvas and configure it to select the records you need to populate your dimension (from the Term table in my case). However, in addition to the query you used in the previous method, you will add another column to the SELECT to make use of the variable returned from the Execute SQL Task.
1
2
3
4
5
6
7
SELECT ROW_NUMBER() OVER(ORDER BY TermCalendarID) + ? AS TERM_KEY
,TextTerm AS TERM_DESCRIPTION
,Term AS TERM_CODE
,TermCalendarID AS TERM_ID
,TermStartDate AS TERM_START_DATE
,TermEndDate AS TERM_END_DATE
FROM SemesterTerms
  1. As in the query above, add the statement “ROW_NUMBER() OVER(ORDER BY <column_name>) + ? AS TERM_KEY”. Without the “+ ?” , this query orders the rows by the column indicated and then assigns sequential numbers, starting with 1, to each row. In my case, I had 280 rows in the result, numbered sequentially from 1-280.
  2. The “?” (question mark) is the parameter placeholder in the OLE DB Source for SSIS. After adding the question mark, click on the Parameters… button to the right of the query window. NOTE, if you don’t have a “?” in your query, you will get the message: “The SQL statement does not contain any parameters.”
  3. Select the variable you defined in step 5 (User::Counter in my case) and click OK to save.

SSIS OLE DB source editor parameters

What this does then (“ROW_NUMBER() OVER(ORDER BY <column_name>) + ? AS TERM_KEY”), is give each row a sequential row number, and then add to each the variable number which is the MAX surrogate key value of the destination dimension table. On initial load, the TERM dimension table is empty, so “0” is added to each row number. The destination ends up with 280 rows, with TERM-KEYs numbered sequentially 1-280. On the next subsequent load, each new row number will have 280 added to it. So if on the next load 5 new terms had been added to the source table, the first would be ROW_NUMBER 1, plus 280 which would result in a new surrogate key of 281, and so on.

  1. Add an OLE DB Destination and configure it to map the source columns plus the new key column to the appropriate destination columns. OK to save.

This simpler surrogate key generator populates the destination dimension just as the previous one did. I haven’t compared the performance of these two methods, but Sam Vanga claims that the ROW_NUMBER() method is twice as fast as the script option (he is using a different script than I though).

Conclusion

Kimball specified that the surrogate key generator should “independently generate surrogate keys for every dimension; it should be independent of database instance and able to serve distributed clients… consider having the ETL tool generate and maintain the surrogate keys” (pp. 469-470). Do these two methods meet that specification? They certainly do not rely on database triggers or on an identity column. I’m not sure what he is thinking of with “distributed clients” (Can anyone provide an insight? Please leave a comment below.) Are these methods examples of the ETL tool “maintaining” the keys? Kimball seems to be saying the generator is a thing. When I started looking around for an SSIS key generator, I was surprised to find only methods. I’ll continue to look for other options, but for now, this seems to be the best option. Let me know your thoughts and experience.

 

read more:
https://diybidwguy.com/2017/10/20/using-ssis-generate-surrogate-keys/

3 Effective Ways of Generating Surrogate Keys With SSIS

A surrogate key is an auto-generated value, usually integer, in the dimension table. It is made the primary key of the table and is used to join a dimension to a fact table. Among other benefits, surrogate keys allow you to maintain history in a dimension table. Despite their popularity, SSIS doesn’t have a built-in solution for generating surrogate keys. Let’s take a look at a few alternatives in this post.

First, create the following table. We will import data from Person. Person table (AdventureWorks sample) into this table. Note: PersonSK is the surrogate key.
CREATE TABLE [dbo].Person
(
[PersonSK] INT IDENTITY(1,1) NOT NULL
, [FirstName] NVARCHAR(50) NULL
, [LastName] NVARCHAR(50) NULL
, CONSTRAINT PK_PersonSK PRIMARY KEY [PersonSK]
) ;


1. With IDENTITY()

Drag a data flow task on to the control flow and configure the OLE DB source.

Next, drag OLE DB destination and connect it to the source. Specify the connection manager, and choose Table or view – fast load as the data access mode. This performs a bulk insert and is the fastest of all.

Destination table has 3 columns, but source has only 2 columns. In the mappings page, map input and output columns for FirstName and LastName and ignore the mapping for PersonSK.

When you run the package, becasue PersonSK is an identity column in the table, SQL Server will automatically generate values for you. This solution is easy and fast, but sometimes depending on your ETL methodology, you can’t rely on IDENTITY().


2.With Script Component

I frequently use Script Transformation. The steps are nicely written in this post by Phil Brammer (b).  This is simply a script used as a transformation. The script generates a row number for each row passed through the data flow.


3.. With ROW_NUMBER()

You can use ROW_NUMBER() when working with SQL Server data source to let the database engine do the work for you. This can be faster.

If you’re doing an incremental load, first find the maximum key value from the destination. I’ll use the following query. It’ll return zero if there were no rows, else it returns the maximum value.

SELECT  ISNULL(MAX(PersonSK),0) SK
FROM    dbo.Person

Add an Execute SQL Task to the control flow and set the result property to single row. Then, add a variable to hold the return value.

Next, connect a data flow task to the execute SQL task. We will use the following SQL statement in the OLE DB source editor. In addition to the LastName and FirstName columns, we are using ROW_NUMBER() function to generate a unique number for every row.

SELECT
ROW_NUMBER() OVER(ORDER BY [LastName]) SK
, [LastName]
, [FirstName]
FROM
Person.Person

The query will generate numbers starting from 1 for each row, but while loading to the destination we don’t want to start from 1. We want to find the maximum value in the destination and start from the next highest value. So, I’m adding the max value to every row number using parameters in the OLE DB source.


In the OLE DB destination, check the box that says Keep Identity. By doing this we are asking SSIS to keep the identity values that are generated by the source. In the mappings page, you’ll see a new input that was created in the OLE DB source. Map it to PersonSK surrogate key.

Go ahead and run the package. If it’s all good you will see cute little green tick marks like below.

Summary

In this example, we looked at different options to generate surrogate keys while loading dimensions using SSIS. We used IDENTITY() constraint in SQL Server. We talked about Script Component. Finally, we saw making use ROW_NUMBER() function. Last option is twice as fast as using the Script Component with around 20,000 rows and an index on LastName column.

 

SSIS 03: Deploy Packages with SSIS

What You Will Learn

The best way to become acquainted with the new tools, controls, and features available in Microsoft SQL Server Integration Services is to use them. This tutorial walks you through the steps to create an Integration Services project and then add the packages and other necessary files to the project. After the project is complete, you will create a deployment bundle, copy the bundle to the destination computer, and then install the packages on the destination computer.

 

Lessons in This Tutorial

Lesson 1: Preparing to Create the Deployment Bundle
In this lesson, you will get ready to deploy an ETL solution by creating a new Integration Services project and adding the packages and other required files to the project.

Lesson 2: Create the Deployment Bundle in SSIS
In this lesson, you will build a deployment utility and verify that the deployment bundle includes the necessary files.

Lesson 3: Install SSIS Packages
In this lesson, you will copy the deployment bundle to the target computer, install the packages, and then run the packages.


Requirements

This tutorial is intended for users who are already familiar with fundamental file system operations, but who have limited exposure to the new features available in SQL Server Integration Services. To better understand basic Integration Services concepts that you will put to use in this tutorial, you might find it useful to first complete the following Integration Services tutorial: SSIS How to Create an ETL Package.

Source computer. The computer on which you will create the deployment bundle must have the following components installed:

Destination computer. The computer to which you deploy packages must have the following components installed:

  • SQL Server
  • Sample data, completed packages, configurations, and a Readme. These files are installed together if you download the Adventure Works 2014 Sample Databases.
  • SQL Server Management Studio.
  • SQL Server Integration Services.
  • You must have permission to create and drop tables in AdventureWorks and to run packages in SQL Server Management Studio.
  • You must have read and write permission on the sysssispackages table in the msdb SQL Server system database.

If you plan to deploy packages to the same computer as the one on which you create the deployment bundle, that computer must meet requirements for both the source and destination computers.

Estimated time to complete this tutorial: 2 hours

 

SSIS 02: How to Create an ETL Package

In this tutorial, you learn how to use SSIS Designer to create a simple Microsoft SQL Server Integration Services package. The package that you create takes data from a flat file, reformats the data, and then inserts the reformatted data into a fact table. In following lessons, the package is expanded to demonstrate looping, package configurations, logging, and error flow.

When you install the sample data that the tutorial uses, you also install the completed versions of the packages that you create in each lesson of the tutorial. By using the completed packages, you can skip ahead and begin the tutorial at a later lesson if you like. If this tutorial is your first time working with packages or the new development environment, we recommend that you begin with Lesson1.

Lessons in This Tutorial

Lesson 1: Create a Project and Basic Package with SSIS
In this lesson, you create a simple ETL package that extracts data from a single flat file, transforms the data using lookup transformations and finally loads the result into a fact table destination.

Lesson 2: Adding Looping with SSIS
In this lesson, you expand the package you created in Lesson 1 to take advantage of new looping features to extract multiple flat files into a single data flow process.

Lesson 3: Add Logging with SSIS
In this lesson, you expand the package you created in Lesson 2 to take advantage of new logging features.

Lesson 4: Add Error Flow Redirection with SSIS
In this lesson, you expand the package you created in lesson 3 to take advantage of new error output configurations.

Lesson 5: Add SSIS Package Configurations for the Package Deployment Model
In this lesson, you expand the package you created in Lesson 4 to take advantage of new package configuration options.

Lesson 6: Using Parameters with the Project Deployment Model in SSIS
In this lesson, you expand the package you created in Lesson 5 to take advantage of using new parameters with the project deployment model.


What You Learn

The best way to become acquainted with the new tools, controls, and features available in Microsoft SQL Server Integration Services is to use them. This tutorial walks you through SSIS Designer to create a simple ETL package that includes looping, configurations, error flow logic, and logging.

Requirements

This tutorial is intended for users familiar with fundamental database operations, but who have limited exposure to the new features available in SQL Server Integration Services.

To use this tutorial, your system must have the following components installed:

  • SQL Server with the AdventureWorksDW2012 database. To download the AdventureWorksDW2012 database, download AdventureWorksDW2012.bak from AdventureWorks sample databases and restore the backup.
  • Sample data. The sample data is included with the SSIS lesson packages. To download the sample data and the lesson packages as a Zip file, click here.

SSIS 01: SQL Server Integration Services Tutorials

This section contains tutorials Integration Services.

Microsoft Integration Services is a platform for building enterprise-level data integration and data transformations solutions. You use Integration Services to solve complex business problems by copying or downloading files, sending e-mail messages in response to events, updating data warehouses, cleaning and mining data, and managing SQL Server objects and data. The packages can work alone or in concert with other packages to address complex business needs. Integration Services can extract and transform data from a wide variety of sources such as XML data files, flat files, and relational data sources, and then load the data into one or more destinations.

Integration Services includes a rich set of built-in tasks and transformations; tools for constructing packages; and the Integration Services service for running and managing packages. You can use the graphical Integration Services tools to create solutions without writing a single line of code; or you can program the extensive Integration Services object model to create packages programmatically and code custom tasks and other package objects.

Try SQL Server and SQL Server Integration Services

info_tip Resources