Why Data Modelling is Important in BI

As per my experience in working on reporting systems based on semantic models, aka “self-service BI”, but the principles are the same we had in DSS (decision support system), OLAP (on-line analytical processing), and many other names that we’ve seen over the years.

The basic idea is always the same: enable the user to navigate data without rewriting a new query every time. In order to do that, a description of the data model with metadata that enrich the simple collection of table and relationships allows a generic client tool to offer a customized experience, as it was designed for a specific business need.
Creating a good data model specific for the business analysis is important because the goal is different compared to a database that runs a transactional application. If you want to record sales, you need a database that is optimized to collect data in a safe and accurate way. If you want to analyze sales, you need a data model that is optimized to query and aggregate data in volume. These two requirements are different in a way that is much larger than a simple performance issue.

A normalized model for a transactional application might show challenges for a query. How to interpret a missing data or a null value? A data model optimized for the analysis is simpler in its nature, because the data have been transformed so that they are unambiguous.

Power BI shares the same data modeling experience of Power Pivot and Analysis Services Tabular. The easiest way to create a data model is to get the data “as is” from a data source, define relationships, and then start designing reports. However, this approach raises several issues when you have too many tables, or different granularities in tables representing the same entity from different data sources. Power Query and the M language are here to solve these problems. If you are so lucky to read data from a well-designed data mart that follows the star schema principles, you don’t need any transformation. But in all the other cases, you probably should massage your tables before creating your reports.

The question is: what is the right data model for the business analysis?
The first simple answer is: the model should be designed to answer business questions, and not to simply represent the data as they come from the data source.
Yes, this answer is not a real answer, because it does not provide any practical guidance.
A better hint is: one entity, one table. This is a “user-friendly” way to describe a star schema, where each set of attributes describing a business entity is collected in a table called “dimension”, and each set of numeric metrics describing events and sharing the same granularity is saved in a table called “fact table”. But “one entity, one table” is simpler to understand.

A product is a table. Category, color, price, manufacturer: these are attributes. There is no added value in creating many relationships to just describe attributes of the same entity. Just collapse all these attributes in the same Product table.
A customer is a table. Country, city, address, number of employees are just attributes of the customer. It is useless to create a table with the countries.
Unless the country is a business entity in your analysis. If you are doing demographical researches, chances are that the country is a business entity, and not just an attribute of a customer.
An invoice is a table. Date of the invoice, shipping cost, products sold, quantities… all these attributes should be in a table with a conformed granularity. And when this is not possible, but only then, you start considering several tables at different granularities.

Design principles for data modelling

There are no right or wrong data model; just good or bad. Good data modelling is difficult; and the follow design principles could be useful:

–Be faithful to the specification of the requirement

–Use common sense and make assumption only if the specification fails to explain

–Avoid duplication and other redundant information

–The KISS principle

Customer: A customer may not yet have booked for a safari, or may have booked for and participated in several safaris. A company or an individual person may be a customer – but only individual persons are booked into scheduled safari trip occurrences. The information to be stored about a customer is:

  • For a company customer: name, address (first line address, second line address, city, postcode, country) and contact name, email address and phone number. A contact may change over time
  • For a person customer: first name, last name, address (first line address, second line address, city, postcode, country), email address, phone number and date of birth
  • A customer’s preferences (whether company or person) for time slots during which they wish to travel must be retained in the database. A customer may be able to travel in several time slots during the year. Tane wishes to retain this information so that he can target only those able to participate with publicity for scheduled trips

SQL Server Data Types you Must Know

Why data types are important

  1. The data is stored in the database in a consistent and known format.
  2. Knowing the data type allows you to know which calculations and formulations you can use on the column.
  3. Data types affect storage. Some values take up more space when stored in one data type versus another.  Take our age tables above for example.
  4. Data types affect performance. The less time the database has to infer values or convert them the better.  “Is December, 32, 2015 a date?”

Commonly used SQL Server Data Types

There are over thirty different data types you can choose from when defining columns.  Many of these are set up for very specific jobs such as storing images, and others more suitable to general use.

Here is the data types you’ll most frequently encounter in your everyday use of SQL.  These are:

  • INT
  • VARCHAR, NVARCHAR
  • DATETIME
  • DECIMAL, FLOAT
  • BIT

INT – Integer Data Type

The integer data type is used to store whole numbers.  Examples include -23, 0, 5, and 10045.  Whole numbers don’t include decimal places.  Since SQL server uses a number of computer words to represent an integer there are maximum and minimum values which it can represent.  An INT datatype can store a value from -2,147,483,648 to 2,147,483,647.

Practical uses of the INT data type include using it to count values, store a person’s age, or use as an ID key to a table.

But INT wouldn’t be so good to keep track of a terabyte hard drive address space, as the INT data type only goes to 2 billion and we would need to track into the trillions.  For this you could use BIGINT.

The INT data type can be used in calculations.  Since DaysToManufacture is defined as INT we can easily calculate hours by multiplying it by 24:

SELECT Name,
       DaysToManufacture,
       DaysToManufacture * 24 as HoursToManufacture
FROM   Production.Product

Here you can see the results

Use of INT to perform calculations.

There are many operations and functions you can use with integers which we’ll cover once we dig into functions.

VARCHAR and NVARCHAR – Text Values

Both VARCHAR and NVARCHAR are used to store variable length text values.  “VARCHAR” stands for variable length character.

The number of characters to store in a VARCHAR or NVARCHAR are defined within the column.   For instance as you can see in the following column definition from the object explorer, the product name is defined to hold fifty characters.

VARCHAR definition shown in SQL Server Management Studio

What makes VARCHAR popular is that values less than fifty characters take less space.  Only enough space to hold the value is allocated.  This differs from the CHAR data type which always allocates the specified length, regardless of the length of the actual data stored.

The VARCHAR datatype can typically store a maximum of 8,000 characters.  The NVARCHAR datatype is used to store Unicode text.  Since UNICODE characters occupy twice the space, NVARCHAR columns can store a maximum of 4,000 characters.

The advantage NVARCHAR has over VARCHAR is it can store Unicode characters.  This makes it handy to store extended character sets like those used for languages such as Kanji.

If your database was designed prior to SQL 2008 you’ll most likely encounter VARCHAR; however, more modern databases or those global in nature tend to use NVARCHAR.

DATETIME – Date and Time

The DATETIME data type is used to store the date and time.  An example of a DATATIME value is

1968-10-23 1:45:37.123

This is the value for October 23rd, 1968 at 1:45 AM.  Actually the time is more precise than that.  The time is really 45 minutes, 37.123 seconds.

In many cases you just need to store the date.  In these cases, the time component is zeroed out.  Thus, November 5th, 1972 is

1972-11-05 00:00:00.000

A DATETIME can store dates from January 1, 1753, through December 31, 9999.  This makes the DATETIME good for recording dates in today’s world, but not so much in William Shakespeare’s.

As you get more familiar with the various SQL built-in functions you’ll be able to manipulate the data.  To give you a glimpse, we’ll use the YEAR function to count employees hired each year.  When given a DATETIME value, the YEAR function return the year.

The query we’ll use is

SELECT   YEAR(HireDate),
         Count(*)
FROM     HumanResources.Employee
GROUP BY YEAR(HireDate)
ORDER BY YEAR(HireDate)

And here are the results

Use YEAR on DATETIME data type

The benefit is the DATETIME type ensures the values are valid dates.  Once this is assured, we’re able to use a slew of functions to calculate the number of days between dates, the month of a date and so on.

We’ll explore these various functions in detail in another blog article.

DECIMAL and FLOAT – Decimal Points

As you may have guessed DECIMAL and FLOAT datatypes are used to work with decimal values such as 10.3.

I lumped DECIMAL and FLOAT into the same category, since they both can handle values with decimal points; however, they both do so differently:

If you need precise values, such as when working with financial or accounting data, then use DECIMAL.  The reason is the DECIMAL datatype allows you to define the number of decimal points to maintain.

DECIMAL

DECIMAL data types are defined by precision and scale.  The precision determine the number of total digits to store; whereas, scale determine the number of digits to the right of the decimal point.

A DECIMAL datatype is specified as DECIMAL(precision,scale).

A DECIMAL datatype can be no more than 38 digits.  The precision and scale must adhere to the following relation

0 <= scale <= precision <= 38 digits

In the Production.Product table, the weight column’s datatype is defined as DECIMAL(8,2).  The first digit is the precision, the second the scale.

Weight is defined to have eight total digits, two of them to the right of the decimal place.  We’ll the following sample query to illustrate how this data type.

SELECT   DISTINCT Weight
FROM     Production.Product
WHERE    Weight BETWEEN 29.00 and 189.00
ORDER BY Weight DESC

The results follow:

Using DECIMAL data type to display results

FLOAT

Where DECIMAL datatypes are great for exact numbers, FLOATs are really good for long numeric values.  Though a DECIMAL value can have 38 digits total, in many engineering and scientific application this is inadequate.  For scientific applications where extreme numeric values are encountered, FLOAT rises to the top!

FLOATS have a range from – 1.79E+308 to 1.79E+308.  That means the largest value can be 179 followed by 306 zeros (large indeed!).

Because of the way float data is stored in the computer (see IEEE 754 floating point specification) the number stored is an extremely close approximation.  For many application this is good enough.

Because of the approximate behavior, avoid using <> and = operators in the WHERE clause.  Many a DBA has been burned by the statement.

WHERE mass = 2.5

Their expectation are dashed when mass is supposed to equal 2.5, but really, in the computer it is stored as 2.499999999999999; therefore, not equal to 2.500000000000000!

That is the nature of floating points and computers.  You and I see 2.499999999999999 and think for practical purposes it is 2.5, but to the computer, were off just a bit.  J

BIT – Boolean or Yes/No values

There’s times when you just need to store whether something “is” or “is not.”  For instance, whether an employee is active.  It is in these cases that the BIT datatype comes to its own.  This data type be one of three states: 1, 0, or NULL.

The value of 1 signifies TRUE and 0 FALSE.

In this query we’re listing all salaried position job titles

SELECT DISTINCT JobTitle
FROM   HumanResources.Employee
WHERE  SalariedFlag = 1

Here are the results

Using the BIT data type in Searches

We could have also use ‘True’ instead of 1.  Here is the same example using ‘True’

SELECT DISTINCT JobTitle
FROM   HumanResources.Employee
WHERE  SalariedFlag = 'True'

And the opposite using ‘False’

SELECT DISTINCT JobTitle
FROM   HumanResources.Employee
WHERE  SalariedFlag = 'False'

I tend to stick with 1 and 0, since it is easier to type, but if you’re going for readability, then ‘True’ and ‘False’ are good options.

Read more:
https://www.essentialsql.com/commonly-used-sql-server-data-types/