Microsoft SQL Server Integration Services (SSIS) is a fantastic tool for managing how data is extracted, transformed and loaded (ETL) into your database systems. In this article I want to suggest some concepts and techniques to apply in constructing a framework to support your ETL.
The design of a Data Warehouse (DW) is a large topic and there are many different approaches that can be taken. If you are unfamiliar with DW concepts, then you will find that you need to read up on the subject. The Kimball Group http://www.kimballgroup.com is a great place to start looking technical information and purchasing a book such as 'The Data Warehouse Toolkit –Third Edition, Ralph Kimball and Margery Ross, Published Wiley' a worthwhile investment.
Reasons to Build a Data Warehouse
In brief below are a few of the many reasons why people wish to construct a DW as opposed to working with the source data in OLTP systems:-
- Business users can find it very difficult to piece together all the links in a complex normalized relational database schema; there is a need for a simplified structure
- Tables and fields often don't represent familiar business oriented names and structures; alternative naming and re-structuring can add clarity to the business information
- OLTP systems are focused on showing operational state, but often do not represent the bigger picture of how data changed over time; analysis which presents an accurate picture over time is important to the business user
- Operational systems may suffer performance issues when extensive analysis is undertaken against a live system or large volumes of data; there is a need to perform analysis on large volumes of data
Below are a list of some of the technical terms which we will use in these articles:-
- Dimensional analysis; process of restructuring data into a form more suitable for a business users to consume
- Dimensions; contains attributes which are linked to the facts or possibly other dimensions; dimensions operate like lookup lists and mainly contain non-numeric information
- Fact or Measure; contain numeric values that you measure and aggregate, often linked to several dimensions
- Dimension types; there are different types of dimensions, some can have slowly changing data values, the dimension captures these changes over time and support accurate reporting of data over time
- Grain of a fact; this could be a transaction, a daily price, or a combination of other critical values; the choice of grain is important
- Star schema; the aim of the design is to construct a very simple data model, with a fact in the centre and relationships linked to the dimensions; it looks like a star
- Snowflake Schema; where we have dimensions linked to other dimensions; to be avoided when possible
Now that's a very brief description of some technical terms!
Data Warehouse Layers
In Figure 1, on the left we have the live system, data from the live system will be periodically uploaded into a staging area in the DW, the pipe from the live system could be using a direct database link or mechanism for importing data from CSV files. Data can often come from multiple systems (consolidating and relating data from different systems can be a significant challenge).
Populating the Staging layer is where you need to allow for data arriving in different formats, for a DW linked to a live system the staging tables will be almost an identical copy of the live tables.
Figure 1: Data Warehouse Layers
The second layer of the DW is called History or Enterprise, this is where we accrue historical changes in data, normally the tables are very similar in structure to the staging layer but have additional fields for recording how data has changed over time. This is technically a more complex layer of the DW, because we need to efficiently record how data changes over time. The history layer can be designed to records new records only when the data has change (sometimes referred to as recording the delta in the data); rather than making a fresh copy of all data each time it is loaded.
It is important in the Staging layer that we can identify a unique key or combination of fields which can then be used to check whether the data row is already present in the history layer.
The final layer (of which there may be more than one) is called the Data Mart or Transformation Layer, this is where we restructure our data into a form which users can consume, it usually has a simple star structure; facts in the centre surrounded by dimensions. If you are familiar with SQL Server Analysis Services (SSAS), then you can think of this layer as having the data structured for construction of a cube.
The term Transformation (the T in ETL) can occur at many stages in the DW if you consider transformation to be for example changing a data type, concatenating columns etc., but I prefer to think of this process as where we transform structure, morphing our relational database structures into the star topology of the data marts.
Because building a DW can involve a significant amount of processing, it is a good idea to have a framework which can record and monitor the processing, this could involve recording row counts, tracking errors, report progress through the layers and facilitate recovery when data fails to load.
Here are a few general tips:-
- A framework has the idea of executing a cycle, daily or hourly, give each cycle a unique identifier which is then carried through the framework tagging all the data; this simplifies debugging problems relating to which load data came from in a particular layer
- Use schemas to separate loading data from different systems or loading data into different sub-systems or layers
- Design the framework so if a process fails, you can restart processing from the point of failure without needing to re-run other parts of the framework; speeding up development
- Design the framework to partition work into the Staging, Enterprise and Transformation layers (and when dealing with multiple systems supplying data, breakdown the areas by the different systems); this provides fine control over processing in subsequent layers, for example Transformation layer processing can often only be undertaken after Enterprise is completed for all systems, whilst Enterprise processing can be undertaken as data from each source system successfully completes the staging process
- Record the start rows in each table, the end count of rows, number of inserts, updates, deletes and the time taken for each processing step; this all helps in understanding how data volumes will grow and the detailed nature of how data is changing
- If using dynamic SQL for Insert/Update/Delete operations, then save the generated SQL in the framework; this simplifies debugging
- If computing dynamic CheckSum calculations on rows of data, then record the SQL used for the calculations in the framework; this simplifies debugging
- Configure SSIS logging to record any warnings or errors generated by the packages, aim to have your packages execute without generating any warnings
SSIS Techniques for Processing Data
There are three common methods to process ETL data using SSIS:-
- Use the built in data flows; this is the simplest method, SSIS will automatically check your meta-data (table structures) on the input and output structures match and it is very easy to get record counts and record them in the framework; I like this because it is very simple and efficient
- Use C# or VB code to load data; this often appeals to people when they don't want to maintain a large number of packages and can construct a generic approach to loading the data, it can work very well with suitable error handling; I like this when I have to deal with input formats which are nonstandard and need to skip through the input file headers and footers and perform more complex processing to load the data
- Use stored procedures executed from SSIS; this will appeal to those with TSQL skills, you can easily feedback counts on records to the framework. I like to use this when moving data from staging to enterprise where the table structures will be very similar. If using this technique I will write my own exception reports to check for meta-data changes between staging and enterprise layers
Whichever technique you use, make sure to have good error handling and record all the processing statistics in the framework.
In Figure 2, we show the basic framework components:-
- Processing Run; every time the framework executes it creates a record which uniquely identifies the daily or hourly cycle
- Processing Step; a master list of all processing packages/steps; flags in the list can be used to enable or disable processing steps
- Schedule Processing; this is the logic which can be split between packages and stored procedures and co-ordinates both the scheduling of work and the progress of work through the framework
- Processing worklist; when a run commences, and as it progresses, processing steps are copied into the worklist which lists which packages and steps are scheduled for processing
- Run Progress; as the scheduled packages/steps in the Processing List start and complete execution information is recorded on progress of the SSIS package processing
Figure 2: basic framework components
In this framework, rather than having a single master program job which runs the entire framework, we have a scheduled set of SQL Server Agent jobs, each job is designed to execute one or more packages/layers. So our framework is self-driven, as layers/systems complete execution they schedule up subsequent layers/systems for the next stages for processing.
This self-driven approach means that we design the packages so that when they are executed they check the framework to see if they are enabled for execution and have work to perform.
Figure 3: worklist processing
In Figure 3, we see an example of the worklist processing. The staging packages for system 1 have completed and have scheduled the enterprise packages for system 1 which are now executing. The system 2 staging packages are still processing.
Each block on the left in Figure 3 represents a SQL Server Agent job or job step, this means we need to make sure we allow enough time before starting the enterprise system 1 packages for the staging packages system 1 to be completed. In our experience this need for scheduling execution time for the jobs has not proven to be a problem, and it avoids the need for complex job scheduling logic.
Each SSIS package will contain one or more processing steps (typically 7 – 15), for example in an ETL system we could have 10 packages allowing for around 150 processing steps.
The packages are designed to check the Run Progress and Processing Run to determine if the package is allowed to execute, and then process the appropriate steps.
When completing execution the package can check to see if the layer/system steps have all be completed and if complete construct a set of worklist steps to be executed allowing data to be processed by other packages to the next DW layer.
Using this technique we can easily update the data in the processing worklist to allow steps in the ETL to be re-run; this improves speed of development when building the ETL
The framework will have a number of supporting informational tables, examples are shown in Figure 4; such as status codes, a list of the processing layers and systems.
Figure 4: supporting tables of information
These tables of information shown in Figure 4, help when building exception reports for the framework.
Package Logging and Control
In Figure 5, we have an outline of the package logging and control:-
- Processing Run; supplies information to allow a package to determine if the DW layer has been enabled to execute
- Package Execution; each package will contain initial processing steps to determine and control how it executes, and final steps to record completion and schedule other processing package steps for execution
- Package Step Execution; a package when running will then execute the appropriate package steps
- SSIS Logging; SSIS allows for warnings and errors to be automatically logged
- Processing Worklist; the list that drives the processing steps in the package, this can also be used by the package to schedule steps to be executed in other packages
- Package Progress; a table which logs information on when a package starts and completes execution
- Step Progress; a table which logs each step within the package recording measurements on numbers of rows processed and processing time
Figure 5: package logging and control
The simplest way to construct a SSIS package with multiple steps is shown in Figure 6, here each of the processing blocks operates independently.
Figure 6: simple SSIS block of operations
We could add into these steps additional logic which reads from the Processing Worklist to determine if the step should be executed, and records execution in the Step Progress, as shown in Figure 7.
Figure 7: basic processing control added to processing steps
Because steps can take significant time to execute it is sometimes desirable when developing packages to limit the processing to a particular step (often the one which has failed and needs further design changes). Note that in Figure 7, the processing steps could execute in parallel.
In the next example, we place the bulk of our processing steps inside a container which will only undertake processing for items which are scheduled in the worklist and where processing has not been completed, adding a sequence container enables the package to be designed to have pre-execution and post-execution operations.
In Figure 8, we show our general purpose template to constructing a package:-
- Pre-Execution Checks; this is where the package determines if it should execute, and then records that it has started
- Worklist Processing; in this section we execute each of the scheduled steps 1-3
- Steps 4-5; sometimes steps can only be executed after earlier steps are completed; in our example we have steps 4 and 5 which need to execute after steps 1-3
- Post-Execution Scheduling; this section records that the package has completed, it could also check and flag when an entire layer in the DW was completed and if required schedule other package steps for execution
Figure 8: general layout of package structure
Serial or Parallel Execution
In Figure 8, the large central box containing steps 1-3 would be a ForEach container, and the structure of the sequences (steps) can be constructed using one of two techniques.
In Figure 9, the sequences inside the ForEach container has a dummy task which acts to allow precedence constraints on each arrow to undertake the appropriate processing. The ForEach container would iterate over a recordset loading the name for each step to processes into a variable StepName.
The arrow leading to Step 1 would have a conditional test that the StepName == 'Step 1', and the arrow leading to Step 2 would have a conditional test that the StepName == 'Step 2'. The processing for this sequence would be serial, with Step 1 processing and then Step 2 processing etc.
Figure 9: serial execution of steps
In Figure 10, the sequences inside the ForEach container still have the same precedence constraints, but because the steps are not linked together this will allow for parallel execution. This uses similar precedence constraint logic but multiple dummy tasks.
Figure 10: parallel execution of steps
The question of whether you want serial or parallel execution depends on a number of factors, such as the availability of multiple processors and the degree to which the processing steps could need to access the same data, but in general a parallel approach would yield better performance.
Figure 11, demonstrates a general template for each step in a package:-
- Step1 Start; the step logs that execution has started
- Step1 Operations; the step completes a series of processing operations, recording record counts in variables
- Step1 Complete; the step records completion
- Step1 Operation Errors; the step records any failures
Figure 11: general layout of a package step
Reducing Processing Time
If you know that data has not changed in the OLTP system since last imported, then you can save time by not bringing in the unchanged data. In OLTP system you would need to construct triggers on data tables (possibly slowly changing dimensional data) where notifications are generated when any data is changed. The ETL framework can then decide based on this notification whether to schedule a particular table for processing.
Figure 12: schematic showing the polling process
In Figure 12, the OLTP system has triggers which write to a DWPolling table when dimensional data has changed. The ETL processing then reads the DWPolling information to decide when data need to be checked for changes.
Another useful feature of an OLTP system is that if it records the date/time when a record was last updated or created against each row in a table, then you can reduce the amount of data being processed by restricting processing based on the last updated date compared against the known date when the data was last processed.
In the first part of this article we have looked at general concepts involved in building one type of SSIS framework, in the future parts of this article we look at the detailed implementation of these ideas in a simplified example of a framework.