Mastering Big Data and Data Warehousing

ETL Process in Data Warehouse- Description and Examples

Introduction

Relational Database Management System (RDBMS) offers huge advantages over the traditional File Management System such as faster data access, multiple levels of abstraction, and high-end security mechanism. Another advantage of RDBMS is it supports the concept of a Data Warehouse which consists of ETL solutions. In this article, we  will learn the ETL full form and its description with examples.

But, before looking into the ETL Process in Data Warehouse, let’s take a look at what a Data Warehouse is. Generally, A warehouse simply means a place where all the materials and products are stored for sales and distribution. Similarly, a Data Warehouse is a centralized repository where all the Enterprise Data is stored for analysis and gaining useful insights.

In other words, it can be said that a Data Warehouse is a ‘Centralized RDBMS system that is queried for making some strategic business decisions. The Data Warehouse extracts data from various sources, systematically organizes it and then analyzes it to gain some information that is further used in business decision-making.

Data Warehouse contains an ETL System which is used to fetch the data from various resources. Now, we will look into the ETL full form with examples. ETL stands for Extraction Transformation and Loading. The fetching of the data by a Data warehouse is not done in a single step. Instead, it is a three-step process using ETL. Following Diagram shows the different steps in the ETL Process in Warehousing.

The three words in the ETL full form are the three steps through which the warehouse fetches the data from different resources. Therefore, ETL is the mechanism through which the data from various systems is brought into the Data Warehouse. The three steps of ETL are discussed as follows:

Extraction

Extract means Extraction of the Data. In this step, data is extracted from various sources and systems. The data extracted may have different file formats such as NoSQL, CRM or ERP server files, Flat Files, XML, CSV, etc. The data in the form of files are brought into a place called the staging area.

You might be wondering why data from different sources is not directly brought into the data warehouse. The reason behind this is the data may be corrupted which can cause damage to the warehouse. As recovering from warehouse damage will be difficult, data is loaded into the staging area.

Transformation

This is an important step in the ETL Process in which the data is transformed into a single specific format. This is done through a series of methods that are applied to the data. The end goal of this step is to transform the operational source format of data into a particular data warehouse format. Also, it aims to clean the data to improve the data quality and establish consistency.

The methods which can be applied include the filtering, cleaning, joining, splitting, and sorting of the data. For example, you may need to filter the data in some specific range or fill the null values with a default value (data cleaning).

Loading

This is the last step in the ETL process in which the data which has been transformed into a single data warehouse format is loaded into the target warehouse. The frequency of loading the data may vary as per the requirements. The entire Loading process in ETL is carried out as follows:

  1. Initial Load: All the Data Warehouse Tables are loaded into the warehouse.
  2. Incremental Load: Periodic Loading of the incremental data changes into the warehouse.
  3. Full Refresh: Erasing all the data of the table and then replacing it with fresh data in the warehouse.

Thus, ETL full form encapsulates various facets of data processing and loading techniques which form an important component of a Data Warehouse. In the current scenario, various ETL tools such as Oracle Data Integrator and IBM Infosphere Information Server are present to perform an efficient ETL Process in the Data Warehouse. However, ETL tasks require specialized skills to monitor performance and give accurate results.