Amazing world of data : ETL Process

In this blog we are going to discuss a very interesting concept, which will explain how data coming from different sources is combined uniformly to form an integrated, time variant, and non volatile target system like data warehouse.We know by now that data must be loaded into data warehouse regularly so that it can serve its purpose of facilitating business analysis.To do this, data from one or more operational systems needs to be extracted and copied into the data warehouse. The challenge in data warehouse environments is to integrate, rearrange and consolidate large volumes of data over many systems, thereby providing a new unified information base for business intelligence. The process of extracting data from source systems and bringing it into the data warehouse is commonly called ETL.

ETL stands for Extract, Transform and Load,it is a traditionally accepted way for organizations to combine data from multiple systems into a single database, data store, data warehouse, or data lake. To be more accurate,

ETL describes the end-to-end process by which a company takes its full breadth of data — structured and unstructured and managed by any number of teams from anywhere in the world — and gets it to a state where it’s actually useful for business purposes.

So we can say that ETL process plays a vital role to bring all relevant data together and make actionable so that executives, managers, and other stakeholders can make an analytically informed business decision from it.

ETL process pipeline consists of three main processes, let’s take a look at each step one by one.

Extract :

This step is carried out in two sub steps. The first and foremost important sub step is data identification and remember whenever I say data we are talking in hundreds of kilobytes up to gigabytes.Sometimes it may happen that the specific subset of interest is not available so we need to extract more data than necessary. This brings to the second sub step that is Extraction,as the word suggests, during the data extraction step, data is copied or exported from source locations to a staging area. The data can come from virtually any structured or unstructured source — SQL or NoSQL servers, CRM and ERP systems, text and document files, emails, web pages, and more. We must note that the size of data is varying so the time required for the extraction process alone can range from minutes, hours to even days.

Transform :

Once the data is extracted, it is physically transported to a staging system ( Intermediate system) for further processing. In this staging area, the raw data is transformed so as to be more useful for analysis, and to fit the schema of the target data warehouse. During transformation the first step is data cleansing, in which data is filtered, de-duplicated, validated and it is then authenticated. Then on this clean data different rules, calculations and functions are applied, and at the last the data is formatted into tables or joined tables to match the schema of the target data warehouse.

You must have one question that will be, why don’t we just do all these steps during the extraction process. By making the transformation process on the staging data we avoid the risk of corrupting the source data and also performing all these steps on the source will consume a lot many resources.

Load :

In this last step, the transformed data is moved from the staging area into a target data warehouse. Typically, this involves an initial loading of all data, followed by periodic loading of incremental data changes, this particular process of loading only the changed data is called CDC or change data capture. CDC is a low impact and fast delivery mechanism which captures changed data based on table mapping and sends only the changed data to corresponding database or to the data warehouse. Another loading method is full refresh which erases all the data from the data warehouse and replaces it with the new data. CDC is the most commonly used strategy for loading as it reduces processing overhead and network traffic by sending only the data that has changed.

At this point you must be thinking that this ETL process can only be used to create data warehouses. You will be surprised if I told you this 1970s concept is still used in new technological solutions like Machine learning, AI and IOT. In ML, ETL process is used to move the data into a single location for ML purposes.As we all know in IOT, we collect data from many devices like factory equipment, network servers, smartphones, or a broad range of other machines — even wearable and implanted devices. ETL helps move data from multiple IOT sources to a single place where you can analyze it.

I guess by reading about ETL process, you must have new interest for the concept of data warehouse, for that you can read my previous two blogs

If you are a data geek just like me, and want to learn from the start then my blog about data will be very useful to you

Thank you for reading, Don’t forget to follow so that you will not miss any amazing content

Geek in world of data, with interest in Data analyatics, Data Science, Machine learning.