Amazing world of data : ETL Process

Extract :

This step is carried out in two sub steps. The first and foremost important sub step is data identification and remember whenever I say data we are talking in hundreds of kilobytes up to gigabytes.Sometimes it may happen that the specific subset of interest is not available so we need to extract more data than necessary. This brings to the second sub step that is Extraction,as the word suggests, during the data extraction step, data is copied or exported from source locations to a staging area. The data can come from virtually any structured or unstructured source — SQL or NoSQL servers, CRM and ERP systems, text and document files, emails, web pages, and more. We must note that the size of data is varying so the time required for the extraction process alone can range from minutes, hours to even days.

Transform :

Once the data is extracted, it is physically transported to a staging system ( Intermediate system) for further processing. In this staging area, the raw data is transformed so as to be more useful for analysis, and to fit the schema of the target data warehouse. During transformation the first step is data cleansing, in which data is filtered, de-duplicated, validated and it is then authenticated. Then on this clean data different rules, calculations and functions are applied, and at the last the data is formatted into tables or joined tables to match the schema of the target data warehouse.

Load :

In this last step, the transformed data is moved from the staging area into a target data warehouse. Typically, this involves an initial loading of all data, followed by periodic loading of incremental data changes, this particular process of loading only the changed data is called CDC or change data capture. CDC is a low impact and fast delivery mechanism which captures changed data based on table mapping and sends only the changed data to corresponding database or to the data warehouse. Another loading method is full refresh which erases all the data from the data warehouse and replaces it with the new data. CDC is the most commonly used strategy for loading as it reduces processing overhead and network traffic by sending only the data that has changed.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Abhishek Satish Gore

Abhishek Satish Gore


Freelancer content writer with a profound understanding of data engineering, big data, and cloud. drop a mail: for further details