What are Data Lakes
In Business intelligence, we are collecting continuously generating data, may it be streaming data, engagement data, batch data, or any other internal-external data, so that means all sorts of data, we collect this raw data to get powerful insights. These insights helps us to understand the business and helps us to develop more intelligent applications or to make informed decisions, but this raw data is still not the data we can use, to get insights from it we need to perform all sorts of processes like data cleansing, feature extraction, data preprocessing, data transformation so and so forth. Our main goal is to provide this data to any dashboarding system or to any Machine learning model so that business executives can get insights about how business is running and steps are necessary to improve the overall performance. We divide this process into four steps,
- Collect
- Organize
- Analyze
- Infuse
Analyze and infuse part is done mostly by a dashboarding system or by any ML model.
As a data engineer, let’s focus on the first three stages. In previous blogs we discussed one method of collecting and organizing data which was “Data Warehousing”. But what if my organization wants a data management system which can accommodate raw data of any form without any structure and hold it until my data scientist is not ready to experiment. The answer to this question will be using a “Data Lakes”.
Most data management systems get overwhelmed due to constantly structuring that incoming data, and due explosion in usage of IOT devices ,data sources are becoming more ad-hoc and we are getting a multitude of data. So now to modernize data management systems organizations are adopting the data lake models.
What is a data lake ?
Data lake is a hybrid data management solution, it is a centralized repository that stores structured, unstructured and semi-structured data securely, regardless of volume and format with an unlimited capacity to scale. Data lakes are different as they store data as-is, which removes complexity and ingesting and storing all of the data. Data lake flows a flat architecture and the lake resides in a Hadoop system.
To understand data lakes more clearly, we need to see how it differs from data warehouses.
A data warehouse is a database optimized to analyze relational data coming from transactional systems and line of business applications. The data structure, and schema are defined in advance to optimize for fast SQL queries, so the data warehouse is said to be a “schema on write” system. This system is time consuming and complex.
On the other hand, a data lake stores relational data from line of business applications, this unstructured and raw data is stored without currently defining a purpose. This data isn’t transformed until it is needed for analysis, schema is then applied so data can be analyzed. This is also called the “schema on read” system.
Data lakes and data warehouses are nothing but repositories for big data but each is optimized for different uses. Consider them complementary rather than competing tools, and companies need both.
What are the benefits of data lake ?
Scalability : Data lakes offer massive scalability up to the exabyte scale. This is important because when creating a data lake you generally don’t know in advance the volume of data it will need to hold. Traditional data storage systems can’t scale in this way. Data lakes are based on the Hadoop framework, which is a framework that helps in the balanced processing of huge data sets across clusters of systems using simple models. It scales up from a single server to thousands, offering local computation and storage at each node. Hadoop supports huge clusters maintaining a constant price per execution bereft of scaling. To accommodate more one just has to plug in a new cluster.
High-velocity Data: The data lake uses tools like Kafka, Flume, Scribe, and Chukwa to acquire high-velocity data and queue it efficiently. Further they try to integrate with large volumes of historical data.
Structure: The data lake presents a unique arena where structure like metadata, speech tagging etc. can be applied on varied datasets in the same storage with intrinsic detail. This enables the processing of the combinatorial data in advanced analytic scope.
So data lakes not only simplify data management but also speeds up analytics and helps to organize and govern data more accurately.
Like everything, even this data management marvel has some drawbacks. Raw data is stored with no-oversight of the contents, so we need to make sure that there’s mechanisms to catalog, and secure data. Without these elements, data cannot be found or trusted, resulting in a “data swamp.”
So at the end, we can say that, Data lake might not be a complete shift but rather an additional method to aid the existing methods like big data, data warehouse etc. to mine all of the scattered data across a multitude of sources opening new gateway to new insights.