Data Lake vs. Data Warehouse vs. Data Mart

Published on: Oct 19, 2022

We all agree there is a lot of value to be gained from analysing data. But how to do that is less obvious. By now, we all understand the need to gather truckloads of data, but where to store all that? How to choose between data lake vs. data warehouse vs. data mart?

All our clients focus on collecting data for business analysis, but they also struggle with a rock-solid understanding of why. We always hear the following essential questions: How should we store data? What is the best architecture solution in our case? What is the best solution, short-term and long-term?

Most of the said questions link back to the core “why” question. Why should we collect and store humongous amounts of data – what is the core business purpose?

To help our clients (and everybody else), we prepared this post discussing some core elements regarding storing and using data. We don’t claim to be super experts in the nitty-gritty definitions of the terminology, but we aim to provide a practical overview to help you clarify things.

The core question: What is the optimal data storage solution? The honest and accurate answer depends on your situation and what you aim to achieve.

Let’s jump into a practical overview of the different terms and terminologies without further ado.

Table of Contents

What’s a database?

As we always say, start as small as possible, so let’s agree on a fundamental element of any modern and smart industrial solution: the database.

“A database is a modeled collection of data that is accessible in many ways. A data model can be designed to integrate the operational data of the organisation. The data model abstracts the key entities involved in an action and their relationship.” (Data Analytics Made Accessible, 2021).

The data has to be stored and managed (create, update, delete etc), which is where the database management software systems come in – out of many, you may remember MySQL and Postgres, which are relational database management systems (RDBMS) with rigid rules around storage and organisation, or Cassandra and Azure Cosmos DB, which are non-relational databases mainly associated with unstructured data and more flexible data storage capabilities.

Basically, a database has a purpose: collect raw data about, for example, a business-related process (like Transactions) to process, organise, and manage it. Whenever a transaction occurs, it’s instantly captured and stored in the database, together with the needed information about that transaction (order number, name, date, amount etc).

You can then access the database to search for specific information and/or another software can retrieve the data from the database for further use.

What’s a data lake?

It’s exactly like the name says and like the image you may have in your mind right now: a massive pile (lake) of raw data, structured and unstructured, coming in from many different sources and stored without a specific purpose in mind over an extended period of time.

Basically, it’s all the data from the organisation in its original, unaltered form – a huge number of data points that aren’t yet organised to service a specific business need/decision, but rather act as fuel for other data analytics tools and techniques, like machine learning. You do, however, give every data point a unique identifier and a set of metadata tags.

There are several benefits of a data lake:

Because the data is raw and unstructured/unorganised, you can store massive amounts of data for cheaper.

It’s perfect for feeding a machine learning model with raw, real-time data.

By having such a broad range of data available, you can find new ways to use it and new business needs/questions/decisions/insights to use it for.

Because you don’t have to structure and organise it, the data lake is a scalable option – the raw data can keep coming in.

“But what’s the difference between a database and a data lake?”, you may ask. Good questions. Here are some, according to Zuar:

– a database only deals with one topic, while a data lake is the “data dump” from many databases and sources.

– a database stores mainly text and numerical data, while a data lake is happy to welcome any type of data.

– the database is just the first step of the ETL journey (the data will be exported to another process), while in a data lake the ETL process is done completely. ETL (Extract, Transform, Load) is a three-phase process where data is extracted, transformed (cleaned, scrubbed), and loaded into an output data container.

What’s a data warehouse?

A data warehouse is a clean set of data you need to make specific business decisions. It’s highly structured and organised to make it easier for the stakeholders (business analysts, people from Operations, managers etc) to analyse it and make targeted decisions based on it. You get a consolidated, integrated view of the entire organisation.

The name “warehouse” is actually accurate: in a warehouse, the contents come from a source as “raw”, and they then have to be processed and then organised into sections and onto shelves (the shelves are the data marts, but we’ll talk about that in a minute). You now have a much clearer idea of what data you have (it’s also stored historically) and how you can use it to answer important business questions that will affect most or all of the enterprise.

When the amount of data gets to be quite large, you can perform data mining to unravel more meaningful and deeper patterns and aggregate the data to increase the speed of analysis.

The data warehouse is, almost always, stored in the cloud and it will become the primary source of truth for strategic decisions within the enterprise, as well as for reporting and other types of analysis.

What’s a data mart?

If the data warehouse was a highly structured and organised database, wait ‘til you hear about the data mart.

The data mart is a much smaller, simpler, and even more focused set of data to serve a specific purpose. It’s a subset of a data warehouse and it contains summarised data about and for one line/department/area of business, therefore it’s used for particular, tactical decisions (whereas a data warehouse, which is bigger and more complex, is used for more abstract, enterprise-wide, strategic decisions).

Because it’s a smaller subset of data used for reporting and providing insights on a specific functional area, the time and cost, as well as the effort, to manage it are on the lower end of the spectrum compared to a data warehouse.

Which was the first: the data mart or the data warehouse?

Similar to the good ol’ “the chicken or the egg” story, you may wonder which comes first: the data mart or the data warehouse. The answer is: both, depending on your preference and goals.

There are two ways a data warehouse can be built: either a top-down approach (Inmon model) or a bottom-up approach (Kimball model).

Top-down approach

This approach entails building the DW first as a main, centralised, and integrated source of data for future data marts. The data marts will be created for very specific business needs using the data from the DW when needed to ensure integrity and consistency across the enterprise.

Astera: The Bill Inmon design approach uses the normalized form for building entity structure, avoiding data redundancy as much as possible. This results in clearly identifying business requirements and preventing any data update irregularities. Moreover, the advantage of this top-down approach in database design is that it is robust to business changes and contains a dimensional perspective of data across data mart.

Next, the physical model is constructed, which follows the normalized structure. This Bill Inmon model creates a single source of truth for the whole business. Data loading becomes less complex due to the normalized structure of the model. However, using this arrangement for querying is challenging as it includes numerous tables and links.

Data Lake vs. Data Warehouse vs. Data Mart — Source of image: Panoply

Third normal form (3NF) is a database schema design approach for relational databases which uses normalizing principles to reduce the duplication of data, avoid data anomalies, ensure referential integrity, and simplify data management.

An OLAP cube is a multi-dimensional array of data. Online analytical processing (OLAP) is a computer-based technique for analysing data to look for insights. The term cube here refers to a multi-dimensional dataset. For example, a company might wish to summarize financial data by product, by time period, and by city to compare actual and budget expenses. Product, time, city, and scenario (actual and budget) are the data’s dimensions.

The bottom-up approach

If you go with this approach, you’ll build the subject-specific data marts first for different parts of your business, THEN bring them all together and integrate them into a more comprehensive data warehouse. Thus, you start with the questions that need answers first, then compile the answers into a more detailed overview/story.

Astera: The primary data sources are evaluated, and an Extract, Transform and Load (ETL) tool is used to fetch data from several sources and load it into a staging area of the relational database server. Once data is uploaded to the data warehouse staging area, the next phase includes loading data into a dimensional data warehouse model that’s denormalized by nature. This model partitions data into the fact table, which is numeric transactional data, or dimension table, which is the reference information that supports facts.

Star schema is the fundamental element of the dimensional data warehouse model. The combination of a fact table with several dimensional tables is often called the star schema. Kimball dimensional modeling allows users to construct several star schemas to fulfill various reporting needs. The advantage of star schema is that small dimensional-table queries run instantaneously.

To integrate data, Kimball’s approach to DW lifecycle suggests the idea of conformed data dimensions. It exists as a basic dimension table shared across different fact tables (such as customer and product) within a data warehouse or as the same dimension tables in various Kimball data marts. This guarantees that a single data item is used in a similar manner across all the facts.

When to use each approach?

There are several factors going into this, like preference, resources, and, maybe the most important, the type of decision that has to be made based on the data.

In favour of the bottom-up approach (Kimball model) would be a specific department of an organisation, like marketing or finance, to carry out reporting that doesn’t require enterprise-wide data. Also, it’s generally a simpler, easier, and cheaper option if you’re willing to focus on business processes (for example, you need reporting just for a particular team) rather than the bigger image of the enterprise and if you don’t foresee many changes in the source systems.

A top-down approach, however, is used when you need centralised enterprise-wide data to get an overview of the entire organisation and make strategic decisions, like an expansion, for example. This is best if you have diverse reporting requirements and you need more flexibility in updating the data warehouse in the case of changing business requirements or data sources. However, this approach also requires more resources (manpower and skills) because of its increasing complexity (including in the build, setup, and delivery) over time.

At the end of the day, both approaches have their ups and downs, there’s no one-size-fits-all, and the decisions largely depend on the unique characteristics of your operation. There’s always the option of a hybrid data model as well, where you create a dimensional data warehouse model of a data warehouse using the Inmon method AND you use the Kimball method to develop data marts using the star schema.

Data Lake vs. Data Warehouse

One thing you should know is that you can use a data lake and a data warehouse together.

While the insights from the data warehouse, which is very structured and organised, are super valuable, the insights you can get from a data lake are a great addition to that. Why? Because given the fact that there’s so much raw data from different sources, and it’s not structured in a way that could enable a bias or a pre-made idea of what you should be looking for, unexpected and/or deeper insights can unravel. Therefore, the combination of the well-organised data of the DW and the raw, massive, and real-time data of the DL may create an incredible opportunity for deep and comprehensive analysis.

Data Lake vs. Data Warehouse vs. Data Mart

So, what’s the conclusion here? As always (and that annoys us every time), it depends. Based on the above table and info, you can assess your current situation and which type of database would fit your requirements and resources.