Data lake, data mesh, or something else?

Reading Time: 4 minutes

Zhamak Dehghani coined the term data mesh in 2019. Since this data transformation, the internet started flooding with information about its advantages and all the problems it would solve. Many articles compare data lake with data mesh. 

This article will not explain, in great detail, what a data mesh or data lake is, but, instead, how you can adapt to these systems. We’ll also dive into why organizations must put some abstraction on top of these data management systems to make the right decisions based on their organization-specific scenarios. 

Just so you have it, if you are not already aware of these terms, you can refer below articles before joining me back here.

1. Data Mesh Principles and Logical Architecture

(https://martinfowler.com/articles/data-mesh-principles.html)

2. Data Lake Principles and Logical Architecture

( https://www.guru99.com/data-lake-architecture.html )

3. Data Mesh is not a Data Lake

(https://www.linkedin.com/pulse/data-mesh-lake-jeffrey-t-pollock/)

If you have skipped the above three articles, then here is a high-level summary.

What is a data lake?

Data lakes are systems used for organizing and managing data centrally. This data can be in any form, i.e., it can be structured, semi-structured or unstructured. Data lakes typically follow a schema-on-read approach. If we go a little back in history, data lake came into the picture because of the rise of big data. This was triggered by the advent of new data generating sources and more users getting acquainted with the internet. 

What is a data mesh?

Data mesh is a system that focuses on decentralizing data and data ownership in an organization. This means anyone in an organization can use and manage data until they have proper credentials for more access. Data mesh advocates that data can stay where it is regardless of whether it lies in different databases. It focuses more on serving this data as a data product that can be made accessible to all the authorized stakeholders.  Historically, data mesh came into the picture to overcome challenges people faced with centralized data management systems.

Why organizations should have abstraction 

Ever-changing design patterns

If we look from a 30,000 feet view, data lakes and data mesh are design patterns. That is, they are a way to organize and implement data to make easy to access, manage and maintain more efficiently. They are not bringing in any computation revolution. These patterns focus primarily on the ownership of data, its maintenance and distribution to the topmost in a hierarchy.

Now, the moment we consider this as a design pattern, it becomes clear that there is no thumb rule to it, and we cannot say one will fit all forever. These design patterns are bound to evolve. As with every new implementation in a production environment, fresh sets of challenges lead to further brainstorming and give birth to some new design patterns. 

The challenge to catching up with these design patterns is the threat of obsoletion. By the time you analyze it, alter it for your organization, migrate your data, and overcome all the challenges, there comes a new approach to doing it more efficiently. 

Another challenge is functionality. With new designs in the management systems, you don’t know whether they will be more efficient for you than your existing design. If you search around the internet, some of the early adopters of data mesh have already started posting articles advising enterprises on how to avoid common pitfalls of data mesh setup.

This ultimately becomes our first reason to abstract this entire process. We cannot always spend a lot of development effort to come to a single conclusion. Instead, we can think of a system that can help us give a quick leap into the future and help us make informed decisions. The abstraction should be such that you can easily change your source and destination without any engineering efforts to your core processing logic.

Changes in ownership of data

When learning about data lake and data mesh, one of the crucial topics is data ownership. Thankfully, data mesh brings a perspective that the maintenance and ownership of data should be with the team that understands that data more appropriately. Data owners should be responsible for serving that data in a consumable manner to end-users in the organization.

This ownership distribution brings up the need for the data owners to capture and maintain feedback from the end-user. If the feedbacks contain some enhancement, they need to take care of it and again expose this new enhanced data to the end-user.

If the individual owners start tracking this independently, there is a lot of duplication of efforts that these owners will spend in brainstorming and coming up with some automation. The story does not end here. What if tomorrow, some new design patterns come up with a unique view on data ownership?

In that case, we should abstract the access control part and develop data management systems that will facilitate data access management, data discovery, data tagging, and feedback capture. This can help organizations better organize their data access hierarchy and create a sophisticated, efficient way to adapt new design patterns.

Shifts in the thinking process

If we try to make sense of current design patterns, there is a constant effort to move data closer to the end-user so that the end-user can question the data as needed. Again, let’s take the example of the latest data approach as a product mentioned in the data mesh. It underlines that this thought process is more inclined to abstract the technical nitty-gritty of data transformations and serve consumable data.

However, what if we have an abstracted data management solution systems where domain experts can create formulas/expressions based on their expertise, and engineers can serve the data by applying those formulas to the underlying data structure? It will make more sense if these expressions are maintained at some central portal so that anyone unaware of these expressions can study these and apply them to his own data. For example, suppose someone is not aware of how to calculate simple interest in a banking domain. In that case, such people can simply access the already created expressions and use these expressions on their data.

There isn’t a one-size-fits-all solution

Both data lake and data mesh have their importance. One cannot conclude that only one way to organize and transform data will solve all the problems. There could be scenarios where data lake is the apt choice, while data mesh can be a more sensible choice in other cases. There could even be a third scenario where both can co-exist.

Organizing data centrally in a data lake makes sense when you need more performant queries, as we can avoid latency caused by connecting siloed data. At the same time, a data mesh is suitable for organizations storing their data across multiple databases and does not want a dependency on the central team.

As the internet community grows, data generation speed is growing multi-fold, thereby introducing new forms of data. At present, data is categorized into structured, semi-structured, and unstructured. One never knows when something new can pop up. 

Moreover, organizing data is a skill. No matter whatever groundwork you do, there are chances you will learn it through iterations because data and use cases will always differ from organization to organization.

Hence, we need some abstract data management systems to help data scientists quickly run over these iterations and make them less cumbersome for you, the enterprise owner. This will help us make informed decisions based on our specific data and circumstances.

The future of data mesh and data lake

In the ever-changing space of technology and design patterns, organizations should invest in creating tools that can enable them to migrate and implement design patterns as required. As the world evolves, enterprises should not be binding themselves to use only one design pattern decided by someone ages ago. Because it is challenging to keep pace with updated versions and new technologies, organizations should evaluate their needs and put a conscious effort into the build vs. buy methodology.

We at Spectra are already helping organizations overcome these challenges by starting a client with smaller problem resolutions tools. If that makes sense and adds value to the organization, they can then adapt this tool to have smooth sailing over different architectures and technology upgrades.

Spectra is a comprehensive enterprise DataOps platform (data ingestion, transformation, and preparation) platform built to manage complex, varied data pipelines using a low-code user interface. Its domain-specific features deliver remarkable data solutions at speed and scale. Maximize your ROI with faster time-to-market, time-to-value and reduced cost of ownership.

The advantage of employing an enterprise data management solution like Spectra is that you get the best of the learnings across implementations, reducing your optimization iterations. Moreover, if any new connector comes in, you don’t have to take the pain of analyzing and developing it. With Spectra, you can directly use it without writing a single line of code.

Author

Mahesh Jadhav

Technical architect for Spectra

Mahesh Jadhav is the technical architect for Spectra. He is an Oracle-certified Java professional with 10+ years of experience and a hands-on expert in Apache Spark, Kubernetes, Big data, Spring boot, and application development. Mahesh is actively involved in technical design of the platform as well as in putting strategies around performance, security and packaging aspects of the product. Tunning Apache Spark jobs is something he loves and has contributed immensely in this area. Building a data platform is something that challenges him.

Latest Blogs

See how your peers leverage Fosfor + Snowflake to create the value they want consistently.

Leveraging Snowpark for Analytics workloads on Lumin

Decision Intelligence products help businesses make timely decisions that are accountable and actionable, with accurate inferences and recommendations. Decision Intelligence platforms need to offer key features that enable users to perform prescriptive, predictive, diagnostic, and descriptive analyses.

Read more

Lumin + Snowflake - Empowering decision makers in the Pharma value chain

Brand analytics is crucial for any Pharma manufacturer’s success. During the drug research and development phase, Pharma manufacturers spend millions of dollars, and consume 10-15 years to bring much needed drugs to the market, with only 7-12 years of patented time available.

Read more

Lumin + Snowflake empower healthcare providers control staff attrition

Decision Intelligence is one of the most sought-after AI activations by enterprises for the unprecedented advantage it offers businesses. Although there are quite a few decision intelligence solutions in the market, Lumin presents a list of unique attributes that make it stand out from the competition.

Read more