Data warehouses and lakes will merge

October 21, 2022 ENR

Register now for your free virtual pass to the Low-Code/No-Code Summit this November 9. Hear from executives from Service Now, Credit Karma, Stitch Fix, Appian, and more. Learn more.

My first prediction relates to the foundation of modern data systems: the storage layer. For decades, data warehouses and lakes have enabled companies to store (and sometimes process) large volumes of operational and analytical data. While a warehouse stores data in a structured state, via schemas and tables, lakes primarily store unstructured data.

However, as technologies mature and companies seek to “win” the data storage wars, companies like AWS, Snowflake, Google and Databricks are developing solutions that marry the best of both worlds, blurring the boundaries between data warehouse and data lake architectures. Additionally, more and more businesses are adopting both warehouses and lakes — either as one solution or a patchwork of several.

Primarily to keep up with the competition, major warehouse and lake providers are developing new functionalities that bring either solution closer to parity with the other. While data warehouse software expands to cover data science and machine learning use cases, lake companies are building out tooling to help data teams make more sense out of raw data.

But what does this mean for data quality? In our opinion, this convergence of technologies is ultimately good news. Kind of.

Event

Low-Code/No-Code Summit

Join today’s leading executives at the Low-Code/No-Code Summit virtually on November 9. Register for your free pass today.

On the one hand, a way to better operationalize data with fewer tools means there are — in theory — fewer opportunities for data to break in production. The lakehouse demands greater standardization of how data platforms work, and therefore opens the door for a more centralized approach to data quality and observability. Frameworks like ACID (Atomicity, Consistency, Isolation, Durability) and Delta Lake make managing data contracts and change management much more manageable at scale.

We predict that this convergence will be good for consumers (both financially and in terms of resource management), but will also likely introduce additional complexity to your data pipelines.

Emergence of new roles on the data team

In 2012, the Harvard Business Review named “data scientist” the sexiest job of the 21st century. Shortly thereafter, in 2015, DJ Patil, a PhD and former data science lead at LinkedIn, was hired as the United States’ first-ever Chief Data Scientist. And in 2017, Apache Airflow creator Maxime Beauchemin predicted the “downfall of the data engineer” in a canonical blog post.

Long gone are the days of siloed database administrators or analysts. Data is emerging as its own company-wide organization with bespoke roles like data scientists, analysts and engineers. In the coming years, we predict even more specializations will emerge to handle the ingestion, cleaning, transformation, translation, analysis, productization and reliability of data.

This wave of specialization is not unique to data, of course. Specialization is common to nearly every industry and signals a market maturity indicative of the need for scale, improved speed and heightened performance.

The roles we predict will come to dominate the data organization over the next decade include:

Data product manager: The data product manager is responsible for managing the life cycle of a given data product and is often responsible for managing cross-functional stakeholders, product roadmaps and other strategic tasks.
Analytics engineer: The analytics engineer, a term made popular by dbt Labs, sits between a data engineer and analysts and is responsible for transforming and modeling the data such that stakeholders are empowered to trust and use that data. Analytics engineers are simultaneously specialists and generalists, often owning several tools in the stack and juggling many technical and less technical tasks.
Data reliability engineer: The data reliability engineer is dedicated to building more resilient data stacks, primarily via data observability, testing and other common approaches. Data reliability engineers often possess DevOps skills and experience that can be directly applied to their new roles.
Data designer: A data designer works closely with analysts to help them tell stories about that data through business intelligence visualizations or other frameworks. Data designers are more common in larger organizations, and often come from product design backgrounds. Data designers should not be confused with database designers, an even more specialized role that actually models and structures data for storage and production.

So, how will the rise in specialized data roles — and bigger data teams — affect data quality?

As the data team diversifies and use cases increase, so will stakeholders. Bigger data teams and more stakeholders mean more eyeballs are looking at the data. As one of my colleagues says: “The more people look at something, the more likely they’ll complain about [it].”

Rise of automation

Ask any data engineer: More automation is generally a positive thing.

Automation reduces manual toil, scales repetitive processes and makes large-scale systems more fault-tolerant. When it comes to improving data quality, there is a lot of opportunity for automation to fill the gaps where testing, cataloging and other more manual processes fail.

We foresee that over the next several years, automation will be increasingly applied to several different areas of data engineering that affect data quality and governance:

Hard-coding data pipelines: Automated ingestion solutions make it easy — and fast — to ingest data and send it to your warehouse or lake for storage and processing. In our opinion, there’s no reason why engineers should be spending their time moving raw SQL from a CSV file to your data warehouse.
Unit testing and orchestration checks: Unit testing is a classic problem of scale, and most organizations can’t possibly cover all of their pipelines end-to-end — or even have a test ready for every possible way data can go bad. One company had key pipelines that went directly to a few strategic customers. They monitored data quality meticulously, instrumenting more than 90 rules on each pipeline. Something broke and suddenly 500,000 rows were missing — all without triggering one of their tests. In the future, we anticipate teams leaning into more automated mechanisms of testing their data and orchestrating circuit breakers on broken pipelines.
Root cause analysis: Often when data breaks, the first step many teams take is to frantically ping the data engineer who has the most organizational knowledge and hope they’ve seen this type of issue before. The second step is to then manually spot-check thousands of tables. Both are painful. We hope for a future where data teams can automatically run root cause analysis as part of the data reliability workflow with a data observability platform or other type of DataOps tooling.

While this list just scratches the surface of areas where automation can benefit our quest for better data quality, I think it’s a decent start.

More distributed environments and the rise of data domains

Distributed data paradigms like the data mesh make it easier and more accessible for functional groups across the enterprise to leverage data for specific use cases. The potential of domain-oriented ownership applied to data management is high (faster data access, greater data democratization, more informed stakeholders), but so are the potential complications.

Data teams need look no further than the microservice architecture for a sneak peak of what’s to come after data mesh mania calms down and teams begin their implementations in earnest. Such distributed approaches demand more discipline at both the technical and cultural levels when it comes to enforcing data governance.

Generally speaking, siphoning off technical components can increase data quality issues. For instance, a schema change in one domain can cause a data fire drill in another area of the business, or duplication of a critical table that is regularly updated or augmented for one part of the business can cause pandemonium if used by another. Without proactively generating awareness and creating context about how to work with the data, it can be challenging to scale the data mesh approach.

So, where do we go from here?

I predict that in the coming years, achieving data quality will become both easier and harder for organizations across industries, and it’s up to data leaders to help their organizations navigate these challenges as they drive their business strategies forward.

Increasingly complicated systems and higher volumes of data beget complication; innovations and advancements in data engineering technologies mean greater automation and improved ability to “cover our bases” when it comes to preventing broken pipelines and products. Regardless of how you slice it, however, striving for some measure of data reliability will become table stakes for even the most novice of data teams.

I anticipate that data leaders will start measuring data quality as a vector of data maturity (if they haven’t already), and in the process, work towards building more reliable systems.

Until then, here’s wishing you no data downtime.

Barr Moses is the CEO and co-founder of Monte Carlo.

DataDecisionMakers

Welcome to the VentureBeat community!

DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.

If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.

You might even consider contributing an article of your own!