Data quality management in the age of AI

Barr Moses
3 min readOct 9, 2024

--

Over the last 12 months, data quality has become THE problem to solve for enterprise data teams — and unsurprisingly, AI is driving the charge.

As more enterprise teams look to AI as their strategic differentiator, the risks associated with bad data become exponentially greater. At the speed and scale of modern data environments, data teams need advanced data quality methods that can rise to meet these challenges.

In this week’s edition, I’ll consider three of the most common tactics for managing data quality — monitoring, testing, and observability — and discuss how each can (and will) work themselves out in the age of AI.

Defining our terms — data quality monitoring, data testing, and data observability.

Before we can understand the future of data quality, we need to understand the present. In its simplest terms, you can think of data quality as the problem; testing and monitoring as methods to detect problems; and data observability as a comprehensive approach that combines and extends both methods to actually triage and resolve the problem at scale.

Data testing

Data testing is a detection method that employs user-defined rules to identify specific known issues within a dataset. Manual data testing can be effective for specific use-cases, but naturally becomes less effective at scale. Moreover, testing can only detect the issues you expect to find, and its visibility is limited to the data itself — not the system or code that’s powering it.

Data quality monitoring

Unlike the one-to-one nature of testing, data quality monitoring is an ongoing solution that continually monitors and identifies anomalies in your data based on user-defined thresholds or machine learning. Benefits include broader coverage for unknown unknowns and the ability to track metrics and discover patterns over time. However, broad monitors can be expensive to apply effectively across a large environment, and still require monitors to be expressed in SQL. Like testing, monitors are also limited to the data itself and don’t support the root-cause process.

Data Observability

Inspired by software engineering best practices, data observability is an end-to-end AI-enabled approach to data quality management that’s designed to answer the what, who, why, and how of data quality issues within a single platform. It compensates for the limitations of traditional data quality methods by leveraging detection, triage, and resolution in a single workflow across your data, systems, and code — the three places data products can break.

The future of data quality management for AI applications and beyond

It isn’t simply the AI that needs better data quality management, though. To maximize scalability, your data quality management will also need to incorporate AI as well.

By leveraging AI into monitor creation, anomaly detection, and root-cause analysis, advanced solutions like data observability can enable hyper-scalable data quality management for real-time data streaming, RAG architectures, and other AI use-cases.

As we move deeper into the AI future, I expect that we’ll see data teams continue to adopt solutions that unify not just tooling but teams and processes as well, leveraging automation and AI in intelligent ways to democratize data quality for the teams that own it.

What do you think? Agree? Disagree? Let me know in the comments.

Stay reliable,

Barr

--

--