Your data quality strategy should be automated — here’s where to start

Barr Moses
3 min readSep 27, 2024

Did you know that data analysts often spend more time writing tests for data than they do driving value from it?

Data quality is a team sport, involving analysts, engineers, and governance teams to hand off the baton across issue detection, triage, and resolution. For most data teams, the first step is 1) understanding what data matters most to your business and 2) hand-writing data quality checks to cover these critical assets. The problem is that the burden of writing rules will always grow with the scale of your data environment. The more data you have, the more rules you’ll need to write to validate it. It’s tedious. It’s expensive. And worst of all — it isn’t enough.

At a certain point, a manual approach to testing just doesn’t make sense. In my opinion, automating data quality checks is the only way for modern data teams to effectively manage data reliability at scale.

Now, I know what you’re thinking — not every rule can be automated — and I agree with you. Some business rules will always rely on the expertise of domain-specific analysts and SMEs to define them. But here are five baseline data quality rules that can and should be automated with AI:

  1. Uniqueness rules — If it’s a routine rule-type that you find yourself writing often, it’s a rule you shouldn’t be writing by hand. But uniqueness isn’t all things to all fields — or even all tables. Instead of profiling each field separately and guessing at thresholds, you can use a good machine learning monitor that allows for a UNIQUE _RATE to be applied across an entire table and then programmatically defines appropriate uniqueness rates by field.
  2. Validity rules and dimension drift — Validating low-cardinality fields is always tedious. Instead of hand-writing rules to validate multiple possible values or the limitless potential permutations of drift, ML-powered dimension tracking can be used to apply distribution analysis to the values of a given field, giving analysts the distribution of certain values and their percentage based on historic values and their relative frequency.
  3. Timeliness rules — As arguably THE most common data quality rule, timeliness checks are the premier candidate for automation — and a massive burden when you don’t. Instead of cloning this test over and over again with slightly different standards, you can deploy an automated monitor like ‘TIME_SINCE_LAST_ROW_COUNT_CHANGE,’ to alert you if a load of new rows breaches historic thresholds — and save yourself a whole lot of time.
  4. Accuracy rules — In general, manual accuracy rules require profiling each numeric column separately to define its accepted ranges — and if you have hundreds of tables to define, you have a whole lot of rules to write. What’s worse, this data is highly sensitive to changes. Automated distribution monitors for things like mean, negative%, and standard deviation can programmatically check for shifts in the numeric profile of your data across tables and proactively maintain those thresholds over time.
  5. Rules for unforeseen issues — You don’t know what you don’t know, and this applies to your data environment. Analysts need some form of advanced automation that can offer coverage for all the issues you can’t anticipate. Automated machine learning monitors for ‘unknown unknowns’ should cover your entire environment at the data, system, and code levels to detect those issues that aren’t being monitored by a specific test — like a code change that causes an API to stop collecting data or a JSON schema change that removes a critical column. This type of coverage should tell you not only not only when a break happens, but where and why as well.

In 2024, a modern approach to data quality is an absolute necessity. Without strategic automation for routine manual data testing (and an operational strategy for rolling it out), you’re not just sacrificing valuable insights — you’re getting left behind.

Stay reliable,
Barr Moses

--

--