Closing the Data Downtime Gap

Image for post
Image for post
Photo by Joshua Sortino on Unsplash

“Our data is 100% reliable” — said no one ever.

In a previous blog post, I wrote about “data downtime” and its meaning for data-driven companies. As data becomes more mission-critical to organizations and more complex for data teams to manage, we’ll see the rise of data downtime in importance and urgency. Companies have various approaches to managing data downtime; those who have mastered it start by realizing it requires proactive, systemic attention.

Companies that are able to manage data downtime achieve great benefits.

Here are a few:

Much has been written about solutions, some great examples include these from Intuit and Netflix — yet I’ve found there still isn’t a standard industry best practice. Curious as to why, I interviewed and collected information from over 80 organizations about their approach to data downtime. Here are some of my observations:

Image for post
Image for post

Data downtime happens to almost everyone on their quest to become data-driven. There are various degrees of severity, but it’s not unique or isolated to a specific industry, skill set, technology or organizational structure. So in case you’re wondering — no, you’re not alone. :)

Organizations follow a journey on their path to realizing maximal data uptime — I call it the data reliability maturity curve. The journey typically starts when the company begins to experience hyper-growth, starts using data in a new and intense way, and/or as a result of an important business metric or customer which has been impacted due to bad data.

The data reliability maturity curve

There are four main steps in the data reliability journey:

#1 Reactive: data teams start their morning with the daily fire-drill, and spend the first half of their day triaging a data problem. They spend significant amounts of time in crisis mode and are slow to make progress on important initiatives. As a result, the organization struggles to use data effectively in the product, machine learning algorithms, or business decision making.

#2 Proactive: the data team develops manual sanity checks and custom QA queries to validate their work, perhaps also receive Slack or email alerts when things go wrong. This requires discipline and active collaboration between engineering, data engineering, data analysts, and data scientists. Examples:

#3 Automated: having reliable, accurate data becomes a priority. Validation queries are run on a schedule and have broader coverage of pipelines. Teams have a data health dashboard they use to view issues, troubleshoot and direct others in the organization to learn about up-to-date status of data. Examples:

#4 Scalable: the data team draws on concepts from dev ops and institutes a staging environment, reusable components for validation, and/or hard and soft alerts for data errors. There is substantial coverage of mission-critical data and the team is on top of issues before they propagate downstream. Examples:

In a recent poll I conducted among ~50 data professionals, more than 60% indicated they are currently in the earlier stages of the data reliability journey, however they are taking concrete steps in their roadmap to move up the curve.

Where are you in your data reliability journey, and where do you want to be by the end of 2019?

Co-Founder and CEO, Monte Carlo (www.montecarlodata.com). @BM_DataDowntime #datadowntime

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store