Why Third-Party Data is Still Your Biggest Risk

Barr Moses
3 min readAug 15, 2024

In the AI era, third-party data is your kryptonite. Here’s why and what you can do about it.

The bane of every data leader’s existence often boils down to two words: third-party data.

From converting text into vectors that train GPTs to using financial market data to inform trading decisions, companies across industries leverage third-party data to power some of their most critical products. While third-party data offers a competitive advantage, using it also introduces additional risk, and we all know how much data teams love risk.

Just last week, a federal testimony revealed that “bad data from a third-party” vendor led Florida to wrongfully deny Medicaid coverage for new mothers. But waving the “third-party data” flag isn’t going to absolve Florida of responsibility — and it won’t absolve data teams, either.

Regardless of where data comes from, the moment it lands in production, it’s the data team’s responsibility. As companies invest in AI, this accountability for data integrity and trust is even greater. In fact, according to a recent survey, 91% of data leaders say they’re actively building AI applications, but 2 out of 3 admit to not completely trusting the data they’re trained on, often from third-party sources.

Here are 4 practical approaches to ensuring third-party data quality scale:

  • Know your data. Map dependencies between third-party data sources, downstream data and AI products, and all of the transformations in between. Bonus points if you can automate it.
  • Assign owners. Understanding data ownership is critical to creating accountability. Have clear expectations around who manages what data products so when data breaks, they can triage and provide context that can expedite root cause analysis.
  • Automate rule creation. There are infinite ways third-party data can break, and the more you ingest, the harder it is to scale testing. Do yourself a favor and invest in an automated approach to monitoring. Better yet if your solution provides rule suggestions, too.
  • Set up incident workflows where you work. The data quality programs that fail are the ones that don’t take into account that most incident management takes place in Slack or Microsoft Teams. Ensure that your data quality solutions are interoperable with your existing tech stack to drive greater adoption.
  • Measure performance. Track the reliability of your third-party data over time to understand what sources most frequently break and flag any coverage gaps on downstream tables. Data Quality Scorecards can help you aggregate and communicate this info to your execs.

Data teams may not be able to prevent third-party data downtime, but they can take proactive measures to ensure these issues are found and fixed before they impact the business.

And that’s something to celebrate.

Stay reliable,

Barr

--

--