Data Engineers Spend Two Days Per Week Firefighting Bad Data

Monte Carlo has released the initial results of its 2022 data quality survey, which found that data professionals are spending 40 percent of their time evaluating or checking data quality and that poor data quality impacts 26 percent of their companies’ revenue.

The report, commissioned by Monte Carlo and conducted by Wakefield Research between April 28 and May 11, 2022, found that 75 percent of the 300 data professionals surveyed take four or more hours to detect a data quality incident and about half said it takes an average of nine hours to resolve the issue once identified.

Even worse, 58 percent said the total number of incidents has increased somewhat or greatly over the past year, often as a result of more complex pipelines, bigger data teams, greater volumes of data, and other factors.

Today, the average organization experiences about 61 data-related incidents per month, each of which takes an average of 13 hours to identify and resolve. This adds up to an average of about 793 hours per month per company. 

However, 61 incidents only represents the number of incidents known to respondentsProprietary data from the Monte Carlo platform suggests the average organization experiences about 70 data incidents per year for every thousand tables in their environment.

“In the mid-2010s, organizations were shocked to learn that their data scientists were spending about 60 percent of their time just getting data ready for analysis,” said Barr Moses, Monte Carlo CEO and co-founder.

“Now, even with more mature data organizations and advanced stacks, data teams are still wasting 40 percent of their time troubleshooting data downtime. Not only is this wasting valuable engineering time, but it’s also costing precious revenue and diverting attention away from initiatives that move the needle for the business. These results validate that data reliability is one of the biggest and most urgent problems facing today’s data and analytics leaders.”

Nearly half of respondent organizations measure data quality most often by the number of customer complaints their company receives, highlighting the ad hoc -- and reputation-damaging -- nature of this important element of modern data strategy.

The Business Cost of Data Downtime

“Garbage in, garbage out” aptly describes the impact data quality has on data analytics and machine learning. If the data is unreliable, so are the insights derived from it. 

In fact, on average, respondents said bad data impacts 26 percent of their revenue. This validates and supplements other industry studies that have uncovered the high cost of bad data. For example, Gartner estimates poor data quality costs organizations an average $US12.9 million every year.

Nearly half of respondents to the Monte Carlo survey said business stakeholders are impacted by issues the data team doesn’t catch most or all of the time.

In fact, according to the survey, respondents who conducted at least three different types of data tests for distribution, schema, volume, null, or freshness anomalies at least once a week suffered fewer data incidents (46) on average than respondents with a less-rigorous testing regime (61). However, testing alone was insufficient and more stringent testing did not have a significant correlation with reducing the level of impact on revenue or stakeholders. 

Within Six Months, 90 Percent of Organizations Will Invest or Plan to Invest in Data Quality 

Last year, organizations spent $US39.2 billion on cloud databases such as Snowflake, Databricks, and Google BigQuery. This year, 88 percent of respondent organizations are already investing or planning to invest in data quality solutions within six months. 

Data observability is one such data quality solution. Leading data teams at organizations such as JetBlue, Vimeo, and Affirm leverage automated, end-to-end data observability to detect, resolve, and prevent data incidents and lower data downtime at scale.