Testing Data Applications is Hard

Testing a data application is similar to testing any software application in many ways, just with a strong focus on testing data-related issues. But testing problems like failing data workflows, mismatches in data reconciliation after ETL, and data quality issues means that you’re not only testing the code but also the data itself.

Data applications need comprehensive testing because they’re often responsible for providing data for other applications to consume. They’re as critical for the day-to-day operations of your business as they are for the long-term decisions you need to make for your business to succeed.

Testing the code of an application that handles data-specific operations to transport, transform, cleanse, and integrate data from multiple data sources benefits your data team in many ways, some of which include:

  • Building trust and confidence among various teams around data quality, accuracy, timeliness, and completeness.
  • Ensuring that other applications consuming the data have a more reliable testing experience.
  • Boosting overall application reliability.
  • Driving better end-to-end application testing, which can be a nightmare for application developers, data teams, and the business.

Common Challenges with Testing Data Applications

This article will take you through some common challenges that data teams face when testing data applications. Being familiar with these challenges will enable you to take preventive and corrective action in your data application testing strategy.

Having No Tests at All

Having no tests is bad; some tests are better than no tests.

Tests build credibility and confidence in a system. So any number of unit tests, integration test, etc. are better than no tests at all. 

The good news is that If you’ve written at least a few tests, it’s easier for your team to add more because the test framework is already in place. It’s then just a matter of replicating the template by testing another scenario. This encourages clarity and promotes momentum for further application and testing development.

Dealing with Legacy Workflows

Legacy workflows just aren’t conducive to better testing. Their data stacks have tightly coupled testing frameworks, which are often wordy, hard to learn, and generally unfavorable for better testing.

Newer data stacks take a more integration-based, modular approach to testing tools. As a result, testing tools can now work with a wide range of data tools for ETL, business intelligence, orchestration, and so on.

Deciding What to Do with Test Failures

But knowing what to do with test failures is hard. Having tests in place isn’t helpful if you don’t have a way to observe and act upon those failures.

When they’re built into CI/CD pipelines, test failures are still frustrating but are very effective at stopping you from deploying any code into the next environment. Many tests, such as scheduled data quality and profiling tests, don’t hinder your deployment but need your attention. To act upon those tests, you need visibility into those failures.

This is where data testing also comes under the purview of observability. A good data testing tool will have built-in methods for observing failures and acting upon them.

Needing Lots of Data Testing Options

In data applications, you need to run tests based not only on your application’s code but also on the source and target schema specifications. 

For instance, databases comply with varying schema and SQL standards definitions. To test them effectively, your testing framework should be able to support those standards. One way of going about it is to use a tool that integrates with different databases, warehouses, data lakes, and so on. That gives you a standard method of defining tests related to structure, referential integrity, aggregates, uniqueness, incompleteness, accuracy, and other custom business logic-based tests specific to your data application. This way, you’ll be able to run tests across multiple systems.

Setting Up Test Data

Setting up your test data is not easy. You can test data applications either using mock data or real data, and both scenarios come with their own sets of problems. 

It is hard to emulate real scenarios with fake data, especially when you’re just starting out with a data application and you don’t have a deep understanding of the profile of the real data. Using real data has challenges related to privacy, security, compliance, and more, so let’s look at some specific issues for both of these scenarios.

Handling Data Dumps from Production

Once your team has reached an agreement that testing is a good thing, the next step is to provide data for testing the data application in lower environments.

Taking data dumps from one environment, copying them to another, and restoring them in that environment is slow, expensive, and prone to error if it’s not fully automated. If your infrastructure isn’t at a maturity level where you can bring whole environments up and down with the click of a button or a simple command, setting up real test data can be challenging since it’ll be a manual affair.

Working with Increased Data Volumes

The challenges with making real data available for testing get dramatically more painful when data volumes increase. Backing up and restoring data on demand can be a blocking IO operation, resulting in degraded database performance and unexpected behavior in the data application.

That’s one of the reasons why automating test data availability is usually a matter for data-focused DevOps teams. They can help devise a plan or a schedule for making fresh data available for testing without impacting the data applications in production.

Developing a Cleanup Strategy

Sometimes you just can’t avoid using production data for testing purposes. When that’s the case, it’s essential to have a strategy to clean the data up after running the tests to ensure that there are no security risks related to that data.

Effective data deletion and cleanup strategies work only when good data governance, auditing, and compliance are in place. Even in lower environments, you need to ensure that people aren’t able to access highly sensitive material, such as PII and PHI data.

One of the best ways to avoid this is to ensure that the data doesn’t stay in the testing environment longer than it needs to. Combined with the controls above, authentication and authorization can also help prevent any significant problems concerning production data.

Managing Security Concerns

On a broader note, there are a lot of privacy and security concerns with using production data for testing in lower environments. Many companies don’t have the same restriction, auditing, and compliance mechanisms for development and testing environments as they do for production ones. This leaves a gap for accidental or intentional data security incidents and breaches.

Talking about the risks of having production data in lower environments, Benjamin Ross, the director of Delphix, noted,

“Nonproduction development and testing environments, vital as they are, pose an enormous increase in the surface area of risk and are often the soft underbelly for GDPR compliance.”

Using Mock Data

Many tools generate mock data to emulate a production environment for different types of testing for your data application. Most of these tools help developers either while developing the data applications, which require very low volumes of data, or while running stress and load tests on the data applications, which require very high volumes of data.

In both these scenarios, the general intent is to make the data application work. There’s not a lot of chaos monkeying involved by, say, injecting failure-inducing data in the application. It’s tough to emulate real-world data using a data-mocking application or an API.

Managing a Pipeline

Setting up and managing a pipeline isn’t easy. You have to integrate tests into your CI/CD pipeline or data workflows, which can initially create a lot of friction. Pipeline jobs that would have run completely fine before may not with the newly integrated tests. Setting up a testing pipeline for data applications can also be challenging; it needs a fair bit of upskilling on the data team’s part to use the pipeline efficiently.

Setting up a testing pipeline might mean more work in the short term, but in the long term, the benefits significantly outweigh the additional work. Meltano, with its modular framework and flexibility, is an example of a tool that simplifies your test pipeline setup and management experience.

Choosing Where to Test Data

Data testing can be done in many places along an application’s lifecycle, and it’s hard for data teams to know when and where to do it. If you have a complex data application lifecycle with many steps, data sources, and transformations, it’s possible to overdo testing to the point that it impacts your data application’s performance.

It’is always useful to know where exactly to test your data in the data application lifecycle, to ensure that you’re catching the issues on the right steps and not brute-forcing tests onto your system.

Appreciating the Tradeoff with Data Application Testing

Obviously, testing data applications comes with its costs and its benefits. From not having tests at all, you can go to a place where you have too many tests. Both situations are undesirable, as running too many tests can also result in bad things for your data application, like slowness, degraded performance, longer release cycles, and increased costs.

Depending on the type of data you’re dealing with and the kind of data application you’re working on, you need to strike a balance between the two extremes. You have to find the right types and the right number of tests to run for your application. Fortunately, Meltano makes it easy to test both your data and your logic.

Thanks to our guest writer Kovid Rathee for this contribution.


You haven’t seen nothing yet!