Production vs Synthetic Data for Testing

#discuss #testdata #softwaretesting

When should we use one over the other?
Which is the best approach?
What is the criteria used to guide the decision?

Top comments (4)

Dian Fay • Dec 19 '18 • Edited

Production data has some issues:

legal or regulatory requirements mandate anonymizing PII, patient data, financials, and so on, which requires extra effort
email addresses, phone numbers, and the like have to be "disarmed" to ensure integration tests can't accidentally reach users
data is changing all the time, so it's more difficult to write stable assertions
paucity of representative data for certain states: if some process requires multiple steps but in practice 99% of users complete it in one go, prod data is insufficient for testing the intermediary stages

It's still important to test against production data -- second. Functional coverage is more important, and you can only be sure of testing the most possibilities if you generate fixture data. If you're working in a smaller team, testing against live data sets will likely be all manual.

Fixtures are tricky to do right, and the obvious solution of a monolithic test dataset is a dead end for reasons best explained by Jorge Luis Borges. I wrote something a while ago about a more flexible modular approach based on the post-structuralist idea of rhizomes, and published a drop-in JavaScript implementation; the PHP O/RM Doctrine does something similar as well.

Phil Ashby • Dec 18 '18

We have tremendous fun with 100+ suppliers (API vendors of various sorts), who all bring some sort of test interface, frequently synthetic data, none of them compatible with each other...what to do?

we mock them and use a consistent local synthetic data set for local testing (of our combinatorial logic or other internal testing).
we use their synthetic data for point testing such as connectivity/credentials checks (when in UAT/Stage/Live -all production)
we use production data, hopefully with test markers to avoid side effects (like marking someone's credit history) for end-end tests across multiple providers.

Alan Barr • Dec 18 '18

This is super hard. I do not quite understand how people can abstract away the complexity of data and state. We seem to do this for configuration and for systems but when it comes to a user has this property at this time with this value then everything goes out the window. I am still not sure what the best approach is, maybe if the system is small enough but crossing boundaries of systems it feels like this all goes out the window. An alternative approach to test data might entail capturing the state of a user at a given time and reproducing that in the staging system or disabling changes in the production system for that user to reproduce an issue. This feels like one of the last properties software teams think about in development.

Synthesized • Oct 6 '19

Synthetic data is often generated to represent the production data.

It is normally used to protect privacy and confidentiality of production data, e.g. in testing and creating many different types of systems such fraud detection and churn prediction systems.

There is a number approaches to generate synthetic data described by the folks from Synthesized (synthesized.io/) in this blog post

blog.synthesized.io/2018/11/28/thr...

DEV Community

Production vs Synthetic Data for Testing

Top comments (4)

Read next

Difference Between Varray and Nested Table in PLSQL

How to Choose the Right Algorithm for Model Training

Edition 2 — Monday Memes

The Enigma of “Ghost” Jobs in the IT Market: A Frustrating Reality for Technology Professionals