For small projects, it's very easy -- if not always good practice -- to grab a copy of the production database and save it locally for use in development.
However, as a project grows, this becomes increasingly more time consuming and increasingly more unsafe (as individual team members should probably not be walking around with tons of actual user data on their local machines).
Further, for organizations that are open source, it certainly doesn't make sense to freely distribute copies of production data to anyone who wants it.
So, my question to you is this: how do you populate your development databases?
Top comments (19)
Wow I was just talking about this with @maestromac . We seed our database with dummy data as our default development process. Even though we could, we don't grab copies of production DB unless we also have some safety constraints in place. I just wouldn't want to do something bad by accident in that regard.
But our process is inefficient and is not an ideal development experience. I feel like it might be better if we actively maintained a central dummy database we manicured by reseeding it with a well-thought out process of creating dummy data or preprocessing production data. Then storing that in the cloud and every time someone wants to reset their dev environment, they download the ready-to-go DB that is primed for a beautiful, fresh dev setup.
Does this make sense or am I way off-base?
Yup, this makes a lot of sense -- thanks, @ben !
Though, I'm curious about what your data structure looks like that you're able to generate dummy data as part of your development process.
Prod has the benefit of giving us virtually every imaginable permutation of data, whereas manually generating it will likely result in blind spots that make it tougher to catch things during the QA process.
I definitely agree with respect to the security concerns around grabbing prod, and I really like the idea of trying to maintain a central dummy database!
Yeah our dummy data system has its flaws and only works because it's worked so far. Far from a universal-forever solution, but we have a small team and gradually building out concrete solutions as our team and data-scaling needs necessitate them.
The centrally maintained dummy DB could be based off production, but I think it's key to have it be its own project with dedicated maintenance (from few or many) and setting the team up for success.
I'd really love to hear from others, especially those in large, hairy, long-lived projects.
I've worked at a couple major online retailers with long-lived databases and they've been grotty. Basically, after a decade or more, the only way you end up with a prod database you could synthetically dummy up is if all the members that have ever worked on the team (including management) have kept technical debt near zero. Not. Gunna. Happen. Deadlines, deferred maintenance, and shifting business goals impose compromises that have long-lived effects on how other features are built, etc. The ripples go on and on.
It becomes insane to try to maintain a synthetic dev/test database that duplicates all the things that users do, or have done, possibly using features that no longer exist, or by exploiting bugs. At some point, trying to dummy up "production-adjacent" data is as much work as just using a copy of production.
The best solution I've seen used a nightly snapshot of production together with Docker so that all devs did their daily work with a full copy of the production DB. (With user passwords, etc. stripped.) We could let our dev DB get paved over every night, or flip a switch so that the replication process would leave it alone.
This is such an important conversation. I have had the same dilemma, and over the years I ended up with two ways of solving it for two different use cases.
For development, regression testing and performance testing, I created a product simulator that calls the same product REST APIs called by the UI. That generates synthetic data, emulating real user usage patterns. This not only creates data for development, but it also tests all the major back-end functions of the product and it is able to recreate the data from scratch, even if there are major changes to the schemes and data structures (which is problematic with snapshots). Additionally, this is also at the base for a performance testing framework, meaning that I can measure the performance of various areas of the back-end product as the data is being generated.
For testing releasing of new builds and data migrations, I run a staging environment with a copy of the production environment, but only a subset of the real data, where all the PII has been obfuscated. This is important because real data has all sorts of corner cases and variants that is difficult to completely emulate synthetically.
I just published an article yesterday talking about putting a project to run fast after you get the code, here: dev.to/taq/driver-driven-developme...
As I'm using Rails for my web apps, I use to feed the seeds file with all I need to run the project on my development environment. This gives me:
A way to run my project fast, from zero, with no extra needed configurations. Every developer who has access to the code will be allowed to do that.
Samples of the data needed to run the project, and even to build my fixtures/factories.
A way to reset the development database and build it again, fast, if needed.
Even with bigger projects this is working for me. When there are massive data to insert on seeds, for example, if needed to load all the states and cities, I put the data on external files and load them on seeds.
I use FactoryGirl to seed development data. Since I work at a financial company, I don't use production data, even sanitized production data for development. The factories share as much production code as possible, so that a seeded model is exactly the same as a manually created model, but it's just much faster.
We use a slimmed down version of the production database dump which we call the "nocustomer" dump.
It's quite involved to make this all work (for my taste) but we've yet to find a better solution.
Of course, the "obvious" solutionis to simply generate fake data.
But:
How do we create it? The production DB is PostgresSQL, so we:
pg_dump
schema
and use the same dump to only import the DDL into this schemaINSERT INTO <newschema>.<table> SELECT * FROM <oldschema>.<table> WHERE … is not production data 😀 …
Within the developers VM there a special command can then download this dump, import and, rebuild ElasticSearch, etc.
Oh, and this is only the really high level stuff. There are so much details in between to make this seamlessly work.
The process takes a couple if hours (4-6 currently) but at least it's fully automated (except triggering it).
The instance is quite expensive (again, for my taste) but when there's a need for this dump, you usually want it as quick as possible.
While writing this down and re-reading it, I almost can't believe it but, yes, this works. It occasionally fails (like: once or twice a year) but otherwise is rock solid.
For a simple data structure, I see how it might be possible to just take a horizontal slice of production data and then anonymize it. (This is especially true for NoSQL databases.)
However, for more complicated SQL databases with lots of joins, I'm not sure this would be an option because so much of the data is apt to be reliant on so much other data outside of the 'slice' you'd want to take.
I like to use the faker.js library to populate my development databases. It's easy to use and simple to integrate into test suites or custom bootstrapping scripts.
It has a wide API that is well documented and covers almost anything you need. It supports many localities ( i18n ) which can be useful for seeding certain types of applications. Faker.js has been in development for seven years and currently has 124 contributors and 11k stars on Github.
It's a very nifty piece of software. We never use production data in testing or development environments. Everything is always generated with faker bootstrapping scripts customized to the application. Works very well. The localization part can also help with testing UTF-8 encodings.
Disclaimer: I am the author of faker.js
Ah, so, a little story, that won't answer your question in any way other than please don't do it this way.
Way back when, I worked for a lovely company making, amongst other things, telco billing systems. We wrote them largely in server-side Javascript, which was a wild idea back in ~1999 that would surely never go mainstream. Oh, we also did Agile before it was Agile (anyone remember Extreme Programming?) and a bunch of other stuff that's now rather more mundane.
Anyway, in order to actually Get Stuff Done, we needed to populate the dev systems with test data, so we could test stuff and things. Sorry, that was obvious. Some of the test accounts would need to be large to test overflow issues, others would need outstanding payments to generate those "red reminders", and so on.
To begin with, us sober-minded developers would enter in data such as our own names and addresses. Soon, though, a frantic competition to come up with the most amusing test data emerged. Saddam Hussein, the late dictator of Iraq, ended up in the generic dev database backup. Then, so did one "Liz Windsor", who lived in Buckingham Palace. When we wanted to spin up a system, we'd just restore from this backup. Simples, right? What could possibly go wrong?
So, the great day arrived when we would go live. To ensure that nothing went wrong, we installed the live system in the tried and tested way we'd installed the development and test systems. That was, of course, the least risky way of doing things. Simples, right? What could possibly go... Oh.
Because the next month, several developers got a letter on their doorstep, from the customer, asking them to pay an often vast telephone bill. For many developers, this was clearly listing all the premium-rate porn chat lines (it was the '90's, and yes, these things existed). We spent no small time carefully explaining the situation to the customer, and getting those bills cleared from the system. It wasn't pretty.
Then someone remembered that Her Majesty The Queen Elizabeth II, Queen of The United Kingdom, Fid Def and all that jazz, would also have had a letter carried to her by a highly trained footman. We all thought that the nice red writing demanding immediate payment would, no doubt, contrast nicely with the gold platter is would be borne on.
One other idea: for organizations that use frameworks like Rails where the associations between tables are explicitly expressed through models, perhaps it would be possible to somehow take a slice of the production database (anonymizing it of course) by taking a slice from key tables and then traversing the associations to gather just the right amount of data.
We do a ton of development against Oracle databases. In any application that has sensitive data, Oracle Data Masking and Subsetting Pack (an option to Oracle Enterprise Edition) has the ability to take all the production data and massage it so that it can't be used to identify production information. All referential integrity is maintained throughout this process.