Effective Data Management in Automated Testing

It's hardly possible to overrate the importance of the Test Data Management in automated tests. If your strategy is chosen incorrectly, you won't be able to get to the effective testing as this is one of the most important factors that either will help or will drag you down. The article highlights how and where the data for the tests should be kept to make testing as painless and effective as possible.

Dark Side

If at least one of the bullets is true for you, make sure you read the article through:

  • Data and test code are kept in separate files
  • Data is loaded once for all cases at the beginning of a test run
  • Data is hardcoded either in separate files or in the test code itself
  • Data is explicitly described even if it’s not needed for the case

Now let's go over each item in more details. If you only are interested in the final results, jump to the end of the article.

Keep data in code

The most obvious problem with having separate files - to read the test case fully you'd need to jump between multiple files. This may seem like an exaggerated issue, but:

  1. It's not clear what the test is doing without the data. Usually the name of the test stands for itself, but parameterized tests often have pretty generic names which complicates things.
  2. When you've got thousands of the tests you start to have bugs in them if it's hard to read a single test at one glance. Both test code and test data may have defects.
  3. It's harder to do code reviews. Instead of seeing the whole thing in one place you'd need scroll back and forth. Code reviews are usually done in browsers, so there is not a lot of possibilities to jump between parts of code. People often are pretty negligent during code reviews since this is a very energy consuming process, don't make it even harder.
  4. To create collections (50 elements?), you have to describe each element separately which is a lo-o-ot of extra letters
  5. This won't allow for data randomization (stay tuned).

Load data for each test separately

Sometimes people have large data sets that are pre-loaded to fill the database before the actual test execution is started. The justification you may hear is that it's faster when it's a bulk operation comparing to per-test data load. But while this may be faster for big suites it's certainly much slower if you need to run a small sub-set of the tests. Typically we run tests in these situations:

  • A per-commit/nightly/manual run of the full test suite
  • Creating and running new test cases
  • Re-running some of the tests that may've given a false-negative result
  • "Hey, Jimmy, you broke the code - please run this test to reproduce the bug"

In most of these situations you need to run a small set of the tests. If you load the data into the SUT on per-case basis then the timing is increased linearly with the number of the tests. And that's natural. What's not natural is to wait 10 mins while only one test is run.

Randomize data instead of hardcoding it

The downsides of having hardcoded data are:

  • Data must be picked very carefully to escape collisions. Especially for unique fields like emails
  • Impossible to start several test runs simultaneously against a single SUT instance. Race conditions will appear when tests try to use the same data.
  • Cleanup needs to be done before every run
  • To create collections (50 elements?), you have to describe each element separately which is a lo-o-ot of extra letters (Again)

Read more about the topic in a dedicated article: Randomized Testing.

Keep unnecessary data out of tests

Suppose you're testing a user profile - you open it and then update user's address. Do you need user's username, email and the rest of the information in this test? No, the only thing you should care about is the address. Otherwise your test cases (or data files) are getting enormously large and extremely hard to read.

To overcome this problem you need to have a default set of data for each possible entity in the system. E.g. if user has to have age specified when you register it, then create a class Person and have the field age set to a default value. Even better - the value should be random, but valid. Especially this is important for unique fields. Then when you create an instance of Person in your tests, its fields are filled by default. You're testing the age? Then change it in the test itself - override the default value with what's needed for that particular case.

If you're not sure how these classes fit in your test architecture, read Evolution of Automation Test Engineer for the details on how your test layers should look, especially pay attention to the Business Layer.

Light Side

Here is where you should get with your test data:

  • Data is loaded for each test case separately. Avoid pre-loading data for the whole test suite at the beginning of the test run.
  • Data is defined inside of test cases. Avoid keeping data in separate files - it should reside in the code itself.
  • Data is unique for each test run. Use Randomized Testing as opposed to hardcoded values.
  • Put only important data into tests. Not important data should be generated outside of the test.

Example of such test can be (written in Groovy JUnit):

void 'create client with Username Max Length must pass validation'() {
   Client client = clients.create(new Client(username: alphanumeric(30)))

Check out Test Pyramid for more examples.


Projects that don't leverage Light Force often suffer from long dev cycles because of how much time it takes to run the tests and analyze the results. Another big issue with these projects - they require a lot of people to maintain the tests. Usually Test Data is not the only problem in such projects and the saddest thing is that you often can't fix the rest of the issues without first fixing the data management part. So be true Jedi and do it right from the beginning.

And remember - there are exceptions to every rule.