Doubling Data for Performance Testing
Or: The Most Impressive T-SQL Script I’ve Ever Written
I was recently working on a new application. After three months in the field, users were starting to complain about performance issues. We had done some limited performance tuning for the first release, and more as part of the second release, but new issues were popping up as more data got entered into the system. We could have continued fixing issues as they came up, one release at a time, but we wanted to get ahead of the problem, and the client wanted to know that the system would remain usable without developer intervention for a few years at least.
The nature of the business was such that the rate of data entry and number of concurrent users shouldn’t change much over time. This meant that increasing the amount of data would be a good approximation of how the system would look in the future. How do you increase the amount of data in a normalized relational database? It would be difficult to enter realistic test data manually, and unrealistic test data can cause unrealistic test results. Bad results means we waste time fixing issues that wouldn’t appear, and never get a chance to detect issues that will.
I proposed creating a T-SQL script that would double all the existing data. Running it once would simulate six months of usage, running it again simulates a year, twice more and we’re simulating four years of usage. Once it was approved, it took me a couple days to finish, about half of that going to testing and debugging. The result is the most impressive T-SQL script I’ve ever written.
The script I wrote didn’t do the actual duplication, it combined hand-entered metadata with data from sys.tables and sys.columns to generate a much bigger script. Not only was it faster to write this way, but it made it easier to fix issues, and it allowed the script to be reused after columns were added to the database.
I had one table variable with a list of tables to copy, and another defining the relationships between foreign keys and their source tables. Most foreign keys were named the same in all tables that they appeared in, so a single mapping was often enough for all of them. Most of the mappings could be determined from the list of tables itself, so I only had a few other relationships and special column rules that had to be entered manually.
Another factor that helped was our use of guids for all primary keys. Because the can be determined before inserting a row, it was possible to generate the mapping from old to new key at the start of the script. I could also use a single insert statement for each table, and the order of execution only mattered where foreign key constraints existed.
The results were tremendous. We found a bunch of issues we wouldn’t have found otherwise, and had a fairly solid indication of how the application would behave years in the future.