FEB 14, 2017
Written by M. Scott Ford

Check Your Work: Ensuring Your Refactoring Doesn't Introduce Bugs

Code refactoring, as defined by Wikipedia, is “the process of restructuring existing computer code — changing the factoring — without changing its external behavior.”

As such, refactored code should introduce no behavior changes. Otherwise, you’re not refactoring. You’re refactoring and doing something else.

If you’re strict about that, then the best way to ensure your refactored code works the same as the original implementation is to run the old and the new versions and make sure they still give you the same result.

Test-Driven Development

Test-Driven Development (TDD) is one way to achieve that goal. This confidence is gained by first authoring tests to capture the verification and validation criteria for your original code. If you’re on a team that practices TDD, chances are high that you won’t actually have to write this; you’ll just need to confirm that it exists.

Once you have a test that covers all of the logical decisions that are captured in the original implementation, you can use that same test to validate that any replacement implementation is working correctly.

This is my preferred approach to refactoring. But it’s not always the best tool for the job.

Use the Previous Implementation to Check Its Replacement

What if TDD doesn’t make sense for the problem you’re trying to solve? What if the only place to check to see if the code’s working correctly is by running it in a production environment? Or what if the barrier to getting your code to run inside of a test fixture is just too much for your team to achieve their goals.

Steve Maguire, in his book Writing Solid Code, suggested using debug flags that will run the old version of a block of refactored code along with the new version and halt the program if the values don’t match.

So in C#, that might look something like this:

public int SumOfIntegers(int count) {
  #if DEBUG
    // original _slow_ implementation
    var slowResult = 0;
    for (int x = 1; x <= count; x++) {
      slowResult += x;

  // fast version
  var result = (count * (count + 1)) / 2;

  #if DEBUG
    if (slowResult != result) {
      throw new RefactoringException($"Refactoring of SumOfIntegers has failed with: count={count}")

  return result;

That code shows two implementations for computing the sum of the first N integers.

The first attempt was to just N times and add up each number as it goes. That implementation has been wrapped in a conditional compilation check #if DEBUG so that the old, slow implementation will only run if the code is compiled with the DEBUG flag set to true.

A much faster approach is to rely on a formula that many mathematics students have had to write a proof for over the years.

The code above has been authored so that the fast version will run regardless of how the DEBUG flag has been set.

After the fast version executes, there’s a check that looks to see if the slow implementation and the fast implementation got the same results. If they didn’t, then an exception is thrown in with some additional information.

This is a great technique that you can start using today in production. An experimental change could be wrapped in a flag which could be set at compile time, like the example above, or at run time by storing the flag in the application’s database. Also, instead of throwing an exception, which will likely halt the program, you can just write to an error log instead.

An advantage of this technique is that it doesn’t have any outside dependencies. A disadvantage is that you have to be careful to remove experiments from the code when you’re done with them. Otherwise, they will make it more difficult to understand what’s going on in your program. Another disadvantage is that if you want to get sophisticated about how often you’re running the two versions, then that’s extra logic that will need to be woven into your code. For example, if the goal is to speed things up, then always performing this calculation twice is not going to achieve that goal. We’ll likely make performance worse, while we’re on our path to make it better.

Experiment with GitHub’s Scientist (And Friends)

Taking this approach to the next level, GitHub introduced the Scientist library, which allows you to deploy refactored code and run an experiment in production to help you determine whether the refactored code has the same behavior as its original.

Initially written in Ruby, this library has been ported to numerous languages, and chances are high that it’s been ported to your favorite language.

Here’s what our example would look like using Scientist.NET, the official .NET version of the Scientist library:

public int SumOfIntegers(int count) {
  return Scientist.Science<int>('sum-of-integers', experiment => {
    experiment.Use(() => {
      var result = 0;
      for (int x = 1; x <= count; x++) {
        result += x;
      return result;
    experiment.Try(() => {
      return (count * (count + 1)) / 2;

That looks a little more complicated than our version which relied on a couple of simple if statements, and there are some important benefits that we’re getting from using Scientist in this way.

  • The Use block is always run.

    In this case. That’s our original implementation. Since the goal is to write an experiment to determine if our replacement is better, it makes sense that our assumed-to-be-correct old version would be the default trusted version. Also, the value of the Use block is what’s going to be returned to anyone that calls SumOfIntegers.

  • The Try block will be called sometimes, not all of the time.

    The Scientist library has ways to configure how often it’ll be called. This overcomes a performance concern with the DEBUG flag approach above. We get to decide how often we want to incur the cost of running both versions.

  • The order that the Use block and the Try block are performed is randomized.

    This wouldn’t likely make a difference with our example here, but if the two implementations are meant to do the exact same thing, then the order that they are run shouldn’t matter. In fact, it would be really frustrating to learn that the replacement operation only worked, because the original was run first. Our DEBUG flag version always ran each version in the same order.

  • The amount of time that it takes to run both the Use and Try blocks is measured and saved.

    This allows us to do an analysis on wether or not our replacement is faster or slower than our original implementation. That’s an assumption that we’re making without being able to actually measure it in the DEBUG version.

  • Any bugs in the Try version won’t cause the application to halt.

    In the DEBUG version above, if an error was encountered that caused an exception to be raised, it would halt the program. The Scientist library ensures that these exceptions are handled correctly and that the program will continue to run even if an exception has been thrown in.

  • All of the information collected by Scientist is published for further analysis.

    In the DEBUG version above, we could have accomplished this goal, but we’d have to significantly complicate the code that we’ve written.

  • That’s a ton of benefits for using the library.

    There are some drawbacks as well. For one, you have to add a new dependency to your production application. In environments strict about what libraries are deployed into production, that could create a challenge for your team.

Another thing to look out for is that you’ll need to remove experiments after obtaining the results you wanted. Leaving a bunch of different experiments in your application will make it difficult to maintain the system. We had that same problem with the DEBUG approach, and it’s worth repeating.

Have you tried similar experiments in Scientist (and friends)? What did you think? Let me know in the comments below.