Platform Migrations - The Built Environment as a Metaphor

Long exposure image of city traffic at night.

Platform migrations are a common challenge among teams that we interact with. Platforms need to change frequently, to keep up with the demands and needs of the software systems they support, and the teams that use and interact with them.

Some real life examples of systems we’ve seen, and helped clients migrate, include:

  • Web app: Python 2.7 to 3.x upgrades
  • Java web app: Apache Struts 1.x to Spring Boot 2.x
  • Java web app: Play 1.x to Spring Boot 2.x
  • Rails web app: upgrade to a version that’s still receiving security updates
  • Windows desktop app: Win32 to WinUI

We’ve also heard of these following upgrades as highly likely candidates (we have not executed them ourselves):

  • Linux desktop app: X11 to Wayland
  • macOS desktop app: Cocoa to SwiftUI

One of the most common positions we encounter among teams that are facing this situation is that the best (or only) option is to completely level whatever they’ve got, and start over from scratch. The popular conception is that if the foundation needs to change, then the whole building needs to come down.

Construction (and destruction) as an imperfect metaphor for software

Construction has long had a place in the software industry, often used as justification for introducing more traditional engineering practices, big upfront design, and creating grand, detailed plans. The thinking has been: since software projects have been failing, if we treat them like (physical) construction projects, they’ll succeed.

Construction has nevertheless fallen out of popularity as an optimal metaphor for software development. The active assembly of a building is far more similar to compiling source code than any other activity we might do as software developers. There are many thought pieces advocating for drawing a parallel between source code and a finished blueprint (a design list of instructions). The physical construction crew takes those plans and executes them (which is what your compiler does when it takes the source code and translates it into something that can run).

The construction metaphor also fails to capture the difference between complicated and complex, especially as it relates to software systems. Complicated refers to the existence of many similarities between projects. In construction, there are a lot: even in some housing developments, almost every single house has nearly identical (or at least, very similar) floorplan and attributes, and entire neighborhoods get built that way. That consistency drives efficiency and the construction business’ ability to make more money off of repeating a process, and thus build more, faster and cheaper, leading to higher revenue (and profit). The vast majority of construction projects rely on materials that existed prior to the project starting, and very few materials must be invented from scratch to start the project. Thus, it’s possible to leverage past experience from working on previous construction projects.

In software development, almost all of these tenets break down. There are a few types of projects that don’t fit this rule (i.e. a simplified page website is relatively simple to construct, and if you’re used to using the same tool to construct it, you can avail of some of the benefits of the repetition). However, this certainly doesn’t apply to an application that must grow, evolve, and adapt in response to its users and the world in which it interacts.

When it comes to complexity, while some third party libraries exist that developers can build upon, much of the work that goes into a software project is inventing the very tools and materials that are used to produce a product. Thus, much of our past experience and performance is not usually a good predictor of our success. Because much of what we’re building is brand new, it can be difficult to predict.

Demolition is put forward as a simple idea among software teams (all you need to do is rm -rf, or Remove-Item -Recurse -Force), and then the whole project’s gone, everything’s fine, and you can just move on.

When that “simple” approach is taken, a lot of hidden issues are exposed. If a building gets torn down, a necessary precursor is that nobody is still living in it, and there’s nothing left inside. For most major building demolitions, many recyclable materials are harvested from the building first as a prep step. It’s not as simple as hitting a button, and everything is gone. The utility that the structure was providing must also be considered (and noted as an implicit opportunity cost in this approach); it cannot be providing that same utility whilst being demolished, and while something new is being put in its place. The people that were previously using that building need to move all their stuff somewhere else, and find a new building for use. Destruction (in a total demolition sense) is thus also an imperfect metaphor for software.

The built environment as a metaphor for platform migrations

While construction and destruction are highly imperfect metaphors for software, we can still turn to the built world as a fitting metaphor for software maintenance. If we envision a city as a thing that is maintained, and look at it from the top down, watching buildings coming and going, people moving throughout the city, it is noticeably more of a living creature than a static item. The built environment surrounding us is composed of tons of structures, constantly in flux. Those structures are connected via surfaces (i.e. highways, streets, and walkways, which all have their own governing rules and capabilities) that allow people to move from place to place. Materials being used in each location also need to flow through the built environment. Utilities like power, waste, water, and data are constantly traveling throughout the organism.

The built environment is a constantly changing organism, and no single plan exists that represents the precise structure of your city statically. We see the same with software systems: if you were to create an incredibly detailed blueprint that captured, in a moment, what your built environment - or your software system - is at any given time, it would immediately be out of date within moments after it was captured. Neither are static; both are constantly in movement and experiencing change.

The built environment never stops operating. Similar to the ways in which the human body continues operating throughout our lives, some parts getting damaged and healing, or changing and aging over time. There’s never a pause button that says “all work stops” - instead, anything that needs to be replaced gets isolated. For example, when a bridge is being replaced or repaired, traffic is detoured to an alternate path, not stopped completely for the duration of the work. Complete disruptions are simply not tolerated.

Structures can be thought of as a good stand-in for applications. While the initial construction of those might not be a perfect parallel, once the structure exists, the ways in which it must be maintained is very analogous to software applications. Much like any new software application, buildings have a break-in period, too. Stewart Brand’s book and BBC mini-series How Buildings Learn discusses just this: how buildings are continuously refined and shaped over time by their occupants, and start out with their fair share of kinks that need to be worked out: leaky roofs and faulty pipes, just as bugs plague software applications

Considering the surfaces that connect all of the structures around us, a decent parallel in the digital environment would be protocols that allow different applications to talk to each other. It would be thought equally unacceptable to completely stop traffic by removing a surface (i.e. road) in between two structures, just as removing a protocol would curtail the ability for two software applications to communicate with each other.

The built environment is constantly changing and adapting to meet the needs of the people around it and affected by it: new structures get built and replaced, new roads are put in, traffic lights, schools, parks, and much more.

We’ve all encountered the demolition of condemned buildings: often, making the way for some new type of structure or addition to the built environment. Similarly, it’s all too tempting (and seemingly easy) to scrap an aging and struggling software system and simply start from scratch in a quest to build something new and better. We’re fascinated by the cases in which neither of these approaches are taken, and (more in line with the needs of reality), bulldozing isn’t required to make the changes needed to the system (physical or digital).

When you want a new kitchen in the house you’ve lived in for 10 years, there’s no need to bulldoze the entire house and build it again from scratch. Instead, you can simply isolate the kitchen and renovate it. This might require you to cook somewhere else for a few weeks or order takeout, but it doesn’t require you to move all of your belongings and life out of the house for months or even years. You continue using the structure, with the specific uses adapting over time to what is available. Maybe the renovators finish the stovetop before the oven, so you can start using that first. The same concept applies to software systems. Instead of getting rid of your entire application (that might have taken years to develop and refine, and is doing its job well but has some serious issues) and rebuilding from scratch, the same type of remodeling can be undertaken in a software environment.

First, we were curious to discover how many parallels actually exist in the built environment, and how many cases this gradual transformation (move/modify without interrupting operations). A few interesting real world examples follow:

The school in Shanghai that walked to a new location: An 85-year-old primary school was entirely lifted off the ground and relocated using a new technology dubbed the “walking machine” in 2020.

The raising of Chicago: By the mid 1800s, Chicago stood at only four feet or so above the water line of lake Michigan, and as a result of being built essentially on top of a swamp, encountered huge issues with standing water and poor drainage of sewage, resulting in water-born diseases and countless other problems. Installing a sewer system was considered, though only presented a partial solution. The breakthrough plan that was eventually proposed and undertaken over the course of the next couple of decades was to lift the entire city by several meters. The clincher: operations continued in all of these buildings, with people residing in, coming and going from, and working throughout buildings that were rolling across streets and pivoting around corners. The retrospectively massive and monumental transformation that the city underwent was gradual enough that nobody even noticed they were moving, on a day to day basis.

Raising Galveston: Following a devastating hurricane in 1900, Galveston, Texas was in total ruins. Those that remained following the storm wanted to rebuild and recover from the tragedy, in a way that would hopefully mitigate the risk of it happening again. The solution devised included building a sea wall around Galveston, and in addition, raising the city up so there was a path for water to drain. The entire city was raised between two and five meters over the course of about a decade. The whole city continued to function while different pieces of it were at differing heights, many buildings still in use while being lifted. One example of how this worked was the temporary addition of elevated sidewalks that made it easier to walk between buildings.

In 1930, the 11,000 ton Bell Telephone Exchange was moved without suspending operations: in the quest to better utilize the surface area of the block where the building was located (build more offices, etc.), the original plan was simply to demolish the building. Someone pointed out that the entire region’s telephone lines ran through the building, so if it were demolished, phone service disruptions would pervade for months or more, which was unacceptable. The solution was to slide the building over, and then rotate it (specifically, over a 30 or so day period, the building was shifted 52 feet south, rotated 90 degrees, and then shifted again 100 feet west). The building remained operational the entire time - employees continued to report to work, all utilities remained hooked up, and customer phone service went uninterrupted. Workers were interviewed during and following the process, and nobody really noticed that the building was being moved throughout. Following the move, a brand new building was constructed right next to it, utilizing the space to its fullest.

Just as the CNN article of the walking school addresses decades of indifference to historic buildings, razing them to clear land for new office buildings and skyscrapers, there’s growing awareness in the software community that hard cutovers and rewrites involve more harm than anything, not to mention huge opportunity costs. Valuable architectural heritage lost as a result of demolition is akin to the inherent (and in many cases, irreplaceable) value of legacy codebases. Just as advanced technologies allow old buildings to be relocated rather than demolished, creative approaches to platform migrations can allow legacy and problematic software systems to be salvaged and repurposed, while gradually and safely transitioning to a new and more modern platform, all executed with minimal disruption to occupants and users of the system. Just as you can go on living and working in a building that is being lifted or moved, developers can go on working productively in an application that is in flux, and the application can go on serving the needs of the business.

If we draw inspiration from such events in the built environment, how might it change the ways in which we approach software development, and more specifically, platform migrations?

It is crucial to review the consequences of doing a complete demolition. Really think hard about this before proceeding. Traffic will need to be rerouted, the new platform will need to be built even while the old is still functional; how long will the switch over take, and will the new platform remain largely unused until the day it is turned on?

Is there perhaps a way to gradually start using that new application instead? For example, when building a new building next door, when the first floor is ready, could we start moving over people from the old first floor to the new first floor, so we’re able to make a gradual transition from the old building to the new, instead of waiting for the new building to be entirely finished, and then moving all sixty floors of employees, equipment, utilities, etc., over in one fell swoop?

If we can think of ways to keep the system operating as a whole - even while in motion - it can deliver a much better experience to the people paying for the work, as well as the people utilizing these applications to get their work done, because they’ll start to gradually see benefits of the new application over time, while the engineering team gets to see the benefits of not having to maintain two disparate systems. For example, when developers add a feature/update to the old system, they have to figure out how to add the same to the new system somehow and ensure it’s working correctly, even though the new system is going unused, while the old system (which will be phased out) is still (wastefully) tested and strengthened through all the use it’s getting.

When considering return on investment for your software project, consider the opportunity costs involved with any platform migration or demolition and rewrite. While all efforts are being dedicated to constructing a complete replacement (regardless of how it’s done), what other activities are being forgone, and does it make sense to dedicate budgets accordingly. Often, these platform migrations must take place sooner or later. We prefer to approach this crossroads by discovering non-destructive ways that they can be executed.

Cover photo by Pawel Nolbert.