These days migrating code to new, more powerful hardware platforms is no longer every so often thing. It’s continuous and for many businesses, unrelenting. Companies need to migrate code (software migration) to take advantage of technological innovations that drive their competitive edge. Some feel that all that is required is to click the “deploy” button in a CI/CD tool, letting the automation take over. Those who’ve earned their engineering stripes understand that a lot more is required.
Migrations are complex and risky. The best way to protect a system migration, ensure its success is to take the time to develop a well-informed plan which addresses the business and technical considerations and to provide the right levels of observability over system performance to all involved teams. As the saying goes, few people plan to fail; instead, they fail to plan.
Let’s take a closer look.
There are No Second Chances
The first thing to understand regarding software migration is that the odds are you won’t get a second chance. Migrations are significant events that, in many cases, involve taking a system down for a predetermined period. Once down, it must be brought back online on time and working according to expectation. Failure to meet this necessary condition can be career-ending, but most importantly, revenue loss, perhaps the business. After the damage is done, there is rarely a do-over. So, it’s essential that everybody involved in the migration understands that failure is not an option.
Create an Optimal Performance Baseline First
That last thing you want to do is move current performance problems to a new environment. This only nets marginal code that will be poisonous to the new hardware. You only want to migrate code that’s proven to be performant on the old hardware.
This requires the creation of an optimal performance baseline against the legacy hardware environment for the system to be migrated. This way, if you know, the code runs successfully on a legacy system yet molasses slow in the new environment, reason dictates that you know where to look – the hardware. It’s a simple apple to apple comparison. However, when you migrate problematic code to a new environment, and things go haywire, it could be anything.
Performance baselines require metrics, evidence of system operational capacity, and changes under various levels of pressure. Load testing is often used to put this pressure on existing systems before migration, then again after migration but before release to users. Whether you get your metrics from system monitoring dashboards, in-product telemetry, or through testing tools, it’s necessary to simulate the right amount of pressure so that unknowns come out of the woodwork beforehand, giving you observability before and then reusable validation processes after migration. The last thing you want is to leave it to chance that anything could happen.
For situations that “could be anything,” you compromise your customer’s confidence in the whole system. Just look back at the horrendous debut of HealthCare.gov in 2013. Unfortunately, it never established an optimal performance baseline. Fact, they had no baseline whatsoever. Success was seemingly never in the cards.
Maintain Maximum Visibility and Fast Feedback
The rule of thumb for new system migration is not to launch on modern hardware without establishing proper visibility into what that hardware does. You need to make sure that adequate system monitoring and reporting are in place from the start. Possess clarity into the underlying infrastructure, especially when using native cloud providers like AWS, Google Cloud, or Azure. These systems are complex, often transient. Virtualized instances are created and destroyed at a moments notice. Here now but gone in seconds. It’s entirely possible to encounter a performance testing problem on a VM that no longer exists. Avoid this with the forensic approach. Having access to logs and other types of reports that provide past events comprehensively is a must. Having the ability to easily compare performance testing results, before and after changes or component reconfigurations, is a critical capability to have in your tool belt.
Once you migrate, your feedback capabilities need to be as immediate as possible. The faster the rate of change, the quicker the information needs to be available as the consequences of the moment will have a more significant impact. Think of it this way: when taking a stroll down the street, you’re afforded the ability to look away at something on the other side. When running fast, and trying to look elsewhere even for a second could result in a collision with a light post. The same holds for systems under migration. The faster you go, the more you need to see as quickly as possible. Metaphorically speaking, the light post will likely take your system going down, and it may never come back.
Keep Feature Releases in Sync with Migration Updates
Consider a typical problem. You’re performing a late-stage test on an intended migration target, and you uncover a way to boost performance in the software significantly. It’s not a bug fix, not a new feature either. It’s an enhancement. Nonetheless, being an Agile shop, you do what you’re supposed to do. You put the change request into the backlog. A month or two goes by when your team finally gets the go-ahead to implement the change request. There’s only one problem. The new target for your new code is nothing like the environment you designed the enhancement around. Everything is new.
Had you implemented the enhancement against the platform that was operational when the need was identified, the upgrade would have been infinitely more manageable. Now, you have to start at the beginning and run tests on the original environment and the new one (Good testing practices recommend that you do this to conduct accurate comparisons and verification.).
The flexibility within your performance toolchain should equally match your need to re-test in different environments. A quick change to hostnames shouldn’t require re-scripting or re-work. Activities such as automatically tagging dynamic resources in APM tools or changing load testing endpoint details should be absurdly simple, for either a human running ad-hoc sprint tests or as part of configuration sets in continuous integration jobs triggering performance tests. A failure to do so results in delays during patch and feature rollout, or worse, a lack of visibility into the negative impact to end users because of a necessary adjustment after migration.
The moral of this tale is that there needs to be tight coordination between feature release and migration updates. It’s not a matter of just sending change requests to a backlog and working on them when there’s time available in a future sprint. There’s a big difference between “anytime” and the “right time.” If a feature release depends on a particular machine, VM or container environment, it’s best to implement that feature against that environmental dependency. To do otherwise will cost money. Many times the cost can outweigh the benefit.
Make Trending Continuous
Trends are important. They help us determine when to add resources and take them away. They enable users to understand behavior at micro and macro levels. Patterns tell us about the overall capabilities of our systems.
Trends do not happen at the moment. They take time. There is no one snapshot of CPU utilization that will tell us when a maximum threshold nears. The “moment” might say to us we’re maxed. It won’t tell us we’re about to max. This is a significant difference.
To enable such a determination, you need to collect data over time, all the time. This is particularly relevant when it comes to ensuring that an ops migration is going according to plan.
Data collected over time shouldn’t be episodic. It’s more than saying, “we’re migrating over the XYZ provider next month; let’s set up the hardware environment now to make sure our expectations are in line with reality.”
Of course, collecting data to determine operational trends at the onset of migration is useful. The data collection should not stop post-migration. A good rule of them is to start trending on the new hardware immediately, keeping it going throughout the life of the system. You can do this by creating continuous integration (CI) jobs to trigger performance checks and small load tests that regularly execute on the new hardware at predefined intervals. These jobs should run not only before the migration happens, but continue after the migration has occurred.
Having a continuous stream of complete and useful performance data provides the depth of information necessary to make an accurate determination of the trends that will affect the current migration and the next migration. As those who have been in the engineering community for a while understand, when it comes to ops migrations it never over, it’s just one continuing story.
Putting it All Together
An ops migration is rarely smooth. Even under the perfect conditions, adversity can present itself. The trick is to be able to handle mishaps effectively, to expect the unexpected, and to be able to meet the challenges at hand.
Preparation is paramount. You must understand that code migration to a new operating environment is a timeboxed event without the option of a mulligan. Teams also need the right platforms for testing and observability in place to identify and fix performance issues quickly, before and after migration. The code has to be performant before the migration occurs. Also, you need to set up monitoring mechanisms, and fast feedback loops to allow maximum visibility into the target environment’s state – at the time of migration, continuously afterward. You’ll need this information to determine operational trends that will guide future migrations.
Protecting an ops migration can make or break a career for those responsible. Keeping in mind the suggestions described herein will help you protect the current migration as well as establish proper practices that will reduce the risk associated with those in your future.