#NeotysPAC – The Myth of Continuous Performance Testing, by Stephen Townshend

[By Stephen Townshend, Assurity Consulting]

Continuous performance testing is the concept of building automated performance testing into an automated deployment pipeline. I see it being promoted everywhere from social media and conferences, through to clients specifically requesting it. However – does it deliver?

In this blog, I want to address five areas of concern.

The limited value of component testing

The most common way I’ve seen continuous performance testing implemented is to benchmark the performance of components as new code is deployed. Components could be anything from an API, an application function, or even a system within a system-of-systems.

However, component focused performance testing only addresses a small percentage of the performance risk:

At best, component focused performance testing allows us to track the response time and capacity of components in isolation relative to previous builds. It does not necessarily tell us about how the solution will perform in the real world, under real conditions.

Many performance issues occur not within the components of our solution, but with how they integrate together. In this situation, if we need to understand the performance of the entire solution before releasing it to our customers, then end to end integrated performance testing is the only way to mitigate that risk effectively. I have heard of component-only performance testing approaches which have ended disastrously.

So, given that we often need an end-to-end performance testing anyway, and that the percentage of risk covered by component performance testing is low – is continually performance testing the components of our solution worth the effort?

Losing Sight of the Big Picture

A side effect of fixating on components is that it becomes easy to lose the sense of the big picture. There are different facets to that:

  • Losing sight of the overall solution we are testing
  • Losing sight of the business purpose of the solution

The second point is particularly important. Once we lose connection with what is important to the business, we lose the ability to prioritize what we do base on what matters.

For example, I did some work for a new credit card business. They built a brand new green field solution from the ground up – and my job was to assess the performance risk, build a strategy, and implement it.

As I learned about this solution, I quickly became overwhelmed by the number of systems, systems within systems, integrations, network protocols… how was I ever going to performance test all of it? And then I took a step back and looked at the business purpose of the solution. In reality, customers only do a few select things most of the time – apply for new cards, check their transaction history, and make purchases using their cards.

My approach was eventually to drive load throughout the solution from the outside in (simulating the core customer activity). This achieved two things:

  • It was magnitudes simpler than trying to break up the solution into components and test each one
  • Because of the realism, the results were more valuable – we had a better understanding of what the customer experience would be in the real world

The caveat of this approach is you need great monitoring to at least track requests throughout the solution to understanding where the time is being taken (as a minimum).

Test Asset Maintenance

Most performance testing requires test assets. Whenever the application under test changes there is a risk our test assets will break, or worse – the tests will still run, but something else has changed which means our tests no longer represent reality (E.g., an additional AJAX request we are not requesting). It stands to reason that in a rapid delivery lifecycle there is more opportunity for this to occur due to the frequency of our releases.

Performance test assets also tend to be fragile because they involve simulating network traffic. Very small changes within a system can have a profound impact on the network traffic – even bouncing an application server can regenerate every dynamic ID on the pages of a web application.

This begs the question – how much manual effort is required to maintain our performance test assets within this rapid delivery model? Will we be able to keep up with the rate of change?

Many load testing tool vendors tell a pretty consistent story about how they are CI/CD or DevOps ready and promote continuous performance testing. I find this a very interesting claim. What does it mean to be CI/CD or DevOps ready?

Naturally, there needs to be a command line interface or callable API’s to run tests and retrieve results in an automated way, but you also want some way to be able to re-apply correlation rules automatically (to reduce the manual effort between iterations). Most load testing tools come with some form of automated correlation engine, but a manual effort is always required. For example, what about handling:

  • Client-side encryption?
  • Cookies being set by client-side JavaScript?
  • Extracting the value from the middle of a re-direct chain?

On top of this, there is all the other stuff performance testers need to manually include such as:

  • Think-time
  • Pacing
  • Parameterisation (data pool)
  • Logic (loops, conditions)
  • Structuring the project for ease of use and reporting

The fundamental issue with (almost) every tool on the market is that all of this useful logic we build into our test assets is tied to the recorded network traffic. Whenever we re-record the traffic, we lose everything except those automatic correlation rules.

My point is not to criticise load testing tools, but to point out the risk – that the manual intervention required to maintain load testing assets within a rapid delivery lifecycle is not scalable.

Automating a Pass or Fail Outcome

To fit the ‘continuous’ model we ideally want a “pass” or a “fail” verdict at the end of our performance tests. The question is, how do we determine this?

The simplest way is to define NFRs or SLAs. For example, say we want to automatically performance test a set of ten RESTful APIs. We define an NFR that all of these must respond within 2 seconds 95% of the time. What happens if one of our API’s takes 2.1 seconds at the 95th percentile?

Should we fail the build? Does that make sense from a business perspective? Probably not. So how do we do this better?

The most common way I’ve seen this done is to capture a baseline or benchmark and compare back to it during future builds. If this is a sliding benchmark (E.g., we always compare to the most recent run), then we lose the ability to pick up gradual degradation over time.  For example, we could have an SLA which says our API must not be more than 15% slower relative to the previous build, yet over 30 consecutive builds the response time could degrade by 10%, and it wouldn’t be picked up.

A better way to go about this is to compare back to multiple previous runs, say the past 15 successful builds. I think this is a good way to go about it – but it’s hard.

We need to build a rules engine which is capable of looking back over each of these runs to draw meaningful conclusions about performance. I know people who have attempted this, and it is much harder than it sounds. Our rules engine needs to be:

  • Maintainable
  • Configurable
  • Understandable

…and our rules need to be relevant by relating back to real business need and being mathematically justified.

And here I am only exploring one dimension – response time. What about resource usage? The thing about performance is we do not know in advance which resource is going to be the bottleneck during a test. Think about how many counters are available in Windows Perfmon – most of them are irrelevant most of the time, and each of them requires customized rules around what is acceptable or not.

This is not an impossible problem, and I think we are moving in the right direction, but it’s not something I have yet seen adequately addressed in the real world.

Production-like Environments

Ideally, we would run our performance tests in production or a production-like environment. It’s one of the principles of DevOps – let’s stop building and testing our software in environments which are so radically different from the real world.

From experience, however, this is often not the case. Especially in the continuous delivery context. The table below explains some of the common ways our test environments differ from production.

So what can we learn from an environment that does not match production? It lets us compare performance relative to previous builds in the same environment. It does not necessarily tell us about what performance will be like in the real world.

Given this, my question is – is any of this testing activity worthwhile if we do it in an environment which does not match the real world?

When is it Appropriate?

I do think there are some situations where this kind of testing activity is more valuable than others. Most importantly, if you are a big multi-national business with millions of users – it makes sense. Do as much performance testing as you can as early as possible to find as many issues as you can (because you stand to lose a lot from performance issues). However, for a smaller company in a country such as New Zealand you will end up putting in a similar amount of effort to mitigate a much lower level of absolute risk – which begs the question, is it worthwhile?

Other factors which make a project more suitable than others include whether the software is being built bespoke, the presence or absence of production-like environments, the maturity of the deployment pipeline and the testability of the software. Another factor which I picked up from the PAC conference had really good monitoring – ideally APM.

Post-PAC Reflection

My perspective has changed slightly since attending the PAC.

In particular, I realize that I work in a unique situation – a small country with a small economy, where the value of this kind of activity is questionable. But overseas there are situations where this makes a lot more sense. I still think there are plenty of situations where continuous performance testing is inappropriate, but not always.

I also think there is a whole other discussion that needs to happen about what it takes to implement this in an organization. It’s not about technology or process; it’s a culture shift. Either we need to get our development teams more engaged in the performance of their software, or we need our performance specialists to be integrated better into our software teams.

It’s also important to understand that continuous performance testing is not a replacement for the end-to-end performance testing. It is simply an extra activity we can do to find performance issues earlier. As long as we all understand that, I’m much more comfortable with continuous performance testing as a concept.

I have my thoughts about how we can move forward in a constructive way:

  • Assess performance risk early. Build a connection between what we as performance testers do and business need so we can prioritize what we do base on what matters.
  • Build more testable software. Either build or choose software which is testable regarding the architecture and network traffic. Then we have a chance.
  • Build better performance specialists. Let’s build a new generation of performance specialists who are not tools focused but can stretch in both directions – both to relate back what they do to business need and to help investigate and diagnose performance issues when they find them.
  • Promote a culture of performance awareness. Within our industry and organizations let’s make sure everyone understands what performance is and why it matters. And within our community of performance specialists let’s be more honest about how hard some of this stuff is, and encourage deeper discussion so we can all learn from each other and do our jobs better.

Stephen Townshend

Stephen is a software performance specialist based in Auckland, New Zealand. Performance is about more than just testing – performance risk is a business risk. Stephen’s view is that his job is to identify, prioritize, and manage performance risk in a way which is appropriate to each customer’s unique situation. This allows us to focus on what matters to the business the most, first.

He specializes in the two ends of engagement – the risk assessment and strategy up front, and the investigation and diagnosis of issues in production or during performance testing.

Stephen is also a professionally trained actor for theatre and film which coincidentally makes him an expert at both performance and performance.

Learn More

Do you want to know more about this event? You can read the Stephen Townshend’s presentation here.

Leave a Reply

Your email address will not be published. Required fields are marked *