How do you manage to run a performance test against each service of your system? And what about running each test every day? If you’re a performance tester who has worked in a traditional development environment and then shifted to CI/CD, you may have noticed that your usual approach to load testing no longer holds up.
In my Neotys “PAC to the Future” talk, I shared why the typical approach for load testing doesn’t work with continuous delivery and an alternative method designed with a different objective in mind. However, in the end, you will still need to employ both approaches for the best results! In this post, I’ll go into detail about this and share how you can detect performance degradations instantly in CD, based on my real-world experiences working with some of the clients of my testing services company, Abstracta.
Yesterday’s Performance Testing Approach
As for performance testers, we simulate load, and while doing so, we measure things to detect bottlenecks (the “narrower” part of the system that doesn’t allow things to pass through as quickly) and the breaking point (after what amount of load does the system significantly degrade?).
Traditionally, we would follow these steps to execute performance tests:
- Define a bunch of test cases (user flows) and the load scenario.
- Automate them with a load simulation tool (at the http level).
- Run tests, analyze the results, and work with devs and ops to improve.
- Continue doing this until reaching the service level agreements (SLAs) or a “good enough” level of performance (in terms of response times and resource consumption).
What changes in continuous delivery?
Teams who’ve achieved continuous delivery have already paved the way for continuous integration (CI), which is a practice wherein each developer’s code is merged at least once/day. A stable code repository is maintained from which anyone can start working on change. The build is automated with various automatic checks, such as code quality reviews, unit tests, etc.
Then, in CD, we deliver continuously to the business team, and frequently to users (to production). So, we need an automated pipeline with different checks, and the checks we’re dealing with here are performance validations.
The typical approach of doing load simulations at the end of the development cycle is not going to work anymore, since new code is pushed out so often (even daily), and these changes make the load simulation scripts, then make the build fail because they cannot run properly and provide accurate results.
So, what can we do differently?
Performance Testing in CI/CD – Different Approach, Different Goal
Traditionally, the objective of performance testing is to determine if the system will support the expected load. In CD, my suggestion is to alter the goal to be able to detect system degradation as soon as possible.
There’s a substantial difference between running a performance test before go-live (after six months of development) and releasing every day (or at least very frequently).
When you release with higher frequency, there will be fewer differences between each new version and the previous one. And, luckily, there’s also less risk.
We won’t know from these tests if the application will support the real load in production. But, we know that the current version does, and we would catch it if something is causing degradation. If we don’t identify any, the new version should support the load (or at least there’s less risk that it won’t).
In CD, we want to know if the changes introduced in the new version (the part we’re unsure about) are generating any negative impact, any degradation.
Also, there is even less risk because you should have a rollback process in place for any release, and you monitor in production. Then, if something worsens the performance and you detect that in production, you can roll it back and analyze what happened, adding more checks to your pipeline to shift the validations left.
How can we detect degradation immediately?
There are three steps (and inevitable challenges you may face) to follow for continuous performance testing, which I will go over next:
- Selecting a tool
- Defining the test cases to automate
- Designing of the load scenarios
Selecting the Right Tool
In my experience, most of the time, using open-source tools has been a requirement. In such cases, JMeter is a go-to, but for CI, I wouldn’t recommend it.
The problem with JMeter is that all the scripts are stored in xml files that most likely look like:
Additionally, you can’t easily see the differences between one version and another, even in simple tests, because the script is an XML file. Well, you can see the differences, but it is not that easy to understand what changed.
A great alternative to JMeter for this purpose is Taurus, also an open-source project which allows us to specify a test with a simple YAML file. The best part of Taurus is that the script is straightforward, and then the tool generates the code to run the test with JMeter, Gatling, or more than 20 different weapons, even Selenium.
Below you can see an easy to use Taurus script. You can read it and understand it as it’s pretty simple:
Not bad! This script is equivalent to the previous XML file I showed.
Think about this classic performance testing situation – you find there’s a problem because an assertion is failing. The first thing you’d do is NOT tell everyone that there’s a problem in the system, but first ensure that your tests are correct. Then, you start analyzing, trying to see if there is any change in the test from previous versions, etc. But, this isn’t an easy task when working with JMeter, which further proves why we need plain text-based scripting.
This is how it would look like when you use Taurus or any “CI-friendly” tool:
You can choose from many options like Taurus, Artillery, Gatling, or anything that allows you to script in a programming language or in a text format that you can manage in the same repository as your application’s code.
My advice for selecting the proper load testing tool in CI/CD is:
- Choose a CI-friendly tool that allows you to quickly compare versions and detect differences using your Git repository manager.
- JMeter is excellent for load testing, but the tests are stored as XML, so it’s not great for CI.
As a side note, maybe you can find another way to manage versions/changes when you use tools based on graphical models, like JMeter or NeoLoad, in order to take advantage of all of their potentials, or maybe, if your protocol is supported only by NeoLoad and not by any open-source tool. The rest of the article applies to these situations as well.
Defining the Test Cases to Automate
Now, the question is: What’s the most straightforward test we can run to detect performance degradations?
First, remember we want to avoid false negatives, mainly because we’ll be running these tests daily. These happen when the tests say something is wrong with the system, but it’s just the test that’s wrong.
Also, we want to test early and frequently.
And lastly, take into consideration that end-to-end tests (at the http level) are fragile. User simulation, reproducing the traffic between the user and the server, is very sensitive to any changes in the UI or flow, or even the way you manage the session, configurations, etc.
With these considerations in mind, are you familiar with the famous test automation pyramid by Michael Cohn? I’ve modified it to make one for performance testing in CI/CD:
As you go down the pyramid toward the base, you find that:
- API/Unit tests are cheaper (easier to prepare, less infrastructure required)
- It’s easier to maintain so you can run more frequently (daily)
- Testing can be done earlier
- It’s a disadvantage – they’re not conclusive about real user metrics
And as you go up the pyramid, you’ll find:
- Load scenarios that require infrastructure and load similar to production
- More expensive tests because they’re harder to prepare and maintain
- Better results, direct correspondence with end-to-end user metrics
So, the answer to the question, “What’s the simplest test to detect degradations as soon as they’re introduced?” is unit performance tests.
Unit Performance Testing
The goal is to run the same test every time and compare the newest results with the previous ones to see if the response times have gotten worse.
When we talk about end-to-end tests, in the load simulation, our load scenario and the assertions are based on the business needs. How many users do I simulate when we run tests at the API level in a scaled-down environment?
The tests should have acceptance criteria (assertions) as tight as possible so that at the slightest system regression, before any negative impact, validation will fail, indicating the problem. We typically do this in terms of error rates, response times, and throughput.
To visualize what problem we’re solving, see the following graphs:
These graphs represent what we don’t want:
In the second graph, it shows a degradation in the requests per second, but the test is passing, and it cannot show an alert about this degradation, because the acceptance criteria are too flexible. It’s verifying that the throughput is greater than 60 req/sec, so when the functionality decreases from 250 to 200 req/sec, no one will pay attention to that. We need the assertion to make the build fail when this happens, or (as in the first graph) when the response times go from 200 ms to more than 400 ms in the following build. The difference is too significant.
How to define the load and the assertions? Take a look at the following experiment we run against each API endpoint to determine the load scenario to include in our CD pipeline:
Say we run the first test with 100 virtual users without issue; the response times are below 100ms (at least the 95th percentile), and the throughput is 50 TPS.
Then, we run the test with 200 virtual users, and again, there are no crashes, and times are at 115 ms and the throughput at 75 TPS.
Great, it’s scaling.
If we continue on this path of testing, we will, at some point, reach a particular load in which we see that we’re no longer achieving an increase in the throughput.
Following this scenario, imagine we get to 350 concurrent users, and we have a throughput of 150 TPS, with 130ms response times and 0% errors.
If we pass 400 virtual users and the throughput is still about 150 TPS, and with 450 users, the performance will be even less than 150 TPS.
There’s a concept called the “kneecap” that we would be encountering at a certain point, which determines that we saturated some bottleneck.
The TPS is expected to increase when we increase the number of concurrent users; if it doesn’t happen, it’s because we are overloading the system’s capacity.
This is the primary method for finding the kneecap when doing stress testing when we want to know how much our servers can scale under the current configuration.
So, at the end of this experiment, we arrived at defining the scenario we want to include in our pipeline and with these assertions:
Load: 350 threads
- <1% error
- P95 response times <130 ms + 10%
- Throughput >= 150 TPS – 10%
Then, the test that we will schedule to continue running frequently is that which executes 350 users, is expected to have less than 1% error with expected response times below 130 ms (allowing a margin of 10%, maybe 20%), and last, but not least, we have to assert the throughput, verifying that we are reaching 150 TPS (also with a margin of 10%).
This is the way we can detect right on time when something decreases system performance.
Note that for this to be valid, we need an exclusive environment for testing. With a unique test environment, the results will be more or less predictable. They won’t be affected if another person, for example, runs something else at the same time, causing the response times to soar and generating false positives, resulting in a lot of time wasted.
But wait, am I implying that just doing unit performance testing is enough for teams working in continuous delivery? Not at all!
Good testers know as Jerry Weinberg taught us that testing the “parts” is not enough to understand how the “whole” will behave since you need to test the whole to see how the parts interact.
So, make sure to complement your unit tests with integration and load tests, running them with a frequency that your team thinks is best.
But wait, one last thing as well, you can’t forget about the client-side! If you test the backend and optimize it then you have memory leaks or high CPU consumption in the client-side, you’re wasting time optimizing the backend since it isn’t paying off.
For the web, there are plenty of tools to check performance on the client-side:
For mobile, I recommend Apptim.
The Key to Performance in CD
Long gone are the days when we release a new version of software every six months (at least I hope). Here are my primary recommendations for performance engineers who need to start running performance tests continuously to maintain a high level of system performance:
- Choose a CI-friendly load testing tool.
- Run unit performance tests to detect degradations as soon as possible.
- Also, run tests as you would in a waterfall from time to time, but expect fewer surprises.
- Consider reviewing performance and optimizing it end-to-end, not only the server-side but also the client-side too.
The best part is, you can do all of this to significantly improve the performance of your system by adding these checks, which don’t require a substantial investment. For information on adding other types of automated checks to your CI pipeline, check out my Ultimate Guide to Continuous Testing.
If you’re running your performance checks like this on CD, let me know.
Learn More about the Performance Advisory Council
Want to see the full conversation, check out the presentation here.