Setting up very large performance tests
Most performance engineers probably face this a lot. You start talking with a new project lead about the next application to test and they start saying they want their application to be able to handle 2 million users. You say, sure, you mean total registered users, or maybe total daily visits, not concurrent users. But no, they actually mean 2 million concurrent users. It’s for a TV show, when people can actually participate and vote with the app. They expect tens of millions of viewers, and a serious number of viewers are expected to actually participate with the app. Or at least, they want to actually verify that it is at least possible. So, a new challenge is born. Let’s do this, let’s try to actually simulate 2 million concurrent users and, as a bonus, let them all vote within 7 seconds (= up to 300,000 votes per second). And let’s do it with a single controller setup, not “fake it” with multiple controllers at the same time.
Long story short, I’ve run a successful test with 2 million concurrent users using 800 load generators. This test and many other heavy tests (100,000+ virtual users) are the basis of this blog about best practices and recommendations. Running performance tests with these high numbers of virtual users can be a real challenge. If you do not prepare it with care, it’s very likely that you’ll become the bottleneck instead of the tested application. And that would be kind of ironic, that your test setup is the bottleneck of the test and not the application.
Tune your scripts
The scripts are a significant part of the used resources of your agents, as they are the core of your test. The goal should be to minimize resource usage. There are a couple of things that you can do to minimize the resource usage within your scripts:
- Use strict matching patterns for your regular expressions without many wildcards. Using (multiple) wildcards can be pretty CPU-intensive. Try to improve the match-pattern so it can find the pattern quicker without recurring/iterative lookups.
- Do not store previous responses. This can increase memory usage significantly. For debugging and analyzing errors, it’s great, but it’s a balance between more data versus making the test possible in a good way.
- Reconsider your assertions/verifications on pages or responses. If you already have a variable extractor, you can probably remove the specific assertion. Do this with care, because you never want to pass a response when it was actually a failure. Sometimes out-of-the-box assertions, like on HTTP codes, are already fine. It slightly decreases CPU usage.
There are some other things that you can do to decrease resources, but I’m not a fan of these, because it cripples the realism of a test. And having a realistic test is the foundation of a proper load test. But here are some actions to consider for reducing system resources:
- Decrease think time/delays in your test. Having a shorter session time means less concurrent virtual users, which means less memory consumption at one time.
- Remove static objects to be retrieved. It reduces network and CPU consumption a whole lot, but it also makes the test very unrealistic. But if you want to focus more on the backend, it’s a consideration.
Tune your scenario
On the scenario part there’s a lot to be done to decrease resources too. Here are a couple of things you can do:
- Disable live updates of virtual users. Many tools have the ability to follow virtual users during the test. Sometimes it’s a nice feature, but it comes with a lot of network usage from the agents to the controller, so disabling this can save a significant amount of network usage.
- Consider more aggregation of data. This is usually not a saver during a test, but it can save processing time after the test and also save a bit on disk space. In gathering and computing the results after heavy tests, you’re not talking minutes but a lot of time. It took me more than 30 minutes to process the 2 million user test on a 32 core controller machine.
- Limit stored number of errors per second. Having a sudden issue on the environment can cause A LOT of errors at once — so many that the agents and/or controller can be overloaded and die instantly. Try to limit the number of errors it stores per second. Usually most errors are the same, so it doesn’t make a lot of sense to store all of them.
- Use a linear “slow” increase of virtual users rather than big steps at once. Starting 10,000 users at once will kill your agents immediately. In general, it’s a ramp-up scenario I always dislike, because a lot of users are started at the same second and follow the same transactions at the same time.
- And last but not least, consider creating your own variable manager. You can save a lot of resources on the controller when you have to deal with CSV-files or other unique values by avoiding the usage of the built-in variable manager. When you’re doing a million-plus user test and you have a steep ramp up, you can add virtual users at a rate between 1,000 or even 8,000 per second (in my case). Those are all requests per second from the agents to the controller to get the next unique account details. It takes a huge hit on the controller, so what you can do is create an external service which provides the values to the users. I’ve created a service in Python to act as a simple webserver and as responds with the next CSV-line. This service was tuned to handle 10,000 requests per second, so I was good, and saved a whole lot of CPU-resources on the controller.
Tune your agents
The agents are the services that actually run your scripts. It’s important that the agent be able to handle a lot of concurrent virtual users. The more your agent can support at one time, the fewer agents you need. And the fewer agents you need, the less money you have to spend. And it will also ease the controller more because it has to address fewer agents.
It’s a good idea to start running a stress test with a single agent against your application and verify the limits of the agent (assuming the application can handle more users). You can see how the agent behaves, how much resources it takes, and at what levels the agent starts to impact the test or results. Things that can happen are unresponsive agents, a crashing agent, and increased response times caused by the agent, not the application. Or you run into threading issues on the agent. The potential performance problems on an agent are numerous and help you tune your agent and eventually know how many virtual users it can safely support. In my experience a maximum number of concurrent virtual users that a single agent can support is usually somewhere between 2,000 and 3,000 depending on tons of things. But for this is a good rule of thumb for ordinary scripts, web applications, think times and 4 CPUs/8 GB memory machine.
Here are a couple of items where you can tune your agent:
- Tune the kernel for some important settings like maximum open file descriptors and network settings. Some kernel network settings can be tuned by increasing values of used local port range, max connection tracking, decrease timeouts for TIME_WAIT so it removes the connection earlier from the pool, etc.
- Tune the heapsize (Xmx) versus real available memory. Usually 80% is fine, or save at least 1GB for off-heap memory.
- Select a proper cloud instance type. According to your single agent test, you can verify if the test is more CPU, memory or network intensive.
Distribute your agents
For increased realism of the test, it’s important to distribute your agents across different providers. If you would use only a single provider, you will use a single network and a single transit/peer towards the application. For testing streaming services, which are very network heavy, it’s important that you access the client’s network over different peers of their network.
Distributing the agents across geographical locations and cloud providers also helps determine issues from a location (or not). It can happen that response times are increasing only from agents from the same ISP. If you save results independent from the agents, you can verify this.
There are many different cloud providers available where you can rent images for a couple of hours. Some great cloud providers to start tons of agents at once are AWS, Azure, Google and DigitalOcean.
Many commercial load test tools also provide out-of-the-box rentable cloud agents. Use them if you don’t want to go into the necessary time and trouble yourself. They have agents available for you all over the world and are ready in a few clicks. If you want to set it up yourself, use AWS for sure. It has the easiest interface available to start 100+ agents in a minute.
DNS load balancing
High-volume web applications usually are built upon cloud services. And in most cases those services rely on DNS to load balance the users across regions and servers. When it comes to simulating users, especially at high load, it is very important to “automatically” let the virtual users follow the logic of the DNS load balancing algorithm.
You need to understand how this DNS load balancing works and ensure that your virtual users follow the same behavior as real users (browsers) would do. For example, if you don’t care about proper DNS “simulation,” all your users might end up on the same server and the test is completely unrealistic. So, if during the ramp up and increased load, a new node is added to the application pool and the users or load generators use a cached DNS response, the users won’t end up on the new node.
Here are a couple of measures you can take to improve your DNS resolving logic for your users and load generators:
Decrease DNS caching time to 1 second
Java load generators usually default to use the cache during the complete lifetime of the JVM, which means it will always use the cache and all users will always connect to the same IP address. A usual default of 30 seconds is way too long, so set it to 1 second, and you’ll have sufficient queries to your DNS server. It helps if the hostname responds with a random list of IP addresses and the users practically connect randomly to them instead of just one of the random ones.
Use a self-hosted local DNS server
By installing and using a local DNS server, you can bypass the DNS server of the ISP where your load generators are located. The trick is to override the TTL setting of DNS records to 1 second, preventing the usage of cache of the DNS server and forcing it to renew the records from the origin’s domain. Every update of the DNS record will be known ASAP on your load generators. A usual TTL is 60 seconds for CDN or ELB hostnames. If the CDN or other cloud service updates the DNS records, because it tries to avoid overutilized nodes, or adds nodes to the pool, your DNS server, and so your load generator, and so your virtual users, will connect to the new IP-addresses sooner like in the real world.
Use a DNS servers on different geographical locations
Practically all cloud services use the location of your requesting DNS server (so the DNS server of your load generator) to determine the user’s geographical location with the help of IP-location-tables, to respond with a list of IP addresses on which it expects to give the lowest latency to the end user. To kind of fake the location of the virtual users, you can use different DNS servers of different geographical locations on your load generators. There are tons of open DNS servers available listed with their location. This can be important if you need to spread your virtual users more evenly/realistically across edge nodes,etc.
Balance hostnames or IPs in your script
Some services use static hostnames for different regions, or if you have a set of fixed IP-addresses, it could be as simple as directly connecting to those instead of relying on the DNS system. This will work only for static hostnames AND IP addresses.
I hope all these tips, tricks, and recommendations give you a little more guidance on how to set up such a test. It’s possible and if you’re about to test an application with a huge number of concurrent users, don’t take the shortcut to test with fewer users and decrease the think time, or test only parts of the system. Just do it and do it right.
Learn More about the Performance Advisory Council
Want to see the full conversation, check out the presentation here.