[By Twan Koot]
We just completed a successful PAC event in Chamonix which included two days of experience and knowledge sharing. If you’re reading this post, you’re likely to be interested in my presentation, but there is more to this! Let’s dive into it, shall we.
A large part of the work we performance testers/engineers do after a load testing includes the examination/analysis of the performance monitoring data collected from the test. We each have our methods to identify issues. One of the most common I’ve come across is what I like to call the “three-step dance.”
The first step is a check of the “Magic Three” metrics: CPU, RAM, and IO and looks at graphs from the APM dashboard. The second check is done based on a threshold of percent usage, (for instance, 85% CPU consumed). If the metric stays below this threshold, the specific component is satisfactory. The review is to examine response times against resources graphs to look for peak matches. Should there be a spike in response times, it would stand to reason that a similar one exists in CPU usage, right?
The above-described method to analyze performance testing data is, in my opinion, far from optimal. It’s based on percent usage metrics which don’t accurately reflect server health. That said, is there a better approach to analyzing performance monitoring data? Yes, of course, and it’s USE (Utilization, Saturation, and Errors). It’s actually a method described in Systems Performance Enterprise and the Cloud, by Brendan Gregg (brendangregg.com).
The first part is quite common among performance testers. Utilization may be the most popular metric to consider. Nearly all monitoring or APM tools support this, and for most testers, it’s the metric they watch. Utilization is just the starting point though. The second less popular type, Saturation, to this performance tester, is considered the most critical. Lastly, the hardest of the three to obtain (due to either Cloud platform limitations or a lack of tooling), is Errors. The following table provides an overview of the “Magic Three.”
Table 1.0 USE metrics plotted on CPU, Memory, and Storage device I/O.
- Utilization – Average time a recourse was busy servicing work.
- Saturation – The degree of work which can’t be handled, which is being queued.
- Errors – The amount of errors.
So, how do we apply this fresh new USE metric system? We can use the following flow:
Figure 1.0 USE process flow.
As you can see it’s a pretty straightforward flow starting with the identification of which resources are available for analysis It’s preferable that we do this before running the test, so we have the correct metrics available. Then, select a resource we want to check. Next step includes going through USE metrics for each resource – reviewing each type shows any possible problems. When following this flow, you will systematically check resources in a detailed way which also goes more in-depth then solely looking at %usage of a resource.
To show the importance of looking further beyond % usage, we will be focusing on the CPU resource. We’ve all looked at the CPU usage graph to see if we get high usage. So, what does 80% CPU utilization mean? Is this CPU overloaded? To answer, we should look into what the metric can mean. When looking only at a graph, we may interpret the parameter into something like below:
Figure 2.0 CPU Usage when looking purely at a CPU utilization graph.
Watch out when examining CPU as there can be something happening elsewhere. When the CPU is servicing work, it may be waiting for other resources to provide feedback for completing a process. This stalled process will show up in many tools as busy CPU while waiting for additional resources. See Figure 3.0. Meanwhile, Figure 4.0 shows NMON on an overloaded machine. Note when checking the CPU tab, we see 30% of the shown utilization is “waits on other resources.”
Figure 3.0 CPU Usage what it actually can mean.
Figure 4.0 CPU Usage Shown in NMON
We now know that CPU usage can mislead when not monitored extensively. But, we only touched the “U” of USE, so let’s go deeper to learn more about the impact of Saturation on CPU resources.
The definition of Saturation, from a performance tester’s perspective, is interesting given the degree of work which can’t be handled, which is being queued. Queueing typically results in delay after delay (or, increased response time). Which metric measures resource saturation? For the CPU, it’s the scheduler, or in many tools, the “RunQueue.” Runqueue is the queue of instructions waiting to be transferred to the CPU for calculation. This provides a quick checkpoint of the CPU to determine potential overload. However, what if we have a run queue of 15? Again, let’s take an even deeper dive into the metrics introducing BCC and EBPF.
(e)BPF stands for “extended” Berkeley Packet Filter. BPF can run in-kernel filtered programs allowing for low overhead monitoring of many deep and low-level metrics. It opens up new possibilities to additional metrics. Since BPF is on low-level programming tools, it can be hard. Luckily, there is BCC, which is described in the following manner, “a toolkit for creating efficient kernel tracing and manipulation programs, including several useful tools and examples. It makes use of extended BPF, formally eBPF, a new feature first added to Linux 3.15. Much of what BCC uses requires Linux 4.1 and above.”
Figure 5.0 BCC tool collection plotted on system components
With a single BCC tool, we can check how much latency a particular run queue may cause. To retrieve this, we can use Runqlat as it allows for the determination of how much Scheduler or RunQueue latency exists. The output is shown in Figure 6.0 displays latency in ms, the number of measurements during the performance monitoring run. Now we can see if a high run queue leads to extensive latencies on the CPU resource.
Figure 6.0 Runqlat showing output overview.
Putting it All Together
Hope I was thorough, yet brief enough in my explanation for you to follow along. When applying these tools, leveraging these new metrics in your analysis, you can efficiently detect and diagnose bottlenecks or performance issues. You also now know how CPU can be misleading and used as a starting point.
Okay, maybe it isn’t your “three-step dance.” Its more formal name – USE works just as well.
Learn More about the Performance Advisory Council
Want to learn more about this event, see Twan’s presentation here.