[By Andreas Grabner]
DevOps Teams are investing a lot in automating tasks that have traditionally been done manually. When it comes to performance validation, we used to run load tests and then have a team of performance experts look at the results of the load testing tool, compare it with previous builds and then give a thumbs up or down.
At the recent Neotys PAC (Performance Advisory Council) in Chamonix, France, I was accepted to present on the topic “Performance as Code – Let’s Make it a Standard” – see the presentation here. Thanks to the discussions in France and those that followed afterward I got a lot of great input on how to automate the validation of deployments following the “Everything as Code” principle. The result just went into the first release of Pitometer, an OpenSource project which is part of keptn. If you want to give it a try you can either use the library on its own as explained on the GitHub page, or leverage it as part of keptn which uses it to automate continuous deployment and continuous operations for cloud-native platforms such as Kubernetes, OpenShift or CloudFoundry. The following animation shows how Pitometer is used by keptn for automated deployment validation:
In its first version, Pitometer supports Prometheus and Dynatrace as data source. Neotys is next on the list as they are an active contributor to the keptn project. Before going into more details – let me iterate
- WHY we came up with Pitometer
- WHICH use cases & requirements it supports
- HOW Pitometer works and how to use it
WHY: Inspiration from Dynatrace, Intuit, T-Systems, Google, Netflix and others
To set the record straight: I was not the one that came up with the initial idea! I saw different variations of how to automate deployment validation.
The first inspiration came from Thomas Steinmaurer, Chief Performance Architect at Dynatrace, who told me that he pulls in multiple metrics from different tools (testing, monitoring, log …) after every load test, stores them in a database table, performs basic regression analytics (comparing against thresholds or with previous timeframe) and then calculates (=scores) whether the build is good enough or not.
Automatic Performance Validation based on a set of configurable metrics: Simple, Automated, Works!
If you want to learn more from Thomas check out his video and slides from Virtual PAC.
Another inspiration came from Intuit who presented at Dynatrace PERFORM 2018 & 2019 on how they pull load testing and performance monitoring metrics into Jenkins, compare results across builds, provide a quality heat map and use this for automatic build validation.
Starting last year, I worked with Raphael Pionke and Mathias Fichtner from T-Systems MMS. They implemented the OpenSource Jenkins Performance Signature plugin which pulls in metrics specified in a “Performance Signature as Code” file that lives in the Source Code Repo. The configuration included a list of metrics as well as thresholds that were used to evaluate the overall build quality:
“Performance Signature as Code” evaluates Dynatrace metrics against defined thresholds after every Jenkins Build
If you want to learn more check out the blog and video Shift-Left in Jenkins with Performance Signature.
On top of these examples we have Kayenta, and open source framework from Netflix and Google, which is used for automated canary validation. Their approach is to calculate an overall canary score based on the evaluation of a set of metrics against defined thresholds.
Last but not least is Pivotal’s Indicator Protocol which follows the same concepts: a list of indicators (metrics & thresholds) defined as “Observability as Code” in order to evaluate key performance indicators of your deployed services or applications!
Conclusion: there are a lot of different implementations that are all targeted around automating the validation of deployments by comparing different metrics! But which to choose? To answer this question I worked out the key use cases that a “Performance as Code” or “Deployment Validation as Code” framework should support so that we can find out which framework to use.
WHICH Use Cases & Requirements it supports
At Neotys PAC I presented 7 different use cases that I came across in my work as a performance engineer in the last 20+ years. If you have additional ones be my guest and let me know. Here is the list my use cases – details can be found in the 2019 slide deck on the PAC Events Page
- Continuous Performance Feedback to Developers
- Automated Deployment / Canary Validation
- Automate Test Generation
- Monitoring Alerting Definition
- Auto-Remediation Definition
- Pre-Deployment Environment Checks
- Event-Driven Continuous Performance
Instead of going through all these use cases let me short cut to the key requirements we worked out:
Req #1: Metrics beyond Classic Performance KPIs: Include Architecture, Business, Platform
For elastic environments it is no longer enough to just look at response time, failure rate or throughput. Why? Because the platforms (k8s, OpenShift, CloudFoundry, Serverless …) that host our apps will simply scale to ensure the desired response time and will therefore make “bad deployments” look like good ones.
Need an example? The following shows a Dynatrace PurePath captured from a Tomcat container running on Kubernetes. Some of these requests to Tomcat make 26k calls to the backend database:
Deployment Validation therefore needs to go beyond these classical Key Performance Indicators. Here are some examples:
- Number of Container/Pod Instances to handle a certain Throughput
- CPU & Memory Usage per Service Endpoint & Throughput
- Number of Service-to-Service Interactions per Service Endpoint
- Number of Service-to-Database Interactions per Service Endpoint
- Kubernetes Node Utilization, e.g: Requested vs Actual CPU
There is a new set of tools that provide deeper full stack metrics into the platforms, the applications and the dependencies of deployed services. Its important to provide access to these metrics that give developers and operators better insights about how good their deployments really are.
Req #2: Metric Evaluation: Smart Baselining vs just Static Thresholds
While static thresholds have a place in deployment validation, they make it harder to get started as you have to know what your thresholds are and define them. The group at the PAC event agreed that we need to go beyond just static threshold analysis. Here are at least three additional approaches
1: Compare with reference build, timeframe or environment
This is rather straight forward. Instead of comparing a metric with a static threshold we compare it with the value captured from
- a previous build: compare Build 17 with Build 18
- a reference timeframe: compare last hour with same hour last week
- a reference environment: compare blue with green or Canary A with Canary B
Here is an example of a Blue / Green comparison clearly showing Green is violating both Response Time & Failure Rate.
Open question: which aggregations to compare: average, 90th perc, 95th perc or a combination.
2: Baselining across builds
This approach calculates a baseline based on values captured from previous builds. This would be done for every metric which allows us to detect a regression without having to manually specify thresholds. Here is a simply visualization of such an approach:
Open question: which aggregations to compare: average, 90th perc, 95th perc or a combination.
3: Anomaly Detection
The problem with all evaluation approaches described so far is that we always try to capture a single value for e.g: Response Time, Failure Rate … – BUT – which Response Time value would we look at when evaluating the result of a 2 hour load test? Is it the average, the 90th perc or the max?
In our discussions we agreed that a single outlier during an observed timeframe is most often not a bad thing. The problem is if a metric is above a certain threshold for a longer time period than allowed. The following example shows an approach to evaluate a sliding window of 5 minutes and only alert in case the observed metric violates the threshold for more than 3 minutes of that sliding window:
This approach can be combined with the previous approaches by using the calculated baseline of previous builds or the values of a reference timeframe as the threshold.
Req #3: Metrics Definition Tool Agnostic
While we had a couple of tool vendors sitting at Neotys PAC we all agreed that a framework for deployment validation needs to be tool agnostic – meaning – the framework must provide a way to pull in metrics from different data sources (=different monitoring, testing or analytics tools).
Req #4: Custom Extensions for Testing, Monitoring & Auto-Remediation Tools
While deployment validation was the key use case that drove these discussions we all agreed that we need to think beyond metric based validation. The framework should allow custom meta data that allows other tools to build on top of these metric definitions. To give you three examples
1: Test Script & Workload Generation
If we can pull throughput metrics of individual service endpoints over a given period of time, we can repurpose that data to generate test scripts and test workload definitions.
Example: pull the throughput metrics of the top service endpoints of the last hour in production. Based on that data we can create a load testing workload definition that simulates the same load behavior as in production.
2: Monitoring Alert Definitions
If we use metrics to validate a deployment in a staging environment, we can take these “confirmed” metric values and apply them to a monitoring tool for production alerts.
Example: the response time of 150ms was evaluated in the latest build in staging which is now promoted to production. We can take this value and automatically create custom alerts for our production monitoring so that we get notified in case production behaves differently than what was validated in staging!
3: Auto-Remediation Hints
We can extend “Monitoring Alert Definitions” by not only getting an alert in case of a violation in production but by automatically calling the best-known remediation workflow.
Example: if response time exceeds our threshold of 150ms and if we observe an increase in overall traffic volume we might just as well call a script that can temporarily scale up our platform. This would most likely be the logical “manual” remediation.
HOW Pitometer works
All these use cases and requirements influenced how we built Pitometer which is part of keptn where it is utilized for automatic deployment validation through the Pitometer keptn Service. You can go ahead and try Pitometer as part of keptn by following the installation instructions of keptn. You can also use Pitometer standalone to execute your evaluation.
The following animation shows the workflow of how Pitometer interacts with its core components to evaluate a deployment:
- Pitometer: The core library that orchestrates the evaluation process
- Data Source: Pitometer supports pluggable data sources to query specific metrics
- Grader: Pitometer passes metric values to the grader which returns a metric score
- Specfile: Defines Indicators (Metric + Grading) and an Overall Validation Objective
The result of an evaluation lists the result of each Indicator, how many points given, details on violations and the overall score. Here is a sample result file and the explanation of the individual sections:
A little more color on the evaluation process
When Pitometer does its evaluation it first passes the Indicator Query Definition to the respective data source, e.g: Dynatrace (average conversion rate for the sockshop-blue app). This value is then passed to the Grader, e.g: Threshold Grader which does a static comparison against the warning and severe thresholds. Its up to the Grader’s implementation on how many points of the maximum metricScore to give. The Threshold Grader for instance gives all points if the value does not exceed a limit, 50% if it exceeds warning and 0% if it exceeds the severe threshold.
At the end, Pitometer sums up all points the grader gave for each Indicator. This becomes the total score which can then be compared to the overall objective. Now we know whether this deployment has achieved enough points to be considered pass, warning or fail!
Testing Pitometer as Standalone Library
If you want to use Pitometer yourself simply follow the instructions on the GitHub page which also includes a sample Node.js application.
You will also learn that you can pass additional context to Pitometer which is passed to Sources and Graders. Additionally, a timeframe needs to be provided for each run. Keptn for instance passes the timeframe of the previous test execution to Pitometer to evaluate the metrics captured during the run of the performance test.
Pitometer as part of keptn
As shown in the animation at the beginning of this blog, we built a Pitometer Service for keptn which uses Pitometer for automated deployment validation. You can try it yourself by deploying the current version of keptn as described in the keptn docs.
Extending Pitometer with Sources and Graders
We already have external contributors such as Neotys and T-Systems MMS who are extending Pitometer with additional sources and graders. Neotys will provide a source to pull in Load Testing Metrics while T-Systems is building a more sophisticated grader that will allow you to compare against previous builds or against a baseline from previous builds. If you want to contribute your own extension just have a look at the current implementations such as Pitometer Source Dynatrace or Pitometer Grader Thresholds.
Thanks, Neotys for PAC
Now its just time to say THANK YOU Neotys for organizing these PAC Events. It is a great opportunity for performance experts around the globe to get together, share experiences and bounce new ideas. And as we could see in my case – it inspires new open source projects!
Learn More about Performance as Code
Want to learn more about this event, see Andrea’s presentation here.