#NeotysPAC – Continuous Cluster Performance Validation, by Thomas Steinmaurer & Andy Grabner

  [By Thomas Steinmaurer & Andreas Grabner]

A Day in the life of a 2,5-full-time equivalent Cluster Performance Engineering team, tackling challenges in an agile development process with 40+ engineering teams and a bi-weekly go-to production schedule.

Our Digital Transformation by Numbers

Dynatrace as a company – founded in 2005 – has gone through a major transformation in the last few years.

Years back, we have been releasing product updates twice a year. This has completely changed in the last years with our new 3rd generation software intelligence monitoring solution meeting new requirements and challenges for large scale, cloud, container and serverless type architectures. Nowadays we are pushing production ready features into our SaaS offering every two weeks resp. every four weeks to our On-Premise (Managed) customers. 40+ engineering teams – globally distributed – are adding new features and enhancements on a daily basis in our bi-weekly agile development process. Beside many other aspects to make this happen, the likely most crucial component is our Continuous Delivery & Feedback (CDF) pipeline. CDF is our beating heart of transferring engineering work into something executable, for various stages and for different purposes.

The Dynatrace Continuous Delivery & Feedback (CDF) Pipeline

Our pipeline contains of three stages: Two pre-production areas called DEV and ACCEPTANCE stage and of course a PRODUCTION stage. In total > 30 separated Dynatrace clusters hosted in AWS in different regions with > 1000 EC2 instances up and running 24×7.

Engineers merge their code and fixes into Git MASTER, which gets automatically deployed/pushed by the CDF pipeline into the DEV stage several times during the day. This environment includes a dedicated Dynatrace cluster we call Demo.Dev, used for demoing purposes including sprint review meetings and for gathering feedback from product management in early stages. Automation and fast feedback are the keys to success.

The ACCEPTANCE stage is also automatically updated with new cluster builds daily. But instead of pushing every Git MASTER commit, we deploy approved sprint builds including sprint branch backports (code fixes) happening through the two weeks test cycle. ACCEPTANCE stage is all about testing features from an End2End perspective, with two approaches: Common manual testing, but the majority is automated here as well. Think about automated UI testing.

Both pre-production stages also have dedicated load test environments, but more on that later.

PRODUCTION is pretty clear I think. It provides a “fast lane to production”, in case emergency hotfixes need to be applied quickly into one or more of our > 20 production SaaS clusters.

All three stages are monitored with our own product. That means, we “eat our own dog food” – or as some colleagues say: “drink our own Champaign”. Monitoring in all stages results in pro-active investigation by our engineering teams. Despite lots of testing it sometimes happens that we see certain uncommon resource utilization patterns in production the first time. Having monitoring in production is therefore a great safety net and a chance for us to learn and improve our pre-production testing.

Local Development vs. Large Scale Production

Engineering work is done on local hardware. Not a big surprise. A typical local development area could look like that. 😊

A local environment full of cress, but only very occasional – honeymoon related, in combination with nice colleagues. As you can see, we do not lose our humor with all the hard work we put in 😊.

On a more serious note, a local environment means a capable machine with a powerful desktop CPU, SSD and 32G RAM. From a Dynatrace product perspective working locally means that developers run a single Dynatrace server process which is doing all the heavy lifting data analysis. The machine also has to run the backend storage system like Cassandra and ElasticSearch. This means that our developers do not run in a cluster (distributed system), there is no network latency and additionally, by default, the server process pre-creates 5 tenants. A tenant is simply an isolated area across our whole product including transactional (PurePath) storage and database backend data like timeseries and real-user monitoring data. So, locally, the Smartscape UI, a prominent area in our product, may look like that locally.

A real Dynatrace agent installed locally, resulting in a single host being monitored, with a few processes and auto-detected services. Looks great and rocks, even from a performance perspective.

A typical Dynatrace in Production looks a bit different though. Having a look on a typical single SaaS cluster, the deployment/infrastructure looks like that:

A real distributed system with several nodes per component, hosted in a single AWS region across all three availability zones to guarantee fault tolerance and scale out capabilities. Such a production cluster is a shared environment, usually hosting > 1000 tenants (customers), with > 25K+ agents sending monitoring data. Smartscape for a single tenant (possibly out of 1000) could look like that:

More agents, much more complex, our AI detected abnormal situations and showing red bubbles in the topology model etc. A LOT different to what engineers are trying out locally.

Another example is our Visual Resolution Path UI, which allows the user to go back and forth in time and see how an AI detected problem evolves over its life-time.

This looks like a very complex problem detected by the AI resulting in rendering a “mushroom”. 😊

While this is somehow impressive, it is not really useful for our customers due to its number of involved entities and connections to each other. And yes: this also caused some stability issues in our cluster.

These “negative” examples we just highlight are mainly the result of individual features not being built for large scale. Additionally, we had no dedicated and continuous load testing four years ago with dedicated infrastructure and human resources. That’s something we had to address!

Key Performance Influencers

There are quite a few variables in our system, which play an important role and can affect the cluster negatively.

  1. Number of Servers
  2. Number of Tenants (40+ periodic workers per tenant per server)
    1. => 160K periodic workers in a single 4 server cluster with 1000 tenants
  3. Number of Agents sending monitoring data
  4. Size and complexity of the topology model
    1. Monitored hosts, processes, services etc.
  5. Timeseries data
  6. Number and type of End-User UI requests

With all these variables, 40+ engineering teams working on the product and our bi-weekly production-ready release cycle: How do we ensure?

  • Throughput
  • Scalability
  • Robustness

We strongly believe, only 24×7 Continuous Cluster Performance Validation (CCPV) allows us to tackle this sort of non-functional requirements. As of now, CCPV is already established equally important to anything else we are doing in our product!

CCPV – Environments and Scenarios

Our CCPV for Dynatrace SaaS is backed by two dedicated load test environments:

DAILY Regression

DAILY Regression is a small cluster with smaller EC2 instance types compared to production, with the goal to detect performance regression early, in a small scale environment. Equally to our Demo.Dev cluster, daily regression is automatically updated with Git MASTER builds and fed with the same simulated/incoming load to make regression detection possible: Daily Regression = Daily Validation.

Our Continuous Regression Monitoring (CRM) area – a very simple but sufficient tool – gives us feedback for each over-the-night run by gathering > 70 metrics across our core components. Fully automated, by using the Dynatrace Timeseries API (remember, we eat our own dog food, ehm sorry, drink our own Champaign?) and a simple change detection, which lets us decide, if we need to dig further into a potential regression.

What we get is a daily performance signature across all cluster components, comparable from different aspects. For example, CRM pointed us to the following concrete regression:

Our CRM has detected a response time increase of relatively ~ 50% in a core component of the server process just through updating the cluster build on November 16 with the same incoming load. This definitely needs further attention by analyzing code-level changes with the responsible team. Once a potential impacting code change has been identified, e.g. as certain stack traces pop up as top contributor in Dynatrace, a simple revert may act as a verification, e.g. response time returns to the previous level.

While a revert is usually not the final solution – especially when talking about a new feature – we at least know what has caused the regression and can work on improving that.

SPRINT Builds

SPRINT (Builds) Regression has a different purpose. This is a dedicated cluster on production-like infrastructure. Equally capable and expensive like a real production cluster. It serves sprint branch builds like in the ACCEPTANCE stage and executes 24×7 large scale tests focusing on Sprint X to X + 1 regression and robustness topics, including cluster rolling update scenarios without down-time at full load with 1500 tenants and > 40K simulated agents attached. Cluster updates are not automatically triggered like in other DEV clusters, but manually via our orchestration layer called Cloud Control to also cover longer runs (e.g. > 72hr without restart) in context of detecting slowly increasing memory leaks etc.

With our bi-weekly agile development process, the timeline for sprint builds load testing looks like that:

While a certain sprint (e.g. Sprint 82) is still in development, we have Sprint 81 branch builds in large scale load test running for two weeks. When Sprint 82 development finishes on Thursday evening, we have an additional over-the-weekend run with the latest Sprint 81 branch build including all backports at that time. Monday/Tuesday next week is targeted for a production GO / No-GO decision. If there was a GO, a rolling cluster update without full down-time from Sprint 81 to 82 is executed at full load and a new two weeks test cycle starts from scratch. During the two weeks, additional sprint branch build updates are performed. Not automated with a fixed schedule, purely on-demand with a single mouse-click in our orchestration layer.

CCPV Toolbox

You may wonder how we simulate the load? Of course, it does not make sense to spin up 2K machines to simulate 2K Dynatrace OS agents. One of our most important tool in our toolbox is a load generator to simulate real-world Dynatrace OneAgent traffic. As our OneAgents send some very specific and proprietary data following a particular message workflow, we couldn’t use one of the industry leading load testing tools such as NeoLoad or JMeter. We had to build our own load generator, called Cluster Workload Simulator (CWS). Properly configured on a capable machine, a single CWS JVM can simulate up to 20K agents.

Additionally to generating load, we are also using:

  • Dynatrace for monitoring (ever heard that we drink our own Champaign? 😊)
  • Our already discussed CRM area for daily regression = daily validation
  • Common Java tools for heap and thread dump analysis, and
  • Java Flight Recording, especially for object churn and lock contention analysis

In load test, each JVM has additional JFR-related options in place. For example:

-XX:+UnlockCommercialFeatures
 -XX:+FlightRecorder
 -XX:+UnlockDiagnosticVMOptions
 -XX:+DebugNonSafepoints
 -XX:FlightRecorderOptions=stackdepth=1024,repository=/data/jfr
 -XX:StartFlightRecording=settings=profile.jfc
 ,duration=60m,delay=360m
 ,filename=/data/jfr/profile_duration60min_delay360min_\`(date +%Y%m%d-%H%M%S)\`.jfr.zip,compress=true

This results in gathering a one-hour JFR session after the JVM has been running for six hours. So, in DAILY regression, with daily restarts due to cluster updates, we end up with daily JFR session files as additional source for investigation, if needed.

This results in gathering a one-hour JFR session after the JVM has been running for six hours. So, in DAILY regression, with daily restarts due to cluster updates, we end up with daily JFR session files as additional source for investigation, if needed.

...

82902657 Mar 21 01:48 profile_duration60min_delay360min_20180320-184648.jfr.zip

80738948 Mar 22 01:34 profile_duration60min_delay360min_20180321-183331.jfr.zip

82080134 Mar 23 01:33 profile_duration60min_delay360min_20180322-183319.jfr.zip

...

The Role of Third-Party Components

Our code is critical and needs to pass various stages before it gets deployed to production. I cannot stress enough, that third-party components need to be treated equally important or perhaps even evil from a quality perspective.

Like for many of you, third-party components play an important role for us as well. Among others we use Cassandra and ElasticSearch as database backends or Jetty as embedded web server. Each third-party component must equally pass our quality stages including CCPV.

The following screenshot shows a situation where a new version of Cassandra resulted in double JVM GC suspension under the same load conditions we always run in our SPRINT load test environment.

Because we are also constantly monitoring our 3rd party components just as we do our regular code, we could immediately point out issues and address it.

Another example was a slowly accumulating minimum heap usage over days, obviously a memory leak.

We reported these issues back to the Cassandra community and did some re-tests with the potential fixes that were provided.

A similar situation happened with Jetty. We faced a memory leak when comparing Jetty 9.4.8 vs. 9.4.7 under heavy load. See the Suspension chart from the Dynatrace process group UI below across three Jetty instances (in a three node Managed Dynatrace cluster) with different versions.

It is not about bashing third-party components here. They are extremely useful and add a lot of value to us, but just a reminder, to plan and test third-party component updates extensively, because getting an instant fix for a third-party component is close to impossible.

Lessons Learned

The biggest lesson learned was that we had to provide constant feedback to our engineers on performance regressions. But not only at the end of a sprint, but on a continuous basis, which allowed them to address performance problems right when they introduced them through a code or configuration change.

We at Dynatrace have the luxury to have Dynatrace as a product at our disposal for all our pipeline stages. Since we started promoting our story of Continuous Performance Validation, we see more of our customers following suit and leveraging Dynatrace to capture and provide more granular feedback for every build that gets pushed through the continuous delivery pipeline. We designed the Dynatrace API to pull out relevant data into tools such as Jenkins, Bamboo, Azure DevOps, Concourse to act as quality gates – just as we have done with our performance signature validation.

If you have any questions, don’t hesitate to leave a comment.

Learn More 

Want to learn more about this event, see Thomas Steinmaurer’s presentation here.

Leave a Reply

Your email address will not be published. Required fields are marked *