[By Andreas Grabner, Dynatrace]
What happens if you lock a handful of seasoned performance engineers into a Scottish castle and let them discuss the future of performance engineering? Magic happens!
In all seriousness. Thanks to Neotys for hosting the first Performance Advisory Council – simply called PAC. I believe I got invited because I have been part of the global performance engineering community for the last 15+ years and my friends at Neotys probably thought I have my thoughts to share on how performance engineering is evolving.
In the two days during PAC it became clear to me that the larger community understands that “Winter is coming.” And while I mean this literally (I saw snow last week), I also mean that a significant shift and disruption is happening in our profession. Performance engineering in 2020 will not be the same as it is now – and it is because of several “buzzwords” floating around when talking DevOps Transformation which will also impact performance engineering. Wilson Mar gave a great overview at PAC where he touched on several new disciplines during his presentation:
- Automation: Not only test execution but also cause analysis
- Shift-Left: Provide automated performance feedback earlier in the pipeline
- Shift-Right: Influence and leverage production monitoring
- Self-Service: Don’t become the bottleneck! Provide your expertise as self-service
- End User Monitoring: That’s the bottom line! Happy Users make Happy Business!
- Cloud-Scale: Static monitoring and planning are not IT works in 2017!
- Artificial Intelligence: Leverage machine learning and big data for better and faster performance advice!
A key capability and skill that was discussed by almost every presenter were Application Performance Management (APM). There are several aspects where APM supports future performance engineers:
- Code Level Root Cause Analysis and Tuning while running Performance Tests
- Leverage APM Production Data for creating more realistic test workloads
- Testing your production APM installation as part of your load tests
- Automate Performance Metrics into your CI/CD Pipeline
- Provide Performance Engineering Services based on live APM production data
Working for Dynatrace, a vendor in the APM space, I got to talk about how Dynatrace evolved our monitoring capabilities to support the “Future performance engineering discipline.” While most of you might be aware of what APM used to be when we talk about our AppMon product (or tools like AppDynamics, NewRelic, Wily, etc.) may have not yet been exposed to our new AI-powered Dynatrace Fullstack Monitoring platform powered by Dynatrace OneAgent. Wow – that’s a lot of buzzwords in the last sentence alone.
To cut through the buzzword cloud, I decided to share the internals of what Dynatrace built, what the whole AI (Artificial Intelligence) is all about and why we believe our new approach of capturing, analyzing and providing access to performance data is the best approach to support the future requirements.
Dynatrace AI Demystified
I started my presentation with the statement: “I will explain what we built and how we analyze the data we have. It is up to you, in the end, to decide whether you want to call it AI or not. We strongly believe though that we built something different, something that will help you succeed with the new types of applications we are dealing with!”
I structured my talk into three sections:
- OneAgent: Why and what we built
- Insights into the AI
- AI in Action
Let’s dig into it.
#1 – OneAgent: Why and What We Built
With our AppMon product, we learned that it is important to have visibility to 100% of all transactions within individual applications. What we missed was 100% visibility into the Full Stack, a good SaaS, and On-premise offering as well as an easy fully automated way to deploy our agents regardless OS and Application Stack!
Our OneAgent solved all these problems: One Agent to monitor them all
- A single installer for Windows, Linux, AIX, and Solaris
- Automatically monitors your host, all processes, services and log files
- Automatically monitors all network connections and therefore understands all dependencies
- Automatically injects into your Application Stack Runtimes to provide code-level visibility
- Automatically provides end-user monitoring for all your web and mobile applications
The following shows SmartScape, the visualization of all detected and monitored entities (data centers, hosts, processes, services, and applications) and all its horizontal and vertical dependencies:
SmartScape visualizes all dependencies between all automatically monitored entities based on OneAgent data capturing
End-to-End Tracing and Code-Level Visibility are capabilities that most modern APM tools provide. In the case of Dynatrace we call it PurePath – the “pure” end-to-end execution path for a single transaction through your application stack:
End-to-End Code-Level Tracing is a mandatory capability in modern APM solutions. Dynatrace calls it PurePath!
Another key capability of any APM and Monitoring Tool is capturing time series data, automatic baselining and anomaly detection. At Dynatrace we do this on infrastructure as well as application and service level metrics, E.g., Response Time of any REST Endpoint exposed by a hosted service – or – CPU on a Docker container:
Automatic Time-series monitoring, baselining and anomaly detection is another key APM capability
There is much more that OneAgent captures, E.g., Deployment and Configuration Changes, Log Messages or Metrics from your Cloud (AWS, Azure, Google, VMWare…), Container (Kubernetes, Docker, …) or PaaS (OpenShift, CloudFoundry, …).
The problem with all this data is that we do not solve the fundamental problem: Is there a problem in our system, where are the hotspots, who is impacted and what can we do to fix the problem or make the application go faster.
Our solution to this problem is what we call Dynatrace AI. And here is what the AI does:
#2 – Insights into the AI
The AI is first and foremost powered by very accurate Fullstack data captured by OneAgent. Additionally, you can feed more data to Dynatrace through a REST API or our Plugin technology.
The next key element is smart anomaly detection on each metric we see. For certain metrics, such as real user monitoring metrics, we apply hypercube baselining algorithms to detect whether there is an anomaly for certain Geos, Browsers, Features, OS, Bandwidths, …
Dynatrace AI leverages Hybercube baselining to detect metric anomalies across multiple dimensions
An anomaly detected based on a baseline violation will generate an event. But events are also generated for things such as process crashed, disk full, the connection dropped, database unavailable, GC Time High, High Network Latency, … all these events are automatically analyzed based on the relationship the entities have that experience these events. Remember Smartscape? That’s how Dynatrace knows all the dependencies between processes, hosts, services, and applications. This allows us to look at individual events, correlate based on dependencies, group them and then rank them based on impact and expert knowledge. This approach allows us to factor in causation as we do not simply rely on correlation of events or time series data. That’s one magic ingredient of the Dynatrace AI:
Dynatrace sees many events, understands the dependencies and is then able to identify real problems based on causation and not just plain correlation
In my presentation, I went through additional details on how we do each step that is shown in the illustration above. One aspect I want to highlight is that we do capture every interaction of every user with your applications – and we capture this end-to-end thanks to PurePath.
Having all this information allows us to also do a multidimensional-transactional correlation to highlight problems that impact your end users. The following illustration should make that clearer (if you miss the animations go ahead and watch the recorded session):
Every PurePath Matters: 100% capturing allows us to do real multidimensional transactional correlation
When Dynatrace detects a problem it is now able to tell us who is impacted (end users or service calls), what is the cause (full disk, slow code, bade deployment) and how did the problem evolve (looking at all these events). In case the AI detects a problem we open up a so-called Problem Ticket that contains all this relevant information:
Dynatrace AI automates problem detection by highlighting impact, cause and problem evolution
I hope this gave you a bit more insight into what our Dynatrace AI is doing. Whether you want to call it, AI or something else is up to you. We believe it is the right way forward when it comes to supporting performance engineers in a world where our applications are updated more frequently on a more dynamic infrastructure!
#3 – AI in Action
While the AI creates problem tickets that we visualize in the problem view, we also give full access to the Problem Details via the Problem REST API. Dynatrace can also trigger external tools, E.g., PagerDuty, ServiceNow, Jenkins, Ansible, AWS Lambda … to execute incident management workflows or auto-remediation actions. The great thing about this is that you have all the data available that the AI pre-analyzed for you. Following is a quick animation showing the Problem Evolution screen which “replays” all analyzed events. Remember – all this information is also accessible via the REST API which allows you to leverage this data as problems happen to, E.g., push them back to your engineers, change your load balancers, adapt your workload configuration or spin up additional resources to handle a shortage in resources.
Dynatrace AI gives full access to all observed and correlated problems. Either through the UI or REST Interface.
I hope this gave you some insights into what we built, which problems we address and why you as a performance engineer should embrace this new technology. If you happen to be a Neotys user, you should also check out the YouTube Tutorial on “AI-Supported Performance Testing.” Henrik and I showed how to integrate Neoload with Dynatrace and how we can leverage the AI to do better, smarter, DevOpsy performance engineering.
Become a Performance Engineer for the Future
I want to thank Neotys for bringing us together. I hope that all existing and future performance engineers out there understand that performance engineering is no longer about wasting time creating and maintaining load testing scripts. It is not about creating lengthy load testing reports at the end of a release cycle that nobody looks at. It is about providing performance engineering as a self-service from dev all the way to ops embracing automation, integration into dev tools as well as not being afraid to look into your production environment.
All the best,
Andreas Grabner has been a developer, tester, architect, and product evangelist for the past 18 years for CRM, eGovernment, testing and monitoring vendors. In his current role, he helps companies injecting metrics into the end-to-end delivery pipeline to make better decisions on feature and code changes and with that also closing the feedback loops between Ops, Biz, and AppDev. Andreas has been speaking at meetups, user groups, and international conferences to share experiences and stories around architecture, agile transformation, automated testing, CI/CD and DevOps. In his spare time, you can find him on the salsa dance floors.
About the first Neotys Performance Advisory Council #NeotysPAC
Neotys organized its first Performance Advisory Council in Scotland, the 14th & 15th of November 2017.
With 15 Load Testing experts from several countries (UK, France, New-Zeland, Germany, USA, Australia, India…) we explored several themes around Load Testing such as DevOps, Shift Right, AI, etc. By discussing around their experience, the methods they used, their data analysis and their interpretation, we created a lot of high-value-added content that you can use to discover what will be the future of Load Testing.