What the AWS Kinesis Crash Tells Us About Cloud Co-dependence

On Nov 25, 2020, a little over a day before Black Friday, AWS services in one of its most popular regions experienced a pretty serious flaw in its internal architecture. 

At the core of the problem is Kinesis, a stream processing service that many other AWS products — e.g., Cognito (API security/sensing), CloudWatch (rudimentary system monitoring) and services relying on CloudWatch such as Auto Scaling — all depend on. Internal interdependence between services is a well-known facet of the AWS global cloud. They annually suffer from multi-region outages even for their own e-commerce revenue (Prime Day outages), but this time an internal architectural dysfunction blocked customer APIs and monitoring.

Just let that sink in. Your customers’ shopping carts, login access, and other Critical-Path-to-Revenue (CPR) processes just stopped working . . . and when you tried to investigate what was causing it, metrics and data about what was going on weren’t even available. For 12 hours, thousands of companies dependent on the AWS “just trust us” cloud had to sit and wait for their revenue streams to reboot (literally, a reboot was what AWS eventually did to Kinesis as a short-term mitigation).

Your Cloud Isn’t Always Better Than You at Scaling

There is no such thing as 100% uptime, and you’d be hard-pressed to find any SaaS or cloud provider with any legal verbiage that guarantees “response time” instead of the more ambiguous “uptime,” as James Pulley so eloquently goes into in his recent 10/29 News of the Damned episode entitled “The Seduction of SaaS.” But before getting any more ranty, let me pull out a few key architectural considerations and dynamics from the official post-mortem from AWS:

    • “Total threads each server must maintain is directly proportional to the number of servers in the fleet,” which means that
      • this was a known dynamic of their internal Kinesis architecture and should have been accounted for in their new capacity rollout process.
    • Adding capacity to Kinesis caused Kinesis servers to max their thread counts, which means that
      • they’d likely never added this level of capacity before, and 
      • in preparation for Black Friday service level demand estimates, they took a decision on something not yet tested just before a major US holiday.
    • Consolidating to fewer, bigger Kinesis servers was the short-term resolution, with a number of medium-term actions such as 
      • cold-boot optimization and shard-map cache service isolation, both of which could have been tested before capacity increases.
    • They had to reboot the entire Kinesis service for the above short-term solution to take effect, which means that 
      • there was likely no proven or exercised systematic capability to pause and purge per-server shard-map caches in real time.
    • In order to resolve Cognito “best effort” information stream buffering to Kinesis, cross-service teams had to collaborate with each other to figure out how not to cause instant saturation once Kinesis service was restored, which means that
      • there were many people from multiple teams having to work together to resolve this severe incident for long periods of time. Planned or otherwise, this is really hard.
    • ECS and EKS workflows and tasks were significantly delayed due to downstream effects of CloudWatch and CloudBridge depending on Kinesis, another set of services that
      • weren’t proactively insulated from Kinesis downtime using alternate regions.

It sounds like I’m poking them in the eye, right? Maybe a little, not any more than anyone else, but this is just a sample of reasons why a number of e-commerce companies prefer not to bank their biggest seasonal demand curve on the empire that Bezos built. What I hear frequently from business leaders on this topic goes something like this: “Our business continuity plans prevent us from using a competitor’s cloud.” Though other clouds might be a little more expensive (arguable), they fundamentally don’t trust AWS, and in digging further I ultimately arrive at some senior architect that mentions the [lack of transparent] IT control factors in place to prevent an outage such as this. Also consider “who wants to subsidize the competition?” is a very poignant question to ask your business leaders, and one that comes back to haunt them and you in the wee hours of the night. Sorry for that, but it has to be said.

Performance and Reliability Starts with Planning

Fundamentally, all systems are constrained by far more than one or two factors (like thread count per CPU, memory limits or network bandwidth). It’s just that usually in a non-global scale system, you hit these factors one or two at a time. But this is not the case for companies like AWS, where many critical services depend on services that depend on other services (and so on and so forth). 

This is what we see from learnings at Google, specifically in the O’Reilly book “Site Reliability Engineering,” where they specifically call out the fallacy of “root cause” and replace it with the more appropriate pluralistic “biggest contributing factors.” Often these factors aren’t just technologies and un-anticipatable consequences of a rollout; rather, they are about planning and evidence. 

Performance engineering — when done well —  requires that we factor in all stakeholders and experts in planning processes. As a result of synthesizing considerations from these folks, we should also have demonstrable evidence of how the big changes we’re about to make (yes, even seemingly innocuous “capacity increases”) will likely affect our systems and therefore our customers. Performance is a feature, not a test. It’s a feature of not only your apps and services, but of your planning and team culture. I’m not the only one that has said this, not even close. (see here  and here and here and here and here . . . )

I often use the “turtle” emoji in Slack to remind community friends and colleagues of two aspects of performance engineering: (a) turtles go slow and (b) infinite regress (as in “it’s turtles all the way down”). These two aspects of performance detrimentally combine when systems are saturated on any particular resource to result in catastrophic failure. With the way AWS is so internally co-dependent, its performance problems in one service are bound to cause unanticipated and widespread failures “all the way down” to you.

Look, you don’t know what you don’t know, and I have empathy for engineers that worked hard to fix the Kinesis systemic service outage. It’s great that they publish their outage post-mortems (as every team should) to demonstrate that they understand how to prevent this kind of thing from happening again in the future. Also publishing how they planned the capacity increase and picking that apart would be awesome too, but I highly doubt they’re going to publish information on what they didn’t do that they should have done beforehand for industry and competitive vultures to descend upon. But somewhere, someone knew this would happen.

What AWS Plans to Do About It?

From their post-mortem, we can gather a few of their ideas for how to prevent this from happening again, which as always are learnings we all can benefit from:

  • With a hard-earned better understanding of the “proportional” dynamic between servers and inter-server communication between members of their Kinesis fleet, they properly adjusted their server sizes and numbers; though this is not a long-term fix, it is something I’m sure that they’ll be tinkering with moving forward, hopefully in a smaller and better-timed manner than a day before a major holiday event.
  • Splitting out caching from front-facing inbound request processing activities, specifically with their front-end servers and shard-maps is a natural move, something any seasoned architect would also consider next architectural steps.
  • Improvements to server warm-up process to reduce the false-positive “unhealthy” indicators, something similar I see a ton of related to uninformed container liveliness and health check thresholds in the Kubernetes space, should also become a “non-functional requirements” that I’m pretty sure their Kinesis team won’t ignore moving forward (as indicated by “we are making a number of changes to radically improve the cold-start time for the front-end fleet”).
  • Thread limits on both front-end and back-end servers will probably be a top-of-mind SLI (service level indicator), and variations to these metrics will likely be proactively investigated for the next months, if not longer (as indicated by “We are adding fine-grained alarming for thread consumption in the service”).
  • Hopefully we’ll see them take more seriously the disparity in technical debt between front-end and back-end systems work, but we’ll see if their back end and front end SRE work can keep better pace in the future, as hinted upon in “we will greatly accelerate the cellularization of the front-end fleet to match what we’ve done with the back-end . . . this had been under way for the front-end fleet in Kinesis, but unfortunately the work is significant and had not yet been completed.”
  • Prolonged data buffering between services like Cognito and Kinesis should be preventatively alleviated by increasing buffer capacity and careful monitoring, as indicated by “we have modified the Cognito webservers so that they can sustain Kinesis API errors without exhausting their buffers.”
  • Localizing rather than overly centralizing metrics stores will reduce the likelihood of Single-Point-of-Failure (SPOF) on observability over what the heck is going on when (not if) another incident like this occurs, discussed in “will allow . . . services requiring CloudWatch metrics (including AutoScaling), to access these recent metrics directly from the CloudWatch local metrics data store. This change has been completed in the US-EAST-1 Region and will be deployed globally in the coming weeks.”

It’s worth noting that beyond a tarty statement of pride in the “long track record of availability,” they do acknowledge that it severely affected customers and “will do everything we can to learn from this event and use it to improve our availability even further,” which at least implies that they take a “continuous learning and improvement” to outages such as this. In the landscape of modern software teams, this is a must that we cannot only adopt after there has been a SEV incident but from the very beginning.

Leave a Reply

Your email address will not be published. Required fields are marked *