AWS DynamoDb Outage - How long will increased hardware capacity hold out?

Two high profile AWS outages within days of one another, thousands of users of Netflix, Tinder, Airbnb, IMDB, and several others affected worldwide; Amazon quick to the root cause, “mitigations” in the form of increasing capacity; promises proactive monitoring, and segmenting metadata services of DyanmoDb — the database as a service at the centre of attention.

image source: cloudwards.net

Questions are being asked about the core issues that plagued users worldwide — the root cause and in general about the reliability of AWS. A detailed explanation was posted shedding light on the root cause and measures to fix the same. In a nutshell, the root cause was isolated to a component called the metadata service; and that it did not scale with increasing dataset. These revelations come as a surprise; after all, DynamoDb, the database as a service claims to deliver fast and predictable performance with seamless scalability. A service Netflix uses to scale its app.

The mitigations applied by Amazon are not addressing the root cause; and unless the real root cause is fixed, it is inevitable that due to the same reasons services will go down as data growth cascades at an unbelievable rate.

Increasing hardware capacity tends to mask underlying defects. While the exact infrastructure powering DynamoDb is not documented, it is reasonable to assume that the service was hosted on a high end server; The server would likely have large amounts of RAM, CPU and would have all flash direct attached storage; use of flash in DynamoDb is somewhat documented. The service was designed and started with adequate capacity and had been operating as advertised for years. Why did such a high end configuration buckle down under load? It is ironic, a service such as DynamoDb delivers its value by scaling out; is dependent on an internal service that cannot scale without requiring additional RAM or CPU as the case may be.

The real problem lies in the architecture of DynamoDb. It has two parts: a meta data service along with a number of storage nodes. While it is not clearly documented, it appears that the meta data service is run on a single node; although replicas exist for high availability, they do not participate in the active operation. Meta data service and storage nodes frequently exchange information. One of the mitigations “… we are reducing the rate at which storage nodes request membership data”. There are typical tradeoffs between consistency and membership. Longer intervals between membership pings mean greater chances of a membership change not being seen in time; and the DynamoDb service having ghost members — storage nodes — affecting availability and quality of service. It remains to be seen how this change affects DynamoDb services. This architectural defect is acknowledged in the explanation posted — “Finally and longer term, we are segmenting the DynamoDB service so that it will have many instances of the metadata service each serving only portions of the storage server fleet“.

image source: embedded-computing.com

The easy fixes have been applied, as they should have been. It remains to be seen how long increased hardware capacity will hold out. From Amazon’s point, hopefully long enough to let them fix the architectural issues surrounding DynamoDb.

Even that would be short lived — centralised data services such as DynamoDb will continue to be under increased stress and unpredictable spikes in datastore activities; as more and more data is transacted on a daily basis — even Amazon cannot scale indefinitely as recent events have demonstrated. A change in cloud application architecture is a necessity if we are to avoid a collapse of cloud services. Compute and data should cooperate to reduce stress on data services using techniques like compute side caching. This, however, is against the operating principle of cloud players like Amazon who make their money charging for datastore services; wonder if anyone at Amazon sees the irony.

AWS DynamoDb Outage – How long will increased hardware capacity hold out?