Monday got off to a bad start for Amazon Web Services users served by the company’s US-EAST-1 region, when a DNS problem rendered the DynamoDB API unreliable, with consequences for many AWS services and customers.
Although the root cause of the incident apparently affected a single API in just one of many AWS cloud regions, it provided a key database service on which many services — Amazon’s own and those of its customers — were built, in that and other regions.
AI search company Perplexity was one of those affected by the incident, reporting that it was “experiencing an outage related to an AWS operational issue”. And although online design tool Canva didn’t name AWS as the source of its problems, it reported a major issue with its underlying cloud provider resulting in increased error rates for its users during the same time window.
Real-time monitoring service Downdetector noted that outages at Venmo, Roku, Lyft, Zoom, and the McDonald’s app were “possibly related to issues at Amazon Web Service.”
Increased error rates
AWS itself first reported the incident on its service health status page at 12:11 a.m. Pacific time, saying, “We are investigating increased error rates and latencies for multiple AWS services in the US-EAST-1 Region,”
A little over an hour later, it had narrowed the problem down to the DynamoDB endpoint, which it said was also affecting the other services, and half an hour after that, the company reported: “Based on our investigation, the issue appears to be related to DNS resolution of the DynamoDB API endpoint in US-EAST-1. We are working on multiple parallel paths to accelerate recovering.”
By this time, it was clear that the problems were not confined to users or services on the US East Coast.
“Global services or features that rely on US-EAST-1 endpoints such as IAM updates and DynamoDB Global tables may also be experiencing issues,” it said.
By 2:27 a.m. Pacific time, a little more than two hours after it began investigating the incident, the company reported that it had applied initial mitigations and recommended customers retry failed requests, warning that there may be additional latency as some services had a backlog to work through.
Three hours after it began its investigation, the company reported that global services and features reliant on US-EAST-1 had recovered and promised further updates when it had more information to share.
Cloud dependencies
While this outage was quickly fixed, it shows that even in the cloud there are single points of failure that can have worldwide consequences.
A few months ago, it was Microsoft with egg on its face, as a problem in Azure’s US East region rippled out to affect other organizations. Before that, a series of outages at IBM Cloud had customers wondering if they had made the right design choices. The third, shorter, outage affected 54 IBM Cloud services.

Leave a Reply