When the cloud breaks: Lessons from the AWS outage – A memo from our CTO
On Monday, October 20th, the internet experienced a seismic disruption. A major outage at Amazon Web Services (AWS), specifically in its US-EAST-1 region in Northern Virginia, caused widespread failures across apps, websites, and services used by millions globally. From banking apps and airlines to smart home devices and media platforms, the outage exposed the fragility of our digital infrastructure and the risks of over-centralisation.
What happened?
The root cause was traced to a DNS resolution failure within AWS’s DynamoDB service, a core database used by thousands of applications. This meant that apps couldn’t locate the servers they needed to function, triggering cascading failures across 113 AWS services. Even services hosted in other AWS regions were affected due to their reliance on US-EAST-1 for background operations.
And the recovery wasn’t quick. The outage lasted longer than expected because as systems came back online, they all tried to catch up simultaneously — maxing out capacity and slowing the recovery process.
Why it matters
This incident highlights a critical architectural vulnerability: lack of geo-replication. Many companies had not configured their systems to failover to other regions or providers. As a result, when Virginia went down, so did their services.
At Shoothill, we take a different approach.
Shoothill’s resilience strategy
For mission-critical applications, especially those serving large enterprise clients, we implement active geo-replication using platforms like Microsoft Azure SQL. This means:
- Data is continuously replicated to a secondary region.
- If the primary region fails, automatic failover ensures continuity.
- Customers experience minimal disruption, even during major outages.
You can read more about how this works in Azure’s Active Geo-Replication documentation.
The bigger picture
This outage is a wake-up call for businesses relying on single-region cloud deployments. It’s not just about uptime, it’s about trust, continuity, and reputation. Shoothill’s cloud architecture is designed with these principles at its core.
If you’re concerned about your cloud resilience or want to explore geo-replication for your systems, get in touch. We’re here to help you stay online, even when the cloud isn’t.