The DynamoDB Outage: Why Multi-Cloud Isn't the Answer
Posted November 8, 2025 by Trevor Roberts Jr ‐ 9 min read
When DynamoDB experienced its recent outage, many conversations started about multi-cloud strategies. I get it—the instinct to diversify is natural. But I'm going to make a counterintuitive argument: for most organizations, the answer isn't multi-cloud complexity. It's learning your cloud deeply and mastering its resilience patterns.
Introduction
I've been in tech long enough to see trends come and go. The latest trend is toward multi-cloud as a "survival strategy." Every outage triggers a new wave of interest in running workloads across multiple cloud providers. And I understand the logic—if one cloud goes down, another picks up the slack, right?
But here's what I've learned from infrastructure disasters, lessons learned sessions, and post-mortems: the problem isn't which cloud you're on. The problem is how well you understand the cloud you're on.
Recently, I gave a talk about this exact question: "Shall We Multi-Cloud?" It was prompted by a very real event that sparked the conversation worldwide. Let me share what I learned.
What Is Multi-Cloud?
Before we go further, let's define what we're actually talking about:
Multi-Cloud (adjective): Relating to or involving the use of multiple public cloud computing services within a single architecture or strategy.
Example: A multi-cloud environment can enhance resilience and flexibility.
Multi-Cloud (noun): A strategy or approach in which an organization uses cloud services from two or more public cloud providers—such as Microsoft Azure, Google Cloud, and Amazon Web Services—to optimize workloads, increase flexibility, and mitigate the risk of vendor lock-in.
The term has been around for a while, and it's become increasingly common as organizations realize the benefits (and challenges) of working across multiple clouds.
The October 2025 Catalyst: the DynamoDB event
On October 19, 2025, at 11:48 PM PDT, Amazon DynamoDB experienced DNS issues in us-east-1. This wasn't a minor blip. This was a cascading failure that affected over 1,000 companies and generated 11 million user reports on Downdetector within hours.
The impact was massive:
- Reddit went down
- Snapchat went down
- Roblox went down
- Eight Sleep went down
- Countless other services experienced degradation
The Downdetector graph looked like someone had thrown a switch. Suddenly, the internet's conversation shifted from "multi-cloud is nice to have" to "should we be multi-cloud?!" with three exclamation points and a question mark.
It was the perfect storm for cloud diversification advocates. And I get why people thought that way. But let me share what actually happened for the teams that recovered fastest.
Single-Cloud vs Multi-Cloud: The Trade-offs
After my talk, I realized the conversation often misses important nuance. Let me lay out the actual trade-offs:
Single-Cloud Strategy
Pros:
- Access to the full breadth of a single cloud's services
- Opportunities to reduce IT operational tasks
- Deep expertise and specialization
- Faster innovation adoption (your team knows the platform inside-out)
- Simpler compliance and governance models
Cons:
- Restricted to using services from a single provider
- Vendor lock-in concerns
- If the provider has an outage, you're affected
- Potentially higher costs without competitive pressure
Multi-Cloud Strategy
Pros:
- Greater variety of services available
- Access to the best functionality in each cloud
- Reduced vendor lock-in risk
- Potential failover capabilities across clouds
Cons:
- Greater variety of services (wait, that again?)
- Lowest common denominator functionality to integrate across different providers
- Building your own inter-cloud coordination
- Operational overhead: multiple teams with different expertise
- Inconsistent architectures and naming conventions
- Exponentially complex debugging when things fail
- Knowledge drift as team members specialize in only one cloud
- Millions of dollars invested in infrastructure that operates at half-efficiency
Notice that "greater variety of services" appears in both the pros and cons columns? That's because it's both. More variety is wonderful until you realize you're maintaining two completely different systems.
The Multi-Cloud Complexity Tax
Multi-cloud strategies come with a significant burden that often gets underestimated:
Operational Overhead: Your teams need to maintain expertise across multiple cloud providers. That's not just training expenses—it's engineering time, support processes, tooling, and CI/CD infrastructure.
Inconsistent Architectures: You rarely build the exact same application on AWS and Azure. Each cloud has different services, naming conventions, and best practices. You end up with two somewhat-similar systems that behave differently under load.
Debugging Nightmare: When something goes wrong during a multi-cloud failover, which cloud is the problem? Your networking? Your application logic? The failover mechanism itself? The blast radius of unknown complexity expands exponentially.
Operational Knowledge Drift: If your team isn't actively operating across all clouds, knowledge decays. The person who knew how to optimize costs on GCP leaves. Now you're running it suboptimally. When a crisis hits, you're flying blind.
I've watched teams invest millions into multi-cloud infrastructure only to discover they're managing at half-operational-efficiency on all platforms compared to running everything on one cloud really well.
What Actually Happened During the Outage
When the DynamoDB outage hit on October 19th, something interesting happened. The teams that recovered fastest weren't the ones with multi-cloud fallbacks. They weren't the ones frantically spinning up instances on GCP or Azure.
The fastest-recovering teams were the ones who:
- Understood DynamoDB's architecture and could detect whether the issue was application-level or platform-level
- Knew AWS's recovery patterns and had direct relationships with AWS support
- Had pre-built failover strategies using native AWS services like DynamoDB Global Tables and cross-region replication
- Invested in redundancy efficiently through cost optimization tradeoffs
These teams had invested deeply in understanding one cloud. When the outage happened, they didn't need to adapt their application. They just activated their existing disaster recovery plan.
The Comparison That Tells the Story
Team A (Multi-Cloud Strategy When DynamoDB Goes Down):
- DynamoDB in us-east-1 fails
- Decision: Switch to GCP Datastore
- Realize: Datastore API is different, missing features
- Spend 4 hours modifying code and configuration
- Deploy, but experience cascading failures in dependent services
- Finally achieve partial service restoration after 8 hours
- Customers have been offline for half a day
Team B (AWS-Focused, Single-Cloud Strategy):
- DynamoDB in us-east-1 fails
- Automatic failover: DynamoDB Global Tables activate standby region
- Application automatically routes to standby region
- Service fully operational within 15 minutes
- Customers barely notice
The difference isn't about cloud provider quality. It's about preparation and expertise. Team B invested time in mastering one platform's resilience patterns. Team A invested in spreading expertise too thin across multiple platforms.
Learning One Cloud Well
Here's what I advocate instead of multi-cloud:
1. Master Your Cloud's Resilience Patterns
AWS has incredible resilience capabilities built in:
- DynamoDB Global Tables: Multi-master replication with sub-second consistency
- RDS Multi-AZ: Automatic failover for database instances
- Aurora: Already Multi-AZ with automated backups
- S3 Cross-Region Replication: Data replication with versioning
- Application Load Balancer: Automatic health checks and instance routing
These aren't bolt-ons. They're first-class primitives. Learn them deeply.
2. Build with Failure in Mind
Use Chaos Engineering principles on your primary cloud:
# Example: Simulate a DynamoDB failure
def test_dynamodb_failover():
"""Test application behavior when DynamoDB is unavailable"""
# Inject fault into DynamoDB calls
original_query = dynamodb.query
def mock_query_with_fault(*args, **kwargs):
raise botocore.exceptions.ClientError(
{'Error': {'Code': 'ServiceUnavailable'}},
'Query'
)
dynamodb.query = mock_query_with_fault
# Verify application gracefully handles the failure
result = application.handle_user_request()
assert result.status == 'graceful_degradation'
assert result.fallback_cache_used == True
dynamodb.query = original_query
3. Invest in Tooling and Automation
Deep cloud expertise lets you build better automation:
# Terraform: Automated DynamoDB backup and restore
module "dynamodb_protection" {
source = "./modules/dynamodb-protection"
table_name = aws_dynamodb_table.main.name
# Daily backup with point-in-time recovery
backup {
enabled = true
retention_days = 35
enable_pitr = true
point_in_time_recovery_days = 35
}
# Global tables for multi-region
global_tables {
replica_regions = ["us-west-2", "eu-west-1"]
}
}
4. Document Your Cloud Architecture
Create runbooks specific to your cloud:
- Account structure and organizational hierarchy
- Naming conventions and tagging strategy
- Cost optimization opportunities
- Service limits and quotas
- Support escalation procedures
- Disaster recovery procedures
The Right Way to Handle Outage Risk
If you're genuinely concerned about cloud provider outages (which is reasonable), here are better choices than multi-cloud:
- Use multiple availability zones (same cloud, different physical locations)
- Use services designed for resilience (managed services with failover built-in)
- Maintain operational readiness (practice failovers regularly)
- Have business continuity plans (not just technical, but business-level)
- Consider business requirements realistically (do you actually need 99.99% uptime to customers?)
Most outages that become catastrophic failures result from one of these issues:
- Lack of preparation
- Poor monitoring
- Inadequate backups
- Insufficient documentation
None of these are solved by multi-cloud. All of them are solved by deep cloud expertise and operational discipline.
The Path Forward: What Should You Do?
Left with the October 2025 DynamoDB outage as context, here's my nuanced take:
Am I saying use one cloud only? No.
Am I saying multi-cloud is bad? Also, no.
So what should I do? Here's the real answer:
Use the provider(s) that make(s) the most sense for your business and that you have the budget, talent, and time to achieve your business's desired outcomes and delight your customers.
That's it. That's the answer. But let me unpack what that means:
For Most Organizations (Startups, SMBs, Growth Stage)
Pick ONE cloud and dominate it. Invest in your teams' expertise. Master the resilience patterns. Build disaster recovery into your architecture from day one. Your competitive advantage isn't cloud diversification—it's operational excellence on the platform you've chosen.
For Large Enterprises
If you're running multi-cloud, ensure you have:
- Dedicated teams per cloud (not generalists)
- Heavy investment in inter-cloud tooling and coordination
- Clear, documented guidelines for what workloads run where
- Realistic understanding that you're operating three separate businesses
For Organizations Concerned About Vendor Lock-In
Run multiple availability zones within your chosen cloud first. That solves most outage scenarios. Then, if genuinely necessary, add a second cloud for specific workloads that make sense—not as a blanket strategy.
Wrapping Things Up...
The October 2025 DynamoDB outage was a wake-up call, but not for the reason everyone initially thought. It wasn't a sign that we need multi-cloud. It was a reminder that deep expertise and preparation matter more than diversification.
The teams that recovered fastest weren't the ones with fallback options. They were the teams that understood their platform deeply, had practiced their failures, and had resilience built into their architecture.
So my question back to you isn't "Should you multi-cloud?" It's this: "Do you fully understand and optimize the cloud you're already on?"
Answer that question first. The multi-cloud decision will become much clearer.
If you found this article useful, let me know on BlueSky or on LinkedIn!