Your cloud provider will fail. Here’s how to prepare

November 20, 2025

Cloudflare's recent collapse took out everything from ChatGPT to transit systems, underscoring why IT teams need multi-region strategies.
(Credits: pogonici/Shutterstock)

Early on the morning of November 18, as IT pros were grabbing their first cups of coffee, Cloudflare went down and brought a big chunk of the internet down with it. X, ChatGPT, Dropbox, Uber, Zoom, Square, Spotify, and even the New Jersey Transit system all took a hit.

Even DownDetectorOpens a new window couldn’t load because it also runs on Cloudflare. The irony wasn’t lost on anyone trying to figure out what was going on.

Spiceworks user @CharlesHTN summed it up succinctly:

“Seems the best way to take the Internet down is to take down Cloudflare.”

Predictably, stressed-out users started flooding help desk systems with vague and incomplete support tickets, desperately seeking answers to a problem they didn’t fully understand. @The Data Master recounts:

“Yeah, my morning was excellent 😂. Of course we have a site or two that is hosted by them and release the flood gates for calls and tickets. Luckily it was only down for a few hours this time.”

Cloud services are failing more frequently

It’s not your imagination. Cloud services are buckling more often, halting business operations without warning. Just a few weeks ago, DNS issues contributed to massive AWS and Azure outages—the latter of which borked Teams and Outlook. Google CloudOpens a new window had its own multi-hour outage in June. Back in July 2024, a faulty CrowdStrike update reached millions of Windows systems, costing Fortune 500 companies $5.4 billionOpens a new window and grounding flights for days.

Pull up a whiteboard and start mapping your cloud-related dependencies. When email goes down, what else breaks?

The cloud promised better business continuity than individual IT teams could provide at a lower price point. The reality is messier and more complicated, as we’re finding out the hard way. Here’s how to make your business more resilient so the next cloud outage doesn’t leave your business holding the bag.

Cloud dependency presents unique risks

Cloud adoption has now reached the tipping point, with over half of enterprise and SMB workloadsOpens a new window currently running in public clouds. Although some companies are moving cloud workloads back on-prem, that figure only stands at 21%.

In the meantime, vendor concentration has increased this dependence and heightened the corresponding risk. AWS, Microsoft Azure, and Google Cloud now control more than 60%Opens a new window of the cloud market. When one of these providers fails, the impact instantly spreads across thousands of dependent services.

If you’re running the IT shop for a small- or medium-sized business, you’ve got a problem on your hands. Most likely, you’re locked into a single region of a single provider. You understand the risk this involves, but you may not have figured how to tackle it yet. Here’s where to start.

Map your critical dependencies

Pull up a whiteboard and start mapping your cloud-related dependencies. When email goes down, what else breaks? If you’re like most IT shops, the answer is probably “more than you think.” For example, your password resets probably depend on email. What then? And if your CRM goes offline, can sales still function in some fashion or does the entire pipeline freeze?

These dependencies aren’t obvious until you trace them all the way through. Your financial software might live in one cloud while your document management lives in another, connected through APIs that fail when either side disappears.

Start with the four-hour test. Which services would halt business operations if they were unavailable for four hours? Catalog them all, then run the same exercise for eight-hour and twenty-four-hour outage scenarios. The answers will clarify where you actually need redundancy versus where you can accept some downtime.

Figure out what your cloud SLA actually covers

The 99.9% uptime promise in your contract sounds reassuring, but it still allows 43 minutes of downtime per month. More importantly, most SLAs measure whether the service is technically available, not whether you can actually use it. The devil is in the details, so it’s to your benefit to figure out exactly what’s being guaranteed and how that affects your business when a cloud service goes down.

Carefully examine your SLA’s definition of an outage. Most of them don’t compensate for slow performance, regional failures if other regions stay up, or issues caused by upstream dependencies. Even in the event that you do qualify for credits, they’re typically a small percentage of your monthly bill. That’s nowhere near the actual business losses you could face in the event of a prolonged disruption.

Pick your battles with multi-region deployment

Unless your company is particularly flush with cash, you probably can’t afford to run everything across multiple cloud providers. You likely don’t have the budget for a full multi-region deployment within a single provider, either, but you may still be able to build in some cloud resilience without breaking the bank.

First, make a short list of systems that would be showstoppers if they went offline. Authentication services, payment processing, and customer-facing APIs usually make the cut. Your internal wiki and the staging environment for the website redesign probably aren’t worth the investment, though. Consider running your critical systems across at least two regions. The cost is real, but it might be manageable if you’re selective.

Talk to the C-suite about the risk of cloud dependency

If you want to get buy-in for resilience investments, you’ll need to speak C-suite language instead of defaulting to your native tongue, aka geek. This is one of the most important soft skills an IT leader must have. In this case, it means framing the problem in terms of business continuity instead of multi-region deployment.

You can use recent headline-grabbing outages to drive the point home. “Remember when CrowdStrike went down and Delta lost $500 million? Here’s what would happen to our business if AWS went offline for 15 hours.” Walk the powers that be through the revenue impact, customer commitments you couldn’t meet, and contractual penalties you might face.

Incident communication will be critical to riding out the next outage, so document who needs to know what when cloud services fail. Your support team needs a heads up as soon as a critical system has gone offline so they can update customers, and your executive team will likely expect regular updates. Build these communication trees before the next crisis hits, so you can keep everyone in the loop instead of winging it and hoping for the best.

Prepare for the next cloud provider failure now

The next cloud service outage is coming—it’s just a matter of time. The companies that weather these disruptions best won’t be the ones with generous budgets, though. They’ll be the ones that have done the unglamorous work of mapping dependencies and securing buy-in for selective redundancy. Start with one critical system this quarter and take it from there. And then, when the next outage hits, you’ll be ready.

Rose de Fremery
Rose de Fremery

Writer, lowercase d

Former IT Director turned tech writer, Rose de Fremery built an IT department from scratch; she led it through years of head-spinning digital transformation at an international human rights organization. Rose creates content for major tech brands and is delighted to return to the Spiceworks community that once supported her own IT career.
Take me to Community
Do you still have questions? Head over to the Spiceworks Community to find answers.