Cloud infrastructure outages. Who's to blame?
A recent (yet another) outage in Yandex.Cloud nudged me toward this topic, in which we'll try to define the areas of responsibility and ways to protect yourself.
🧨 Incidents happen, whether it's a cloud or a server in a data center. Anything can happen: fire, flood, a drone strike, malicious actions, and so on. As the folk wisdom goes: There is no cloud, it is just someone's else computer.
But when choosing cloud solutions, we want to delegate responsibility for failures to the cloud service provider. Otherwise, why would we overpay for resources that we could deploy ourselves on physical servers, even if also rented. Yes, there are other reasons to use clouds too, but still, when all these convenient things stop working at the most inopportune moment, reliability is exactly what comes to the fore.
It's for this very reason that many companies use their own solutions and don't get hooked on cloud ones. Of course, your own servers can also drop out of the chat at any moment, but at least you'll have options other than just sitting and waiting for everything to return to normal. And this is exactly where it gets interesting.
If you're entirely in the cloud, then in the event of an incident you simply have no options other than to wait 👽. Please note that this assumes you've built a fault-tolerant architecture inside the cloud with multiple availability zones and other high-availability attributes. The sad part is that this may not help. In that case you tell your customers: sorry, my cloud is down and all that's left is to wait. The customers, of course, don't care — they just want everything to work. But you have someone to blame for it, even though in fact you're the one who messed up by fully trusting a single cloud.
But most companies have no other options, and they're fine taking on the possible — and not all that large — risks for the sake of saving money. So it goes down once or twice a year, not so scary. But those for whom it is scary will build their infrastructure from the start so that they have options to switch over to a backup.
📎 It's also important to understand that an outage can be global — for example, the whole network goes down — or it can be local, in the form of a failed service or a single availability zone. In the second case, you may in theory have courses of action that help you weather the problem and restore the service faster.
Weigh the risks and define your SLO for yourself, so that the SLA you promise your customers stays within acceptable bounds. Also keep in mind that the cloud provider's area of responsibility ends at a certain perimeter within which your application runs, and beyond that your own responsibility begins. For instance, if after a failure on a virtual machine your database goes down, but the machine itself is up and running, then it's already your concern how to restore the database.