Murphy's law

How Murphy's law works in devops

Let me tell you how I spent the last week of August and the beginning of September...

It just so happened that out of our entire devops team I was left on my own — you know how it is, vacation season and all that. Fine, not the first time, nothing unusual. But I ended up looking after a project that had been built before my time, and, to put it mildly, not always on the best devops practices 👽

🚬 On a lovely summer day I get a message that the database for the lab environments had given up the ghost. And it wasn't cloud-hosted — it had simply been running in k8s, and for a long time at that.

I went to look at what was going on, and it turned out the pod just didn't have a volume attached, meaning any pod restart would guaranteed kill the database. Very thoughtfully, a backup was being made just in case.

Hmm, I thought.

There really wasn't much reason for the pod to restart other than a node replacement, but by Murphy's law it happened exactly when everyone was on vacation. It was already Friday evening, and I announced that I definitely wouldn't fix anything before Monday. Luckily, on Sunday a colleague jumped in from his vacation, restored everything, and even added a volume so this wouldn't happen again.

🚬 On another, equally lovely day, I accidentally noticed that Vault had fallen over in one of our production k8s clusters. And to bring it back up you need the unseal keys.

I went around asking, but nobody could give them to me. The thing is, Vault had been set up by an employee who had quit about a year earlier, and Vault itself hadn't been restarted in 1.5 years. So we had a black box full of secrets for the application, but no way to get them out 😑.

The first thing I did was tell the developers not to deploy anything, because a running service could go down if it failed to reach Vault. As it turned out, though, the service was already down anyway — it had apparently been restarted together with Vault 🫣. And this is a critical, damn it, service.

In a hurry I had to gather data from everyone about which variables this service needed to run and their values. After an hour, with the help of duct tape and chewing gum, I managed to restore it. Luckily we didn't have all that many secrets stored in that Vault, and within a few days everything was sorted out.

That's how it goes: an employee steps away for a couple of weeks, and a service that had been running for 1.5 years comes crashing down.

Always keep your keys in a safe place, and your hacks far away from production.