Having been part of a reformulation of on-call in two different companies now, there are a few tips that came from many iterations, trying out different formats, until we found one that had a good balance between well-being and on-call quality.
Let us define on-call for the scope of this article.
On-call is the act of an engineer being available to respond immediately to service malfunction, at any time, any day of the year.
It usually entails some sort of automatic alerting system, paired with a way to notify the engineer.
For alerting, I’m more familiar with Prometheus’ Alert Manager, and for notifying, both PagerDuty and OpsGenie. It is outside of the scope of this article to discuss these technologies.
As for the engineers, it is expected that they are able to respond; that is, they have the necessary tools at hand, like internet access and a laptop, and are able-minded.
There are other models for this. Notably, companies that have a team of people, normally called SREs, that are on-call for every system.
Two primary metrics can track the quality of an on-call rotation: Mean Time To Acknowledge (MTTA) and Mean Time To Resolve (MTTR).
The first one tracks how quickly an engineer acknowledges a page, and it reveals how healthy a given rotation is. The other tracks how quickly an acknowledged page is resolved; it shows how good the tooling and documentation are.
With that in mind, it would be natural to assume that the most adequate group of people to be on-call for a set of services is the same group who builds and maintains them.
However, more often than not, teams are formed by two to eight people, which means that they would be on-call many days out of a month. Which leads to the next point.
Even in the unlikely case that your company produces software that never malfunctions, the fact of being essentially trapped in your own home, having to bring your work phone to the bathroom, not being able to have some wine, and knowing that, at any point, you might be woken up with the dreadful, extremely loud sound of the pager, is not fun.
I have been woken up in the middle of the night a few times. Full of adrenaline, already reaching for the laptop. Going back to sleep afterwards is challenging.
Given that, it seems to be a desirable goal to have engineers on-call as few times per month as possible.
But how to pair that up with quality?
To reduce the amount of time on-call, it is necessary to have more people in the rotation.
Apart from making teams larger than intended, the only possible way is to have multiple teams to be on-call for all the systems in the pool, meaning that the people on-call don’t necessarily participate in maintaining the systems that can alert.
This leads to anxiety. How can I be on-call for a system I don’t know?
The following are the prerequisites to make this possible.
Without proper monitoring it is impossible to achieve this goal. You must have comprehensible dashboards and troubleshooting tools.
No alert should be created without a very thorough runbook.
Runbooks should be created and tested with engineers outside of the team. When someone writes a runbook, they will inadvertently make assumptions about the knowledge of the engineer reading it. “Connect to the production server” might mean nothing to someone else.
Remember that the person reading the runbook is under a high-stress situation. They are worried and agitated. The last thing they need is to find out that the links in the runbook lead to nowhere.
A runbook starts with a link to the relevant dashboard showing the data that triggered the alert.
Then, it lists instructions, with IPs, bash commands, etc, to troubleshoot and restore service.
When a team is on-call for their own systems, they know that some alerts are a little flaky. “This one always fires on Friday evenings and it auto-heals in three minutes”, or “This one always fires when a deploy happens”. They happily acknowledge the page and go back to whatever they were doing.
These absolutely cannot happen when you have shared on-call responsibilities. The next engineer won’t know that this is the case. They will wake up, open their laptops, just to find the alert already resolved. Not a good way to make friends.
There are many possible configurations, with weekly or daily rotations being popular.
After many iterations and retrospectives, it was concluded, unsurprisingly, that people care way more about their weekends than week-days.
Given that, my recommended setup is five shifts per week, which result to the following amount of hours on-call outside of the normal 8h in the office:
Usually shifts start at the time most engineers are in the office, let’s say 10am, and finish 10am the next day (or two depending on the shift).
That means five different people are necessary to fill-up all shifts for a week’s worth of on-call.
I imagine most tools provide a way to have a multi-layered on-call rotation.
My recommended configuration is as follows:
With this configuration, let’s say that the Primary engineer gets paged, follows the runbook, but is unable to restore service. When they escalate the page, it will notify all members of the secondary rotation.
The fact that the secondary rotation is best effort can make people nervous. What if no one is able to respond?
We definitely shared this concern, however, after actual experimentation and hundreds and hundreds of pages, not once we had issues with the secondary layer not responding.
Assuming that your company already has on-call rotations in place, usually one per team, my recommendation is to start small.
First, pick two teams to merge and talk it through. If the teams have some intersection in context, it might be a little easier.
Then, get your hands on some alerting statistics: how many alerts per month, how many outside working hours. If you can, find out how many auto-resolved.
With this data in hand, have the two teams clean up the alerts: remove some, fine-tune others. Definitely write runbooks for all of them.
Also, this might be a good moment to read, or reread, Rob Ewaschuk's excellent Philosophy on Alerting at Google. I have found that most teams have, at most, four to five critical alerts that should wake people in the middle of the night, many times less than that.
In general, critical alerts should point to symptoms that affect users, not simply elevated error rates or backlogging in a given service.
With the alerts cleaned-up, it is time to configure the rotation.
For the first week or two, to get people a little more confident, you can still keep the per-team mandatory rotation, but have the alerts first route to the merged rotation.
As the teams get more comfortable, and hopefully happier with the new setup, you can slowly add teams to the rotation until you reach either every engineer in the company, or a good-enough size. If you have twenty people in the rotation, a person would be on-call only one shift per month.
I also strongly recommend that you book a monthly retrospective with all the engineers in the merged rotation, at which you also share alert statistics. Failure to do this can lead to unnecessary unhappiness as I found out.
It is also important to have a channel in which all the engineers in the rotation are. This is important to discuss current alerts, poke people for missing runbooks, rage about the flaky alert that woke you up last night, and to negotiate shift swaps.
My recommended way to calm down the anxiety of a new person joining the rotation is the following:
This is a surprisingly difficult problem to solve: when people are in the office, alert the owning team instead of the shared rotation.
Despite the fact that the tools I have used to have some configurations around time, not one has local holiday calendar support. Unless you are willing to have to remember to update the configuration manually for those, you might end-up with alerts only routing the best-effort layer.
These are the bane of the shared on-call. People get unsurprisingly and rightfully upset when those happen. It is paramount that the leaders of each team take time to fine-tune flaky alerts as soon as possible, immediately or the very next day.
Sometimes, a team is too far away in technology from the others, making writing runbooks nearly impossible.
This, however, can be a symptom of an underlying problem that, perhaps, should be addressed.
In my experience, when trying to implement this, some teams will claim that it is the case. Analyse each with care and work with the team to understand if it is the moment to make foundational changes or not.
If not, the team should carry on with their own rotation.
If one team causes alerts a disproportionate amount when compared to the others, people will get bitter. It is possible to set “maximum quotas” per team, per month, and in the case they are exceeded, the team reverts back to local on-call duties.
I don’t have experience with this, but this possibility has been contemplated in the past. Managing the shared on-call morale is important.
Both PagerDuty and OpsGenie don’t do an amazing job at keeping the existing schedule when adding or removing people from the rotation. It seems to be a difficult problem to solve given the infinite nature of on-call schedules.
Remember to announce these changes in the rotation channel, so that people can check if anything has changed.
As rotations evolve over time, I have observed some beneficial side-effects apart from personal well-being and on-call quality:
And, most importantly, better systems, that alert less and self-heal in more cases.