[lopsa-discuss] Interruptions coverage...

Trey Harris trey at lopsa.org
Tue Dec 20 22:03:25 PST 2005


I could write a book on sysadmin emergency management (the time when 
"interruptions" really matter), but here's a brain dump of a few thoughts:

--
I think that in general, merging the idea of "interruption shield" and 
"oncall" is useful.  Instead of having different methods for making 
contact during business hours versus off hours, just have oncall rotations 
24 hours a day, and instill in the site culture the idea of always 
engaging the oncall for any interruptive issue.

--
Orgs where *every* group that deals with interrupts has an oncall can work 
efficiently to deal with any emergent issue, whenever one arises.  Even 
non-tech groups like HR, legal, physical security and facilities should 
not be exempt from having oncalls, if they may need to be engaged in an 
emergency.  You don't want your junior admin scrutinizing a company org 
chart at 3am trying to figure out who to call.

--
Overall, the goal should be to reduce improvisational behavior to a 
minimum.  People should be directing their creative energies towards 
solving the problem that has created the interruption, not trying to 
engage the resources they need.  When the admin says, "I need database 
schema expertise," the difference in resolution time between following the 
well-defined procedure for engaging a database engineer and saying, 
"hmmm... who can I call?" can be an order of magnitude or more.

--
You should set SLA's (Service-Level Agreements) for your oncalls regarding 
how fast they are expected to respond by phone, how fast they must be 
online and working, and (if necessary) how fast they must be onsite. 
These SLA's should be determinable directly from the MTTR (Mean Time To 
Recovery) SLA's for the services the oncall is responsible for (you have 
internal SLA's, don't you?).

--
I've found that setting different response times for different-severity 
issues, though common, is not terribly useful.  If the person needs to be 
available, they need to be available, and response times are an expression 
of the availability required.  Someone is no more likely to be in the 
dentist's chair or driving to work when a minor issue comes up than a 
major one.

--
When setting a rotation, you have to consider the SLA's involved and the 
frequency of interruptions.  A long-response, low-interrupt-frequency 
group (say, kernel engineering) can have long shifts, say a week or more. 
A fast-response, high-interrupt-frequency group (say, admin for a touchy, 
mission-critical database) must have short shifts.

If the person will be working full time dealing with stressful emergencies 
while they're oncall, eight hours at a time, no more than twice in any 
three days, is the limit for sanity (not to mention quality).  If the 
rotation is small and heavy (say, a group of four people or less), 
half-days can be more effective, in order to give everyone a chance to 
catch their breath and do non-interruptive work.

--
You *can* create groups whose exclusive job is dealing with interrupts and 
emergencies.  Some people even enjoy such work, for a time.  But be 
prepared to have high turnover, especially from new folks who aren't 
prepared for the real-life experience of such a job, and don't expect to 
have anyone remain in such a job for more than a couple years.  You may 
want to think about using 4x10-hour day workweeks with swing shifts in 
that case, as a regular three-day weekend can help people get over the 
adrenaline overload.

--
Speaking more to management than people on this list: You may be in a 
jurisdiction where your sysadmins are salaried workers exempt from labor 
laws.  In that case, you legally *may* (and many sites do) refuse to make 
any dispensations for afterhours oncall, expecting it as a responsibility 
above and beyond showing up for work every day.  If you do so, be aware 
that this can be a sure-fire way to kill morale and increase turnover. 
Wise sites will offer comp time for a "killer oncall shift".  Truly 
enlightened sites will recognize that simply being oncall, whether or not 
one gets called to work, is a stressful work responsibility that deserves 
compensation of some kind.

Smaller sites, especially those with less than 5 people in a rotation (see 
below), will often "tack on" 24x7 pager responsibility to workaday 
responsibilities.  This can work, if you give your employees flexibility 
to decide their own hours and workplace (home vs. office) on the fly. 
You can even set some reasonable boundaries (say, 11am-2pm every day is 
in-office "face time" unless you've had a killer oncall the night before). 
But setting rigid 40+ hour work weeks, and then tacking pager duty on top 
of it, is a sure recipe for unhappy admins, shoddy work, and fast 
turnover.

-- 
Your SLA's may, by the very nature of the response time agreement, require 
"house arrest" of oncalls, because even commute time or going to the 
grocery store will render the oncall incapable of meeting the SLA 
commitment.  If so, don't pussy-foot around this requirement--your oncalls 
will put two and two together quickly enough, and will resent it.  Or, 
more healthily for them but more troublesome for your business, they'll 
carve out their own "reasonable accommodations" that work for them--right 
up until the first time the datacenter is in flames and they're at the 
dentist.  Instead, make house arrest explicit, and have supporting 
policies in place--for example, you get the day before off to run errands, 
or the day after off to recuperate, and there's a secondary oncall to take 
over in case of personal emergencies.

--
The more rigid and onerous the oncall duty, the further out you have to 
schedule it.  And, please, do check the group's vacation calendar *before* 
assigning oncall duty.  People *will* just quit over mistakes here. (And 
good for them!)

--
If you're serious about oncall and interruption-shield responsibilities, 
then they need to be handled, in terms of project work, like vacation (for 
long shifts) or personal days (for short shifts).  Cancel all meetings, 
consider it a dead time when scoping project work, and turn on the email 
autoresponder.  If you have a quiet shift and can hammer away at a 
project, that's great.  But don't commit to anything.

--
If the response time in your SLA is greater than the oncall's commute (the 
*real* commute, mind you, not some policy document that says people should 
be able to get home within N minutes), then the shift can span workdays. 
Otherwise, you'll have to do handoffs during commutes, either ad-hoc (the 
oncall is responsible for getting coverage while they're offline) or 
explicit (a secondary rotation, where the primaries and secondaries are 
required to coordinate their offline time so it doesn't overlap).

--
A secondary rotation is a good idea even when response time doesn't 
require it, in order to prepare for the unexpected.  When the oncall's 
power goes out, when she falls ill, when her encryption token is on the 
fritz, there should be an obvious person her responsibilities fall to.

--
How does handoff from primary to secondary work?  If the primary falls ill 
and is rushed to the hospital, must the secondary immediately take over 
full responsibility?  If so, the secondary's SLA is exactly the same as 
the primary's.  Saying otherwise in the policy won't make it so.

--
Modes of communication must be specified in the SLA.  It's useful to put 
yourself in the shoes of the oncall, the customer/requester, and 
management and do some role-playing to see what the needs are.  Admin: "To 
fix this, I need to use a GUI tool, and this dialup is too slow to use 
it."  Looks like access to broadband had better be in the SLA.  VP: "I 
want status reports every 30 minutes."  Better have conference bridges at 
the ready.  Call leader: "Cut the crosstalk!  We need this bridge quiet 
for the status reports!"  Better have IRC or IM available so that people 
can coordinate and discuss without using the phone bridge.  CEO: "That 
outage was terribly handled!  I want a post-mortem!"  Better have a 
mechanism for recording both the conference call and IM's.

--
Management must take responsibility for ensuring that oncalls, especially 
doing "tacked on" after-hours pager duty, are not being abused.  A lot of 
us have a BOFH streak and are perfectly capable of telling a customer off 
for paging us at 4am for a non-emergency.  But some of us aren't, and none 
of us should have to.  There must be consequences for abusing the oncall.

At one site I know of, an access-control list defines who can open 
high-severity tickets or issue pages.  Anyone can get on the list after 
reading and agreeing to the definitions of what is an emergency and what 
is not.  But just one case of improper use, and the person gets removed 
from the list.  They can still call the 24x7 helpdesk to plead their case 
if they truly have an emergency, but they don't get to make that 
determination themselves anymore.

--
Remember that if you want to effectively cover any responsibility 24x7 
with the same response as business hours, you need a minimum of 5 people. 
Any less, and in many jurisdictions you'll be running afoul of labor laws, 
and in all you'll have a group of unhappy employees.  Even so, many 
people's circadians can't deal with such a structure where they're 
rotating through different work schedules including the graveyard shift.

If you can afford to have at least two worksites offset eight or more 
timezones from one another, you can avoid anyone having to work graveyard, 
which vastly improves morale (and quality).  Go to three sites offset by 
eight hours, and you're truly "following the sun," and nobody has to work 
off-hours.

--
I could go on, and on.  Maybe I should write that book....

Trey


More information about the Discuss mailing list