[lopsa-discuss] Interruptions coverage...
Trey Harris
trey at lopsa.org
Tue Dec 20 22:03:25 PST 2005
I could write a book on sysadmin emergency management (the time when
"interruptions" really matter), but here's a brain dump of a few thoughts:
--
I think that in general, merging the idea of "interruption shield" and
"oncall" is useful. Instead of having different methods for making
contact during business hours versus off hours, just have oncall rotations
24 hours a day, and instill in the site culture the idea of always
engaging the oncall for any interruptive issue.
--
Orgs where *every* group that deals with interrupts has an oncall can work
efficiently to deal with any emergent issue, whenever one arises. Even
non-tech groups like HR, legal, physical security and facilities should
not be exempt from having oncalls, if they may need to be engaged in an
emergency. You don't want your junior admin scrutinizing a company org
chart at 3am trying to figure out who to call.
--
Overall, the goal should be to reduce improvisational behavior to a
minimum. People should be directing their creative energies towards
solving the problem that has created the interruption, not trying to
engage the resources they need. When the admin says, "I need database
schema expertise," the difference in resolution time between following the
well-defined procedure for engaging a database engineer and saying,
"hmmm... who can I call?" can be an order of magnitude or more.
--
You should set SLA's (Service-Level Agreements) for your oncalls regarding
how fast they are expected to respond by phone, how fast they must be
online and working, and (if necessary) how fast they must be onsite.
These SLA's should be determinable directly from the MTTR (Mean Time To
Recovery) SLA's for the services the oncall is responsible for (you have
internal SLA's, don't you?).
--
I've found that setting different response times for different-severity
issues, though common, is not terribly useful. If the person needs to be
available, they need to be available, and response times are an expression
of the availability required. Someone is no more likely to be in the
dentist's chair or driving to work when a minor issue comes up than a
major one.
--
When setting a rotation, you have to consider the SLA's involved and the
frequency of interruptions. A long-response, low-interrupt-frequency
group (say, kernel engineering) can have long shifts, say a week or more.
A fast-response, high-interrupt-frequency group (say, admin for a touchy,
mission-critical database) must have short shifts.
If the person will be working full time dealing with stressful emergencies
while they're oncall, eight hours at a time, no more than twice in any
three days, is the limit for sanity (not to mention quality). If the
rotation is small and heavy (say, a group of four people or less),
half-days can be more effective, in order to give everyone a chance to
catch their breath and do non-interruptive work.
--
You *can* create groups whose exclusive job is dealing with interrupts and
emergencies. Some people even enjoy such work, for a time. But be
prepared to have high turnover, especially from new folks who aren't
prepared for the real-life experience of such a job, and don't expect to
have anyone remain in such a job for more than a couple years. You may
want to think about using 4x10-hour day workweeks with swing shifts in
that case, as a regular three-day weekend can help people get over the
adrenaline overload.
--
Speaking more to management than people on this list: You may be in a
jurisdiction where your sysadmins are salaried workers exempt from labor
laws. In that case, you legally *may* (and many sites do) refuse to make
any dispensations for afterhours oncall, expecting it as a responsibility
above and beyond showing up for work every day. If you do so, be aware
that this can be a sure-fire way to kill morale and increase turnover.
Wise sites will offer comp time for a "killer oncall shift". Truly
enlightened sites will recognize that simply being oncall, whether or not
one gets called to work, is a stressful work responsibility that deserves
compensation of some kind.
Smaller sites, especially those with less than 5 people in a rotation (see
below), will often "tack on" 24x7 pager responsibility to workaday
responsibilities. This can work, if you give your employees flexibility
to decide their own hours and workplace (home vs. office) on the fly.
You can even set some reasonable boundaries (say, 11am-2pm every day is
in-office "face time" unless you've had a killer oncall the night before).
But setting rigid 40+ hour work weeks, and then tacking pager duty on top
of it, is a sure recipe for unhappy admins, shoddy work, and fast
turnover.
--
Your SLA's may, by the very nature of the response time agreement, require
"house arrest" of oncalls, because even commute time or going to the
grocery store will render the oncall incapable of meeting the SLA
commitment. If so, don't pussy-foot around this requirement--your oncalls
will put two and two together quickly enough, and will resent it. Or,
more healthily for them but more troublesome for your business, they'll
carve out their own "reasonable accommodations" that work for them--right
up until the first time the datacenter is in flames and they're at the
dentist. Instead, make house arrest explicit, and have supporting
policies in place--for example, you get the day before off to run errands,
or the day after off to recuperate, and there's a secondary oncall to take
over in case of personal emergencies.
--
The more rigid and onerous the oncall duty, the further out you have to
schedule it. And, please, do check the group's vacation calendar *before*
assigning oncall duty. People *will* just quit over mistakes here. (And
good for them!)
--
If you're serious about oncall and interruption-shield responsibilities,
then they need to be handled, in terms of project work, like vacation (for
long shifts) or personal days (for short shifts). Cancel all meetings,
consider it a dead time when scoping project work, and turn on the email
autoresponder. If you have a quiet shift and can hammer away at a
project, that's great. But don't commit to anything.
--
If the response time in your SLA is greater than the oncall's commute (the
*real* commute, mind you, not some policy document that says people should
be able to get home within N minutes), then the shift can span workdays.
Otherwise, you'll have to do handoffs during commutes, either ad-hoc (the
oncall is responsible for getting coverage while they're offline) or
explicit (a secondary rotation, where the primaries and secondaries are
required to coordinate their offline time so it doesn't overlap).
--
A secondary rotation is a good idea even when response time doesn't
require it, in order to prepare for the unexpected. When the oncall's
power goes out, when she falls ill, when her encryption token is on the
fritz, there should be an obvious person her responsibilities fall to.
--
How does handoff from primary to secondary work? If the primary falls ill
and is rushed to the hospital, must the secondary immediately take over
full responsibility? If so, the secondary's SLA is exactly the same as
the primary's. Saying otherwise in the policy won't make it so.
--
Modes of communication must be specified in the SLA. It's useful to put
yourself in the shoes of the oncall, the customer/requester, and
management and do some role-playing to see what the needs are. Admin: "To
fix this, I need to use a GUI tool, and this dialup is too slow to use
it." Looks like access to broadband had better be in the SLA. VP: "I
want status reports every 30 minutes." Better have conference bridges at
the ready. Call leader: "Cut the crosstalk! We need this bridge quiet
for the status reports!" Better have IRC or IM available so that people
can coordinate and discuss without using the phone bridge. CEO: "That
outage was terribly handled! I want a post-mortem!" Better have a
mechanism for recording both the conference call and IM's.
--
Management must take responsibility for ensuring that oncalls, especially
doing "tacked on" after-hours pager duty, are not being abused. A lot of
us have a BOFH streak and are perfectly capable of telling a customer off
for paging us at 4am for a non-emergency. But some of us aren't, and none
of us should have to. There must be consequences for abusing the oncall.
At one site I know of, an access-control list defines who can open
high-severity tickets or issue pages. Anyone can get on the list after
reading and agreeing to the definitions of what is an emergency and what
is not. But just one case of improper use, and the person gets removed
from the list. They can still call the 24x7 helpdesk to plead their case
if they truly have an emergency, but they don't get to make that
determination themselves anymore.
--
Remember that if you want to effectively cover any responsibility 24x7
with the same response as business hours, you need a minimum of 5 people.
Any less, and in many jurisdictions you'll be running afoul of labor laws,
and in all you'll have a group of unhappy employees. Even so, many
people's circadians can't deal with such a structure where they're
rotating through different work schedules including the graveyard shift.
If you can afford to have at least two worksites offset eight or more
timezones from one another, you can avoid anyone having to work graveyard,
which vastly improves morale (and quality). Go to three sites offset by
eight hours, and you're truly "following the sun," and nobody has to work
off-hours.
--
I could go on, and on. Maybe I should write that book....
Trey
More information about the Discuss
mailing list