Here's a tale of woe from an organisation I know - anonymised to protect the guilty.
A couple of weeks after a major hardware and operating system upgrade, there was a major foul-up during a weekend batch process. What went wrong? What got missed in the (quite extensive) testing?
The symptom was that batch jobs run under concurrent manager were running late. Very late. In fact, they hadn't run. The external scheduling software had attempted to launch them, but failed. Worse than that, there had been no alerting over the weekend. Operators should have been notified of the failure of critical processes by an Enterprise Management (EM) tool.
Cut to the explanation:
As part of the O/S upgrade, user accounts on the server are now set to be locked out if three or more failed attempts to login are made. Someone in operations-land triggered a lockout on a unix account used to run the concurrent manager. And he didn't report it to anyone to reset it. So that explained the concurrent manager failures.
The EM software that should have woken up the operators also failed. Surprise, surprise: it was using the same (locked-out) unix account.
And finally, the alerting rules recognised all kinds of warnings and errors, but noone had considered the possibility that the EM system itself would fail.
Well, it's only a business system; though a couple of C-level execs got their end of month reports a couple of days late, and there were plenty of red faces, nobody actually died...
Just keep an eye out for those nasty corner cases!
Monday, June 15, 2009
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment