Monday, June 15, 2009

Be Alert!

Here's a tale of woe from an organisation I know - anonymised to protect the guilty.

A couple of weeks after a major hardware and operating system upgrade, there was a major foul-up during a weekend batch process. What went wrong? What got missed in the (quite extensive) testing?

The symptom was that batch jobs run under concurrent manager were running late. Very late. In fact, they hadn't run. The external scheduling software had attempted to launch them, but failed. Worse than that, there had been no alerting over the weekend. Operators should have been notified of the failure of critical processes by an Enterprise Management (EM) tool.

Cut to the explanation:

As part of the O/S upgrade, user accounts on the server are now set to be locked out if three or more failed attempts to login are made. Someone in operations-land triggered a lockout on a unix account used to run the concurrent manager. And he didn't report it to anyone to reset it. So that explained the concurrent manager failures.

The EM software that should have woken up the operators also failed. Surprise, surprise: it was using the same (locked-out) unix account.

And finally, the alerting rules recognised all kinds of warnings and errors, but noone had considered the possibility that the EM system itself would fail.

Well, it's only a business system; though a couple of C-level execs got their end of month reports a couple of days late, and there were plenty of red faces, nobody actually died...

Just keep an eye out for those nasty corner cases!

Sunday, June 07, 2009

Oracle Exadata posts #1 TCP-H result

Grag Rahn's Structured Data blog provides the data that Kevin Closson had to remove from his own blog. From an HP/Oracle point of view, a very good performance, reducing cost/QphH by a factor of 4.

However, it is interesting to see that the HP/Oracle solution is still more than 4 times the cost/QphH of the #2 placed Exasol solution (running on Fujitsu Primergy, and reported a year ago) - while the absolute performance improvement is relatively slight (1.16M queries/hr against 1.02M).