Tuesday, April 04, 2006

How Not to Administer Important Systems

Courtesy of SANS NewsBites (link will work shortly) I learned of a router or switch crash that shut down the San Francisco Bay Area Rapid Transit District (BART). According to the article:

BART officials promised Thursday to thoroughly investigate why technicians risked working on computers that control trains while the transit system was running, work that crashed BART's main computer, stalled 50 to 60 trains, and stranded 35,000 passengers for more than an hour at the peak of the Wednesday evening commute.

"The bottom line is we shouldn't have worked on it (during service hours)," BART spokesman Linton Johnson said.

No kidding. They're lucky the trains stopped running rather than keep running -- into each other.

"The network switch was not supposed to get overloaded,'' Johnson said. "It is not supposed to crash. But we shouldn't have been working on (the computer system) while trains were running."

Johnson described the technicians who caused the crash as conscientious workers who were frustrated by problems caused by the installation of new software on Monday and Tuesday. The software upgrade is intended to be more reliable and secure and to allow BART to limit problems instead of having them affect the entire system.

"We had some folks who have a long record of installing (software) components correctly and are proud of having very few problems," Johnson said. "When they had two, they wanted to get them fixed as soon as possible. It was a rush to do the right thing."

"Rush" is an ingredient in a recipe for disaster, despite the desire "to do the right thing." This is why frameworks like IT Infrastructure Library (ITIL) emphasize Service Management over cowboy administrative practices.

1 comment:

Anonymous said...

For every admin who rushes a project or a fix, there is likely a manager above him making the suggestions and demands.

I've seen far too often in business that despite pleas from the technical crews about doing things properly, management moves forward with over-aggressive timelines and demands that end up causing the rush that in turn causes so many problems and wasted money...whether those problems manifest right away like the BART issue, or remain latent in a configuration flaw that allows an attacker to exploit a year from now.