Friday, September 22, 2006

Nisley on Failure Analysis

Since I'm not a professional software developer, the only reason I pay attention to Dr. Dobb's Journal is Ed Nisley. I cited him earlier in Ed Nisley on Professional Engineering and Insights from Dr. Dobb's. The latest issue features Failure Analysis, Ed's look at NASA's documentation on mission failures. Ed writes:

[R]eviewing your projects to discover what you do worst can pay off, if only by discouraging dumb stunts.

What works for you also works for organizations, although few such reviews make it to the outside world. NASA, however, has done a remarkable job of analyzing its failures in public documents that can help the rest of us improve our techniques.


Documenting digital disasters has been a theme of this blog, although my request for readers to share their stories went largely unheeded. This is why I would like to see (and maybe create/lead) a National Digital Security Board.

Here are a few excerpts from Ed's article. I'm not going to summarize it; it takes about 5 minutes to read. These are the concepts I want to remember.

NASA defines the "root" cause of mishap as [a]long a chain of events leading to a mishap, the first causal action or failure to act that could have been controlled systematically either by policy/practice/procedure or individual adherence to policy/practice/procedure.

The root causes of these mishaps (incorrect units, invalid inputs, inverted G-switches) seem obvious in retrospect. How could anyone have possibly made those mistakes?

In addition to the root cause, the MIB Reports also identify a "contributing" cause as [a] factor, event or circumstance which led directly or indirectly to the dominant root cause, or which contributed to the severity of the mishap.


The "chain of events" is symptomatic of disasters. A break in that chain prevents the disaster.

However, the MIB [Mishap Investigation Board] discovered that [t]he Software Interface Specification (SIS) was developed but not properly used in the small forces ground software development and testing. End-to-end testing ... did not appear to be accomplished. (emphasis added)

Lack of end-to-end testing appears to be a common theme with disasters.

Mars, the Death Planet for spacecraft, might not have been the right venue for NASA's then-new "Faster, Better, Cheaper" mission-planning process...

The Mars Program Independent Assessment Team (MPIAT) Report pointed out that overall project management decisions caused the cascading series of failed verifications and tests. One slide of their report showed the MCO and MPL project constraints: Schedule, cost, science requirements, and launch vehicle were established constraints and margins were inadequate. The only remaining variable was risk.

In this context, "Faster" means flying more missions, getting rid of "non-value-added" work, and reducing the cycle time by working smarter rather than harder. "Cheaper" has the obvious meaning: spending less to get the same result. The MCO [Mars Climate Orbiter] and MPL [Mars Polar Lander] missions together cost less than the previous (successful) Mars Pathfinder mission.

The term "Better" has an amorphous definition, which I believe is the fundamental problem. In general, management gets what it measures and, if something cannot be measured, management simply won't insist on getting it.

You can easily demonstrate that you're doing things faster, that you've eliminated "non-value-added" operations, and that you're spending less money than ever before. You cannot show that those decisions are better (or worse), because the only result that really matters is whether the mission actually returns science data. Regrettably, you can measure that aspect of "better" after the fact and, in space, there are no do-overs.
(emphasis added)

The last part is crucial. For digital security, the only result that really matters is whether you preserve confidentiality, integrity, and availability, usually by preventing and/or mitigating compromise. All the other stuff -- "percentage of systems certified and accredited," "percentage of systems with anti-virus applied," "percentage of systems with current patch levels" -- is absolutely secondary. In the Mars mission context, who cares if you build the spacecraft quicker, launch on time, and spend less money, if the vehicle crashes and the mission fails?

Thankfully NASA is taking steps to learn from its mistakes by investigating and documenting these disasters. It's time the digital security world learned something from these rocket scientists.

No comments: