Saturday, September 17, 2005

Engineering Disaster Lessons for Digital Security

I watched an episode of Modern Marvels on the History Channel this afternoon. It was Engineering Disasters 11, one in a series of videos on engineering failures. A few thoughts came to mind while watching the show. I will provide commentary on each topic addressed by the episode.

  • First discussed was the 1944 Cleveland liquified natural gas (LNG) fire. Engineers built a new LNG tank out of material that failed when exposed to cold, torching nearby homes and businesses when ignited. 128 people died. Engineers were not aware of the metal's failure properties, and absolutely no defensive measures were in place around the tank to protect civilian infrastructure.

    This disaster revealed the need to (1) implement plans and defenses to contain catastrophe, (2) monitor to detect problems and warn potential victims, and (3) thoroughly test designs against possible environmental conditions prior to implementation. These days LNG tanks are surrounded by berms capable of containing a complete spill, and they are closely monitored for problems. Homes and businesses are also located far away from the tanks.

  • Next came the 1981 Kansas City Hyatt walkway collapse that killed 114 people. A construction change resulted in an incredibly weak implementation that failed under load. Cost was not to blame; a part that might have prevented failure cost less than $1. Instead, lack of oversight, poor accountability, broken processes, a rushed build, and compromise of the original design resulted in disaster. This case introduced me to the term "structural engineer of record," a person who assigns a seal to the plans used to construct a building. The two engineers of record for the Hyatt plans lost their licenses.

    I wonder what would happen if network architectures were stamped by "security engineers of record?" If they were not willing to afix their stamp, that would indicate problems they could not tolerate. If they are willing to stamp a plan, and massive failure from poor design occurs, the engineer should be fired.

  • The third event was a massive sink hole in 1993 in an Atlanta Marriott hotel parking lot. A sewer drain originally built above ground decades earlier was buried 40 feet under the parking lot. A so-called "safety net" built under the parking lot was supposed to provide additional security by giving hotel owners time to evacuate the premises if a sink hole began to develop.

    Instead, the safety net masked the presence of the sink hole and let it enlarge until it was over 100 feet wide and beyond the net's capacity. Two people standing in the parking lot died when the sewer, sink hole, and net collapsed. This disaster demonstrated the importance of not operating a system (the sewer) outside of its operating design (above ground). The event also showed how products (the net) may introduce a false sense of security and/or unintended consequences.

  • Next came the 1931 Yangzi River floods that killed 145,000 people. The floods were the result of extended rain that overcame levees built decades earlier by amateur builders, usually farmers protecting their lands. The Chinese government's relief efforts were hampered by the Japanese invasion and subsequent civil war. This disaster showed the weaknesses of defenses built by amateurs, for which no one is responsible. It also showed how other security incidents can degrade recovery operations.

    Does your organization operate critical infrastructure that someone else built before you arrived? Perhaps it's the DNS server that no one knows how to administer. Maybe its the time service installed on the Windows server that no one touches. What amateur levee is waiting to break in your organization?

  • The final disaster revolved around the deadly substance asbestos. The story began by extolling the virtues of asbestos, such as its resistance to heat. This extremely user-friendly feature resulted in asbestos deployments in countless products and locations. In 1924 a 33-year-old, 20-year textile veteran died, and her autopsy provided the first concrete evidence of asbestos' toxicity. A 1930 British study of textile workers revealed abnormally high numbers of asbestos-related deaths. As early as 1918 insurance companies were relucant to cover textile workers due to their susceptibility to early death. As early as the 1930s the asbestos industry suppressed conclusions in research they sponsored when it revealed asbestos' harmful effects.

    By 1972, the US Occupational Safety and Health Administration arrived on the scene and chose asbestos as the first substance it would regulate. Still, today there are hundreds of thousands of pending legal cases, but asbestos is not banned in the US. This case demonstrated the importance of properly weighing risks against benefits. The need to independently measure and monitor risks outside of a vendor's promises was also shown.

I believe all of these cases can teach us something useful about digital security engineering. The main difference between the first four cases and the digital security world is the failure in the analog world is blatantly obvious. Digital failures can be far more subtle; it may take weeks or months (or years) for secuirty failures to be detected, unlike sink holes in parking lots. The fifth case, describing asbestos, is similar to digital security because harmful effects were not immediately apparent.


Anonymous said...

I wonder what would happen if network architectures were stamped by "security engineers of record?"

Interesting thought. Without going into any detail regarding how this would occur, I'd think that there would be much simpler network designs, less ad hoc additions to the network, etc.

H. Carvey
"Windows Forensics and Incident Recovery"

Kim Cameron said...

I love this piece and picked it up at

Anonymous said...

Very interesting post! The parallels with digital security and IT work in general is pretty obvious. Planning, mitigation, accountability, rushing, underbudget, untested new technologies, error trapping (or assuming a break will occur and how to deal with it)...on and on.

As always with incidents like these, though, I always caution that hindsight is 20/20, much like New Orleans of present and 9/11 of recent past.

I think the biggest problem in IT is managers and senior mgmt who do not properly understand or give proper respect to these issues. Instead, many of them want "insert latest buzzword here" nownownownow and at little cost. This leaves many competent admins (such as myself) with our hands tied and knowingly making inferior or incomplete efforts on projects, despite our high personal integrity of wanting to achieve high quality (which results in low morale).

Anonymous said...

Right, but the middle ground might be just as relevant; IOW, the Security Engineer of Record's summary comments might express reservations or observed deficiencies in the design. This would mean that someone would have to overrule the recommendations of the designer, and essentially go on record as having done so.

Jim Online said...

It is unfortunate that these accidents occurred because of some engineering failures. What's shocking and saddening about all these is that people died. I hope that these do not happen again.

Georgina Black said...
This comment has been removed by a blog administrator.
Apocalypse Cow said...

I was trained in college to be a Chemical Engineer, and a lot of what I learned in those classes came over to be useful in my IT job, and made me a better sysadmin today, I think. The "engineer of record" is a good idea, but there is one large difference between civil engineering projects like the Hyatt hotel walkways, or the LNG tanks, and that is public danger. The lives of the public are not at risk with IT networks. However, their financial lives and identities often are at risk, and having an engineer of record would do a great deal to help secure networks and systems, and provide accountability to those who are responsible. It would also give power to the engineers, because they would have more of a chance(hopefully) to veto a project or an idea if they thought that the project would lead to endangering the network or the information held/transported on said network.

Anonymous said...

New fire safety rules affecting all non-domestic premises in England and Wales came into force on 1 October 2006.

A fire risk assessment helps you to identify all the fire risks and hazards in your premises. You can then decide to do

something to control them.

Articles Fire Risk Assessments:
1. Fire Types & Fire Extinguishers
2. United Kingdom: Fire Departments
3. New Fire Safety Rules
4. Steps Needed For Fire Risk Assessment
5. Steps Are Needed To Save Lives
6. Fire Safety Engineering
7. Safety Rules: Fire Risk Assessment

Fire Risk Assessments