Tuesday, January 13, 2015

Does This Sound Familiar?

I read the following in the 2009 book Streetlights and Shadows:
Searching for the Keys to Adaptive Decision Making by Gary Klein. It reminded me of the myriad ways operational information technology and security processes fail.

This is a long excerpt, but it is compelling.

== Begin ==

A commercial airliner isn't supposed to run out of fuel at 41,000 feet. There are too many safeguards, too many redundant systems, too many regulations and checklists. So when that happened to Captain Bob Pearson on July 23, 1983, flying a twin-engine Boeing 767 from Ottawa to Edmonton with 61 passengers, he didn't have any standard flight procedures to fall back on.

First the fuel pumps for the left engine quit. Pearson could work around that problem by turning off the pumps, figuring that gravity would feed the engine. The computer showed that he had plenty of fuel for the flight.

Then the left engine itself quit. Down to one engine, Pearson made the obvious decision to divert from Edmonton to Winnipeg, only 128 miles away. Next, the fuel pumps on the right engine went.

Shortly after that, the cockpit warning system emitted a warning sound that neither Pearson nor the first officer had ever heard before. It meant that both the engines had failed.

And then the cockpit went dark. When the engines stopped, Pearson lost all electrical power, and his advanced cockpit instruments went blank, leaving him only with a few battery-powered emergency instruments that were barely enough to land; he could read the instruments because it was still early evening.

Even if Pearson did manage to come in for a landing, he didn't have any way to slow the airplane down. The engines powered the hydraulic system that controlled the flaps used in taking off and in landing. Fortunately, the designers had provided a backup generator that used wind power from the forward momentum of the airplane.

With effort, Pearson could use this generator to manipulate some of his controls to change the direction and pitch of the airplane, but he couldn't lower the flaps and slats, activate the speed brakes, or use normal braking to slow down when landing. He couldn't use reverse thrust to slow the airplane, because the engines weren't providing any thrust. None of the procedures or flight checklists covered the situation Pearson was facing.

  Pearson, a highly experienced pilot, had been flying B-767s for only three months-almost as long as the airplane had been in the Air Canada fleet. Somehow, he had to fly the plane to Winnipeg. However, "fly" is the wrong term. The airplane wasn't flying. It was gliding, and poorly. Airliners aren't designed to glide very well-they are too heavy, their wings are too short, they can't take advantage of thermal currents. Pearson's airplane was dropping more than 20 feet per second.

Pearson guessed that the best glide ratio speed would be 220 knots, and maintained that speed in order to keep the airplane going for the longest amount of time. Maurice Quintal, the first officer, calculated that they wouldn't make it to Winnipeg. He suggested instead a former Royal Canadian Air Force base that he had used years earlier. It was only 12 miles away, in Gimli, a tiny community originally settled by Icelanders in 1875.1 So Pearson changed course once again.

Pearson had never been to Gimli but he accepted Quintal's advice and headed for the Gimli runway. He steered by the texture of the clouds underneath him. He would ask Winnipeg Central for corrections in his heading, turn by about the amount requested, then ask the air traffic controllers whether he had made the correct turn. Near the end of the flight he thought he spotted the Gimli runway, but Quintal corrected him.

As Pearson got closer to the runway, he knew that the airplane was coming in too high and too fast. Normally he would try to slow to 130 knots when the wheels touched down, but that was not possible now and he was likely to crash.

Luckily, Pearson was also a skilled glider pilot. (So was Chesley Sullenberger, the pilot who landed a US Airways jetliner in the Hudson River in January of 2009. We will examine the Hudson River landing in chapter 6.) Pearson drew on some techniques that aren't taught to commercial pilots. In desperation, he tried a maneuver called a slideslip, skidding the airplane forward in the way ice skaters twist their skates to skid to a stop.

He pushed the yoke to the left, as if he was going to turn, but pressed hard on the right rudder pedal to counter the turn. That kept the airplane on course toward the runway. Pearson used the ailerons and the rudder to create more drag. Pilots use this maneuver with gliders and light aircraft to produce a rapid drop in altitude and airspeed, but it had never been tried with a commercial jet. The slide-slip maneuver was Pearson's only hope, and it worked.

  When the plane was only 40 feet off the ground, Pearson eased up on the controls, straightened out the airplane, and brought it in at 175 knots, almost precisely on the normal runway landing point. All the passengers and the crewmembers were safe, although a few had been injured in the scramble to exit the plane after it rolled to a stop.

The plane was repaired at Gimli and was flown out two days later. It returned to the Air Canada fleet and stayed in service another 25 years, until 2008.2 It was affectionately called "the Gimli Glider."

The story had a reasonably happy ending, but a mysterious beginning. How had the plane run out of fuel? Four breakdowns, four strokes of bad luck, contributed to the crisis.

Ironically, safety features built into the instruments had caused the first breakdown. The Boeing 767, like all sophisticated airplanes, monitors fuel flow very carefully. It has two parallel systems measuring fuel, just to be safe. If either channel 1 or channel 2 fails, the other serves as a backup.

However, when you have independent systems, you also have to reconcile any differences between them. Therefore, the 767 has a separate computer system to figure out which of the two systems is more trustworthy. Investigators later found that a small drop of solder in Pearson's airplane had created a partial connection in channel 2. The partial connection allowed just a small amount of current to flow-not enough for channel 2 to operate correctly, but just enough to keep the default mode from kicking in and shifting to channel 1.

The partial connection confused the computer, which gave up. This problem had been detected when the airplane had landed in Edmonton the night before. The Edmonton mechanic, Conrad Yaremko, wasn't able to diagnose what caused the fault, nor did he have a spare fuel-quantity processor. But he had figured out a workaround. If he turned channel 2 off, that circumvented the problem; channel 1 worked fine as long as the computer let it.

The airplane could fly acceptably using just one fuel-quantity processor channel. Yaremko therefore pulled the circuit breaker to channel 2 and put tape over it, marking it as inoperative. The next morning, July 23, a crew flew the plane from Edmonton to Montreal without any trouble.

The second breakdown was a Montreal mechanic's misguided attempt to fix the problem. The Montreal mechanic, jean Ouellet, took note of the problem and, out of curiosity, decided to investigate further. Ouellet had just completed a two-month training course for the 767 but had never worked on one before. He tinkered a bit with the faulty Fuel Quantity Indicator System without success. He re-enabled channel 2; as before, the fuel gauges in the cockpit went blank. Then he got distracted by another task and failed to pull the circuit breaker for channel 2, even though he left the tape in place showing the channel as inoperative. As a result, the automatic fuel-monitoring system stopped working and the fuel gauges stayed blank.

  A third breakdown was confusion about the nature of the fuel gauge problem. When Pearson saw the blank fuel gauges and consulted a list of minimum requirements, he knew that the airplane couldn't be flown in that condition. He also knew that the 767 was still very new-it had first entered into airline service in 1982. The minimum requirements list had already been changed 55 times in the four months that Air Canada had been flying 767s. Therefore, pilots depended more on the maintenance crew to guide their judgment than on the lists and manuals.

Pearson saw that the maintenance crews had approved this airplane to keep flying despite the problem with the fuel gauges. Pearson didn't understand that the crew had approved the airplane to fly using only channel 1. In talking with the pilot who had flown the previous legs, Pearson had gotten the mistaken impression that the airplane had just flown from Edmonton to Ottawa to Montreal with blank fuel gauges. That pilot had mentioned a "fuel gauge problem." When Pearson climbed into the cockpit and saw that the fuel gauges were blank, he assumed that was the problem the previous pilot had encountered, which implied that it was somehow acceptable to continue to operate that way.

The mechanics had another way to provide the pilots with fuel information. They could use a drip-stick mechanism to measure the amount of fuel currently stored in each of the tanks, and they could manually enter that information into the computer. The computer system could then calculate, fairly accurately, how much fuel was remaining all through the flight.

In this case, the mechanics carefully determined the amount of fuel in the tanks. But they made an error when they converted that to weight. This error was the fourth breakdown.

Canada had converted to the metric system only a few years earlier, in 1979. The government had pressed Air Canada to direct Boeing to build the new 767s using metric measurements of liters and kilograms instead of gallons and pounds-the first, and at that time the only, airplane in the Air Canada fleet to use the metric system. The mechanics in Montreal weren't sure about how to make the conversion (on other airplanes the flight engineer did that job, but the 767 didn't use a flight engineer), and they got it wrong.

In using the drip-stick measurements, the mechanics plugged in the weight in pounds instead of kilograms. No one caught the error. Because of the error, everyone believed they had 22,300 kg of fuel on board, the amount needed to get them to Edmonton, but in fact they had only a little more than 10,000 kg-less than half the amount they needed.

  Pearson was understandably distressed by the thought of not being able to monitor the fuel flow directly. Still, the figures had been checked repeatedly, showing that the airplane had more fuel than was necessary. The drip test had been repeated several times, just to be sure.

That morning, the airplane had gotten approval to fly from Edmonton to Montreal despite having fuel gauges that were blank. (In this Pearson was mistaken; the airplane used channel 1 and did have working fuel gauges.) Pearson had been told that maintenance control had cleared the airplane.

The burden of proof had shifted, and Pearson would have to justify a decision to cancel this flight. On the basis of what he knew, or believed he knew, he couldn't justify that decision. Thus, he took off, and everything went well until he ran out of fuel and both his engines stopped.

== End ==

This story is an example that one cannot build "unhackable systems." I also believe this story demonstrates that operational and decision-based failures will continue to plague technology. It is no use building systems that theoretically "have no vulnerabilities" so long as people operate and make decisions based on use of those systems.

If you liked this post, I've written about engineering disasters in the past.

You can but the book which published this story at Amazon.com.

1 comment:

mustu said...

Awesome excerpt to make the point. Thanks