Management by Fact: Flight Data Recorder for Windows
Whenever I fly I use the time to read ;login: magazine from USENIX. Chad Verbowksi's article The Secret Lives of Computers Exposed: Flight Data Recorder for Windows in the April 2007 issue was fascinating. (Nonmembers can't access it until next year -- sorry.) Chad describes FDR:
Flight Data Recorder (FDR) collects events with virtually no system impact, achieves 350:1 compression (0.7 bytes per event), and analyzes a machine day of events in 3 seconds (10 million events per second) without a database. How is this possible, you ask? It turns out that computers tend to do highly repetitive tasks, which means that our event logs (along with nearly all other logs from Web servers, mail servers, and application traces) consist of highly repetitive activities. This is a comforting fact, because if they were truly doing 28 million distinct things every day it would be impossible for us to manage them.
Ok, that's cool by itself. However, the insights gained from these logs is what I'd like to highlight.
Before investigating my own computer’s sordid life, I wanted to understand the state of what ought to be well-managed and well-maintained systems. To understand this I monitored hundreds of MSN production servers across multiple different properties. My goal was to learn how and when changes were being made to these systems, and to learn what software was running. Surely machines in this highly controlled environment would closely reflect the intentions of their masters? However, as you’ll see in the following, we found some of them sneaking off to the back of the server room for a virtual cigarette.
When I read this I remembered what I said in my recent Network Security Monitoring History post. The Air Force in the early 1990s thought it was pretty squared away. The idea behind deploying ASIM sensors was to "validate" the common belief that the Air Force network was "secure." When ASIM started collecting data, AFIWC and AFCERT analysts realized reality was far different.
In my post Further Thoughts on Engineering Disasters I mentioned management by belief (MBB) vs management by fact (MBF). With MBB you make decisions based on what you assume is happening. With MBF you make decisions based on what you measure to be happening. It's no accident the M in ASIM stands for Measurement.
This is exactly what Chad is doing with FDR -- moving from MBB to MBF:
To avoid problems, administrators form a secret pact they call lockdown, during which they all agree not to make changes to the servers for a specific period of time. The theory is that if no changes are made, no problems will happen and they can all try to enjoy their time outside the hum of the temperature-controlled data center.
Using FDR, I monitored these servers for over a year to check the resolve of administrators by verifying that no changes were actually made during lockdown periods. What I found was quite surprising: Each of the five properties had at least one lockdown violation during one of the eight lockdown periods. Two properties had violations in every lockdown period.
We’re not talking about someone logging in to check the server logs; these are modifications to core Line-Of-Business (LOB) and OS applications. In fact, looking across all the hundreds of machines we monitored, we found that most machines have at least one daily change that impacts LOB or OS applications. (emphasis added)
That is an ITIL or Visible Ops nightmare. It gets better (or worse):
We would all expect server environments to be highly controlled: The only thing running should be prescribed software that has been rigorously tested and installed through a regulated process.
Using the FDR logs collected from the hundreds of monitored production servers, I learned which processes were actually running. Without FDR it is difficult to determine what is actually running on a system, which is quite different from what is installed. It turns out that only 10% of the files and settings installed on a system are actually used; consequently, very little of what is installed or sitting on the hard drives is needed.
Brief aside -- what a great argument for building a system up from scratch instead of trying to strip out unnecessary components!
Reviewing a summary of the running processes, we found several interesting facts. Fully 29% of servers were running unauthorized processes. These ranged from client applications such as media players and email clients to more serious applications such as auto-updating Java clients. Without FDR, who can tell from where the auto-updating clients are downloading (or uploading?) files and what applications they run? Most troubling were the eight processes that could not be identified by security experts.
Again, facts show the world is not as it was assumed. Now remediation can occur.
Chad's closing thoughts are helpful:
For the past 20 years, systems management has been more of a “dark art” than a science or engineering discipline because we had to assume that we did not know what was really happening on our computer systems. Now, with FDR’s always-on tracing, scalable data collection, and analysis, we believe that systems management in the next 20 years can assume that we do know and can analyze what is happening on every machine. We believe that this is a key step to removing the “dark arts” from systems management.
The next step is to get some documentation posted on how to operationally use FDR, which is apparently in Vista. Comments are appreciated!
Update: MBB and MBF are concepts I learned from Visible Ops.
Flight Data Recorder (FDR) collects events with virtually no system impact, achieves 350:1 compression (0.7 bytes per event), and analyzes a machine day of events in 3 seconds (10 million events per second) without a database. How is this possible, you ask? It turns out that computers tend to do highly repetitive tasks, which means that our event logs (along with nearly all other logs from Web servers, mail servers, and application traces) consist of highly repetitive activities. This is a comforting fact, because if they were truly doing 28 million distinct things every day it would be impossible for us to manage them.
Ok, that's cool by itself. However, the insights gained from these logs is what I'd like to highlight.
Before investigating my own computer’s sordid life, I wanted to understand the state of what ought to be well-managed and well-maintained systems. To understand this I monitored hundreds of MSN production servers across multiple different properties. My goal was to learn how and when changes were being made to these systems, and to learn what software was running. Surely machines in this highly controlled environment would closely reflect the intentions of their masters? However, as you’ll see in the following, we found some of them sneaking off to the back of the server room for a virtual cigarette.
When I read this I remembered what I said in my recent Network Security Monitoring History post. The Air Force in the early 1990s thought it was pretty squared away. The idea behind deploying ASIM sensors was to "validate" the common belief that the Air Force network was "secure." When ASIM started collecting data, AFIWC and AFCERT analysts realized reality was far different.
In my post Further Thoughts on Engineering Disasters I mentioned management by belief (MBB) vs management by fact (MBF). With MBB you make decisions based on what you assume is happening. With MBF you make decisions based on what you measure to be happening. It's no accident the M in ASIM stands for Measurement.
This is exactly what Chad is doing with FDR -- moving from MBB to MBF:
To avoid problems, administrators form a secret pact they call lockdown, during which they all agree not to make changes to the servers for a specific period of time. The theory is that if no changes are made, no problems will happen and they can all try to enjoy their time outside the hum of the temperature-controlled data center.
Using FDR, I monitored these servers for over a year to check the resolve of administrators by verifying that no changes were actually made during lockdown periods. What I found was quite surprising: Each of the five properties had at least one lockdown violation during one of the eight lockdown periods. Two properties had violations in every lockdown period.
We’re not talking about someone logging in to check the server logs; these are modifications to core Line-Of-Business (LOB) and OS applications. In fact, looking across all the hundreds of machines we monitored, we found that most machines have at least one daily change that impacts LOB or OS applications. (emphasis added)
That is an ITIL or Visible Ops nightmare. It gets better (or worse):
We would all expect server environments to be highly controlled: The only thing running should be prescribed software that has been rigorously tested and installed through a regulated process.
Using the FDR logs collected from the hundreds of monitored production servers, I learned which processes were actually running. Without FDR it is difficult to determine what is actually running on a system, which is quite different from what is installed. It turns out that only 10% of the files and settings installed on a system are actually used; consequently, very little of what is installed or sitting on the hard drives is needed.
Brief aside -- what a great argument for building a system up from scratch instead of trying to strip out unnecessary components!
Reviewing a summary of the running processes, we found several interesting facts. Fully 29% of servers were running unauthorized processes. These ranged from client applications such as media players and email clients to more serious applications such as auto-updating Java clients. Without FDR, who can tell from where the auto-updating clients are downloading (or uploading?) files and what applications they run? Most troubling were the eight processes that could not be identified by security experts.
Again, facts show the world is not as it was assumed. Now remediation can occur.
Chad's closing thoughts are helpful:
For the past 20 years, systems management has been more of a “dark art” than a science or engineering discipline because we had to assume that we did not know what was really happening on our computer systems. Now, with FDR’s always-on tracing, scalable data collection, and analysis, we believe that systems management in the next 20 years can assume that we do know and can analyze what is happening on every machine. We believe that this is a key step to removing the “dark arts” from systems management.
The next step is to get some documentation posted on how to operationally use FDR, which is apparently in Vista. Comments are appreciated!
Update: MBB and MBF are concepts I learned from Visible Ops.
Comments
i have come to expect MBB as the rule and MBF as a kind of holy grail that the seasoned vets argue about at the end of a long day. few people ever seem to realize or acknowledge their own violations (due to rationalizations, etc.), so expecting them to manage technology in a consistent and orderly manner is like hoping someone will operate a vehicle by fact and not belief. good luck, if you know what i mean.
btw, i give a talk on this once in a while that you might be interested in.