While flying to Los Angeles this week I read a great paper by Microsoft and Michigan researchers: Reclaiming Network-wide Visibility Using Ubiquitous Endsystem Monitors. From the Abstract:
Network-centric tools like NetFlow and security systems like IDSes provide essential data about the availability, reliability, and security of network devices and applications. However, the increased use of encryption and tunnelling has reduced the visibility of monitoring applications into packet headers and payloads (e.g. 93% of traffic on our enterprise network is IPSec encapsulated). The result is the inability to collect the required information using network-only measurements.
To regain the lost visibility we propose that measurement systems must themselves apply the end-to-end principle: only endsystems can correctly attach semantics to traffic they send and receive. We present such an end-to-end monitoring platform that ubiquitously records per-flow data and then we show that this approach is feasible and practical using data from our enterprise network.
This is cool. How does it work?
Each endsystem in a network runs a small daemon that uses spare disk capacity to log network activity. Each desktop, laptop and server stores summaries of all network traffic it sends or receives. A network operator or management application can query some or all endsystems, asking questions about the availability, reachability, and performance of network resources and servers throughout the organization...
Ubiquitous network monitoring using endsystems is fundamentally different from other edge-based monitoring: the goal is to passively record summaries of every flow on the network rather than to collect availability and performance statistics or actively probe the network...
It also provides a far more detailed view of traffic because endsystems can associate network activity with host context such as the application and user that sent a packet. This approach restores much of the lost visibility and enables new applications such as network auditing, better data centre management, capacity planning, network forensics, and anomaly detection.
Using real data from an enterprise network we present preliminary results showing that instrumenting, collecting, and querying data from endsystems in a large network is both feasible and practical.
For example, our own enterprise network contains approximately 300,000 endsystems and 2,500 routers. While it is possible to construct an endsystem monitor in an academic or ISP network there are significant additional deployment challenges that must be addressed. Thus, we focus on deployment in enterprise and government networks that have control over software and a critical need for better network visibility...
Even under ideal circumstances there will inevitably be endsystems that simply cannot easily be instrumented, such as printers and other hardware running embedded software. Thus, a key factor in the success of this approach is obtaining good visibility without requiring instrumentation of all endsystems in a network. Even if complete instrumentation were possible, deployment becomes significantly more likely
where incremental benefit can be observed...
[I]nstrumenting just 1% of endsystems was enough to monitor 99.999% bytes on the network. This 1% is dominated by servers of various types (e.g. backup, file, email, proxies), common in such networks.
Wow -- in other words, just pick the right systems to instrument and you end up capturing a LOT of traffic.
How heavy is the load?
To evaluate the per-endsystem CPU overhead we constructed a prototype flow capture system using the ETW event system [Event Tracing for Windows]. ETW is a low overhead event posting infrastructure built into the Windows OS, and so a straightforward usage where an event is posted per-packet introduces overhead proportional to the number of packets per second processed by an endsystem.
We computed observed packets per second over all hosts, and the peak was approximately 18,000 packets per second and the mean just 35 packets per second. At this rate of events, published figures for ETW [Magpie] suggest an overhead of a no more than a few percent on a reasonably provisioned server...
[F]or a 1 second export period there are periods of high traffic volume requiring a large number of records be written out. However, if the export timer is set at 300 seconds, the worst case disk bandwidth required is ≃4.5 MB in 300 seconds, an average rate of 12 kBps.
The maximum storage required by a single machine for an entire week of records is ≃1.5 GB, and the average storage just ≃64 kB. Given the capacity and cost of modern hard disks, these results indicate very low resource overhead.
This is great. I emailed the authors to see if they have an implementation I could test. The home for this work appears to be the Microsoft Anemone Project.