Saturday, January 27, 2007

My Investigative Process Using NSM

I know some of you believe that my Network Security Monitoring (NSM) methodology works and is the best option available for independent, self-reliant, network-centric collection, analysis, and escalation of security events. Some of you think NSM is impossible, a waste of time, irrelevant, whatever. I thought I would offer one introductory case based on live data from my cable line demonstrating my investigative process. Maybe after seeing how I do business the doubters will either think differently (doubtful) or offer their own answer to this problem: how do I know what happened in my enterprise?

(Please: I don't want to hear people complain that I'm using data from a cable line with one public target IP address; I'm not at liberty to disclose data from my client networks in order to satisfy the perceived need for bigger targets. The investigative methodology is the same. Ok?) s shown in the figure at left, I'm using Sguil for my analysis. I'm not going to screen-capture the whole investigation, but I will show the data involved. I begin this investigation with an alert generated by Bleeding Threat ruleset. This is a text-based representation of the alert from Sguil.

Count:1 Event#1.78890 2007-01-15 03:17:36
BLEEDING-EDGE DROP Dshield Block Listed Source ->
IPVer=4 hlen=5 tos=32 dlen=48 ID=39892 flags=2 offset=0 ttl=104 chksum=57066
Protocol: 6 sport=1272 -> dport=4899

Seq=2987955435 Ack=0 Off=7 Res=0 Flags=******S* Win=65535 urp=35863 chksum=0

I'm not using Sguil or Snort to drop anything, but I'm suspicious why this alert fired. Using Sguil I can see exactly how Snort decided to fire this alert.

alert ip [,,,,,,,,,,,,,,,,,,,]
any -> $HOME_NET any (msg:"BLEEDING-EDGE DROP Dshield Block Listed Source - BLOCKING";
reference:url,; threshold: type limit, track by_src, seconds 3600,
count 1; sid:2403000; rev:319; fwsam: src, 72 hours;)

/nsm/rules/cel433/bleeding-dshield-BLOCK.rules: Line 32

Wow, that's a big list of IPs. The source IP in this case ( is in the class C of the Snort alert, so Snort fired. Who is the source? Again, Sguil shows me without launching any new windows or Web tabs.

% [ node-2]
% Whois data copyright terms

inetnum: -
person: Nguyen Manh Hai
nic-hdl: NMH2-AP
address: Vietel Corporation
address: 47 Huynh Thuc Khang, Dong Da District, Hanoi City
phone: +84-4-2661278
fax-no: +84-4-2671278
country: VN
changed: 20040825
source: APNIC

Vietnam -- probably not someone I visited and no one from whom I would expect traffic.

So I know the source IP and I know exactly (not to be taken for granted given other systems) why this alert appeared. The question now is, should I care? If I were restricted to using an alert-centric system (as described earlier), my main option would now be to query for more alerts. I'll spare the description and say this is the only alert from this source. In the alert-centric world, that's it -- end of investigation.

At this point my log-centric friends might say "Check the logs!" That's a fine idea, but what if there aren't any logs? Does this mean the Vietnam box didn't do anything else, or that it did act but generated no logs? That's an important point.

In the NSM world I have two options: check session data, and check full content data. Let's check full content first by right-clicking and asking Sguil to fetch Libpcap data into Wireshark.

Here I could show another cool screen shot of Wireshark, but the data is the important element. Since Sguil copies it to a local directory I specify, I'll just re-read it with Tshark.

1 2007-01-14 22:17:36.162428 -> TCP 1272 > 4899
[SYN] Seq=0 Len=0 MSS=1460
2 2007-01-14 22:17:36.162779 -> TCP 4899 > 1272
[RST, ACK] Seq=0 Ack=1 Win=0 Len=0
3 2007-01-14 22:17:37.230040 -> TCP 1272 > 4899
[SYN] Seq=0 Len=0 MSS=1460
4 2007-01-14 22:17:37.230393 -> TCP 4899 > 1272
[RST, ACK] Seq=0 Ack=1 Win=0 Len=0

Not very exciting -- apparently two SYNs and two RST ACK responses. As we could have recognized from the original alert, port 4899 TCP activity is really old and since I know my network (or at least I think I do), I know I'm not offering 4899 TCP to anyone. But how do you know for all the systems you administer -- or watch -- or don't watch? This full content data, specific to the alert generated, but collected independently of the alert tells us we don't need to worry about this specific event.

This next idea is crucial: just because we have no other alerts from a source, it does not mean this event is all the activity that host performed. With this IP-based Snort alert, we can have some assurance that no other activity occured because Snort will tell us when it sees packets from the network. If we weren't using an IP-based alert, we could query session data, collected independently of alert and full content data, to see what else the attacking host -- or target host -- did.

(Yes, these sections are heavy on the bolding and even underlining, because after five years of writing about this methodology a lot of people still don't appreciate the real-world problems faced by people investigating network incidents.)

I already said we're confident nothing else happened from the attacker here because Snort is triggering on its specific netblock. That sort of alert is a miniscule fraction of the entire rule base. Normally I would find other activity by querying session data and getting results like this from Sguil and SANCP.

Sensor:cel433 Session ID:5020091160069307011
Start Time:2007-01-15 03:17:36 End Time:2007-01-15 03:17:37 ->
Source Packets:2 Bytes:0
Dest Packets:2 Bytes:0

As you can see, Sguil and SANCP have summarized the four packets shown earlier. There's nothing else. Now I am sure the intruder did not do anything else, at least from within the time frame I queried. For added confidence I could query on a time range involving the target to look for suspicious activity to or from other IPs, in the event the intruder switched source IPs.

I have plenty of other cases to pursue, but I'll stop here. What do you think? I expect to hear from people who say "That takes too long," "It's too manual," etc. I picked a trivial example with a well-defined alert and a tiny scope so I could avoid spending most of Saturday night writing this blog post. Think for a moment how you could expand this methodology and the importance of this sort of data to more complex cases and I think you'll give better feedback. Thank you!


Joe said...

I think you're right on the money. Relying on alerts only would not allow you to notice that this guy is trying out some freshly compiled 0-days against your box.

However, every time I see full packet capture, I sigh because I don't have the hardware to do such (start dropping packets). I can collect session data, but not full content. What's your opinion of that?

Michael said...

I do think this is too much work for every incident.

I tried to do a track back on a post I did on my blog but evidently you don't accept trackbacks so I've linked it below:


I'd like to comment on your comment as well; I doubt many shops are truly equipped to handle zero-day exploits, let alone accurately detect them. In fact, I'd argue that if a shop truly has the ability to detect and appropriately react to a zero-day exploit, then they most definitely aren't alert-driven incident handlers.

Typically, honeypots are deployed to collect data for zero-day research and if a shop is deploying honeypots, they certainly better have every other aspect of security well covered, like solid perimeter security, end system security, and incident handling, to name a few. Honeypots and zero-day research are fairly advanced things that should come last for an enterprise (not to mention the inherit risks honeypots themselves pose to the host network).

Richard Bejtlich said...


Where did I say I do this for every incident? Nowhere. Even if I did do this amount of work for "every incident," you're assumimg (1) I'm handling a large number of alerts and (2) this process takes a long time. Both assumptions would also be wrong.

Richard Bejtlich said...


Also --

In your post you criticize me investigating "a relatively insignificant alert." If you read the last paragraph of my post I say:

"I picked a trivial example with a well-defined alert and a tiny scope so I could avoid spending most of Saturday night writing this blog post."

Next time I can choose something more complicated, if that would make you happier! :)

Michael said...

I think you've gotten a bit defensive. You said yourself you expect people to challenge the real world usefulness of NSM.

You provided an incident and clearly stated it was an example of how NSM is to be done. Granted you said it was a trivial example but its still a poor example to make your point because the example doesn't indicate an attack (at any level) and you spent too much time in your investigative process making a decision about the incident.

It would have been better to provide fictional data or even a staged attack to better demonstrate the process.

Your opening paragraph indicated you were interested in how others find the answer to the problem 'how do I know what happened..."

I thought you seriously wanted to know.

Richard Bejtlich said...


You think I'm being "defensive" because you don't understand what I'm trying to say. I'm demonstrating a method -- in this case, investigationg via alert -> full content -> session -> whatever -- that is impossible with alert-centric systems but possible with NSM. The nature of the event in question is absolutely irrelevant.

skippylou said...


I couldn't agree more with the principles and practices used when investigating incidents with NSM and your blog post.

After reading Michael's comments on his page, the one thing that immediately came to mind and which you also pointed out, is "prevention eventually fails".

Maybe if you are a one person IT shop, and are the only one responsible for patching, deploying, securing, etc. systems - then you can feel *a bit* better about your security posture. But this quickly breaks down when multiple humans are involved, even with change control systems, procedures, etc.

The single greatest thing about having NSM deployed, is that "good" feeling you get when you know you have more data to look at after seeing a particularly frightening alert. You aren't stuck wondering what happened, having to figure out if a server someone else deployed is properly patched, or any number of other scenarios. You have the data there! I certainly don't collect full content data everywhere, because I can't, but as much as my hardware and network allow, I will now always collect as much session and full content data as I can.

Bravo to your blog, NSM, Sguil, and the countless other tools that have made my job both more enjoyable and easier.


LonerVamp said...

I think this is excellent. Granted, some of the end results for small events are no different then an alert-based system, but doing this more manually adds much more trust and intuition into the equation.

Over time, an analyst can begin to tune the process just like an alert-based IDS is tuned. You get a feel for what is odd, but normal, for the network in question.

To me, an alert-based method is like getting coffee at McDonalds. You get exactly what you ask for and pay for, and they're designed for maximum efficiency. But they are not designed for maximum comfort, satisfaction, and quality. That's where your Starbucks coffee experience will come from. That's a very rough Monday morning analogy, but it works on some levels.

Besides, on another level, doing this stuff manually adds value to your staff. Responding to alerts does not teach much of anything, and those alert response skills won't apply anywhere else. But responding with the methods given above will equip admins with skills that can be used in many situations, from troubleshooting applications, networks, incidents, and so on. That knowledge and empowerment is worth the time spent.

David Bianco said...

If anyone's tempted to spring the "but just try it on a real-world attack" argument, I would like to remind them to view the Shmoocon 2006 presentation Richard and I gave, in which I clearly lay out the NSM process I used to investigate a real world hack attempt.

I assure you, this was a "real world" scenario, and it was plucked from a pretty sizable campus area network. Without Sguil/NSM, there's simply no way I could have gotten all the data I needed in order to investigate this. And all in just a few minutes, too. Yes, NSM works in the real world, for real networks.

Michael said...

This has really spun out...

I'm not arguing against the NSM process itself. I'm not arguing against logging and monitoring.

I'm simply arguing that the example used severely detracts from the explanation because the example is all of two SYN packets destined for a host not listening on the target port.

Richard, did you even read the first two paragraphs of my post?

In a final attempt to try to re-align the cosmos and help clarify my argument, I've stated the following on my blog:

"My point was and still is that if you’re going to explain something with an example, it matters tremendously that you use a relevant example if you want to clearly state your point. When you go to an auto racing driving school, the instructors don’t put you in a pinto and say ‘while you’re trying to learn the process of driving a fast car, you also have to imagine yourself driving a fast car.’

When you go to firearm classes to learn to shoot a real firearm, they don’t hand you a pop gun and say ‘this is what its like to fire a real weapon, you just have to imagine it's a real weapon.’"

@Skippylou - I agree and am not arguing that prevention is 100%. Hopefully you've read my post on my blog as well as my comments here to get a full picture of where I'm coming from.

@David Bianco - I'm not arguing that NSM would break down in the real world. In fact, in my post I stated clearly that NSM is "is a thorough and very successful process."

Vivek Rajan said...


Just want to share a simple but interesting NSM incident.

We are not security experts, yet we were called to help someone who was having a major problem with the notorious "simpleboard" attack on their webserver (description here ). This is just a crossite scripting variant that impacted two PHP scripts. We fixed the scripts, looked around the directory, and we thought everything was ok.

Just on a hunch we started looking at some captured TCP sessions during the attack period. We found that the attacker (correlated by IP address) had stored several MB of offensive material in an obscure directory. This did not show up in the apache logs.

We did not have access to Snort. Even if we did, we could not have figured out the directory because it was embedded in the POST data.

Anonymous said...

You know, I believe in NSM in a very thorough and realistic manner. Most recently, I spent some time dealing with a client who are unable to fully deal with the idea. They collect only session data at my request, which is actually enough to figure out what happened here. Their eyes have been opened a bit by an ordeal involving some mismanagement and no idea of how to contain an incident involving every windows machine on their network (>60 hosts).

So I was dispatched at 9pm because everything was "running slow" and "acting funny" -- the end users gave the clues that it was all bad when they'd noticed the VoIP phone system (running on windows) doing some interestingly bad things, as well as random workstations rebooting when it wasn't patch day. Nobody on site could figure out what was going on, suspected it was a regularly occuring screw-up on the part of the windows based architecture.

I hurriedly hooked a machine up to one of the bladed switches and grabbed a monitor port and pointed it at the firewall ingress/egress port, then fired up argus 3.0 and tcpdump with some heavy filtering.

What I found was quite astonishing, even having seen the things that I have over time:

An amazing amount of traffic flowing to the vast majority of machines on the LAN to port 2967 (TCP); and a handful to 5888 (TCP) and some others that looked like encrypted tunnels with keepalive packets. The machines were also scanning for open VNC ports externally. I started doing reverse lookups on the IPs and found that the common factor was that they all ran windows.

Seeing that much, I spent 20mins and built a firewall with FreeBSD 6.2 and pf on a random p4 that was laying around, and started blocking just the inbound/outbound traffic that looked suspicious upon initial investigation. Immediately, everything started to run a little more smoothly.

More digging on the machine that was pre-existing and collecting session data showed that the compromise happened in late 2006 and hadn't shown symptoms yet. I narrowed the problem down to a static NAT in a cisco PIX that has been mismanaged from the start, and that I've recommended they remove in favor of an open source solution that runs freebsd/pf in a similar fashion to what I built.

The deal was, they had long ago setup a static NAT configuration to one internal IP to serve web traffic from, not knowing how to forward just one port through. That machine had since been taken down for service, but they'd left the static NAT configuration in place.. the web server happened to be using an IP in the DHCP pool and it was later picked up by a random windows workstation.

For those who haven't already done the research, port 2967/TCP happens to be the update service port for Symantec's Corporate AV. Seems there was a vulnerability in this service a while ago and it wasn't disclosed very well and definitely not to this client's admin. The "auto-update" functionality in this particular AV software doesn't update the core software, just the scanning engine and the virus definitions. A very competent tech had earlier in the evening picked a random windows workstation that was exhibiting the problems and started live response on it, trying to figure out what was wrong.

Two hours in, they'd still not discovered the initial hole and the firewall was already in place containing the traffic inside the perimeter and the full extent of the compromise was known. Every machine running windows and the Symantec AV client was compromised. We'd already started cleaning up the machines starting at the local server that hosted the AV update service, then on to high profile workstations (management, development, and then the technicians last). By the time it was said and done, host-based inspection was effectively worthless except via random sampling. We knew exactly which hosts were affected in less than the first hour, and what to do about it.

NSM is a very powerful thing, and even if not fully practiced has an interesting effect on the way you look at things. This system was "alert based", alright. The "alerts" were generated by the people who used the machines and had an escalation procedure to follow when things didn't look right. It didn't use Sguil, although it would have been nice; but the host-based approach was clearly the wrong one to take in this situation.

I argue that "too much work" is far too subjective to claim here or anywhere else.

Proactiv Solution said...
This comment has been removed by a blog administrator.
Anonymous said...
This comment has been removed by a blog administrator.
dghnfgj said...
This comment has been removed by a blog administrator.