Tuesday, August 11, 2009

2009 CDX Data Sets Posted

Earlier this year I posted Thoughts on 2009 CDX. Greg Conti just sent me a notice that the West Point Information Technology and Operations Center just published, for free, their Intrusion Detection Labeled Data Sets. They include packet captures generated by NSA Red Team activity, packet captures from West Point defenders, and Snort, DNS, Web server, and host logs. This is great data. Stop using the 1999 DARPA data sets. Please.

9 comments:

Dan Weber said...

As someone who worked on creating those DARPA data sets, I must say: yes, please, stop using them!

You wouldn't try to use Windows 95A today. Don't use ancient datasets, especially when there were lots of lessons learned after we made them. (TTLs, sigh.)

Erik H said...

Thanks for this info Richard! The data in the capture files look really useful (I haven't looked through them all yet though)!

I’ve added the link to the list of "Publicly available PCAP files" on the NetworkMiner wiki.

James said...

I took a brief look as well, can’t wait to dig a little deeper. Do you know if there are explanations of the "attacks" and the corresponding captures?

Richard Bejtlich said...

James, not sure. Maybe Greg will comment?

Greg Conti said...

James, we have some more details in the paper (linked from the dataset page), but probably not to the level you'd like to see :) We requested detailed attack information from the red team (they have been very supportive), but their (hand generated) logs were a bit too informal to share. (they were in the heat of battle at their end too :) I view this data capture as an annual event. So, we are interested in ideas regarding what to type of data to capture in order to provide the most value. Next year, we plan on working with the red team to try and generate a solid attack log from the attacker's perspective, which would be very valuable.

Greg Conti said...

After thinking about the idea of using competitions to generate valuable/labeled datasets for the past year or so, we believe it is possible, but not trivial to do it right. The first trick is instrumenting the competition correctly. It is reasonable to place sensors at strategic locations on the network, but a better solution would include instrumented workstations (think keystroke logging), hard drive image snapshots, firewall configuration file/log snapshots, etc, which would require more research to do correctly. The second trick is to incentivize participants so that they will participate and generate the type of data you hope to collect. We believe both challenges can largely be overcome, but success will require iterative attempts, and perhaps a set of canonical game design patterns will emerge.

Alex Raitz said...

Great to see Splunk in use at these exercises! Let us know what we can do to improve the product in regard to this use case.

steven said...
This comment has been removed by a blog administrator.
Toomas Salus said...

Is there anywhere some paper, which explains, how exactly thoses data sets were created?
It would be good to test created neural networks with real network traffic but I did not yet find a way to convert raw network data to "DARPA dataset format".

Sorry for my bad english :)