Thursday, April 30, 2015

The Need for Test Data

Last week at the RSA Conference, I spoke to several vendors about their challenges offering products and services in the security arena. One mentioned a problem I had not heard before, but which made sense to me. The same topic will likely resonate with security researchers, academics, and developers.

The vendor said that his company needed access to large amounts of realistic computing evidence to test and refine their product and service. For example, if a vendor develops software that inspects network traffic, it's important to have realistic network traffic on hand. The same is true of software that works on the endpoint, or on application logs.

Nothing in the lab is quite the same as what one finds in the wild. If vendors create products that work well in the lab but fail in production, no one wins. The same is true for those who conduct research, either as coders or academics.

When I asked vendors about their challenges, I was looking for issues that might meet the criteria of Allan Friedman's new project, as reported in the Federal Register: Stakeholder Engagement on Cybersecurity in the Digital Ecosystem. Allan's work at the Department of Commerce seeks "substantive cybersecurity issues that affect the digital ecosystem and digital economic growth where broad consensus, coordinated action, and the development of best practices could substantially improve security for organizations and consumers."

I don't know if "realistic computing evidence" counts, but perhaps others have ideas that are helpful?


David Wilburn said...

Richard, you're absolutely correct that this is a challenge. Lack of good, labeled data, in useful quantities and in realistic formats and proportions, is a huge challenge in evaluating products, developing security analytics, etc. For instance, when it comes to testing network sensors, there's too much temptation to use traffic generators that don't represent application layer traffic very well. For evaluations, I tend to favor replay of existing, real-world traffic captured from the network location similar or identical to where the proposed solution would be placed. However, that tends to come with its own baggage, like the sensitivity of the data, such that the vendor can't get a copy to troubleshoot, and buying or building good, high-speed replay devices can be a huge undertaking in and of itself. The freely available data sets, such as SANS holiday challenges or Digital Corpora's, tend to be too small and too narrow in use case to be generally useful except to kick the tires. Every once in awhile, I still catch people trying to use ye olde "DARPA data" set from MIT Lincoln Labs in 1999, which is ancient and completely useless. I wish I had a good solution.

Anonymous said...

The Protected Repository for the Defense of Infrastructure Against Cyber Threats

Synthetic and Real World Data sets
(Some may require user agreements)