Sunday, February 13, 2011



Title: Apply predictive modeling techniques to information security.

A few months ago, I presented an article on the Island titled: "Using Analytics and Modeling to Predict attacks" (https://www.infosecisland.com/blogview/6924-Using-Analytics-and-Modeling-to-Predict-Attacks.html). In that article I wondered if analytics could assist security professionals in predicting future computer attacks. After writing a research paper on the subject for my last semester in graduate school, in a nutshell, my simple answer is Yes...and as Dr. Chuvakin commented on my previous article: "The devil's in the details!". The focus of my paper was on the details.

Basically, analytics can be used in any type of industry that produces and consumes data. Of course that includes security.

Predictive analytics and data mining at first may seem to mean the same thing but there are differences. Data mining defines the process of exploring large amounts of data for relationships that can be exploited for proactive decision making. Data mining can produce decisions through normal reports that explain what happened. Alerts can be created to define the times when reactions are necessary. To me, predictive modeling goes a few steps above data mining and therefore adds the most value to a business. Predictive modeling starts with statistical analysis and moves on from standard reporting and alerts to forecasting and optimizations. Instead of focusing on what happened, predictive modeling allows us to look at what will happen next, what trends will continue and how we can do better.

There are considerable barriers to this field. For one, analytics involves the use of advanced statistics. My limited statistical training was certainly a big hurdle for me as I began to put analytics into practice. I dusted off my grad school business statistics book and began to reread the sections on measures of central tendency, probability theories and Bayesian statistics. At the same time, I was learning what exactly was meant by "business analytics" and "predictive modeling". Luckily for me, I work at one of the largest software companies in the world whose focus on business analytics has provided me with a wealth of material and software tools to put this into hands-on practice.

The complex nature of the field leads to the next barrier: you need highly paid, highly skilled modeling professionals. Which leads to the next barrier: you need people who know how to use modeling software. Since analytics is complicated, the software to use it is complicated. But even if you know statistics and learn how to use the tools, you may not be able interpret the results you get. Matter of fact, there is a trend in the industry to combat the complex nature of the field. There are companies that are planning to release tools that bring analytics to the novice end user. For my paper I evaluated two open source packages: R (the stats package) with the Rattle data mining plugin and Weka (a data mining package). I compared the open source offerings to SAS Enterprise Miner - an enterprise strength data mining package with descriptive and predictive modeling capabilities.

In order to apply the techniques to information security I needed datasets. I used a commonly applied dataset in information security research: The network intrusion dataset from the KDD archive popularly referred to as the KDD 99 Cup set. The KDD 99 Cup consists of 41 attributes and 345,814 observations gathered from 9 weeks of raw TCP data from simulated United States Air Force network traffic. The intrusion dataset is quite different from a raw TCP dump. First of all, the KDD99 Cup dataset has a number of attributes that are not found in raw TCP data. Secondly, two features are missing from the dataset that would actually improve intrusion detection models. Those two features are timestamp and source IP address. Web log analysis is based upon these two useful features and they provide valuable insights on access patterns. The data set creators simulated 24 attack types in this data set broken down into 4 classes: Denial of service, Root to Local, probing and User to Root attack types. This dataset was downloaded in two forms: (1) the raw dataset in CSV format for loading into SAS Enterprise Miner and (2) the dataset in ARFF format as required by Weka software. Immediately I realized a major problem in using R and Weka - while I could load 400,000 records in R and Weka - when I chose to build models, both packages frequently hung whereas SAS Enterprise Miner ran like a champ.

Next in my paper, I proposed a basic modeling framework. By using a modeling framework, modelers can apply techniques in an iterative fashion similar to software engineering. This enables the modelers to share models, evaluate models for effectiveness and determine if model results are accurate. My framework start with data exploration, then move onto modeling envisioning, followed by iterative modeling and finally ending with modeling testing and deployment. This framework is loosely based upon the Predictive Model Markup Language (PMML) that was designed by the Data Mining Group.

By starting with data exploration you can use the software to display measures of central tendency. For example, when I imported the KDD 99 Cup dataset into the software, it showed several interesting things.











For one, the summary detected that 57% of all observations involved Smurf DDoS attacks and that 100% of the Smurf attacks involved the ICMP protocol. In addition, 22% of all Neptune attacks involved TCP traffic types. This identifies that Smurf attacks involved a flood of ICMP packets whereas the Neptune attacks are variants of the TCP 3-way handshaking process. Overall the summary statistics showed very irregular data distributions on the KDD99 Cup data set. For example, the DDOS records always come in large clusters whereas the U2R attacks are always represented by isolated records. This does represent a common technique among hackers: Attackers will launch a massive attack against a target in a DDOS attack that overwhelms the server. Hidden in this tremendous amount of data, the attackers will launch more lucrative u2r and l2r attacks. The idea is that the security analysts will be so busy mitigating the DDOS attacks that they don’t even detect the attack trying to gain access through backdoor attacks or password guess attacks.

When moving to model envisioning, you use agile software techniques to document candidate models to aid in predictive modeling. A common model is the decision tree.













When using a decision tree, you identify a target variable from your dataset and the software uses a series of IF - ELSE rules to divide the data into logical segments. Improvements to the predictive models occur during subsequent iterations where model effectiveness is measured. In my paper, I started further dividing the decision tree built in previous iterations by various attributes until I was relatively sure that results that I see could be accurate and useful. The final phase, model testing and deployment, involve determining whether the predictive models constructed in earlier phases perform effectively. Cumulative lift charts are excellent ways to visually show the performance of a model. The lift, a measure of effectiveness of a predictive model, is calculated as the ratio between the results obtained with and without the predictive model.

Lift = confidence / expected confidence

Basically, the greater the area between the lift curve and the baseline, the better the model will be at predicting outcomes.
There has been an increasing amount of work in the information technology field concerning predictive techniques and the need to uncover patterns in data.
Al-Shayea used artificial neural networks in order to predict student’s academic performance with the goal of improving student scores using preplanned strategic programs.
Fouad, Abdel-Aziz, and Nazmy conducted research on using artificial neural networks in IDSs in order to detect unknown signature patterns in network traffic.
Predictive modeling has proven to be extremely effective in solving a wide array of important business problems. There are several hurdles to overcome before this process can be effectively used by a wider audience. One problem is that a trained data analyst who is experienced in modeling techniques and is knowledgeable about the data sources needs to be involved. A highly automated technology solution that incorporates the framework features presented in this paper exposed as a web service would enable developers and database analysts all over the world to build customizable solutions for their company.