tag:blogger.com,1999:blog-90894430063126049612009-07-09T19:00:54.657+02:00Machine Learning for Computer Security.&gt;&gt; Good and bad times with machine learning and security research.  Konrad Rieckhttp://www.blogger.com/profile/11125298703490863387noreply@blogger.comBlogger21125tag:blogger.com,1999:blog-9089443006312604961.post-54418732769302566842009-07-09T18:35:00.005+02:002009-07-09T18:59:11.584+02:00Securing IMS against Novel Threats.Recently, we have published an interesting article at Bell Labs Technical Journal on detecting unknown threats in IMS and VoIP networks. This article has been the outcome of a fruitful cooperation of <a href="http://www.first.fraunhofer.de">Fraunhofer FIRST</a> and <a href="http://www.alcatel-lucent.com">Alcatel-Lucent</a> in Germany. The article's abstract is here: <blockquote>Fixed mobile convergence (FMC) based on the 3GPP IP Multimedia<br />Subsystem (IMS) is considered one of the most important communication technologies of this decade. Yet this all-IP-based network technology brings about the growing danger of security vulnerabilities in communication and data services. Protecting IMS infrastructure servers against malicious exploits poses a major challenge due to the huge number of systems that may be affected. We approach this problem by proposing an architecture for an autonomous and self-sufficient monitoring and protection system for devices and infrastructure inspired by network intrusion detection techniques. The crucial feature of our system is a signature-less detection of abnormal events and zero-day attacks. These attacks may be hidden in a single message or spread across a sequence of messages. Anomalies identified at any of the network domain's ingresses can be further analyzed for discriminative patterns that can be immediately distributed to all edge nodes in the network domain.</blockquote><br /><a href="http://user.cs.tu-berlin.de/~rieck/docs/2009-bltj.pdf">Securing IMS against Novel Threats.</a> Stefan Wahl, Konrad Rieck, Pavel Laskov, Peter Domschitz and Klaus-Robert Müller. Bell Labs Technical Journal (BLTJ), 14(1), 243-257, John Wiley & Sons, May 2009.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-5441873276930256684?l=www.mlsec.org'/></div>Konrad Rieckhttp://www.blogger.com/profile/11125298703490863387noreply@blogger.com0tag:blogger.com,1999:blog-9089443006312604961.post-84337637270223679602009-05-29T18:05:00.009+02:002009-05-29T19:08:02.549+02:00Some Tuning for Feature Extraction.One key to efficient application of learning methods in practice is fast extraction of features from raw data. As an example, I am currently working on methods for automatic analysis of malware, where the behavior of a malware binary is represented in a textual report and mapped to a vector space using frequencies of contained substrings (see <a href="http://user.cs.tu-berlin.de/~rieck/docs/2008-dimva.pdf">here</a>). However, thousands of binaries need to be processed and often the generated report files are huge. Consequently, the task of designing analysis methods gets tedious, as one has to wait several minutes just to load data and extract appropriate feature strings.<br /><br />In quest for a remedy, I have experimented with <a href="http://www.openmp.org">OpenMP</a> (Open Multi-Processing) and <a href="http://people.freebsd.org/~kientzle/libarchive/">libarchive</a>, where the first is a simple API for multi-processing programming in C and the latter a library for reading and writing of file archives, such as zip, tar and on. On the one hand OpenMP enables loading of data and extraction of features in parallel, whereas on the other hand libarchive allows for storing the data efficiently in compressed archives in favor of directories. <br /><br /><center><br /><a href="http://user.cs.tu-berlin.de/~rieck/misc/mal_expt04.png"><img src="http://user.cs.tu-berlin.de/~rieck/misc/mal_expt04.png" width="380"></a><br /></center><br /><br />The figure shows some preliminary run-time measurements for extraction of feature vectors from malware reports. The application of multi-processing clearly accelerates the feature extraction, independent of the applied data format. For example, when reading from a directory the performance is doubled if two threads are used and enables processing up to 100 files per second (note this experiments was run on a dual-core machine). Surprisingly, the extraction performance is also high when using compressed archives. There is almost no difference between feature extraction from a zip/gz archive and a plain directory. Moreover, the zip/gz archive consumes only 5% of the original space and considerably reduces the amount of required storage. That's impressive. If you are dealing with loading and processing thousands of files, these tuning hacks might be an interesting option.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-8433763727022367960?l=www.mlsec.org'/></div>Konrad Rieckhttp://www.blogger.com/profile/11125298703490863387noreply@blogger.com0tag:blogger.com,1999:blog-9089443006312604961.post-6674988290492559182009-05-04T19:57:00.003+02:002009-05-04T20:39:05.845+02:00It's finally done.After a long period of hard and boring work, I finally submitted my Ph.D. thesis to the computer science faculty at <a href="http://www.tu-berlin.de">Berlin Institute of Technology</a> (TU Berlin). The thesis is entitled "<i>Machine Learning for Application-Layer Intrusion Detection</i>" and refereed by <a href="http://ml.cs.tu-berlin.de/en/klaus/index.html">Klaus-Robert Müller</a>, <a href="http://users.cs.dal.ca/~mchugh/">John McHugh</a> and <a href="http://ida.first.fraunhofer.de/~laskov/">Pavel Laskov</a>. <br /><br />In my thesis, I tackle the problem of detecting unknown and novel attacks in the application layer of network communication and present a machine learning framework for intrusion detection. In particular, I propose a generic technique for embedding of network payloads in vector spaces such that features extracted from the payloads are accessible to statistical and geometric analysis. Efficient learning in these high-dimensional vector spaces is realized using the concept of <a href="https://ml01.zrz.tu-berlin.de/twiki/pub/Main/MaschinellesLernenS08/structured2.pdf">kernel functions</a> defined over network payload data. Based on these functions, I derive methods for anomaly detection suitable for identification of unknown attacks, where normality of network data is modeled using geometric concepts such as hyperspheres and neighborhoods. <br /><br />The framework is empirically evaluated using 10 days of HTTP and FTP network traffic and over 100 real attacks unknown to the applied learning methods. A prototype of the framework outperforms related methods from <a href="http://cs.fit.edu/~pkc/id/related/krugel-sac02.ps">Kruegel et al. (2002)</a>, <a href="http://packetstorm.ussrback.com/papers/IDS/nids/Anagram.pdf">Wang et al. (2006)</a> and <a href="http://www.i-pi.com/~ingham/pubs/raid2007-revised.pdf">Ingham et al. (2007)</a>, where it identifies 80&ndash;97% unknown attacks with less than 0.002% false positives. Moreover, reasonable throughput rates between 20&ndash;60 Mbit/s are attained, though no special hardware acceleration is yet utilized.<br /><br />As the thesis is under review, I will not provide an online version now. However, I am going to present some interesting results on visualization of detected attack payloads in later posts. Stay tuned.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-667498829049255918?l=www.mlsec.org'/></div>Konrad Rieckhttp://www.blogger.com/profile/11125298703490863387noreply@blogger.com0tag:blogger.com,1999:blog-9089443006312604961.post-77725008058766585662009-04-07T20:06:00.004+02:002009-04-07T20:55:24.092+02:00Fun and Pain with ZFS.Recently, I found the time to experiment with <a href="http://opensolaris.org/os/community/zfs/">ZFS</a> &ndash; Sun's next-generation file system &ndash; using an external USB drive. In particular, I have been playing with ZFS to test whether it alleviates typical tasks of machine learning research, such as loading directories containing ten thousand of files or repeating experimental runs with large data chunks. As I am not running a Solaris system, I had to install the userland utilities and kernel modules for OSX (Leopard) available at <a href="http://zfs.macosforge.org/trac/">Macforge</a>. Note that Leopard natively supports read-only access to ZFS pools but lacks write functionality. <br /><br />Apart from great read access time, the first interesting issue I noticed is ZFS's ability to create hierarchical file systems on-the-fly. Instead of dumping all contents into a single volume, the hierarchical layout enables fine-grained control of different experimental data sets and allows for assigning quota and options individually per data. For example, the ability to easily split data comes handy, if one enables the transparent compression in ZFS. This snippet of commands creates two file systems named <code>tank/dataset1</code> and <code>tank/dataset2</code> where uses automatic compression.<pre>% zfs create tank/dataset1<br />% zfs create tank/dataset2<br />% zfs set compress=on tank/dataset1<br />% zfs set compress=off tank/dataset2<br /></pre>Compression might not be desired when running certain experiments. Thus, it can be disabled and enabled per file system, such that archived data is stored effectively while the current workbench is accessible with full processing power. A nice feature for working with experimental data. The compression ratio of each file system can be queried using the following command.<pre>% zfs get compressratio tank<br />NAME PROPERTY VALUE SOURCE<br />tank/dataset1 compressratio 3.14x -<br />tank/dataset2 compressratio 1.00x -<br /></pre>Another interesting issue is ZFS's ability to store snapshots of file systems. Initially, a snapshot does not consume any memory as only the differences to the original version are stored. If one is working with multiple copies of the same data, say one version described in a publication and a refined variant, snapshots are a great tool, as they allow one to quickly jump back and forth between different versions of data. One can access the content of each snapshot using the directory <code>.zfs</code> in the root of the considered file system. Here's an example how a snapshot is created for a given data set.<pre>% zfs snapshot tank/dataset1@paper<br /></pre>As it can see from a call to <code>zfs list</code> there is no memory associated with the snapshot directly after creation and thus no storage is wasted.<pre>% zfs list<br />tank/dataset1 2.38G 222G 2.38G /Volumes/tank/dataset1<br />tank/dataset1@paper 0 - 2.38G -<br /></pre>This feature is really nifty if one needs to "freeze" the state of experimental data if a paper is accepted for publication while still continuing to work with the data set.<br /><br />Unfortunately, all my enthusiasm and great plans to work with ZFS have been eliminated by the instability of the current Leopard driver. The driver does not really handle USB devices. If the device is accidentally removed or the computer falls asleep, the ZFS module simply crashes the kernel. I even managed to corrupt the ZFS partition such that any call to <code>zpool status</code> issues a kernel panic. Game over.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-7772500805876658566?l=www.mlsec.org'/></div>Konrad Rieckhttp://www.blogger.com/profile/11125298703490863387noreply@blogger.com0tag:blogger.com,1999:blog-9089443006312604961.post-13522145593809385542009-01-25T17:02:00.002+01:002009-01-25T17:27:57.888+01:00Call for Papers: DIMVA 2009.I am a member of the program committee of this year's conference on <a href="http://www.dimva.org">Detection of Intrusions and Malware & Vulnerability Assessment</a> (DIMVA) in Milan, Italy. The conference invites paper submissions from the domain of computer security with focus on intrusion detection and malware research. <ul><li> Deadline for paper submission: February 6, 2009<br /><li> Notification of acceptance or rejection: March 30, 2009<br /><li> Final paper camera ready copy: April 10, 2009<br /><li> Conference dates: June/July, 2009<br /> </ul>Contributions can be submitted as full papers (limited to 20 pages) or short papers (limited to 10 pages). A detailed "call for papers" is available <a href="http://www1.gi-ev.de/fachbereiche/sicherheit/fg/sidar/dimva/dimva2009/cfp2009.pdf">here</a>. I am looking forward to interesting contributions and an awesome DIMVA conference&mdash;as in the last five years.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-1352214559380938554?l=www.mlsec.org'/></div>Konrad Rieckhttp://www.blogger.com/profile/11125298703490863387noreply@blogger.com0tag:blogger.com,1999:blog-9089443006312604961.post-79229069430233501632008-12-18T13:55:00.025+01:002008-12-30T15:25:42.410+01:00Unpleasant Facts about Data Clustering.<a href="http://en.wikipedia.org/wiki/Data_clustering">Data clustering</a> is a popular technique of data mining and machine learning. The objective is to automatically partition a set of given objects into "meaningful" groups. Clustering is intuitive and fits a variety of problem settings, for example <a href="http://www.cs.columbia.edu/ids/publications/cluster-thesis00.pdf">intrusion detection</a> and <a href="http://www.eecs.umich.edu/~mibailey/publications/raid07_final.pdf">malware categorization</a> in computer security. However, application of clustering methods is laborious and error-prone in practice. Here are my unpleasant facts of clustering:<ul><li> <i>Generic clustering is NP-hard</i>. Formally, clustering aims at partitioning data such that a predefined quality criterion is maximized. Unfortunately, there is no generic way for determining an optimal solution in this setting except for testing all possible combinations. All practical clustering methods, such as k-means and linkage clustering, limit the search space by discarding partitionings and thus provide only an approximation to the best clustering (a local optimum). If you cluster data, you will likely obtain an approximate solution.<br /><br /><li> <i>Hierarchy is no clustering</i>. Linkage clustering is a form of hierarchical clustering, where the data is first mapped to a hierarchical representation and then partitioned into clusters. In practice, people often consider the hierarchical representation of data as the final clustering result. However, a hierarchy encodes an exponential number of possible partitionings and the more involved step is to flatten the hierarchy into a clustering. As a negative example consider the Wikipedia entry on <a href="http://en.wikipedia.org/wiki/Data_clustering">hierarchical clustering</a> which almost solely focuses on constructing hierarchies&mdash;omitting details on the subsequent clustering procedure. <br /><br /><li> <i>Statistical consistency unclear</i>. An important term from statistical learning theory is <i>consistency</i>. Shortly put a learning algorithm is consistent if it converges to the true solution the more training data is provided. Surprisingly, little is known about the consistency of clustering. For instance, most clustering methods aim at partitioning provided data but not the underlying data distributions. That is, if you provide more data to a clustering, there is no guarantee that the solution improves (see the work of <a href="http://www.kyb.mpg.de/~ule">Luxburg</a> [<a href="http://www.cse.ohio-state.edu/~mbelkin/papers/SC_AOS_07.pdf">1</a>, <a href="http://books.nips.cc/papers/files/nips20/NIPS2007_0423.pdf">2</a>]). <br /><br /><li> <i>Model selection vs. Clustering</i>. Clustering is often controlled using a model parameter that determines the granularity of the partitioning. This parameter can be the number of clusters as for k-means but may also take the form of a numeric quantity as in linkage clustering. Clearly, this parameter needs to be adapted prior to application. However, model selection contradicts with the purpose of clustering. On the one hand, to validate the parameter on given data a reference partitioning needs to be available, thus no clustering is required in this case. On the other hand on unlabeled data the parameter can never be validated directly. In practice, one resorts to model selection on reference data and application of the best model to unknown data. Consequently, this setting requires a representative reference partitioning obtained <i>without clustering</i>. <br /></ul>Besides these facts, I like clustering methods and have been successfully working with several of them. Nevertheless, special care needs to be taken when running experiments to achieve sound results&mdash;the application of clustering is more involved as it seems at a first glance.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-7922906943023350163?l=www.mlsec.org'/></div>Konrad Rieckhttp://www.blogger.com/profile/11125298703490863387noreply@blogger.com1tag:blogger.com,1999:blog-9089443006312604961.post-3365901011599841072008-12-17T17:33:00.010+01:002008-12-18T13:46:36.472+01:00Incorporation of Application Layer Protocol Syntax into Anomaly Detection.The majority of current intrusion detection, anti-virus and anti-malware products builds on the concept of <i>misuse detection</i>, that is malicious activity is identified using rules and signatures of known misuse patterns. Consequently, much effort has been devoted to defining expressive and discriminative features for construction of misuse rules. In the domain of network security such features are often derived from the syntax of application layer protocols, for example to answer the questions "who did what to which data when?" <br /><br />In contrast to misuse detection, a second strain of intrusion detection research investigates methods for anomaly detection, that is malicious activity is identified in terms of unusual and abnormal events. Surprisingly, the use of features from protocol syntax in this setting has been considered in only few research, for example <a href="http://www.cs.ucsb.edu/~vigna/publications/2003_kruegel_vigna_ccs03.pdf">Kruegel et al.</a> and <a href="http://www.cs.unm.edu/~forrest/publications/learning-DFA-Representations.pdf">Ingham et al.</a> Access to syntactical features is clearly beneficial for anomaly detection, and hence some of my colleagues (Patrick and Christian) have been working on extending previous approaches to incorporate protocol syntax into a generic anomaly detection framework. <br /><br />Recent results of this work are presented at the <a href="http://www.seclab.cs.sunysb.edu/iciss08/">International Conference on Information Systems Security</a>. In particular, we introduce new features for network intrusion detection, which combine sequential features (such as <a href="http://worminator.cs.columbia.edu/papers/2004/RAID4.PDF">byte frequencies</a> and <a href="http://ida.first.fraunhofer.de/~rieck/docs/2006-dimva.pdf">n-grams</a>) with syntactical tokens. Here is the abstract of our contribution:<blockquote>The syntax of application layer protocols carries valuable information for network intrusion detection. Hence, the majority of modern IDS perform some form of protocol analysis to refine their signatures with application layer context. Protocol analysis, however, has been mainly used for misuse detection, which limits its application for the detection of unknown and novel attacks. In this contribution we address the issue of incorporating application layer context into anomaly-based intrusion detection. We extend a payload-based anomaly detection method by incorporating structural information obtained from a protocol analyzer. The basis for our extension is computation of similarity between attributed tokens derived from a protocol grammar. The enhanced anomaly detection method is evaluated in experiments on detection of web attacks, yielding an improvement of detection accuracy of 49%. While byte-level anomaly detection is sufficient for detection of buffer overflow attacks, identification of recent attacks such as SQL and PHP code injection strongly depends on the availability of application layer context. </blockquote><br /><a href="http://ida.first.fraunhofer.de/~rieck/docs/2008-iciss.pdf">Incorporation of Application Layer Protocol Syntax into Anomaly Detection</a>. Patrick Düssel, Christian Gehl, Pavel Laskov and Konrad Rieck. <i>Proc. of International Conference on Information Systems Security (ICISS)</i>, December 2008.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-336590101159984107?l=www.mlsec.org'/></div>Konrad Rieckhttp://www.blogger.com/profile/11125298703490863387noreply@blogger.com0tag:blogger.com,1999:blog-9089443006312604961.post-84169453383589200672008-12-03T21:16:00.004+01:002008-12-03T22:09:02.826+01:00An Architecture for Inline Anomaly Detection.I am currently finishing my doctoral thesis, thus there is almost no time for interesting activities and fun. Fortunately, I am not the only one, see for instance <a href="http://www.honeyblog.org/archives/3-Old-Entries-Honeypot-Presentation.html">here</a>. <br /><br />Besides all the work, the good news is that we will present an interesting paper on combining anomaly detection and intrusion prevention at the <a href="http://2008.ec2nd.org">European Conference on Computer Network Defense</a> (EC2ND). Here is the abstract from our contribution:<blockquote>In this paper we propose an intrusion prevention system (IPS) which operates inline and is capable to detect unknown attacks using anomaly detection methods. Incorporated in the framework of a packet filter each incoming packet is analyzed and&mdash;according to an internal connection state and a computed anomaly score&mdash;either delivered to the production system, redirected to a special hardened system or logged to a network sink for later analysis. Run-time measurements of an actual implementation prove that the performance overhead of the system is sufficient for inline processing. Accuracy measurements on real network data yield improvements especially in the number of false positives, which are reduced by a factor of five compared to a plain anomaly detector. <br /></blockquote><br /><a href="http://ida.first.fraunhofer.de/~rieck/docs/2008-ec2nd.pdf">An Architecture for Inline Anomaly Detection.</a> Tammo Krueger, Christian Gehl, Konrad Rieck and Pavel Laskov. <i>Proc. of European Conference on Computer Network Defense (EC2ND)</i>, December 2008.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-8416945338358920067?l=www.mlsec.org'/></div>Konrad Rieckhttp://www.blogger.com/profile/11125298703490863387noreply@blogger.com0tag:blogger.com,1999:blog-9089443006312604961.post-66344876607934622112008-11-14T08:53:00.014+01:002008-12-03T22:27:12.381+01:00Generalized Suffix Trees and Distinct Substrings.This is a technical post on strings, substrings and suffix trees. This could get boring. You have been warned.<br /><br />Everybody knows the concept of <i>longest common substring</i>, where a set of strings is given and the task is to determine the longest substring shared by the set. While intuitive and straightforward to implement using a <a href="http://en.wikipedia.org/wiki/Generalised_suffix_tree">generalized suffix tree</a>, this setting is not robust against "noise", such as corrupted or truncated strings. Moreover, finding the longest common substring is useless, if the considered strings derive from different data distributions, thus not necessary sharing any substring.<br /><br />These shortcomings can be addressed by determining a <i>set of shared substrings</i>, which are shared by at least <i>m</i> strings but not necessary all strings in the set. To restrict the amount of small shared substrings, one requires the substrings to have a minimum length. This setting has been applied in the <a href="http://www.cs.berkeley.edu/~dawnsong/papers/polygraph.pdf">PolyGraph</a> and <a href="http://www.cs.northwestern.edu/~zli109/publication/Li-Hamsa-ssp06.pdf">Hamsa</a> paper for automatic signature generation from network data. Again, an implementation is easily realized using a depth-first traversal of a <a href="http://en.wikipedia.org/wiki/Generalised_suffix_tree">generalized suffix tree</a>. Unfortunately, many of the shared substrings will be prefixes and suffixes of each other. <br /><br />This issue is resolved by the <i>set of distinct shared substrings</i>, where a distinct substring <i>x</i> is not contained in any other distinct substring <i>y</i>, unless <i>x</i> appears in at least <i>m</i> strings independently of <i>y</i>. Implementing this setting using <a href="http://en.wikipedia.org/wiki/Generalised_suffix_tree">generalized suffix trees</a> is a little bit more involved. Together with a colleague, I came up with this approach: <br /><blockquote>We start to determine interesting substrings by a depth-first traversal of a <a href="http://en.wikipedia.org/wiki/Generalised_suffix_tree">generalized suffix tree</a>. <br /><br />This time, however, we are looking for distinct substrings and hence can not afford to output a substring that later turns out to be non-distinct. We thus first visit nodes which give rise to longer substrings, that is for each node we order the child nodes according to the length of possible substrings. Consequently, we find longer substrings first. <br /><br />If we have determined a first substring <i>x</i> shared by <i>m</i> strings, we can be sure that it is distinct as no longer substrings exist. Next, we need to mark all non-distinct substrings of <i>x</i>. We do this by traversing the suffix links and parent nodes of <i>x</i>, thereby processing all suffixes and prefixes of <i>x</i>. For each node we maintain a counter indicating the number of strings the corresponding substring is shared by. When processing the prefixes and suffixes of <i>x</i>, we decrement this counter by the number of occurences of <i>x</i>.<br /><br />Finally, we continue the depth-first traversal. We do not descend to nodes with a counter smaller than <i>m</i>, as no distinct shared substring can be determined on the corresponding path.<br /></blockquote><br />To be honest, I have not thoroughly thought about the complexity of this algorithm. Yet, I expect it to be "somewhat linear" in the length of the considered strings. The output of the algorithm resembles the concept of distinct substrings proposed in <a href="http://www.cs.berkeley.edu/~dawnsong/papers/polygraph.pdf">PolyGraph</a>, where the authors apply a costly post-processing of shared substrings. Note that this approach can also be implemented using suffix and LCP arrays with the traversal techniques studied by <a href="ftp://ftp.i.kyushu-u.ac.jp/pub/tr/trcs185.ps.gz ">Kasai et al.</a><br /><br />Did I mention that I like string algorithms?<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-6634487660793462211?l=www.mlsec.org'/></div>Konrad Rieckhttp://www.blogger.com/profile/11125298703490863387noreply@blogger.com0tag:blogger.com,1999:blog-9089443006312604961.post-28360126849930656752008-10-16T08:59:00.009+02:002008-12-03T22:18:45.125+01:00Automatic Feature Selection for Anomaly Detection.Network intrusion detection requires discriminative features from network traffic, which for the case of misuse detection allow for the precise formulation of signatures and rules, and which, on the other end, enable application of learning methods for anomaly detection. However, devising a set of features is not trivial, as attacks are reflected in all sorts of different patterns and numerical measures. A good example is the work of Kruegel at al. (see <a href="http://www.cs.ucsb.edu/~chris/research/doc/comnet05_anomaly.pdf">1</a>, <a href="http://www.cs.ucsb.edu/~chris/research/doc/ccs03_webanomaly.pdf">2</a>, <a href="http://www.cs.ucsb.edu/~chris/research/doc/2002_03.ps">3</a>), in which various heterogeneous features are proposed and manually combined into an effective anomaly detection system. <br /><br />Recently, colleagues at our lab came up with a method for automatic selection and weighting of such features for intrusion detection. Instead of defining a weighting of different features manually, the method automatically determines the mixture which optimizes the performance of anomaly detection. This optimization is realized by incorporating the process of feature selection directly into an anomaly detection method&mdash;a rather involved mathematical procedure. The paper will be presented at the <a href="http://www.aisec.info/">AISec Workshop</a> co-located with CCS. Following is its abstract:<blockquote>A frequent problem in anomaly detection is to decide among different feature sets to be used. For example, various features are known in network intrusion detection based on packet headers, content byte streams or application level protocol parsing. A method for automatic feature selection in anomaly detection is proposed which determines optimal mixture coefficients for various sets of features. The method generalizes the <a href="http://ict.ewi.tudelft.nl/~davidt/papers/ML_SVDD_04.pdf">support vector data description (SVDD)</a> and can be expressed as a semi-infinite linear program that can be solved with standard techniques. The case of a single feature set can be handled as a particular case of the proposed method. The experimental evaluation of the new method on unsanitized HTTP data demonstrates that detectors using automatically selected features attain competitive performance, while sparing practitioners from a priori decisions on feature sets to be used. <br /></blockquote><br /><a href="http://user.cs.tu-berlin.de/~brefeld/publications/aisec19-kloft.pdf">Automatic Feature Selection for Anomaly Detection</a>. M. Kloft, U. Brefeld, P. Düssel, C. Gehl, and P. Laskov. <i>Proceedings of the First ACM Workshop on AISec</i>, October 2008.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-2836012684993065675?l=www.mlsec.org'/></div>Konrad Rieckhttp://www.blogger.com/profile/11125298703490863387noreply@blogger.com0tag:blogger.com,1999:blog-9089443006312604961.post-43761394876395274302008-09-28T08:42:00.012+02:002008-12-03T22:18:51.574+01:00Approximate Kernels for Trees.The majority of machine learning and artificial intelligence methods are designed to work on vectorial data. In practice, however, data does not always come in the form of simple vectors. For instance, analysis of HTML documents requires processing and matching tree structures. The state-of-the-art techniques for learning with trees are convolutional <a href="http://en.wikipedia.org/wiki/Kernel_methods">kernel functions</a>. These functions are effective but painfully slow on large trees. To speed-up such kernels for security applications, we came up with <a href="http://ida.first.fraunhofer.de/~rieck/docs/2008-first.pdf">approximate tree kernels</a>. Here is the abstract from the corresponding technical report:<blockquote>Convolution kernels for trees provide effective means for learning with tree-structured data, such as parse trees of natural language sentences. Unfortunately, the computation time of tree kernels is quadratic in the size of the trees as all pairs of nodes need to be compared: large trees render convolution kernels inapplicable. In this paper, we propose a simple but efficient approximation technique for tree kernels. The approximate tree kernel (ATK) accelerates computation by selecting a sparse and discriminative subset of subtrees using a linear program. The kernel allows for incorporating domain knowledge and controlling the overall computation time through additional constraints. Experiments on applications of natural language processing and web spam detection demonstrate the efficiency of the approximate kernels. We observe run-time improvements of two orders of magnitude while preserving the discriminative expressiveness and classification rates of regular convolution kernels.</blockquote><br /><a href="http://ida.first.fraunhofer.de/~rieck/docs/2008-first.pdf">Approximate Kernels for Trees.</a> Konrad Rieck, Ulf Brefeld and Tammo Krueger. <i>Technical Report FIRST 5/2008, Fraunhofer Institute FIRST, September 2008.</i><div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-4376139487639527430?l=www.mlsec.org'/></div>Konrad Rieckhttp://www.blogger.com/profile/11125298703490863387noreply@blogger.com0tag:blogger.com,1999:blog-9089443006312604961.post-14342154273392160872008-09-20T19:19:00.012+02:002008-11-14T11:14:29.306+01:00Low throughput? The imbalance of network traffic.I have been recently running experiments with a "fast" learning method for anomaly detection in HTTP and FTP payloads. While the method performed well in terms of attack detection, the realized throughput was frustrating. Incoming traffic was processed at 15 mbit/s (HTTP) and 30 mbit/s (FTP) on a single CPU, including traffic normalization, TCP reassembly and anomaly detection. <br /><br />How useful is such a low throughput? <br /><br />Well, better than one might think. Depending on the considered application-layer protocol, there is often an imbalance of ingress and egress traffic. For example, the HTTP traffic at our institute has an ingress-egress ratio of 1/50, while the FTP traffic recorded at LBNL (port 21 only) reaches a ratio of 1/6. Thus, when looking at the total throughput the considered learning method yields around 770 mbit/s (HTTP) and 200 mbit/s (FTP) throughput. That's not so frustrating, given that the performance could be easily increased using multi-core systems. <br /><br />For other protocols, however, the ingress-egress is not a small fraction and one can not indirectly benefit from the imbalance of traffic. In conclusion, when analyzing the run-time performance of an intrusion detection method it really matters which protocol and direction is monitored for evil things.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-1434215427339216087?l=www.mlsec.org'/></div>Konrad Rieckhttp://www.blogger.com/profile/11125298703490863387noreply@blogger.com0tag:blogger.com,1999:blog-9089443006312604961.post-17047696722866617672008-08-30T13:18:00.005+02:002008-11-14T11:16:32.895+01:00Portscanning the Internet.I just returned from a long holiday trip. While browsing through the list of blog postings I missed during this period, I noticed an interesting article on large-scale port scanning:<br /><br /><a href="http://blog.thc.org/index.php?/archives/2-Port-Scanning-%20%20%20%20the-Internet.html">Portscanning the Internet</a> <i>posted at the <a href="http://blog.thc.org">THC blog</a></i>.<br /><br />The author reports on his experience with port scanning methods, which are capable to scan the entire Internet in a reasonable time frame. Besides several remarks on how to conduct such scans and how to threshold a scanning system, there is also a funny side note on a <i>false</i> worm outbreak caused by scanning activities.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-1704769672286661767?l=www.mlsec.org'/></div>Konrad Rieckhttp://www.blogger.com/profile/11125298703490863387noreply@blogger.com0tag:blogger.com,1999:blog-9089443006312604961.post-68111601283447511512008-07-11T18:26:00.011+02:002008-11-14T11:16:17.755+01:00Slides for Talk on Learning Malware Behavior.<img src="http://ida.first.fraunhofer.de/~rieck/docs/talks/2008-dimva-monster.png" align="right" width="120">I just noticed that <a href="http://www.heise.de/security/">heise security</a> posted a <a href="http://www.heise.de/security/Verbessertes-Heuristikverfahren-vorgestellt--/news/meldung/110806">short note</a> on our recent <a href="http://ida.first.fraunhofer.de/~rieck/docs/2008-dimva.pdf">paper</a> presented at DIMVA 2008. Cool. As an addition to the technical paper, I am thus also providing the <a href="http://ida.first.fraunhofer.de/~rieck/docs/talks/2008-dimva-talk.pdf">slides</a> for my talk. Have fun.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-6811160128344751151?l=www.mlsec.org'/></div>Konrad Rieckhttp://www.blogger.com/profile/11125298703490863387noreply@blogger.com0tag:blogger.com,1999:blog-9089443006312604961.post-27223545615600610262008-07-11T09:24:00.005+02:002008-11-14T11:16:17.756+01:00Learning and Classification of Malware Behavior.Yesterday, I have been presenting some of our joint work with the <a href="http://pi1.informatik.uni-mannheim.de/">University of Mannheim</a> on malware analysis at this year's <a href="http://www.dimva2008.org/">DIMVA</a>. The corresponding paper is available online:<br /><br /><a href="http://ida.first.fraunhofer.de/~rieck/2008-dimva.pdf">Learning and Classification of Malware Behavior</a>. <i>To appear in Proceedings of Fifth Conference on Detection of Intrusions and Malware &amp; Vulnerability Assessment (DIMVA), 2008.</i><br /><br />Thorsten Holz wrote a nice summary of our key results in his <a href="http://honeyblog.org/">blog</a> yesterday. In essence, we devise a method for automatic classification of malware families based on their behavior. The method proceeds by learning behavioral patterns from malware monitored in a <a href="http://www.cwsandbox.org/">sandbox</a>. Starting from a set of malware binaries labeled by an anti-virus scanner, the method generalizes observed behavior so that variants undetected by anti-virus products can be identified. Moreover, the method is transparent to the security practitioner as the learning model can be examined and decisions made by the system can be traced back to behavioral patterns discriminative for each malware family.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-2722354561560061026?l=www.mlsec.org'/></div>Konrad Rieckhttp://www.blogger.com/profile/11125298703490863387noreply@blogger.com0tag:blogger.com,1999:blog-9089443006312604961.post-88586930584092167612008-06-26T23:52:00.009+02:002008-11-14T11:14:29.306+01:00A Self-Learning System for VoIP Security.Next week I will present some of our research on learning-based security systems on the <a href="http://iptcomm.org/">IPTCOMM 2008</a> conference. The resulting paper is available at the following link:<br /><br /><a href="http://ida.first.fraunhofer.de/~rieck/docs/2008-iptcomm.pdf">A Self-Learning System for Detection of Anomalous SIP Messages</a>. <i>To appear in Proceedings of Principles, Systems and Applications of IP Telecommunications (IPTCOMM), 2008.</i><br /><br />In this joint work with Bell Labs Germany, we propose a system for automatic detection of unknown attacks in SIP. The system<br />identifies anomalous content by cleverly mapping SIP messages to a vector space, in which deviation from normality can be expressed <i>geometrically</i>. We present some nice experiments demonstrating the high accuracy of the proposed system (no false-positives on a data set of real SIP traffic) and, also, its moderate througput rate (about ~70 mbit/s).<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-8858693058409216761?l=www.mlsec.org'/></div>Konrad Rieckhttp://www.blogger.com/profile/11125298703490863387noreply@blogger.com0tag:blogger.com,1999:blog-9089443006312604961.post-28642878969169915742008-05-25T18:09:00.009+02:002008-11-14T11:15:52.340+01:00Ph-Neutral on an Island.<img src="http://www.insel-berlin.net/uploads/pics/CIMG1196.JPG" align="right" width="180"> I have spent part of the weekend at <a href="http://www.ph-neutral.org">Ph-Neutral</a>, a mini-conference bringing together blackhat and whitehat people. <br /><br />This year the organizers did an excellent job choosing a location: the conference took place on an island in the middle of Berlin called <a href="http://www.insel-berlin.net/">"Die Insel"</a>. The island featured two different bars, room for talks and a cosy beach. What a nice location.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-2864287896916991574?l=www.mlsec.org'/></div>Konrad Rieckhttp://www.blogger.com/profile/11125298703490863387noreply@blogger.com0tag:blogger.com,1999:blog-9089443006312604961.post-37370724541292731902008-05-10T09:49:00.006+02:002008-11-14T11:16:58.741+01:00A Honeypot at Home.I am running a small <a href="http://www.honeyd.org">honeypot</a> at my home dialup capturing unsolicited traffic. Today I found the time to view through its logs. Although I could not spot anything special, I had a good time examining the catch, such as various scans for proxies, relays and vulnerabilities, e.g.<br /><pre>GET http://www.woqianqi.cn/go.php HTTP/1.0<br />CONNECT mx3.mail2000.com.tw:25 HTTP/1.0<br />GET /cacti/cmd.php HTTP/1.0<br />GET /w00tw00t.at.ISC.SANS.DFind:) <br />[...]</pre>While not a professional tool for monitoring internet threats, a honeypot at home suffices to observe incoming cruft and, eventually, have a good time.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-3737072454129273190?l=www.mlsec.org'/></div>Konrad Rieckhttp://www.blogger.com/profile/11125298703490863387noreply@blogger.com0tag:blogger.com,1999:blog-9089443006312604961.post-65744284902949229292008-04-17T22:19:00.006+02:002008-11-14T11:15:52.340+01:00First ACM Workshop on AISec.At this year's CCS conference there will be a special ACM workshop on artificial intelligence for security, named <i>AISec</i>. The call for papers and further information on the workshop are available at the workshop site <a href="http://www.aisec.info">www.aisec.info</a>. <br /><br />This looks like some pretty interesting event. The deadline for submissions of full papers is the 9th May 2008. Proceedings will be published by the ACM.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-6574428490294922929?l=www.mlsec.org'/></div>Konrad Rieckhttp://www.blogger.com/profile/11125298703490863387noreply@blogger.com0tag:blogger.com,1999:blog-9089443006312604961.post-34659127858576000952008-04-09T12:48:00.007+02:002008-04-09T15:31:04.100+02:00Machine Learning Open Source Software.In case you missed it, there is a new <a href="http://jmlr.csail.mit.edu/mloss/">track in JMLR</a> dedicated to open source software for machine learning (MLOSS). As a first result of this effort, a community-driven web site has been started providing a platform for distribution and discussion of open machine learning software. The web site is available at <a href="http://www.mloss.org">www.mloss.org</a> and features a large body of learning software to download. <br /><br />I like it a lot and plan to submit some of my recent software projects; given they reach a mature state (which of course might never happen).<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-3465912785857600095?l=www.mlsec.org'/></div>Konrad Rieckhttp://www.blogger.com/profile/11125298703490863387noreply@blogger.com0tag:blogger.com,1999:blog-9089443006312604961.post-9523336091206714782008-04-03T12:03:00.013+02:002008-11-14T11:14:29.307+01:00Casting out Demons.There is an interesting paper at this year's IEEE S&P: Gabriela Cretu and colleagues propose a method for sanitizing of training data for anomaly detection. The sanitization procedure essentially builds on learning small models from random samples and performing majority voting. Sounds familiar? This is another variant of <i>bagging</i>, here in the context of security for removal of unknown attacks from training data.<br /><br /><a href="http://www.cs.gmu.edu/~astavrou/research/CretuG_SanitizingTrainingData.pdf">Casting out Demons: Sanitizing Training Data for Anomaly Sensors</a>. <i>To appear in the Proceedings of the IEEE Symposium on Security & Privacy</i>, May 2008, Oakland, CA.<br /><br />A small nit: I missed a comparison to anomaly detection methods <i>capable to learn on dirty data</i>, and thus eliminating the need of prior sanitization. But anyway...<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-952333609120671478?l=www.mlsec.org'/></div>Konrad Rieckhttp://www.blogger.com/profile/11125298703490863387noreply@blogger.com0