Automatic Analysis of Malware Behavior using Machine Learning

Monday, December 28. 2009
CWSandbox
In the last couple of years, several honeypot solutions to automatically "collect" malware samples were developed. With these tools, it is possible to obtain copies of malware samples without any human interaction. As a result, we are able to collect quite a few malware samples per day, which then also need to be analyzed. Thus, several sandbox solutions were developed that automate the analysis step by performing dynamic, behavior-based analysis. The result of the dynamic analysis is typically a report that summarizes the observed behavior. The next logical step is to use that information to perform malware classification and malware clustering: at the end of that process, we can then obtain information about which samples perform basically the same kind of activity. We can then automatically find variants of well-known threats, identify new malware families, and reduce the manual effort needed to analyze the large number of incoming malware samples.

In the last couple of months, we worked on malware classification and malware clustering. The results are summarized in a technical report. In the article, we introduce a learning-based framework for automatic analysis of malware behavior. To apply this framework in practice, it suffices to collect a large number of malware samples and monitor their behavior using a sandbox environment. By embedding the observed behavior in a vector space, reflecting behavioral patterns in its dimensions, we are able to apply learning algorithms, such as clustering and classification, for analysis of malware behavior. Both techniques are important for an automated processing of malware samples and we show in several experiments that our techniques significantly improve previous work in this area. For example, the concept of prototypes allows for efficient clustering and classification, while also enabling a security researcher to focus manual analysis on prototypes instead of all malware samples. Moreover, we introduce a technique to perform behavior-based analysis in an incremental way that avoids run-time and memory overhead inherent to previous approaches.

Abstract
Malicious software — so called malware — poses a major threat to the security of computer systems. The amount and diversity of its variants render classic security defenses ineffective, such that millions of hosts in the Internet are infected with malware in form of computer viruses, Internet worms and Trojan horses. While obfuscation and polymorphism employed by malware largely impede detection at file level, the dynamic analysis of malware binaries during run-time provides an instrument for characterizing and defending against the threat of malicious software.
In this article, we propose a framework for automatic analysis of malware behavior using machine learning. The framework allows for automatically identifying novel classes of malware with similar behavior (clustering) and assigning unknown malware to these discovered classes (classification). Based on both, clustering and classification, we propose an incremental approach for behavior-based analysis, capable to process the behavior of thousands of malware binaries on a daily basis. The incremental analysis significantly reduces the run-time overhead of current analysis methods, while providing an accurate discovery and discrimination of novel malware variants.

The full technical report is available at http://honeyblog.org/junkyard/paper/malheur-TR-2009.pd. It was joint work with Konrad Rieck, Philipp Trinius, and Carsten Willems. And the word cloud was generated using http://www.wordle.net/.

Call for Papers: EuroSec 2010

Wednesday, November 25. 2009
admin
The next edition of the European Workshop on System Security (EuroSec 2010) will take place on the 13th of April, 2010, in Paris, France. Please find below the call for papers.

About EuroSec:
EuroSec is a new workshop associated with the Annual ACM SIGOPS EuroSys conference. The workshop aims to bring together researchers, practitioners, system administrators, system programmers, and others interested in the latest advances in the security of computer systems and networks. The focus of the workshop is on novel, practical, systems-oriented work.

Important dates:
  • Paper submission: February 7, 2010 (Hard deadline, no extensions), 5pm, PST
  • Acceptance notification: March 1, 2010
  • Final paper due: March 12, 2010
  • Workshop: April 13, 2010

Continue reading "Call for Papers: EuroSec 2010"

"Visual Analysis of Malware Behavior Using Treemaps and Thread Graphs"

Friday, August 21. 2009
CWSandbox
I continue the series of recently or upcoming papers with a paper we will publish at VizSec'09 entitled "Visual Analysis of Malware Behavior Using Treemaps and Thread Graphs". In the recent years, we saw a lot of progress in the area of automated malware analysis. Nowadays tools such as CWSandbox, Anubis, ThreatExpert, or Norman Sandbox are available. These tools analyze a given binary and generate a report which contains a summary of the observed behavior while executing the sample. Such reports are often quite long, it is for example not uncommon for a CWSandbox report to be longer than 100 lines. An analyst thus has to read the report in order to get an understanding of what a given sample is doing. In this paper we present an approach to visualize the behavior report with treemaps and behavior graphs (i.e., visualizing the behavior of the individual threads over time). This helps to get a quick overview of what a given sample does and also samples from one malware family have a similar looking treemap/behavior graph.

As an example, consider the following three pictures which each show the treemap generated for three distinct samples of the Bagle worm:


Each picture shows a treemap of the behavior: the x-axis depicts the type of action performed, e.g., whether the sample performed actions related to the filesystem, the registry, or the network. The y-axis devides the actions into operations, i.e., whether it was a read or write access to the registry. As you can see, the behavior of the Bagle sample is (more or less) consistent across different samples from the same family. Below you can find the visualization of two Swizzor samples and one Allaple sample.


Samples from the same family have a similar visualization, while samples from different families look different. This could help an analyst to quickly identify if the sample is interesting or just another small variant of a well-known family. This research will be integrated in the frontend of http://cwsandbox.org.

Abstract: We study techniques to visualize the behavior of malicious software (malware). Our aim is to help human analysts to quickly assess and classify the nature of a new malware sample. Our techniques are based on a parametrized abstraction of detailed behavioral reports automatically generated by sandbox environments. We then explore two visualization techniques: treemaps and thread graphs. We argue that both techniques can effectively support a human analyst (a) in detecting maliciousness of software, and (b) in classifying malicious behavior.

"Towards Proactive Spam Filtering"

Friday, July 31. 2009
A common technique employed by spammers is to send spam mails with the help of botnets. In a typical setting, the spammer uses so called template-based spamming: the attacker sends the bots a spam template that describes the structure of the spam message to be sent. Furthermore, the attacker sends meta-data like recipient list, subject list, and a list of URLs that are used to fill in variables in the template. The bots then construct an email based on the template and the meta-data, and send this email to the targets. As a result, the actual work of handling the SMTP communication is moved from the control server to the bots. Nowadays this technique is used by most large spam botnets, like Waledac, Bobax, Rustock, Cutwail, and a lot of the other major spam botnets as Joe Stewart explained in detail.

Since spammers nowadays use such a tactic, we can also collect spam mails in a more efficient way: Instead of waiting at the end-user's mailboxes or spamtraps for mail messages to arrive and then decide whether or not this is spam, we directly interact with the servers that are used to send spam messages. The basic idea is that we execute spambots, i.e., malicious software dedicated to sending spam emails, in a controlled (honeypot) environment and collect all email messages sent by the bots. This enables us to directly interfere with botnet control servers to collect current spam messages sent by a specific botnet.

We describe this idea in more detail in a short paper that was published at DIMVA'09. The paper is also available on this blog.

Abstract: With increasing security measures in network services, remote exploitation is getting harder. As a result, attackers concentrate on more reliable attack vectors like email: victims are infected using either malicious attachments or links leading to malicious websites. Therefore efficient filtering and blocking methods for spam messages are needed. Unfortunately, most spam filtering solutions proposed so far are reactive, they require a large amount of both ham and spam messages to efficiently generate rules to differentiate between both. In this paper, we introduce a more proactive approach that allows us to directly collect spam message by interacting with the spam botnet controllers. We are able to observe current spam runs and obtain a copy of latest spam messages in a fast and efficient way. Based on the collected information we are able to generate templates that represent a concise summary of a spam run. The collected data can then be used to improve current spam filtering techniques and develop new venues to efficiently filter mails.

"Bypassing Kernel Code Integrity Protection Mechanisms"

Saturday, July 18. 2009
A paper that we will publish next month at USENIX Security'09 is entitled "Return-Oriented Rootkits: Bypassing Kernel Code Integrity Protection Mechanisms". In return-oriented programming, an attacker re-uses existing code: he searches for short instruction sequences (typically only one instruction) which are followed by a RET. By cleverly chaining these sequences, an attacker can build a gadget that then performs an actual computation, e.g., the gadget adds two operands. By combining these gadgets, an attacker can then perform arbitrary computations. Return-oriented programming was popularized by Shacham (CCS'07 paper targeting Linux/x86, CCS'08 paper targeting Solaris/SPARC). In our paper we present a system to automatically find useful instructions, build gadgets, and then generate a return-oriented program for Windows as the target OS. In a case study, we show how this system can be used to implement a return-oriented rootkit, bypassing typical kernel code integrity mechanisms. The main insight here is that integrity mechanisms protect against injection of code - however, if the attacker re-uses existing code, these approaches typically fail.

Abstract: Protecting the kernel of an operating system against attacks, especially injection of malicious code, is an important factor for implementing secure operating systems. Several kernel integrity protection mechanism were proposed recently that all have a particular shortcoming: They cannot protect against attacks in which the attacker re-uses existing code within the kernel to perform malicious computations. In this paper, we present the design and implementation of a system that fully automates the process of constructing instruction sequences that can be used by an attacker for malicious computations. We evaluate the system on different commodity operating systems and show the portability and universality of our approach. Finally, we describe the implementation of a practical attack that can bypass existing kernel integrity protection mechanisms.

The paper contains all the details and the results of our experiments are also available. The main part of this work was performed by Ralf Hund, it was the topic of his thesis. Furthermore, Felix Freiling helped with the project. And the word cloud was generated with the help of Wordle