Automatic Analysis of Malware Behavior using Machine Learning

CWSandbox
In the last couple of years, several honeypot solutions to automatically "collect" malware samples were developed. With these tools, it is possible to obtain copies of malware samples without any human interaction. As a result, we are able to collect quite a few malware samples per day, which then also need to be analyzed. Thus, several sandbox solutions were developed that automate the analysis step by performing dynamic, behavior-based analysis. The result of the dynamic analysis is typically a report that summarizes the observed behavior. The next logical step is to use that information to perform malware classification and malware clustering: at the end of that process, we can then obtain information about which samples perform basically the same kind of activity. We can then automatically find variants of well-known threats, identify new malware families, and reduce the manual effort needed to analyze the large number of incoming malware samples.

In the last couple of months, we worked on malware classification and malware clustering. The results are summarized in a technical report. In the article, we introduce a learning-based framework for automatic analysis of malware behavior. To apply this framework in practice, it suffices to collect a large number of malware samples and monitor their behavior using a sandbox environment. By embedding the observed behavior in a vector space, reflecting behavioral patterns in its dimensions, we are able to apply learning algorithms, such as clustering and classification, for analysis of malware behavior. Both techniques are important for an automated processing of malware samples and we show in several experiments that our techniques significantly improve previous work in this area. For example, the concept of prototypes allows for efficient clustering and classification, while also enabling a security researcher to focus manual analysis on prototypes instead of all malware samples. Moreover, we introduce a technique to perform behavior-based analysis in an incremental way that avoids run-time and memory overhead inherent to previous approaches.

Abstract
Malicious software — so called malware — poses a major threat to the security of computer systems. The amount and diversity of its variants render classic security defenses ineffective, such that millions of hosts in the Internet are infected with malware in form of computer viruses, Internet worms and Trojan horses. While obfuscation and polymorphism employed by malware largely impede detection at file level, the dynamic analysis of malware binaries during run-time provides an instrument for characterizing and defending against the threat of malicious software.
In this article, we propose a framework for automatic analysis of malware behavior using machine learning. The framework allows for automatically identifying novel classes of malware with similar behavior (clustering) and assigning unknown malware to these discovered classes (classification). Based on both, clustering and classification, we propose an incremental approach for behavior-based analysis, capable to process the behavior of thousands of malware binaries on a daily basis. The incremental analysis significantly reduces the run-time overhead of current analysis methods, while providing an accurate discovery and discrimination of novel malware variants.

The full technical report is available at http://honeyblog.org/junkyard/paper/malheur-TR-2009.pd. It was joint work with Konrad Rieck, Philipp Trinius, and Carsten Willems. And the word cloud was generated using http://www.wordle.net/.

Trackbacks

  1. PingBack

Comments

Display comments as (Linear | Threaded)

    No comments


Add Comment


E-Mail addresses will not be displayed and will only be used for E-Mail notifications

To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly.
CAPTCHA 1CAPTCHA 2CAPTCHA 3CAPTCHA 4CAPTCHA 5