No Arabic abstract
Numerous metamorphic and polymorphic malicious variants are generated automatically on a daily basis by mutation engines that transform the code of a malicious program while retaining its functionality, in order to evade signature-based detection. These automatic processes have greatly increased the number of malware variants, deeming their fully-manual analysis impossible. Malware classification is the task of determining to which family a new malicious variant belongs. Variants of the same malware family show similar behavioral patterns. Thus, classifying newly discovered malicious programs and applications helps assess the risks they pose. Moreover, malware classification facilitates determining which of the newly discovered variants should undergo manual analysis by a security expert, in order to determine whether they belong to a new family (e.g., one whose members exploit a zero-day vulnerability) or are simply the result of a concept drift within a known malicious family. This motivated intense research in recent years on devising high-accuracy automatic tools for malware classification. In this work, we present DAEMON - a novel dataset-agnostic malware classifier. A key property of DAEMON is that the type of features it uses and the manner in which they are mined facilitate understanding the distinctive behavior of malware families, making its classification decisions explainable. Weve optimized DAEMON using a large-scale dataset of x86 binaries, belonging to a mix of several malware families targeting computers running Windows. We then re-trained it and applied it, without any algorithmic change, feature re-engineering or parameter tuning, to two other large-scale datasets of malicious Android applications consisting of numerous malware families. DAEMON obtained highly accurate classification results on all datasets, establishing that it is also platform-agnostic.
Dynamic malware analysis executes the program in an isolated environment and monitors its run-time behaviour (e.g. system API calls) for malware detection. This technique has been proven to be effective against various code obfuscation techniques and newly released (zero-day) malware. However, existing works typically only consider the API name while ignoring the arguments, or require complex feature engineering operations and expert knowledge to process the arguments. In this paper, we propose a novel and low-cost feature extraction approach, and an effective deep neural network architecture for accurate and fast malware detection. Specifically, the feature representation approach utilizes a feature hashing trick to encode the API call arguments associated with the API name. The deep neural network architecture applies multiple Gated-CNNs (convolutional neural networks) to transform the extracted features of each API call. The outputs are further processed through bidirectional LSTM (long-short term memory networks) to learn the sequential correlation among API calls. Experiments show that our solution outperforms baselines significantly on a large real dataset. Valuable insights about feature engineering and architecture design are derived from the ablation study.
Static malware analysis is well-suited to endpoint anti-virus systems as it can be conducted quickly by examining the features of an executable piece of code and matching it to previously observed malicious code. However, static code analysis can be vulnerable to code obfuscation techniques. Behavioural data collected during file execution is more difficult to obfuscate, but takes a relatively long time to capture - typically up to 5 minutes, meaning the malicious payload has likely already been delivered by the time it is detected. In this paper we investigate the possibility of predicting whether or not an executable is malicious based on a short snapshot of behavioural data. We find that an ensemble of recurrent neural networks are able to predict whether an executable is malicious or benign within the first 5 seconds of execution with 94% accuracy. This is the first time general types of malicious file have been predicted to be malicious during execution rather than using a complete activity log file post-execution, and enables cyber security endpoint protection to be advanced to use behavioural data for blocking malicious payloads rather than detecting them post-execution and having to repair the damage.
Malware is a piece of software that was written with the intent of doing harm to data, devices, or people. Since a number of new malware variants can be generated by reusing codes, malware attacks can be easily launched and thus become common in recent years, incurring huge losses in businesses, governments, financial institutes, health providers, etc. To defeat these attacks, malware classification is employed, which plays an essential role in anti-virus products. However, existing works that employ either static analysis or dynamic analysis have major weaknesses in complicated reverse engineering and time-consuming tasks. In this paper, we propose a visualized malware classification framework called VisMal, which provides highly efficient categorization with acceptable accuracy. VisMal converts malware samples into images and then applies a contrast-limited adaptive histogram equalization algorithm to enhance the similarity between malware image regions in the same family. We provided a proof-of-concept implementation and carried out an extensive evaluation to verify the performance of our framework. The evaluation results indicate that VisMal can classify a malware sample within 5.2ms and have an average accuracy of 96.0%. Moreover, VisMal provides security engineers with a simple visualization approach to further validate its performance.
Malware detection plays a vital role in computer security. Modern machine learning approaches have been centered around domain knowledge for extracting malicious features. However, many potential features can be used, and it is time consuming and difficult to manually identify the best features, especially given the diverse nature of malware. In this paper, we propose Neurlux, a neural network for malware detection. Neurlux does not rely on any feature engineering, rather it learns automatically from dynamic analysis reports that detail behavioral information. Our model borrows ideas from the field of document classification, using word sequences present in the reports to predict if a report is from a malicious binary or not. We investigate the learned features of our model and show which components of the reports it tends to give the highest importance. Then, we evaluate our approach on two different datasets and report formats, showing that Neurlux improves on the state of the art and can effectively learn from the dynamic analysis reports. Furthermore, we show that our approach is portable to other malware analysis environments and generalizes to different datasets.
For the dramatic increase of Android malware and low efficiency of manual check process, deep learning methods started to be an auxiliary means for Android malware detection these years. However, these models are highly dependent on the quality of datasets, and perform unsatisfactory results when the quality of training data is not good enough. In the real world, the quality of datasets without manually check cannot be guaranteed, even Google Play may contain malicious applications, which will cause the trained model failure. To address the challenge, we propose a robust Android malware detection approach based on selective ensemble learning, trying to provide an effective solution not that limited to the quality of datasets. The proposed model utilizes genetic algorithm to help find the best combination of the component learners and improve robustness of the model. Our results show that the proposed approach achieves a more robust performance than other approaches in the same area.