Alumni Research Award Recipient 2020: Qian Han, Computer Science

I am a final-year Ph.D. student in Computer Science working with Professor V.S. Subrahmanian, the Dartmouth College Distinguished Cybersecurity Professor. I am interested in cybersecurity and machine learning research with a focus on stable and automatic Android malware identification and analysis systems. 

The malware creation cycle can be defined as a four-stage phase. i) Malware is released by a hacker into the wild, (ii) the malware is eventually detected, (iii) a signature for the malware is generated – and maybe a patch – and (iv) the hacker then modifies the malware to bypass the signature and the cycle starts again. This illustrates how malware will develop as a family composed of hundreds of variants; each of which needs to be identified. 

The development of signatures for identifying malware is expensive; Google would search to see whether an Android application is a version of an established malware family – if so, current signatures can be adapted and rendered more stable by malware analysts. Significant Android families are AsaCub, Bread, Hqwar, and Kingroot. 

Due to a large number of malware families, deciding if an app is identified is a challenge. We review the evolution of machine learning features associated with samples within an Android malware family and develop a prototype tool that can delineate malware samples based on these features. Then, we gather data from separate sources concerning Android malware families and we use the data to assess our findings. 

The next step is to establish a series of mathematical definitions for the stability of features associated with Android applications in these families. Once we have verified our accuracy using cross-validation techniques, we build a prototype framework named Stability Analysis of  Android Malware (SAAM) that allows an analyst to analyze the stability of features of a new Android application and compare these features with those of diverse established Android malware families, prioritize the similarities, and thus enable the analyst to correctly and speedily recognize the family to which a new Android application belongs. 

For our analysis, we first have to have a consistent dataset of malware families. Our research group Dartmouth Security and AI Lab (DSAIL) has three datasets included in our project. The publicly available DREBIN dataset includes 100+ distinct families. The Koodous dataset has malware families with each malware sample is categorized into one or more specific malware families. Third, we have a list of features correlated with hashes of malware samples from Palo Alto Networks – since this dataset has cryptographic functions defining the samples involved, we are in a position to get the samples from different publicly accessible websites. In order to properly exploit these datasets, we need to make sure that we have the same set of features for all the samples as well as providing a common feature set for all the datasets. 

We conducted experiments with these various "stability" concepts on our malware family datasets to identify the notion of stability is better able to differentiate whether a specific sample belongs to one family or another. We described the characteristics of particular families of malware by observing which features were most constant and those which were most volatile (i.e. most likely to change). 

We developed an experimental testbed called SAAM to compute, compare, and visualize the stability metrics proposed. The SAAM testbed has a set of demonstration Android malware family names pre-loaded in it, along with the feature vectors of the known Android applications belonging to those families. When a new Android application sample's feature vector is selected, the SAAM testbed can compute and visualize the different stability metrics proposed.  Users are able to compare and contrast the metrics of the new Android application sample with those of the pre-loaded malware families' associated stability properties and metrics. SAAM additionally included methods for predicting the family to which a specific sample belongs or to report that it appears to belong to a new malware family.

Through experimental evaluation, SAAM achieved the following four main contributions: 1) Investigated how malware samples from the same family involved time and the efficiency of detection techniques at the feature level. 2) Defined optimal-partition, stability score, and three different kinds of MG (malware over goodware) ratio on the features over time in Android application families. 3) Conducted stability score analysis on 120 families of malware between 2012-2019 from VirusTotal, and 122 families of goodware between 2012-2020 from Google Play Top free apps, averagely 60 samples for each family. 4) Summarized 4 kinds of top stable and unstable features over all the collected Android application families, especially on Android API features, Android permission features, Operation code features, and System command features.

With generous support from the Alumni Research Award, I was able to improve my research in both depth and quality. After completing my Ph.D. in the coming Spring, I will continue conducting my research into the exploration of innovative techniques to quickly identify different types of general malware and the rapid deployment of countermeasures on prevalent anti-virus engines.