nyit-malware

Is it Flappy Bird, or Is it A Trojan Horse?

Also Known As

My 2016 Summer Internship

By Monica Kumaran

The Problem

At my summer research experience I am given a dataset of Android app *.apk files with binary True-False labels of whether the apps are malware. The goal is to figure out how to make the best classifier from the data.

It was a very open-ended problem. I ended up focusing on a a) a data science perspective of what markers signify malware and semantically how it all fits together and b) how to make a classifier that improves on the projects I read about.

Learning More about the Domain Area

Two approaches in the academic literature stood out:

One was the 2014 DREBIN program. By 2016 DREBIN was the gold standard of this domain. DREBIN simulates running the app in a sandboxed environment, with a 94% true positive rate for malware detection. The false-positive rate was 1% and the false-negative rate was unincluded. It by far outperformed commercial virus scanners which did not specialize in Android app malware. The only drawback was that it took 10 seconds to check the newly downloaded Android app before making its prediction.

I modeled my program after their approach. DREBIN took features from the Android app manifest file and from the whole Android app bytecode file and ran the feature data through a linear support vector machine.

Another interesting program was ICCDetector, which used less features than DREBIN but all from the same type of data. It also used a support vector machine, but it specialized in inter-app communication. Inter-app communication allowed the detector to find apps that take advantage of phone events like the phone restarting. An event like the phone restarting could be used by malware to gain root access to the phone on startup, for example. ICCDetector also used a support vector machine.

My Approach

Support vector machines worked well for these type of problems. I wanted to include features from the Android app manifest file. The manifest file is a single file access with straighforward parsing. All the known virus families request permissions through the Android app manifest file, like permission to access Photos or hardware like the camera. I could also include intents used for inter-app communication by extracting data from the manifest file. The entire approach became stripping away features that added to runtime and seeing the impact on accuracy.

Semantic Findings

The initial dataset had columns of all possible requested intents and permission keywords from the app manifest file. Each app was a datapoint with an array of binary values of if it contained that feature in the manifest file - an indicator matrix of 1’s and 0’s.

Highest Validation Accuracy over Different Data Sources

I checked the different types of features from the manifest file dataset to improve my intuition. This graph shows that requested permissions are stronger indicators than inter-app communication intents. The model’s validation accuracy for a dataset of only intents is above random chance, but not by much. The model trained on the combined dataset had a higher accuracy than the one trained on the permissions only dataset. My gut assumption would be that there are only a few intent filters that matter as features and the rest are irrelevant. The accuracy bump was 1.2%, which predicted on the complete 1000 app dataset is equivalent to 12 more apps correctly classified.

I do not have the dataset now, but I suspect that if I isolated those 12 new apps and created a histogram of requested intent filters, there would be a few common filters between the malicious apps. I could use that information to take out features that don’t affect the classification and save some time and memory by reducing the feature space.

Comparing Different Machine Learning Algorithms

Validation Accuracy of Models Based on Different Algorithms on Same Data

A support vector machine using a cubic function to separate the two classes had the highest validation accuracy.

Confusion Matrix of Model with Highest Validation Accuracy

When I look further into the predictions of the cubic SVM, I see that the false negative accuracy is 10.8%. The probability is still similar to the overall miss rate of 8.3%, and I believe in commercial use it would be better to be risk-averse and take the higher Type II error.

Impact

It is possible to only use the Android app manifest file to train a model and still have a high accuracy. This is more fool-proof than the previous approaches from the literature. Because it did not even look at parts of the app that could be obfuscated, this approach is invulnerable to code obfuscation, which the other two could be fooled by. The runtime would also be less because it skips a second step needed after disassembling the *.apk file. It only requires parsing an *.xml file, which for across the dataset has O(N) runtime.

Answering the title, will this ML model be able to tell if that Flappy bird knock-off is a secret deep web scam? Nine times out of ten, yes.

Helpful Resources

article (link stable on 8/9/21) was useful for keeping track of the Android app file structure.

If you found this writeup useful for a similar project, you can fork or look at the scripts in my repo for feature extraction reference. Make sure to star it if you find it useful!