Malware Classification

06 March 2018

Malware Classification using Apache Spark and Deep Learning

Area: NLP, Big Data; Technologies: Spark, GCP.

Microsoft Malware Classification Challange, introduced on Kaggle in 2015, has nearly a half a terabyte of uncompressed data to be classified into 9 different categories. Each malware file in this dataset is made up of around 8500 byte-files containing only hexadecimal codes, making it almost impossible for humans to interpret. The challenge for participants is to create a model that can classify around 2700 unlabeled data files. For a bit of help, there is a subset of the data available to test the code on the local machines.

To solve such a huge Natural Langauge Processing(NLP) problem, a popular distributed computing framework, Apache Spark, has been used to train machine learning models in this project. Classical machine learning techniques including Naive-Bayes, Random Forest, and Logistic Regression are utilized in search of the best solution. Additionally, a deep learning-based approach has been implemented using PyTorch. Google Cloud Platform(GCP)’s effective DataProc cluster has been used to train both the distributed and deep learning models. The project was accomplished in a team of 3.