Internet of Things Malware Dataset – 2018

Introduction:

Threats from malware are not new, although malware or cyber threat hunting remains an ongoing challenge. For example, with the increasing popularity of Internet of Things (IoT) devices  and the general lack of security protection for such devices, IoT devices can be vulnerable to malware attacks. Existing machine learning-based IoT malware hunting approaches have focused on energy consumption patterns and OpCode. This is not surprising, as system calls and OpCodes are two common features in malware hunting. Additionally, in recent years, deep learning methods have also been used in malware analysis and detection. However, at the time this dataset was developed, there was no existing work done to use deep learning in IoT malware detection.

Since the majority of Unix System-V IoT devices uses ARM processors, the benign samples collected for this dataset are from the Linux Debian package repository (‘‘Linux Packages Search – https://pkgs.org/’’) of applications compatible with Raspberry Pie II. ARM processors have been widely used in cloud edge devices, and the Raspberry Pi II can also be considered as an IoT cloud edge device.

Dataset Details:

This dataset includes Arm Cortex-M processor family samples which is one of the market leaders in the microcontroller market, and the Cortex-R processor family is typically used in specialized controllers such as hard disk drives.

The malware samples were collected by searching for available 32-bit ARM-based malware in the VirusTotal Threat Intelligence platform as of September 30th, 2017.

The collected dataset consisted of 280 malware and 271 benign files. All files were unpacked using the Debian installer bundle and then the Object-Dump tool was used to decompile all samples.

A Linux bash script for the dataset samples’ OpCodes was written:

  1. The script extracted each Debian package file (deb file)
  2. The script searched for ELF files from the extracted materials
  3. The script feeds the object-dump tool to decompile the ELF files.

The decompiled codes were then pruned to extract the sequence of OpCodes in each sample.

In terms of the instruction set in these types of microprocessors, Cortex-A has the largest instruction set (OpCodes). Since Raspberry Pie II devices are based on Cortex-A, the complete set of Opcodes obtained will increase the detection date (in comparison to, say the Cortex M families since memory management instruction set is not provided).

Acknowledgements:

We thank VirusTotal for graciously providing us with a private API key to access their data to prepare our dataset. This work is partially supported by the European Council International Incoming Fellowship, Belgium (FP7-PEOPLE-2013-IIF) grant and the last author is supported by the Cloud Technology Endowed Professorship, USA.

Citation:

Plain Text:

Hamed HaddadPajouh, Ali Dehghantanha, Raouf Khayami, Kim-Kwang Raymond Choo, A deep Recurrent Neural Network based approach for Internet of Things malware threat hunting, Future Generation Computer Systems, Volume 85, 2018, Pages 88-96, ISSN 0167-739X, https://doi.org/10.1016/j.future.2018.03.007.

BibText:

@article{HADDADPAJOUH201888, title = {A deep Recurrent Neural Network based approach for Internet of Things malware threat hunting}, journal = {Future Generation Computer Systems}, volume = {85}, pages = {88-96}, year = {2018}, issn = {0167-739X}, doi = {https://doi.org/10.1016/j.future.2018.03.007}, url = {https://www.sciencedirect.com/science/article/pii/S0167739X1732486X}}

Download dataset: https://github.com/CyberScienceLab/Our-Datasets/tree/master/IoT/OpCode/OpCode