BALab seminars — A systematic review of datasets for intrusion detection systems

Abstract

Intrusion Detection Systems (IDS) based on Machine Learning (ML) techniques are essential for cybersecurity, utilizing datasets to detect and mitigate malicious network or system activities. The efficacy of IDSs is contingent on datasets that are both extensive and representative of actual cyber threats, ensuring accurate and robust system performance evaluation. This paper reports on a systematic literature review (SLR) that examines the landscape of datasets for IDS training and evaluation. The SLR explores the traits, creation methods, and constraints of these datasets, alongside the identification of challenges in their generation and utility. Highlighted is the variance in dataset quality, the gradual pace of development, and the ongoing deficit of datasets in the intrusion detection domain. With the continuous evolution of cyber threats, a persistent reassessment of these datasets is imperative for maintaining their pertinence and efficacy. Our SLR aggregates and dissects the extant research to elucidate the strengths and weaknesses of current datasets, determining their aptness for varied IDS contexts. The objective is to delineate the present state of IDS datasets, suggest measures for their enhancement, and propose future research directions. We emphasize the need for standardized, comprehensive, and ethically sound datasets that reflect the evolving threat landscape. Future research should focus on data augmentation to cover a broader spectrum of attack scenarios, improving the robustness of IDS against diverse threats. Additionally, fostering open-source datasets and cross-sector collaboration is crucial for integrating practical, real-world cybersecurity challenges into academic research, thereby democratizing access and promoting innovation in IDS development.