Open AccessJournal Article10.22624/aims/csean-smart2023p4

Detection of Algorithmically Generated Domain Names using Ensemble Machine Learning Technique

- 11 Jul 2023

- Advances in Multidisciplinary and Scient...

- Vol. 2, Iss: 2, pp 27-34

TL;DR: Abdullahi et al. as discussed by the authors proposed an ensemble machine learning technique for the detection and classification of algorithmically generated domain names (AGDNs) leveraging the combined strength of four different machine learning algorithms: Naïve Bayes, SVM, Random Forest and CART.

Abstract: Prior to now, cyber attackers use malwares with hard-coded domain names stored in the malware binaries that communicate with a command and control (C&C) servers to launch cyber-attacks on their victim computers. Malware attacks such as botnets and ransomwares are some of the most prevalent forms of these attacks. As soon as a system is infected with a malware (either a botnet or a ransomware), one of the most essential components is to establish a secured communication with the botmaster (i.e., the malware author), through a C&C server. However, with a simple reverse engineering technique, cyber security experts could detect and block these domain names, hence, denying them the ability to communicate with the C&C servers and from receiving further instructions from the botmaster. This led to cyber criminals developing the Domain Generation Algorithm (DGA) technique, which algorithmically generate thousands or more candidate’s domain names for communication with the C&C server, thereby obfuscating the domain names of these malwares and making it difficult for cyber security experts to detect or block these domain names. This paper therefore proposes an ensemble machine learning technique for the detection and classification of algorithmically generated domain names (AGDNs) leveraging the combined strength of 4 different machine learning algorithms: Naïve Bayes, SVM, Random Forest and CART. The models were trained twice, first with 4 features and thereafter with 10 features. In order to effectively utilise the result of the predictions, we used a voting-based ensemble approach, where the final classification is decided by the majority vote of the algorithms. Result of the research shows that the Naïve Bayes model performed better than all the other models with an accuracy of 97.54% when trained with 10 features and 95.99% when trained with 4 features. Keywords: WSN, DDoS, Intrusion Detection System, Random Forest, Machine Learning. Proceedings Citation Format Abdullahi, S.M., Mohammed, A., Ibrahim, R.Y. & Shamsuddeen, A. (2023): Detection of Algorithmically Generated Domain Names using Ensemble Machine Learning Technique. Proceedings of the Cyber Secure Nigeria Conference. Nigerian Army Resource Centre (NARC) Abuja, Nigeria. 11-12th July, 2023. Pp 27-34. https://cybersecurenigeria.org/conference-proceedings/volume-2-2023/ dx.doi.org/10.22624/AIMS/CSEAN-SMART2023P2.

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Most frequently asked questions

1. How do cyber criminals use DGA technique?

Cyber criminals use the Domain Generation Algorithm (DGA) technique to generate a large number of malicious pseudo-random domain names within a short period of time. They then use one of these domain names to resolve the Domain Name Service (DNS) address of the Command and Control (C&C) server and establish a secure communication with the attacker. Once this communication is established, the malware sends/receives data/instructions with the attacker. The attacker then seizes complete control of the compromised system and spreads malware (either a botnet or a ransomware). The compromised system or network is then used to target single or multiple computers within the network for stealing confidential data, disabling or hijacking the system, or launching further attacks. These attacks can include distributed denial of service attacks, man in the middle attacks, phishing attacks, SQL injection attacks, etc.

2. What is the approach used in DGA detection?

The approach used in DGA detection involves an ensemble machine learning approach that combines botnets and ransomware DGA malwares. It uses a different training dataset, Cisco Umbrella top 1 million most visited domain names, and attributes extracted from domain names data itself. The methodology follows the Cross-Industry Standard Process for Data Mining (CRISP-DM) model. Ten features are extracted from domain names data, including length, numbers, and special characters. Four machine learning models, including naive bayes, support vector machines, random forest, and classification and regression tree model, are deployed. Models are trained with 4 and 10 features using 300,000 randomly selected training datasets. The models are evaluated using 80% training data and 20% validation data, with 10-fold cross-validation for efficiency. The speed-accuracy trade-off is considered in choosing the best fit model, balancing execution time and prediction accuracy.

3. Which machine learning model performed best in terms of accuracy and execution time?

The Naive Bayes model performed best in terms of accuracy and execution time, with an accuracy level of more than 90% and no malicious domain names wrongly classified as benign. This model is recommended for deployment due to its speed and accuracy. Future work could involve using live domain names data as training datasets to further improve the model's performance.

References

•Journal Article•10.1214/18-AOS1709

Generalized random forests

Susan Athey, +2 more

- 01 Apr 2019

- Annals of Statistics

TL;DR: A flexible, computationally efficient algorithm for growing generalized random forests, an adaptive weighting function derived from a forest designed to express heterogeneity in the specified quantity of interest, and an estimator for their asymptotic variance that enables valid confidence intervals are proposed.

...read moreread less

1.6K

•Journal Article•10.1016/J.CHAOS.2020.110059

Applications of machine learning and artificial intelligence for Covid-19 (SARS-CoV-2) pandemic: A review

Samuel Lalmuanawma, +2 more

- 25 Jun 2020

- Chaos Solitons & Fractals

TL;DR: The ongoing development in AI and ML has significantly improved treatment, medication, screening, prediction, forecasting, contact tracing, and drug/vaccine development process for the Covid-19 pandemic and reduce the human intervention in medical practice.

...read moreread less

837

•Journal Article•10.1016/J.PROCS.2021.01.199

A Systematic Literature Review on Applying CRISP-DM Process Model

Christoph Schröer, +3 more

- 01 Jan 2021

- Procedia Computer Science

TL;DR: It is found that CRISP-DM is still a de-factor standard in data mining, but there are challenges since the most studies do not foresee a deployment phase and the contribution of the paper is to identify best practices and process phases in which data mining analysts can be better supported.

...read moreread less

419

•Journal Article•10.1109/TKDE.2019.2962680

CRISP-DM Twenty Years Later: From Data Mining Processes to Data Science Trajectories

Fernando Martínez-Plumed, +7 more

- 01 Aug 2021

- IEEE Transactions on Knowledge and Data ...

TL;DR: It is argued that if the project is goal-directed and process-driven the process model view still largely holds, and when data science projects become more exploratory the paths that the project can take become more varied, and a more flexible model is called for.

...read moreread less

351

•Journal Article•10.1109/ACCESS.2019.2891588

A Machine Learning Framework for Domain Generation Algorithm-Based Malware Detection

Yi Li, +3 more

- 31 Jan 2019

- IEEE Access

TL;DR: This paper collects real-time threat data from the real-life traffic over a one-year period and builds a deep neural network model to enhance the proposed machine learning framework by handling the huge dataset it gradually collected.

...read moreread less

...

Expand

Detection of Algorithmically Generated Domain Names using Ensemble Machine Learning Technique

Chat with Paper

AI Agents for this Paper

Most frequently asked questions

1. How do cyber criminals use DGA technique?

2. What is the approach used in DGA detection?

3. Which machine learning model performed best in terms of accuracy and execution time?

References

Generalized random forests

Applications of machine learning and artificial intelligence for Covid-19 (SARS-CoV-2) pandemic: A review

A Systematic Literature Review on Applying CRISP-DM Process Model

CRISP-DM Twenty Years Later: From Data Mining Processes to Data Science Trajectories

A Machine Learning Framework for Domain Generation Algorithm-Based Malware Detection

Related Papers (5)

BotDet: A System for Real Time Botnet Command and Control Traffic Detection

Detection of Algorithmically Generated Domain Names in Botnets

Machine Learning Based Approaches for Cancer Prediction: A Survey

Detection and Blocking of DGA-based Bot Infected Computers by Monitoring NXDOMAIN Responses

Deep feature-based plant disease identification using machine learning classifier