Improvement of Malicious Software Detection Accuracy through Genetic Programming Symbolic Classifier with Application of Dataset Oversampling Techniques
Created by W.Langdon from
gp-bibliography.bib Revision:1.8051
- @Article{Andelic:2023:Computers,
-
author = "Nikola Andelic and Sandi {Baressi Segota} and
Zlatan Car",
-
title = "Improvement of Malicious Software Detection Accuracy
through Genetic Programming Symbolic Classifier with
Application of Dataset Oversampling Techniques",
-
journal = "Computers",
-
year = "2023",
-
volume = "12",
-
number = "12",
-
pages = "242",
-
email = "nikola.andjelic@uniri.hr",
-
keywords = "genetic algorithms, genetic programming, genetic
programming symbolic classifier, 5-fold
cross-validation, malware software detection,
oversampling techniques, random hyperparameter value
search method",
-
ISSN = "2073-431X",
-
URL = "https://www.mdpi.com/2073-431X/12/12/242",
-
DOI = "doi:10.3390/computers12120242",
-
size = "19 pages",
-
abstract = "Malware detection using hybrid features, combining
binary and hexadecimal analysis with DLL calls, is
crucial for leveraging the strengths of both static and
dynamic analysis methods. Artificial intelligence (AI)
enhances this process by enabling automated pattern
recognition, anomaly detection, and continuous
learning, allowing security systems to adapt to
evolving threats and identify complex, polymorphic
malware that may exhibit varied behaviors. This synergy
of hybrid features with AI empowers malware detection
systems to efficiently and proactively identify and
respond to sophisticated cyber threats in real time. In
this paper, the genetic programming symbolic classifier
(GPSC) algorithm was applied to the publicly available
dataset to obtain symbolic expressions (SEs) that could
detect the malware software with high classification
performance. The initial problem with the dataset was a
high imbalance between class samples, so various
oversampling techniques were used to obtain balanced
dataset variations on which GPSC was applied. To find
the optimal combination of GPSC hyperparameter values,
the random hyperparameter value search method (RHVS)
was developed and applied to obtain SEs with high
classification accuracy. The GPSC was trained with
five-fold cross-validation (5FCV) to obtain a robust
set of SEs on each dataset variation. To choose the
best SEs, several evaluation metrics were used, i.e.,
the length and depth of SEs, accuracy score (ACC), area
under receiver operating characteristic curve (AUC),
precision, recall, f1-score, and confusion matrix. The
best-obtained SEs are applied on the original
imbalanced dataset to see if the classification
performance is the same as it was on balanced dataset
variations. The results of the investigation showed
that the proposed method generated SEs with high
classification accuracy (0.9962) in malware software
detection.",
-
notes = "Faculty of Engineering, University of Rijeka,
Vukovarska 58, 51000 Rijeka, Croatia",
- }
Genetic Programming entries for
Nikola Andelic
Sandi Baressi Segota
Zlatan Car
Citations