Unbalanced breast cancer data classification using novel fitness functions in genetic programming
Introduction
Our body is composed of many millions of tiny cells, each a self-contained living unit. Normally, each cell coordinates with the others that compose tissues and organs of our body. One way that this coordination occurs is reflected in how our cells reproduce themselves. Normal cells in the body grow and divide for a period of time and then stop growing and dividing. Thereafter, they only reproduce themselves as necessary to replace defective or dying cells. Cancer occurs when this cellular reproduction process goes out of control and some of the body’s cells begin to divide without stopping and spread into surrounding tissues. In other words, cancer is a disease characterized by uncontrolled, uncoordinated and undesirable cell division (Gatenby & Brown, 2017). Unlike normal cells, cancer cells continue to grow and divide for their whole lives, replicating into more and more harmful cells. Cancers form solid tumors, which are masses of tissue. As cancer cells divide and replicate themselves, they often form into a clump, known as a tumor (Aravanis, Lee, & Klausner, 2017). Tumors cause many of the symptoms of cancer by pressuring, crushing and destroying surrounding non-cancerous cells and tissues.
Tumors come in two forms; benign and malignant (Pena-Reyes & Sipper, 1999). Benign tumors are not cancerous, thus they do not grow and spread to the extent of cancerous tumors. Benign tumors are usually not life-threatening. Malignant tumors, on the other hand, grow and spread to other areas of the body (Muto, Bussey, & Morson, 1975). The process whereby cancer cells travel from the initial tumor site to other parts of the body is known as metastasis (Müller et al., 2001).
Breast cancer is a malignant cell growth in the breast (Muto et al., 1975). If left untreated, cancer spreads to other areas of the body. Excluding skin cancer, breast cancer is the most common type of cancer in women, accounting for one of every three cancer diagnoses. The incidence of breast cancer rises after age 40 (Moss et al., 2006). The highest incidence (approximately 80% of invasive cases) occurs in women over age 50. This disease like other cancer diseases is still under research and so far, no globally available preventive measures have been proposed. The best way to prevent breast cancer is the identification of the disease at early stages and taking preventive measures before it spreads to other tissues.
The early and accurate identification of this disease is done based on the study of previous diagnosis data, gathering useful information from the past data. This process can be aided by concepts and techniques from computer science field. ML techniques enable the system to learn from previous data and based on the learning, the system can predict and give results on new unseen data. The available datasets of breast cancer are unbalanced because there are more instances of benign values, thereby signifying that it is important to develop a framework which deals with these scenarios. Datasets can be termed as unbalanced if one of its classes is represented by only a small number of training instances (called the minority class) while the other classes make up the majority (Beyan & Fisher, 2015). Due to the influence of the larger majority class on traditional training criteria in the fitness function, the classifier results tend to have good accuracy on the majority class but has insufficient accuracy on the minority class(es) (Japkowicz & Stephen, 2002).
In recent years, the presence of unbalanced datasets have been encountered in different areas of wide range of issues, such as face recognition (Leng, Yu, & Jingyan, 2017),the analysis of satellite imagery detection of oil spills (Mera, Bolon-Canedo, Cotos, & Alonso-Betanzos, 2017), panel data (Ye, Xu, & Wu, 2018), medical diagnostic decision making (Bhardwaj, Tiwari, RameshKrishna, & Vishaal Varma, Bhardwaj, Sakalle, Bhardwaj, Tiwari, 2018). Thus the classification of datasets has substantially received more attention. The use of ML in medical science is increasing gradually and significantly (Kononenko, 2001, Polat, Güneş, Arslan, 2008). Most frameworks use accuracy as a metric to evaluate the results (Polat & Güneş, 2007a), but in the case of unbalanced dataset classification, accuracy gives biased results (Patterson & Zhang, 2007). Therefore, in this paper, we introduce two new fitness functions, F2 score and Distance score (D score), to evaluate the GP framework for addressing the problem of unbalanced breast cancer data classification. F2 score improves the performance of the classifier by focusing more on leaning about the minority class. Learning about minority class is important in medical data, as the minority data contains more relevant information. D score handles unbalanced data by learning about both the classes by giving them equal importance i.e., by being unbiased with the classes. To show the usefulness of our proposed DGP and F2GP frameworks, we have trained a GP framework using accuracy as the fitness function and compared the results. To show the superiority of our methods, we have also compared our proposed DGP and F2GP frameworks with Back Propagation Neural Network (BPNN), Koza & Rice model, Average Class Accuracy in fitness (Ave) and Genetically Optimized Neural Network (GONN) on 60-40, 70-30 partition schemes and also on 10-fold cross validation scheme. To show the superiority of our methods, we have also compared our proposed DGP and F2GP frameworks with Back Propagation Neural Network (BPNN) (Hagan, Demuth, Beale, & De Jesús, 1996), Koza & Rice model (Koza & Rice, 1991), Average Class Accuracy in fitness (Ave) (Bhowan, Johnston, & Zhang, 2012) and Genetically Optimized Neural Network (GONN) on 60-40, 70-30 partition schemes and also on 10-fold cross validation scheme.
Section snippets
Related works
Several works have been done in medical data diagnosis and identifying the healthy and unhealthy cases (Bhardwaj, Tiwari, Krishna, Varma, 2016, Bhardwaj, Tiwari, Varma, Krishna, 2014). Breast cancer specific works have been done as well to classify the malignant and benign cases properly. As medical datasets are usually unbalanced (Jafari-Marandi, Davarzani, Gharibdousti, & Smith, 2018), therefore works have also been done in unbalanced data classification. Few of the related works on breast
Genetic programming
GP (Koza, 1992) is a ML framework inspired by Darwin’s theory of evolution. It first creates random individuals or solutions known as GP trees and then evolves them until an optimal solution is found. It evolves the solution using reproduction, crossover and mutation operators. As multiple solutions are created with operators like crossover and mutation, GP does a better job in exploring the search space and reaching a globally optimal solution. The steps of GP are described next.
Proposed fitness functions
Due to the unbalanced nature of breast cancer datasets, most ML models fail to classify between benign (non-cancerous) and malignant (cancerous) cases properly, even though they have high accuracy. As observed in the previous section, accuracy gives biased results with unbalanced data, which makes it inefficient to work as fitness function for GP. To resolve the issues related to classification of unbalanced data, we apply GP to propose two new fitness functions that are F2 Score and D score.
Results and discussions
The proposed GP framework as a classifier was implemented in Python (3.6) and on an Intel i7 7th gen laptop of 3.4 GHz with 16GB of RAM. Our GP framework was trained by following the parameters described in Table 1. The methods BPNN, Koza & Rice Model, Ave and GONN used for comparison were implemented on our system with same parameters and configuration. The parameter values used in training for the experiment are described in Table 1. The activation function used for BPNN and GONN are sigmoid.
Conclusion
In this work, a novel fitness function termed as D score has been proposed for addressing the problems related to the classification of unbalanced dataset. Along with D Score we have sufficiently proved the worth of F2 score as a fitness function for classifications. The concept of GP has supported the execution of these fitness functions. The WBCD dataset is employed to test the stated methodology. The methods D score and F2 score are experimented and evaluated in comparison to the other
CRediT authorship contribution statement
Divyaansh Devarriya: Conceptualization, Investigation, Resources, Methodology, Writing - review & editing. Cairo Gulati: Data curation, Formal analysis, Resources, Writing - review & editing. Vidhi Mansharamani: Writing - original draft, Formal analysis. Aditi Sakalle: Writing - original draft, Resources, Formal analysis, Supervision, Validation. Arpit Bhardwaj: Conceptualization, Investigation, Methodology, Project administration, Supervision, Validation, Visualization, Writing - original
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References (55)
- et al.
Supervised fuzzy clustering for the identification of fuzzy classifiers
Pattern Recognition Letters
(2003) - et al.
Next-generation sequencing of circulating tumor dna for early cancer detection
Cell
(2017) - et al.
Classifying imbalanced data sets using similarity based hierarchical decomposition
Pattern Recognition
(2015) - et al.
Breast cancer diagnosis using genetically optimized neural network model
Expert Systems with Applications
(2015) - et al.
A novel genetic programming approach for epileptic seizure detection
Computer Methods and Programs in Biomedicine
(2016) - et al.
A systematic study of the class imbalance problem in convolutional neural networks
Neural Networks
(2018) - et al.
A support vector machine classifier with rough set-based feature selection for breast cancer diagnosis
Expert Systems with Applications
(2011) - et al.
Analysing the influence of the fitness function on genetically programmed bots for a real-time strategy game
Entertainment Computing
(2017) - et al.
Mutations, evolution and the central role of a self-defined fitness function in the initiation and progression of cancer
Biochimica et Biophysica Acta (BBA)-Reviews on Cancer
(2017) - et al.
Rhsboost: Improving classification performance in imbalance data
Computational Statistics & Data Analysis
(2017)
An optimum ann-based breast cancer diagnosis: Bridging gaps between ann learning and decision-making goals
Applied Soft Computing
Comparing fitness functions for genetic feature transformation
IFAC-PapersOnLine
Machine learning for medical diagnosis: history, state of the art and perspective
Artificial Intelligence in Medicine
Instance categorization by support vector machines to adjust weights in adaboost for imbalanced data classification
Information Sciences
Data augmentation for unbalanced face recognition training sets
Neurocomputing
Deep variance network: An iterative, improved CNN framework for unbalanced training datasets
Pattern Recognition
Genetic programming for evolving figure-ground segmentors from multiple features
Applied Soft Computing
Wbcd breast cancer database classification applying artificial metaplasticity neural network
Expert Systems with Applications
On the use of feature selection to improve the detection of sea oil spills in sar images
Computers & Geosciences
Effect of mammographic screening from age 40 years on breast cancer mortality at 10 years’ follow-up: a randomised controlled trial
The Lancet
Obtaining interpretable fuzzy classification rules from medical data
Artificial Intelligence in Medicine
A knowledge-based system for breast cancer classification using fuzzy logic method
Telematics and Informatics
Comparing performances of backpropagation and genetic algorithms in the data classification
Expert Systems with Applications
A fuzzy-genetic approach to breast cancer diagnosis
Artificial Intelligence in Medicine
Breast cancer diagnosis using least square support vector machine
Digital Signal Processing
A hybrid approach to medical decision support systems: Combining feature selection, fuzzy weighted pre-processing and airs
Computer Methods and Programs in Biomedicine
A cascade learning system for classification of diabetes disease: Generalized discriminant analysis and least square support vector machine
Expert Systems with Applications
Cited by (85)
An imbalanced binary classification method based on contrastive learning using multi-label confidence comparisons within sample-neighbors pair
2023, NeurocomputingCitation Excerpt :Classification is a common problem in machine learning and deep learning. Under natural conditions, classification problems often have unavoidable data imbalances, such as abnormal detection of power equipment [1], disease identification [2], and telecommunication fraud [3]. In these cases, accurate discrimination of the minority class is crucial.
Evolving ensembles using multi-objective genetic programming for imbalanced classification
2022, Knowledge-Based SystemsRisk-supported case-based reasoning approach for cost overrun estimation of water-related projects using machine learning
2024, Engineering, Construction and Architectural ManagementAn evolutionary feature selection method based on probability-based initialized particle swarm optimization
2024, International Journal of Machine Learning and CyberneticsHealth prediction for king salmon via evolutionary machine learning with genetic programming
2024, Journal of the Royal Society of New Zealand