Unbalanced breast cancer data classification using novel fitness functions in genetic programming

https://doi.org/10.1016/j.eswa.2019.112866Get rights and content

Highlights

  • Novel fitness functions F2 score and D score is proposed.

  • Proposed fitness functions tackle the problem of unbalanced breast cancer data classification.

  • F2 score gives more weight age to minority classes.

Abstract

Breast Cancer is a common disease and to prevent it, the disease must be identified at earlier stages. Available breast cancer datasets are unbalanced in nature, i.e. there are more instances of benign (non-cancerous) cases then malignant (cancerous) ones. Therefore, it is a challenging task for most machine learning (ML) models to classify between benign and malignant cases properly, even though they have high accuracy. Accuracy is not a good metric to assess the results of ML models on breast cancer dataset because of biased results. To address this issue, we use Genetic Programming (GP) and propose two fitness functions. First one is F2 score which focuses on learning more about the minority class, which contains more relevant information, the second one is a novel fitness function known as Distance score (D score) which learns about both the classes by giving them equal importance and being unbiased. The GP framework in which we implemented D score is named as D-score GP (DGP) and the framework implemented with F2 score is named as F2GP. The proposed F2GP achieved a maximum accuracy of 99.63%, 99.51% and 100% for 60-40, 70-30 partition schemes and 10 fold cross validation scheme respectively and DGP achieves a maximum accuracy of 99.63%, 98.5% and 100% in 60-40, 70-30 partition schemes and 10 fold cross validation scheme respectively. The proposed models also achieves a recall of 100% for all the test cases. This shows that using a new fitness function for unbalanced data classification improves the performance of a classifier.

Introduction

Our body is composed of many millions of tiny cells, each a self-contained living unit. Normally, each cell coordinates with the others that compose tissues and organs of our body. One way that this coordination occurs is reflected in how our cells reproduce themselves. Normal cells in the body grow and divide for a period of time and then stop growing and dividing. Thereafter, they only reproduce themselves as necessary to replace defective or dying cells. Cancer occurs when this cellular reproduction process goes out of control and some of the body’s cells begin to divide without stopping and spread into surrounding tissues. In other words, cancer is a disease characterized by uncontrolled, uncoordinated and undesirable cell division (Gatenby & Brown, 2017). Unlike normal cells, cancer cells continue to grow and divide for their whole lives, replicating into more and more harmful cells. Cancers form solid tumors, which are masses of tissue. As cancer cells divide and replicate themselves, they often form into a clump, known as a tumor (Aravanis, Lee, & Klausner, 2017). Tumors cause many of the symptoms of cancer by pressuring, crushing and destroying surrounding non-cancerous cells and tissues.

Tumors come in two forms; benign and malignant (Pena-Reyes & Sipper, 1999). Benign tumors are not cancerous, thus they do not grow and spread to the extent of cancerous tumors. Benign tumors are usually not life-threatening. Malignant tumors, on the other hand, grow and spread to other areas of the body (Muto, Bussey, & Morson, 1975). The process whereby cancer cells travel from the initial tumor site to other parts of the body is known as metastasis (Müller et al., 2001).

Breast cancer is a malignant cell growth in the breast (Muto et al., 1975). If left untreated, cancer spreads to other areas of the body. Excluding skin cancer, breast cancer is the most common type of cancer in women, accounting for one of every three cancer diagnoses. The incidence of breast cancer rises after age 40 (Moss et al., 2006). The highest incidence (approximately 80% of invasive cases) occurs in women over age 50. This disease like other cancer diseases is still under research and so far, no globally available preventive measures have been proposed. The best way to prevent breast cancer is the identification of the disease at early stages and taking preventive measures before it spreads to other tissues.

The early and accurate identification of this disease is done based on the study of previous diagnosis data, gathering useful information from the past data. This process can be aided by concepts and techniques from computer science field. ML techniques enable the system to learn from previous data and based on the learning, the system can predict and give results on new unseen data. The available datasets of breast cancer are unbalanced because there are more instances of benign values, thereby signifying that it is important to develop a framework which deals with these scenarios. Datasets can be termed as unbalanced if one of its classes is represented by only a small number of training instances (called the minority class) while the other classes make up the majority (Beyan & Fisher, 2015). Due to the influence of the larger majority class on traditional training criteria in the fitness function, the classifier results tend to have good accuracy on the majority class but has insufficient accuracy on the minority class(es) (Japkowicz & Stephen, 2002).

In recent years, the presence of unbalanced datasets have been encountered in different areas of wide range of issues, such as face recognition (Leng, Yu, & Jingyan, 2017),the analysis of satellite imagery detection of oil spills (Mera, Bolon-Canedo, Cotos, & Alonso-Betanzos, 2017), panel data (Ye, Xu, & Wu, 2018), medical diagnostic decision making (Bhardwaj, Tiwari, RameshKrishna, & Vishaal Varma, Bhardwaj, Sakalle, Bhardwaj, Tiwari, 2018). Thus the classification of datasets has substantially received more attention. The use of ML in medical science is increasing gradually and significantly (Kononenko, 2001, Polat, Güneş, Arslan, 2008). Most frameworks use accuracy as a metric to evaluate the results (Polat & Güneş, 2007a), but in the case of unbalanced dataset classification, accuracy gives biased results (Patterson & Zhang, 2007). Therefore, in this paper, we introduce two new fitness functions, F2 score and Distance score (D score), to evaluate the GP framework for addressing the problem of unbalanced breast cancer data classification. F2 score improves the performance of the classifier by focusing more on leaning about the minority class. Learning about minority class is important in medical data, as the minority data contains more relevant information. D score handles unbalanced data by learning about both the classes by giving them equal importance i.e., by being unbiased with the classes. To show the usefulness of our proposed DGP and F2GP frameworks, we have trained a GP framework using accuracy as the fitness function and compared the results. To show the superiority of our methods, we have also compared our proposed DGP and F2GP frameworks with Back Propagation Neural Network (BPNN), Koza & Rice model, Average Class Accuracy in fitness (Ave) and Genetically Optimized Neural Network (GONN) on 60-40, 70-30 partition schemes and also on 10-fold cross validation scheme. To show the superiority of our methods, we have also compared our proposed DGP and F2GP frameworks with Back Propagation Neural Network (BPNN) (Hagan, Demuth, Beale, & De Jesús, 1996), Koza & Rice model (Koza & Rice, 1991), Average Class Accuracy in fitness (Ave) (Bhowan, Johnston, & Zhang, 2012) and Genetically Optimized Neural Network (GONN) on 60-40, 70-30 partition schemes and also on 10-fold cross validation scheme.

Section snippets

Related works

Several works have been done in medical data diagnosis and identifying the healthy and unhealthy cases (Bhardwaj, Tiwari, Krishna, Varma, 2016, Bhardwaj, Tiwari, Varma, Krishna, 2014). Breast cancer specific works have been done as well to classify the malignant and benign cases properly. As medical datasets are usually unbalanced (Jafari-Marandi, Davarzani, Gharibdousti, & Smith, 2018), therefore works have also been done in unbalanced data classification. Few of the related works on breast

Genetic programming

GP (Koza, 1992) is a ML framework inspired by Darwin’s theory of evolution. It first creates random individuals or solutions known as GP trees and then evolves them until an optimal solution is found. It evolves the solution using reproduction, crossover and mutation operators. As multiple solutions are created with operators like crossover and mutation, GP does a better job in exploring the search space and reaching a globally optimal solution. The steps of GP are described next.

Proposed fitness functions

Due to the unbalanced nature of breast cancer datasets, most ML models fail to classify between benign (non-cancerous) and malignant (cancerous) cases properly, even though they have high accuracy. As observed in the previous section, accuracy gives biased results with unbalanced data, which makes it inefficient to work as fitness function for GP. To resolve the issues related to classification of unbalanced data, we apply GP to propose two new fitness functions that are F2 Score and D score.

Results and discussions

The proposed GP framework as a classifier was implemented in Python (3.6) and on an Intel i7 7th gen laptop of 3.4 GHz with 16GB of RAM. Our GP framework was trained by following the parameters described in Table 1. The methods BPNN, Koza & Rice Model, Ave and GONN used for comparison were implemented on our system with same parameters and configuration. The parameter values used in training for the experiment are described in Table 1. The activation function used for BPNN and GONN are sigmoid.

Conclusion

In this work, a novel fitness function termed as D score has been proposed for addressing the problems related to the classification of unbalanced dataset. Along with D Score we have sufficiently proved the worth of F2 score as a fitness function for classifications. The concept of GP has supported the execution of these fitness functions. The WBCD dataset is employed to test the stated methodology. The methods D score and F2 score are experimented and evaluated in comparison to the other

CRediT authorship contribution statement

Divyaansh Devarriya: Conceptualization, Investigation, Resources, Methodology, Writing - review & editing. Cairo Gulati: Data curation, Formal analysis, Resources, Writing - review & editing. Vidhi Mansharamani: Writing - original draft, Formal analysis. Aditi Sakalle: Writing - original draft, Resources, Formal analysis, Supervision, Validation. Arpit Bhardwaj: Conceptualization, Investigation, Methodology, Project administration, Supervision, Validation, Visualization, Writing - original

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (55)

  • R. Jafari-Marandi et al.

    An optimum ann-based breast cancer diagnosis: Bridging gaps between ann learning and decision-making goals

    Applied Soft Computing

    (2018)
  • J. Klusáček et al.

    Comparing fitness functions for genetic feature transformation

    IFAC-PapersOnLine

    (2016)
  • I. Kononenko

    Machine learning for medical diagnosis: history, state of the art and perspective

    Artificial Intelligence in Medicine

    (2001)
  • W. Lee et al.

    Instance categorization by support vector machines to adjust weights in adaboost for imbalanced data classification

    Information Sciences

    (2017)
  • B. Leng et al.

    Data augmentation for unbalanced face recognition training sets

    Neurocomputing

    (2017)
  • S. Li et al.

    Deep variance network: An iterative, improved CNN framework for unbalanced training datasets

    Pattern Recognition

    (2018)
  • Y. Liang et al.

    Genetic programming for evolving figure-ground segmentors from multiple features

    Applied Soft Computing

    (2017)
  • A. Marcano-Cedeño et al.

    Wbcd breast cancer database classification applying artificial metaplasticity neural network

    Expert Systems with Applications

    (2011)
  • D. Mera et al.

    On the use of feature selection to improve the detection of sea oil spills in sar images

    Computers & Geosciences

    (2017)
  • S.M. Moss et al.

    Effect of mammographic screening from age 40 years on breast cancer mortality at 10 years’ follow-up: a randomised controlled trial

    The Lancet

    (2006)
  • D. Nauck et al.

    Obtaining interpretable fuzzy classification rules from medical data

    Artificial Intelligence in Medicine

    (1999)
  • M. Nilashi et al.

    A knowledge-based system for breast cancer classification using fuzzy logic method

    Telematics and Informatics

    (2017)
  • H.H. Örkcü et al.

    Comparing performances of backpropagation and genetic algorithms in the data classification

    Expert Systems with Applications

    (2011)
  • C.A. Pena-Reyes et al.

    A fuzzy-genetic approach to breast cancer diagnosis

    Artificial Intelligence in Medicine

    (1999)
  • K. Polat et al.

    Breast cancer diagnosis using least square support vector machine

    Digital Signal Processing

    (2007)
  • K. Polat et al.

    A hybrid approach to medical decision support systems: Combining feature selection, fuzzy weighted pre-processing and airs

    Computer Methods and Programs in Biomedicine

    (2007)
  • K. Polat et al.

    A cascade learning system for classification of diabetes disease: Generalized discriminant analysis and least square support vector machine

    Expert Systems with Applications

    (2008)
  • Cited by (85)

    • An imbalanced binary classification method based on contrastive learning using multi-label confidence comparisons within sample-neighbors pair

      2023, Neurocomputing
      Citation Excerpt :

      Classification is a common problem in machine learning and deep learning. Under natural conditions, classification problems often have unavoidable data imbalances, such as abnormal detection of power equipment [1], disease identification [2], and telecommunication fraud [3]. In these cases, accurate discrimination of the minority class is crucial.

    View all citing articles on Scopus
    View full text