Elsevier

Applied Soft Computing

Volume 50, January 2017, Pages 124-134
Applied Soft Computing

Classification of human cancer diseases by gene expression profiles

https://doi.org/10.1016/j.asoc.2016.11.026Get rights and content

Highlights

  • DNA microarrays appearances empowered the simultaneous observing of expression levels of a large number of genes.

  • In the proposed methodology, Information gain (IG) is first used for feature selection, then Genetic Algorithm (GA) is employed for feature reduction and finally Genetic Programming (GP) is used for cancer types’ classification.

Abstract

A cancers disease in virtually any of its types presents a significant reason behind death surrounding the world. In cancer analysis, classification of varied tumor types is of the greatest importance. Microarray gene expressions datasets investigation has been seemed to provide a successful framework for revising tumor and genetic diseases. Despite the fact that standard machine learning ML strategies have effectively been valuable to realize significant genes and classify category type for new cases, regular limitations of DNA microarray data analysis, for example, the small size of an instance, an incredible feature number, yet reason for limitation its investigative, medical and logical uses. Extending the interpretability of expectation and forecast approaches while holding a great precision would help to analysis genes expression profiles information in DNA microarray dataset all the most reasonable and proficiently. This paper presents a new methodology based on the gene expression profiles to classify human cancer diseases. The proposed methodology combines both Information Gain (IG) and Standard Genetic Algorithm (SGA). It first uses Information Gain for feature selection, then uses Genetic Algorithm (GA) for feature reduction and finally uses Genetic Programming (GP) for cancer types’ classification. The suggested system is evaluated by classifying cancer diseases in seven cancer datasets and the results are compared with most latest approaches. The use of proposed system on cancers datasets matching with other machine learning methodologies shows that no classification technique commonly outperforms all the others, however, Genetic Algorithm improve the classification performance of other classifiers generally.

Introduction

The term cancer is utilized to identify diseases wherein an abnormal cell division without control is exists. This uncontrolled division causes a lump called by the tumor to form, or rogue immune system cells to develop, invades other tissues and spreads to other parts of the body through the blood vessels and lymph systems. The spectral range of cancer types surpasses 100 different tumors, mostly named by the positioning in the body where in fact the cancer first developed or by the type of tissue cell in where they start/originate (histological type). By kind of tissue cell, cancers may be categorized into six major categories: Carcinoma, Sarcoma, Myeloma, Leukemia, Lymphoma and Mixed Types. By primary site of origin, cancers may be of specific types like breast cancer, lung cancer, prostate cancer, liver cancer renal cell carcinoma (kidney cancer), oral cancer, brain cancer, etc. In cancer medical diagnosis, classification of the several tumor types is of the greatest importance. An accurate prediction of several tumor types provides better treatment and toxicity minimization on patients. Traditional methods of tackling this situation are mostly based on morphological characteristics of tumorous tissue. These conventional methods are reported to acquire several diagnosis limitations. Consequently, creating methodologies that can effectively distinguish between cancers subtypes is vital effectively.

The performances of DNA microarrays empowered the simultaneous observing of expression levels of a large number of genes [1] and also have motivated the ascent of computational evaluation including machine learning techniques. These procedures have been useful to extract patterns and build classification models from gene expression data and also have supported in cancers prediction [2] and prognosis [3]. DNA microarray technology has been broadly found in cancers studies for prediction of disease broadly. It really is a great platform effectively used for the analysis of gene expression in a multitude of experimental researchers [4]. Utilizing microarray technology, you can analyze gene-expression levels of thousands of genes from two test sample cells. With regards to the way to obtain the samples, essential investigations, like disease improvement, accurate diagnosis, medication response and prognosis after treatment, should be achieved [5].

Many successful feature selection algorithms had been devised and the review of feature selection algorithms might be within [6]. Several prior research workers [7], [8], [9], [10] were involved in research of goodness of an attribute subset in deciding an optimum one. The essential feature selection was an optimization problem. Within the paper [11] recommended the well-organized selection of discriminative genes from microarray gene expression data for cancers diagnosis. In his analysis [12] showed about dimension reduction for classification with gene expression microarray data. Comparison of general schemes for gene selection methods [13], [14], [15] as shown in Table 1.

This paper tackles the classification problem of human cancer diseases by using gene expression profiles. It presents a new methodology to analyze microarray datasets and efficiently classify cancer diseases. The new methodology first employs IG for feature selection, then employs GA for feature reduction and finally employs GP for cancer diseases classification. This method (IG/GA) improves classification accuracy of cancer classification by reducing the number of features and preventing the GA from being trapped in a local optimum. The proposed methodology is evaluated by classifying cancers diseases in seven cancer datasets and the results are compared with most recent approaches.

The rest of this paper is structured the following. Section 2 identifies the problem and its own challenges. Section 3 presents a literature review of related work. Section 4 presents a synopsis of feature selection and information gain. Section 5 presents the proposed methodology while as the experimental results are discussed in Section 6. Finally, Section 7 lists the concluding remarks.

Section snippets

Problem definition and challenges

Gene classification as the area of research poses a new challenge because of its unique problem characteristics. First, the challenge originates from the exclusive natural environment of the prevailing genes expression dataset; where almost all of these datasets have a sample size below 200, vs. thousands to hundred thousands of genes provided in each tuple. Second, just a few amounts of these genes present relevant features to the investigated disease. Third, originates from the occurrence of

Related work

The evaluation of gene expression data obtained in microarray tests has been of great involvement in the research regions of pattern recognition, machine learning, and statistics. Researchers across the world are attracted to the problem of discovering biologically interesting information in the expression data of so many genes. As stated before the key problem is the proportion between the extensive amount of genes assessed per instance and the small amount of available samples. Gene selection

Feature selection and information gain

Feature selection is a preprocessing procedure expecting to select the most informative genes that can separate groups, i.e., cancer subtypes. The essential reason is to discover a reduced group band of features from a dataset to diminish the initial feature space dimensionality. Generally, cancers classification studies require the use of formal strategies of feature selection for just two explanations:

  • To lower the computational requirements in experimental responsibilities, which helps the

Proposed methodology

Fig. 1 shows the general framework of the proposed approach. The methodology first accepts Gene Microarray Dataset as input patterns and selects the significant features (feature selection) from the input patterns by using IG. The selected features are then reduced by applying GA. Finally, the methodology employs GP for cancer types’ classification [32].

Experimental results

This section presents the performance evaluation of the proposed IG/SGA methodology. The proposed framework is verified by considering 7 Cancer Gene Expression Datasets. For each test, two important criteria are used for observational assessment of the performance evaluation:

  • A number of selected genes.

  • Predictive accuracy on selected gene.

Conclusions

Classification of cancer predicated on gene expression data is an encouraging research area in the field of data mining. The suggested algorithm tended to the issue of early diagnosis cancer any particular one of the world’s most genuine health issues. In this paper, a new methodology is provided to classify human cancers diseases predicated on the gene expression profiles. Within the proposed methodology, IG can be utilized for feature selection first, then GA is utilized for feature reduction

References (54)

  • R.M. Luque-Baena, D. Urda, J.L. Subirats, L. Franco, J.M. Jerez, Analysis of Cancer Microarray Data using Constructive...
  • G. Chakraborty et al.

    Multi-objective optimization using Pareto GA for gene-selection from microarray data for disease classification

    IEEE International Conference Systems, Man, and Cybernetics (SMC)

    (2013)
  • J. Jeyachidra et al.

    A study on statistical based feature selection methods for classification of gene microarray dataset

    J. Theor. Appl. Inf. Technol.

    (2013)
  • A. Ghaheri et al.

    The applications of genetic algorithms in medicine

    Oman. Med. J.

    (2015)
  • S. Hengpraprohm

    GA-Based classifier with SNR weighted features for cancer microarray data classification

    Int. J. Signal Process. Syst.

    (2013)
  • Y. Piao et al.

    An ensemble correlation-based gene selection algorithm for cancer classification with gene expression data

    Bioinformatics

    (2012)
  • H. Wang et al.

    Dimension reduction with gene expression data using targeted variable importance measurement

    BMC Bioinf.

    (2011)
  • M. Khan, S.M.K. Quadri, Effects of Using Filter Based Feature Selection on the Performance of Machine Learners Using...
  • G. Chakraborty et al.

    Multi-objective optimization using Pareto GA for gene-selection from microarray data for disease classification

    IEEE International Conference on Systems, Man, and Cybernetics

    (2013)
  • M. Karzynski et al.

    Using a genetic algorithm and a perceptron for feature selection and supervised class learning in DNA microarray data

    Artif. Intell. Rev.

    (2013)
  • T. AC et al.

    Ensemble machine learning on gene expression data for cancer classification

    Appl. Bioinformatics

    (2003)
  • T.M. Mitchell

    Machine Learning

    (1997)
  • H.M. Alshamlan et al.

    A study of cancer microarray gene expression profile: objectives and approaches

    Proceedings of the World Congress on Engineering

    (2013)
  • Salima Omar et al.

    Machine learning techniques for anomaly detection: an overview

    Int. J. Comput. Appl.

    (2013)
  • D.K.S. Yip et al.

    Systematic exploration of autonomous modules in noisy microRNA-target networks for testing the generality of the ceRNA hypothesis

    BMC Genom.

    (2014)
  • R.M. Luque-Baena et al.

    Application of genetic algorithms and constructive neural networks for the analysis of microarray cancer data, US national library of medicine national institutes of health

    Theor. Biol. Med. Model.

    (2014)
  • E.B. Huerta et al.

    A hybrid LDA and genetic algorithm for gene selection and classification of microarray data

    El-Sevier Pattern Recognit. Bioinform.

    (2010)
  • Cited by (0)

    View full text