Output-based transfer learning in genetic programming for document classification
Introduction
Document classification has been addressed by a large number of machine learning algorithms [1], [2], [3], [4]. The number of features in a document classification task is often large, and a feature selection method is typically required [5], [6]. Some categories in a document classification task, such as category “comp.graphics” and category “comp.windows.x” in [7], are very closely related to each other [3], [7], [8].
Transfer learning techniques [9], [10], [11], [12] have been employed for training classifiers from few categories then pass learned knowledge to other relevantly similar categories [7]. In a set of similar categories, the distribution of the selected features in different categories is often different [7]. Therefore, when a trained model from a source domain is applied to a target domain, the distribution change of the selected features has to be taken into account [13], [14].
Genetic programming (GP) has been effectively utilized for feature selection in different applications, e.g., multiple classification [15], cybersecurity [16], and high dimensional data classification [17]. GP evolves programs which automatically include a subset of features. In a question–answer ranking task [18], a smaller set of features is effectively selected by GP than other methods. It reveals that GP has capability of automatically selecting proper features to construct effective text classifiers. This research is motivated by that GP evolves programs to transfer features selected by GP to a single output-based composite feature. Between a source domain and a target domain, it is taken into account the difference in the transferred space only. Thus the features selected from the source domain are not required for the relevant distribution change estimation when these features are employed in the target domain. It is promising to explore how to effectively transfer GP programs to the target domain when these programs come from the source domain. Additionally, output-based transfer learning approaches [19], [20] have been proposed to train a classifier on a small dataset. The shared features are transformed from the target domain to the source domain, and the proposed target objective function considers the probability of each predicted category based on the outputs of the classifier in all the training data (including both the source domain and the target domain training data). The results show that the effective of transfer learning can be improved after the output of the trained classifier is considered.
This paper conducts an investigation on an output-based transfer learning system using GP for document classification. In this paper, GP programs are directly evolved from a set of sparse features without using feature selection methods on a source domain. The evolved programs from a source domain will be applied to a target domain without considering of the distribution of input features. Instead, after GP programs are evolved in the source domain, they are directly applied to the target domain, and we only consider the change of the outputs of the GP programs. A linear model is proposed to combine a set of these GP programs, and the linear model is optimised based on the training data from the target domain. Furthermore, new programs are directly mutated from these GP programs. The features represented by the mutated programs are used to enrich the data to be used by the linear model on the target domain. The major contributions of this paper are as follows:
a transfer learning technique is proposed that can effectively and directly transfer GP programs evolved from the source domain to the target domain;
a method is introduced that can effectively combine GP programs evolved from the source domain and their mutated programs for predicting the test documents in the target domain; and
the evolved GP program can be explained to some extent in the context of document classification.
After this section, the background of document classification is provided in Section 2. The backgrounds of transfer learning and GP for text classification are given in Section 2 as well. The output-based transfer learning system for document classification is introduced in Section 3. After Section 4 introduces the design of experiments, the results and discussions are presented in Section 5. Section 6 brings conclusions and addresses future research directions.
Section snippets
Document classification
Document classification is the task of discriminating a document as one category or more categories based on the text content in the document. To automatically handle a document classification task, statistical techniques and artificial intelligence approaches have been utilized [5], [21], [22], [23].
Normally, document classification includes the stages of pre-processing, feature extraction, model training, and predication. Some words, such as “an” and “the”, are removed in the stage of
The proposed output-based transfer learning approach
The proposed GP-based transfer learning system is presented in this section. The main components of the GP system are described in Section 3.1. Section 3.2 introduces use of the output-based GP programs to the transfer learning system. Section 3.3 provides the overall structure of the system.
Dataset
The twenty newsgroup dataset [28] has been widely employed by researches [26], [27], [42]. Some categories in the dataset are very closely related to each other, such as category “comp.graphics” and category “comp.windows.x”. There are twenty categories, and four groups. Each group is a major category, such as “comp.graphics” and “comp.windows.x” belonging to the category “comp”. Following [42], six datasets are generated. Fig. 3 lists the details of the six binary classification tasks. The
Results and discussions
The test results on each dataset are provided in this section. There are discussions of the test results as well.
Conclusions
This paper investigated output-based transfer learning on GP programs for document classification. After GP programs were evolved in the source domains, a linear model was proposed to combine these GP programs for classifying documents on the target domains. From the experiments, the evolved GP classifiers from source domains have been shown to be helpful to classify documents from target domains. The combinations of randomly selected GP programs from a source domain in the proposed linear
CRediT authorship contribution statement
Wenlong Fu: Conception and design of study, Acquisition of data, Analysis and/or interpretation of data, Writing - original draft, Writing - review & editing. Bing Xue: Conception and design of study, Acquisition of data, Analysis and/or interpretation of data, Writing - original draft, Writing - review & editing. Xiaoying Gao: Conception and design of study, Acquisition of data, Analysis and/or interpretation of data, Writing - original draft, Writing - review & editing. Mengjie Zhang:
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgment
All authors approved the version of the manuscript to be published.
Funding
No funding was received for this work.
Intellectual property
We confirm that we have given due consideration to the protection of intellectual property associated with this work and that there are no impediments to publication, including the timing of publication, with respect to intellectual property. In so doing we confirm that we have followed the regulations of our institutions concerning intellectual property.
References (50)
- et al.
Bag-of-concepts representation for document classification based on automatic knowledge acquisition from probabilistic knowledge base
Knowl.-Based Syst.
(2020) - et al.
Deep neural network for hierarchical extreme multi-label text classification
Appl. Soft Comput.
(2019) - et al.
Integrating associative rule-based classification with Naïve Bayes for text classification
Appl. Soft Comput.
(2018) - et al.
Combining binary classifiers in different dichotomy spaces for text categorization
Appl. Soft Comput.
(2019) - et al.
A recent overview of the state-of-the-art elements of text classification
Expert Syst. Appl.
(2018) - et al.
Transfer learning using computational intelligence: A survey
Knowl.-Based Syst.
(2015) - et al.
A framework for semi-supervised metric transfer learning on manifolds
Knowl.-Based Syst.
(2019) - et al.
Extreme learning machine based transfer learning algorithms: A survey
Neurocomputing
(2017) - et al.
Genetic programming for multiple-feature construction on high-dimensional classification
Pattern Recognit.
(2019) - et al.
Output based transfer learning with least squares support vector machine and its application in bladder cancer prognosis
Neurocomputing
(2020)
Evaluation of feature selection methods for text classification with small datasets using multiple criteria decision-making methods
Appl. Soft Comput.
Semantic text classification: A survey of past and recent advances
Inf. Process. Manage.
Learning document representation via topic-enhanced LSTM model
Knowl.-Based Syst.
Syntactic N-grams as machine learning features for natural language processing
Expert Syst. Appl.
Senti-N-Gram: An n-gram lexicon for sentiment analysis
Expert Syst. Appl.
Improved inverse gravity moment term weighting for text classification
Expert Syst. Appl.
Term-weighting learning via genetic programming for text classification
Knowl.-Based Syst.
Genetic programming-based feature learning for question answering
Inf. Process. Manage.
A filter-based feature construction and feature selection approach for classification using genetic programming
Knowl.-Based Syst.
Text classification using capsules
Neurocomputing
Relevance popularity: A term event model based feature selection scheme for text classification
PLoS One
Domain adaptation via transfer component analysis
IEEE Trans. Neural Netw.
Making trillion correlations feasible in feature grouping and selection
IEEE Trans. Pattern Anal. Mach. Intell.
Enhanced cross-domain sentiment classification utilizing a multi-source transfer learning approach
Soft Comput.
A survey on transfer learning
IEEE Trans. Knowl. Data Eng.
Cited by (6)
A comprehensive review of automatic programming methods
2023, Applied Soft ComputingA hierarchical estimation of multi-modal distribution programming for regression problems
2023, Knowledge-Based SystemsCitation Excerpt :Gaussian and polynomial kernels are traditionally used in kernel-based methods to approximate the target function [17–20]. Genetic programming (GP) [21] is one of the evolutionary computation techniques that is used for solving different problems [22–26], and the regression problem is one of the most common [27–34]. GP has the benefit of not requiring the regression models to be specified beforehand to anticipate the outcome.
Hierarchical-linked batch-to-batch optimization based on transfer learning of synthesis process
2023, Canadian Journal of Chemical EngineeringA Robust Deep Model for Improved Categorization of Legal Documents for Predictive Analytics
2023, International Journal on Recent and Innovation Trends in Computing and CommunicationProbit Regressive Tversky Indexed Rocchio Convolutive Deep Neural Learning for Legal Document Data Analytics
2021, International Journal of Performability EngineeringProbit regressive tversky indexed rocchio convolutive deep neural learning for legal document data analytics
2021, International Journal of Intelligent Systems and Applications in Engineering