Skip to main content

Advertisement

Log in

Optimal column subset selection for image classification by genetic algorithms

  • CS and OR in Big Data and Cloud Com
  • Published:
Annals of Operations Research Aims and scope Submit manuscript

Abstract

Many problems in operations research can be solved by combinatorial optimization. Fixed-length subset selection is a family of combinatorial optimization problems that involve selection of a set of unique objects from a larger superset. Feature selection, p-median problem, and column subset selection problem are three examples of hard problems that involve search for fixed-length subsets. Due to their high complexity, exact algorithms are often infeasible to solve real-world instances of these problems and approximate methods based on various heuristic and metaheuristic (e.g. nature-inspired) approaches are often employed. Selecting column subsets from massive data matrices is an important technique useful for construction of compressed representations and low rank approximations of high-dimensional data. Search for an optimal subset of exactly k columns of a matrix, \(A^{m\times n}\), \(k < n\), is a well-known hard optimization problem with practical implications for data processing and mining. It can be used for unsupervised feature selection, dimensionality reduction, data visualization, and so on. A compressed representation of raw real-world data can contribute, for example, to reduction of algorithm training times in supervised learning, to elimination of overfitting in classification and regression, to facilitation of better data understanding, and to many other benefits. This paper proposes a novel genetic algorithm for the column subset selection problem and evaluates it in a series of computational experiments with image classification. The evaluation shows that the proposed modifications improve the results obtained by artificial evolution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html.

References

  • Avron, H., & Boutsidis, C. (2013). Faster subset selection for matrices and applications. SIAM Journal on Matrix Analysis and Applications, 34(4), 1464–1499. doi:10.1137/120867287.

    Article  Google Scholar 

  • Balzano, L., Nowak, R., & Bajwa, W.U. (2010). Column subset selection with missing data. In NIPS workshop on low-rank methods for large-scale machine learning.

  • Boutsidis, C. (2009). An improved approximation algorithm for the column subset selection problem. In Proceedings of the twentieth annual ACM-SIAM symposium on discrete algorithms, ser. SODA ’09. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics (pp. 968–977). http://dl.acm.org/citation.cfm?id=1496770.1496875

  • Boutsidis, C., & Magdon-Ismail, M. (2011). Deterministic feature selection for k-means clustering. arXiv:1109.5664.

  • Boutsidis, C., Mahoney, M.W., & Drineas, P. (2009). An improved approximation algorithm for the column subset selection problem. In Proceedings of the 20th annual ACM-SIAM symposium on discrete algorithms (SODA), SIAM, Philadelphia (pp. 968–977).

  • Boutsidis, C., Zouzias, A., Mahoney, M. W., & Drineas, P. (2013). Stochastic dimensionality reduction for k-means clustering. arXiv:1110.2897.

  • Boutsidis, C., & Magdon-Ismail, M. (2014). A note on sparse leastsquares regression. Information Processing Letter, 114(5), 273–276.

  • Boyce, D., Farhi, A., & Weischedel, R. (2013). Optimal subset selection: Multiple regression, interdependence and optimal network algorithms (Vol. 103). New York: Springer Science & Business Media.

    Google Scholar 

  • Chan, T. F., & Hansen, P. C. (1992). Some applications of the rank revealing QR factorization. SIAM Journal on Scientific and Statistical Computing, 13, 727–741.

    Article  Google Scholar 

  • Cicirello, V. A. (2006). Non-wrapping order crossover: An order preserving crossover operator that respects absolute position. In Proceedings of the 8th annual conference on genetic and evolutionary computation, GECCO ’06, New York, NY, USA. ACM (pp. 1125–1132).

  • Çivril, A., & Magdon-Ismail, M. (2012) Column subset selection via sparse approximation of SVD. Theoretical Computer Science, 421(0), 1–14. http://www.sciencedirect.com/science/article/pii/S0304397511009388

  • Çivril, A. (2014). Column subset selection problem is UG-hard. Journal of Computer and System Sciences, 80(4), 849–859.

    Article  Google Scholar 

  • Couvreur, C., & Bresler, Y. (2000). On the optimality of the backward greedy algorithm for the subset selection problem. SIAM Journal on Matrix Analysis and Applications, 21(3), 797–808. doi:10.1137/S0895479898332928.

    Article  Google Scholar 

  • Czarn, A., MacNish, C., Vijayan, K., & Turlach, B. A. (2004). Statistical exploratory analysis of genetic algorithms: The influence of gray codes upon the difficulty of a problem. In Australian conference on artificial intelligence, ser. Lecture Notes in Computer Science, G. I. Webb and X. Yu, Eds., vol. 3339 (pp. 1246–1252). Springer.

  • Das, A., & Kempe, D. (2008). Algorithms for subset selection in linear regression. In Proceedings of the fortieth annual ACM symposium on theory of computing, ser. STOC ’08. New York, NY, USA: ACM (pp. 45–54). doi:10.1145/1374376.1374384

  • de Hoog, F. R., & Mattheij, R. M. M. (2007). Subset selection for matrices. Linear Algebra and its Applications, 422, 349–359.

    Article  Google Scholar 

  • de Hoog, F. R., & Mattheij, R. M. M. (2011). A note on subset selection for matrices. Linear Algebra and its Applications, 434, 1845–1850.

    Article  Google Scholar 

  • Deshpande, A., & Rademacher, L. (2010). Efficient volume sampling for row/column subset selection. In Proceedings of the 2010 IEEE 51st annual symposium on foundations of computer science, ser. FOCS ’10. Washington, DC, USA: IEEE Computer Society (pp. 329–338). doi:10.1109/FOCS.2010.38

  • Diao, R., & Shen, Q. (2015). Nature inspired feature selection meta-heuristics. Artificial Intelligence Review, 44(3), 311–340.

    Article  Google Scholar 

  • Farahat, A. K. (2013). Distributed column subset selection on mapreduce. In 2013 IEEE 13th international conference on data mining (ICDM) (pp. 171–180).

  • Farahat, A. K., Elgohary, A., Ghodsi, A., & Kamel, M. S. (2013). Greedy column subset selection for large-scale data sets, CoRR. arXiv:1312.6838.

  • Farahat, A. K., Ghodsi, A., & Kamel, M. S. (2013). A fast greedy algorithm for generalized column subset selection. CoRR. arXiv:1312.6820.

  • Foster, L., & Kommu, R. (2006). Algorithm 853: An efficient algorithm for solving rank-deficient least squares problems. ACM Transactions on Mathematical Software, 32, 157–165.

    Article  Google Scholar 

  • Friedberg, S. (2003). Linear algebra, 4th edn. Prentice-Hall Of India Pvt. Limited. http://books.google.cz/books?id=yLCLMQAACAAJ

  • Garris, M. D. (1994). Design, collection, and analysis of handwriting sample image databases. Encyclopedia of Computer Science and Technology, 31(supplement 16), 189–213.

    Google Scholar 

  • Golub, G., & Van Loan, C. (1996). Matrix computations, 3rd edn. In Johns Hopkins studies in the mathematical sciences. Johns Hopkins University Press.

  • Hastie, T., Tibshirani, R., & Friedman, J. (2013). The elements of statistical learning: data mining, inference, and prediction. Springer series in statistics. New York: Springer.

    Google Scholar 

  • Ipsen, I. C. F., Kelley, C. T., & Pope, S. R. (2011). Rank-deficient nonlinear least squares problems and subset selection. SIAM Journal on Numerical Analysis, 49, 1244–1266.

    Article  Google Scholar 

  • Jongen, H. Th, Meer, K., & Triesch, E. (2004). Optimization theory. Berlin: Kluwer Academic Publishers.

    Google Scholar 

  • Joshi, S., & Boyd, S. (2009). Sensor selection via convex optimization. IEEE Transactions on Signal Processing, 57, 451–462.

    Article  Google Scholar 

  • Krömer, P., & Platoš, J. (2014). Genetic algorithm for sampling from scale-free data and networks. In Proceedings of the 2014 annual conference on genetic and evolutionary computation, GECCO ’14 (pp. 793–800), New York, NY, USA, ACM.

  • Krömer, P., & Platoš, J. (2016). A comparison of differential evolution and genetic algorithms for the column subset selection problem. In 9th international conference on computer recognition systems, CORES 2015; Wrocaw; Poland; Advances in Intelligent Systems and Computing, Vol 403 (pp. 223–232).

  • Maynard, H. B., Zandin, K. B., & Zandin, K. B. (2001). Maynard’s industrial engineering handbook. McGraw-Hill, New York. No. Sirsi i9780070411029.

  • Mitchell, M. (1996). An introduction to genetic algorithms. Cambridge, MA: MIT Press.

    Google Scholar 

  • Mladenovic, N., Brimberg, J., Hansen, P., & Moreno-Pérez, J. A. (2007). The p-median problem: a survey of metaheuristic approaches. European Journal of Operational Research, 179(3), 927–939.

    Article  Google Scholar 

  • Pan, C.-T., & Tang, P. (1999). Bounds on singular values revealed by QR factorizations. BIT Numerical Mathematics, 39(4), 740–756.

    Article  Google Scholar 

  • Ravindran, A. (2008). Operations Research and Management Science Handbook. New York: CRC Press. ISBN 978-0-8493-9721-9.

    Google Scholar 

  • Sabeti, M., Boostani, R., & Zoughi, T. (2012). Using genetic programming to select the informative eeg-based features to distinguish schizophrenic patients. Neural Network World, 22(1), 3–20.

    Article  Google Scholar 

  • Santana, L. E. A. S., & de Paula Canuto, A. M. (2014). Filter-based optimization techniques for selection of feature subsets in ensemble systems. Expert Systems with Applications, 41(4), 1622–1631. http://www.sciencedirect.com/science/article/pii/S0957417413006805

  • Shen, J., Ju, B., Jiang, T., Ren, J., Zheng, M., Yao, C., et al. (2011). Column subset selection for active learning in image classification. Neurocomputing, 74(18), 3785–3792.

    Article  Google Scholar 

  • Tropp, J. A. (2009). Column subset selection, matrix factorization, and eigenvalue optimization. In: 20th annual ACM-SIAM symposium on discrete algorithms location, New York (pp. 978–986).

  • Wang, Y., & Singh, A. (2015). An empirical comparison of sampling techniques for matrix column subset selection. In 2015 53rd annual allerton conference on communication, control, and computing, Allerton (pp. 1069–1074).

  • Wu, A. S., Lindsay, R. K., & Riolo, R. (1997). Empirical observations on the roles of crossover and mutation. In Bäck, T. (ed.), Proceedings of the seventh international conference on genetic algorithms, San Francisco, CA (pp. 362–369). Morgan Kaufmann. http://citeseer.ist.psu.edu/wu97empirical.html

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jana Nowaková.

Additional information

This work was supported by the Czech Science Foundation under the Grant No. GJ16-25694Y and by the Projects SP2016/97, ”Parallel processing of Big Data III”, of the Student Grant System, VŠB-TU Ostrava.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Krömer, P., Platoš, J., Nowaková, J. et al. Optimal column subset selection for image classification by genetic algorithms. Ann Oper Res 265, 205–222 (2018). https://doi.org/10.1007/s10479-016-2331-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10479-016-2331-0

Keywords

Navigation