Skip to main content

Identifying and Harnessing the Building Blocks of Machine Learning Pipelines for Sensible Initialization of a Data Science Automation Tool

  • Chapter
  • First Online:
Book cover Genetic Programming Theory and Practice XIV

Part of the book series: Genetic and Evolutionary Computation ((GEVO))

Abstract

As data science continues to grow in popularity, there will be an increasing need to make data science tools more scalable, flexible, and accessible. In particular, automated machine learning (AutoML) systems seek to automate the process of designing and optimizing machine learning pipelines. In this chapter, we present a genetic programming-based AutoML system called TPOT that optimizes a series of feature preprocessors and machine learning models with the goal of maximizing classification accuracy on a supervised classification problem. Further, we analyze a large database of pipelines that were previously used to solve various supervised classification problems and identify 100 short series of machine learning operations that appear the most frequently, which we call the building blocks of machine learning pipelines. We harness these building blocks to initialize TPOT with promising solutions, and find that this sensible initialization method significantly improves TPOT’s performance on one benchmark at no cost of significantly degrading performance on the others. Thus, sensible initialization with machine learning pipeline building blocks shows promise for GP-based AutoML systems, and should be further refined in future work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    See https://gist.github.com/rhiever/27f795b00b95751ee38fd9e946c72b0b for a full list of building blocks.

  2. 2.

    Benchmark data available at http://www.randalolson.com/data/benchmarks/.

References

  1. Banzhaf, W., Nordin, P., Keller, R.E., Francone, F.D.: Genetic Programming: An Introduction. Morgan Kaufmann, San Meateo (1998)

    Book  Google Scholar 

  2. Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 281–305 (2012)

    MathSciNet  MATH  Google Scholar 

  3. Bhowan, U., Johnston, M., Zhang, M., Yao, X.: Evolving diverse ensembles using genetic programming for classification with unbalanced data. Trans. Evol. Comput. 17(3), 368–386 (2013)

    Article  Google Scholar 

  4. Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. CoRR abs/1603.02754 (2016). http://arxiv.org/abs/1603.02754

  5. Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 6, 182–197 (2002)

    Article  Google Scholar 

  6. Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F.: Efficient and robust automated machine learning. In: Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems 28, pp. 2944–2952. Curran Associates, Inc., Red Hook (2015)

    Google Scholar 

  7. Feurer, M., Springenberg, J.T., Hutter, F.: Initializing bayesian hyperparameter optimization via meta-learning. In: Proceedings of the 29th AAAI Conference on Artificial Intelligence, January 25–30, 2015, Austin, pp. 1128–1135 (2015)

    Google Scholar 

  8. Fortin, F.A., De Rainville, F.M., Gardner, M.A., Parizeau, M., Gagné, C.: DEAP: evolutionary algorithms made easy. J. Mach. Learn. Res. 13, 2171–2175 (2012)

    MathSciNet  MATH  Google Scholar 

  9. Garca-Arnau, M., Manrique, D., Ros, J., Rodrguez-Patn, A.: Initialization method for grammar-guided genetic programming. Knowl.-Based Syst. 20, 127–133 (2007). The 26th SGAI International Conference on Innovative Techniques and Applications of Artificial Intelligence

    Google Scholar 

  10. Goldberg, D.E.: The Design of Innovation: Lessons from and for Competent Genetic Algorithms. Kluwer Academic Publishers, Norwell (2002)

    Book  Google Scholar 

  11. Greene, C.S., White, B.C., Moore, J.H.: An expert knowledge-guided mutation operator for genome-wide genetic analysis using genetic programming. In: Pattern Recognition in Bioinformatics, pp. 30–40. Springer, Berlin (2007)

    Google Scholar 

  12. Greene, C.S., White, B.C., Moore, J.H.: Sensible initialization using expert knowledge for genome-wide analysis of epistasis using genetic programming. In: 2009 IEEE Congress on Evolutionary Computation, pp. 1289–1296 (2009)

    Google Scholar 

  13. Hastie, T.J., Tibshirani, R.J., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York (2009)

    Book  Google Scholar 

  14. Hutter, F., Lücke, J., Schmidt-Thieme, L.: Beyond manual tuning of hyperparameters. Künstl. Intell. 29, 329–337 (2015)

    Article  Google Scholar 

  15. Kanter, J.M., Veeramachaneni, K.: Deep feature synthesis: towards automating data science endeavors. In: Proceedings of the International Conference on Data Science and Advance Analytics. IEEE, Piscataway (2015)

    Google Scholar 

  16. Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge (1992)

    MATH  Google Scholar 

  17. Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml

  18. Luke, S., Panait, L.: A survey and comparison of tree generation algorithms. In: Spector, L., Goodman, E.D., Wu, A., Langdon, W.B., Voigt, H.M., Gen, M., Sen, S., Dorigo, M., Pezeshk, S., Garzon, M.H., Burke, E. (eds.) Proceedings of the 6th Genetic and Evolutionary Computation Conference, GECCO ’01, pp. 81–88. Morgan Kaufmann, San Francisco (2001)

    Google Scholar 

  19. Martinsson, P.G., Rokhlin, V., Tygert, M.: A randomized algorithm for the decomposition of matrices. Appl. Comput. Harmon. Anal. 30, 47–68 (2011)

    Article  MathSciNet  Google Scholar 

  20. Olson, R.S., Bartley, N., Urbanowicz, R.J., Moore, J.H.: Evaluation of a tree-based pipeline optimization tool for automating data science (2016). Arxiv e-print. http://arxiv.org/abs/1603.06212

  21. Olson, R.S., Urbanowicz, R.J., Andrews, P.C., Lavender, N.A., Kidd, L.C., Moore, J.H.: Automating biomedical data science through tree-based pipeline optimization. In: Applications of Evolutionary Computation: 19th European Conference, EvoApplications 2016, Porto, March 30 April 1, 2016, Proceedings, Part I, pp. 123–137. Springer International Publishing, Cham (2016)

    Chapter  Google Scholar 

  22. O’Neill, M., Ryan, C.: Grammatical Evolution: Evolutionary Automatic Programming in a Arbitrary Language. Genetic Programming, vol. 4. Kluwer Academic Publishers, Dordrecht (2003)

    Google Scholar 

  23. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  24. Poli, R., Langdon, W.B., McPhee, N.F.: A Field Guide to Genetic Programming. Lulu Enterprises, UK Ltd, Egham (2008)

    Google Scholar 

  25. Reif, M.: A comprehensive dataset for evaluating approaches of various meta-learning tasks. In: First International Conference on Pattern Recognition and Methods (ICPRAM) (2012)

    Google Scholar 

  26. Simon, P.: Too Big to Ignore: The Business Case for Big Data. Wiley & SAS Business Series. Wiley, New Delhi (2013)

    Google Scholar 

  27. Snoek, J., Larochelle, H., Adams, R.P.: Practical bayesian optimization of machine learning algorithms. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 25, pp. 2951–2959. Curran Associates, Inc., Red Hook (2012)

    Google Scholar 

  28. Urbanowicz, R.J., Kiralis, J., Sinnott-Armstrong, N.A., Heberling, T., Fisher, J.M., Moore, J.H.: GAMETES: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures. BioData Min. 5, 16 (2012)

    Article  Google Scholar 

  29. Velez, D.R., et al.: A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction. Genet. Epidemiol. 31(4), 306–315 (2007)

    Article  Google Scholar 

  30. Zutty, J., Long, D., Adams, H., Bennett, G., Baxter, C.: Multiple objective vector-based genetic programming using human-derived primitives. In: Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation, GECCO ’15, pp. 1127–1134. ACM, New York (2015)

    Google Scholar 

Download references

Acknowledgements

We thank the Penn Medicine Academic Computing Services for the use of their computing resources. This work was supported by National Institutes of Health grants LM009012, LM010098, and EY022300.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Randal S. Olson .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Olson, R.S., Moore, J.H. (2018). Identifying and Harnessing the Building Blocks of Machine Learning Pipelines for Sensible Initialization of a Data Science Automation Tool. In: Riolo, R., Worzel, B., Goldman, B., Tozier, B. (eds) Genetic Programming Theory and Practice XIV. Genetic and Evolutionary Computation. Springer, Cham. https://doi.org/10.1007/978-3-319-97088-2_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-97088-2_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-97087-5

  • Online ISBN: 978-3-319-97088-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics