Skip to main content

Advertisement

Log in

FlexGP

Cloud-Based Ensemble Learning with Genetic Programming for Large Regression Problems

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

We describe FlexGP, the first Genetic Programming system to perform symbolic regression on large-scale datasets on the cloud via massive data-parallel ensemble learning. FlexGP provides a decentralized, fault tolerant parallelization framework that runs many copies of Multiple Regression Genetic Programming, a sophisticated symbolic regression algorithm, on the cloud. Each copy executes with a different sample of the data and different parameters. The framework can create a fused model or ensemble on demand as the individual GP learners are evolving. We demonstrate our framework by deploying 100 independent GP instances in a massive data-parallel manner to learn from a dataset composed of 515K exemplars and 90 features, and by generating a competitive fused model in less than 10 minutes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Friese, M., Flasch, O., Vladislavleva, K., Bartz-Beielstein, T., Mersmann, O., Naujoks, B., Stork, J., Zaefferer, M.: Ensemble-based model selection for smart metering data. In: Proceedings of the 22nd Workshop Computational Intelligence, pp. 215–227. Dortmund, Germany (2012)

  2. Schmidt, M., Lipson, H.: Distilling free-form natural laws from experimental data. Science 324(5923), 81–85 (2009)

    Article  Google Scholar 

  3. Choudhury, A., Nair, P.B., Keane, A.J., et al.: A data parallel approach for large-scale gaussian process modeling. In: Proceedings of the Second SIAM International Conference on Data Mining, pp 95–111. SIAM (2002)

  4. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Series B 58, 267–288 (1994)

    MathSciNet  Google Scholar 

  5. Arnaldo, I., Krawiec, K., O’Reilly, U.M.: Multiple regression genetic programming. In: Proceedings of the 2014 Conference on Genetic and Evolutionary Computation, GECCO ’14, pp 879–886. ACM, New York (2014)

  6. Vladislavleva, E.: Model-based problem solving through symbolic regression via pareto genetic programming. Ph.D. thesis, Tilburg University, Tilburg, the Netherlands (2008)

    Google Scholar 

  7. Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 6(2), 182–197 (2002). doi:10.1109/4235.996017

    Article  Google Scholar 

  8. Ganjisaffar, Y.: Lasso4j. https://code.google.com/p/lasso4j/ (2014)

  9. Friedman, J.H., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1–22 (2010)

    Article  Google Scholar 

  10. Veeramachaneni, K., Derby, O., Sherry, D., O’Reilly, U.M.: Learning regression ensembles with genetic programming at scale. In: Proceedings of the 15th Annual Conference on Genetic and Evolutionary Computation, GECCO ’13, pp 1117–1124. ACM, New York (2013)

  11. Yang, Y.: Adaptive regression by mixing. J. Am. Stat. Assoc. 96(454), 574–588 (2001)

    Article  MATH  Google Scholar 

  12. Derby, O: FlexGP: a scalable system for factored learning in the cloud. Master’s thesis, Massachusetts Institute of Technology (2013)

  13. Jelasity, M., Montresor, A., Babaoglu, O.: Gossiping in distributed systems. Comput. Netw. 53(13), 2321 (2009). doi:10.1016/j.comnet.2009.03.013

    Article  MATH  Google Scholar 

  14. Langford, J.: Vowpal wabbit. http://hunch.net/vw/ (2014)

  15. Langford, J., Li, L., Zhang, T.: Sparse online learning via truncated gradient. J. Mach. Learn. Res. 10, 777–801 (2009)

    MathSciNet  MATH  Google Scholar 

  16. MathWorks: Neural network toolbox. http://www.mathworks.com/products/neural-network/ (2014)

  17. Keijzer, M.: Improving symbolic regression with interval arithmetic and linear scaling. In: Ryan, C., Soule, T., Keijzer, M., Tsang, E., Poli, R., Costa, E. (eds.) Genetic Programming. Lecture Notes in Computer Science, vol. 2610, pp 275–299. Springer, Berlin / Heidelberg (2003)

    Google Scholar 

  18. Vladislavleva, C., Smits, G.: Symbolic regression via genetic programming. Final Thesis for Dow Benelux BV (2005)

  19. Silva, S., Dignum, S., Vanneschi, L.: Operator equalisation for bloat free genetic programming and a survey of bloat control methods. Genet. Program Evolvable Mach. 13(2), 197–238 (2012)

    Article  Google Scholar 

  20. Eureqa desktop: http://www.nutonian.com/products/eureqa/ (2014)

  21. Amazon web services (AWS): http://aws.amazon.com/ (2014)

  22. Bertin-Mahieux, T., Ellis, D.P., Whitman, B., Lamere, P.: The million song dataset. In: Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011) (2011)

  23. Sherry, D., Veeramachaneni, K., McDermott, J., O’Reilly, U.M.: Flex-GP: genetic programming on the cloud. In: Chio, C.D., Agapitos, A., Cagnoni, S., Cotta, C., Vega, F.F.d., Caro, G.A.D., Drechsler, R., Ekart, A., Esparcia- Alcazar, A.I., Farooq, M., Langdon, W.B., Merelo- Guervos, J.J., Preuss, M., Richter, H., Silva, S., Simes, A., Squillero, G., Tarantino, E., Tettamanzi, A.G.B., Togelius, J., Urquhart, N., Uyar, A., Yannakakis, G.N. (eds.) Applications of Evolutionary Computation no. 7248 in Lecture Notes in Computer Science, pp. 477–486. Springer, Berlin Heidelberg (2012)

  24. Sherry, D.J.: FlexGP 2.0: multiple levels of parallelism in distributed machine learning via genetic programming. Master’s thesis, Massachusetts Institute of Technology (2013)

  25. Fernández, F., Tomassini, M., Vanneschi, L.: An empirical study of multipopulation genetic programming. Genet. Program Evolvable Mach. 4(1), 21–51 (2003). doi:10.1023/A:1021873026259

    Article  MATH  Google Scholar 

  26. Fazenda, P., McDermott, J., O’Reilly, U.M.: A library to run evolutionary algorithms in the cloud using MapReduce. In: Chio, C., Agapitos, A., Cagnoni, S., Cotta, C., Vega, F., Caro, G., Drechsler, R., Ekárt, A., Esparcia-Alcázar, A., Farooq, M., Langdon, W., Merelo-Guervós, J., Preuss, M., Richter, H., Silva, S., Simes, A., Squillero, G., Tarantino, E., Tettamanzi, A., Togelius, J., Urquhart, N., Uyar, A., Yannakakis, G. (eds.) Applications of Evolutionary Computation. Lecture Notes in Computer Science, Vol. 7248, pp 416– 425. Springer, Berlin Heidelberg (2012)

  27. Wang, S., Gao, B.J., Wang, K., Lauw, H.W.: Parallel learning to rank for information retrieval. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’11, pp 1083–1084. ACM, New York (2011)

  28. Verma, A., Llora, X., Goldberg, D., Campbell, R.: Scaling genetic algorithms using MapReduce. In: Intelligent Systems Design and Applications, 2009. ISDA ’09. Ninth International Conference on, pp 13–18 (2009)

  29. Verma, A., Llora, X., Venkataraman, S., Goldberg, D., Campbell, R.: Scaling eCGA model building via data-intensive computing. In: Evolutionary Computation (CEC), 2010 IEEE Congress on, pp 1–8 (2010)

  30. Huang, D.W., Lin, J.: Scaling populations of a genetic algorithm for job shop scheduling problems using MapReduce. In: Cloud Computing Technology and Science (CloudCom), 2010 IEEE Second International Conference on, pp 780–785 (2010)

  31. Jiménez Laredo, J., Lombrańa González, D., Fernández de Vega, F., García Arenas, M., Merelo Guervós, J.: A peer-to-peer approach to genetic programming. In: Silva, S., Foster, J., Nicolau, M., Machado, P., Giacobini, M. (eds.) Genetic programming. Lecture Notes in Computer Science, Vol. 6621, pp 108–117. Springer, Berlin Heidelberg (2011)

  32. Laredo, J., Eiben, A., Steen, M., Merelo, J.: Evag: a scalable peer-to-peer evolutionary algorithm. Genet. Program Evolvable Mach. 11, 227–246 (2010). doi:10.1007/s10710-009-9096-z

    Article  Google Scholar 

  33. Folino, G., Forestiero, A., Spezzano, G.: A jxta based asynchronous peer-to-peer implementation of genetic programming. J. Softw. 1(2), 12–23 (2006)

    Article  Google Scholar 

  34. Perrone, M.P., Cooper, L.N.: When networks disagree: Ensemble methods for hybrid neural networks. In: Mammone, R. (ed.) Neural Networks for Speech and Image processing, pp 126–142. Chapman and Hall (1993)

  35. Krogh, A., Vedelsby, J.: Neural network ensembles, cross validation, and active learning. Adv. Neural Inf. Process. Syst. 7, 231–238 (1995)

    Google Scholar 

  36. Quinlan, J.R.: Bagging, boosting, and C4.5. In: Proceedings of the Thirteenth National Conference on Artificial Intelligence, AAAI’96, vol. 1, pp 725–730. AAAI Press (1996)

  37. Dietterich, T.: An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Mach. Learn. 40(2), 139– 157 (2000)

    Article  Google Scholar 

  38. Dietterich, T.: Ensemble methods in machine learning In: Multiple Classifier Systems. Lecture Notes in Computer Science, Vol. 1857, pp 1–15. Springer, Berlin Heidelberg (2000)

  39. Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)

    MathSciNet  MATH  Google Scholar 

  40. Freund, Y., Schapire, R.: Experiments with a new boosting algorithm. In: Machine learning international conference, pp 148–156. Morgan Kauffman Publishers, Inc. (1996)

  41. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  MATH  Google Scholar 

  42. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  43. Imamura, K., Soule, T., Heckendorn, R., Foster, J.: Behavioral diversity and a probabilistically optimal GP ensemble. Genet. Program Evolvable Mach. 4(3), 235–253 (2003)

    Article  Google Scholar 

  44. Bhowan, U., Johnston, M., Zhang, M., Yao, X.: Evolving diverse ensembles using genetic programming for classification with unbalanced data. IEEE Trans. Evol. Comput. 17(3), 368–386 (2013). doi:10.1109/TEVC.2012.2199119

    Article  Google Scholar 

  45. Langdon, W., Barrett, S., Buxton, B.: Combining decision trees and neural networks for drug discovery. In: Foster, J., Lutton, E., Miller, J., Ryan, C., Tettamanzi, A. (eds.) Genetic Programming. Lecture Notes in Computer Science, Vol. 2278, pp 60–70. Springer, Berlin Heidelberg (2002)

  46. Johansson, U., Löfström, T., König, R., Niklasson, L.: Genetically evolved trees representing ensembles. In: Artificial Intelligence and Soft Computing–ICAISC 2006, pp 613–22 (2006)

  47. Folino, G., Pizzuti, C., Spezzano, G.: Mining distributed evolving data streams using fractal GP ensembles. In: Ebner, M., O’Neill, M., Ekárt, A., Vanneschi, L., Esparcia-Alcázar, A. (eds.) Genetic Programming. Lecture Notes in Computer Science, Vol. 4445, pp 160–169. Springer, Berlin Heidelberg (2007)

  48. Lanzi, P.L.: XCS with stack-based genetic programming. In: Sarker, R., Reynolds, R., Abbass, H., Tan, K.C., McKay, B., Essam, D., Gedeon, T. (eds.) Proceedings of the 2003 Congress on Evolutionary Computation CEC2003, pp 1186–1191. IEEE Press, Canberra (2003)

  49. Kittler, J., Hatef, M., Duin, R., Matas, J.: On combining classifiers. IEEE Trans. Pattern Anal. Mach. Intell. 20(3), 226–239 (1998)

    Article  Google Scholar 

  50. Iba, H.: Bagging, boosting, and bloating in genetic programming. In: Banzhaf, W., Daida, J., Eiben, A.E., Garzon, M.H., Honavar, V., Jakiela, M., Smith, R.E. (eds.) Proceedings of the Genetic and Evolutionary Computation Conference, vol. 2, pp 1053–1060. Morgan Kaufmann, Orlando, Florida (1999)

  51. Veeramachaneni, K., Vladislavleva, K., Burland, M., Parcon, J., O’Reilly, U.M.: Evolutionary optimization of flavors. In: Proceedings of the 12th annual conference on Genetic and evolutionary computation, pp 1291–1298. ACM (2010)

  52. Kotanchek, M., Smits, G., Vladislavleva, E.: Trustable symbolic regression models: using ensembles, interval arithmetic and pareto fronts to develop robust and trust-aware models. In: Riolo, R., Soule, T., Worzel, B. (eds.) Genetic Programming Theory and Practice V. Genetic and Evolutionary Computation Series, pp 201–220. Springer, US (2008)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ignacio Arnaldo.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Veeramachaneni, K., Arnaldo, I., Derby, O. et al. FlexGP. J Grid Computing 13, 391–407 (2015). https://doi.org/10.1007/s10723-014-9320-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10723-014-9320-9

Keywords

Navigation