ABSTRACT
For the past six years, researchers in genetic programming and other program synthesis disciplines have used the General Program Synthesis Benchmark Suite to benchmark many aspects of automatic program synthesis systems. These problems have been used to make notable progress toward the goal of general program synthesis: automatically creating the types of software that human programmers code. Many of the systems that have attempted the problems in the original benchmark suite have used it to demonstrate performance improvements granted through new techniques. Over time, the suite has gradually become outdated, hindering the accurate measurement of further improvements. The field needs a new set of more difficult benchmark problems to move beyond what was previously possible.
In this paper, we describe the 25 new general program synthesis benchmark problems that make up PSB2, a new benchmark suite. These problems are curated from a variety of sources, including programming katas and college courses. We selected these problems to be more difficult than those in the original suite, and give results using PushGP showing this increase in difficulty. These new problems give plenty of room for improvement, pointing the way for the next six or more years of general program synthesis research.
- dnolan. 2015. Code Wars: Ten-Pin Bowling. https://www.codewars.com/kata/5531abe4855bcc8d1f00004c/javascript Accessed: 2020-01-20.Google Scholar
- Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/mlGoogle Scholar
- Project Euler. 2002. Project Euler: Coin Sums. https://projecteuler.net/problem=31 Accessed: 2020-01-20.Google Scholar
- Project Euler. 2008. Project Euler: Dice Game. https://projecteuler.net/problem=205 Accessed: 2020-01-20.Google Scholar
- Austin J. Ferguson, Jose Guadalupe Hernandez, Daniel Junghans, Alexander Lalejini, Emily Dolson, and Charles Ofria. 2019. Characterizing the effects of random subsampling and dilution on Lexicase selection. In Genetic Programming Theory and Practice XVII, Wolfgang Banzhaf, Erik Goodman, Leigh Sheneman, Leonardo Trujillo, and Bill Worzel (Eds.). East Lansing, MI, USA.Google Scholar
- Stefan Forstenlechner, David Fagan, Miguel Nicolau, and Michael O'Neill. 2017. A Grammar Design Pattern for Arbitrary Program Synthesis Problems in Genetic Programming. In EuroGP 2017: Proceedings of the 20th European Conference on Genetic Programming (LNCS, Vol. 10196). Springer Verlag, Amsterdam, 262--277. Google ScholarCross Ref
- Stefan Forstenlechner, David Fagan, Miguel Nicolau, and Michael O'Neill. 2018. Extending Program Synthesis Grammars for Grammar-Guided Genetic Programming. In 15th International Conference on Parallel Problem Solving from Nature (LNCS, Vol. 11101), Anne Auger, Carlos M. Fonseca, Nuno Lourenco, Penousal Machado, Luis Paquete, and Darrell Whitley (Eds.). Springer, Coimbra, Portugal, 197--208. Google ScholarCross Ref
- Stefan Forstenlechner, David Fagan, Miguel Nicolau, and Michael O'Neill. 2018. Towards effective semantic operators for program synthesis in genetic programming. In GECCO '18: Proceedings of the Genetic and Evolutionary Computation Conference. ACM, Kyoto, Japan, 1119--1126. Google ScholarDigital Library
- Stefan Forstenlechner, David Fagan, Miguel Nicolau, and Michael O'Neill. 2018. Towards Understanding and Refining the General Program Synthesis Benchmark Suite with Genetic Programming. In 2018 IEEE Congress on Evolutionary Computation (CEC), Marley Vellasco (Ed.). IEEE, Rio de Janeiro, Brazil. https://doi.org/ Google ScholarDigital Library
- g964. 2015. Code Wars: Bouncing Balls. https://www.codewars.com/kata/5544c7a5cb454edb3c000047 Accessed: 2020-01-20.Google Scholar
- Sumit Gulwani. 2011. Automating String Processing in Spreadsheets Using Input-output Examples. SIGPLAN Not. 46, 1 (Jan. 2011), 317--330. Google ScholarDigital Library
- Thomas Helmuth and Peter Kelly. 2019. General Program Synthesis Benchmark Suite Datasets. https://github.com/thelmuth/program-synthesis-benchmark-datasetsGoogle Scholar
- Thomas Helmuth and Peter Kelly. 2021. PSB2: The Second Program Synthesis Benchmark Suite. Google ScholarCross Ref
- Thomas Helmuth, Nicholas Freitag McPhee, Edward Pantridge, and Lee Spector. 2017. Improving Generalization of Evolved Programs Through Automatic Simplification. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO '17). ACM, Berlin, Germany, 937--944. Google ScholarDigital Library
- Thomas Helmuth, Nicholas Freitag McPhee, and Lee Spector. 2018. Program Synthesis using Uniform Mutation by Addition and Deletion. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO '18). ACM, Kyoto, Japan, 1127--1134. Google ScholarDigital Library
- Thomas Helmuth, Edward Pantridge, Grace Woolson, and Lee Spector. 2020. Genetic Source Sensitivity and Transfer Learning in Genetic Programming. In Artificial Life Conference Proceedings. MIT Press, 303--311. Google ScholarCross Ref
- Thomas Helmuth and Lee Spector. 2015. General Program Synthesis Benchmark Suite. In GECCO '15: Proceedings of the 2015 conference on Genetic and Evolutionary Computation Conference. ACM, Madrid, Spain, 1039--1046. https://doi.org/ Google ScholarDigital Library
- Thomas Helmuth and Lee Spector. 2020. Explaining and Exploiting the Advantages of Down-sampled Lexicase Selection. In Artificial Life Conference Proceedings. MIT Press, 341--349. Google ScholarCross Ref
- Thomas Helmuth and Lee Spector. 2021. Problem-solving benefits of down-sampled lexicase selection. Artificial Life (2021). In press.Google Scholar
- Thomas Helmuth, Lee Spector, and James Matheson. 2015. Solving Uncompromising Problems with Lexicase Selection. IEEE Transactions on Evolutionary Computation 19, 5 (Oct. 2015), 630--643. Google ScholarDigital Library
- Thomas Helmuth, Lee Spector, Nicholas Freitag McPhee, and Saul Shanabrook. 2016. Linear Genomes for Structured Programs. In Genetic Programming Theory and Practice XIV (Genetic and Evolutionary Computation). Springer, Ann Arbor, USA.Google Scholar
- Erik Hemberg, Jonathan Kelly, and Una-May O'Reilly. 2019. On domain knowledge and novelty to improve program synthesis performance with grammatical evolution. In GECCO '19: Proceedings of the Genetic and Evolutionary Computation Conference. ACM, Prague, Czech Republic, 1039--1046. https://doi.org/ Google ScholarDigital Library
- Jose Guadalupe Hernandez, Alexander Lalejini, Emily Dolson, and Charles Ofria. 2019. Random subsampling improves performance in lexicase selection. In GECCO '19: Proceedings of the Genetic and Evolutionary Computation Conference Companion. ACM, Prague, Czech Republic, 2028--2031. https://doi.org/ Google ScholarDigital Library
- jacobb. 2014. Code Wars: Simple Substitution Cipher Helper. https://www.codewars.com/kata/52eb114b2d55f0e69800078d Accessed: 2020-01-20.Google Scholar
- jhoffner. 2013. Code Wars: Convert string to camel case. https://www.codewars.com/kata/517abf86da9663f1d2000003 Accessed: 2020-01-20.Google Scholar
- Susumu Katayama. 2010. Recent Improvements of MagicHaskeller. In Approaches and Applications of Inductive Programming. Springer. Google ScholarCross Ref
- Jonathan Kelly, Erik Hemberg, and Una-May O'Reilly. 2019. Improving Genetic Programming with Novel Exploration - Exploitation Control. In EuroGP 2019: Proceedings of the 22nd European Conference on Genetic Programming, Lukas Sekanina, Ting Hu, Nuno Lourenço, Hendrik Richter, and Pablo García-Sánchez (Eds.). Springer International Publishing, 64--80.Google Scholar
- KenKamau. 2017. Code Wars: The boolean order. https://www.codewars.com/kata/59eb1e4a0863c7ff7e000008 Accessed: 2020-01-20.Google Scholar
- Alexander Lalejini and Charles Ofria. 2019. Tag-accessed memory for genetic programming. In GECCO '19: Proceedings of the Genetic and Evolutionary Computation Conference Companion. ACM, Prague, Czech Republic, 346--347. https://doi.org/ Google ScholarDigital Library
- Trang T Le, William La Cava, Joseph D Romano, John T Gregg, Daniel J Goldberg, Praneel Chakraborty, Natasha L Ray, Daniel Himmelstein, Weixuan Fu, and Jason H Moore. 2020. PMLB v1.0: an open source dataset collection for benchmarking machine learning methods. arXiv preprint arXiv:2012.00058 (2020).Google Scholar
- Jinsuk Lim and Shin Yoo. 2016. Field report: Applying monte carlo tree search for program synthesis. In International Symposium on Search Based Software Engineering. Springer, 304--310.Google ScholarCross Ref
- David Lynch, James McDermott, and Michael O'Neill. 2020. Program Synthesis in a Continuous Space using Grammars and Variational Autoencoders. In 16th International Conference on Parallel Problem Solving from Nature, Part II (LNCS, Vol. 12270), Thomas Baeck, Mike Preuss, Andre Deutz, Hao Wang2, Carola Doerr, Michael Emmerich, and Heike Trautmann (Eds.). Springer, Leiden, Holland, 33--47. https://doi.org/ Google ScholarDigital Library
- mcclaskc. 2014. Code Wars: Validate Credit Card Number. https://www.codewars.com/kata/5418a1dd6d8216e18a0012b2 Accessed: 2020-01-20.Google Scholar
- James McDermott, David R. White, Sean Luke, Luca Manzoni, Mauro Castelli, Leonardo Vanneschi, Wojciech Jaskowski, Krzysztof Krawiec, Robin Harper, Kenneth De Jong, and Una-May O'Reilly. 2012. Genetic programming needs better benchmarks. In GECCO '12: Proceedings of the Genetic and evolutionary computation conference. ACM, Philadelphia, Pennsylvania, USA, 791--798. https://doi.org/ Google ScholarDigital Library
- MrZizoScream. 2018. Code Wars: Array Leaders. https://www.codewars.com/kata/5a651865fd56cb55760000e0 Accessed: 2020-01-20.Google Scholar
- myjinxin2015. 2016. Code Wars: Fastest Code: Half it IV. https://www.codewars.com/kata/5719b28964a584476500057d Accessed: 2020-01-20.Google Scholar
- MysteriousMagenta. 2014. Code Wars: Square Every Digit. https://www.codewars.com/kata/546e2562b03326a88e000020 Accessed: 2020-01-20.Google Scholar
- Randal S. Olson, William La Cava, Patryk Orzechowski, Ryan J. Urbanowicz, and Jason H. Moore. 2017. PMLB: a large benchmark suite for machine learning evaluation and comparison. BioData Mining 10, 1 (11 Dec 2017), 36. Google ScholarCross Ref
- Michael O'Neill and Anthony Brabazon. 2019. Mutational Robustness and Structural Complexity in Grammatical Evolution. In 2019 IEEE Congress on Evolutionary Computation, CEC 2019, Carlos A. Coello Coello (Ed.). IEEE Computational Intelligence Society, IEEE Press, Wellington, New Zealand, 1338--1344. https://doi.org/ Google ScholarDigital Library
- Michael O'Neill and Lee Spector. 2020. Automatic programming: The open issue? Genetic Programming and Evolvable Machines 21, 1-2 (June 2020), 251--262. https://doi.org/ Twentieth Anniversary Issue. Google ScholarDigital Library
- Edward Pantridge, Thomas Helmuth, Nicholas Freitag McPhee, and Lee Spector. 2017. On the Difficulty of Benchmarking Inductive Program Synthesis Methods. In Proceedings of the Genetic and Evolutionary Computation Conference Companion (GECCO '17). ACM, Berlin, Germany, 1589--1596. https://doi.org/ Google ScholarDigital Library
- Edward Pantridge and Lee Spector. 2020. Code Building Genetic Programming. In Proceedings of the 2020 Genetic and Evolutionary Computation Conference (GECCO '20). Association for Computing Machinery, internet, 994--1002. https://doi.org/ Google ScholarDigital Library
- rb50. 2017. Code Wars: Shopping List. https://www.codewars.com/kata/596266482f9add20f70001fc Accessed: 2020-01-20.Google Scholar
- Christopher D. Rosin. 2019. Stepping Stones to Inductive Synthesis of Low-Level Looping Programs. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence (AAAI '19, Vol. 33). AAAI Press, Palo Alto, California USA.Google ScholarDigital Library
- RVdeKoning. 2015. Code Wars: Greatest common divisor. https://www.codewars.com/kata/5500d54c2ebe0a8e8a0003fd/python Accessed: 2020-01-20.Google Scholar
- Anil Kumar Saini and Lee Spector. 2019. Using Modularity Metrics as Design Features to Guide Evolution in Genetic Programming. In Genetic Programming Theory and Practice XVII, Wolfgang Banzhaf, Erik Goodman, Leigh Sheneman, Leonardo Trujillo, and Bill Worzel (Eds.). Springer, East Lansing, MI, USA, 165--180. https://doi.org/ Google ScholarCross Ref
- Anil Kumar Saini and Lee Spector. 2020. Why and When Are Loops Useful in Genetic Programming?. In Proceedings of the 2020 Genetic and Evolutionary Computation Conference Companion (GECCO '20). Association for Computing Machinery, internet, 247--248. https://doi.org/ Google ScholarDigital Library
- Shivo. 2015. Code Wars: Get the Middle Character. https://www.codewars.com/kata/56747fd5cb988479af000028 Accessed: 2020-01-20.Google Scholar
- smile67. 2016. Code Wars: Text Search. https://www.codewars.com/kata/56b78faebd06e61870001191 Accessed: 2020-01-20.Google Scholar
- Dominik Sobania and Franz Rothlauf. 2020. Challenges of Program Synthesis with Grammatical Evolution. In EuroGP 2020: Proceedings of the 23rd European Conference on Genetic Programming (LNCS, Vol. 12101), Ting Hu, Nuno Lourenco, and Eric Medvet (Eds.). Springer Verlag, Seville, Spain, 211--227. https://doi.org/ Google ScholarDigital Library
- Lee Spector, Jon Klein, and Maarten Keijzer. 2005. The Push3 execution stack and the evolution of control. In GECCO 2005: Proceedings of the 2005 conference on Genetic and evolutionary computation, Vol. 2. ACM Press, Washington DC, USA, 1689--1696. https://doi.org/ Google ScholarDigital Library
- Lee Spector and Alan Robinson. 2002. Genetic Programming and Autoconstructive Evolution with the Push Programming Language. Genetic Programming and Evolvable Machines 3, 1 (March 2002), 7--40. https://doi.org/ Google ScholarDigital Library
- StephenLastname2. 2017. Code Wars: Distance between two points. https://www.codewars.com/kata/5a0b72484bebaefe60001867 Accessed: 2020-01-20.Google Scholar
- stephenyu. 2014. Code Wars: Fizz Buzz. https://www.codewars.com/kata/5300901726d12b80e8000498 Accessed: 2020-01-20.Google Scholar
- Eric Wastl. 2015. Advent of Code: Not Quite Lisp. https://adventofcode.com/2015/day/1 Accessed: 2020-01-20.Google Scholar
- Eric Wastl. 2017. Advent of Code: Inverse Captcha. https://adventofcode.com/2017/day/1 Accessed: 2020-01-20.Google Scholar
- Eric Wastl. 2019. Advent of Code: The Tyranny of the Rocket Empire. https://adventofcode.com/2019/day/1 Accessed: 2020-01-20.Google Scholar
- Eric Wastl. 2020. Advent of Code: Report Repair. https://adventofcode.com/2020/day/1 Accessed: 2020-01-20.Google Scholar
- David R. White, James Mcdermott, Mauro Castelli, Luca Manzoni, Brian W. Goldman, Gabriel Kronberger, Wojciech Jaśkowski, Una-May O'Reilly, and Sean Luke. 2013. Better GP benchmarks: community survey results and proposals. Genetic Programming and Evolvable Machines 14, 1 (March 2013), 3--29. Google ScholarDigital Library
- John Woodward, Simon Martin, and Jerry Swan. 2014. Benchmarks that matter for genetic programming. In GECCO 2014 4th workshop on evolutionary computation for the automated design of algorithms. ACM, Vancouver, BC, Canada, 1397--1404. https://doi.org/ Google ScholarDigital Library
- xDranik. 2013. Code Wars: Stop gninnipS My sdroW! https://www.codewars.com/kata/5264d2b162488dc400000001 Accessed: 2020-01-20.Google Scholar
Index Terms
PSB2: the second program synthesis benchmark suite
Recommendations
Applying genetic programming to PSB2: the next generation program synthesis benchmark suite
AbstractFor the past seven years, researchers in genetic programming and other program synthesis disciplines have used the General Program Synthesis Benchmark Suite (PSB1) to benchmark many aspects of systems that conduct programming by example, where the ...
SPEC HPG benchmarks for high-performance systems
In this paper, we discuss the results and characteristics of the benchmark suites maintained by the Standard Performance Evaluation Corporation's (SPEC) High-Performance Group (HPG). Currently, SPECHPGhas two lines of benchmark suites for measuring ...
General Boolean Function Benchmark Suite
FOGA '23: Proceedings of the 17th ACM/SIGEVO Conference on Foundations of Genetic AlgorithmsJust over a decade ago, the first comprehensive review on the state of benchmarking in Genetic Programming (GP) analyzed the mismatch between the problems that are used to test the performance of GP systems and real-world problems. Since then, several ...
Comments