Characterizing fault tolerance in genetic programming

https://doi.org/10.1016/j.future.2010.02.006Get rights and content

Abstract

Evolutionary algorithms, including genetic programming (GP), are frequently employed to solve difficult real-life problems, which can require up to days or months of computation. An approach for reducing the time-to-solution is to use parallel computing on distributed platforms. Large platforms such as these are prone to failures, which can even be commonplace events rather than rare occurrences. Thus, fault tolerance and recovery techniques are typically necessary. The aim of this article is to show the inherent ability of parallel GP to tolerate failures in distributed platforms without using any fault-tolerant technique. This ability is quantified via simulation experiments performed using failure traces from real-world distributed platforms, namely, desktop grids, for two well-known problems.

Introduction

Evolutionary algorithms (EAs) are employed to solve real-world problems. However, when difficult problems are faced, large and often prohibitive times-to-solution ensue. A common approach is then to use parallel versions of the algorithm. Successful parallel evolutionary algorithms (PEAs) have been proposed [1], [2], [3]. Additionally, development frameworks for PEAs have been implemented (e.g., Calypso [4]), some specifically for parallel genetic programming (PGP) [5].

PEAs can be run on potentially large-scale parallel computing platforms. Two types of parallel platforms have achieved very high scales: clusters and desktop grids. In the last decade large clusters have become mainstream, and clusters account today for more than 80% of the Top500 list, which ranks the 500 fastest supercomputers based on the LINPACK benchmark [6]. The term “desktop grid” (DG) refers to distributed networks of heterogeneous individual systems that contribute computing resources when idle. A well-known example of a DG is the Berkeley Open Infrastructure for Network Computing (BOINC) [7], which hosts the famous SETI@home [8] project. At the time this article is being written, BOINC enlists around 500,000 volunteer computers. Smaller, but still impressive, DG infrastructures, are deployed within enterprises or data centers [9]. Such deployments comprise more powerful, more available, and less heterogeneous computers than Internet-wide DGs. The main advantage of DGs is that they provide large-scale parallel computing capabilities at a very low cost for specific types of application.

The aforementioned large-scale parallel computing platforms hold promises for running large PEAs. However, with large scale there is a higher risk that processors experience failures during the execution of an application (e.g., a crash). In this paper the terms “failure” and “fault” are used without making the subtle distinction between them, as it is not necessary for our purpose. Failures occur frequently in large-scale clusters [10]. In DGs, failures are the common case: a participating computer can be reclaimed by its owner at any time (e.g., when the owner launches an application, when the keyboard/mouse is used). In this case, the DG application is abruptly suspended or terminated, which it is seen as a failure.

In order to circumvent and/or alleviate failures, many researchers have developed different techniques for an application to not be terminated when one or more of the participating processors experience a failure. This ability is known as fault tolerance, and it ensures that the application behaves in a well-defined manner (e.g., with graceful degradation of performance) when a failure occurs [11]. Various fault tolerance techniques have been developed [12]. These techniques can be employed with parallel applications, and support many types of computational and communication failures [13]. In general, the employment of fault tolerance mechanisms requires the modification of the application, and sometimes the parallel algorithms themselves. The developer thus could face a sharp increase in software complexity. For this reason, generic fault tolerance solutions have been developed as libraries or software environments [14], [15], [16].

To the best of our knowledge, there has been little investigation of the behaviors of PEAs in general, and of PGP in particular, in the presence of failures. Nevertheless, there are different tools available that can be used to parallelize and run any EA, and thus GP, in volunteer computing environments [17], where failures are common.

In previous work [18], [19], we presented preliminary results about fault tolerance in PGP under several simplified assumptions. We showed that PGP applications exhibit inherent fault-tolerant behaviors. Therefore, it seems feasible to run them on large-scale computing infrastructures, which suffer from failures, but without the burden of implementing/using any kind of fault tolerance techniques, and without sacrificing overall application efficiency significantly. We then extended those preliminary results in two ways [20]. First, two different PGP problems were used for running simulations using host availability data collected from real-world DG deployments (instead of using simplistic, and ultimately unrealistic, processor availability models). Second, the simulations used availability data from different platforms, making it possible to study the impact of different host availability profiles on application execution. To the best of our knowledge, this was the first time that an attempt for characterizing PGP had been performed from the fault tolerance point of view. The results showed that, in some specific contexts, PGP can tolerate various failure rates.

All previous results were obtained using a stringent assumption: once unavailable, a resource never becomes available again. This is, however, not the case in real-world DGs. In this paper we extend the work in [20] and run simulations in which resources can become available again. This should further improve the graceful degradation feature of PGP as the number of resources fluctuates throughout application execution instead of continuously decreasing.

This paper is organized as follows. Section 2 reviews related works beyond the ones described earlier. Section 3 provides an overview of the different types of failure that may arise as well as the relevant fault tolerance techniques. Section 4 describes our experimental methodology, and results are discussed in Section 5. Section 6 concludes the paper with a summary of our results and future directions.

Section snippets

Background and related work

When using EAs, and especially GP, to solve real-world problems, researchers and practitioners often face prohibitively long times-to-solution on a single computer. For instance, Trujillo et al. required more than 24 h to solve a computer vision problem [21], and times-to-solution can be much longer, measured in weeks or even months. Consequently, several researchers have studied the application of parallel computing to spatially structured EAs in order to shorten times-to-solution [1], [2], [3]

Failure models

Fault tolerance can be defined as the ability of a system to behave in a well-defined manner once a failure occurs. In this paper we only take into account failures at the process level. A complete description of failures in distributed systems is beyond the scope of our discussion. According to Ghosh [13], failures can be classified as follows: crash, omission, transient, Byzantine, software, temporal, or security failures. However, in practice, any system may experience a failure due to the

Experimental methodology

We rely on simulation experiments. Simulation allows us to perform a statistically significant number of experiments in a wide range of realistic scenarios. Most importantly, our experiments are repeatable, via “replaying” host availability trace data collected from real-world DG platforms [49], so that fair comparison between simulated application executions is possible.

Results without churn

In this section we consider the scenario in which hosts never become available again (no churn): the number of individuals per generation is non-increasing. Fig. 2 shows the evolution of the number of individuals in each generation for the EP5 and 11M problems when simulated over two 24-hour periods, denoted by Day 1 and Day 2, randomly selected out of each of three of our traces, entrfin, ucb, and xwtr, for a total of six experiments.

Table 3 shows a summary of the obtained fitness for the EP5

Conclusions

In this paper we have analyzed the behavior of parallel genetic programming (PGP) applications executing in distributed platforms with high failure rates, with the goal of characterizing the inherent fault tolerance capabilities of the PGP paradigm. We have used two well-known GP problems and, to the best of our knowledge, for the first time in this context we have used host availability traces collected on real-world desktop grid (DG) platforms.

Our main conclusion is that PGP inherently

Acknowledgements

Thanks are due to the reviewers for their detailed and constructive comments.

This work was supported by University of Extremadura, regional government Junta de Extremadura, National Nohnes project TIN2007-68083-C02-01 Spanish Ministry of Science and Education and by the U.S. National Science Foundation under Award #0546688.

Daniel Lombraña González received his degree in Computer Engineering from the University of León, Spain, in 2002. Currently he is finishing his Ph.D., which is centered on fault tolerance and parallel genetic programming on desktop grid computing environments. His research interest include fault tolerance, parallel genetic programming, and volunteer grid computing.

References (55)

  • A. Baratloo, P. Dasgupta, Z. Kedem, CALYPSO: A novel software system for fault-tolerant parallel processing on...
  • G. Folino et al.

    CAGE: A tool for parallel genetic programming applications

  • Top 500 Supercomputer Sites, 2009....
  • D. Anderson, Boinc: A system for public-resource computing and storage, in: Grid Computing, 2004. Proceedings. Fifth...
  • D.P. Anderson et al.

    Seti@home: An experiment in public-resource computing

    Commun. ACM

    (2002)
  • B. Schroeder, G.A. Gibson, A large-scale study of failures in high-performance computing systems, in: Proc. of the...
  • F.C. Gartner

    Fundamentals of fault-tolerant distributed computing in asynchronous environments

    ACM Computing Surveys

    (1999)
  • M.L. Douglas Thain

    The Grid 2

    (2004)
  • S. Ghosh

    Distributed Systems: An Algorithmic Approach

    (2006)
  • C. Coti, T. Herault, P. Lemarinier, L. Pilard, A. Rezmerita, E. Rodriguez, F. Cappello, Blocking vs. non-blocking...
  • G. Fagg, E. Gabriel, G. Bosilca, T. Angskun, Z. Chen, J. Pjesivac-Grbovic, K. London, J. Dongarra, Extending the MPI...
  • J. Pruyne, M. Livny, Managing checkpoints for parallel programs, in: Workshop on Job Scheduling Strategies for Parallel...
  • D. Lombraña, F. Fernández, L. Trujillo, G. Olague, B. Segal, Customizable execution environments with virtual desktop...
  • D. Lombraña, F. Fernández, Analyzing fault tolerance on parallel genetic programming by means of dynamic-size...
  • I. Hidalgo, F. Fernández, J. Lanchares, D. Lombraña, Is the island model fault tolerant?, in: Genetic and Evolutionary...
  • D.L. González, F.F. de Vega, H. Casanova, Characterizing fault tolerance in genetic programming, in: Workshop on...
  • L. Trujillo et al.
    (2008)
  • Cited by (0)

    Daniel Lombraña González received his degree in Computer Engineering from the University of León, Spain, in 2002. Currently he is finishing his Ph.D., which is centered on fault tolerance and parallel genetic programming on desktop grid computing environments. His research interest include fault tolerance, parallel genetic programming, and volunteer grid computing.

    Francisco Fernández de Vega received his Ph.D. in Computer Science from the University of Extremadura, Spain, in 2001. He was vice-head of research, Centro Universitario de Mérida, University of Extremadura, 2004–2005 and CIO of the University of Extremadura 2005–2007. He is currently an associate professor of Computer Science and the director of the GEA research group (Artificial Evolution Group).

    He has published over 150 referred papers. His research interests include Bioinspired Algorithms (Genetic Programming, Cellular Automata, Epidemic Algorithms) and Cluster and Grid Computing. He is part of the steering committee of the Spanish conference on Evolutionary Algorithms (MAEB), and has presented invited tutorials at several international conferences (including CEC and PPSN).

    He was co-chair of the 1st and 2nd Workshops on Parallel Bioinspired Algorithms, held jointly with IEEE ICPP 2005, ACM GECCO 2007 and 1st Workshop on Parallel Architectures and Bioinspired Algorithms, IEEE PACT 2008 (Oslo 2005, London 2007, Toronto 2008). He has edited a Special Issue on Parallel Bioinspired Algorithms, Journal of Parallel and Distributed Computing.

    Henri Casanova received his BS degree from the Ecole Nationale Superieure d’Electronique, d’Electrotechnique, d’Informatique et d’Hydraulique de Toulouse, France, in 1993, his MS degree from the Universite Paul Sabatier, Toulouse, France, in 1994, and his Ph.D. degree from the University of Tennessee, Knoxville, in 1998. He is an associate professor in the Information and Computer Science Department, University of Hawai’i, Manoa. His research interests are in the area of parallel and distributed computing. In particular, his research emphasizes the modeling and the simulation of platforms and applications, as well as both the theoretical and practical aspects of scheduling problems.

    View full text