Characterizing fault tolerance in genetic programming
Introduction
Evolutionary algorithms (EAs) are employed to solve real-world problems. However, when difficult problems are faced, large and often prohibitive times-to-solution ensue. A common approach is then to use parallel versions of the algorithm. Successful parallel evolutionary algorithms (PEAs) have been proposed [1], [2], [3]. Additionally, development frameworks for PEAs have been implemented (e.g., Calypso [4]), some specifically for parallel genetic programming (PGP) [5].
PEAs can be run on potentially large-scale parallel computing platforms. Two types of parallel platforms have achieved very high scales: clusters and desktop grids. In the last decade large clusters have become mainstream, and clusters account today for more than 80% of the Top500 list, which ranks the 500 fastest supercomputers based on the LINPACK benchmark [6]. The term “desktop grid” (DG) refers to distributed networks of heterogeneous individual systems that contribute computing resources when idle. A well-known example of a DG is the Berkeley Open Infrastructure for Network Computing (BOINC) [7], which hosts the famous SETI@home [8] project. At the time this article is being written, BOINC enlists around 500,000 volunteer computers. Smaller, but still impressive, DG infrastructures, are deployed within enterprises or data centers [9]. Such deployments comprise more powerful, more available, and less heterogeneous computers than Internet-wide DGs. The main advantage of DGs is that they provide large-scale parallel computing capabilities at a very low cost for specific types of application.
The aforementioned large-scale parallel computing platforms hold promises for running large PEAs. However, with large scale there is a higher risk that processors experience failures during the execution of an application (e.g., a crash). In this paper the terms “failure” and “fault” are used without making the subtle distinction between them, as it is not necessary for our purpose. Failures occur frequently in large-scale clusters [10]. In DGs, failures are the common case: a participating computer can be reclaimed by its owner at any time (e.g., when the owner launches an application, when the keyboard/mouse is used). In this case, the DG application is abruptly suspended or terminated, which it is seen as a failure.
In order to circumvent and/or alleviate failures, many researchers have developed different techniques for an application to not be terminated when one or more of the participating processors experience a failure. This ability is known as fault tolerance, and it ensures that the application behaves in a well-defined manner (e.g., with graceful degradation of performance) when a failure occurs [11]. Various fault tolerance techniques have been developed [12]. These techniques can be employed with parallel applications, and support many types of computational and communication failures [13]. In general, the employment of fault tolerance mechanisms requires the modification of the application, and sometimes the parallel algorithms themselves. The developer thus could face a sharp increase in software complexity. For this reason, generic fault tolerance solutions have been developed as libraries or software environments [14], [15], [16].
To the best of our knowledge, there has been little investigation of the behaviors of PEAs in general, and of PGP in particular, in the presence of failures. Nevertheless, there are different tools available that can be used to parallelize and run any EA, and thus GP, in volunteer computing environments [17], where failures are common.
In previous work [18], [19], we presented preliminary results about fault tolerance in PGP under several simplified assumptions. We showed that PGP applications exhibit inherent fault-tolerant behaviors. Therefore, it seems feasible to run them on large-scale computing infrastructures, which suffer from failures, but without the burden of implementing/using any kind of fault tolerance techniques, and without sacrificing overall application efficiency significantly. We then extended those preliminary results in two ways [20]. First, two different PGP problems were used for running simulations using host availability data collected from real-world DG deployments (instead of using simplistic, and ultimately unrealistic, processor availability models). Second, the simulations used availability data from different platforms, making it possible to study the impact of different host availability profiles on application execution. To the best of our knowledge, this was the first time that an attempt for characterizing PGP had been performed from the fault tolerance point of view. The results showed that, in some specific contexts, PGP can tolerate various failure rates.
All previous results were obtained using a stringent assumption: once unavailable, a resource never becomes available again. This is, however, not the case in real-world DGs. In this paper we extend the work in [20] and run simulations in which resources can become available again. This should further improve the graceful degradation feature of PGP as the number of resources fluctuates throughout application execution instead of continuously decreasing.
This paper is organized as follows. Section 2 reviews related works beyond the ones described earlier. Section 3 provides an overview of the different types of failure that may arise as well as the relevant fault tolerance techniques. Section 4 describes our experimental methodology, and results are discussed in Section 5. Section 6 concludes the paper with a summary of our results and future directions.
Section snippets
Background and related work
When using EAs, and especially GP, to solve real-world problems, researchers and practitioners often face prohibitively long times-to-solution on a single computer. For instance, Trujillo et al. required more than 24 h to solve a computer vision problem [21], and times-to-solution can be much longer, measured in weeks or even months. Consequently, several researchers have studied the application of parallel computing to spatially structured EAs in order to shorten times-to-solution [1], [2], [3]
Failure models
Fault tolerance can be defined as the ability of a system to behave in a well-defined manner once a failure occurs. In this paper we only take into account failures at the process level. A complete description of failures in distributed systems is beyond the scope of our discussion. According to Ghosh [13], failures can be classified as follows: crash, omission, transient, Byzantine, software, temporal, or security failures. However, in practice, any system may experience a failure due to the
Experimental methodology
We rely on simulation experiments. Simulation allows us to perform a statistically significant number of experiments in a wide range of realistic scenarios. Most importantly, our experiments are repeatable, via “replaying” host availability trace data collected from real-world DG platforms [49], so that fair comparison between simulated application executions is possible.
Results without churn
In this section we consider the scenario in which hosts never become available again (no churn): the number of individuals per generation is non-increasing. Fig. 2 shows the evolution of the number of individuals in each generation for the EP5 and 11M problems when simulated over two 24-hour periods, denoted by Day 1 and Day 2, randomly selected out of each of three of our traces, entrfin, ucb, and xwtr, for a total of six experiments.
Table 3 shows a summary of the obtained fitness for the EP5
Conclusions
In this paper we have analyzed the behavior of parallel genetic programming (PGP) applications executing in distributed platforms with high failure rates, with the goal of characterizing the inherent fault tolerance capabilities of the PGP paradigm. We have used two well-known GP problems and, to the best of our knowledge, for the first time in this context we have used host availability traces collected on real-world desktop grid (DG) platforms.
Our main conclusion is that PGP inherently
Acknowledgements
Thanks are due to the reviewers for their detailed and constructive comments.
This work was supported by University of Extremadura, regional government Junta de Extremadura, National Nohnes project TIN2007-68083-C02-01 Spanish Ministry of Science and Education and by the U.S. National Science Foundation under Award #0546688.
Daniel Lombraña González received his degree in Computer Engineering from the University of León, Spain, in 2002. Currently he is finishing his Ph.D., which is centered on fault tolerance and parallel genetic programming on desktop grid computing environments. His research interest include fault tolerance, parallel genetic programming, and volunteer grid computing.
References (55)
- et al.
Resource availability in enterprise desktop grids
Journal of Future Generation Computer Systems
(2007) - et al.
Grid computing for parallel bioinspired algorithms
J. Parallel Distrib. Comput.
(2006) - et al.
Reliability challenges in large systems
Future Generation Computer Systems
(2006) - et al.
Building with paradisEO reusable parallel and distributed evolutionary algorithms
Parallel Computing
(2004) - et al.
Interfacing Condor and PVM to harness the cycles of workstation clusters
FGCS. Future Generations Computer Systems
(1996) - et al.
Dynamic population variation in genetic programming
Information Sciences
(2009) - et al.
Population variation in genetic programming
Information Sciences
(2007) - et al.
Parallel genetic programming
Spatially Structured Evolutionary Algorithms
(2005)A survey of parallel genetic algorithms
Calculateurs Paralleles, Reseaux et Systems Repartis
(1998)
CAGE: A tool for parallel genetic programming applications
Seti@home: An experiment in public-resource computing
Commun. ACM
Fundamentals of fault-tolerant distributed computing in asynchronous environments
ACM Computing Surveys
The Grid 2
Distributed Systems: An Algorithmic Approach
Cited by (0)
Daniel Lombraña González received his degree in Computer Engineering from the University of León, Spain, in 2002. Currently he is finishing his Ph.D., which is centered on fault tolerance and parallel genetic programming on desktop grid computing environments. His research interest include fault tolerance, parallel genetic programming, and volunteer grid computing.
Francisco Fernández de Vega received his Ph.D. in Computer Science from the University of Extremadura, Spain, in 2001. He was vice-head of research, Centro Universitario de Mérida, University of Extremadura, 2004–2005 and CIO of the University of Extremadura 2005–2007. He is currently an associate professor of Computer Science and the director of the GEA research group (Artificial Evolution Group).
He has published over 150 referred papers. His research interests include Bioinspired Algorithms (Genetic Programming, Cellular Automata, Epidemic Algorithms) and Cluster and Grid Computing. He is part of the steering committee of the Spanish conference on Evolutionary Algorithms (MAEB), and has presented invited tutorials at several international conferences (including CEC and PPSN).
He was co-chair of the 1st and 2nd Workshops on Parallel Bioinspired Algorithms, held jointly with IEEE ICPP 2005, ACM GECCO 2007 and 1st Workshop on Parallel Architectures and Bioinspired Algorithms, IEEE PACT 2008 (Oslo 2005, London 2007, Toronto 2008). He has edited a Special Issue on Parallel Bioinspired Algorithms, Journal of Parallel and Distributed Computing.
Henri Casanova received his BS degree from the Ecole Nationale Superieure d’Electronique, d’Electrotechnique, d’Informatique et d’Hydraulique de Toulouse, France, in 1993, his MS degree from the Universite Paul Sabatier, Toulouse, France, in 1994, and his Ph.D. degree from the University of Tennessee, Knoxville, in 1998. He is an associate professor in the Information and Computer Science Department, University of Hawai’i, Manoa. His research interests are in the area of parallel and distributed computing. In particular, his research emphasizes the modeling and the simulation of platforms and applications, as well as both the theoretical and practical aspects of scheduling problems.