Interfacing knowledge discovery algorithms to large database management systems

https://doi.org/10.1016/S0950-5849(99)00024-5Get rights and content

Abstract

The efficient mining of large, commercially credible, databases requires a solution to at least two problems: (a) better integration between existing Knowledge Discovery algorithms and popular DBMS; (b) ability to exploit opportunities for computational speedup such as data parallelism. Both problems need to be addressed in a generic manner, since the stated requirements of end-users cover a range of data mining paradigms, DBMS, and (parallel) platforms. In this paper we present a family of generic, set-based, primitive operations for Knowledge Discovery in Databases (KDD). We show how a number of well-known KDD classification metrics, drawn from paradigms such as Bayesian classifiers, Rule-Induction/Decision Tree algorithms, Instance-Based Learning methods, and Genetic Programming, can all be computed via our generic primitives. We then show how these primitives may be mapped into SQL and, where appropriate, optimised for good performance in respect of practical factors such as client–server communication overheads. We demonstrate how our primitives can support C4.5, a widely-used rule induction system. Performance evaluation figures are presented for commercially available parallel platforms, such as the IBM SP/2.

Introduction

The efficient mining of large, commercially credible, databases requires a solution to at least two problems: (a) better integration between existing Knowledge Discovery algorithms and popular DBMS such as Oracle; (b) ability to exploit opportunities for computational speedup such as data parallelism. By ‘commercially credible’ databases, we mean databases large enough to make it impractical to retrieve the entire database and process it on a workstation. Most of the popular knowledge discovery algorithms are sequential, and assume that the database is held in main memory; as the foregoing makes clear, this assumption is unrealistic. This results in the so-called scaling problem for knowledge discovery.

In this paper we assume that the database to be mined is stored in some central place, for example in a data warehouse, in a relational database server which is part of a client/server system. This is a common situation. The challenge which we address is to provide a framework for knowledge discovery in databases (KDD) which achieves tolerable run-times on corporate data held within a DBMS. Knowledge discovery, as described in [1], [2], is a process involving several stages: initial investigation, preparation of the data, the application of knowledge discovery algorithms, and postprocessing. Our framework is compatible with this view. The discovery algorithms may cover a number of abstract problems: classification, clustering, the induction of association rules and time series, and so on. In this paper we will address the commonest of these problems, namely classification. Within classification, there is no single algorithm or paradigm that yields the most useful information for all varieties of stored data. Thus, our framework for efficient classification has to be generic, in the sense of supporting several KDD paradigms. This suggests a need for generic primitives—a topic to which we return in Section 2.

The classification problem in a nutshell is to induce a set of rules for assigning one of a given finite set of labels or classes to unseen cases, given a training set of correctly classified cases. Commonplace real-world examples of this problem include credit rating, the effective targeting of mailshots, and medical diagnosis. Ref. [3] provides a good overview of the subject. The particular emphasis of our work is on the efficient mining of large databases with minimal loss of accuracy. There are two aspects of efficiency, namely size and speed, which deserve comment.

First, let us consider size. Returning to the scaling problem noted above, current commercial practice is to “scale” knowledge discovery by retrieving a sample of the database, typically a data cube of at most a few tens of megabytes, to the client and processing it in main memory there. For some applications, sampling is a reasonable approach and works well. There are, however, three potential problems. First, if the knowledge discovery method involves inducing a model of some kind, there is a higher chance of a spurious model fitting a sample by chance. This becomes more likely the more attributes there are, since more attributes mean a larger potential model space. Second, sampling makes it harder to detect special cases or “small disjuncts”. In some applications, the concept to be described consists entirely of a large number of special cases: even if this is not so, it is often the special cases that comprise the interesting knowledge from a commercial viewpoint, because they are the least obvious. Finally, there is the converse problem: because good classification algorithms need to be able to detect special cases, they tend to “overfit” small databases. Again sampling makes it more likely that this will happen. These problems will affect any approach that depends on sampling to achieve scalability. This includes the kind of “divide and conquer” approach described for example in Ref. [4], where classifiers are initially trained on partitions of the database, and then combined using some statistical method or by voting.

Now let us consider speed. If reducing the size of the database is not possible or desirable, the alternative is to speed up the algorithm. There are essentially three ways to make a database application run faster: choose better indexes and storage structures; use fewer or more efficient queries; or put the data on a faster server, in particular a parallel server. The focus of our research is on the second and third options: choice of queries, and parallelism. Parallelism is particularly important. Any whole-database KDD algorithm, however efficient, must be at least O(N), where N is the size of the database. Parallelism offers the possibility of reducing this to O(N/P), where P is the number of processors available. Parallelism is also important because the large databases collected by commercial organizations in their data warehouses are increasingly stored on parallel database servers [5].

The rest of this paper is organised as follows. In Section 2 we describe our framework for KDD, including the design and efficient implementation of our generic KDD primitives. These primitives are set-based, thus mapping conveniently into SQL and simplifying the task of integrating the many existing KDD algorithms with popular DBMS. In Section 3 we describe the application of the primitives to two KDD paradigms, namely simple Bayesian classifiers and genetic programming. Since the decision tree paradigm is well-used in practice, in Section 4 we focus particularly on a well-known example, namely the C4.5 KDD algorithm. Based on our primitives, we describe the implementation of C4.5/DB, a high-accuracy decision tree induction scheme suitable for large databases. In Section 5 we discuss the results of experiments using synthetic data generators which evaluate the performance of C4.5/DB on a commercially available parallel database platform, namely the IBM SP/2 running DB2 parallel edition. Conclusions and scope for future research are presented in Section 6.

Section snippets

A framework for knowledge discovery

Fig. 1 shows our framework for knowledge discovery in diagram form. A tool provides the user with a graphical interface to a set of more or less autonomous algorithms, which accesses the database using KDD primitives. Each primitive is implemented using one or more database queries. A “toolbox” approach, where several algorithms are made available together, is necessary, as each knowledge discovery algorithm has an inductive bias which makes it effective on some databases but ineffective on

Implementation of classification paradigms

The remainder of the paper concerns our implementation of several popular classification paradigms. Our toolkit currently covers: decision tree induction, simple Bayesian induction and a genetic programming method. We assume our readers are familiar with these paradigms, and will only describe them to the minimal extent necessary to make the paper coherent.

C4.5/DB: high-accuracy decision tree induction for large databases

A decision tree classifier approximates the class probability function by partitioning the case space based on the available attributes, and assigning the same class probabilities to all the cases in a given partition. The idea is somewhat analogous to the idea of a “step function” in mathematics. Each node of the tree represents a partition of the case space, generated by testing a single attribute. If the attribute is categorical, the partition contains one subset for each possible value of

Exploitation of parallelism

One of the major attractions of our primitive-based approach to data mining is that very large databases are often stored on parallel servers, and by using SQL queries we can exploit the parallelism those servers provide. In this section we assess how efficiently C4.5/DB does this.

We measure the effects of increasing parallelism along the usual dimensions, speedup and scaleup. Speedup reflects the effect of increasing the system resources while the database size is fixed: ideally, execution

Conclusions

One of our stated tasks was to achieve a closer integration between the KDD culture and the DBMS cultures. We have developed and used some set-based generic KDD primitives which have permitted the following aspects of integration to be addressed:

  • 1.

    A standard SQL interface can be created between several existing KDD algorithms and several existing relational DBMS. This in turn allows corporate data management features such as security and integrity to be maintained throughout the data-mining

Acknowledgements

We would like to thank the University of Southampton for allowing us access to their IBM SP2, and Steve O'Connell and Ian Hardy for their assistance with our experiments. The research described in this paper was supported by EPSRC grant GR/L/16002.

References (20)

  • P.K. Chan et al.
  • R. Kufrin
  • U.M. Fayyad et al.
  • R.J. Brachman et al.
  • S.M. Weiss et al.

    Computer Systems That Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems

    (1991)
  • Proc. First and Second Commercial Parallel Programming Conferences, London, IBC Technical Services Ltd., November...
  • A.A. Freitas et al.

    Mining Very Large Databases with Parallel Processing

    (1998)
  • A.A. Freitas et al.

    Speeding up knowledge discovery in large relational databases by means of a new discretization algorithm

  • J. Catlett, On Changing Continuous Attributes into Ordered Discrete Attributes, Proc. European Working Session on...
  • J. Gray et al.

    A relational aggregation operator generalizing group-by, cross-tab, and sub-totals

    Data Mining and Knowledge Discovery

    (1997)
There are more references available in the full text version of this article.

Cited by (0)

View full text