abstract = "We propose GP-zip2, a new approach to loss less data
compression based on Genetic Programming (GP). GP is
used to optimally combine well-known loss-less
compression algorithms to maximise data compression.
GP-zip2 evolves programs with multiple components. One
component analyses statistical features extracted by
sequentially scanning the data to be compressed and
divides the data into blocks. These blocks are
projected onto a two-dimensional Euclidean space via
two further (evolved) program components. K-means
clustering is then applied to group similar data
blocks. Each cluster is labelled with the optimal
compression algorithm for its member blocks. After
evolution, evolved programs can be used to compress
unseen data. The compression algorithms available to
GP-zip2 are: Arithmetic coding, Lempel-Ziv-Welch,
Unbounded Prediction by Partial Matching, Run Length
Encoding, and Bzip2. Experimentation shows that the
results produced by GP-zip2 are human-competitive,
being typically superior to well-established
human-designed compression algorithms in terms of the
compression ratios achieved in heterogeneous archive
files.",
notes = "GP chooses when to switch between 5 well known
algorithms (LZW, PPMD, Run-length encoding and Bzip2)
and which to use. Three trees co-evolve to do work
between themselves (splitter tree, two feature
extraction, cf \cite{langdon:book} ). Error in previous
Kattan GP-zip work. Threshold theta on splitter tree
output, Davis Bouldin index (DBI). variable length
header in output file. Compares well to WinRar. Amazon
EC2 AWS cloud computer used. Model of use assumes
compress once (eg DVD) uncompressed many times.
Single v. two pass approaches. Arithmetic coding.",
affiliation = "School of Computer Science and Electronic Engineering,
University of Essex, Colchester, CO4 3SQ UK",