A Genetic Programming Approach to Record Deduplication
Created by W.Langdon from
gp-bibliography.bib Revision:1.7954
- @Article{deCarvalho:2011:ieeeTKDE,
-
author = "Moises G. {de Carvalho} and Alberto H. F. Laender and
Marcos Andre Goncalves and Altigran S. {da Silva}",
-
title = "A Genetic Programming Approach to Record
Deduplication",
-
journal = "IEEE Transactions on Knowledge and Data Engineering",
-
year = "2012",
-
month = mar,
-
volume = "24",
-
number = "3",
-
pages = "399--412",
-
abstract = "Several systems that rely on consistent data to offer
high quality services, such as digital libraries and
e-commerce brokers, may be affected by the existence of
duplicates, quasi-replicas, or near-duplicate entries
in their repositories. Because of that, there have been
significant investments from private and government
organisations in developing methods for removing
replicas from its data repositories. This is due to the
fact that clean and replica-free repositories not only
allow the retrieval of higher-quality information but
also lead to more concise data and to potential savings
in computational time and resources to process this
data. In this article, we propose a genetic programming
approach to record deduplication that combines several
different pieces of evidence extracted from the data
content to find a deduplication function that is able
to identify whether two entries in a repository are
replicas or not. As shown by our experiments, our
approach outperforms an existing state-of-the-art
method found in the literature. Moreover, the suggested
functions are computationally less demanding since they
use fewer evidence. In addition, our genetic
programming approach is capable of automatically
adapting these functions to a given fixed replica
identification boundary, freeing the user from the
burden of having to choose and tune this parameter.",
-
keywords = "genetic algorithms, genetic programming, computational
time, data repositories, database administration,
database integration, digital libraries, e-commerce
brokers, fixed replica identification boundary,
information retrieval, record deduplication, replica
removal, replica-free repositories, information
retrieval, replicated databases",
-
size = "14 pages",
-
DOI = "doi:10.1109/TKDE.2010.234",
-
ISSN = "1041-4347",
-
notes = "Also known as \cite{5645623}",
- }
Genetic Programming entries for
Moises G de Carvalho
Alberto H F Laender
Marcos Andre Goncalves
Altigran S da Silva
Citations