skip to main content
10.1145/1141753.1141760acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
Article

Learning to deduplicate

Published:11 June 2006Publication History

ABSTRACT

Identifying record replicas in Digital Libraries and other types of digital repositories is fundamental to improve the quality of their content and services as well as to yield eventual sharing efforts. Several deduplication strategies are available, but most of them rely on manually chosen settings to combine evidence used to identify records as being replicas. In this paper, we present the results of experiments we have carried out with a novel Machine Learning approach we have proposed for the deduplication problem. This approach, based on Genetic Programming (GP), is able to automatically generate similarity functions to identify record replicas in a given repository. The generated similarity functions properly combine and weight the best evidence available among the record fields in order to tell when two distinct records represent the same real-world entity. The results of the experiments show that our approach outperforms the baseline method by Fellegi and Sunter by more than 12% when identifying replicas in a data set containing researcher's personal data, and by more than 7%, in a data set with article citation data.

References

  1. Baeza Yates, R. A., and Ribeiro Neto, B. A. Modern Information Retrieval. ACM Press/Addison-Wesley, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Banzhaf, W., Nordin, P., E, R. E. K. R., and Francone, F. D. Genetic Programming - An Introduction: On the Automatic Evolution of Computer Programs and Its Applications. Morgan Kaufmann Publishers, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., and Fienberg, S. Adaptive name matching in information integration. IEEE Intelligent Systems 18, 5 (September/October 2003), 16--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Bilenko, M., and Mooney, R. J. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2003), pp. 39--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Caldera, J. P., and Rosa, A. C. School time tabling using genetic search. In Proceedings of the 2nd International Conference on the Practice and Theory of Automated Timetabling (1997), pp. 115--122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Carvalho, J. C. P., and Silva, A. S. Finding similar identities among objects from multiple web sources. In Proceedings of the fifth ACM International Workshop on Web Information and Data Management (2003), pp. 90 -- 93. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Chaudhuri, S., Ganjam, K., Ganti, V., and Motwani, R. Robust and efficient fuzzy match for online data cleaning. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (2003), pp. 313--324. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Cohen, W. W. Data integration using similarity joins and a word-based information representation language. ACM TOIS 18, 3 (2000), 288--321. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Cohen, W. W., and Richman, J. Learning to match and cluster large high-dimensional data sets for data integration. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2002), pp. 475--480. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Freely Extensible Biomedical Record Linkage. http://sourceforge.net/projects/febrl.Google ScholarGoogle Scholar
  11. Fellegi, I. P., and Sunter, A. B. A theory for record linkage. Journal of American Statistical Association 66, 1 (1969), 1183--1210.Google ScholarGoogle Scholar
  12. Guha, S., Koudas, N., Marathe, A., and Srivastava, D. Merging the results of approximate match operations. In Proc. of VLDB (2004), pp. 636--647. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Haveliwala, T. H., Gionis, A., Klein, D., and Indyk, P. Evaluating strategies for similarity search on the web. In Proceedings of the 11th International Conference on World Wide Web (2002), pp. 432--442. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Koza, J. R. Gentic Programming: on the programming of computers by means of natural selection. MIT Press, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Lawrence, S., Giles, C. L., and Bollacker, K. D. Autonomous citation matching. In Proceedings of the Third International Conference on Autonomous Agents (1999), pp. 392--393. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Lawrence, S., Giles, C. L., and Bollacker, K. D. Digital libraries and autonomous citation indexing. IEEE Computer 32, 6 (1999), 67--71. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Lee, D., On, B. W., Kang, J., and Park, S. Effective and scalable solutions for mixed and split citation problems in digital libraries. In Proceedings of the 2nd International Workshop on Information Quality in Information Systems (2005), pp. 69--76. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Tejada, S., Knoblock, C. A., and Minton, S. Learning object identification rules for information integration. Information Systems 26, 8 (2001), 607--633. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Learning to deduplicate

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              JCDL '06: Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
              June 2006
              402 pages
              ISBN:1595933549
              DOI:10.1145/1141753

              Copyright © 2006 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 11 June 2006

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • Article

              Acceptance Rates

              Overall Acceptance Rate415of1,482submissions,28%

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader