ABSTRACT
Identifying record replicas in Digital Libraries and other types of digital repositories is fundamental to improve the quality of their content and services as well as to yield eventual sharing efforts. Several deduplication strategies are available, but most of them rely on manually chosen settings to combine evidence used to identify records as being replicas. In this paper, we present the results of experiments we have carried out with a novel Machine Learning approach we have proposed for the deduplication problem. This approach, based on Genetic Programming (GP), is able to automatically generate similarity functions to identify record replicas in a given repository. The generated similarity functions properly combine and weight the best evidence available among the record fields in order to tell when two distinct records represent the same real-world entity. The results of the experiments show that our approach outperforms the baseline method by Fellegi and Sunter by more than 12% when identifying replicas in a data set containing researcher's personal data, and by more than 7%, in a data set with article citation data.
- Baeza Yates, R. A., and Ribeiro Neto, B. A. Modern Information Retrieval. ACM Press/Addison-Wesley, 1999. Google ScholarDigital Library
- Banzhaf, W., Nordin, P., E, R. E. K. R., and Francone, F. D. Genetic Programming - An Introduction: On the Automatic Evolution of Computer Programs and Its Applications. Morgan Kaufmann Publishers, 1998. Google ScholarDigital Library
- Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., and Fienberg, S. Adaptive name matching in information integration. IEEE Intelligent Systems 18, 5 (September/October 2003), 16--23. Google ScholarDigital Library
- Bilenko, M., and Mooney, R. J. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2003), pp. 39--48. Google ScholarDigital Library
- Caldera, J. P., and Rosa, A. C. School time tabling using genetic search. In Proceedings of the 2nd International Conference on the Practice and Theory of Automated Timetabling (1997), pp. 115--122. Google ScholarDigital Library
- Carvalho, J. C. P., and Silva, A. S. Finding similar identities among objects from multiple web sources. In Proceedings of the fifth ACM International Workshop on Web Information and Data Management (2003), pp. 90 -- 93. Google ScholarDigital Library
- Chaudhuri, S., Ganjam, K., Ganti, V., and Motwani, R. Robust and efficient fuzzy match for online data cleaning. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (2003), pp. 313--324. Google ScholarDigital Library
- Cohen, W. W. Data integration using similarity joins and a word-based information representation language. ACM TOIS 18, 3 (2000), 288--321. Google ScholarDigital Library
- Cohen, W. W., and Richman, J. Learning to match and cluster large high-dimensional data sets for data integration. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2002), pp. 475--480. Google ScholarDigital Library
- Freely Extensible Biomedical Record Linkage. http://sourceforge.net/projects/febrl.Google Scholar
- Fellegi, I. P., and Sunter, A. B. A theory for record linkage. Journal of American Statistical Association 66, 1 (1969), 1183--1210.Google Scholar
- Guha, S., Koudas, N., Marathe, A., and Srivastava, D. Merging the results of approximate match operations. In Proc. of VLDB (2004), pp. 636--647. Google ScholarDigital Library
- Haveliwala, T. H., Gionis, A., Klein, D., and Indyk, P. Evaluating strategies for similarity search on the web. In Proceedings of the 11th International Conference on World Wide Web (2002), pp. 432--442. Google ScholarDigital Library
- Koza, J. R. Gentic Programming: on the programming of computers by means of natural selection. MIT Press, 1992. Google ScholarDigital Library
- Lawrence, S., Giles, C. L., and Bollacker, K. D. Autonomous citation matching. In Proceedings of the Third International Conference on Autonomous Agents (1999), pp. 392--393. Google ScholarDigital Library
- Lawrence, S., Giles, C. L., and Bollacker, K. D. Digital libraries and autonomous citation indexing. IEEE Computer 32, 6 (1999), 67--71. Google ScholarDigital Library
- Lee, D., On, B. W., Kang, J., and Park, S. Effective and scalable solutions for mixed and split citation problems in digital libraries. In Proceedings of the 2nd International Workshop on Information Quality in Information Systems (2005), pp. 69--76. Google ScholarDigital Library
- Tejada, S., Knoblock, C. A., and Minton, S. Learning object identification rules for information integration. Information Systems 26, 8 (2001), 607--633. Google ScholarDigital Library
Index Terms
Learning to deduplicate
Recommendations
An unsupervised heuristic-based approach for bibliographic metadata deduplication
Digital libraries of scientific articles contain collections of digital objects that are usually described by bibliographic metadata records. These records can be acquired from different sources and be represented using several metadata standards. These ...
Subsequent patient visit detection in a high volume OPD using record linkage techniques
COMPUTE '10: Proceedings of the Third Annual ACM Bangalore ConferenceRecord or data linkage techniques are used to link records which represent the same entity (e.g. patient, customer, citation, etc.) in one or more data sets where a unique identifier for each entity is not available in all or any of the data sets to be ...
Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data miningMatching records that refer to the same entity across data-bases is becoming an increasingly important part of many data mining projects, as often data from multiple sources needs to be matched in order to enrich data or improve its quality. Significant ...
Comments