Abstract
Record linkage is a commonly used task in data integration to facilitate the identification of matching records that refer to the same entity from different databases. The scalability of multidatabase record linkage (MDRL) is significantly challenged with the increase of both the sizes and the number of databases that are to be linked. Identifying matching records across subgroups of databases is an important aspect in MDRL that has not been addressed so far. We propose a scalable subgroup blocking approach for MDRL that uses an efficient search over a graph structure to identify similar blocks of records that need to be compared across subgroups of multiple databases. We provide an analysis of our technique in terms of complexity and blocking quality. We conduct an empirical study on large real-world datasets that shows our approach is scalable with the size of subgroups and the number of databases, and outperforms an existing state-of-the-art blocking technique for MDRL.
This work was funded by the Australian Research Council under Discovery Projects DP130101801 and DP160101934. The authors would also like to thank Vassilios Verykios for his valuable feedback.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Aggarwal, C., Wang, H.: Managing and Mining Graph Data. Springer, New York (2010). https://doi.org/10.1007/978-1-4419-6045-0
Boyd, J., Ferrante, A., O’Keefe, C., et al.: Data linkage infrastructure for cross-jurisdictional health-related research in Australia. BMC Health Serv. Res. 12, 480 (2012)
Christen, P.: Data Matching. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31164-2
Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate record detection: a survey. IEEE TKDE 19, 1–16 (2007)
Fellegi, I., Sunter, A.: A theory for record linkage. JASA 64, 1183–1210 (1969)
Fu, Z., Christen, P., Zhou, J.: A graph matching method for historical census household linkage. In: Tseng, V.S., Ho, T.B., Zhou, Z.-H., Chen, A.L.P., Kao, H.-Y. (eds.) PAKDD 2014. LNCS (LNAI), vol. 8443, pp. 485–496. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-06608-0_40
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Theory of Computing (1998)
Inokuchi, A., Washio, T., Motoda, H.: An apriori-based algorithm for mining frequent substructures from graph data. In: Zighed, D.A., Komorowski, J., Żytkow, J. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 13–23. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-45372-5_2
Kong, C., Gao, M., Xu, C., Qian, W., Zhou, A.: Entity matching across multiple heterogeneous data sources. In: ACM DASFAA (2016)
Papadakis, G., Svirsky, J., et al.: Comparative analysis of approximate blocking techniques for entity resolution. VLDB Endow. 9, 684–695 (2016)
Ranbaduge, T., Vatsalan, D., Christen, P.: Scalable block scheduling for efficient multi-database record linkage. In: IEEE ICDM (2016)
Ranbaduge, T., Vatsalan, D., Christen, P., Verykios, V.: Hashing-based distributed multi-party blocking for privacy-preserving record linkage. In: Bailey, J., Khan, L., Washio, T., Dobbie, G., Huang, J.Z., Wang, R. (eds.) PAKDD 2016. LNCS (LNAI), vol. 9652, pp. 415–427. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-31750-2_33
Randall, S., Ferrante, A., Boyd, J., Semmens, J.: The effect of data cleaning on record linkage quality. BMC Med. Inform. Decis. Mak. 13, 64 (2013)
Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach (2009)
Sadinle, M., Fienberg, S.: A generalized Fellegi-Sunter framework for multiple record linkage with application to homicide record systems. JASA 108, 385–397 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Ranbaduge, T., Vatsalan, D., Christen, P. (2018). A Scalable and Efficient Subgroup Blocking Scheme for Multidatabase Record Linkage. In: Phung, D., Tseng, V., Webb, G., Ho, B., Ganji, M., Rashidi, L. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2018. Lecture Notes in Computer Science(), vol 10939. Springer, Cham. https://doi.org/10.1007/978-3-319-93040-4_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-93040-4_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93039-8
Online ISBN: 978-3-319-93040-4
eBook Packages: Computer ScienceComputer Science (R0)