A Comparison of Blocking Methods for Record Linkage

Rebecca C. Steorts¹⁶,
Samuel L. Ventura¹⁶,
Mauricio Sadinle¹⁶ &
…
Stephen E. Fienberg¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8744))

Included in the following conference series:

International Conference on Privacy in Statistical Databases

1722 Accesses
66 Citations

Abstract

Record linkage seeks to merge databases and to remove duplicates when unique identifiers are not available. Most approaches use blocking techniques to reduce the computational complexity associated with record linkage. We review traditional blocking techniques, which typically partition the records according to a set of field attributes, and consider two variants of a method known as locality sensitive hashing, sometimes referred to as “private blocking.” We compare these approaches in terms of their recall, reduction ratio, and computational complexity. We evaluate these methods using different synthetic datafiles and conclude with a discussion of privacy-related issues.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A Tutorial on Blocking Methods for Privacy-Preserving Record Linkage

Novel Blocking Techniques and Distance Metrics for Record Linkage

Neighbourhood Blocking for Record Linkage

References

Christen, P.: Probabilistic Data Generation for Deduplication and Data Linkage. In: Gallagher, M., Hogan, J.P., Maire, F. (eds.) IDEAL 2005. LNCS, vol. 3578, pp. 109–116. Springer, Heidelberg (2005)
Chapter Google Scholar
Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering 24 (2012)
Google Scholar
Christen, P., Pudjijono, A.: Accurate synthetic generation of realistic personal information. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS, vol. 5476, pp. 507–514. Springer, Heidelberg (2009)
Chapter Google Scholar
Christen, P., Pudjijono, A.: Accurate Synthetic Generation of Realistic Personal Information. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS, vol. 5476, pp. 507–514. Springer, Heidelberg (2009), http://dx.doi.org/10.1007/978-3-642-01307-2_47
Chapter Google Scholar
Christen, P., Vatsalan, D.: Flexible and Extensible Generation and Corruption of Personal Data. In: Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM 2013) (2013)
Google Scholar
Clauset, A., Newman, M.E., Moore, C.: Finding community structure in very large networks. Physical Review E 70, 066111 (2004)
Google Scholar
Durham, E.A.: A framework for accurate, efficient private record linkage. Ph.D. thesis, Vanderbilt University (2012)
Google Scholar
Fortunato, S.: Community detection in graphs. Physics Reports 486, 75–174 (2010)
Article MathSciNet Google Scholar
Goldenberg, A., Zheng, A.X., Fienberg, S.E., Airoldi, E.M.: A survey of statistical network models. Foundations and Trends® in Machine Learning 2, 129–233 (2010)
Article Google Scholar
Hall, R., Fienberg, S.: Valid statistical inference on automatically matched files. In: Domingo-Ferrer, J., Tinnirello, I. (eds.) PSD 2012. LNCS, vol. 7556, pp. 131–142. Springer, Heidelberg (2012)
Chapter Google Scholar
Herzog, T., Scheuren, F., Winkler, W.: Data Quality and Record Linkage Techniques. Springer, New York (2007)
MATH Google Scholar
Herzog, T., Scheuren, F., Winkler, W.: Record linkage. Wiley Interdisciplinary Reviews: Computational Statistics 2 (2010), doi:10.1002/wics.108
Google Scholar
Karakasidis, A., Verykios, V.S.: Reference table based k-anonymous private blocking. In: Proceedings of the 27th Annual ACM Symposium on Applied Computing, pp. 859–864. ACM (2012)
Google Scholar
Kuzu, M., Kantarcioglu, M., Durham, E., Malin, B.: A constraint satisfaction cryptanalysis of bloom filters in private record linkage. In: Fischer-Hübner, S., Hopper, N. (eds.) PETS 2011. LNCS, vol. 6794, pp. 226–245. Springer, Heidelberg (2011)
Chapter Google Scholar
Liang, H., Wang, Y., Christen, P., Gayler, R.: Noise-tolerant approximate blocking for dynamic real-time entity resolution. In: Tseng, V.S., Ho, T.B., Zhou, Z.-H., Chen, A.L.P., Kao, H.-Y. (eds.) PAKDD 2014, Part II. LNCS, vol. 8444, pp. 449–460. Springer, Heidelberg (2014)
Chapter Google Scholar
McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 169–178. ACM (2000)
Google Scholar
Pasula, H., Marthi, B., Milch, B., Russell, S., Shpitser, I.: Identity uncertainty and citation matching. In: Advances in Neural Information Processing Systems, pp. 1425–1432 (2003)
Google Scholar
Paulevé, L., Jégou, H., Amsaleg, L.: Locality sensitive hashing: A comparison of hash function types and querying mechanisms. Pattern Recognition Letters 31, 1348–1358 (2010)
Article Google Scholar
Rajaraman, A., Ullman, J.D.: Mining of massive datasets. Cambridge University Press (2012)
Google Scholar
Steorts, R.C., Hall, R., Fienberg, S.: A Bayesian approach to graphical record linkage and de-duplication (2013) (submitted)
Google Scholar
Vatsalan, D., Christen, P.: Sorted nearest neighborhood clustering for efficient private blocking. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013, Part II. LNCS, vol. 7819, pp. 341–352. Springer, Heidelberg (2013)
Chapter Google Scholar
Vatsalan, D., Christen, P., Verykios, V.S.: An efficient two-party protocol for approximate matching in private record linkage. In: Proceedings of the Ninth Australasian Data Mining Conference, vol. 121, pp. 125–136. Australian Computer Society, Inc. (2011)
Google Scholar
Winkler, W., Yancey, W., Porter, E.: Fast record linkage of very large files in support of decennial and administrative records projects. In: Proceedings of American Statistical Association Section on Survey Research Methods (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistics, Carnegie Mellon University, Pittsburgh, PA, U.S.A., 15213
Rebecca C. Steorts, Samuel L. Ventura, Mauricio Sadinle & Stephen E. Fienberg

Authors

Rebecca C. Steorts
View author publications
You can also search for this author in PubMed Google Scholar
Samuel L. Ventura
View author publications
You can also search for this author in PubMed Google Scholar
Mauricio Sadinle
View author publications
You can also search for this author in PubMed Google Scholar
Stephen E. Fienberg
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Engineering and Mathematics, UNESCO Chair in Data Privacy, Universitat Rovira i Virgili, Av. Països Catalans 26, 43007, Tarragona, Catalonia
Josep Domingo-Ferrer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Steorts, R.C., Ventura, S.L., Sadinle, M., Fienberg, S.E. (2014). A Comparison of Blocking Methods for Record Linkage. In: Domingo-Ferrer, J. (eds) Privacy in Statistical Databases. PSD 2014. Lecture Notes in Computer Science, vol 8744. Springer, Cham. https://doi.org/10.1007/978-3-319-11257-2_20

Download citation

DOI: https://doi.org/10.1007/978-3-319-11257-2_20
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11256-5
Online ISBN: 978-3-319-11257-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Comparison of Blocking Methods for Record Linkage

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

A Tutorial on Blocking Methods for Privacy-Preserving Record Linkage

Novel Blocking Techniques and Distance Metrics for Record Linkage

Neighbourhood Blocking for Record Linkage

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

A Comparison of Blocking Methods for Record Linkage

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

A Tutorial on Blocking Methods for Privacy-Preserving Record Linkage

Novel Blocking Techniques and Distance Metrics for Record Linkage

Neighbourhood Blocking for Record Linkage

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation