Statistics, Department of

 

The R Journal

Date of this Version

12-2010

Document Type

Article

Citation

The R Journal (December 2010) 2(2)

Comments

Copyright 2010, The R Foundation. Open access material. License: CC BY 3.0 Unported

Abstract

Record linkage deals with detecting homonyms and mainly synonyms in data. The package RecordLinkage provides means to per form and evaluate different record linkage methods. A stochastic framework is implemented which calculates weights through an EM algorithm. The determination of the necessary thresholds in this model can be achieved by tools of extreme value theory. Furthermore, machine learning methods are utilized, including decision trees (rpart), bootstrap aggregating (bagging), ada boost (ada), neural nets (nnet) and support vector machines (svm). The generation of record pairs and comparison patterns from single data items are provided as well. Comparison patterns can be chosen to be binary or based on some string metrics. In order to reduce computation time and memory usage, blocking can be used. Future development will concentrate on additional and refined methods, performance improvements and input/output facilities needed for real-world application.

Share

COinS