Statistics, Department of
![The R Journal](../../assets/md5images/45dde958d9702f15cd02ad735fa74210.png)
The R Journal
Date of this Version
12-2010
Document Type
Article
Citation
The R Journal (December 2010) 2(2)
Abstract
Record linkage deals with detecting homonyms and mainly synonyms in data. The package RecordLinkage provides means to per form and evaluate different record linkage methods. A stochastic framework is implemented which calculates weights through an EM algorithm. The determination of the necessary thresholds in this model can be achieved by tools of extreme value theory. Furthermore, machine learning methods are utilized, including decision trees (rpart), bootstrap aggregating (bagging), ada boost (ada), neural nets (nnet) and support vector machines (svm). The generation of record pairs and comparison patterns from single data items are provided as well. Comparison patterns can be chosen to be binary or based on some string metrics. In order to reduce computation time and memory usage, blocking can be used. Future development will concentrate on additional and refined methods, performance improvements and input/output facilities needed for real-world application.
Included in
Numerical Analysis and Scientific Computing Commons, Programming Languages and Compilers Commons
Comments
Copyright 2010, The R Foundation. Open access material. License: CC BY 3.0 Unported