Benchmarking the Performance of Machine Learning Algorithms for Record Linkage at Different Heterogeneity Rates in a New Setting
Abstract
Record linkage is used to identify and link the same entity from one or more databases when a unique identifier is absent. As the amount of data increases largely every day, machine learning has become effective in integrating data with heterogeneity from multiple sources to establish more comprehensive datasets. As it is challenging to build a high-quality labeled dataset to train good models, our aim for this research will be to investigate which machine learning models will work best under certain conditions when applying these models trained in one setting to a new setting. In this paper, we compare the performance of three different machine learning models (i.e., random forests, linear SVM, and radial SVM) trained in a different setting from an open-source hybrid record linkage system using different heterogeneity rates (0% - 60%). The RL heterogeneity generator introduces name errors, date errors, missing data errors, and record level heterogeneities in the data. The models were trained on a subset of hospital record data containing nearly 10,000 pairs. We test how robust these models are in a new voter registration dataset. The performance of the models was evaluated based on F1 score, Recall, and the percentage of pairs that needed manual review. The radial and linear SVM models transfer better to a new setting across all heterogeneity rates compared to the random forest model. The linear SVM model outperformed the radial SVM by 4% on average in terms of the percentage of pairs that needed manual review. However, we found that the radial SVM performed significantly better than the linear SVM in terms of recall performance (80% - 48% compared to 59% - 29%) for heterogeneity rates from 0% to 60%. Overall, the radial SVM performed best in our experiments.
Citation
Sivakumar, Hariharan (2022). Benchmarking the Performance of Machine Learning Algorithms for Record Linkage at Different Heterogeneity Rates in a New Setting. Undergraduate Research Scholars Program. Available electronically from https : / /hdl .handle .net /1969 .1 /196521.