Package 'multilink'

Title: Multifile Record Linkage and Duplicate Detection
Description: Implementation of the methodology of Aleshin-Guendel & Sadinle (2022) <doi:10.1080/01621459.2021.2013242>. It handles the general problem of multifile record linkage and duplicate detection, where any number of files are to be linked, and any of the files may have duplicates.
Authors: Serge Aleshin-Guendel [aut, cre]
Maintainer: Serge Aleshin-Guendel <[email protected]>
License: GPL-3
Version: 0.1.1
Built: 2024-11-08 06:14:44 UTC
Source: https://github.com/aleshing/multilink

Help Index


Create Comparison Data

Description

Create comparison data for all pairs of records, except for those records in files which are assumed to have no duplicates.

Usage

create_comparison_data(
  records,
  types,
  breaks,
  file_sizes,
  duplicates,
  verbose = TRUE
)

Arguments

records

A data.frame containing the records to be linked, where each column of records is a field to be compared. If there are multiple files, records should be obtained by stacking the files on top of each other so that records[1:file_sizes[1], ] contains the records for file 1, records[(file_sizes[1] + 1):(file_sizes[1] + file_sizes[2]), ] contains the records for file 2, and so on. Missing values should be coded as NA.

types

A character vector, indicating the comparison to be used for each field (i.e. each column of records). The options are: "bi" for binary comparisons, "nu" for numeric comparisons (absolute difference), "lv" for string comparisons (normalized Levenshtein distance), "lv_sep" for string comparisons (normalized Levenshtein distance) where each string may contain multiple spellings separated by the "|" character. We assume that fields using options "bi", "lv", and "lv_sep" are of class character, and fields using the "nu" option are of class numeric. For fields using the "lv_sep" option, for each record pair the normalized Levenshtein distance is computed between each possible spelling, and the minimum normalized Levenshtein distance between spellings is then used as the comparison for that record pair.

breaks

A list, the same length as types, indicating the break points used to compute disagreement levels for each fields' comparisons. If types[f]="bi", breaks[[f]] is ignored (and thus can be set to NA). See Details for more information on specifying this argument.

file_sizes

A numeric vector indicating the size of each file.

duplicates

A numeric vector indicating which files are assumed to have duplicates. duplicates[k] should be 1 if file k has duplicates, and duplicates[k] should be 0 if file k has no duplicates. If any files do not have duplicates, we strongly recommend that the largest such file is organized to be the first file.

verbose

A logical indicator of whether progress messages should be print (default TRUE).

Details

The purpose of this function is to construct comparison vectors for each pair of records. In order to construct these vectors, one needs to specify the types and breaks arguments. The types argument specifies how each field should be compared, and the breaks argument specifies how to discretize these comparisons.

Currently, the types argument supports three types of field comparisons: binary, absolute difference, and the normalized Levenshtein distance. Please contact the package maintainer if you need a new type of comparison to be supported.

The breaks argument should be a list, with with one element for each field. If a field is being compared with a binary comparison, i.e. types[f]="bi", then the corresponding element of breaks should be NA, i.e. breaks[[f]]=NA. If a field is being compared with a numeric or string comparison, then the corresponding element of breaks should be a vector of cut points used to discretize the comparisons. To give more detail, suppose you pass in cut points breaks[[f]]=c(cut_1, ...,cut_L). These cut points discretize the range of the comparisons into L+1 intervals: I0=(,cut1],I1=(cut1,cut2],...,IL=(cutL,]I_0=(-\infty, cut_1], I_1=(cut_1, cut_2], ..., I_L=(cut_L, \infty]. The raw comparisons, which lie in [0,)[0,\infty) for numeric comparisons and [0,1][0,1] for string comparisons, are then replaced with indicators of which interval the comparisons lie in. The interval I0I_0 corresponds to the lowest level of disagreement for a comparison, while the interval ILI_L corresponds to the highest level of disagreement for a comparison.

Value

a list containing:

record_pairs

A data.frame, where each row contains the pair of records being compared in the corresponding row of comparisons. The rows are sorted in ascending order according to the first column, with ties broken according to the second column in ascending order. For any given row, the first column is less than the second column, i.e. record_pairs[i, 1] < record_pairs[i, 2] for each row i.

comparisons

A logical matrix, where each row contains the comparisons for the record pair in the corresponding row of record_pairs. Comparisons are in the same order as the columns of records, and are represented by L + 1 columns of TRUE/FALSE indicators, where L + 1 is the number of disagreement levels for the field based on breaks.

K

The number of files, assumed to be of class numeric.

file_sizes

A numeric vector of length K, indicating the size of each file.

duplicates

A numeric vector of length K, indicating which files are assumed to have duplicates. duplicates[k] should be 1 if file k has duplicates, and duplicates[k] should be 0 if file k has no duplicates. If any files do not have duplicates, we strongly recommend that the largest such file is organized to be the first file.

field_levels

A numeric vector indicating the number of disagreement levels for each field.

file_labels

An integer vector of length sum(file_sizes), where file_labels[i] indicates which file record i is in.

fp_matrix

An integer matrix, where fp_matrix[k1, k2] is a label for the file pair (k1, k2). Note that fp_matrix[k1, k2] = fp_matrix[k2, k1].

rp_to_fp

A logical matrix that indicates which record pairs belong to which file pairs. rp_to_fp[fp, rp] is TRUE if the records record_pairs[rp, ] belong to the file pair fp, and is FALSE otherwise. Note that fp is given by the labeling in fp_matrix.

ab

An integer vector, of length ncol(comparisons) * K * (K + 1) / 2 that indicates how many record pairs there are with a given disagreement level for a given field, for each file pair.

file_sizes_not_included

A numeric vector of 0s. This element is non-zero when reduce_comparison_data is used.

ab_not_included

A numeric vector of 0s. This element is non-zero when reduce_comparison_data is used.

labels

NA. This element is not NA when reduce_comparison_data is used.

pairs_to_keep

NA. This element is not NA when reduce_comparison_data is used.

cc

0. This element is non-zero when reduce_comparison_data is used.

References

Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [doi:10.1080/01621459.2021.2013242][arXiv]

Examples

## Example with small no duplicate dataset
data(no_dup_data_small)

# Create the comparison data
comparison_list <- create_comparison_data(no_dup_data_small$records,
 types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
 breaks = list(NA,  c(0, 0.25, 0.5),  c(0, 0.25, 0.5),
               c(0, 0.25, 0.5), c(0, 0.25, 0.5),  NA, NA),
 file_sizes = no_dup_data_small$file_sizes,
 duplicates = c(0, 0, 0))

## Example with small duplicate dataset
data(dup_data_small)

# Create the comparison data
comparison_list <- create_comparison_data(dup_data_small$records,
 types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
 breaks = list(NA,  c(0, 0.25, 0.5),  c(0, 0.25, 0.5),
               c(0, 0.25, 0.5), c(0, 0.25, 0.5),  NA, NA),
 file_sizes = dup_data_small$file_sizes,
 duplicates = c(1, 1, 1))

Duplicate Dataset

Description

A dataset containing 867 simulated records from 3 files with no duplicate records in each file.

Usage

dup_data

Format

A list with three elements:

records

A data.frame with the records, containing 7 fields, from all three files, in the format used for input to create_comparison_data.

file_sizes

The size of each file.

IDs

The true partition of the records, represented as an integer vector of arbitrary labels of length sum(file_sizes).

Source

Extracted from the datasets used in the simulation study of the paper. The datasets were generated using code from Peter Christen's group https://dmm.anu.edu.au/geco/index.php.

References

Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [doi:10.1080/01621459.2021.2013242][arXiv]

Examples

data(dup_data)

# There are 500 entities represented in the records
length(unique(dup_data$IDs))

Small Duplicate Dataset

Description

A dataset containing 96 simulated records from 3 files with no duplicate records in each file, subset from dup_data.

Usage

dup_data_small

Format

A list with three elements:

records

A data.frame with the records, containing 7 fields, from all three files, in the format used for input to create_comparison_data.

file_sizes

The size of each file.

IDs

The true partition of the records, represented as an integer vector of arbitrary labels of length sum(file_sizes).

Source

Extracted from the datasets used in the simulation study of the paper. The datasets were generated using code from Peter Christen's group https://dmm.anu.edu.au/geco/index.php.

References

Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [doi:10.1080/01621459.2021.2013242][arXiv]

Examples

data(dup_data_small)

# There are 96 entities represented in the records
length(unique(dup_data_small$IDs))

Find the Bayes Estimate of a Partition

Description

Find the (approximate) Bayes estimate of a partition based on MCMC samples of the partition and a specified loss function.

Usage

find_bayes_estimate(
  partitions,
  burn_in,
  L_FNM = 1,
  L_FM1 = 1,
  L_FM2 = 2,
  L_A = Inf,
  max_cc_size = nrow(partitions),
  verbose = TRUE
)

Arguments

partitions

Posterior samples of the partition, where each column is one sample and the partition is represented as an integer vector of arbitrary labels, as produced by the output of a call to gibbs_sampler.

burn_in

The number of samples to discard for burn in.

L_FNM

Positive loss for a false non-match. Default is 1.

L_FM1

Positive loss for a type 1 false match. Default is 1.

L_FM2

Positive loss for a type 2 false match. Default is 2.

L_A

Positive loss for abstaining from making a decision for a record. Default is Inf, i.e. decisions are made for all records.

max_cc_size

The maximum allowable connected component size over which the posterior expected loss is minimized. Default is nrow(partitions), i.e. no approximation is used. When is.infinite(L_A), we recommend setting this argument to 50, then increasing based on a computational budget. When !is.infinite(L_A), we recommend setting this argument to 10-12, then increasing based on a computational budget (although an increase of 1 in this argument can in the worst case lead to a doubling in computation time).

verbose

A logical indicator of whether progress messages should be print (default TRUE).

Value

A vector, the same length of a column of partitions containing the (approximate) Bayes estimate of the partition. If !is.infinite(L_A) the output may be a partial estimate. A positive number l in index i indicates that record i is in the same cluster as every other record j with l in index j. A value of -1 in index i indicates that the Bayes estimate abstained from making a decision for record i.

References

Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [doi:10.1080/01621459.2021.2013242][arXiv]

Examples

# Example with small no duplicate dataset
data(no_dup_data_small)

# Create the comparison data
comparison_list <- create_comparison_data(no_dup_data_small$records,
 types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
 breaks = list(NA,  c(0, 0.25, 0.5),  c(0, 0.25, 0.5),
               c(0, 0.25, 0.5), c(0, 0.25, 0.5),  NA, NA),
 file_sizes = no_dup_data_small$file_sizes,
 duplicates = c(0, 0, 0))

# Specify the prior
prior_list <- specify_prior(comparison_list, mus = NA, nus = NA, flat = 0,
 alphas = rep(1, 7), dup_upper_bound = c(1, 1, 1),
 dup_count_prior_family = NA, dup_count_prior_pars = NA,
 n_prior_family = "uniform", n_prior_pars = NA)

# Find initialization for the matching (this step is optional)
# The following line corresponds to only keeping pairs of records as
# potential matches in the initialization for which neither gname nor fname
# disagree at the highest level
pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) &
 (comparison_list$comparisons[, "fname_DL_3"] != TRUE)
Z_init <- initialize_partition(comparison_list, pairs_to_keep, seed = 42)

# Run the Gibbs sampler
results <- gibbs_sampler(comparison_list, prior_list, n_iter = 1000,
 Z_init = Z_init, seed = 42)

# Find the full Bayes estimate

full_estimate <- find_bayes_estimate(results$partitions, burn_in = 100,
 L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = Inf, max_cc_size = 50)

# Find the partial Bayes estimate
partial_estimate <- find_bayes_estimate(results$partitions, burn_in = 100,
 L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = 0.1, max_cc_size = 12)


# Example with small duplicate dataset
data(dup_data_small)

# Create the comparison data
comparison_list <- create_comparison_data(dup_data_small$records,
 types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
 breaks = list(NA,  c(0, 0.25, 0.5),  c(0, 0.25, 0.5),
               c(0, 0.25, 0.5), c(0, 0.25, 0.5),  NA, NA),
 file_sizes = dup_data_small$file_sizes,
 duplicates = c(1, 1, 1))

# Reduce the comparison data
# The following line corresponds to only keeping pairs of records for which
# neither gname nor fname disagree at the highest level
pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) &
 (comparison_list$comparisons[, "fname_DL_3"] != TRUE)
reduced_comparison_list <- reduce_comparison_data(comparison_list,
 pairs_to_keep, cc = 1)

# Specify the prior
prior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA,
 flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10),
 dup_count_prior_family = c("Poisson", "Poisson", "Poisson"),
 dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform",
 n_prior_pars = NA)

# Run the Gibbs sampler
results <- gibbs_sampler(reduced_comparison_list, prior_list, n_iter = 1000,
 seed = 42)

# Find the full Bayes estimate

full_estimate <- find_bayes_estimate(results$partitions, burn_in = 100,
 L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = Inf, max_cc_size = 50)

# Find the partial Bayes estimate
partial_estimate <- find_bayes_estimate(results$partitions, burn_in = 100,
 L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = 0.1, max_cc_size = 12)

Gibbs Sampler for Posterior Inference

Description

Run a Gibbs sampler to explore the posterior distribution of partitions of records.

Usage

gibbs_sampler(
  comparison_list,
  prior_list,
  n_iter = 2000,
  Z_init = 1:sum(comparison_list$file_sizes),
  seed = 70,
  single_likelihood = FALSE,
  chaperones_info = NA,
  verbose = TRUE
)

Arguments

comparison_list

The output from a call to create_comparison_data or reduce_comparison_data.

prior_list

The output from a call to specify_prior.

n_iter

The number of iterations of the Gibbs sampler to run.

Z_init

Initialization of the partition of records, represented as an integer vector of arbitrary labels of length sum(comparison_list$file_sizes). The default initialization places each record in its own cluster. See initialize_partition for an alternative initialization when there are no duplicates in each file.

seed

The seed to use while running the Gibbs sampler.

single_likelihood

A logical indicator of whether to use a single likelihood for comparisons for all file pairs, or whether to use a separate likelihood for comparisons for each file pair. When single_likelihood=TRUE, a single likelihood is used, and the prior hyperparameters for m and u from the first file pair are used. We do not recommend using a single likelihood in general.

chaperones_info

If chaperones_info is set to NA, then Gibbs updates to the partition are used during the Gibbs sampler, as described in Aleshin-Guendel & Sadinle (2022). Else, Chaperones updates, as described in Miller et al. (2015) and Betancourt et al. (2016), are used and chaperones_info should be a list with five elements controlling Chaperones updates to the partition during the Gibbs sampler: chap_type, num_chap_iter, nonuniform_chap_type, extra_gibbs, num_restrict. chap_type is 0 if using a uniform Chaperones distribution, and 1 if using a nonuniform Chaperones distribution. num_chap_iter is the number of Chaperones updates to the partition that are made during each iteration of the Gibbs sampler. When using a nonuniform Chaperones distribution, nonuniform_chap_type is 0 if using the exact version, or 1 if using the partial version. extra_gibbs is a logical indicator of whether a Gibbs update to the partition should be done after the Chaperones updates, at each iteration of the Gibbs sampler. num_restrict is the number of restricted Gibbs steps to take during each Chaperones update to the partition.

verbose

A logical indicator of whether progress messages should be print (default TRUE).

Details

Given the prior specified using specify_prior, this function runs a Gibbs sampler to explore the posterior distribution of partitions of records, conditional on the comparison data created using create_comparison_data or reduce_comparison_data.

Value

a list containing:

m

Posterior samples of the m parameters. Each column is one sample.

u

Posterior samples of the u parameters. Each column is one sample.

partitions

Posterior samples of the partition. Each column is one sample. Note that the partition is represented as an integer vector of arbitrary labels of length sum(comparison_list$file_sizes).

contingency_tables

Posterior samples of the overlap table. Each column is one sample. This incorporates counts of records determined not to be candidate matches to any other records using reduce_comparison_data.

cluster_sizes

Posterior samples of the size of each cluster (associated with an arbitrary label from 1 to sum(comparison_list$file_sizes)). Each column is one sample.

sampling_time

The time in seconds it took to run the sampler.

References

Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [doi:10.1080/01621459.2021.2013242][arXiv]

Jeffrey Miller, Brenda Betancourt, Abbas Zaidi, Hanna Wallach, & Rebecca C. Steorts (2015). Microclustering: When the cluster sizes grow sublinearly with the size of the data set. NeurIPS Bayesian Nonparametrics: The Next Generation Workshop Series. [arXiv]

Brenda Betancourt, Giacomo Zanella, Jeffrey Miller, Hanna Wallach, Abbas Zaidi, & Rebecca C. Steorts (2016). Flexible Models for Microclustering with Application to Entity Resolution. Advances in neural information processing systems. [Published] [arXiv]

Examples

# Example with small no duplicate dataset
data(no_dup_data_small)

# Create the comparison data
comparison_list <- create_comparison_data(no_dup_data_small$records,
 types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
 breaks = list(NA,  c(0, 0.25, 0.5),  c(0, 0.25, 0.5),
               c(0, 0.25, 0.5), c(0, 0.25, 0.5),  NA, NA),
 file_sizes = no_dup_data_small$file_sizes,
 duplicates = c(0, 0, 0))

# Specify the prior
prior_list <- specify_prior(comparison_list, mus = NA, nus = NA, flat = 0,
 alphas = rep(1, 7), dup_upper_bound = c(1, 1, 1),
 dup_count_prior_family = NA, dup_count_prior_pars = NA,
 n_prior_family = "uniform", n_prior_pars = NA)

# Find initialization for the matching (this step is optional)
# The following line corresponds to only keeping pairs of records as
# potential matches in the initialization for which neither gname nor fname
# disagree at the highest level
pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) &
 (comparison_list$comparisons[, "fname_DL_3"] != TRUE)
Z_init <- initialize_partition(comparison_list, pairs_to_keep, seed = 42)

# Run the Gibbs sampler
{
results <- gibbs_sampler(comparison_list, prior_list, n_iter = 1000,
 Z_init = Z_init, seed = 42)
}

# Example with small duplicate dataset
data(dup_data_small)

# Create the comparison data
comparison_list <- create_comparison_data(dup_data_small$records,
 types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
 breaks = list(NA,  c(0, 0.25, 0.5),  c(0, 0.25, 0.5),
               c(0, 0.25, 0.5), c(0, 0.25, 0.5),  NA, NA),
 file_sizes = dup_data_small$file_sizes,
 duplicates = c(1, 1, 1))

# Reduce the comparison data
# The following line corresponds to only keeping pairs of records for which
# neither gname nor fname disagree at the highest level
pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) &
 (comparison_list$comparisons[, "fname_DL_3"] != TRUE)
reduced_comparison_list <- reduce_comparison_data(comparison_list,
 pairs_to_keep, cc = 1)

# Specify the prior
prior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA,
 flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10),
 dup_count_prior_family = c("Poisson", "Poisson", "Poisson"),
 dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform",
 n_prior_pars = NA)

# Run the Gibbs sampler
{
results <- gibbs_sampler(reduced_comparison_list, prior_list, n_iter = 1000,
 seed = 42)
}

Initialize the Partition

Description

Generate an initialization for the partition in the case when it is assumed there are no duplicates in all files (so that the partition is a matching).

Usage

initialize_partition(comparison_list, pairs_to_keep, seed = NA)

Arguments

comparison_list

the output from a call to create_comparison_data or reduce_comparison_data. Note that in order to correctly specify the initialization, if reduce_comparison_data is used to the reduce the number of record pairs that are candidate matches, then the output of reduce_comparison_data (not create_comparison_data) should be used for this argument.

pairs_to_keep

A logical vector, the same length as comparison_list$record_pairs, indicating which record pairs are potential matches in the initialization.

seed

The seed to use to generate the initialization.

Details

When it is assumed that there are no duplicates in all files, and reduce_comparison_data is not used to reduce the number of potential matches, the Gibbs sampler used for posterior inference may experience slow mixing when using an initialization for the partition where each record is in its own cluster (the default option for the Gibbs sampler). The purpose of this function is to provide an alternative initialization scheme.

To use this initialization scheme, the user passes in a logical vector that indicates which record pairs are potential matches according to an indexing method (as in reduce_comparison_data). Note that this indexing is only used to generate the initialization, it is not used for inference. The initialization scheme first finds the transitive closure of the potential matches, which partitions the records into blocks. Within each block of records, the scheme randomly selects a record from each file, and these selected records are then placed in the same cluster for the partition initialization. All other records are placed in their own clusters.

Value

an integer vector of arbitrary labels of length sum(comparison_list$file_sizes), giving an initialization for the partition.

References

Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [doi:10.1080/01621459.2021.2013242][arXiv]

Examples

# Example with small no duplicate dataset
data(no_dup_data_small)

# Create the comparison data
comparison_list <- create_comparison_data(no_dup_data_small$records,
 types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
 breaks = list(NA,  c(0, 0.25, 0.5),  c(0, 0.25, 0.5),
               c(0, 0.25, 0.5), c(0, 0.25, 0.5),  NA, NA),
 file_sizes = no_dup_data_small$file_sizes,
 duplicates = c(0, 0, 0))

# Find initialization for the matching
# The following line corresponds to only keeping pairs of records as
# potential matches in the initialization for which neither gname nor fname
# disagree at the highest level
pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) &
 (comparison_list$comparisons[, "fname_DL_3"] != TRUE)
Z_init <- initialize_partition(comparison_list, pairs_to_keep, seed = 42)

No Duplicate Dataset

Description

A dataset containing 730 simulated records from 3 files with no duplicate records in each file.

Usage

no_dup_data

Format

A list with three elements:

records

A data.frame with the records, containing 7 fields, from all three files, in the format used for input to create_comparison_data.

file_sizes

The size of each file.

IDs

The true partition of the records, represented as an integer vector of arbitrary labels of length sum(file_sizes).

Source

Extracted from the datasets used in the simulation study of the paper. The datasets were generated using code from Peter Christen's group https://dmm.anu.edu.au/geco/index.php.

References

Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [doi:10.1080/01621459.2021.2013242] [arXiv]

Examples

data(no_dup_data)

# There are 500 entities represented in the records
length(unique(no_dup_data$IDs))

Small No Duplicate Dataset

Description

A dataset containing 71 simulated records from 3 files with no duplicate records in each file, subset from no_dup_data.

Usage

no_dup_data_small

Format

A list with three elements:

records

A data.frame with the records, containing 7 fields, from all three files, in the format used for input to create_comparison_data.

file_sizes

The size of each file.

IDs

The true partition of the records, represented as an integer vector of arbitrary labels of length sum(file_sizes).

Source

Extracted from the datasets used in the simulation study of the paper. The datasets were generated using code from Peter Christen's group https://dmm.anu.edu.au/geco/index.php.

References

Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [doi:10.1080/01621459.2021.2013242] [arXiv]

Examples

data(no_dup_data_small)

# There are 71 entities represented in the records
length(unique(no_dup_data_small$IDs))

Reduce Comparison Data Size

Description

Use indexing to reduce the number of record pairs that are potential matches.

Usage

reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1)

Arguments

comparison_list

The output of a call to create_comparison_data.

pairs_to_keep

A logical vector, the same length as comparison_list$record_pairs, indicating which record pairs should be kept as potential matches. These potential matches do not have to be transitive (see the argument cc).

cc

A numeric indicator of whether to find the transitive closure of pairs_to_keep, and use these potential matches instead of just those from pairs_to_keep. cc should be 1 if the transitive closure is being used, and cc should be 0 if the transitive closure is not being used. We recommend setting cc to 1.

Details

When using comparison-based record linkage methods, scalability is a concern, as the number of record pairs is quadratic in the number of records. In order to address these concerns, it's common to declare certain record pairs to not be potential matches a priori, using indexing methods. The user is free to index using any method they like, as long as they can produce a logical vector that indicates which record pairs are potential matches according to their indexing method. We recommend, if the user chosen indexing method does not output potential matches that are transitive, to set the cc argument to 1. By transitive we mean, for any three records ii, jj, and kk, if ii and jj are potential matches, and jj and kk are potential matches, then ii and kk are potential matches. Non-transitive indexing schemes can lead to poor mixing of the Gibbs sampler used for posterior inference, and suggests that the indexing method used may have been too stringent.

If indexing is used, it may be the case that some records are declared to not be potential matches to any other records. In this case, the indexing method has made the decision that these records have no matches, and thus we can remove them from the data set and relabel the remaining records; see the documentation for labels for information on how to go between the original labeling and the new labeling.

If indexing is used, comparisons for record pairs that aren't potential matches are still used during inference, where they're used to inform the distribution of comparisons for non-matches.

Value

a list containing:

record_pairs

A data.frame, where each row contains the pair of records being compared in the corresponding row of comparisons. The rows are sorted in ascending order according to the first column, with ties broken according to the second column in ascending order. For any given row, the first column is less than the second column, i.e. record_pairs[i, 1] < record_pairs[i, 2] for each row i. If according to pairs_to_keep there are records which are not potential matches to any other records, the remaining records are relabeled (see labels).

comparisons

A logical matrix, where each row contains the comparisons between the record pair in the corresponding row of record_pairs. Comparisons are in the same order as the columns of records, and are represented by L + 1 columns of TRUE/FALSE indicators, where L + 1 is the number of disagreement levels for the field based on breaks.

K

The number of files, assumed to be of class numeric.

file_sizes

A numeric vector of length K, indicating the size of each file. If according to pairs_to_keep there are records which are not potential matches to any other records, the remaining records are relabeled (see labels), and file_sizes now represents the sizes of each file after removing such records.

duplicates

A numeric vector of length K, indicating which files are assumed to have duplicates. duplicates[k] should be 1 if file k has duplicates, and duplicates[k] should be 0 if file k has no duplicates.

field_levels

A numeric vector indicating the number of disagreement levels for each field.

file_labels

An integer vector of length sum(file_sizes), where file_labels[i] indicated which file record i is in.

fp_matrix

An integer matrix, where fp_matrix[k1, k2] is a label for the file pair (k1, k2). Note that fp_matrix[k1, k2] = fp_matrix[k2, k1].

rp_to_fp

A logical matrix that indicates which record pairs belong to which file pairs. rp_to_fp[fp, rp] is TRUE if the records record_pairs[rp, ] belong to the file pair fp, and is FALSE otherwise. Note that fp is given by the labeling in fp_matrix.

ab

An integer vector, of length ncol(comparisons) * K * (K + 1) / 2 that indicates how many record pairs there are with a given disagreement level for a given field, for each file pair.

file_sizes_not_included

If according to pairs_to_keep there are records which are not potential matches to any other records, the remaining records are relabeled (see labels), and file_sizes_not_included indicates, for each file, the number of such records that were removed.

ab_not_included

For record pairs not included according to pairs_to_keep, this is an integer vector, of length ncol(comparisons) * K * (K + 1) / 2 that indicates how many record pairs there are with a given disagreement level for a given field, for each file pair.

labels

If according to pairs_to_keep there are records which are not potential matches to any other records, the remaining records are relabeled. labels provides a dictionary that indicates, for each of the new labels, which record in the original labeling the new label corresponds to. In particular, the first column indicates the record in the original labeling, and the second column indicates the new labeling.

pairs_to_keep

A logical vector, the same length as comparison_list$record_pairs, indicating which record pairs were kept as potential matches. This may not be the same as the input pairs_to_keep if cc was set to 1.

cc

A numeric indicator of whether the connected components of the potential matches are closed under transitivity.

References

Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [doi:10.1080/01621459.2021.2013242][arXiv]

Examples

# Example with small duplicate dataset
data(dup_data_small)

# Create the comparison data
comparison_list <- create_comparison_data(dup_data_small$records,
 types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
 breaks = list(NA,  c(0, 0.25, 0.5),  c(0, 0.25, 0.5),
               c(0, 0.25, 0.5), c(0, 0.25, 0.5),  NA, NA),
 file_sizes = dup_data_small$file_sizes,
 duplicates = c(1, 1, 1))

# Reduce the comparison data
# The following line corresponds to only keeping pairs of records for which
# neither gname nor fname disagree at the highest level
pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) &
 (comparison_list$comparisons[, "fname_DL_3"] != TRUE)
reduced_comparison_list <- reduce_comparison_data(comparison_list,
 pairs_to_keep, cc = 1)

Relabel the Bayes Estimate of a Partition

Description

Relabel the Bayes estimate of a partition, for use after using indexing to reduce the number of record pairs that are potential matches.

Usage

relabel_bayes_estimate(reduced_comparison_list, bayes_estimate)

Arguments

reduced_comparison_list

The output from a call to reduce_comparison_data.

bayes_estimate

The output from a call to find_bayes_estimate.

Details

When the function reduce_comparison_data is used to reduce the number of record pairs that are potential matches, it may be the case that some records are declared to not be potential matches to any other records. In this case, the indexing method has made the decision that these records have no matches, and thus we can remove them from the data set and relabel the remaining records; see the documentation for labels in reduce_comparison_data for information on how to go between the original labeling and the new labeling. The purpose of this function is to relabel the output of find_bayes_estimate when the function reduce_comparison_data is used, so that the user doesn't have to do this relabeling themselves.

Value

A data.frame, with as many rows as sum(reduced_comparison_list$file_sizes + reduced_comparison_list$file_sizes_not_included), i.e. the number of records originally input to create_comparison_data, before indexing occurred. This data.frame has two columns, "original_labels" and "link_id". Given row i of records originally input to create_comparison_data, the linkage id according to bayes_estimate is given by the ith row of the link_id column. See the documentation for find_bayes_estimate for information on how to interpret this linkage id.

References

Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [doi:10.1080/01621459.2021.2013242][arXiv]

Examples

# Example with small duplicate dataset
data(dup_data_small)

# Create the comparison data
comparison_list <- create_comparison_data(dup_data_small$records,
 types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
 breaks = list(NA,  c(0, 0.25, 0.5),  c(0, 0.25, 0.5),
               c(0, 0.25, 0.5), c(0, 0.25, 0.5),  NA, NA),
 file_sizes = dup_data_small$file_sizes,
 duplicates = c(1, 1, 1))

# Reduce the comparison data
# The following line corresponds to only keeping pairs of records for which
# neither gname nor fname disagree at the highest level
pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) &
 (comparison_list$comparisons[, "fname_DL_3"] != TRUE)
reduced_comparison_list <- reduce_comparison_data(comparison_list,
 pairs_to_keep, cc = 1)

# Specify the prior
prior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA,
 flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10),
 dup_count_prior_family = c("Poisson", "Poisson", "Poisson"),
 dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform",
 n_prior_pars = NA)

# Run the Gibbs sampler
{
results <- gibbs_sampler(reduced_comparison_list, prior_list, n_iter = 1000,
 seed = 42)

# Find the full Bayes estimate
full_estimate <- find_bayes_estimate(results$partitions, burn_in = 100,
 L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = Inf, max_cc_size = 50)

# Find the partial Bayes estimate
partial_estimate <- find_bayes_estimate(results$partitions, burn_in = 100,
 L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = 0.1, max_cc_size = 12)

# Relabel the full and partial Bayes estimates
full_estimate_relabel <- relabel_bayes_estimate(reduced_comparison_list,
 full_estimate)

partial_estimate_relabel <- relabel_bayes_estimate(reduced_comparison_list,
 partial_estimate)

# Add columns to the records corresponding to their full and partial
# Bayes estimates
dup_data_small$records <- cbind(dup_data_small$records,
 full_estimate_id = full_estimate_relabel$link_id,
 partial_estimate_id = partial_estimate_relabel$link_id)
}

Specify the Prior Distributions

Description

Specify the prior distributions for the mm and uu parameters of the models for comparison data among matches and non-matches, and the partition.

Usage

specify_prior(
  comparison_list,
  mus = NA,
  nus = NA,
  flat = 0,
  alphas = NA,
  dup_upper_bound = NA,
  dup_count_prior_family = NA,
  dup_count_prior_pars = NA,
  n_prior_family = NA,
  n_prior_pars = NA
)

Arguments

comparison_list

the output from a call to create_comparison_data or reduce_comparison_data. Note that in order to correctly specify the prior, if reduce_comparison_data is used to the reduce the number of record pairs that are potential matches, then the output of reduce_comparison_data (not create_comparison_data) should be used for this argument.

mus, nus

The hyperparameters of the Dirichlet priors for the mm and uu parameters for the comparisons among matches and non-matches, respectively. These are positive numeric vectors which have length equal to the number of columns of comparison_list$comparisons times the number of file pairs (comparison_list$K * (comparison_list$K + 1) / 2). If set to NA, flat priors are used. We recommend using flat priors for mm and uu.

flat

A numeric indicator of whether a flat prior for partitions should be used. flat should be 1 if a flat prior is used, and flat should be 0 if a structured prior is used. If a flat prior is used, the remaining arguments should be set to NA. Otherwise, the remaining arguments should be specified. We do not recommend using a flat prior for partitions in general.

alphas

The hyperparameters for the Dirichlet-multinomial overlap table prior, a positive numeric vector of length 2 ^ comparison_list$K - 1. The indexing of these hyperparameters is based on the the comparison_list$K-bit binary representation of the inclusion patterns of the overlap table. To give a few examples, suppose comparison_list$K is 3. 1 in 3-bit binary is 001, so alphas[1] is the hyperparameter for the 001 cell of the overlap table, representing clusters containing only records from the third file. 2 in 3-bit binary is 010, so alphas[2] is the hyperparameter for the 010 cell of the overlap table, representing clusters containing only records from the second file. 3 in 3-bit binary is 011, so alphas[3] is the hyperparameter for the 011 cell of the overlap table, representing clusters containing only records from the second and third files. If set to NA, the hyperparameters will all be set to 1.

dup_upper_bound

A numeric vector indicating the maximum number of duplicates, from each file, allowed in each cluster. For a given file k, dup_upper_bound[k] should be between 1 and comparison_list$file_sizes[k], i.e. even if you don't want to impose an upper bound, you have to implicitly place an upper bound: the number of records in a file. If set to NA, the upper bound for file k will be set to 1 if no duplicates are allowed for that file, or comparison_list$file_sizes[k] if duplicates are allowed for that file.

dup_count_prior_family

A character vector indicating the prior distribution family used for the number of duplicates in each cluster, for each file. Currently the only option is "Poisson" for a Poisson prior, truncated to lie between 1 and dup_upper_bound[k]. The mean parameter of the Poisson distribution is specified using the dup_count_prior_pars argument. If set to NA, a Poisson prior with mean 1 will be used.

dup_count_prior_pars

A list containing the parameters for the prior distribution for the number of duplicates in each cluster, for each file. For file k, when dup_count_prior_family[k]="Poisson", dup_count_prior_pars[[k]] is a positive constant representing the mean of the Poisson prior.

n_prior_family

A character indicating the prior distribution family used for n, the number of clusters represented in the records. Note that this includes records determined not to be potential matches to any other records using reduce_comparison_data. Currently the there are two options: "uniform" for a uniform prior for n, i.e. p(n)1p(n) \propto 1, and "scale" for a scale prior for n, i.e. p(n)1/np(n) \propto 1/n. If set to NA, a uniform prior will be used.

n_prior_pars

Currently set to NA. When more prior distribution families for n are implemented, this will be a vector of parameters for those priors.

Details

The purpose of this function is to specify prior distributions for all parameters of the model. Please note that if reduce_comparison_data is used to the reduce the number of record pairs that are potential matches, then the output of reduce_comparison_data (not create_comparison_data) should be used as input.

For the hyperparameters of the Dirichlet priors for the mm and uu parameters for the comparisons among matches and non-matches, respectively, we recommend using a flat prior. This is accomplished by setting mus=NA and nus=NA. Informative prior specifications are possible, but in practice they will be overwhelmed by the large number of comparisons.

For the prior for partitions, we do not recommend using a flat prior. Instead we recommend using our structure prior for partitions. By setting flat=0 and the remaining arguments to NA, one obtains the default specification for the structured prior that we have found to perform well in simulation studies. The structured prior for partitions is specified as follows:

  • Specify a prior for n, the number of clusters represented in the records. Note that this includes records determined not to be potential matches to any other records using reduce_comparison_data. Currently, a uniform prior and a scale prior for n are supported. Our default specification uses a uniform prior.

  • Specify a prior for the overlap table (see the documentation for alphas for more information). Currently a Dirichlet-multinomial prior is supported. Our default specification sets all hyperparameters of the Dirichlet-multinomial prior to 1.

  • For each file, specify a prior for the number of duplicates in each cluster. As a part of this prior, we specify the maximum number of records in a cluster for each file, through dup_upper_bound. When there are assumed to be no duplicates in a file, the maximum number of records in a cluster for that file is set to 1. When there are assumed to be duplicates in a file, we recommend setting the maximum number of records in a cluster for that file to be less than the file size, if prior knowledge allows. Currently, a Poisson prior for the the number of duplicates in each cluster is supported. Our default specification uses a Poisson prior with mean 1.

Please contact the package maintainer if you need new prior families for n or the number of duplicates in each cluster to be supported.

Value

a list containing:

mus

The hyperparameters of the Dirichlet priors for the m parameters for the comparisons among matches.

nus

The hyperparameters of the Dirichlet priors for the u parameters for the comparisons among non-matches. Includes data from comparisons of record pairs that were declared to not be potential matches using reduce_comparison_data.

flat

A numeric indicator of whether a flat prior for partitions should be used. flat is 1 if a flat prior is used, and flat is 0 if a structured prior is used.

no_dups

A numeric indicator of whether no duplicates are allowed in all of the files.

alphas

The hyperparameters for the Dirichlet-multinomial overlap table prior, a positive numeric vector of length 2 ^ comparison_list$K, where the first element is 0.

alpha_0

The sum of alphas.

dup_upper_bound

A numeric vector indicating the maximum number of duplicates, from each file, allowed in each cluster. For a given file k, dup_upper_bound[k] should be between 1 and comparison_list$file_sizes[k], i.e. even if you don't want to impose an upper bound, you have to implicitly place an upper bound: the number of records in a file.

log_dup_count_prior

A list containing the log density of the prior distribution for the number of duplicates in each cluster, for each file.

log_n_prior

A numeric vector containing the log density of the prior distribution for the number of clusters represented in the records.

nus_specified

The nus before data from comparisons of record pairs that were declared to not be potential matches using reduce_comparison_data are added. Used for input checking.

References

Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [doi:10.1080/01621459.2021.2013242] [arXiv]

Examples

# Example with small no duplicate dataset
data(no_dup_data_small)

# Create the comparison data
comparison_list <- create_comparison_data(no_dup_data_small$records,
 types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
 breaks = list(NA,  c(0, 0.25, 0.5),  c(0, 0.25, 0.5),
               c(0, 0.25, 0.5), c(0, 0.25, 0.5),  NA, NA),
 file_sizes = no_dup_data_small$file_sizes,
 duplicates = c(0, 0, 0))

# Specify the prior
prior_list <- specify_prior(comparison_list, mus = NA, nus = NA, flat = 0,
 alphas = rep(1, 7), dup_upper_bound = c(1, 1, 1),
 dup_count_prior_family = NA, dup_count_prior_pars = NA,
 n_prior_family = "uniform", n_prior_pars = NA)

# Example with small duplicate dataset
data(dup_data_small)

# Create the comparison data
comparison_list <- create_comparison_data(dup_data_small$records,
 types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
 breaks = list(NA,  c(0, 0.25, 0.5),  c(0, 0.25, 0.5),
               c(0, 0.25, 0.5), c(0, 0.25, 0.5),  NA, NA),
 file_sizes = dup_data_small$file_sizes,
 duplicates = c(1, 1, 1))

# Reduce the comparison data
# The following line corresponds to only keeping pairs of records for which
# neither gname nor fname disagree at the highest level
pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) &
 (comparison_list$comparisons[, "fname_DL_3"] != TRUE)
reduced_comparison_list <- reduce_comparison_data(comparison_list,
 pairs_to_keep, cc = 1)

# Specify the prior
prior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA,
 flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10),
 dup_count_prior_family = c("Poisson", "Poisson", "Poisson"),
 dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform",
 n_prior_pars = NA)