Synthesising Data from Marginals

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE
)

Data is synthesised by sampling from a multivariate cumulative distribution (Copula), using the simstudy package.

Without Correlations

Data can be synthesised from marginal distributions using the synthesise_data() function:

library(RESIDE)
marginals <- import_marginal_distributions()
simulated_data <- synthesise_data(marginals)

With correlations

User specified correlations can be added to the synthesised data by supplying a correlation matrix. An empty correlations matrix can be generated using the export_empty_cor_matrix() function, supplying the marginals imported using 'import_marginal_distributions' and a folder path respectively:

library(RESIDE)
marginals <- import_marginal_distributions()
export_empty_cor_matrix(marginals, folder_path = tempdir())

The exported CSV file will be a symmetric table which looks like:

.cor_matrix <- utils::read.csv("correlation_matrix.csv")
.cor_matrix <- tibble::column_to_rownames(.cor_matrix, names(.cor_matrix)[1])
DT::datatable(
  .cor_matrix,
  options = list(
    pageLength=10, scrollX='400px'
  )
)

Correlations should then be added to the CSV file, without modifying the column / row names. Correlations should use rank order correlations. Categorical variables are represented as dummy variables named using the format variable name underscore category name e.g. SEX_F. Note the correlation matrix should be symmetrical and positive semi definite.

Once the correlations have been added to the CSV file, the correlations can be imported using the `import_cor_matrix' function:

library(RESIDE)
correlation_matrix <- import_cor_matrix()

By default the filename for the correlation matrix is that of the exported filename (correlation_matrix.csv) and is imported from the current working directory. This can be changed by specifying a file_path using the corresponding parameter of the import_cor_matrix() function, this file path should be a relative or absolute file path.

The import_cor_matrix() function will produce and error if the matrix is not symmetrical and positive semi definite, or the file does not exist.

With a correlation matrix data can now be synthesised with the user specified correlations using the synthesise_data() function, specifying the correlation matrix imported by the import_cor_matrix() function:

library(RESIDE)
marginals <- import_marginal_distributions()
export_empty_cor_matrix(marginals)
correlation_matrix <- import_cor_matrix()
simulated_data <- synthesise_data(
  marginals,
  correlation_matrix
)

NB It is not possible to entirely maintain all the marginal distributions when specifying correlations, this is a known limitation and is not likely to change.



Try the RESIDE package in your browser

Any scripts or data that you put into this service are public.

RESIDE documentation built on Oct. 18, 2024, 1:07 a.m.