In caseybreen/wcensoc:

[^updated]: Last updated: r format(Sys.time(), '%d %B, %Y')

\vspace{-10truemm}

#knitr::opts_chunk$set(include=FALSE)

library(tidyverse)
library(data.table)
library(kableExtra)

Description

These files allow researchers to identify sibships in CenSoc-Numident data.

The CenSoc-Numident dataset links Social Security Numident records (the Berkeley Unified Numident Mortality Database) with the 1940 Census. Sibships (sibling groups) in the CenSoc-Numident are located by finding individuals with shared parent names as recorded in Social Security Numident records. We use two methods to match siblings in Numident records:

1) The exact match method identifies siblings only with exactly identical parent names (after names have undergone cleaning and standardization). This is the most stringent match method.

2) The flexible match method permits parents' names to be slightly different, within a threshold defined by Jaro–Winkler string distance, in addition to exact matches. This allows siblings to be matched even in cases of minor misspellings, mistranscriptions, or spelling variations in parents' names among sibships (e.g., mother's maiden name recorded as "Brannum" and "Branum"). This increases the number of siblings found, but has higher potential to falsely match unrelated individuals. Most (but not all) individuals and sibling connections identified using the exact match method are also identified using this method.

Sibling groups are identified among individuals in the CenSoc-Numident who died at age 65+ in the years 1988-2005. They may be of any gender composition and contain from 2 to 8 siblings each. An overview of the resulting sibships created by each method is presented below:

# Read numident data
numident_sibs_exact <- fread("/data/censoc/censoc_data_releases/siblings/siblings_v1/censoc_numident_siblings_v1/censoc_numident_sibs_exact_match_v1.csv")
numident_sibs_flex <- fread("/data/censoc/censoc_data_releases/siblings/siblings_v1/censoc_numident_siblings_v1/censoc_numident_sibs_flexible_match_v1.csv")

# get basic metrics to make table below

nrow(numident_sibs_exact) # total size
numident_sibs_exact %>% group_by(sib_group_id_exact) %>% filter(row_number() == 1) %>% nrow() # number of groups
numident_sibs_exact %>% group_by(sib_group_id_exact) %>% mutate(n_sibs=n()) %>% ungroup() %>% 
  group_by(n_sibs) %>% tally() # group sizes


nrow(numident_sibs_flex) # total size
numident_sibs_flex %>% group_by(sib_group_id_flexible) %>% filter(row_number() == 1) %>% nrow() # number of groups
numident_sibs_flex %>% group_by(sib_group_id_flexible) %>% mutate(n_sibs=n()) %>% ungroup() %>% 
  group_by(n_sibs) %>% tally() # group sizes

df <- tibble(
             "Match method" = c("Exact match", "Flexble match"),
             "Number of individuals" = c("908,472", "1,185,055"),
             "Number of sibships" = c("426,952", "551,641"),
             "Mean sibship size" = c("2.13", "2.15")
             ) %>% 
  knitr::kable(format = "pipe")

df %>% kable_styling(full_width = F)

We note that sibships are established using only information from Social Security Numident data. As such, no information from the 1940 Census is used to match siblings, and false linkages between Numident data and the 1940 Census may create erroneous sibships. For detailed documentation of sibling identification methodologies and characteristics of sibships in Social Security Numident data, please see the paper: Methods for Identifying Siblings in Administrative Mortality Data available online at https://censoc.berkeley.edu/documentation/.

\vspace{0truemm}

Usage

Each sibling identifier dataset consist of two columns: a unique individual identifier HISTID, and an identifier for each sibling group. The HISTID of one sibling within each sibship is used as the group identifier.

\vspace{15pt}

# Show first 5 rows the exact match
head(numident_sibs_exact, 5)

\vspace{15pt}

These files must be used in conjunction with the CenSoc-Numident dataset. Users will need to merge the datasets using the unique identifier HISTID, as in the example code below:

\vspace{2truemm}

require(data.table)
# Read CenSoc-Numident 
numident <- data.table::fread("censoc_numident_v3.csv")
# Read CenSoc-Numident siblings IDs
numident_sib_id <- data.table::fread("censoc_numident_sibs_exact_match_v1.csv")
# Attach sibling IDs to CenSoc-Numident, keeping all records in the CenSoc-Numident
numident_with_sibs <- merge(numident, numident_sib_id, by = "HISTID", all.x = TRUE)