In caseybreen/wcensoc:

[^updated]: Last updated: r format(Sys.time(), '%d %B, %Y')

\captionsetup[table]{labelformat=empty}

| Page| Variable | Label | |----:|:-----------------------|:---------------------------------------------| | \hyperlink{page.2}{2} | \hyperlink{page.2}{HISTID} |Historical Unique Identifier | | \hyperlink{page.3}{3} | \hyperlink{page.3}{byear} |Year of Birth | | \hyperlink{page.4}{4} | \hyperlink{page.4}{bmonth} |Month of Birth | | \hyperlink{page.5}{5} | \hyperlink{page.5}{dyear} |Year of Death | | \hyperlink{page.6}{6} | \hyperlink{page.6}{dmonth} |Month of Death | | \hyperlink{page.7}{7} | \hyperlink{page.7}{death_age} |Age at Death (Years) | | \hyperlink{page.8}{8} |\hyperlink{page.8}{sex} |Sex | | \hyperlink{page.9}{9} |\hyperlink{page.9}{race_first} |Race on First Application | | \hyperlink{page.10}{10} |\hyperlink{page.10}{race_first_cyear} |First Race: Application Year | | \hyperlink{page.11}{11} |\hyperlink{page.11}{race_first_cmonth}|First Race: Application Month | | \hyperlink{page.12}{12} |\hyperlink{page.12}{race_last} |Race on Last Application | | \hyperlink{page.13}{13} |\hyperlink{page.13}{race_last_cyear} |Last Race: Application Year | | \hyperlink{page.14}{14} |\hyperlink{page.14}{race_last_cmonth} |Last Race: Application Month | | \hyperlink{page.15}{15} |\hyperlink{page.15}{bpl} |Place of Birth | | \hyperlink{page.16}{16} |\hyperlink{page.16}{zip_residence} |ZIP Code of Residence at Time of Death | | \hyperlink{page.17}{17} |\hyperlink{page.17}{socstate} |State where Social Security Number Issued | | \hyperlink{page.18}{18} |\hyperlink{page.18}{age_first_application} |Age at First Social Security Application | | \hyperlink{page.19}{19} | \hyperlink{page.19}{weight} |CenSoc Sample Weight | | \hyperlink{page.20}{20} | \hyperlink{page.20}{Additional IPUMS variables}| Additional 1940 Census variables, including: pernum, perwt, age, mbpl, fbpl, educd, educ_yrs, empstatd, hispan, incwage, incnonwg, marst, nativity, occ, occscore, ownershp, race, rent, serial, statefip, and urban.|

\vspace{100pt}

Summary: The CenSoc-Numident Version 3 Demo dataset (N = 64,686) links the IPUMS 1940 Census 1% sample to the National Archives' public release of the Social Security Numident file. Records were linked using a conservative variant of the ABE method developed by Abramitzky, Boustan, and Eriksson (\textcolor{blue}{2012}, \textcolor{blue}{2014}, \textcolor{blue}{2017}).

We note that this demo dataset is not conducive to high-resolution mortality research. We recommend using this file for exploratory and demonstrative purposes. To best conduct research with CenSoc data, researchers may download the full CenSoc-Numident from the CenSoc website, obtain an extract of the full-count 1940 Census from IPUMS-USA, and merge data using on the individual-level, unique identifier HISTID variable. Please adhere to CenSoc and IPUMS citation guidelines when using this file.

\newpage

\huge HISTID \normalsize \vspace{12pt}

Label: Historical Unique Identifier

Description: HISTID is a unique individual-level identifier. It can be used to merge the CenSoc-Numident file with the 1940 Full-Count Census from IPUMS.

\newpage

\huge byear \normalsize \vspace{12pt}

Label: Birth Year

Description: byear reports a person's year of birth, as recorded in the Numident death records.

## Library Packages
library(tidyverse)
library(data.table)
library(kableExtra)

## read in censoc_numident_v3 data file
censoc_numident_v3<-read_csv("/data/censoc/censoc_data_releases/censoc_numident_demo/censoc_numident_demo_v3/censoc_numident_demo_v3.csv")

byear_plot <- censoc_numident_v3 %>%
    group_by(byear) %>%
    summarise(n = n()) %>%
  ggplot(aes(x = byear, y = n)) + 
    geom_line() + 
    geom_point() +
    theme_minimal(base_size = 15) +
  ggtitle("Year of Birth") + 
  theme(legend.position="bottom") +
  xlab("title") + 
  labs(x = "Year", 
       y = "Count") + 
  # scale_y_continuous(labels = scales::comma, limits = c(0, 5000)) +
  scale_x_continuous(breaks = scales::pretty_breaks(n=5))

\vspace{75pt}

byear_plot

\newpage \huge bmonth \normalsize \vspace{12pt}

Label: Birth Month

Description: bmonth reports a person's month of birth, as recorded in the Numident death records.

## run in the console and copy and paste into documentation
bmonth_tabulated <- censoc_numident_v3 %>% 
    group_by(bmonth) %>%
    tally() %>%
    mutate(freq = signif(n*100 / sum(n), 2)) %>%
    mutate(label = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")) %>%
    select(bmonth, label, n, `freq %` = freq) %>% 
  knitr::kable(format = "pipe")

\vspace{30pt}

bmonth_tabulated

\newpage

\huge dyear \normalsize \vspace{12pt}

Label: Death Year

Description: dyear reports a person's year of death, as recorded in the Numident death records.

dyear_plot <- censoc_numident_v3 %>%
    group_by(dyear) %>%
    summarise(n = n()) %>%
    ggplot(aes(x = dyear, y = n)) + 
    geom_line() + 
    geom_point() +
    theme_minimal(base_size = 15) +  
  ggtitle("Year of Death") + 
  theme(legend.position="bottom") +
  xlab("title") + 
  labs(x = "Year", 
       y = "Count") + 
  scale_y_continuous(labels = scales::comma) + 
 scale_y_continuous(labels = scales::comma, limits = c(0, 7000)) +
  scale_x_continuous(breaks = scales::pretty_breaks(n=5)) 
#The minimum y is 3000 deaths a year when the scale is removed

\vspace{75pt}

dyear_plot

\newpage

\huge dmonth \normalsize \vspace{12pt}

Label: Death Month

Description: dmonth reports a person's month of death, as recorded in the Numident death records.

dmonth_tabulated <- censoc_numident_v3 %>%
    group_by(dmonth) %>%
    tally() %>%
    mutate(freq = signif(n*100 / sum(n), 2)) %>%
    mutate(label = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")) %>%
    select(dmonth, label, n, `freq %` = freq) %>% 
  knitr::kable(format = "pipe")

\vspace{30pt}

dmonth_tabulated

\newpage

\huge death_age \normalsize \vspace{12pt}

Label: Age at Death (Years)

Description: death_age reports a person's age at death in years, calculated using the birth and death information recorded in the Numident death records.

death_age_plot <- censoc_numident_v3 %>%
    group_by(death_age) %>%
    summarise(n = n()) %>%
    ggplot(aes(x = death_age, y = n)) + 
    geom_line() + 
    geom_point() +
    theme_minimal(base_size = 15) + 
  ggtitle("Age at Death") + 
  theme(legend.position="bottom") +
  xlab("title") + 
  labs(x = "Age at Death", 
       y = "Count") + 
  scale_y_continuous(labels = scales::comma) + 
  scale_x_continuous(breaks = scales::pretty_breaks(n=5))

\vspace{75pt}

death_age_plot

\newpage

\huge sex \normalsize \vspace{12pt}

Label: Sex

Description: sex reports a person's sex, as recorded in the Numident death, application, or claim records.

sex_tabulated <- censoc_numident_v3 %>%
    group_by(sex) %>%
     tally() %>%
     mutate(freq = signif(n*100 / sum(n), 3)) %>%
     mutate(label = c("Men", "Women")) %>%
     select(sex, label, n, `freq %` = freq) %>% 
  knitr::kable(format = "pipe")

\vspace{30pt}

sex_tabulated

\newpage

\huge race_first \normalsize \vspace{12pt}

Label: Race First

Description: race_first reports a person's race, as recorded on their first Social Security application entry.

Note: Before 1980, the race schema in the Social Security application form contained three categories: White, Black, and Other. In 1980, the SSA added three categories: (1) Asian, Asian American, or Pacific Islander, (2) Hispanic, and (3) North American Indian or Alaskan Native. The Other category was also removed.

race_first_tabulated <- censoc_numident_v3 %>%
    group_by(race_first) %>%
    tally() %>%
    mutate(freq = signif(n*100 / sum(n), 3)) %>%
    mutate(label = c("White", "Black", "Other", "Asian", "Hispanic", "North American Native", "Missing")) %>%
    select(race_first, label, n, `freq %` = freq) %>% 
  knitr::kable(format = "pipe")

\vspace{30pt}

race_first_tabulated

\newpage

\huge race_first_cyear \normalsize \vspace{12pt}

Label: First Race: Application Year

Description: race_first_cyear is a numeric variable reporting the year of the application on which a person reported their first race.

\vspace{12pt}

\newpage

\huge race_first_cmonth \normalsize \vspace{12pt}

Label: First Race: Application Month

Description: race_first_cmonth is a numeric variable reporting the month of the application on which a person reported their first race.

\newpage

\huge race_last \normalsize \vspace{12pt}

Label: Race Last

Description: race_last reports a person's race, as recorded on their most recent Social Security application entry.

race_last_tabulated <- censoc_numident_v3 %>%
    group_by(race_last) %>%
    tally() %>%
    mutate(freq = signif(n*100 / sum(n), 3)) %>%
    mutate(label = c("White", "Black", "Other", "Asian", "Hispanic", "North American Native", "Missing")) %>%
    select(race_last, label, n, `freq %` = freq) %>% 
  knitr::kable(format = "pipe")

\vspace{30pt}

race_last_tabulated

\newpage

\huge race_last_cyear \normalsize \vspace{12pt}

Label: Last Race: Application Year

Description: race_last_cyear reports the year of the application on which a person reported their last race.

\newpage

\huge race_last_cmonth \normalsize \vspace{12pt}

Label: Last Race: Application Month

Description: race_last_cmonth is a numeric variable reporting the month of the application on which a person reported their last race.

\newpage

\huge bpl \normalsize \vspace{12pt}

Label: Birthplace

Description: bpl is a numeric variable reporting a person's place of birth, as recorded in the Numident application or claims records. The accompanying bpl_string variable reports the person's place of birth as a character string. The coding schema matches the detailed IPUMS-USA birthplace coding schema.

For a complete list of IPUMS Birthplace codes, please see: \textcolor{blue}{https://usa.ipums.org/usa-action/variables/BPL}

bpl_tabulation <- censoc_numident_v3 %>%
    filter(bpl < 10000 | is.na(bpl)) %>%
    group_by(bpl, bpl_string) %>%
    tally() %>%
    ungroup() %>%
    mutate(freq = round(n*100 / sum(n), 2)) %>%
    select(bpl, bpl_string, n, `freq %` = freq)

rows <- seq_len(nrow(bpl_tabulation) %/% 2)

knitr::kable(list(bpl_tabulation[rows,1:4],
           matrix(numeric(), nrow=0, ncol=1),
           bpl_tabulation[-rows, 1:4]),
      caption = "BPL Tabulation (Native born only)",
      label = "tables", format = "latex", booktabs = TRUE)  %>%
  kableExtra::kable_styling(latex_options = c("HOLD_position"))

\newpage

\huge zip_residence

\normalsize

\vspace{12pt}

Label: ZIP Code of Residence at Time of Death

Description: zip_residence is a string variable (9-characters) reporting a person's ZIP code of residence at time of death, as recorded in the Numident death records.

\newpage \huge socstate

\normalsize

\vspace{12pt}

Label: State where Social Security Number Issued

Description: socstate is a numeric variable reporting the state in which a person's Social Security card was issued. It is determined by the first three (3) digits of a person's Social Security number, as recorded in Numident death records. The accompanying socstate_string variable reports the state in which a person's Social Security card was issued as a character string. The coding schema matches the detailed IPUMS-USA birthplace coding schema.

The list of codes is also available at: \textcolor{blue}{https://usa.ipums.org/usa-action/variables/BPL}

\vspace{30pt}

socstate_tabulation <- censoc_numident_v3 %>%
    #filter(socstate < 10000 | is.na(socstate)) %>%
    group_by(socstate, socstate_string) %>%
    tally() %>%
    ungroup() %>%
    mutate(freq = round(n*100 / sum(n), 2)) %>%
    select(socstate, socstate_string, n, `freq %` = freq)

rows <- seq_len(nrow(socstate_tabulation) %/% 2)

knitr::kable(list(socstate_tabulation[rows,1:4],
           matrix(numeric(), nrow=0, ncol=1),
           socstate_tabulation[-rows, 1:4]),
      caption = "Tabulation of socstate",
      label = "tables", format = "latex", booktabs = TRUE)  %>%
  kableExtra::kable_styling(latex_options = "HOLD_position")

\newpage

\huge age_first_app

\normalsize

\vspace{12pt}

Label: Age at First Social Security Application

Description: age_first_application reports the age at which a person submitted their first Social Security application.

age_first_app_plot <- censoc_numident_v3 %>%
  group_by(age_first_application) %>%
  filter(age_first_application %in% c(0:110)) %>% 
  summarise(n = n()) %>%
  ggplot(aes(x = age_first_application, y = n)) + 
  geom_point() +
  geom_line() + 
  theme_minimal(15) + 
  theme(legend.position="bottom") +
  labs(title = "Age of First Application",
       x = "Age of First Application", 
       y = "Count") + 
  scale_y_continuous(labels = scales::comma) + 
  scale_x_continuous(breaks = scales::pretty_breaks(n=5))

\vspace{75pt}

age_first_app_plot

\newpage

\huge weight \normalsize

Label: CenSoc Sample Weight [^1]

[^1]: The IPUMS-USA 1940 1% sample also includes a weight (perweight) to account for the 1940 sampling procedure (thus no weights for the 100% complete count 1940 census). For analysis, we recommend using both sets of weights. A final weight can be constructed by multiplying the two weights together.

Description: weight is a post-stratification person-weight to National Center for Health Statistics (NCHS) totals for persons (1) dying between 1988-2005 (2) dying between ages 65-100. Weights are based on age at death, year of death, sex, and race, and place of birth. Please see the \textcolor{blue}{technical documentation on weights} for more information.

weights_tabulated <- censoc_numident_v3 %>%
  filter(!is.na(weight)) %>% 
  summarize('Min Weight' = round(min(weight),2), 'Max Weight' = round(max(weight), 2)) %>%
  mutate(id = 1:n()) %>% 
  pivot_longer(-id, names_to = "Label", values_to = "Value") %>% 
  select(Value, Label) %>% 
  add_row(Label = "No Weight Assigned", Value = NA) %>% 
  knitr::kable(format = "markdown")

weights <- censoc_numident_v3 %>% 
  filter(!is.na(weight)) %>% 
  group_by(death_age, dyear) %>% 
  summarize(weight = mean(weight))

## plot mortality sex ratio Lexis surface
weights_lexis <- weights %>% 
  ggplot() +
  geom_raster(aes(x = dyear, y = death_age,
                  fill = weight)) +
  ## Lexis grid
  geom_hline(yintercept = seq(65, 100, 10),
             alpha = 0.2, lty = "dotted") +
  geom_vline(xintercept = seq(1985, 2005, 10),
             alpha = 0.2, lty = "dotted") +
  geom_abline(intercept = seq(-100, 100, 10)-1910,
              alpha = 0.2, lty = "dotted") +
  scale_fill_viridis_c(option = "magma") +
  scale_x_continuous("Year", expand = c(0.02, 0),
                     breaks = seq(1988, 2005, 5)) +
  scale_y_continuous("Age", expand = c(0, 0),
                     breaks = seq(65, 100, 10)) +
  guides(fill = guide_legend(reverse = TRUE)) +
  # coord
  coord_equal() +
  # theme
  theme_void() +
  theme(
    axis.text = element_text(colour = "black"),
    axis.text.y = element_text(size = 10),
    axis.text.x = element_text(size = 10, angle = 45, hjust = .5), 
    plot.title = element_text(size = 10, vjust = 2),
    legend.text = element_text(size = 10), 
    axis.title=element_text(size = 10,face="bold")
  ) + 
  labs(X = "Year",
       Y = "Age",
       title = "Average weight by age at death and year of death")

\vspace{30pt}

weights_tabulated

\vspace{30pt}

weights_lexis

\newpage

\huge IPUMS 1940 Census Variables \normalsize

\vspace{12pt}

The variables below are from the IPUMS-USA 1940 Census 1% sample. We recommend looking at the terrific documentation on the IPUMS-USA website: \textcolor{blue}{https://usa.ipums.org/usa/index.shtml}

| Variable | Label | |:-------------|:---------------------------------------------| | pernum |Person number in household| |perwt |IPUMS person weight[^2] | | age |Age on April 1st, 1940| | mbpl |Mother's place of birth[^3]| | fbpl |Father's place of birth[^4]| | educd |Educational attainment (detailed IPUMS codes)| | educ_yrs |Educational attainment in years (constructed)[^5]| | empstatd |Employment status (detailed)| | hispan |Hispanic/Spanish/Latino origin (imputed)[^6]| | incwage | Wage and salary income in 1939 | | incnonwg |Had non-wage/salary income over $50 in 1939| | marst |Marital status| | nativity |Foreign birthplace or parentage| | occ |Occupation| | occscore |Occupational income score| | ownershp |Ownership of dwelling (tenure)| | race |Race[^7]| | rent |Monthly contract rent| | serial |Household serial number| | statefip |State of residence 1940 (FIPS codes)| | urban |Urban/rural status|

[^2]: The IPUMS perweight accounts for the 1940 sampling procedure to construct the 1% sample, and thus is only available in the 1940 1% sample. For analysis, we recommend using both the IPUMS perweight and the CenSoc weight. A final weight can be constructed by multiplying the two weights together [^3]: This variable is only available for sample-line persons (a one-in-twenty sample asked additional questions in the 1940 Census) or those living with their mother. [^4]: This variable is only available for sample-line persons (a one-in-twenty sample asked additional questions in the 1940 Census) or those living with their father. [^5]: educ_yrs is constructed from the IPUMS educd variable but not directly available from IPUMS. [^6]: The 1940 Census did not directly inquire about Hispanic ethnicity or origin. This variable is determined by IPUMS using information such as one's birthplace or a parent's birthplace. [^7]: The IPUMS race variable reports race as recorded in the 1940 Census. In contrast, the race_first and race_last variables in this dataset contain race as self-reported on Social Security applications.