```{=html}
```r par(mar = c(2.5, 8.5, 4.1, 2.5), cex.main = .7, cex.lab = .6)
Population size estimation is a methodological approach employed across multiple scientific disciplines, which serves as the basis for research, policy formulation, and decision-making processes [cf. @bohning2018capture]. In the field of statistics, particularly official statistics, precise population estimates are essential in order to develop robust economic models, optimize resource allocation, and inform evidence-based policy [cf. @baffour2013modern]. Social scientists utilize advanced population estimation techniques to investigate hard-to-reach populations, such as homeless individuals or illicit drug users in an effort to overcome the inherent limitations of conventional census methodologies. These techniques are crucial for obtaining accurate data on populations that are typically under-represented or difficult to access using traditional sampling methods [cf. @vincent2022estimating]. In ecology and epidemiology, researchers focus on estimating the size of individual species or disease-affected populations within defined geographical regions as part of conservation efforts, ecosystem management, and public health interventions.
Population size estimation can be approached using various methodologies, each with distinct advantages and limitations. Traditional approaches include full enumeration (e.g. censuses) and comprehensive sample surveys, which, while providing detailed data, are often resource-intensive and may result in delayed estimates, particularly for human populations. Alternative methods leverage existing data sources, such as administrative registers or carefully designed small-scale studies in wildlife research, or census coverage surveys [cf. @wolter1986some; @zhang2019note]. Information from these sources is often extracted by applying statistical methods, known as capture-recapture or multiple system estimation, which rely on data from multiple enumerations of the same population [cf. @dunne2024system]. This approach can be implemented using either a single source with repeated observations, two sources, or multiple sources.
In this paper we focus on methods that involve a single data source with multiple enumerations of the same units [cf. @ztpoisson]. In human population studies, such data can be derived from police records, health system databases, or border control logs; in the case of non-human populations, data of this kind can come from veterinary records or specialized field data. These methods are often applied to estimate hard-to-reach or hidden populations, where standard sampling methods may be inappropriate because of prohibitive costs or problems with identifying population members.
While methods for two or more sources are implemented in various
open-source software packages, for instance CARE-4 [@yang2006care4]
(in GAUSS), Rcapture [@baillargeon2007rcapture], marked
[@laake2013marked] or VGAM [@yee2015vgam] (all in R),
single-source capture-recapture (SSCR) methods are either not available
at all or are only partially implemented in existing R packages or
other software. Therefore, the paper attempts to bridge this gap by
introducing the singleRcapture package, which implement
state-of-the-art SSCR methods and offer a user friendly API resembling
existing R functions (e.g., glm
). In the next subsection we
describe existing R packages and other software that could be used
for estimating population size using SSCR methods.
The majority of SSCR methods assume zero-truncated distributions or
their extensions (e.g., inclusion of one-inflation). The
Distributions.jl [@Distributionsjl] (in Julia), StatsModels
[@seabold2010statsmodels] (in Python), countreg [@countreg],
VGAM [@VGAM-main] or distributions3 [@distributions3] (in R)
implement some of those truncated distributions (e.g.
distributions3::ZTPoisson
or countreg::zerotrunc
) and the most
general distributions, such as Generally Altered, Inflated, Truncated
and Deflated, can be found in the VGAM package (e.g.
VGAM::gaitdpoisson
for the Poisson distribution, see @gaitdcount for a
detailed description). However, the estimation of parameters of a given
truncated (and possibly inflated) distribution is just the first step
(as in the case of log-linear models in capture-recapture with two
sources) and, to the best of our knowledge, there is no open-source
software that can be used to estimate population size using on SSCR
methods and includes variance estimators or diagnostics.
Therefore, the purpose of the singleRcapture, an R package, is to bridge this gap by providing scientists and other practitioners with a tool for estimating population size with SSCR methods. We have implemented state-of-the-art methods, as recently described by @bohning2018capture or @bohning2024one and provided their extensions (e.g., inclusion of covariates, different treatment of one-inflation), which will be covered in detail in Section 2. The package implements variance estimation based on various methods, can be used to create custom models and diagnostic plots (e.g. rootograms) with parameters estimated using a modified iteratively reweighted least squares (IRLS) algorithm we have implemented for estimation stability. To the best of our knowledge, no existing open-source package allows the estimation of population size by selecting an appropriate SSCR model, conducting the estimation, and providing informative diagnostics of the results.
The remaining part of the paper is structured as follows. Section
2 contains a brief description of the theoretical
background and information about fitting methods and available methods
of variance estimation. Section 3 introduces the main
functions of the package. Section 4 presents a case study
and contains an assessment of its results, diagnostics and estimates of
specific sub-populations. Section 5 describes classes
and S3methods
implemented in the package. The paper ends with
conclusions and an appendix showing how to use the estimatePopsizeFit
function which is aimed to mimic the glm.fit
or similar functions. In
replication materials we show how to implement a custom model as this
option could be of interest to users wishing to apply any new bootstrap
methods not implemented in the package.
Let $Y_{k}$ represent the number of times the $k$-th unit was observed in a single source (e.g. register). Clearly, we only observe $k:Y_{k}>0$ and do not know how many units have been missed (i.e. $Y_{k}=0$), so the population size, denoted by $N$, needs to be estimated. In general, we assume that the conditional distribution of $Y_{k}$ given a vector of covariates $\bx_{k}$ follows a version of the zero-truncated count data distribution (and its extensions). When we know the parameters of the distribution we can estimate the population size using a Horvitz-Thompson type estimator given by:
$$ \begin{equation} \hat{N}= \sum_{k=1}^{N}\frac{I_{k}}{\mathbb{P}[Y_{k}>0|\bX_{k}]}= \sum_{k=1}^{N_{obs}}\frac{1}{\mathbb{P}[Y_{k}>0|\bX_{k}]}, \end{equation} $$
where $I_{k}:=\1_{\mathbb{N}}(Y_{k})$, $N_{obs}$ is the number of observed units and $\1$ is the indicator function, while the maximum likelihood estimate of $N$ is obtained after substituting regression parameters $\bbeta$ for $\mathbb{P}[Y_{k}>0|\bx_{k}]$ in the above equation.
The basic SSCR assumes independence between counts, which is a rather naive assumption, since the first capture may significantly influence the behavior of a given unit or limit the possibility of subsequent captures (e.g. due to incarceration).
To solve these issues, @godwin2017estimation and @ztoi-oizt-poisson introduced one-inflated distributions, which explicitly model the probability of singletons by giving additional mass $\omega$ to singletons denoted as $\1_{{1}}(y)$ [cf. @bohning2024one]. In other words they considered a new random varialbe $Y^{\ast}$ that corresponds to the data collection process which exhibits one inflation:
$$ \begin{equation} \mathbb{P}\left[Y^{\ast}=y|Y^{\ast}>0\right] = \omega\1_{{1}}(y)+(1-\omega)\mathbb{P}[Y=y|Y>0]. \end{equation} $$
Analytic variance estimation is then performed by computing two parts of the decomposition according to the law of total variance given by:
$$ \begin{equation}\label{eq-law_of_total_variance_decomposition} \text{var}[\hat{N}] = \mathbb{E}\left[\text{var} \left[\hat{N}|I_{1},\dots,I_{n}\right]\right] + \text{var}\left[\mathbb{E}[\hat{N}|I_{1},\dots,I_{n}]\right], \end{equation} $$
where the first part can be estimated using the multivariate $\delta$ method given by:
$$ \begin{equation} \mathbb{E}\left[\text{var} \left[\hat{N}|I_{1},\dots,I_{n}\right]\right] = \left.\left(\frac{\partial(N|I_1,\dots,I_N)}{\partial\bbeta}\right)^\top \text{cov}\left[\hat{\bbeta}\right] \left(\frac{\partial(N|I_1,\dots,I_N)}{\partial\bbeta}\right) \right|_{\bbeta=\hat{\bbeta}}, \end{equation} $$
while the second part of the decomposition above, assuming independence of $I_{k}$'s and after some omitted simplifications, is optimally estimated by:
$$ \begin{equation} \text{var}\left[\mathbb{E}(\hat{N}|I_{1},\dots,I_{n})\right] = \text{var}\left[\sum_{k=1}^{N}\frac{I_{k}}{\mathbb{P}(Y_{k}>0)}\right] \approx\sum_{k=1}^{N_{obs}}\frac{1-\mathbb{P}(Y_{k}>0)}{\mathbb{P}(Y_{k}>0)^{2}}, \end{equation} $$
which serves as the basis for interval estimation. Confidence intervals are usually constructed under the assumption of (asymptotic) normality of $\hat{N}$ or asymptotic normality of $\ln(\hat{N}-N)$ (or log normality of $\hat{N}$). The latter is an attempt to address a common criticism of student type confidence intervals in SSCR, namely a possibly skewed distribution of $\hat{N}$, and results in the $1-\alpha$ confidence interval given by:
$$ \begin{equation} \left(N_{obs}+\frac{\hat{N}-N_{obs}}{\xi},N_{obs} + \left(\hat{N}-N_{obs}\right)\xi\right), \end{equation} $$
where:
$$ \begin{equation} \xi = \exp\left(z\left(1-\frac{\alpha}{2}\right) \sqrt{\ln\left(1+\frac{\widehat{\text{Var}}(\hat{N})}{\left(\hat{N}-N_{obs}\right)^{2}}\right)}\right), \end{equation} $$
and where $z$ is the quantile function of the standard normal distribution. The estimator $\hat{N}$ is best interpreted as being an estimator of the total number of \underline{observable} units in the population, since we have no means of estimating the number of units in the population for which the probability of being included in the data is $0$ [@ztpoisson].
The full list of models implemented in singleRcapture along with corresponding expressions for probability density functions and point estimates can be found in the collective help file for all family functions:
?ztpoisson
For the sake of simplicity, we only list the family functions together with brief descriptions. For more detailed information, please consult the relevant literature.
The current list of these family functions includes:
$$ \hat{N}{\text{Chao}} = N{obs}+ \sum_{k=1}^{\boldsymbol{f}{1}+\boldsymbol{f}{2}} \left(2\exp\left(\bx_{k}\hat{\bbeta}\right)+ 2\exp\left(2\bx_{k}\hat{\bbeta}\right)\right)^{-1}, \tag{Chao's estimator} $$
$$ \hat{N}{\text{Zelt}}=\sum{k=1}^{N_{obs}} \left(1-\exp\left(-2\exp\left(\bx_{k}\hat{\bbeta}\right)\right)\right)^{-1}. \tag{Zelterman's estimator} $$
where $\boldsymbol{f}{1}$ and $\boldsymbol{f}{2}$ denotes number of units observed once and twice.
zt
$^\ast$) and zero-one-truncated (zot
$^\ast$)
Poisson [@zotmodels], geometric, NB type II (NB2) regression, where
the non-truncated distribution is parameterized as:$$ \mathbb{P}[Y=y|\lambda,\alpha] = \frac{\Gamma\left(y+\alpha^{-1}\right)}{\Gamma\left(\alpha^{-1}\right)y!} \left(\frac{\alpha^{-1}}{\alpha^{-1}+\lambda}\right)^{\alpha^{-1}} \left(\frac{\lambda}{\lambda + \alpha^{-1}}\right)^{y}. $$
ztoi
$^\ast$) modifications, where the
count data variable $Y^{\ast}$ is defined such that its distribution
statisfies:$$ \mathbb{P}\left[Y^{\ast}=y\right]= \begin{cases} \mathbb{P}[Y=0] & y=0, \ \omega\left(1-\mathbb{P}[Y=0]\right)+(1-\omega)\mathbb{P}[Y=1] & y=1, \ (1-\omega)\mathbb{P}[Y=y] & y>1, \end{cases} $$
$$ \mathbb{P}\left[Y^{\ast}=y|Y^{\ast}>0\right]=\omega\1_{{1}}(y)+(1-\omega)\mathbb{P}[Y=y|Y>0]. $$
oizt
$^\ast$) modifications, where the
count data variable $Y^{\ast}$ is defined as:$$ \mathbb{P}\left[Y^{\ast}=y\right] = \omega \1_{{1}}(y)+(1-\omega)\mathbb{P}[Y=y], $$
$$ \mathbb{P}\left[Y^{\ast}=y|Y^{\ast}>0\right] = \omega\frac{\1_{{1}}(y)}{1-(1-\omega)\mathbb{P}[Y=0]}+ (1-\omega)\frac{\mathbb{P}[Y=y]}{1-(1-\omega)\mathbb{P}[Y=0]}. $$
Note that ztoi
$^\ast$ and oizt
$^\ast$ distributions are equivalent,
in the sense that the maximum value of the likelihood function is equal
for both of those distributions given any data, as shown by
[@bohning2023equivalence] but population size estimators are different.
In addition, we propose two new approaches to model singletons in a similar way as in hurdle models. In particular, we have proposed the following:
ztHurdle
*) for Poisson,
geometric and NB2 is defined as:$$ \mathbb{P}[Y^{\ast}=y]=\begin{cases} \frac{\mathbb{P}[Y=0]}{1-\mathbb{P}[Y=1]} & y=0, \ \pi(1-\mathbb{P}[Y=1]) & y=1, \ (1-\pi) \frac{\mathbb{P}[Y=y]}{1-\mathbb{P}[Y=1]} & y>1, \end{cases} $$
$$ \mathbb{P}[Y^{\ast}=y|Y^{\ast}>0]=\pi\mathbf{1}{{1}}(y)+ (1-\pi)\mathbf{1}{\mathbb{N}\setminus{1}}(y)\frac{\mathbb{P}[Y=y]}{1-\mathbb{P}[Y=0]-\mathbb{P}[Y=1]}. $$
where $\pi$ denotes the conditional probability of observing singletons.
Hurdlezt
*) for Poisson,
geometric and NB2 is defined as:$$ \mathbb{P}[Y^{\ast}=y]=\begin{cases} \pi & y=1, \ (1-\pi) \frac{\mathbb{P}[Y=y]}{1-\mathbb{P}[Y=1]} & y\neq1, \end{cases} $$
$$ \mathbb{P}[Y^{\ast}=y|Y^{\ast}>0]=\begin{cases} \pi\frac{1-\mathbb{P}[Y=1]}{1-\mathbb{P}[Y=0]-\mathbb{P}[Y=1]} & y=1,\ (1-\pi)\frac{\mathbb{P}[Y=y]}{1-\mathbb{P}[Y=0]-\mathbb{P}[Y=1]} & y>1, \end{cases} $$
where $\pi$ denotes the unconditional probability of observing singletons.
The approaches presented above differ in their assumptions,
computational complexity, or in the way they treat heterogeneity of
captures and singletons. For instance, the dispersion parameter $\alpha$
in the NB2 type models is often interpreted as measuring the
severity
of unobserved heterogeneity in the underlying Poisson
process [@ztnegbin]. When using any truncated NB model, the
hope is that given the class of models considered, the consistency is
not lost despite the lack of information.
While not discussed in the literature, the interpretation of
heterogeneous $\alpha$ across the population (specified in
controlModel
) would be that the unobserved heterogeneity affects the
accuracy of the prediction for the dependent variable $Y$ more severely
than others. The geometric model (NB with $\alpha=1$) is singled out in
the package and often considered in the literature because of inherent
computational issues with NB models, which are exacerbated by the fact
that data used for SSCR are usually of rather low quality. Data sparsity
is a particularly common problem in SSCR and a big challenge for all
numerical methods for fitting the (zero-truncated) NB model.
The extra mass $\omega$ in one-inflated models is an important extension
to the researcher's toolbox for SSCR models, since the inflation at
$y=1$ is likely to occur in many types of applications. For example,
when estimating the number of active people who committed criminal acts
in a given time period, the fact of being captured for the first time
following an arrest is associated with the risk of no longer being able
to be captured a second time. One constraint present in modelling via
inflated models is that attempts to include both the possibility of one
inflation and one deflation lead to both numerical and inferential
problems since the parameter space (of $(\omega, \lambda)$ or
$(\omega, \lambda, \alpha)$) is then given by
${(\omega, \lambda, \alpha) | \forall x\in \mathbb{N}: p(x|\omega, \lambda, \alpha)\geq0}$
for the probability mass function $p$. The boundary of this set is then
a $1$ or $2-$dimentional manifold, transforming this parameter space
into $\mathbb{R}^{3}$ would require using link
functions that depend
on more than one parameter.
Hurdle models represent another approach to modelling one-inflation. They can also model deflation as well as inflation and deflation simultaneously, so they are more flexible and, in the case of hurdle zero-truncated models, appear to be more numerically stable.
Although the question of how to interpret regression parameters tends to
be somewhat overlooked in SSCR studies, we should point out that the
interpretation of the $\omega$ inflation parameter (in ztoi
$^\ast$ or
oizt
$^\ast$) is more convenient than the interpretation of the $\pi$
probability parameter (in hurdle models). Additionally, the
interpretation of the $\lambda$ parameter in (one) inflated models
conforms to the following intuition: given that unit $k$ comes from the
non-inflated part of the population, it follows a Poisson distribution
(respectively geometric or negative binomial) with the $\lambda$
parameter (or $\lambda,\alpha$); no such interpretation exists for
hurdle models. Interestingly, estimates from hurdle zero-truncated and
one-inflated zero-truncated models tend to be quite close to one
another, although more rigorous studies are required to confirm this
observation.
As previously noted, the singleRcapture package can be used to model
the (linear) dependence of all parameters on covariates. A modified IRLS
algorithm is employed for this purpose as presented in Algorithm 1; full details are available in @VGAM-main.
In order to apply the algorithm, a modified model matrix
$\bX_{\text{vlm}}$ is created when the estimatePopsize
function is
called. In the context of the models implemented in singleRcapture,
this matrix can be written as:
$$ \begin{equation}\label{X_vlm-definition} \bX_{\text{vlm}}= \begin{pmatrix} \bX_{1} & \boldsymbol{0} &\dotso &\boldsymbol{0}\ \boldsymbol{0} & \bX_{2} &\dotso &\boldsymbol{0}\ \vdots & \vdots & \ddots & \vdots\ \boldsymbol{0} & \boldsymbol{0} &\dotso &\bX_{p} \end{pmatrix} \end{equation} $$
where each $\bX_{i}$ corresponds to a model matrix associated with a user specified formula.
Initialize with iter
← 1, $\boeta$ ← start
, $\bW$ ← I, $\ell$ ←
$\ell(\bbeta)$
Store values from the previous step:\ $\ell_{-}$ ← $\ell$, $\bW_{-}$ ← $\bW$, $\bbeta_{-}$ ← $\bbeta$ (the last assignment is omitted during the first iteration), and assign values in the current iteration:
$\displaystyle\boeta$ ← $\bX_{\text{vlm}}\bbeta+\bo$
$\bW_{(k)}$ ← $\mathbb{E}\left[-\frac{\partial^{2}\ell}{\partial\boeta_{(k)}^\top\partial\boeta_{(k)}}\right]$
$\boldsymbol{Z}{(k)}$ ← $\boeta{(k)}+\frac{\partial\ell}{\partial\boeta_{(k)}}\bW_{(k)}^{-1}-\bo_{(k)}$
where $\bo$ denotes offset.
Assign the current coefficient value:\ $\bbeta$ ← $\left(\bX_{\text{vlm}}\bW\bX_{\text{vlm}}\right)^{-1}\bX_{\text{vlm}}\bW\boldsymbol{Z}$
If $\ell(\bbeta)<\ell(\bbeta_{-})$ try selecting the smallest value $h$ such that for $\bbeta_{h}$ ← $2^{-h}\left(\bbeta+\bbeta_{-}\right)$ the inequality $\ell(\bbeta_{h})>\ell(\bbeta_{-})$ holds if this is successful $\bbeta$ ← $\bbeta_{h}$, else stop the algorithm.
If convergence is achieved or iter
is higher than maxiter
, stop
the algorithm, else iter
← 1 + iter
and return to step 2.
In the case of multi-parameter families, we get a matrix of linear predictors $\boeta$ instead of a vector, with the number of columns matching the number of parameters in the distribution. "Weights" (matrix $\bW$) are then modified to be information matrices $\displaystyle\mathbb{E}\left[-\frac{\partial^{2}\ell}{\partial\boeta_{(k)}^\top\partial\boeta_{(k)}}\right]$, where $\ell$ is the log-likelihood function and $\boeta_{(k)}$ is the $k$-th row of $\boeta$, while in the typical IRLS they are scalars $\displaystyle\mathbb{E}\left[-\frac{\partial^{2}\ell}{\partial\eta_{k}^{2}}\right]$, which is often just $\displaystyle-\frac{\partial^{2}\ell}{\partial\eta^{2}}$.
We have implemented three types of bootstrap algorithms: parametric (adapted from theory in @zwane, @norrpoll for multiple source setting with covariates), semi-parametric (see e.g. @BoehningFriedl2021) and nonparametric. The nonparametric version is the usual bootstrap algorithm; which will typically underestimate the variance of $\hat{N}$. In this section, the focus is on the first two approaches.
The idea of semi-parametric bootstrap is to modify the usual bootstrap to include the additional uncertainty resulting from the fact that the sample size is a random variable. This type of bootstrap is performed in steps listed in Algorithm 2.
In other words, we first draw a sample size and then a sample conditional on the sample size. Note that when using the semi-parametric bootstrap one implicitly assumes that the population size estimate $\hat{N}$ is accurate. The last implemented bootstrap type is the parametric algorithm, which first draws a finite population of size $\approx\hat{N}$ from the superpopulation model and then samples from this population according to the selected model, as described in Algorithm 3.
Note that in order for this type of algorithm to result in consistent standard error estimates, it is imperative that the estimated model for the entire superpopulation probability space is consistent, which may be much less realistic than in the case of the semi-parametric bootstrap. The parametric bootstrap algorithm is the default option in singleRcapture.
estimatePopsize
functionThe singleRcapture package is built around the estimatePopsize
function. The main design objective was to make using estimatePopsize
as similar as possible to the standard stats::glm
function or packages
for fitting zero-truncated regression models, such as countreg (e.g.
countreg::zerotrunc
function). The estimatePopsize
function is used
to first fit an appropriate (vector) generalized linear model and to
estimate the population size along with its variance. It is assumed that
the response vector (i.e. the dependent variable) corresponds to the
number of times a given unit was observed in the source. The most
important arguments are given in Table below; the obligatory ones are
formula, data, model
.
| Argument | Description |
|------------------|------------------------------------------------------|
| formula
| The main formula (i.e., for the Poisson $\lambda$ parameter); |
| data
| A data.frame
(or data.frame
coercible) object; |
| model
| Either a function, a string, or a family class object specifying which model should be used; possible values are listed in the documentation. The supplied argument should have the form model = "ztpoisson"
, model = ztpoisson
, or if a link function should be specified, then model = ztpoisson(lambdaLink = "log")
can be used; |
| method
| A numerical method used to fit regression IRLS
or optim
; |
| popVar
| A method for estimating variance of $\hat{N}$ and creating confidence intervals (either bootstrap, analytic, or skipping the estimation entirely); |
| controlMethod
, controlModel
, or controlPopVar
| Control parameters for numerical fitting, specifying additional formulas (inflation, dispersion) and population size estimation, respectively; |
| offset
| A matrix of offset values with the number of columns matching the number of distribution parameters, providing offset values to each of the linear predictors; |
| ...
| Additional optional arguments passed to other methods, e.g., estimatePopsizeFit
; |
An important step in using estimatePopsize
is specifying the model
parameter, which indicates the type of model that will be used for
estimating the unobserved part of the population. For instance, to fit
Chao's or Zelterman's model one should select chao
or zelterman
and,
assuming that one-inflation is present, one can select one of the
zero-truncated one-inflated (ztoi
$^\ast$) or one-inflated
zero-truncated (oizt
$^\ast$) models, such as oiztpoisson
for Poisson
or ztoinegbin
for NB2.
If it is assumed that heterogeneity is observed for NB2 models, one can
specify the formula in the controlModel
argument with the
controlModel
function and the alphaFormula
argument. This enables
the user to provide a formula for the dispersion parameter in the NB2
models. If heterogeneity is assumed for ztoi
$^\ast$ or oizt
$^\ast$,
one can specify the omegaFormula
argument, which corresponds to the
$\omega$ parameter in these models. Finally, if covariates are assumed
to be available for the hurdle models (ztHurdle
$^\ast$ or
Hurdlezt
$^\ast$), then piFormula
can be specified, as it provides a
formula for the probability parameter in these models.
controlPopVar
The estimatePopsize
function makes it possible to specify the variance
estimation method via popVar
(e.g. analytic or variance bootstrap) and
control the estimation process by specifying controlPopVar
. In the
control function controlPopVar
the user can specify the bootType
argument, which has three possible values:
"parametric", "semiparametric"
and "nonparametric"
. Additional
arguments accepted by the contorlPopVar
function, which are relevant
to bootstrap, include:
alpha
, B
-- the significance level and the number of bootstrap
samples to be performed, respectively, with $0.05$ and $500$ being
the default options.cores
-- the number of process cores to be used in bootstrap (1 by
default); parallel computing is enabled by doParallel
[@doParallel], foreach [@foreach] and parallel
packages [@parallel].keepbootStat
-- a logical value indicating whether to keep a
vector of statistics produced by the bootstrap.traceBootstrapSize
, bootstrapVisualTrace
-- logical values
indicating whether sample and population size should be tracked
(FALSE
by default); these work only when cores
= 1.fittingMethod
, bootstrapFitcontrol
-- the fitting method (by
default the same as the one used in the original call) and control
parameters (controlMethod
) for model fitting in the bootstrap.In addition, the user can specify the type of confidence interval by
means of confType
and the type of covariance matrix by using covType
for the analytical variance estimator (observed or the Fisher
information matrix).
In the next sections we present a case study involving the use of a
simple zero-truncated Poisson regression and a more advanced model:
one-inflated zero-truncated geometric regression with the cloglog
link
function. First, we present the example dataset, then we describe how to
estimate the population size and assess the quality and diagnostics
measures. Finally, we show how to estimate the population size in
user-specified sub-populations.
The package can be installed in the standard manner using:
install.packages("singleRcapture")
Then, we need to load the package using the following code:
library(singleRcapture)
We use a dataset from @ztpoisson, which contains information about
immigrants in four Dutch cities (Amsterdam, Rotterdam, The Hague and
Utrecht), who were staying in the country without a legal permit in 1995
and appeared in police records for that year. This dataset is included
in the package called netherlandsimmigrant
:
data(netherlandsimmigrant) head(netherlandsimmigrant)
The number of times each individual appeared in the records is included
in the capture
variable. The available covariates include
gender, age, reason, nation
; the last two represent the reason for
being captured and the region of the world a given person comes from:
summary(netherlandsimmigrant)
One notable characteristic of this dataset is that it contains a
disproportionately large number of individuals who were observed only
once (i.e. r sum(netherlandsimmigrant$capture==1)
).
table(netherlandsimmigrant$capture)
The basic syntax of estimatePopsize
is very similar to that of glm
,
the same can be said about the output of the summary method except for
additional results of population size estimates (denoted as
Population size estimation results
).
basicModel <- estimatePopsize( formula = capture ~ gender + age + nation, model = ztpoisson(), data = netherlandsimmigrant, controlMethod = controlMethod(silent = TRUE) ) summary(basicModel)
The output regarding the population size contains the point estimate, the observed proportion (based on the input dataset), the standard error and two confidence intervals: one relating to the point estimate, the second -- to the observed proportion.
According to this simple model, the population size is about 12,500, with about 15% of units observed in the register. The 95% CI under normality indicates that the true population size is likely to be between 7,000-18,000, with about 10-26% of the target population observed in the register.
Since there is a reasonable suspicion that the act of observing a unit
in the dataset may lead to undesirable consequences for the person
concerned (in this case, a possible deportation, detention or something
similar). For these reasons, the user may consider one-inflated models,
such as one-inflated zero-truncated geometric model (specified by
oiztgeom
family) and those presented below.
set.seed(123456) modelInflated <- estimatePopsize( formula = capture ~ nation, model = oiztgeom(omegaLink = "cloglog"), data = netherlandsimmigrant, controlModel = controlModel( omegaFormula = ~ gender + age ), popVar = "bootstrap", controlPopVar = controlPopVar(bootType = "semiparametric", B = 50) ) summary(modelInflated)
According to this approach, the population size is about 7,000, which is
about 5,000 less than in the case of the naive Poisson approach. A
comparison of AIC and BIC suggests that the one-inflation model fits the
data better with BIC for oiztgeom
r round(BIC(modelInflated))
and
r round(BIC(basicModel))
for ztpoisson
.
We can access population size estimates using the following code, which returns a list with numerical results.
popSizeEst(basicModel) # alternative: basicModel$populationSize popSizeEst(modelInflated) # alternative: modelInflated$populationSize
The decision whether to use a zero-truncated Poisson or one-inflated zero-truncated geometric model should be based on the assessment of the model and the assumptions regarding the data generation process. One possible method of selection is based on the likelihood ratio test, which can be computed quickly and conveniently with the lmtest (@lmtest) interface:
library(lmtest)
lrtest(basicModel, modelInflated, name = function(x) { if (family(x)$family == "ztpoisson") "Basic model" else "Inflated model" })
However, the above is not a standard method of model selection in SSCR. The next sections are dedicated to a detailed description of how to assess the results using standard statistical tests and diagnostics.
A popular method of testing the model fit in single source capture-recapture studies consists in comparing the fitted marginal frequencies $\displaystyle\sum_{j=1}^{N_{obs}}\hat{\mathbb{P}}\left[Y_{j}=k|\bx_{j}, Y_{j} > 0\right]$ with the observed marginal frequencies $\displaystyle\sum_{j=1}^{N}\1_{{k}}(Y_{j})=\sum_{j=1}^{N_{obs}}\1_{{k}}(Y_{j})$ for $k\geq1$. If the fitted model bears sufficient resemblance to the real data collection process, these quantities should be quite close and both $G$ and $\chi^{2}$ tests can be used to test the statistical significance of the discrepancy with the following singleRcapture syntax for the Poisson model (rather poor fit):
margFreq <- marginalFreq(basicModel) summary(margFreq, df = 1, dropl5 = "group")
and for the one-inflated model (better fit):
margFreq_inf <- marginalFreq(modelInflated) summary(margFreq_inf, df = 1, dropl5 = "group")
where the dropl5
argument is used to indicate how to handle cells with
less than $5$ fitted observations. Note, however, that currently there
is no continuity correction.
The singleRStaticCountData
class has a plot
method implementing
several types of quick demonstrative plots, such as the rootogram
[@rootogram], for comparing fitted and marginal frequencies,
which can be generated with the following syntax:
plot( basicModel, plotType = "rootogram", main = "ZT Poisson model") plot(modelInflated, plotType = "rootogram", main = "OI ZT Geometric model")
The above plots suggest that the oiztgeom
model fits the data better.
Another important issue in population size estimation is to conduct
model diagnostics in order to verify whether influential observations
are present in the data. For this purpose the leave-one-out (LOO)
diagnostic implemented in the dfbeta
from the stats package has
been adapted as shown below (multiplied by a factor of a hundred for
better readability):
dfb <- dfbeta(basicModel) round(t(apply(dfb, 2, quantile)*100), 4)
dfi <- dfbeta(modelInflated) round(t(apply(dfi, 2, quantile)*100), 4)
The result of the dfbeta
can be further used in the dfpopsize
function, which can be used to quantify LOO on the population size. Note
the warning when the bootstap variance estimation is applied.
dfb_pop <- dfpopsize(basicModel, dfbeta = dfb) dfi_pop <- dfpopsize(modelInflated, dfbeta = dfi) summary(dfb_pop) summary(dfi_pop)
Figure 2 shows a comparison of the effect of deleting an observation on the population size estimate and inverse probability weights, which refer to the contribution of a given observation to the population size estimate:
plot(basicModel, plotType = "dfpopContr", dfpop = dfb_pop, xlim = c(-4500, 150)) plot(modelInflated, plotType = "dfpopContr", dfpop = dfi_pop, xlim = c(-4500, 150))
These plots show how the population size changes if a given observation
is removed. For instance, if we remove observation
r which.min(dfb_pop)
, then the population size will increase by about
r abs(round(min(dfb_pop)))
for the ztpoisson
model. In the case of
oiztgeom
, the largest change is equal to r abs(round(min(dfi_pop)))
for observation r which.min(dfi_pop)
.
The full list of plot types along with the list of optional arguments
that can be passed from the call to the plot
method down to base R
and graphics functions can be found in the help file of the plot
method.
?plot.singleRStaticCountData
stratifyPopsize
functionResearchers may be interested not only in the total population size but
also in the size of specific sub-populations (e.g. males, females,
particular age groups). For this reason we have created the
stratifyPopsize
function, which estimates the size by strata defined
by the coefficients in the model (the default option). The following
output presents results based on the ztpoisson
and oiztgeom
models.
popSizestrata <- stratifyPopsize(basicModel) cols <- c("name", "Observed", "Estimated", "logNormalLowerBound", "logNormalUpperBound") popSizestrata_report <- popSizestrata[, cols] cols_custom <- c("Name", "Obs", "Estimated", "LowerBound", "UpperBound") names(popSizestrata_report) <- cols_custom popSizestrata_report
popSizestrata_inflated <- stratifyPopsize(modelInflated) popSizestrata_inflated_report <- popSizestrata_inflated[, cols] names(popSizestrata_inflated_report) <- cols_custom popSizestrata_inflated_report
The stratifyPopsize
function prepared to handle objects of the
singleRStaticCountData
class, accepts three optional parameters
strata, alpha, cov
, which are used for specifying sub-populations,
significance levels and the covariance matrix to be used for computing
standard errors. An example of the full call is presented below.
library(sandwich) popSizestrataCustom <- stratifyPopsize( object = basicModel, strata = ~ gender + age, alpha = rep(c(0.1, 0.05), each=2), cov = vcovHC(basicModel, type = "HC4") ) popSizestrataCustom_report <- popSizestrataCustom[, c(cols, "confLevel")] names(popSizestrataCustom_report) <- c(cols_custom, "alpha") popSizestrataCustom_report
We have provided integration with the sandwich [@sandwich]
package to correct the variance-covariance matrix in the $\delta$
method. In the code we have used the vcovHC
method for
singleRStaticCountData
class from the sandwich package, different
significance levels for confidence intervals in each stratum and a
formula to specify that we want estimates for both males and females to
be grouped by nation
and age
. The strata
parameter can be
specified either as:
~ gender * age
),netherlandsimmigrant$gender == "male"
),c("gender", "age")
),list( "Stratum 1" = netherlandsimmigrant$gender == "male" & netherlandsimmigrant$nation == "Suriname", "Stratum 2" = netherlandsimmigrant$gender == "female" & netherlandsimmigrant$nation == "North Africa" )
One can also specify plotType = "strata"
in the plot
function, which
results in a plot with point and CI estimates of the population size.
plot(basicModel, plotType = "strata") plot(modelInflated, plotType = "strata")
Only the logNormal
type of confidence interval is used for plotting
since the studentized confidence intervals often result in negative
lower bounds.
S3Methods
{#sec-methods}We have created a number of classes. The main ones are:
singleRStaticCountData
, singleRfamily
, and supplementary are:
popSizeEstResults
, summarysingleRmargin
and \newline
summarysingleRStaticCountData
, which make it possible to extract
relevant information regarding the population size.
For instance, the popSizeEst
function can be used to extract
information about the estimated size of the population as given below:
(popEst <- popSizeEst(basicModel))
and the resulting object popEst
of the popSizeEstResults
class
contains the following fields:
pointEstimate
, variance
-- numerics containing the point
estimate and variance of this estimate.confidenceInterval
-- a data.frame
with confidence intervals.boot
-- If the bootstrap was performed a numeric vector containing
the $\hat{N}$ values from the bootstrap, a character vector with
value "No bootstrap performed"
otherwise.control
-- a controlPopVar
object with controls used to obtain
the object.The only explicitly defined method for popSizeEstResults
,
summarysingleRmargin
and summarysingleRStaticCountData
classes is
the print
method, but the former one also accepts R primitives
like coef
:
coef(summary(basicModel))
analogously to glm
from stats. The singleRfamily
inherits the
family
class from stats and has explicitly defined print
and
simulate
methods. Example usage is presented below
| Function | Description |
|-----------------|-------------------------------------------------------|
| fitted
| It works almost exactly like glm
counterparts but returns more information, namely on fitted values for the truncated and non-truncated probability distribution; |
| logLik
| Compared to glm
method, it has the possibility of returning not just the value of the fitted log-likelihood but also the entire function (argument type = "function"
) along with two first derivatives (argument deriv = 0:2
); |
| model.matrix
| It has the possibility of returning the $X_{\text{vlm}}$ matrix defined previously; |
| simulate
| It calls the simulate
method for the chosen model and fitted $\boeta$; |
| predict
| It has the possibility of returning either fitted distribution parameters for each unit (type = "response"
), or just linear predictors (type = "link"
), or means of the fitted distributions of $Y$ and $Y|Y>0$ (type = "mean"
) or the inverse probability weights (type = "contr"
). It is possible to set the se.fit
argument to TRUE
in order to obtain standard errors for each of those by using the $\delta$ method. Also, it is possible to use a custom covariance matrix for standard error computation (argument cov
); |
| redoPopEstimation
| A function that applies all post-hoc procedures that were performed (such as heteroscedastic consistent covariance matrix estimation via countreg) to estimate the population size and standard errors; |
| residuals
| Used for obtaining residuals of several types, we refer interested readers to the manual ?singleRcapture:::residuals.singleRStaticCountData
; |
| stratifyPopsize, summary
| Compared to the glm
class, summary has the possibility of adding confidence intervals to the coefficient matrix (argument confint = TRUE
) and using a custom covariance matrix (argument cov = someMatrix
); |
| plot
| It has been discussed above; |
| popSizeEst
| An extractor showcased above; |
| cooks.distance
| It works only for single predictor models; |
| dfbeta, dfpopsize
| Multi-threading in dfbeta
is available and dfpopsize
calls dfbeta
if no dfbeta
object was provided in the call; |
| bread, estfun, vcovHC
| For (almost) full sandwich compatibility; |
| AIC, BIC, extractAIC, family, confint, df.residual, model.frame, hatvalues, nobs, print, sigma, influence, rstudent, rstandard
| These work exactly like glm
counterparts. |
set.seed(1234567890) N <- 10000 gender <- rbinom(N, 1, 0.2) eta <- -1 + 0.5*gender counts <- simulate(ztpoisson(), eta = cbind(eta), seed = 1) summary(data.frame(gender, eta, counts))
The full list of explicitly defined methods for singleRStaticCountData
methods is presented in Table above.
In this paper we have introduced the singleRcapture package for
single source capture-recapture models. The package implement
state-of-the-art methods for estimating population size based on a
single data set with multiple counts. The package implements different
methods to account for heterogeneity in capture probabilities, modelled
using covariates, as well as behavioural change, modelled using
one-inflation. We have built the package to facilitate the
implementation of new models using family
objects; their application
is exemplified in the Section 7. An example of
implementing a custom family described in Section 8 is
presented in replication materials.
Furthermore, since many R users are familiar with countreg or VGAM packages, we have implemented a lightweight extension called singleRcaptureExtra, available through Github (\url{https://github.com/ncn-foreigners/singleRcaptureExtra}), which can be used to integrate singleRcapture with these packages.
In future work we plan to implement Bayesian estimation using Stan (e.g. via the brms package; @carpenter2017stan, @brms) and for one-inflation models we can use the recent approach proposed by @tuoto2022bayesian and implement our own families using the brms package.
The authors' work has been financed by the National Science Centre in Poland, OPUS 20, grant no. 2020/39/B/HS4/00941.
The authors would like to thank Peter van der Heijden, Maarten Cruyff, Dankmar Böhning, Łukasz Chrostowski and Layna Dennett for useful comments that have helped to improve the functionality of the package. In addition, we would like to thank Marcin Szymkowiak and Tymon Świtalski for their valuable comments that have considerably improved the paper.
estimatePopsizeFit
function {#sec-estimatePopsizeFit short-title="The estimatePopsizeFit function"}In this section we provide a step-by-step description of how to prepare
data in order to use the estimatePopsizeFit
function, which may be
useful to some users, e.g. those wishing to make modifications to the
$\hat{N}$ estimate or to the bootstrap. In order to show how to apply
the function we will fit a zero truncated geometric model on the data
from @chao-generalization with covariate dependency:
\begin{align} \log(\lambda) &= \beta_{1, 1} + \beta_{1, 2} \text{log_distance} + \beta_{1, 3} \text{C_TYPE} + \beta_{1, 4} \text{log_size}, \ \text{logit}(\omega) &= \beta_{2, 1} + \beta_{2, 2} \text{log_distance} + \beta_{2, 3} \text{C_TYPE}. \end{align}
This would be equivalent to the following esimatePopsize
call:
estimatePopsize( TOTAL_SUB ~ ., data = farmsubmission, model = ztoigeom(), controlModel( omegaFormula = ~ 1 + log_size + C_TYPE ) )
X <- matrix(data = 0, nrow = 2 * NROW(farmsubmission), ncol = 7)
model.matrix
according to the
specified formula and specify the attribute attr(X, "hwm")
that
informs the function which elements of the design matrix correspond
to which linear predictor (covariates for counts and covariates for
one-inflation)X[1:NROW(farmsubmission), 1:4] <- model.matrix( ~ 1 + log_size + log_distance + C_TYPE, farmsubmission ) X[-(1:NROW(farmsubmission)), 5:7] <- model.matrix( ~ 1 + log_distance + C_TYPE, farmsubmission ) attr(X, "hwm") <- c(4, 3)
glm.fit
function.start <- glm.fit( y = farmsubmission$TOTAL_SUB, x = X[1:NROW(farmsubmission), 1:4], family = poisson() )$coefficients start
estimatePopsizeFit
function to fit the model assuming a
zero-truncated one-inflated geometric distribution as specified in
the family
argument.res <- estimatePopsizeFit( y = farmsubmission$TOTAL_SUB, X = X, method = "IRLS", priorWeights = 1, family = ztoigeom(), control = controlMethod(silent = TRUE), coefStart = c(start, 0, 0, 0), etaStart = matrix(X %*% c(start, 0, 0, 0), ncol = 2), offset = cbind(rep(0, NROW(farmsubmission)), rep(0, NROW(farmsubmission))) )
stats::optim
function.ll <- ztoigeom()$makeMinusLogLike(y = farmsubmission$TOTAL_SUB, X = X)
res2 <- estimatePopsizeFit( y = farmsubmission$TOTAL_SUB, X = X, method = "optim", priorWeights = 1, family = ztoigeom(), coefStart = c(start, 0, 0, 0), control = controlMethod(silent = TRUE, maxiter = 10000), offset = cbind(rep(0, NROW(farmsubmission)), rep(0, NROW(farmsubmission))) )
data.frame(IRLS = round(c(res$beta, -ll(res$beta), res$iter), 4), optim = round(c(res2$beta, -ll(res2$beta), res2$iter[1]), 4))
The default maxiter
parameter for "optim"
fitting is $1000$, but we
needed to increase it since the optim
does not converge in $1000$
steps and "gets stuck" at a plateau, which results in a lower
log-likelihood value compared to the standard "IRLS"
.
The above situation is rather typical. While we did not conduct any
formal numerical analyses, it seems that when one attempts to model more
than one parameter of the distribution as covariate dependent optim
algorithms, both "Nelder-Mead"
and "L-BFGS-B"
seem to be ill-suited
for the task despite being provided with the analytically computed
gradient. This is one of the reasons why "IRLS"
is the default fitting
method.
In this section we provide details regarding the family
object for the
singleRcapture package. This object contains additional parameters
in comparison to the standard family
object from the stats
package.
| Function | Description |
|--------------|----------------------------------------------------------|
| makeMinusLogLike
| A factory function for creating the following functions:
$\ell(\bbeta)$, $\frac{\partial\ell}{\partial\bbeta}$, $\frac{\partial^{2}\ell}{\partial\bbeta^\top\partial\bbeta}$
from the $\boldsymbol{y}$ vector and the $\bX_{vlm}$ matrix. The deriv
argument has possible values in c(0, 1, 2)
that determine which derivative to return; the default value is 0
, which represents the minus log-likelihood. |
| links
| A list with link functions; |
| mu.eta, variance
| Functions of linear predictors that return the expected value and variance. The type
argument with 2 possible values ("trunc"
and "nontrunc"
) specifies whether to return $\mathbb{E}[Y|Y>0], \text{var}[Y|Y>0]$ or $\mathbb{E}[Y], \text{var}[Y]$ respectively. The deriv
argument with values in c(0, 1, 2)
is used for indicating the derivative with respect to the linear predictors, which is used for providing standard errors in the predict
method. |
| family
| A string that specifies the model name; |
| valideta, validmu
| For now, it only returns TRUE
. In the near future, it will be used to check whether applied linear predictors are valid (i.e., are transformed into some elements of the parameter space subjected to the inverse link function). |
| funcZ, Wfun
| Functions that create pseudo residuals and working weights used in the IRLS algorithm; |
| devResids
| A function that returns deviance residuals given a vector of prior weights of linear predictors and the response vector; |
| pointEst, popVar
| Functions that return the point estimate for the population size and analytic estimation of its variance given prior weights of linear predictors and, in the latter case, also estimates of $\text{cov}(\hat{\bbeta})$ and $\bX_{\text{vlm}}$ matrix. There is an additional boolean parameter contr
in the former function, which, if set to TRUE
, returns the contribution of each unit. |
| etaNames
| Names of linear predictors; |
| densityFunction
| A function that returns the value of PMF at values of x
given linear predictors. The type
argument specifies whether to return $\mathbb{P}[Y|Y>0]$ or $\mathbb{P}[Y]$; |
| simulate
| A function that generates values of a dependent vector given linear predictors; |
| getStart
| An expression for generating starting points; |
Suppose we want to implement a very specific zero truncated family function in the singleRcapture, which corresponds to the following "untruncated" distribution:
$$ \mathbb{P}[Y=y|\lambda, \pi] = \begin{cases} 1 - \frac{1}{2}\lambda - \frac{1}{2}\pi & \text{when: } y=0\ \frac{1}{2}\pi & \text{when: } y=1\ \frac{1}{2}\lambda & \text{when: } y=2, \end{cases} $$
with $\lambda, \pi\in\left(0, 1\right)$ being dependent on covariates.
Below we provide a possible way of implementing the above model, with
lambda, pi
meaning $\frac{1}{2}\lambda,\frac{1}{2}\pi$. We provide a
simple example that shows that the proposed approach works as expected.
myFamilyFunction <- function(lambdaLink = c("logit", "cloglog", "probit"), piLink = c("logit", "cloglog", "probit"), ...) { if (missing(lambdaLink)) lambdaLink <- "logit" if (missing(piLink)) piLink <- "logit" links <- list() attr(links, "linkNames") <- c(lambdaLink, piLink) lambdaLink <- switch(lambdaLink, "logit" = singleRcapture:::singleRinternallogitLink, "cloglog" = singleRcapture:::singleRinternalcloglogLink, "probit" = singleRcapture:::singleRinternalprobitLink ) piLink <- switch(piLink, "logit" = singleRcapture:::singleRinternallogitLink, "cloglog" = singleRcapture:::singleRinternalcloglogLink, "probit" = singleRcapture:::singleRinternalprobitLink ) links[1:2] <- c(lambdaLink, piLink) mu.eta <- function(eta, type = "trunc", deriv = FALSE, ...) { pi <- piLink(eta[, 2], inverse = TRUE) / 2 lambda <- lambdaLink(eta[, 1], inverse = TRUE) / 2 if (!deriv) { switch (type, "nontrunc" = pi + 2 * lambda, "trunc" = 1 + lambda / (pi + lambda) ) } else { # Only necessary if one wishes to use standard errors in predict method switch (type, "nontrunc" = { matrix(c(2, 1) * c( lambdaLink(eta[, 1], inverse = TRUE, deriv = 1) / 2, piLink(eta[, 2], inverse = TRUE, deriv = 1) / 2 ), ncol = 2) }, "trunc" = { matrix(c( pi / (pi + lambda) ^ 2, -lambda / (pi + lambda) ^ 2 ) * c( lambdaLink(eta[, 1], inverse = TRUE, deriv = 1) / 2, piLink(eta[, 2], inverse = TRUE, deriv = 1) / 2 ), ncol = 2) } ) } } variance <- function(eta, type = "nontrunc", ...) { pi <- piLink(eta[, 2], inverse = TRUE) / 2 lambda <- lambdaLink(eta[, 1], inverse = TRUE) / 2 switch (type, "nontrunc" = pi * (1 - pi) + 4 * lambda * (1 - lambda - pi), "trunc" = lambda * (1 - lambda) / (pi + lambda) ) } Wfun <- function(prior, y, eta, ...) { pi <- piLink(eta[, 2], inverse = TRUE) / 2 lambda <- lambdaLink(eta[, 1], inverse = TRUE) / 2 G01 <- ((lambda + pi) ^ (-2)) * piLink(eta[, 2], inverse = TRUE, deriv = 1) * lambdaLink(eta[, 1], inverse = TRUE, deriv = 1) * prior / 4 G00 <- ((lambda + pi) ^ (-2)) - (pi ^ (-2)) - lambda / ((lambda + pi) * (pi ^ 2)) G00 <- G00 * prior * (piLink(eta[, 2], inverse = TRUE, deriv = 1) ^ 2) / 4 G11 <- ((lambda + pi) ^ (-2)) - (((lambda + pi) * lambda) ^ -1) G11 <- G11 * prior * (lambdaLink(eta[, 1], inverse = TRUE, deriv = 1) ^ 2) / 4 matrix( -c(G11, # lambda G01, # mixed G01, # mixed G00 # pi ), dimnames = list(rownames(eta), c("lambda", "mixed", "mixed", "pi")), ncol = 4 ) } funcZ <- function(eta, weight, y, prior, ...) { pi <- piLink(eta[, 2], inverse = TRUE) / 2 lambda <- lambdaLink(eta[, 1], inverse = TRUE) / 2 weight <- weight / prior G0 <- (2 - y) / pi - ((lambda + pi) ^ -1) G1 <- (y - 1) / lambda - ((lambda + pi) ^ -1) G1 <- G1 * lambdaLink(eta[, 1], inverse = TRUE, deriv = 1) / 2 G0 <- G0 * piLink(eta[, 2], inverse = TRUE, deriv = 1) / 2 uMatrix <- matrix(c(G1, G0), ncol = 2) weight <- lapply(X = 1:nrow(weight), FUN = function (x) { matrix(as.numeric(weight[x, ]), ncol = 2) }) pseudoResid <- sapply(X = 1:length(weight), FUN = function (x) { #xx <- chol2inv(chol(weight[[x]])) # less computationally demanding xx <- solve(weight[[x]]) # more stable xx %*% uMatrix[x, ] }) pseudoResid <- t(pseudoResid) dimnames(pseudoResid) <- dimnames(eta) pseudoResid } minusLogLike <- function(y, X, offset, weight = 1, NbyK = FALSE, vectorDer = FALSE, deriv = 0, ...) { y <- as.numeric(y) if (is.null(weight)) { weight <- 1 } if (missing(offset)) { offset <- cbind(rep(0, NROW(X) / 2), rep(0, NROW(X) / 2)) } if (!(deriv %in% c(0, 1, 2))) stop("Only score function and derivatives up to 2 are supported.") deriv <- deriv + 1 switch (deriv, function(beta) { eta <- matrix(as.matrix(X) %*% beta, ncol = 2) + offset pi <- piLink(eta[, 2], inverse = TRUE) / 2 lambda <- lambdaLink(eta[, 1], inverse = TRUE) / 2 -sum(weight * ((2 - y) * log(pi) + (y - 1) * log(lambda) - log(pi + lambda))) }, function(beta) { eta <- matrix(as.matrix(X) %*% beta, ncol = 2) + offset pi <- piLink(eta[, 2], inverse = TRUE) / 2 lambda <- lambdaLink(eta[, 1], inverse = TRUE) / 2 G0 <- (2 - y) / pi - ((lambda + pi) ^ -1) G1 <- (y - 1) / lambda - ((lambda + pi) ^ -1) G1 <- G1 * weight * lambdaLink(eta[, 1], inverse = TRUE, deriv = 1) / 2 G0 <- G0 * weight * piLink(eta[, 2], inverse = TRUE, deriv = 1) / 2 if (NbyK) { XX <- 1:(attr(X, "hwm")[1]) return(cbind(as.data.frame(X[1:nrow(eta), XX]) * G1, as.data.frame(X[-(1:nrow(eta)), -XX]) * G0)) } if (vectorDer) { return(cbind(G1, G0)) } as.numeric(c(G1, G0) %*% X) }, function (beta) { lambdaPredNumber <- attr(X, "hwm")[1] eta <- matrix(as.matrix(X) %*% beta, ncol = 2) + offset pi <- piLink(eta[, 2], inverse = TRUE) / 2 lambda <- lambdaLink(eta[, 1], inverse = TRUE) / 2 res <- matrix(nrow = length(beta), ncol = length(beta), dimnames = list(names(beta), names(beta))) # pi^2 derivative dpi <- (2 - y) / pi - (lambda + pi) ^ -1 G00 <- ((lambda + pi) ^ (-2)) - (2 - y) / (pi ^ 2) G00 <- t(as.data.frame(X[-(1:(nrow(X) / 2)), -(1:lambdaPredNumber)] * (G00 * ((piLink(eta[, 2], inverse = TRUE, deriv = 1) / 2) ^ 2) + dpi * piLink(eta[, 2], inverse = TRUE, deriv = 2) / 2) * weight)) %*% as.matrix(X[-(1:(nrow(X) / 2)), -(1:lambdaPredNumber)]) # mixed derivative G01 <- (lambda + pi) ^ (-2) G01 <- t(as.data.frame(X[1:(nrow(X) / 2), 1:lambdaPredNumber]) * G01 * (lambdaLink(eta[, 1], inverse = TRUE, deriv = 1) / 2) * (piLink(eta[, 2], inverse = TRUE, deriv = 1) / 2) * weight) %*% as.matrix(X[-(1:(nrow(X) / 2)), -(1:lambdaPredNumber)]) # lambda^2 derivative G11 <- ((lambda + pi) ^ (-2)) - (y - 1) / (lambda ^ 2) dlambda <- (y - 1) / lambda - ((lambda + pi) ^ -1) G11 <- t(as.data.frame(X[1:(nrow(X) / 2), 1:lambdaPredNumber] * (G11 * ((lambdaLink(eta[, 1], inverse = TRUE, deriv = 1) / 2) ^ 2) + dlambda * lambdaLink(eta[, 1], inverse = TRUE, deriv = 2) / 2) * weight)) %*% X[1:(nrow(X) / 2), 1:lambdaPredNumber] res[-(1:lambdaPredNumber), -(1:lambdaPredNumber)] <- G00 res[1:lambdaPredNumber, 1:lambdaPredNumber] <- G11 res[1:lambdaPredNumber, -(1:lambdaPredNumber)] <- t(G01) res[-(1:lambdaPredNumber), 1:lambdaPredNumber] <- G01 res } ) } validmu <- function(mu) { (sum(!is.finite(mu)) == 0) && all(0 < mu) && all(2 > mu) } # this is optional devResids <- function(y, eta, wt, ...) { 0 } pointEst <- function (pw, eta, contr = FALSE, ...) { pi <- piLink(eta[, 2], inverse = TRUE) / 2 lambda <- lambdaLink(eta[, 1], inverse = TRUE) / 2 N <- pw / (lambda + pi) if(!contr) { N <- sum(N) } N } popVar <- function (pw, eta, cov, Xvlm, ...) { pi <- piLink(eta[, 2], inverse = TRUE) / 2 lambda <- lambdaLink(eta[, 1], inverse = TRUE) / 2 bigTheta1 <- -pw / (pi + lambda) ^ 2 # w.r to pi bigTheta1 <- bigTheta1 * piLink(eta[, 2], inverse = TRUE, deriv = 1) / 2 bigTheta2 <- -pw / (pi + lambda) ^ 2 # w.r to lambda bigTheta2 <- bigTheta2 * lambdaLink(eta[, 1], inverse = TRUE, deriv = 1) / 2 # w.r to lambda bigTheta <- t(c(bigTheta2, bigTheta1) %*% Xvlm) f1 <- t(bigTheta) %*% as.matrix(cov) %*% bigTheta f2 <- sum(pw * (1 - pi - lambda) / ((pi + lambda) ^ 2)) f1 + f2 } dFun <- function (x, eta, type = c("trunc", "nontrunc")) { if (missing(type)) type <- "trunc" pi <- piLink(eta[, 2], inverse = TRUE) / 2 lambda <- lambdaLink(eta[, 1], inverse = TRUE) / 2 switch (type, "trunc" = { (pi * as.numeric(x == 1) + lambda * as.numeric(x == 2)) / (pi + lambda) }, "nontrunc" = { (1 - pi - lambda) * as.numeric(x == 0) + pi * as.numeric(x == 1) + lambda * as.numeric(x == 2) } ) } simulate <- function(n, eta, lower = 0, upper = Inf) { pi <- piLink(eta[, 2], inverse = TRUE) / 2 lambda <- lambdaLink(eta[, 1], inverse = TRUE) / 2 CDF <- function(x) { ifelse(x == Inf, 1, ifelse(x < 0, 0, ifelse(x < 1, 1 - pi - lambda, ifelse(x < 2, 1 - lambda, 1)))) } lb <- CDF(lower) ub <- CDF(upper) p_u <- stats::runif(n, lb, ub) sims <- rep(0, n) cond <- CDF(sims) <= p_u while (any(cond)) { sims[cond] <- sims[cond] + 1 cond <- CDF(sims) <= p_u } sims } getStart <- expression( if (method == "IRLS") { etaStart <- cbind( family$links[[1]](mean(observed == 2) * (1 + 0 * (observed == 2))), # lambda family$links[[2]](mean(observed == 1) * (1 + 0 * (observed == 1))) # pi ) + offset } else if (method == "optim") { init <- c( family$links[[1]](weighted.mean(observed == 2, priorWeights) * 1 + .0001), family$links[[2]](weighted.mean(observed == 1, priorWeights) * 1 + .0001) ) if (attr(terms, "intercept")) { coefStart <- c(init[1], rep(0, attr(Xvlm, "hwm")[1] - 1)) } else { coefStart <- rep(init[1] / attr(Xvlm, "hwm")[1], attr(Xvlm, "hwm")[1]) } if ("(Intercept):pi" %in% colnames(Xvlm)) { coefStart <- c(coefStart, init[2], rep(0, attr(Xvlm, "hwm")[2] - 1)) } else { coefStart <- c(coefStart, rep(init[2] / attr(Xvlm, "hwm")[2], attr(Xvlm, "hwm")[2])) } } ) structure( list( makeMinusLogLike = minusLogLike, densityFunction = dFun, links = links, mu.eta = mu.eta, valideta = function (eta) {TRUE}, variance = variance, Wfun = Wfun, funcZ = funcZ, devResids = devResids, validmu = validmu, pointEst = pointEst, popVar = popVar, family = "myFamilyFunction", etaNames = c("lambda", "pi"), simulate = simulate, getStart = getStart, extraInfo = c( mean = "pi / 2 + lambda", variance = paste0("(pi / 2) * (1 - pi / 2) + 2 * lambda * (1 - lambda / 2 - pi / 2)"), popSizeEst = "(1 - (pi + lambda) / 2) ^ -1", meanTr = "1 + lambda / (pi + lambda)", varianceTr = paste0("lambda * (1 - lambda / 2) / (pi + lambda)") ) ), class = c("singleRfamily", "family") ) }
A quick tests shows us that this implementation in fact works:
set.seed(123) Y <- simulate( myFamilyFunction(lambdaLink = "logit", piLink = "logit"), nsim = 1000, eta = matrix(0, nrow = 1000, ncol = 2), truncated = FALSE ) mm <- estimatePopsize( formula = Y ~ 1, data = data.frame(Y = Y[Y > 0]), model = myFamilyFunction(lambdaLink = "logit", piLink = "logit"), # the usual observed information matrix # is ill-suited for this distribution controlPopVar = controlPopVar(covType = "Fisher") ) summary(mm)
where the link functions, such as
singleRcapture:::singleRinternalcloglogLink
, are just internal
functions in singleRcapture
that compute link functions, their
inverses and derivatives of both links and inverse links up to the third
order:
singleRcapture:::singleRinternalcloglogLink
One could, of course, include the code for computing them manually.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.