This document is an R Markdown file. It is a great way to combine data analysis with reports. To learn more about R Markdown, check out the online book at https://bookdown.org/yihui/rmarkdown/.
In this session, we will use two simulated datasets to illustrate three different scenarios of small area estimation (SAE):
In the first example, we will use the simulated dataset DemoData2
in the SUMMER package.
library(SUMMER)
library(sp)
library(ggplot2)
library(gridExtra)
data(DemoData2)
head(DemoData2)
## clustid id region age weights strata tobacco.use
## 1 1 1 nairobi 30 1.1 nairobi.urban 0
## 2 1 3 nairobi 22 1.1 nairobi.urban 0
## 3 1 4 nairobi 42 1.1 nairobi.urban 0
## 4 2 4 nyanza 25 1.1 nyanza.urban 0
## 5 1 5 nairobi 25 1.1 nairobi.urban 0
## 6 1 6 nairobi 37 1.1 nairobi.urban 0
DemoData2
simulates a survey from a stratified cluster sampling design similar to the DHS surveys. Suppose we are interested in the prevalence of tobacco usage.
Let \(y_i\) and \(n_i\) denote the number of individuals using tobacco and the total number of people in areas i = 1, …, n, respectively. Ignoring the survey design, the naive estimate for the prevalence of tobacco usage is \[ \hat{p}_i = y_i / n_i, \] with associated variance being \(p_i(1 − p_i)/n_i\). This naive estimate can be easily calculated by tabulating the data.
naive <- aggregate(tobacco.use ~ region, data = DemoData2, FUN = mean)
Ignoring survey design for now, we may fit the following simple smoothing model for a simple random sample:
\[\begin{align} y_i | p_i &\sim \mbox{Binomial}(n_i, p_i), \\ \mbox{logit}(p_i) &= \mu + s_i + \epsilon_i, \end{align}\]
where the \(s_i\) and \(\epsilon_i\) are assumed to be following the ICAR and independent priors respectively. That is, the logit of prevalence can be parameterized by the BYM model.
We can fit this model with the fitGeneric
function in the SUMMER package. Notice that in order to perform the smoothing, we need to construct a spatial adjacency matrix of the regions. The associated map with this simulated dataset is the province map of Kenya, as included in the package as well.
data(DemoMap2)
smoothed <- fitGeneric(data = DemoData2, geo = DemoMap2$geo, Amat = DemoMap2$Amat,
responseType = "binary", responseVar = "tobacco.use", regionVar = "region",
strataVar = NULL, weightVar = NULL, clusterVar = NULL, CI = 0.95)
For this model, the smoothed estimates are saved in smoothed$smooth
, and the naive estimates are saved in smoothed$HT
.
head(smoothed$smooth)
## region time mean variance median lower upper mean.original
## 1 nairobi NA -3.5 0.023 -3.5 -3.8 -3.2 0.031
## 2 central NA -3.2 0.012 -3.2 -3.4 -3.0 0.040
## 3 coast NA -2.6 0.017 -2.6 -2.9 -2.4 0.067
## 4 eastern NA -3.2 0.024 -3.2 -3.5 -2.9 0.040
## 5 nyanza NA -3.4 0.030 -3.4 -3.7 -3.1 0.033
## 6 rift valley NA -2.9 0.026 -2.9 -3.2 -2.5 0.055
## variance.original median.original lower.original upper.original
## 1 0.000020 0.031 0.023 0.040
## 2 0.000017 0.039 0.032 0.048
## 3 0.000065 0.067 0.052 0.083
## 4 0.000034 0.039 0.029 0.051
## 5 0.000031 0.033 0.023 0.044
## 6 0.000069 0.054 0.040 0.072
head(smoothed$HT)
## HT.est HT.sd HT.variance HT.prec HT.est.original HT.variance.original
## 2 -3.5 0.17 0.029 35 0.028 0.000021
## 1 -3.2 0.12 0.014 72 0.039 0.000020
## 4 -2.5 0.12 0.015 65 0.073 0.000070
## 3 -3.2 0.18 0.032 31 0.038 0.000043
## 6 -3.4 0.20 0.040 25 0.031 0.000036
## 8 -2.7 0.17 0.027 37 0.061 0.000090
## n y region
## 2 1287 36 nairobi
## 1 1915 75 central
## 4 964 70 coast
## 3 843 32 eastern
## 6 839 26 nyanza
## 8 639 39 rift valley
Both the naive and smoothed estimates described before fail to account for the survey design and are thus subject to bias and incorrect variance estimation. To account for the different design weights associated with each sample, we can estimate the survey weighted estimates (the Horvitz-Thompson estimates) using the survey package.
library(survey)
design <- svydesign(ids = ~clustid + id, weights = ~weights, strata = ~strata,
data = DemoData2)
direct <- svyby(~tobacco.use, ~region, design, svymean)
head(direct)
## region tobacco.use se
## central central 0.043 0.0079
## nairobi nairobi 0.027 0.0057
## eastern eastern 0.035 0.0114
## coast coast 0.074 0.0088
## northeastern northeastern 0.037 0.0061
## nyanza nyanza 0.028 0.0067
Finally, we can smooth (the logit of) the weighted estimates directly, and obtain smoothed estimates that account for survey designs. That is, we fit the following model,
\[\begin{align}
\mbox{logit}(\hat{p_i}) &\sim \mbox{Normal}(\theta_i, \hat{V}_i), \\
\theta_i &= \mu + s_i + \epsilon_i,
\end{align}\]
where \(\hat{p_i}\) and \(\hat{V}_i\) are the design-based weighted estimate of the prevalence at region \(i\), and the associated variance of its logit, respectively. Using the fitGeneric
function, we now need to specify survey designs in order to fit this model
smoothweighted <- fitGeneric(data = DemoData2, geo = DemoMap2$geo, Amat = DemoMap2$Amat,
responseType = "binary", responseVar = "tobacco.use", regionVar = "region",
strataVar = "strata", weightVar = "weights", clusterVar = "~clustid+id",
CI = 0.95)
We can again obtain the Horvitz-Thompson estimates and the spatially smoothed estimates using
head(smoothweighted$HT)
## HT.est HT.sd HT.variance HT.prec HT.est.original HT.variance.original n
## 2 -3.6 0.21 0.046 21.9 0.027 0.000033 NA
## 1 -3.1 0.19 0.037 26.8 0.043 0.000062 NA
## 4 -2.5 0.13 0.017 59.9 0.074 0.000078 NA
## 3 -3.3 0.34 0.116 8.6 0.035 0.000129 NA
## 6 -3.5 0.24 0.060 16.7 0.028 0.000045 NA
## 8 -2.6 0.22 0.049 20.4 0.072 0.000217 NA
## y region
## 2 NA nairobi
## 1 NA central
## 4 NA coast
## 3 NA eastern
## 6 NA nyanza
## 8 NA rift valley
head(smoothweighted$smooth)
## region time mean variance median lower upper mean.original
## 1 nairobi NA -3.4 0.038 -3.4 -3.8 -3.1 0.032
## 2 central NA -3.1 0.029 -3.1 -3.4 -2.8 0.043
## 3 coast NA -2.6 0.017 -2.6 -2.9 -2.4 0.069
## 4 eastern NA -3.2 0.061 -3.2 -3.7 -2.7 0.040
## 5 nyanza NA -3.4 0.047 -3.4 -3.8 -3.0 0.033
## 6 rift valley NA -2.8 0.039 -2.8 -3.1 -2.4 0.060
## variance.original median.original lower.original upper.original
## 1 0.000036 0.032 0.022 0.045
## 2 0.000050 0.043 0.031 0.059
## 3 0.000068 0.068 0.054 0.086
## 4 0.000090 0.039 0.024 0.061
## 5 0.000047 0.032 0.021 0.048
## 6 0.000131 0.059 0.042 0.086
We now combine the four estimates and compare the results. First we construct a new data frame that hosts all four set of estimates and their variances.
prev <- NULL
prev <- rbind(prev, data.frame(region = smoothed$HT$region,
mean = smoothed$HT$HT.est.original,
var = smoothed$HT$HT.variance.original,
type = "Naive"))
prev <- rbind(prev, data.frame(region = smoothed$smooth$region,
mean = smoothed$smooth$mean.original,
var = smoothed$smooth$variance.original,
type = "Smoothed"))
prev <- rbind(prev, data.frame(region = smoothweighted$HT$region,
mean = smoothweighted$HT$HT.est.original,
var = smoothweighted$HT$HT.variance.original,
type = "Weighted"))
prev <- rbind(prev, data.frame(region = smoothweighted$smooth$region,
mean = smoothweighted$smooth$mean.original,
var = smoothweighted$smooth$variance.original,
type = "Smooth Weighted"))
We plot the estimates and variances on the map using function mapPlot
. The spatial smoothing and reduced variance can be easily seen from the following plots.
g1 <- mapPlot(prev, geo = DemoMap2$geo, by.data = "region", by.geo = "REGNAME",
variables = "type", values = "mean", is.long = TRUE, legend.label = "Estimates",
ncol = 4)
g2 <- mapPlot(prev, geo = DemoMap2$geo, by.data = "region", by.geo = "REGNAME",
variables = "type", values = "var", is.long = TRUE, legend.label = "Variance",
ncol = 4)
grid.arrange(g1, g2, ncol = 1)