This document is an R Markdown file. It is a great way to combine data analysis with reports. To learn more about R Markdown, check out the online book at https://bookdown.org/yihui/rmarkdown/.

In this session, we will use two simulated datasets to illustrate three different scenarios of small area estimation (SAE):

  • Spatial smoothing of the prevalence of a binary indicator.
  • Space-time smoothing of neonatal mortality rates (NMR).
  • Space-time smoothing of under-5 mortality rates (U5MR).

Spatial smoothing: the generic case

In the first example, we will use the simulated dataset DemoData2 in the SUMMER package.

library(SUMMER)
library(sp)
library(ggplot2)
library(gridExtra)
data(DemoData2)
head(DemoData2)
##   clustid id  region age weights        strata tobacco.use
## 1       1  1 nairobi  30     1.1 nairobi.urban           0
## 2       1  3 nairobi  22     1.1 nairobi.urban           0
## 3       1  4 nairobi  42     1.1 nairobi.urban           0
## 4       2  4  nyanza  25     1.1  nyanza.urban           0
## 5       1  5 nairobi  25     1.1 nairobi.urban           0
## 6       1  6 nairobi  37     1.1 nairobi.urban           0

DemoData2 simulates a survey from a stratified cluster sampling design similar to the DHS surveys. Suppose we are interested in the prevalence of tobacco usage.

Naive estimates

Let \(y_i\) and \(n_i\) denote the number of individuals using tobacco and the total number of people in areas i = 1, …, n, respectively. Ignoring the survey design, the naive estimate for the prevalence of tobacco usage is \[ \hat{p}_i = y_i / n_i, \] with associated variance being \(p_i(1 − p_i)/n_i\). This naive estimate can be easily calculated by tabulating the data.

naive <- aggregate(tobacco.use ~ region, data = DemoData2, FUN = mean)

Smoothed (naive) estimates

Ignoring survey design for now, we may fit the following simple smoothing model for a simple random sample:

\[\begin{align} y_i | p_i &\sim \mbox{Binomial}(n_i, p_i), \\ \mbox{logit}(p_i) &= \mu + s_i + \epsilon_i, \end{align}\]

where the \(s_i\) and \(\epsilon_i\) are assumed to be following the ICAR and independent priors respectively. That is, the logit of prevalence can be parameterized by the BYM model.

We can fit this model with the fitGeneric function in the SUMMER package. Notice that in order to perform the smoothing, we need to construct a spatial adjacency matrix of the regions. The associated map with this simulated dataset is the province map of Kenya, as included in the package as well.

data(DemoMap2)
smoothed <- fitGeneric(data = DemoData2, geo = DemoMap2$geo, Amat = DemoMap2$Amat, 
    responseType = "binary", responseVar = "tobacco.use", regionVar = "region", 
    strataVar = NULL, weightVar = NULL, clusterVar = NULL, CI = 0.95)

For this model, the smoothed estimates are saved in smoothed$smooth, and the naive estimates are saved in smoothed$HT.

head(smoothed$smooth)
##        region time mean variance median lower upper mean.original
## 1     nairobi   NA -3.5    0.023   -3.5  -3.8  -3.2         0.031
## 2     central   NA -3.2    0.012   -3.2  -3.4  -3.0         0.040
## 3       coast   NA -2.6    0.017   -2.6  -2.9  -2.4         0.067
## 4     eastern   NA -3.2    0.024   -3.2  -3.5  -2.9         0.040
## 5      nyanza   NA -3.4    0.030   -3.4  -3.7  -3.1         0.033
## 6 rift valley   NA -2.9    0.026   -2.9  -3.2  -2.5         0.055
##   variance.original median.original lower.original upper.original
## 1          0.000020           0.031          0.023          0.040
## 2          0.000017           0.039          0.032          0.048
## 3          0.000065           0.067          0.052          0.083
## 4          0.000034           0.039          0.029          0.051
## 5          0.000031           0.033          0.023          0.044
## 6          0.000069           0.054          0.040          0.072
head(smoothed$HT)
##   HT.est HT.sd HT.variance HT.prec HT.est.original HT.variance.original
## 2   -3.5  0.17       0.029      35           0.028             0.000021
## 1   -3.2  0.12       0.014      72           0.039             0.000020
## 4   -2.5  0.12       0.015      65           0.073             0.000070
## 3   -3.2  0.18       0.032      31           0.038             0.000043
## 6   -3.4  0.20       0.040      25           0.031             0.000036
## 8   -2.7  0.17       0.027      37           0.061             0.000090
##      n  y      region
## 2 1287 36     nairobi
## 1 1915 75     central
## 4  964 70       coast
## 3  843 32     eastern
## 6  839 26      nyanza
## 8  639 39 rift valley

Weighted estimates

Both the naive and smoothed estimates described before fail to account for the survey design and are thus subject to bias and incorrect variance estimation. To account for the different design weights associated with each sample, we can estimate the survey weighted estimates (the Horvitz-Thompson estimates) using the survey package.

library(survey)
design <- svydesign(ids = ~clustid + id, weights = ~weights, strata = ~strata, 
    data = DemoData2)
direct <- svyby(~tobacco.use, ~region, design, svymean)
head(direct)
##                    region tobacco.use     se
## central           central       0.043 0.0079
## nairobi           nairobi       0.027 0.0057
## eastern           eastern       0.035 0.0114
## coast               coast       0.074 0.0088
## northeastern northeastern       0.037 0.0061
## nyanza             nyanza       0.028 0.0067

Smoothed and weighted estimates

Finally, we can smooth (the logit of) the weighted estimates directly, and obtain smoothed estimates that account for survey designs. That is, we fit the following model,

\[\begin{align} \mbox{logit}(\hat{p_i}) &\sim \mbox{Normal}(\theta_i, \hat{V}_i), \\ \theta_i &= \mu + s_i + \epsilon_i, \end{align}\]
where \(\hat{p_i}\) and \(\hat{V}_i\) are the design-based weighted estimate of the prevalence at region \(i\), and the associated variance of its logit, respectively. Using the fitGeneric function, we now need to specify survey designs in order to fit this model

smoothweighted <- fitGeneric(data = DemoData2, geo = DemoMap2$geo, Amat = DemoMap2$Amat, 
    responseType = "binary", responseVar = "tobacco.use", regionVar = "region", 
    strataVar = "strata", weightVar = "weights", clusterVar = "~clustid+id", 
    CI = 0.95)

We can again obtain the Horvitz-Thompson estimates and the spatially smoothed estimates using

head(smoothweighted$HT)
##   HT.est HT.sd HT.variance HT.prec HT.est.original HT.variance.original  n
## 2   -3.6  0.21       0.046    21.9           0.027             0.000033 NA
## 1   -3.1  0.19       0.037    26.8           0.043             0.000062 NA
## 4   -2.5  0.13       0.017    59.9           0.074             0.000078 NA
## 3   -3.3  0.34       0.116     8.6           0.035             0.000129 NA
## 6   -3.5  0.24       0.060    16.7           0.028             0.000045 NA
## 8   -2.6  0.22       0.049    20.4           0.072             0.000217 NA
##    y      region
## 2 NA     nairobi
## 1 NA     central
## 4 NA       coast
## 3 NA     eastern
## 6 NA      nyanza
## 8 NA rift valley
head(smoothweighted$smooth)
##        region time mean variance median lower upper mean.original
## 1     nairobi   NA -3.4    0.038   -3.4  -3.8  -3.1         0.032
## 2     central   NA -3.1    0.029   -3.1  -3.4  -2.8         0.043
## 3       coast   NA -2.6    0.017   -2.6  -2.9  -2.4         0.069
## 4     eastern   NA -3.2    0.061   -3.2  -3.7  -2.7         0.040
## 5      nyanza   NA -3.4    0.047   -3.4  -3.8  -3.0         0.033
## 6 rift valley   NA -2.8    0.039   -2.8  -3.1  -2.4         0.060
##   variance.original median.original lower.original upper.original
## 1          0.000036           0.032          0.022          0.045
## 2          0.000050           0.043          0.031          0.059
## 3          0.000068           0.068          0.054          0.086
## 4          0.000090           0.039          0.024          0.061
## 5          0.000047           0.032          0.021          0.048
## 6          0.000131           0.059          0.042          0.086

We now combine the four estimates and compare the results. First we construct a new data frame that hosts all four set of estimates and their variances.

prev <- NULL
prev <- rbind(prev, data.frame(region = smoothed$HT$region, 
                               mean = smoothed$HT$HT.est.original,
                               var = smoothed$HT$HT.variance.original,
                               type = "Naive"))
prev <- rbind(prev, data.frame(region = smoothed$smooth$region, 
                               mean = smoothed$smooth$mean.original, 
                               var = smoothed$smooth$variance.original,
                               type = "Smoothed"))
prev <- rbind(prev, data.frame(region = smoothweighted$HT$region, 
                               mean = smoothweighted$HT$HT.est.original,
                               var = smoothweighted$HT$HT.variance.original,
                               type = "Weighted"))
prev <- rbind(prev, data.frame(region = smoothweighted$smooth$region, 
                               mean = smoothweighted$smooth$mean.original, 
                               var = smoothweighted$smooth$variance.original,
                               type = "Smooth Weighted"))

We plot the estimates and variances on the map using function mapPlot. The spatial smoothing and reduced variance can be easily seen from the following plots.

g1 <- mapPlot(prev, geo = DemoMap2$geo, by.data = "region", by.geo = "REGNAME", 
    variables = "type", values = "mean", is.long = TRUE, legend.label = "Estimates", 
    ncol = 4)
g2 <- mapPlot(prev, geo = DemoMap2$geo, by.data = "region", by.geo = "REGNAME", 
    variables = "type", values = "var", is.long = TRUE, legend.label = "Variance", 
    ncol = 4)
grid.arrange(g1, g2, ncol = 1)