This document is an R Markdown file. It is a great way to combine data analysis with reports. To learn more about R Markdown, check out the online book at https://bookdown.org/yihui/rmarkdown/.
In this session, we will use real data from Kenya 2014 DHS survey, to illustrate spatial-temporal smoothing of child mortality rates.
First, we load the package and the necessary data. INLA is not in a standard repository, so we check if it is available and install it if it is not installed. For this vignette, we used INLA version 19.09.03.
library(SUMMER)
library(ggplot2)
library(gridExtra)
The DHS data can be obtained from the DHS program website at https://dhsprogram.com/data/dataset/Kenya_Standard-DHS_2014. For the analysis of U5MR, we will use the Births Recode in .dta format. Notice that registration with the DHS program is required in order to access this dataset. The map files for this DHS can be freely downloaded from http://spatialdata.dhsprogram.com/boundaries/.
With both the DHS birth record data and the corresponding shapefiles saved in the local directory. We can load them into with packages readstata13
and rgdal
. We also automatically generates the spatial adjacency matrix Amat
using the function getAmat()
.
library(readstata13)
filename <- "../data/KEBR71DT/KEBR71FL.DTA"
births <- read.dta13(filename, generate.factors = TRUE)
library(rgdal)
mapfilename0 <- "../data/shapefiles/sdr_subnational_boundaries.shp"
geo0 <- readOGR(mapfilename0, verbose = FALSE)
mapfilename <- "../data/shapefiles/sdr_subnational_boundaries2.shp"
geo <- readOGR(mapfilename, verbose = FALSE)
Amat <- getAmat(geo, geo$REGNAME)
The Amat
matrix encodes the spatial adjacency matrix of the 47 counties, with column and row names matching the regions used in the map. This adjacency matrix will be used for the spatial smoothing model. It can also be created by hand if necessary.
We first demonstrate the method that smooths the direct estimates of subnational-level U5MR. For this analysis, we consider the \(8\) Admin-1 region groups. In order to calculate the direct estimates of U5MR, we need the full birth history data in the format so that every row corresponds to a birth and columns that contain:
v023
), cluster (v001
), and household (v002
)v025
)v008
)b3
)b5
)b7
)Since county labels are usually not in the DHS dataset, we now load the GPS location of each clusters and map them to the corresponding counties.
loc <- readOGR("../data//KEGE71FL/KEGE71FL.shp", verbose = FALSE)
loc.dat <- data.frame(cluster = loc$DHSCLUST, long = loc$LONGNUM, lat = loc$LATNUM)
gps <- mapPoints(loc.dat, geo = geo, long = "long", lat = "lat", names = c("REGNAME"))
colnames(gps)[4] <- "region"
gps0 <- mapPoints(loc.dat, geo = geo0, long = "long", lat = "lat", names = c("REGNAME"))
colnames(gps0)[4] <- "province"
gps <- merge(gps, gps0[, c("cluster", "province")])
sum(is.na(gps$region))
## [1] 9
Notice that there are \(9\) clusters that fall on the county boundaries and we need to manually assign them to a county based on best guess. In this example as an illustration, we remove these clusters without GPS coordinates.
unknown_cluster <- gps$cluster[which(is.na(gps$region))]
gps <- gps[gps$cluster %in% unknown_cluster == FALSE, ]
births <- births[births$v001 %in% unknown_cluster == FALSE, ]
births <- merge(births, gps[, c("cluster", "region", "province")], by.x = "v001",
by.y = "cluster", all.x = TRUE)
births$v024 <- births$region
The birth history data from DHS is already in this form and the getBirths
function default to use the current recode manual column names (as indicated above). The name of these fields can be defined explicitly in the function arguments too. We reorganize the data into the ‘person-month’ format with getBirths
function and reorder the columns for better readability.
dat <- getBirths(data = births, strata = c("v023"), year.cut = seq(1985, 2020,
by = 1))
dat <- dat[, c("v001", "v002", "v024", "time", "age", "v005", "strata", "died")]
colnames(dat) <- c("clustid", "id", "region", "time", "age", "weights", "strata",
"died")
years <- levels(dat$time)
head(dat)
## clustid id region time age weights strata died
## 1 1 6 nairobi 2009 0 5476381 1 0
## 2 1 6 nairobi 2009 1-11 5476381 1 0
## 3 1 6 nairobi 2009 1-11 5476381 1 0
## 4 1 6 nairobi 2009 1-11 5476381 1 0
## 5 1 6 nairobi 2009 1-11 5476381 1 0
## 6 1 6 nairobi 2009 1-11 5476381 1 0
Notice that we also need to specify the time intervals of interest. In this example, we with to calculate and predict U5MR in 5-year intervals from 1985-1990 to 2015-2019. For U5MR, we will use the discrete survival model to calculate direct estimates for each region and time. This step involves breaking down the age of each death into discrete intervals. The default option assumes a discrete survival model with six discrete hazards (probabilities of dying in a particular interval, given survival to the start of the interval) for each of the age bands: \([0,1), [1,12), [12,24), [24,36), [36,48)\), and \([48,60]\).
We may also calculate other types of mortality rates of interest using getBirths
. For example, for U1MR,
dat_infant <- getBirths(data = births, month.cut = c(1, 12), strata = c("v023"))
And the smoothing steps can be similarly carried out.
Using the person-month format data, we can calculate Horvitz-Thompson estimators using getDirect
for a single survey or getDirectList
for multiple surveys. The discrete hazards in each time interval are estimated using a logistic regression model, with weighting to account for the survey design. The direct estimates are then calculated using the discrete hazards. In order to correctly account for survey design, We need to specify the stratification and cluster variables. In the Kenya DHS example, a two-stage stratified cluster sampling design was used, where strata are specified in the strata
column, and clusters are specified by the cluster ID (clusterid
) and household ID (id
).
direct0 <- getDirect(births = dat, years = years, regionVar = "region", timeVar = "time",
clusterVar = "~clustid + id", ageVar = "age", weightsVar = "weights", geo.recode = NULL)
Sometimes additional information are available to adjust the direct estimates from the surveys. For example, in countries with high prevalence of HIV, estimates of U5MR can be biased, particularly before ART treatment became widely available. Pre-treatment HIV positive women had a high risk of dying, and such women who had given birth were therefore less likely to appear in surveys. The children of HIV positive women are also more likely to have a higher probability of dying compared to those born to HIV negative women. Thus we expect that the U5MR is underestimated if we do not adjust for the missing women.
Suppose we can obtain the ratio of the reported U5MR to the true U5MR, \(r_{it}\), at region \(i\) and time period \(t\), we can apply the adjustment factor to the direct estimates and the associated variances. The HIV adjustment factors were calculated for the 2014 Kenya DHS survey and included in the package.
data(KenData)
adj <- KenData$HIV2014.yearly
colnames(adj) <- c("years", "province", "ratio")
head(adj)
## years province ratio
## 1 2014 All 0.99
## 2 2013 All 0.98
## 3 2012 All 0.98
## 4 2011 All 0.97
## 5 2010 All 0.97
## 6 2009 All 0.96
The KenData$HIV2014.yearly
data frame contains HIV adjustment factors at both national and province levels. So we will create another column in direct0
that codes the province each county belongs to.
matched <- match(direct0$region, gps$region)
direct0$province <- as.character(gps[matched, "province"])
direct0$province[direct0$region == "All"] <- "All"
direct0$logit.lower <- logit(direct0$lower)
direct0$logit.upper <- logit(direct0$upper)
direct <- getAdjusted(data = direct0, ratio = adj, time = "years", region = "province",
adj = "ratio", logit.lower = "logit.lower", logit.upper = "logit.upper",
lower = "lower", upper = "upper")
The direct estimates calculated using getDirect
contains both national and subnational estimates for the \(47\) regions, over the \(35\) years from 1985 to 2019. We first fit a model with temporal random effects only to smooth the national estimates over time. In this part, we use the subset of data region variable being “All”.
fit0 <- fitINLA(data = direct, geo = NULL, Amat = NULL, year_label = years,
year_range = c(1985, 2019), rw = 2, m = 1)
The marginal posteriors are already stored in the fitted object. We use the following function to extract and re-arrange them.
out0 <- getSmoothed(fit0, year_range = c(1985, 2019), year_label = years)
We can compare the results visually. Notice to correctly display the period estimates, the reference year in each period needs to be specified. Here we simply take the median year in each period.
random.time <- getDiag(fit0, field = "time", year_label = years)
plot(random.time, is.subnational = FALSE) + facet_grid(~label) + ggtitle("Compare temporal random effects: National Model") +
ylab("Random Effects")
The national model also allows us to benchmark the estimates using other published national results. For example, we take the 2019 UN-IGME estimates and calculate the ratio of the estimates from national models to the published UN estimates. We will use this adjustment ratio to correct the bias from our direct estimates. We organize the adjustment ratios into a matrix of two columns, since the adjustment factor only varies over time. We can then perform the benchmarking to the UN estimates similar to the HIV adjustment before.
UN <- KenData$IGME2019
ratio <- out0$median[1:34]/UN$mean[34:67]
adj.benchmark <- data.frame(years = out0$years[1:34], ratio = ratio)
direct <- getAdjusted(data = direct, ratio = adj.benchmark, time = "years",
region = "province", adj = "ratio", logit.lower = "logit.lower", logit.upper = "logit.upper",
lower = "lower", upper = "upper")
After benchmarking, we can fit the smoothing model again on the adjusted direct estimates, and see if they align with the UN estimates.
fit0.benchmark <- fitINLA(data = direct, geo = NULL, Amat = NULL, year_label = years,
year_range = c(1985, 2019), rw = 2, m = 1)
out0.benchmark <- getSmoothed(fit0.benchmark, year_range = c(1985, 2019), year_label = years)
g1 <- plot(out0, year_label = years, year_med = as.numeric(years), is.subnational = FALSE,
data.add = UN, option.add = list(point = "mean"), label.add = "UN", color.add = "orange") +
ggtitle("National Smoothing Model: Before Benchmarking") + ylim(c(0, 0.14))
g2 <- plot(out0.benchmark, year_label = years, year_med = as.numeric(years),
is.subnational = FALSE, data.add = UN, option.add = list(point = "mean"),
label.add = "UN", color.add = "orange") + ggtitle("National Smoothing Model: After Benchmarking") +
ylim(c(0, 0.14))
grid.arrange(g1, g2, ncol = 2)