vignettes/srvyr-database.Rmd
srvyr-database.Rmd
## Loading required package: RSQLite
## Warning: package 'RSQLite' was built under R version 4.3.3
Srvyr 0.3 has a completely rewritten database backend. Using
databases that are already stored dplyr’s tbl_lazy
objects
is now just as easy as working with data stored in regular R data.frames
as you don’t need to have a unique identifier. Additionally, it now
works more similarly to the survey package’s database code and so
shouldn’t be any slower.
During development, I have tested using SQLite (and the now defunct MonetDBLite) databases, but in theory other database backends should work as well.
This vignette shows the basics of how to use srvyr with databases. It is based on analysis from the wonderful resource asdfree ( website and github ). Many thanks to ajdamico and collaborators. Specifically, I have adapted code from American Community Survey - 2011 single year analysis and the associated data preparation scripts.
In order to focus on srvyr and databases, we start with a prepared dataset. The full code is available on Github, and the high level description of what it does is:
Download data from acs website (currently only Alaska and Hawaii to save time, though it would be easy to adapt to download to all 50 states and Puerto Rico).
Merges the household and person datasets so that we can look at the variables related to each person including those at the household level
Selects only a few variables that will be used in this analysis to save space, but again it could easily be adapted to keep all of the variables.
For more information on the specifics of the American Community Survey, see the asdfree site. Now, our code loads this prepared dataset, initiates a SQLite database, and puts the data into the dataset.
suppressMessages({
library(survey)
library(srvyr)
library(dplyr)
library(dbplyr)
library(RSQLite)
})
# Load data (Currently only Alaska and Hawaii to keep file size small and with
# limited variables, butcode that downloaded the files is available here
# https://github.com/gergness/srvyr/blob/main/vignettes/save_acs_data.R
# and could easily be adapted to download all states.)
load("acs_m.Rdata") # acs_m data
# Set up database and table
db <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
acs_m_db <- copy_to(db, acs_m, "acs_m", temporary = FALSE)
# Or, if the data was already stored in the database, you could do this
# acs_m_data <- tbl(db, sql("SELECT * FROM acs_m"))
Now that we have the data in the database, we can interact with the
database directly using sql commands, or we can use dplyr’s
functionality to treat it mostly the same as a local
data.frame
. However, the data is not stored in memory, so
we could work with much larger datasets (though in this case, the data
is too small for this to be a problem).
## # A tibble: 2 × 2
## sex hicov
## <int> <dbl>
## 1 1 1.14
## 2 2 1.10
## Warning: Missing values are always removed in SQL aggregation functions.
## Use `na.rm = TRUE` to silence this warning
## This warning is displayed once every 8 hours.
## # Source: SQL [2 x 2]
## # Database: sqlite 3.46.0 [:memory:]
## sex hicov
## <int> <dbl>
## 1 1 1.14
## 2 2 1.10
# But smaller object size
object.size(acs_m)
## 7777312 bytes
object.size(acs_m_db)
## 10904 bytes
Note that though many commands behave exactly the same whether on a local data.frame or database, sometimes more advanced / complicated syntax around variable modification allowed in dplyr does not work on a particular database and so it is better to be more explicit. For example, creating a variable inside of a summarize call does not work in some databases. .
## # A tibble: 2 × 2
## sex hicov
## <int> <dbl>
## 1 1 0.858
## 2 2 0.895
# Works in RSQlite, but didn't in the now defunct MonetDB
# acs_m_db %>%
# group_by(sex) %>%
# summarize(hicov = mean(hicov == 1))
#
# > Error in .local(conn, statement, ...) :
# > Unable to execute statement....
# > ....
# Creating the variable separately works as an integer works though
acs_m_db %>%
group_by(sex) %>%
mutate(hicov = ifelse(hicov == 1, 1L, 0L)) %>%
summarize(hicov = mean(hicov))
## # Source: SQL [2 x 2]
## # Database: sqlite 3.46.0 [:memory:]
## sex hicov
## <int> <dbl>
## 1 1 0.858
## 2 2 0.895
Further, sometimes working with variable types can get difficult if
you are used to working in R. Notice how in the above, instead of
hicov = (hicov == 1)
, I wrote out the ifelse statement. If
I hadn’t RSQLite would be unable to calculate the mean of the boolean
variable created.
Finally, a major difference when transitioning from dplyr on local
data.frames is that not all R functions are translated to SQL. For
example, cut()
isn’t implemented in SQL, so you can’t
create a new variable in the data.frame using it.
acs_m %>%
group_by(agecat = cut(agep, c(0, 19, 35, 50, 65, 200))) %>%
summarize(hicov = mean(hicov == 1))
## # A tibble: 6 × 2
## agecat hicov
## <fct> <dbl>
## 1 (0,19] 0.908
## 2 (19,35] 0.795
## 3 (35,50] 0.837
## 4 (50,65] 0.873
## 5 (65,200] 0.992
## 6 NA 0.953
# acs_m_db %>%
# group_by(agecat = cut(agep, c(0, 19, 35, 50, 65, 200))) %>%
# summarize(hicov = mean(hicov == 1))
#
# > Error in .local(conn, statement, ...) :
# > Unable to execute statement....
# > ...
acs_m_db %>%
mutate(agecat = ifelse(agep < 19, "0-18",
ifelse(agep >= 19 & agep < 35, "19-34",
ifelse(agep >= 35 & agep < 50, "35-49",
ifelse(agep >= 50 & agep < 65, "50-64",
ifelse(agep >= 65, "65+", NA)))))) %>%
group_by(agecat) %>%
summarize(hicov = mean(hicov))
## # Source: SQL [5 x 2]
## # Database: sqlite 3.46.0 [:memory:]
## agecat hicov
## <chr> <dbl>
## 1 0-18 1.09
## 2 19-34 1.20
## 3 35-49 1.17
## 4 50-64 1.13
## 5 65+ 1.01
For more information on the specifics of databases with dplyr, see
vignette("database", package = "dplyr")
, the
DBI
package or the specific database packages, like
RSQLite
.
Srvyr commands are nearly identical to old. The only difference for setup is that you need a variable that uniquely identifies each row in the database (uid).
acs_m_db_svy <- acs_m_db %>%
as_survey_rep(
weight = pwgtp,
repweights = matches("pwgtp[0-9]+") ,
scale = 4 / 80,
rscales = rep(1 , 80),
mse = TRUE,
type = "JK1",
variables = -c(matches("^pwgtp"))
)
acs_m_db_svy
## Call: Called via srvyr
## Unstratified cluster jacknife (JK1) with 80 replicates and MSE variances.
## Sampling variables:
## - repweights: `pwgtp1 + pwgtp2 + pwgtp3 + pwgtp4 + pwgtp5 + pwgtp6 + pwgtp7 +
## pwgtp8 + pwgtp9 + pwgtp10 + pwgtp11 + pwgtp12 + pwgtp13 + pwgtp14 + pwgtp15
## + pwgtp16 + pwgtp17 + pwgtp18 + pwgtp19 + pwgtp20 + pwgtp21 + pwgtp22 +
## pwgtp23 + pwgtp24 + pwgtp25 + pwgtp26 + pwgtp27 + pwgtp28 + pwgtp29 +
## pwgtp30 + pwgtp31 + pwgtp32 + pwgtp33 + pwgtp34 + pwgtp35 + pwgtp36 +
## pwgtp37 + pwgtp38 + pwgtp39 + pwgtp40 + pwgtp41 + pwgtp42 + pwgtp43 +
## pwgtp44 + pwgtp45 + pwgtp46 + pwgtp47 + pwgtp48 + pwgtp49 + pwgtp50 +
## pwgtp51 + pwgtp52 + pwgtp53 + pwgtp54 + pwgtp55 + pwgtp56 + pwgtp57 +
## pwgtp58 + pwgtp59 + pwgtp60 + pwgtp61 + pwgtp62 + pwgtp63 + pwgtp64 +
## pwgtp65 + pwgtp66 + pwgtp67 + pwgtp68 + pwgtp69 + pwgtp70 + pwgtp71 +
## pwgtp72 + pwgtp73 + pwgtp74 + pwgtp75 + pwgtp76 + pwgtp77 + pwgtp78 +
## pwgtp79 + pwgtp80`
## - weights: pwgtp
## Data variables:
## - serialno (int), sporder (chr), pwgtp (int), pwgtp1 (int), pwgtp2 (int),
## pwgtp3 (int), pwgtp4 (int), pwgtp5 (int), pwgtp6 (int), pwgtp7 (int),
## pwgtp8 (int), pwgtp9 (int), pwgtp10 (int), pwgtp11 (int), pwgtp12 (int),
## pwgtp13 (int), pwgtp14 (int), pwgtp15 (int), pwgtp16 (int), pwgtp17 (int),
## pwgtp18 (int), pwgtp19 (int), pwgtp20 (int), pwgtp21 (int), pwgtp22 (int),
## pwgtp23 (int), pwgtp24 (int), pwgtp25 (int), pwgtp26 (int), pwgtp27 (int),
## pwgtp28 (int), pwgtp29 (int), pwgtp30 (int), pwgtp31 (int), pwgtp32 (int),
## pwgtp33 (int), pwgtp34 (int), pwgtp35 (int), pwgtp36 (int), pwgtp37 (int),
## pwgtp38 (int), pwgtp39 (int), pwgtp40 (int), pwgtp41 (int), pwgtp42 (int),
## pwgtp43 (int), pwgtp44 (int), pwgtp45 (int), pwgtp46 (int), pwgtp47 (int),
## pwgtp48 (int), pwgtp49 (int), pwgtp50 (int), pwgtp51 (int), pwgtp52 (int),
## pwgtp53 (int), pwgtp54 (int), pwgtp55 (int), pwgtp56 (int), pwgtp57 (int),
## pwgtp58 (int), pwgtp59 (int), pwgtp60 (int), pwgtp61 (int), pwgtp62 (int),
## pwgtp63 (int), pwgtp64 (int), pwgtp65 (int), pwgtp66 (int), pwgtp67 (int),
## pwgtp68 (int), pwgtp69 (int), pwgtp70 (int), pwgtp71 (int), pwgtp72 (int),
## pwgtp73 (int), pwgtp74 (int), pwgtp75 (int), pwgtp76 (int), pwgtp77 (int),
## pwgtp78 (int), pwgtp79 (int), pwgtp80 (int), agep (int), hicov (int), sex
## (int), st (chr), rt (chr)
Because srvyr stores the survey variables locally, the srvyr object takes up much more memory than the dplyr one. However, this object would not grow in size if you added more data variables to your survey, so if your survey is very wide, it will save a lot space.
object.size(acs_m_db_svy)
## 8391936 bytes
For very large survey data sets with replicate weights, it is
strongly recommended to specify a value in the degf
parameter of as_survey_rep()
: otherwise, the survey package
will attempt to automatically determine the design degrees of freedom
using a process that can be very slow for large data sets. The value to
use for degf
will often be specified in survey data
documentation, but a common rule of thumb is the number of columns of
replicate weights.
Analysis commands from srvyr are also similar to ones that work on local data.frames. The main differences come from the issues discussed above about explicitly creating variables difficulties in translating R commands, and variable types.
The following analysis is based on the asdfree analysis and shows some basic analysis on the total population, insurance coverage, age and sex.
# You can calculate the population of the united states #
# by state
acs_m_db_svy %>%
mutate(one = 1L) %>% # Note that because of weird behavior of MonetDB, need to use 1L not just 1
group_by(st) %>%
summarize(count = survey_total(one))
## Adding missing grouping variables: `st`
## # A tibble: 2 × 3
## st count count_se
## <chr> <dbl> <dbl>
## 1 02 722718 0
## 2 15 1374810 0
# Or the average age of downloaded states
acs_m_db_svy %>%
summarize(agep = survey_mean(agep, na.rm = TRUE))
## # A tibble: 1 × 2
## agep agep_se
## <dbl> <dbl>
## 1 37.4 0.0532
# Average age by state
acs_m_db_svy %>%
group_by(st) %>%
summarize(agep = survey_mean(agep, na.rm = TRUE))
## Adding missing grouping variables: `st`
## # A tibble: 2 × 3
## st agep agep_se
## <chr> <dbl> <dbl>
## 1 02 34.6 0.133
## 2 15 38.8 0.0404
# percent uninsured - nationwide (of downloaded states)
acs_m_db_svy %>%
mutate(hicov = as.character(hicov)) %>%
group_by(hicov) %>%
summarize(pct = survey_mean(na.rm = TRUE))
## Adding missing grouping variables: `hicov`
## Warning: There was 1 warning in `dplyr::summarise()`.
## ℹ In argument: `pct = survey_mean(na.rm = TRUE)`.
## ℹ In group 1: `hicov = "1"`.
## Caused by warning:
## ! na.rm argument has no effect on survey_mean when calculating grouped proportions.
## This warning is displayed once per session.
## # A tibble: 2 × 3
## hicov pct pct_se
## <chr> <dbl> <dbl>
## 1 1 0.885 0.00339
## 2 2 0.115 0.00339
# by state
acs_m_db_svy %>%
mutate(hicov = as.character(hicov)) %>%
group_by(st, hicov) %>%
summarize(pct = survey_mean(na.rm = TRUE))
## Adding missing grouping variables: `st` and `hicov`
## # A tibble: 4 × 4
## # Groups: st [2]
## st hicov pct pct_se
## <chr> <chr> <dbl> <dbl>
## 1 02 1 0.802 0.00793
## 2 02 2 0.198 0.00793
## 3 15 1 0.928 0.00383
## 4 15 2 0.0719 0.00383
# 25th, median, and 75th percentile of age of residents of the united states (downloaded states)
acs_m_db_svy %>%
summarize(agep = survey_quantile(agep, c(0.25, 0.5, 0.75), na.rm = TRUE))
## # A tibble: 1 × 6
## agep_q25 agep_q50 agep_q75 agep_q25_se agep_q50_se agep_q75_se
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 19 36 55 0.251 0.251 0.251
# Filter works, so we can restrict the acs_m object to females only
acs_m_db_svy_female <- acs_m_db_svy %>%
filter(sex == 2)
# Now any of the above commands can be re-run using the acs_m_female object
# instead of the acs_m object in order to analyze females only
# This is equivalent to using acs_m, and applying the filter every time.
# average age - nationwide (of downloaded states), restricted to females
acs_m_db_svy_female %>%
summarize(agep = survey_mean(agep, na.rm = TRUE))
## # A tibble: 1 × 2
## agep agep_se
## <dbl> <dbl>
## 1 38.2 0.103
# median age - nationwide (of downloaded states), restricted to females
acs_m_db_svy_female %>%
summarize(agep = survey_median(agep, na.rm = TRUE))
## # A tibble: 1 × 2
## agep agep_se
## <dbl> <dbl>
## 1 37 0.251
# Note that though some R functions are translated by dplyr into SQL, not
# all of them are. For example, when constructing a new age category
# variable in the dataset neither findIntervals nor cut work on databases,
# so we have to spell out groups with ifelse()
acs_m_db_svy %>%
mutate(agecat = ifelse(agep < 19, "0-18",
ifelse(agep >= 19 & agep < 35, "19-34",
ifelse(agep >= 35 & agep < 50, "35-49",
ifelse(agep >= 50 & agep < 65, "50-64",
ifelse(agep >= 65, "65+", NA)))))) %>%
group_by(agecat) %>%
summarize(pct = survey_mean(na.rm = TRUE))
## Adding missing grouping variables: `agecat`
## # A tibble: 5 × 3
## agecat pct pct_se
## <chr> <dbl> <dbl>
## 1 0-18 0.248 0.00101
## 2 19-34 0.228 0.00172
## 3 35-49 0.198 0.00176
## 4 50-64 0.202 0.00134
## 5 65+ 0.126 0.000980
If you’d like to run a command from the survey package, you’ll need to collect the data locally first. You can select only the variables you’ll need for the analysis so that you don’t have to store the whole dataset in memory.
acs_m_db_svy %>%
select(agep, hicov, sex) %>%
collect() %>%
{survey::svyglm(hicov ~ sex + agep, .)} %>%
summary()
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
##
## Call:
## survey::svyglm(formula = hicov ~ sex + agep, .)
##
## Survey design:
## Called via srvyr
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.1882942 0.0101068 117.573 < 2e-16 ***
## sex -0.0309300 0.0048065 -6.435 9.56e-09 ***
## agep -0.0007165 0.0001143 -6.269 1.94e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1015023)
##
## Number of Fisher Scoring iterations: 2
Note that srvyr does not require write access to perform calculations, the database created in this vignette was set to read-only at the beginning. This can be important when you want to make sure that your original data is not altered accidentally, or if you don’t have write access to a database.