vignettes/extending-srvyr.Rmd
extending-srvyr.Rmd
## Loading required package: convey
## Loading required package: laeken
I don’t expect this vignette to be help for most srvyr users, it is
instead intended for other package developers. An exciting new feature
that is easier now that I have reworked srvyr’s non-standard evaluation
to match dplyr 0.7+ is that it is now possible for non-srvyr functions
to be called from within summarize
. This vignette describes
some of the inner-workings of summarize so that others can extend srvyr.
This is kind of a fiddly part of srvyr, and I don’t expect that many
people will want or need to understand it, so this guide is mostly aimed
at package authors who already have an understanding of how survey
objects work. If you’d like more explanation, please let me know on github!
This guide has also been rewritten for srvyr 1.0, as I had to rework summarize and was unable to maintain backwards compatibility.
srvyr implements the “survey statistics” functions from the survey
package. Some examples are the svymean, svytotal, svyciprop, svyquantile
and svyratio all return a svystat
object which usually
prints out the estimate and its standard error and other estimates of
the variance can be calculated from it. In srvyr, these estimates are
created inside of a summarize call and the variance estimates are
specified at the same time.
The combination of srvyr’s group_by and summarize is analogous to the
svyby
function that performs one of the survey statistic
function and performs it on multiple groups. However, as of srvyr 1.0,
srvyr no longer uses svyby
, instead the survey object is
split into each group’s
srvyr’s summarize expects that the survey statistics functions will return objects that are formatted in a particular way. Below, I’ll explain some of the functions that will help create these objects for you in most cases, but the return should be:
srvyr_result_df
object (which is just a wrapper
around a data.frame
)srvyr now exports several functions that can help convert functions designed for the survey package to this format.
cur_svy()
- This function, modeled after
dplyr::current_vars()
, is a hidden way to send the survey
object to the object (by hidden, I mean that the user doesn’t have to
specify the survey in the arguments of their function call). To use it,
you can now directly call cur_svy()
from inside your
function. This survey includes only the current group’s survey
data.cur_svy_full()
- Like cur_svy()
, but
includes the full survey data intead of just the current group’s
data.cur_svy_wts()
- This helper function provides access to
the full-sample weights for the current group’s data.set_survey_vars()
- Many survey functions have limited
support for both supplying a formula indicating the variables to
calculate a statistic on as well as a vector. However, oftentimes the
vector version is less well supported than the formula version. Since
srvyr uses dplyr semantics, it ends up returning the values as vectors.
This function will add on the variable to the survey, defaulting to
having the name “__SRVYR_TEMP_VAR__”.get_var_est()
- A helper function that calculates
variance estimates like standard error (se), confidence interval (ci),
variance (var), or coefficient of variance (cv). For functions that
support it, there is a separate argument for design effects (to match
survey’s conventions).as_srvyr_result_df()
- A helper function that adds the
srvyr_result_df
class to a data.frame
Note that these functions may not work in all cases. In srvyr, I’ve
actually had to write multiple versions of get_var_est()
because of minor differences in the way survey objects are returned.
Hopefully they will help in most situations, or at least give you a good
place to start.
Two less important conventions that srvyr functions follow are:
That was just a lot of text, but I think it’s probably easiest just
to provide an example. The convey package provides several methods for
analysis of inequality using survey data. The svygini function
calculates the gini coefficient. Here, we’ll write functions that make a
srvyr version survey_gini
.
# S3 generic function
survey_gini <- function(
x, na.rm = FALSE, vartype = c("se", "ci", "var", "cv"), ...
) {
if (missing(vartype)) vartype <- "se"
vartype <- match.arg(vartype, several.ok = TRUE)
.svy <- srvyr::set_survey_vars(srvyr::cur_svy(), x)
out <- convey::svygini(~`__SRVYR_TEMP_VAR__`, na.rm = na.rm, design = .svy)
out <- srvyr::get_var_est(out, vartype)
as_srvyr_result_df(out)
}
And here’s what this function looks like in practice:
# Example from ?convey::svygini
suppressPackageStartupMessages({
library(srvyr)
library(survey)
library(convey)
library(laeken)
})
data(eusilc) ; names( eusilc ) <- tolower( names( eusilc ) )
# Setup for survey package
des_eusilc <- svydesign(
ids = ~rb030,
strata = ~db040,
weights = ~rb050,
data = eusilc
)
des_eusilc <- convey_prep(des_eusilc)
# Setup for srvyr package
srvyr_eusilc <- eusilc %>%
as_survey(
ids = rb030,
strata = db040,
weights = rb050
) %>%
convey_prep()
## Ungrouped
# Calculate ungrouped for survey package
svygini(~eqincome, design = des_eusilc)
#> gini SE
#> eqincome 0.26497 0.0019
# Use new function from summarize
srvyr_eusilc %>%
summarize(eqincome = survey_gini(eqincome))
#> # A tibble: 1 × 2
#> eqincome eqincome_se
#> <dbl> <dbl>
#> 1 0.265 0.00195
## Groups
# Calculate by groups for survey
survey::svyby(~eqincome, ~rb090, des_eusilc, convey::svygini)
#> rb090 eqincome se
#> male male 0.2578983 0.002617279
#> female female 0.2702080 0.002892713
# Use new function from summarize
srvyr_eusilc %>%
group_by(rb090) %>%
summarize(eqincome = survey_gini(eqincome))
#> # A tibble: 2 × 3
#> rb090 eqincome eqincome_se
#> <fct> <dbl> <dbl>
#> 1 male 0.258 0.00262
#> 2 female 0.270 0.00289