knitr::opts_chunk$set( collapse = TRUE, comment = "#>", dev = "png", fig.width = 7, fig.height = 3.5, message = FALSE, warning = FALSE ) options(width = 800) pkgs <- c( "ggplot2", "splines", "gridExtra", "patchwork", "see", "survival", "datawizard", "htmltools" ) if (!all(vapply(pkgs, requireNamespace, quietly = TRUE, FUN.VALUE = logical(1L)))) { knitr::opts_chunk$set(eval = FALSE) }
library(htmltools) callout_tip <- function(header = NULL, ...) { div( class = "callout-tip", tags$h1( tags$img(src = "../man/figures/summary.png", width = "20", height = "17", style = "vertical-align:middle"), # nolint header ), ... ) } includeCSS("../man/figures/callout.css")
After fitting a model, it is useful generate model-based estimates (expected values, or adjusted predictions) of the response variable for different combinations of predictor values. Such estimates can be used to make inferences about relationships between variables - adjusted predictions tell you: what is the expected ouctome for certain values or levels of my predictors? Even for complex models, the visualization of marginal means or adjusted predictions is far easier to understand and allows to intuitively get the idea of how predictors and outcome are associated.
There are three major goals that you can achieve with ggeffects: computing marginal means and adjusted predictions, testing these predictions for statistical significance, and creating figures (plots). What you basically would need for your workflow is: predict_response()
, test_predictions()
and plot()
.
callout_tip( "Summary of most important points:", tags$ul( tags$li("The aim of the ", tags$em("ggeffects-package"), " is to understand your model and look at how predictors are associated with the outcome. This is achieved by calculating adjusted predictions, using the ", tags$code("predict_response()"), " function."), # nolint tags$li(tags$code("predict_response()"), "estimates the outcome for meaningful values of predictors of interest (so-called ", tags$em("focal terms"), ")."), # nolint tags$li("The interpretation of the results, or the conclusion that can be drawn, also depend on how the non-focal terms are handled. This is controlled by the ", tags$code("margin"), "argument."), # nolint tags$li("The syntax is easy and intuitive: Just provide a model object, specify focal terms in the ", tags$code("terms"), " argument, and optionally specify the ", tags$code("margin"), " argument. This works for simple effects as well as more complex interaction effects.") # nolint ) )
ggeffects computes marginal means and adjusted predictions at the mean (MEM), at representative values (MER) or averaged across predictors (so called focal terms) from statistical models. The result is returned as data frame with consistent structure, especially for further use with ggplot.
At least one focal term needs to be specified for which the effects are computed. It is also possible to compute adjusted predictions for focal terms, grouped by the levels of another model's predictor. The package also allows plotting adjusted predictions for two-, three- or four-way-interactions, or for specific values of a focal term only. Examples are shown below.
predict_response()
is actually a wrapper around three "workhorse" functions, ggpredict()
, ggemmeans()
and ggaverage()
. Depending on the value of the margin
argument, predict_response()
calls one of those functions, with different arguments. The margin
argument indicates how to marginalize over the non-focal predictors, i.e. those variables that are not specified in terms
.
It is important to know, which question you would like to answer. See the following options for the margin
argument and which question is answered by each option:
"mean_reference"
and "mean_mode"
: "mean_reference"
calls ggpredict()
, i.e. non-focal predictors are set to their mean (numeric variables), reference level (factors), or "most common" value (mode) in case of character vectors. "mean_mode"
calls ggpredict(typical = c(numeric = "mean", factor = "mode"))
, i.e. non-focal predictors are set to their mean (numeric variables) or mode (factors, or "most common" value in case of character vectors).Predictions based on "mean_reference"
and "mean_mode"
represent a rather "theoretical" view on your data, which does not necessarily exactly reflect the characteristics of your sample. It helps answer the question, "What is the predicted (or: expected) value of the response at meaningful values or levels of my focal terms for a 'typical' observation in my data?", where 'typical' refers to certain characteristics of the remaining predictors.
"marginalmeans"
: calls ggemmeans()
, i.e. non-focal predictors are set to their mean (numeric variables) or marginalized over the levels or "values" for factors and character vectors. Marginalizing over the factor levels of non-focal terms computes a kind of "weighted average" for the values at which these terms are hold constant. Thus, non-focal categorical terms are conditioned on "weighted averages" of their levels. There are different weighting options that can be chosen with the weights
argument."marginalmeans"
comes closer to the sample, because it takes all possible values and levels of your non-focal predictors into account. It would answer thr question, "What is the predicted (or: expected) value of the response at meaningful values or levels of my focal terms for an 'average' observation in my data?". It refers to randomly picking a subject of your sample and the result you get on average.
"empirical"
(or on of its aliases, "counterfactual"
or "average"
): calls ggaverage()
, i.e. non-focal predictors are marginalized over the observations in your sample. The response is predicted for each subject in the data and predicted values are then averaged across all subjects, aggregated/grouped by the focal terms. In particular, averaging is applied to counterfactual predictions (Dickerman and Hernan 2020). There is a more detailed description in this vignette."empirical"
is probably the most "realistic" approach, insofar as the results can also be transferred to other contexts. It answers the question, "What is the predicted (or: expected) value of the response at meaningful values or levels of my focal terms for the 'average' observation in the population?". It does not only refer to the actual data in your sample, but also "what would be if" we had more data, or if we had data from a different population. This is where "counterfactual" refers to.
You can set a default-option for the margin
argument via options()
, e.g. options(ggeffects_margin = "empirical")
, so you don't have to specify your "default" marginalization method each time you call predict_response()
. Use options(ggeffects_margin = NULL)
to remove that setting.
The condition
argument can be used to fix non-focal terms to specific values.
By default, predict_response()
always returns predicted values for the response of a model (or response distribution for Bayesian models).
Typically, predict_response()
(or ggpredict()
) returns confidence intervals based on the standard errors as returned by the predict()
-function, assuming normal distribution (+/- 1.96 * SE
) resp. a Student's t-distribuion (if degrees of freedom are available). If predict()
for a certain model object does not return standard errors (for example, merMod-objects), these are calculated manually, by following steps: matrix-multiply X
by the parameter vector B
to get the predictions, then extract the variance-covariance matrix V
of the parameters and compute XVX'
to get the variance-covariance matrix of the predictions. The square-root of the diagonal of this matrix represent the standard errors of the predictions, which are then multiplied by the critical test-statistic value (e.g., ~1.96 for normal distribuion) for the confidence intervals.
The returned data frames always have the same, consistent structure and column names, so it's easy to create ggplot-plots without the need to re-write the arguments to be mapped in each ggplot-call. x
and predicted
are the values for the x- and y-axis. conf.low
and conf.high
could be used as ymin
and ymax
aesthetics for ribbons to add confidence bands to the plot. group
can be used as grouping-aesthetics, or for faceting.
If the original variable names are desired as column names, there is an as.data.frame()
method for objects of class ggeffects
, which has an terms_to_colnames
argument, to use the variable names as column names instead of the standardized names "x"
etc.
The examples shown here mostly use ggplot2-code for the plots, however, there is also a plot()
-method, which is described in the vignette Plotting Adjusted Predictions.
predict_response()
computes predicted values for all possible levels and values from model's predictors that are defined as focal terms. In the simplest case, a fitted model is passed as first argument, and the focal term as second argument. Use the raw name of the variable for the terms
-argument only - you don't need to write things like poly(term, 3)
or I(term^2)
for the terms
-argument.
library(ggeffects) data(efc, package = "ggeffects") fit <- lm(barthtot ~ c12hour + neg_c_7 + c161sex + c172code, data = efc) predict_response(fit, terms = "c12hour")
As you can see, predict_response()
(or their lower-level functions ggpredict()
, ggeffect()
, ggaverage()
or ggemmeans()
) has a nice print()
method, which takes care of printing not too many rows (but always an equally spaced range of values, including minimum and maximum value of the term in question) and giving some extra information. This is especially useful when predicted values are shown depending on the levels of other terms (see below).
The output shows the predicted values for the response at each value from the term c12hour. The data is already in shape for ggplot:
library(ggplot2) theme_set(theme_bw()) mydf <- predict_response(fit, terms = "c12hour") ggplot(mydf, aes(x, predicted)) + geom_line()
The terms
argument accepts up to four model terms, where the second to fourth terms indicate grouping levels. This allows predictions for the term in question at different levels or values for other focal terms:
predict_response(fit, terms = c("c12hour", "c172code"))
Creating a ggplot is pretty straightforward: the colour
aesthetics is mapped with the group
column:
mydf <- predict_response(fit, terms = c("c12hour", "c172code")) ggplot(mydf, aes(x, predicted, colour = group)) + geom_line()
Another focal term would stratify the result and will create another column named facet
, which - as the name implies - might be used to create a facted plot:
mydf <- predict_response(fit, terms = c("c12hour", "c172code", "c161sex")) # print a more compact table print(mydf, collapse_tables = TRUE) ggplot(mydf, aes(x, predicted, colour = group)) + geom_line() + facet_wrap(~facet)
Finally, a third differentation can be defined, creating another column named panel
. In such cases, you may create multiple plots (for each value in panel
). ggeffects takes care of this when you use plot()
and automatically creates an integrated plot with all panels in one figure.
mydf <- predict_response(fit, terms = c("c12hour", "c172code", "c161sex", "neg_c_7")) plot(mydf) + theme(legend.position = "bottom")
If the term
argument is either missing or NULL
, adjusted predictions for each model term are calculated. The result is returned as a list, which can be plotted manually (or using the plot()
function).
mydf <- predict_response(fit) mydf
Here we show examples of interaction terms, however, this section applies in general to using many focal terms. You can plot up to five focal terms. For all of these examples, you can easily use the plot()
-method. ggplot2 is just used to show how to create plots from scratch.
To plot the adjusted predictions of interaction terms, simply specify these terms in the terms
argument.
data(efc, package = "ggeffects") # make categorical efc$c161sex <- datawizard::to_factor(efc$c161sex) # fit model with interaction fit <- lm(neg_c_7 ~ c12hour + barthtot * c161sex, data = efc) # select only levels 30, 50 and 70 from continuous variable Barthel-Index mydf <- predict_response(fit, terms = c("barthtot [30,50,70]", "c161sex")) ggplot(mydf, aes(x, predicted, colour = group)) + geom_line()
Since the terms
argument accepts up to five focal terms, you can also compute adjusted predictions for a 3-way-, 4-way- or 5-way-interaction. To plot the adjusted predictions of three interaction terms, just like before, specify all three terms in the terms
argument.
# fit model with 3-way-interaction fit <- lm(neg_c_7 ~ c12hour * barthtot * c161sex, data = efc) # select only levels 30, 50 and 70 from continuous variable Barthel-Index mydf <- predict_response(fit, terms = c("c12hour", "barthtot [30,50,70]", "c161sex")) ggplot(mydf, aes(x, predicted, colour = group)) + geom_line() + facet_wrap(~facet)
4-way-interactions, or more generally: four focal terms, will be plotted in a grid layout. The first focal term is plotted on the x-axis. The second focal term is mapped to different colors (groups) and appears in the legend. The third focal term is mapped to columns, and the fourth focal term is mapped to rows.
# fit model with 4-way-interaction fit <- lm(neg_c_7 ~ c12hour * barthtot * c161sex * c172code, data = efc) # adjusted predictions for all 4 interaction terms pr <- predict_response(fit, c("c12hour", "barthtot", "c161sex", "c172code")) # use plot() method, easier than own ggplot-code from scratch plot(pr) + theme(legend.position = "bottom")
5-way-interactions are rather confusing to print and plot. When plotting, multiple plots (for each level of the fifth interaction term) are plotted for the remaining four focal terms. Note that for five focal terms, n_rows
can be used to arrange the "sub-plots".
# fit model with 5-way-interaction fit <- lm(neg_c_7 ~ c12hour * barthtot * c161sex * c172code * e42dep, data = efc) # adjusted predictions for all 5 interaction terms pr <- predict_response(fit, c("c12hour", "barthtot", "c161sex", "c172code", "e42dep")) # use plot() method, easier than own ggplot-code from scratch plot(pr, n_rows = 2) + theme(legend.position = "bottom")
predict_response()
also works for models with polynomial terms or splines. Following code reproduces the plot from ?splines::bs
:
library(splines) data(women) fm1 <- lm(weight ~ bs(height, df = 5), data = women) dat <- predict_response(fm1, "height") ggplot(dat, aes(x, predicted)) + geom_line() + geom_point()
predict_response()
also supports coxph
-models from the survival-package and is able to either plot risk-scores (the default), probabilities of survival (type = "survival"
) or cumulative hazards (type = "cumulative_hazard"
).
Since probabilities of survival and cumulative hazards are changing across time, the time-variable is automatically used as x-axis in such cases, so the terms
argument only needs up to two variables for type = "survival"
or type = "cumulative_hazard"
.
library(survival) data("lung2") m <- coxph(Surv(time, status) ~ sex + age + ph.ecog, data = lung2) # predicted risk-scores predict_response(m, c("sex", "ph.ecog"))
# probability of survival predict_response(m, c("sex", "ph.ecog"), type = "survival")
ggeffects makes use of the sjlabelled-package and supports labelled data. If the data from the fitted models is labelled, the value and variable label attributes are usually copied to the model frame stored in the model object. ggeffects provides various getter-functions to access these labels, which are returned as character vector and can be used in ggplot's lab()
- or scale_*()
-functions.
get_title()
- a generic title for the plot, based on the model family, like "predicted values" or "predicted probabilities"get_x_title()
- the variable label of the first model term in terms
.get_y_title()
- the variable label of the response.get_legend_title()
- the variable label of the second model term in terms
.get_x_labels()
- value labels of the first model term in terms
.get_legend_labels()
- value labels of the second model term in terms
.The data frame returned by predict_response()
must be used as argument to one of the above function calls.
get_x_title(mydf) get_y_title(mydf) ggplot(mydf, aes(x, predicted, colour = group)) + geom_line() + facet_wrap(~facet) + labs( x = get_x_title(mydf), y = get_y_title(mydf), colour = get_legend_title(mydf) )
Dickerman BA, Hernan, MA. Counterfactual prediction is not only for causal inference. Eur J Epidemiol 35, 615–617 (2020).
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.