# A tibble: 20 × 9 pid survey n_children_in_household expense_toys_children <dbl> <chr> <dbl> <dbl> 1 5910295 Survey 03 NA NA 2 5910429 Survey 03 NA 665. 3 5910295 Survey 04 0 NA 4 5910429 Survey 04 1 NA 5 5910295 Survey 05 NA NA 6 5910429 Survey 05 NA 650. 7 5910295 Survey 06 NA NA 8 5910429 Survey 06 NA 949. 9 5910295 Survey 07 NA NA 10 5910429 Survey 07 NA 987. 11 5910295 Survey 08 0 NA 12 5910429 Survey 08 1 NA 13 5910295 Survey 09 NA 90 14 5910429 Survey 09 NA 165. 15 5910295 Survey 10 NA NA 16 5910429 Survey 10 NA 899. 17 5910295 Survey 11 NA NA 18 5910429 Survey 11 NA 932. 19 5910295 Survey 12 0 NA 20 5910429 Survey 12 2 NA # ℹ 5 more variables: expense_school_children <dbl>, # expense_care_children <dbl>, expense_food <dbl>, expense_insurance <dbl>, # expense_recreation <dbl>
vlightr
Conditionally format vectors of any class in R using {cli} text formatting. {vlightr} makes interactive data explication easier, by allowing elements of vectors and dataframe columns to be found with ease.
Where’s Waldo
For the past few years I have worked behind the scenes writing analysis code for this study (and its friends). Over the course of the study, 3,000 participants received dozens of surveys, some annual, others monthly, and a few daily, which comprised thousands of questions. Many weeks, it was my job to comb through this survey data and bring my boss the most suspicious looking observations (potential typos, data collection errors, contradicting responses, and the like).
During our weekly zoom meetings, screen-sharing my tiny R-Studio console, I frantically live coded to filter()
and select()
ever smaller subsets of data while saying things like:
“participant ID 5910295 responded
expense_toys_children
of $90 in Survey 09, but in Survey 08 and Survey 12 said they hadn_children_in_household == 0
, so the children expense questions should have been skipped”.
Squinting at the <tibble> I had printed, hoping I’d said the correct participant ID, my boss and I would go back and forth about which row or column I was talking about - I’d un-filter to look at all of a participant’s survey data and then re-filter to spotlight the problematic observation. More difficult still was asynchronous data-sharing, which involved many screen-shots of data shared over Slack and Google Docs annotated with clip-art arrows and informative labels such as this one or see, no response here.
Play eye-spy and find the problem I’ve described. Note that n_children_in_household
is only asked in Surveys 4, 8, and 12, while expense questions are asked in every other Survey.
My Digital Highlighter
Several months into my potentially-problematic-data scavenger hunt I came across Davis Vaughan’s {ivs} package. {ivs}, powered by the {vctrs} package, implements an <ivs_iv> “vector-super-class” which can turn many generic vectors in R into interval vectors. Here’s an example of {ivs} in action, creating both a date interval (similar to an <Interval> in {lubridate}) and an integer interval.
# Date interval
::iv(
ivsstart = as.Date(c("2020-01-01", "2020-02-01")),
end = as.Date(c("2020-01-05", "2020-02-12"))
)
<iv<date>[2]>
[1] [2020-01-01, 2020-01-05) [2020-02-01, 2020-02-12)
# Integer interval
::iv(start = 1:3, end = 4:6) ivs
<iv<integer>[3]>
[1] [1, 4) [2, 5) [3, 6)
Inspired by Vaughan’s work, I created my own much-less-robust vector super-class, the <highlight> vector. Below is more-or-less the full original implementation.
# Creates a new vector of class <highlight> containing a vector `x`,
# an equal length vector of locations `at`, and a `highlighter` function.
<- function(x, at, highlighter = cli::col_yellow) {
highlight <- if (inherits(x, "highlight")) vctrs::field(x, "data") else x
data is.na(at)] <- FALSE
at[
::new_rcrd(
vctrsfields = list(data = data, at = at),
highlighter = highlighter,
class = "highlight"
)
}
# The `format()` method of a <highlight> formats it's underlying data
# and then highlights elements at the locations specified by `at`.
<- function(x, ...) {
format.highlight <- vctrs::field(x, "at")
at <- vctrs::field(x, "data")
data <- attr(x, "highlighter")
highlighter
<- format(data, ...)
out <- highlighter(out[at])
out[at]
out
}
# Nicely display the type of a highlighted vector in a <tibble>
<- function(x, ...) {
vec_ptype_abbr.highlight <- vctrs::field(x, "data")
data paste0("hl<", vctrs::vec_ptype_abbr(data), ">")
}
Harnessing the magic of {vctrs}, these twenty-ish lines of code allow us to modify the format()
method of nearly any in vector in R1. Rather than playing Where’s Waldo with my boss, this allowed me to quickly highlight()
any observation in a survey dataset.
library(dplyr, warn.conflicts = FALSE)
<- simulate_survey()
survey_data |>
survey_data filter(pid == 5910295) |>
mutate(across(everything(), ~highlight(.x, grepl("(8|9|12)$", survey))))
# A tibble: 10 × 9 pid survey n_children_in_household expense_toys_children <hl<dbl>> <hl<chr>> <hl<dbl>> <hl<dbl>> 1 5910295 Survey 03 NA NA 2 5910295 Survey 04 0 NA 3 5910295 Survey 05 NA NA 4 5910295 Survey 06 NA NA 5 5910295 Survey 07 NA NA 6 5910295 Survey 08 0 NA 7 5910295 Survey 09 NA 90 8 5910295 Survey 10 NA NA 9 5910295 Survey 11 NA NA 10 5910295 Survey 12 0 NA # ℹ 5 more variables: expense_school_children <hl<dbl>>, # expense_care_children <hl<dbl>>, expense_food <hl<dbl>>, # expense_insurance <hl<dbl>>, expense_recreation <hl<dbl>>
The {vlightr} Package
The highlight()
function serves it’s purpose admirably, but I couldn’t resist the urge to slap some sick flame decals on it. The {vlightr} package implements a fully featured version of highlight()
with all the requisite bells and whistles.
# devtools::install_github("EthanSansom/vlightr")
library(vlightr)
# Highlight numbers greater than 5
<- highlight(c(9, 0, -1), .t = ~ .x > 5, .f = color("violet"))
highlighted print(highlighted)
<highlight<double>[3]>
[1] 9 0 -1
The vlightr::highlight()
takes a vector as it’s first argument, a vectorized2 test function or lambda .t
as it’s second, and a formatter function .f
as it’s third. Unlike the <highlight> of my youth, you can actually do things with a <vlightr_highlight>.
sort(c(highlighted, hl(2:8))) # `hl()` is short for `highlight()`
<highlight<double>[10]> [1] -1 0 2 3 4 5 6 7 8 9
Variants highlight_mult()
and highlight_case()
provide a switch()
or dplyr::case_when()
style interface for supplying multiple conditional formats.
<- highlight_case(
indicator c(0, 1, NA, 9),
0 ~ label("No"),
1 ~ label("Yes"),
~ color("red"),
is.na ~ cli::style_bold(paste(.x, "[?]"))
true
)print(indicator)
<highlight_case<double>[4]> [1] 0 [No] 1 [Yes] NA 9 [?]
The left-hand-side argument of each formula ~
may be a function or a syntactic literal3 and the right-hand-side a formatter function. true()
here is a function which always returns TRUE
4.
{vlightr} also comes with a handful of generator functions to help quickly style text.
<- label("A label")
labelled <- color("red")
missing <- style("bold")
important <- color_rep(c(
rainbow "red", "orange", "gold", "green", "blue", "purple", "violet"
))
highlight_case(
c("Label", "Where?", "Ah!", "Imagination"),
"Label" ~ labelled,
"Where?" ~ missing,
"Ah!" ~ important,
"Imagination" ~ rainbow
|> print(width = 10) )
<highlight_case<character>[4]> [1] Label [A label] Where? Ah! Imagination
I’ll admit that color_rep()
does not a critical data-analysis tool make, but after seeing Danielle Navarro’s R startup message I was determined to support rainbow-styled text.
Since {vlightr} will most likely be used for quick data-exploration, I’ve added shorthand versions, hl()
, hl_mult()
, and hl_case()
, of highlight()
and it’s variants. Take care to use these time savings responsibly.
library(rlang)
library(purrr, warn.conflicts = FALSE)
|>
starwars mutate(
eye_color = hl(
.x = eye_color,
.t = true,
.f = ~map_chr(.x, \(x) try_fetch(color(x)(x), error = \(cnd) x))
),height = hl_mult(height, .x == max(.x) ~ label("max")),
species = hl_case(species, "Human" ~ "💪", "Droid" ~ "🦾")
|>
) select(name, height, eye_color, species) |>
head(10)
# A tibble: 10 × 4 name height eye_color species <chr> <hl<int>> <hl<chr>> <hlc<chr>> 1 Luke Skywalker 172 blue 💪 2 C-3PO 167 yellow 🦾 3 R2-D2 96 red 🦾 4 Darth Vader 202 [max] yellow 💪 5 Leia Organa 150 brown 💪 6 Owen Lars 178 blue 💪 7 Beru Whitesun lars 165 blue 💪 8 R5-D4 97 red 🦾 9 Biggs Darklighter 183 brown 💪 10 Obi-Wan Kenobi 182 blue-gray 💪
For a more restrained approach, use templight()
, which implements my original idea for highlighting vectors by location.
|>
survey_data filter(pid == 5910295) |>
mutate(across(everything(), ~templight(.x, grepl("(8|9|12)$", survey))))
# A tibble: 10 × 9 pid survey n_children_in_household expense_toys_children <vlghtr_t> <vlghtr_t> <vlghtr_t> <vlghtr_t> 1 5910295 Survey 03 NA NA 2 5910295 Survey 04 0 NA 3 5910295 Survey 05 NA NA 4 5910295 Survey 06 NA NA 5 5910295 Survey 07 NA NA 6 5910295 Survey 08 0 NA 7 5910295 Survey 09 NA 90 8 5910295 Survey 10 NA NA 9 5910295 Survey 11 NA NA 10 5910295 Survey 12 0 NA # ℹ 5 more variables: expense_school_children <vlghtr_t>, # expense_care_children <vlghtr_t>, expense_food <vlghtr_t>, # expense_insurance <vlghtr_t>, expense_recreation <vlghtr_t>
Post-Script
I’m still working on the {vlightr} package website and adding {testthat} unit tests in preparation for a CRAN submission later this year.
Footnotes
This comes with the minor sacrifice of destroying all non-format-related functionality of the highlighted vector.↩︎
{vlightr} stands for “vector-highlighter”, so instead of supplying a predicate
.p
, as inpurrr::map_if()
, we provide a test.t
.↩︎Literals are more-or-less the set of symbols used for creating scalar atomic vectors, e.g.
FALSE
,NA_real_
,"Hello"
,12L
.↩︎I’m hoping to capitalize on the muscle memory developed from using this default argument pattern:
dplyr::case_when(if_this ~ that, TRUE ~ default)
.↩︎