vlightr

featured
package
{vctrs}

Conditionally format vectors of any class in R using {cli} text formatting. {vlightr} makes interactive data explication easier, by allowing elements of vectors and dataframe columns to be found with ease.

Published

January 30, 2025


Where’s Waldo

For the past few years I have worked behind the scenes writing analysis code for this study (and its friends). Over the course of the study, 3,000 participants received dozens of surveys, some annual, others monthly, and a few daily, which comprised thousands of questions. Many weeks, it was my job to comb through this survey data and bring my boss the most suspicious looking observations (potential typos, data collection errors, contradicting responses, and the like).

During our weekly zoom meetings, screen-sharing my tiny R-Studio console, I frantically live coded to filter() and select() ever smaller subsets of data while saying things like:

“participant ID 5910295 responded expense_toys_children of $90 in Survey 09, but in Survey 08 and Survey 12 said they had n_children_in_household == 0, so the children expense questions should have been skipped”.

Squinting at the <tibble> I had printed, hoping I’d said the correct participant ID, my boss and I would go back and forth about which row or column I was talking about - I’d un-filter to look at all of a participant’s survey data and then re-filter to spotlight the problematic observation. More difficult still was asynchronous data-sharing, which involved many screen-shots of data shared over Slack and Google Docs annotated with clip-art arrows and informative labels such as this one or see, no response here.

Play eye-spy and find the problem I’ve described. Note that n_children_in_household is only asked in Surveys 4, 8, and 12, while expense questions are asked in every other Survey.

# A tibble: 20 × 9                                                              
       pid survey    n_children_in_household expense_toys_children              
     <dbl> <chr>                       <dbl>                 <dbl>              
 1 5910295 Survey 03                      NA                   NA               
 2 5910429 Survey 03                      NA                  665.              
 3 5910295 Survey 04                       0                   NA               
 4 5910429 Survey 04                       1                   NA               
 5 5910295 Survey 05                      NA                   NA               
 6 5910429 Survey 05                      NA                  650.              
 7 5910295 Survey 06                      NA                   NA               
 8 5910429 Survey 06                      NA                  949.              
 9 5910295 Survey 07                      NA                   NA               
10 5910429 Survey 07                      NA                  987.              
11 5910295 Survey 08                       0                   NA               
12 5910429 Survey 08                       1                   NA               
13 5910295 Survey 09                      NA                   90               
14 5910429 Survey 09                      NA                  165.              
15 5910295 Survey 10                      NA                   NA               
16 5910429 Survey 10                      NA                  899.              
17 5910295 Survey 11                      NA                   NA               
18 5910429 Survey 11                      NA                  932.              
19 5910295 Survey 12                       0                   NA               
20 5910429 Survey 12                       2                   NA               
# ℹ 5 more variables: expense_school_children <dbl>,                            
#   expense_care_children <dbl>, expense_food <dbl>, expense_insurance <dbl>,   
#   expense_recreation <dbl>                                                    

My Digital Highlighter

Several months into my potentially-problematic-data scavenger hunt I came across Davis Vaughan’s {ivs} package. {ivs}, powered by the {vctrs} package, implements an <ivs_iv> “vector-super-class” which can turn many generic vectors in R into interval vectors. Here’s an example of {ivs} in action, creating both a date interval (similar to an <Interval> in {lubridate}) and an integer interval.

# Date interval
ivs::iv(
  start = as.Date(c("2020-01-01", "2020-02-01")), 
  end = as.Date(c("2020-01-05", "2020-02-12"))
)
<iv<date>[2]>
[1] [2020-01-01, 2020-01-05) [2020-02-01, 2020-02-12)
# Integer interval
ivs::iv(start = 1:3, end = 4:6)
<iv<integer>[3]>
[1] [1, 4) [2, 5) [3, 6)

Inspired by Vaughan’s work, I created my own much-less-robust vector super-class, the <highlight> vector. Below is more-or-less the full original implementation.

# Creates a new vector of class <highlight> containing a vector `x`,
# an equal length vector of locations `at`, and a `highlighter` function.
highlight <- function(x, at, highlighter = cli::col_yellow) {
  data <- if (inherits(x, "highlight")) vctrs::field(x, "data") else x
  at[is.na(at)] <- FALSE
  
  vctrs::new_rcrd(
    fields = list(data = data, at = at),
    highlighter = highlighter,
    class = "highlight"
  )
}

# The `format()` method of a <highlight> formats it's underlying data 
# and then highlights elements at the locations specified by `at`.
format.highlight <- function(x, ...) {
  at <- vctrs::field(x, "at")
  data <- vctrs::field(x, "data")
  highlighter <- attr(x, "highlighter")
  
  out <- format(data, ...)
  out[at] <- highlighter(out[at])
  out
}

# Nicely display the type of a highlighted vector in a <tibble>
vec_ptype_abbr.highlight <- function(x, ...) {
  data <- vctrs::field(x, "data")
  paste0("hl<", vctrs::vec_ptype_abbr(data), ">") 
}

Harnessing the magic of {vctrs}, these twenty-ish lines of code allow us to modify the format() method of nearly any in vector in R1. Rather than playing Where’s Waldo with my boss, this allowed me to quickly highlight() any observation in a survey dataset.

library(dplyr, warn.conflicts = FALSE)

survey_data <- simulate_survey()
survey_data |>
  filter(pid == 5910295) |>
  mutate(across(everything(), ~highlight(.x, grepl("(8|9|12)$", survey))))
# A tibble: 10 × 9                                                              
         pid    survey n_children_in_household expense_toys_children            
   <hl<dbl>> <hl<chr>>               <hl<dbl>>             <hl<dbl>>            
 1   5910295 Survey 03                      NA                    NA            
 2   5910295 Survey 04                       0                    NA            
 3   5910295 Survey 05                      NA                    NA            
 4   5910295 Survey 06                      NA                    NA            
 5   5910295 Survey 07                      NA                    NA            
 6   5910295 Survey 08                       0                    NA            
 7   5910295 Survey 09                      NA                    90            
 8   5910295 Survey 10                      NA                    NA            
 9   5910295 Survey 11                      NA                    NA            
10   5910295 Survey 12                       0                    NA            
# ℹ 5 more variables: expense_school_children <hl<dbl>>,                        
#   expense_care_children <hl<dbl>>, expense_food <hl<dbl>>,                    
#   expense_insurance <hl<dbl>>, expense_recreation <hl<dbl>>                   

The {vlightr} Package

The highlight() function serves it’s purpose admirably, but I couldn’t resist the urge to slap some sick flame decals on it. The {vlightr} package implements a fully featured version of highlight() with all the requisite bells and whistles.

# devtools::install_github("EthanSansom/vlightr")
library(vlightr)

# Highlight numbers greater than 5
highlighted <- highlight(c(9, 0, -1), .t = ~ .x > 5, .f = color("violet"))
print(highlighted)
<highlight<double>[3]>                                                          
[1] 9  0  -1                                                                    

The vlightr::highlight() takes a vector as it’s first argument, a vectorized2 test function or lambda .t as it’s second, and a formatter function .f as it’s third. Unlike the <highlight> of my youth, you can actually do things with a <vlightr_highlight>.

sort(c(highlighted, hl(2:8))) # `hl()` is short for `highlight()`
<highlight<double>[10]>                                                         
[1] -1 0  2  3  4  5  6  7  8  9                                                

Variants highlight_mult() and highlight_case() provide a switch() or dplyr::case_when() style interface for supplying multiple conditional formats.

indicator <- highlight_case(
  c(0, 1, NA, 9),
  0 ~ label("No"),
  1 ~ label("Yes"),
  is.na ~ color("red"),
  true ~ cli::style_bold(paste(.x, "[?]"))
)
print(indicator)
<highlight_case<double>[4]>                                                     
[1] 0 [No]  1 [Yes] NA      9 [?]                                               

The left-hand-side argument of each formula ~ may be a function or a syntactic literal3 and the right-hand-side a formatter function. true() here is a function which always returns TRUE4.

{vlightr} also comes with a handful of generator functions to help quickly style text.

labelled <- label("A label")
missing <- color("red")
important <- style("bold")
rainbow <- color_rep(c(
  "red", "orange", "gold", "green", "blue", "purple", "violet"
))

highlight_case(
  c("Label", "Where?", "Ah!", "Imagination"),
  "Label" ~ labelled,
  "Where?" ~ missing,
  "Ah!" ~ important,
  "Imagination" ~ rainbow
) |> print(width = 10)
<highlight_case<character>[4]>                                                  
[1] Label [A label] Where?          Ah!             Imagination                 

I’ll admit that color_rep() does not a critical data-analysis tool make, but after seeing Danielle Navarro’s R startup message I was determined to support rainbow-styled text.

Since {vlightr} will most likely be used for quick data-exploration, I’ve added shorthand versions, hl(), hl_mult(), and hl_case(), of highlight() and it’s variants. Take care to use these time savings responsibly.

library(rlang)
library(purrr, warn.conflicts = FALSE)

starwars |>
  mutate(
    eye_color = hl(
      .x = eye_color, 
      .t = true,
      .f = ~map_chr(.x, \(x) try_fetch(color(x)(x), error = \(cnd) x))
    ),
    height = hl_mult(height, .x == max(.x) ~ label("max")),
    species = hl_case(species, "Human" ~ "💪", "Droid" ~ "🦾")
  ) |>
  select(name, height, eye_color, species) |>
  head(10)
# A tibble: 10 × 4                                                              
   name                  height eye_color    species                            
   <chr>              <hl<int>> <hl<chr>> <hlc<chr>>                            
 1 Luke Skywalker           172      blue         💪                             
 2 C-3PO                    167    yellow         🦾                             
 3 R2-D2                     96       red         🦾                             
 4 Darth Vader        202 [max]    yellow         💪                             
 5 Leia Organa              150     brown         💪                             
 6 Owen Lars                178      blue         💪                             
 7 Beru Whitesun lars       165      blue         💪                             
 8 R5-D4                     97       red         🦾                             
 9 Biggs Darklighter        183     brown         💪                             
10 Obi-Wan Kenobi           182 blue-gray         💪                             

For a more restrained approach, use templight(), which implements my original idea for highlighting vectors by location.

survey_data |>
  filter(pid == 5910295) |>
  mutate(across(everything(), ~templight(.x, grepl("(8|9|12)$", survey))))
# A tibble: 10 × 9                                                              
          pid     survey n_children_in_household expense_toys_children          
   <vlghtr_t> <vlghtr_t>              <vlghtr_t>            <vlghtr_t>          
 1    5910295  Survey 03                      NA                    NA          
 2    5910295  Survey 04                       0                    NA          
 3    5910295  Survey 05                      NA                    NA          
 4    5910295  Survey 06                      NA                    NA          
 5    5910295  Survey 07                      NA                    NA          
 6    5910295  Survey 08                       0                    NA          
 7    5910295  Survey 09                      NA                    90          
 8    5910295  Survey 10                      NA                    NA          
 9    5910295  Survey 11                      NA                    NA          
10    5910295  Survey 12                       0                    NA          
# ℹ 5 more variables: expense_school_children <vlghtr_t>,                       
#   expense_care_children <vlghtr_t>, expense_food <vlghtr_t>,                  
#   expense_insurance <vlghtr_t>, expense_recreation <vlghtr_t>                 

Post-Script

I’m still working on the {vlightr} package website and adding {testthat} unit tests in preparation for a CRAN submission later this year.

Footnotes

  1. This comes with the minor sacrifice of destroying all non-format-related functionality of the highlighted vector.↩︎

  2. {vlightr} stands for “vector-highlighter”, so instead of supplying a predicate .p, as in purrr::map_if(), we provide a test .t.↩︎

  3. Literals are more-or-less the set of symbols used for creating scalar atomic vectors, e.g. FALSE, NA_real_, "Hello", 12L.↩︎

  4. I’m hoping to capitalize on the muscle memory developed from using this default argument pattern: dplyr::case_when(if_this ~ that, TRUE ~ default).↩︎