phinterval

featured

package

c++

A free DLC for the {lubridate} package which makes it fast and easy to perform set operations (e.g. intersection, union, set-difference) on datetime intervals. Disclaimer, I am not affiliated with {lubridate}!

Published

January 30, 2025

{lubridate} is Great

I love {lubridate} because I do not love working with dates - or their malicious cousins, times. This is probably a cliché among people who think about data for a living, but I will reiterate it anyway; I don’t like thinking about time zones, leap years, how many days are in April, or dealing with base-60 arithmetic.

Fortunately, {lubridate} provides a collection of datetime classes and functions which, more often than not, keep me safely within a bubble where I don’t need to think about these things. If you’re unfamiliar with {lubridate}, here are a few examples.

library(lubridate, warn.conflicts = FALSE)

dyears(1) / dminutes()  # How many minutes are in an average year?

[1] 525960

ddays(366) / dminutes() # How many minutes are in a leap year?

[1] 527040

# Convert a UTC datetime to an EST datetime
with_tz(ymd_hms("2012-03-26 10:10:00", tz = "UTC"), "EST")

[1] "2012-03-26 05:10:00 EST"

There is, however, one family of problems that {lubridate} does not like. Problems with holes. Consider work and lunch, which measure the time-spans I spent at work and on a lunch break on January 3rd 2025.

work <- interval("2025-01-03 09:00:00", "2025-01-03 17:00:00")
lunch <- interval("2025-01-03 12:00:00", "2025-01-03 13:00:00")

What I’d like to analyze is the time that I was actually doing work, excluding my lunch break. My first instinct is to punch a hole through my work day using setdiff() to remove the time I spent at lunch. Unfortunately for me, this isn’t going to work.

try(setdiff(work, lunch))

Error in setdiff.Interval(work, lunch) : 
  Cases 1 result in discontinuous intervals.

The phrase “discontinuous intervals” here is a fancier way of saying “holey intervals”.

Holey Intervals

A <phinterval> is a “potentially-holey-interval” vector, i.e. it’s an interval which might contain gaps. In particular, each element of a <phinterval> vector is a set of non-overlapping, non-abutting datetime intervals.

# devtools::install_github("EthanSansom/phinterval")
library(phinterval)

jan_seventh <- interval(as.Date("2020-01-07"), as.Date("2020-01-08"))
jan_9_to_10 <- interval(as.Date("2020-01-09"), as.Date("2020-01-10"))
jan_9_to_12 <- interval(as.Date("2020-01-09"), as.Date("2020-01-12"))

# Create a length-1 <phinterval> vector
phint <- phinterval(c(jan_seventh, jan_9_to_10, jan_9_to_12))
print(phint)

<phinterval<UTC>[1]>
[1] [2020-01-07--2020-01-08, 2020-01-09--2020-01-12] UTC

A few things of note about the <phinterval> vector phint:

The element phint[[1]] describes two non-overlapping time spans.
phint[[1]] contains a “hole” during January the 8th.
When creating phint, phinterval() combined jan_9_to_10 and jan_9_to_12.

Many date and time related concepts are naturally represented by a <phinterval>, including which portion of the day was I actually working on January 3rd, 2025.

phint_diff(work, lunch)

<phinterval<UTC>[1]>
[1] [2025-01-03 09:00:00--2025-01-03 12:00:00, 2025-01-03 13:00:00--2025-01-03 17:00:00] UTC

Hard Questions with Dates

To further motivate the existence of a <phinterval>, lets consider a fake employment dataset. We’ll pretend that we’re government employees who’ve been tasked with analyzing the employment history of respondents to the latest census survey. Each person who responded to the census provided us with the title and start/end date of every job they’ve ever held.

print(employment)

# A tibble: 9 × 3
  name    job_title      job_interval                  
  <chr>   <chr>          <Interval>                    
1 Bruce   Industrialist  1939-03-01 UTC--1940-06-01 UTC
2 Bruce   Philanthropist 1942-01-01 UTC--1945-01-01 UTC
3 Bruce   Vigilante      1943-01-01 UTC--1947-01-01 UTC
4 Clark   Journalist     1938-04-01 UTC--1944-01-01 UTC
5 Clark   Coach          1946-01-01 UTC--1947-06-01 UTC
6 Tony    Inventor       1962-12-01 UTC--1963-06-01 UTC
7 Tony    CEO            1964-01-01 UTC--1967-01-01 UTC
8 Tony    Consultant     1966-01-01 UTC--1967-01-01 UTC
9 Natasha Spy            1964-04-01 UTC--1970-01-01 UTC

Our boss has a few questions for us:

When was each respondent employed?
When did each respondent have gaps in their employment?

We’ll focus on "Bruce" first. Our office uses the {tidyverse}, so we’ll be working with {dplyr}, {ggplot2}, and {lubridate}.

library(dplyr, warn.conflicts = FALSE)
library(ggplot2)

bruce_employment <- employment |> filter(name == "Bruce")

Let’s plot the duration of each of his jobs to get a lay of the land.

Plot Code

bruce_plot <- bruce_employment |>
  mutate(
    job_start = int_start(job_interval),
    job_end = int_end(job_interval)
  )

bruce_plot |>
  ggplot() +
  geom_segment(
    aes(
      y = job_title, yend = job_title,
      x = job_start, xend = job_end
    ),
    colour = "#38B873",
    arrow = arrow(angle = 45, ends = "both", length = unit(10, "pt")),
    linewidth = 1.5
  ) +
  labs(title = NULL, x = NULL, y = NULL) +
  theme_minimal(base_size = 14) +
  theme(
    panel.background = element_blank(),
    plot.background = element_blank()
  )

The time spans we want to find, Bruce’s employment at any job and his unemployment between jobs, look like this.

Plot Code

bruce_plot <- tribble(
  ~"job_title", ~"job_start", ~"job_end",
  "Employed", as.Date("1939-03-01"), as.Date("1940-06-01"),
  "Employed", as.Date("1942-01-01"), as.Date("1947-01-01"),
  "Unemployed", as.Date("1940-06-01"), as.Date("1942-01-01")
)

bruce_plot |>
  mutate(job_title = ordered(job_title, c("Unemployed", "Employed"))) |>
  ggplot() +
  geom_segment(
    aes(
      y = job_title, yend = job_title,
      x = job_start, xend = job_end
    ),
    colour = if_else(bruce_plot$job_title == "Employed", "blue", "red"),
    arrow = arrow(angle = 45, ends = "both", length = unit(10, "pt")),
    linewidth = 1.5
  ) +
  labs(title = NULL, x = NULL, y = NULL) +
  theme_minimal(base_size = 14) +
  theme(
    panel.background = element_blank(),
    plot.background = element_blank()
  )

To answer our boss’s first question, we’re looking to find the blue time-spans where Bruce was employed. This is surprisingly un-simple. Here’s a handful of Stack Overflow posts with tens of thousands of views between them asking how to do exactly this:

Answers to these questions suggest using packages ranging from {data.table} to {IRanges} to {ivs}. Here’s a fast-ish solution using {base} R.

# Extract the start and end of each of Bruce's jobs
intervals <- sort(bruce_employment$job_interval)
starts <- int_start(intervals)
ends <- int_end(intervals)

# Do some magic to merge overlapping jobs
overlap_groups <- c(
  0, 
  cumsum(as.numeric(lead(starts)) > cummax(as.numeric(ends)))[-length(ends)]
)
new_starts <- do.call(c, split(starts, overlap_groups) |> lapply(min))
new_ends <- do.call(c, split(ends, overlap_groups) |> lapply(max))

# Turn this back into an <Interval> vector
bruce_employment_intervals <- interval(new_starts, new_ends)
print(bruce_employment_intervals)

[1] 1939-03-01 UTC--1940-06-01 UTC 1942-01-01 UTC--1947-01-01 UTC

Suffice it to say, we’ve firmly left the warm embrace of {lubridate}’s intuitive API. To answer our boss’s second question, when was Bruce unemployed between jobs, we perform another slightly less confusing dance.

# Extract the start and end of each of Bruce's employment spans
starts <- int_start(bruce_employment_intervals)
ends <- int_end(bruce_employment_intervals)

# Turn the ends of employment into the starts of unemployment
bruce_unemployment_intervals <- interval(ends[-length(ends)], starts[-1])
print(bruce_unemployment_intervals)

[1] 1940-06-01 UTC--1942-01-01 UTC

And here, is the fruit of our labour.

Plot Code

# Plot the duration of each of Bruce's jobs
bruce_plot <- bruce_employment |>
  bind_rows(
    tibble(
      job_title = c("Employed", "Employed", "Unemployed"),
      job_interval = c(bruce_employment_intervals, bruce_unemployment_intervals)
    )
  ) |>
  mutate(
    job_start = int_start(job_interval),
    job_end = int_end(job_interval),
    job_title = ordered(
      job_title, 
      rev(c(
        "Vigilante", "Philanthropist", "Industrialist", 
        "Employed", "Unemployed"
      ))
    ),
    colour = case_when(
      job_title == "Employed" ~ "blue",
      job_title == "Unemployed" ~ "red",
      TRUE ~ "#38B873"
    )
  )

bruce_plot |>
  ggplot() +
  geom_segment(
    aes(
      y = job_title, yend = job_title,
      x = job_start, xend = job_end,
      colour = colour
    ),
    arrow = arrow(angle = 45, ends = "both", length = unit(10, "pt")),
    linewidth = 1.5
  ) +
  labs(title = NULL, x = NULL, y = NULL) +
  scale_colour_identity() +
  theme_minimal(base_size = 14) +
  theme(
    panel.background = element_blank(),
    plot.background = element_blank()
  )

We present this plot to our boss. They look at us, perplexed, and ask why it took 30 minutes to make this one image. Anyhow, they grumble, for our next task we need to find all periods where Bruce and Clark were working at the same time.

Plot Code

employment |>
  filter(name %in% c("Bruce", "Clark")) |>
  mutate(
    job_start = int_start(job_interval),
    job_end = int_end(job_interval),
    job_title = ordered(
      job_title, 
      rev(c(
        "Vigilante", "Philanthropist", "Industrialist", 
        "Journalist", "Coach"
      ))
    )
  ) |>
  ggplot() +
  geom_segment(
    aes(
      y = job_title, yend = job_title,
      x = job_start, xend = job_end,
      colour = name
    ),
    arrow = arrow(angle = 45, ends = "both", length = unit(10, "pt")),
    linewidth = 1.5,
    key_glyph = draw_key_rect
  ) +
  labs(title = NULL, x = NULL, y = NULL, colour = NULL) +
  scale_colour_manual(
    labels = c("Bruce", "Clark"),
    values = c("black", "gold")
  ) +
  theme_minimal(base_size = 14) +
  theme(
    legend.position = "top",
    panel.background = element_blank(), 
    plot.background = element_blank()
  )

I’m going to spare you the details of how we’d do this, but it’s not fun. And the questions only get more complicated from here.

Enter {phinterval}

The goal of the {phinterval} package is to ever-so-slightly expand the bubble that {lubridate} created, so that we can solve problems like these using a familiar API. The <phinterval> vector class and it’s methods are designed to look right at home alongside the classes implemented by {lubridate}:

<Duration> - A length of time in seconds.
<Period> - A length of time in minutes, hours, days, weeks, months, or years.
<Interval> - A span of time between two instants.
<phinterval> - An <Interval> which may contain holes.

Returning to our boss’s first request, what we’re really trying to do is flatten/merge/combine an <Interval> vector into a <phinterval> element. We can do this via phint_squash() which squashes an <Interval> vector into a scalar <phinterval>.

bruce_employed <- phint_squash(bruce_employment$job_interval)
print(bruce_employed)

<phinterval<UTC>[1]>
[1] [1939-03-01--1940-06-01, 1942-01-01--1947-01-01] UTC

For our boss’s second request, we just want to retrieve the “holes” of our <phinterval> which represent the gaps in Bruce’s employment. We can do this using phint_invert() which returns the gaps of an existing <phinterval> vector as a new <phinterval>.

phint_invert(bruce_employed)

<phinterval<UTC>[1]>
[1] [1940-06-01--1942-01-01] UTC

{phinterval} really starts to shine when we use it alongside {dplyr}.

squashed <- employment |>
  group_by(name) |>
  summarize(employed = phint_squash(job_interval)) |>
  mutate(unemployed = phint_invert(employed))

print(squashed)

# A tibble: 4 × 3
  name                                                employed
  <chr>                                             <phintrvl>
1 Bruce   [1939-03-01--1940-06-01, 1942-01-01--1947-01-01] UTC
2 Clark   [1938-04-01--1944-01-01, 1946-01-01--1947-06-01] UTC
3 Natasha                         [1964-04-01--1970-01-01] UTC
4 Tony    [1962-12-01--1963-06-01, 1964-01-01--1967-01-01] UTC
# ℹ 1 more variable: unemployed <phintrvl>

Our boss’s intimidating third question is now just a matter of taking the intersection of Bruce’s and Clark’s employment histories with phint_intersect().

bruce_employed <- squashed$employed[squashed$name == "Bruce"]
clark_employed <- squashed$employed[squashed$name == "Clark"]

phint_intersect(bruce_employed, clark_employed)

<phinterval<UTC>[1]>
[1] [1939-03-01--1940-06-01, 1942-01-01--1944-01-01, 1946-01-01--1947-01-01] UTC

And here’s the plot to prove it works.

Plot Code

employment |>
  filter(name %in% c("Bruce", "Clark")) |>
  bind_rows(
    tibble(
      name = "Bruce + Clark",
      job_title = "Both Employed",
      job_interval = phint_to_spans(phint_intersect(bruce_employed, clark_employed))[[1]]
    )
  ) |>
  mutate(
    job_start = int_start(job_interval),
    job_end = int_end(job_interval),
    job_title = ordered(
      job_title, 
      rev(c(
        "Vigilante", "Philanthropist", "Industrialist", 
        "Journalist", "Coach", "Both Employed"
      ))
    ),
    name = ordered(name, c("Bruce", "Clark", "Bruce + Clark"))
  ) |>
  ggplot() +
  geom_segment(
    aes(
      y = job_title, yend = job_title,
      x = job_start, xend = job_end,
      colour = name
    ),
    arrow = arrow(angle = 45, ends = "both", length = unit(10, "pt")),
    linewidth = 1.5,
    key_glyph = draw_key_rect
  ) +
  labs(title = NULL, x = NULL, y = NULL, colour = NULL) +
  scale_colour_manual(
    labels = c("Bruce", "Clark", "Bruce + Clark"),
    values = c("black", "gold", "purple")
  ) +
  theme_minimal(base_size = 14) +
  theme(
    legend.position = "top",
    panel.background = element_blank(), 
    plot.background = element_blank()
  )

The Catch

Unfortunately, while all of the code in this demo works, {phinterval} is still a prototype. Because of it’s non-standard data-structure, we can’t just use fast vectorized operators or primitive functions when manipulating a <phinterval>’s data. Looking at the source code for the <phinterval> class you’ll see a liberal use of map(), a wrapper around base::lapply().

This is slow, sometimes very slow. Compare the function body of lubridate::int_end(), which retrieves the end time of an <Interval>, with the function body of phinterval::phint_end(), which retrieves the end time of a <phinterval>.

body(lubridate::int_end)

int@start + int@.Data

body(phinterval::phint_end)

{
    phint <- check_is_phinty(phint)
    reference_time <- field(phint, "reference_time")
    range_ends <- field(phint, "range_ends")
    reference_time + map_dbl(range_ends, max)
}

{phinterval}, relatively speaking, is doing a lot of work to accomplish a pretty simple task. To improve performance, I’d like to implement a portion of the {phinterval} package in C++ (with the help of the {Rcpp} package) and work on optimizing the remaining R code.

Until then, if you need fast and flexible interval operations, I’d recommend Davis Vaughn’s great {ivs} package. It can do anything shown in this article and it’s powered by fast {vctrs} functions already written in C++.

Why {phinterval}

If {ivs} works, why bother with {phinterval}? First, because writing new packages is fun. Second, because {ivs}, by design, is not meant to be an extension of the <Interval> vector (an <ivs_iv> can represent a right-open interval of any vector type which has methods for comparison and is supported by the {vctrs} package).

{phinterval}, meanwhile, is designed to be a drop-in extension of the {lubridate} package. If you’re lucky, and your boss doesn’t ask too many questions, you’ll never need it. If you’re less lucky, you can take solace in the fact that you shouldn’t need to change any of your existing {lubridate} work-flow to start working with {phinterval}. Any {lubridate} <Interval> vector can be coerced to an equivalent <phinterval> vector without loss of information, including instantaneous intervals. <phinterval> vectors also support many of the useful features of an <Interval>, including date arithmetic and coercion to a duration or period.

Users of {lubridate} will already know how to use many of the phint_*() family of functions, all of which accept an <Interval> or <phinterval> vector as input.

{lubridate}	{phinterval}	Returns
`int_length()`	`phint_length()`	Length in seconds
`int_start()`	`phint_start()`	Start date of the (ph)interval
`int_end()`	`phint_end()`	End date of the (ph)interval
`int_shift()`	`phint_end()`	A (ph)interval shifted up/down the timeline
`int_overlaps()`	`phint_overlaps()`	Whether elements of two ph(intervals) overlap
`%within%`	`%within%`	Whether a (ph)interval is within another ph(interval)

When you do need to reach for the unique functionality of {phinterval}, functions such as phint_union(), phint_intersection(), and phint_diff() all accept any <Interval> vector as input and output a <phinterval> vector with the same timezone.