4 Working with Dates and Times

One of the packages I default to when working with dates and times is the {lubridate} package, built to make working with dates and times easier than would be the case for most folks. Why is working with dates and times tricky? Well, the complexity stems from a variety of factors including time zones but not the least how the date variable has been constructed and what date/time components are included. One can also use other packages, such as {zoo} or then rely on base R. I often end up using a mix of these three. There is the {clock} package as well but I have not worked with it much.

4.1 The base R approach

Let us create some dates to work with, and we will deliberately make these messy so that they all look different.

basedate = as.Date("1970-1-1")
today = date()
tomorrow = as.Date("2017-12-18")
yesterday1 = "12/16/2017"
yesterday2 = "12-16-17"
yesterday3 = "12 16 17"
basedate; today; tomorrow; yesterday1; yesterday2; yesterday3
#> [1] "1970-01-01"
#> [1] "Thu Jan 13 19:51:34 2022"
#> [1] "2017-12-18"
#> [1] "12/16/2017"
#> [1] "12-16-17"
#> [1] "12 16 17"
#>  Date[1:1], format: "1970-01-01"
#>  chr "Thu Jan 13 19:51:34 2022"
#>  Date[1:1], format: "2017-12-18"
#>  chr "12/16/2017"
#>  chr "12-16-17"
#>  chr "12 16 17"

basedate is the default start date that R uses internally. Note that tomorrow and yesterday1, yesterday2, yesterday3 are created differently, and also that R shows the five dates as holding different formats. yesterday1, yesterday2, yesterday3, and today are read as characters. So the first thing to be done is to convert the dates read as characters into the date format.

  format = "%m/%d/%Y"
  ) -> yesterday.date1 
#>  Date[1:1], format: "2017-12-16"
  format = "%m-%d-%y"
  ) -> yesterday.date2 
#>  Date[1:1], format: "2017-12-16"
  format = "%m %d %y"
  ) -> yesterday.date3 
#>  Date[1:1], format: "2017-12-16"

Pay attention to the format = "" command since it is often the source of errors. In particular, here we are specifying that in yesterday1 the month shows up first %m then the day /%d and then the year /%Y. For yesterday2 I had to indicate that the elements are separated by - and I used %y this time to indicate that year is in a two-digit format. For yesterday3 I also had to specify that the three elements are separated by a blank space. In the end, all three dates show up with a common Date format.

This is far from the worst of it. Rather, often the data will come with date fields arranged in various ways, as shown below.

yesterday4 = "2017-16-12"
  format = "%Y-%d-%m"
  ) -> yesterday.date4 
#>  Date[1:1], format: "2017-12-16"
yesterday5 = "16Dec2017"
  format = "%d%b%Y"
  ) -> yesterday.date5 
#>  Date[1:1], format: "2017-12-16"
yesterday6 = "16 December, 2017"
  format = "%d %B, %Y"
  ) -> yesterday.date6 
#>  Date[1:1], format: "2017-12-16"

So one thing to pay attention to is the format options shown below.

  Code = c(
    "%a", "%A", "%d", "%m", "%b", "%B", "%y", "%Y"
  Represents = c(
    "Day spelled out (abbreviated)", "Day spelled out", 
    "Day of the month", "Month", "Month (abbreviated)", 
    "Month (fully spelled out)", "Year (2-digits)", 
    "Year (4-digits)"
  Example = c(
    "Mon", "Monday", "16", "12", "Dec", "December",
    "17", "2017"
  ) -> df
Table 4.1: Date formats to remember
Code Represents Example
%a Day spelled out (abbreviated) Mon
%A Day spelled out Monday
%d Day of the month 16
%m Month 12
%b Month (abbreviated) Dec
%B Month (fully spelled out) December
%y Year (2-digits) 17
%Y Year (4-digits) 2017

Another issue to keep in mind is that if dates are being exported from a software package, they may be exported as integers, for example, 35341. This number is representing the number of days lapsed since some originating date, and whatever date maps to 35341 can be recovered so long as we know the origin date. I show you two examples below, each with different originating dates. Quite obviously, if we use the wrong originating date we will end up at a different point in calendar time.

  origin = as.Date("1900-01-01")
  ) -> x
x; str(x)
#> [1] "1996-10-05"
#>  Date[1:1], format: "1996-10-05"
  origin = as.Date("1940-01-01")
  ) -> y 
y; str(y)
#> [1] "2036-10-04"
#>  Date[1:1], format: "2036-10-04"

4.1.1 Origin dates

My habit is to try and remember the default origin dates used by standard software routines. SAS, for example, uses January 1, 1960 as the origin, resulting in any date before this being assigned a negative number and any date after this being assigned a positive number. The clock changes at midnight, of course. Here is a small example of three SAS dates entered as integers.

as.Date("1960-01-01") -> sas.origin 
c(-12, 1, 43) -> x 
  origin = sas.origin
  ) -> y 
#> [1] "1959-12-20" "1960-01-02" "1960-02-13"

Often SAS date fields are exported as characters, looking something like 04aug2004 and if they have time appended to the, as 04aug2004:21:35:09. If dates can be stored and exported as integers so can time. Recovering the underlying dates and times is not difficult.

16287 -> sas.date 
"04AUG2004:21:35:09" -> sas.datetime 
  origin = sas.origin
  ) -> sas.d 
#>  Date[1:1], format: "2004-08-04"
  format = "%d%b%Y:%H:%M:%S"
  ) -> sas.dt 
#>  Date[1:1], format: "2004-08-04"

Alternatively, you could try to do the following when working with dates as integers:

16287 -> sas.date 
class(sas.date) = "Date"
#> [1] "2014-08-05"

However, the default origin date being used is R’s date of January 1, 1970, so the resulting data is 2014-08-05 and not what you see for the same integer with SAS origin date. SPSS has a different origin date – October 14, 1582 – the date of the start of the Julian calendar. Stata uses January 1, 1960 as the start date. Excel (inappropriately named) has differing origin dates:

All versions of Excel for Windows calculate dates based on the January 1, 1900 date system. Excel 2008 for Mac and earlier Excel for Mac versions calculate dates based on the January 1, 1904 date system. Excel 2016 for Mac and Excel for Mac 2011 use the January 1, 1900 date system, which guarantees date compatibility with Excel for Windows.

4.1.2 Extracting date elements

If you want to extract the year, month, day, or figure out what day of the week it is, what day of the month, the quarter, seconds lapsed, etc., these actions are easy to carry out in base R as well.

date() -> today 
#>  chr "Thu Jan 13 19:51:37 2022"
  format = "%a %b %d %H:%M:%S %Y"
  ) -> today.date 
#>  Date[1:1], format: "2022-01-13"
  ) -> today.y
today.y # year
#> [1] "2022"
  ) -> today.m1 
today.m1 # month
#> [1] "Jan"
  ) -> today.m2 
today.m2 # month
#> [1] "January"
  ) -> today.d1
today.d1 # day
#> [1] "13"
  ) -> today.d2
today.d2 # day
#> [1] "Thu"
  ) -> today.d3
today.d3 # day
#> [1] "Thursday"

You could also do:

  abbreviate = FALSE
  ) -> today.d4 
today.d4 # day
#> [1] "Thursday"
  abbreviate = FALSE
  ) -> today.m3
today.m3 # month
#> [1] "January"
  abbreviate = FALSE
  ) -> today.q
today.q # quarter
#> [1] "Q1"

What about converting today to its Julian calendar value, without specifying origin (which results in R using 1970-01-01 as the origin)?

  ) -> today.julian 
#> [1] 19005
#> attr(,"origin")
#> [1] "1970-01-01"

4.1.3 Calculations with dates and generating dates

Once we have an object or variable stored in the correct Date format it is easy to calculate anything we might need to. For example, and starting simply, what if I want to know the date 30 days from today? 79 days ago?

today.date + 30
#> [1] "2022-02-12"
today.date - 79
#> [1] "2021-10-26"

We could take two dates and then calculate the length of time lapsed between these two dates.

  ) -> date1 
date2 = today.date
lapsed.time = (date2 - date1)
#> Time difference of 1721 days

Say I want to create a vector of dates that starts and ends on specific dates, and the step function is 1 day, a week, 4 months, etc. The step is indicated with the by = "" command

  from = as.Date("2017-12-17"), 
  to = as.Date("2018-12-16"), 
  by = "day"
  ) -> my.dates1 
  from = as.Date("2017-12-17"), 
  to = as.Date("2018-12-16"), 
  by = "week"
  ) -> my.dates2 
  from = as.Date("2017-12-17"), 
  to = as.Date("2018-12-16"), 
  by = "month"
  ) -> my.dates3 
  from = as.Date("2017-12-17"), 
  to = as.Date("2018-12-16"), 
  by = "3 days"
  ) -> my.dates4 
  from = as.Date("2017-12-17"), 
  to = as.Date("2018-12-16"), 
  by = "2 weeks"
  ) -> my.dates5 
  from = as.Date("2017-12-17"), 
  to = as.Date("2018-12-16"), 
  by = "4 months"
  ) -> my.dates6 
  from = as.Date("2017-12-17"), 
  to = as.Date("2019-12-16"), 
  by = "year"
  ) -> my.dates7 
  from = as.Date("2017-12-17"), 
  to = as.Date("2022-12-16"), 
  by = "2 years"
  ) -> my.dates8 

4.1.4 Dates in the economics dataframe

As a form of practice, let us load the economics dataframe and work with the date fields in that dataframe. The descriptions of the variables follows:

Variable Description Source
date Month of data collection None
psavert personal savings rate http://research.stlouisfed.org/fred2/series/PSAVERT/
pce personal consumption expenditures, in billions of dollars http://research.stlouisfed.org/fred2/series/PCE
unemploy number of unemployed in thousands http://research.stlouisfed.org/fred2/series/UNEMPLOY
uempmed median duration of unemployment, in weeks http://research.stlouisfed.org/fred2/series/UEMPMED
pop total population, in thousands http://research.stlouisfed.org/fred2/series/POP

Now, with this dataframe, I’d like to create indicators for year, month, and quarter.

#> [1] "date"     "pce"      "pop"      "psavert"  "uempmed"  "unemploy"
economics$year = format(economics$date, "%Y")
economics$month = format(economics$date, "%B")
economics$quarter = quarters(economics$date, abbreviate = FALSE)

4.2 Using {lubridate}

Things are somehow easier to do with this package, largely because the functions seem more intuitive (at least to me) and the package does a lot of things really well. Below we start with some simple explorations of basic ways in which {lubridate} will interpret commands about dates and times.

today1 = "20171217"
today2 = "2017-12-17"
today3 = "2017 December 17"
today4 = "20171217143241"
today5 = "2017 December 17 14:32:41"
today6 = "December 17 2017 14:32:41"
today7 = "17-Dec, 2017 14:32:41"

#> [1] "2017-12-17"
#> [1] "2017-12-17"
#> [1] "2017-12-17"
#> [1] "2017-12-17 14:32:41 UTC"
#> [1] "2017-12-17 14:32:41 UTC"
#> [1] "2017-12-17 14:32:41 UTC"
#> [1] "2017-12-17 14:32:41 UTC"

Note the flexibility here; the package parses the strings and so long as you indicate whether the format is year then month then day (ymd) or month then day then year (mdy), etc., and if time in hours minutes and seconds should be read as well (hms), you need worry about nothing else.

Now, the default time zone is UTC – Coordinated Universal Time – “is the basis for civil time today. This 24-hour time standard is kept using highly precise atomic clocks combined with the Earth’s rotation.” Why UTC when it should be CUT?. So if you have data measured in a different time zone you will have to specify it thus:

dmy_hms(today7, tz = "EST")
#> [1] "2017-12-17 14:32:41 EST"

We can also use it in some surprising ways. For example, say I am flying out of London in 10 days. What is the current time in London? What is my departure time?

now("Europe/London") -> now.london
#> [1] "2022-01-14 00:51:37 GMT"
now.london + days(10) -> my.departure
#> [1] "2022-01-24 00:51:37 GMT"

Hang on a second, the boarding pass shows departure time to be at 11:41 AM. Ah, easily corrected:

  my.departure, hours = 11, minutes = 41, seconds = 0
  ) -> corrected.departure
#> [1] "2022-01-24 11:41:00 GMT"

4.2.1 Extracting date elements

This is ridiculously easy as well. I’ll stick with the running example of today and show you the extractor commands.

dmy_hms(today7) -> today 
year(today) -> today.y; today.y
#> [1] 2017
month(today) -> today.m1; today.m1
#> [1] 12
  label = TRUE, 
  abbr = TRUE
  ) -> today.m2; today.m2
#> [1] Dec
#> 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < Oct < ... < Dec
  label = TRUE, 
  abbr = FALSE
  ) -> today.m3; today.m3
#> [1] December
#> 12 Levels: January < February < March < April < May < June < July < ... < December
week(today) -> today.w; today.w
#> [1] 51
yday(today) -> today.doy; today.doy
#> [1] 351
mday(today) -> today.dom; today.dom
#> [1] 17
wday(today) -> today.dow1; today.dow1
#> [1] 1
  label = TRUE, 
  abbr = TRUE
  ) -> today.dow2; today.dow2
#> [1] Sun
#> Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
  label = TRUE, 
  abbr = FALSE
  ) -> today.dow3; today.dow3
#> [1] Sunday
#> 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday
hour(today) -> today.h; today.h
#> [1] 14
minute(today) -> today.m; today.m
#> [1] 32
second(today) -> today.s; today.s
#> [1] 41
tz(today) -> today.tz; today.tz
#> [1] "UTC"

Note that when label = TRUE you get the days of the weeks and the months not only labelled but ordered as well. You can set the abbreviated labels off by specifying abbr = FALSE (same as abbreviate = FALSE).

4.2.2 Creating Dates and Times

To see the {lubridate} package in action I’ll read in the on time arrival data in the nycflights13 package and create a dataframe that only has year, month, day, hour, and minute.

#>  [1] "year"           "month"          "day"            "dep_time"      
#>  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
#>  [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
#> [13] "origin"         "dest"           "air_time"       "distance"      
#> [17] "hour"           "minute"         "time_hour"
flights %>% 
    year, month, day, hour, minute
    ) -> df
df %>% 
    dep.date = make_date(year, month, day),
    dep.datetime = make_datetime(year, month, day, hour, minute)
    ) -> df

Note the difference between make_date() and make_datetime().

Now, if you look at dep_time and sched_dep_time etc you will see they are in a strange format – 517, 533, etc. These are departure times in the local time zone, represented without the hour-minute separator. This can be easily fixed prior to creating appropriate date-time variables for actual versus scheduled departure, and likewise for actual versus scheduled arrival.

make_datetime_100 <- function(year, month, day, time) {
  make_datetime(year, month, day, time %/% 100, time %% 100)
} # Hadley's function to use modulus arithmetic to pull out the hour and minute components

flights %>% 
    !is.na(dep_time), !is.na(arr_time)
    ) %>% 
    dep_time = make_datetime_100(year, month, day, dep_time),
    arr_time = make_datetime_100(year, month, day, arr_time),
    sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
    sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
  ) %>% 
    origin, dest, ends_with("delay"), ends_with("time")
    ) -> df2 

You could have done the same thing with:

flights %>% 
    !is.na(dep_time), !is.na(arr_time)
    ) %>% 
    dep_time = make_datetime(year, month, day, dep_time %/% 100, dep_time %% 100),
    arr_time = make_datetime(year, month, day, arr_time %/% 100, arr_time %% 100),
    sched_dep_time = make_datetime(year, month, day, sched_dep_time %/% 100, sched_dep_time %% 100),
    sched_arr_time = make_datetime(year, month, day, sched_arr_time %/% 100, sched_arr_time %% 100)
  ) %>% 
    origin, dest, ends_with("delay"), ends_with("time")
    ) -> df3 

4.2.3 Intervals, durations and periods

You may think it would be straightforward to calculate the length of time between two dates, whether length is being measured in days, years, hours, weeks, or what have you. However, it is trickier than we realize because you have to worry about daylight savings time, leap years, and leap seconds. Consequently, “At different moments in time, the lengths of months, weeks, days, hours, and even minutes will also vary. We can consider these to be relative units of time; their length is relative to when they occur. In contrast, seconds always have a consistent length. Hence, seconds are exact units of time. lubridate authors” The authors of lubridate have come up with ways that allow you to calculate both exact and relative lengths of time, respectively.

Instants are specific moments in time, such as now, when I am writing this chapter. Indeed, every time you read in and store a date (with our without time), you have created what {lubridate} calls an instant.

Timespans, however, are measured as intervals, durations, or periods. An interval captures the timespan between two instants as, for example, in the following instance.

now = now(tz = "UTC"); now
#> [1] "2022-01-14 00:51:43 UTC"
dogbarked = ymd_hms("2017-12-17 15:41:11"); dogbarked
#> [1] "2017-12-17 15:41:11 UTC"
span = interval(now, dogbarked); span
#> [1] 2022-01-14 00:51:43 UTC--2017-12-17 15:41:11 UTC
int = interval(ymd("1980-01-01"), ymd("2014-09-18"))
time_length(int, "week")
#> [1] 1811
time_length(int, "month")
#> [1] 416.6
time_length(int, "year")
#> [1] 34.71
time_length(int, "hour")
#> [1] 304296
time_length(int, "minute")
#> [1] 18257760
time_length(int, "second")
#> [1] 1.095e+09

If we now look at this span of time in seconds we will have an exact interval, and {lubridate} refers to an exact interval as a duration.

If we measure the timespan in minutes, hours, days, etc then we have a relative span of time and {lubridate} calls this a period. So for span created earlier, we have

#> [1] "-128596232.855446s (~-4.07 years)"
#> [1] "1095465600s (~34.71 years)"

… a measurement of roughly 5.51 hours but the precise span is 19825.8998761177 seconds.

As it turns out, {lubridate} has the following, often incredibly useful, helper functions …

x = 6
#> [1] "6s"
#> [1] "360s (~6 minutes)"
#> [1] "21600s (~6 hours)"
#> [1] "518400s (~6 days)"
#> [1] "3628800s (~6 weeks)"
#> [1] "189345600s (~6 years)"

Note how the same value of x = 6 is being converted into specific durations depending upon whether you ask for dminutes or dweeks.

In contrast, period objects can be created as follows:

x = 6
#> [1] "6M 0S"
#> [1] "6H 0M 0S"
#> [1] "6d 0H 0M 0S"
#> [1] "42d 0H 0M 0S"
#> [1] "6m 0d 0H 0M 0S"
#> [1] "6y 0m 0d 0H 0M 0S"

You can also generate a period object in seconds() but this is then no different from a duration and hence does not really add any value here.

When do these prove to be useful? For me, they come in handy when flipping start and end dates for some episodes or spells into a measure of how long the subject was in an episode or spell. Then, if I have start and end dates for each subject I can calculate the length of stay in days, weeks, months, years, etc. I can also do this given date-of-birth for subjects but not their age.

  "2012-08-21", "2012-09-01", "2012-08-15", "2012-09-18"
  ) -> start 
  "2012-09-16", "2012-09-06", "2012-08-22", "2012-10-11"
  ) -> end 
  start, end
  ) -> elapsed.time 
as.duration(elapsed.time) / dweeks(1)
#> [1] 3.7143 0.7143 1.0000 3.2857
as.duration(elapsed.time) / ddays(1)
#> [1] 26  5  7 23
as.duration(elapsed.time) / dminutes(1)
#> [1] 37440  7200 10080 33120
as.duration(elapsed.time) / dyears(1)
#> [1] 0.07118 0.01369 0.01916 0.06297

There is a lot more we could do with dates and times but the ground we have covered thus far should be sufficient for much of the work I end up doing.

4.3 Exercises

4.3.1 An Example

Here is a real-world example of how lubridate comes in handy. The data you will see are simulated and reflect typical fields since for children in the state’s foster care system. Each child is removed from a home on a given date. Once they are placed, the placement begins and during a placement a child can leave (with or without authorization) and then will likely return to the same placement. The child will also, eventually, be discharged from the state foster care system. Now, assume you have the following data.


Calculate (i) the median age at entry, and (ii) the median length of stay (defined here as the duration of a placement with start and end dates).

4.3.2 Shark Attacks

In keeping with the philosophy of Albert Einstein, the mission of the Global Shark Attack File is to provide current and historical data on shark/human interactions for those who seek accurate and meaningful information and verifiable references. For researchers needing more data, and for those in the media, the Global Shark Attack File provides direct contact information of the case investigators. Humans are not on the menu of sharks. Sharks bite humans out of curiosity or to defend themselves. It is the goal of the Global Shark Attack File to demonstrate and emphasize, through forensic analysis, the significance of shark/human interactions in comparison to the myriad dangers that we face in our daily lives. With a better understanding of these interactions we can minimize the risk of being injured by a shark and concentrate on the conservation of all shark species worldwide.

Unprovoked vs. Provoked - GSAF defines a provoked incident as one in which the shark was speared, hooked, captured or in which a human drew “first blood”. Although such incidents are of little interest to shark behaviorists, when the species of shark involved is known and pre-op photos of the wounds are available, the bite patterns are of value in determining species of shark involved in other cases when the species could not identified by the patient or witnesses. We know that a live human is rarely perceived as prey by a shark. Many incidents are motivated by curiosity, others may result when a shark perceives a human as a threat or competitor for a food source, and could be classed as “provoked” when examined from the shark’s perspective.

Incidents involving Boats – Incidents in which a boat was bitten or rammed by a shark are in green. However, in cases in which the shark was hooked, netted or gaffed, the entry is orange because they are classed as provoked incidents.

Casualties of War & Air/Sea Disasters - Sharks maintain the health of the marine ecosystem by removing the dead or injured animals. Many incidents result because, like other animals that don’t rely on instinct alone, sharks explore their environment. Lacking hands, they may investigate an unfamiliar object with their mouths. Unlike humans, there is no malice in sharks; they simply do what nature designed them to do. Air/Sea Disasters are accidents that place people into the day-to-day business of sharks. The wartime losses due to sharks result from mans’ cruelty to man. Air/Sea Disasters are in yellow.

Questionable incidents - Incidents in which there are insufficient data to determine if the injury was caused by a shark or the person drowned and the body was later scavenged by sharks. In a few cases, despite media reports to the contrary, evidence indicated there was no shark involvement whatsoever. Such incidents are in blue.

On December 17, 2017 I pulled the most recent data-set available from GSAF and that file can be accessed here. Read it in and then answer the following questions.

  1. Create a proper date variable, one that displays the year, month, and day.

  2. Can you create a proper time variable that displays the hour and minutes of the time the encounter occurred? Why or why not? What solutions could you try to get around the hurdles?

  3. Use the date field you create to calculate the length of days between successive shark attacks by country. Is this duration expanding or shrinking in recent years for the USA since 1940? What about for Australia?

addClassKlippyTo("pre.r, pre.markdown"); addKlippy('right', 'top', '#6495ED', '1', 'Click to copy', 'Done');