4 Working with Dates and Times
One of the packages I default to when working with dates and times is the {lubridate}
package, built to make working with dates and times easier than would be the case for most folks. Why is working with dates and times tricky? Well, the complexity stems from a variety of factors including time zones but not the least how the date variable has been constructed and what date/time components are included. One can also use other packages, such as {zoo}
or then rely on base R. I often end up using a mix of these three. There is the {clock}
package as well but I have not worked with it much.
4.1 The base R approach
Let us create some dates to work with, and we will deliberately make these messy so that they all look different.
= as.Date("1970-1-1")
basedate = date()
today = as.Date("2017-12-18")
tomorrow = "12/16/2017"
yesterday1 = "12-16-17"
yesterday2 = "12 16 17"
yesterday3
basedate; today; tomorrow; yesterday1; yesterday2; yesterday3#> [1] "1970-01-01"
#> [1] "Thu Jan 13 19:51:34 2022"
#> [1] "2017-12-18"
#> [1] "12/16/2017"
#> [1] "12-16-17"
#> [1] "12 16 17"
str(basedate)
#> Date[1:1], format: "1970-01-01"
str(today)
#> chr "Thu Jan 13 19:51:34 2022"
str(tomorrow)
#> Date[1:1], format: "2017-12-18"
str(yesterday1)
#> chr "12/16/2017"
str(yesterday2)
#> chr "12-16-17"
str(yesterday3)
#> chr "12 16 17"
basedate
is the default start date that R uses internally. Note that tomorrow
and yesterday1, yesterday2, yesterday3
are created differently, and also that R shows the five dates as holding different formats. yesterday1
, yesterday2
, yesterday3
, and today
are read as characters. So the first thing to be done is to convert the dates read as characters into the date format.
as.Date(
yesterday1, format = "%m/%d/%Y"
-> yesterday.date1
) str(yesterday.date1)
#> Date[1:1], format: "2017-12-16"
as.Date(
yesterday2, format = "%m-%d-%y"
-> yesterday.date2
) str(yesterday.date2)
#> Date[1:1], format: "2017-12-16"
as.Date(
yesterday3, format = "%m %d %y"
-> yesterday.date3
) str(yesterday.date3)
#> Date[1:1], format: "2017-12-16"
Pay attention to the format = ""
command since it is often the source of errors. In particular, here we are specifying that in yesterday1
the month shows up first %m
then the day /%d
and then the year /%Y
. For yesterday2
I had to indicate that the elements are separated by -
and I used %y
this time to indicate that year is in a two-digit format. For yesterday3
I also had to specify that the three elements are separated by a blank space. In the end, all three dates show up with a common Date format.
This is far from the worst of it. Rather, often the data will come with date fields arranged in various ways, as shown below.
= "2017-16-12"
yesterday4 as.Date(
yesterday4, format = "%Y-%d-%m"
-> yesterday.date4
) str(yesterday.date4)
#> Date[1:1], format: "2017-12-16"
= "16Dec2017"
yesterday5 as.Date(
yesterday5, format = "%d%b%Y"
-> yesterday.date5
) str(yesterday.date5)
#> Date[1:1], format: "2017-12-16"
= "16 December, 2017"
yesterday6 as.Date(
yesterday6, format = "%d %B, %Y"
-> yesterday.date6
) str(yesterday.date6)
#> Date[1:1], format: "2017-12-16"
So one thing to pay attention to is the format options shown below.
library(tidyverse)
tibble(
Code = c(
"%a", "%A", "%d", "%m", "%b", "%B", "%y", "%Y"
),Represents = c(
"Day spelled out (abbreviated)", "Day spelled out",
"Day of the month", "Month", "Month (abbreviated)",
"Month (fully spelled out)", "Year (2-digits)",
"Year (4-digits)"
), Example = c(
"Mon", "Monday", "16", "12", "Dec", "December",
"17", "2017"
)-> df )
Code | Represents | Example |
---|---|---|
%a | Day spelled out (abbreviated) | Mon |
%A | Day spelled out | Monday |
%d | Day of the month | 16 |
%m | Month | 12 |
%b | Month (abbreviated) | Dec |
%B | Month (fully spelled out) | December |
%y | Year (2-digits) | 17 |
%Y | Year (4-digits) | 2017 |
Another issue to keep in mind is that if dates are being exported from a software package, they may be exported as integers, for example, 35341
. This number is representing the number of days lapsed since some originating date, and whatever date maps to 35341 can be recovered so long as we know the origin
date. I show you two examples below, each with different originating dates. Quite obviously, if we use the wrong originating date we will end up at a different point in calendar time.
as.Date(
35341,
origin = as.Date("1900-01-01")
-> x
) str(x)
x; #> [1] "1996-10-05"
#> Date[1:1], format: "1996-10-05"
as.Date(
35341,
origin = as.Date("1940-01-01")
-> y
) str(y)
y; #> [1] "2036-10-04"
#> Date[1:1], format: "2036-10-04"
4.1.1 Origin dates
My habit is to try and remember the default origin dates used by standard software routines. SAS, for example, uses January 1, 1960 as the origin, resulting in any date before this being assigned a negative number and any date after this being assigned a positive number. The clock changes at midnight, of course. Here is a small example of three SAS dates entered as integers.
as.Date("1960-01-01") -> sas.origin
c(-12, 1, 43) -> x
as.Date(
x, origin = sas.origin
-> y
)
y#> [1] "1959-12-20" "1960-01-02" "1960-02-13"
Often SAS date fields are exported as characters, looking something like 04aug2004
and if they have time appended to the, as 04aug2004:21:35:09
. If dates can be stored and exported as integers so can time. Recovering the underlying dates and times is not difficult.
16287 -> sas.date
"04AUG2004:21:35:09" -> sas.datetime
as.Date(
sas.date, origin = sas.origin
-> sas.d
) str(sas.d)
#> Date[1:1], format: "2004-08-04"
as.Date(
sas.datetime, format = "%d%b%Y:%H:%M:%S"
-> sas.dt
) str(sas.dt)
#> Date[1:1], format: "2004-08-04"
Alternatively, you could try to do the following when working with dates as integers:
16287 -> sas.date
class(sas.date) = "Date"
sas.date#> [1] "2014-08-05"
However, the default origin date being used is R’s date of January 1, 1970, so the resulting data is 2014-08-05
and not what you see for the same integer with SAS origin date. SPSS has a different origin date – October 14, 1582
– the date of the start of the Julian calendar. Stata uses January 1, 1960
as the start date. Excel (inappropriately named) has differing origin dates:
All versions of Excel for Windows calculate dates based on the January 1, 1900 date system. Excel 2008 for Mac and earlier Excel for Mac versions calculate dates based on the January 1, 1904 date system. Excel 2016 for Mac and Excel for Mac 2011 use the January 1, 1900 date system, which guarantees date compatibility with Excel for Windows.
4.1.2 Extracting date elements
If you want to extract the year, month, day, or figure out what day of the week it is, what day of the month, the quarter, seconds lapsed, etc., these actions are easy to carry out in base R as well.
date() -> today
str(today)
#> chr "Thu Jan 13 19:51:37 2022"
as.Date(
today, format = "%a %b %d %H:%M:%S %Y"
-> today.date
) str(today.date)
#> Date[1:1], format: "2022-01-13"
format(
today.date, "%Y"
-> today.y
) # year
today.y #> [1] "2022"
format(
today.date, "%b"
-> today.m1
) # month
today.m1 #> [1] "Jan"
format(
today.date, "%B"
-> today.m2
) # month
today.m2 #> [1] "January"
format(
today.date, "%d"
-> today.d1
) # day
today.d1 #> [1] "13"
format(
today.date, "%a"
-> today.d2
) # day
today.d2 #> [1] "Thu"
format(
today.date, "%A"
-> today.d3
) # day
today.d3 #> [1] "Thursday"
You could also do:
weekdays(
today.date,abbreviate = FALSE
-> today.d4
) # day
today.d4 #> [1] "Thursday"
months(
today.date,abbreviate = FALSE
-> today.m3
) # month
today.m3 #> [1] "January"
quarters(
today.date,abbreviate = FALSE
-> today.q
) # quarter
today.q #> [1] "Q1"
What about converting today to its Julian calendar value, without specifying origin (which results in R using 1970-01-01 as the origin)?
julian(
today.date-> today.julian
)
today.julian#> [1] 19005
#> attr(,"origin")
#> [1] "1970-01-01"
4.1.3 Calculations with dates and generating dates
Once we have an object or variable stored in the correct Date format it is easy to calculate anything we might need to. For example, and starting simply, what if I want to know the date 30 days from today? 79 days ago?
+ 30
today.date #> [1] "2022-02-12"
- 79
today.date #> [1] "2021-10-26"
We could take two dates and then calculate the length of time lapsed between these two dates.
as.Date(
"2017-04-28"
-> date1
) = today.date
date2 = (date2 - date1)
lapsed.time
lapsed.time#> Time difference of 1721 days
Say I want to create a vector of dates that starts and ends on specific dates, and the step function is 1 day, a week, 4 months, etc. The step is indicated with the by = ""
command
seq(
from = as.Date("2017-12-17"),
to = as.Date("2018-12-16"),
by = "day"
-> my.dates1
) seq(
from = as.Date("2017-12-17"),
to = as.Date("2018-12-16"),
by = "week"
-> my.dates2
) seq(
from = as.Date("2017-12-17"),
to = as.Date("2018-12-16"),
by = "month"
-> my.dates3
) seq(
from = as.Date("2017-12-17"),
to = as.Date("2018-12-16"),
by = "3 days"
-> my.dates4
) seq(
from = as.Date("2017-12-17"),
to = as.Date("2018-12-16"),
by = "2 weeks"
-> my.dates5
) seq(
from = as.Date("2017-12-17"),
to = as.Date("2018-12-16"),
by = "4 months"
-> my.dates6
) seq(
from = as.Date("2017-12-17"),
to = as.Date("2019-12-16"),
by = "year"
-> my.dates7
) seq(
from = as.Date("2017-12-17"),
to = as.Date("2022-12-16"),
by = "2 years"
-> my.dates8 )
4.1.4 Dates in the economics dataframe
As a form of practice, let us load the economics
dataframe and work with the date fields in that dataframe. The descriptions of the variables follows:
Variable | Description | Source |
---|---|---|
date | Month of data collection | None |
psavert | personal savings rate | http://research.stlouisfed.org/fred2/series/PSAVERT/ |
pce | personal consumption expenditures, in billions of dollars | http://research.stlouisfed.org/fred2/series/PCE |
unemploy | number of unemployed in thousands | http://research.stlouisfed.org/fred2/series/UNEMPLOY |
uempmed | median duration of unemployment, in weeks | http://research.stlouisfed.org/fred2/series/UEMPMED |
pop | total population, in thousands | http://research.stlouisfed.org/fred2/series/POP |
Now, with this dataframe, I’d like to create indicators for year, month, and quarter.
library(ggplot2)
data(economics)
names(economics)
#> [1] "date" "pce" "pop" "psavert" "uempmed" "unemploy"
$year = format(economics$date, "%Y")
economics$month = format(economics$date, "%B")
economics$quarter = quarters(economics$date, abbreviate = FALSE) economics
4.2 Using {lubridate}
Things are somehow easier to do with this package, largely because the functions seem more intuitive (at least to me) and the package does a lot of things really well. Below we start with some simple explorations of basic ways in which {lubridate}
will interpret commands about dates and times.
= "20171217"
today1 = "2017-12-17"
today2 = "2017 December 17"
today3 = "20171217143241"
today4 = "2017 December 17 14:32:41"
today5 = "December 17 2017 14:32:41"
today6 = "17-Dec, 2017 14:32:41"
today7
library(lubridate)
ymd(today1)
#> [1] "2017-12-17"
ymd(today2)
#> [1] "2017-12-17"
ymd(today3)
#> [1] "2017-12-17"
ymd_hms(today4)
#> [1] "2017-12-17 14:32:41 UTC"
ymd_hms(today5)
#> [1] "2017-12-17 14:32:41 UTC"
mdy_hms(today6)
#> [1] "2017-12-17 14:32:41 UTC"
dmy_hms(today7)
#> [1] "2017-12-17 14:32:41 UTC"
Now, the default time zone is UTC – Coordinated Universal Time – “is the basis for civil time today. This 24-hour time standard is kept using highly precise atomic clocks combined with the Earth’s rotation.” Why UTC when it should be CUT?. So if you have data measured in a different time zone you will have to specify it thus:
dmy_hms(today7, tz = "EST")
#> [1] "2017-12-17 14:32:41 EST"
We can also use it in some surprising ways. For example, say I am flying out of London in 10 days. What is the current time in London? What is my departure time?
now("Europe/London") -> now.london
now.london#> [1] "2022-01-14 00:51:37 GMT"
+ days(10) -> my.departure
now.london
my.departure#> [1] "2022-01-24 00:51:37 GMT"
Hang on a second, the boarding pass shows departure time to be at 11:41 AM. Ah, easily corrected:
update(
hours = 11, minutes = 41, seconds = 0
my.departure, -> corrected.departure
)
corrected.departure#> [1] "2022-01-24 11:41:00 GMT"
4.2.1 Extracting date elements
This is ridiculously easy as well. I’ll stick with the running example of today
and show you the extractor commands.
dmy_hms(today7) -> today
year(today) -> today.y; today.y
#> [1] 2017
month(today) -> today.m1; today.m1
#> [1] 12
month(
today, label = TRUE,
abbr = TRUE
-> today.m2; today.m2
) #> [1] Dec
#> 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < Oct < ... < Dec
month(
today, label = TRUE,
abbr = FALSE
-> today.m3; today.m3
) #> [1] December
#> 12 Levels: January < February < March < April < May < June < July < ... < December
week(today) -> today.w; today.w
#> [1] 51
yday(today) -> today.doy; today.doy
#> [1] 351
mday(today) -> today.dom; today.dom
#> [1] 17
wday(today) -> today.dow1; today.dow1
#> [1] 1
wday(
today, label = TRUE,
abbr = TRUE
-> today.dow2; today.dow2
) #> [1] Sun
#> Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
wday(
today, label = TRUE,
abbr = FALSE
-> today.dow3; today.dow3
) #> [1] Sunday
#> 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday
hour(today) -> today.h; today.h
#> [1] 14
minute(today) -> today.m; today.m
#> [1] 32
second(today) -> today.s; today.s
#> [1] 41
tz(today) -> today.tz; today.tz
#> [1] "UTC"
Note that when label = TRUE
you get the days of the weeks and the months not only labelled but ordered as well. You can set the abbreviated labels off by specifying abbr = FALSE
(same as abbreviate = FALSE
).
4.2.2 Creating Dates and Times
To see the {lubridate}
package in action I’ll read in the on time arrival data in the nycflights13
package and create a dataframe that only has year, month, day, hour, and minute.
library(nycflights13)
data(flights)
names(flights)
#> [1] "year" "month" "day" "dep_time"
#> [5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
#> [9] "arr_delay" "carrier" "flight" "tailnum"
#> [13] "origin" "dest" "air_time" "distance"
#> [17] "hour" "minute" "time_hour"
library(dplyr)
%>%
flights select(
year, month, day, hour, minute-> df
) %>%
df mutate(
dep.date = make_date(year, month, day),
dep.datetime = make_datetime(year, month, day, hour, minute)
-> df )
Note the difference between make_date()
and make_datetime()
.
Now, if you look at dep_time
and sched_dep_time
etc you will see they are in a strange format – 517, 533, etc. These are departure times in the local time zone, represented without the hour-minute separator. This can be easily fixed prior to creating appropriate date-time variables for actual versus scheduled departure, and likewise for actual versus scheduled arrival.
<- function(year, month, day, time) {
make_datetime_100 make_datetime(year, month, day, time %/% 100, time %% 100)
# Hadley's function to use modulus arithmetic to pull out the hour and minute components
}
library(dplyr)
%>%
flights filter(
!is.na(dep_time), !is.na(arr_time)
%>%
) mutate(
dep_time = make_datetime_100(year, month, day, dep_time),
arr_time = make_datetime_100(year, month, day, arr_time),
sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
%>%
) select(
ends_with("delay"), ends_with("time")
origin, dest, -> df2 )
You could have done the same thing with:
require(dplyr)
%>%
flights filter(
!is.na(dep_time), !is.na(arr_time)
%>%
) mutate(
dep_time = make_datetime(year, month, day, dep_time %/% 100, dep_time %% 100),
arr_time = make_datetime(year, month, day, arr_time %/% 100, arr_time %% 100),
sched_dep_time = make_datetime(year, month, day, sched_dep_time %/% 100, sched_dep_time %% 100),
sched_arr_time = make_datetime(year, month, day, sched_arr_time %/% 100, sched_arr_time %% 100)
%>%
) select(
ends_with("delay"), ends_with("time")
origin, dest, -> df3 )
4.2.3 Intervals, durations and periods
You may think it would be straightforward to calculate the length of time between two dates, whether length is being measured in days, years, hours, weeks, or what have you. However, it is trickier than we realize because you have to worry about daylight savings time, leap years, and leap seconds. Consequently, “At different moments in time, the lengths of months, weeks, days, hours, and even minutes will also vary. We can consider these to be relative units of time; their length is relative to when they occur. In contrast, seconds always have a consistent length. Hence, seconds are exact units of time. lubridate authors” The authors of lubridate have come up with ways that allow you to calculate both exact and relative lengths of time, respectively.
Instants
are specific moments in time, such as now, when I am writing this chapter. Indeed, every time you read in and store a date (with our without time), you have created what {lubridate}
calls an instant
.
Timespans
, however, are measured as intervals
, durations
, or periods
. An interval
captures the timespan between two instants as, for example, in the following instance.
= now(tz = "UTC"); now
now #> [1] "2022-01-14 00:51:43 UTC"
= ymd_hms("2017-12-17 15:41:11"); dogbarked
dogbarked #> [1] "2017-12-17 15:41:11 UTC"
= interval(now, dogbarked); span
span #> [1] 2022-01-14 00:51:43 UTC--2017-12-17 15:41:11 UTC
= interval(ymd("1980-01-01"), ymd("2014-09-18"))
int time_length(int, "week")
#> [1] 1811
time_length(int, "month")
#> [1] 416.6
time_length(int, "year")
#> [1] 34.71
time_length(int, "hour")
#> [1] 304296
time_length(int, "minute")
#> [1] 18257760
time_length(int, "second")
#> [1] 1.095e+09
If we now look at this span of time in seconds we will have an exact interval, and {lubridate}
refers to an exact interval as a duration
.
If we measure the timespan in minutes, hours, days, etc then we have a relative span of time and {lubridate}
calls this a period
. So for span
created earlier, we have
as.duration(span)
#> [1] "-128596232.855446s (~-4.07 years)"
as.duration(int)
#> [1] "1095465600s (~34.71 years)"
… a measurement of roughly 5.51 hours but the precise span is 19825.8998761177 seconds
.
As it turns out, {lubridate}
has the following, often incredibly useful, helper functions …
= 6
x dseconds(x)
#> [1] "6s"
dminutes(x)
#> [1] "360s (~6 minutes)"
dhours(x)
#> [1] "21600s (~6 hours)"
ddays(x)
#> [1] "518400s (~6 days)"
dweeks(x)
#> [1] "3628800s (~6 weeks)"
dyears(x)
#> [1] "189345600s (~6 years)"
Note how the same value of x = 6
is being converted into specific durations depending upon whether you ask for dminutes
or dweeks
.
In contrast, period
objects can be created as follows:
= 6
x minutes(x)
#> [1] "6M 0S"
hours(x)
#> [1] "6H 0M 0S"
days(x)
#> [1] "6d 0H 0M 0S"
weeks(x)
#> [1] "42d 0H 0M 0S"
months(x)
#> [1] "6m 0d 0H 0M 0S"
years(x)
#> [1] "6y 0m 0d 0H 0M 0S"
You can also generate a period object in seconds()
but this is then no different from a duration and hence does not really add any value here.
When do these prove to be useful? For me, they come in handy when flipping start and end dates for some episodes or spells into a measure of how long the subject was in an episode or spell. Then, if I have start and end dates for each subject I can calculate the length of stay in days, weeks, months, years, etc. I can also do this given date-of-birth for subjects but not their age.
c(
"2012-08-21", "2012-09-01", "2012-08-15", "2012-09-18"
-> start
) c(
"2012-09-16", "2012-09-06", "2012-08-22", "2012-10-11"
-> end
) interval(
start, end-> elapsed.time
) as.duration(elapsed.time) / dweeks(1)
#> [1] 3.7143 0.7143 1.0000 3.2857
as.duration(elapsed.time) / ddays(1)
#> [1] 26 5 7 23
as.duration(elapsed.time) / dminutes(1)
#> [1] 37440 7200 10080 33120
as.duration(elapsed.time) / dyears(1)
#> [1] 0.07118 0.01369 0.01916 0.06297
There is a lot more we could do with dates and times but the ground we have covered thus far should be sufficient for much of the work I end up doing.
4.3 Exercises
4.3.1 An Example
Here is a real-world example of how lubridate
comes in handy. The data you will see are simulated and reflect typical fields since for children in the state’s foster care system. Each child is removed
from a home on a given date. Once they are placed, the placement begins
and during a placement a child can leave
(with or without authorization) and then will likely return
to the same placement. The child will also, eventually, be discharged
from the state foster care system. Now, assume you have the following data.
load("./data/example_dates.RData")
Calculate (i) the median age at entry, and (ii) the median length of stay (defined here as the duration of a placement with start and end dates).
Note the flexibility here; the package parses the strings and so long as you indicate whether the format is year then month then day
(ymd)
or month then day then year(mdy)
, etc., and if time in hours minutes and seconds should be read as well(hms)
, you need worry about nothing else.