In this tutorial, we will learn about The Most Useful Date Manipulation Functions in Spark in Details.
DateTime functions will always be tricky but very important irrespective of language or framework. In this blog post, we review the DateTime functions available in Apache Spark.
Pyspark and Spark SQL provide many built-in functions. The functions such as the date and time functions are useful when you are working with DataFrame which stores date and time type values. Sometimes you may be tempted to write a UDF(user-defined functions) only later to realize that it would have been better to check out the documentation because it already exists. In this article, we will check what are the most important Spark SQL date functions with some examples.
Spark SQL Date Functions
n date functions are user and performance-friendly. Spark SQL supports almost all date functions that are supported in Apache Hive. Following in the table below are the Spark SQL date functions these can be used to manipulate the data frame columns that contain data type values. The list contains pretty much all date functions that are supported in Apache Spark.
Setup
Here are a few key setup steps before trying the examples further down.
Import Functions in PySpark
from pyspark.sql.functions import *
Create Sample DataFrame
Let’s try to create a sample DataFrame so that we can use it for the rest of this blog to understand the various DateTime functions.
emp = [(1, "AAA", "dept1", 1000, "2019-02-01 15:12:13"),
(2, "BBB", "dept1", 1100, "2018-04-01 5:12:3"),
(3, "CCC", "dept1", 3000, "2017-06-05 1:2:13"),
(4, "DDD", "dept1", 1500, "2019-08-10 10:52:53"),
(5, "EEE", "dept2", 8000, "2016-01-11 5:52:43"),
(6, "FFF", "dept2", 7200, "2015-04-14 19:32:33"),
(7, "GGG", "dept3", 7100, "2019-02-21 15:42:43"),
(8, "HHH", "dept3", 3700, "2016-09-25 15:32:33"),
(9, "III", "dept3", 4500, "2017-10-15 15:22:23"),
(10, "JJJ", "dept5", 3400, "2018-12-17 15:14:17")]
empdf = spark.createDataFrame(emp, ["id", "name", "dept", "salary", "date"])
add_months
This function adds months to a date. It will return a new date, however many months from the start date. In the below statement we add 1 month to the column “date” and generated a new column as “next_month”.
df = (empdf
.select("date")
.withColumn("next_month", add_months("date", 1)))
df.show(2)
Output
+-------------------+--------------+
| date | next_month|
+-------------------+--------------+
|2019-08-10 10:52:53| 2019-09-10|
| 2018-04-01 5:12:3| 2018-05-01|
+-------------------+--------------+
current_date
This function returns the current date.
df = (empdf
.withColumn("current_date", current_date())
.select("id", "current_date"))
df.show(2)
Output
+---+------------+
| id|current_date|
+---+------------+
| 5| 2019-10-07|
| 10| 2019-10-07|
+---+------------+
current_timestamp
This function returns the current timestamp.
df = (empdf
.withColumn("current_timestamp", current_timestamp())
.select("id", "current_timestamp"))
df.show(2,False)
Output
+---+-----------------------+
|id |current_timestamp |
+---+-----------------------+
|1 |2019-10-07 14:40:42.471|
|2 |2019-10-07 14:40:42.471|
+---+-----------------------+
date_add
in a new column as “next_date”. E.g. for date: 1st Feb 2019 it returns 6th Feb 2019.
df = (empdf
.select("date")
.withColumn("next_date", date_add("date", 5)))
df.show(2)
Output
+-------------------+----------+
| date| next_date|
+-------------------+----------+
|2019-02-01 15:12:13|2019-02-06|
| 2018-04-01 5:12:3|2018-04-06|
+-------------------+----------+
date_format
This function will convert the date to the specified format. For example, we can convert the date from “yyyy-MM-dd” to “dd/MM/yyyy” format.
df = (empdf
.select("date")
.withColumn("new_date", date_format("date", "dd/MM/yyyy")))
df.show(2)
Output
+-------------------+----------+
| date| new_date|
+-------------------+----------+
|2019-02-01 15:12:13|01/02/2019|
| 2018-04-01 5:12:3|01/04/2018|
+-------------------+----------+
date_sub
date_add. In the example below, it returns a date that is 5 days earlier in a column as “new_date”. For example, date 1st Feb 2019 returns 27th Jan 2019.
df = (empdf
.select("date")
.withColumn("new_date", date_sub("date", 5)))
df.show(2)
Output
+-------------------+----------+
| date| new_date|
+-------------------+----------+
|2019-02-01 15:12:13|2019-01-27|
| 2018-04-01 5:12:3|2018-03-27|
+-------------------+----------+
date_trunc
This function returns a timestamp truncated to the specified unit. It could be a year, month, day, hour, minute, second, week or quarter.
- Let’s truncate the date by a year. we can use “yyyy” or “yy” or” “year” to specify year. For timestamp “2019–02–01 15:12:13”, if we truncate based on the year it will return “2019–01–01 00:00:00”
df = (empdf
.select("date")
.withColumn("new_date", date_trunc("year", "date")))
df.show(2)
Output
+-------------------+-------------------+
| date| new_date|
+-------------------+-------------------+
|2019-02-01 15:12:13|2019-01-01 00:00:00|
| 2018-04-01 5:12:3|2018-01-01 00:00:00|
+-------------------+-------------------+
Let’s truncate date by month. We use “mm” or “month” or” “mon” to specify month. For timestamp “2019–02–01 15:12:13”, if we truncate based on month it returns “2019–02–01 00:00:00”
df = (empdf
.select("date")
.withColumn("new_date", date_trunc("month", "date")))
df.show(2)
Output
+-------------------+-------------------+
| date| new_date|
+-------------------+-------------------+
|2019-02-01 15:12:13|2019-02-01 00:00:00|
| 2018-04-01 5:12:3|2018-04-01 00:00:00|
+-------------------+-------------------+
Let’s truncate date by day. We can use “day” or “dd” to specify day. For timestamp “2019–02–10 15:12:13”, if we truncate based on day it will return “2019–02–10 00:00:00”
df = (empdf
.select("date")
.withColumn("new_date", date_trunc("day", "date")))
df.show(2)
Output
+-------------------+-------------------+
| date| new_date|
+-------------------+-------------------+
|2019-02-01 15:12:13|2019-02-01 00:00:00|
| 2018-04-01 5:12:3|2018-04-01 00:00:00|
+-------------------+-------------------+
we can use these many formats to truncate the date based on different levels. we can use these formats: ‘year’, ‘yyyy’, ‘yy’, ‘month’, ‘mon’, ‘mm’, ‘day’, ‘dd’, ‘hour’, ‘minute’, ‘second’, ‘week’, ‘quarter’.
datediff
This function returns the difference between dates in terms of days. Let’s add another column as the current date and then take the difference between “current_date” and “date”.
Output
+-------------------+------------+---------+
| date|current_date|date_diff|
+-------------------+------------+---------+
|2019-02-01 15:12:13| 2019-10-07| 248|
| 2018-04-01 5:12:3| 2019-10-07| 554|
+-------------------+------------+---------+
dayofmonth
This function returns the day of the month. For 5th Jan 2019 (2019–01–05) it will return 5.
df = (empdf
.select("date")
.withColumn("dayofmonth", dayofmonth("date")))
df.show(2)
Output
+-------------------+----------+
| date|dayofmonth|
+-------------------+----------+
|2019-02-01 15:12:13| 1|
| 2018-04-01 5:12:3| 1|
+-------------------+----------+
dayofweek
This function returns the day of the week as an integer. It will consider Sunday as 1st and Saturday as 7th. For 1st Feb 2019 (2019–02–01) which is Friday, it will return 6. Similarly, for 1st April 2018 (2018–04–01) which is Sunday, it will return 1.
df = (empdf
.select("date")
.withColumn("dayofweek", dayofweek("date")))
df.show(2)
Output
+-------------------+---------+
| date|dayofweek|
+-------------------+---------+
|2019-02-01 15:12:13| 6|
| 2018-04-01 5:12:3| 1|
+-------------------+---------+
dayofyear
This function returns the day of the year as an integer. For 1st Feb it will return 32 (31 days of Jan +1 day of Feb). For 1st April 2018, it will return 91 (31 days of Jan + 28 days of Feb (2018 is a non-leap year) + 31 days of Mar + 1 day of April).
df = (empdf
.select("date")
.withColumn("dayofyear", dayofyear("date")))
df.show(2)
Output
+-------------------+---------+
| date|dayofyear|
+-------------------+---------+
|2019-02-01 15:12:13| 32|
| 2018-04-01 5:12:3| 91|
+-------------------+---------+
from_utc_timestamp
This function converts UTC timestamps to timestamps of any specified timezone. By default, it assumes the date is a UTC timestamp.
Let’s convert a UTC timestamp to “PST” time.
df = (empdf
.select("date")
.withColumn("pst_timestamp", from_utc_timestamp("date", "PST")))
df.show(2)
Output
+-------------------+-------------------+
| date| pst_timestamp|
+-------------------+-------------------+
|2019-02-01 15:12:13|2019-02-01 07:12:13|
| 2018-04-01 5:12:3|2018-03-31 22:12:03|
+-------------------+-------------------+
unix_timestamp
spark.sql.session.timeZone” to set the timezone.)
df = (empdf
.select("date")
.withColumn("unix_timestamp", unix_timestamp("date", "yyyy-MM-dd HH:mm:ss")))
df.show(2)
Output
+-------------------+--------------+
| date|unix_timestamp|
+-------------------+--------------+
|2019-02-01 15:12:13| 1549033933|
| 2018-04-01 5:12:3| 1522559523|
+-------------------+--------------+
from_unixtime
spark.sql.session.timeZone” to set the timezone). For demonstration purposes, we have converted the timestamp to Unix timestamp and converted it back to timestamp.
df = (empdf
.select("date")
# Convert timestamp to unix timestamp.
.withColumn("unix_timestamp", unix_timestamp("date", "yyyy-MM-dd HH:mm:ss"))
# Convert unix timestamp to timestamp.
.withColumn("date_from_unixtime", from_unixtime("unix_timestamp")))
df.show(2)
Output
+-------------------+--------------+-------------------+
| date|unix_timestamp| date_from_unixtime|
+-------------------+--------------+-------------------+
|2019-02-01 15:12:13| 1549033933|2019-02-01 15:12:13|
| 2018-04-01 5:12:3| 1522559523|2018-04-01 05:12:03|
+-------------------+--------------+-------------------+
hour
This function will return the hour part of the date.
df = (empdf
.select("date")
.withColumn("hour", hour("date")))
df.show(2)
Output
+-------------------+----+
| date|hour|
+-------------------+----+
|2019-02-01 15:12:13| 15|
| 2018-04-01 5:12:3| 5|
+-------------------+----+
last_day
This function will return the last date of the month for a given date. For 5th Jan 2019, it will return 31st Jan 2019, since this is the last date for the month.
df = empdf.select("date").withColumn("last_date", last_day("date"))
Output
+-------------------+----------+
| date| last_date|
+-------------------+----------+
|2019-02-01 15:12:13|2019-02-28|
| 2018-04-01 5:12:3|2018-04-30|
+-------------------+----------+
minute
This function will return minute part of the date.
df = (empdf
.select("date")
.withColumn("minute", minute("date")))
df.show(2)
Output
+-------------------+------+
| date|minute|
+-------------------+------+
|2019-02-01 15:12:13| 12|
| 2018-04-01 5:12:3| 12|
+-------------------+------+
month
This function will return the month part of the date.
df = (empdf
.select("date")
.withColumn("month", month("date")))
df.show(2)
Output
+-------------------+-----+
| date|month|
+-------------------+-----+
|2019-02-01 15:12:13| 2|
| 2018-04-01 5:12:3| 4|
+-------------------+-----+
months_between
This function returns the difference between dates in terms of months. If the first date is greater than the second one, the result will be positive else negative. For example, between 6th Feb 2019 and 5th Jan 2019, it will return 1.
df = (empdf
.select("date")
# Add another date column as current date.
.withColumn("current_date", current_date())
# Take the difference between current_date and date column in terms of months.
.withColumn("months_between", months_between("current_date", "date")))
df.show(2)
Note: from Spark 2.4.0 onwards you can specify the third argument “roundOff=True” to round-Off the value. The default value is True.
Output
+-------------------+------------+--------------+
| date|current_date|months_between|
+-------------------+------------+--------------+
|2019-02-01 15:12:13| 2019-10-09| 8.23762955|
| 2018-04-01 5:12:3| 2019-10-09| 18.25107415|
+-------------------+------------+--------------+
next_day
This function will return the next day based on the dayOfWeek specified in the next argument. For e.g. for 1st Feb 2019 (Friday) if we ask for next_day as Sunday, it will return 3rd Feb 2019.
df = (empdf
.select("date")
.withColumn("next_day", next_day("date", "sun")))
df.show(2)
Output
+-------------------+----------+
| date| next_day|
+-------------------+----------+
|2019-02-01 15:12:13|2019-02-03|
| 2018-04-01 5:12:3|2018-04-08|
+-------------------+----------+
quarter
This function will return a quarter of the given date as an integer.
df = (empdf
.select("date")
.withColumn("quarter", quarter("date")))
df.show(2)
Output
+-------------------+-------+
| date|quarter|
+-------------------+-------+
|2019-02-01 15:12:13| 1|
| 2018-04-01 5:12:3| 2|
+-------------------+-------+
second
This function will return the second part of the date.
df = (empdf
.select("date")
.withColumn("second", second("date")))
df.show(2)
Output
+-------------------+------+
| date|second|
+-------------------+------+
|2019-02-01 15:12:13| 13|
| 2018-04-01 5:12:3| 3|
+-------------------+------+
to_date
This function will convert the String or TimeStamp to Date.
df = (empdf
.select("date")
.withColumn("to_date", to_date("date")))
df.show(2)
Note: Check the data type of column “date” and “to-date”.
If the string format is ‘yyyy-MM-dd HH:mm:ss’ then we need not specify the format. Otherwise, specify the format as the second arg in to_date function.
Output
+-------------------+----------+
| date| to_date|
+-------------------+----------+
|2019-02-01 15:12:13|2019-02-01|
| 2018-04-01 5:12:3|2018-04-01|
+-------------------+----------+
Here we convert a string of format ‘dd/MM/yyyy HH:mm:ss’ to “date” data type. Note the default format is ‘yyyy-MM-dd`.
df1 = spark.createDataFrame([('15/02/2019 10:30:00',)], ['date'])
df2 = (df1
.withColumn("new_date", to_date("date", 'dd/MM/yyyy HH:mm:ss')))
df2.show(2)
Output
+-------------------+----------+
| date| new_date|
+-------------------+----------+
|15/02/2019 10:30:00|2019-02-15|
+-------------------+----------+
to_timestamp
This function converts String to TimeStamp. Here we convert a string of format ‘dd/MM/yyyy HH:mm:ss’ to the “timestamp” data type. The default format is ‘yyyy-MM-dd HH:mm:ss’
df1 = spark.createDataFrame([('15/02/2019 10:30:00',)], ['date'])
df2 = (df1
.withColumn("new_date", to_timestamp("date", 'dd/MM/yyyy HH:mm:ss')))
df2.show(2)
Output
+-------------------+-------------------+
| date| new_date|
+-------------------+-------------------+
|15/02/2019 10:30:00|2019-02-15 10:30:00|
+-------------------+-------------------+
to_utc_timestamp
This function converts given timestamp to UTC timestamp. Let’s convert a “PST” timestamp to “UTC” timestamp.
df = (empdf
.select("date")
.withColumn("utc_timestamp", to_utc_timestamp("date", "PST")))
df.show(2)
Output
+-------------------+-------------------+
| date| utc_timestamp|
+-------------------+-------------------+
|2019-02-01 15:12:13|2019-02-01 23:12:13|
| 2018-04-01 5:12:3|2018-04-01 12:12:03|
+-------------------+-------------------+
weekofyear
This function will return the weekofyear for the given date.
df = (empdf
.select("date")
.withColumn("weekofyear", weekofyear("date")))
df.show(2)
Output
+-------------------+----------+
| date|weekofyear|
+-------------------+----------+
|2019-02-01 15:12:13| 5|
| 2018-04-01 5:12:3| 13|
+-------------------+----------+
year
This function will return the year part of the date.
df = (empdf
.select("date")
.withColumn("year", year("date")))
df.show(2)
Output
+-------------------+----+
| date|year|
+-------------------+----+
|2019-02-01 15:12:13|2019|
| 2018-04-01 5:12:3|2018|
+-------------------+----+