Pyspark date format from multiple columns - apache-spark

I have four string columns 'hour', 'day', 'month', 'year' in my data frame. I would like to create new column fulldate in format 'dd/MM/yyyy HH:mm'.
df2 = df1.withColumn("fulldate", to_date(concat(col('day'), lit('/'), col('month'), lit('/'), col('year'), lit(' '), col('hour'), lit(':'), lit('0'), lit('0')), 'dd/MM/yyyy HH:mm'))
but it doesn't seem to work. I'm getting format "yyyy-mm-dd".
Am I missing something?

For Spark 3+, you can use make_timestamp function to create a timestamp column from those columns and use date_format to convert it to the desired date pattern :
from pyspark.sql import functions as F
df2 = df1.withColumn(
"fulldate",
F.date_format(
F.expr("make_timestamp(year, month, day, hour, 0, 0)"),
"dd/MM/yyyy HH:mm"
)
)

Use date_format instead of to_date.
to_date converts a column to date type from the given format, while date_format converts a date type column to the given format.
from pyspark.sql.functions import date_format, concat, col, lit
df2 = df1.withColumn(
"fulldate",
date_format(
concat(col('year'), lit('/'), col('month'), lit('/'), col('day'), lit(' '), col('hour'), lit(':'), lit('00'), lit(':'), lit('00')),
'dd/MM/yyyy HH:mm'
)
)
For better readability, you can use format_string:
from pyspark.sql.functions import date_format, format_string, col
df2 = df1.withColumn(
"fulldate",
date_format(
format_string('%d/%d/%d %d:00:00', col('year'), col('month'), col('day'), col('hour')),
'dd/MM/yyyy HH:mm'
)
)

Related

Change the timestamp from UTC to given format in Pyspark

i have a timestamp value i.e "2021-08-18T16:49:42.175-06:00". how can i convert this to "2021-08-18T16:49:42.175Z" format in pyspark.
You can use Pyspark DataFrame function date_format to reformat your timestamp string to any other format.
Example:
df = df.withColumn("ts_column", date_format("ts_column", "yyyy-MM-dd'T'HH:mm:ss.SSSZ")
date_format expects a TimestampType column so you might need to cast it Timestamp first if it currently is StringType
Set the timeZone to "UTC" and read-only upt0 23 chars.
Try below:
spark.conf.set("spark.sql.session.timeZone", "UTC")
spark.sql(""" select to_timestamp('2021-08-18T16:49:42.175-06:00') as ts,
date_format(to_timestamp(substr('2021-08-18T16:49:42.175-06:00',1,23)),'yyyy-MM-dd HH:mm:ss.SSSZ') as ts2 from range(1) """).show(false)
+-----------------------+----------------------------+
|ts |ts2 |
+-----------------------+----------------------------+
|2021-08-18 22:49:42.175|2021-08-18 16:49:42.175+0000|
+-----------------------+----------------------------+
Note that +0000 is UTC
If you want to get "Z", then use X
spark.conf.set("spark.sql.session.timeZone", "UTC")
spark.sql("""
with t1 ( select to_timestamp('2021-08-18T16:49:42.175-06:00') as ts,
to_timestamp(substr('2021-08-18T16:49:42.175-06:00',1,23)) as ts2 from range(1) )
select *, date_format(ts2,'YYYY-MM-d HH:MM:ss.SX') ts3 from t1
""").show(false)
+-----------------------+-----------------------+------------------------+
|ts |ts2 |ts3 |
+-----------------------+-----------------------+------------------------+
|2021-08-18 22:49:42.175|2021-08-18 16:49:42.175|2021-08-18 16:08:42.175Z|
+-----------------------+-----------------------+------------------------+

Transform a day of the year to day_month format

So I was wondering if it's possible with PySpark to transform a random day of the year (0-365) to day-month format. In my case, the input would be a string.
Example:
Input : "091"
Expected output (month-day): "0331"
This is possible but you need to have to year also. Convert the year to date (first January of that year) and add the days to get the desired result, then format it.
Here's a working example
from pyspark.sql import functions as F
df = spark.createDataFrame([("2020", "091")], ["year", "day_of_year"])
df1 = df.withColumn(
"first_day_year",
F.concat_ws("-", "year", F.lit("01"), F.lit("01"))
).withColumn(
"day_month",
F.date_format(
F.expr("date_add(first_day_year, cast(day_of_year as int) -1)"),
"MMdd"
)
).drop("first_day_year")
df1.show()
#+----+-----------+---------+
#|year|day_of_year|day_month|
#+----+-----------+---------+
#|2020| 091| 0331|
#+----+-----------+---------+
You can use date_add to add the number of days to the new year's day.
import pyspark.sql.functions as F
df2 = df.withColumn(
'day_month',
F.expr("date_format(date_add('2020-01-01', int(day - 1)), 'MMdd')")
)
df2.show()
+---+---------+
|day|day_month|
+---+---------+
| 91| 0331|
+---+---------+
Note that the result will vary depending on whether it's a leap year or not.

How to convert a datetime column to firstday of month?

I have a PySpark dataframe with column which has datetime values in the format '09/19/2020 09:27:18 AM'
I want to convert to first day of month 01-Nov-2020 in this format.
I have tried "date_col", F.trunc("date_col", "month") which is resulting in null date
and
df_result = df_result.withColumn('gl_date', F.udf(lambda d: datetime.datetime.strptime(d, '%MM/%dd/%yyyy %HH:%mm:%S a').strftime('%Y/%m/1'), t.StringType())(F.col('date_col')))
the second method I tried errors with date format '%MM/%dd/%yyyy %HH:%mm:%S a' is not matched with '09/19/2020 09:27:18 AM'
You can convert the column to timestamp type before calling trunc:
import pyspark.sql.functions as F
df_result2 = df_result.withColumn(
'gl_date',
F.date_format(
F.trunc(
F.to_timestamp("date_col", "MM/dd/yyyy hh:mm:ss a"),
"month"
),
"dd-MMM-yyyy"
)
)

How to convert all the date format to a timestamp for date column?

I am using PySpark version 3.0.1. I am reading a csv file as a PySpark dataframe having 2 date column. But when I try to print the schema both column is populated as string type.
Above screenshot attached is a Dataframe and schema of the Dataframe.
How to convert the row values there in both the date column to timestamp format using pyspark?
I have tried many things but all code is required the current format but how to convert to proper timestamp if I am not aware of what format is coming in csv file.
I have tried below code as wellb but this is creating a new column with null value
df1 = df.withColumn('datetime', col('joining_date').cast('timestamp'))
print(df1.show())
print(df1.printSchema())
Since there are two different date types, you need to convert using two different date formats, and coalesce the results.
import pyspark.sql.functions as F
result = df.withColumn(
'datetime',
F.coalesce(
F.to_timestamp('joining_date', 'MM-dd-yy'),
F.to_timestamp('joining_date', 'MM/dd/yy')
)
)
result.show()
+------------+-------------------+
|joining_date| datetime|
+------------+-------------------+
| 01-20-20|2020-01-20 00:00:00|
| 01/19/20|2020-01-19 00:00:00|
+------------+-------------------+
If you want to convert all to a single format:
import pyspark.sql.functions as F
result = df.withColumn(
'datetime',
F.date_format(
F.coalesce(
F.to_timestamp('joining_date', 'MM-dd-yy'),
F.to_timestamp('joining_date', 'MM/dd/yy')
),
'MM-dd-yy'
)
)
result.show()
+------------+--------+
|joining_date|datetime|
+------------+--------+
| 01-20-20|01-20-20|
| 01/19/20|01-19-20|
+------------+--------+

How to calculate Max(Date) and Min(Date) for DateType in pyspark dataframe?

The dataframe has a date column in string type '2017-01-01'
It is converted to DateType()
df = df.withColumn('date', col('date_string').cast(DateType()))
I would like to calculate the first day and last day of the column. I tried with the following codes, but they do not work. Can anyone give any suggestions? Thanks!
df.select('date').min()
df.select('date').max()
df.select('date').last_day()
df.select('date').first_day()
Aggregate with min and max:
from pyspark.sql.functions import min, max
df = spark.createDataFrame([
"2017-01-01", "2018-02-08", "2019-01-03"], "string"
).selectExpr("CAST(value AS date) AS date")
min_date, max_date = df.select(min("date"), max("date")).first()
min_date, max_date
# (datetime.date(2017, 1, 1), datetime.date(2019, 1, 3))
Additional way to do it in a line
import pyspark.sql.functions as F
df.agg(F.min("date"), F.max("date")).show()

Resources