Change the timestamp from UTC to given format in Pyspark - apache-spark

i have a timestamp value i.e "2021-08-18T16:49:42.175-06:00". how can i convert this to "2021-08-18T16:49:42.175Z" format in pyspark.

You can use Pyspark DataFrame function date_format to reformat your timestamp string to any other format.
Example:
df = df.withColumn("ts_column", date_format("ts_column", "yyyy-MM-dd'T'HH:mm:ss.SSSZ")
date_format expects a TimestampType column so you might need to cast it Timestamp first if it currently is StringType

Set the timeZone to "UTC" and read-only upt0 23 chars.
Try below:
spark.conf.set("spark.sql.session.timeZone", "UTC")
spark.sql(""" select to_timestamp('2021-08-18T16:49:42.175-06:00') as ts,
date_format(to_timestamp(substr('2021-08-18T16:49:42.175-06:00',1,23)),'yyyy-MM-dd HH:mm:ss.SSSZ') as ts2 from range(1) """).show(false)
+-----------------------+----------------------------+
|ts |ts2 |
+-----------------------+----------------------------+
|2021-08-18 22:49:42.175|2021-08-18 16:49:42.175+0000|
+-----------------------+----------------------------+
Note that +0000 is UTC
If you want to get "Z", then use X
spark.conf.set("spark.sql.session.timeZone", "UTC")
spark.sql("""
with t1 ( select to_timestamp('2021-08-18T16:49:42.175-06:00') as ts,
to_timestamp(substr('2021-08-18T16:49:42.175-06:00',1,23)) as ts2 from range(1) )
select *, date_format(ts2,'YYYY-MM-d HH:MM:ss.SX') ts3 from t1
""").show(false)
+-----------------------+-----------------------+------------------------+
|ts |ts2 |ts3 |
+-----------------------+-----------------------+------------------------+
|2021-08-18 22:49:42.175|2021-08-18 16:49:42.175|2021-08-18 16:08:42.175Z|
+-----------------------+-----------------------+------------------------+

Related

convert string with UTC offset to spark timestamp

How to store string 2018-03-21 08:15:00 +03:00 as a timestamptype, preserving the UTC offset, in spark?
tried below
from pyspark.sql.functions import *
df = spark.createDataFrame([("2018-03-21 08:15:00 +03:00",)], ["timestamp"])
newDf= df.withColumn("newtimestamp", to_timestamp(col('timestamp'), "yyyy-MM-dd HH:mm:ss XXX")
)
This prints newtimestamp column with value converted to UTC time i.e 2018-03-21 05:15:00
How I can store this string as timestamp column in dataframe preserving offset i.e store same string as timestamp or store like 2018-03-21 08:15:00 +3000
You need to format the timestamp you obtain from the convertion to the desired pattern using date_format:
newDf = df.withColumn(
"newtimestamp",
to_timestamp(col('timestamp'), "yyyy-MM-dd HH:mm:ss XXX")
).withColumn(
"newtimestamp_formatted",
date_format("newtimestamp", "yyyy-MM-dd HH:mm:ss Z")
)
newDf.show(truncate=False)
#+--------------------------+-------------------+-------------------------+
#|timestamp |newtimestamp |newtimestamp_formatted |
#+--------------------------+-------------------+-------------------------+
#|2018-03-21 08:15:00 +03:00|2018-03-21 06:15:00|2018-03-21 06:15:00 +0100|
#+--------------------------+-------------------+-------------------------+

Pyspark date format from multiple columns

I have four string columns 'hour', 'day', 'month', 'year' in my data frame. I would like to create new column fulldate in format 'dd/MM/yyyy HH:mm'.
df2 = df1.withColumn("fulldate", to_date(concat(col('day'), lit('/'), col('month'), lit('/'), col('year'), lit(' '), col('hour'), lit(':'), lit('0'), lit('0')), 'dd/MM/yyyy HH:mm'))
but it doesn't seem to work. I'm getting format "yyyy-mm-dd".
Am I missing something?
For Spark 3+, you can use make_timestamp function to create a timestamp column from those columns and use date_format to convert it to the desired date pattern :
from pyspark.sql import functions as F
df2 = df1.withColumn(
"fulldate",
F.date_format(
F.expr("make_timestamp(year, month, day, hour, 0, 0)"),
"dd/MM/yyyy HH:mm"
)
)
Use date_format instead of to_date.
to_date converts a column to date type from the given format, while date_format converts a date type column to the given format.
from pyspark.sql.functions import date_format, concat, col, lit
df2 = df1.withColumn(
"fulldate",
date_format(
concat(col('year'), lit('/'), col('month'), lit('/'), col('day'), lit(' '), col('hour'), lit(':'), lit('00'), lit(':'), lit('00')),
'dd/MM/yyyy HH:mm'
)
)
For better readability, you can use format_string:
from pyspark.sql.functions import date_format, format_string, col
df2 = df1.withColumn(
"fulldate",
date_format(
format_string('%d/%d/%d %d:00:00', col('year'), col('month'), col('day'), col('hour')),
'dd/MM/yyyy HH:mm'
)
)

How to convert all the date format to a timestamp for date column?

I am using PySpark version 3.0.1. I am reading a csv file as a PySpark dataframe having 2 date column. But when I try to print the schema both column is populated as string type.
Above screenshot attached is a Dataframe and schema of the Dataframe.
How to convert the row values there in both the date column to timestamp format using pyspark?
I have tried many things but all code is required the current format but how to convert to proper timestamp if I am not aware of what format is coming in csv file.
I have tried below code as wellb but this is creating a new column with null value
df1 = df.withColumn('datetime', col('joining_date').cast('timestamp'))
print(df1.show())
print(df1.printSchema())
Since there are two different date types, you need to convert using two different date formats, and coalesce the results.
import pyspark.sql.functions as F
result = df.withColumn(
'datetime',
F.coalesce(
F.to_timestamp('joining_date', 'MM-dd-yy'),
F.to_timestamp('joining_date', 'MM/dd/yy')
)
)
result.show()
+------------+-------------------+
|joining_date| datetime|
+------------+-------------------+
| 01-20-20|2020-01-20 00:00:00|
| 01/19/20|2020-01-19 00:00:00|
+------------+-------------------+
If you want to convert all to a single format:
import pyspark.sql.functions as F
result = df.withColumn(
'datetime',
F.date_format(
F.coalesce(
F.to_timestamp('joining_date', 'MM-dd-yy'),
F.to_timestamp('joining_date', 'MM/dd/yy')
),
'MM-dd-yy'
)
)
result.show()
+------------+--------+
|joining_date|datetime|
+------------+--------+
| 01-20-20|01-20-20|
| 01/19/20|01-19-20|
+------------+--------+

Convert a string column to timestamp when read into spark

I'm trying to read a csv file into spark with databricks, but my time column is in string format, my time column entry is like: 2019-08-01 23:59:05-07:00, I want to convert it into timestamp type, here's what I tried:
df = (spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv(path_to_file)
.withColumn("observed", unix_timestamp("dt", "yyyy-MM-dd hh:mm:ss.SSSZ")
.cast("double")
.cast("timestamp"))
)
But I got error message: cannot resolve '`dt`' given input columns, I'm guessing I didn't get the "yyyy-MM-dd hh:mm:ss.SSSZ" format right?
Assuming your csv looks like this:
df = spark.createDataFrame([('2019-08-01 23:59:05-07:00',)], ['dt'])
df.show()
+--------------------+
| dt|
+--------------------+
|2019-08-01 23:59:...|
+--------------------+
You can simply parse the timestamp with a to_timestamp function
from pyspark.sql.functions import to_timestamp
df.withColumn('observed', to_timestamp('dt', "yyyy-MM-dd HH:mm:ssXXX")).show()
+--------------------+-------------------+
| dt| observed|
+--------------------+-------------------+
|2019-08-01 23:59:...|2019-08-02 08:59:05|
+--------------------+-------------------+
So, as #HristoIliev mentioned, the reason behind cannot resolve '`dt`' is that 'dt' is supposed to be name of the column already in your dataframe, and 'observed' is supposed to be the name of a new column. If you adjust the names thought it still won't work, because there is format mismatch: yyyy-MM-dd hh:mm:ss.SSSZ won't parse 2019-08-01 23:59:05-07:00, but "yyyy-MM-dd HH:mm:ssXXX" will.

PySpark - to_date format from column

I am currently trying to figure out, how to pass the String - format argument to the to_date pyspark function via a column parameter.
Specifically, I have the following setup:
sc = SparkContext.getOrCreate()
df = sc.parallelize([('a','2018-01-01','yyyy-MM-dd'),
('b','2018-02-02','yyyy-MM-dd'),
('c','02-02-2018','dd-MM-yyyy')]).toDF(
["col_name","value","format"])
I am currently trying to add a new column, where each of the dates from the column F.col("value"), which is a string value, is parsed to a date.
Separately for each format, this can be done with
df = df.withColumn("test1",F.to_date(F.col("value"),"yyyy-MM-dd")).\
withColumn("test2",F.to_date(F.col("value"),"dd-MM-yyyy"))
This however gives me 2 new columns - but I want to have 1 column containing both results - but calling the column does not seem to be possible with the to_date function:
df = df.withColumn("test3",F.to_date(F.col("value"),F.col("format")))
Here an error "Column object not callable" is being thrown.
Is is possible to have a generic approach for all possible formats (so that I do not have to manually add new columns for each format)?
You can use a column value as a parameter without a udf using the spark-sql syntax:
Spark version 2.2 and above
from pyspark.sql.functions import expr
df.withColumn("test3",expr("to_date(value, format)")).show()
#+--------+----------+----------+----------+
#|col_name| value| format| test3|
#+--------+----------+----------+----------+
#| a|2018-01-01|yyyy-MM-dd|2018-01-01|
#| b|2018-02-02|yyyy-MM-dd|2018-02-02|
#| c|02-02-2018|dd-MM-yyyy|2018-02-02|
#+--------+----------+----------+----------+
Or equivalently using pyspark-sql:
df.createOrReplaceTempView("df")
spark.sql("select *, to_date(value, format) as test3 from df").show()
Spark version 1.5 and above
Older versions of spark do not support having a format argument to the to_date function, so you'll have to use unix_timestamp and from_unixtime:
from pyspark.sql.functions import expr
df.withColumn(
"test3",
expr("from_unixtime(unix_timestamp(value,format))").cast("date")
).show()
Or equivalently using pyspark-sql:
df.createOrReplaceTempView("df")
spark.sql(
"select *, cast(from_unixtime(unix_timestamp(value,format)) as date) as test3 from df"
).show()
As far as I know, your problem requires some udf(user defined functions) to apply the correct format. But then inside a udf you can not directly use spark functions like to_date. So I created a little workaround in the solution. First the udf takes the python date conversion with the appropriate format from the column and converts it to an iso-format. Then another withColumn converts the iso-date to the correct format in column test3. However, you have to adapt the format in the original column to match the python dateformat strings, e.g. yyyy -> %Y, MM -> %m, ...
test_df = spark.createDataFrame([
('a','2018-01-01','%Y-%m-%d'),
('b','2018-02-02','%Y-%m-%d'),
('c','02-02-2018','%d-%m-%Y')
], ("col_name","value","format"))
def map_to_date(s,format):
return datetime.datetime.strptime(s,format).isoformat()
myudf = udf(map_to_date)
test_df.withColumn("test3",myudf(col("value"),col("format")))\
.withColumn("test3",to_date("test3")).show(truncate=False)
Result:
+--------+----------+--------+----------+
|col_name|value |format |test3 |
+--------+----------+--------+----------+
|a |2018-01-01|%Y-%m-%d|2018-01-01|
|b |2018-02-02|%Y-%m-%d|2018-02-02|
|c |02-02-2018|%d-%m-%Y|2018-02-02|
+--------+----------+--------+----------+
You dont need the format column also. You can use coalesce to check for all possible options
def get_right_date_format(date_string):
from pyspark.sql import functions as F
return F.coalesce(
F.to_date(date_string, 'yyyy-MM-dd'),
F.to_date(date_string, 'dd-MM-yyyy'),
F.to_date(date_string, 'yyyy-dd-MM')
)
df = sc.parallelize([('a','2018-01-01'),
('b','2018-02-02'),
('c','2018-21-02'),
('d','02-02-2018')]).toDF(
["col_name","value"])
df = df.withColumn("formatted_data",get_right_date_format(df.value, 'dd-MM-yyyy'))
The issue with this approach though is a date like 2020-02-01 would be treated as 1st Feb 2020, when it is likely that 2nd Jan 2020 is also possible.
Just an alternative approach !!!

Resources