convert string with UTC offset to spark timestamp

convert string with UTC offset to spark timestamp - apache-spark

How to store string 2018-03-21 08:15:00 +03:00 as a timestamptype, preserving the UTC offset, in spark?
tried below
from pyspark.sql.functions import *
df = spark.createDataFrame([("2018-03-21 08:15:00 +03:00",)], ["timestamp"])
newDf= df.withColumn("newtimestamp", to_timestamp(col('timestamp'), "yyyy-MM-dd HH:mm:ss XXX")
)
This prints newtimestamp column with value converted to UTC time i.e 2018-03-21 05:15:00
How I can store this string as timestamp column in dataframe preserving offset i.e store same string as timestamp or store like 2018-03-21 08:15:00 +3000

You need to format the timestamp you obtain from the convertion to the desired pattern using date_format:
newDf = df.withColumn(
"newtimestamp",
to_timestamp(col('timestamp'), "yyyy-MM-dd HH:mm:ss XXX")
).withColumn(
"newtimestamp_formatted",
date_format("newtimestamp", "yyyy-MM-dd HH:mm:ss Z")
)
newDf.show(truncate=False)
#+--------------------------+-------------------+-------------------------+
#|timestamp |newtimestamp |newtimestamp_formatted |
#+--------------------------+-------------------+-------------------------+
#|2018-03-21 08:15:00 +03:00|2018-03-21 06:15:00|2018-03-21 06:15:00 +0100|
#+--------------------------+-------------------+-------------------------+

Related

Pyspark parse datetime field with day and month names into timestamp

I'm not even sure where to start. I want to parse a column that is currently a string into a timestamp. The records look like the following:
Thu, 28 Jan 2021 02:54:17 +0000
What is the best way to parse this as a timestamp? I wasn't even sure where to start since it's not a super common way to store dates

You could probably start from the docs Datetime Patterns for Formatting and Parsing:
import pyspark.sql.functions as F
df = spark.createDataFrame([("Thu, 28 Jan 2021 02:54:17 +0000",)], ['timestamp'])
df.withColumn(
"timestamp",
F.to_timestamp("timestamp", "E, dd MMM yyyy HH:mm:ss Z")
).show()
#+-------------------+
#| timestamp|
#+-------------------+
#|2021-01-28 02:54:17|
#+-------------------+
However, since Spark version 3.0, you can no longer use some symbols like E while parsing to timestamp:
Symbols of ‘E’, ‘F’, ‘q’ and ‘Q’ can only be used for datetime
formatting, e.g. date_format. They are not allowed used for datetime
parsing, e.g. to_timestamp.
You can either set the time parser to legacy:
spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
Or use some string functions to remove the day part from string before using to_timestamp:
df.withColumn(
"timestamp",
F.to_timestamp(F.split("timestamp", ",")[1], " dd MMM yyyy HH:mm:ss Z")
).show()

Why Spark is not recognizing this time format?

I get null for the timestamp 27-04-2021 14:11 with this code. What mistake am I doing? Why is the timestamp format string DD-MM-yyyy HH:mm not correct here?
df = spark.createDataFrame([('27-04-2021 14:11',)], ['t'])
df = df.select(to_timestamp(df.t, 'DD-MM-yyyy HH:mm').alias('dt'))
display(df)

D is for day of the year, and d is for day of the month.
Try this:
df = df.select(F.to_timestamp(df.t, "dd-MM-yyyy HH:mm").alias("dt"))

Change the timestamp from UTC to given format in Pyspark

i have a timestamp value i.e "2021-08-18T16:49:42.175-06:00". how can i convert this to "2021-08-18T16:49:42.175Z" format in pyspark.

You can use Pyspark DataFrame function date_format to reformat your timestamp string to any other format.
Example:
df = df.withColumn("ts_column", date_format("ts_column", "yyyy-MM-dd'T'HH:mm:ss.SSSZ")
date_format expects a TimestampType column so you might need to cast it Timestamp first if it currently is StringType

Set the timeZone to "UTC" and read-only upt0 23 chars.
Try below:
spark.conf.set("spark.sql.session.timeZone", "UTC")
spark.sql(""" select to_timestamp('2021-08-18T16:49:42.175-06:00') as ts,
date_format(to_timestamp(substr('2021-08-18T16:49:42.175-06:00',1,23)),'yyyy-MM-dd HH:mm:ss.SSSZ') as ts2 from range(1) """).show(false)
+-----------------------+----------------------------+
|ts |ts2 |
+-----------------------+----------------------------+
|2021-08-18 22:49:42.175|2021-08-18 16:49:42.175+0000|
+-----------------------+----------------------------+
Note that +0000 is UTC
If you want to get "Z", then use X
spark.conf.set("spark.sql.session.timeZone", "UTC")
spark.sql("""
with t1 ( select to_timestamp('2021-08-18T16:49:42.175-06:00') as ts,
to_timestamp(substr('2021-08-18T16:49:42.175-06:00',1,23)) as ts2 from range(1) )
select *, date_format(ts2,'YYYY-MM-d HH:MM:ss.SX') ts3 from t1
""").show(false)
+-----------------------+-----------------------+------------------------+
|ts |ts2 |ts3 |
+-----------------------+-----------------------+------------------------+
|2021-08-18 22:49:42.175|2021-08-18 16:49:42.175|2021-08-18 16:08:42.175Z|
+-----------------------+-----------------------+------------------------+

How to convert a datetime column to firstday of month?

I have a PySpark dataframe with column which has datetime values in the format '09/19/2020 09:27:18 AM'
I want to convert to first day of month 01-Nov-2020 in this format.
I have tried "date_col", F.trunc("date_col", "month") which is resulting in null date
and
df_result = df_result.withColumn('gl_date', F.udf(lambda d: datetime.datetime.strptime(d, '%MM/%dd/%yyyy %HH:%mm:%S a').strftime('%Y/%m/1'), t.StringType())(F.col('date_col')))
the second method I tried errors with date format '%MM/%dd/%yyyy %HH:%mm:%S a' is not matched with '09/19/2020 09:27:18 AM'

You can convert the column to timestamp type before calling trunc:
import pyspark.sql.functions as F
df_result2 = df_result.withColumn(
'gl_date',
F.date_format(
F.trunc(
F.to_timestamp("date_col", "MM/dd/yyyy hh:mm:ss a"),
"month"
),
"dd-MMM-yyyy"
)
)

Pyspark date format from multiple columns

I have four string columns 'hour', 'day', 'month', 'year' in my data frame. I would like to create new column fulldate in format 'dd/MM/yyyy HH:mm'.
df2 = df1.withColumn("fulldate", to_date(concat(col('day'), lit('/'), col('month'), lit('/'), col('year'), lit(' '), col('hour'), lit(':'), lit('0'), lit('0')), 'dd/MM/yyyy HH:mm'))
but it doesn't seem to work. I'm getting format "yyyy-mm-dd".
Am I missing something?

For Spark 3+, you can use make_timestamp function to create a timestamp column from those columns and use date_format to convert it to the desired date pattern :
from pyspark.sql import functions as F
df2 = df1.withColumn(
"fulldate",
F.date_format(
F.expr("make_timestamp(year, month, day, hour, 0, 0)"),
"dd/MM/yyyy HH:mm"
)
)

Use date_format instead of to_date.
to_date converts a column to date type from the given format, while date_format converts a date type column to the given format.
from pyspark.sql.functions import date_format, concat, col, lit
df2 = df1.withColumn(
"fulldate",
date_format(
concat(col('year'), lit('/'), col('month'), lit('/'), col('day'), lit(' '), col('hour'), lit(':'), lit('00'), lit(':'), lit('00')),
'dd/MM/yyyy HH:mm'
)
)
For better readability, you can use format_string:
from pyspark.sql.functions import date_format, format_string, col
df2 = df1.withColumn(
"fulldate",
date_format(
format_string('%d/%d/%d %d:00:00', col('year'), col('month'), col('day'), col('hour')),
'dd/MM/yyyy HH:mm'
)
)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

convert string with UTC offset to spark timestamp - apache-spark

Related

Pyspark parse datetime field with day and month names into timestamp

Why Spark is not recognizing this time format?

Change the timestamp from UTC to given format in Pyspark

How to convert a datetime column to firstday of month?

Pyspark date format from multiple columns

Categories

Resources