pyspark to_timestamp function doesn't convert certain timestamps - apache-spark

I would like to use the to_timestamp function to format timestamps in pyspark. How can I do it without the timezone shifting or certain dates being omitted. ?
from pyspark.sql.types import StringType
from pyspark.sql.functions import col, udf, to_timestamp
date_format = "yyyy-MM-dd'T'HH:mm:ss"
vals = [('2018-03-11T02:39:00Z'), ('2018-03-11T01:39:00Z'), ('2018-03-11T03:39:00Z')]
testdf = spark.createDataFrame(vals, StringType())
testdf.withColumn("to_timestamp", to_timestamp("value",date_format)).show(4,False)
testdf.withColumn("to_timestamp", to_timestamp("value", date_format)).show(4,False)
+--------------------+-------------------+
|value |to_timestamp |
+--------------------+-------------------+
|2018-03-11T02:39:00Z|null |
|2018-03-11T01:39:00Z|2018-03-11 01:39:00|
|2018-03-11T03:39:00Z|2018-03-11 03:39:00|
+--------------------+-------------------+
I expected 2018-03-11T02:39:00Z to format correctly to 2018-03-11 02:39:00
Then I switched to the default to_timestamp function.
testdf.withColumn("to_timestamp", to_timestamp("value")).show(4,False)`
+--------------------+-------------------+
|value |to_timestamp |
+--------------------+-------------------+
|2018-03-11T02:39:00Z|2018-03-10 20:39:00|
|2018-03-11T01:39:00Z|2018-03-10 19:39:00|
|2018-03-11T03:39:00Z|2018-03-10 21:39:00|
+--------------------+-------------------+

The shift in time when you call to_timestamp() with default values is because you spark instance is set to your local timezone and not UTC. You can check by running
spark.conf.get('spark.sql.session.timeZone')
If you want your timestamp to be displayed in UTC, set the conf value.
spark.conf.set('spark.sql.session.timeZone', 'UTC')
Another important point in your code, when you define date format as "yyyy-MM-dd'T'HH:mm:ss", you are essentially asking spark to ignore timezone and consider all timestamps to be in UTC/Zulu. Proper format would be date_format = "yyyy-MM-dd'T'HH:mm:ssXXX" but its a moot point if you are calling to_timestamp() with defaults.

use from_utc_timestamp method which will treat input column value as UTC timestamp
testdf.withColumn("to_timestamp", from_utc_timestamp("value")).show(4,False)

Related

Unknown date data type (spark,parquet) [13 character long]

I have a parquet file with the date column filled with a data type I am having trouble with
I understand that Hive and Impala tend to rebase their time stamp...However, I cannot seem to convert or find any pointers on how to solve this.
I have tried setting int96RebaseModeInRead and datetimeRebaseModeInRead mode to legacy
I also tried to apply a date schema onto the read operation but to no avail.
This is with schema applied
These are the documentations I've reviewed so far. Maybe there's a simple solution I am not seeing. Let's also assume that there's no way for me to ask the person who created the source file what the heck they did.
https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#data-source-option
https://kontext.tech/article/1062/spark-2x-to-3x-date-timestamp-and-int96-rebase-modes
https://docs.cloudera.com/runtime/7.2.1/developing-spark-applications/topics/spark-timestamp-compatibility-parquet.html
Also, this thread is the only one I was able to find that shows how the timestamp is created but not how to reverse it. Please give me some pointers.
parquet int96 timestamp conversion to datetime/date via python
As I understand you try to cast order_date column to dateType. If thats the case following code could help.
You can read order_date column as stringType from source file and you should use your own timezone for from_utc_timestamp method.
from pyspark.sql.functions import from_utc_timestamp
from pyspark.sql.types import StringType
d = ['1374710400000']
df = spark.createDataFrame(d, StringType())
df.show()
df = df.withColumn('new_date',from_utc_timestamp(from_unixtime(df.value/1000,"yyyy-MM-dd hh:mm:ss"),'GMT+1')).show()
Output:
+-------------+
| value|
+-------------+
|1374710400000|
+-------------+
+-------------+-------------------+
| value| new_date|
+-------------+-------------------+
|1374710400000|2013-07-25 13:00:00|
+-------------+-------------------+

Casting date from string spark

I am having a Date in my dataframe in String Datatype with format - dd/MM/yyyy as below:
When I am trying to convert the string to date format, all the functions are returning null values.
Looking to convert the datatype to DateType.
It looks like your date strings contain quotes, you need to remove them, using for example regexp_replace, before calling to_date:
import pyspark.sql.functions as F
df = spark.createDataFrame([("'31-12-2021'",), ("'30-11-2021'",), ("'01-01-2022'",)], ["Birth_Date"])
df = df.withColumn(
"Birth_Date",
F.to_date(F.regexp_replace("Birth_Date", "'", ""), "dd-MM-yyyy")
)
df.show()
#+----------+
#|Birth_Date|
#+----------+
#|2021-12-31|
#|2021-11-30|
#|2022-01-01|
#+----------+

Parsing timestamps from string and rounding seconds in spark

I have a spark DataFrame with a column "requestTime", which is a string representation of a timestamp. How can I convert it to get this format: YY-MM-DD HH:MM:SS, knowing that I have the following value: 20171107014824952 (which means : 2017-11-07 01:48:25)?
The part devoted to the seconds is formed of 5 digits, in the example above the seconds part is = 24952 and what was displayed in the log file is 25 so I have to round up 24.952 before applying the to_timestamp function, that's why I asked for help.
Assuming you have the following spark DataFrame:
df.show()
#+-----------------+
#| requestTime|
#+-----------------+
#|20171107014824952|
#+-----------------+
With the schema:
df.printSchema()
#root
# |-- requestTime: string (nullable = true)
You can use the techniques described in Convert pyspark string to date format to convert this to a timestamp. Since the solution is dependent on your spark version, I've created the following helper function:
import pyspark.sql.functions as f
def timestamp_from_string(date_str, fmt):
try:
"""For spark version 2.2 and above, to_timestamp is available"""
return f.to_timestamp(date_str, fmt)
except (TypeError, AttributeError):
"""For spark version 2.1 and below, you'll have to do it this way"""
return f.from_unixtime(f.unix_timestamp(date_str, fmt))
Now call it on your data using the appropriate format:
df.withColumn(
"requestTime",
timestamp_from_string(f.col("requestTime"), "yyyyMMddhhmmssSSS")
).show()
#+-------------------+
#| requestTime|
#+-------------------+
#|2017-11-07 01:48:24|
#+-------------------+
Unfortunately, this truncates the timestamp instead of rounding.
Therefore, you need to do the rounding yourself before converting. The tricky part is that the number is stored as a string - you'll have to convert it to a double, divide by 1000., convert it back to a long (to chop off the decimal and you can't use int as the number is too big), and finally back to a string.
df.withColumn(
"requestTime",
timestamp_from_string(
f.round(f.col("requestTime").cast("double")/1000.0).cast('long').cast('string'),
"yyyyMMddhhmmss"
)
).show()
#+-------------------+
#| requestTime|
#+-------------------+
#|2017-11-07 01:48:25|
#+-------------------+

Change the timestamp to UTC format in Pyspark

I have an input dataframe(ip_df), data in this dataframe looks like as below:
id timestamp_value
1 2017-08-01T14:30:00+05:30
2 2017-08-01T14:30:00+06:30
3 2017-08-01T14:30:00+07:30
I need to create a new dataframe(op_df), wherein i need to convert timestamp value to UTC format. So final output dataframe will look like as below:
id timestamp_value
1 2017-08-01T09:00:00+00:00
2 2017-08-01T08:00:00+00:00
3 2017-08-01T07:00:00+00:00
I want to achieve it using PySpark. Can someone please help me with it? Any help will be appericiated.
If you absolutely need the timestamp to be formatted exactly as indicated, namely, with the timezone represented as "+00:00", I think using a UDF as already suggested is your best option.
However, if you can tolerate a slightly different representation of the timezone, e.g. either "+0000" (no colon separator) or "Z", it's possible to do this without a UDF, which may perform significantly better for you depending on the size of your dataset.
Given the following representation of data
+---+-------------------------+
|id |timestamp_value |
+---+-------------------------+
|1 |2017-08-01T14:30:00+05:30|
|2 |2017-08-01T14:30:00+06:30|
|3 |2017-08-01T14:30:00+07:30|
+---+-------------------------+
as given by:
l = [(1, '2017-08-01T14:30:00+05:30'), (2, '2017-08-01T14:30:00+06:30'), (3, '2017-08-01T14:30:00+07:30')]
ip_df = spark.createDataFrame(l, ['id', 'timestamp_value'])
where timestamp_value is a String, you could do the following (this uses to_timestamp and session local timezone support which were introduced in Spark 2.2):
from pyspark.sql.functions import to_timestamp, date_format
spark.conf.set('spark.sql.session.timeZone', 'UTC')
op_df = ip_df.select(
date_format(
to_timestamp(ip_df.timestamp_value, "yyyy-MM-dd'T'HH:mm:ssXXX"),
"yyyy-MM-dd'T'HH:mm:ssZ"
).alias('timestamp_value'))
which yields:
+------------------------+
|timestamp_value |
+------------------------+
|2017-08-01T09:00:00+0000|
|2017-08-01T08:00:00+0000|
|2017-08-01T07:00:00+0000|
+------------------------+
or, slightly differently:
op_df = ip_df.select(
date_format(
to_timestamp(ip_df.timestamp_value, "yyyy-MM-dd'T'HH:mm:ssXXX"),
"yyyy-MM-dd'T'HH:mm:ssXXX"
).alias('timestamp_value'))
which yields:
+--------------------+
|timestamp_value |
+--------------------+
|2017-08-01T09:00:00Z|
|2017-08-01T08:00:00Z|
|2017-08-01T07:00:00Z|
+--------------------+
You can use parser and tz in dateutil library.
I assume you have Strings and you want a String Column :
from dateutil import parser, tz
from pyspark.sql.types import StringType
from pyspark.sql.functions import col, udf
# Create UTC timezone
utc_zone = tz.gettz('UTC')
# Create UDF function that apply on the column
# It takes the String, parse it to a timestamp, convert to UTC, then convert to String again
func = udf(lambda x: parser.parse(x).astimezone(utc_zone).isoformat(), StringType())
# Create new column in your dataset
df = df.withColumn("new_timestamp",func(col("timestamp_value")))
It gives this result :
<pre>
+---+-------------------------+-------------------------+
|id |timestamp_value |new_timestamp |
+---+-------------------------+-------------------------+
|1 |2017-08-01T14:30:00+05:30|2017-08-01T09:00:00+00:00|
|2 |2017-08-01T14:30:00+06:30|2017-08-01T08:00:00+00:00|
|3 |2017-08-01T14:30:00+07:30|2017-08-01T07:00:00+00:00|
+---+-------------------------+-------------------------+
</pre>
Finally you can drop and rename :
df = df.drop("timestamp_value").withColumnRenamed("new_timestamp","timestamp_value")

PySpark dataframe convert unusual string format to Timestamp

I am using PySpark through Spark 1.5.0.
I have an unusual String format in rows of a column for datetime values. It looks like this:
Row[(datetime='2016_08_21 11_31_08')]
Is there a way to convert this unorthodox yyyy_mm_dd hh_mm_dd format into a Timestamp?
Something that can eventually come along the lines of
df = df.withColumn("date_time",df.datetime.astype('Timestamp'))
I had thought that Spark SQL functions like regexp_replace could work, but of course I need to replace
_ with - in the date half
and _ with : in the time part.
I was thinking I could split the column in 2 using substring and count backward from the end of time. Then do the 'regexp_replace' separately, then concatenate. But this seems to many operations? Is there an easier way?
Spark >= 2.2
from pyspark.sql.functions import to_timestamp
(sc
.parallelize([Row(dt='2016_08_21 11_31_08')])
.toDF()
.withColumn("parsed", to_timestamp("dt", "yyyy_MM_dd HH_mm_ss"))
.show(1, False))
## +-------------------+-------------------+
## |dt |parsed |
## +-------------------+-------------------+
## |2016_08_21 11_31_08|2016-08-21 11:31:08|
## +-------------------+-------------------+
Spark < 2.2
It is nothing that unix_timestamp cannot handle:
from pyspark.sql import Row
from pyspark.sql.functions import unix_timestamp
(sc
.parallelize([Row(dt='2016_08_21 11_31_08')])
.toDF()
.withColumn("parsed", unix_timestamp("dt", "yyyy_MM_dd HH_mm_ss")
# For Spark <= 1.5
# See issues.apache.org/jira/browse/SPARK-11724
.cast("double")
.cast("timestamp"))
.show(1, False))
## +-------------------+---------------------+
## |dt |parsed |
## +-------------------+---------------------+
## |2016_08_21 11_31_08|2016-08-21 11:31:08.0|
## +-------------------+---------------------+
In both cases the format string should be compatible with Java SimpleDateFormat.
zero323's answer answers the question, but I wanted to add that if your datetime string has a standard format, you should be able to cast it directly into timestamp type:
df.withColumn('datetime', col('datetime_str').cast('timestamp'))
It has the advantage of handling milliseconds, while unix_timestamp only has only second-precision (to_timestamp works with milliseconds too but requires Spark >= 2.2 as zero323 stated). I tested it on Spark 2.3.0, using the following format: '2016-07-13 14:33:53.979' (with milliseconds, but it also works without them).
I add some more code lines from Florent F's answer for better understanding and running the snippet in local machine:
import os, pdb, sys
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.types import StructType, ArrayType
from pyspark.sql.types import StringType
from pyspark.sql.functions import col
sc = pyspark.SparkContext('local[*]')
spark = SparkSession.builder.getOrCreate()
# preparing some example data - df1 with String type and df2 with Timestamp type
df1 = sc.parallelize([{"key":"a", "date":"2016-02-01"},
{"key":"b", "date":"2016-02-02"}]).toDF()
df1.show()
df2 = df1.withColumn('datetime', col('date').cast("timestamp"))
df2.show()
Just want to add more resources and example into this discussion.
https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
For example, if your ts string is "22 Dec 2022 19:06:36 EST", then the format is "dd MMM yyyy HH:mm:ss zzz"

Resources