pyspark: removing mirosecond from timestamp - apache-spark

I am working on a pyspark script and one of the required transformation is to convert the microsecond timestamp into seconds timestamp -
Read the parquet file as input
Determine if any column is "timestamp".(Will be in microseconds)
Example - 2019-03-30 19:56:14.520138
If yes, convert it to 'yyyy-mm-dd hh:mm:ss' format
After conversion - 2019-03-30 19:56:14
Write the dataframe in parquet format back to s3.
I have tried, the below but it doesn't work. The returned dataframe still shows microsecond.
df = spark.read.parquet(p_input_loc)
def customize_df(df):
getTimestampCol = list(
filter(lambda x: "timestamp" in x, df.dtypes))
print(getTimestampCol)
"""[('created_at', 'timestamp'), ('updated_at', 'timestamp')]"""
if getTimestampCol:
for row in getTimestampCol:
df = df.withColumn(row[0], f.to_timestamp(row[0], 'yyyy-mm-dd hh:mm:ss'))
return df
else:
return df
So I need help!!

Here problem is with your function usage.
The to_timestamp function parse the date in the formatit is provided and then convert it to timestamp but to change the format you need to use date_format function.
Here is an example
df2 = spark.createDataFrame([("2020-01-01 11:22:59.9989","12312020","31122020"), ("2020-01-01 11:22:59.9989","12312020","31122020" )], ["ID","Start_date","End_date"])
df2.withColumn('ss',f.date_format(df2.ID.cast(t.TimestampType()),'yyyy-MM-dd HH:mm:ss')).select('ss','ID').show(2, False)
+-------------------+------------------------+
|ss |ID |
+-------------------+------------------------+
|2020-01-01 11:22:59|2020-01-01 11:22:59.9989|
|2020-01-01 11:22:59|2020-01-01 11:22:59.9989|
+-------------------+------------------------+
So change your
df = df.withColumn(row[0], f.to_timestamp(row[0], 'yyyy-mm-dd hh:mm:ss'))
with
df = df.withColumn(row[0], f.date_format(row[0], 'yyyy-MM-dd HH:mm:ss'))
as your column is already of timestampType.
Hope it helps

Related

Casting date from string spark

I am having a Date in my dataframe in String Datatype with format - dd/MM/yyyy as below:
When I am trying to convert the string to date format, all the functions are returning null values.
Looking to convert the datatype to DateType.
It looks like your date strings contain quotes, you need to remove them, using for example regexp_replace, before calling to_date:
import pyspark.sql.functions as F
df = spark.createDataFrame([("'31-12-2021'",), ("'30-11-2021'",), ("'01-01-2022'",)], ["Birth_Date"])
df = df.withColumn(
"Birth_Date",
F.to_date(F.regexp_replace("Birth_Date", "'", ""), "dd-MM-yyyy")
)
df.show()
#+----------+
#|Birth_Date|
#+----------+
#|2021-12-31|
#|2021-11-30|
#|2022-01-01|
#+----------+

Spark 2.3 timestamp subtract milliseconds

I am using Spark 2.3 and I have read here that it does not support timestamp milliseconds (only in 2.4+), but am looking for ideas on how to do what I need to do.
The data I am processing stores dates as String datatype in Parquet files in this format: 2021-07-09T01:41:58Z
I need to subtract one millisecond from that. If it were Spark 2.4, I think I could do something like this:
to_timestamp(col("sourceStartTimestamp")) - expr("INTERVAL 0.001 SECONDS")
But since it is Spark 2.3, that does not do anything. I confirmed it can subtract 1 second, but it ignores any value less than a second.
Can anyone suggestion a workaround for how to do this in Spark 2.3? Ultimately, the result will need to be a String data type if that makes any difference.
Since millisecond-timestamp isn't supported by Spark 2.3 (or below), consider using a UDF that takes a delta millis and a date format to get what you need using java.time's plusNanos():
def getMillisTS(delta: Long, fmt: String = "yyyy-MM-dd HH:mm:ss.SSS") = udf{
(ts: java.sql.Timestamp) =>
import java.time.format.DateTimeFormatter
ts.toLocalDateTime.plusNanos(delta * 1000000).format(DateTimeFormatter.ofPattern(fmt))
}
Test-running the UDF:
val df = Seq("2021-01-01 00:00:00", "2021-02-15 12:30:00").toDF("ts")
df.withColumn("millisTS", getMillisTS(-1)($"ts")).show(false)
/*
+-------------------+-----------------------+
|ts |millisTS |
+-------------------+-----------------------+
|2021-01-01 00:00:00|2020-12-31 23:59:59.999|
|2021-02-15 12:30:00|2021-02-15 12:29:59.999|
+-------------------+-----------------------+
*/
df.withColumn("millisTS", getMillisTS(5000)($"ts")).show(false)
/*
+-------------------+-----------------------+
|ts |millisTS |
+-------------------+-----------------------+
|2021-01-01 00:00:00|2021-01-01 00:00:05.000|
|2021-02-15 12:30:00|2021-02-15 12:30:05.000|
+-------------------+-----------------------+
*/
val df = Seq("2021-01-01T00:00:00Z", "2021-02-15T12:30:00Z").toDF("ts")
df.withColumn(
"millisTS",
getMillisTS(-1, "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")(to_timestamp($"ts", "yyyy-MM-dd'T'HH:mm:ss'Z'"))
).show(false)
/*
+-------------------+------------------------+
|ts |millisTS |
+-------------------+------------------------+
|2021-01-01 00:00:00|2020-12-31T23:59:59.999Z|
|2021-02-15 12:30:00|2021-02-15T12:29:59.999Z|
+-------------------+------------------------+
*/

Select a next or previous record on a dataframe (PySpark)

I have a spark dataframe that has a list of timestamps (partitioned by uid, ordered by timestamp). Now, I'd like to query the dataframe to get either previous or next record.
df = myrdd.toDF().repartition("uid").sort(desc("timestamp"))
df.show()
+------------------------+-------+
|uid |timestamp |
+------------------------+-------+
|Peter_Parker|2020-09-19 02:14:40|
|Peter_Parker|2020-09-19 01:07:38|
|Peter_Parker|2020-09-19 00:04:39|
|Peter_Parker|2020-09-18 23:02:36|
|Peter_Parker|2020-09-18 21:58:40|
So for example if I were to query:
ts=datetime.datetime(2020, 9, 19, 0, 4, 39)
I want to get the previous record on (2020-09-18 23:02:36), and only that one.
How can I get the previous one?
It's possible to do it using withColumn() and diff, but is there a smarter more efficient way of doing that? I really really don't need to calculate diff for ALL events, since it is already ordered. I just want prev/next record.
You can use a filter and order by, and then limit the results to 1 row:
df2 = (df.filter("uid = 'Peter_Parker' and timestamp < timestamp('2020-09-19 00:04:39')")
.orderBy('timestamp', ascending=False)
.limit(1)
)
df2.show()
+------------+-------------------+
| uid| timestamp|
+------------+-------------------+
|Peter_Parker|2020-09-18 23:02:36|
+------------+-------------------+
Or by using row_number after filtering :
from pyspark.sql import Window
from pyspark.sql import functions as F
df1 = df.filter("timestamp < '2020-09-19 00:04:39'") \
.withColumn("rn", F.row_number().over(Window.orderBy(F.desc("timestamp")))) \
.filter("rn = 1").drop("rn")
df1.show()
#+------------+-------------------+
#| uid| timestamp|
#+------------+-------------------+
#|Peter_Parker|2020-09-18 23:02:36|
#+------------+-------------------+

Parsing timestamps from string and rounding seconds in spark

I have a spark DataFrame with a column "requestTime", which is a string representation of a timestamp. How can I convert it to get this format: YY-MM-DD HH:MM:SS, knowing that I have the following value: 20171107014824952 (which means : 2017-11-07 01:48:25)?
The part devoted to the seconds is formed of 5 digits, in the example above the seconds part is = 24952 and what was displayed in the log file is 25 so I have to round up 24.952 before applying the to_timestamp function, that's why I asked for help.
Assuming you have the following spark DataFrame:
df.show()
#+-----------------+
#| requestTime|
#+-----------------+
#|20171107014824952|
#+-----------------+
With the schema:
df.printSchema()
#root
# |-- requestTime: string (nullable = true)
You can use the techniques described in Convert pyspark string to date format to convert this to a timestamp. Since the solution is dependent on your spark version, I've created the following helper function:
import pyspark.sql.functions as f
def timestamp_from_string(date_str, fmt):
try:
"""For spark version 2.2 and above, to_timestamp is available"""
return f.to_timestamp(date_str, fmt)
except (TypeError, AttributeError):
"""For spark version 2.1 and below, you'll have to do it this way"""
return f.from_unixtime(f.unix_timestamp(date_str, fmt))
Now call it on your data using the appropriate format:
df.withColumn(
"requestTime",
timestamp_from_string(f.col("requestTime"), "yyyyMMddhhmmssSSS")
).show()
#+-------------------+
#| requestTime|
#+-------------------+
#|2017-11-07 01:48:24|
#+-------------------+
Unfortunately, this truncates the timestamp instead of rounding.
Therefore, you need to do the rounding yourself before converting. The tricky part is that the number is stored as a string - you'll have to convert it to a double, divide by 1000., convert it back to a long (to chop off the decimal and you can't use int as the number is too big), and finally back to a string.
df.withColumn(
"requestTime",
timestamp_from_string(
f.round(f.col("requestTime").cast("double")/1000.0).cast('long').cast('string'),
"yyyyMMddhhmmss"
)
).show()
#+-------------------+
#| requestTime|
#+-------------------+
#|2017-11-07 01:48:25|
#+-------------------+

pyspark to_timestamp function doesn't convert certain timestamps

I would like to use the to_timestamp function to format timestamps in pyspark. How can I do it without the timezone shifting or certain dates being omitted. ?
from pyspark.sql.types import StringType
from pyspark.sql.functions import col, udf, to_timestamp
date_format = "yyyy-MM-dd'T'HH:mm:ss"
vals = [('2018-03-11T02:39:00Z'), ('2018-03-11T01:39:00Z'), ('2018-03-11T03:39:00Z')]
testdf = spark.createDataFrame(vals, StringType())
testdf.withColumn("to_timestamp", to_timestamp("value",date_format)).show(4,False)
testdf.withColumn("to_timestamp", to_timestamp("value", date_format)).show(4,False)
+--------------------+-------------------+
|value |to_timestamp |
+--------------------+-------------------+
|2018-03-11T02:39:00Z|null |
|2018-03-11T01:39:00Z|2018-03-11 01:39:00|
|2018-03-11T03:39:00Z|2018-03-11 03:39:00|
+--------------------+-------------------+
I expected 2018-03-11T02:39:00Z to format correctly to 2018-03-11 02:39:00
Then I switched to the default to_timestamp function.
testdf.withColumn("to_timestamp", to_timestamp("value")).show(4,False)`
+--------------------+-------------------+
|value |to_timestamp |
+--------------------+-------------------+
|2018-03-11T02:39:00Z|2018-03-10 20:39:00|
|2018-03-11T01:39:00Z|2018-03-10 19:39:00|
|2018-03-11T03:39:00Z|2018-03-10 21:39:00|
+--------------------+-------------------+
The shift in time when you call to_timestamp() with default values is because you spark instance is set to your local timezone and not UTC. You can check by running
spark.conf.get('spark.sql.session.timeZone')
If you want your timestamp to be displayed in UTC, set the conf value.
spark.conf.set('spark.sql.session.timeZone', 'UTC')
Another important point in your code, when you define date format as "yyyy-MM-dd'T'HH:mm:ss", you are essentially asking spark to ignore timezone and consider all timestamps to be in UTC/Zulu. Proper format would be date_format = "yyyy-MM-dd'T'HH:mm:ssXXX" but its a moot point if you are calling to_timestamp() with defaults.
use from_utc_timestamp method which will treat input column value as UTC timestamp
testdf.withColumn("to_timestamp", from_utc_timestamp("value")).show(4,False)

Resources