How to stop timestamp in pyspark from dropping trailing zeroes - apache-spark

I have Spark dataframe in where the Timestamp is in milliseconds.
+-----------------------+
|CALC_TS |
+-----------------------+
|2021-01-27 01:35:05.043|
|2021-01-27 01:35:05.043|
|2021-01-27 01:35:05.043|
I want to make it show microseconds like so:
+--------------------------+
|CALC_TS |
+--------------------------+
|2021-01-27 01:35:05.043000|
|2021-01-27 01:35:05.043000|
|2021-01-27 01:35:05.043000|
So basically I would like the milliseconds portion to show in terms of microseconds. In the above example, the 43 milliseconds from the 1st dataframe would be 43 thousand microseconds as shown in the seconds dataframe.
I have tried:
df.withColumn('TIME', to_timestamp('CALC_TS', 'yyyy-MM-dd HH:mm:ss.SSSSSS'))
and
df.withColumn('TIME', col('CALC_TS').cast("timestamp"))
But they are giving the same result and stripping the last 3 zeroes. Is there a way to achieve this?

to_timestamp(timestamp_str[,fmt]) accepts a string and returns a timestamp (type). If your CALC_TS is already a timestamp as you said, you should rather use df.withColumn('TIME', date_format('CALC_TS','yyyy-MM-dd HH:mm:ss.SSSSSS')) to format it to string, with microseconds precision. From Spark reference:
o Fraction: Use one or more (up to 9) contiguous 'S' characters, e,g
SSSSSS, to parse and format fraction of second. For parsing, the
acceptable fraction length can be [1, the number of contiguous ‘S’].
For formatting, the fraction length would be padded to the number of
contiguous ‘S’ with zeros. Spark supports datetime of micro-of-second
precision, which has up to 6 significant digits, but can parse
nano-of-second with exceeded part truncated.
For Spark 2.4, and just to make it look like the precision of a timestamp field is microseconds, perhaps you can "fake" trailing zeroes while formatting it like this: date_format('CALC_TS','yyyy-MM-dd HH:mm:ss.SSS000')

You can use rpad.
Right pad with trailing zeros upto the expected length of your timestamp. In your case, a length of 26 characters (for format yyyy-MM-dd HH:mm:ss.SSSSSS)
from pyspark.sql.functions import *
df.withColumn('CALC_TS_1', col('CALC_TS').cast("timestamp"))\
.withColumn('CALC_TS_1', rpad(col('CALC_TS_1').cast('string'),26,'0'))\
.show(truncate=False)
+--------------------------+--------------------------+
|CALC_TS |CALC_TS_1 |
+--------------------------+--------------------------+
|2021-01-27 01:35:05.043 |2021-01-27 01:35:05.043000|
|2021-01-27 01:35:05.043567|2021-01-27 01:35:05.043567|
+--------------------------+--------------------------+

If the columnCALC_TS is of type string, first convert to TimestampType using to_timestamp and unix_timestamp functions then using date_format you can format it with 6 fractions in milliseconds :
from pyspark.sql import functions as F
df.printSchema()
#root
# |-- CALC_TS: string (nullable = true)
df1 = df.withColumn(
'TIME',
F.to_timestamp(
F.unix_timestamp('CALC_TS', "yyyy-MM-dd HH:mm:ss.SSS") # seconds
+ F.substring_index('CALC_TS', '.', -1).cast('float') / 1000 # milliseconds part
)
).withColumn(
"TIME_FORMAT",
F.date_format("TIME", "yyyy-MM-dd HH:mm:ss.SSSSSS")
)
df1.show(truncate=False)
#+-----------------------+-----------------------+--------------------------+
#|CALC_TS |TIME |TIME_FORMAT |
#+-----------------------+-----------------------+--------------------------+
#|2021-01-27 01:35:05.043|2021-01-27 01:35:05.043|2021-01-27 01:35:05.000043|
#|2021-01-27 01:35:05.043|2021-01-27 01:35:05.043|2021-01-27 01:35:05.000043|
#|2021-01-27 01:35:05.043|2021-01-27 01:35:05.043|2021-01-27 01:35:05.000043|
#+-----------------------+-----------------------+--------------------------+
#root
# |-- CALC_TS: string (nullable = true)
# |-- TIME: timestamp (nullable = true)
# |-- TIME_FORMAT: string (nullable = true)
If the column is already of type timestamp, simply use date_format as in the above code.

Related

Is there any way to handle time in pyspark?

I have a string with 6 characters which should be loaded into SQL Server as the TIME data type.
But spark doesn't have any time data type. I have tried a few ways but data type is not returning in the timestamp.
I am reading the data as a string and converting it to timestamp and then finally trying to extract time values but it is returning value as string again.
df.select('time_col').withColumn("time_col",to_timestamp(col("time_col"),"HHmmss").cast(TimestampType())).withColumn("tim2", date_format(col("time_col"), "HHmmss")).printSchema()
root
|-- time_col: timestamp (nullable = true)
|-- tim2: string (nullable = true)
And the data looks like this but in a different data type.
df.select('time_col').withColumn("time_col",to_timestamp(col("time_col"),"HHmmss").cast(TimestampType())).withColumn("tim2", date_format(col("time_col"), "HHmmss")).show(5)
+-------------------+------+
| time_col| tim2|
+-------------------+------+
|1970-01-01 14:44:51|144451|
|1970-01-01 14:48:37|144837|
|1970-01-01 14:46:10|144610|
|1970-01-01 11:46:39|114639|
|1970-01-01 17:44:33|174433|
+-------------------+------+
Is there any way I can get tim2 column in timestamp column or column equivalent to TIME data type of SQL Server?
I think you won't get what you are trying to do, there's no type in PySpark to handle "HH:mm:ss", see: What data type should be used for a time column
I'd suggest you to use it as string.
In my case I used to convert into timestamp in spark and before sending to SQL server just make it string.. it worked fine with me.
Maybe this will help you, but it seems to me that this changes the column in str
df.withColumn('TIME', date_format('datetime', 'HH:mm:ss'))
In scala, python will be similar:
scala> val df = Seq("144451","144837").toDF("c").select('c.cast("INT").cast("TIMESTAMP"))
df: org.apache.spark.sql.DataFrame = [c: timestamp]
scala> df.show()
+-------------------+
| c|
+-------------------+
|1970-01-02 17:07:31|
|1970-01-02 17:13:57|
+-------------------+
scala> df.printSchema()
root
|-- c: timestamp (nullable = true)

Parsing timestamps from string and rounding seconds in spark

I have a spark DataFrame with a column "requestTime", which is a string representation of a timestamp. How can I convert it to get this format: YY-MM-DD HH:MM:SS, knowing that I have the following value: 20171107014824952 (which means : 2017-11-07 01:48:25)?
The part devoted to the seconds is formed of 5 digits, in the example above the seconds part is = 24952 and what was displayed in the log file is 25 so I have to round up 24.952 before applying the to_timestamp function, that's why I asked for help.
Assuming you have the following spark DataFrame:
df.show()
#+-----------------+
#| requestTime|
#+-----------------+
#|20171107014824952|
#+-----------------+
With the schema:
df.printSchema()
#root
# |-- requestTime: string (nullable = true)
You can use the techniques described in Convert pyspark string to date format to convert this to a timestamp. Since the solution is dependent on your spark version, I've created the following helper function:
import pyspark.sql.functions as f
def timestamp_from_string(date_str, fmt):
try:
"""For spark version 2.2 and above, to_timestamp is available"""
return f.to_timestamp(date_str, fmt)
except (TypeError, AttributeError):
"""For spark version 2.1 and below, you'll have to do it this way"""
return f.from_unixtime(f.unix_timestamp(date_str, fmt))
Now call it on your data using the appropriate format:
df.withColumn(
"requestTime",
timestamp_from_string(f.col("requestTime"), "yyyyMMddhhmmssSSS")
).show()
#+-------------------+
#| requestTime|
#+-------------------+
#|2017-11-07 01:48:24|
#+-------------------+
Unfortunately, this truncates the timestamp instead of rounding.
Therefore, you need to do the rounding yourself before converting. The tricky part is that the number is stored as a string - you'll have to convert it to a double, divide by 1000., convert it back to a long (to chop off the decimal and you can't use int as the number is too big), and finally back to a string.
df.withColumn(
"requestTime",
timestamp_from_string(
f.round(f.col("requestTime").cast("double")/1000.0).cast('long').cast('string'),
"yyyyMMddhhmmss"
)
).show()
#+-------------------+
#| requestTime|
#+-------------------+
#|2017-11-07 01:48:25|
#+-------------------+

How to convert Timestamp column to milliseconds Long column in Spark SQL

What is the shortest and the most efficient way in Spark SQL to transform Timestamp column to a milliseconds timestamp Long column?
Here is an example of a transformation from timestamp to milliseconds
scala> val ts = spark.sql("SELECT now() as ts")
ts: org.apache.spark.sql.DataFrame = [ts: timestamp]
scala> ts.show(false)
+-----------------------+
|ts |
+-----------------------+
|2019-06-18 12:32:02.41 |
+-----------------------+
scala> val tss = ts.selectExpr(
| "ts",
| "BIGINT(ts) as seconds_ts",
| "BIGINT(ts) * 1000 + BIGINT(date_format(ts, 'SSS')) as millis_ts"
| )
tss: org.apache.spark.sql.DataFrame = [ts: timestamp, seconds_ts: bigint ... 1 more field]
scala> tss.show(false)
+----------------------+----------+-------------+
|ts |seconds_ts|millis_ts |
+----------------------+----------+-------------+
|2019-06-18 12:32:02.41|1560861122|1560861122410|
+----------------------+----------+-------------+
As you can see, the most straightforward method to get milliseconds from timestamp doesn't work - cast to long returns seconds, however milliseconds information in timestamp is preserved.
The only way I found to to extract milliseconds information is by using date_format function , which is nothing like as simple as I would expect.
Does anybody know the way to get milliseconds UNIX time out of Timestamp column simpler than that?
According to the code on Spark's DateTimeUtils:
"Timestamps are exposed externally as java.sql.Timestamp and are stored internally as longs, which are capable of storing timestamps with microsecond precision."
Therefore, if you define a UDF that has a java.sql.Timestamp as input you can simply call getTime for a Long in millisecond.
val tsConversionToLongUdf = udf((ts: java.sql.Timestamp) => ts.getTime)
Applying this to a variety of Timestamps:
val df = Seq("2017-01-18 11:00:00.000", "2017-01-18 11:00:00.111", "2017-01-18 11:00:00.110", "2017-01-18 11:00:00.100")
.toDF("timestampString")
.withColumn("timestamp", to_timestamp(col("timestampString")))
.withColumn("timestampConversionToLong", tsConversionToLongUdf(col("timestamp")))
.withColumn("timestampCastAsLong", col("timestamp").cast(LongType))
df.printSchema()
df.show(false)
// returns
root
|-- timestampString: string (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- timestampConversionToLong: long (nullable = false)
|-- timestampCastAsLong: long (nullable = true)
+-----------------------+-----------------------+-------------------------+-------------------+
|timestampString |timestamp |timestampConversionToLong|timestampCastAsLong|
+-----------------------+-----------------------+-------------------------+-------------------+
|2017-01-18 11:00:00.000|2017-01-18 11:00:00 |1484733600000 |1484733600 |
|2017-01-18 11:00:00.111|2017-01-18 11:00:00.111|1484733600111 |1484733600 |
|2017-01-18 11:00:00.110|2017-01-18 11:00:00.11 |1484733600110 |1484733600 |
|2017-01-18 11:00:00.100|2017-01-18 11:00:00.1 |1484733600100 |1484733600 |
+-----------------------+-----------------------+-------------------------+-------------------+
Note that the column "timestampCastAsLong" just shows that a direct cast to a Long will not return the desired result in milliseconds, but only in seconds.

Converting a column from string to to_date populating a different month in pyspark

I am using spark 1.6.3. When converting a column val1 (of datatype string) to date, the code is populating a different month in the result than what's in the source.
For example, suppose my source is 6/15/2017 18:32. The code below is producing 15-1-2017 as the result (Note that the month is incorrect).
My code snippet is as below
from pyspark.sql.functions import from_unixtime,unix_timestamp ,to_date
df5 = df.withColumn("val1", to_date(from_unixtime(unix_timestamp(("val1"), "mm/dd/yyyy"))))
Expected output is 6/15/2017 of date type. Please suggest.
You're using the incorrect date format. You need to use MM for the month (not mm).
For example:
df = sqlCtx.createDataFrame([('6/15/2017 18:32',)], ["val1"])
df.printSchema()
#root
# |-- val1: string (nullable = true)
As we can see val1 is a string. We can convert to date using your code with the capital M:
from pyspark.sql.functions import from_unixtime, unix_timestamp, to_date
df5 = df.withColumn("val1", to_date(from_unixtime(unix_timestamp(("val1"), "MM/dd/yyyy"))))
df5.show()
#+----------+
#| val1|
#+----------+
#|2017-06-15|
#+----------+
The new is a date type, which will display as YYYY-MM-DD:
df5.printSchema()
#root
# |-- val1: date (nullable = true)

Can unix_timestamp() return unix time in milliseconds in Apache Spark?

I'm trying to get the unix time from a timestamp field in milliseconds (13 digits) but currently it returns in seconds (10 digits).
scala> var df = Seq("2017-01-18 11:00:00.000", "2017-01-18 11:00:00.123", "2017-01-18 11:00:00.882", "2017-01-18 11:00:02.432").toDF()
df: org.apache.spark.sql.DataFrame = [value: string]
scala> df = df.selectExpr("value timeString", "cast(value as timestamp) time")
df: org.apache.spark.sql.DataFrame = [timeString: string, time: timestamp]
scala> df = df.withColumn("unix_time", unix_timestamp(df("time")))
df: org.apache.spark.sql.DataFrame = [timeString: string, time: timestamp ... 1 more field]
scala> df.take(4)
res63: Array[org.apache.spark.sql.Row] = Array(
[2017-01-18 11:00:00.000,2017-01-18 11:00:00.0,1484758800],
[2017-01-18 11:00:00.123,2017-01-18 11:00:00.123,1484758800],
[2017-01-18 11:00:00.882,2017-01-18 11:00:00.882,1484758800],
[2017-01-18 11:00:02.432,2017-01-18 11:00:02.432,1484758802])
Even though 2017-01-18 11:00:00.123 and 2017-01-18 11:00:00.000 are different, I get the same unix time back 1484758800
What am I missing?
Milliseconds hide in fraction part timestamp format
Try this:
df = df.withColumn("time_in_milliseconds", col("time").cast("double"))
You'll get something like 1484758800.792, where 792 it's milliseconds
At least it's works for me (Scala, Spark, Hive)
Implementing the approach suggested in Dao Thi's answer
import pyspark.sql.functions as F
df = spark.createDataFrame([('22-Jul-2018 04:21:18.792 UTC', ),('23-Jul-2018 04:21:25.888 UTC',)], ['TIME'])
df.show(2,False)
df.printSchema()
Output:
+----------------------------+
|TIME |
+----------------------------+
|22-Jul-2018 04:21:18.792 UTC|
|23-Jul-2018 04:21:25.888 UTC|
+----------------------------+
root
|-- TIME: string (nullable = true)
Converting string time-format (including milliseconds ) to unix_timestamp(double). Extracting milliseconds from string using substring method (start_position = -7, length_of_substring=3) and Adding milliseconds seperately to unix_timestamp. (Cast to substring to float for adding)
df1 = df.withColumn("unix_timestamp",F.unix_timestamp(df.TIME,'dd-MMM-yyyy HH:mm:ss.SSS z') + F.substring(df.TIME,-7,3).cast('float')/1000)
Converting unix_timestamp(double) to timestamp datatype in Spark.
df2 = df1.withColumn("TimestampType",F.to_timestamp(df1["unix_timestamp"]))
df2.show(n=2,truncate=False)
This will give you following output
+----------------------------+----------------+-----------------------+
|TIME |unix_timestamp |TimestampType |
+----------------------------+----------------+-----------------------+
|22-Jul-2018 04:21:18.792 UTC|1.532233278792E9|2018-07-22 04:21:18.792|
|23-Jul-2018 04:21:25.888 UTC|1.532319685888E9|2018-07-23 04:21:25.888|
+----------------------------+----------------+-----------------------+
Checking the Schema:
df2.printSchema()
root
|-- TIME: string (nullable = true)
|-- unix_timestamp: double (nullable = true)
|-- TimestampType: timestamp (nullable = true)
unix_timestamp() return unix timestamp in seconds.
The last 3 digits in the timestamps are the same with the last 3 digits of the milliseconds string (1.999sec = 1999 milliseconds), so just take the last 3 digits of the timestamps string and append to the end of the milliseconds string.
It cannot be done with unix_timestamp() but since Spark 3.1.0 there is a built-in function called unix_millis():
unix_millis(timestamp) - Returns the number of milliseconds since 1970-01-01 00:00:00 UTC. Truncates higher levels of precision.
Up to Spark version 3.0.1 it is not possible to convert a timestamp into unix time in milliseconds using the SQL built-in function unix_timestamp.
According to the code on Spark's DateTimeUtils
"Timestamps are exposed externally as java.sql.Timestamp and are stored internally as longs, which are capable of storing timestamps with microsecond precision."
Therefore, if you define a UDF that has a java.sql.Timestamp as input you can call getTime for a Long in millisecond. If you apply unix_timestamp you will only get unix time with precision in seconds.
val tsConversionToLongUdf = udf((ts: java.sql.Timestamp) => ts.getTime)
Applying this to a variety of Timestamps:
val df = Seq("2017-01-18 11:00:00.000", "2017-01-18 11:00:00.111", "2017-01-18 11:00:00.110", "2017-01-18 11:00:00.100")
.toDF("timestampString")
.withColumn("timestamp", to_timestamp(col("timestampString")))
.withColumn("timestampConversionToLong", tsConversionToLongUdf(col("timestamp")))
.withColumn("timestampUnixTimestamp", unix_timestamp(col("timestamp")))
df.printSchema()
df.show(false)
// returns
root
|-- timestampString: string (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- timestampConversionToLong: long (nullable = false)
|-- timestampCastAsLong: long (nullable = true)
+-----------------------+-----------------------+-------------------------+-------------------+
|timestampString |timestamp |timestampConversionToLong|timestampUnixTimestamp|
+-----------------------+-----------------------+-------------------------+-------------------+
|2017-01-18 11:00:00.000|2017-01-18 11:00:00 |1484733600000 |1484733600 |
|2017-01-18 11:00:00.111|2017-01-18 11:00:00.111|1484733600111 |1484733600 |
|2017-01-18 11:00:00.110|2017-01-18 11:00:00.11 |1484733600110 |1484733600 |
|2017-01-18 11:00:00.100|2017-01-18 11:00:00.1 |1484733600100 |1484733600 |
+-----------------------+-----------------------+-------------------------+-------------------+
Wow, same with #Тимур Залимов just cast it
>>> df2 = df_msg.withColumn("datetime", F.col("timestamp").cast("timestamp")).withColumn("timestamp_back" , F.col("datetime").cast("double"))
>>> r = df2.rdd.take(1)[0]
>>> r.timestamp_back
1666509660.071501
>>> r.timestamp
1666509660.071501
>>> r.datetime
datetime.datetime(2022, 10, 23, 15, 21, 0, 71501)

Resources