Spark 2.3 timestamp subtract milliseconds

Spark 2.3 timestamp subtract milliseconds - apache-spark

I am using Spark 2.3 and I have read here that it does not support timestamp milliseconds (only in 2.4+), but am looking for ideas on how to do what I need to do.
The data I am processing stores dates as String datatype in Parquet files in this format: 2021-07-09T01:41:58Z
I need to subtract one millisecond from that. If it were Spark 2.4, I think I could do something like this:
to_timestamp(col("sourceStartTimestamp")) - expr("INTERVAL 0.001 SECONDS")
But since it is Spark 2.3, that does not do anything. I confirmed it can subtract 1 second, but it ignores any value less than a second.
Can anyone suggestion a workaround for how to do this in Spark 2.3? Ultimately, the result will need to be a String data type if that makes any difference.

Since millisecond-timestamp isn't supported by Spark 2.3 (or below), consider using a UDF that takes a delta millis and a date format to get what you need using java.time's plusNanos():
def getMillisTS(delta: Long, fmt: String = "yyyy-MM-dd HH:mm:ss.SSS") = udf{
(ts: java.sql.Timestamp) =>
import java.time.format.DateTimeFormatter
ts.toLocalDateTime.plusNanos(delta * 1000000).format(DateTimeFormatter.ofPattern(fmt))
}
Test-running the UDF:
val df = Seq("2021-01-01 00:00:00", "2021-02-15 12:30:00").toDF("ts")
df.withColumn("millisTS", getMillisTS(-1)($"ts")).show(false)
/*
+-------------------+-----------------------+
|ts |millisTS |
+-------------------+-----------------------+
|2021-01-01 00:00:00|2020-12-31 23:59:59.999|
|2021-02-15 12:30:00|2021-02-15 12:29:59.999|
+-------------------+-----------------------+
*/
df.withColumn("millisTS", getMillisTS(5000)($"ts")).show(false)
/*
+-------------------+-----------------------+
|ts |millisTS |
+-------------------+-----------------------+
|2021-01-01 00:00:00|2021-01-01 00:00:05.000|
|2021-02-15 12:30:00|2021-02-15 12:30:05.000|
+-------------------+-----------------------+
*/
val df = Seq("2021-01-01T00:00:00Z", "2021-02-15T12:30:00Z").toDF("ts")
df.withColumn(
"millisTS",
getMillisTS(-1, "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")(to_timestamp($"ts", "yyyy-MM-dd'T'HH:mm:ss'Z'"))
).show(false)
/*
+-------------------+------------------------+
|ts |millisTS |
+-------------------+------------------------+
|2021-01-01 00:00:00|2020-12-31T23:59:59.999Z|
|2021-02-15 12:30:00|2021-02-15T12:29:59.999Z|
+-------------------+------------------------+
*/

Related

Parsing timestamps from string and rounding seconds in spark

I have a spark DataFrame with a column "requestTime", which is a string representation of a timestamp. How can I convert it to get this format: YY-MM-DD HH:MM:SS, knowing that I have the following value: 20171107014824952 (which means : 2017-11-07 01:48:25)?
The part devoted to the seconds is formed of 5 digits, in the example above the seconds part is = 24952 and what was displayed in the log file is 25 so I have to round up 24.952 before applying the to_timestamp function, that's why I asked for help.

Assuming you have the following spark DataFrame:
df.show()
#+-----------------+
#| requestTime|
#+-----------------+
#|20171107014824952|
#+-----------------+
With the schema:
df.printSchema()
#root
# |-- requestTime: string (nullable = true)
You can use the techniques described in Convert pyspark string to date format to convert this to a timestamp. Since the solution is dependent on your spark version, I've created the following helper function:
import pyspark.sql.functions as f
def timestamp_from_string(date_str, fmt):
try:
"""For spark version 2.2 and above, to_timestamp is available"""
return f.to_timestamp(date_str, fmt)
except (TypeError, AttributeError):
"""For spark version 2.1 and below, you'll have to do it this way"""
return f.from_unixtime(f.unix_timestamp(date_str, fmt))
Now call it on your data using the appropriate format:
df.withColumn(
"requestTime",
timestamp_from_string(f.col("requestTime"), "yyyyMMddhhmmssSSS")
).show()
#+-------------------+
#| requestTime|
#+-------------------+
#|2017-11-07 01:48:24|
#+-------------------+
Unfortunately, this truncates the timestamp instead of rounding.
Therefore, you need to do the rounding yourself before converting. The tricky part is that the number is stored as a string - you'll have to convert it to a double, divide by 1000., convert it back to a long (to chop off the decimal and you can't use int as the number is too big), and finally back to a string.
df.withColumn(
"requestTime",
timestamp_from_string(
f.round(f.col("requestTime").cast("double")/1000.0).cast('long').cast('string'),
"yyyyMMddhhmmss"
)
).show()
#+-------------------+
#| requestTime|
#+-------------------+
#|2017-11-07 01:48:25|
#+-------------------+

How to use approxQuantile by group?

Spark has SQL function percentile_approx(), and its Scala counterpart is df.stat.approxQuantile().
However, the Scala counterpart cannot be used on grouped datasets, something like df.groupby("foo").stat.approxQuantile(), as answered here: https://stackoverflow.com/a/51933027.
But it's possible to do both grouping and percentiles in SQL syntax. So I'm wondering, maybe I can define an UDF from SQL percentile_approx function and use it on my grouped dataset?

Spark >= 3.1
Corresponding SQL functions have been added in Spark 3.1 - see SPARK-30569.
Spark < 3.1
While you cannot use approxQuantile in an UDF, and you there is no Scala wrapper for percentile_approx it is not hard to implement one yourself:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
import org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
object PercentileApprox {
def percentile_approx(col: Column, percentage: Column, accuracy: Column): Column = {
val expr = new ApproximatePercentile(
col.expr, percentage.expr, accuracy.expr
).toAggregateExpression
new Column(expr)
}
def percentile_approx(col: Column, percentage: Column): Column = percentile_approx(
col, percentage, lit(ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY)
)
}
Example usage:
import PercentileApprox._
val df = (Seq.fill(100)("a") ++ Seq.fill(100)("b")).toDF("group").withColumn(
"value", when($"group" === "a", randn(1) + 10).otherwise(randn(3))
)
df.groupBy($"group").agg(percentile_approx($"value", lit(0.5))).show
+-----+------------------------------------+
|group|percentile_approx(value, 0.5, 10000)|
+-----+------------------------------------+
| b| -0.06336346702250675|
| a| 9.818985618591595|
+-----+------------------------------------+
df.groupBy($"group").agg(
percentile_approx($"value", typedLit(Seq(0.1, 0.25, 0.75, 0.9)))
).show(false)
+-----+----------------------------------------------------------------------------------+
|group|percentile_approx(value, [0.1,0.25,0.75,0.9], 10000) |
+-----+----------------------------------------------------------------------------------+
|b |[-1.2098351202406483, -0.6640768986666159, 0.6778253126144265, 1.3255676906697658]|
|a |[8.902067202468098, 9.290417382259626, 10.41767257153993, 11.067087075488068] |
+-----+----------------------------------------------------------------------------------+
Once this is on the JVM classpath you can also add PySpark wrapper, using logic similar to built-in functions.

Converting epoc time in long milliseconds to timestamp for structured streaming sql

I am trying to read data stream following the below schema from kafka
val schema = StructType(
List(
StructField("timestamp",LongType, true),
StructField("id",StringType,true),
StructField("value",DoubleType,true),
)
)
timestamp is coming as long milliseconds from epoc .
And I converted long value to timestamp using the below method
val dfNew=df.selectExpr("CAST(value AS STRING)").as[String].select(from_json($"value",schema) as "record")
.select($"record.id",$"record.value", col("record.timestamp").cast(TimestampType).as("timestamp"))
I want test following strcutured streaming query using window and watermarking
val output=dfNew.withWatermark("timestamp", "16 seconds").groupBy(window($"timestamp" , "10 seconds", "5 seconds"),$"uuid").count()
Its giving result , but in Window Clumn its is displaying a future time stamp
+--------------------------------------------------+------------------------------------+-----+
|window |id |count|
+--------------------------------------------------+------------------------------------+-----+
|[50232-03-09 18:13:000.0, 50232-03-09 18:13:100.0]|11c7ebdb-8810-4a51-9d38-4099fd21862a|1 |
|[50232-03-09 17:49:400.0, 50232-03-09 17:49:500.0]|11c7ebdb-8810-4a51-9d38-4099fd21862a|1 |
|[50232-03-09 19:26:500.0, 50232-03-09 19:27:000.0]|58f86590-e27e-44d6-86d3-0905b126c9fd|1 |
|[50232-03-09 18:29:555.0, 50232-03-09 18:30:055.0]|11c7ebdb-8810-4a51-9d38-4099fd21862a|1 |
50232-03-09 18:13:000.0 ?
What could be the issue , I guess the conversion is wrong in kafka read stream query which I did
col("record.timestamp").cast(TimestampType).as("timestamp"))
.But I couldn't find any place where this is tried.Every body is trying from_unit_time() , but that gives me zero result and also the resolution is seconds ,
any solutions ? Please...

It simple. Spark represents interprets in seconds not milliseconds. So just divide the input by 1000:
Seq(1523013247000L).toDF.select(
($"value" / 1000).cast("timestamp"), // correct
$"value".cast("timestamp") // Your current code
).show
// +---------------------------------+--------------------+
// |CAST((value / 1000) AS TIMESTAMP)| value|
// +---------------------------------+--------------------+
// | 2018-04-06 13:14:07|50232-05-15 05:16...|
// +---------------------------------+--------------------+

PySpark dataframe convert unusual string format to Timestamp

I am using PySpark through Spark 1.5.0.
I have an unusual String format in rows of a column for datetime values. It looks like this:
Row[(datetime='2016_08_21 11_31_08')]
Is there a way to convert this unorthodox yyyy_mm_dd hh_mm_dd format into a Timestamp?
Something that can eventually come along the lines of
df = df.withColumn("date_time",df.datetime.astype('Timestamp'))
I had thought that Spark SQL functions like regexp_replace could work, but of course I need to replace
_ with - in the date half
and _ with : in the time part.
I was thinking I could split the column in 2 using substring and count backward from the end of time. Then do the 'regexp_replace' separately, then concatenate. But this seems to many operations? Is there an easier way?

Spark >= 2.2
from pyspark.sql.functions import to_timestamp
(sc
.parallelize([Row(dt='2016_08_21 11_31_08')])
.toDF()
.withColumn("parsed", to_timestamp("dt", "yyyy_MM_dd HH_mm_ss"))
.show(1, False))
## +-------------------+-------------------+
## |dt |parsed |
## +-------------------+-------------------+
## |2016_08_21 11_31_08|2016-08-21 11:31:08|
## +-------------------+-------------------+
Spark < 2.2
It is nothing that unix_timestamp cannot handle:
from pyspark.sql import Row
from pyspark.sql.functions import unix_timestamp
(sc
.parallelize([Row(dt='2016_08_21 11_31_08')])
.toDF()
.withColumn("parsed", unix_timestamp("dt", "yyyy_MM_dd HH_mm_ss")
# For Spark <= 1.5
# See issues.apache.org/jira/browse/SPARK-11724
.cast("double")
.cast("timestamp"))
.show(1, False))
## +-------------------+---------------------+
## |dt |parsed |
## +-------------------+---------------------+
## |2016_08_21 11_31_08|2016-08-21 11:31:08.0|
## +-------------------+---------------------+
In both cases the format string should be compatible with Java SimpleDateFormat.

zero323's answer answers the question, but I wanted to add that if your datetime string has a standard format, you should be able to cast it directly into timestamp type:
df.withColumn('datetime', col('datetime_str').cast('timestamp'))
It has the advantage of handling milliseconds, while unix_timestamp only has only second-precision (to_timestamp works with milliseconds too but requires Spark >= 2.2 as zero323 stated). I tested it on Spark 2.3.0, using the following format: '2016-07-13 14:33:53.979' (with milliseconds, but it also works without them).

I add some more code lines from Florent F's answer for better understanding and running the snippet in local machine:
import os, pdb, sys
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.types import StructType, ArrayType
from pyspark.sql.types import StringType
from pyspark.sql.functions import col
sc = pyspark.SparkContext('local[*]')
spark = SparkSession.builder.getOrCreate()
# preparing some example data - df1 with String type and df2 with Timestamp type
df1 = sc.parallelize([{"key":"a", "date":"2016-02-01"},
{"key":"b", "date":"2016-02-02"}]).toDF()
df1.show()
df2 = df1.withColumn('datetime', col('date').cast("timestamp"))
df2.show()

Just want to add more resources and example into this discussion.
https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
For example, if your ts string is "22 Dec 2022 19:06:36 EST", then the format is "dd MMM yyyy HH:mm:ss zzz"

Spark Window Functions - rangeBetween dates

I have a Spark SQL DataFrame with date column, and what I'm trying to get is all the rows preceding current row in a given date range. So for example I want to have all the rows from 7 days back preceding given row. I figured out, I need to use a Window Function like:
Window \
.partitionBy('id') \
.orderBy('start')
I want to have a rangeBetween 7 days, but there is nothing in the Spark docs I could find on this. Does Spark even provide such option? For now I'm just getting all the preceding rows with:
.rowsBetween(-sys.maxsize, 0)
but would like to achieve something like:
.rangeBetween("7 days", 0)

Spark >= 2.3
Since Spark 2.3 it is possible to use interval objects using SQL API, but the DataFrame API support is still work in progress.
df.createOrReplaceTempView("df")
spark.sql(
"""SELECT *, mean(some_value) OVER (
PARTITION BY id
ORDER BY CAST(start AS timestamp)
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) AS mean FROM df""").show()
## +---+----------+----------+------------------+
## | id| start|some_value| mean|
## +---+----------+----------+------------------+
## | 1|2015-01-01| 20.0| 20.0|
## | 1|2015-01-06| 10.0| 15.0|
## | 1|2015-01-07| 25.0|18.333333333333332|
## | 1|2015-01-12| 30.0|21.666666666666668|
## | 2|2015-01-01| 5.0| 5.0|
## | 2|2015-01-03| 30.0| 17.5|
## | 2|2015-02-01| 20.0| 20.0|
## +---+----------+----------+------------------+
Spark < 2.3
As far as I know it is not possible directly neither in Spark nor Hive. Both require ORDER BY clause used with RANGE to be numeric. The closest thing I found is conversion to timestamp and operating on seconds. Assuming start column contains date type:
from pyspark.sql import Row
row = Row("id", "start", "some_value")
df = sc.parallelize([
row(1, "2015-01-01", 20.0),
row(1, "2015-01-06", 10.0),
row(1, "2015-01-07", 25.0),
row(1, "2015-01-12", 30.0),
row(2, "2015-01-01", 5.0),
row(2, "2015-01-03", 30.0),
row(2, "2015-02-01", 20.0)
]).toDF().withColumn("start", col("start").cast("date"))
A small helper and window definition:
from pyspark.sql.window import Window
from pyspark.sql.functions import mean, col
# Hive timestamp is interpreted as UNIX timestamp in seconds*
days = lambda i: i * 86400
Finally query:
w = (Window()
.partitionBy(col("id"))
.orderBy(col("start").cast("timestamp").cast("long"))
.rangeBetween(-days(7), 0))
df.select(col("*"), mean("some_value").over(w).alias("mean")).show()
## +---+----------+----------+------------------+
## | id| start|some_value| mean|
## +---+----------+----------+------------------+
## | 1|2015-01-01| 20.0| 20.0|
## | 1|2015-01-06| 10.0| 15.0|
## | 1|2015-01-07| 25.0|18.333333333333332|
## | 1|2015-01-12| 30.0|21.666666666666668|
## | 2|2015-01-01| 5.0| 5.0|
## | 2|2015-01-03| 30.0| 17.5|
## | 2|2015-02-01| 20.0| 20.0|
## +---+----------+----------+------------------+
Far from pretty but works.
* Hive Language Manual, Types

Spark 3.3 is released, but...
The answer may be as old as Spark 1.5.0:
datediff.
datediff(col_name, '1000') will return an integer difference of days from 1000-01-01 to col_name.
As the first argument, it accepts dates, timestamps and even strings.
As the second, it even accepts 1000.
The answer
Date difference in days - depending on the data type of the order column:
date
Spark 3.1+
.orderBy(F.expr("unix_date(col_name)")).rangeBetween(-7, 0)
Spark 2.1+
.orderBy(F.expr("datediff(col_name, '1000')")).rangeBetween(-7, 0)
timestamp
Spark 2.1+
.orderBy(F.expr("datediff(col_name, '1000')")).rangeBetween(-7, 0)
long - UNIX time in microseconds (e.g. 1672534861000000)
Spark 2.1+
.orderBy(F.col("col_name") / 86400_000000).rangeBetween(-7, 0)
long - UNIX time in milliseconds (e.g. 1672534861000)
Spark 2.1+
.orderBy(F.col("col_name") / 86400_000).rangeBetween(-7, 0)
long - UNIX time in seconds (e.g. 1672534861)
Spark 2.1+
.orderBy(F.col("col_name") / 86400).rangeBetween(-7, 0)
long in format yyyyMMdd
Spark 3.3+
.orderBy(F.expr("unix_date(to_date(col_name, 'yyyyMMdd'))")).rangeBetween(-7, 0)
Spark 3.1+
.orderBy(F.expr("unix_date(to_date(cast(col_name as string), 'yyyyMMdd'))")).rangeBetween(-7, 0)
Spark 2.2+
.orderBy(F.expr("datediff(to_date(cast(col_name as string), 'yyyyMMdd'), '1000')")).rangeBetween(-7, 0)
Spark 2.1+
.orderBy(F.unix_timestamp(F.col("col_name").cast('string'), 'yyyyMMdd') / 86400).rangeBetween(-7, 0)
string in date format of 'yyyy-MM-dd'
Spark 3.1+
.orderBy(F.expr("unix_date(to_date(col_name))")).rangeBetween(-7, 0)
Spark 2.1+
.orderBy(F.expr("datediff(col_name, '1000')")).rangeBetween(-7, 0)
string in other date format (e.g. 'MM-dd-yyyy')
Spark 3.1+
.orderBy(F.expr("unix_date(to_date(col_name, 'MM-dd-yyyy'))")).rangeBetween(-7, 0)
Spark 2.2+
.orderBy(F.expr("datediff(to_date(col_name, 'MM-dd-yyyy'), '1000')")).rangeBetween(-7, 0)
Spark 2.1+
.orderBy(F.unix_timestamp("col_name", 'MM-dd-yyyy') / 86400).rangeBetween(-7, 0)
string in timestamp format of 'yyyy-MM-dd HH:mm:ss'
Spark 2.1+
.orderBy(F.expr("datediff(col_name, '1000')")).rangeBetween(-7, 0)
string in other timestamp format (e.g. 'MM-dd-yyyy HH:mm:ss')
Spark 2.2+
.orderBy(F.expr("datediff(to_date(col_name, 'MM-dd-yyyy HH:mm:ss'), '1000')")).rangeBetween(-7, 0)

Fantastic solution #zero323, if you want to operate with minutes instead of days as I have to, and you don't need to partition with id, so you only have to modify a simply part of the code as I show:
df.createOrReplaceTempView("df")
spark.sql(
"""SELECT *, sum(total) OVER (
ORDER BY CAST(reading_date AS timestamp)
RANGE BETWEEN INTERVAL 45 minutes PRECEDING AND CURRENT ROW
) AS sum_total FROM df""").show()

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark 2.3 timestamp subtract milliseconds - apache-spark

Related

Parsing timestamps from string and rounding seconds in spark

How to use approxQuantile by group?

Converting epoc time in long milliseconds to timestamp for structured streaming sql

PySpark dataframe convert unusual string format to Timestamp

Spark Window Functions - rangeBetween dates

Categories

Resources