I am using Spark 2.3 and I have read here that it does not support timestamp milliseconds (only in 2.4+), but am looking for ideas on how to do what I need to do.
The data I am processing stores dates as String datatype in Parquet files in this format: 2021-07-09T01:41:58Z
I need to subtract one millisecond from that. If it were Spark 2.4, I think I could do something like this:
to_timestamp(col("sourceStartTimestamp")) - expr("INTERVAL 0.001 SECONDS")
But since it is Spark 2.3, that does not do anything. I confirmed it can subtract 1 second, but it ignores any value less than a second.
Can anyone suggestion a workaround for how to do this in Spark 2.3? Ultimately, the result will need to be a String data type if that makes any difference.
Since millisecond-timestamp isn't supported by Spark 2.3 (or below), consider using a UDF that takes a delta millis and a date format to get what you need using java.time's plusNanos():
def getMillisTS(delta: Long, fmt: String = "yyyy-MM-dd HH:mm:ss.SSS") = udf{
(ts: java.sql.Timestamp) =>
import java.time.format.DateTimeFormatter
ts.toLocalDateTime.plusNanos(delta * 1000000).format(DateTimeFormatter.ofPattern(fmt))
}
Test-running the UDF:
val df = Seq("2021-01-01 00:00:00", "2021-02-15 12:30:00").toDF("ts")
df.withColumn("millisTS", getMillisTS(-1)($"ts")).show(false)
/*
+-------------------+-----------------------+
|ts |millisTS |
+-------------------+-----------------------+
|2021-01-01 00:00:00|2020-12-31 23:59:59.999|
|2021-02-15 12:30:00|2021-02-15 12:29:59.999|
+-------------------+-----------------------+
*/
df.withColumn("millisTS", getMillisTS(5000)($"ts")).show(false)
/*
+-------------------+-----------------------+
|ts |millisTS |
+-------------------+-----------------------+
|2021-01-01 00:00:00|2021-01-01 00:00:05.000|
|2021-02-15 12:30:00|2021-02-15 12:30:05.000|
+-------------------+-----------------------+
*/
val df = Seq("2021-01-01T00:00:00Z", "2021-02-15T12:30:00Z").toDF("ts")
df.withColumn(
"millisTS",
getMillisTS(-1, "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")(to_timestamp($"ts", "yyyy-MM-dd'T'HH:mm:ss'Z'"))
).show(false)
/*
+-------------------+------------------------+
|ts |millisTS |
+-------------------+------------------------+
|2021-01-01 00:00:00|2020-12-31T23:59:59.999Z|
|2021-02-15 12:30:00|2021-02-15T12:29:59.999Z|
+-------------------+------------------------+
*/
Related
I have a spark DataFrame with a column "requestTime", which is a string representation of a timestamp. How can I convert it to get this format: YY-MM-DD HH:MM:SS, knowing that I have the following value: 20171107014824952 (which means : 2017-11-07 01:48:25)?
The part devoted to the seconds is formed of 5 digits, in the example above the seconds part is = 24952 and what was displayed in the log file is 25 so I have to round up 24.952 before applying the to_timestamp function, that's why I asked for help.
Assuming you have the following spark DataFrame:
df.show()
#+-----------------+
#| requestTime|
#+-----------------+
#|20171107014824952|
#+-----------------+
With the schema:
df.printSchema()
#root
# |-- requestTime: string (nullable = true)
You can use the techniques described in Convert pyspark string to date format to convert this to a timestamp. Since the solution is dependent on your spark version, I've created the following helper function:
import pyspark.sql.functions as f
def timestamp_from_string(date_str, fmt):
try:
"""For spark version 2.2 and above, to_timestamp is available"""
return f.to_timestamp(date_str, fmt)
except (TypeError, AttributeError):
"""For spark version 2.1 and below, you'll have to do it this way"""
return f.from_unixtime(f.unix_timestamp(date_str, fmt))
Now call it on your data using the appropriate format:
df.withColumn(
"requestTime",
timestamp_from_string(f.col("requestTime"), "yyyyMMddhhmmssSSS")
).show()
#+-------------------+
#| requestTime|
#+-------------------+
#|2017-11-07 01:48:24|
#+-------------------+
Unfortunately, this truncates the timestamp instead of rounding.
Therefore, you need to do the rounding yourself before converting. The tricky part is that the number is stored as a string - you'll have to convert it to a double, divide by 1000., convert it back to a long (to chop off the decimal and you can't use int as the number is too big), and finally back to a string.
df.withColumn(
"requestTime",
timestamp_from_string(
f.round(f.col("requestTime").cast("double")/1000.0).cast('long').cast('string'),
"yyyyMMddhhmmss"
)
).show()
#+-------------------+
#| requestTime|
#+-------------------+
#|2017-11-07 01:48:25|
#+-------------------+
Spark has SQL function percentile_approx(), and its Scala counterpart is df.stat.approxQuantile().
However, the Scala counterpart cannot be used on grouped datasets, something like df.groupby("foo").stat.approxQuantile(), as answered here: https://stackoverflow.com/a/51933027.
But it's possible to do both grouping and percentiles in SQL syntax. So I'm wondering, maybe I can define an UDF from SQL percentile_approx function and use it on my grouped dataset?
Spark >= 3.1
Corresponding SQL functions have been added in Spark 3.1 - see SPARK-30569.
Spark < 3.1
While you cannot use approxQuantile in an UDF, and you there is no Scala wrapper for percentile_approx it is not hard to implement one yourself:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
import org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
object PercentileApprox {
def percentile_approx(col: Column, percentage: Column, accuracy: Column): Column = {
val expr = new ApproximatePercentile(
col.expr, percentage.expr, accuracy.expr
).toAggregateExpression
new Column(expr)
}
def percentile_approx(col: Column, percentage: Column): Column = percentile_approx(
col, percentage, lit(ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY)
)
}
Example usage:
import PercentileApprox._
val df = (Seq.fill(100)("a") ++ Seq.fill(100)("b")).toDF("group").withColumn(
"value", when($"group" === "a", randn(1) + 10).otherwise(randn(3))
)
df.groupBy($"group").agg(percentile_approx($"value", lit(0.5))).show
+-----+------------------------------------+
|group|percentile_approx(value, 0.5, 10000)|
+-----+------------------------------------+
| b| -0.06336346702250675|
| a| 9.818985618591595|
+-----+------------------------------------+
df.groupBy($"group").agg(
percentile_approx($"value", typedLit(Seq(0.1, 0.25, 0.75, 0.9)))
).show(false)
+-----+----------------------------------------------------------------------------------+
|group|percentile_approx(value, [0.1,0.25,0.75,0.9], 10000) |
+-----+----------------------------------------------------------------------------------+
|b |[-1.2098351202406483, -0.6640768986666159, 0.6778253126144265, 1.3255676906697658]|
|a |[8.902067202468098, 9.290417382259626, 10.41767257153993, 11.067087075488068] |
+-----+----------------------------------------------------------------------------------+
Once this is on the JVM classpath you can also add PySpark wrapper, using logic similar to built-in functions.
I am trying to read data stream following the below schema from kafka
val schema = StructType(
List(
StructField("timestamp",LongType, true),
StructField("id",StringType,true),
StructField("value",DoubleType,true),
)
)
timestamp is coming as long milliseconds from epoc .
And I converted long value to timestamp using the below method
val dfNew=df.selectExpr("CAST(value AS STRING)").as[String].select(from_json($"value",schema) as "record")
.select($"record.id",$"record.value", col("record.timestamp").cast(TimestampType).as("timestamp"))
I want test following strcutured streaming query using window and watermarking
val output=dfNew.withWatermark("timestamp", "16 seconds").groupBy(window($"timestamp" , "10 seconds", "5 seconds"),$"uuid").count()
Its giving result , but in Window Clumn its is displaying a future time stamp
+--------------------------------------------------+------------------------------------+-----+
|window |id |count|
+--------------------------------------------------+------------------------------------+-----+
|[50232-03-09 18:13:000.0, 50232-03-09 18:13:100.0]|11c7ebdb-8810-4a51-9d38-4099fd21862a|1 |
|[50232-03-09 17:49:400.0, 50232-03-09 17:49:500.0]|11c7ebdb-8810-4a51-9d38-4099fd21862a|1 |
|[50232-03-09 19:26:500.0, 50232-03-09 19:27:000.0]|58f86590-e27e-44d6-86d3-0905b126c9fd|1 |
|[50232-03-09 18:29:555.0, 50232-03-09 18:30:055.0]|11c7ebdb-8810-4a51-9d38-4099fd21862a|1 |
50232-03-09 18:13:000.0 ?
What could be the issue , I guess the conversion is wrong in kafka read stream query which I did
col("record.timestamp").cast(TimestampType).as("timestamp"))
.But I couldn't find any place where this is tried.Every body is trying from_unit_time() , but that gives me zero result and also the resolution is seconds ,
any solutions ? Please...
It simple. Spark represents interprets in seconds not milliseconds. So just divide the input by 1000:
Seq(1523013247000L).toDF.select(
($"value" / 1000).cast("timestamp"), // correct
$"value".cast("timestamp") // Your current code
).show
// +---------------------------------+--------------------+
// |CAST((value / 1000) AS TIMESTAMP)| value|
// +---------------------------------+--------------------+
// | 2018-04-06 13:14:07|50232-05-15 05:16...|
// +---------------------------------+--------------------+
I am using PySpark through Spark 1.5.0.
I have an unusual String format in rows of a column for datetime values. It looks like this:
Row[(datetime='2016_08_21 11_31_08')]
Is there a way to convert this unorthodox yyyy_mm_dd hh_mm_dd format into a Timestamp?
Something that can eventually come along the lines of
df = df.withColumn("date_time",df.datetime.astype('Timestamp'))
I had thought that Spark SQL functions like regexp_replace could work, but of course I need to replace
_ with - in the date half
and _ with : in the time part.
I was thinking I could split the column in 2 using substring and count backward from the end of time. Then do the 'regexp_replace' separately, then concatenate. But this seems to many operations? Is there an easier way?
Spark >= 2.2
from pyspark.sql.functions import to_timestamp
(sc
.parallelize([Row(dt='2016_08_21 11_31_08')])
.toDF()
.withColumn("parsed", to_timestamp("dt", "yyyy_MM_dd HH_mm_ss"))
.show(1, False))
## +-------------------+-------------------+
## |dt |parsed |
## +-------------------+-------------------+
## |2016_08_21 11_31_08|2016-08-21 11:31:08|
## +-------------------+-------------------+
Spark < 2.2
It is nothing that unix_timestamp cannot handle:
from pyspark.sql import Row
from pyspark.sql.functions import unix_timestamp
(sc
.parallelize([Row(dt='2016_08_21 11_31_08')])
.toDF()
.withColumn("parsed", unix_timestamp("dt", "yyyy_MM_dd HH_mm_ss")
# For Spark <= 1.5
# See issues.apache.org/jira/browse/SPARK-11724
.cast("double")
.cast("timestamp"))
.show(1, False))
## +-------------------+---------------------+
## |dt |parsed |
## +-------------------+---------------------+
## |2016_08_21 11_31_08|2016-08-21 11:31:08.0|
## +-------------------+---------------------+
In both cases the format string should be compatible with Java SimpleDateFormat.
zero323's answer answers the question, but I wanted to add that if your datetime string has a standard format, you should be able to cast it directly into timestamp type:
df.withColumn('datetime', col('datetime_str').cast('timestamp'))
It has the advantage of handling milliseconds, while unix_timestamp only has only second-precision (to_timestamp works with milliseconds too but requires Spark >= 2.2 as zero323 stated). I tested it on Spark 2.3.0, using the following format: '2016-07-13 14:33:53.979' (with milliseconds, but it also works without them).
I add some more code lines from Florent F's answer for better understanding and running the snippet in local machine:
import os, pdb, sys
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.types import StructType, ArrayType
from pyspark.sql.types import StringType
from pyspark.sql.functions import col
sc = pyspark.SparkContext('local[*]')
spark = SparkSession.builder.getOrCreate()
# preparing some example data - df1 with String type and df2 with Timestamp type
df1 = sc.parallelize([{"key":"a", "date":"2016-02-01"},
{"key":"b", "date":"2016-02-02"}]).toDF()
df1.show()
df2 = df1.withColumn('datetime', col('date').cast("timestamp"))
df2.show()
Just want to add more resources and example into this discussion.
https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
For example, if your ts string is "22 Dec 2022 19:06:36 EST", then the format is "dd MMM yyyy HH:mm:ss zzz"
I have a Spark SQL DataFrame with date column, and what I'm trying to get is all the rows preceding current row in a given date range. So for example I want to have all the rows from 7 days back preceding given row. I figured out, I need to use a Window Function like:
Window \
.partitionBy('id') \
.orderBy('start')
I want to have a rangeBetween 7 days, but there is nothing in the Spark docs I could find on this. Does Spark even provide such option? For now I'm just getting all the preceding rows with:
.rowsBetween(-sys.maxsize, 0)
but would like to achieve something like:
.rangeBetween("7 days", 0)
Spark >= 2.3
Since Spark 2.3 it is possible to use interval objects using SQL API, but the DataFrame API support is still work in progress.
df.createOrReplaceTempView("df")
spark.sql(
"""SELECT *, mean(some_value) OVER (
PARTITION BY id
ORDER BY CAST(start AS timestamp)
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) AS mean FROM df""").show()
## +---+----------+----------+------------------+
## | id| start|some_value| mean|
## +---+----------+----------+------------------+
## | 1|2015-01-01| 20.0| 20.0|
## | 1|2015-01-06| 10.0| 15.0|
## | 1|2015-01-07| 25.0|18.333333333333332|
## | 1|2015-01-12| 30.0|21.666666666666668|
## | 2|2015-01-01| 5.0| 5.0|
## | 2|2015-01-03| 30.0| 17.5|
## | 2|2015-02-01| 20.0| 20.0|
## +---+----------+----------+------------------+
Spark < 2.3
As far as I know it is not possible directly neither in Spark nor Hive. Both require ORDER BY clause used with RANGE to be numeric. The closest thing I found is conversion to timestamp and operating on seconds. Assuming start column contains date type:
from pyspark.sql import Row
row = Row("id", "start", "some_value")
df = sc.parallelize([
row(1, "2015-01-01", 20.0),
row(1, "2015-01-06", 10.0),
row(1, "2015-01-07", 25.0),
row(1, "2015-01-12", 30.0),
row(2, "2015-01-01", 5.0),
row(2, "2015-01-03", 30.0),
row(2, "2015-02-01", 20.0)
]).toDF().withColumn("start", col("start").cast("date"))
A small helper and window definition:
from pyspark.sql.window import Window
from pyspark.sql.functions import mean, col
# Hive timestamp is interpreted as UNIX timestamp in seconds*
days = lambda i: i * 86400
Finally query:
w = (Window()
.partitionBy(col("id"))
.orderBy(col("start").cast("timestamp").cast("long"))
.rangeBetween(-days(7), 0))
df.select(col("*"), mean("some_value").over(w).alias("mean")).show()
## +---+----------+----------+------------------+
## | id| start|some_value| mean|
## +---+----------+----------+------------------+
## | 1|2015-01-01| 20.0| 20.0|
## | 1|2015-01-06| 10.0| 15.0|
## | 1|2015-01-07| 25.0|18.333333333333332|
## | 1|2015-01-12| 30.0|21.666666666666668|
## | 2|2015-01-01| 5.0| 5.0|
## | 2|2015-01-03| 30.0| 17.5|
## | 2|2015-02-01| 20.0| 20.0|
## +---+----------+----------+------------------+
Far from pretty but works.
* Hive Language Manual, Types
Spark 3.3 is released, but...
The answer may be as old as Spark 1.5.0:
datediff.
datediff(col_name, '1000') will return an integer difference of days from 1000-01-01 to col_name.
As the first argument, it accepts dates, timestamps and even strings.
As the second, it even accepts 1000.
The answer
Date difference in days - depending on the data type of the order column:
date
Spark 3.1+
.orderBy(F.expr("unix_date(col_name)")).rangeBetween(-7, 0)
Spark 2.1+
.orderBy(F.expr("datediff(col_name, '1000')")).rangeBetween(-7, 0)
timestamp
Spark 2.1+
.orderBy(F.expr("datediff(col_name, '1000')")).rangeBetween(-7, 0)
long - UNIX time in microseconds (e.g. 1672534861000000)
Spark 2.1+
.orderBy(F.col("col_name") / 86400_000000).rangeBetween(-7, 0)
long - UNIX time in milliseconds (e.g. 1672534861000)
Spark 2.1+
.orderBy(F.col("col_name") / 86400_000).rangeBetween(-7, 0)
long - UNIX time in seconds (e.g. 1672534861)
Spark 2.1+
.orderBy(F.col("col_name") / 86400).rangeBetween(-7, 0)
long in format yyyyMMdd
Spark 3.3+
.orderBy(F.expr("unix_date(to_date(col_name, 'yyyyMMdd'))")).rangeBetween(-7, 0)
Spark 3.1+
.orderBy(F.expr("unix_date(to_date(cast(col_name as string), 'yyyyMMdd'))")).rangeBetween(-7, 0)
Spark 2.2+
.orderBy(F.expr("datediff(to_date(cast(col_name as string), 'yyyyMMdd'), '1000')")).rangeBetween(-7, 0)
Spark 2.1+
.orderBy(F.unix_timestamp(F.col("col_name").cast('string'), 'yyyyMMdd') / 86400).rangeBetween(-7, 0)
string in date format of 'yyyy-MM-dd'
Spark 3.1+
.orderBy(F.expr("unix_date(to_date(col_name))")).rangeBetween(-7, 0)
Spark 2.1+
.orderBy(F.expr("datediff(col_name, '1000')")).rangeBetween(-7, 0)
string in other date format (e.g. 'MM-dd-yyyy')
Spark 3.1+
.orderBy(F.expr("unix_date(to_date(col_name, 'MM-dd-yyyy'))")).rangeBetween(-7, 0)
Spark 2.2+
.orderBy(F.expr("datediff(to_date(col_name, 'MM-dd-yyyy'), '1000')")).rangeBetween(-7, 0)
Spark 2.1+
.orderBy(F.unix_timestamp("col_name", 'MM-dd-yyyy') / 86400).rangeBetween(-7, 0)
string in timestamp format of 'yyyy-MM-dd HH:mm:ss'
Spark 2.1+
.orderBy(F.expr("datediff(col_name, '1000')")).rangeBetween(-7, 0)
string in other timestamp format (e.g. 'MM-dd-yyyy HH:mm:ss')
Spark 2.2+
.orderBy(F.expr("datediff(to_date(col_name, 'MM-dd-yyyy HH:mm:ss'), '1000')")).rangeBetween(-7, 0)
Fantastic solution #zero323, if you want to operate with minutes instead of days as I have to, and you don't need to partition with id, so you only have to modify a simply part of the code as I show:
df.createOrReplaceTempView("df")
spark.sql(
"""SELECT *, sum(total) OVER (
ORDER BY CAST(reading_date AS timestamp)
RANGE BETWEEN INTERVAL 45 minutes PRECEDING AND CURRENT ROW
) AS sum_total FROM df""").show()