How to expire state of dropDuplicates in structured streaming to avoid OOM? - apache-spark

I want to count the unique access for each day using spark structured streaming, so I use the following code
.dropDuplicates("uuid")
and in the next day the state maintained for today should be dropped so that I can get the right count of unique access of the next day and avoid OOM. The spark document indicates using dropDuplicates with watermark, for example:
.withWatermark("timestamp", "1 day")
.dropDuplicates("uuid", "timestamp")
but the watermark column must be specified in dropDuplicates. In such case the uuid and timestamp will be used as a combined key to deduplicate elements with the same uuid and timestamp, which is not what I expected.
So is there a perfect solution?

After a few days effort I finally find out the way myself.
While studying the source code of watermark and dropDuplicates, I discovered that besides an eventTime column, watermark also supports window column, so we can use the following code:
.select(
window($"timestamp", "1 day"),
$"timestamp",
$"uuid"
)
.withWatermark("window", "1 day")
.dropDuplicates("uuid", "window")
Since all events in the same day have the same window, this will produce the same results as using only uuid to deduplicate. Hopes can help someone.

Below is the modification of the procedure proposed in Spark documentation. Trick is to manipulate event time i.e. put event time in
buckets. Assumption is that event time is provided in milliseconds.
// removes all duplicates that are in 15 minutes tumbling window.
// doesn't remove duplicates that are in different 15 minutes windows !!!!
public static Dataset<Row> removeDuplicates(Dataset<Row> df) {
// converts time in 15 minute buckets
// timestamp - (timestamp % (15 * 60))
Column bucketCol = functions.to_timestamp(
col("event_time").divide(1000).minus((col("event_time").divide(1000)).mod(15*60)));
df = df.withColumn("bucket", bucketCol);
String windowDuration = "15 minutes";
df = df.withWatermark("bucket", windowDuration)
.dropDuplicates("uuid", "bucket");
return df.drop("bucket");
}

I found out that window function didn't work so I chose to use window.start or window.end.
.select(
window($"timestamp", "1 day").start,
$"timestamp",
$"uuid"
)
.withWatermark("window", "1 day")
.dropDuplicates("uuid", "window")

Related

Kotlin - Get difference between datetimes in seconds

Is there any way, to get the difference between two datetimes in seconds?
For example
First datetime: 2022-04-25 12:09:10
Second datetime: 2022-05-24 02:46:21
There is a dedicated class for that - Duration (the same in Android doc).
A time-based amount of time, such as '34.5 seconds'.
This class models a quantity or amount of time in terms of seconds and nanoseconds. It can be accessed using other duration-based units, such as minutes and hours. In addition, the DAYS unit can be used and is treated as exactly equal to 24 hours, thus ignoring daylight savings effects. See Period for the date-based equivalent to this class.
Here is example usage:
val date1 = LocalDateTime.now()
val date2 = LocalDateTime.now()
val duration = Duration.between(date1, date2)
val asSeconds: Long = duration.toSeconds()
val asMinutes: Long = duration.toMinutes()
If your date types are in the java.time package, in other words, are inheritors of Temporal: Use the ChronoUnit class.
val diffSeconds = ChronoUnit.SECONDS.between(date1, date2)
Note that this can result in a negative value, so do take its absolute value (abs) if necessary.

How do I know if a file has been dumped in badrecordspath ? [ Azure Databricks -Spark]

I want to use badrecordspath in Spark in in Azure Databricks to get a count of corrupt records associated to the job, but there is no simple way to know :
if a file has been written
in which partition the file has been written
I thought maybe i could check if the last partition was created in the last 60 seconds with some code like that :
from datetime import datetime, timedelta
import time
import datetime
df = spark.read.format('csv').option("badRecordsPath", corrupt_record_path)
partition_dict = {} #here the dictionnary of partition and corresponding timestamp
for i in dbutils.fs.ls(corrupt_record_path):
partition_dict[i.name[:-1]]=time.mktime(datetime.datetime.strptime(i.name[:-1], "%Y%m%dT%H%M%S").timetuple())
#here i get the timestamp of one minute ago
submit_timestamp_utc_minus_minute = datetime.datetime.now().replace(tzinfo = timezone.utc) - timedelta(seconds=60)
submit_timestamp_utc_minus_minute = time.mktime(datetime.datetime.strptime(submit_timestamp_utc_minus_minute.strftime("%Y%m%dT%H%M%S"), "%Y%m%dT%H%M%S").timetuple())
#Here i compare the latest partition to check if is more recent than 60 seconds ago
if partition_dict[max(partition_dict, key=lambda k: partition_dict[k])]>submit_timestamp_utc_minus_minute :
corrupt_dataframe = spark.read.format('json').load(corrupt_record_path+partition+'/bad_records')
corrupt_records_count = corrupt_dataframe.count()
else:
corrupt_records_count = 0
But i see two issues :
it is a lot of overhead (ok the code could also be better written, but
still)
i'm not even sure when does the partition name is defined in the
reading job. Is it at the beginning of the job or at the end ? If it is at the beginning, then the 60 seconds are not relevant at all.
As a side note i cannot use PERMISSIVE read with corrupt_records_column, as i don't want to cache the dataframe (you can see my other question here )
Any suggestion or observation would be much appreciated !

Stream writes having multiple identical keys to delta lake

I am writing streams to delta lake through spark structured streaming. Each streaming batch contains key - value (also contains timestamp as one column). delta lake doesn't support of update with multiple same keys at source( steaming batch) So I want to update delta lake with only record with latest timestamp. How can I do this ?
This is code snippet I am trying:
def upsertToDelta(microBatchOutputDF: DataFrame, batchId: Long) {
println(s"Executing batch $batchId ...")
microBatchOutputDF.show()
deltaTable.as("t")
.merge(
microBatchOutputDF.as("s"),
"s.key = t.key")
.whenMatched().updateAll()
.whenNotMatched().insertAll()
.execute()
}
Thanks in advance.
You can eliminate records having older timestamp from your "microBatchOutputDF" dataframe & keep only record with latest timestamp for given key.
You can use spark's 'reduceByKey' operation & implement custom reduce function as below.
def getLatestEvents(input: DataFrame) : RDD[Row] = {
input.rdd.map(x => (x.getAs[String]("key"), x)).reduceByKey(reduceFun).map(_._2) }
def reduceFun(x: Row, y: Row) : Row = {
if (x.getAs[Timestamp]("timestamp").getTime > y.getAs[Timestamp]("timestamp").getTime) x else y }
Assumed key is of type string & timestamp of type timestamp. And call "getLatestEvents" for your streaming batch 'microBatchOutputDF'. It ignores older timestamp events & keeps only latest one.
val latestRecordsDF = spark.createDataFrame(getLatestEvents(microBatchOutputDF), <schema of DF>)
Then call deltalake merge operation on top of 'latestRecordsDF'
In streaming for a microbatch, you might got more than one records for a given key. In order to update it with target table, you have to figure out the latest record for the key in the microbatch. In your case you can use max of timestamp column and the value column to find the latest record and use that one for merge operation.
You can refer this link for more details on finding the latest record for the given key.

Spark 1.5.2: Grouping DataFrame Rows over a Time Range

I have a df with the following schema:
ts: TimestampType
key: int
val: int
The df is sorted in ascending order of ts. Starting with row(0), I would like to group the dataframe within certain time intervals.
For example, if I say df.filter(row(0).ts + expr(INTERVAL 24 HOUR)).collect(), it should return all the rows within the 24 hr time window of row(0).
Is there a way to achieve the above within Spark DF context?
Generally speaking it is relatively simple task. All you need is basic arithmetics on UNIX timestamps. First lets cast all timestamps to numerics:
val dfNum = df.withColumn("ts", $"timestamp".cast("long"))
Next lets find minimum timestamp over all rows:
val offset = dfNum.agg(min($"ts")).first.getLong(0)
and use it to compute groups:
val aDay = lit(60 * 60 * 24)
val group = (($"ts" - lit(offset)) / aDay).cast("long")
val dfWithGroups = dfNum.withColumn("group", group)
Finally you can use it as a grouping column:
dfWithGroups.groupBy($"group").agg(min($"value")).
If you want meaningful intervals (interpretable as timestamps) just multiply groups by aDay.
Obviously this won't handle complex cases like handling daylight saving time or leap seconds but should be good enough most of the time. If you need to handle properly any of this you use a similar logic using Joda time with an UDF.

mimic ' group by' and window function logic in spark

I have a large csv file with the columns id,time,location. I made it an RDD, and want to compute some aggregated metrics of the trips, when a trip is defined as a time-contiguous set of records of the same id, separated by at least 1 hour on either side. I am new to spark. (related)
To do that, I think to create an RDD with elements of the form (trip_id,(time, location)) and use reduceByKey to calculate all the needed metrics.
To calculate the trip_id, i try to implement the SQL-approach of the linked question, to make an indicator field of whether the record is a start of a trip, and make a cumulative sum of this indicator field. This does not sound like a distributed approach: is there a better one?
Furthermore, how can I add this indicator field? it should be 1 if the time-difference to the previous record of the same id is above an hour, and 0 otherwise. I thought of at first doing groupBy id and then sort in each of the values, but they will be inside an Array and thus not amenable to sortByKey, and there is no lead function as in SQL to get the previous value.
Example of the suggested aforementioned approach: for the RDD
(1,9:30,50)
(1,9:37,70)
(1,9:50,80)
(2,19:30,10)
(1,20:50,20)
We want to turn it first into the RDD with the time differences,
(1,9:30,50,inf)
(1,9:37,70,00:07:00)
(1,9:50,80,00:13:00)
(2,19:30,10,inf)
(2,20:50,20,01:20:00)
(The value of the earliest record is, say, scala's PositiveInfinity constant)
and turn this last field into an indicator field of whether it is above 1, which indicates whether we start a trip,
(1,9:30,50,1)
(1,9:37,70,0)
(1,9:50,80,0)
(2,19:30,10,1)
(2,20:50,20,1)
and then turn it into a trip_id
(1,9:30,50,1)
(1,9:37,70,1)
(1,9:50,80,1)
(2,19:30,10,2)
(2,20:50,20,3)
and then use this trip_id as the key to aggregations.
The preprocessing was simply to load the file and delete the header,
val rawdata=sc.textFile("some_path")
def isHeader(line:String)=line.contains("id")
val data=rawdata.filter(!isHeader(_))
Edit
While trying to implement with spark SQL, I ran into an error regarding the time difference:
val lags=sqlContext.sql("
select time - lag(time) over (partition by id order by time) as diff_time from data
");
since spark doesn't know how to take the difference between two timestamps. I'm trying to check whether this difference is above 1 hour.
It Also doesn't recognize the function getTime, that I found online as an answer, the following returns an error too (Couldn't find window function time.getTime):
val lags=sqlContext.sql("
select time.getTime() - (lag(time)).getTime() over (partition by id order by time)
from data
");
Even though making a similar lag difference for a numeric attribute works:
val lag_numeric=sqlContext.sql("
select longitude - lag(longitude) over (partition by id order by time)
from data"); //works
Spark didn't recognize the function Hours.hoursBetween either. I'm using spark 1.4.0.
I also tried to define an appropriate user-defined-function, but UDFS are oddly not recognized inside queries:
val timestamp_diff: ((Timestamp,Timestamp) => Double) =
(d1: Timestamp,d2: Timestamp) => d1.getTime()-d2.getTime()
val lags=sqlContext.sql("select timestamp_diff(time,lag(time))
over (partition by id order by time) from data");
So, how can spark test whether the difference between timestamps is above an hour?
Full code:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions._
import sqlContext._
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.hive.HiveContext//For window functions
import java.util.Date
import java.sql.Timestamp
case class Record(id: Int, time:Timestamp, longitude: Double, latitude: Double)
val raw_data=sc.textFile("file:///home/sygale/merged_table.csv")
val data_records=
raw_data.map(line=>
Record( line.split(',')(0).toInt,
Timestamp.valueOf(line.split(',')(1)),
line.split(',')(2).toDouble,
line.split(',')(3).toDouble
))
val data=data_records.toDF()
data.registerTempTable("data")
val lags=sqlContext.sql("
select time - lag(time) over (partition by id order by time) as diff_time from data
");

Resources