Streaming data processing with Spark Structured Streaming

Streaming data processing with Spark Structured Streaming - apache-spark

I have events being pushed to Kafka from the App.
Business design is so that for one interaction in the App maximum 3 events can be generated.
The events for one interaction have common ID.
My goal is to combine those three Events in one row in Delta table.
I'm doing 15 minutes timestamp based window aggregation.
And the issue is the following.
Let's say it's the period of 12:00 - 12:15.
Event A - timestamp = 12:14:30
Event B - timestamp = 12:14:50
Event C - timestamp = 12:15:50
So Event C is in different time frame and hence it is not combined to A and B.
I can check for each batch whether the ID exists already in delta table or not but I wonder if there is more elegant way to solve these cases.

Related

How to count the number of messages fetched from a Kafka topic in a day?

I am fetching data from Kafka topics and storing them in Deltalake(parquet) format. I wish to find the number of messages fetched in particular day.
My thought process: I thought to read the directory where the data is stored in parquet format using spark and apply count on the files with ".parquet" for a particular day. This returns a count but I am not really sure if that's the correct way.
Is this way correct ? Are there any other ways to count the number of messages fetched from a Kafka topic for a particular day(or duration) ?

Message we consume from topic not only have key-value but also have other information like timestamp
Which can be used to track the consumer flow.
Timestamp
Timestamp get updated by either Broker or Producer based on Topic configuration. If Topic configured time stamp type is CREATE_TIME, the timestamp in the producer record will be used by the broker whereas if Topic configured to LOG_APPEND_TIME , timestamp will be overwritten by the broker with the broker local time while appending the record.
So if you are storing any where if you keep timestamp you can very well track per day, or per hour message rate.
Other way you can use some Kafka dashboard like Confluent Control Center (License price) or Grafana (free) or any other tool to track the message flow.
In our case while consuming message and storing or processing along with that we also route meta details of message to Elastic Search and we can visualize it through Kibana.

You can make use of the "time travel" capabilities that Delta Lake offers.
In your case you can do
// define location of delta table
val deltaPath = "file:///tmp/delta/table"
// travel back in time to the start and end of the day using the option 'timestampAsOf'
val countStart = spark.read.format("delta").option("timestampAsOf", "2021-04-19 00:00:00").load(deltaPath).count()
val countEnd = spark.read.format("delta").option("timestampAsOf", "2021-04-19 23:59:59").load(deltaPath).count()
// print out the number of messages stored in Delta Table within one day
println(countEnd - countStart)
See documentation on Query an older snapshot of a table (time travel).

Another way to retrieve this information without counting the rows between two versions is to use Delta table history. There are several advantages of that - you don't read the whole dataset, you can take into account updates & deletes as well, for example if you're doing MERGE operation (it's not possible to do with comparing .count on different versions, because update is replacing the actual value, or delete the row).
For example, for just appends, following code will count all inserted rows written by normal append operations (for other things, like, MERGE/UPDATE/DELETE we may need to look into other metrics):
from delta.tables import *
df = DeltaTable.forName(spark, "ml_versioning.airbnb").history()\
.filter("timestamp > 'begin_of_day' and timestamp < 'end_of_day'")\
.selectExpr("cast(nvl(element_at(operationMetrics, 'numOutputRows'), '0') as long) as rows")\
.groupBy().sum()

How does Spark Structured Streaming determine an event has arrived late?

I read through the spark structured streaming documentation and I wonder how does spark structured streaming determine an event has arrived late? Does it compare the event-time with the processing time?
Taking the above picture as an example Is the bold right arrow line "Time" represent processing time? If so
1) where does this processing time come from? since its streaming Is it assuming someone is likely using an upstream source that has processing timestamp in it or spark adds a processing timestamp field? For example, when reading messages from Kafka we do something like
Dataset<Row> kafkadf = spark.readStream().forma("kafka").load()
This dataframe has timestamp column by default which I am assuming is the processing time. correct? If so, Does Kafka or Spark add this timestamp?
2) I can see there is a time comparison between bold right arrow line and time in the message. And is that how spark determines an event is late?

Processing time of individual job (one RDD of the DStream) is what dictate the processing time in general. It is not when the actual processing of that RDD happen but when the RDD job was allocated to be processed.
In order to clearly understand what the above statement means, create a spark streaming application where batch time = 60 seconds and make sure the batch takes 2 minute. Eventually you will see that a job is allocated to be processed at a time but has not been picked up because the previous job has not finished.
Next:
Out of order data can be dealt two different ways.
Create a High water mark.
It is explained in the same spark user guide page you got your picture.
It is easy to understand where we have a key, value pair where key is the timestamp. Setting .withWatermark("timestamp", "10 minutes") we are essentially saying that if I have received a message for 10 AM then I will allow messages little older than that(Upto 9.50AM). Any message older than that gets dropped.
The other way out of order data can be handled is with in the mapGroupsWithState or mapWithState function.
Here you can decide what to do when you get bunch of values for a key. Drop anything before time X or go even fancier than that. (Like if it is data from A then allow 20 minutes delay, for the rest allow 30 minutes delay etc...)

This document from databricks explains it pretty clearly:
https://databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apache-sparks-structured-streaming.html
Basically it comes down to the watermark (late data threshold) and the order in which data records arrive. Processing time does not play into it at all. You set a watermark on the column representing your event time. If a record R1 with event time T1 arrives after a record R2 with event time T2 has already been seen, and T2 > T1 + Threshold, then R1 will be discarded.
For example, suppose T1 = 09:00, T2 = 10:00, Threshold = 61 min. If a record R1 with event time T1 arrives before a record R2 with event time T2 then R1 is included in calculations. If R1 arrives after R2 then R1 is still included in calculations because T2 < T1 + Threshold.
Now suppose Threshold = 59 min. If R1 arrives before R2 then R1 is included in calculations. If R1 arrives after R2, then R1 is discarded because we've already seen a record with event time T2 and T2 > T1 + Threshold.

Solution for delaying events for N days

We're currently writing an application in Microsoft Azure and we're planning to use Event Hubs to handle processing of real time events.
However, after an initial processing we will have to delay further processing of the events for N number of days. The process will work like this:
Event triggered -> Place event in Event Hub -> Event gets fetched from Event Hub and processed -> Event should be delay for X days -> Event gets' further processed (two last steps might be a loop)
How can we achieve this delay of further event processing without using polling or similar strategies. One idea is to use Azure Queues and their visibility timeout, but 7 days is the supported maximum according to the documentation and our business demands are in the 1-3 months maximum range. Number of events in our system should be max 10k per day.
Any ideas would be appreciated, thanks!

As you already mentioned - EventHubs supports only 7 days window of data to be retained.
Event Hubs are typically used as real-time telemetry data pipe-lines where data seek performance is critical. For 99.9% usecases/scenarios our users typically require last couple of hours, if not seconds.
However, after the real-time processing is over, and If you still need to re-analyze the data after a while, for ex: run a Hadoop job on last months data - our seek pattern & store are not optimized for it. We recommend to forward the messages to other data archival stores which are specialized for big-data queries.
As - data archival is an ask that most of our customers naturally look for - we are releasing a new feature which automatically archives the data in AVRO format into Azure storage.

Cassandra - IN or TOKEN query for querying an entire partition?

I want to query a complete partition of my table.
My compound partition key consists of (id, date, hour_of_timestamp). id and date are strings, hour_of_timestamp is an integer.
I needed to add the hour_of_timestamp field to my partition key because of hotspots while ingesting the data.
Now I'm wondering what's the most efficient way to query a complete partition of my data?
According to this blog, using SELECT * from mytable WHERE id = 'x' AND date = '10-10-2016' AND hour_of_timestamp IN (0,1,...23); is causing a lot of overhead on the coordinator node.
Is it better to use the TOKEN function and query the partition with two tokens? Such as SELECT * from mytable WHERE TOKEN(id,date,hour_of_timestamp) >= TOKEN('x','10-10-2016',0) AND TOKEN(id,date,hour_of_timestamp) <= TOKEN('x','10-10-2016',23);
So my question is:
Should I use the IN or TOKEN query for querying an entire partition of my data? Or should I use 23 queries (one for each value of hour_of_timestamp) and let the driver do the rest?
I am using Cassandra 3.0.8 and the latest Datastax Java Driver to connect to a 6 node cluster.

You say:
Now I'm wondering what's the most efficient way to query a complete
partition of my data? According to this blog, using SELECT * from
mytable WHERE id = 'x' AND date = '10-10-2016' AND hour_of_timestamp
IN (0,1,...23); is causing a lot of overhead on the coordinator node.
but actually you'd query 24 partitions.
What you probably meant is that you had a design where a single partition was what now consists of 24 partitions, because you add the hour to avoid an hotspot during data ingestion. Noting that in both models (the old one with hotspots and this new one) data is still ordered by timestamp, you have two choices:
Run 1 query at time.
Run 2 queries the first time, and then one at time to "prefetch" results.
Run 24 queries in parallel.
CASE 1
If you process data sequentially, the first choice is to run the query for the hour 0, process the data and, when finished, run the query for the hour 1 and so on... This is a straightforward implementation, and I don't think it deserves more than this.
CASE 2
If your queries take more time than your data processing, you could "prefetch" some data. So, the first time you could run 2 queries in parallel to get the data of both the hours 0 and 1, and start processing data for hour 0. In the meantime, data for hour 1 arrives, so when you finish to process data for hour 0 you could prefetch data for hour 2 and start processing data for hour 1. And so on.... In this way you could speed up data processing. Of course, depending on your timings (data processing and query times) you should optimize the number of "prefetch" queries.
Also note that the Java Driver does pagination for you automatically, and depending on the size of the retrieved partition, you may want to disable that feature to avoid blocking the data processing, or may want to fetch more data preemptively with something like this:
ResultSet rs = session.execute("your query");
for (Row row : rs) {
if (rs.getAvailableWithoutFetching() == 100 && !rs.isFullyFetched())
rs.fetchMoreResults(); // this is asynchronous
// Process the row ...
}
where you could tune that rs.getAvailableWithoutFetching() == 100 to better suit your prefetch requirements.
You may also want to prefetch more than one partition the first time, so that you ensure your processing won't wait on any data fetching part.
CASE 3
If you need to process data from different partitions together, eg you need both data for hour 3 and 6, then you could try to group data by "dependency" (eg query both hour 3 and 6 in parallel).
If you need all of them then should run 24 queries in parallel and then join them at application level (you already know why you should avoid the IN for multiple partitions). Remember that your data is already ordered, so your application level efforts would be very small.

Spark: grouping of data stream based on cycle time

I need your inputs regarding grouping of data stream within spark streaming on the basis of cycle time.
We are receiving input data in this formats {Object_id:"vm123", time:"1469077478" , metric :"cpu.usage" , value :"50.8"}.
Data frames are getting ingested very fast at the average rate of 10 seconds.We have a use case to create bins of data based on cycle time .
Suppose Spark bin/batch time is 1 minute for processing of data. Cycle time should be based on the message time-stamp. for example if we receive first packet at 11:30am then we would have to aggregate all messages of that metrics which are received in between 11:30am to 11:31am(1 minute) and send it for processing with cycle time 11.31am.
As per Spark documentation , we only have a support to bin data based on a fix batch duration , for example if we define the batch duration for 1 minute , it will hold the data for 1 minute and send that as a batch , where we have an option to aggregate the data received during this one minuted duration . But this approach does not follows the notion of aggregating the data based on the Cycle Time as defined above.
Please let us know if we have a way to achieve the above use case through Spark or some other tool.
Added Details::
In Our Use case data frames are getting ingested in each 10 seconds for different entities and each object has few metrics.We need to create bins of data before processing based on cycle time interval (like 5 mins) and the start time of that interval should start with message time-stamp.
for example:
We have messages of an object 'vm123' in kafka queue like following:
message1= {Object_id:"vm123", time:"t1" , metric :"m1" , value :"50.8"}.
message2 = {Object_id:"vm123", time:"t1" , metric :"m2" , value :"55.8"}..............................................................
Cycle time interval = 5 minutes.
so a first bin for entity 'VM123' should have all messages of range of t1 to (t1+5*60) times and final group of messages with 5 minute cycle time for ob1 should be like following:
{Object_id:"ob1" , time:"t5" , metrics : [{"name": "m1", value:"average of (v1,v2,v3,v4,v5)},{"name": "m2", value:"average of (v1,v2,v3,v4,v5)}"] }
Thanks

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string