What is spark.streaming.receiver.maxRate? How does it work with batch interval - apache-spark

I am working with spark 1.5.2. I understand what a batch interval is, essentially the interval after which the processing part should start on the data received from the receiver.
But I do not understand what is spark.streaming.receiver.maxRate. From some research it is apparently an important parameter.
Lets consider a scenario. my batch interval is set to 60s. And spark.streaming.receiver.maxRate is set to 60*1000. What if I get 60*2000 records in 60s due to some temporary load. What would happen? Will the additional 60*1000 records be dropped? Or would the processing happen twice during that batch interval?

Property spark.streaming.receiver.maxRate applies to number of records per second.
The receiver max rate is applied when receiving data from the stream - that means even before batch interval applies. In other words you will never get more records per second than set in spark.streaming.receiver.maxRate. The additional records will just "stay" in the stream (e.g. Kafka, network buffer, ...) and get processed in the next batch.

Related

Spark structured streaming asynchronous batch blocking

I’m using Apache Spark structured streaming for reading from Kafka. Sometimes my micro batches get processed in a greater time than specified, due to heavy writes IO operations. I was wondering if there’s an option of starting the next batch before the first one has finished, but make the second batch blocked by the first?
I mean that if the first one took 7 seconds and the batch is set for 5 seconds, then start the second batch on the fifth second. But if the second batch finishes block it so it won’t write before it’s previous batch (because of the will to keep the correct messages order).
No. Next batch only starts if previous completed. I think you mean term interval. It would become a mess otherwise.
See https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers

spark streaming understanding timeout setup in mapGroupsWithState

I am trying very hard to understand the timeout setup when using the mapGroupsWithState for spark structured streaming.
below link has very detailed specification, but I am not sure i understood it properly, especially the GroupState.setTimeoutTimeStamp() option. Meaning when setting up the state expiry to be sort of related to the event time.
https://spark.apache.org/docs/3.0.0-preview/api/scala/org/apache/spark/sql/streaming/GroupState.html
I copied them out here:
With EventTimeTimeout, the user also has to specify the the the event time watermark in the query using Dataset.withWatermark().
With this setting, data that is older than the watermark are filtered out.
The timeout can be set for a group by setting a timeout timestamp usingGroupState.setTimeoutTimestamp(), and the timeout would occur when the watermark advances beyond the set timestamp.
You can control the timeout delay by two parameters - watermark delay and an additional duration beyond the timestamp in the event (which is guaranteed to be newer than watermark due to the filtering).
Guarantees provided by this timeout are as follows:
Timeout will never be occur before watermark has exceeded the set timeout.
Similar to processing time timeouts, there is a no strict upper bound on the delay when the timeout actually occurs. The watermark can advance only when there is data in the stream, and the event time of the data has actually advanced.
question 1:
What is this timestamp in this sentence and the timeout would occur when the watermark advances beyond the set timestamp? is it an absolute time or is it a relative time duration to the current event time in the state? I know I could expire it by removing the state by ```
e.g. say I have some data state like below, when will it exprire by setting up what value in what settings?
+-------+-----------+-------------------+
|expired|something | timestamp|
+-------+-----------+-------------------+
| false| someKey |2020-08-02 22:02:00|
+-------+-----------+-------------------+
question 2:
Reading the sentence Data that is older than the watermark are filtered out, I understand the late arrival data is ignored after it is read from kafka, is this correct?
question reason
Without understanding these, i can not really apply them to use cases. Meaning when to use GroupState.setTimeoutDuration(), when to use GroupState.setTimeoutTimestamp()
Thanks a lot.
ps. I also tried to read below
- https://www.waitingforcode.com/apache-spark-structured-streaming/stateful-transformations-mapgroupswithstate/read
(confused me, did not understand)
- https://databricks.com/blog/2017/10/17/arbitrary-stateful-processing-in-apache-sparks-structured-streaming.html
(did not say a lot of it for my interest)
What is this timestamp in the sentence and the timeout would occur when the watermark advances beyond the set timestamp?
This is the timestamp you set by GroupState.setTimeoutTimestamp().
is it an absolute time or is it a relative time duration to the current event time in the state?
This is a relative time (not duration) based on the current batch window.
say I have some data state (column timestamp=2020-08-02 22:02:00), when will it expire by setting up what value in what settings?
Let's assume your sink query has a defined processing trigger (set by trigger()) of 5 minutes. Also, let us assume that you have used a watermark before applying the groupByKey and the mapGroupsWithState. I understand you want to use timeouts based on event times (as opposed to processing times, so your query will be like:
ds.withWatermark("timestamp", "10 minutes")
.groupByKey(...) // declare your key
.mapGroupsWithState(
GroupStateTimeout.EventTimeTimeout)(
...) // your custom update logic
Now, it depends on how you set the TimeoutTimestamp withing your "custom update logic". Somewhere in your custom update logic you will need to call
state.setTimeoutTimestamp()
This method has four different signatures and it is worth scanning through their documentation. As we have set a watermark in (withWatermark) we can actually make use of that time. As a general rule: It is important to set the timeout timestamp (set by state.setTimeoutTimestamp()) to a value larger then the current watermark. To continue with our example we add one hour as shown below:
state.setTimeoutTimestamp(state.getCurrentWatermarkMs, "1 hour")
To conclude, your message can arrive into your stream between 22:00:00 and 22:15:00 and if that message was the last for the key it will timeout by 23:15:00 in your GroupState.
question 2: Reading the sentence Data that is older than the watermark are filtered out, I understand the late arrival data is ignored after it is read from kafka, this is correct?
Yes, this is correct. For the batch interval 22:00:00 - 22:05:00 all messages that have an event time (defined by column timestamp) arrive later then the declared watermark of 10 minutes (meaning later then 22:15:00) will be ignored anyway in your query and are not going to be processed within your "custom update logic".

Check messages in each 10 min intervals kafka - Nodejs

Consumer should check messages at each 10 min intervals this time response message should contains from uncommitted offset,
Currently messages getting once producer send message
That's not really how a Kafka Consumer works. Usually, you have an infinite loop and just take whatever messages are given to you. Unless you're changing the group.id and not committing offsets between requests, you'll always get the next batch of messages.
If you want to add some max consumption limit, followed by a 10 minutes to sleep a thread within that loop, then that's an implementation detail of your application, but not specific to Kafka

Spark Streaming Kafka backpressure

We have a Spark Streaming application, it reads data from a Kafka queue in receiver and does some transformation and output to HDFS. The batch interval is 1min, we have already tuned the backpressure and spark.streaming.receiver.maxRate parameters, so it works fine most of the time.
But we still have one problem. When HDFS is totally down, the batch job will hang for a long time (let us say the HDFS is not working for 4 hours, and the job will hang for 4 hours), but the receiver does not know that the job is not finished, so it is still receiving data for the next 4 hours. This causes OOM exception, and the whole application is down, we lost a lot of data.
So, my question is: is it possible to let the receiver know the job is not finishing so it will receive less (or even no) data, and when the job finished, it will start receiving more data to catch up. In the above condition, when HDFS is down, the receiver will read less data from Kafka and block generated in the next 4 hours is really small, the receiver and the whole application is not down, after the HDFS is ok, the receiver will read more data and start catching up.
You can enable back pressure by setting the property spark.streaming.backpressure.enabled=true. This will dynamically modify your batch sizes and will avoid situations where you get an OOM from queue build up. It has a few parameters:
spark.streaming.backpressure.pid.proportional - response signal to error in last batch size (default 1.0)
spark.streaming.backpressure.pid.integral - response signal to accumulated error - effectively a dampener (default 0.2)
spark.streaming.backpressure.pid.derived - response to the trend in error (useful for reacting quickly to changes, default 0.0)
spark.streaming.backpressure.pid.minRate - the minimum rate as implied by your batch frequency, change it to reduce undershoot in high throughput jobs (default 100)
The defaults are pretty good but I simulated the response of the algorithm to various parameters here

When using netcat to process logs in spark streaming, it is dropping last few lines?

I have a spark-streaming service, where I am processing and detecting anomalies on the basis of some offline generated model. I feed data into this service from a log file, which is streamed using the following command
tail -f <logfile>| nc -lk 9999
Here the spark streaming service is taking data from port 9999. However, I observe that the last few lines are being dropped, i.e. spark streaming does not receive those log lines or they are not processed.
However, I also observed that if I simply take the logfile as standard input instead of tailing it, no lines are dropped:
nc -q 10 -lk 9999 < logfile
Can anyone explain why this behavior is happening? And what could be a better resolution to the problem of streaming log data to spark streaming instance?
In Spark Streaming, data comes in over the wire, and constitutes a block on every block interval. This block is replicated on other machines (according to your storage level as soon as formed. Once a batch interval elapses, each block formed since the last batch interval tick forms part of a new RDD. It is once you have formed this RDD that you can schedule a job, so the data collected during the batch interval n is then processed during batch interval n+1.
So, the possible culprits for "losing a bit of data towards the end" could be:
you are observing your input file at the same time as you are monitoring the input for Spark. If you consider your monitoring at instant t, a bit after n batch intervals have elapsed, your log file has produced the data for n batches and then some ("a little bit more"). Except, the beginning of the next batch (n+1) is at this stage in the data collection phase, in the form of blocks on your Receiver. No data has been lost, the processing of batch n+1 has simply not started yet.
or your application assumes it's receiving a similar number of elements in each RDD and does not process the potentially (much) smaller last batch's RDD correctly.
or you're stopping your application or data before the last batch interval elapses (you need to wait n+1 batch intervals to see the processing of n batches of data).
or there is something weird occurring with the system clock of your executors. Have you thought of synchronizing them with ntp ?

Resources