How to make the consumer know that the Producer has finished sending all the messages to the Broker? - apache-spark

1: We are working on a near real time processing or Batch Processing using Spark Streaming. Our current design has Kafka included.
2: Every 15 minutes the Producer will send the messages.
3: We plan to use Spark Streaming to consume messages from Kafka topic.

That a very broad question:
Basically, there is no such thing as "all messages" because it's stream processing (but I still understand your question).
One way would be to inject a control message at last message that "ends a burst of data"
You could also use some "side communication channel" via an RPC such that the producer send the last offset it did write to the consumer
You could put an heuristic -- if poll() does return nothing for 1 minute, you just assume that all data got consumed
And there might be other methods... But it's all hand coded -- there is no support in Kafka (cf. (1.)).

Related

Parallel ordered processing of spring integration events

am pretty new to the integration framework. I have a spring message listener which receives updates for stocks ( S1, S2, S3 ...). The updates for same stock need to be processed sequentially where as updates for different stocks should be processed parallel.
e.g. if sequence of updates are S1-1, S1-2, S2-1, S1-3, S3-1, S2-2, S1-3, S3-2 .. then there should be three parallel stream of processing
S1-1, S1-2, S1-3
S2-1, S2-1
S3-1, S3-2
Note that there could be thousands of such stocks.
Currently I am processing everything in parallel using executor on the channel. how can i achieve my requirements. please advise. thanks.
Well, since we can't parallel "thousands of such stocks" because spawning so many threads is really inefficient, so I would go something like partitioning and queue channels for each of them.
The router component could decide by the request message which partition it belongs and send to respective queue channel.
This way a poller on that queue channel would pull messages sequentially, but different queues would be processed in parallel.
See more info in docs:
https://docs.spring.io/spring-integration/docs/current/reference/html/message-routing.html#router
https://docs.spring.io/spring-integration/docs/current/reference/html/core.html#channel-implementations-queuechannel

Check messages in each 10 min intervals kafka - Nodejs

Consumer should check messages at each 10 min intervals this time response message should contains from uncommitted offset,
Currently messages getting once producer send message
That's not really how a Kafka Consumer works. Usually, you have an infinite loop and just take whatever messages are given to you. Unless you're changing the group.id and not committing offsets between requests, you'll always get the next batch of messages.
If you want to add some max consumption limit, followed by a 10 minutes to sleep a thread within that loop, then that's an implementation detail of your application, but not specific to Kafka

Best approach to check if Spark streaming jobs are hanging

I have Spark streaming application which basically gets a trigger message from Kafka which kick starts the batch processing which could potentially take up to 2 hours.
There were incidents where some of the jobs were hanging indefinitely and didn't get completed within the usual time and currently there is no way we could figure out the status of the job without checking the Spark UI manually. I want to have a way where the currently running spark jobs are hanging or not. So basically if it's hanging for more than 30 minutes I want to notify the users so they can take an action. What all options do I have?
I see I can use metrics from driver and executors. If I were to choose the most important one, it would be the last received batch records. When StreamingMetrics.streaming.lastReceivedBatch_records == 0 it probably means that Spark streaming job has been stopped or failed.
But in my scenario, I will receive only 1 streaming trigger event and then it will kick start the processing which may take up to 2 hours so I won't be able to rely on the records received.
Is there a better way? TIA
YARN provides the REST API to check the status of application and status of cluster resource utilization as well.
with API call it will give a list of running applications and their start times and other details. you can have simple REST client that triggers maybe once in every 30 min or so and check if the job is running for more than 2 hours then send a simple mail alert.
Here is the API documentation:
https://hadoop.apache.org/docs/r2.7.3/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Applications_API
Maybe a simple solution like.
At the start of the processing - launch a waiting thread.
val TWO_HOURS = 2 * 60 * 60 * 1000
val t = new Thread(new Runnable {
override def run(): Unit = {
try {
Thread.sleep(TWO_HOURS)
// send an email that job didn't end
} catch {
case _: Exception => _
}
}
})
And in the place where you can say that batch processing is ended
t.interrupt()
If processing is done within 2 hours - waiter thread is interrupted and e-mail is not sent. If processing is not done - e-mail will be sent.
Let me draw your attention towards Streaming Query listeners. These are quite amazing lightweight things that can monitor your streaming query progress.
In an application that has multiple queries, you can figure out which queries are lagging or have stopped due to some exception.
Please find below sample code to understand its implementation. I hope that you can use this and convert this piece to better suit your needs. Thanks!
spark.streams.addListener(new StreamingQueryListener() {
override def onQueryStarted(event: QueryStartedEvent) {
//logger message to show that the query has started
}
override def onQueryProgress(event: QueryProgressEvent) {
synchronized {
if(event.progress.name.equalsIgnoreCase("QueryName"))
{
recordsReadCount = recordsReadCount + event.progress.numInputRows
//Logger messages to show continuous progress
}
}
}
override def onQueryTerminated(event: QueryTerminatedEvent) {
synchronized {
//logger message to show the reason of termination.
}
}
})
I'm using Kubernetes currently with the Google Spark Operator. [1]
Some of my streaming jobs hang while using Spark 2.4.3: few tasks fail, then the current batch job never progresses.
I have set a timeout using a StreamingProgressListener so that a thread signals when no new batch is submitted for a long time. The signal is then forwarded to a Pushover client that sends a notification to an Android device. Then System.exit(1) is called. The Spark Operator will eventually restart the job.
[1] https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
One way is to monitor the output of the spark job that was kick started. Generally, for example,
If it writes to HDFS, monitor the HDFS output directory for last modified file timestamp or file count generated
If it writes to a Database, you could have a query to check the timestamp of the last record inserted into your job output table.
If it writes to Kafka, you could use Kafka GetOffsetShell to get the output topic's current offset.
Utilize
TaskContext
This provides contextual information for a task, and supports adding listeners for task completion/failure (see addTaskCompletionListener).
More detailed information such as the task 'attemptNumber' or 'taskMetrics' is available as well.
This information can be used by your application during runtime to determine if their is a 'hang' (depending on the problem)
More information about what is 'hanging' would be useful in providing a more specific solution.
I had a similar scenario to deal with about a year ago and this is what I did -
As soon as Kafka receive's message, spark streaming job picks up the event and start processing.
Spark streaming job sends an alert email to Support group saying "Event Received and spark transformation STARTED". Start timestamp is stored.
After spark processing/transformations are done - sends an alert email to Support group saying "Spark transformation ENDED Successfully". End timestamp is stored.
Above 2 steps will help support group to track if spark processing success email is not received after it's started and they can investigate by looking at spark UI for job failure or delayed processing (maybe job is hung due to resource unavailability for a long time)
At last - store event id or details in HDFS file along with start and end timestamp. And save this file to the HDFS path where some hive log_table is pointing to. This will be helpful for future reference to how spark code is performing over the period time and can be fine tuned if required.
Hope this is helpful.

Spark Streaming Execution Flow

I am a newbie to Spark Streaming and I have some doubts regarding the same like
Do we need always more than one executor or with one we can do our job
I am pulling data from kafka using createDirectStream which is receiver less method and batch duration is one minute , so is my data is received for one batch and then processed during other batch duration or it is simultaneously processed
If it is processed simultaneously then how is it assured that my processing is finished in the batch duration
How to use the that web UI to monitor and debugging
Do we need always more than one executor or with one we can do our job
It depends :). If you have a very small volume of traffic coming in, it could very well be that one machine code suffice in terms of load. In terms of fault tolerance that might not be a very good idea, since a single executor could crash and make your entire stream fault.
I am pulling data from kafka using createDirectStream which is
receiver less method and batch duration is one minute , so is my data
is received for one batch and then processed during other batch
duration or it is simultaneously processed
Your data is read once per minute, processed, and only upon the completion of the entire job will it continue to the next. As long as your batch processing time is less than one minute, there shouldn't be a problem. If processing takes more than a minute, you will start to accumulate delays.
If it is processed simultaneously then how is it assured that my
processing is finished in the batch duration?
As long as you don't set spark.streaming.concurrentJobs to more than 1, a single streaming graph will be executed, one at a time.
How to use the that web UI to monitor and debugging
This question is generally too broad for SO. I suggest starting with the Streaming tab that gets created once you submit your application, and start diving into each batch details and continuing from there.
To add a bit more on monitoring
How to use the that web UI to monitor and debugging
Monitor your application in the Streaming tab on localhost:4040, the main metrics to look for are Processing Time and Scheduling Delay. Have a look at the offical doc : http://spark.apache.org/docs/latest/streaming-programming-guide.html#monitoring-applications
batch duration is one minute
Your batch duration a bit long, try to adjust it with lower values to improve your latency. 4 seconds can be a good start.
Also it's a good idea to monitor these metrics on Graphite and set alerts. Have a look at this post https://stackoverflow.com/a/29983398/3535853

Spark Streaming Kafka backpressure

We have a Spark Streaming application, it reads data from a Kafka queue in receiver and does some transformation and output to HDFS. The batch interval is 1min, we have already tuned the backpressure and spark.streaming.receiver.maxRate parameters, so it works fine most of the time.
But we still have one problem. When HDFS is totally down, the batch job will hang for a long time (let us say the HDFS is not working for 4 hours, and the job will hang for 4 hours), but the receiver does not know that the job is not finished, so it is still receiving data for the next 4 hours. This causes OOM exception, and the whole application is down, we lost a lot of data.
So, my question is: is it possible to let the receiver know the job is not finishing so it will receive less (or even no) data, and when the job finished, it will start receiving more data to catch up. In the above condition, when HDFS is down, the receiver will read less data from Kafka and block generated in the next 4 hours is really small, the receiver and the whole application is not down, after the HDFS is ok, the receiver will read more data and start catching up.
You can enable back pressure by setting the property spark.streaming.backpressure.enabled=true. This will dynamically modify your batch sizes and will avoid situations where you get an OOM from queue build up. It has a few parameters:
spark.streaming.backpressure.pid.proportional - response signal to error in last batch size (default 1.0)
spark.streaming.backpressure.pid.integral - response signal to accumulated error - effectively a dampener (default 0.2)
spark.streaming.backpressure.pid.derived - response to the trend in error (useful for reacting quickly to changes, default 0.0)
spark.streaming.backpressure.pid.minRate - the minimum rate as implied by your batch frequency, change it to reduce undershoot in high throughput jobs (default 100)
The defaults are pretty good but I simulated the response of the algorithm to various parameters here

Resources