Apache Spark: How to Send custom messages from Executor to Driver - apache-spark

Is there a way to send some custom messages from Executor to Driver In Apache Spark? It is quite evident from driver and executor logs that there is a a lot of framework level communication is happening however I did not find any API to send custom messages between processes. Please advise.

I think you can use Accumulators!
read about it: https://www.tutorialspoint.com/apache_spark/advanced_spark_programming.htm
You can not send messages, but you can aggregate data at executors and use it at the driver after the parallel computation ends.

Related

gRPC Server - maximum number of executor threads reached

I have implemented a gRPC server which has just only one RPC method. It takes the input object (contained in the request) and write synchronously it into an Apache Kafka topic thanks to the Kafka clients producer API. I have set as executor a fixed thread pool of 50 threads.
Suppose that Kafka brokers are not available due to a temporary fault and the gRPC server receives so many requests such that all the 50 threads to handle them become busy since they are all blocked due to the synchronous write retries for the Kafka topic.
What does it happen if other requests arrives while all the 50 threads are busy?
Does the gRPC queue them safely? Is there a risk to lose some request?
Do you know where this concept might be described in the official documentation?
Thank you very much.
P.s: Kafka is just an example I used to explain the question, you can think of any other service which requires a synchronous write.
I'm assuming that you are using C++. I'll also preface this answer by saying that the C++ synchronous API for gRPC is not very performant. The async API is what is generally recommended for performant code.
Yes, those bytes would get queued up in the transport layer. To enforce limits on how many bytes would get queued up, please configure ResourceQuota [1]https://github.com/grpc/grpc/blob/6a10e41db75bd6074bf01a08d260365e44922f04/include/grpcpp/server_builder.h#L222
[2]https://github.com/grpc/grpc/blob/master/include/grpcpp/resource_quota.h

Email alert for spark streaming delay

We have spark jobs that loads data from Kafka to hive database. Sometimes our streaming jobs getting too much data or hanged, causing delay in live streaming.
We can able to see the active process and pending process in queue in Spark UI.
I want to consolidate these information and send an email alert in case of any delay.
Thanks
You can use below GitHub package for Spark Email Alert.
https://github.com/NikhilSuthar/Scala-Spark-Mail

App server Log process

I have a requirement from my client to process the application(Tomcat) server log file for a back end REST Based App server which is deployed on a cluster. Clint wants to generate "access" and "frequency" report from those data with different parameter.
My initial plan is that get those data from App server log --> push to Spark Streaming using kafka and process the data --> store those data to HIVE --> use zeppelin to get back those processed and centralized log data and generate reports as per client requirement.
But as per my knowledge Kafka does not any feature which can read data from log file and post them in Kafka broker by its own , in that case we have write a scheduler job process which will read the log time to time and send them in Kafka broker , which I do not prefer to do, as in that case it will not be a real time and there can be synchronization issue which we have to bother about as we have 4 instances of application server.
Another option, I think we have in this case is Apache Flume.
Can any one suggest me which one would be better approach in this case or if in Kafka, we have any process to read data from log file by its own and what are the advantage or disadvantages we can have in feature in both the cases?
I guess another option is Flume + kakfa together , but I can not speculate much what will happen as I have almost no knowledge about flume.
Any help will be highly appreciated...... :-)
Thanks a lot ....
You can use Kafka Connect (file source connector) to read/consume Tomcat logs files & push them to Kafka. Spark Streaming can then consume from Kafka topics and churn the data
tomcat -> logs ---> kafka connect -> kafka -> spark -> Hive

Is it possible to implement a reliable receiver which supports non-graceful shutdown?

I'm curious if it is an absolute must that a Spark streaming application is brought down gracefully or it runs the risk of causing duplicate data via the write-ahead log. In the below scenario I outline sequence of steps where a queue receiver interacts with a queue requires acknowledgements for messages.
Spark queue receiver pulls a batch of messages from the queue.
Spark queue receiver stores the batch of messages into the write-ahead log.
Spark application is terminated before an ack is sent to the queue.
Spark application starts up again.
The messages in the write-ahead log are processed through the streaming application.
Spark queue receiver pulls a batch of messages from the queue which have already been seen in step 1 because they were not acknowledged as received.
...
Is my understanding correct on how custom receivers should be implemented, the problems of duplication that come with it, and is it normal to require a graceful shutdown?
Bottom line: It depends on your output operation.
Using the Direct API approach, which was introduced on V1.3, eliminates inconsistencies between Spark Streaming and Kafka, and so each record is received by Spark Streaming effectively exactly once despite failures because offsets are tracked by Spark Streaming within its checkpoints.
In order to achieve exactly-once semantics for output of your results, your output operation that saves the data to an external data store must be either idempotent, or an atomic transaction that saves results and offsets.
For further information on the Direct API and how to use it, check out this blog post by Databricks.

Spark streaming with JMS - No API

Is there any API/ way to integrate Spark Streaming with JMS. I am able to integrate with Kafka and Sockets but to integrate with Jms queue or topic I am unable to.
I think you should try calling reciever api in spark. You need to create custom receiver
http://spark.apache.org/docs/latest/streaming-custom-receivers.html
Also check rely from tathagat das who is spark contributor from
www.apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-and-JMS-td5371.html
If you need help in detail let me know
I know this is an old post . Since I am working on something similar. You can use spark jms receiver
Spark JMS receiver

Resources