Spark streamming task shutdown gracefully when kafka client send message asynchronously - apache-spark

i am building a spark streamming application, read input message from kafka topic, transformation message and output the result message into another kafka topic. Now i am confused how to prevent data loss when application restart, including kafka read and output. Setting the spark configuration "spark.streaming.stopGracefullyOnShutdow" true can help?

You can configure Spark to do checkpoint to HDFS and store the Kafka offsets in Zookeeper (or Hbase, or configure elsewhere for fast, fault tolerant lookups)
Though, if you process some records and write the results before you're able to commit offsets, then you'll end up reprocessing those records on restart. It's claimed that Spark can do exactly once with Kafka, but that is a only with proper offset management, as far as I know, for example, Set enable.auto.commit to false in the Kafka priorities, then only commit after the you've processed and written the data to its destination
If you're just moving data between Kafka topics, Kafka Streams is the included Kafka library to do that, which doesn't require YARN or a cluster scheduler

Related

Parameter to control the handshake interval between Kafka and spark

While the kafka brokers are up and running, spark process running in cluster mode is able to read the messages from the kafka topic. But when the brokers were shutdown intentionally, the spark consumer is still in RUNNING status.
Is there any parameter to control the handshake interval between spark consumer and the zookeeper process, so that the spark process can fail if the brokers are not reachable. Or is there any alternate way to fail the consumer. Please suggest.
No there is not.
KAFKA & Spark Structured Steaming (SSS) are loosely coupled, and given high availability scenarios, failure etc. SSS just waits and will process rebalanced topics when KAFKA rebalances the load.
The whole premise is that KAFKA will do something to alleviate the situation - if a Broker goes down. Even if there are zero Brokers after a while, SSS will wait as you have noted already. It is of course knows nothing, but just waits.
As long as the topics still exist and the "fail on data loss" is not set to true if a topic is deleted, the SSS Apps will go on.

Structured Streaming Rollback files in case of Exception

In my Structured Streaming application, I am reading the data from MQ and doing some transformation and writing the results to kafka. I have implemented the MQ custom source.
My question is how to roll back the messages to MQ incase of exceptions during the transformation or while writing the messages to Kafka.
I am reading the messages as bulk, say 5000 messages per batch, but while writing the results, if kafka goes down, what are the ways we can rollback the messages?
Is there any approach we can rollback or recover messages when using custom source (any not distributed source like MQ).

App server Log process

I have a requirement from my client to process the application(Tomcat) server log file for a back end REST Based App server which is deployed on a cluster. Clint wants to generate "access" and "frequency" report from those data with different parameter.
My initial plan is that get those data from App server log --> push to Spark Streaming using kafka and process the data --> store those data to HIVE --> use zeppelin to get back those processed and centralized log data and generate reports as per client requirement.
But as per my knowledge Kafka does not any feature which can read data from log file and post them in Kafka broker by its own , in that case we have write a scheduler job process which will read the log time to time and send them in Kafka broker , which I do not prefer to do, as in that case it will not be a real time and there can be synchronization issue which we have to bother about as we have 4 instances of application server.
Another option, I think we have in this case is Apache Flume.
Can any one suggest me which one would be better approach in this case or if in Kafka, we have any process to read data from log file by its own and what are the advantage or disadvantages we can have in feature in both the cases?
I guess another option is Flume + kakfa together , but I can not speculate much what will happen as I have almost no knowledge about flume.
Any help will be highly appreciated...... :-)
Thanks a lot ....
You can use Kafka Connect (file source connector) to read/consume Tomcat logs files & push them to Kafka. Spark Streaming can then consume from Kafka topics and churn the data
tomcat -> logs ---> kafka connect -> kafka -> spark -> Hive

Kafka OffsetOutOfRangeException

I am streaming loads of data through kafka. And then I have spark streaming which is consuming these messages. Basically down the line, spark streaming throws this error:
kafka.common.OffsetOutOfRangeException
Now I am aware what this error means. So I changed the retention policy to 5 days. However I still encountered the same issue. Then I listed all the messages for a topic using --from-beginning in kafka. Surely enough, ton of messages from the beginning of the kafka streaming part were not present and since spark streaming is a little behind the kafka streaming part, spark streaming tries to consume messages that have been deleted by kafka. However I thought changing the retention policy would take care of this:
--add-config retention.ms=....
What I suspect is happening that kafka is deleting messages from the topic to free up space (because we are streaming tons of data) for the new messages. Is there a property which I can configure that specifies how much bytes of data kafka can store before deleting the prior messages?
You can set the maximum size of the topic when u create the topic using the topic configuration property retention.bytes via console like:
bin/kafka-topics.sh --zookeeper localhost:2181 --create --topic my-topic --partitions 1 --replication-factor 1 --config retention.bytes=10485760 --config
or u can use global broker configuration property log.retention.bytes to set the maximum size for all topics.
what is important to know is that log.retention.bytes doesn't enforce a hard limit on a topic size, but it just signal to Kafka when to start deleting the oldest messages
Another way to solve this problem is to specify in the configuration the spark parameter :
spark.streaming.kafka.maxRatePerPartition

How spark streaming data are stored

In spark streaming, stream data will be received by receivers which run on workers. The data will be pushed into a data block periodically and receiver will send the receivedBlockInfo to the driver. I want to know that will spark streaming distribute the block to the cluster?(In other words, will it use a distributing storage strategy). If it does not distribute the data across the cluster, how will the workload balance be guaranteed?(Image we have a cluster of 10s nodes but there are only a few receivers)
As far as I know data are received by the worker node where the receiver is running. They are not distributed across other nodes.
If you need the input stream to be repartitioned (balanced across cluster) before further processing, you can use
inputStream.repartition(<number of partitions>)
You can read more about level of parallelism in Spark documentation
https://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning

Resources