Structured Streaming Rollback files in case of Exception - apache-spark

In my Structured Streaming application, I am reading the data from MQ and doing some transformation and writing the results to kafka. I have implemented the MQ custom source.
My question is how to roll back the messages to MQ incase of exceptions during the transformation or while writing the messages to Kafka.
I am reading the messages as bulk, say 5000 messages per batch, but while writing the results, if kafka goes down, what are the ways we can rollback the messages?
Is there any approach we can rollback or recover messages when using custom source (any not distributed source like MQ).

Related

Design Question Kafka Consumer/Producer vs Kafka Stream

I'm working with NodeJs MS, so far they communicate through Kafka Consumer/Producer. Now I need to buiid a Loggger MS which must record all the messages and do some processing (parse and save to db), but I'm not sure if the current approach could be improved using Kafka Stream or if I should continue using Consumers
The Streams API is a higher level abstraction that sits on top of the Consumer/Producer APIs. The Streams API allows you to filter and transform messages, and build a topology of processing steps.
For what you're describing, if you're just picking up a messages and doing a single processing step, the Consumer API is probably fine. That said, you could do the same thing with the Streams API too and not use the other features.
buiid a Loggger MS which must record all the messages and do some processing (parse and save to db)
I would suggest using something like Streams API or Nodejs Producer + Consumer to parse and write back to Kafka.
From your parsed/filtered/sanitized messages, you can run a Kafka Connect cluster to sink your data into a DB
could be improved using Kafka Stream or if I should continue using Consumers
Ultimately, depends what you need. The peek and foreach methods of Streams DSL are functionally equivalent to a Consumer

Spark streamming task shutdown gracefully when kafka client send message asynchronously

i am building a spark streamming application, read input message from kafka topic, transformation message and output the result message into another kafka topic. Now i am confused how to prevent data loss when application restart, including kafka read and output. Setting the spark configuration "spark.streaming.stopGracefullyOnShutdow" true can help?
You can configure Spark to do checkpoint to HDFS and store the Kafka offsets in Zookeeper (or Hbase, or configure elsewhere for fast, fault tolerant lookups)
Though, if you process some records and write the results before you're able to commit offsets, then you'll end up reprocessing those records on restart. It's claimed that Spark can do exactly once with Kafka, but that is a only with proper offset management, as far as I know, for example, Set enable.auto.commit to false in the Kafka priorities, then only commit after the you've processed and written the data to its destination
If you're just moving data between Kafka topics, Kafka Streams is the included Kafka library to do that, which doesn't require YARN or a cluster scheduler

App server Log process

I have a requirement from my client to process the application(Tomcat) server log file for a back end REST Based App server which is deployed on a cluster. Clint wants to generate "access" and "frequency" report from those data with different parameter.
My initial plan is that get those data from App server log --> push to Spark Streaming using kafka and process the data --> store those data to HIVE --> use zeppelin to get back those processed and centralized log data and generate reports as per client requirement.
But as per my knowledge Kafka does not any feature which can read data from log file and post them in Kafka broker by its own , in that case we have write a scheduler job process which will read the log time to time and send them in Kafka broker , which I do not prefer to do, as in that case it will not be a real time and there can be synchronization issue which we have to bother about as we have 4 instances of application server.
Another option, I think we have in this case is Apache Flume.
Can any one suggest me which one would be better approach in this case or if in Kafka, we have any process to read data from log file by its own and what are the advantage or disadvantages we can have in feature in both the cases?
I guess another option is Flume + kakfa together , but I can not speculate much what will happen as I have almost no knowledge about flume.
Any help will be highly appreciated...... :-)
Thanks a lot ....
You can use Kafka Connect (file source connector) to read/consume Tomcat logs files & push them to Kafka. Spark Streaming can then consume from Kafka topics and churn the data
tomcat -> logs ---> kafka connect -> kafka -> spark -> Hive

Spark Streaming from Kafka Consumer

I might need to work with Kafka and I am absolutely new to it. I understand that there are Kafka producers which will publish the logs(called events or messages or records in Kafka) to the Kafka topics.
I will need to work on reading from Kafka topics via consumer. Do I need to set up consumer API first then I can stream using SparkStreaming Context(PySpark) or I can directly use KafkaUtils module to read from kafka topics?
In case I need to setup the Kafka consumer application, how do I do that? Please can you share links to right docs.
Thanks in Advance!!
Spark provide internal kafka stream in which u dont need to create custom consumer there is 2 approach to connect with kafka 1 with receiver 2. direct approach.
For more detail go through this link http://spark.apache.org/docs/latest/streaming-kafka-integration.html
There's no need to set up kafka consumer application,Spark itself creates a consumer with 2 approaches. One is Reciever Based Approach which uses KafkaUtils class and other is Direct Approach which uses CreateDirectStream Method.
Somehow, in any case of failure ion Spark streaming,there's no loss of data, it starts from the offset of data where you left.
For more details,use this link: http://spark.apache.org/docs/latest/streaming-kafka-integration.html

Is it possible to implement a reliable receiver which supports non-graceful shutdown?

I'm curious if it is an absolute must that a Spark streaming application is brought down gracefully or it runs the risk of causing duplicate data via the write-ahead log. In the below scenario I outline sequence of steps where a queue receiver interacts with a queue requires acknowledgements for messages.
Spark queue receiver pulls a batch of messages from the queue.
Spark queue receiver stores the batch of messages into the write-ahead log.
Spark application is terminated before an ack is sent to the queue.
Spark application starts up again.
The messages in the write-ahead log are processed through the streaming application.
Spark queue receiver pulls a batch of messages from the queue which have already been seen in step 1 because they were not acknowledged as received.
...
Is my understanding correct on how custom receivers should be implemented, the problems of duplication that come with it, and is it normal to require a graceful shutdown?
Bottom line: It depends on your output operation.
Using the Direct API approach, which was introduced on V1.3, eliminates inconsistencies between Spark Streaming and Kafka, and so each record is received by Spark Streaming effectively exactly once despite failures because offsets are tracked by Spark Streaming within its checkpoints.
In order to achieve exactly-once semantics for output of your results, your output operation that saves the data to an external data store must be either idempotent, or an atomic transaction that saves results and offsets.
For further information on the Direct API and how to use it, check out this blog post by Databricks.

Resources