Reliable SQS Receiver for Spark Streaming - apache-spark

I would like to have a Spark Streaming SQS Receiver which deletes SQS messages only after they were successfully stored on S3.
For this a Custom Receiver can be implemented with the semantics of the Reliable Receiver.
The store(multiple-records) call blocks until the given records have been stored and replicated inside Spark.
If the write-ahead logs are enabled, all the data received from a receiver gets written into a write ahead log in the configuration checkpoint directory. The checkpoint directory can be pointed to S3.
After the store(multiple-records) blocking call finishes, are the records already stored in the checkpoint directory (and thus can be safely deleted from SQS)?

Edit: This is also explained in this Spark Summit presentation.
With write-ahead logs and checkpointing enabled, the store(multiple-records) call blocks until the given records have been written to write-ahead logs.
Receiver.store(ArrayBuffer[T], ...)
ReceiverSupervisorImpl.pushArrayBuffer(ArrayBuffer[T], ...)
ReceiverSupervisorImpl.pushAndReportBlock(...)
WriteAheadLogBasedBlockHandler.storeBlock(...)
This implementation stores the block into the block manager as well as a write ahead log. It does this in parallel, using Scala Futures, and returns only after the block has been stored in both places.

Related

With reliable receiver, can spark streaming loss data when worker crashed?

When I read the Fault-tolerance Semantics in spark streaming document. Combined with reliable receivers, when the worker node fails, spark streaming will not lose data.
But consider following scenario:
Two executors get the calculated data and save them in their memory
Receivers know that the data has been entered
Receivers send acknowledgment to source and at the same time replica worker and current worker crashed
But source received acknowledgment
When the receiver restarts, these data were lost! Is right?

Is spark JDBC sink transactionally safe at node level?

I have a question related to opening a transaction at partition level. If I use jdbc connector to write to database (postgess), will partition specific writes at worker node be transactionally safe i.e.
If a worker node goes down while writing the data, will the rows related to this partition/ worker node be rolled back?
There is a transaction boundary on the partition (see https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L588)
But if there are failures afterwards before the task is marked as SUCCESS, for example with a network issue or timeout, then you might still get multiple writes

Spark streamming task shutdown gracefully when kafka client send message asynchronously

i am building a spark streamming application, read input message from kafka topic, transformation message and output the result message into another kafka topic. Now i am confused how to prevent data loss when application restart, including kafka read and output. Setting the spark configuration "spark.streaming.stopGracefullyOnShutdow" true can help?
You can configure Spark to do checkpoint to HDFS and store the Kafka offsets in Zookeeper (or Hbase, or configure elsewhere for fast, fault tolerant lookups)
Though, if you process some records and write the results before you're able to commit offsets, then you'll end up reprocessing those records on restart. It's claimed that Spark can do exactly once with Kafka, but that is a only with proper offset management, as far as I know, for example, Set enable.auto.commit to false in the Kafka priorities, then only commit after the you've processed and written the data to its destination
If you're just moving data between Kafka topics, Kafka Streams is the included Kafka library to do that, which doesn't require YARN or a cluster scheduler

Structured Streaming Rollback files in case of Exception

In my Structured Streaming application, I am reading the data from MQ and doing some transformation and writing the results to kafka. I have implemented the MQ custom source.
My question is how to roll back the messages to MQ incase of exceptions during the transformation or while writing the messages to Kafka.
I am reading the messages as bulk, say 5000 messages per batch, but while writing the results, if kafka goes down, what are the ways we can rollback the messages?
Is there any approach we can rollback or recover messages when using custom source (any not distributed source like MQ).

Is it possible to implement a reliable receiver which supports non-graceful shutdown?

I'm curious if it is an absolute must that a Spark streaming application is brought down gracefully or it runs the risk of causing duplicate data via the write-ahead log. In the below scenario I outline sequence of steps where a queue receiver interacts with a queue requires acknowledgements for messages.
Spark queue receiver pulls a batch of messages from the queue.
Spark queue receiver stores the batch of messages into the write-ahead log.
Spark application is terminated before an ack is sent to the queue.
Spark application starts up again.
The messages in the write-ahead log are processed through the streaming application.
Spark queue receiver pulls a batch of messages from the queue which have already been seen in step 1 because they were not acknowledged as received.
...
Is my understanding correct on how custom receivers should be implemented, the problems of duplication that come with it, and is it normal to require a graceful shutdown?
Bottom line: It depends on your output operation.
Using the Direct API approach, which was introduced on V1.3, eliminates inconsistencies between Spark Streaming and Kafka, and so each record is received by Spark Streaming effectively exactly once despite failures because offsets are tracked by Spark Streaming within its checkpoints.
In order to achieve exactly-once semantics for output of your results, your output operation that saves the data to an external data store must be either idempotent, or an atomic transaction that saves results and offsets.
For further information on the Direct API and how to use it, check out this blog post by Databricks.

Resources