When I read the Fault-tolerance Semantics in spark streaming document. Combined with reliable receivers, when the worker node fails, spark streaming will not lose data.
But consider following scenario:
Two executors get the calculated data and save them in their memory
Receivers know that the data has been entered
Receivers send acknowledgment to source and at the same time replica worker and current worker crashed
But source received acknowledgment
When the receiver restarts, these data were lost! Is right?
Related
I have a Spark streaming (Scala) application running in CDH 5.13 consuming messages from Kafka using client 0.10.0. My Kafka cluster contains 3 brokers. Kafka topic is divided into 12 partitions evenly distributed between these 3 brokers. My Spark streaming consumer has 12 executors with 1 core each.
Spark streaming starts reading millions of messages from Kafka in each batch, but reduces the number to thousands due to the fact that Spark is not capable to cope with the load and queue of unprocessed batches is created. That is fine and my expectation though is that Spark processes the small batches very quickly and returns to normal, however I see that from time to time one of the executors that processes only few hundreds of messages gets 'request timeout' error just after reading the last offset from Kafka:
DEBUG org.apache.clients.NetworkClient Disconnecting from node 12345 due to request timeout
After this error, executor sends several RPC requests driver that take ~40 seconds and after this time executor reconnects to the same broker from which it disconnected.
My question is how can I prevent this request timeout and what is the best way to find the root cause for it?
Thank you
The root cause for disconnection was the fact that response for data request arrived from Kafka too late. i.e. after request.timeout.ms parameter which was set to default 40000 ms. The disconnection problem was fixed when I increased this value.
i am building a spark streamming application, read input message from kafka topic, transformation message and output the result message into another kafka topic. Now i am confused how to prevent data loss when application restart, including kafka read and output. Setting the spark configuration "spark.streaming.stopGracefullyOnShutdow" true can help?
You can configure Spark to do checkpoint to HDFS and store the Kafka offsets in Zookeeper (or Hbase, or configure elsewhere for fast, fault tolerant lookups)
Though, if you process some records and write the results before you're able to commit offsets, then you'll end up reprocessing those records on restart. It's claimed that Spark can do exactly once with Kafka, but that is a only with proper offset management, as far as I know, for example, Set enable.auto.commit to false in the Kafka priorities, then only commit after the you've processed and written the data to its destination
If you're just moving data between Kafka topics, Kafka Streams is the included Kafka library to do that, which doesn't require YARN or a cluster scheduler
I'm curious if it is an absolute must that a Spark streaming application is brought down gracefully or it runs the risk of causing duplicate data via the write-ahead log. In the below scenario I outline sequence of steps where a queue receiver interacts with a queue requires acknowledgements for messages.
Spark queue receiver pulls a batch of messages from the queue.
Spark queue receiver stores the batch of messages into the write-ahead log.
Spark application is terminated before an ack is sent to the queue.
Spark application starts up again.
The messages in the write-ahead log are processed through the streaming application.
Spark queue receiver pulls a batch of messages from the queue which have already been seen in step 1 because they were not acknowledged as received.
...
Is my understanding correct on how custom receivers should be implemented, the problems of duplication that come with it, and is it normal to require a graceful shutdown?
Bottom line: It depends on your output operation.
Using the Direct API approach, which was introduced on V1.3, eliminates inconsistencies between Spark Streaming and Kafka, and so each record is received by Spark Streaming effectively exactly once despite failures because offsets are tracked by Spark Streaming within its checkpoints.
In order to achieve exactly-once semantics for output of your results, your output operation that saves the data to an external data store must be either idempotent, or an atomic transaction that saves results and offsets.
For further information on the Direct API and how to use it, check out this blog post by Databricks.
I would like to have a Spark Streaming SQS Receiver which deletes SQS messages only after they were successfully stored on S3.
For this a Custom Receiver can be implemented with the semantics of the Reliable Receiver.
The store(multiple-records) call blocks until the given records have been stored and replicated inside Spark.
If the write-ahead logs are enabled, all the data received from a receiver gets written into a write ahead log in the configuration checkpoint directory. The checkpoint directory can be pointed to S3.
After the store(multiple-records) blocking call finishes, are the records already stored in the checkpoint directory (and thus can be safely deleted from SQS)?
Edit: This is also explained in this Spark Summit presentation.
With write-ahead logs and checkpointing enabled, the store(multiple-records) call blocks until the given records have been written to write-ahead logs.
Receiver.store(ArrayBuffer[T], ...)
ReceiverSupervisorImpl.pushArrayBuffer(ArrayBuffer[T], ...)
ReceiverSupervisorImpl.pushAndReportBlock(...)
WriteAheadLogBasedBlockHandler.storeBlock(...)
This implementation stores the block into the block manager as well as a write ahead log. It does this in parallel, using Scala Futures, and returns only after the block has been stored in both places.
In spark streaming, stream data will be received by receivers which run on workers. The data will be pushed into a data block periodically and receiver will send the receivedBlockInfo to the driver. I want to know that will spark streaming distribute the block to the cluster?(In other words, will it use a distributing storage strategy). If it does not distribute the data across the cluster, how will the workload balance be guaranteed?(Image we have a cluster of 10s nodes but there are only a few receivers)
As far as I know data are received by the worker node where the receiver is running. They are not distributed across other nodes.
If you need the input stream to be repartitioned (balanced across cluster) before further processing, you can use
inputStream.repartition(<number of partitions>)
You can read more about level of parallelism in Spark documentation
https://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning