How spark streaming data are stored - apache-spark

In spark streaming, stream data will be received by receivers which run on workers. The data will be pushed into a data block periodically and receiver will send the receivedBlockInfo to the driver. I want to know that will spark streaming distribute the block to the cluster?(In other words, will it use a distributing storage strategy). If it does not distribute the data across the cluster, how will the workload balance be guaranteed?(Image we have a cluster of 10s nodes but there are only a few receivers)

As far as I know data are received by the worker node where the receiver is running. They are not distributed across other nodes.
If you need the input stream to be repartitioned (balanced across cluster) before further processing, you can use
inputStream.repartition(<number of partitions>)
You can read more about level of parallelism in Spark documentation
https://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning

Related

How do I setup Spark application to pull from single Kafka topic on multiple Spark nodes?

My application has a Kafka input stream for a single topic, it does some filtering and aggregating of the data, and then writes to Elasticsearch. What I'm seeing is that while the application is distributed to all of the spark nodes and processing the data properly, only one node is pulling data, and the rest are idle.
Also, I am using an R53 hostname for the Kafka nodes. Should I use a comma-separated list of the Kafka nodes instead?
The topic has 20 partitions. I am running Spark 3.2.1 using only Spark Streaming (no DFS).
The topic has 20 partitions
Then up to 20 executors should be able to consume in parallel.
using an R53 hostname for the Kafka nodes
Any Kafka client, including Spark, will need to communicate with the brokers individually. This means you'll need to expose each broker's advertised.listeners setting such that Spark can communicate with each broker directly, and not via a single DNS name / load balancer address. If only one is resolvable, then you'll only be able to consume (or produce) to just that one.
Should I use a comma-separated list of the Kafka nodes instead
It's recommended, but not necessary. For example, what if the broker at the one address provided is not responding? The bootstrap protocol will return all advertised.listener addresses back to the client based on its associated listeners protocol.

With reliable receiver, can spark streaming loss data when worker crashed?

When I read the Fault-tolerance Semantics in spark streaming document. Combined with reliable receivers, when the worker node fails, spark streaming will not lose data.
But consider following scenario:
Two executors get the calculated data and save them in their memory
Receivers know that the data has been entered
Receivers send acknowledgment to source and at the same time replica worker and current worker crashed
But source received acknowledgment
When the receiver restarts, these data were lost! Is right?

Spark streamming task shutdown gracefully when kafka client send message asynchronously

i am building a spark streamming application, read input message from kafka topic, transformation message and output the result message into another kafka topic. Now i am confused how to prevent data loss when application restart, including kafka read and output. Setting the spark configuration "spark.streaming.stopGracefullyOnShutdow" true can help?
You can configure Spark to do checkpoint to HDFS and store the Kafka offsets in Zookeeper (or Hbase, or configure elsewhere for fast, fault tolerant lookups)
Though, if you process some records and write the results before you're able to commit offsets, then you'll end up reprocessing those records on restart. It's claimed that Spark can do exactly once with Kafka, but that is a only with proper offset management, as far as I know, for example, Set enable.auto.commit to false in the Kafka priorities, then only commit after the you've processed and written the data to its destination
If you're just moving data between Kafka topics, Kafka Streams is the included Kafka library to do that, which doesn't require YARN or a cluster scheduler

Worker Queue option in Kafka

We are developing an application , which will receive time series sensor data as byte array from a set of devices via UDP. This data needs to be parsed and stored in a Cassandra Database...
We were using RabbitMQ as the message broker and using the Work Queues based consumers to parse the data and push it in to cassandra... Because of increasing traffic, we are concerned about RabbitMQ perfomance and are planning to move to Kafka... Our understanding is that the same can be implemented using consumer group in kafka .. is our understanding correct
With Apache Kafka, you can scale a topic relatively easier. In order to be able to process more data in same time you'll need:
Having multiple consumers in same consumer group, you'll be able to consume multiple messages in same time. You are limited to the number of partitions of a topic.
Increase the number of partitions for a topic, and increase the number of consumers.
Increase the number of brokers, if you still to process more data.
I will approach the scalability in the order described above, but Kafka can handle a lot. In a setup with 2 brokers, 4 partitions per topic and 2 consumers (each consumer use one thread per partition), consumer decode json to java object, enrich and store to Cassandra, it can handle 30k/s (data is batched in batch of 200 insert statements).

Is it possible to implement a reliable receiver which supports non-graceful shutdown?

I'm curious if it is an absolute must that a Spark streaming application is brought down gracefully or it runs the risk of causing duplicate data via the write-ahead log. In the below scenario I outline sequence of steps where a queue receiver interacts with a queue requires acknowledgements for messages.
Spark queue receiver pulls a batch of messages from the queue.
Spark queue receiver stores the batch of messages into the write-ahead log.
Spark application is terminated before an ack is sent to the queue.
Spark application starts up again.
The messages in the write-ahead log are processed through the streaming application.
Spark queue receiver pulls a batch of messages from the queue which have already been seen in step 1 because they were not acknowledged as received.
...
Is my understanding correct on how custom receivers should be implemented, the problems of duplication that come with it, and is it normal to require a graceful shutdown?
Bottom line: It depends on your output operation.
Using the Direct API approach, which was introduced on V1.3, eliminates inconsistencies between Spark Streaming and Kafka, and so each record is received by Spark Streaming effectively exactly once despite failures because offsets are tracked by Spark Streaming within its checkpoints.
In order to achieve exactly-once semantics for output of your results, your output operation that saves the data to an external data store must be either idempotent, or an atomic transaction that saves results and offsets.
For further information on the Direct API and how to use it, check out this blog post by Databricks.

Resources