Data locality in Spark Streaming - apache-spark

Recently I've been doing performance tests on Spark Streaming. I ran a receiver on one of the 6 slaves and submitted a simple Word Count application to the cluster(actually I know this configuration is not proper in practice,just a simple test).I analyzed the scheduling log and found that nearly 88% of tasks are scheduled to the node where receiver ran on and the locality are always PROCESS_LOCAL and the CPU utilization is very high. Why does not Spark Streaming distribute data across the cluster and make full use of cluster? I've read official guide and it does not explain in detail, especially in Spark Streaming. Will it copy stream data to another node with free CPU and start new task on it when a task is on a node with busy CPU? If so, how can we explain the former case?

When you run the stream receiver just on one of the 6 nodes, all the received data are processed on this node (that is the data locality).
Data are not distributed across other nodes by default. If you need the input stream to be repartitioned (balanced across cluster) before further processing, you can use
inputStream.repartition(<number of partitions>)
This distributes the received batches of data across the specified number of machines in the cluster before further processing.
You can read more about level of parallelism in Spark documentation
https://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning

Related

Spark Streaming - Kafka Integration

We are using small spark cluster with 5 nodes and all these 5 nodes were connected with Kafka brokers.
We are planning to scale the cluster by adding more nodes and this may require configuring this additional to connect with the Kafka cluster .We are assessing the best practices of integrations
How it actually to be integrated to make the integration as easy as possible
Is it needed for all the workers node to be connected with the
brokers , in that case , it might not be scalable ?
I would advice to go over the documentation of spark with kafka integretion
https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
"How it actually to be integrated to make the integration as easy as possible" :
I'm not sure what do you mean - but basically when you connect to kafka you should provide the bootstrap servers : Bootstrap Servers are a list of host/port pairs to use for establishing the initial connection to the Kafka cluster.
These servers are just used for the initial connection to discover the full cluster membership. so the number of nodes of the kafka cluster will not change the way you integrate
"Is it needed for all the workers node to be connected with the brokers , in that case , it might not be scalable ?" :
spark integration works in the following way (sort of):
the sprak driver - connects to the kafka to understand the required partitions and offsets
based on part 1 the partitions are assigned to the spark "workers" - which is usually a 1 to 1 from a kafka partition to a spark partition.
not all workers (I guess you mean executors) connect to all kafka nodes - so in this case it is also scalable
side note : you can use a configuration to further break the number of spark partitions that would read from a single kafka partition - its called minPartitions and its from spark 2.4.7
last note : spark streaming with kafka is a very used and known use case and is used in very big data ecosystems as a first intuitive thought I would assume its scalable
Came across the following phrase while going though the book , https://learning.oreilly.com/library/view/stream-processing-with/9781491944233/ch19.html
Particularly the phrase The driver does not send data to the executors; instead, it simply sends a few offsets they use to directly consume data. - seems the _all the executors( worker nodes) have to have connection with the kafka as it is quite possible the tasks might run on any executor
The gist of data delivery is that the Spark driver queries offsets and
decides offset ranges for every batch interval from Apache Kafka.
After it receives those offsets, the driver dispatches them by
launching a task for each partition, resulting in a 1:1 parallelism
between the Kafka partitions and the Spark partitions at work. Each
task retrieves data using its specific offset ranges.
The driver does
not send data to the executors; instead, it simply sends a few offsets
they use to directly consume data. As a consequence, the parallelism
of data ingestion from Apache Kafka is much better than the legacy
receiver model, where each stream was consumed by a single machine.

auto scale spark cluster

I have a spark streaming job running on a cluster. Spark job pulls messages from Kafka and do the required processing before dumping the processed data to database. I have sized my cluster as per the current load. But this load requirement may go up/down in the future. I want to know the techniques to facilitate this auto scaling without restarting the job. Scaling becomes more complicated if kakfa is being used (as in my case) as I won't like the partitions to be moved around in stateful streaming. Currently the cluster is completely in house but I won't mind migrating to cloud if that assists the scaling use case.
it is not an answer. Just some notes
"in stateful streaming". What did you mean by that? All state in spark is distributed. And you should not rely on local system, as if some task failed, it can be send to any other executor.
do you speak about increasing size of cluster or resources dedicated for your spark job in cluster?
If the first one, you need to monitor each node (memory, cpu) and when it's time (hit some threshold) add more nodes.
If the second one: we didn't find nice solution. Spark provides 'autoscaling' feature, however it doesn't work properly with kafka streaming.

Spark Streaming and High Availability

I'm building Apache Spark application that acts on multiple streams.
I did read the Performance Tuning section of the documentation:
http://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning
What I didn't get is:
1) Are the streaming receivers located on multiple worker nodes or is the driver machine?
2) What happens if one of the nodes that receives the data fails (power off/restart)
Are the streaming receivers located on multiple worker nodes or is the
driver machine
Receivers are located on worker nodes, which are responsible for the consumption of the source which holds the data.
What happens if one of the nodes that receives the data fails (power
off/restart)
The receiver is located on the worker node. The worker node get's it's tasks from the driver. This driver can either be located on a dedicated master server if you're running in Client Mode, or it can be on one of the workers if you're running in Cluster Mode. In case a node fails which doesn't run the driver, the driver will re-assign the partitions held on the failed node to a different one, which will then be able to re-read the data from the source, and do the additional processing needed to recover from the failure.
This is why a replayable source such as Kafka or AWS Kinesis is needed.

Spark on Mesos - running multiple Streaming jobs

I have 2 spark streaming jobs that I want to run, as well as keeping some available resources for batch jobs and other operations.
I evaluated Spark Standalone cluster manager, but I realized that I would have to fix the resources for two jobs, which would leave almost no computing power to batch jobs.
I started evaluating Mesos, because it has "fine grained" execution model, where resources are shifted between Spark applications.
1) Does it mean that a single core can be shifted between 2 streaming applications?
2) Although I have spark & cassandra, in order to exploit data locality, do I need to have dedicated core on each of the slave machines to avoid shuffling?
3) Would you recommend running Streaming jobs in "fine grained" or "course grained" mode. I know that logical answer is course grained (in order to minimize the latency of streaming apps) but what when resource in total cluster are limited (cluster of 3 nodes, 4 cores each - there are 2 streaming applications to run and multiple time to time batch jobs)
4) In Mesos, when I run spark streaming job in cluster mode, will it occupy 1 core permanently (like Standalone cluster manager is doing), or will that core execute driver process and sometimes act as executor?
Thank you
Fine grained mode is actually deprecated now. Even with it, each core is allocated to task until completion, but in Spark Streaming, each processing interval is a new job, so tasks only last as long the time it takes to process each interval's data. Hopefully that time is less than the interval time or your processing will back up, eventually running out of memory to store all those RDDs waiting for processing.
Note also that you'll need to have one core dedicated to each stream's Reader. Each will be pinned for the life of the stream! You'll need extra cores in case the stream ingestion needs to be restarted; Spark will try to use a different core. Plus you'll have a core tied up by your driver, if it's also running on the cluster (as opposed to on your laptop or something).
Still, Mesos is a good choice, because it will allocate the tasks to nodes that have capacity to run them. Your cluster sounds pretty small for what you're trying to do, unless the data streams are small themselves.
If you use the Datastax connector for Spark, it will try to keep input partitions local to the Spark tasks. However, I believe that connector assumes it will manage Spark itself, using Standalone mode. So, before you adopt Mesos, check to see if that's really all you need.

Data receiving in Spark Streaming

Recently I've been doing performance tests on Spark Streaming. But some problems puzzled me a lot.
In Spark Streaming, receivers are scheduled to run in executors on worker nodes.
How many receivers are there in a cluster? Can I control the number of receivers?
If not all workers run receiver to receive stream data, the other worker nodes will not receive any data? In that case, how can I guarantee task scheduling based on data locality? Copy data from nodes those run receivers?
There is only one receiver per DStream, but you can create more than one DStream and union them together to act as one. This is why it is suggested to run Spark Streaming against a cluster that is at least N(receivers) + 1 cores. Once the data is past the receiving portion, it is mostly a simple Spark application and follows the same rules of a batch job. (This is why streaming is referred to as micro-batching)

Resources