New directStream API reads topic's partitions sequentially. Why? - apache-spark

I am trying to read kafka topic with new directStream method in KafkaUtils.
I have Kafka topic with 8 partitions.
I am running streaming job on yarn with 8 execuors with 1 core(--num-executors 8 --executor-cores 1) for each one.
So noticed that spark reads all topic's partitions in one executor sequentially - this is obviously not what I want.
I want spark to read all partitions in parallel.
How can I achieve that?
Thank you, in advance.

An initial communication to Kafka at job creation occurs, solely to set the offsets of the KafkaRDD - more specifically, the offsets for each KafkaRDD partition that makes up the KafkaRDD across the cluster.
They are then used to fetch data once the job is actually executed, on each Executor. Depending on what you noticed it's possible you may have seen that initial communication (from the driver). If you have seen all your jobs executing on the same executor, then something else would be going wrong than just using Kafka.

Related

Does receiver stores its streaming content in one where the executor is running?

Lets say there are 3 executors (ex1, ex2, ex3). In one executor(ex1) lets say receiver is running, now what happens when the data comes in that source.
Lets say a data is arrived in a kafka topic say "topic1", now the receiver running in ex1 will consume that data arrived in the topic right? Now where is that data stored?
Is it stored in that executor ex1 itself?
What if that data is too huge? Does it breaks it down and distributes it over to other executors?
Lets say a capacity of 10gb each executor(ex1, ex2, ex3). And a data arrived say 15gb (hypothetical assumption) now what happens to ex1. Will it fail or it will be handled? If handles how will it handle? Does it distributes over the cluster. If it distributes over the cluster, how will the foreachRDD fits into picture if in a batch only one rdd is formed. As if it distributes by breaking up, now it is more then one rdd in cluster right for that particular batch?
How many receivers run in spark job? Does it depends on the number of input sources? If spark is reading from 4 different kafka topics, does it mean that 4 different receivers will run separately in different executors? What if there are only 2 executors and 4 kafka topics/sources?In that case will 4 different receivers run in these two executors evenly? What if the sources are of odd number? IF two executors and 3 kafka sources, then is it that in one of the executors there will be two? What if one of the executors dies? How will it get recovered?
Is it stored in that executor ex1 itself?
Yes all the data fetched by the Spark Driver is pushed to executor for further processing.
What if that data is too huge? Does it breaks it down and
distributes it over to other executors?
The data is Hash partitioned once read by Spark receiver and then distributed among the executor fairly. If still you a data skew try adding a Custom partitioner and repartition the data.
How many receivers run in spark job? Does it depends on the number of input sources? If spark is reading from 4 different kafka topics, does it mean that 4 different receivers will run separately in different executors? What if there are only 2 executors and 4 kafka topics/sources?In that case will 4 different receivers run in these two executors evenly? What if the sources are of odd number? IF two executors and 3 kafka sources, then is it that in one of the executors there will be two? What if one of the executors dies? How will it get recovered?
There only one single receiver (i.e. one or multiple topics) which does the Kafka offset management. It hands over range of offsets per topic to Spark executor and Spark executor reads the data directly from Kafka. If anyone of the executor dies all the stages of it are re-executed from the last successful saved stage.
The Spark load distribution is not based size of data but on the count of events.
Guidelines say that if there are N partitions for a topic then Spark should have 2N executors to achieve optimum CPU resource utilization.
You should find more details on below link,
https://blog.cloudera.com/reading-data-securely-from-apache-kafka-to-apache-spark/

How to avoid a shuffle on spark if the data is already on the correct executor

We have a spark job which reads from kafka.
There is one executor per kafka partition.
On the topic there is session information, the data is put onto the topic partitioned by sessionId already.
In spark we want to do a groupByKey operation using sessionId. As a developer I know that different messages for the same session will always be on a single executor.
So spark should not need to move data between executors to do the shuffle.
Is there a way to make sure that spark does not move data between executors for this operation?
Or any advice on how I would handle this issue?

Spark Structured Streaming Print Offsets Per Batch Per Executor

I have a simple job (20 executors, 8G memory each) that reads from Kafka (with 50 partitions), checkpoints to HDFS, and posts data to a HTTP endpoint (1000 events per second). I recently started to see some straggling executors which would take far longer compared to other executors. As part of investigation I was trying to rule out data skew; is there a way to print partition:offsets for executors? Or is there any other way to track why an executor maybe straggling?
I know I can implement StreamingQueryListener but that'll only give me partition:offsets per batch, and won't tell me which executor is processing a specific partition.
You can have it printed if you have used a sink with foreach. forEach in structured spark streaming. The open method has those details and it gets executed for every executor. so u have those details

How does Spark Streaming schedule map tasks between driver and executor?

I use Apache Spark 2.1 and Apache Kafka 0.9.
I have a Spark Streaming application that runs with 20 executors and reads from Kafka that has 20 partitions. This Spark application does map and flatMap operations only.
Here is what the Spark application does:
Create a direct stream from kafka with interval of 15 seconds
Perform data validations
Execute transformations using drool which are map only. No reduce transformations
Write to HBase using check-and-put
I wonder if executors and partitions are 1-1 mapped, will every executor independently perform above steps and write to HBase independently, or data will be shuffled within multiple executors and operations will happen between driver and executors?
Spark jobs submit tasks that can only be executed on executors. In other words, executors are the only place where tasks can be executed. The driver is to coordinate the tasks and schedule them accordingly.
With that said, I'd say the following is true:
will every executor independently perform above steps and write to HBase independently
By the way, the answer is irrelevant to what Spark version is in use. It's always been like this (and don't see any reason why it would or even should change).

Kafka topic partitions to Spark streaming

I have some use cases that I would like to be more clarified, about Kafka topic partitioning -> spark streaming resource utilization.
I use spark standalone mode, so only settings I have are "total number of executors" and "executor memory". As far as I know and according to documentation, way to introduce parallelism into Spark streaming is using partitioned Kafka topic -> RDD will have same number of partitions as kafka, when I use spark-kafka direct stream integration.
So if I have 1 partition in the topic, and 1 executor core, that core will sequentially read from Kafka.
What happens if I have:
2 partitions in the topic and only 1 executor core? Will that core read first from one partition and then from the second one, so there will be no benefit in partitioning the topic?
2 partitions in the topic and 2 cores? Will then 1 executor core read from 1 partition, and second core from the second partition?
1 kafka partition and 2 executor cores?
Thank you.
The basic rule is that you can scale up to the number of Kafka partitions. If you set spark.executor.cores greater than the number of partitions, some of the threads will be idle. If it's less than the number of partitions, Spark will have threads read from one partition then the other. So:
2 partitions, 1 executor: reads from one partition then then other. (I am not sure how Spark decides how much to read from each before switching)
2p, 2c: parallel execution
1p, 2c: one thread is idle
For case #1, note that having more partitions than executors is OK since it allows you to scale out later without having to re-partition. The trick is to make sure that your partitions are evenly divisible by the number of executors. Spark has to process all the partitions before passing data onto the next step in the pipeline. So, if you have 'remainder' partitions, this can slow down processing. For example, 5 partitions and 4 threads => processing takes the time of 2 partitions - 4 at once then one thread running the 5th partition by itself.
Also note that you may also see better processing throughput if you keep the number of partitions / RDDs the same throughout the pipeline by explicitly setting the number of data partitions in functions like reduceByKey().

Resources