How does Spark Streaming schedule map tasks between driver and executor? - apache-spark

I use Apache Spark 2.1 and Apache Kafka 0.9.
I have a Spark Streaming application that runs with 20 executors and reads from Kafka that has 20 partitions. This Spark application does map and flatMap operations only.
Here is what the Spark application does:
Create a direct stream from kafka with interval of 15 seconds
Perform data validations
Execute transformations using drool which are map only. No reduce transformations
Write to HBase using check-and-put
I wonder if executors and partitions are 1-1 mapped, will every executor independently perform above steps and write to HBase independently, or data will be shuffled within multiple executors and operations will happen between driver and executors?

Spark jobs submit tasks that can only be executed on executors. In other words, executors are the only place where tasks can be executed. The driver is to coordinate the tasks and schedule them accordingly.
With that said, I'd say the following is true:
will every executor independently perform above steps and write to HBase independently
By the way, the answer is irrelevant to what Spark version is in use. It's always been like this (and don't see any reason why it would or even should change).

Related

Spark Structured Streaming Print Offsets Per Batch Per Executor

I have a simple job (20 executors, 8G memory each) that reads from Kafka (with 50 partitions), checkpoints to HDFS, and posts data to a HTTP endpoint (1000 events per second). I recently started to see some straggling executors which would take far longer compared to other executors. As part of investigation I was trying to rule out data skew; is there a way to print partition:offsets for executors? Or is there any other way to track why an executor maybe straggling?
I know I can implement StreamingQueryListener but that'll only give me partition:offsets per batch, and won't tell me which executor is processing a specific partition.
You can have it printed if you have used a sink with foreach. forEach in structured spark streaming. The open method has those details and it gets executed for every executor. so u have those details

Spark Sql Job optimization

I have a job which consist around 9 sql statement to pull data from hive and write back to hive db. It is currently running for 3hrs which seems too long considering spark abitlity to process data. The application launchs total 11 stages.
I did some analysis using Spark UI and found below grey areas which can be improved:
Stage 8 in Job 5 has shuffle output of 1.5 TB.
Time gap between job 4 and job 5 is 20 Mins. I read about this time gap and found spark perform IO out of spark job which reflects as gap between two jobs which can be seen in driver logs.
We have a cluster of 800 nodes with restricted resources for each queue and I am using below conf to submit job:
-- num-executor 200
-- executor-core 1
-- executor-memory 6G
-- deployment mode client
Attaching Image of UI as well.
Now my questions are:
Where can I find driver log for this job?
In image, I see a long list of Executor added which I sum is more than 200 but in Executor tab, number is exactly 200. Any explation for this?
Out of all the stages, only one stage has TASK around 35000 but rest of stages has 200 tasks only. Should I increase number of executor or should I go for dynamic allocation facility of spark?
Below are the thought processes that may guide you to some extent:
Is it necessary to have one core per executor? The executor need not be fat always. You can have more cores in one executor. it is a trade-off between creating a slim vs fat executors.
Configure shuffle partition parameter spark.sql.shuffle.partitions
Ensure while reading data from Hive, you are using Sparksession (basically HiveContext). This will pull the data into Spark memory from HDFS and schema information from Metastore of Hive.
Yes, Dynamic allocation of resources is a feature that helps in allocating the right set of resources. It is better than having fixed allocation.

How does Apache Spark assign partition-ids to its executors

I have a long-running Spark streaming job which uses 16 executors which only one core each.
I use default partitioner(HashPartitioner) to equally distribute data to 16 partitions. Inside updateStateByKeyfunction, i checked for the partition id from TaskContext.getPartitionId() for multiple batches and found out the partition-id of a executor is quite consistent but still changing to another id after a long run.
I'm planing to do some optimization to spark "updateStateByKey" API, but it can't be achieved if the partition-id keeps changing among batches.
So when does Spark change the partition-id of a executor?
Most probably, the task has failed and restart again, so the TaskContext has changed, and so as the partitionId.

Does spark ensure datalocality?

When I submit my spark job into yarn cluster with --num-executers=4 , I can see in the spark UI, 4 executors are allocated in 4 nodes in the cluster. In my spark application I am taking inputs from various HDFS locations in various steps. But the allocated executors remain the same through out the execution.
My doubt is whether spark do anything for data-locality, since the nodes it selects at the very beginning irrespective of where input data situated(at least just in case of HDFS)?
I know map reduce does it in some extent.
Yes, it does. Spark still uses Hadoop InputFormat and RecordReader interfaces and appropriate implementations like i.e. TextInputFormat. So Spark's behaviour in this case is very similar to common MapReduce. Spark driver retrieves block locations of the file and assigns task to executors with regard to data locality.

New directStream API reads topic's partitions sequentially. Why?

I am trying to read kafka topic with new directStream method in KafkaUtils.
I have Kafka topic with 8 partitions.
I am running streaming job on yarn with 8 execuors with 1 core(--num-executors 8 --executor-cores 1) for each one.
So noticed that spark reads all topic's partitions in one executor sequentially - this is obviously not what I want.
I want spark to read all partitions in parallel.
How can I achieve that?
Thank you, in advance.
An initial communication to Kafka at job creation occurs, solely to set the offsets of the KafkaRDD - more specifically, the offsets for each KafkaRDD partition that makes up the KafkaRDD across the cluster.
They are then used to fetch data once the job is actually executed, on each Executor. Depending on what you noticed it's possible you may have seen that initial communication (from the driver). If you have seen all your jobs executing on the same executor, then something else would be going wrong than just using Kafka.

Resources