How does Apache Spark assign partition-ids to its executors - apache-spark

I have a long-running Spark streaming job which uses 16 executors which only one core each.
I use default partitioner(HashPartitioner) to equally distribute data to 16 partitions. Inside updateStateByKeyfunction, i checked for the partition id from TaskContext.getPartitionId() for multiple batches and found out the partition-id of a executor is quite consistent but still changing to another id after a long run.
I'm planing to do some optimization to spark "updateStateByKey" API, but it can't be achieved if the partition-id keeps changing among batches.
So when does Spark change the partition-id of a executor?

Most probably, the task has failed and restart again, so the TaskContext has changed, and so as the partitionId.

Related

Spark Sql Job optimization

I have a job which consist around 9 sql statement to pull data from hive and write back to hive db. It is currently running for 3hrs which seems too long considering spark abitlity to process data. The application launchs total 11 stages.
I did some analysis using Spark UI and found below grey areas which can be improved:
Stage 8 in Job 5 has shuffle output of 1.5 TB.
Time gap between job 4 and job 5 is 20 Mins. I read about this time gap and found spark perform IO out of spark job which reflects as gap between two jobs which can be seen in driver logs.
We have a cluster of 800 nodes with restricted resources for each queue and I am using below conf to submit job:
-- num-executor 200
-- executor-core 1
-- executor-memory 6G
-- deployment mode client
Attaching Image of UI as well.
Now my questions are:
Where can I find driver log for this job?
In image, I see a long list of Executor added which I sum is more than 200 but in Executor tab, number is exactly 200. Any explation for this?
Out of all the stages, only one stage has TASK around 35000 but rest of stages has 200 tasks only. Should I increase number of executor or should I go for dynamic allocation facility of spark?
Below are the thought processes that may guide you to some extent:
Is it necessary to have one core per executor? The executor need not be fat always. You can have more cores in one executor. it is a trade-off between creating a slim vs fat executors.
Configure shuffle partition parameter spark.sql.shuffle.partitions
Ensure while reading data from Hive, you are using Sparksession (basically HiveContext). This will pull the data into Spark memory from HDFS and schema information from Metastore of Hive.
Yes, Dynamic allocation of resources is a feature that helps in allocating the right set of resources. It is better than having fixed allocation.

How to rebalance RDD during processing time for unbalanced executor workloads

Suppose I have an RDD with 1,000 elements and 10 executors. Right now I parallelize the RDD with 10 partitions and process 100 elements by each executor (assume 1 task per executor).
My difficulty is that some of these partitioned tasks may take much longer than others, so say 8 executors will be done quickly, while the remaining 2 will be stuck doing something for longer. So the master process will be waiting for the 2 to finished before moving on, and 8 will be idling.
What would be a way to make the idling executors 'take' some work from the busy ones? Unfortunately I can't anticipate ahead of time which ones will end up 'busier' than others, so can't balance the RDD properly ahead of time.
Can I somehow make executors communicate with each other programmatically? I was thinking of sharing a DataFrame with the executors, but based on what I see I cannot manipulate a DataFrame inside an executor?
I am using Spark 2.2.1 and JAVA
Try using spark dynamic resource allocation, which scales the number of executors registered with the application up and down based on the workload.
You can endable the below properties
spark.dynamicAllocation.enabled = true
spark.shuffle.service.enabled = true
You can consider to configure the below properties as well
spark.dynamicAllocation.executorIdleTimeout
spark.dynamicAllocation.maxExecutors
spark.dynamicAllocation.minExecutors
Spark provides a mechanism to dynamically adjust the resources your application occupies based on the workload. This means that your application may give resources back to the cluster if they are no longer used and request them again later when there is demand. This feature is particularly useful if multiple applications share resources in your Spark cluster.

How does Spark Streaming schedule map tasks between driver and executor?

I use Apache Spark 2.1 and Apache Kafka 0.9.
I have a Spark Streaming application that runs with 20 executors and reads from Kafka that has 20 partitions. This Spark application does map and flatMap operations only.
Here is what the Spark application does:
Create a direct stream from kafka with interval of 15 seconds
Perform data validations
Execute transformations using drool which are map only. No reduce transformations
Write to HBase using check-and-put
I wonder if executors and partitions are 1-1 mapped, will every executor independently perform above steps and write to HBase independently, or data will be shuffled within multiple executors and operations will happen between driver and executors?
Spark jobs submit tasks that can only be executed on executors. In other words, executors are the only place where tasks can be executed. The driver is to coordinate the tasks and schedule them accordingly.
With that said, I'd say the following is true:
will every executor independently perform above steps and write to HBase independently
By the way, the answer is irrelevant to what Spark version is in use. It's always been like this (and don't see any reason why it would or even should change).

Does spark ensure datalocality?

When I submit my spark job into yarn cluster with --num-executers=4 , I can see in the spark UI, 4 executors are allocated in 4 nodes in the cluster. In my spark application I am taking inputs from various HDFS locations in various steps. But the allocated executors remain the same through out the execution.
My doubt is whether spark do anything for data-locality, since the nodes it selects at the very beginning irrespective of where input data situated(at least just in case of HDFS)?
I know map reduce does it in some extent.
Yes, it does. Spark still uses Hadoop InputFormat and RecordReader interfaces and appropriate implementations like i.e. TextInputFormat. So Spark's behaviour in this case is very similar to common MapReduce. Spark driver retrieves block locations of the file and assigns task to executors with regard to data locality.

what factors affect how many spark job concurrently

We recently have set up the Spark Job Server to which the spark jobs are submitted.But we found out that our 20 nodes(8 cores/128G Memory per node) spark cluster can only afford 10 spark jobs running concurrently.
Can someone share some detailed info about what factors would actually affect how many spark jobs can be run concurrently? How can we tune the conf so that we can take full advantage of the cluster?
Question is missing some context, but first - it seems like Spark Job Server limits the number of concurrent jobs (unlike Spark itself, which puts a limit on number of tasks, not jobs):
From application.conf
# Number of jobs that can be run simultaneously per context
# If not set, defaults to number of cores on machine where jobserver is running
max-jobs-per-context = 8
If that's not the issue (you set the limit higher, or are using more than one context), then the total number of cores in the cluster (8*20 = 160) is the maximum number of concurrent tasks. If each of your jobs creates 16 tasks, Spark would queue the next incoming job waiting for CPUs to be available.
Spark creates a task per partition of the input data, and the number of partitions is decided according to the partitioning of the input on disk, or by calling repartition or coalesce on the RDD/DataFrame to manually change the partitioning. Some other actions that operate on more than one RDD (e.g. union) may also change the number of partitions.
Some things that could limit the parallelism that you're seeing:
If your job consists of only map operations (or other shuffle-less operations), it will be limited to the number of partitions of data you have. So even if you have 20 executors, if you have 10 partitions of data, it will only spawn 10 task (unless the data is splittable, in something like parquet, LZO indexed text, etc).
If you're performing a take() operation (without a shuffle), it performs an exponential take, using only one task and then growing until it collects enough data to satisfy the take operation. (Another question similar to this)
Can you share more about your workflow? That would help us diagnose it.

Resources