How does Spark distribute GroupedData.apply(pandas_udf) into Tasks? - apache-spark

Motivation: I am applying a computationally expensive pandas_udf onto a spark GroupedData object, and I want to set my spark configuration such that each time the udf is run it has at least 4 cores to run on.
I know I can set the number of cores per spark task with the following:
spark.task.cpus = X
but I am concerned that a single spark task might try to run multiple concurrent instances of my expensive pandas_udf...
Is each application of the udf onto a group in the GroupedData object given its own Spark Task? And if so, could I achieve the optimization I'm looking for by setting the following 2 spark configurations:
spark.task.cpus (the number of cpus per task)
spark.executor.cores (the number of cpus per executor)
I'm currently thinking that setting:
spark.task.cpus = spark.executor.cores = 4
Might be the solution I'm looking for, but I welcome all opinions.
Thank you!

Related

decide no of partition in spark (running on YARN) based on executer ,cores and memory

How to decide no of partition in spark (running on YARN) based on executer, cores and memory.
As i am new to spark so doesn't have much hands on real scenario
I know many things to consider to decide the partition but still any production general scenario explanation in detail will be very helpful.
Thanks in advance
One important parameter for parallel collections is the number of
partitions to cut the dataset into. Spark will run one task for each
partition of the cluster. Typically you want 2-4 partitions for each
CPU in your cluster
the number of parition is recommended to be 2/4 * the number of cores.
so if you have 7 executor with 5 core , you can repartition between 7*5*2 = 70 and 7*5*4 = 140 partition
https://spark.apache.org/docs/latest/rdd-programming-guide.html
IMO with spark 3.0 and AWS EMR 2.4.x with adaptive query execution you're often better off letting spark handle it. If you do want to hand tune it the answer can often times be complicated. One good option is to have 2 or 4 times the number of cpus available. While this is useful for most datasizes it becomes problematic with very large and very small datasets. In those cases it's useful to aim for ~128MB per partition.

Spark: Executors have different tasks

I use Spark 2.4.3 with 12 executors, each with 5 cores and 40 memories. I set defaultParallelism to 180.
I use the following code to read two single text files from hdfs.
val f1 = sc.textFile("file1", sc.defaultParallelism)
val f2 = sc.textFile("file2", sc.defaultParallelism)
val all = f1.union(f2).persist()
all.count()
When I look at the Spark UI, I find that executors get different number of tasks (some get only 3). Why not Spark assign the same # of tasks to executors so that the maximum efficiency can be obtained? Is there a way to avoid this?
A couple things to keep in mind...
not all tasks take the same amount of time. I find 2x or 4x often times is a better # of tasks.
scheduling has a bit of overhead.
the spark ui can make it difficult to determine to utilization rate of the cluster since a single executor task slot is scattered around the graph.
IMO you don't want to optimize executor utilization rate. What you want to optimize is either query latency or throughput per cpu. You can very easily over parallelize spark jobs and cause them to be extremely inefficient.

How to rebalance RDD during processing time for unbalanced executor workloads

Suppose I have an RDD with 1,000 elements and 10 executors. Right now I parallelize the RDD with 10 partitions and process 100 elements by each executor (assume 1 task per executor).
My difficulty is that some of these partitioned tasks may take much longer than others, so say 8 executors will be done quickly, while the remaining 2 will be stuck doing something for longer. So the master process will be waiting for the 2 to finished before moving on, and 8 will be idling.
What would be a way to make the idling executors 'take' some work from the busy ones? Unfortunately I can't anticipate ahead of time which ones will end up 'busier' than others, so can't balance the RDD properly ahead of time.
Can I somehow make executors communicate with each other programmatically? I was thinking of sharing a DataFrame with the executors, but based on what I see I cannot manipulate a DataFrame inside an executor?
I am using Spark 2.2.1 and JAVA
Try using spark dynamic resource allocation, which scales the number of executors registered with the application up and down based on the workload.
You can endable the below properties
spark.dynamicAllocation.enabled = true
spark.shuffle.service.enabled = true
You can consider to configure the below properties as well
spark.dynamicAllocation.executorIdleTimeout
spark.dynamicAllocation.maxExecutors
spark.dynamicAllocation.minExecutors
Spark provides a mechanism to dynamically adjust the resources your application occupies based on the workload. This means that your application may give resources back to the cluster if they are no longer used and request them again later when there is demand. This feature is particularly useful if multiple applications share resources in your Spark cluster.

how to make two Spark RDD run parallel

For example I created two RDDs in my code as following:
val rdd1=sc.esRDD("userIndex1/type1")
val rdd2=sc.esRDD("userIndex2/type2")
val rdd3=rdd1.join(rdd2)
rdd3.foreachPartition{....}
I found they were executed serially, why not Spark run them parallel?
The reason of my question is that the network is very slow, for generating rdd1 need 1 hour and generating rdd2 needs 1 hour as well. So I asked why Spark didn't generate the two RDDs at the same time.
Spark provide the asynchronous action to run all jobs in asynchronously so it will may be help in you use case to run all computation in parallel + concurrent. AT a time only one RDD will be computed in spark cluster but u can make them asynchronous. you can check java docs for this api here https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/rdd/AsyncRDDActions.html
And there is also a blog about it check it out here https://blog.knoldus.com/2015/10/21/demystifying-asynchronous-actions-in-spark/
I have found similar behavior. Running RDDs either in Serial or parallel doesn't make any difference due to the number of executors, executor cores you set in your spark-submit.
Let's say we have 2 RDDs as you mentioned above. Let's say each RDD takes 1 hr with 1 executor and 1 core each. We cannot increase the performance with 1 executor and 1 core (Spark config), even if spark runs both RDDs in parallel unless you increase the executors and cores.
So, Running two RDDs in parallel is not going to increase the performance.

what factors affect how many spark job concurrently

We recently have set up the Spark Job Server to which the spark jobs are submitted.But we found out that our 20 nodes(8 cores/128G Memory per node) spark cluster can only afford 10 spark jobs running concurrently.
Can someone share some detailed info about what factors would actually affect how many spark jobs can be run concurrently? How can we tune the conf so that we can take full advantage of the cluster?
Question is missing some context, but first - it seems like Spark Job Server limits the number of concurrent jobs (unlike Spark itself, which puts a limit on number of tasks, not jobs):
From application.conf
# Number of jobs that can be run simultaneously per context
# If not set, defaults to number of cores on machine where jobserver is running
max-jobs-per-context = 8
If that's not the issue (you set the limit higher, or are using more than one context), then the total number of cores in the cluster (8*20 = 160) is the maximum number of concurrent tasks. If each of your jobs creates 16 tasks, Spark would queue the next incoming job waiting for CPUs to be available.
Spark creates a task per partition of the input data, and the number of partitions is decided according to the partitioning of the input on disk, or by calling repartition or coalesce on the RDD/DataFrame to manually change the partitioning. Some other actions that operate on more than one RDD (e.g. union) may also change the number of partitions.
Some things that could limit the parallelism that you're seeing:
If your job consists of only map operations (or other shuffle-less operations), it will be limited to the number of partitions of data you have. So even if you have 20 executors, if you have 10 partitions of data, it will only spawn 10 task (unless the data is splittable, in something like parquet, LZO indexed text, etc).
If you're performing a take() operation (without a shuffle), it performs an exponential take, using only one task and then growing until it collects enough data to satisfy the take operation. (Another question similar to this)
Can you share more about your workflow? That would help us diagnose it.

Resources