I am running set of spark-sql queries in parallel using Google dataproc. I am spinning up my own cluster and ideally it should consume all the resources available in the cluster. I do see that number of vCores used is only 40 though 320 vCores are available. Do you know how I can tune the performance in this case?
I tried with different number of cores and executors. Though some apps are pending, it doesn't seem to take additional resources. I am spinning up a cluster of 20 nodes but it is still taking lot of time for computation.
I am setting the number of partitions as 50 to restrict the number of output files. Even if I skip setting it, it doesn't seem to improve performance
Related
I have a Dataproc cluster with 2 worker nodes (n1s2). There is an external server which submits around 360 spark jobs within an hour (with a couple of minutes spacing between each submission). The first job completes successfully but the subsequent ones get stuck and do not proceed at all.
Each job crunches some timeseries numbers and writes to Cassandra. And the time taken is usually 3-6 minutes when the cluster is completely free.
I feel this can be solved by just scaling up the cluster, but would become very costly for me.
What would be the other options to best solve this use case?
Running 300+ concurrent jobs on a 2 worker nodes cluster doesn't sound like feasible. You want to first estimate how much resource (CPU, memory, disk) each job needs then make a plan for the cluster size. YARN metrics like available CPU, available memory, especially pending memory would be helpful for identifying the situation where it is lack of resources.
I have a scenario where all the containers (around 50) should be up and running all the time for realtime spark queries.
Even though I set spark.dynamicallocation.minExecutors =50. it is not spinning up 50 executors. It is only spinning up 30 containers and remaining cores and memory is Unused.
Please let me know is there any limitations or way to resolve.
I am using a Spark 2.2.0 cluster configured in Standalone mode. Cluster has 2 octa core machines. This cluster is exclusively for Spark jobs and no other process uses them. I have around 8 Spark Streaming apps which run on this cluster.I explicitly set SPARK_WORKER_CORES (in spark-env.sh) to 8 and allocate one core to each app using total-executor-cores setting. This config reduces the capability to work in parallel on multiple tasks. If a stage works on a partitioned RDD with 200 partitions, only one task executes at a time. What I wanted Spark to do was to start separate thread for each job and process in parallel. But I couldn't find a separate Spark setting to control the number of threads.So, I decided to play around and bloated the number of cores (i.e. SPARK_WORKER_CORES in spark-env.sh) to 1000 on each machine. Then I gave 100 cores to each Spark application. I found that spark started processing 100 partitons in parallel this time indicating that 100 threads were being used.I am not sure if this is the correct method of impacting the number of threads used by a Spark job.
You mixed up two things:
Cluster manger properties - SPARK_WORKER_CORES - total number of cores that worker can offer. Use it to control a fraction of resources that should be used by Spark in total
Application properties --total-executor-cores / spark.cores.max - number of cores that application requests from the cluster manager. Use it control in-app parallelism.
Only the second on is directly responsible for app parallelism as long as, the first one is not limiting.
Also CORE in Spark is a synonym of thread. If you:
allocate one core to each app using total-executor-cores setting.
then you specifically assign a single data processing thread.
Is there any advantage to starting more than one spark instance (master or worker) on a particular machine/node?
The spark standalone documentation doesn't explicitly say anything about starting a cluster or multiple workers on the same node. It does seem to implicitly conflate that one worker equals one node
Their hardware provisioning page says:
Finally, note that the Java VM does not always behave well with more than 200 GB of RAM. If you purchase machines with more RAM than this, you can run multiple worker JVMs per node. In Spark’s standalone mode, you can set the number of workers per node with the SPARK_WORKER_INSTANCES variable in conf/spark-env.sh, and the number of cores per worker with SPARK_WORKER_CORES.
So aside from working with large amounts of memory or testing cluster configuration, is there any benefit to running more than one worker per node?
I think the obvious benefit is to improve the resource utilization of the hardware per box without losing performance. In terms of parallelism, one big executor with multiple cores seems to be same with multiple executors with less cores.
Is it possible to have executors with different amounts of memory on a Mesos cluster? Or am I bounded by the machine with the least memory? (Assuming I want to use all available cpus).
Short anwer: No.
Unfortunately, Spark Mesos and YARN only allow giving as much resources (cores, memory, etc.) per machine as your worst machine has (discussion). Ideally, the cluster should be homogeneous in order to take full advantage of its resources.
However, there might exist a workaround for your problem. According to the linked source above, Spark standalone allows creating multiple workers on some machines. You might modify your worker configuration to be appropriate for the worst machine, and start multiple workers on these.
For example, given two computers with 4G and 20G memory respectively, you could create 5 workers on the latter, each with a configuration to use just 4G of memory, as limited per the first machine.