I have a scenario where all the containers (around 50) should be up and running all the time for realtime spark queries.
Even though I set spark.dynamicallocation.minExecutors =50. it is not spinning up 50 executors. It is only spinning up 30 containers and remaining cores and memory is Unused.
Please let me know is there any limitations or way to resolve.
Related
I have a spark cluster managed by YARN with 200GB RAM and 72 vCPUs.
And I have a number of small pyspark applications that perform Spark Streaming tasks. These applications are long-running, with each micro-batch running between 1 - 30min.
However, I can only run 7 applications stably. When I tried to run the 8th application, all jobs would restart frequently.
When 8 jobs are running, resource consumption is only 120 GB and about 30 CPUs.
May I understand why jobs could be instable although there are huge memories (80GB) left?
BTW, there are about 16GB RAM configured for OS.
I'm working with Spark and Yarn on an Azure HDInsight cluster, and I have some troubles on understanding the relations between the workers' resources, executors and containers.
My cluster has 10 workers D13 v2 (8 cores and 56GB of memery), therefore I should have 80 cores available for spark applications. However, when I try to start an application with the following parameters
"executorMemory": "6G",
"executorCores": 5,
"numExecutors": 20,
I see in the Yarn UI 100 cores available (therefore, 20 more than what I should have). I've run an heavy query, and on the executor page of Yarn UI I see all 20 executors working, with 4 or 5 active task in parallel. I tried also pushing the numExecutors to 25, and I do see all 25 working, again with several tasks in parallel for each executor.
It was my understanding the 1 executor core = 1 cluster core, but this is not compatible with what I observe. The official Microsoft documentation (for instance here) it's not really helpful. It states:
An Executor runs on the worker node and is responsible for the tasks
for the application. The number of worker nodes and worker node size
determines the number of executors, and executor sizes.
but it does not say what the relation is. I suspect Yarn is only bound by memory limits (e.g. I can run how many executors I want, if I have enough memory), but I don't understand how this might work in relation with the available cpus in the cluster.
Do you know what I am missing?
I have a Dataproc cluster with 2 worker nodes (n1s2). There is an external server which submits around 360 spark jobs within an hour (with a couple of minutes spacing between each submission). The first job completes successfully but the subsequent ones get stuck and do not proceed at all.
Each job crunches some timeseries numbers and writes to Cassandra. And the time taken is usually 3-6 minutes when the cluster is completely free.
I feel this can be solved by just scaling up the cluster, but would become very costly for me.
What would be the other options to best solve this use case?
Running 300+ concurrent jobs on a 2 worker nodes cluster doesn't sound like feasible. You want to first estimate how much resource (CPU, memory, disk) each job needs then make a plan for the cluster size. YARN metrics like available CPU, available memory, especially pending memory would be helpful for identifying the situation where it is lack of resources.
I am running set of spark-sql queries in parallel using Google dataproc. I am spinning up my own cluster and ideally it should consume all the resources available in the cluster. I do see that number of vCores used is only 40 though 320 vCores are available. Do you know how I can tune the performance in this case?
I tried with different number of cores and executors. Though some apps are pending, it doesn't seem to take additional resources. I am spinning up a cluster of 20 nodes but it is still taking lot of time for computation.
I am setting the number of partitions as 50 to restrict the number of output files. Even if I skip setting it, it doesn't seem to improve performance
I am trying to set DC/OS Spark-Kafka-Cassandra cluster using 1 master and 3 private AWS m3.xlarge instances (each having 4 processors, 15GB RAM).
I have questions regarding some strange behaviour I have incurred in the spike I did several days ago.
On each of the private nodes I have following fixed resources reserved (I speak about CPU usage, memory is not the issue)
0.5 CPUs for Cassandra on each node
0.3 - 0.5 CPUs for Kafka one each node
0.5 CPUs is the Mesos overhead (I simply see in DC/OS UI that it is occupied 0.5CPUs more than the summation of all the services that are running on a node -> this probably belongs to some sort of Mesos overhead)
rest of the resources I have available for running Spark jobs (around 2.5 CPUs)
Now, I want to run 2 streaming jobs, so that they run on every node of the cluster. This requires me to set in dcos spark run command that number of executors is 3 (although I have 3 nodes in the cluster), as well as that number of CPU cores is 3 (it is impossible to set 1 or 2,because as far as I see minimum CPUs per executor is 1). Of course, for each of the streaming jobs, 1 CPU in the cluster is occupied by the driver program.
First strange situation that I see is that instead of running 3 executors with 1 core each, Mesos launches 2 executors on 2 nodes where one has 2 CPUs, while the other has 1 CPU. There is nothing launched on the 3rd node even though there were enough resources. How to force Mesos to run 3 executors on the cluster?
Also, when I run 1 pipeline with 3 CPUs, I see that those CPUs are blocked, and cannot be reused by other streaming pipeline, even though they are not doing any workload. Why Mesos can not shift available resources between applications? Isn't that the main benefit of using Mesos? Or maybe simply there are not enough resources to be shifted?
EDITED
Also the question is can I assign less than one CPU per Executor?
Kindest regards,
Srdjan