Spark looses all executors one minute after starting - apache-spark

I run pyspark on 8 node Google dataproc cluster with default settings.
Few seconds after starting I see 30 executor cores running (as expected):
>>> sc.defaultParallelism
30
One minute later:
>>> sc.defaultParallelism
2
From that point all actions run on only 2 cores:
>>> rng = sc.parallelize(range(1,1000000))
>>> rng.cache()
>>> rng.count()
>>> rng.getNumPartitions()
2
If I run rng.cache() while cores are still connected they stay connected and jobs get distributed.
Checking on monitoring app (port 4040 on master node) shows executors are removed:
Executor 1
Removed at 2016/02/25 16:20:14
Reason: Container container_1456414665542_0006_01_000002 exited from explicit termination request."
Is there some setting that could keep cores connected without workarounds?

For the most part, what you are seeing is actually just the differences in how Spark on YARN can be configured vs spark standalone. At the moment, YARN's reporting of "VCores Used" doesn't actually correctly correspond to a real container reservation of cores, and containers are actually just based on the memory reservation.
Overall there are a few things at play here:
Dynamic allocation causes Spark to relinquish idle executors back to YARN, and unfortunately at the moment spark prints that spammy but harmless "lost executor" message. This was the classical problem of spark on YARN where spark originally paralyzed clusters it ran on because it would grab the maximum number of containers it thought it needed and then never give them up.
With dynamic allocation, when you start a long job, spark quickly allocates new containers (with something like exponential ramp-up to quickly be able to fill a full YARN cluster within a couple minutes), and when idle, relinquishes executors with the same ramp-down at an interval of about 60 seconds (if idle for 60 seconds, relinquish some executors).
If you want to disable dynamic allocation you can run:
spark-shell --conf spark.dynamicAllocation.enabled=false
gcloud dataproc jobs submit spark --properties spark.dynamicAllocation.enabled=false --cluster <your-cluster> foo.jar
Alternatively, if you specify a fixed number of executors, it should also automatically disable dynamic allocation:
spark-shell --conf spark.executor.instances=123
gcloud dataproc jobs submit spark --properties spark.executor.instances=123 --cluster <your-cluster> foo.jar

Related

Spark is dropping all executors at the beginning of a job

I'm trying to configure a spark job to run with fixed resources on a Dataproc cluster, however after the job was running for 6 minutes I noticed that all but 7 executors had been dropped. 45 minutes later the job has not progressed at all, and I cannot find any errors or logs to explain.
When I check the timeline in the job details it shows all but 7 executors being removed at the 6 minute mark, with the message Container [really long number] exited from explicit termination request..
The command I am running is:
gcloud dataproc jobs submit spark --region us-central1 --cluster [mycluster] \
--class=path.to.class.app --jars="gs://path-to-jar-file" --project=my-project \
--properties=spark.executor.instances=72,spark.driver.memory=28g,spark.executor.memory=28g
My cluster is 1 + 24 n2-highmem16 instances if that helps.
EDIT: I terminated the job, reset, and tried again. The exact same thing happened at the same point in the job (Job 9 Stage 9/12)
Typically that message is expected to be associated with Spark Dynamic Allocation; if you want to always have a fixed number of executors, you can try to add the property:
...
--properties=spark.dynamicAllocation.enabled=false,spark.executor.instances=72...
However, that probably won't address the root problem in your case aside from seeing idle executors continue to stick around; if the dynamic allocation was relinquishing those executors, that would be due to those tasks having completed already but where your remaining executors for whatever reason are not yet done for a long time. This often indicates some kind of data skew where the remaining executors have a lot more work to do than the ones that already completed for whatever reason, unless the remaining executors were simply all equally loaded as part of a smaller phase of the pipeline, maybe in a "reduce" phase.
If you're seeing lagging tasks out of a large number of equivalent tasks, you might consider adding a repartition() step to your job to chop it up more fine-grained in the hopes of spreading out those skewed partitions, or otherwise changing the way your group or partition your data through other means.
Fixed. The job was running out of resources. Allocated some more executors to the job and it completed.

Why spark application are not running on all nodes

I installed the following spark benchmark:
https://github.com/BBVA/spark-benchmarks
I run Spark on top of YARN on 8 workers but I only get 2 running executors during the job (TestDFSIO).
I also set executor-cores to be 9 but only 2 are running.
Why would that happen?
I think the problem is coming from YARN because I get a similar (almost) issue with TestDFSIO on Hadoop. In fact, at the beginning of the job, only two nodes run, but then all the nodes execute the application in parallel!
Note that I am not using HDFS for storage!
I solved this issue. What I've done is that I set the number of cores per executor to 5 (--executor-cores) and the total number of executors to 23 (--num-executors) which was at first 2 by default.

Spark (yarn-client mode) not releasing memory after job/stage finishes

We are consistently observing this behavior with interactive spark jobs in spark-shell or running Sparklyr in RStudio etc.
Say I launched spark-shell in yarn-client mode and performed an action, which triggered several stages in a job and consumed x cores and y MB memory. Once this job finishes, and the corresponding spark session is still active, the allocated cores & memory is not released even though that job is finished.
Is this normal behavior?
Until the corresponding spark session is finished, the ip:8088/ws/v1/cluster/apps/application_1536663543320_0040/
kept showing:
y
x
z
I would assume, Yarn would dynamically allocate these unused resources to other spark jobs which are awaiting resources.
Please clarify if I am missing something here.
You need to play with configs around dynamic allocation https://spark.apache.org/docs/latest/configuration.html#dynamic-allocation -
Set spark.dynamicAllocation.executorIdleTimeout to a smaller value say 10s. Default value of this parameter is 60s. This config tells spark that it should release the executor only when it is idle for this much time.
Check spark.dynamicAllocation.initialExecutors/spark.dynamicAllocation.minExecutors. Set these to a small number - say 1/2. The spark application will never downscale below this number unless the SparkSession is closed.
Once you set these two configs, your application should release the extra executors once they are idle for 10 seconds.
Yes the resources are allocated until the SparkSession is active. To handle this better you can use dynamic allocation.
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-dynamic-allocation.html

Spark job fails when cluster size is large, succeeds when small

I have a spark job which takes in three inputs and does two outer joins. The data is in key-value format (String, Array[String]). Most important part of the code is:
val partitioner = new HashPartitioner(8000)
val joined = inputRdd1.fullOuterJoin(inputRdd2.fullOuterJoin(inputRdd3, partitioner), partitioner).cache
saveAsSequenceFile(joined, filter="X")
saveAsSequenceFile(joined, filter="Y")
I'm running the job on EMR with r3.4xlarge driver node and 500 m3.xlarge worker nodes. The spark-submit parameters are:
spark-submit --deploy-mode client --master yarn-client --executor-memory 3g --driver-memory 100g --executor-cores 3 --num-executors 4000 --conf spark.default.parallelism=8000 --conf spark.storage.memoryFraction=0.1 --conf spark.shuffle.memoryFraction=0.2 --conf spark.yarn.executor.memoryOverhead=4000 --conf spark.network.timeout=600s
UPDATE: with this setting, number of executors seen in spark jobs UI were 500 (one per node)
The exception I see in the driver log is the following:
17/10/13 21:37:57 WARN HeartbeatReceiver: Removing executor 470 with no recent heartbeats: 616136 ms exceeds timeout 600000 ms
17/10/13 21:39:04 ERROR ContextCleaner: Error cleaning broadcast 5
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [600 seconds]. This timeout is controlled by spark.network.timeout at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcEnv.scala:214)
...
Some of the things I tried that failed:
I thought the problem would be because of there are too many executors being spawned and driver has an overhead of tracking these executors. I tried reducing the number of executors by increasing the executor-memory to 4g. This did not help.
I tried changing the instance type of driver to r3.8xlarge, this did not help either.
Surprisingly, when I reduce the number of worker nodes to 300, the job runs file. Does any one have any other hypothesis on why this would happen?
Well this is a little bit a problem to understand how is the allocation of Spark works.
According to your information, you have 500 nodes with 4 cores each. So, you have 4000 cores. What you are doing with your request is creating 4000 executors with 3 cores each. It means that you are requesting 12000 cores for your cluster and there is no thing like that.
This error of RPC timeout is regularly associated with how many jvms you started in the same machine, and that machine is not able to respond in proper time due to much thing happens at the same time.
You need to know that, --num-executors is better been associated to you nodes, and the number of cores should be associated to the cores you have in each node.
For example, the configuration of m3.xLarge is 4 cores with 15 Gb of RAM. What is the best configuration to run a job there? That depends what you are planning to do. See if you are going to run just one job I suggest you to set up like this:
spark-submit --deploy-mode client --master yarn-client --executor-memory 10g --executor-cores 4 --num-executors 500 --conf spark.default.parallelism=2000 --conf spark.yarn.executor.memoryOverhead=4000
This will allow you job to run fine, if you don't have problem to fit your data to your worker is better change the default.parallelism to 2000 or you are going to lost lot of time with shuffle.
But, the best approach I think that you can do is keeping the dynamic allocation that EMR enables it by default, just set the number of cores and the parallelism and the memory and you job will run like a charm.
I experimented with lot of configurations modifying one parameter at a time with 500 nodes. I finally got the job to work by lowering the number of partitions in the HashPartitioner from 8000 to 3000.
val partitioner = new HashPartitioner(3000)
So probably the driver is overwhelmed with a the large number of shuffles that has to be done when there are more partitions and hence the lower partition helps.

How to release seemingly inactive executors from a long-running PySpark framework?

Here's my problem. Let's say I have a long-running PySpark framework. It has thousands of tasks that can all be executed in parallel. I get allocated 1,000 cores at the beginning on many different hosts. Each task needs one core. Then, when those finish, the host holds onto one core and has no active tasks. Since there are a large number of hosts, what can happen is that a larger and larger percentage of my cores are allocated to executors that don't have any active tasks. So I can have 1000 cores allocated, but only 100 active tasks. The other 900 cores are allocated to executors that have no active tasks. How can I improve this? Is there a way to shut down executors that aren't doing anything? I am currently using PySpark 1.2, so it'd be great for the functionality to be in that version, but would be happy to hear about solutions (or better solutions) in newer versions. Thanks!
If you do not specify the number of executors that Spark should use, Spark allocates executors as long as Spark has at least 1 task pending in its queue. You can set an upper limit to the number of executors that Spark can dynamically allocate by using this parameter: spark.dynamicAllocation.maxExecutors.
In other word, when launching spark, use:
pyspark --master yarn-client --conf spark.dynamicAllocation.maxExecutors=1000
instead of
pyspark --master yarn-client --num-executors=1000
By default, Spark will release executors after 60s of non-activity.
Note, if you .persist() your Spark.DataFrame, make sure to .unpersist() them otherwise Spark will not release the executors.

Resources