Spark dropping executors while reading HDFS file - apache-spark

I'm observing a behavior where spark job drops executors while reading data from HDFS. Below is the configuration for spark shell.
spark-shell \
--executor-cores 5 \
--conf spark.shuffle.compress=true \
--executor-memory=4g \
--driver-memory=4g \
--num-executors 100
query: spark.sql("select * from db.table_name").count
This particular query would spin up ~ 40,000 tasks. While execution, number of running tasks start at 500, then the no of running tasks would
slowly drop down to ~0(I have enough resources) and then suddenly spikes to 500(dynamic allocation is turned off). I'm trying to understand the reason for this behavior and trying to look for possible ways to avoid this. This drop and spike happens only when I'm trying to read stage, all the intermediate stages will run in parallel without such huge spikes.
I'll be happy to provide any missing information.

Related

pyspark spark 2.4 on EMR 5.27 - cluster stop processing after listing files

Given an application converting csv to parquet (from and to S3) with little transformation:
for table in tables:
df_table = spark.read.format('csv') \
.option("header", "true") \
.option("escape", "\"") \
.load(path)
df_one_seven_thirty_days = df_table \
.filter(
(df_table['date'] == fn.to_date(fn.lit(one_day))) \
| (df_table['date'] == fn.to_date(fn.lit(seven_days))) \
| (df_table['date'] == fn.to_date(fn.lit(thirty_days)))
)
for i in df_one_seven_thirty_days.schema.names:
df_one_seven_thirty_days = df_one_seven_thirty_days.withColumnRenamed(i, colrename(i).lower())
df_one_seven_thirty_days.createOrReplaceTempView(table)
df_sql = spark.sql("SELECT * FROM "+table)
df_sql.write \
.mode("overwrite").format('parquet') \
.partitionBy("customer_id", "date") \
.option("path", path) \
.saveAsTable(adwords_table)
I'm facing a difficulty with spark EMR.
On local with spark submit, this has no difficulties running (140MB of data) and quite fast.
But on EMR, it's another story.
the first "adwords_table" will be converted without problems but the second one stays idle.
I've gone through the spark jobs UI provided by EMR and I noticed that once this task is done:
Listing leaf files and directories for 187 paths:
Spark kills all executors:
and 20min later nothing more happens. All the tasks are on "Completed" and no new ones are starting.
I'm waiting for the saveAsTable to start.
My local machine is 8 cores 15GB and the cluster is made of 10 nodes r3.4xlarge:
32 vCore, 122 GiB memory, 320 SSD GB storage EBS Storage:200 GiB
The configuration is using maximizeResourceAllocation true and I've only change the --num-executors / --executor-cores to 5
Does any know why the cluster goes into "idle" and don't finishes the task? (it'll eventually crashes without errors 3 hours later)
EDIT:
I made few progress by removing all glue catalogue connections + downgrading hadoop to use: hadoop-aws:2.7.3
Now the saveAsTable is working just fine, but once it finishes, I see the executors being removed and the cluster is idle, the step doesn't finish.
Thus my problem is still the same.
What I found out after many tries and headaches is that the cluster is still running / processing.
It is actually trying to write the data, but only from the master node.
Surprisingly enough, this won't be showing on the UI and it gives an impression of being idle.
The writing is taking few hours, no matter what I do (repartition(1), bigger cluster, etc).
The main problem here is the saveAsTable, I have no clue what it is doing that takes so long or make the writing so slow.
Thus I went for the write.parquet("hdfs:///tmp_loc") locally on the cluster and then processed to use the aws s3-dist-cp from the hdfs to the s3 folder.
The performance are outstanding, I went from a saveAsTable (taking 3 to 5 hours to write 17k rows / 120MB) to 3min.
As the data / schema might change at some point, I just execute a glue save from a sql request.
I am also facing the same issue, is the issue related the new version of EMR 5.27?
For me also job is getting stuck for one executor for very long.It completes all 99% executor and this happens while reading the files.

How to stop Spark Structured Streaming from filling HDFS

I have a Spark Structured Streaming task running on AWS EMR that is essentially a join of two input streams over a one minute time window. The input streams have a 1 minute watermark. I don't do any aggregation. I write results to S3 "by hand" with a forEachBatch and a foreachPartition per Batch that converts the data to string and writes to S3.
I would like to run this for a long time, i.e. "forever", but unfortunately Spark slowly fills up HDFS storage on my cluster and eventually dies because of this.
There seem to be two types of data that accumulate. Logs in /var and .delta, .snapshot files in /mnt/tmp/.../. They don't get deleted when I kill the task with CTRL+C (or in case of using yarn with a yarn application kill) either, I have to manually delete them.
I run my task with spark-submit. I tried adding
--conf spark.streaming.ui.retainedBatches=100 \
--conf spark.streaming.stopGracefullyOnShutdown=true \
--conf spark.cleaner.referenceTracking.cleanCheckpoints=true \
--conf spark.cleaner.periodicGC.interval=15min \
--conf spark.rdd.compress=true
without effect. When I add --master yarn the paths where the temporary files are stored change a bit, but the problem of them accumulating over time persists. Adding a --deploy-mode cluster seems to make the problem worse as more data seems to be written.
I used to have a Trigger.ProcessingTime("15 seconds) in my code, but removed it as I read that Spark might fail to clean up after itself if the trigger time is too short compared to the compute time. This seems to have helped a bit, HDFS fills up slower, but temporary files are still piling up.
If I don't join the two streams, but just select on both and union the results to write them to S3 the accumulation of cruft int /mnt/tmp doesn't happen. Could it be that my cluster is too small for the input data?
I would like to understand why Spark is writing these temp files, and how to limit the space they consume. I would also like to know how to limit the amount of space consumed by logs.
Spark fills HDFS with logs because of https://issues.apache.org/jira/browse/SPARK-22783
One needs to set spark.eventLog.enabled=false so that no logs are created.
in addition to #adrianN's answer, on the EMR side, they retain application logs on HDFS - see https://aws.amazon.com/premiumsupport/knowledge-center/core-node-emr-cluster-disk-space/

Spark job fails when cluster size is large, succeeds when small

I have a spark job which takes in three inputs and does two outer joins. The data is in key-value format (String, Array[String]). Most important part of the code is:
val partitioner = new HashPartitioner(8000)
val joined = inputRdd1.fullOuterJoin(inputRdd2.fullOuterJoin(inputRdd3, partitioner), partitioner).cache
saveAsSequenceFile(joined, filter="X")
saveAsSequenceFile(joined, filter="Y")
I'm running the job on EMR with r3.4xlarge driver node and 500 m3.xlarge worker nodes. The spark-submit parameters are:
spark-submit --deploy-mode client --master yarn-client --executor-memory 3g --driver-memory 100g --executor-cores 3 --num-executors 4000 --conf spark.default.parallelism=8000 --conf spark.storage.memoryFraction=0.1 --conf spark.shuffle.memoryFraction=0.2 --conf spark.yarn.executor.memoryOverhead=4000 --conf spark.network.timeout=600s
UPDATE: with this setting, number of executors seen in spark jobs UI were 500 (one per node)
The exception I see in the driver log is the following:
17/10/13 21:37:57 WARN HeartbeatReceiver: Removing executor 470 with no recent heartbeats: 616136 ms exceeds timeout 600000 ms
17/10/13 21:39:04 ERROR ContextCleaner: Error cleaning broadcast 5
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [600 seconds]. This timeout is controlled by spark.network.timeout at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcEnv.scala:214)
...
Some of the things I tried that failed:
I thought the problem would be because of there are too many executors being spawned and driver has an overhead of tracking these executors. I tried reducing the number of executors by increasing the executor-memory to 4g. This did not help.
I tried changing the instance type of driver to r3.8xlarge, this did not help either.
Surprisingly, when I reduce the number of worker nodes to 300, the job runs file. Does any one have any other hypothesis on why this would happen?
Well this is a little bit a problem to understand how is the allocation of Spark works.
According to your information, you have 500 nodes with 4 cores each. So, you have 4000 cores. What you are doing with your request is creating 4000 executors with 3 cores each. It means that you are requesting 12000 cores for your cluster and there is no thing like that.
This error of RPC timeout is regularly associated with how many jvms you started in the same machine, and that machine is not able to respond in proper time due to much thing happens at the same time.
You need to know that, --num-executors is better been associated to you nodes, and the number of cores should be associated to the cores you have in each node.
For example, the configuration of m3.xLarge is 4 cores with 15 Gb of RAM. What is the best configuration to run a job there? That depends what you are planning to do. See if you are going to run just one job I suggest you to set up like this:
spark-submit --deploy-mode client --master yarn-client --executor-memory 10g --executor-cores 4 --num-executors 500 --conf spark.default.parallelism=2000 --conf spark.yarn.executor.memoryOverhead=4000
This will allow you job to run fine, if you don't have problem to fit your data to your worker is better change the default.parallelism to 2000 or you are going to lost lot of time with shuffle.
But, the best approach I think that you can do is keeping the dynamic allocation that EMR enables it by default, just set the number of cores and the parallelism and the memory and you job will run like a charm.
I experimented with lot of configurations modifying one parameter at a time with 500 nodes. I finally got the job to work by lowering the number of partitions in the HashPartitioner from 8000 to 3000.
val partitioner = new HashPartitioner(3000)
So probably the driver is overwhelmed with a the large number of shuffles that has to be done when there are more partitions and hence the lower partition helps.

How to launch parallel spark jobs?

I think i don't understand enough how to launch jobs.
I have one job which takes 60 seconds to finish. I run it with following command:
spark-submit --executor-cores 1 \
--executor-memory 1g \
--driver-memory 1g \
--master yarn \
--deploy-mode cluster \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.shuffle.service.enabled=true \
--conf spark.dynamicAllocation.minExecutors=1 \
--conf spark.dynamicAllocation.maxExecutors=4 \
--conf spark.dynamicAllocation.initialExecutors=4 \
--conf spark.executor.instances=4 \
If i increase number of partitions from code and number of executors the app will finish faster, which it's ok. But if i increase only executor-cores the finish time is the same, and i don't understand why. I expect the time to be lower than initial time.
My second problem is if i launch twice above code i expect that both jobs to finish in 60 seconds, but this don't happen. Both jobs finish after 120 seconds and i don't understand why.
I run this code on AWS EMR, on 2 instances(4 cpu each, and each cpu has 2 threads). From what i saw in default EMR configurations, yarn is set on FIFO(default) mode with CapacityScheduler.
What do you think about this problems?
Spark creates partitions based on a logic inside the data source. In your case it probably creates a number of partitions which is smaller than the number of executors * executor cores so just increasing the cores will not make it run faster as those would be idle. When you also increase the number of partitions it can work faster.
When you run spark-submit twice, there is a good chance that the dynamic allocation reaches the maximum allocation of executors before the second one starts (it takes ~4 seconds by default in your case). Depending on how yarn was defined, this might fill up all of the available executors (either because the number of threads defined is too small or because memory is filled up). In any case if this indeed happens, the second spark-submit would not start processing until some executor is freed meaning it takes the sum of times.
BTW remember that in cluster mode, the driver takes up an executor too...

Spark job randomly hangs int the middle of a stage while reading data

I have a spark job which reads data, transforms it(shuffle involved) and writes data back to disks. Different instances of the same spark job is used for processing separate data in parallel(each has its input\output dir). Some of the jobs, so far, 3 jobs out of 200 roughly, got stuck in the middle of reading stage. By stuck I mean there is no tasks finished after some point, there is no progress in stage, there is no new errors logs of executors in UI, a job can run for half an hour and then it stops and there is no progress. When I rerun the whole set of jobs, everything can be fine or some other jobs can hang again, this time some others(another in/out dir). We use spark 1.6.0(CDH_5.8). We use dynamic allocation and such a job can eat more resources after it already "stuck". Any idea what can be done in such situations?
I start jobs using this properties:
--master yarn-cluster
--driver-memory 8g
--executor-memory 4g
--conf spark.yarn.executor.memoryOverhead=1024
--conf spark.dynamicAllocation.maxExecutors=2200
--conf spark.yarn.maxAppAttempts=2
--conf spark.dynamicAllocation.enabled=true
UPADATE
Disabling dynamic allocation seems solved the issue, we are gonna try running our jobs another several days to conclude was it really the reason.

Resources