Spark stage stays stuck in pending - apache-spark

I am running a rather simple Spark job: read a couple of Parquet datasets (10-100GB) each, do a bunch of joins, and writing the result back to Parquet.
Spark always seem to get stuck on the last stage. The stage stays "Pending" even though all previous stages have completed, and there are executors waiting. I've waited up to 1.5 hours and it just stays stuck.
I have tried the following desperate measures:
Using smaller datasets appears to work, but then the plan changes (e.g., some broadcast joins start to pop up) so that doesn't really help to troubleshoot.
Allocating more executor or driver memory doesn't seem to help.
Any idea?
Details
Running Spark 2.3.1 on Amazon EMR (5.17)
client-mode on YARN
Driver thread dump
Appears similar to Spark job showing unknown in active stages and stuck although I can't be sure
Job details showing the stage staying in pending:

Related

Spark limit + write is too slow

I have a dataset of 8Billion records stored in parquet files in Azure Data Lake Gen 2.
I wanted to separate out a sample dataset of 2Billion records in a different location for some benchmarking needs so I did the following
df = spark.read.option('inferSchema', 'true').format('parquet').option('badRecordsPath', f'/tmp/badRecords/').load(read_path)
df.limit(2000000000).write.option('badRecordsPath', f'/tmp/badRecords/').format('parquet').save(f'{write_path}/advertiser/2B_parquet')
This job is running on 8 nodes of 8core 28GB RAM machines [ 8 WorkerNodes + 1 Master Node ]. It's been running for over an hour with not a single file is written yet. The load did finish within 2s, so I know the limit + write action is what's causing the bottleneck [ although load just infers schema and creates a list of files but not actually reading the data ].
So I started inspecting the Spark UI for some clues and here are my observations
2 Jobs have been created by Spark
The first job took 35 mins. Here's the DAG
The second job has been running for about an hour now with no progress at all. The second job has two stages in it.
If you notice, stage 3 has one running task, but if I open the stages panel, I can't see any details of the task. I also don't understand why it's trying to do a shuffle when all I have is a limit on my DF. Does limit really need a shuffle? Even if it's shuffling, it seems like 1hr is awfully long to shuffle data around.
Also if this is what's really performing the limit, what did the first job really do? Just read the data? 35mins for that also seems too long, but for now I'd just settle on the job being completed.
Stage 4 is just stuck which is believed to be the actual writing stage and I believe is waiting for this shuffle to end.
I am new to spark and I'm kinda clueless about what's happening here. Any insights on what I'm doing wrong will be very useful.

Spark is dropping all executors at the beginning of a job

I'm trying to configure a spark job to run with fixed resources on a Dataproc cluster, however after the job was running for 6 minutes I noticed that all but 7 executors had been dropped. 45 minutes later the job has not progressed at all, and I cannot find any errors or logs to explain.
When I check the timeline in the job details it shows all but 7 executors being removed at the 6 minute mark, with the message Container [really long number] exited from explicit termination request..
The command I am running is:
gcloud dataproc jobs submit spark --region us-central1 --cluster [mycluster] \
--class=path.to.class.app --jars="gs://path-to-jar-file" --project=my-project \
--properties=spark.executor.instances=72,spark.driver.memory=28g,spark.executor.memory=28g
My cluster is 1 + 24 n2-highmem16 instances if that helps.
EDIT: I terminated the job, reset, and tried again. The exact same thing happened at the same point in the job (Job 9 Stage 9/12)
Typically that message is expected to be associated with Spark Dynamic Allocation; if you want to always have a fixed number of executors, you can try to add the property:
...
--properties=spark.dynamicAllocation.enabled=false,spark.executor.instances=72...
However, that probably won't address the root problem in your case aside from seeing idle executors continue to stick around; if the dynamic allocation was relinquishing those executors, that would be due to those tasks having completed already but where your remaining executors for whatever reason are not yet done for a long time. This often indicates some kind of data skew where the remaining executors have a lot more work to do than the ones that already completed for whatever reason, unless the remaining executors were simply all equally loaded as part of a smaller phase of the pipeline, maybe in a "reduce" phase.
If you're seeing lagging tasks out of a large number of equivalent tasks, you might consider adding a repartition() step to your job to chop it up more fine-grained in the hopes of spreading out those skewed partitions, or otherwise changing the way your group or partition your data through other means.
Fixed. The job was running out of resources. Allocated some more executors to the job and it completed.

Spark long running jobs with dataset

I have a spark code that used to run batch jobs(each job span varies from few seconds to few minutes). Now I wanted to take this same code and run it long running. To do this I have thought to create spark context only once and then in a while loop I would wait for new config/tasks to come and will start executing them.
So far whenever I tried to run this code, my applications stops running after 5-6 iterations without any exception or error printed. This long running job has been assigned with 1 executor with 10GB of memory and a spark driver with 4GB of memory(which was good for our batch job). So my questions is what are various things that we need to do to move from small batch jobs to long running jobs within code itself. I have seen this useful link - http://mkuthan.github.io/blog/2016/09/30/spark-streaming-on-yarn/ but this link is mostly about spark configurations to keep them running for long.
Spark version - 2.3 (can move to spark 2.4.1) running over yarn cluster

Possible reasons that spark waits and does not schedule tasks to run?

This might be a very generic question but hope someone can point some hint. But I found that sometimes, my job spark seems to hit a "pause" many times:
The natural of the job is: read orc files (from a hive table), filter by certain columns, no join, then write out to another hive table.
There were total 64K tasks for my job / stage (FileScan orc, followed by Filter, Project).
The application has 500 executors, each has 4 cores. Initially, about 2000 tasks were running concurrently, things look good.
After a while, I noticed the number running tasks dropped all the way near 100. Many cores/executors were just waiting with nothing to do. (I checked the log from these waiting executors, there was no error. All assigned tasks were done on them, they were just waiting)
After about 3-5 minutes, then these waiting executors suddenly got tasks assigned and now were working happily.
Any particular reasons this can be? The application is running from spark-shell (--master yarn --deploy-mode client, with number of executors/sizes etc. specified)
Thanks!

OOM error - unable to acquire 261244 bytes of memory, got 0

I am trying to run spark job which is both data and processing intensive job on dataproc and getting OOM with below error
‘OOM error - unable to acquire 261244 bytes of memory, got 0’
To give overview - On collect action, job is shuffling TBs of data. Roughly ~6TB
What I know is mentioned error comes when executor runs out of memory but when I am increasing executor memory then executor per node is decreasing resulting in less vcores causing job to run slow.
Can anyone please help me with above error. I have tried everything which is being suggested on stackoverflow.
Dataproc configuration
I am using highmem-16 dataproc machine.
Code cant be shared as it is massive code with lot of transformations and this is first action on data.

Resources