I have Spark EMR job that is running fine until a point in code that it sort of stucks. The step remains in Running state and in the logs I don't see any error. I believe the code is fine as I have tested it on my jupyter notebook. I am unable to debug or figure out what is the issue here as the step is not failing and there are no error logs produced in stderr file.
When I try to execute only the part of code that my Step is stuck on, it executes and the Step ends successfully. Also, in the monitoring I see no issues with memory and nodes etc. Any help to identify the issue?
Related
Instead of the expected output from a display(my_dataframe), I get Failed to fetch the result. Retry when looking at the completed run (also marked as success).
The notebook runs fine, including the expected outputs, when run as an on-demand notebook (same cluster config etc.). It seems to be a UI issue? I honestly don't even know where to look for possible causes.
I had the same problem while running a job on Azure Databricks and restarting my computer (or maybe the explorer..) helped.
I am running a notebook that executes other notebooks through the dbutils.notebooks.run() command. Whenever I run this job manually it executes without any issues. Whenever the job runs nightly the ephemeral notebook runs returns the error
org.apache.spark.sql.AnalysisException: Unable to infer schema for
Parquet. It must be specified manually.;
Some other notebooks that ran into the error I was able to resolve by increasing the number of workers on the cluster. I've tried doing that on this workflow as well without any luck, and I can't find any documentation that indicates that should be necessary anyway.
Any insight would be helpful.
Increasing the number of works on the cluster pool fixed the problem. Not certain the correct number of workers needed per ephemeral run, it would seem the minimum of 2 per run would needed, and they are necessarily returned immediately when the run is completed.
My spark application is failing with the above error.
Actually my spark program is writing the logs to that directory. Both stderr and stdout are being written to all the workers.
My program use to worik fine earlier. But yesterday i changed the fodler pointed to SPARK_WORKER_DIR. But today i put the old setting back and restarted the spark.
Can anyone give me clue on why i am getting this error?
In my case the problem was caused by the activation of
SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true
in spark-env.sh, that should remove old app/driver data directories, but it seems it is bugged and removes data of running apps.
Just comment that line and see if it helps.
On running a spark program on spark cluster, I get message saying error initializing spark context followed by error saying cannot call methods on a stopped spark context. Sometimes on running the same job it runs successfully without error.
This does not happen always. Sometimes the job runs fine while at other times it gives this the above message. I tried restarting spark by using commands ./stop-all.sh and ./start-all.sh. I tried checking the logs it says started spark successfully. What can be the issue for this occasional error message. Can it be related to disk space or any other reason?
thank you...
This issue seems to have been occurring due to disk space being full in the nodes. I do not get this error now.Cleaning up the unneeded files seems to have resolved it issue.
Sporadically, I get the following error when running Apache Spark on EC2:
WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
It happens randomly and the only way to fix is it to shut down the cluster and re start everything. Clearly, this is inefficient. Why does this error randomly occur? I can see in the console all my instances are perfectly fine. I am trying to work with a 6MB file. It can't possibly be out of memory. Every time I attempt to do something on Apache Spark EC2 there seems to be some random error popping up. This recent error occurred when I'm running a program I have run 5'000+ times before. On the same cluster type. Why is there so many sporadic errors? And what does this one even mean considering my instances and master are working perfectly.