pyspark job execution in yarn cluster - apache-spark

I am trying to understand how spark job work in yarn cluster
I am using below commands to submit job
spark-submit --master yarn --deploy-mode cluster sparksessionexample.py
After submitting job console shows below console log
2020-05-29 20:52:48,668 INFO yarn.Client: Uploading resource file:/tmp/spark-bcd415f0-a22e-46b2-951c-5b6e4385a0c6/__spark_libs__2908230569257238890.zip -> hdfs://localhost:9000/user/hadoop/.sparkStaging/application_1590759398715_0003/__spark_libs__2908230569257238890.zip
2020-05-29 20:53:14,164 INFO yarn.Client: Uploading resource file:/home/hadoop/pythonprojects/Python/src/spark_jobs/sparksessionexample.py -> hdfs://localhost:9000/user/hadoop/.sparkStaging/application_1590759398715_0003/sparksessionexample.py
2020-05-29 20:53:14,610 INFO yarn.Client: Uploading resource file:/home/hadoop/clouderaapp/apache-spark/python/lib/pyspark.zip -> hdfs://localhost:9000/user/hadoop/.sparkStaging/application_1590759398715_0003/pyspark.zip
2020-05-29 20:53:15,984 INFO yarn.Client: Uploading resource file:/home/hadoop/clouderaapp/apache-spark/python/lib/py4j-0.10.7-src.zip -> hdfs://localhost:9000/user/hadoop/.sparkStaging/application_1590759398715_0003/py4j-0.10.7-src.zip
2020-05-29 20:53:18,362 INFO yarn.Client: Uploading resource file:/tmp/spark-bcd415f0-a22e-46b2-951c-5b6e4385a0c6/__spark_conf__7123551182035223076.zip -> hdfs://localhost:9000/user/hadoop/.sparkStaging/application_1590759398715_0003/__spark_conf__.zip
I just to understand how yarn execute sparksessionexample.py file, i mean whether it create python virtual env on node? as above log shows only uploading lib, confs but what about python client to execute sparksessionexample.py?
Can anyone help understand this?

The "Spark client" is used to bootstrap the Spark job execution.
In your case it is the only thing that runs on your local machine, because you requested cluster execution mode:
the "client" contacts the cluster manager (here YARN Resource Manager, could be Kubernetes Master, etc.) to start the Spark driver inside an AppMaster container
then the driver contacts again the cluster manager to request some containers for the executors
then the driver runs your Python code and distributes the work to the executors
finally the driver de-allocates its executors and itself
at this point the "client" notices that the YARN job has reached success or failure status, and can terminate
In short, the "client" never gets any kind of useful information from the driver running inside the cluster. You must inspect the YARN logs for the container running the driver (it's the AppMaster, typically number 00001).
If you want to see some feedback from the driver, then run your job in client execution mode -- it means the driver will run in the same JVM as the "client", in your local machine, and spit its logs in your console.

Related

Spark Structure Streaming job failing in cluster mode

I am using spark-sql-2.4.1 v in my application.
While writing data on to hdfs folder I am facing this issue in spark-streaming application
Error:
yarn.Client: Deleted staging directory hdfs://dev/user/xyz/.sparkStaging/application_1575699597805_47
20/02/24 14:02:15 ERROR yarn.Client: Application diagnostics message: User class threw exception: org.apache.hadoop.security.AccessControlException: Permission denied: user= xyz, access=WRITE, inode="/tmp/hadoop-admin":admin:supergroup:drwxr-xr-x
.
.
.
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=xyz, access=WRITE, inode="/tmp/hadoop-admin":admin:supergroup:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:350)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:251)
While writing data on to HDFS folder I am facing this issue in spark-streaming application. When I run in yarn-cluster mode I face this issue i.e.
--master yarn \
--deploy-mode cluster \
But when I run in “yarn-client” mode it runs fine i.e.
--master yarn \
--deploy-mode client \
What is the root cause of this problem?
Fundamental question here, why it is trying to write in "/tmp/hadoop-admin/" instead of respective user directory i.e. hdfs://qa2/user/xyz/?
I have come across this fix:
https://issues.apache.org/jira/browse/SPARK-26825
How can I implement it in my spark-sql application?
The only difference between the working --deploy-mode client and the failing --deploy-mode cluster cases is the location of the driver. In client deploy mode, the driver runs on the machine you execute spark-submit (which is usually an edge node that is configured to use a YARN cluster, but it is not part of it) while in cluster deploy mode the driver runs as part of a YARN cluster (one of the nodes under control of YARN).
It looks like you've got a misconfigured edge node.
I'd not be surprised if a regular Spark SQL-only Spark application would be failing too. I'd not be surprised to hear that it has nothing to do with a streaming query (Spark Structured Streaming) and would fail for any Spark application.

Why doesn't the pyspark driver download jar files to local storage?

I am using spark-on-k8s-operator to deploy Spark 2.4.4 on Kubernetes. However, I'm pretty sure this questions is about Spark itself, not about a Kubernetes deployment of it.
I include several files when I deploy a job to the kubernetes cluster, including jars, pyfiles and a main. In spark-on-k8s; this is done via a config file:
spec:
mainApplicationFile: "s3a://project-folder/jobs/test/db_read_k8.py"
deps:
jars:
- "s3a://project-folder/jars/mysql-connector-java-8.0.17.jar"
pyFiles:
- "s3a://project-folder/pyfiles/pyspark_jdbc.zip"
This would be equivalent to
spark-submit \
--jars s3a://project-folder/jars/mysql-connector-java-8.0.17.jar \
--py-files s3a://project-folder/pyfiles/pyspark_jdbc.zip \
s3a://project-folder/jobs/test/db_read_k8.py
In spark-on-k8s, there is a sparkapplication kubernetes pod that manages the submitted spark jobs, and that pod spark-submits to a driver pod (which then interacts with the worker pods). My issue occurs on the driver pod. Once the driver recieves the spark-submit command, it goes about its business, and pull the required files from AWS S3, as expected. Except, it does not pull the jar file:
spark-kubernetes-driver 19/11/05 17:01:19 INFO SparkContext: Added JAR s3a://project-folder/jars/mysql-connector-java-8.0.17.jar at s3a://sezzle-spark/jars/mysql-connector-java-8.0.17.jar with timestamp 1572973279830
spark-kubernetes-driver 19/11/05 17:01:19 INFO SparkContext: Added file s3a://project-folder/jobs/test/db_read_k8.py at s3a://sezzle-spark/jobs/test/db_read_k8.py with timestamp 1572973279872
spark-kubernetes-driver 19/11/05 17:01:19 INFO Utils: Fetching s3a://project-folder/jobs/test/db_read_k8.py to /var/data/spark-f54f76a6-8f2b-4bd5-9644-c406aecac2dd/spark-42e3cd23-55c5-4099-a6af-455efb5dc4f2/userFiles-ae47c908-d0f0-4ff5-aee6-4dadc5c9b95f/fetchFileTemp1013256051456720708.tmp
spark-kubernetes-driver 19/11/05 17:01:19 INFO SparkContext: Added file s3a://project-folder/pyfiles/pyspark_jdbc.zip at s3a://sezzle-spark/pyfiles/pyspark_jdbc.zip with timestamp 1572973279962
spark-kubernetes-driver 19/11/05 17:01:20 INFO Utils: Fetching s3a://project-folder/pyfiles/pyspark_jdbc.zip to /var/data/spark-f54f76a6-8f2b-4bd5-9644-c406aecac2dd/spark-42e3cd23-55c5-4099-a6af-455efb5dc4f2/userFiles-ae47c908-d0f0-4ff5-aee6-4dadc5c9b95f/fetchFileTemp6740168219531159007.tmp
All three required files are "added" but only the main and pyfiles are "fetched." Looking through the driver pod, I can't find the jar file anywhere; it just doesn't get downloaded locally. This, of course, crashes my application, because the mysql driver isn't in the classpath.
Why doesn't spark download jar files to the driver's local filesystem the way it does for the pyfiles and python main?
PySpark has a bit unclear and not enough documented dependency management.
If your problem is with adding .jar only I would recommend you to use --packages ... instead (spark-operator should have the analogous option).
Hope it'll work for you.

Spark runs endlessly for Pi example

I just setup Spark and ran the command
spark-shell --master yarn-client --driver-memory 512m --executor-memory 512m
However, it just keeps endlessly printing out messages like
16/04/25 17:34:46 INFO Client: Application report for application_1460481694166_0125 (state: ACCEPTED)
I read somewhere that I could try to kill the application. But I'm not sure what
When I try
yarn application -list
I see
Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
application_1460481694166_0118 org.apache.spark.examples.SparkPi SPARK root default ACCEPTED UNDEFINED 0% N/A
application_1460481694166_0124 Spark shell SPARK root default ACCEPTED UNDEFINED 0% N/A
application_1460481694166_0120 Spark shell ...
Zeppelin SPARK zeppelin default RUNNING UNDEFINED 10% http://10.0.2.15:4040
application_1460481694166_0117 org.apache.spark.examples.SparkPi SPARK root default ACCEPTED UNDEFINED 0% N/A
application_1460481694166_0123 Spark shell
...
I'm not sure why Zeppelin is showing up because I closed it in my web browser
What do I need to do now?
I'm guessing Zeppelin is still running even though you closed your browser. Closing the browser is not the same as stopping the hosting process. Stopping the hosting process is done in the CLI tab that started the process. As a last ditch, you can yarn application -kill any of the running processes in any tab.
yarn application -kill application_1460481694166_0118
That will kill the (first) spark application.

Spark + Mesos cluster mode, who uploads the jar?

I'm trying to run Spark applications with Mesos cluster mode. (I've got client mode working but still would like to try cluster mode)
I have launched spark-mesos-dispatcher on the Mesos master node.
When I submit the assembly at local path /tmp/assembly.jar using the following command,
bin/spark-submit --master mesos://dispatcher:7077 --deploy-mode cluster --class com.example.Example /tmp/assembly.jar
It fails because the file /tmp/assembly.jar does not exist on the mesos slave nodes.
I1129 10:47:43.839771 5884 fetcher.cpp:414] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/9d725348-931a-48fb-96f7-d29a4b09f3e8-S9\/deploy","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/tmp\/assembly.jar"}}],"sandbox_directory":"\/var\/lib\/mesos\/slaves\/9d725348-931a-48fb-96f7-d29a4b09f3e8-S9\/frameworks\/9d725348-931a-48fb-96f7-d29a4b09f3e8-0291\/executors\/driver-20151129104742-0008\/runs\/31bf5840-226e-4b87-ae76-d14bd2f17950","user":"user"}
I1129 10:47:43.840710 5884 fetcher.cpp:369] Fetching URI '/tmp/assembly.jar'
I1129 10:47:43.840721 5884 fetcher.cpp:243] Fetching directly into the sandbox directory
I1129 10:47:43.840731 5884 fetcher.cpp:180] Fetching URI '/tmp/assembly.jar'
I1129 10:47:43.840737 5884 fetcher.cpp:160] Copying resource with command:cp '/tmp/assembly.jar' '/var/lib/mesos/slaves/9d725348-931a-48fb-96f7-d29a4b09f3e8-S9/frameworks/9d725348-931a-48fb-96f7-d29a4b09f3e8-0291/executors/driver-20151129104742-0008/runs/31bf5840-226e-4b87-ae76-d14bd2f17950/assembly.jar'
cp: cannot stat `/tmp/assembly.jar': No such file or directory
Failed to fetch '/tmp/assembly.jar': Failed to copy with command 'cp '/tmp/assembly.jar' '/var/lib/mesos/slaves/9d725348-931a-48fb-96f7-d29a4b09f3e8-S9/frameworks/9d725348-931a-48fb-96f7-d29a4b09f3e8-0291/executors/driver-20151129104742-0008/runs/31bf5840-226e-4b87-ae76-d14bd2f17950/assembly.jar'', exit status: 256
Failed to synchronize with slave (it's probably exited)
In case of YARN cluster mode, Spark's YARN client implementation will upload the application jar to HDFS so that the driver and all executors have access to the jar, but I could not find such code in RestSubmissionClient, which is used by Mesos or Standalond cluster mode.
Who does the uploading in this case? or do I need to manually put the application assembly somewhere accessible via an HTTP URI?
From my understanding you could use the SparkContext addJar() method to add a local (to the driver application) JAR file path, which will then be distributed to the executor nodes (in client mode).
As you state that you want to use cluster mode, I'd suggest that you have a look at the Spark Jobserver project, which should make the running of Spark applications on Mesos easier than with the built-in tools.

Spark cluster set up error

With some research over the internet, I can use
sbin/start-master.sh
to start the spark master server spark service over my Ubuntu Linux computers
and use
bin/spark-class org.apache.spark.deploy.worker.Worker spark://...
for the slave nodes service up and running.
The good news was I can see the local web page with works found alive.
However, after such, I tried to launch the shell to work ...
MASTER=spark://localhost:7077 bin/spark-shell
but it returned:
sparkMaster#localhost:7077 ...
And therefore I modified the code to
MASTER=spark://sparkuser#localhost:7077 bin/spark-shell
where the sparkuser is the one connected to the two nodes
However, with such modification, I got:
ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
WARN SparkDeploySchedulerBackend: Application ID is not initialized yet.
ERROR TaskSchedulerImpl: Exiting due to error from cluster scheduler: All masters are unresponsive! Giving up.
and when I tried
MASTER=local-cluster[3,2,1024] bin/spark-shell
It did come out with the spark logo in the shell but I was afraid the slave nodes were not binding in.
Did I miss anything for the Spark cluster setting?
Just launch spark-shell on cluster with --master flag as follows
./spark-shell --master spark://localhost:7077 bin/spark-shell

Resources