Why doesn't the pyspark driver download jar files to local storage? - apache-spark

I am using spark-on-k8s-operator to deploy Spark 2.4.4 on Kubernetes. However, I'm pretty sure this questions is about Spark itself, not about a Kubernetes deployment of it.
I include several files when I deploy a job to the kubernetes cluster, including jars, pyfiles and a main. In spark-on-k8s; this is done via a config file:
spec:
mainApplicationFile: "s3a://project-folder/jobs/test/db_read_k8.py"
deps:
jars:
- "s3a://project-folder/jars/mysql-connector-java-8.0.17.jar"
pyFiles:
- "s3a://project-folder/pyfiles/pyspark_jdbc.zip"
This would be equivalent to
spark-submit \
--jars s3a://project-folder/jars/mysql-connector-java-8.0.17.jar \
--py-files s3a://project-folder/pyfiles/pyspark_jdbc.zip \
s3a://project-folder/jobs/test/db_read_k8.py
In spark-on-k8s, there is a sparkapplication kubernetes pod that manages the submitted spark jobs, and that pod spark-submits to a driver pod (which then interacts with the worker pods). My issue occurs on the driver pod. Once the driver recieves the spark-submit command, it goes about its business, and pull the required files from AWS S3, as expected. Except, it does not pull the jar file:
spark-kubernetes-driver 19/11/05 17:01:19 INFO SparkContext: Added JAR s3a://project-folder/jars/mysql-connector-java-8.0.17.jar at s3a://sezzle-spark/jars/mysql-connector-java-8.0.17.jar with timestamp 1572973279830
spark-kubernetes-driver 19/11/05 17:01:19 INFO SparkContext: Added file s3a://project-folder/jobs/test/db_read_k8.py at s3a://sezzle-spark/jobs/test/db_read_k8.py with timestamp 1572973279872
spark-kubernetes-driver 19/11/05 17:01:19 INFO Utils: Fetching s3a://project-folder/jobs/test/db_read_k8.py to /var/data/spark-f54f76a6-8f2b-4bd5-9644-c406aecac2dd/spark-42e3cd23-55c5-4099-a6af-455efb5dc4f2/userFiles-ae47c908-d0f0-4ff5-aee6-4dadc5c9b95f/fetchFileTemp1013256051456720708.tmp
spark-kubernetes-driver 19/11/05 17:01:19 INFO SparkContext: Added file s3a://project-folder/pyfiles/pyspark_jdbc.zip at s3a://sezzle-spark/pyfiles/pyspark_jdbc.zip with timestamp 1572973279962
spark-kubernetes-driver 19/11/05 17:01:20 INFO Utils: Fetching s3a://project-folder/pyfiles/pyspark_jdbc.zip to /var/data/spark-f54f76a6-8f2b-4bd5-9644-c406aecac2dd/spark-42e3cd23-55c5-4099-a6af-455efb5dc4f2/userFiles-ae47c908-d0f0-4ff5-aee6-4dadc5c9b95f/fetchFileTemp6740168219531159007.tmp
All three required files are "added" but only the main and pyfiles are "fetched." Looking through the driver pod, I can't find the jar file anywhere; it just doesn't get downloaded locally. This, of course, crashes my application, because the mysql driver isn't in the classpath.
Why doesn't spark download jar files to the driver's local filesystem the way it does for the pyfiles and python main?

PySpark has a bit unclear and not enough documented dependency management.
If your problem is with adding .jar only I would recommend you to use --packages ... instead (spark-operator should have the analogous option).
Hope it'll work for you.

Related

pyspark job execution in yarn cluster

I am trying to understand how spark job work in yarn cluster
I am using below commands to submit job
spark-submit --master yarn --deploy-mode cluster sparksessionexample.py
After submitting job console shows below console log
2020-05-29 20:52:48,668 INFO yarn.Client: Uploading resource file:/tmp/spark-bcd415f0-a22e-46b2-951c-5b6e4385a0c6/__spark_libs__2908230569257238890.zip -> hdfs://localhost:9000/user/hadoop/.sparkStaging/application_1590759398715_0003/__spark_libs__2908230569257238890.zip
2020-05-29 20:53:14,164 INFO yarn.Client: Uploading resource file:/home/hadoop/pythonprojects/Python/src/spark_jobs/sparksessionexample.py -> hdfs://localhost:9000/user/hadoop/.sparkStaging/application_1590759398715_0003/sparksessionexample.py
2020-05-29 20:53:14,610 INFO yarn.Client: Uploading resource file:/home/hadoop/clouderaapp/apache-spark/python/lib/pyspark.zip -> hdfs://localhost:9000/user/hadoop/.sparkStaging/application_1590759398715_0003/pyspark.zip
2020-05-29 20:53:15,984 INFO yarn.Client: Uploading resource file:/home/hadoop/clouderaapp/apache-spark/python/lib/py4j-0.10.7-src.zip -> hdfs://localhost:9000/user/hadoop/.sparkStaging/application_1590759398715_0003/py4j-0.10.7-src.zip
2020-05-29 20:53:18,362 INFO yarn.Client: Uploading resource file:/tmp/spark-bcd415f0-a22e-46b2-951c-5b6e4385a0c6/__spark_conf__7123551182035223076.zip -> hdfs://localhost:9000/user/hadoop/.sparkStaging/application_1590759398715_0003/__spark_conf__.zip
I just to understand how yarn execute sparksessionexample.py file, i mean whether it create python virtual env on node? as above log shows only uploading lib, confs but what about python client to execute sparksessionexample.py?
Can anyone help understand this?
The "Spark client" is used to bootstrap the Spark job execution.
In your case it is the only thing that runs on your local machine, because you requested cluster execution mode:
the "client" contacts the cluster manager (here YARN Resource Manager, could be Kubernetes Master, etc.) to start the Spark driver inside an AppMaster container
then the driver contacts again the cluster manager to request some containers for the executors
then the driver runs your Python code and distributes the work to the executors
finally the driver de-allocates its executors and itself
at this point the "client" notices that the YARN job has reached success or failure status, and can terminate
In short, the "client" never gets any kind of useful information from the driver running inside the cluster. You must inspect the YARN logs for the container running the driver (it's the AppMaster, typically number 00001).
If you want to see some feedback from the driver, then run your job in client execution mode -- it means the driver will run in the same JVM as the "client", in your local machine, and spit its logs in your console.

How to let spark cluster fetch package jars from local path rather than from master?

I found that every time I launch an application in my spark standalone cluster with external packages, say pyspark --master=spark://master:7077 --packages Azure:mmlspark:0.17, executors are always trying to fetch package jars from the driver. Here is the log:
2019-05-23 21:14:56 INFO Executor:54 - Fetching spark://Master:2653/files/com.microsoft.cntk_cntk-2.4.jar with timestamp 1558616430055
2019-05-23 21:14:56 INFO TransportClientFactory:267 - Successfully created connection to Master/192.168.100.2:2653 after 23 ms (0 ms spent in bootstraps)
2019-05-23 21:14:56 INFO Utils:54 - Fetching spark://Master:2653/files/com.microsoft.cntk_cntk-2.4.jar to /tmp/spark-0a60d982-0082-4d37-aea1-e1c0b21ee2be/executor-c9632fd2-29fc-429c-bdfb-31d870ed19e8/spark-15805ad8-ab00-41b3-b466-b0e8e95a3f56/fetchFileTemp5196357990337888981.tmp
Something like this repeats in the logs of the executors. The size of the package is quite big, so the process takes a lot of time.
I have tried using --jars argument of pyspark to upload required jars to each executor. Executors did fetch them from the local path, but I couldn't import the package in the shell.
So how to solve the problem? What should I do to let the executors fetch the package from the local path? Or maybe from HDFS?
We can copy the jar to all nodes and add the path to the jar in spark.executor.extraClassPath config param, so that the jars are available in executor's classpath.

How can I run spark in headless mode in my custom version on HDP?

How can I run spark in headless mode?
Currently, I am executing spark on a HDP 2.6.4 (i.e. 2.2 is installed by default) on the cluster.
I have downloaded a spark 2.4.1 Scala 2.11 release in headless mode (i.e. no hadoop jars are built in) from https://spark.apache.org/downloads.html. The exact name is: pre-built with scala 2.11 and user provided hadoop
Now when trying to run I follow: https://spark.apache.org/docs/latest/hadoop-provided.html
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
export HADOOP_CONF_DIR=/etc/hadoop/conf
export SPARK_HOME=/home/<<my_user>>/development/software/spark_no_provided_hadoop
./bin/spark-shell --master yarn --deploy-mode client --queue <<my_yarn_queue>>
Unfortunately, it fails to start:
19/05/01 07:12:23 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
19/05/01 07:12:38 ERROR cluster.YarnClientSchedulerBackend: The YARN application has already ended! It might have been killed or the Application Master may have failed to start. Check the YARN application logs for more details.
19/05/01 07:12:38 ERROR spark.SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Application application_1555489055691_64276 failed 2 times due to AM Container for appattempt_1555489055691_64276_000002 exited with exitCode: 1
When looking at the logs for details I see:
Log Type: prelaunch.err
launch_container.sh: line 30: $PWD:$PWD/__spark_conf__:$PWD/__spark_libs__/*:/etc/hadoop/conf:/usr/hdp/2.6.4.0-91/hadoop/*:/usr/hdp/2.6.4.0-91/hadoop/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure:/usr/hdp/2.6.4.0-91/hadoop/conf:/usr/hdp/2.6.4.0-91/hadoop/lib/*:/usr/hdp/2.6.4.0-91/hadoop/.//*:/usr/hdp/2.6.4.0-91/hadoop-hdfs/./:/usr/hdp/2.6.4.0-91/hadoop-hdfs/lib/*:/usr/hdp/2.6.4.0-91/hadoop-hdfs/.//*:/usr/hdp/2.6.4.0-91/hadoop-yarn/lib/*:/usr/hdp/2.6.4.0-91/hadoop-yarn/.//*:/usr/hdp/2.6.4.0-91/hadoop-mapreduce/lib/*:/usr/hdp/2.6.4.0-91/hadoop-mapreduce/.//*:/usr/hdp/2.6.4.0-91/tez/*:/usr/hdp/2.6.4.0-91/tez/lib/*:/usr/hdp/2.6.4.0-91/tez/conf:$PWD/__spark_conf__/__hadoop_conf__: bad substitution
So:
/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar: bad substitution
is the cause (and similar to https://community.hortonworks.com/questions/23699/bad-substitution-error-running-spark-on-yarn.html), but this is completely inside Ambari's management domain. How can I work around it to run a more recent version of spark (2.4.x) on the existing 2.6.x HDP plattform?
edit
Assuming I passed a wrong configuration directory for HADOOP_CONF_DIR, it is unset. But then:
When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
so it must be passed. Could it be, that I am passing the wrong value?
According to Exception: java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. in spark could be correct. For me, no HADOOP_HOME is set by default.
Even when setting to: export HADOOP_CONF_DIR=/usr/hdp/current/spark2-client/conf, the same bad substitution error remains.
NOTE: some interesting steps:
https://community.hortonworks.com/articles/244059/steps-to-install-supplementary-spark-on-hdp-cluste.html, but not for the headless edition
https://community.hortonworks.com/questions/85757/how-to-add-the-hadoop-and-yarn-configuration-file.html
Indeed, https://community.hortonworks.com/questions/23699/bad-substitution-error-running-spark-on-yarn.html is the solution:
cd /usr/hdp
ls
2.6.xxx current share
So for me:
./bin/spark-shell --master yarn --deploy-mode client --queue <<my_queue>>--conf spark.driver.extraJavaOptions='-Dhdp.version=2.6.xxx' --conf spark.yarn.am.extraJavaOptions='-Dhdp.version=2.6.xxx'
works

Notable to load twritter jar file in spark cluster

I am working on basic spark twitter application. But I am not able to load twitter jar file in spark cluster.
spark-shell --jars /usr/local/Twitter/spark-streaming-twitter_2.11-2.0.1.jar,\
/usr/local/Twitter/twitter4j-core-4.0.6.jar,\
/usr/local/Twitter/twitter4j-stream-4.0.6.jar
I am using above command to add the jar file in spark env. But I am getting file not found exception
18/03/28 09:11:39 ERROR SparkContext: Failed to add
file:/usr/local/Twitter/spark-streaming-twitter_2.11-2.0.1.jar to
Spark environment java.io.FileNotFoundException: Jar
/usr/local/Twitter/spark-streaming-twitter_2.11-2.0.1.jar not found

Spark + Mesos cluster mode, who uploads the jar?

I'm trying to run Spark applications with Mesos cluster mode. (I've got client mode working but still would like to try cluster mode)
I have launched spark-mesos-dispatcher on the Mesos master node.
When I submit the assembly at local path /tmp/assembly.jar using the following command,
bin/spark-submit --master mesos://dispatcher:7077 --deploy-mode cluster --class com.example.Example /tmp/assembly.jar
It fails because the file /tmp/assembly.jar does not exist on the mesos slave nodes.
I1129 10:47:43.839771 5884 fetcher.cpp:414] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/9d725348-931a-48fb-96f7-d29a4b09f3e8-S9\/deploy","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/tmp\/assembly.jar"}}],"sandbox_directory":"\/var\/lib\/mesos\/slaves\/9d725348-931a-48fb-96f7-d29a4b09f3e8-S9\/frameworks\/9d725348-931a-48fb-96f7-d29a4b09f3e8-0291\/executors\/driver-20151129104742-0008\/runs\/31bf5840-226e-4b87-ae76-d14bd2f17950","user":"user"}
I1129 10:47:43.840710 5884 fetcher.cpp:369] Fetching URI '/tmp/assembly.jar'
I1129 10:47:43.840721 5884 fetcher.cpp:243] Fetching directly into the sandbox directory
I1129 10:47:43.840731 5884 fetcher.cpp:180] Fetching URI '/tmp/assembly.jar'
I1129 10:47:43.840737 5884 fetcher.cpp:160] Copying resource with command:cp '/tmp/assembly.jar' '/var/lib/mesos/slaves/9d725348-931a-48fb-96f7-d29a4b09f3e8-S9/frameworks/9d725348-931a-48fb-96f7-d29a4b09f3e8-0291/executors/driver-20151129104742-0008/runs/31bf5840-226e-4b87-ae76-d14bd2f17950/assembly.jar'
cp: cannot stat `/tmp/assembly.jar': No such file or directory
Failed to fetch '/tmp/assembly.jar': Failed to copy with command 'cp '/tmp/assembly.jar' '/var/lib/mesos/slaves/9d725348-931a-48fb-96f7-d29a4b09f3e8-S9/frameworks/9d725348-931a-48fb-96f7-d29a4b09f3e8-0291/executors/driver-20151129104742-0008/runs/31bf5840-226e-4b87-ae76-d14bd2f17950/assembly.jar'', exit status: 256
Failed to synchronize with slave (it's probably exited)
In case of YARN cluster mode, Spark's YARN client implementation will upload the application jar to HDFS so that the driver and all executors have access to the jar, but I could not find such code in RestSubmissionClient, which is used by Mesos or Standalond cluster mode.
Who does the uploading in this case? or do I need to manually put the application assembly somewhere accessible via an HTTP URI?
From my understanding you could use the SparkContext addJar() method to add a local (to the driver application) JAR file path, which will then be distributed to the executor nodes (in client mode).
As you state that you want to use cluster mode, I'd suggest that you have a look at the Spark Jobserver project, which should make the running of Spark applications on Mesos easier than with the built-in tools.

Resources