Spark on k8s: spark.kubernetes.file.upload.path not cleaned - apache-spark

I am running Spark on k8s, which works so far, but the directory under spark.kubernetes.file.upload.path in spark-defaults.conf is not cleaned after the driver pod ends. Is there an additional option I have to activate?

Related

Python+PySpark File locally connecting to a Remote HDFS/Spark/Yarn Cluster

I've been playing around with HDFS and Spark. I've set up a five node cluster on my network running HDFS, Spark, and managed by Yarn. Workers are running in client mode.
From the master node, I can launch the PySpark shell just fine. Running example jars, the job is split up to the worker nodes and executes nicely.
I have a few questions on whether and how to run python/Pyspark files against this cluster.
If I have a python file with a PySpark calls elsewhere else, like on my local dev laptop or a docker container somewhere, is there a way to run or submit this file locally and have it executed on the remote Spark cluster? Methods that I'm wondering about involve running spark-submit in the local/docker environment and but the file has SparkSession.builder.master() configured to the remote cluster.
Related, I see a configuration for --master in spark-submit, but the only yarn option is to pass "yarn" which seems to only queue locally? Is there a way to specify remote yarn?
If I can set up and run the file remotely, how do I set up SparkSession.builder.master()? Is the url just to the hdfs:// url to port 9000, or do I submit it to one of the Yarn ports?
TIA!
way to run or submit this file locally and have it executed on the remote Spark cluster
Yes, well "YARN", not "remote Spark cluster". You set --master=yarn when running with spark-submit, and this will run against the configured yarn-site.xml in HADOOP_CONF_DIR environment variable. You can define this at the OS level, or in spark-env.sh.
You can also use SparkSession.builder.master('yarn') in code. If both options are supplied, one will get overridden.
To run fully "in the cluster", also set --deploy-mode=cluster
Is there a way to specify remote yarn?
As mentioned, this is configured from yarn-site.xml for providing resourcemanager location(s).
how do I set up SparkSession.builder.master()? Is the url just to the hdfs:// url to port 9000
No - The YARN resource manager has its own RPC protocol, not hdfs:// ... You can use spark.read("hdfs://namenode:port/path") to read HDFS files, though. As mentioned, .master('yarn') or --master yarn are the only configs you need that are specific for Spark.
If you want to use Docker containers, YARN does support this, but Spark's Kubernetes master will be easier to setup, and you can use Hadoop Ozone or MinIO rather than HDFS in Kubernetes.

spark standalone running on docker cleanup not running

I'm running spark on standalone mode as a docker service where I have one master node and one spark worker. I followed the spark documentation instructions:
https://spark.apache.org/docs/latest/spark-standalone.html
to add the properties where the spark cluster cleans itself and I set those in my docker_entrypoint
export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.interval=900 -Dspark.worker.cleanup.appDataTtl=900
and verify that it was enables following the logs of the worker node service
My question is do we expect to get all directories located on SPARK_WORKER_DIR directory to be cleaned ? or does it only clean the application files
Because I still see some empty directories holding there

Spark Standalone how to pass local .jar file to cluster

I have a cluster with two workers and one master.
To start master & workers I use the sbin/start-master.sh and sbin/start-slaves.shin the master's machine. Then, the master UI shows me that the slaves are ALIVE (so, everything OK so far). Issue comes when I want to use spark-submit.
I execute this command in my local machine:
spark-submit --master spark://<master-ip>:7077 --deploy-mode cluster /home/user/example.jar
But the following error pops up: ERROR ClientEndpoint: Exception from cluster was: java.nio.file.NoSuchFileException: /home/user/example.jar
I have been doing some research in stack overflow and Spark's documentation and it seems like I should specify the application-jar of spark-submit command as "Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes." (as it indicates https://spark.apache.org/docs/latest/submitting-applications.html).
My question is: how can I set my .jar as globally visible inside the cluster? There is a similar question in here Spark Standalone cluster cannot read the files in local filesystem but solutions do not work for me.
Also, am I doing something wrong by initialising the cluster inside my master's machine using sbin/start-master.sh but then doing the spark-submit in my local machine? I initialise the master inside my master's terminal because I read so in Spark's documentation, but maybe this has something to do with the issue. From Spark's documentation:
Once you’ve set up this file, you can launch or stop your cluster with the following shell scripts, based on Hadoop’s deploy scripts, and available in SPARK_HOME/sbin: [...] Note that these scripts must be executed on the machine you want to run the Spark master on, not your local machine.
Thank you very much
EDIT:
I have copied the file .jar in every worker and it works. But my point is to know if there is a better way, since this method makes me copy the .jar to each worker everytime I create a new jar. (This was one of the answers from the question of the already posted link Spark Standalone cluster cannot read the files in local filesystem )
#meisan your spark-submit command is missing out on 2 things.
your jars should be added with flag --jar
file holding your driver code i.e. the main function.
Now you have not specified anywhere if you are using scala or python but in the nutshell your command will look something like:
for python :
spark-submit --master spark://<master>:7077 --deploy-mode cluster --jar <dependency-jars> <python-file-holding-driver-logic>
for scala:
spark-submit --master spark://<master>:7077 --deploy-mode cluster --class <scala-driver-class> --driver-class-path <application-jar> --jar <dependency-jars>
Also, spark takes care of sending the required files and jars to the executors when you use the documented flags.
If you want to omit the --driver-class-path flag, you can set the environmental variable SPARK_CLASSPATH to path where all your jars are placed.

Spark Job running even after spark Master process is killed

We are working on spark cluster where spark job(s) are getting submitted successfully even after spark "Master" process is killed.
Here is the complete details about what we are doing.
process details :-
jps
19560 NameNode
18369 QuorumPeerMain
22414 Jps
20168 ResourceManager
22235 Master
and we submitted one spark job to this Master using the command like
spark-1.6.1-bin-without-hadoop/bin/spark-submit --class com.test.test --master yarn-client --deploy-mode client test.jar -incomingHost hostIP
where hostIP having correct ip address of the machine running "Master" process.
And after this we are able to see the job in RM Web UI also.
Now when we kill the "Master" Process , we can see the submitted job is running fine which is expected here as we we are using yarn mode and that job will run without any issue.
Now we killed the "Master" process.
But when we submit once again the same command "spark-submit" pointing to same Master IP which is currently down , we see once more job in RM web ui (host:8088), This we are not able to understand as Spark "Master" is killed ( and host:8080) the spark UI also does not come.
Please note that we are using "yarn-client" mode as below code
sparkProcess = new SparkLauncher()
.......
.setSparkHome(System.getenv("SPARK_HOME"))
.setMaster("yarn-client")
.setDeployMode("client")
Please some can explain me about this behaviour ? Did not found after reading many blogs (http://spark.apache.org/docs/latest/running-on-yarn.html ) and official docs .
Thanks
Please check cluster overview. As per your description you are running spark application on yarn cluster mode with driver placed in instance where you launch command. The Spark master is related to spark standalone cluster mode which on your case launch command should be similar to
spark-submit --master spark://your-spark-master-address:port

Not able to launch Spark cluster in Standalone mode with start-all.sh

I am new to spark and I am trying to install Spark Standalone to a 3 node cluster. I have done password-less SSH from master to other nodes.
I have tried the following config changes
Updated the hostnames for 2 nodes in conf/slaves.sh file. Created spark-env.sh file and updated the SPARK_MASTER_IP with the master URL Also, tried
updating the spark.master value in the spark-defaults.conf file
Snapshot of conf/slaves.sh
# A Spark Worker will be started on each of the machines listed below.
Spark-WorkerNode1.hadoop.com
Spark-WorkerNode2.hadoop.com
Snapshot of spark-defaults.conf
# Example:
spark.master spark://Spark-Master.hadoop.com:7077
But when I try to start the cluster by running the start-all.sh on the master, it does not recognize the worker nodes and start the cluster as local.
It does not give any error, the log files shows Successfully started service 'sparkMaster' and Successfully started service 'sparkWorker' on the master.
I have tried to run start-master and start-slave script on individual nodes and it seems to work fine. I can see 2 workers in the web UI. I am using spark 1.6.0
Can somebody please help me with what I am missing while trying to run start-all?
Snapshot of conf/slaves.sh
The file should named slaves without extension.

Resources