Spark cleanup job not running - apache-spark

Whenever I do a dse spark-submit <jarname>,it copies the jar in SPARK_WORKER_DIR (in my case /var/lib/spark-worker/worker-0). I want to get the jar automatically deleted once the spark job is successfully completed/run. Using this, I changed my SPARK_WORKER_OPTS in spark-env.sh as follows :
export SPARK_WORKER_OPTS="$SPARK_WORKER_OPTS -Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.interval=1800"
But the jar is still not getting deleted. Am I doing something wrong? What should I do?

Adding this line to spark-env.sh and restarting the dse service worked for me:
export SPARK_WORKER_OPTS="$SPARK_WORKER_OPTS -Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.interval=3600 -Dspark.worker.cleanup.appDataTtl=172800 "
I restarted the dse service by
nodetool drain
sudo service dse restart
This deletes the log 2 days after the job is complete.

Related

"Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher" when running spark-submit or PySpark

I am trying to run the spark-submit command on my Hadoop cluster Here is a summary of my Hadoop Cluster:
The cluster is built using 5 VirtualBox VM's connected on an internal network
There is 1 namenode and 4 datanodes created.
All the VM's were built from the Bitnami Hadoop Stack VirtualBox image
I am trying to run one of the spark examples using the following spark-submit command
spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.0.3.jar 10
I get the following error:
[2022-07-25 13:32:39.253]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher
I get the same error when trying to run a script with PySpark.
I have tried/verified the following:
environment variables: HADOOP_HOME, SPARK_HOME and HADOOP_CONF_DIR have been set in my .bashrc file
SPARK_DIST_CLASSPATH and HADOOP_CONF_DIR have been defined in spark-env.sh
Added spark.master yarn, spark.yarn.stagingDir hdfs://hadoop-namenode:8020/user/bitnami/sparkStaging and spark.yarn.jars hdfs://hadoop-namenode:8020/user/bitnami/spark/jars/ in spark-defaults.conf
I have uploaded the jars into hdfs (i.e. hadoop fs -put $SPARK_HOME/jars/* hdfs://hadoop-namenode:8020/user/bitnami/spark/jars/ )
The logs accessible via the web interface (i.e. http://hadoop-namenode:8042 ) do not provide any further details about the error.
This section of the Spark documentation seems relevant to the error since the YARN libraries should be included, by default, but only if you've installed the appropriate Spark version
For with-hadoop Spark distribution, since it contains a built-in Hadoop runtime already, by default, when a job is submitted to Hadoop Yarn cluster, to prevent jar conflict, it will not populate Yarn’s classpath into Spark. To override this behavior, you can set spark.yarn.populateHadoopClasspath=true. For no-hadoop Spark distribution, Spark will populate Yarn’s classpath by default in order to get Hadoop runtime. For with-hadoop Spark distribution, if your application depends on certain library that is only available in the cluster, you can try to populate the Yarn classpath by setting the property mentioned above. If you run into jar conflict issue by doing so, you will need to turn it off and include this library in your application jar.
https://spark.apache.org/docs/latest/running-on-yarn.html#preparations
Otherwise, yarn.application.classpath in yarn-site.xml refers to local filesystem paths in each of ResourceManager servers where JARs are available for all YARN applications (spark.yarn.jars or extra packages should get layered onto this)
Another problem could be file permissions. You probably shouldn't put Spark jars into an HDFS user folder if they're meant to be used by all users. Typically, I'd put it under hdfs:///apps/spark/<version>, then give that 744 HDFS permissions
In the Spark / YARN UI, it should show the complete classpath of the application for further debugging
I figured out why I was getting this error. It turns out that I made an error while specifying spark.yarn.jars in spark-defaults.conf
The value of this property must be
hdfs://hadoop-namenode:8020/user/bitnami/spark/jars/*
instead of
hdfs://hadoop-namenode:8020/user/bitnami/spark/jars/
i.e. Basically, we need to specify the jar files as the value to this property and not the folder containing the jar files.

Local file upload failed in spark application

In my code, I am trying to load a file which is in my local machine into spark application,
sc.textFile("file:///home/testpath/file1“).
When I submit the job on the command line
Scenario 1: spark submit --class … master local
Job ran successfully with out any issues.
Scenario 2 : spark submit --class …. —master yarn —deploy-mode cluster
Job failed by throwing file:///home/testpath/file1 file not found Exception.
But when I tested file1.... File exists on my local.
Scenario 3 : spark submit —class … —master yarn —deploy-mode client
Job failed by throwing file:///home/testpath/file1 file not found Exception.
But when I tested file1,, File exists on my local.
Scenario 4: spark-shell —master=yarn
Val file1 = sc.textFile("file:///home/testpath/file1“).
Job failed by throwing file:///home/testpath/file1 file not found Exception.
In core-site.xml, fs.default.name property set to hdfs://mynamenode:9000
Could you please help how can I load local file in my spark application( Using spark 2.X version)
Any Ideas? Thanks in advance.
When spark execution mode is local, spark executor jobs are scheduled on the same local node and hence, it is able to find the file. But, when in yarn mode, executor jobs are scheduled randomly on any of the cluster nodes. So, you may either move your file to HDFS or maintain a copy of this file on each node

Could not find valid SPARK_HOME on dataproc

Spark job executed by Dataproc cluster on Google Cloud gets stuck on a task PythonRDD.scala:446
The error log says Could not find valid SPARK_HOME while searching ... paths under /hadoop/yarn/nm-local-dir/usercache/root/
The thing is, SPARK_HOME should be set by default on a dataproc cluster.
Other spark jobs that don't use RDDs work just fine.
During the initialization of the cluster I do not reinstall spark (but I have tried to, which I previously thought caused the issue).
I also found out that all my executors were removed after a minute of running the task.
And yes, I have tried to run the following initialization action and it didn't help:
#!/bin/bash
cat << EOF | tee -a /etc/profile.d/custom_env.sh /etc/*bashrc >/dev/null
export SPARK_HOME=/usr/lib/spark/
EOF
Any help?
I was using a custom mapping function. When I put the function to a separate file the problem disappeared.

Using same jar with Spark-submit

I deploy a job on yarn cluster mode by spark-submit with my jar file. The job deployed every time I submitted with 'same jar file', but It upload to hadoop everytime it's submitted. I think it's unnecessary routine to upload same jar every time. Is there any way to upload once and do yarn jobs with the jar?
You can put your spark jar in hdfs and then use --master yarn-cluster mode, this way you could save the time required to upload the jar to hdfs everytime.
Other alternatives is put your jar in spark classpath on every node which has the following drawbacks:
If you have more than 30 nodes it would be very tedious to scp your jar in each node.
If you hadoop cluster upgrades and there is a new installation of spark, you would have to reploy.

Spark driver always fail to bind to submit host in cluster mode

Hi I'm trying to deploy Spark streaming job using standalone cluster. All the jars are installed locally on each node and I run spark-submit inside one of the nodes. The driver is then started in one of the workers randomly but always try to bind to the node where I submitted the job. And if it happens to be on a different node, the driver always fails. I tried to set spark.driver.host to different values but didn't help.
Anyone with the same problem? Or is there any better ways to submit spark jobs, ideally in Standalone cluster.
spark-env.sh
export SPARK_MASTER_WEBUI_PORT=18080
export SPARK_MASTER_PORT=7077
export SPARK_LOCAL_HOSTNAME=local_host_name
export SPARK_LOG_DIR=/var/log/spark
export SPARK_WORKER_DIR=/var/run/spark/work
export SPARK_LOCAL_DIRS=/var/run/spark/tmp
export STANDALONE_SPARK_MASTER_HOST=master_host_name
spark-defaults.conf
spark.master spark://master_host_name:6066
spark.io.compression.codec lz4
I run it with spark-submit --deploy-mode cluster --supervise
Thanks a lot

Resources