apache spark: Completed application history deleted after restarting - apache-spark

When I restart spark cluster all of history of completed application in web ui are deleted. How can I preserve this history from deleting when restarting?

Spark itself doesn't store logs. If you want to store them then you need to enable that config by using "spark.eventLog":
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master spark:// \
--conf spark.eventLog.enabled=true \
--conf spark.eventLog.dir="hdfs://your path" \
/home/spark/spark-3.2.1-bin-hadoop3.2/examples/jars/spark-examples_2.12-3.2.1.jar 8

Don't restart spark master. Just make it got query like Zeppelin.


Apache Spark application is not seen in Spark Web UI (Java)

I am trying to run an ETL job using Apache Spark (Java) in Kubernetes cluster. The Application is running, and data is getting inserted into database (mysql). But, the application is not seen in Spark Web UI.
The command I used for submitting the application is:
./spark-submit --class com.xxxx.etl.EtlApplication \
--name MyETL \
--master k8s://XXXXXXXXXX.xxx.us-west-2.eks.amazonaws.com:443 \
--conf "spark.kubernetes.container.image=YYYYYY.yyy.ecr.us-west-2.amazonaws.com/spark-poc:32" \
--conf "spark.kubernetes.driverEnv.SPARK_MASTER_URL=spark://my-spark-master-headless.default.svc.cluster.local:7077" \
--conf "spark.kubernetes.authenticate.driver.serviceAccountName=my-spark" \
--conf "spark.kubernetes.driver.request.cores=256m" \
--conf "spark.kubernetes.driver.limit.cores=512m" \
--conf "spark.kubernetes.executor.request.cores=256m" \
--conf "spark.kubernetes.executor.limit.cores=512m" \
--deploy-mode cluster \
local:///opt/bitnami/spark/examples/jars/EtlApplication-with-dependencies.jar 1000
I use a jenkins job to build my code and move the jar to /opt/bitnami/spark/examples/jars folder in the container inside the cluster.
The job is seen running in the pod when I check with kubectl get pods, and is seen on taking localhost:4040 after mapping the port to localhost using kubectl port-forward pod/myetl-df26f5843cb88da7-driver 4040:4040
Tried the same spark-submit command with Spark example jar (which came along with Spark installation in the container):
./spark-submit --class org.apache.spark.examples.SparkPi \
--conf spark.kubernetes.container.image=YYYYYY.yyy.ecr.us-west-2.amazonaws.com/spark-poc:5" \
--master k8s://XXXXXXXXXX.xxx.us-west-2.eks.amazonaws.com:443 \
--conf spark.kubernetes.driverEnv.SPARK_MASTER_URL=spark://my-spark-master-headless.default.svc.cluster.local:7077" \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=my-spark \
--deploy-mode cluster \
local:///opt/bitnami/spark/examples/jars/spark-examples_2.12-3.3.0.jar 1000
This time this application is getting listed in the Spark Web UI. I tried several options, and on removing the line --conf spark.kubernetes.driverEnv.SPARK_MASTER_URL=spark://my-spark-master-headless.default.svc.cluster.local:7077", the SparkPi example application is also not displayed in Spark Web UI.
Am I missing something? Do I need to change my java code to accept spark.kubernetes.driverEnv.SPARK_MASTER_URL? Tried several options buut nothing works.
Thanks in advance.

Submitting a spark job to a kubernetes cluster using bitnami spark docker image

I have a local setup with minikube and I'm trying to use spark-submit to submit a job to a local Kubernetes. The idea here is to use my local machine's spark-submit to submit to the kubernetes master which will handle creating a spark cluster and taking it down when the work is finished.
I'm using the image bitnami/spark:3.2.1 and the following command:
./bin/spark-submit --master k8s:// \
--deploy-mode cluster \
--conf spark.executor.instances=1 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.container.image=bitnami/spark:3.2.1 \
--class org.apache.spark.examples.JavaSparkPi \
--name spark-pi \
This does not seem to work and the logs in the spark driver are:
Caused by: java.io.IOException: Failed to connect to spark-master:7077
Caused by: java.net.UnknownHostException: spark-master
If I use the docker-image-tool.sh to build a custom spark docker image with the python bindings and use that, it works perfectly. How is bitnami's image special and why doesn't it recognise that the master in this case is kubernetes?
I also tried using the option conf spark.kubernetes.driverEnv.SPARK_MASTER_URL=spark:// when submitting but the error was similar to above.

Spark on K8's Issues loading jar

I am trying to run a sample spark application(provided in the spark examples jar) on kubernetes and trying to understand the behavior. In this process, I did the following,
Built a running kubernetes cluster with 3 nodes (1 master and 2 child) with adequate resources(10 cores, 64Gigs mem, 500GB disk). Note that I don't have internet access on my nodes.
Installed Spark distribution - spark-2.3.3-bin-hadoop2.7
As there is no internet access on the node, I preloaded a spark image( from gcr.io/cloud-solutions-images/spark:v2.3.0-gcs) into the docker on the node running kubernetes master
Running spark-submit to k8's as follows,
./bin/spark-submit --master k8s://https://test-k8:6443 \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=5 \
--conf spark.kubernetes.container.image=gcr.io/cloud-solutions-images/spark:v2.3.0-gcs \
--conf spark.kubernetes.driver.pod.name=spark-pi-driver \
--conf spark.kubernetes.container.image.pullPolicy=IfNotPresent \
However, it fails with the below error,
Error: Could not find or load main class org.apache.spark.examples.SparkPi
In regards to the above I have below questions:
Do we need to provide Kubernetes a distribution of spark? and is that what we are doing with?
--conf spark.kubernetes.container.image=gcr.io/cloud-solutions-images/spark:v2.3.0-gcs
If I have my own spark example, for say processing events from Kafka. What should be my approach?
Any help in debugging the above Error and answering my follow up questions is thankful.
spark.kubernetes.container.image should be an image that has both the spark binaries & the application code. In my case, as I don't have access to the internet from my nodes. Doing the following let spark driver pick the correct jar.
So, this is what I did,
In my local computer, I did a docker build
docker build -t spark_pi_test:v1.0 -f kubernetes/dockerfiles/spark/Dockerfile .
Above built me a docker image in my local computer.
tar'd the built docker image,
docker save spark_pi_test:v1.0 > spark_pi_test_v1.0.tar
scp'd the tar ball to all 3 kube nodes.
docker load the tar ball on all 3 kube nodes.
docker load < spark_pi_test_v1.0.tar
Then I submitted the spark job as follows,
./bin/spark-submit --master k8s://https://test-k8:6443 --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=5 --conf spark.kubernetes.container.image=spark_pi_test:v1.0 --conf spark.kubernetes.driver.pod.name=spark-pi-driver --conf spark.kubernetes.container.image.pullPolicy=IfNotPresent local:///opt/spark/examples/jars/spark-examples_2.11-2.3.3.jar 100000
The above jar path is the path in the docker container.
For reference to DockerFile,

How to set spark.driver.extraClassPath through Apache Livy on Azure Spark cluster?

I would would like to add some configuration when a Spark Job is submitted via Apache Livy into an Azure cluster. Currently to launch a spark Job via Apache Livy in the cluster, I use the following command
curl -X POST --data '{"file": "/home/xxx/lib/MyJar.jar", "className": "org.springframework.boot.loader.JarLauncher"}' -H "Content-Type: application/json" localhost:8998/batches
This command generate the following process
……. org.apache.spark.deploy.SparkSubmit --conf spark.master=yarn-cluster --conf spark.yarn.tags=livy-batch-51-qHXmHXWg --conf spark.yarn.submit.waitAppCompletion=false --class org.springframework.boot.loader.JarLauncher adl://home/home/xxx/lib/MyJar.jar
Due to a technical issue when running the jar, Ineed to introduce two configurations into this command.
--conf "spark.driver.extraClassPath=/home/xxx/lib /jars/*"
--conf "spark.executor.extraClassPath=/home/xxx/lib/jars/*"
It's related to a logback issue when running on spark which use log4j2. the extra class path adds logback jars
I found here https://groups.google.com/a/cloudera.org/forum/#!topic/hue-user/fcRM3YiqAAA that it can be done by adding this conf to LIVY_SERVER_JAVA_OPTS or spark-defaults.conf
From Ambari I modified LIVY_SERVER_JAVA_OPTS in livy-env.sh (in spak2 & livy menu) and
Advanced spark2-defaults in Spark2.
Unfortunately this is not working on our side. Even I can see that the LivyServer is launched with -Dspark.driver.extraClassPath
Is there any specific configuration to add in Azure Hdinsight to make it working?
Note that the process should be like
……. org.apache.spark.deploy.SparkSubmit --conf spark.master=yarn-cluster --conf spark.yarn.tags=livy-batch-51-qHXmHXWg --conf spark.yarn.submit.waitAppCompletion=false **--conf "spark.driver.extraClassPath=/home/xxx/lib /jars/*" --conf "spark.executor.extraClassPath=/home/xxx/lib/jars/*"**
--class org.springframework.boot.loader.JarLauncher adl://home/home/xxx/lib/MyJar.jar
Add the following
"conf":{ "spark.driver.extraClassPath":"wasbs:///pathtojar.jar","spark.yarn.user.classpath.first":"true"}

Submit job to DCOS Spark with multiple instances?

I have two instances of spark in my DCOS cluster, when I submit my job via CLI
dcos spark run --submit-args="\
--driver-cores 8 \
--driver-memory 16384M \
--conf spark.eventLog.enabled=true \
--conf spark.eventLog.dir=hdfs://hdfs/history \
--class com.CalcPi \
<url to job -spark-test-assembly-0.0.5-SNAPSHOT.jar> 99000000"`
the job is forever stuck in the queue. But when I have only one instance everything works fine. I have already try
--deploy-mode cluster --supervise
The following config options are hopefully the answer you are looking for:
dcos config set spark.app_id spark-one
dcos spark run ...
dcos config set spark.app_id spark-two
dcos spark run ...
