I am using k8s to deploy spark invoked using spark-submit. We use contour as our ingress-class. I was wondering if there is way to create ingress object for the spark-driver container to expose Spark UI. I am trying to see if all this can be done in one step process may be by using annotations or labels. Something like hxquangnhat/kubernetes-auto-ingress, which uses annotation to enable ingress
All I want to do is to use spark-submit to submit the spark job and get the Spark UI exposed. May be create the ingress using --conf like
--conf spark.kubernetes.driver.annotation.prometheus.io/scrape=true \
--conf spark.kubernetes.executor.annotation.prometheus.io/scrape=true \
--conf spark.kubernetes.driver.annotation.prometheus.io/port=XXXXX \
--conf spark.kubernetes.executor.annotation.prometheus.io/port=XXXXX \
Please let me know if you have any thoughts or have seen some examples like this.
you can simply create Ingress right after Spark Driver submission. Simply provide a script with a language of your choice having Kubernetes client library.
In order to configure automatic Ingress deletion/GC on Spark Driver Pod deletion you can use Kubernetes OwnerReference to Spark Driver Pod.
Also you may want to refer the Apache Livy project Spark on Kubernetes support PR and the related Helm charts repo, which offer a way to solve Spark UI exposure as well as some other aspects of running Spark on Kubernetes.
Related
Currently I'm running spark application on k8s, I wish to scatter spark executor pods in different nodes during the same application as much as possible
I notice that executor pods have been labled automatically by some.
I suppose this could be done by using podAffinity, but these lables are generated during runtime, like spark-app-name and spark-app-selector
You can create a pod template file using --conf spark.kubernetes.executor.podTemplateFile=<your_file_here> as discussed in the docs and specify your pod affinities. I found a semi-example here, maybe this helps :)
There was a discussion about this on Spark's Github and the conclusion was that using a pod template file is the preferred solution.
We created a sample pyspark job and gave the spark-submit commands as following in ec2 instance
sudo ./bin/spark-submit --packages za.co.absa.spline.agent.spark:spark-3.1-spline-agent-bundle_2.12:0.6.1 --conf “spark.sql.queryExecutionListeners=za.co.absa.spline.harvester.listener.SplineQueryExecutionListener” --conf "spark.spline.lineageDispatcher.http.producer.url=http://[ec2]
(http://localhost:8080/producer):8080/producer" --conf "spark.spline.lineageDispatcher=logging" /home/ec2-user/spline-sandbox/mysparkjob.py
we are able to view the output in the console but unable to view in spline UI what additional steps need to be done ?
Through docker how can we embed the pyspark job on an ec2 instance ?
You set spark.spline.lineageDispatcher=logging that means instead of sending the lineage to spline server it is just written into the log. If you leave that out or set spark.spline.lineageDispatcher=http (which is the default) the lineage data should be send to spark.spline.lineageDispatcher.http.producer.url
I would also recommend using the latest version of spline currently 0.7.12.
You can see documentation of dispatchers and other Spline Agent features here:
https://github.com/AbsaOSS/spline-spark-agent/tree/develop
if using older version change the branch/tag to see older version of the docuemntation.
I'm getting error while I'm trying to configure spark with mongodb in my EMR instance. Below is the command -
spark-shell --conf "spark.mongodb.output.uri=mongodb://admin123:Vibhuti21!#docdb-2021-09-18-15-29-54.cluster-c4paykiwnh4d.us-east-1.docdb.amazonaws.com:27017/?replicaSet=rs0&readPreference=secondaryPreferred&retryWrites=false" "spark.mongodb.output.collection="ecommerceCluster" --packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.3
I'm a beginner in Spark & AWS. Can anyone please help?
DocumentDB requires a CA bundle to be installed on each node where your spark executors will launch. As such you firstly need to install the CA certs on each instance, AWS has a guide under the JAVA section for this in two bash scripts which makes things easier.1
Once these certs are installed, your spark command needs to reference the truststores and its passwords using the configuration parameters you can pass to Spark. Here is an example that I ran and this worked fine.
spark-submit
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.3
--conf "spark.executor.extraJavaOptions=
-Djavax.net.ssl.trustStore=/tmp/certs/rds-truststore.jks
-Djavax.net.ssl.trustStorePassword=<yourpassword>" pytest.py
you can provide those same configuration options in both spark-shell as well.
One thing i did find tricky, was that the mongo spark connector doesnt appear to know the ssl_ca_certs parameter in the connection string, so i removed this to avoid warnings from Spark as the Spark executors would reference the keystore in the configuration anyway.
Apologies in advance as I am new to spark. I have created a spark cluster in standalone mode with 4 workers, and after successfully being able to configure worker properties, I wanted to know how to configure the master properties.
I am writing an application and connecting it to the cluster using SparkSession.builder (I do not want to submit it using spark-submit.)
I know that that the workers can be configured in the conf/spark-env.sh file and has parameters which can be set such as 'SPARK_WORKER_MEMORY' and 'SPARK_WORKER_CORES'
My question is: How do I configure the properties for the master? Because there is no 'SPARK_MASTER_CORES' or 'SPARK_MASTER_MEMORY' in this file.
I thought about setting this in the spark-defaults.conf file, however it seems that this is only used for spark-submit.
I thought about setting it in the application using SparkConf().set("spark.driver.cores", "XX") however this only specifies the number of cores for this application to use.
Any help would be greatly appreciated.
Thanks.
Three ways of setting the configurations of Spark Master node (Driver) and spark worker nodes. I will show examples of setting the memory of the master node. Other settings can be found here
1- Programatically through SpackConf class.
Example:
new SparkConf().set("spark.driver.memory","8g")
2- Using Spark-Submit: make sure not to set the same configuraiton in your code (Programatically like 1) and while doing spark submit. if you already configured settings programatically, every job configuration mentioned in spark-submit that overlap with (1) will be ignored.
example :
spark-submit --driver-memory 8g
3- through the Spark-defaults.conf:
In case none of the above is set this settings will be the defaults.
example :
spark.driver.memory 8g
I try to use kafka(0.9.1) with secure mode. I would read data with Spark, so I must pass the JAAS conf file to the JVM. I use this cmd to start my job :
/opt/spark/bin/spark-submit -v --master spark://master1:7077 \
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.conf=kafka_client_jaas.conf" \
--files "./conf/kafka_client_jaas.conf,./conf/kafka.client.1.keytab" \
--class kafka.ConsumerSasl ./kafka.jar --topics test
I still have the same error :
Caused by: java.lang.IllegalArgumentException: You must pass java.security.auth.login.config in secure mode.
at org.apache.kafka.common.security.kerberos.Login.login(Login.java:289)
at org.apache.kafka.common.security.kerberos.Login.<init>(Login.java:104)
at org.apache.kafka.common.security.kerberos.LoginManager.<init>(LoginManager.java:44)
at org.apache.kafka.common.security.kerberos.LoginManager.acquireLoginManager(LoginManager.java:85)
at org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:55)
I think the spark does not inject the parameter Djava.security.auth.login.conf in the jvm !!
The main cause of this issue is that you have mentioned wrong property name. it should be java.security.auth.login.config and not -Djava.security.auth.login.conf. Moreover if you are using keytab file. make sure to make it available on all executors using --files argument in spark-submit. if you are using kerberos ticket make sure to set KRB5CCNAME on all executors using property SPARK_YARN_USER_ENV.
if you are using older version of spark 1.6.x or earlier. then there are some known issues with spark that this integration will not work then you have to write a custom receiver.
For spark 1.8 and later, you can see configuration here
Incase you need to create custom receiver you can see this