Unable to view the pyspark job in Spline using ec2 instance - apache-spark

We created a sample pyspark job and gave the spark-submit commands as following in ec2 instance
sudo ./bin/spark-submit --packages za.co.absa.spline.agent.spark:spark-3.1-spline-agent-bundle_2.12:0.6.1 --conf “spark.sql.queryExecutionListeners=za.co.absa.spline.harvester.listener.SplineQueryExecutionListener” --conf "spark.spline.lineageDispatcher.http.producer.url=http://[ec2]
(http://localhost:8080/producer):8080/producer" --conf "spark.spline.lineageDispatcher=logging" /home/ec2-user/spline-sandbox/mysparkjob.py
we are able to view the output in the console but unable to view in spline UI what additional steps need to be done ?
Through docker how can we embed the pyspark job on an ec2 instance ?

You set spark.spline.lineageDispatcher=logging that means instead of sending the lineage to spline server it is just written into the log. If you leave that out or set spark.spline.lineageDispatcher=http (which is the default) the lineage data should be send to spark.spline.lineageDispatcher.http.producer.url
I would also recommend using the latest version of spline currently 0.7.12.
You can see documentation of dispatchers and other Spline Agent features here:
https://github.com/AbsaOSS/spline-spark-agent/tree/develop
if using older version change the branch/tag to see older version of the docuemntation.

Related

Presto on (Py)Spark - setup

Through my job as a data engineer most of my data logic is through:
presto queries (small - medium data sets and calculation, easiest for working with analysts)
Spark (the big guns, when the calculations are pretty heavy), mostly in python environment.
Although there are several scenarios I prefer to work smarter with presto (data sketch functions) then utilize spark.
The scenario I'm looking for is to integrate presto framework with pyspark. and Im failing to setup it correctly in the session level.
tried to convert the spark-submit example in presto docs to a pyspark session builder without any luck.
/spark/bin/spark-submit
--master spark://spark-master:7077
--executor-cores 4
--conf spark.task.cpus=4
--class com.facebook.presto.spark.launcher.PrestoSparkLauncher
presto-spark-launcher-0.271.jar
--package presto-spark-package-0.271.tar.gz
--config /presto/etc/config.properties
--catalogs /presto/etc/catalogs
--catalog hive
--schema default
--file query.sql
loaded the jsons successfully:
config snap
but when trying to run spark.sql with presto syntax I still fail.
What am I missing?
I was following this documentation:
https://prestodb.io/docs/current/installation/spark.html
This official GIT:
https://github.com/prestodb/presto/issues/13856
And tried to get more data across the web (there isn't much)
https://medium.com/#ravishankar.nair/the-ultimate-duo-in-distributed-computing-prestodb-running-on-spark-b63d0e567eeb

How to connect documentdb to a spark application in an emr instance

I'm getting error while I'm trying to configure spark with mongodb in my EMR instance. Below is the command -
spark-shell --conf "spark.mongodb.output.uri=mongodb://admin123:Vibhuti21!#docdb-2021-09-18-15-29-54.cluster-c4paykiwnh4d.us-east-1.docdb.amazonaws.com:27017/?replicaSet=rs0&readPreference=secondaryPreferred&retryWrites=false" "spark.mongodb.output.collection="ecommerceCluster" --packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.3
I'm a beginner in Spark & AWS. Can anyone please help?
DocumentDB requires a CA bundle to be installed on each node where your spark executors will launch. As such you firstly need to install the CA certs on each instance, AWS has a guide under the JAVA section for this in two bash scripts which makes things easier.1
Once these certs are installed, your spark command needs to reference the truststores and its passwords using the configuration parameters you can pass to Spark. Here is an example that I ran and this worked fine.
spark-submit
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.3
--conf "spark.executor.extraJavaOptions=
-Djavax.net.ssl.trustStore=/tmp/certs/rds-truststore.jks
-Djavax.net.ssl.trustStorePassword=<yourpassword>" pytest.py
you can provide those same configuration options in both spark-shell as well.
One thing i did find tricky, was that the mongo spark connector doesnt appear to know the ssl_ca_certs parameter in the connection string, so i removed this to avoid warnings from Spark as the Spark executors would reference the keystore in the configuration anyway.

Auto ingress with contour

I am using k8s to deploy spark invoked using spark-submit. We use contour as our ingress-class. I was wondering if there is way to create ingress object for the spark-driver container to expose Spark UI. I am trying to see if all this can be done in one step process may be by using annotations or labels. Something like hxquangnhat/kubernetes-auto-ingress, which uses annotation to enable ingress
All I want to do is to use spark-submit to submit the spark job and get the Spark UI exposed. May be create the ingress using --conf like
--conf spark.kubernetes.driver.annotation.prometheus.io/scrape=true \
--conf spark.kubernetes.executor.annotation.prometheus.io/scrape=true \
--conf spark.kubernetes.driver.annotation.prometheus.io/port=XXXXX \
--conf spark.kubernetes.executor.annotation.prometheus.io/port=XXXXX \
Please let me know if you have any thoughts or have seen some examples like this.
you can simply create Ingress right after Spark Driver submission. Simply provide a script with a language of your choice having Kubernetes client library.
In order to configure automatic Ingress deletion/GC on Spark Driver Pod deletion you can use Kubernetes OwnerReference to Spark Driver Pod.
Also you may want to refer the Apache Livy project Spark on Kubernetes support PR and the related Helm charts repo, which offer a way to solve Spark UI exposure as well as some other aspects of running Spark on Kubernetes.

spark-submit on ibm bluemix

i've just registrated a free instance of "Analitycs for Apache Spark" and followed this tutorial to use spark submit ibm ad hoc designed script to run an app from my local machine on bluemix cloud cluster. The issue is the following: i've made everithing that was described in the tutorial and lunched this script
./spark-submit.sh --vcap ./vcap.json --deploy-mode cluster --master
https://spark.eu-gb.bluemix.net --files /home/vito/vinorosso2.csv
--conf spark.service.spark_version=2.2.0
/home/vito/workspace_2/sbt-esempi/target/scala-2.11/isolationF3.jar
--class progettoSisDis.MasterNode
everything proceed fine (dataset vinorosso2.csv and my fatJar are correctly uploaded) until the terminal sais :" submission complete" at this point when i go to the log file created by the script there was this error message :
Submit job result: Invalid plan and spark version combination in HTTP request (ibm.SparkService.PayGoPersonal, 2.0.0)
Submission ID:
ERROR: Problem submitting job. Exit
So, it wasn't enough to register a free instance of Analitycs for apache spark to submit a spark job? Hope someone can help. By the way, if it helps, on my local machine i'm using spark with intellij idea (scala). Byyye
From https://console.bluemix.net/docs/services/AnalyticsforApacheSpark/index-gentopic2.html#using_spark-submit you need to be using Spark version 1.6.x or 2.0.x. Your submit job is set to version 2.2.0. Try submitting using spark.service.spark_version=2.0.0 (assuming your code will work with this version of Spark).

How to submit job(jar) to the Azure Spark cluster through commandline interface?

I am new to HDInsight Spark, I am trying to run a use-case to learn how things work in Azure Spark cluster. This is what I have done so far.
Able to create azure spark cluster.
Create jar by following steps as described in the link: create standalone scala application to run on HDInsight Spark cluster. I have used the same scala code as given in the link.
ssh into head node
upload jar to the blob storage using link: using azure CLI with azure storage
copy zip to machine
hadoop fs -copyToLocal
I have checked that the jar gets uploaded to the headnode(machine).
I want to run that jar and get the results as stated in the link given in
point 2 above.
What will be the next step? How can I submit spark job and get results using command line interface?
For example considering you are created jar for program submit.jar. In order to submit this to your cluster with dependency you can use below syntax.
spark-submit --master yarn --deploy-mode cluster --packages "com.microsoft.azure:azure-eventhubs-spark_2.11:2.2.5" --class com.ex.abc.MainMethod "wasb://space-hdfs#yourblob.blob.core.windows.net/xx/xx/submit.jar" "param1.json" "param2"
Here --packages :is to include dependency to you program, you can use --jars option and then followed by jar path. --jars "path/to/dependency/abc.jar"
--class : Main method of your program
after that specify path for your program jar.
you can pass parameters with if you needed as shown above
A couple of options on submitting spark jars:
1) If you want to submit the job on the headnode already, you can use spark-submit
See Apache submit jar documentation
2) An easier alternative is to submit spark jar via livy after uploading the jar to wasb storage.
See submit via livy doc. You can skip step 5 if you do it this way.

Resources