Presto on (Py)Spark - setup - apache-spark

Through my job as a data engineer most of my data logic is through:
presto queries (small - medium data sets and calculation, easiest for working with analysts)
Spark (the big guns, when the calculations are pretty heavy), mostly in python environment.
Although there are several scenarios I prefer to work smarter with presto (data sketch functions) then utilize spark.
The scenario I'm looking for is to integrate presto framework with pyspark. and Im failing to setup it correctly in the session level.
tried to convert the spark-submit example in presto docs to a pyspark session builder without any luck.
/spark/bin/spark-submit
--master spark://spark-master:7077
--executor-cores 4
--conf spark.task.cpus=4
--class com.facebook.presto.spark.launcher.PrestoSparkLauncher
presto-spark-launcher-0.271.jar
--package presto-spark-package-0.271.tar.gz
--config /presto/etc/config.properties
--catalogs /presto/etc/catalogs
--catalog hive
--schema default
--file query.sql
loaded the jsons successfully:
config snap
but when trying to run spark.sql with presto syntax I still fail.
What am I missing?
I was following this documentation:
https://prestodb.io/docs/current/installation/spark.html
This official GIT:
https://github.com/prestodb/presto/issues/13856
And tried to get more data across the web (there isn't much)
https://medium.com/#ravishankar.nair/the-ultimate-duo-in-distributed-computing-prestodb-running-on-spark-b63d0e567eeb

Related

Unable to view the pyspark job in Spline using ec2 instance

We created a sample pyspark job and gave the spark-submit commands as following in ec2 instance
sudo ./bin/spark-submit --packages za.co.absa.spline.agent.spark:spark-3.1-spline-agent-bundle_2.12:0.6.1 --conf “spark.sql.queryExecutionListeners=za.co.absa.spline.harvester.listener.SplineQueryExecutionListener” --conf "spark.spline.lineageDispatcher.http.producer.url=http://[ec2]
(http://localhost:8080/producer):8080/producer" --conf "spark.spline.lineageDispatcher=logging" /home/ec2-user/spline-sandbox/mysparkjob.py
we are able to view the output in the console but unable to view in spline UI what additional steps need to be done ?
Through docker how can we embed the pyspark job on an ec2 instance ?
You set spark.spline.lineageDispatcher=logging that means instead of sending the lineage to spline server it is just written into the log. If you leave that out or set spark.spline.lineageDispatcher=http (which is the default) the lineage data should be send to spark.spline.lineageDispatcher.http.producer.url
I would also recommend using the latest version of spline currently 0.7.12.
You can see documentation of dispatchers and other Spline Agent features here:
https://github.com/AbsaOSS/spline-spark-agent/tree/develop
if using older version change the branch/tag to see older version of the docuemntation.

Spark shell for Databricks

Notebooks are nice, but REPL is sometimes more useful. Am I somehow able to run spark-shell that executes on Databricks? Like:
spark-shell --master https://adb-5022.2.azuredatabricks.net/
I looked through available tools related to Databricks (databricks connect, dbx, ...), but it seems there's no such functionality.
Databricks connect is the tool that you need if you want to execute code from you local machine in the Databricks cluster. Same as the spark-shell, the driver will be on your local machine, and executors are remove. The databricks-connect package installs the modified distribution of the Apache Spark so you can use spark-shell, pyspark, spark-submit, etc. - just make sure that that directory is in the PATH.
P.S. but I really don't understand why notebooks doesn't work for you - spark-shell doesn't have any superior features compared to them.

How to connect documentdb to a spark application in an emr instance

I'm getting error while I'm trying to configure spark with mongodb in my EMR instance. Below is the command -
spark-shell --conf "spark.mongodb.output.uri=mongodb://admin123:Vibhuti21!#docdb-2021-09-18-15-29-54.cluster-c4paykiwnh4d.us-east-1.docdb.amazonaws.com:27017/?replicaSet=rs0&readPreference=secondaryPreferred&retryWrites=false" "spark.mongodb.output.collection="ecommerceCluster" --packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.3
I'm a beginner in Spark & AWS. Can anyone please help?
DocumentDB requires a CA bundle to be installed on each node where your spark executors will launch. As such you firstly need to install the CA certs on each instance, AWS has a guide under the JAVA section for this in two bash scripts which makes things easier.1
Once these certs are installed, your spark command needs to reference the truststores and its passwords using the configuration parameters you can pass to Spark. Here is an example that I ran and this worked fine.
spark-submit
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.3
--conf "spark.executor.extraJavaOptions=
-Djavax.net.ssl.trustStore=/tmp/certs/rds-truststore.jks
-Djavax.net.ssl.trustStorePassword=<yourpassword>" pytest.py
you can provide those same configuration options in both spark-shell as well.
One thing i did find tricky, was that the mongo spark connector doesnt appear to know the ssl_ca_certs parameter in the connection string, so i removed this to avoid warnings from Spark as the Spark executors would reference the keystore in the configuration anyway.

spark-submit on ibm bluemix

i've just registrated a free instance of "Analitycs for Apache Spark" and followed this tutorial to use spark submit ibm ad hoc designed script to run an app from my local machine on bluemix cloud cluster. The issue is the following: i've made everithing that was described in the tutorial and lunched this script
./spark-submit.sh --vcap ./vcap.json --deploy-mode cluster --master
https://spark.eu-gb.bluemix.net --files /home/vito/vinorosso2.csv
--conf spark.service.spark_version=2.2.0
/home/vito/workspace_2/sbt-esempi/target/scala-2.11/isolationF3.jar
--class progettoSisDis.MasterNode
everything proceed fine (dataset vinorosso2.csv and my fatJar are correctly uploaded) until the terminal sais :" submission complete" at this point when i go to the log file created by the script there was this error message :
Submit job result: Invalid plan and spark version combination in HTTP request (ibm.SparkService.PayGoPersonal, 2.0.0)
Submission ID:
ERROR: Problem submitting job. Exit
So, it wasn't enough to register a free instance of Analitycs for apache spark to submit a spark job? Hope someone can help. By the way, if it helps, on my local machine i'm using spark with intellij idea (scala). Byyye
From https://console.bluemix.net/docs/services/AnalyticsforApacheSpark/index-gentopic2.html#using_spark-submit you need to be using Spark version 1.6.x or 2.0.x. Your submit job is set to version 2.2.0. Try submitting using spark.service.spark_version=2.0.0 (assuming your code will work with this version of Spark).

Spark pyspark vs spark-submit

The documentation on spark-submit says the following:
The spark-submit script in Spark’s bin directory is used to launch
applications on a cluster.
Regarding the pyspark it says the following:
You can also use bin/pyspark to launch an interactive Python shell.
This question may sound stupid, but when i am running the commands though pyspark they also run on the "cluster", right? They do not run on the master node only, right?
There is no practical difference between these two. If not configured otherwise both will execute code in a local mode. If master is configured (either by --master command line parameter or spark.master configuration) corresponding cluster will be used to execute the program.
If you are using EMR , there are three things
using pyspark(or spark-shell)
using spark-submit without using --master and --deploy-mode
using spark-submit and using --master and --deploy-mode
although using all the above three will run the application in spark cluster, there is a difference how the driver program works.
in 1st and 2nd the driver will be in client mode whereas in 3rd the
driver will also be in the cluster.
in 1st and 2nd, you will have to wait untill one application complete
to run another, but in 3rd you can run multiple applications in
parallel.
Just adding a clarification that others have not addressed (you may already know this, but it was unclear from the wording of your question):
..when i am running the commands though pyspark they also run on the "cluster", right? They do not run on the master node only, right?
As with spark-submit, standard Python code will run only on the driver. When you call operations through the various pyspark APIs, you will trigger transformations or actions that will be registered/executed on the cluster.
As others have pointed out, spark-submit can also launch jobs in cluster mode. In this case, driver still executes standard Python code, but the driver is a different machine to the one that you call spark-submit from.
Pyspark compare to Scala spark and Java Spark have extreme differences, for Python spark in only support YARN for scheduling the cluster.
If you are running python spark on a local machine, then you can use pyspark. If in the cluster, use the spark-submit.
If you have any dependencies in your python spark job, you need a zip file for submission.

Resources