Do pyspark need a local Spark installation? - python-3.x

I'm trying to get going with spark. Trying to create a simple SQL connection to a database while running Spark in a docker container.
I do not have Spark installed on my laptop. Only inside my docker container.
I got the following code on my laptop:
spark = SparkSession \
.builder \
.master("spark://localhost:7077") \ # <-- Docker container with master and worker
.appName("sparktest") \
.getOrCreate()
jdbcDF = spark.read.format("jdbc") \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.option("url", "jdbc:sqlserver://xxx") \
.option("dbtable", "xxx") \
.option("user", "xxx") \
.option("password", "xxx").load()
I can't get it to work.
I either get java.sql.SQLException: No suitable driver or ClassNotFoundException from Java.
I've moved the files to the container and everything seems fine over there.
I've made sure the mssql jar files are on the SPARK_CLASSPATH on both driver and executor.
Am I supposed to have Spark installed locally for me to use PySpark against the remote master running in my docker container?
It looks like its trying to find the SQL driver on my laptop?
Everything is fine if i run code using spark-submit from inside the docker container.
I was trying to avoid going the route of jupyter hosted inside the docker container, but was hoping to not having to install Spark on my Windows laptop and keeping it in my linux container.

I faced it before and for a solution you can download jdbc driver and set the driver configuration manually with giving jdbc driver path
from pyspark.context import SparkConf
conf = SparkConf()
conf.set('spark.jars', '/PATH_OF_DRIVER/driver.jar')
conf.set('spark.executor.extraClassPath', '/PATH_OF_DRIVER/driver.jar')

Related

Write to local Hive metastore instead of AWS Glue Data Catalog when developing a AWS Glue job locally

I'm trying to create a local development environment for writing glue jobs and have followed https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html to use the amazon/aws-glue-libs:glue_libs_3.0.0_image_01 docker image.
However in my glue code I also want to pull data from s3 and create a database in a metastore with spark sql eg
spark.sql(f'CREATE DATABASE IF NOT EXISTS {database_name}')
I have managed to use a local version of aws by using localstack, and configuring hadoop to use my local aws endpoint
spark-submit --conf spark.hadoop.fs.s3a.endpoint=localstack:4566 \
\--conf spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider \
\--conf spark.hadoop.fs.s3a.access.key=bar \
\--conf spark.hadoop.fs.s3a.secret.key=foo \
\--conf spark.hadoop.fs.s3a.path.style.access=true
However when calling the above spark sql command I'm getting an error as it's trying to use the real aws glue data catalog as a metastore
org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Unable to verify existence of default database: com.amazonaws.services.glue.model.AWSGlueException: The security token included in the request is invalid. (Service: AWSGlue; Status Code: 400
I have tried to configure spark to use a local metastore when initialising the spark context, however it still tried to use glue and I get the above error from aws
SparkSession.builder.appName(f"{task}")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.sql.warehouse.dir", "/temp")
.enableHiveSupport()
.getOrCreate()
The main issue was that the aws-glue-libs image contained a hive-site.xml which was referencing the Amazon's hive metastore. To get this to work I removed this as a step in the Dockerfile, and specified the full path to the local hive store in the configuration when running a spark-submit
My Dockerfile
FROM amazon/aws-glue-libs:glue_libs_3.0.0_image_01
RUN rm /home/glue_user/spark/conf/hive-site.xml
And the conf in spark-submit
--conf spark.sql.warehouse.dir=/home/glue_user/workspace/temp/db \
--conf 'spark.driver.extraJavaOptions=-Dderby.system.home=/home/glue_user/workspace/temp/hive' \

spark.driver.extraClassPath doesn't work in virtual PySpark environment

I'm saving data to a Postgres database and the job failed with the following:
py4j.protocol.Py4JJavaError: An error occurred while calling
o186.jdbc. : java.lang.ClassNotFoundException: org.postgresql.Driver
Until I downloaded the postgres jar to the spark/jars folder when I had spark installed globally.
I have sense moved to a new machine and instead only installed pyspark in a virtual environemnt (venv) via pip.
I tried setting the extraClassPath config value to my jar folder inside the virtual directory but that didn't work:
session = SparkSession \
.builder \
.config("spark.driver.extraClassPath", "/home/me/source/acme/.venv/lib/python3.6/site-packages/pyspark/jars/postgresql-42.2.6.jar") \
.getOrCreate()
Have tried relative and absolute path as well as wild card (*) and full filename. Nothing seems to work.
Setting the spark.jars.packages did correctly load the package from Maven however:
.config('spark.jars.packages', 'org.postgresql:postgresql:42.2.6') \
How can I make the extraClassPath work?
You will also need to add jar in executor class path.
session = SparkSession \
.builder \
.config("spark.driver.extraClassPath", "/home/me/source/acme/.venv/lib/python3.6/site-packages/pyspark/jars/postgresql-42.2.6.jar") \
.config("spark.executor.extraClassPath", "/home/me/source/acme/.venv/lib/python3.6/site-packages/pyspark/jars/postgresql-42.2.6.jar") \
.getOrCreate()
EDIT:
To semantically replicate spark.jars.package you can use spark.jars with absolute path to jar file. Also just to be sure check your jar and confirm it has proper MENIFEST for driver.

Spark-Cassandra-Connector Does not work for spark-submit

I am using spark-cassandra-connector to connect to cassandra from spark.
I am able to connect through Livy successfully using the below command.
curl -X POST --data '{"file": "/my/path/test.py", "conf" : {"spark.jars.packages": "com.datastax.spark:spark-cassandra-connector_2.11:2.3.0", "spark.cassandra.connection.host":"myip"}}' -H "Content-Type: application/json" localhost:8998/batches
Also able to connect through pyspark shell interactively using below command
sudo pyspark --packages com.datastax.spark:spark-cassandra-connector_2.10:2.0.10 --conf spark.cassandra.connection.host=myip
However not able to connect through spark-submit. some of the commands I have tried for the same are below.
spark-submit test.py --packages com.datastax.spark:spark-cassandra-connector_2.11:2.3.2 --conf spark.cassandra.connection.host=myip this one didnt work.
I tried passing these parameters my python files used for spark-submit, still didnt work.
conf = (SparkConf().setAppName("Spark-Cassandracube").set("spark.cassandra.connection.host","myip").set({"spark.jars.packages","com.datastax.spark:spark-cassandra-connector_2.11:2.3.0"))
sc = SparkContext(conf = conf)
sqlContext = SQLContext(sc)
tried passing these parameters uisng jupyter notebook was also.
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.datastax.spark:spark-cassandra-connector_2.11:2.3.0 --conf spark.cassandra.connection.host="myip" pyspark-shell'
All the threads that i have seen so far are talking about spark-cassandra-connector using spark-shell but nothing much about spark-submit.
Version used
Livy : 0.5.0
Spark : 2.4.0
Cassandra : 3.11.4
Not tested, but the most probable cause is that you're specifying all options:
--packages com.datastax.spark:spark-cassandra-connector_2.11:2.3.2 \
--conf spark.cassandra.connection.host=myip
after a name of your script: test.py - in this case, spark-submit considers them as parameters for a script itself, not for spark-submit. Try to move script name after options...
P.S. See Spark documentation for more details...

Submit an application to a standalone spark cluster running in GCP from Python notebook

I am trying to submit a spark application to a standalone spark(2.1.1) cluster 3 VM running in GCP from my Python 3 notebook(running in local laptop) but for some reason spark session is throwing error "StandaloneAppClient$ClientEndpoint: Failed to connect to master sparkmaster:7077".
Environment Details: IPython and Spark Master are running in one GCP VM called "sparkmaster". 3 additional GCP VMs are running Spark workers and Cassandra Clusters. I connect from my local laptop(MBP) using Chrome to GCP VM IPython notebook in "sparkmaster"
Please note that terminal works:
bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.1 --master spark://sparkmaster:7077 ex.py 1000
Running it from Python Notebook:
import os
os.environ["PYSPARK_SUBMIT_ARGS"] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.1 pyspark-shell'
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark=SparkSession.builder.master("spark://sparkmaster:7077").appName('somatic').getOrCreate() #This step works if make .master('local')
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "kafka1:9092,kafka2:9092,kafka3:9092") \
.option("subscribe", "gene") \
.load()
so far I have tried these:
I have tried to change spark master node spark-defaults.conf and spark-env.sh to add SPARK_MASTER_IP.
Tried to find the STANDALONE_SPARK_MASTER_HOST=hostname -f setting so that I can remove "-f". For some reason my spark master ui shows FQDN:7077 not hostname:7077
passed FQDN as param to .master() and os.environ["PYSPARK_SUBMIT_ARGS"]
Please let me know if you need more details.
After doing some more research I was able to resolve the conflict. It was due to a simple environment variable called SPARK_HOME. In my case it was pointing to Conda's /bin(pyspark was running from this location) whereas my spark setup was present in a diff. path. The simple fix was to add
export SPARK_HOME="/home/<<your location path>>/spark/" to .bashrc file( I want this to be attached to my profile not to the spark session)
How I have done it:
Step 1: ssh to master node in my case it was same as ipython kernel/server VM in GCP
Step 2:
cd ~
sudo nano .bashrc
scroll down to the last line and paste the below line
export SPARK_HOME="/home/your/path/to/spark-2.1.1-bin-hadoop2.7/"
ctrlX and Y and enter to save the changes
Note: I have also added few more details to the environment section for clarity.

How to get SparkContext if Spark runs on Yarn?

We have a program based on Spark standalone, and in this program we use SparkContext and SqlContext to do lots of queries.
Now we want to deploy the system on a Spark which runs on Yarn. But when we modify the spark.master to yarn-cluster, the application throws an exception says this works with spark-submit type only. When we switch to yarn-client, although it no longer throws exceptions, it doesn't work properly.
It seems that if runs on Yarn, we can no longer use SparkContext to work, instead we should use something like yarn.Client, but in this way we don't know how to change our code to achieve what we have done before using SparkContext and SqlContext.
Is there a good way to solve this? Can we get SparkContext from yarn.Client or we should change our code to utilize new interfaces of yarn.Client?
Thank you!
When you run on cluster , you need to do a spark-submit like this
./bin/spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
--master will be yarn
--deploy-mode will be cluster
In your application if you have something like setMaster("local[]") , you can remove it and build the code. when you do spark-submit with --Master yarn, yarn will launch containers for you instead of spark-standalone scheduler.
Your app code can look like this without any setting for Master
val conf = new SparkConf().setAppName("App Name")
val sc = new SparkContext(conf)
yarn deploy mode client is use when you want to launch driver on same machine from code is running. On a cluster the deploy mode should be cluster, this will make sure driver is launched on one of the worker node by yarn.

Resources