How do I connect Spark to JDBC driver in Zeppelin? - apache-spark

I am trying to pull in data from a SQL server to a Hive table using Spark in a Zeppelin notebook.
I am trying to run the following code:
%pyspark
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.dataframe import DataFrame
from pyspark.sql.functions import *
spark = SparkSession.builder \
.appName('sample') \
.getOrCreate()
#set url, table, etc.
df = spark.read.format('jdbc') \
.option('url', url) \
.option('driver', 'com.microsoft.sqlserver.jdbc.SQLServerDriver') \
.option('dbtable', table) \
.option('user', user) \
.option('password', password) \
.load()
However, I keep getting the exception:
...
Py4JJavaError: An error occurred while calling o81.load.
: java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.SQLServerDriver
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
...
I have been trying to figure this out all day and I believe something is wrong with how I am trying to set up the driver. I have a driver under /tmp/sqljdbc42.jar on the instance. Can you please explain how I can let Spark know where this driver is? I have tried many different ways both through the shell and through the interpreter editor.
Thanks!
EDIT
I also should note that I loaded the jar to my instance throug Zeppelin's shell (%sh) using
curl -o /tmp/sqljdbc42.jar http://central.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/6.4.0.jre8/mssql-jdbc-6.4.0.jre8.jar
pyspark --driver-class-path /tmp/sqljdbc42.jar --jars /tmp/sqljdbc42.jar

Here is how I fixed this:
scp driver jar onto the cluster driver node
Go to Zeppelin interpreter and scroll to the Spark section then click edit.
Write the complete path to the jar under artifacts e.g. /home/Hadoop/mssql-jdbc.jar and nothing else.
Click save.
Then you should be good!

You can add it through Web UI in Interpreter settings as follow:
Click Interpreter in menu
Click 'edit' button in the Spark interpreter
Add the path for the jar in the artifact field
Then just save and restart interpreter.

Similar to Tomas, you can add the driver (or any library) using maven in the interpreter:
Click Interpreter in menu
Click 'edit' button in the Spark interpreter
Add the path for the jar in the artifact field
Add the groupId:artifactId:version
For example, in your case, you can use com.microsoft.sqlserver:mssql-jdbc:jar:8.4.1.jre8 in artifact field.
When you restart the interpreter, it will download and add the dependency for you.

Related

JavaPackage object is not callable error for pydeequ constraint suggestion

I'm getting a "JavaPackage object is not callable" error while trying to run the PyDeequ constraint suggestion method on databricks.
I have tried running this code on Apache Spark 3.1.2 cluster as well as Apache Spark 3.0.1 cluster but no luck.
suggestionResult = ConstraintSuggestionRunner(spark).onData(df).addConstraintRule(DEFAULT()).run()
print(suggestionResult)
Please refer to the second screenshot attached for the expanded error status.
PyDeequ error screenshot
Expanded PyDeequ error screenshot
I was able to combine some solutions found here, as well as other solutions, to get past the above JavaPackage error in Azure Databricks. Here are the details, if helpful for anyone.
From this link, I downloaded the appropriate JAR file to match my Spark version. In my case, that was deequ_2_0_1_spark_3_2.jar. I then installed this file using the JAR type under Libraries in my cluster configurations.
The following then worked, ran in different cells in a notebook.
%pip install pydeequ
%sh export SPARK_VERSION=3.2.1
df = spark.read.load("abfss://container-name#account.dfs.core.windows.net/path/to/data")
from pyspark.sql import SparkSession
import pydeequ
spark = (SparkSession
.builder
.getOrCreate())
from pydeequ.analyzers import *
analysisResult = AnalysisRunner(spark) \
.onData(df) \
.addAnalyzer(Size()) \
.addAnalyzer(Completeness("column_name")) \
.run()
analysisResult_df = AnalyzerContext.successMetricsAsDataFrame(spark, analysisResult)
analysisResult_df.show()

spark.driver.extraClassPath doesn't work in virtual PySpark environment

I'm saving data to a Postgres database and the job failed with the following:
py4j.protocol.Py4JJavaError: An error occurred while calling
o186.jdbc. : java.lang.ClassNotFoundException: org.postgresql.Driver
Until I downloaded the postgres jar to the spark/jars folder when I had spark installed globally.
I have sense moved to a new machine and instead only installed pyspark in a virtual environemnt (venv) via pip.
I tried setting the extraClassPath config value to my jar folder inside the virtual directory but that didn't work:
session = SparkSession \
.builder \
.config("spark.driver.extraClassPath", "/home/me/source/acme/.venv/lib/python3.6/site-packages/pyspark/jars/postgresql-42.2.6.jar") \
.getOrCreate()
Have tried relative and absolute path as well as wild card (*) and full filename. Nothing seems to work.
Setting the spark.jars.packages did correctly load the package from Maven however:
.config('spark.jars.packages', 'org.postgresql:postgresql:42.2.6') \
How can I make the extraClassPath work?
You will also need to add jar in executor class path.
session = SparkSession \
.builder \
.config("spark.driver.extraClassPath", "/home/me/source/acme/.venv/lib/python3.6/site-packages/pyspark/jars/postgresql-42.2.6.jar") \
.config("spark.executor.extraClassPath", "/home/me/source/acme/.venv/lib/python3.6/site-packages/pyspark/jars/postgresql-42.2.6.jar") \
.getOrCreate()
EDIT:
To semantically replicate spark.jars.package you can use spark.jars with absolute path to jar file. Also just to be sure check your jar and confirm it has proper MENIFEST for driver.

Setting spark.local.dir in Pyspark/Jupyter

I'm using Pyspark from a Jupyter notebook and attempting to write a large parquet dataset to S3.
I get a 'no space left on device' error. I searched around and learned that it's because /tmp is filling up.
I want to now edit spark.local.dir to point to a directory that has space.
How can I set this parameter?
Most solutions I found suggested setting it when using spark-submit. However, I am not using spark-submit and just running it as a script from Jupyter.
Edit: I'm using Sparkmagic to work with an EMR backend.I think spark.local.dir needs to be set in the config JSON, but I am not sure how to specify it there.
I tried adding it in session_configs but it didn't work.
The answer depends on where your SparkContext comes from.
If you are starting Jupyter with pyspark:
PYSPARK_DRIVER_PYTHON='jupyter'\
PYSPARK_DRIVER_PYTHON_OPTS="notebook" \
PYSPARK_PYTHON="python" \
pyspark
then your SparkContext is already initialized when you receive your Python kernel in Jupyter. You should therefore pass a parameter to pyspark (at the end of the command above): --conf spark.local.dir=...
If you are constructing a SparkContext in Python
If you have code in your notebook like:
import pyspark
sc = pyspark.SparkContext()
then you can configure the Spark context before creating it:
import pyspark
conf = pyspark.SparkConf()
conf.set('spark.local.dir', '...')
sc = pyspark.SparkContext(conf=conf)
Configuring Spark from the command line:
It's also possible to configure Spark by editing a configuration file in bash. The file you want to edit is ${SPARK_HOME}/conf/spark-defaults.conf. You can append to it as follows (creating it if it doesn't exist):
echo 'spark.local.dir /foo/bar' >> ${SPARK_HOME}/conf/spark-defaults.conf

Submit an application to a standalone spark cluster running in GCP from Python notebook

I am trying to submit a spark application to a standalone spark(2.1.1) cluster 3 VM running in GCP from my Python 3 notebook(running in local laptop) but for some reason spark session is throwing error "StandaloneAppClient$ClientEndpoint: Failed to connect to master sparkmaster:7077".
Environment Details: IPython and Spark Master are running in one GCP VM called "sparkmaster". 3 additional GCP VMs are running Spark workers and Cassandra Clusters. I connect from my local laptop(MBP) using Chrome to GCP VM IPython notebook in "sparkmaster"
Please note that terminal works:
bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.1 --master spark://sparkmaster:7077 ex.py 1000
Running it from Python Notebook:
import os
os.environ["PYSPARK_SUBMIT_ARGS"] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.1 pyspark-shell'
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark=SparkSession.builder.master("spark://sparkmaster:7077").appName('somatic').getOrCreate() #This step works if make .master('local')
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "kafka1:9092,kafka2:9092,kafka3:9092") \
.option("subscribe", "gene") \
.load()
so far I have tried these:
I have tried to change spark master node spark-defaults.conf and spark-env.sh to add SPARK_MASTER_IP.
Tried to find the STANDALONE_SPARK_MASTER_HOST=hostname -f setting so that I can remove "-f". For some reason my spark master ui shows FQDN:7077 not hostname:7077
passed FQDN as param to .master() and os.environ["PYSPARK_SUBMIT_ARGS"]
Please let me know if you need more details.
After doing some more research I was able to resolve the conflict. It was due to a simple environment variable called SPARK_HOME. In my case it was pointing to Conda's /bin(pyspark was running from this location) whereas my spark setup was present in a diff. path. The simple fix was to add
export SPARK_HOME="/home/<<your location path>>/spark/" to .bashrc file( I want this to be attached to my profile not to the spark session)
How I have done it:
Step 1: ssh to master node in my case it was same as ipython kernel/server VM in GCP
Step 2:
cd ~
sudo nano .bashrc
scroll down to the last line and paste the below line
export SPARK_HOME="/home/your/path/to/spark-2.1.1-bin-hadoop2.7/"
ctrlX and Y and enter to save the changes
Note: I have also added few more details to the environment section for clarity.

Invalid Spark URL in local spark session

since updating to Spark 2.3.0, tests which are run in my CI (Semaphore) fail due to a allegedly invalid spark url when creating the (local) spark context:
18/03/07 03:07:11 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Invalid Spark URL: spark://HeartbeatReceiver#LXC_trusty_1802-d57a40eb:44610
at org.apache.spark.rpc.RpcEndpointAddress$.apply(RpcEndpointAddress.scala:66)
at org.apache.spark.rpc.netty.NettyRpcEnv.asyncSetupEndpointRefByURI(NettyRpcEnv.scala:134)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109)
at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:32)
at org.apache.spark.executor.Executor.<init>(Executor.scala:155)
at org.apache.spark.scheduler.local.LocalEndpoint.<init>(LocalSchedulerBackend.scala:59)
at org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:126)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:500)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2486)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:930)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:921)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921)
The spark session is created as following:
val sparkSession: SparkSession = SparkSession
.builder
.appName(s"LocalTestSparkSession")
.config("spark.broadcast.compress", "false")
.config("spark.shuffle.compress", "false")
.config("spark.shuffle.spill.compress", "false")
.master("local[3]")
.getOrCreate
Before updating to Spark 2.3.0, no problems were encountered in version 2.2.1 and 2.1.0. Also, running the tests locally works fine.
Change the SPARK_LOCAL_HOSTNAME to localhost and try.
export SPARK_LOCAL_HOSTNAME=localhost
This has been resolved by setting sparkSession config "spark.driver.host" to the IP address.
It seems that this change is required from 2.3 onwards.
If you don't want to change the environment variable, you can change the code to add the config in the SparkSession builder (like Hanisha said above).
In PySpark:
spark = SparkSession.builder.config("spark.driver.host", "localhost").getOrCreate()
As mentioned in above answers, You need to change SPARK_LOCAL_HOSTNAME to localhost. In windows, you have to use SET command, SET SPARK_LOCAL_HOSTNAME=localhost
but this SET command is temporary. you may have to run it again and again in every new terminal. but instead, you can use SETX command, which is permanent.
SETX SPARK_LOCAL_HOSTNAME localhost
You can type above command in any place. just open a command prompt and run above command. Notice that unlike SET command, SETX command do not allow equation mark. you need to separate environment variable and the value by a Space.
if Success, you will see a message like "SUCCESS: Specified value was saved"
you can also verify that your variable is successfully added by just typing SET in a different command prompt. (or type SET s , which gives variables, starting with the letter 'S'). you can see that SPARK_LOCAL_HOSTNAME=localhost in results, which will not happen if you use SET command instead of SETX
Change your hostname to have NO underscore.
spark://HeartbeatReceiver#LXC_trusty_1802-d57a40eb:44610 to spark://HeartbeatReceiver#LXCtrusty1802d57a40eb:44610
Ubuntu AS root
#hostnamectl status
#hostnamectl --static set-hostname LXCtrusty1802d57a40eb
#nano /etc/hosts
127.0.0.1 LXCtrusty1802d57a40eb
#reboot
Try to run Spark locally, with as many worker threads as logical cores on your machine :
.master("local[*]")
I would like to complement #Prakash Annadurai answer by saying:
If you want to make the variable settlement last after exiting the terminal, add it to your shell profile (e.g. ~/.bash_profile) with the same command:
export SPARK_LOCAL_HOSTNAME=localhost
For anyone working in Jupyter Notebook. Adding %env SPARK_LOCAL_HOSTNAME=localhost to the very beginning of the cell solved it for me. Like so:
%env SPARK_LOCAL_HOSTNAME=localhost
import findspark
findspark.init()
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("Test")
sc = SparkContext(conf = conf)
Setting .config("spark.driver.host", "localhost") fixed the issue for me.
SparkSession spark = SparkSession
.builder()
.config("spark.master", "local")
.config("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
.config("spark.hadoop.fs.s3a.buffer.dir", "/tmp")
.config("spark.driver.memory", "2048m")
.config("spark.executor.memory", "2048m")
.config("spark.driver.bindAddress", "127.0.0.1")
.config("spark.driver.host", "localhost")
.getOrCreate();

Resources