Connect to Oracle DB using PySpark - apache-spark

I am trying to connect to an Oracle DB using PySpark.
spark_config = SparkConf().setMaster(config['cluster']).setAppName('sim_transactions_test').set("jars", "..\Lib\ojdbc7.jar")
sc = SparkContext(conf=spark_config)
sqlContext = SQLContext(sc)
df_sim_input = self.sqlContext.read\
.format("jdbc")\
.option("driver", "oracle.jdbc.driver.OracleDriver")\
.option("url", config["db.url"])\
.option("dbtable", query)\
.option("user", config["db.user"])\
.option("password", config["db.password"])\
.load()
This gives me a
py4j.protocol.Py4JJavaError: An error occurred while calling o31.load.
: java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver
So it seems it cannot find the jar file in the SparkContext. It seems to be possible to load a PySpark shell with external jars, but I want to load them from the Python code.
Can someone explain to me how you can add this external jar from Python and make a query to an Oracle DB?
Extra question, how come that for a postgres DB the code works fine without importing an external jdbc? Is that because if it is installed on your system, it will automatically find it?

You should probably also set driver-class-path as jars sends the jar file only to workers, not the driver.
That said, you should be very careful when setting JVM configuration in the python code as you need to make sure the JVM loads with them (you can't add them later). You can try setting PYSPARK_SUBMIT_ARGS e.g.:
export PYSPARK_SUBMIT_ARGS="--jars jarname --driver-class-path jarname pyspark-shell"
This will tell pyspark to add these options to the JVM loading the same as if you would have added it in the command line

Related

pyspark connection to MariaDB fails with ClassNotFoundException

I'm trying to retrieve data from MariaDB with pyspark.
I created spark_session with configuration to include jdbc jar file, but couldn't solve problem. Current code to create session looks like below.
path = "hdfs://nameservice1/user/PATH/TO/JDBC/mariadb-java-client-2.7.1.jar"
# or path = "/home/PATH/TO/JDBC/mariadb-java-client-2.7.1.jar"
spark = SparkSession.config("spark.jars", path)\
.config("spark.driver.extraClassPath", path)\
.config("spark.executor.extraClassPath", path)\
.enableHiveSupport()
.getOrCreate()
Note that I've tried every case of configuration I know
(Check Permission, change directory both hdfs or local, add or remove configuration ...)
And then, code to load data is.
sql = "SOME_SQL_TO_RETRIEVE_DATA"
spark = spark.read.format('jdbc').option('dbtable', sql)
.option('url', 'jdbc:mariadb://{host}:{port}/{db}')\
.option("user", SOME_USER)
.option("password", SOME_PASSWORD)
.option("driver", 'org.mariadb.jdbc.Driver')
.load()
But it fails with java.lang.ClassNotFoundException: org.mariadb.jdbc.Driver
When I tried this with spark-submit, I saw log message.
... INFO SparkContext: Added Jar /PATH/TO/JDBC/mariadb-java-client-2.7.1.jar at spark://SOME_PATH/jars/mariadb-java-client-2.7.1.jar with timestamp SOME_TIMESTAMP
What is wrong?
For anyone who suffers from same problem.
I figured out. Spark Document says that
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-class-path command line option or in your default properties file.
So instead setting configuration on python code, I added arguments on spark-submit following this document.
spark-submit {other arguments ...} \
--driver-class-path PATH/TO/JDBC/my-jdbc.jar \
--jars PATH/TO/JDBC/my-jdbc.jar \
MY_PYTHON_SCRIPT.py

Pyspark Structured streaming locally with Kafka-Jupyter

After looking at the other answers i still cant figure it out.
I am able to use kafkaProducer and kafkaConsumer to send and receive a messages from within my notebook.
producer = KafkaProducer(bootstrap_servers=['127.0.0.1:9092'],value_serializer=lambda m: json.dumps(m).encode('ascii'))
consumer = KafkaConsumer('hr',bootstrap_servers=['127.0.0.1:9092'],group_id='abc' )
I've tried to connect to the stream with both spark context and spark session.
from pyspark.streaming.kafka import KafkaUtils
sc = SparkContext("local[*]", "stream")
ssc = StreamingContext(sc, 1)
Which gives me this error
Spark Streaming's Kafka libraries not found in class path. Try one
of the following.
1. Include the Kafka library and its dependencies with in the
spark-submit command as
$ bin/spark-submit --packages org.apache.spark:spark-streaming-
kafka-0-8:2.3.2 ...
It seems that i needed to add the JAR to my
!/usr/local/bin/spark-submit --master local[*] /usr/local/Cellar/apache-spark/2.3.0/libexec/jars/spark-streaming-kafka-0-8-assembly_2.11-2.3.2.jar pyspark-shell
which returns
Error: No main class set in JAR; please specify one with --class
Run with --help for usage help or --verbose for debug output
What class do i put in?
How do i get Pyspark to connect to the consumer?
The command you have is trying to run spark-streaming-kafka-0-8-assembly_2.11-2.3.2.jar, and trying to find pyspark-shell as a Java class inside of that.
As the first error says, you missed a --packages after spark-submit, which means you would do
spark-submit --packages ... someApp.jar com.example.YourClass
If you are just locally in Jupyter, you may want to try Kafka-Python, for example, rather than PySpark... Less overhead, and no Java dependencies.

java.lang.ClassNotFoundException: com.ibm.db2.jcc.DB2Driver exception for connecting BigSQL using Python

I'm new to pyspark.I'm using python 3.5 & spark2.2.0 on my Ubuntu 16.0. I wrote following code to connect BigSQL using pyspark
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.getOrCreate()
spark_train_df = spark.read.jdbc("jdbc:db2://my bigsq url :port number:sslConnection=true;sslTrustStoreLocation=ibm-truststore.jks;sslTrustStorePassword=*password123;","schema.Table Name",
properties={"user": username,
"password": password,
'driver' : 'com.ibm.db2.jcc.DB2Driver'}) # Trust store location is defined in .bashrc
spark_train_df.registerTempTable('data_table')
train_df = spark.sql('select * from data_table')
Also I have added my trust store & driver path in my .bashrc file
But while running this code I'm getting error message
java.lang.ClassNotFoundException: com.ibm.db2.jcc.DB2Driver exception
Can you expert please guide me to solve this problem?
You need to add the DB2 JDBC jars in your spark-submit, i.e., for postgres
spark-shell --master local[*] --packages org.postgresql:postgresql:9.4.1207.jre7
or (or DB2)
spark-shell --master local[*] --jars /path/to/db2/jdbc/db2.jar

Spark 2.1 - Error While instantiating HiveSessionState

With a fresh install of Spark 2.1, I am getting an error when executing the pyspark command.
Traceback (most recent call last):
File "/usr/local/spark/python/pyspark/shell.py", line 43, in <module>
spark = SparkSession.builder\
File "/usr/local/spark/python/pyspark/sql/session.py", line 179, in getOrCreate
session._jsparkSession.sessionState().conf().setConfString(key, value)
File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/local/spark/python/pyspark/sql/utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':"
I have Hadoop and Hive on the same machine. Hive is configured to use MySQL for the metastore. I did not get this error with Spark 2.0.2.
Can someone please point me in the right direction?
I was getting same error in windows environment and Below trick worked for me.
in shell.py the spark session is defined with .enableHiveSupport()
spark = SparkSession.builder\
.enableHiveSupport()\
.getOrCreate()
Remove hive support and redefine spark session as below:
spark = SparkSession.builder\
.getOrCreate()
you can find shell.py in your spark installation folder.
for me it's in "C:\spark-2.1.1-bin-hadoop2.7\python\pyspark"
Hope this helps
I had the same problem. Some of the answers sudo chmod -R 777 /tmp/hive/, or to downgrade spark with hadoop to 2.6 didn't work for me.
I realized that what caused this problem for me is that I was doing SQL queries using the sqlContext instead of using the sparkSession.
sparkSession =SparkSession.builder.master("local[*]").appName("appName").config("spark.sql.warehouse.dir", "./spark-warehouse").getOrCreate()
sqlCtx.registerDataFrameAsTable(..)
df = sparkSession.sql("SELECT ...")
this perfectly works for me now.
Spark 2.1.0 - When I run it with yarn client option - I don't see this issue, but yarn cluster mode gives "Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':".
Still looking for answer.
The issue for me was solved by disabling HADOOP_CONF_DIR environment variable. It was pointing to hadoop configuration directory and while starting pyspark shell, the variable caused spark to initiate hadoop cluster which wasn't initiated.
So if you have HADOOP_CONF_DIR variable enabled, then you have to start hadoop cluster started before using spark shells
Or you need to disable the variable.
You are missing the spark-hive jar.
For example, if you are running on Scala 2.11, with Spark 2.1, you can use this jar.
https://mvnrepository.com/artifact/org.apache.spark/spark-hive_2.11/2.1.0
I saw this error on a new (2018) Mac, which came with Java 10. The fix was to set JAVA_HOME to Java 8:
export JAVA_HOME=`usr/libexec/java_home -v 1.8`
I too was struggling in cluster mode. Added hive-site.xml from sparkconf directory, if you have hdp cluster then it should be at /usr/hdp/current/spark2-client/conf. Its working for me.
I was getting this error trying to run pyspark and spark-shell when my HDFS wasn't started.
I have removed ".enableHiveSupport()\" from shell.py file and its working perfect
/*****Before********/
spark = SparkSession.builder\
.enableHiveSupport()\
.getOrCreate()
/*****After********/
spark = SparkSession.builder\
.getOrCreate()
/*************************/
Project location and file permissions would be issue. I have observed this error happening inspite of changes to my pom file.Then i changed the directory of my project to user directory where i have full permissions, this solved my issue.

Spark SQL RDD loads in pyspark but not in spark-submit: "JDBCRDD: closed connection"

I have the following simple code for loading a table from my Postgres database into an RDD.
# this setup is just for spark-submit, will be ignored in pyspark
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
conf = SparkConf().setAppName("GA")#.setMaster("localhost")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
# func for loading table
def get_db_rdd(table):
url = "jdbc:postgresql://localhost:5432/harvest?user=postgres"
print(url)
lower = 0
upper = 1000
ret = sqlContext \
.read \
.format("jdbc") \
.option("url", url) \
.option("dbtable", table) \
.option("partitionColumn", "id") \
.option("numPartitions", 1024) \
.option("lowerBound", lower) \
.option("upperBound", upper) \
.option("password", "password") \
.load()
ret = ret.rdd
return ret
# load table, and print results
print(get_db_rdd("mytable").collect())
I run ./bin/pyspark then paste that into the interpreter, and it prints out the data from my table as expected.
Now, if I save that code to a file named test.py then do ./bin/spark-submit test.py, it starts to run, but then I see these messages spam my console forever:
17/02/16 02:24:21 INFO Executor: Running task 45.0 in stage 0.0 (TID 45)
17/02/16 02:24:21 INFO JDBCRDD: closed connection
17/02/16 02:24:21 INFO Executor: Finished task 45.0 in stage 0.0 (TID 45). 1673 bytes result sent to driver
Edit: This is on a single machine. I haven't started any masters or slaves; spark-submit is the only command I run after system start. I tried with the master/slave setup with the same results.
My spark-env.sh file looks like this:
export SPARK_WORKER_INSTANCES=2
export SPARK_WORKER_CORES=2
export SPARK_WORKER_MEMORY=800m
export SPARK_EXECUTOR_MEMORY=800m
export SPARK_EXECUTOR_CORES=2
export SPARK_CLASSPATH=/home/ubuntu/spark/pg_driver.jar # Postgres driver I need for SQLContext
export PYTHONHASHSEED=1337 # have to make workers use same seed in Python3
It works if I spark-submit a Python file that just creates an RDD from a list or something. I only have problems when I try to use a JDBC RDD. What piece am I missing?
When using spark-submit you should supply the jar to the executors.
As mentioned in spark 2.1 JDBC documents:
To get started you will need to include the JDBC driver for you
particular database on the spark classpath. For example, to connect to
postgres from the Spark Shell you would run the following command:
bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9.4.1207.jar
Note: The same should be for spark-submit command
Troubleshooting
The JDBC driver class must be visible to the primordial class loader
on the client session and on all executors. This is because Java’s
DriverManager class does a security check that results in it ignoring
all drivers not visible to the primordial class loader when one goes
to open a connection. One convenient way to do this is to modify
compute_classpath.sh on all worker nodes to include your driver JARs.
This is a horrible hack. I'm not considering this the answer, but it does work.
Alright, only pyspark works? Fine, then we'll use it. Wrote this Bash script:
cat $1 | $SPARK_HOME/bin/pyspark # pipe the Python file into pyspark
I run that script in my Python script that's submitting jobs. Also, I'm including the code I use to pass arguments between the processes, in case it helps someone:
new_env = os.environ.copy()
new_env["pyspark_argument_1"] = "some param I need in my Spark script" # etc...
p = subprocess.Popen(["pyspark_wrapper.sh {}".format(py_fname)], shell=True, env=new_env)
In my Spark script:
something_passed_from_submitter = os.environ["pyspark_argument_1"]
# do stuff in Spark...
I feel like Spark is better supported and (if this is a bug) less buggy with Scala than with Python 3, so that might be the better solution for now. But my script uses some files we wrote in Python 3, so...

Resources