java.lang.ClassNotFoundException: exception for connecting BigSQL using Python - apache-spark

I'm new to pyspark.I'm using python 3.5 & spark2.2.0 on my Ubuntu 16.0. I wrote following code to connect BigSQL using pyspark
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.getOrCreate()
spark_train_df ="jdbc:db2://my bigsq url :port number:sslConnection=true;sslTrustStoreLocation=ibm-truststore.jks;sslTrustStorePassword=*password123;","schema.Table Name",
properties={"user": username,
"password": password,
'driver' : ''}) # Trust store location is defined in .bashrc
train_df = spark.sql('select * from data_table')
Also I have added my trust store & driver path in my .bashrc file
But while running this code I'm getting error message
java.lang.ClassNotFoundException: exception
Can you expert please guide me to solve this problem?

You need to add the DB2 JDBC jars in your spark-submit, i.e., for postgres
spark-shell --master local[*] --packages org.postgresql:postgresql:9.4.1207.jre7
or (or DB2)
spark-shell --master local[*] --jars /path/to/db2/jdbc/db2.jar


Connecting to Casssandra on remote client using Spark

I have two PCs, one of them is Ubuntu system that has Cassandra, and the other one is Windows PC.
I have made same installations of Java, Spark, Python and Scala versions on both PCs. My goal is read data with Jupyter Notebook using Spark from Cassandra that on other PC.
On the PC that has Cassandra, I was able to read data with connecting to Cassandra using Spark. But when I try to connect that Cassandra from remote client using Spark, I could not connect to Cassandra and get an error.
Representation of the system
Commands that run on Ubuntu PC which has Cassandra.
~/spark/bin ./pyspark --master spark:// --packages com.datastax.spark:spark-cassandra-connector_2.12:3.1.0 --conf spark.driver.extraJavaOptions=-Xss512m --conf spark.executer.extraJavaOptions=-Xss512m
from spark.sql.functions import col
host = {"":',,',"table":"table_one","keyspace":"log_keyspace"}
data_frame ="org.apache.spark.sql.cassandra").options(**hosts).load()
a = data_frame.filter(col("col_1")<100000).select("col_1","col_2","col_3","col_4","col_5").toPandas()
As a result of the above codes running, the data received from Cassandra can be displayed.
Commands trying to get data by connecting to Cassandra from another PC.
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = ' --master spark:// --packages com.datastax.spark:spark-cassandra-connector_2.12:3.1.0 --conf spark.driver.extraJavaOptions=-Xss512m --conf spark.executer.extraJavaOptions=-Xss512m pyspark '
import findspark
from pyspark import SparkContext SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql import SQLContext
conf = SparkConf().setAppName('example')
sc = pyspark.SparkContext(conf = conf)
spark = SparkSession(sc)
hosts ={"":'',"table":"table_one","keyspace":"log_keyspace"}
sqlContext = SQLContext(sc)
data_frame ="org.apache.spark.sql.cassandra").options(**hosts).load()
As a result of the above codes running, " :java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.cassandra. Please find packages at " error occurs.
What can I do for fixing this error?

pyspark connection to MariaDB fails with ClassNotFoundException

I'm trying to retrieve data from MariaDB with pyspark.
I created spark_session with configuration to include jdbc jar file, but couldn't solve problem. Current code to create session looks like below.
path = "hdfs://nameservice1/user/PATH/TO/JDBC/mariadb-java-client-2.7.1.jar"
# or path = "/home/PATH/TO/JDBC/mariadb-java-client-2.7.1.jar"
spark = SparkSession.config("spark.jars", path)\
.config("spark.driver.extraClassPath", path)\
.config("spark.executor.extraClassPath", path)\
Note that I've tried every case of configuration I know
(Check Permission, change directory both hdfs or local, add or remove configuration ...)
And then, code to load data is.
spark ='jdbc').option('dbtable', sql)
.option('url', 'jdbc:mariadb://{host}:{port}/{db}')\
.option("user", SOME_USER)
.option("password", SOME_PASSWORD)
.option("driver", 'org.mariadb.jdbc.Driver')
But it fails with java.lang.ClassNotFoundException: org.mariadb.jdbc.Driver
When I tried this with spark-submit, I saw log message.
... INFO SparkContext: Added Jar /PATH/TO/JDBC/mariadb-java-client-2.7.1.jar at spark://SOME_PATH/jars/mariadb-java-client-2.7.1.jar with timestamp SOME_TIMESTAMP
What is wrong?
For anyone who suffers from same problem.
I figured out. Spark Document says that
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-class-path command line option or in your default properties file.
So instead setting configuration on python code, I added arguments on spark-submit following this document.
spark-submit {other arguments ...} \
--driver-class-path PATH/TO/JDBC/my-jdbc.jar \
--jars PATH/TO/JDBC/my-jdbc.jar \

How to connect to remote Cassandra server through pyspark for write operation?

I am trying to connect to remote Cassandra server through pyspark, but it is not performing write operation in Cassandra while running cronjob. The same code works on the server on jupyter notebook, but not through cronjob.
`os.environ['PYSPARK_SUBMIT_ARGS'] = '--master local[*] pyspark-shell --packages com.datastax.spark:spark-cassandra-connector_2.12:2.5.0 --conf pyspark-shell --conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions'
from pyspark import SparkContext
sc = SparkContext("local", "keyspace_name")
sqlContext = SQLContext(sc)
I see this error in the cassandra logs : ERROR [Messaging-EventLoop-3-3] 2020-08-05 09:24:36,606 - Failed to handshake with peer$InvalidCrc –

Pyspark Structured streaming locally with Kafka-Jupyter

After looking at the other answers i still cant figure it out.
I am able to use kafkaProducer and kafkaConsumer to send and receive a messages from within my notebook.
producer = KafkaProducer(bootstrap_servers=[''],value_serializer=lambda m: json.dumps(m).encode('ascii'))
consumer = KafkaConsumer('hr',bootstrap_servers=[''],group_id='abc' )
I've tried to connect to the stream with both spark context and spark session.
from pyspark.streaming.kafka import KafkaUtils
sc = SparkContext("local[*]", "stream")
ssc = StreamingContext(sc, 1)
Which gives me this error
Spark Streaming's Kafka libraries not found in class path. Try one
of the following.
1. Include the Kafka library and its dependencies with in the
spark-submit command as
$ bin/spark-submit --packages org.apache.spark:spark-streaming-
kafka-0-8:2.3.2 ...
It seems that i needed to add the JAR to my
!/usr/local/bin/spark-submit --master local[*] /usr/local/Cellar/apache-spark/2.3.0/libexec/jars/spark-streaming-kafka-0-8-assembly_2.11-2.3.2.jar pyspark-shell
which returns
Error: No main class set in JAR; please specify one with --class
Run with --help for usage help or --verbose for debug output
What class do i put in?
How do i get Pyspark to connect to the consumer?
The command you have is trying to run spark-streaming-kafka-0-8-assembly_2.11-2.3.2.jar, and trying to find pyspark-shell as a Java class inside of that.
As the first error says, you missed a --packages after spark-submit, which means you would do
spark-submit --packages ... someApp.jar com.example.YourClass
If you are just locally in Jupyter, you may want to try Kafka-Python, for example, rather than PySpark... Less overhead, and no Java dependencies.

saveAsTable ends in failure in Spark-yarn cluster environment

I set up a spark-yarn cluster environment, and try spark-SQL with spark-shell:
spark-shell --master yarn --deploy-mode client --conf spark.yarn.archive=hdfs://hadoop_273_namenode_ip:namenode_port/
One thing to mention is the Spark is in Windows 7. After spark-shell starts up successfully, I execute the commands as below:
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
scala> val df_mysql_address ="jdbc").option("url", "jdbc:mysql://mysql_db_ip/db").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "ADDRESS").option("user", "root").option("password", "root").load()
scala> df_mysql_address.write.format("parquet").saveAsTable("address_local")
"show" command returns result-set correctly, but the "saveAsTable" ends in failure. The error message says: Mkdirs failed to create file:/C:/jshen.workspace/programs/spark-2.2.0-bin-hadoop2.7/spark-warehouse/address_local/_temporary/0/_temporary/attempt_20171018104423_0001_m_000000_0 (exists=false, cwd=file:/tmp/hadoop/nm-local-dir/usercache/hduser/appcache/application_1508319604173_0005/container_1508319604173_0005_01_000003)
I expect and guess the table is to be saved in the hadoop cluster, but you can see that the dir (C:/jshen.workspace/programs/spark-2.2.0-bin-hadoop2.7/spark-warehouse) is the folder in my Windows 7, not in hdfs, not even in the hadoop ubuntu machine.
How could I do? Please advise, thanks.
The way to get rid of the problem is to provide "path" option prior to "save" operation as shown below:
scala> df_mysql_address.write.option("path", "/spark-warehouse").format("parquet").saveAsTable("address_l‌​ocal")
Thanks #philantrovert.
