Connecting to Casssandra on remote client using Spark - apache-spark

I have two PCs, one of them is Ubuntu system that has Cassandra, and the other one is Windows PC.
I have made same installations of Java, Spark, Python and Scala versions on both PCs. My goal is read data with Jupyter Notebook using Spark from Cassandra that on other PC.
On the PC that has Cassandra, I was able to read data with connecting to Cassandra using Spark. But when I try to connect that Cassandra from remote client using Spark, I could not connect to Cassandra and get an error.
Representation of the system
Commands that run on Ubuntu PC which has Cassandra.
~/spark/bin ./pyspark --master spark://10.0.0.10:7077 --packages com.datastax.spark:spark-cassandra-connector_2.12:3.1.0 --conf spark.driver.extraJavaOptions=-Xss512m --conf spark.executer.extraJavaOptions=-Xss512m
from spark.sql.functions import col
host = {"spark.cassandra.connection.host":'10.0.0.10,10.0.0.11,10.0.0.12',"table":"table_one","keyspace":"log_keyspace"}
data_frame = sqlContext.read.format("org.apache.spark.sql.cassandra").options(**hosts).load()
a = data_frame.filter(col("col_1")<100000).select("col_1","col_2","col_3","col_4","col_5").toPandas()
As a result of the above codes running, the data received from Cassandra can be displayed.
Commands trying to get data by connecting to Cassandra from another PC.
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = ' --master spark://10.0.0.10:7077 --packages com.datastax.spark:spark-cassandra-connector_2.12:3.1.0 --conf spark.driver.extraJavaOptions=-Xss512m --conf spark.executer.extraJavaOptions=-Xss512m spark.cassandra.connection.host=10.0.0.10 pyspark '
import findspark
findspark.init()
findspark.find()
from pyspark import SparkContext SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql import SQLContext
conf = SparkConf().setAppName('example')
sc = pyspark.SparkContext(conf = conf)
spark = SparkSession(sc)
hosts ={"spark.cassandra.connection.host":'10.0.0.10',"table":"table_one","keyspace":"log_keyspace"}
sqlContext = SQLContext(sc)
data_frame = sqlContext.read.format("org.apache.spark.sql.cassandra").options(**hosts).load()
As a result of the above codes running, " :java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.cassandra. Please find packages at http://spark.apache.org/third-party-projects.html " error occurs.
What can I do for fixing this error?

Related

Set up jupyter on EMR to read from cassandra using cql?

When I try to set the spark context in jupyter with
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages datastax:spark-cassandra-connector:2.4.0-s_2.11 --conf spark.cassandra.connection.host=x.x.x.x pyspark-shell'
or
spark = SparkSession.builder \
.appName('SparkCassandraApp') \
.config('spark.cassandra.connection.host', 'x.x.x.x') \
.config('spark.cassandra.connection.port', 'xxxx') \
.config('spark.cassandra.output.consistency.level','ONE') \
.master('local[2]') \
.getOrCreate()
I still cannot make a connection to the cassandra cluster with the code
dataFrame = spark.read.format("org.apache.spark.sql.cassandra").option("keyspace", "keyspace").option("table", "table").load()
dataFrame = dataFrame.limit(100)
dataFrame.show()
Comes up with error:
An error was encountered:
An error occurred while calling o103.load.
: java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.cassandra.
Please find packages at http://spark.apache.org/third-party-projects.html
A similar question was asked here modify jupyter kernel to add cassandra connection in spark
but i do not see a valid answer.

How to connect to remote Cassandra server through pyspark for write operation?

I am trying to connect to remote Cassandra server through pyspark, but it is not performing write operation in Cassandra while running cronjob. The same code works on the server on jupyter notebook, but not through cronjob.
`os.environ['PYSPARK_SUBMIT_ARGS'] = '--master local[*] pyspark-shell --packages com.datastax.spark:spark-cassandra-connector_2.12:2.5.0 --conf spark.cassandra.connection.host=127.0.0.1 pyspark-shell --conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions'
from pyspark import SparkContext
sc = SparkContext("local", "keyspace_name")
sqlContext = SQLContext(sc)
Data_to_Write.write.format("org.apache.spark.sql.cassandra").mode('append').options(table="tablename",keyspace="keyspace_name").save()`
I see this error in the cassandra logs : ERROR [Messaging-EventLoop-3-3] 2020-08-05 09:24:36,606 OutboundConnectionInitiator.java:373 - Failed to handshake with peer xx.xxx.xxx.xxx:9042(xx.xxx.xxx.xxx:9042) org.apache.cassandra.net.Crc$InvalidCrc –

Pyspark Failed to find data source: kafka

I am working on Kafka streaming and trying to integrate it with Apache Spark. However, while running I am getting into issues. I am getting the below error.
This is the command I am using.
df_TR = Spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "taxirides").load()
ERROR:
Py4JJavaError: An error occurred while calling o77.load.: java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at http://spark.apache.org/third-party-projects.html
How can I resolve this?
NOTE: I am running this in Jupyter Notebook
findspark.init('/home/karan/spark-2.1.0-bin-hadoop2.7')
import pyspark
from pyspark.sql import SparkSession
Spark = SparkSession.builder.appName('KafkaStreaming').getOrCreate()
from pyspark.sql.types import *
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
Everything is running fine till here (above code)
df_TR = Spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "taxirides").load()
This is where things are going wrong (above code).
The blog which I am following: https://www.adaltas.com/en/2019/04/18/spark-streaming-data-pipelines-with-structured-streaming/
Edit
Using spark.jars.packages works better than PYSPARK_SUBMIT_ARGS
Ref - PySpark - NoClassDefFoundError: kafka/common/TopicAndPartition
It's not clear how you ran the code. Keep reading the blog, and you see
spark-submit \
...
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0 \
sstreaming-spark-out.py
Seems you missed adding the --packages flag
In Jupyter, you could add this
import os
# setup arguments
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0'
# initialize spark
import pyspark, findspark
findspark.init()
Note: _2.11:2.4.0 need to align with your Scala and Spark versions... Based on the question, yours should be Spark 2.1.0

How to run python spark script with specific jars

I have to run a python script on EMR instance using pyspark to query dynamoDB. I am able to do that by querying dynamodb on pyspark which is executed by including jars with following command.
`pyspark --jars /usr/share/aws/emr/ddb/lib/emr-ddb-hive.jar,/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar`
I ran following python3 script to query data using pyspark python module.
import time
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, HiveContext
start_time = time.time()
SparkContext.setSystemProperty("hive.metastore.uris", "thrift://nn1:9083")
sparkSession = (SparkSession
.builder
.appName('example-pyspark-read-and-write-from-hive')
.enableHiveSupport()
.getOrCreate())
df_load = sparkSession.sql("SELECT * FROM example")
df_load.show()
print(time.time() - start_time)
Which caused following runtime exception for missing jars.
java.lang.ClassNotFoundException Class org.apache.hadoop.hive.dynamodb.DynamoDBSerDe not found
How do I convert the pyspark --jars.. to a pythonic equivalent.
As of now I tried copying the jars from the location /usr/share/... to $SPARK_HOME/libs/jars and adding that path to spark-defaults.conf external class path that had no effect.
Use spark-submit command to execute your python script. Example :
spark-submit --jars /usr/share/aws/emr/ddb/lib/emr-ddb-hive.jar,/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar script.py

java.lang.ClassNotFoundException: com.ibm.db2.jcc.DB2Driver exception for connecting BigSQL using Python

I'm new to pyspark.I'm using python 3.5 & spark2.2.0 on my Ubuntu 16.0. I wrote following code to connect BigSQL using pyspark
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.getOrCreate()
spark_train_df = spark.read.jdbc("jdbc:db2://my bigsq url :port number:sslConnection=true;sslTrustStoreLocation=ibm-truststore.jks;sslTrustStorePassword=*password123;","schema.Table Name",
properties={"user": username,
"password": password,
'driver' : 'com.ibm.db2.jcc.DB2Driver'}) # Trust store location is defined in .bashrc
spark_train_df.registerTempTable('data_table')
train_df = spark.sql('select * from data_table')
Also I have added my trust store & driver path in my .bashrc file
But while running this code I'm getting error message
java.lang.ClassNotFoundException: com.ibm.db2.jcc.DB2Driver exception
Can you expert please guide me to solve this problem?
You need to add the DB2 JDBC jars in your spark-submit, i.e., for postgres
spark-shell --master local[*] --packages org.postgresql:postgresql:9.4.1207.jre7
or (or DB2)
spark-shell --master local[*] --jars /path/to/db2/jdbc/db2.jar

Resources