Error thrown while reading data from cassandraTable - cassandra

I am trying to read data from cassandra table using pyspark-cassandra connector.
I downloaded the package from pyspark-cassandra
My script is as follow:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
import pyspark_cassandra
from pyspark_cassandra import CassandraSparkContext,Row
conf=SparkConf().setMaster("local").setAppName("App1").set("spark.cassandra.co nnection.host","http://192.168.0.2")
sc = CassandraSparkContext(conf=conf)
rdd = sc.cassandraTable("keyspace1", "table1")
When I run above script using command:
$ bin/spark-submit --packages TargetHolding:pyspark-cassandra:0.3.5 pyscript.py
I am getting below error:
> py4j.protocol.Py4JJavaError: An error occurred while calling
> o83.newInstance. : java.lang.NoClassDefFoundError: Could not
> initialize class com.datastax.spark.connector.types.TypeConverter$

Related

No module named pyspark Error when using generic function

I am building project in pycharm IDE using pyspark.
The Spark install successfully and can be call easily from command prompt.
The Interpreter also configured correctly in project setting. I also tried with pip install pyspark.
The main.py looks like:-
import os
os.environ["SPARK_HOME"] = "/usr/local/spark"
from pyspark import SparkContext
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as F
from genericFunc import genericFunction
from config import constants
spark = genericFunction.start_data_pipeline()
inputDf = genericFunction.read_json(constants.INPUT_FOLDER_PATH+"file-000.json")
inputDf1 = genericFunction.read_json(constants.INPUT_FOLDER_PATH+" file-001.json")
and the generic function looks like:-
from pyspark.sql import SparkSession
print('w')
def start_data_pipeline():
#setting up spark session
'''
This function will set the spark session and return it to the __main__
function.
'''
try:
spark = SparkSession\
.builder\
.appName("Nike ETL")\
.getOrCreate()
return spark
except Exception as e:
raise
def read_json(file_name):
#setting up spark session
'''
This function will set the spark session and return it to the __main__
function.
'''
try:
spark = start_data_pipeline()
spark = spark.read \
.option("header", "true") \
.option("inferSchema", "true")\
.json(file_name)
return spark
except Exception as e:
raise
def load_as_csv(df,file_name):
#setting up spark session
'''
This function will set the spark session and return it to the __main__
function.
'''
try:
df.repartition(1).write.format('com.databricks.spark.csv')\
.save(file_name, header = 'true')
except Exception as e:
raise
Error:
Error:
Unresolved reference 'genericFunc'
"C:\Users\MY PC\PycharmProjects\pythonProject1\venv\Scripts\python.exe" C:/Capgemini/cv/tulsi/test-tulsi/main.py
Traceback (most recent call last):
File "C:/Capgemini/cv/tulsi/test-naveen/main.py", line 6, in <module>
from pyspark import SparkContext
ImportError: No module named pyspark
Process finished with exit code 1
Please help
The problem is that PyCharm creates its own virtual environment (venv) before running a python project and that venv do not have the packages installed - in this case pyspark. So you need to point PyCharm to the correct python shell where the packages are available.
You should go to File -> Settings -> Project -> Python Interpreter
and change the Python Interpreter to correct python that has the packages. To find your python run this your python shell
>>> import os
>>> import sys
>>> os.path.dirname(sys.executable)
'C:\\Doc\\'

Can connect to local master but not remote, pyspark

I'm trying out this example from anaconda docs:
from pyspark import SparkConf
from pyspark import SparkContext
import findspark
findspark.init('/home/Snow/anaconda3/lib/python3.8/site-packages/pyspark')
conf = SparkConf()
conf.setMaster('local[*]')
conf.setAppName('spark')
sc = SparkContext(conf=conf)
def mod(x):
import numpy as np
return (x, np.mod(x, 2))
rdd = sc.parallelize(range(1000)).map(mod).take(10)
Locally the script runs fine, without errors. When I change the line conf.setMaster('local[*]') to conf.setMaster('spark://remote_ip:7077') I get the error:
Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.IllegalArgumentException: requirement failed: Can only call getServletHandlers on a running MetricsSystem
at scala.Predef$.require(Predef.scala:281)
Why is this happening? I also added SPARK_MASTER_HOST=remote_ip and
SPARK_MASTER_PORT=7077 to ~/anaconda3/lib/python3.8/site-packages/pyspark/bin/load_spark_env.sh.
My spark version is 3.0.1 and server is 3.0.0
I can ping the remote_ip.

how to read xlsx file using pyspark without help of pandas

I am using this code to read the XLSX file in my local PC. but I couldn't read that file and I am using "com.crealytics.spark.excel" library also.
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
spark = SparkSession.builder \
.appName("test") \
.master("local[0]") \
.getOrCreate()
empFile = "C:/Users/Dev/Downloads/SAMPLE.xlsx"
employeesDF = sqlContext.read.format("com.crealytics.spark.excel").option("sheetName", "Sheet1").option("useHeader", "true").option("treatEmptyValuesAsNulls", "false").option("inferSchema", "false").option("location", empFile).option("addColorColumns", "False").load()
employeesDF.createOrReplaceTempView("EMP")
expLevel = sqlContext.sql("Select * from EMP")
expLevel.show()
if I run this code I got the error like this
py4j.protocol.Py4JJavaError: An error occurred while calling o35.load.
: java.lang.NoClassDefFoundError: scala/Product$class

TypeError: 'JavaPackage' object is not callable & Spark Streaming's Kafka libraries not found in class path

I use pyspark streaming to read kafka data, but it went wrong:
import os
from pyspark.streaming.kafka import KafkaUtils
from pyspark.streaming import StreamingContext
from pyspark import SparkContext
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8:2.0.2 pyspark-shell'
sc = SparkContext(appName="test")
sc.setLogLevel("WARN")
ssc = StreamingContext(sc, 60)
kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181", "test-id", {'test': 2})
kafkaStream.map(lambda x: x.split(" ")).pprint()
ssc.start()
ssc.awaitTermination()
________________________________________________________________________________________________
Spark Streaming's Kafka libraries not found in class path. Try one of the following.
1. Include the Kafka library and its dependencies with in the
spark-submit command as
$ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8:2.4.3 ...
2. Download the JAR of the artifact from Maven Central http://search.maven.org/,
Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.4.3.
Then, include the jar in the spark-submit command as
$ bin/spark-submit --jars <spark-streaming-kafka-0-8-assembly.jar> ...
________________________________________________________________________________________________
Traceback (most recent call last):
File "/home/docs/dp_model/dp_algo_platform/dp_algo_core/test/test.py", line 29, in <module>
kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181", "test-id", {'test': 2})
File "/home/softs/spark-2.4.3-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/streaming/kafka.py", line 78, in createStream
File "/home/softs/spark-2.4.3-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/streaming/kafka.py", line 217, in _get_helper
TypeError: 'JavaPackage' object is not callable
My spark version: 2.4.3, kafka version: 2.1.0, and I replace os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8:2.0.2 pyspark-shell' with os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8:2.4.3 pyspark-shell', it cannot work either. How can I do it?
I think you should move around your imports such that the environment is loaded with the variable before you import and initialize the Spark variables
You also definitely need to be using the same version of packages as your Spark version
import os
sparkVersion = '2.4.3' # update this accordingly
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8:{} pyspark-shell'.format(sparkVersion)
# import Spark core
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
# import extra packages
from pyspark.streaming.kafka import KafkaUtils
# begin application
spark = SparkSession.builder.appName("test").getOrCreate()
sc = spark.sparkContext
Note: Kafka 0.8 support is deprecated as of Spark 2.3.0

Pyspark Py4JJavaError

enter image description here
Getting this error on - > sc = pyspark.SparkContext(appName="Pi") line
import findspark
findspark.init()
import pyspark
import random
pyspark.SparkContext(appName="Pi")

Resources