PySpark - Connecting to HBASE using PySpark - Package import failing - apache-spark

I am facing an issue when connecting to HBASE using PySpark as it fails with an error as:
py4j.protocol.Py4JJavaError: An error occurred while calling o42.load.
: java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.execution.datasources.hbase. Please find packages at http://spark.apache.org/third-party-projects.html
HDP Version : 2.6.4.0-91
Spark Ver: 2.2.0.2.6.4.0-91
Python: 2.7.5
Jar used: /usr/hdp/2.6.4.0-91/shc/shc-core-1.1.0.2.6.4.0-91.jar
I tried jar import using pyspark --jars /usr/hdp/2.6.4.0-91/shc/shc-core-1.1.0.2.6.4.0-91.jar
It takes to PySpark's shell with the prompt, but when I try to connect to HBASE, it fails with the error mentioned above.
Sample Code Executed:
Using Python version 2.7.5 (default, May 31 2018 09:41:32)
SparkSession available as 'spark'.
>>> catalog = ''.join("""{'table': {'namespace': 'default','name': 'books'},'rowkey': 'key','columns': {'title': {'cf': 'rowkey', 'col': 'key', 'type': 'string'},'author': {'cf': 'info', 'col': 'author', 'type': 'string'}}}""".split())
>>>
>>> df = sqlContext.read.options(catalog=catalog).format('org.apache.spark.sql.execution.datasources.hbase').load()
Failing with error given below:
Traceback (most recent call last):
File "", line 1, in
ImportError: No module named org.apache.spark.sql.execution.datasources.hbase

Try with using --packages and --repositories arguments as mentioned here.
bash$ export SPARK_MAJOR_VERSION=2
bash$ pyspark --packages com.hortonworks:shc-core:1.1.1-2.1-s_2.11 --repositories http://repo.hortonworks.com/content/groups/public/
>>> from pyspark.sql.functions import *
>>> from pyspark.sql.types import *
>>> spark = SparkSession \
.builder \
.enableHiveSupport() \
.getOrCreate()
>>> catalog = ''.join("""{'table': {'namespace': 'default','name': 'books'},'rowkey': 'key','columns': {'title': {'cf': 'rowkey', 'col': 'key', 'type': 'string'},'author': {'cf': 'info', 'col': 'author', 'type': 'string'}}}""".split())
>>> df=spark.read.options(catalog=catalog,newtable=5).format("org.apache.spark.sql.execution.datasources.hbase").load()

Related

No module named pyspark Error when using generic function

I am building project in pycharm IDE using pyspark.
The Spark install successfully and can be call easily from command prompt.
The Interpreter also configured correctly in project setting. I also tried with pip install pyspark.
The main.py looks like:-
import os
os.environ["SPARK_HOME"] = "/usr/local/spark"
from pyspark import SparkContext
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as F
from genericFunc import genericFunction
from config import constants
spark = genericFunction.start_data_pipeline()
inputDf = genericFunction.read_json(constants.INPUT_FOLDER_PATH+"file-000.json")
inputDf1 = genericFunction.read_json(constants.INPUT_FOLDER_PATH+" file-001.json")
and the generic function looks like:-
from pyspark.sql import SparkSession
print('w')
def start_data_pipeline():
#setting up spark session
'''
This function will set the spark session and return it to the __main__
function.
'''
try:
spark = SparkSession\
.builder\
.appName("Nike ETL")\
.getOrCreate()
return spark
except Exception as e:
raise
def read_json(file_name):
#setting up spark session
'''
This function will set the spark session and return it to the __main__
function.
'''
try:
spark = start_data_pipeline()
spark = spark.read \
.option("header", "true") \
.option("inferSchema", "true")\
.json(file_name)
return spark
except Exception as e:
raise
def load_as_csv(df,file_name):
#setting up spark session
'''
This function will set the spark session and return it to the __main__
function.
'''
try:
df.repartition(1).write.format('com.databricks.spark.csv')\
.save(file_name, header = 'true')
except Exception as e:
raise
Error:
Error:
Unresolved reference 'genericFunc'
"C:\Users\MY PC\PycharmProjects\pythonProject1\venv\Scripts\python.exe" C:/Capgemini/cv/tulsi/test-tulsi/main.py
Traceback (most recent call last):
File "C:/Capgemini/cv/tulsi/test-naveen/main.py", line 6, in <module>
from pyspark import SparkContext
ImportError: No module named pyspark
Process finished with exit code 1
Please help
The problem is that PyCharm creates its own virtual environment (venv) before running a python project and that venv do not have the packages installed - in this case pyspark. So you need to point PyCharm to the correct python shell where the packages are available.
You should go to File -> Settings -> Project -> Python Interpreter
and change the Python Interpreter to correct python that has the packages. To find your python run this your python shell
>>> import os
>>> import sys
>>> os.path.dirname(sys.executable)
'C:\\Doc\\'

How to read Druid data using JDBC driver with spark?

How can I read data from Druid using spark and Avatica JDBC Driver?
This is avatica JDBC document
Reading data from Druid using python and Jaydebeapi module, I succeed like below code.
$ python
import jaydebeapi
conn = jaydebeapi.connect("org.apache.calcite.avatica.remote.Driver",
"jdbc:avatica:remote:url=http://0.0.0.0:8082/druid/v2/sql/avatica/",
{"user": "druid", "password":"druid"},
"/root/avatica-1.17.0.jar",
)
cur = conn.cursor()
cur.execute("SELECT * FROM INFORMATION_SCHEMA.TABLES")
cur.fetchall()
output is:
[('druid', 'druid', 'wikipedia', 'TABLE'),
('druid', 'INFORMATION_SCHEMA', 'COLUMNS', 'SYSTEM_TABLE'),
('druid', 'INFORMATION_SCHEMA', 'SCHEMATA', 'SYSTEM_TABLE'),
('druid', 'INFORMATION_SCHEMA', 'TABLES', 'SYSTEM_TABLE'),
('druid', 'sys', 'segments', 'SYSTEM_TABLE'),
('druid', 'sys', 'server_segments', 'SYSTEM_TABLE'),
('druid', 'sys', 'servers', 'SYSTEM_TABLE'),
('druid', 'sys', 'supervisors', 'SYSTEM_TABLE'),
('druid', 'sys', 'tasks', 'SYSTEM_TABLE')] -> default tables
But I want to read using spark and JDBC.
I tried it but there is a problem using spark like below code.
$ pyspark --jars /root/avatica-1.17.0.jar
df = spark.read.format('jdbc') \
.option('url', 'jdbc:avatica:remote:url=http://0.0.0.0:8082/druid/v2/sql/avatica/') \
.option("dbtable", 'INFORMATION_SCHEMA.TABLES') \
.option('user', 'druid') \
.option('password', 'druid') \
.option('driver', 'org.apache.calcite.avatica.remote.Driver') \
.load()
output is:
Traceback (most recent call last):
File "<stdin>", line 8, in <module>
File "/root/spark-2.4.4-bin-hadoop2.7/python/pyspark/sql/readwriter.py", line 172, in load
return self._df(self._jreader.load())
File "/root/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/root/spark-2.4.4-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/root/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o2999.load.
: java.sql.SQLException: While closing connection
...
Caused by: java.lang.RuntimeException: com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "rpcMetadata" (class org.apache.calcite.avatica.remote.Service$CloseConnectionResponse), not marked as ignorable (0 known properties: ])
at [Source: {"response":"closeConnection","rpcMetadata":{"response":"rpcMetadata","serverAddress":"172.18.0.7:8082"}}
; line: 1, column: 46]
...
Caused by: com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "rpcMetadata" (class org.apache.calcite.avatica.remote.Service$CloseConnectionResponse), not marked as ignorable (0 known properties: ])
at [Source: {"response":"closeConnection","rpcMetadata":{"response":"rpcMetadata","serverAddress":"172.18.0.7:8082"}}
; line: 1, column: 46]
...
Note:
I downloaded Avatica jar file(avatica-1.17.0.jar) from maven-repository
I installed Druid server using docker-compose and default setting values.
I found another way to solve this problem. I used spark-druid-connector to connect druid with spark.
But I changed some codes like this to use this code for my environment.
This is my environment:
spark: 2.4.4
scala: 2.11.12
python: python 3.6.8
druid:
zookeeper: 3.5
druid: 0.17.0
However, it has a problem.
If you use spark-druid-connector at least once, all sql queries like spark.sql("select * from tmep_view") used from the following will be entered into this planner.
but, if you use dataframe's api like df.distinct().count(), then there are no problems. I didn't solve yet.
I tried with spark-shell:
./bin/spark-shell --driver-class-path avatica-1.17.0.jar --jars avatica-1.17.0.jar
val jdbcDF = spark.read.format("jdbc")
.option("url", "jdbc:avatica:remote:url=http://0.0.0.0:8082/druid/v2/sql/avatica/")
.option("dbtable", "INFORMATION_SCHEMA.TABLES")
.option("user", "druid")
.option("password", "druid")
.load()

Pyspark ModuleNotFoundError: No module named 'mmlspark'

My environment: Ubuntu 64 bit, Spark 2.4.5, Jupyter Notebook.
With internet connection that's fine, I don't get any error:
spark = SparkSession.builder \
.appName("Churn Scoring LightGBM") \
.master("local[4]") \
.config("spark.jars.packages","com.microsoft.ml.spark:mmlspark_2.11:0.18.1") \
.getOrCreate()
from mmlspark.lightgbm import LightGBMClassifier
But without an internet connection I got related jars (This style recommended by cloudera docs):
import os
mmlspark_jars_dir = os.path.join(os.environ["SPARK_HOME"], "mmlspark_jars")
mmlspark_jars = [os.path.join(mmlspark_jars_dir, x) for x in os.listdir(mmlspark_jars_dir)]
print(mmlspark_jars)
['/home/erkan/spark/mmlspark_jars/com.jcraft_jsch-0.1.54.jar',
'/home/erkan/spark/mmlspark_jars/com.microsoft.ml.spark_mmlspark_2.11-0.18.1.jar',
'/home/erkan/spark/mmlspark_jars/commons-codec_commons-codec-1.10.jar',
'/home/erkan/spark/mmlspark_jars/org.scalatest_scalatest_2.11-3.0.5.jar',
'/home/erkan/spark/mmlspark_jars/org.apache.httpcomponents_httpcore-4.4.10.jar',
'/home/erkan/spark/mmlspark_jars/org.openpnp_opencv-3.2.0-1.jar',
'/home/erkan/spark/mmlspark_jars/commons-logging_commons-logging-1.2.jar',
'/home/erkan/spark/mmlspark_jars/com.github.vowpalwabbit_vw-jni-8.7.0.2.jar',
'/home/erkan/spark/mmlspark_jars/org.apache.httpcomponents_httpclient-4.5.6.jar',
'/home/erkan/spark/mmlspark_jars/org.scala-lang_scala-reflect-2.11.12.jar',
'/home/erkan/spark/mmlspark_jars/org.scala-lang.modules_scala-xml_2.11-1.0.6.jar',
'/home/erkan/spark/mmlspark_jars/com.microsoft.cntk_cntk-2.4.jar',
'/home/erkan/spark/mmlspark_jars/io.spray_spray-json_2.11-1.3.2.jar',
'/home/erkan/spark/mmlspark_jars/org.scalactic_scalactic_2.11-3.0.5.jar',
'/home/erkan/spark/mmlspark_jars/com.microsoft.ml.lightgbm_lightgbmlib-2.2.350.jar']
And I had to modify SparkSession like this:
spark = SparkSession.builder \
.appName("Churn Scoring LightGBM") \
.master("local[4]") \
.config("spark.jars", ",".join(mmlspark_jars)) \
.getOrCreate()
I observed from terminal and everything seemed fine SparkSession was created. Then I checked Spark UI
Then I tried to import:
from mmlspark.lightgbm import LightGBMClassifier
And got this error:
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-10-df498625321c> in <module>
----> 1 from mmlspark.lightgbm import LightGBMClassifier
ModuleNotFoundError: No module named 'mmlspark'
I don't understand that although I see the same jars on SparkUI import doesn't work with the second method.

TypeError: 'JavaPackage' object is not callable & Spark Streaming's Kafka libraries not found in class path

I use pyspark streaming to read kafka data, but it went wrong:
import os
from pyspark.streaming.kafka import KafkaUtils
from pyspark.streaming import StreamingContext
from pyspark import SparkContext
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8:2.0.2 pyspark-shell'
sc = SparkContext(appName="test")
sc.setLogLevel("WARN")
ssc = StreamingContext(sc, 60)
kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181", "test-id", {'test': 2})
kafkaStream.map(lambda x: x.split(" ")).pprint()
ssc.start()
ssc.awaitTermination()
________________________________________________________________________________________________
Spark Streaming's Kafka libraries not found in class path. Try one of the following.
1. Include the Kafka library and its dependencies with in the
spark-submit command as
$ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8:2.4.3 ...
2. Download the JAR of the artifact from Maven Central http://search.maven.org/,
Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.4.3.
Then, include the jar in the spark-submit command as
$ bin/spark-submit --jars <spark-streaming-kafka-0-8-assembly.jar> ...
________________________________________________________________________________________________
Traceback (most recent call last):
File "/home/docs/dp_model/dp_algo_platform/dp_algo_core/test/test.py", line 29, in <module>
kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181", "test-id", {'test': 2})
File "/home/softs/spark-2.4.3-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/streaming/kafka.py", line 78, in createStream
File "/home/softs/spark-2.4.3-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/streaming/kafka.py", line 217, in _get_helper
TypeError: 'JavaPackage' object is not callable
My spark version: 2.4.3, kafka version: 2.1.0, and I replace os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8:2.0.2 pyspark-shell' with os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8:2.4.3 pyspark-shell', it cannot work either. How can I do it?
I think you should move around your imports such that the environment is loaded with the variable before you import and initialize the Spark variables
You also definitely need to be using the same version of packages as your Spark version
import os
sparkVersion = '2.4.3' # update this accordingly
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8:{} pyspark-shell'.format(sparkVersion)
# import Spark core
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
# import extra packages
from pyspark.streaming.kafka import KafkaUtils
# begin application
spark = SparkSession.builder.appName("test").getOrCreate()
sc = spark.sparkContext
Note: Kafka 0.8 support is deprecated as of Spark 2.3.0

PySpark 2.2.0 Write DataFrame to S3 AmazonServiceException Class Not Found

I'm trying to write a Spark DataFrame to S3 with pyspark. I'm using Spark version 2.2.0.
sc = SparkContext('local', 'Test')
sc._jsc.hadoopConfiguration().set("fs.s3a.awsAccessKeyId", aws_key)
sc._jsc.hadoopConfiguration().set("fs.s3a.awsSecretAccessKey", aws_secret)
sc._jsc.hadoopConfiguration().set("fs.s3a.multipart.uploads.enabled", "true")
spark = sql.SparkSession \
.builder \
.appName("TEST") \
.getOrCreate()
sql_context = sql.SQLContext(sc, spark)
filename = 'gerrymandering'
s3_uri = 's3a://mybucket/{}'.format(filename)
print(s3_uri)
df = sql_context.createDataFrame([('1', '4'), ('2', '5'), ('3', '6')], ["A", "B"])
df.write.parquet(s3_uri)
The traceback I get is:
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o48.save.
: java.lang.NoClassDefFoundError: com/amazonaws/AmazonServiceException
I'm not sure but there seems to be a jar dependency error. I've tried multiple versions of hadoop-aws-X.jar as well as aws-java-sdk-X.jar but they all produce this same error.
As of writing this my command was:
spark-submit --jars hadoop-aws-2.9.0.jar,aws-java-sdk-1.7.4.jar test.py
Any ideas on how I can resolve this NoClassDefFoundError?
Don't try and use a Hadoop-aws JAR and AWS SDK. different from that it ships with; the AWS SDK Changes too much between versions. For hadoop-2.9.0 you need aws-java-sdk-bundle version 1.11.199
See mvnrepo/hadoop-aws

Resources