PySpark 2.2.0 Write DataFrame to S3 AmazonServiceException Class Not Found - apache-spark

I'm trying to write a Spark DataFrame to S3 with pyspark. I'm using Spark version 2.2.0.
sc = SparkContext('local', 'Test')
sc._jsc.hadoopConfiguration().set("fs.s3a.awsAccessKeyId", aws_key)
sc._jsc.hadoopConfiguration().set("fs.s3a.awsSecretAccessKey", aws_secret)
sc._jsc.hadoopConfiguration().set("fs.s3a.multipart.uploads.enabled", "true")
spark = sql.SparkSession \
.builder \
.appName("TEST") \
.getOrCreate()
sql_context = sql.SQLContext(sc, spark)
filename = 'gerrymandering'
s3_uri = 's3a://mybucket/{}'.format(filename)
print(s3_uri)
df = sql_context.createDataFrame([('1', '4'), ('2', '5'), ('3', '6')], ["A", "B"])
df.write.parquet(s3_uri)
The traceback I get is:
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o48.save.
: java.lang.NoClassDefFoundError: com/amazonaws/AmazonServiceException
I'm not sure but there seems to be a jar dependency error. I've tried multiple versions of hadoop-aws-X.jar as well as aws-java-sdk-X.jar but they all produce this same error.
As of writing this my command was:
spark-submit --jars hadoop-aws-2.9.0.jar,aws-java-sdk-1.7.4.jar test.py
Any ideas on how I can resolve this NoClassDefFoundError?

Don't try and use a Hadoop-aws JAR and AWS SDK. different from that it ships with; the AWS SDK Changes too much between versions. For hadoop-2.9.0 you need aws-java-sdk-bundle version 1.11.199
See mvnrepo/hadoop-aws

Related

Error while connecting big query in GCP using Spark

I was trying to connect Google big query using pySpark using the below code :
from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName("GCP")
sc = SparkContext(conf=conf)
master = "yarn"
spark = SparkSession.builder \
.master("local")\
.appName("GCP") \
.getOrCreate()
spark._jsc.hadoopConfiguration().set("google.cloud.auth.service.account.json.keyfile","key.json")
df = spark.read.format('bigquery') \
.option("parentProject", "project_name") \
.option('table', 'project_name.table_name') \
.load()
df.show()
my spark version 2.3 and big query jar : spark-bigquery-latest_2.12
Though my service account was having "BigQuery Job User" permission at project level and bigquery data viewer and bigquery user at dataset level , but still I am getting the below error when trying to execute the above code
Traceback (most recent call last):
File "/home/lo815/GCP/gcp.py", line 23, in <module>
df.show()
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 350, in show
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o93.showString.
: com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.PermissionDeniedException: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: PERMISSION_DENIED: request failed: the user does not have 'bigquery.readsessions.create' permission for 'projects/GCP'
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:53)

How to read Druid data using JDBC driver with spark?

How can I read data from Druid using spark and Avatica JDBC Driver?
This is avatica JDBC document
Reading data from Druid using python and Jaydebeapi module, I succeed like below code.
$ python
import jaydebeapi
conn = jaydebeapi.connect("org.apache.calcite.avatica.remote.Driver",
"jdbc:avatica:remote:url=http://0.0.0.0:8082/druid/v2/sql/avatica/",
{"user": "druid", "password":"druid"},
"/root/avatica-1.17.0.jar",
)
cur = conn.cursor()
cur.execute("SELECT * FROM INFORMATION_SCHEMA.TABLES")
cur.fetchall()
output is:
[('druid', 'druid', 'wikipedia', 'TABLE'),
('druid', 'INFORMATION_SCHEMA', 'COLUMNS', 'SYSTEM_TABLE'),
('druid', 'INFORMATION_SCHEMA', 'SCHEMATA', 'SYSTEM_TABLE'),
('druid', 'INFORMATION_SCHEMA', 'TABLES', 'SYSTEM_TABLE'),
('druid', 'sys', 'segments', 'SYSTEM_TABLE'),
('druid', 'sys', 'server_segments', 'SYSTEM_TABLE'),
('druid', 'sys', 'servers', 'SYSTEM_TABLE'),
('druid', 'sys', 'supervisors', 'SYSTEM_TABLE'),
('druid', 'sys', 'tasks', 'SYSTEM_TABLE')] -> default tables
But I want to read using spark and JDBC.
I tried it but there is a problem using spark like below code.
$ pyspark --jars /root/avatica-1.17.0.jar
df = spark.read.format('jdbc') \
.option('url', 'jdbc:avatica:remote:url=http://0.0.0.0:8082/druid/v2/sql/avatica/') \
.option("dbtable", 'INFORMATION_SCHEMA.TABLES') \
.option('user', 'druid') \
.option('password', 'druid') \
.option('driver', 'org.apache.calcite.avatica.remote.Driver') \
.load()
output is:
Traceback (most recent call last):
File "<stdin>", line 8, in <module>
File "/root/spark-2.4.4-bin-hadoop2.7/python/pyspark/sql/readwriter.py", line 172, in load
return self._df(self._jreader.load())
File "/root/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/root/spark-2.4.4-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/root/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o2999.load.
: java.sql.SQLException: While closing connection
...
Caused by: java.lang.RuntimeException: com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "rpcMetadata" (class org.apache.calcite.avatica.remote.Service$CloseConnectionResponse), not marked as ignorable (0 known properties: ])
at [Source: {"response":"closeConnection","rpcMetadata":{"response":"rpcMetadata","serverAddress":"172.18.0.7:8082"}}
; line: 1, column: 46]
...
Caused by: com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "rpcMetadata" (class org.apache.calcite.avatica.remote.Service$CloseConnectionResponse), not marked as ignorable (0 known properties: ])
at [Source: {"response":"closeConnection","rpcMetadata":{"response":"rpcMetadata","serverAddress":"172.18.0.7:8082"}}
; line: 1, column: 46]
...
Note:
I downloaded Avatica jar file(avatica-1.17.0.jar) from maven-repository
I installed Druid server using docker-compose and default setting values.
I found another way to solve this problem. I used spark-druid-connector to connect druid with spark.
But I changed some codes like this to use this code for my environment.
This is my environment:
spark: 2.4.4
scala: 2.11.12
python: python 3.6.8
druid:
zookeeper: 3.5
druid: 0.17.0
However, it has a problem.
If you use spark-druid-connector at least once, all sql queries like spark.sql("select * from tmep_view") used from the following will be entered into this planner.
but, if you use dataframe's api like df.distinct().count(), then there are no problems. I didn't solve yet.
I tried with spark-shell:
./bin/spark-shell --driver-class-path avatica-1.17.0.jar --jars avatica-1.17.0.jar
val jdbcDF = spark.read.format("jdbc")
.option("url", "jdbc:avatica:remote:url=http://0.0.0.0:8082/druid/v2/sql/avatica/")
.option("dbtable", "INFORMATION_SCHEMA.TABLES")
.option("user", "druid")
.option("password", "druid")
.load()

how to read xlsx file using pyspark without help of pandas

I am using this code to read the XLSX file in my local PC. but I couldn't read that file and I am using "com.crealytics.spark.excel" library also.
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
spark = SparkSession.builder \
.appName("test") \
.master("local[0]") \
.getOrCreate()
empFile = "C:/Users/Dev/Downloads/SAMPLE.xlsx"
employeesDF = sqlContext.read.format("com.crealytics.spark.excel").option("sheetName", "Sheet1").option("useHeader", "true").option("treatEmptyValuesAsNulls", "false").option("inferSchema", "false").option("location", empFile).option("addColorColumns", "False").load()
employeesDF.createOrReplaceTempView("EMP")
expLevel = sqlContext.sql("Select * from EMP")
expLevel.show()
if I run this code I got the error like this
py4j.protocol.Py4JJavaError: An error occurred while calling o35.load.
: java.lang.NoClassDefFoundError: scala/Product$class

How to connect spark with hive using pyspark?

I am trying to read hive tables using pyspark, remotely. It states the error that it is unable to connect to Hive Metastore client.
I have read multiple answers on SO and other sources, they were mostly configurations but none of them could address why am I unable to connect remotely. I read the documentation and observed that without making changes in any configuration file, we can connect spark with hive. Note: I have port-forwarded a machine where hive is running and brought it available to localhost:10000. I even connected the same using presto and was able to run queries on hive.
The code is:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, HiveContext
SparkContext.setSystemProperty("hive.metastore.uris", "thrift://localhost:9083")
sparkSession = (SparkSession
.builder
.appName('example-pyspark-read-and-write-from-hive')
.enableHiveSupport()
.getOrCreate())
data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 5)]
df = sparkSession.createDataFrame(data)
df.write.saveAsTable('example')
I expect the output to be an acknowledgment of table being saved but instead, I am facing this error.
Abstract error is:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/local/spark/python/pyspark/sql/readwriter.py", line 775, in saveAsTable
self._jwrite.saveAsTable(name)
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/local/spark/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
I have fired a command:
ssh -i ~/.ssh/id_rsa_sc -L 9000:A.B.C.D:8080 -L 9083:E.F.G.H:9083 -L 10000:E.F.G.H:10000 ubuntu#I.J.K.l
When I check for ports 10000 and 9083 via the commands:
aviral#versinator:~/testing-spark-hive$ nc -zv localhost 10000
Connection to localhost 10000 port [tcp/webmin] succeeded!
aviral#versinator:~/testing-spark-hive$ nc -zv localhost 9083
Connection to localhost 9083 port [tcp/*] succeeded!
Upon running the script, I get the following error:
Caused by: java.net.UnknownHostException: ip-172-16-1-101.ap-south-1.compute.internal
... 45 more
The catch is in letting the hive configs being stored while creating the spark session itself.
sparkSession = (SparkSession
.builder
.appName('example-pyspark-read-and-write-from-hive')
.config("hive.metastore.uris", "thrift://localhost:9083", conf=SparkConf())
.enableHiveSupport()
.getOrCreate()
)
It should be noted that no changes in spark conf are required, even serverless services like AWS Glue can have such connections.
For full code:
from pyspark import SparkContext, SparkConf
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession, HiveContext
"""
SparkSession ss = SparkSession
.builder()
.appName(" Hive example")
.config("hive.metastore.uris", "thrift://localhost:9083")
.enableHiveSupport()
.getOrCreate();
"""
sparkSession = (SparkSession
.builder
.appName('example-pyspark-read-and-write-from-hive')
.config("hive.metastore.uris", "thrift://localhost:9083", conf=SparkConf())
.enableHiveSupport()
.getOrCreate()
)
data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 5)]
df = sparkSession.createDataFrame(data)
# Write into Hive
#df.write.saveAsTable('example')
df_load = sparkSession.sql('SELECT * FROM example')
df_load.show()
print(df_load.show())

TypeError: 'JavaPackage' object is not callable

when I code the spark sql API hiveContext.sql()
from pyspark import SparkConf,SparkContext
from pyspark.sql import SQLContext,HiveContext
conf = SparkConf().setAppName("spark_sql")
sc = SparkContext(conf = conf)
hc = HiveContext(sc)
#rdd = sc.textFile("test.txt")
sqlContext = SQLContext(sc)
res = hc.sql("use teg_uee_app")
#for each in res.collect():
# print(each[0])
sc.stop()
I got the following error:
enFile "spark_sql.py", line 23, in <module>
res = hc.sql("use teg_uee_app")
File "/spark/python/pyspark/sql/context.py", line 580, in sql
return DataFrame(self._ssql_ctx.sql(sqlQuery), self)
File "/spark/python/pyspark/sql/context.py", line 683, in _ssql_ctx
self._scala_HiveContext = self._get_hive_ctx()
File "/spark/python/pyspark/sql/context.py", line 692, in _get_hive_ctx
return self._jvm.HiveContext(self._jsc.sc())
TypeError: 'JavaPackage' object is not callable
how do I add SPARK_CLASSPATH or SparkContext.addFile?I don't have idea.
Maybe this will help you: When using HiveContext I have to add three jars to the spark-submit arguments:
spark-submit --jars /usr/lib/spark/lib/datanucleus-api-jdo-3.2.6.jar,/usr/lib/spark/lib/datanucleus-core-3.2.10.jar,/usr/lib/spark/lib/datanucleus-rdbms-3.2.9.jar ...
Of course the paths and versions depend on your cluster setup.
In my case this turned out to be a classpath issue - I had a Hadoop jar on the classpath that was a wrong version of Hadoop than I was running.
Make sure you only set the executor and/or driver classpaths in one place and that there's no system-wide default applied somewhere such as .bashrc or Spark's conf/spark-env.sh.

Resources