PySpark 2.2.0 Write DataFrame to S3 AmazonServiceException Class Not Found

PySpark 2.2.0 Write DataFrame to S3 AmazonServiceException Class Not Found - apache-spark

I'm trying to write a Spark DataFrame to S3 with pyspark. I'm using Spark version 2.2.0.
sc = SparkContext('local', 'Test')
sc._jsc.hadoopConfiguration().set("fs.s3a.awsAccessKeyId", aws_key)
sc._jsc.hadoopConfiguration().set("fs.s3a.awsSecretAccessKey", aws_secret)
sc._jsc.hadoopConfiguration().set("fs.s3a.multipart.uploads.enabled", "true")
spark = sql.SparkSession \
.builder \
.appName("TEST") \
.getOrCreate()
sql_context = sql.SQLContext(sc, spark)
filename = 'gerrymandering'
s3_uri = 's3a://mybucket/{}'.format(filename)
print(s3_uri)
df = sql_context.createDataFrame([('1', '4'), ('2', '5'), ('3', '6')], ["A", "B"])
df.write.parquet(s3_uri)
The traceback I get is:
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o48.save.
: java.lang.NoClassDefFoundError: com/amazonaws/AmazonServiceException
I'm not sure but there seems to be a jar dependency error. I've tried multiple versions of hadoop-aws-X.jar as well as aws-java-sdk-X.jar but they all produce this same error.
As of writing this my command was:
spark-submit --jars hadoop-aws-2.9.0.jar,aws-java-sdk-1.7.4.jar test.py
Any ideas on how I can resolve this NoClassDefFoundError?

Don't try and use a Hadoop-aws JAR and AWS SDK. different from that it ships with; the AWS SDK Changes too much between versions. For hadoop-2.9.0 you need aws-java-sdk-bundle version 1.11.199
See mvnrepo/hadoop-aws

Related

Error while connecting big query in GCP using Spark

I was trying to connect Google big query using pySpark using the below code :
from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName("GCP")
sc = SparkContext(conf=conf)
master = "yarn"
spark = SparkSession.builder \
.master("local")\
.appName("GCP") \
.getOrCreate()
spark._jsc.hadoopConfiguration().set("google.cloud.auth.service.account.json.keyfile","key.json")
df = spark.read.format('bigquery') \
.option("parentProject", "project_name") \
.option('table', 'project_name.table_name') \
.load()
df.show()
my spark version 2.3 and big query jar : spark-bigquery-latest_2.12
Though my service account was having "BigQuery Job User" permission at project level and bigquery data viewer and bigquery user at dataset level , but still I am getting the below error when trying to execute the above code
Traceback (most recent call last):
File "/home/lo815/GCP/gcp.py", line 23, in <module>
df.show()
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 350, in show
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o93.showString.
: com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.PermissionDeniedException: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: PERMISSION_DENIED: request failed: the user does not have 'bigquery.readsessions.create' permission for 'projects/GCP'
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:53)

How to read Druid data using JDBC driver with spark?

How can I read data from Druid using spark and Avatica JDBC Driver?
This is avatica JDBC document
Reading data from Druid using python and Jaydebeapi module, I succeed like below code.
$ python
import jaydebeapi
conn = jaydebeapi.connect("org.apache.calcite.avatica.remote.Driver",
"jdbc:avatica:remote:url=http://0.0.0.0:8082/druid/v2/sql/avatica/",
{"user": "druid", "password":"druid"},
"/root/avatica-1.17.0.jar",
)
cur = conn.cursor()
cur.execute("SELECT * FROM INFORMATION_SCHEMA.TABLES")
cur.fetchall()
output is:
[('druid', 'druid', 'wikipedia', 'TABLE'),
('druid', 'INFORMATION_SCHEMA', 'COLUMNS', 'SYSTEM_TABLE'),
('druid', 'INFORMATION_SCHEMA', 'SCHEMATA', 'SYSTEM_TABLE'),
('druid', 'INFORMATION_SCHEMA', 'TABLES', 'SYSTEM_TABLE'),
('druid', 'sys', 'segments', 'SYSTEM_TABLE'),
('druid', 'sys', 'server_segments', 'SYSTEM_TABLE'),
('druid', 'sys', 'servers', 'SYSTEM_TABLE'),
('druid', 'sys', 'supervisors', 'SYSTEM_TABLE'),
('druid', 'sys', 'tasks', 'SYSTEM_TABLE')] -> default tables
But I want to read using spark and JDBC.
I tried it but there is a problem using spark like below code.
$ pyspark --jars /root/avatica-1.17.0.jar
df = spark.read.format('jdbc') \
.option('url', 'jdbc:avatica:remote:url=http://0.0.0.0:8082/druid/v2/sql/avatica/') \
.option("dbtable", 'INFORMATION_SCHEMA.TABLES') \
.option('user', 'druid') \
.option('password', 'druid') \
.option('driver', 'org.apache.calcite.avatica.remote.Driver') \
.load()
output is:
Traceback (most recent call last):
File "<stdin>", line 8, in <module>
File "/root/spark-2.4.4-bin-hadoop2.7/python/pyspark/sql/readwriter.py", line 172, in load
return self._df(self._jreader.load())
File "/root/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/root/spark-2.4.4-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/root/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o2999.load.
: java.sql.SQLException: While closing connection
...
Caused by: java.lang.RuntimeException: com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "rpcMetadata" (class org.apache.calcite.avatica.remote.Service$CloseConnectionResponse), not marked as ignorable (0 known properties: ])
at [Source: {"response":"closeConnection","rpcMetadata":{"response":"rpcMetadata","serverAddress":"172.18.0.7:8082"}}
; line: 1, column: 46]
...
Caused by: com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "rpcMetadata" (class org.apache.calcite.avatica.remote.Service$CloseConnectionResponse), not marked as ignorable (0 known properties: ])
at [Source: {"response":"closeConnection","rpcMetadata":{"response":"rpcMetadata","serverAddress":"172.18.0.7:8082"}}
; line: 1, column: 46]
...
Note:
I downloaded Avatica jar file(avatica-1.17.0.jar) from maven-repository
I installed Druid server using docker-compose and default setting values.

I found another way to solve this problem. I used spark-druid-connector to connect druid with spark.
But I changed some codes like this to use this code for my environment.
This is my environment:
spark: 2.4.4
scala: 2.11.12
python: python 3.6.8
druid:
zookeeper: 3.5
druid: 0.17.0
However, it has a problem.
If you use spark-druid-connector at least once, all sql queries like spark.sql("select * from tmep_view") used from the following will be entered into this planner.
but, if you use dataframe's api like df.distinct().count(), then there are no problems. I didn't solve yet.

I tried with spark-shell:
./bin/spark-shell --driver-class-path avatica-1.17.0.jar --jars avatica-1.17.0.jar
val jdbcDF = spark.read.format("jdbc")
.option("url", "jdbc:avatica:remote:url=http://0.0.0.0:8082/druid/v2/sql/avatica/")
.option("dbtable", "INFORMATION_SCHEMA.TABLES")
.option("user", "druid")
.option("password", "druid")
.load()

how to read xlsx file using pyspark without help of pandas

I am using this code to read the XLSX file in my local PC. but I couldn't read that file and I am using "com.crealytics.spark.excel" library also.
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
spark = SparkSession.builder \
.appName("test") \
.master("local[0]") \
.getOrCreate()
empFile = "C:/Users/Dev/Downloads/SAMPLE.xlsx"
employeesDF = sqlContext.read.format("com.crealytics.spark.excel").option("sheetName", "Sheet1").option("useHeader", "true").option("treatEmptyValuesAsNulls", "false").option("inferSchema", "false").option("location", empFile).option("addColorColumns", "False").load()
employeesDF.createOrReplaceTempView("EMP")
expLevel = sqlContext.sql("Select * from EMP")
expLevel.show()
if I run this code I got the error like this
py4j.protocol.Py4JJavaError: An error occurred while calling o35.load.
: java.lang.NoClassDefFoundError: scala/Product$class

How to connect spark with hive using pyspark?

I am trying to read hive tables using pyspark, remotely. It states the error that it is unable to connect to Hive Metastore client.
I have read multiple answers on SO and other sources, they were mostly configurations but none of them could address why am I unable to connect remotely. I read the documentation and observed that without making changes in any configuration file, we can connect spark with hive. Note: I have port-forwarded a machine where hive is running and brought it available to localhost:10000. I even connected the same using presto and was able to run queries on hive.
The code is:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, HiveContext
SparkContext.setSystemProperty("hive.metastore.uris", "thrift://localhost:9083")
sparkSession = (SparkSession
.builder
.appName('example-pyspark-read-and-write-from-hive')
.enableHiveSupport()
.getOrCreate())
data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 5)]
df = sparkSession.createDataFrame(data)
df.write.saveAsTable('example')
I expect the output to be an acknowledgment of table being saved but instead, I am facing this error.
Abstract error is:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/local/spark/python/pyspark/sql/readwriter.py", line 775, in saveAsTable
self._jwrite.saveAsTable(name)
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/local/spark/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
I have fired a command:
ssh -i ~/.ssh/id_rsa_sc -L 9000:A.B.C.D:8080 -L 9083:E.F.G.H:9083 -L 10000:E.F.G.H:10000 ubuntu#I.J.K.l
When I check for ports 10000 and 9083 via the commands:
aviral#versinator:~/testing-spark-hive$ nc -zv localhost 10000
Connection to localhost 10000 port [tcp/webmin] succeeded!
aviral#versinator:~/testing-spark-hive$ nc -zv localhost 9083
Connection to localhost 9083 port [tcp/*] succeeded!
Upon running the script, I get the following error:
Caused by: java.net.UnknownHostException: ip-172-16-1-101.ap-south-1.compute.internal
... 45 more

The catch is in letting the hive configs being stored while creating the spark session itself.
sparkSession = (SparkSession
.builder
.appName('example-pyspark-read-and-write-from-hive')
.config("hive.metastore.uris", "thrift://localhost:9083", conf=SparkConf())
.enableHiveSupport()
.getOrCreate()
)
It should be noted that no changes in spark conf are required, even serverless services like AWS Glue can have such connections.
For full code:
from pyspark import SparkContext, SparkConf
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession, HiveContext
"""
SparkSession ss = SparkSession
.builder()
.appName(" Hive example")
.config("hive.metastore.uris", "thrift://localhost:9083")
.enableHiveSupport()
.getOrCreate();
"""
sparkSession = (SparkSession
.builder
.appName('example-pyspark-read-and-write-from-hive')
.config("hive.metastore.uris", "thrift://localhost:9083", conf=SparkConf())
.enableHiveSupport()
.getOrCreate()
)
data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 5)]
df = sparkSession.createDataFrame(data)
# Write into Hive
#df.write.saveAsTable('example')
df_load = sparkSession.sql('SELECT * FROM example')
df_load.show()
print(df_load.show())

TypeError: 'JavaPackage' object is not callable

when I code the spark sql API hiveContext.sql()
from pyspark import SparkConf,SparkContext
from pyspark.sql import SQLContext,HiveContext
conf = SparkConf().setAppName("spark_sql")
sc = SparkContext(conf = conf)
hc = HiveContext(sc)
#rdd = sc.textFile("test.txt")
sqlContext = SQLContext(sc)
res = hc.sql("use teg_uee_app")
#for each in res.collect():
# print(each[0])
sc.stop()
I got the following error:
enFile "spark_sql.py", line 23, in <module>
res = hc.sql("use teg_uee_app")
File "/spark/python/pyspark/sql/context.py", line 580, in sql
return DataFrame(self._ssql_ctx.sql(sqlQuery), self)
File "/spark/python/pyspark/sql/context.py", line 683, in _ssql_ctx
self._scala_HiveContext = self._get_hive_ctx()
File "/spark/python/pyspark/sql/context.py", line 692, in _get_hive_ctx
return self._jvm.HiveContext(self._jsc.sc())
TypeError: 'JavaPackage' object is not callable
how do I add SPARK_CLASSPATH or SparkContext.addFile?I don't have idea.

Maybe this will help you: When using HiveContext I have to add three jars to the spark-submit arguments:
spark-submit --jars /usr/lib/spark/lib/datanucleus-api-jdo-3.2.6.jar,/usr/lib/spark/lib/datanucleus-core-3.2.10.jar,/usr/lib/spark/lib/datanucleus-rdbms-3.2.9.jar ...
Of course the paths and versions depend on your cluster setup.

In my case this turned out to be a classpath issue - I had a Hadoop jar on the classpath that was a wrong version of Hadoop than I was running.
Make sure you only set the executor and/or driver classpaths in one place and that there's no system-wide default applied somewhere such as .bashrc or Spark's conf/spark-env.sh.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

PySpark 2.2.0 Write DataFrame to S3 AmazonServiceException Class Not Found - apache-spark

Don't try and use a Hadoop-aws JAR and AWS SDK. different from that it ships with; the AWS SDK Changes too much between versions. For hadoop-2.9.0 you need aws-java-sdk-bundle version 1.11.199 See mvnrepo/hadoop-aws

Related

Error while connecting big query in GCP using Spark

How to read Druid data using JDBC driver with spark?

how to read xlsx file using pyspark without help of pandas

How to connect spark with hive using pyspark?

TypeError: 'JavaPackage' object is not callable

Categories

Resources