An error occurred while calling o100.save. [Amazon][JDBC](10220) Driver not capable - apache-spark

I am just trying to write pyspark dataframe to redshift table and getting below error.
An error occurred while calling o100.save. [Amazon][JDBC](10220) Driver not capable
Here is the code which I am trying to write on redshift table.
df1.write \
.format("jdbc") \
.option("url", url) \
.option("dbtable", table) \
.option("user",user) \
.option("password",pw) \
.mode('overwrite').save()
I tired to see why we are getting this error and I even suspected it may be because of null values, and so I tried with following code but still getting the same error.
df1= df.na.drop("all")
Please suggest how and why we get this kind of error and really appreciate your help.
Thanks,
Bab

Related

How do I add bigquery property to a dataproc (jobs submit spark) through gcp shell?

I am trying to load google patent data using bigquery with the following code (Python).
df= spark.read \ .format("bigquery") \ .option("table", table) \ .option("filter", "country_code = '{}' AND application_kind = '{}' AND publication_date >= {} AND publication_date < {}".format(country, kind, date_from, date_to)) \ .load()
Then I got the following error that says
Py4JJavaError: An error occurred while calling o144.load. : java.lang.ClassNotFoundException: Failed to find data source: bigquery. Please find packages at http://spark.apache.org/third-party-projects.html
I am suspecting the cause for this error is the lack of bigquery package in my cluster (I checked it on the configuration). I tried to add bigquery using gcp shell command. I get an error of,
$ gcloud dataproc jobs submit spark \ --cluster=my-cluster\ --region=europe-west2 \ --jar=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar
But I got an error again
Exception in thread "main" org.apache.spark.SparkException: Failed to get main class in JAR with error 'null'. Please specify one with --class.
How do I install (gcloud dataproc jobs submit spark) bigquery??? Ideally on dataproc.
I don't know what to add in terms of --class

Not able to configure custom hive-metastore-client in spark

We are facing some challenges working with spark and hive.
We need to connect to hive-metastore from spark and we have to use custom hive-metastore-client inside spark.
code snippet:
spark = SparkSession \
.builder \
.config('spark.sql.hive.metastore.version','3.1.2' ) \
.config('spark.sql.hive.metastore.jars', 'hive-standalone-metastore-3.1.2-sqlquery.jar') \
.config("spark.yarn.dist.jars", "hive-exec-3.1.2.jar") \
.config('hive.metastore.uris', "thrift://localhost:9090") \
the above code works with inbuilt hive-metastore-client but fails with custom one with error:
py4j.protocol.Py4JJavaError: An error occurred while calling o79.databaseExists.
: java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/metadata/HiveException
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.metadata.HiveException
To resolve this issue, we have configured custom hive-exec in code and also tried to pass using command line:
/usr/local/Cellar/apache-spark/3.1.2/libexec/bin/spark-submit spark-session.py --jars hive-exec-3.1.2.jar
/usr/local/Cellar/apache-spark/3.1.2/libexec/bin/spark-submit spark-session.py --conf spark.executor.extraClassPath hive-exec-3.1.2.jar
/usr/local/Cellar/apache-spark/3.1.2/libexec/bin/spark-submit spark-session.py --driver-class-path hive-exec-3.1.2.jar
But the issue is not resolved yet, any suggestion would help us.

Error loading spark sql context for redshift jdbc url in glue

Hello I am trying to fetch month-wise data from a bunch of heavy redshift table(s) in glue job.
As far as I know glue documentation on this is very limited.
The query works fine in SQL Workbench which I have connected using the same jdbc connection being used in glue 'myjdbc_url'.
Below is what I have tried and seeing error -
from pyspark.context import SparkContext
sc = SparkContext()
sql_context = SQLContext(sc)
df1 = sql_context.read \
.format("jdbc") \
.option("url", myjdbc_url) \
.option("query", mnth_query) \
.option("forward_spark_s3_credentials","true") \
.option("tempdir", "s3://my-bucket/sprk") \
.load()
print("Total recs for month :"+str(mnthval)+" df1 -> "+str(df1.count()))
However it shows me driver error in the logs as below -
: java.sql.SQLException: No suitable driver
at java.sql.DriverManager.getDriver(DriverManager.java:315)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$6.apply(JDBCOptions.scala:105)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$6.apply(JDBCOptions.scala:105)
at scala.Option.getOrElse(Option.scala:121)
I have used following too but to no avail. Ends up in Connection
refused error.
sql_context.read \
.format("com.databricks.spark.redshift")
.option("url", myjdbc_url) \
.option("query", mnth_query) \
.option("forward_spark_s3_credentials","true") \
.option("tempdir", "s3://my-bucket/sprk") \
.load()
What is the correct driver to use. As I am using glue which is a managed service with transient cluster in the background. Not sure what am I missing.
Please help what is the right driver?

Spark Session create in Master Node

I am trying to run my spark code from jupyter notebook to my company server.
So i add the following code
spark = SparkSession.builder.master("spark://host:port") \
.appName("usres information analysis") \
.config("spark.executor.memory","5g") \
.getOrCreate()
in my notebook. But i am getting the following error
An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NullPointerException
Is there any problem with my code?

Obtaining Invalid S3 URI error when reading from Redshift

I am trying to read data from a Redshift table into a Spark 2.0 dataframe.
My call looks like this:
df = spark.read \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://hostname:5439/dbname?user=myuser&password=pwd&ssl=true&sslfactory=com.amazon.redshift.ssl.NonValidatingFactory") \
.option("dbtable", "myschema.mytable") \
.option('forward_spark_s3_credentials',"true") \
.option("tempdir", "s3a://mybucket/tmp2") \
.option("region", "us-east-1") \
.load()
This returns ok without errors.
However, when I run
df.collect()
I obtain the error below:
17/02/07 17:37:36 WARN Utils$: An error occurred while trying to read
the S3 bucket lifecycle configuration
java.lang.IllegalArgumentException: Invalid S3 URI: hostname does not
appear to be a valid S3 endpoint: s3://mybucket/tmp2
at com.amazonaws.services.s3.AmazonS3URI.<init>(AmazonS3URI.java:65)
at com.amazonaws.services.s3.AmazonS3URI.<init>(AmazonS3URI.java:42)
at com.databricks.spark.redshift.Utils$.checkThatBucketHasObjectLifecycleConfiguration(Utils.scala:72)
at com.databricks.spark.redshift.RedshiftRelation.buildScan(RedshiftRelation.scala:76)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$anonfun$11.apply(DataSourceStrategy.scala:336)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$anonfun$11.apply(DataSourceStrategy.scala:336)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:384)
at ...
The data is then subsequently returned...
Out[2]: [Row(col1=1, col2=u'yyyyy', col3=datetime.date(2015, 1, 6), col4=datetime.date(2017, 1, 6), col5=Decimal('21'), col6=u'ABCDEF',...)]
Points to note:
This error occurs for both spark-submit and pyspark
The version of
Spark is 2.1 and the jars directory contains these pertinent files:
RedshiftJDBC4-1.2.1.1001.jar
aws-java-sdk-1.7.4.jar
spark-redshift_2.11-0.5.0.jar
hadoop-aws-2.7.3.jar
I have tried other combinations esp of the aws-java but in such cases, I don't even get the dataframe to return. I get an an error from the spark.read call.
The tmp2 bucket directory in S3 exists and is written to with a split
file containing the results from Redshift.
This is being run under a federated login and there is no need to supply credentials
explicitly.
Any help/suggestions would be greatly appreciated.
Check if bucket and redshift DB are in same aws regions?

Resources