An error occurred while calling o100.save. [Amazon][JDBC](10220) Driver not capable

An error occurred while calling o100.save. [Amazon][JDBC](10220) Driver not capable - apache-spark

I am just trying to write pyspark dataframe to redshift table and getting below error.
An error occurred while calling o100.save. [Amazon][JDBC](10220) Driver not capable
Here is the code which I am trying to write on redshift table.
df1.write \
.format("jdbc") \
.option("url", url) \
.option("dbtable", table) \
.option("user",user) \
.option("password",pw) \
.mode('overwrite').save()
I tired to see why we are getting this error and I even suspected it may be because of null values, and so I tried with following code but still getting the same error.
df1= df.na.drop("all")
Please suggest how and why we get this kind of error and really appreciate your help.
Thanks,
Bab

Related

How do I add bigquery property to a dataproc (jobs submit spark) through gcp shell?

I am trying to load google patent data using bigquery with the following code (Python).
df= spark.read \ .format("bigquery") \ .option("table", table) \ .option("filter", "country_code = '{}' AND application_kind = '{}' AND publication_date >= {} AND publication_date < {}".format(country, kind, date_from, date_to)) \ .load()
Then I got the following error that says
Py4JJavaError: An error occurred while calling o144.load. : java.lang.ClassNotFoundException: Failed to find data source: bigquery. Please find packages at http://spark.apache.org/third-party-projects.html
I am suspecting the cause for this error is the lack of bigquery package in my cluster (I checked it on the configuration). I tried to add bigquery using gcp shell command. I get an error of,
$ gcloud dataproc jobs submit spark \ --cluster=my-cluster\ --region=europe-west2 \ --jar=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar
But I got an error again
Exception in thread "main" org.apache.spark.SparkException: Failed to get main class in JAR with error 'null'. Please specify one with --class.
How do I install (gcloud dataproc jobs submit spark) bigquery??? Ideally on dataproc.
I don't know what to add in terms of --class

Not able to configure custom hive-metastore-client in spark

We are facing some challenges working with spark and hive.
We need to connect to hive-metastore from spark and we have to use custom hive-metastore-client inside spark.
code snippet:
spark = SparkSession \
.builder \
.config('spark.sql.hive.metastore.version','3.1.2' ) \
.config('spark.sql.hive.metastore.jars', 'hive-standalone-metastore-3.1.2-sqlquery.jar') \
.config("spark.yarn.dist.jars", "hive-exec-3.1.2.jar") \
.config('hive.metastore.uris', "thrift://localhost:9090") \
the above code works with inbuilt hive-metastore-client but fails with custom one with error:
py4j.protocol.Py4JJavaError: An error occurred while calling o79.databaseExists.
: java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/metadata/HiveException
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.metadata.HiveException
To resolve this issue, we have configured custom hive-exec in code and also tried to pass using command line:
/usr/local/Cellar/apache-spark/3.1.2/libexec/bin/spark-submit spark-session.py --jars hive-exec-3.1.2.jar
/usr/local/Cellar/apache-spark/3.1.2/libexec/bin/spark-submit spark-session.py --conf spark.executor.extraClassPath hive-exec-3.1.2.jar
/usr/local/Cellar/apache-spark/3.1.2/libexec/bin/spark-submit spark-session.py --driver-class-path hive-exec-3.1.2.jar
But the issue is not resolved yet, any suggestion would help us.

Error loading spark sql context for redshift jdbc url in glue

Hello I am trying to fetch month-wise data from a bunch of heavy redshift table(s) in glue job.
As far as I know glue documentation on this is very limited.
The query works fine in SQL Workbench which I have connected using the same jdbc connection being used in glue 'myjdbc_url'.
Below is what I have tried and seeing error -
from pyspark.context import SparkContext
sc = SparkContext()
sql_context = SQLContext(sc)
df1 = sql_context.read \
.format("jdbc") \
.option("url", myjdbc_url) \
.option("query", mnth_query) \
.option("forward_spark_s3_credentials","true") \
.option("tempdir", "s3://my-bucket/sprk") \
.load()
print("Total recs for month :"+str(mnthval)+" df1 -> "+str(df1.count()))
However it shows me driver error in the logs as below -
: java.sql.SQLException: No suitable driver
at java.sql.DriverManager.getDriver(DriverManager.java:315)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$6.apply(JDBCOptions.scala:105)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$6.apply(JDBCOptions.scala:105)
at scala.Option.getOrElse(Option.scala:121)
I have used following too but to no avail. Ends up in Connection
refused error.
sql_context.read \
.format("com.databricks.spark.redshift")
.option("url", myjdbc_url) \
.option("query", mnth_query) \
.option("forward_spark_s3_credentials","true") \
.option("tempdir", "s3://my-bucket/sprk") \
.load()
What is the correct driver to use. As I am using glue which is a managed service with transient cluster in the background. Not sure what am I missing.
Please help what is the right driver?

Spark Session create in Master Node

I am trying to run my spark code from jupyter notebook to my company server.
So i add the following code
spark = SparkSession.builder.master("spark://host:port") \
.appName("usres information analysis") \
.config("spark.executor.memory","5g") \
.getOrCreate()
in my notebook. But i am getting the following error
An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NullPointerException
Is there any problem with my code?

Obtaining Invalid S3 URI error when reading from Redshift

I am trying to read data from a Redshift table into a Spark 2.0 dataframe.
My call looks like this:
df = spark.read \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://hostname:5439/dbname?user=myuser&password=pwd&ssl=true&sslfactory=com.amazon.redshift.ssl.NonValidatingFactory") \
.option("dbtable", "myschema.mytable") \
.option('forward_spark_s3_credentials',"true") \
.option("tempdir", "s3a://mybucket/tmp2") \
.option("region", "us-east-1") \
.load()
This returns ok without errors.
However, when I run
df.collect()
I obtain the error below:
17/02/07 17:37:36 WARN Utils$: An error occurred while trying to read
the S3 bucket lifecycle configuration
java.lang.IllegalArgumentException: Invalid S3 URI: hostname does not
appear to be a valid S3 endpoint: s3://mybucket/tmp2
at com.amazonaws.services.s3.AmazonS3URI.<init>(AmazonS3URI.java:65)
at com.amazonaws.services.s3.AmazonS3URI.<init>(AmazonS3URI.java:42)
at com.databricks.spark.redshift.Utils$.checkThatBucketHasObjectLifecycleConfiguration(Utils.scala:72)
at com.databricks.spark.redshift.RedshiftRelation.buildScan(RedshiftRelation.scala:76)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$anonfun$11.apply(DataSourceStrategy.scala:336)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$anonfun$11.apply(DataSourceStrategy.scala:336)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:384)
at ...
The data is then subsequently returned...
Out[2]: [Row(col1=1, col2=u'yyyyy', col3=datetime.date(2015, 1, 6), col4=datetime.date(2017, 1, 6), col5=Decimal('21'), col6=u'ABCDEF',...)]
Points to note:
This error occurs for both spark-submit and pyspark
The version of
Spark is 2.1 and the jars directory contains these pertinent files:
RedshiftJDBC4-1.2.1.1001.jar
aws-java-sdk-1.7.4.jar
spark-redshift_2.11-0.5.0.jar
hadoop-aws-2.7.3.jar
I have tried other combinations esp of the aws-java but in such cases, I don't even get the dataframe to return. I get an an error from the spark.read call.
The tmp2 bucket directory in S3 exists and is written to with a split
file containing the results from Redshift.
This is being run under a federated login and there is no need to supply credentials
explicitly.
Any help/suggestions would be greatly appreciated.

Check if bucket and redshift DB are in same aws regions?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

An error occurred while calling o100.save. [Amazon][JDBC](10220) Driver not capable - apache-spark

Related

How do I add bigquery property to a dataproc (jobs submit spark) through gcp shell?

Not able to configure custom hive-metastore-client in spark

Error loading spark sql context for redshift jdbc url in glue

Spark Session create in Master Node

Obtaining Invalid S3 URI error when reading from Redshift

Categories

Resources