Obtaining Invalid S3 URI error when reading from Redshift - apache-spark

I am trying to read data from a Redshift table into a Spark 2.0 dataframe.
My call looks like this:
df = spark.read \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://hostname:5439/dbname?user=myuser&password=pwd&ssl=true&sslfactory=com.amazon.redshift.ssl.NonValidatingFactory") \
.option("dbtable", "myschema.mytable") \
.option('forward_spark_s3_credentials',"true") \
.option("tempdir", "s3a://mybucket/tmp2") \
.option("region", "us-east-1") \
.load()
This returns ok without errors.
However, when I run
df.collect()
I obtain the error below:
17/02/07 17:37:36 WARN Utils$: An error occurred while trying to read
the S3 bucket lifecycle configuration
java.lang.IllegalArgumentException: Invalid S3 URI: hostname does not
appear to be a valid S3 endpoint: s3://mybucket/tmp2
at com.amazonaws.services.s3.AmazonS3URI.<init>(AmazonS3URI.java:65)
at com.amazonaws.services.s3.AmazonS3URI.<init>(AmazonS3URI.java:42)
at com.databricks.spark.redshift.Utils$.checkThatBucketHasObjectLifecycleConfiguration(Utils.scala:72)
at com.databricks.spark.redshift.RedshiftRelation.buildScan(RedshiftRelation.scala:76)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$anonfun$11.apply(DataSourceStrategy.scala:336)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$anonfun$11.apply(DataSourceStrategy.scala:336)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:384)
at ...
The data is then subsequently returned...
Out[2]: [Row(col1=1, col2=u'yyyyy', col3=datetime.date(2015, 1, 6), col4=datetime.date(2017, 1, 6), col5=Decimal('21'), col6=u'ABCDEF',...)]
Points to note:
This error occurs for both spark-submit and pyspark
The version of
Spark is 2.1 and the jars directory contains these pertinent files:
RedshiftJDBC4-1.2.1.1001.jar
aws-java-sdk-1.7.4.jar
spark-redshift_2.11-0.5.0.jar
hadoop-aws-2.7.3.jar
I have tried other combinations esp of the aws-java but in such cases, I don't even get the dataframe to return. I get an an error from the spark.read call.
The tmp2 bucket directory in S3 exists and is written to with a split
file containing the results from Redshift.
This is being run under a federated login and there is no need to supply credentials
explicitly.
Any help/suggestions would be greatly appreciated.

Check if bucket and redshift DB are in same aws regions?

Related

An error occurred while calling o100.save. [Amazon][JDBC](10220) Driver not capable

I am just trying to write pyspark dataframe to redshift table and getting below error.
An error occurred while calling o100.save. [Amazon][JDBC](10220) Driver not capable
Here is the code which I am trying to write on redshift table.
df1.write \
.format("jdbc") \
.option("url", url) \
.option("dbtable", table) \
.option("user",user) \
.option("password",pw) \
.mode('overwrite').save()
I tired to see why we are getting this error and I even suspected it may be because of null values, and so I tried with following code but still getting the same error.
df1= df.na.drop("all")
Please suggest how and why we get this kind of error and really appreciate your help.
Thanks,
Bab

How do I add bigquery property to a dataproc (jobs submit spark) through gcp shell?

I am trying to load google patent data using bigquery with the following code (Python).
df= spark.read \ .format("bigquery") \ .option("table", table) \ .option("filter", "country_code = '{}' AND application_kind = '{}' AND publication_date >= {} AND publication_date < {}".format(country, kind, date_from, date_to)) \ .load()
Then I got the following error that says
Py4JJavaError: An error occurred while calling o144.load. : java.lang.ClassNotFoundException: Failed to find data source: bigquery. Please find packages at http://spark.apache.org/third-party-projects.html
I am suspecting the cause for this error is the lack of bigquery package in my cluster (I checked it on the configuration). I tried to add bigquery using gcp shell command. I get an error of,
$ gcloud dataproc jobs submit spark \ --cluster=my-cluster\ --region=europe-west2 \ --jar=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar
But I got an error again
Exception in thread "main" org.apache.spark.SparkException: Failed to get main class in JAR with error 'null'. Please specify one with --class.
How do I install (gcloud dataproc jobs submit spark) bigquery??? Ideally on dataproc.
I don't know what to add in terms of --class

pyspark connection to MariaDB fails with ClassNotFoundException

I'm trying to retrieve data from MariaDB with pyspark.
I created spark_session with configuration to include jdbc jar file, but couldn't solve problem. Current code to create session looks like below.
path = "hdfs://nameservice1/user/PATH/TO/JDBC/mariadb-java-client-2.7.1.jar"
# or path = "/home/PATH/TO/JDBC/mariadb-java-client-2.7.1.jar"
spark = SparkSession.config("spark.jars", path)\
.config("spark.driver.extraClassPath", path)\
.config("spark.executor.extraClassPath", path)\
.enableHiveSupport()
.getOrCreate()
Note that I've tried every case of configuration I know
(Check Permission, change directory both hdfs or local, add or remove configuration ...)
And then, code to load data is.
sql = "SOME_SQL_TO_RETRIEVE_DATA"
spark = spark.read.format('jdbc').option('dbtable', sql)
.option('url', 'jdbc:mariadb://{host}:{port}/{db}')\
.option("user", SOME_USER)
.option("password", SOME_PASSWORD)
.option("driver", 'org.mariadb.jdbc.Driver')
.load()
But it fails with java.lang.ClassNotFoundException: org.mariadb.jdbc.Driver
When I tried this with spark-submit, I saw log message.
... INFO SparkContext: Added Jar /PATH/TO/JDBC/mariadb-java-client-2.7.1.jar at spark://SOME_PATH/jars/mariadb-java-client-2.7.1.jar with timestamp SOME_TIMESTAMP
What is wrong?
For anyone who suffers from same problem.
I figured out. Spark Document says that
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-class-path command line option or in your default properties file.
So instead setting configuration on python code, I added arguments on spark-submit following this document.
spark-submit {other arguments ...} \
--driver-class-path PATH/TO/JDBC/my-jdbc.jar \
--jars PATH/TO/JDBC/my-jdbc.jar \
MY_PYTHON_SCRIPT.py

Loading data from GCS using Spark Local

I am trying to read data from GCS buckets on my local machine, for testing purposes. I would like to sample some of the data in the cloud
I have downloaded the GCS Hadoop Connector JAR.
And setup the sparkConf as follow:
conf = SparkConf() \
.setMaster("local[8]") \
.setAppName("Test") \
.set("spark.jars", "path/gcs-connector-hadoop2-latest.jar") \
.set("spark.hadoop.google.cloud.auth.service.account.enable", "true") \
.set("spark.hadoop.google.cloud.auth.service.account.json.keyfile", "path/to/keyfile")
sc = SparkContext(conf=conf)
spark = SparkSession.builder \
.config(conf=sc.getConf()) \
.getOrCreate()
spark.read.json("gs://gcs-bucket")
I have also tried to set the conf like so:
sc._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
sc._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.json.keyfile", "path/to/keyfile")
sc._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.enable", "true")
I am using PySpark install via PIP and running the code using the unit test module from IntelliJ
py4j.protocol.Py4JJavaError: An error occurred while calling o128.json.
: java.io.IOException: No FileSystem for scheme: gs
What should I do?
Thanks!
To solve this issue, you need to add configuration for fs.gs.impl property in addition to properties that you already configured:
sc._jsc.hadoopConfiguration().set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")

PySpark HBase/Phoenix integration

I'm supposed to read Phoenix data into pyspark.
edit:
I'm using Spark HBase converters:
Here is a code snippet:
port="2181"
host="zookeperserver"
keyConv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
valueConv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"
cmdata_conf = {"hbase.zookeeper.property.clientPort":port, "hbase.zookeeper.quorum": host, "hbase.mapreduce.inputtable": "camel", "hbase.mapreduce.scan.columns": "data:a"}
sc.newAPIHadoopRDD("org.apache.hadoop.hbase.mapreduce.TableInputFormat","org.apache.hadoop.hbase.io.ImmutableBytesWritable","org.apache.hadoop.hbase.client.Result",keyConverter=keyConv,valueConverter=valueConv,conf=cmdata_conf)
Traceback:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/hdp/2.3.0.0-2557/spark/python/pyspark/context.py", line 547, in newAPIHadoopRDD
jconf, batchSize)
File "/usr/hdp/2.3.0.0-2557/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/usr/hdp/2.3.0.0-2557/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.io.IOException: No table was provided.
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:130)
Any help would be much appreciated.
Thank you!
/Tina
Using spark phoenix plugin is the recommended approach.
please find details about phoenix spark plugin here
Environment : tested with AWS EMR 5.10 , PySpark
Following are the steps
Create Table in phoenix https://phoenix.apache.org/language/
Open Phoenix shell
“/usr/lib/phoenix/bin/sqlline.py“
DROP TABLE IF EXISTS TableName;
CREATE TABLE TableName (DOMAIN VARCHAR primary key);
UPSERT INTO TableName (DOMAIN) VALUES('foo');
Download spark phoenix plugin jar
download spark phoenix plugin jar from https://mvnrepository.com/artifact/org.apache.phoenix/phoenix-core/4.11.0-HBase-1.3
you need phoenix--HBase--client.jar , i used phoenix-4.11.0-HBase-1.3-client.jar as per my phoenix and hbase version
From your hadoop home directory, setup the following variable:
phoenix_jars=/home/user/apache-phoenix-4.11.0-HBase-1.3-bin/phoenix-4.11.0-HBase-1.3-client.jar
Start PySpark shell and add the dependency in Driver and executer classpath
pyspark --jars ${phoenix_jars} --conf spark.executor.extraClassPath=${phoenix_jars}
--Create ZooKeeper URL ,Replace with your cluster zookeeper quorum, you can check from hbase-site.xml
emrMaster = "ZooKeeper URL"
df = sqlContext.read \
.format("org.apache.phoenix.spark") \
.option("table", "TableName") \
.option("zkUrl", emrMaster) \
.load()
df.show()
df.columns
df.printSchema()
df1=df.replace(['foo'], ['foo1'], 'DOMAIN')
df1.show()
df1.write \
.format("org.apache.phoenix.spark") \
.mode("overwrite") \
.option("table", "TableName") \
.option("zkUrl", emrMaster) \
.save()
There are two ways to do this :
1) As Phoenix has a JDBC layer you can use Spark JdbcRDD to read data from Phoenix in Spark https://spark.apache.org/docs/1.3.0/api/java/org/apache/spark/rdd/JdbcRDD.html
2) Using Spark HBAse Converters:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala
https://github.com/apache/spark/tree/master/examples/src/main/python

Resources