BigQuery 'INTERNAL: request failed: internal error' when retrieving __TABLES__ from PySpark - apache-spark

I am trying to query the __TABLES__ data in BigQuery from PySpark. I was using this code to interrogate the system table:
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.config('parentProject', 'my-parent-project')\
.config('spark.jars.packages', 'com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.18.1')\
.getOrCreate()
spark.read.format('bigquery')\
.option("credentials", my_key)\
.option("project", 'my-parent-project') \
.option('table', 'my-dataset.__TABLES__') \
.load()
and it was working up to 2021/06/25. The following day all of a sudden for this very code I started receiving this error message instead:
: com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.InternalException: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: INTERNAL: request failed: internal error
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:67)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:72)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:60)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:68)
at com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1074)
at com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
at com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1213)
at com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:983)
at com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:771)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:563)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:533)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:413)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl.access$500(ClientCallImpl.java:66)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:742)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:721)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Suppressed: com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.AsyncTaskException: Asynchronous task failed
(the stack trace is even longer than this, if needed ask for it and I'll post the rest of it).
Did anything changed in BigQuery? The error message is not very helpful to me to troubleshoot, any suggestions? To add more context, with the same code I'm able to query other tables in the same dataset.
I observed this behaviour with Spark 3.0.1 and BigQuery connector com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.18.1. I've also experienced same behaviour using Spark 2.4.3 and BigQuery connector com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.18.1

Notice that the __TABLES__ is not an actual BigQuery table, but a view to its metadata. A way to overcome this is to execute the following:
spark = SparkSession.builder\
.config('parentProject', 'my-parent-project')\
.config('viewsEnabled','true')\
.config('materializationDataset', DATASET)
.config('spark.jars.packages', 'com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.21.1')\
.getOrCreate()
tables_df = spark.read.format('bigquery')\
.option("credentials", my_key)\
.load("SELECT * FROM my-project.my-dataset.__TABLES__")

Related

Using spark MS SQL connector PySpark causes NoSuchMethodError for BulkCopy

I'm trying to use the MS SQL connector for Spark to insert high volumes of data from pyspark.
After creating a session:
SparkSession.builder
.config('spark.jars.packages', 'org.apache.hadoop:hadoop-azure:3.2.0,org.apache.spark:spark-avro_2.12:3.1.2,com.microsoft.sqlserver:mssql-jdbc:8.4.1.jre8,com.microsoft.azure:spark-mssql-connector_2.12:1.2.0')
I get the following error:
ERROR executor.Executor: Exception in task 6.0 in stage 12.0 (TID 233)
java.lang.NoSuchMethodError: 'void com.microsoft.sqlserver.jdbc.SQLServerBulkCopy.writeToServer(com.microsoft.sqlserver.jdbc.ISQLServerBulkData)'
at com.microsoft.sqlserver.jdbc.spark.BulkCopyUtils$.bulkWrite(BulkCopyUtils.scala:110)
at com.microsoft.sqlserver.jdbc.spark.BulkCopyUtils$.savePartition(BulkCopyUtils.scala:58)
at com.microsoft.sqlserver.jdbc.spark.SingleInstanceWriteStrategies$.$anonfun$write$2(BestEffortSingleInstanceStrategy.scala:43)
at com.microsoft.sqlserver.jdbc.spark.SingleInstanceWriteStrategies$.$anonfun$write$2$adapted(BestEffortSingleInstanceStrategy.scala:42)
at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2(RDD.scala:1020)
at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2$adapted(RDD.scala:1020)
at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
When trying to write data like this:
try:
(
df.write.format("com.microsoft.sqlserver.jdbc.spark")
.mode("append")
.option("url", url)
.option("dbtable", table_name)
.option("user", username)
.option("password", password)
.option("schemaCheckEnabled", "false")
.save()
)
except ValueError as error:
print("Connector write failed", error)
I tried different versions of spark and the sql connector but no luck so far.
I also tried using a jar for the mssql-jdbc dependency directly:
SparkSession.builder
.config('spark.jars', '/mssql-jdbc-8.4.1.jre8.jar')
.config(...)
It still complains that it can't find the method, however if you inspect the JAR file, the method is defined in the source code.
Any tips on where to look are welcome!
We reproduced the same scenario in our environment and it's correctly working now.
There is an issue in JDBC driver 8.2.2 you can use the older version for the library.
Below is the code sample,
Output:
Data got inserted into table from pyspark.
Reference: NoSuchMethodError for BulkCopy.
Using com.microsoft.sqlserver:mssql-jdbc:8.4.1.jre8 is one thing but also you need proper version of MS' Spark SQL Connector, compatible with your Spark's version.
com.microsoft.azure:spark-mssql-connector_2.12_3.0:1.0.0-alpha and com.microsoft.sqlserver:mssql-jdbc:8.4.1.jre8 did not work for my case as I'm using AWS Glue 3.0 (which is Spark 3.1)
I had to switch to com.microsoft.azure:spark-mssql-connector_2.12:1.2.0 as it's Spark 3.1 compatible.
def write_df_to_target(self, df, schema_table):
spark = self.gc.spark_session
spark.builder.config('spark.jars.packages', 'com.microsoft.sqlserver:mssql-jdbc:8.4.1.jre8,com.microsoft.azure:spark-mssql-connector_2.12:1.2.0').getOrCreate()
credentials = self.get_credentials(self.replica_connection_name)
df.write \
.format("com.microsoft.sqlserver.jdbc.spark") \
.option("url", credentials["url"] + ";databaseName=" + self.database_name) \
.option("dbtable", schema_table) \
.option("user", credentials["user"]) \
.option("password", credentials["password"]) \
.option("batchsize","100000") \
.option("numPartitions","15") \
.save()
Last thing. AWS Glue job must have --user-jars-first: "true" param.
This instruction indicates that provided jars are to be used in first order (aka - you override default ones).
Try to check if equivalent parameter is on your end.

Unrecognized connection property 'url' when using Presto JDBC in Spark SQL

Here is my spark sql code, where I am trying to read a presto table based on this guide;  https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
val df = spark.read
.format("jdbc")
.option("driver", "com.facebook.presto.jdbc.PrestoDriver")
.option("url", "jdbc:presto://localhost:8889/mycatalog")
.option("query", "select * from mydb.mytable limit 1")
.option("user", "myuserid")
.load()
 
I am getting the following exception, unrecognized connection property 'url'
Exception in thread "main" java.sql.SQLException: Unrecognized connection property 'url'
at com.facebook.presto.jdbc.PrestoDriverUri.validateConnectionProperties(PrestoDriverUri.java:345)
at com.facebook.presto.jdbc.PrestoDriverUri.<init>(PrestoDriverUri.java:102)
at com.facebook.presto.jdbc.PrestoDriverUri.<init>(PrestoDriverUri.java:92)
at com.facebook.presto.jdbc.PrestoDriver.connect(PrestoDriver.java:87)
at org.apache.spark.sql.execution.datasources.jdbc.connection.BasicConnectionProvider.getConnection(BasicConnectionProvider.scala:49)
at org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider$.create(ConnectionProvider.scala:68)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$createConnectionFactory$1(JdbcUtils.scala:62)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:56)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:326)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:308)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:308)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:226)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:341)
Seems like this issue is related to https://github.com/prestodb/presto/issues/9254  where the property url is not a recognized property in Presto and looks like the fix needs to be done on the Spark side? Are there any other workaround for this issue?
PS:
Spark Version: 3.1.1
presto-jdbc version: 0.245
looks like a spark bug fixed 3.3
https://issues.apache.org/jira/browse/SPARK-36163
There is no issue with spark or presto JDBC driver. I don't think URL which you specified will work.
You should change that to below format.
jdbc:presto://localhost:8889/mycatalog
UPDATE
Not sure how it's working with spark version < 3. As an workaround you can use another jar where strict config check has been removed as specified here.
#odonnry is correct that the issue was fixed in spark 3.3.x, but if anyone cannot upgrade to Spark 3.3.x and is trying to use Trino, I created a workaround below according to the Jira issue linked by #Mohana
https://github.com/amitferman/trino

SparkBQ connector issue: IllegalArgumentException: Wrong bucket: bq_temporary_folder, expected bucket: null

I need help figuring out why my Spark job that is using the spark-bigquery-connector failed during the write step to temporaryGcsBucket.
Here is the Spark code that is trying to write the data to BQ table
outDF.write
.format("bigquery")
.option("temporaryGcsBucket", "bq_temporary_folder")
.option("parentProject", "user_project")
.option("table", "user.destination_table")
.mode(SaveMode.Overwrite)
.save()
Here is the error log:
Caused by: java.lang.IllegalArgumentException: Wrong bucket: bq_temporary_folder, in path: gs://bq_temporary_folder/.spark-bigquery-application_1605725797163_1966-6f6cfc35-543b-4138-ae0e-649ce7c2ae56, expected bucket: null
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem.checkPath(GoogleHadoopFileSystem.java:89)
at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:581)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.makeQualified(GoogleHadoopFileSystemBase.java:454)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.<init>(FileOutputCommitter.java:144)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.<init>(FileOutputCommitter.java:103)
at org.apache.parquet.hadoop.ParquetOutputCommitter.<init>(ParquetOutputCommitter.java:43)
at org.apache.parquet.hadoop.ParquetOutputFormat.getOutputCommitter(ParquetOutputFormat.java:442)
at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupCommitter(HadoopMapReduceCommitProtocol.scala:100)
at org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol.setupCommitter(SQLHadoopMapReduceCommitProtocol.scala:40)
at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupTask(HadoopMapReduceCommitProtocol.scala:217)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:226)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:175)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:405)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I'm using the SparkBQ connector spark-bigquery-with-dependencies_2.12-0.18.0.jar and the GCS connector gcs-connector-hadoop2-2.0.0-shaded.jar.
I do not think it's permission issue. Since the Spark job was able to write to the same GCS bucket by calling rdd.saveAsTextFile()
When I checked the GoogleHadoopFileSystem code, the expected bucket: null error message occurred because the rootBucket in the GoogleHadoopFileSystemBase class is not set.
But I don't know how I can get this field initialized in the first place.
Someone told me that there might be a mismatch GCS connector version between my Spark job and the Spark cluster. Even after confirming that the Spark job was using the GCS connector library configured for the Spark cluster, the job still failed.
I'm running out of idea what to try at this point.
Thank you in advance for any advices/helps.

How to setup Elasticsearch Structured Streaming with X-Pack enabled?

I'm trying to use Elasticsearch (ES) 6.1.1 Hadoop with installed x-pack to write data using Spark Structured Streaming 2.2.1.
This is my code (the index already exists in elastic):
val exceptions = spark
.readStream
.text(path)
val advancedQuery = exceptions
.writeStream
.format("org.elasticsearch.spark.sql")
.trigger(Trigger.ProcessingTime(10.seconds))
.option("checkpointLocation", "/checkpoint")
val runningQuery = advancedQuery.start("spark/exc")
runningQuery.awaitTermination
But I do get the following exception:
org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:327)
at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:575)
at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:58)
at org.elasticsearch.spark.sql.streaming.EsStreamQueryWriter.run(EsStreamQueryWriter.scala:41)
at org.elasticsearch.spark.sql.streaming.EsSparkSqlStreamingSink$$anonfun$addBatch$2$$anonfun$2.apply(EsSparkSqlStreamingSink.scala:52)
at org.elasticsearch.spark.sql.streaming.EsSparkSqlStreamingSink$$anonfun$addBatch$2$$anonfun$2.apply(EsSparkSqlStreamingSink.scala:51)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
**Caused by: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: missing authentication token for REST request [/]
null**
How can I set the required authentication data?
Figured it out:
One needs to add two additional options "es.net.http.auth.user" and "es.net.http.auth.pass" like here:
val advancedQuery = exceptions
...
.option("es.net.http.auth.user", "*your elastic user goes here*")
.option("es.net.http.auth.pass", "*your elastic password goes here*")
...

Amazon s3a returns 400 Bad Request with Spark-redshift library

I am facing java.io.IOException: s3n://bucket-name : 400 : Bad Request error while loading Redshift data through spark-redshift library:
The Redshift cluster and the s3 bucket both are in mumbai region.
Here is the full error stack:
2017-01-13 13:14:22 WARN TaskSetManager:66 - Lost task 0.0 in stage 0.0 (TID 0, master): java.io.IOException: s3n://bucket-name : 400 : Bad Request
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:453)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:427)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.handleException(Jets3tNativeFileSystemStore.java:411)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:181)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at org.apache.hadoop.fs.s3native.$Proxy10.retrieveMetadata(Unknown Source)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:476)
at com.databricks.spark.redshift.RedshiftRecordReader.initialize(RedshiftInputFormat.scala:115)
at com.databricks.spark.redshift.RedshiftFileFormat$$anonfun$buildReader$1.apply(RedshiftFileFormat.scala:92)
at com.databricks.spark.redshift.RedshiftFileFormat$$anonfun$buildReader$1.apply(RedshiftFileFormat.scala:80)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(fileSourceInterfaces.scala:279)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(fileSourceInterfaces.scala:263)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.jets3t.service.impl.rest.HttpException: 400 Bad Request
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:425)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:279)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestHead(RestStorageService.java:1052)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectImpl(RestStorageService.java:2264)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectDetailsImpl(RestStorageService.java:2193)
at org.jets3t.service.StorageService.getObjectDetails(StorageService.java:1120)
at org.jets3t.service.StorageService.getObjectDetails(StorageService.java:575)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:174)
... 30 more
And here is my java code for the same:
SparkContext sparkContext = SparkSession.builder().appName("CreditModeling").getOrCreate().sparkContext();
sparkContext.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem");
sparkContext.hadoopConfiguration().set("fs.s3a.awsAccessKeyId", fs_s3a_awsAccessKeyId);
sparkContext.hadoopConfiguration().set("fs.s3a.awsSecretAccessKey", fs_s3a_awsSecretAccessKey);
sparkContext.hadoopConfiguration().set("fs.s3a.endpoint", "s3.ap-south-1.amazonaws.com");
SQLContext sqlContext=new SQLContext(sparkContext);
Dataset dataset= sqlContext
.read()
.format("com.databricks.spark.redshift")
.option("url", redshiftUrl)
.option("query", query)
.option("aws_iam_role", aws_iam_role)
.option("tempdir", "s3a://bucket-name/temp-dir")
.load();
I was able to solve the problem on spark local mode by doing following changes (referred this):
1) I have replaced the jets3t jar to 0.9.4
2) Changed jets3t configuration properties to support the aws4 version bucket as follows:
Jets3tProperties myProperties = Jets3tProperties.getInstance(Constants.JETS3T_PROPERTIES_FILENAME);
myProperties.setProperty("s3service.s3-endpoint", "s3.ap-south-1.amazonaws.com");
myProperties.setProperty("storage-service.request-signature-version", "AWS4-HMAC-SHA256");
myProperties.setProperty("uploads.stream-retry-buffer-size", "2147483646");
But now i am trying to run the job in a clustered mode (spark standalone mode or with a resource manager MESOS) and the error appears again :(
Any help would be appreciated!
Actual Problem:
Updating Jets3tProperties, to support AWS s3 signature version 4, at runtime worked on local mode but not on cluster mode because the properties were only getting updated on the driver JVM but not on any of the executor JVM's.
Solution:
I found a workaround to update the Jets3tProperties on all executors by referring to this link.
By referring to the above link I have put an additional code snippet, to update the Jets3tProperties, inside .foreachPartition() function which will run it for the first partition created on any of the executors.
Here is the code:
Dataset dataset= sqlContext
.read()
.format("com.databricks.spark.redshift")
.option("url", redshiftUrl)
.option("query", query)
.option("aws_iam_role", aws_iam_role)
.option("tempdir", "s3a://bucket-name/temp-dir")
.load();
dataset.foreachPartition(rdd -> {
boolean first=true;
if(first){
Jets3tProperties myProperties =
Jets3tProperties.getInstance(Constants.JETS3T_PROPERTIES_FILENAME);
myProperties.setProperty("s3service.s3-endpoint", "s3.ap-south-1.amazonaws.com");
myProperties
.setProperty("storage-service.request-signature-version", "AWS4-HMAC-SHA256");
myProperties.setProperty("uploads.stream-retry-buffer-size", "2147483646");
first = false;
}
});
that stack implies that you're using the older s3n connector, based on jets3t. you are setting permissions which only work with S3a, the newer one. Use a URL like s3a:// to pick up the new entry.
Given you are trying to use V4 API, you'll need to set the fs.s3a.endpoint too. The 400/bad-request response is one you'd see if you tried to auth with v4 against the central endpointd

Resources