Amazon s3a returns 400 Bad Request with Spark-redshift library - apache-spark

I am facing java.io.IOException: s3n://bucket-name : 400 : Bad Request error while loading Redshift data through spark-redshift library:
The Redshift cluster and the s3 bucket both are in mumbai region.
Here is the full error stack:
2017-01-13 13:14:22 WARN TaskSetManager:66 - Lost task 0.0 in stage 0.0 (TID 0, master): java.io.IOException: s3n://bucket-name : 400 : Bad Request
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:453)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:427)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.handleException(Jets3tNativeFileSystemStore.java:411)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:181)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at org.apache.hadoop.fs.s3native.$Proxy10.retrieveMetadata(Unknown Source)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:476)
at com.databricks.spark.redshift.RedshiftRecordReader.initialize(RedshiftInputFormat.scala:115)
at com.databricks.spark.redshift.RedshiftFileFormat$$anonfun$buildReader$1.apply(RedshiftFileFormat.scala:92)
at com.databricks.spark.redshift.RedshiftFileFormat$$anonfun$buildReader$1.apply(RedshiftFileFormat.scala:80)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(fileSourceInterfaces.scala:279)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(fileSourceInterfaces.scala:263)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.jets3t.service.impl.rest.HttpException: 400 Bad Request
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:425)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:279)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestHead(RestStorageService.java:1052)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectImpl(RestStorageService.java:2264)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectDetailsImpl(RestStorageService.java:2193)
at org.jets3t.service.StorageService.getObjectDetails(StorageService.java:1120)
at org.jets3t.service.StorageService.getObjectDetails(StorageService.java:575)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:174)
... 30 more
And here is my java code for the same:
SparkContext sparkContext = SparkSession.builder().appName("CreditModeling").getOrCreate().sparkContext();
sparkContext.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem");
sparkContext.hadoopConfiguration().set("fs.s3a.awsAccessKeyId", fs_s3a_awsAccessKeyId);
sparkContext.hadoopConfiguration().set("fs.s3a.awsSecretAccessKey", fs_s3a_awsSecretAccessKey);
sparkContext.hadoopConfiguration().set("fs.s3a.endpoint", "s3.ap-south-1.amazonaws.com");
SQLContext sqlContext=new SQLContext(sparkContext);
Dataset dataset= sqlContext
.read()
.format("com.databricks.spark.redshift")
.option("url", redshiftUrl)
.option("query", query)
.option("aws_iam_role", aws_iam_role)
.option("tempdir", "s3a://bucket-name/temp-dir")
.load();
I was able to solve the problem on spark local mode by doing following changes (referred this):
1) I have replaced the jets3t jar to 0.9.4
2) Changed jets3t configuration properties to support the aws4 version bucket as follows:
Jets3tProperties myProperties = Jets3tProperties.getInstance(Constants.JETS3T_PROPERTIES_FILENAME);
myProperties.setProperty("s3service.s3-endpoint", "s3.ap-south-1.amazonaws.com");
myProperties.setProperty("storage-service.request-signature-version", "AWS4-HMAC-SHA256");
myProperties.setProperty("uploads.stream-retry-buffer-size", "2147483646");
But now i am trying to run the job in a clustered mode (spark standalone mode or with a resource manager MESOS) and the error appears again :(
Any help would be appreciated!

Actual Problem:
Updating Jets3tProperties, to support AWS s3 signature version 4, at runtime worked on local mode but not on cluster mode because the properties were only getting updated on the driver JVM but not on any of the executor JVM's.
Solution:
I found a workaround to update the Jets3tProperties on all executors by referring to this link.
By referring to the above link I have put an additional code snippet, to update the Jets3tProperties, inside .foreachPartition() function which will run it for the first partition created on any of the executors.
Here is the code:
Dataset dataset= sqlContext
.read()
.format("com.databricks.spark.redshift")
.option("url", redshiftUrl)
.option("query", query)
.option("aws_iam_role", aws_iam_role)
.option("tempdir", "s3a://bucket-name/temp-dir")
.load();
dataset.foreachPartition(rdd -> {
boolean first=true;
if(first){
Jets3tProperties myProperties =
Jets3tProperties.getInstance(Constants.JETS3T_PROPERTIES_FILENAME);
myProperties.setProperty("s3service.s3-endpoint", "s3.ap-south-1.amazonaws.com");
myProperties
.setProperty("storage-service.request-signature-version", "AWS4-HMAC-SHA256");
myProperties.setProperty("uploads.stream-retry-buffer-size", "2147483646");
first = false;
}
});

that stack implies that you're using the older s3n connector, based on jets3t. you are setting permissions which only work with S3a, the newer one. Use a URL like s3a:// to pick up the new entry.
Given you are trying to use V4 API, you'll need to set the fs.s3a.endpoint too. The 400/bad-request response is one you'd see if you tried to auth with v4 against the central endpointd

Related

BigQuery 'INTERNAL: request failed: internal error' when retrieving __TABLES__ from PySpark

I am trying to query the __TABLES__ data in BigQuery from PySpark. I was using this code to interrogate the system table:
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.config('parentProject', 'my-parent-project')\
.config('spark.jars.packages', 'com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.18.1')\
.getOrCreate()
spark.read.format('bigquery')\
.option("credentials", my_key)\
.option("project", 'my-parent-project') \
.option('table', 'my-dataset.__TABLES__') \
.load()
and it was working up to 2021/06/25. The following day all of a sudden for this very code I started receiving this error message instead:
: com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.InternalException: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: INTERNAL: request failed: internal error
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:67)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:72)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:60)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:68)
at com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1074)
at com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
at com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1213)
at com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:983)
at com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:771)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:563)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:533)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:413)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl.access$500(ClientCallImpl.java:66)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:742)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:721)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Suppressed: com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.AsyncTaskException: Asynchronous task failed
(the stack trace is even longer than this, if needed ask for it and I'll post the rest of it).
Did anything changed in BigQuery? The error message is not very helpful to me to troubleshoot, any suggestions? To add more context, with the same code I'm able to query other tables in the same dataset.
I observed this behaviour with Spark 3.0.1 and BigQuery connector com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.18.1. I've also experienced same behaviour using Spark 2.4.3 and BigQuery connector com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.18.1
Notice that the __TABLES__ is not an actual BigQuery table, but a view to its metadata. A way to overcome this is to execute the following:
spark = SparkSession.builder\
.config('parentProject', 'my-parent-project')\
.config('viewsEnabled','true')\
.config('materializationDataset', DATASET)
.config('spark.jars.packages', 'com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.21.1')\
.getOrCreate()
tables_df = spark.read.format('bigquery')\
.option("credentials", my_key)\
.load("SELECT * FROM my-project.my-dataset.__TABLES__")

SparkBQ connector issue: IllegalArgumentException: Wrong bucket: bq_temporary_folder, expected bucket: null

I need help figuring out why my Spark job that is using the spark-bigquery-connector failed during the write step to temporaryGcsBucket.
Here is the Spark code that is trying to write the data to BQ table
outDF.write
.format("bigquery")
.option("temporaryGcsBucket", "bq_temporary_folder")
.option("parentProject", "user_project")
.option("table", "user.destination_table")
.mode(SaveMode.Overwrite)
.save()
Here is the error log:
Caused by: java.lang.IllegalArgumentException: Wrong bucket: bq_temporary_folder, in path: gs://bq_temporary_folder/.spark-bigquery-application_1605725797163_1966-6f6cfc35-543b-4138-ae0e-649ce7c2ae56, expected bucket: null
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem.checkPath(GoogleHadoopFileSystem.java:89)
at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:581)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.makeQualified(GoogleHadoopFileSystemBase.java:454)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.<init>(FileOutputCommitter.java:144)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.<init>(FileOutputCommitter.java:103)
at org.apache.parquet.hadoop.ParquetOutputCommitter.<init>(ParquetOutputCommitter.java:43)
at org.apache.parquet.hadoop.ParquetOutputFormat.getOutputCommitter(ParquetOutputFormat.java:442)
at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupCommitter(HadoopMapReduceCommitProtocol.scala:100)
at org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol.setupCommitter(SQLHadoopMapReduceCommitProtocol.scala:40)
at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupTask(HadoopMapReduceCommitProtocol.scala:217)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:226)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:175)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:405)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I'm using the SparkBQ connector spark-bigquery-with-dependencies_2.12-0.18.0.jar and the GCS connector gcs-connector-hadoop2-2.0.0-shaded.jar.
I do not think it's permission issue. Since the Spark job was able to write to the same GCS bucket by calling rdd.saveAsTextFile()
When I checked the GoogleHadoopFileSystem code, the expected bucket: null error message occurred because the rootBucket in the GoogleHadoopFileSystemBase class is not set.
But I don't know how I can get this field initialized in the first place.
Someone told me that there might be a mismatch GCS connector version between my Spark job and the Spark cluster. Even after confirming that the Spark job was using the GCS connector library configured for the Spark cluster, the job still failed.
I'm running out of idea what to try at this point.
Thank you in advance for any advices/helps.

write dataframe to cassandra facing BusyPoolException

I am trying to write dataframe to cassandra using these line of code,was able to write to table for someday but suddenly the error came
alertdf
.write.format("org.apache.spark.sql.cassandra")
.options(Map("keyspace" -> "dummy", "table" -> "dummytable"))
.mode(SaveMode.Append)
.save()
I get this below error,not able to find out what is getting wrong
ERROR QueryExecutor: Failed to execute: com.datastax.spark.connector.writer.RichBoundStatement#7dba59e2
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: **.**.**.**/**.**.**.**:9042 (com.datastax.driver.core.exceptions.BusyPoolException: [**.**.**.**/**.**.**.**] Pool is busy (no available connection and the queue has reached its max size 256)))
at com.datastax.driver.core.RequestHandler.reportNoMoreHosts(RequestHandler.java:211)
at com.datastax.driver.core.RequestHandler.access$1000(RequestHandler.java:46)
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.findNextHostAndQuery(RequestHandler.java:275)
at com.datastax.driver.core.RequestHandler$SpeculativeExecution$1.onFailure(RequestHandler.java:338)
at shade.com.datastax.spark.connector.google.common.util.concurrent.Futures$6.run(Futures.java:1310)
at shade.com.datastax.spark.connector.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:457)
at shade.com.datastax.spark.connector.google.common.util.concurrent.Futures$ImmediateFuture.addListener(Futures.java:106)
at shade.com.datastax.spark.connector.google.common.util.concurrent.Futures.addCallback(Futures.java:1322)
at shade.com.datastax.spark.connector.google.common.util.concurrent.Futures.addCallback(Futures.java:1258)
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.query(RequestHandler.java:297)
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.findNextHostAndQuery(RequestHandler.java:272)
at com.datastax.driver.core.RequestHandler.startNewExecution(RequestHandler.java:115)
at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:95)
at com.datastax.driver.core.SessionManager.executeAsync(SessionManager.java:132)
at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.datastax.spark.connector.cql.SessionProxy.invoke(SessionProxy.scala:40)
at com.sun.proxy.$Proxy14.executeAsync(Unknown Source)
at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.datastax.spark.connector.cql.SessionProxy.invoke(SessionProxy.scala:40)
at com.sun.proxy.$Proxy15.executeAsync(Unknown Source)
at com.datastax.spark.connector.writer.QueryExecutor$$anonfun$$lessinit$greater$1.apply(QueryExecutor.scala:11)
at com.datastax.spark.connector.writer.QueryExecutor$$anonfun$$lessinit$greater$1.apply(QueryExecutor.scala:11)
at com.datastax.spark.connector.writer.AsyncExecutor.executeAsync(AsyncExecutor.scala:31)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1$$anonfun$apply$2.apply(TableWriter.scala:199)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1$$anonfun$apply$2.apply(TableWriter.scala:198)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at com.datastax.spark.connector.writer.GroupingBatchBuilder.foreach(GroupingBatchBuilder.scala:31)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1.apply(TableWriter.scala:198)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1.apply(TableWriter.scala:175)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:112)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:111)
at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:145)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:111)
at com.datastax.spark.connector.writer.TableWriter.writeInternal(TableWriter.scala:175)
at com.datastax.spark.connector.writer.TableWriter.insert(TableWriter.scala:162)
at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:149)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
can anyone help me with this issue?
It looks like that your servers are overloaded, and don't process your requests on time. I recommend to try to tune write-related configuration parameters, like, output.concurrent.writes, output.throughput_mb_per_sec and other, but I would start with first 2.

How to setup Elasticsearch Structured Streaming with X-Pack enabled?

I'm trying to use Elasticsearch (ES) 6.1.1 Hadoop with installed x-pack to write data using Spark Structured Streaming 2.2.1.
This is my code (the index already exists in elastic):
val exceptions = spark
.readStream
.text(path)
val advancedQuery = exceptions
.writeStream
.format("org.elasticsearch.spark.sql")
.trigger(Trigger.ProcessingTime(10.seconds))
.option("checkpointLocation", "/checkpoint")
val runningQuery = advancedQuery.start("spark/exc")
runningQuery.awaitTermination
But I do get the following exception:
org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:327)
at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:575)
at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:58)
at org.elasticsearch.spark.sql.streaming.EsStreamQueryWriter.run(EsStreamQueryWriter.scala:41)
at org.elasticsearch.spark.sql.streaming.EsSparkSqlStreamingSink$$anonfun$addBatch$2$$anonfun$2.apply(EsSparkSqlStreamingSink.scala:52)
at org.elasticsearch.spark.sql.streaming.EsSparkSqlStreamingSink$$anonfun$addBatch$2$$anonfun$2.apply(EsSparkSqlStreamingSink.scala:51)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
**Caused by: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: missing authentication token for REST request [/]
null**
How can I set the required authentication data?
Figured it out:
One needs to add two additional options "es.net.http.auth.user" and "es.net.http.auth.pass" like here:
val advancedQuery = exceptions
...
.option("es.net.http.auth.user", "*your elastic user goes here*")
.option("es.net.http.auth.pass", "*your elastic password goes here*")
...

Error when read hbase within spark rdd foreach action

I am trying to use spark streaming to consume a kafka message queue, within the foreachRdd operation, I am trying to read hbase within a rdd foreach action according to the messages that taking out from kafka. But I am getting some errors, looks like there are some dead lock. Anybody can help me figure out what's the issue here?
Below is the code details.
KafkaUtils.createStream(streamingContext,kafkazk,kafkaGroup,kafkaTopic.split(",").map(_.trim -> 1).toMap)
.foreachRDD(_.processRDD(new CQRequestContext(null,new DBRequestContext(prefix,dbName,tableName,null,null,null))))
streamingContext.start()
streamingContext.awaitTermination()
I am trying to do some hbase read with in each rdd foreach action, I am trying to create a hbase connection inthe foreachPartition action, in order to reduce the connection numbers.
def processRDD(cQRequestContext: CQRequestContext) = {
rdd.foreachPartition(iterator=>{
val connection = ConnectionFactory.createConnection(HbaseUtil.getHbaseConfiguration("FdsCQ"));
val hTable= connection.getTable(TableName.valueOf("cqdev:sampleTestTable"))
try {
iterator.foreach(requestTuple =>{
//processAIPRequest(requestTuple._2,cQRequestContext,myTable)
val p = new Put(Bytes.toBytes("dabao12345"))
p.addColumn(Bytes.toBytes(SellerConstants.CQ_ABUSE_RESULT.COLUMN_FAMILY), Bytes.toBytes(SellerConstants.CQ_ABUSE_RESULT.RESPONSE_CONTENT), Bytes.toBytes("content"))
p.addColumn(Bytes.toBytes(SellerConstants.CQ_ABUSE_RESULT.COLUMN_FAMILY), Bytes.toBytes(SellerConstants.CQ_ABUSE_RESULT.PACKAGE_NOS), Bytes.toBytes("response"))
p.addColumn(Bytes.toBytes(SellerConstants.CQ_ABUSE_RESULT.COLUMN_FAMILY), Bytes.toBytes(SellerConstants.CQ_ABUSE_RESULT.ABUSE_PACKAGES_NO), Bytes.toBytes("response"))
p.addColumn(Bytes.toBytes(SellerConstants.CQ_ABUSE_RESULT.COLUMN_FAMILY), Bytes.toBytes(SellerConstants.CQ_ABUSE_RESULT.RESPONSE_CODE), Bytes.toBytes("responseCode"))
p.addColumn(Bytes.toBytes(SellerConstants.CQ_ABUSE_RESULT.COLUMN_FAMILY), Bytes.toBytes(SellerConstants.CQ_ABUSE_RESULT.RESPONSE_STATUS), Bytes.toBytes("respone"))
hTable.put(p)
})
} finally {
//scanner.close
hTable.close()
connection.close()
}
})
}
And finally, I am getting below error, After debug I found this is not something like running out of connection from hbase, During testing I only put one message into kafka, and when the first time I am trying to create the hbase connection, I got the below errors.
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:503)
org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342)
org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1036)
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:222)
org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:481)
org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:86)
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.retrieveClusterId(ConnectionManager.java:849)
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.<init>(ConnectionManager.java:670)
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
java.lang.reflect.Constructor.newInstance(Constructor.java:526)
org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:238)
org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:218)
org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:119)
com.XXXXXXXX.func.CQStreamingFunction$$anonfun$processRDD$1.apply(CQStreamingFunction.scala:50)
com.XXXXXXXX.func.CQStreamingFunction$$anonfun$processRDD$1.apply(CQStreamingFunction.scala:49)
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:898)
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:898)
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
org.apache.spark.scheduler.Task.run(Task.scala:88)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Further more the above error only happens, when I try to run my spark streaming job in yarn-cluster mode in a yarn cluster, When I run in local mode in my local machine and connect to the remote hbase, everything works fine.
Env Details: CDH 5.5.2 with spark 1.5.0

Resources