GeoSpark show SQL results fails - apache-spark

I am using GeoSpark 1.3.1 where I am trying to find all geo points that are contained in a POLYGON. I use the sql command:
val result = spark.sql(
|SELECT *
|FROM spatial_trace, streetCrossDf
|WHERE ST_Within (streetCrossDf.geometry, spatial_trace.geometry)
""".stripMargin)
result.show()
The query works fine but, fails when I try to show the result. Seems like an output issue from the library. I am doing this in zeppelin notebook. Can someone please tell me what I am doing wrong.? I get error below:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 10.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10.0 (TID 15, 10.42.22.236, executor 3): java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to org.apache.spark.sql.catalyst.util.ArrayData
at org.apache.spark.sql.geosparksql.strategy.join.TraitJoinQueryExec$$anonfun$toSpatialRdd$1.apply(TraitJoinQueryExec.scala:164)
at org.apache.spark.sql.geosparksql.strategy.join.TraitJoinQueryExec$$anonfun$toSpatialRdd$1.apply(TraitJoinQueryExec.scala:163)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1334)
at scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:214)
at scala.collection.AbstractIterator.aggregate(Iterator.scala:1334)
at org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1122)
at org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1122)
at org.apache.spark.SparkContext$$anonfun$36.apply(SparkContext.scala:2157)
at org.apache.spark.SparkContext$$anonfun$36.apply(SparkContext.scala:2157)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

I know I'm a bit late, but this is addressed by the developer here. The geometries need to be converted using a constructor
Example fix:
WHERE ST_Within (ST_GeomFromWKT(streetCrossDf.geometry), ST_GeomFromWKT(spatial_trace.geometry))

Related

FileNotFoundException: Spark save fails. Cannot clear cache from Dataset[T] avro

I get the following error when saving a dataframe in avro for a second time. If I delete sub_folder/part-00000-XXX-c000.avro after saving, and then try to save the same dataset, I get the following:
FileNotFoundException: File /.../main_folder/sub_folder/part-00000-3e7064c0-4a82-424c-80ca-98ce75766972-c000.avro does not exist. It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
If I delete not only from sub_folder, but also from main_folder, then the problem doesn't happen, but I can't afford that.
The problem actually doesnt happen when trying to save the dataset in any
other format.
Saving an empty dataset does not cause an error.
The example suggests that the tables need to be refreshed, but as the output of sparkSession.catalog.listTables().show() there are no tables to refresh.
+----+--------+-----------+---------+-----------+
|name|database|description|tableType|isTemporary|
+----+--------+-----------+---------+-----------+
+----+--------+-----------+---------+-----------+
The previously saved dataframe looks like this. The application is supposed to update it:
+--------------------+--------------------+
| Col1 | Col2 |
+--------------------+--------------------+
|[123456, , ABC, [...|[[v1CK, RAWNAME1_,..|
|[123456, , ABC, [...|[[BG8M, RAWNAME2_...|
+--------------------+--------------------+
For me this is a clear cache problem. However, all attemps of clearing the cache have failed:
dataset.write
.format("avro")
.option("path", path)
.mode(SaveMode.Overwrite) // Any save mode gives the same error
.save()
// Moving this either before or after saving doesnt help.
sparkSession.catalog.clearCache()
// This will not un-persist any cached data that is built upon this Dataset.
dataset.cache().unpersist()
dataset.unpersist()
And this is how I read the dataset:
private def doReadFromPath[T <: SpecificRecord with Product with Serializable: TypeTag: ClassTag](path: String): Dataset[T] = {
val df = sparkSession.read
.format("avro")
.load(path)
.select("*")
df.as[T]
}
Finally the stack trace is this one. Thanks a lot for your help!:
ERROR [task-result-getter-3] (Logging.scala:70) - Task 0 in stage 9.0 failed 1 times; aborting job
ERROR [main] (Logging.scala:91) - Aborting job 150de02a-ac6a-4d42-824d-5db44a98c19a.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 1 times, most recent failure: Lost task 0.0 in stage 9.0 (TID 11, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:254)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:168)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: File file:/DATA/XXX/main_folder/sub_folder/part-00000-3e7064c0-4a82-424c-80ca-98ce75766972-c000.avro does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:241)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:239)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:245)
... 10 more
*Reading from the same location and writing in to same location will give this issue. it was also discussed in this forum. along with my answer there *
and the below message in the error will mis lead. but actual issue is read/write from/in the same location.
You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL
I am giving another example other than yours (used parquet in your case avro).
I have 2 options for you.
Option 1 (cache and show will work like below...) :
import org.apache.spark.sql.functions._
val df = Seq((1, 10), (2, 20), (3, 30)).toDS.toDF("sex", "date")
df.show(false)
df.repartition(1).write.format("parquet").mode("overwrite").save(".../temp") // save it
val df1 = spark.read.format("parquet").load(".../temp") // read back again
val df2 = df1.withColumn("cleanup" , lit("Rod want to cleanup")) // like you said you want to clean it.
//BELOW 2 ARE IMPORTANT STEPS LIKE `cache` and `show` forcing a light action show(1) with out which FileNotFoundException will come.
df2.cache // cache to avoid FileNotFoundException
df2.show(2, false) // light action to avoid FileNotFoundException
// or println(df2.count) // action
df2.repartition(1).write.format("parquet").mode("overwrite").save(".../temp")
println("Rod saved in same directory where he read it from final records he saved after clean up are ")
df2.show(false)
Option 2 :
1) save the DataFrame with a different avro folder.
2) Delete the old avro folder.
3) Finally rename this newly created avro folder to the old name, will work.
Thanks a lot Ram Ghadiyaram!
The solution had 2 solved my problem but only in my local Ubuntu. When I tested in HDFS, the problem remained.
The solution 1 was the definite fix. This is how my code looks now:
private def doWriteToPath[T <: Product with Serializable: TypeTag: ClassTag](dataset: Dataset[T], path: String): Unit = {
// clear any previously cached avro
sparkSession.catalog.clearCache()
// update the cache for this particular dataset, and trigger an action
dataset.cache().show(1)
dataset.write
.format("avro")
.option("path", path)
.mode(SaveMode.Overwrite)
.save()
}
Some remarks:
I had indeed checked that post, and attempted unsuccessfully the solution. I discarded that to be my problem, for the following reasons:
I had created a /temp under 'main_folder', called 'sub_folder_temp', and saving still failed.
Saving the same non-empty dataset in the same path but in json format actually works without the workaround discussed here.
Saving an empty dataset with the same type [T] in the same path actually works without the workaround discussed here.

write dataframe to cassandra facing BusyPoolException

I am trying to write dataframe to cassandra using these line of code,was able to write to table for someday but suddenly the error came
alertdf
.write.format("org.apache.spark.sql.cassandra")
.options(Map("keyspace" -> "dummy", "table" -> "dummytable"))
.mode(SaveMode.Append)
.save()
I get this below error,not able to find out what is getting wrong
ERROR QueryExecutor: Failed to execute: com.datastax.spark.connector.writer.RichBoundStatement#7dba59e2
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: **.**.**.**/**.**.**.**:9042 (com.datastax.driver.core.exceptions.BusyPoolException: [**.**.**.**/**.**.**.**] Pool is busy (no available connection and the queue has reached its max size 256)))
at com.datastax.driver.core.RequestHandler.reportNoMoreHosts(RequestHandler.java:211)
at com.datastax.driver.core.RequestHandler.access$1000(RequestHandler.java:46)
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.findNextHostAndQuery(RequestHandler.java:275)
at com.datastax.driver.core.RequestHandler$SpeculativeExecution$1.onFailure(RequestHandler.java:338)
at shade.com.datastax.spark.connector.google.common.util.concurrent.Futures$6.run(Futures.java:1310)
at shade.com.datastax.spark.connector.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:457)
at shade.com.datastax.spark.connector.google.common.util.concurrent.Futures$ImmediateFuture.addListener(Futures.java:106)
at shade.com.datastax.spark.connector.google.common.util.concurrent.Futures.addCallback(Futures.java:1322)
at shade.com.datastax.spark.connector.google.common.util.concurrent.Futures.addCallback(Futures.java:1258)
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.query(RequestHandler.java:297)
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.findNextHostAndQuery(RequestHandler.java:272)
at com.datastax.driver.core.RequestHandler.startNewExecution(RequestHandler.java:115)
at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:95)
at com.datastax.driver.core.SessionManager.executeAsync(SessionManager.java:132)
at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.datastax.spark.connector.cql.SessionProxy.invoke(SessionProxy.scala:40)
at com.sun.proxy.$Proxy14.executeAsync(Unknown Source)
at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.datastax.spark.connector.cql.SessionProxy.invoke(SessionProxy.scala:40)
at com.sun.proxy.$Proxy15.executeAsync(Unknown Source)
at com.datastax.spark.connector.writer.QueryExecutor$$anonfun$$lessinit$greater$1.apply(QueryExecutor.scala:11)
at com.datastax.spark.connector.writer.QueryExecutor$$anonfun$$lessinit$greater$1.apply(QueryExecutor.scala:11)
at com.datastax.spark.connector.writer.AsyncExecutor.executeAsync(AsyncExecutor.scala:31)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1$$anonfun$apply$2.apply(TableWriter.scala:199)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1$$anonfun$apply$2.apply(TableWriter.scala:198)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at com.datastax.spark.connector.writer.GroupingBatchBuilder.foreach(GroupingBatchBuilder.scala:31)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1.apply(TableWriter.scala:198)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1.apply(TableWriter.scala:175)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:112)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:111)
at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:145)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:111)
at com.datastax.spark.connector.writer.TableWriter.writeInternal(TableWriter.scala:175)
at com.datastax.spark.connector.writer.TableWriter.insert(TableWriter.scala:162)
at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:149)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
can anyone help me with this issue?
It looks like that your servers are overloaded, and don't process your requests on time. I recommend to try to tune write-related configuration parameters, like, output.concurrent.writes, output.throughput_mb_per_sec and other, but I would start with first 2.

count throws java.lang.NumberFormatException: null on the file loaded from object store with inferSchema enabled

The count() on a dataframe loaded from IBM Blue mix object storage throws the following exception when inferSchema is enabled:
Name: org.apache.spark.SparkException
Message: Job aborted due to stage failure: Task 3 in stage 43.0 failed 10 times, most recent failure: Lost task 3.9 in stage 43.0 (TID 166, yp-spark-dal09-env5-0034): java.lang.NumberFormatException: null
at java.lang.Integer.parseInt(Integer.java:554)
at java.lang.Integer.parseInt(Integer.java:627)
at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:241)
at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:116)
at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:85)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:128)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:127)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
I don't get the above exception if I disable the inferSchema.
Why am I getting this exception? by default, how many rows are read by databricks if inferSchema is enabled?
This was actually an issue with the spark-csv package (null value still not correctly parsed #192) that was dragged into spark 2.0. It has been corrected and pushed in spark 2.1.
Here is the associated PR : [SPARK-18269][SQL] CSV datasource should read null properly when schema is lager than parsed tokens.
Since you are already using spark 2.0 you can easily upgrade to 2.1 and drop that spark-csv package. It's not needed anyway.

Amazon s3a returns 400 Bad Request with Spark-redshift library

I am facing java.io.IOException: s3n://bucket-name : 400 : Bad Request error while loading Redshift data through spark-redshift library:
The Redshift cluster and the s3 bucket both are in mumbai region.
Here is the full error stack:
2017-01-13 13:14:22 WARN TaskSetManager:66 - Lost task 0.0 in stage 0.0 (TID 0, master): java.io.IOException: s3n://bucket-name : 400 : Bad Request
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:453)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:427)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.handleException(Jets3tNativeFileSystemStore.java:411)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:181)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at org.apache.hadoop.fs.s3native.$Proxy10.retrieveMetadata(Unknown Source)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:476)
at com.databricks.spark.redshift.RedshiftRecordReader.initialize(RedshiftInputFormat.scala:115)
at com.databricks.spark.redshift.RedshiftFileFormat$$anonfun$buildReader$1.apply(RedshiftFileFormat.scala:92)
at com.databricks.spark.redshift.RedshiftFileFormat$$anonfun$buildReader$1.apply(RedshiftFileFormat.scala:80)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(fileSourceInterfaces.scala:279)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(fileSourceInterfaces.scala:263)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.jets3t.service.impl.rest.HttpException: 400 Bad Request
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:425)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:279)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestHead(RestStorageService.java:1052)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectImpl(RestStorageService.java:2264)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectDetailsImpl(RestStorageService.java:2193)
at org.jets3t.service.StorageService.getObjectDetails(StorageService.java:1120)
at org.jets3t.service.StorageService.getObjectDetails(StorageService.java:575)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:174)
... 30 more
And here is my java code for the same:
SparkContext sparkContext = SparkSession.builder().appName("CreditModeling").getOrCreate().sparkContext();
sparkContext.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem");
sparkContext.hadoopConfiguration().set("fs.s3a.awsAccessKeyId", fs_s3a_awsAccessKeyId);
sparkContext.hadoopConfiguration().set("fs.s3a.awsSecretAccessKey", fs_s3a_awsSecretAccessKey);
sparkContext.hadoopConfiguration().set("fs.s3a.endpoint", "s3.ap-south-1.amazonaws.com");
SQLContext sqlContext=new SQLContext(sparkContext);
Dataset dataset= sqlContext
.read()
.format("com.databricks.spark.redshift")
.option("url", redshiftUrl)
.option("query", query)
.option("aws_iam_role", aws_iam_role)
.option("tempdir", "s3a://bucket-name/temp-dir")
.load();
I was able to solve the problem on spark local mode by doing following changes (referred this):
1) I have replaced the jets3t jar to 0.9.4
2) Changed jets3t configuration properties to support the aws4 version bucket as follows:
Jets3tProperties myProperties = Jets3tProperties.getInstance(Constants.JETS3T_PROPERTIES_FILENAME);
myProperties.setProperty("s3service.s3-endpoint", "s3.ap-south-1.amazonaws.com");
myProperties.setProperty("storage-service.request-signature-version", "AWS4-HMAC-SHA256");
myProperties.setProperty("uploads.stream-retry-buffer-size", "2147483646");
But now i am trying to run the job in a clustered mode (spark standalone mode or with a resource manager MESOS) and the error appears again :(
Any help would be appreciated!
Actual Problem:
Updating Jets3tProperties, to support AWS s3 signature version 4, at runtime worked on local mode but not on cluster mode because the properties were only getting updated on the driver JVM but not on any of the executor JVM's.
Solution:
I found a workaround to update the Jets3tProperties on all executors by referring to this link.
By referring to the above link I have put an additional code snippet, to update the Jets3tProperties, inside .foreachPartition() function which will run it for the first partition created on any of the executors.
Here is the code:
Dataset dataset= sqlContext
.read()
.format("com.databricks.spark.redshift")
.option("url", redshiftUrl)
.option("query", query)
.option("aws_iam_role", aws_iam_role)
.option("tempdir", "s3a://bucket-name/temp-dir")
.load();
dataset.foreachPartition(rdd -> {
boolean first=true;
if(first){
Jets3tProperties myProperties =
Jets3tProperties.getInstance(Constants.JETS3T_PROPERTIES_FILENAME);
myProperties.setProperty("s3service.s3-endpoint", "s3.ap-south-1.amazonaws.com");
myProperties
.setProperty("storage-service.request-signature-version", "AWS4-HMAC-SHA256");
myProperties.setProperty("uploads.stream-retry-buffer-size", "2147483646");
first = false;
}
});
that stack implies that you're using the older s3n connector, based on jets3t. you are setting permissions which only work with S3a, the newer one. Use a URL like s3a:// to pick up the new entry.
Given you are trying to use V4 API, you'll need to set the fs.s3a.endpoint too. The 400/bad-request response is one you'd see if you tried to auth with v4 against the central endpointd

Spark Hbase connection issue

Hitting with followiong error while i am trying to connect the hbase through spark(using newhadoopAPIRDD) in HDP 2.4.2.Already tried increasing the RPC time in hbase site xml file,still getting the same. any idea how to fix ?
Exception in thread "main" org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=36, exceptions:
Wed Nov 16 14:59:36 IST 2016, null, java.net.SocketTimeoutException: callTimeout=60000, callDuration=71216: row 'scores,,00000000000000' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=hklvadcnc06.hk.standardchartered.com,16020,1478491683763, seqNum=0
at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:271)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:195)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:59)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:320)
at org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:295)
at org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:160)
at org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:155)
at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:821)
at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:193)
at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:89)
at org.apache.hadoop.hbase.client.MetaScanner.allTableRegions(MetaScanner.java:324)
at org.apache.hadoop.hbase.client.HRegionLocator.getAllRegionLocations(HRegionLocator.java:88)
at org.apache.hadoop.hbase.util.RegionSizeCalculator.init(RegionSizeCalculator.java:94)
at org.apache.hadoop.hbase.util.RegionSizeCalculator.<init>(RegionSizeCalculator.java:81)
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:256)
at org.apache.hadoop.hbase.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:237)
at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:120)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
at org.apache.spark.rdd.RDD.count(RDD.scala:1157)
at scb.Hbasetest$.main(Hbasetest.scala:85)
at scb.Hbasetest.main(Hbasetest.scala)
Caused by: java.net.SocketTimeoutException: callTimeout=60000, callDuration=71216: row 'scores,,00000000000000' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=hklvadcnc06.hk.standardchartered.com,16020,1478491683763, seqNum=0
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:159)
at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:64)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hbase.exceptions.ConnectionClosingException: Call to hklvadcnc06.hk.standardchartered.com/10.20.235.13:16020 failed on local exception: org.apache.hadoop.hbase.exceptions.ConnectionClosingException: Connection to hklvadcnc06.hk.standardchartered.com/10.20.235.13:16020 is closing. Call id=9, waitTime=171
at org.apache.hadoop.hbase.ipc.RpcClientImpl.wrapException(RpcClientImpl.java:1281)
at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1252)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:32651)
at org.apache.hadoop.hbase.client.ScannerCallable.openScanner(ScannerCallable.java:372)
at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:199)
at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:62)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:346)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:320)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:126)
... 4 more
Caused by: org.apache.hadoop.hbase.exceptions.ConnectionClosingException: Connection to hklvadcnc06.hk.standardchartered.com/10.20.235.13:16020 is closing. Call id=9, waitTime=171
at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.cleanupCalls(RpcClientImpl.java:1078)
at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.close(RpcClientImpl.java:879)
at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.run(RpcClientImpl.java:604)
16/11/16 14:59:36 INFO SparkContext: Invoking stop() from shutdown hook
I have added the hbase-conf path in hadoop classpath and the issue has been resolved .
Thanks!
Though a bit different context but I faced a similar type of exception while connecting hive with hbase.
Guess what! My hbase table's column mapping was mis-configure.
After I configured hbase tables's columns properly(Metadata of the table),the issue vanished.
WITH SERDEPROPERTIES("hbase.columns.mapping" = "personal data:,:key")

Resources