I created a table on Hive called Test. I installed spark and ran spark-shell. When I ran the below query
val o = spark.sql("select * from default.test");
I got the below error :
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.ql.metadata.Hive.getDatabase(Hive.java:1567)
at org.apache.hadoop.hive.ql.metadata.Hive.databaseExists(Hive.java:1552)
at org.apache.spark.sql.hive.client.Shim_v0_12.databaseExists(HiveShim.scala:609)
at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$databaseExists$1(HiveClientImpl.scala:398)
at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:298)
at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229)
at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228)
at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:278)
at org.apache.spark.sql.hive.client.HiveClientImpl.databaseExists(HiveClientImpl.scala:398)
at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$databaseExists$1(HiveExternalCatalog.scala:223)
at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:101)
at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:223)
at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:150)
at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:140)
at org.apache.spark.sql.internal.SharedState.globalTempViewManager$lzycompute(SharedState.scala:170)
at org.apache.spark.sql.internal.SharedState.globalTempViewManager(SharedState.scala:168)
at org.apache.spark.sql.hive.HiveSessionStateBuilder.$anonfun$catalog$2(HiveSessionStateBuilder.scala:70)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager$lzycompute(SessionCatalog.scala:122)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager(SessionCatalog.scala:122)
Related
I want to do some rollup with my data using Java, by using Dataset/DataFrame of Java Spark-SQL. However, it throws an error:
Job aborted due to stage failure: Task serialization failed: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.storage.StorageUtils$
java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.storage.StorageUtils$
at org.apache.spark.util.io.ChunkedByteBufferOutputStream.toChunkedByteBuffer(ChunkedByteBufferOutputStream.scala:118)
at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:295)
at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:127)
at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:88)
at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1489)
at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1163)
at org.apache.spark.scheduler.DAGScheduler.submitStage(DAGScheduler.scala:1071)
at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1014)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2069)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
My code is like this:
Dataset<Row> dataset = sparkSession.createDataFrame(rdd, MyPojo.class); // where rdd has type JavaRDD<MyPojo>
dataset.collectAsList();
Why is it throwing this error?
I am using Couchbase spark connector in spark structured streaming. I have enabled checkpointing on the streaming query. But I get the class cast exception "java.lang.ClassCastException: class org.apache.spark.sql.execution.streaming.SerializedOffset cannot be cast to class com.couchbase.spark.sql.streaming.CouchbaseSourceOffset" when I rerun the spark structured streaming application on previously checkpointed location. If I delete the contents of checkpoint spark runs fine. Is it a bug on spark? I am using spark 2.4.5
20/04/23 19:11:29 ERROR MicroBatchExecution: Query [id = 1ce2e002-20ee-401e-98de-27e70b27f1a4, runId = 0b89094f-3bae-4927-b09c-24d9deaf5901] terminated with error
java.lang.ClassCastException: class org.apache.spark.sql.execution.streaming.SerializedOffset cannot be cast to class com.couchbase.spark.sql.streaming.CouchbaseSourceOffset (org.apache.spark.sql.execution.streaming.SerializedOffset and com.couchbase.spark.sql.streaming.CouchbaseSourceOffset are in unnamed module of loader 'app')
at com.couchbase.spark.sql.streaming.CouchbaseSource.$anonfun$getBatch$2(CouchbaseSource.scala:172)
at scala.Option.map(Option.scala:230)
at com.couchbase.spark.sql.streaming.CouchbaseSource.getBatch(CouchbaseSource.scala:172)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$populateStartOffsets$3(MicroBatchExecution.scala:284)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.populateStartOffsets(MicroBatchExecution.scala:281)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:169)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:351)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:349)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:166)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:281)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193)
I created Avro datafiles using spark2 and then defined a hive table pointing to the avro datafiles.
val trades= spark.read.option("compression","gzip").csv("file:///data/nyse_all/nyse_data").select($"_c0".as("stockticker"),$"_c1".as("tradedate").cast(IntegerType),$"_c2".as("openprice").cast(DataTypes.createDecimalType(10,2)),$"_c3".as("highprice").cast(DataTypes.createDecimalType(10,2)),$"_c4".as("lowprice").cast(DataTypes.createDecimalType(10,2)),$"_c5".as("closeprice").cast(DataTypes.createDecimalType(10,2)),$"_c6".as("volume").cast(LongType))
trades.repartition(4,$"tradedate",$"volume").sortWithinPartitions($"tradedate".asc,$"volume".desc).write.format("com.databricks.spark.avro").save("/user/pawinder/spark_practice/problem6/data/nyse_data_avro")
spark.sql("create external table pawinder.nyse_data_avro(stockticker string, tradedate int, openprice decimal(10,2) , highprice decimal(10,2), lowprice decimal(10,2), closeprice decimal(10,2), volume bigint) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' location '/user/pawinder/spark_practice/problem6/data/nyse_data_avro'")
Querying the hive table fails with the following error:
Error: java.lang.RuntimeException:
org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error
while processing writable
org.apache.hadoop.hive.serde2.avro.AvroGenericRecordWritable#178270b2
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:172)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive
Runtime Error while processing writable
org.apache.hadoop.hive.serde2.avro.AvroGenericRecordWritable#178270b2
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:563)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:163)
... 8 more Caused by: org.apache.avro.AvroTypeException: Found string, expecting union
On some debugging I found that the datatypes that were defined as Decimal(10,2) are marked as String in the avro data files:
[pawinder#gw02 ~]$ hdfs dfs -cat /user/pawinder/spark_practice/problem6/data/nyse_data_avro/part-00003-f1ca3b0a-f0b4-4aa8-bc26-ca50a0a16fe3-c000.avro |more
Objavro.schemaâ–’{"type":"record","name":"topLevelRecord","fields":[{"name":"stockticker","type":["string","null"]},{"name":"tradedate","type":["int","null"]},{"name":"o
penprice","type":["string","null"]},{"name":"highprice","type":["string","null"]},{"name":"lowprice","type":["string","null"]},{"name":"closeprice","type":["string","n
ull"]},{"name":"volume","type":["long","null"]}]}
I am able to query the same hive table in spark-shell. Is spark-sql DecimalType not recognised by avro serde? I am using spark 2.3.
Our Spark setup is on 3 servers and all can see the HBase cluster servers.
I'm using Hadoop 2.7.3, HBase 1.2.6 and Spark 2.1.3.
I connect to Spark with
/opt/spark/bin/spark-shell --master spark://master:7077
and run the following
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.client.{HBaseAdmin, Result, Put, HTable}
import org.apache.hadoop.hbase.{ HBaseConfiguration, HTableDescriptor, HColumnDescriptor }
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.client.TableDescriptor
import org.apache.spark._
import org.apache.spark.rdd.NewHadoopRDD
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HColumnDescriptor
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.conf.Configuration
import scala.collection.JavaConverters._
val conf = HBaseConfiguration.create()
val tablename = "default:Table1"
conf.set(TableInputFormat.INPUT_TABLE,tablename)
val admin = new HBaseAdmin(conf)
admin.isTableAvailable(tablename)
val hBaseRDD = sc.
newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
hBaseRDD.count()
When run on the spark-shell, admin.isTableAvailable(tablename) returns true.
This leads to thinking that Spark can access HBase, but invoking hBaseRDD.count() raises the following error:
java.lang.IllegalStateException: unread block data at java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2776)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1600)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2280)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2204)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2062)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1568)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:428)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:301)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1152)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
at java.lang.Thread.run(Thread.java:748)
2018-07-17 15:58:54,974 ERROR [task-result-getter-3] scheduler.TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 10.11.1.12 , executor 2): java.lang.IllegalStateException: unread block data at java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2776)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1600)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2280)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2204)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2062)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1568)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:428)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:301)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1152)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1455)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1443)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1442)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1670)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1625)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1614)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1928)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1941)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1954)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1968)
at org.apache.spark.rdd.RDD.count(RDD.scala:1158)
... 52 elided
Caused by: java.lang.IllegalStateException: unread block data
at java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2776)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1600)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2280)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2204)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2062)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1568)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:428)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:301)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1152)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
at java.lang.Thread.run(Thread.java:748)
This error occurs when submitting to the cluster while it works correctly when executing the script on the spark-shell.
From some research, it looks like there can be as version mismatch or alternatively due to how I set spark.driver.extraClassPath, which I set to:
/opt/spark/jars/*:/opt/hbase/lib/commons-collections-3.2.2.jar:/opt/hbase/lib/commons-httpclient-3.1.jar:/opt/hbase/lib/findbugs-annotations-1.3.9-1.jar:/opt/hbase/lib/hbase-annotations-1.2.6.jar:/opt/hbase/lib/hbase-annotations-1.2.6-tests.jar:/opt/hbase/lib/hbase-client-1.2.6.jar:/opt/hbase/lib/hbase-common-1.2.6.jar:/opt/hbase/lib/hbase-common-1.2.6-tests.jar:/opt/hbase/lib/hbase-examples-1.2.6.jar:/opt/hbase/lib/hbase-external-blockcache-1.2.6.jar:/opt/hbase/lib/hbase-hadoop2-compat-1.2.6.jar:/opt/hbase/lib/hbase-hadoop-compat-1.2.6.jar:/opt/hbase/lib/hbase-it-1.2.6.jar:/opt/hbase/lib/hbase-it-1.2.6-tests.jar:/opt/hbase/lib/hbase-prefix-tree-1.2.6.jar:/opt/hbase/lib/hbase-procedure-1.2.6.jar:/opt/hbase/lib/hbase-protocol-1.2.6.jar:/opt/hbase/lib/hbase-resource-bundle-1.2.6.jar:/opt/hbase/lib/hbase-rest-1.2.6.jar:/opt/hbase/lib/hbase-server-1.2.6.jar:/opt/hbase/lib/hbase-server-1.2.6-tests.jar:/opt/hbase/lib/hbase-shell-1.2.6.jar:/opt/hbase/lib/hbase-thrift-1.2.6.jar:/opt/hbase/lib/jetty-util-6.1.26.jar:/opt/hbase/lib/ruby/hbase:/opt/hbase/lib/ruby/hbase/hbase.rb:/opt/hbase/lib/ruby/hbase.rb:/opt/hbase/lib/protobuf-java-2.5.0.jar:/opt/hbase/lib/metrics-core-2.2.0.jar:/opt/hbase/lib/htrace-core-3.1.0-incubating.jar:/opt/hbase/lib/guava-12.0.1.jar:/opt/hbase/lib/asm-3.1.jar:/opt/hbase/lib/Cdrpackage.jar:/opt/hbase/lib/commons-daemon-1.0.13.jar:/opt/hbase/lib/commons-el-1.0.jar:/opt/hbase/lib/commons-math-2.2.jar:/opt/hbase/lib/disruptor-3.3.0.jar:/opt/hbase/lib/jamon-runtime-2.4.1.jar:/opt/hbase/lib/jasper-compiler-5.5.23.jar:/opt/hbase/lib/jasper-runtime-5.5.23.jar:/opt/hbase/lib/jaxb-impl-2.2.3-1.jar:/opt/hbase/lib/jcodings-1.0.8.jar:/opt/hbase/lib/jersey-core-1.9.jar:/opt/hbase/lib/jersey-guice-1.9.jar:/opt/hbase/lib/jersey-json-1.9.jar:/opt/hbase/lib/jettison-1.3.3.jar:/opt/hbase/lib/jetty-sslengine-6.1.26.jar:/opt/hbase/lib/joni-2.1.2.jar:/opt/hbase/lib/jruby-complete-1.6.8.jar:/opt/hbase/lib/jsch-0.1.42.jar:/opt/hbase/lib/jsp-2.1-6.1.14.jar:/opt/hbase/lib/junit-4.12.jar:/opt/hbase/lib/servlet-api-2.5-6.1.14.jar:/opt/hbase/lib/servlet-api-2.5.jar:/opt/hbase/lib/spymemcached-2.11.6.jar:/opt/hive-hbase//opt/hive-hbase/hive-hbase-handler-2.0.1.jar
Is there any solution? Or is it a Spark issue or a bug?
I found the solution.
first I tried to upgrade java, hadoop, hbase and spark version but I encountered the same exception.
finally, I found the reason.
in spark/conf/spark-defauls.conf I have added spark.driver.extraClassPath parameter already, while the workers needed spark.executor.extraClassPath parameter to find those jar files.
with that, the previous versions of them worked correctly.
I am running into a weird issue with spark 2.0, using the sparksession to load a text file. Currently my spark config looks like:
val sparkConf = new SparkConf().setAppName("name-here")
sparkConf.registerKryoClasses(Array(Class.forName("org.apache.hadoop.io.LongWritable"), Class.forName("org.apache.hadoop.io.Text")))
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val spark = SparkSession.builder()
.config(sparkConf)
.getOrCreate()
spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.enableServerSideEncryption", "true")
spark.sparkContext.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version", "2")
If I load an s3a file through an rdd, it works fine. However, if I try to use something like:
val blah = SparkConfig.spark.read.text("s3a://bucket-name/*/*.txt")
.select(input_file_name, col("value"))
.drop("value")
.distinct()
val x = blah.collect()
println(blah.head().get(0))
println(x.size)
I get an exception that says: java.net.URISyntaxException: Expected scheme-specific part at index 3: s3:
Do I need to add some addition s3a configuration for the sqlcontext or sparksession? I haven't found any question or online resource that specifies this. What is weird is that it seems like the job runs for 10 minutes, but then fails with this exception. Again, using the same bucket and everything, a regular load of an rdd has no issues.
The other weird thing is that it is complaining about s3 and not s3a. I have triple checked my prefix, and it always says s3a.
Edit: Checked both s3a and s3, both throw the same exception.
17/04/06 21:29:14 ERROR ApplicationMaster: User class threw exception:
java.lang.IllegalArgumentException: java.net.URISyntaxException:
Expected scheme-specific part at index 3: s3:
java.lang.IllegalArgumentException: java.net.URISyntaxException:
Expected scheme-specific part at index 3: s3:
at org.apache.hadoop.fs.Path.initialize(Path.java:205)
at org.apache.hadoop.fs.Path.<init>(Path.java:171)
at org.apache.hadoop.fs.Path.<init>(Path.java:93)
at org.apache.hadoop.fs.Globber.glob(Globber.java:240)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1732)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1713)
at org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:237)
at org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:243)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:374)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:506)
at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:486)
at com.omitted.omitted.jobs.Omitted$.doThings(Omitted.scala:18)
at com.omitted.omitted.jobs.Omitted$.main(Omitted.scala:93)
at com.omitted.omitted.jobs.Omitted.main(Omitted.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)
Caused by: java.net.URISyntaxException: Expected scheme-specific part
at index 3: s3:
at java.net.URI$Parser.fail(URI.java:2848)
at java.net.URI$Parser.failExpecting(URI.java:2854)
at java.net.URI$Parser.parse(URI.java:3057)
at java.net.URI.<init>(URI.java:746)
at org.apache.hadoop.fs.Path.initialize(Path.java:202)
... 26 more
17/04/06 21:29:14 INFO ApplicationMaster: Final app status: FAILED,
exitCode: 15, (reason: User class threw exception:
java.lang.IllegalArgumentException: java.net.URISyntaxException:
Expected scheme-specific part at index 3: s3:)
This should work.
get the right JARs on your CP (Spark with Hadoop 2.7, matching hadoop-aws JAR, aws-java-sdk-1.7.4.jar (exactly this version) and joda-time-2.9.3.jar (or a later vesion)
you shouldn't need to set the fs.s3a.impl value, as that's done in the hadoop default settings. If you do find yourself doing that, it's a sign of a problem.
What's the full stack trace?